Parallel Computational Fluid Dynamics '99: Towards Teraflops, Optimization and Novel Formulations

PARALLEL COMPUTATIONAL FLUID DYNAMICS TOWARDS TERAFLOPS, OPTIMIZATION AND NOVEL FORMULATIONS This Page Intentionally ...

Author: D. Keyes | A. Ecer | J. Periaux | N. Satofuka | P. Fox

13 downloads 398 Views 34MB Size Report

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!

Report copyright / DMCA form

DOWNLOAD PDF

PARALLEL COMPUTATIONAL FLUID DYNAMICS TOWARDS TERAFLOPS, OPTIMIZATION AND NOVEL FORMULATIONS

This Page Intentionally Left Blank

PARALLEL COMPUTATIONAL FLUID DYNAMICS TOWARDS TERAFLOPS,OPTIMIZATION AND NOVEL FORMULATIONS P r o c e e d i n g s of the Parallel C F D ' 9 9 C o n f e r e n c e

Edited by O.

A.

KEYES

Old Dominion University Norfolk, U.S.A. d,

ECER

IUP UI, Indianapolis Indiana, U.S.A.

PERIAUX

N.

Dassault-Aviation Saint-Cloud, France

SATOFU

KA

Kyoto Institute of TechnologF Kyoto, Japan Assistant Editor P,

Fox

I UP UI, Indianapolis Indiana, U.S.A.

N 2000

ELSEVIER A m s t e r d a m - L a u s a n n e - New Y o r k - O x f o r d - S h a n n o n - S i n g a p o r e - Tokyo

ELSEVIER

SCIENCE

B.V.

Sara Burgerhartstraat 25 P.O. Box 211, 1000 AE Amsterdam,

9 2000 Elsevier Science B.V.

The Netherlands

All rights reserved.

This work is protected under copyright by Elsevier Science, and the following terms and conditions apply to its use: Photocopying Single photocopies of single chapters may be made for personal use as allowed by national copyright laws. Permission of the Publisher and payment of a fee is required for all other photocopying, including multiple or systematic copying, copying for advertising or promotional purposes, resale, and all forms of document delivery. Special rates are available for educational institutions that wish to make photocopies for non-profit educational classroom use. Permissions may be sought directly from Elsevier Science Global Rights Department, PO Box 800, Oxford OX5 IDX, UK; phone: (+44) 1865 843830, fax: (+44) 1865 853333, e-mail: [email protected]. You may also contact Global Rights directly through Elsevier's home page (http://www.elsevier.nl), by selecting 'Obtaining Permissions'. In the USA, users may clear permissions and make payments through the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, USA; phone: (978) 7508400, fax: (978) 7504744, and in the UK through the Copyright Licensing Agency Rapid Clearance Service (CLARCS), 90 Tottenham Court Road, London W1P 0LP, UK; phone: (+44) 171 631 5555; fax: (+44) 171 631 5500. Other countries may have a local reprographic rights agency for payments. Derivative Works Tables of contents may be reproduced for internal circulation, but permission of Elsevier Science is required for external resale or distribution of such material. Permission of the Publisher is required for all other derivative works, including compilations and translations. Electronic Storage or Usage Permission of the Publisher is required to store or use electronically any material contained in this work, including any chapter or part of a chapter. Except as outlined above, no part of this work may be reproduced, stored in a retrieval system or transmitted in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, without prior written permission of the Publisher. Address permissions requests to: Elsevier Global Rights Department, at the mail, fax and e-mail addresses noted above. Notice No responsibility is assumed by the Publisher for any injury and/or damage to persons or property as a matter of products liability, negligence or otherwise, or from any use or operation of any methods, products, instructions or ideas contained in the material herein. Because of rapid advances in the medical sciences, in particular, independent verification of diagnoses and drug dosages should be made.

First edition 2000 L i b r a r y o f C o n g r e s s C a t a l o g i n g in P u b l i c a t i o n D a t a A c a t a l o g r e c o r d f r o m t h e L i b r a r y o f C o n g r e s s h a s b e e n a p p l i e d for.

ISBN: 0-444-82851-6 T h e p a p e r u s e d in this p u b l i c a t i o n m e e t s the r e q u i r e m e n t s o f A N S I f N I S O P r i n t e d in T h e N e t h e r l a n d s .

Z39.48-1992

(Permanence of Paper).

PREFACE Parallel CFD'99, the eleventh in an international series of meetings featuring computational fluid dynamics research on parallel computers, was convened in Williamsburg, Virginia, from the 23 rd to the 26 th of May, 1999. Over 125 participants from 19 countries converged for the conference, which returned to the United States for the first time since 1995. Parallel computing and computational fluid dynamics have evolved not merely simultaneously, but synergistically. This is evident, for instance, in the awarding of the Gordon Bell Prizes for floating point performance on a practical application, from the earliest such awards in 1987 to the latest in 1999, when three of the awards went to CFD entries. Since the time of Von Neumann, CFD has been one the main drivers for computing in general, and the appetite for cycles and storage generated by CFD together with its natural concurrency have constantly pushed computing into greater degrees of parallelism. In turn, the opportunities made available by highend computing, even as driven by other forces such as fast networking, have quickly been seized by computational fluid dynamicists. The emergence of departmental-scale Beowulf-type systems is one of the latest such examples. The exploration and promotion of this synergistic realm is precisely the goal of the Parallel CFD international conferences. Many diverse realms of phenomena in which fluid dynamical simulations play a critical role were featured at the 1999 meeting, as were a large variety of parallel models and architectures. Special emphases included parallel methods in optimization, non-PDE-based formulations of CFD (such as lattice-Boltzmann), and the influence of deep memory hierarchies and high interprocessor latencies on the design of algorithms and data structures for CFD applications. There were ten plenary speakers representing major parallel CFD groups in government, academia, and industry from around the world. Shinichi Kawai of the National Space Development Agency of Japan, described that agency's 30 Teraflop/s "Earth Simulator" computer. European parallel CFD programs, heavily driven by industry, were presented by Jean-Antoine D~sideri of INRIA, Marc Garbey of the University of Lyon, Trond Kvamsdahl of SINTEF, and Mark Cross of the University of Greenwich. Dimitri Mavriplis of ICASE and James Taft of NASA Ames updated conferees on the state of the art in parallel computational aerodynamics in NASA. Presenting new results on the platforms of the Accelerated Strategic Computing

Initiative (ASCI) of the U.S. Department of Energy were John Shadid of Sandia-Albuquerque and Paul Fischer of Argonne. John Salmon of Caltech, a two-time Gordon Bell Prize winner and co-author of "How to Build a Beowulf," presented large-scale astrophysical computations on inexpensive PC clusters. Conferees also heard special reports from Robert Voigt of The College of William & Mary and the U.S. Department of Energy on research taking place under the ASCI Alliance Program and from Douglas McCarthy of Boeing on the new CFD General Notation System (CGNS). A pre-conference tutorial on the Portable Extensible Toolkit for Scientific Computing (PETSc), already used by many of the participants, was given by William Gropp, Lois McInnes, and Satish Balay of Argonne. Contributed presentations were given by over 50 researchers representing the state of parallel CFD art and architecture from Asia, Europe, and North America. Major developments at the 1999 meeting were: (1) the effective use of as many as 2048 processors in implicit computations in CFD, (2) the acceptance that parallelism is now the "easy part" of large-scale CFD compared to the difficulty of getting good per-node performance on the latest fast-clocked commodity processors with cache-based memory systems, (3) favorable prospects for Lattice-Boltzmann computations in CFD (especially for problems that Eulerian and even Lagrangian techniques do not handle well, such as two-phase flows and flows with exceedingly multiply-connected domains or domains with a lot of holes in them, but even for conventional flows already handled well with the continuum-based approaches of PDEs), and (4) the nascent integration of optimization and very large-scale CFD. Further details of Parallel CFD'99, as well as other conferences in this series, are available at h t t p ://www. p a r c f d , org. The Editors

vii

ACKNOWLEDGMENTS Sponsoring Parallel CFD'99 9 the Army 9 the IBM

were:

Research Office Corporation

9 the National Aeronautics

and Space Administration

9 the National Science Foundation 9 the Southeastern

Universities Research Association (SURA).

PCFD'99 is especially grateful to Dr. Stephen Davis of the ARO, Dr. Suga Sugavanum of IBM, Dr. David Rudy of NASA, Dr. Charles Koelbel and Dr. John Foss of NSF, and Hugh Loweth of SURA for their support of the meeting. Logistics and facilities were provided by: 9 the College of William & Mary 9 the Institute for Computer (ICASE) 9 the NASA

in Science and Engineering

Langley Research Center

9 Old Dominion 9 the Purdue

Applications

University

School of Engineering and Technology

at IUPUI.

The Old Dominion University Research Foundation waived standard indirect cost rates in administering Parallel CFD'99, since it was recognized as a deliverable of two "center" projects: an NSF Multidisciplinary Computing Challenges grant and a DOE ASCI Level II subcontract on massively parallel algorithms in transport. Pat Fox of IUPUI, the main source of energy and institutional memory for the Parallel CFD conference series was of constant assistance in the planning and execution of Parallel CFD'99.

viii

Emily Todd, the ICASE Conference Coordinator, was of immeasurable help in keeping conference planning on schedule and making arrangements with local vendors. Scientific Committee members Akin Ecer, David Emerson, Isaac Lopez, and James McDonough offered timely advice and encouragement. Manny Salas, Veer Vatsa, and Bob Voigt of the Local Organizing Committee opened the doors of excellent local facilities to Parallel CFD'99 and its accompanying tutorial. Don Morrison of the A/V office at NASA Langley kept the presentations, which employed every variety of presentation hardware, running smoothly. Ajay Gupta, Matt Glaves, Eric Lewandowski, Chris Robinson, and Clinton Rudd of the Computer Science Department at ODU and Shouben Zhou, the ICASE Systems Manager, provided conferees with fin-de-si~cle network connectivity during their stay in Colonial-era Williamsburg. Jeremy York of IUPUI, Kara Olson of ODU, and Jeanie Samply of ICASE provided logistical support to the conference. Most importantly to this volume, the copy-editing of David Hysom and Kara Olson improved dozens of the chapters herein. The Institute for Scientific Computing Research (ISCR) and the Center for Applied Scientific Computing (CASC) at Lawrence Livermore National Laboratory hosted the lead editor during the production of this proceedings. David Keyes, Conference Chair Parallel CFD'99

SCIENTIFIC

COMMITTEE

of t h e P a r a l l e l C F D C o n f e r e n c e s 1998-1999 Ramesh K. Agarwal, Wichita State University, USA Boris Chetverushkin, Russian Academy of Science, Russia Akin Ecer, IUPUI, USA David R. Emerson, Daresbury Laboratory, UK Pat Fox, IUPUI, USA Marc Garbey, Universite Claude Bernard Lyon I, France Alfred Geiger, HLRS, Germany Carl Jenssen, Statoil, Norway David E. Keyes, Old Dominion University, USA Chao-An Lin, National Tsing Hua University, Taiwan Isaac Lopez, Army Research Laboratory NASA Lewis Campus, USA Doug McCarthy, Boeing Company, USA James McDonough, University of Kentucky, USA Jacques P~riaux, Dassault Aviation, France Richard Pelz, Rutgers University, USA Nobuyuki Satofuka, Kyoto Institute of Technology, Japan Pasquale Schiano, Centro Italiano Ricerche Aerospziali CIRA, Italy Suga Sugavanam, IBM Marli E.S. Vogels, National Aerospace Laboratory NLR, The Netherlands David Weaver, Phillips Laboratory, USA

LIST OF P A R T I C I P A N T S in Parallel C F D ' 9 9 Ramesh Agarwal, Wichita State University Rakhim Aitbayev, University of Colorado, Boulder Hasan Akay, IUPUI Giorgio Amati, CASPUR W. Kyle Anderson, NASA Langley Rsearch Center Rustem Aslan, Istanbul Technical University Mike Ashworth, CLRC Abdelkader Baggag, ICASE Satish Balay, Argonne National Laboratory Oktay Baysal, Old Dominion University Robert Biedron, NASA Langley Research Center George Biros, Carnegie Mellon University Daryl Bonhaus, NASA Langley Research Center Gunther Brenner, University of Erlangen Edoardo Bucchignani, CIRA SCPA Xing Cai, University of Oslo Doru Caraeni, Lund Institute of Technology Mark Carpenter, NASA Langley Research Center Po-Shu Chen, ICASE Jiadong Chen, IUPUI Guilhem Chevalier, CERFACS Stanley Chien, IUPUI Peter Chow, Fujitsu Mark Cross, University of Greenwich Wenlong Dai, University of Minnesota Eduardo D'Azevedo, Oak Ridge National Laboratory Anil Deane, University of Maryland Ayodeji Demuren, Old Dominion University Jean-Antoine D~sideri, INRIA Boris Diskin, ICASE Florin Dobrian, Old Dominion University Akin Ecer, IUPUI David Emerson, CLRC Karl Engel, Daimler Chrysler Huiyu Feng, George Washington University Paul Fischer, Argonne National Laboratory Randy Franklin, North Carolina State University Martin Galle, DLR Marc Garbey, University of Lyon Alfred Geiger, HLRS

Aytekin Gel, University of West Virginia Omar Ghattas, Carnegie Mellon University William Gropp, Argonne National Laboratory X. J. Gu, CLRC Harri Hakula, University of Chicago Xin He, Old Dominion University Paul Hovland, Argonne National Laboratory Weicheng Huang, UIUC Xiangyu Huang, College of. William & Mary David Hysom, Old Dominion University Cos Ierotheou, University of Greenwich Eleanor Jenkins, North Carolina State University Claus Jenssen, SINTEF Andreas Kahari, Uppsala University Boris Kaludercic, Computational Dynamics Limited Dinesh Kaushik, Argonne National Laboratory Shinichi Kawai, NSDA David Keyes, Old Dominion University Matthew Knepley, Purdue University Suleyman Kocak, IUPUI John Kroll, Old Dominion University Stefan Kunze, University of Tiibingen Chia-Chen Kuo, NCHPC Trond Kvamsdal, SINTEF Lawrence Leemis, College of William & Mary Wu Li, Old Dominion University David Lockhard, NASA Langley Research Center Josip Loncaric, ICASE Lian Peet Loo, IUPUI Isaac Lopez, NASA Glenn Research Center Li Shi Luo, ICASE James Martin, ICASE Dimitri Mavriplis, ICASE Peter McCorquodale, Lawrence Berkeley National Laboratory James McDonough, University of Kentucky Lois McInnes, Argonne National Laboratory Piyush Mehrotra, ICASE N. Duane Melson, NASA Langley Research Center Razi Nalim, IUPUI Eric Nielsen, NASA Langley Research Center Stefan Nilsson, Chalmers University of Technology Jacques P~riaux, INRIA Alex Pothen, Old Dominion University Alex Povitsky, ICASE

Luie Rey, IBM Austin Jacques Richard, Illinois State University, Chicago Jacqueline Rodrigues, University of Greenwich Kevin Roe, ICASE Cord Rossow, DLR David Rudy, NASA Langley Research Center Jubaraj Sahu, Army Research Laboratory John Salmon, California Institute of Technology Widodo Samyono, Old Dominion University Nobuyuki Satofuka, Kyoto Institute of Technology Punyam Satya-Narayana, Raytheon Erik Schnetter, University of Tiibingen Kara Schumacher Olson, Old Dominion University John Shadid, Sandia National Laboratory Kenjiro Shimano, Musashi Institute of Technology Manuel Sofia, University Politecnico Catalunya Linda Stals, ICASE Andreas Stathopoulous, College of William & Mary Azzeddine Soulaimani, Ecole Sup~rieure A. (Suga) Sugavanam, IBM Dallas Samuel Sundberg, Uppsala University Madhava Syamlal, Fluent Incorporated James Taft, NASA Ames Research Center Danesh Tafti, UIUC Aoyama Takashi, National Aerospace Laboratory Ilker Tarkan, IUPUI Virginia Torczon, College of William & Mary Damien Tromeur-Dervout, University of Lyon Aris Twerda, Delft University of Technology Ali Uzun, IUPUI George Vahala, College of William & Mary Robert Voigt, College of William & Mary Johnson Wang, Aerospace Corporation Tadashi Watanabe, J AERI Chip Watson, Jefferson Laboratory Peter Wilders, Technological University of Delft Kwai Wong, University of Tennessee Mark Woodgate, University of Glasgow Paul Woodward, University of Minnesota Yunhai Wu, Old Dominion University Shishen Xie, University of Houston Jie Zhang, Old Dominion University

xii

xiii

This Page Intentionally Left Blank

XV

T A B L E OF C O N T E N T S

Preface ................................................................. Acknowledgments

...................................................

List of Scientific Committee

Members .............................

List of Participants ................................................... Conference Photograph

PLENARY

............................................

v vii ix x xiii

PAPERS

J.-A. Ddsideri, L. Fournier, S. Lanteri, N. Marco, B. Mantel, J. Pdriaux, and J. F. Wang Parallel Multigrid Solution and Optimization in Compressible Flow Simulation and Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

P. F. Fischer and H. M. Tufo High-Performance Spectral Element Algorithms and Implementations .. 17

M. Garbey and D. Tromeur-Dervout O p e r a t o r Splitting and Domain Decomposition for Multiclusters . . . . . . . 27

S. Kawai, M. Yokokawa, H. Ito, S. Shingu, K. Tani, and K. Yoshida Development of the E a r t h Simulator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

37

K. McManus, M. Cross, C. Walshaw, S. Johnson, C. Bailey, K. Pericleous, A. Sloan, and P. Chow Virtual Manufacturing and Design in the Real World Implementation and Scalability on H P C C Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

D. Mavriplis Large-Scale Parallel Viscous Flow C o m p u t a t i o n s Using an U n s t r u c t u r e d Multigrid Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

xvi

CONTRIBUTED

PAPERS

R. Agarwal Efficient Parallel Implementation of a Compact Higher-Order Maxwell Solver Using Spatial Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

R. Aitbayev, X.-C. Cai, and M. Paraschivoiu Parallel Two-Level Methods for Three-dimensional Transonic Compressible Flow Simulations on Unstructured Meshes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

T. Aoyama, A. Ochi, S. Saito, and E. Shima Parallel Calculation of Helicopter BVI Noise by a Moving Overlapped Grid Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

A. R. Aslan, U. Gulcat, A. Misirlioglu, and F. O. Edis Domain Decomposition Implementations for Parallel Solutions of 3D NavierStokes Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

A. Baggag, H. L. A tkins, and D. E. Keyes Parallel Implementation of the Discontinuous Galerkin Method . . . . . . . 115

J. Bernsdorf, G. Brenner, F. Durst, and M. Baum Numerical Simulations of Complex Flows with Lattice-Boltzmann A u t o m a t a on Parallel Computers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

G. Biros and O. Ghattas Parallel Preconditioners for K K T Systems Arising in Optimal Control of Viscous Incompressible Flows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131

E. Bucchignani, A. Matrone, and F. Stella Parallel Polynomial Preconditioners for the Analysis of Chaotic Flows in Rayleigh-Benard Convection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139

X. Cai, H. P. Langtangen, and O. Munthe An Object-Oriented Software Framework for Building Parallel Navier-Stokes Solvers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147

D. Caraeni, C. Bergstrom, and L. Fuchs Parallel NAS3D: An Efficient Algorithm for Engineering Spray Simulations using LES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155

xvii

Y. P. Chien, J. D. Chert, A. Ecer, and H. U. Akay Dynamic Load Balancing for Parallel CFD on NT Networks . . . . . . . . . .

165

P. Chow, C. Bailey, K. McManus, D. Wheeler, H. Lu, M. Cross, and C. Addison Parallel Computer Simulation of a Chip Bonding to a Printed Circuit Board: In the Analysis Phase of a Design and Optimization Process for Electronic Packaging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173

W. Dai and P. R. Woodward Implicit-explicit Hybrid Schemes for Radiation Hydrodynamics Suitable for Distributed Computer Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179

A. Deane Computations of Three-dimensional Compressible Rayleigh-Taylor Instability on S G I / C r a y T3E . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189

A. Ecer, E. Lemoine, and L Tarkan An Algorithm for Reducing Communication Cost in Parallel Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

199

F. O. Edis, U. Gulcat, and A. R. Aslan Domain Decomposition Solution of Incompressible Flows using Unstructured Grids with pP2P1 Elements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207

D. R. Emerson, Y. F. Hu, and M. Ashworth Parallel Agglomeration Strategies for Industrial Unstructured Solvers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

215

M. Galle, T. Gerhold, and J. Evans Parallel Computation of Turbulent Flows Around Complex Geometries on Hybrid Grids with the DLR-TAU Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223

A. Gel and L Celik Parallel Implementation of a Commonly Used Internal Combustion Engine Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231

W. D. Gropp, D. K. Kaushik, D. E. Keyes, and B. F. Smith Towards Realistic Performance Bounds for Implicit CFD Codes . . . . . . . 241

xviii

W. Huang and D. Tafli A Parallel Computing Framework for Dynamic Power Balancing in Adaptive Mesh Refinement Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249

E. W. Jenkins, R. C. Berger, J. P. Hallberg, S. E. Howington, C. T. Kelley, J. H. Schmidt, A. K. Stagg, and M. D. Tocci A Two-Level Aggregation-Based Newton-Krylov-Schwarz Method for Hydrology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257

D. E. Keyes The Next Four Orders of Magnitude in Performance for Parallel CFD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

265

M. G. Knepley, A. H. Sameh, and V. Sarin Design of Large-Scale Parallel Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . .

273

S. Kocak and H. U. Akay An Efficient Storage Technique for Parallel Schur Complement Method and Applications on Different Platforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281

S. Kunze, E. Schnetter, and R. Speith Applications of the Smoothed Particle Hydrodynamics method: The Need for Supercomputing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289

I. M. Llorente, B. Diskin, and N. D. Melson Parallel Multigrid with Blockwise Smoothers for Multiblock Grids . . . . 297

K. Morinishi and N. Satofuka An Artificial Compressibility Solver for Parallel Simulation of Incompressible Two-Phase Flows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305

E. J. Nielsen, W. K. Anderson, and D. K. Kaushik Implementation of a Parallel Framework for Aerodynamic Design Optimization on Unstructured Meshes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313

S. Nilsson Validation of a Parallel Version of the Explicit Projection Method for Turbulent Viscous Flows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 321

T.-W. Pan, V. Sarin, R. Glowinski, J. Pdriaux, and A. Sameh Parallel Solution of Multibody Store Separation Problems by a Fictitious Domain Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329

xix

A. Povitsky Efficient Parallel-by-Line Methods in CFD . . . . . . . . . . . . . . . . . . . . . . . . . . . .

337

J. N. Rodrigues, S. P. Johnson, C. Walshaw, and M. Cross An Automatable Generic Strategy for Dynamic Load Balancing in a Parallel Structured Mesh CFD Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345

J. Sahu, K. R. Heavey, and D. M. Pressel Parallel Performance of a Zonal Navier-Stokes Code on a Missile Flowfield . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

355

N. Satofuka and T. Sakai Parallel Computation of Three-dimensional Two-phase Flows By LatticeBoltzmann Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363

P. Satya-narayana, R. A vancha, P. Mucci, and R. Pletcher Parallelization and Optimization of a Large Eddy Simulation Code using and O p e n M P for SGI Origin2000 Performance . . . . . . . . . . . . . . . . . . . . . . . . 371

K. Shimano, Y. Hamajima, and C. Arakawa Calculation of Unsteady Incompressible Flows on a Massively Parallel Computer Using the B.F.C. Coupled Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 381

M. Sofia, J. Cadafalch, R. Consul, K. Clararnunt, and A. Oliva A Parallel Algorithm for the Detailed Numerical Simulation of Reactive Flows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 389

A. Soulaimani, A. Rebaine, and Y. Saad Parallelization of the Edge-based Stabilized Finite Element Method ... 397

A. Twerda, R. L. Verweij, T. W. J. Peeters, and A. F. Bakker The Need for Multigrid for Large Computations . . . . . . . . . . . . . . . . . . . . . .

407

A. Uzun, H. U. Akay, and C. E. Bronnenberg Parallel Computations of Unsteady Euler Equations on Dynamically Deforming Unstructured Grids . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415

G. Vahala, J. Carter, D. Wah, L. Vahala, and P. Pavlo Parallelization and MPI Performance of T h e r m a l Lattice Boltzmann Codes for Fluid Turbulence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423

XX

T. Watanabe and K. Ebihara Parallel Computation of Two-Phase Flows Using the Immiscible Lattice Gas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 431 P. Wilders Parallel Performance Modeling of an Implicit Advection-Diffusion Solver . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

439

M. A. Woodgate, K. J. Badcock, and B. E. Richards A Parallel 3D Fully Implicit Unsteady Multiblock CFD Code Implemented on a Beowulf Cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 447

PLENARY PAPERS

This Page Intentionally Left Blank

Parallel Computational Fluid Dynamics Towards Teraflops, Optimization and Novel Formulations D. Keyes, A. Ecer, J. Periaux, N. Satofuka and P. Fox (Editors) 2000 Elsevier Science B.V. All rights reserved.

Parallel m u l t i g r i d solution and o p t i m i z a t i o n in c o m p r e s s i b l e flow s i m u l a t i o n and design J.-A. D(?sid~ri, L. Fournier, S. Lanteri and N. Marco, INRIA Projet Sinus, 2004 Route des Lucioles, 06902 Sophia-Antipolis Cedex, B. Mantel, J. P~riaux, Dassault Aviation, France, J.F. Wang, Nanjing Institute of Aeronautics and Astronautics, China. This paper describes recent achievements regarding the development of parallel multigrid (PMG) methods and parallel genetic algorithms (PGAs) in the framework of compressible flow simulation and optimization. More precisely, this work is part of a broad research objective aimed at building efficient and robust optimization strategies for complex multi-disciplinary shape-design problems in aerodynamics. Ultimately, such a parallel optimization technique should combine the following ingredients: I1. parallel multigrid methods for the acceleration of the underlying flow calculations; I2. domain decomposition algorithms for an efficient and mathematically well posed distribution of the global optimization process on a network of processors; I3. robust parallel optimization techniques combining non-deterministic algorithms (genetic algorithms in the present context) and efficient local optimization algorithms to accelerate the convergence to the optimal solution; I4. distributed shape parametrization techniques. In this contribution, we address mainly topics I1 and I3 in the context of optimal airfoil design. 1. P A R A L L E L S T R A T E G I E S IN G E N E T I C A L G O R I T H M S 1.1.

Introduction

Genetic algorithms (GAs) are search algorithms based on mechanisms simulating natural selection. They rely on the analogy with Darwin's principle of survival of the fittest. John Holland, in the 1970's, introduced the idea according to which difficult optimization problems could be solved by such an evolutionary approach. The technique operates on a population of potential solutions represented by strings of binary digits (called chromosomes or individuals) which are submitted to several semi-stochastic operators (selection, crossover and mutation). The population evolves during the generations according to the fitness value of the individuals; then, when a stationary state is reached, the population has converged to an/the optimized solution (see [3] for an introduction to the

subject). GAs differ from classical optimization procedures, such as the steepest descent or conjugate gradient method, in many ways: - the entire parameter set is coded; the iteration applies to an entire population of potential solutions, in contrast to classical algorithms, in which a single candidate solution is driven to optimality by successive steps; the iteration is an "evolution" step, or new generation, conducted by semi-stochastic operators; - the search space is investigated (more) globally, enhancing robustness; two keywords are linked to GAs : ezploration and ezploitation. Exploration of the search space is important at the beginning of the GA process, while exploitation is desirable when the GA process is close to the global optimum. GAs have been introduced in aerodynamics shape design problems for about fifteen years (see Kuiper et al. [4], P6riaux et al. in [9], Quagliarella in [10] and Obayashi in [8], who present 3D results for a transonic flow around a wing geometry). The main concern related to the use of GAs for aerodynamic design is the computational effort needed for the accurate evaluation of a design configuration that, in the case of a crude application of the technique, might lead to unacceptable computer time if compared with more classical algorithms. In addition, hard problems need larger populations and this translates directly into higher computational costs. It is a widely accepted position that GAs can be effectively parallelized and can in principle take full advantage of (massively) parallel computer architectures. This point of view is above all motivated by the fact that within a generation (iteration) of the algorithm, the fitness values associated with each individual of the population can be evaluated in parallel. In this study, we have developed a shape optimum design methodology that combines the following ingredients: the underlying flow solver discretizes the Euler or full Navier-Stokes equations using a mixed finite element/finite volume formulation on triangular meshes. Time integration to steady state is achieved using a linearized Euler implicit scheme which results in the solution of a linear system for advancing the solution to the next time step; -

a binary-coded genetic algorithm is used as the main optimization kernel. In our context, the population of individuals is represented by airfoil shapes. The shape parametrization strategy is based on B~zier curves.

1.2. P a r a l l e l i z a t i o n s t r a t e g y Several possible strategies can be considered for the parallelization of the GA-based shape design optimization described above:

- a first strategy stems from the following remark: within a given generation of a GA, the evaluation of the fitness values associated with the population of individuals defines independent processes. This makes GAs particularly well suited for massively parallel systems; we also note that a parent/chuld approach is a standard candidate for the implementation of this first level of parallelism, especially when the size of the populations is greater than the available number of processors; - a second strategy consists of concentrating the parallellization efforts on the process underlying a fitness value evaluation, here the flow solver. This approach finds its main motivation in the fact that, when complex field analysers are used in conjunction with a GA, then the aggregate cost of fitness values evaluations can represent between 80 to 90% of the total optimization time. A SPMD paradigm is particularly well suited for the implementation of this strategy; the third option combines the above two approaches and clearly yields a two-level parallelization strategy which has been considered here and which will be detailed in the sequel. Our choice has been motivated by the following remarks: (1) a parallel version of the two-dimensional flow solver was available and adapted to the present study; (2) we have targetted a distributed memory SPMD implementation and we did not want the resulting optimization tool to be limited by memory capacity constraints, especially since the present study will find its sequel in its adaptation to 3D shape optimization problems, based on more complex aerodynamical models (Navier-Stokes equations coupled with a turbulence model); and (3) we believe that the adopted parallelization strategy will define a good starting-point for the construction and evaluation of sub-populations based parallel genetic algorithms (PGas). In our context, the parallelization strategy adopted for the flow solver combines domain partitioning techniques and a message-passing programming model [6]. The underlying mesh is assumed to be partitioned into several submeshes, each one defining a subdornain. Basically, the same ~%ase" serial code is executed within every subdomain. Applying this parallelization strategy to the flow solver results in modifications occuring in the main time-stepping loop in order to take into account one or several assembly phases of the subdomain results. The coordination of subdomain calculations through information exchange at artificial boundaries is implemented using calls to functions of the MPI library. The paralellization described above aims at reducing the cost of the fitness function evaluation for a given individual. However, another level of parallelism can clearly be exploited here and is directly related to the binary tournament approach and the crossover operator. In practice, during each generation, individuals of the current population are treated pairwise; this applies to the selection, crossover, mutation and fitness function evaluation steps. Here, the main remark is that for this last step, the evaluation of the fitness functions associated with the two selected individuals, defines independent operations. We have chosen to exploit this fact using the notion of process groups which is one of the main features of the MPI environment. Two groups are defined, each of them containing the same number of processes; this number is given by the number of subdomains in the partitioned mesh. Now, each group is responsible for the evaluation

of the fitness function for a given individual. We note in passing that such an approach based on process groups will also be interesting in the context of sub-populations based P G a s (see [1] for a review on the subject); this will be considered in a future work.

1.3. An o p t i m u m shape design case The method has been applied to a direct optimization problem consisting in designing the shape of an airfoil, symbolically denoted by 3', to reduce the shock-induced drag, CD, while preserving the lift, CL, to the reference value, C~AE, corresponding to the RAE2822 airfoil, immersed in an Eulerian flow at 2~ of incidence and a freestream Mach number of 0.73. Thus, the cost functional was given the following form: J(7) =

CD

+

10 (Cc -

cRAE)2

(1)

The non-linear convergence tolerance has been fixed to 10 .6 . The computational mesh consists of 14747 vertices (160 vertices on the airfoil) and 29054 triangles. Here, each "chromosome" represents a candidate airfoil defined by a B~zier spline whose support is made of 7+7 control points at prescribed abscissas for the upper and lower surfaces. A population of 30 individuals has been considered. After 50 generations, the shape has evolved and the shock has been notably reduced; the initial and final flows (iso-Mach values) are shown on Figure 1. Additionally, initial and final values of CD and Cc are given in Table 1. The calculations have been performed on the following systems: an SGI Origin 2000 (equipped with 64 MIPS RI0000/195 Mhz processors) and an experimental Pentium Pro (P6/200 Mhz, running the LINUX system) cluster where the interconnection is realized through F a s t E t h e r n e t (100 Mbits/s) switches. The native MPI implementation has been used on the SGI O r i g i n 2000 system while MPICH 1.1 has been used on the Pentium Pro cluster. Performance results are given for 64 bit arithmetic computations.

Figure 1. Drag reduction: initial and optimized flows (steady iso-Mach lines)

We compare timing measurements for the overall optimization using one and two process groups. Timings are given for a fixed number of generations (generally 5 optimization iterations). In Tables 2 and 3 below, Ng and Np respectively denote the number of process groups and the total number of processes (Ng = 2 and Np = 4 means 2 processes for each of the two groups), "CPU" is the total CPU time, "Flow" is the accumulated flow solver time, and "Elapsed" is the total elapsed time (the distinction between the CPU and the elapsed times is particularly relevant for the Pentium Pro cluster). Finally, S(Np) is the parallel speed-up ratio Elapsed(N 9 = 1, Np = 5)/Elapsed(Ng, Np), the case Nv = 1, Np = 5 serving as a reference. For the multiple processes cases, the given timing measures ("CPU" and "Flow") always correspond to the maximum value over the per-process measures.

Table 1 Drag reduction: initial and optimized values of the Co and CL coefficients "-'~L

0.8068

0.8062

v-"D

0.0089

0.0048

Table 2 Parallel perfomance results on the SGI O r i g i n 2000 N q Np Elapsed CPU Flow S(Np) Min Max 1 2 1

5 10 10

2187 sec 1270sec 1126sec

2173 sec 1261sec 1115sec

1934 sec 1031sec 900sec

1995 sec 1118sec 953sec

1.0 1.7 1.9

Table 3 Parallel perfomance results on the Pentium Pro cluster Ng N v Elapsed CPU Flow Min Max 1 2 1

5 10 10

18099 sec 9539 sec 10764 sec

14974 sec 8945 sec 8866 sec

13022 sec 7291 sec 7000 sec

14387 sec 8744 sec 7947 sec

S(N ) 1.0 1.85 1.7

For both architectures, the optimal speed-up of 2 is close to being achieved. For the Pentium Pro cluster, the communication penalty is larger, and this favors the usage of fewer processors and more groups. For the SGI Origin 2000, the situation is different: communication only involves memory access (shared memory), and parallelization remains

/ Y

...f.---""BezierjSpline ......................... .....................................

~::'z'~'r'~::.::':

...................................................................

--__~ :> •

A

Figure 2. Geometry of multi-element airfoil including slat, main body and flap (shape and position definition)

efficient as the number of processors increases; moreover, additional gains are achieved due to the larger impact of cache memory when subdomains are small. 1.4. High-lift m u l t i - e l e m e n t airfoil o p t i m i z a t i o n by G A s a n d m u l t i - a g e n t s t r a t e gies In this section, we report on numerical experiments conducted to optimize the configuration (shape and position) of a high-lift multi-element airfoil by both conventional GAs and more novel ones, based on multi-agent strategies better fit to parallel computations. These experiments have been also described in [7] in part. The increased performance requirements for high-lift systems as well as the availability of (GA-based) optimization methods tend to renew the emphasis on multi-element aerodynamics. High-lift systems, as depicted on Figure 2, consist of a leading-edge device (slat) whose effect is to delay stall angle, and a trailing-edge device (flap) to increase the lift while maintaining a high L/D ratio. The lift coemcient CL of such an airfoil is very sensitive to the flow features around each element and its relative position to the main body; in particular, the location of the separation point can change rapidly due to the wake/boundary-layer interaction. As a result, the functional is non-convex and presents several local optima, making use of a robust algorithm necessary to a successful optimization. Here, the 2D flow computation is conducted by the Dassault-Aviation code "Damien" which combines an inviscid flow calculation by a panel method with a wake/boundarylayer interaction evaluation. This code, which incorporates data concerning transition criteria, separated zones, and wake/boundary-layer interaction, has been thoroughly calibrated by assessments and global validation through comparisons with ONERA windtunnel measurements. As a result, viscous effects can be computed, and this provides at a very low cost a fairly accurate determination of the aerodynamics coefficients. As a

100

,

-

9

-SO ,

|

..

-100

200

5O

0

9

-100

100

,

......

" -

-

"T""

................ .

0

150

----:/

,

9

I

50

1

t

..... "

I

,,

0

'

'

!

100-

~

,

~

..........

O

e

0

0

9

0

I

!

a

200

300

400

~

'

0 I .......

:.....~~

O

,

-

........

. . . . . .

!

_

500

~

__

6110

-~

'~

-

1 ...................................

................. ]

-5O -TOO -t50 -200

Figure 3. Initial (top) and final (bottom) configurations in position optimization problem

counterpart of this simplified numerical model, the flow solver is non differentiable and can only be treated as a black box in the optimization process. Evolutionary algorithms are thus a natural choice to conduct the optimization in a situation of this type. Figure 3 relates to a first experiment in which only the 6 parameters determining the positions relative to the main body (deflection angle, overlap and gap) of the two high-lift devices (slat and flap) have been optimized by a conventional GA, similar to the one of the previous section. Provided reasonable ranges are given for each parameter, an optimum solution is successfully found by the GA, corresponding to an improved lift coefficient of 4.87. The second experiment is the first step in the optimization of the entire configuration consisting of the shapes of the three elements and both positions of slat and flap. More precisely, only the inverse problem consisting in reconstructing the pressure field is considered presently. If 7s, 7B and 7F denote the design variables associated with the slat, main body and flap respectively, and 7 = (Ts, 7B, 7F) represents a candidate configuration, one minimizes the following functional:

J(7) - 4 ( 7 ) + J . ( 7 ) + J.(7)

(2)

in which, for example:

a~

(3)

l0 is a positive integral extending over one element only (slat) but reflecting in the integrand the interaction of the whole set of design variables. Here, Pt is the given target pressure; similar definitions are made for JB(3') and JR(')'). In order to reduce the number of design variables and enforce smoothness of the geometry, shapes are represented by B~zier curves. This inverse problem has been solved successfully by the GA first by a "global" algorithm in which the chromosomes contain the coded information associated with all design variables indiscriminately. The convergence history of this experiment is indicated on Figure 4 for the first 200 generations. An alternative to this approach is provided by optimization algorithms based on (pseudo) Nash strategies in which the design variables are a priori partitioned into appropriate subsets. The population of a given subset evolves independently at each new generation according to its own GA, with the remaining design variables being held fixed and equal to the best elements in their respective populations found at the previous generation. Evidently, in such an algorithm, the computational task of the different GAs, or "players" to use a term of game theory, can be performed in parallel [11]. Many different groupings of the design variables can be considered, two of which are illustrated on Figure 4: a 3-player strategy (slat shape and position; body shape; flap shape and position) and a 5-player strategy (slat, body and flap shapes, slat and flap positions). The two algorithms achieve about the same global optimum, but the parameters of the GAs (population size) have been adjusted so that the endpoints of the two convergence paths correspond to the same number of functional evaluations as 100 generations of the global algorithm. Thus, this experiment simulates a comparison of algorithms at equal "serial" cost. This demonstrates the effectiveness of the multi-agent approach, which achieves the global optimum by computations that could evidently be performed in parallel. We terminate this section by two remarks concerning Nash strategies. First, in a preliminary experiment related to the slat/flap position inverse problem, a 2-player game in which the main body (TB) is fixed, an attempt was made to let the population of 7s evolve at each generation according to an independent GA minimizing the partial cost functional js(Ts) = Js (~/s, 7B, 5'F) only, (the flap design variables being held fixed to the best element 7F found at the previous generation,) and symmetrically for 7F being driven by jF(TF) -- JF (~s, ")'B, ")IF). Figure 5 indicates that in such a case, the algorithm fails to achieve the desired global optimum. Second, observe that in the case of a general cost function, a Nash equilibrium in which a minimum is found with respect to each subgroup of variables, the other variables being held fixed, does not necessarily realize a global minimum. For example, in the trivial case of a function f(x, y) of two real variables, a standard situation in which the partial functions:

= f ( x , y*),

r

(4)

= f(z*, y)

achieve local minima at x* and y* respectively, is realized typically when:

r

= =

y*) = o , y*) = o ,

r r

= =

y*) > o y*) > o

(5)

11 0.014

0.012

0.01

-7~i~. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. ......................................................

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .N. .a. s. .h. . . 3. . . p. .i.@ . . .e.r. s. . . . . . . . . . . . . . . . . . . . . . . Global inverse problem

. ..........

: ,

0.008

0.006

',

P~pulation

.

.

.

.

.

.

,~ize : 3 0

iiili~..... §...........i........... '..........~>:iv~:.~:i~:~:~-.,-p~ ~:-.-:!~:~............................................... :-i

0.004

-]......... :.... Globd:-. . . . . . . . . i .................................................

~,:;~i~:~:

:

"

:

:iN'ash ~ p l . a y ~ s

t

N--i:W:.......... :........... !........ 74~;a~;;;.-a~-h;~Kiiia:~i,~4h-~ ..... i.......... !........... ~......... i;~,~i::~ i i : ! 'i' ? i i

0.002 :

',

,,

20

40

60

,

80 Number

1 O0

120

140

160

180

200

of generations

Figure 4. Optimization of shape and position design variables by various strategies all involving the same number of cost functional evaluations

and this does not imply t h a t the Hessian matrix be positive definite. However, in the case of an inverse problem in which each individual positive component of the cost function is driven to 0, the global o p t i m u m is indeed achieved. 2. P A R A L L E L 2.1.

MULTIGRID

ACCELERATION

Introduction

Clearly, reducing the time spent in flow calculations (to the m i n i m u m ) is crucial to make GAs a viable alternative to other optimization techniques. One possible strategy to achieve this goal consists in using a multigrid m e t h o d to accelerate the solution of the linear systems resulting from the linearized implicit time integration scheme. As a first step, we have developed parallel linear multigrid algorithms for the acceleration of compressible steady flow calculations, independently of the optimization framework. This is justified by the fact that the flow solver is mainly used as a black box by the GA. The starting point consists of an existing flow solver based on the averaged compressible Navier-Stokes equations, coupled with a k - c turbulence model [2]. The spatial discretization combines finite element and finite volume concepts and is designed for u n s t r u c t u r e d triangular meshes. Steady state solutions of the resulting semi-discrete equations are obtained by using an Euler implicit time advancing strategy which has the following features: linearization (approximate linearization of the convective fluxes and exact differentiation of the viscous terms); preconditioning (the Jacobian m a t r i x is based on a first-order Godunov

12 Convergence comparison for the fitness

0.014

def'mition

i ...... ~ .......

0.012

---~(

........

,- . . . . . . . . . . . . .

-, . . . . . . . . . . . . . .

!

0.0

~'"

i

~

9

Convergence Convergence Convergence Convergence

i

i

slat position, D2 flap position, D2 slat position, D 1 flap position, D1

iiiiiiiiiiiiiii iiiiii!iiii

0,008I-

............

i

iiiiiiiiiiiiii iiiiiiiiiiiii"

0.006

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

o.oo~

............

,: .............

~..............

i ............

i ................

i::~:;.,i

o

..............

. . . . .

0

0

5

10

15

20

T ..............

25

i

.

. . . . . . . . . . . . . .

30

.

.

.

.

.

.

.

.

.

.

.

.

. . . . . . . . . .

35

4[

Number of generations

Figure 5. Effect of the fitness definition on the convergence of slat/flap position parameters

scheme); and local time stepping and CFL law (a local time step is computed on each control volume). Each pseudo time step requires the solution of two sparse linear systerns (respectively, for the mean flow variables and for the variables associated with the turbulence model). The multigrid strategy is adopted to gain efficiency in the solution of the two subsysterns. In the present method, the coarse-grid approximation is based on the construction of macro-elements, more specifically macro control-volumes by "volume-agglomeration". Starting from the finest mesh, a "greedy" coarsening algorithm is applied to generate automatically the coarse discretizations (see Lallemand et al. [5]). Parallelism is introduced in the overall flow solver by using a strategy that combines mesh partitioning techniques and a message passing programming model. The MPI environment is used for the implementation of the required communication steps. Both the discrete fluxes calculation and the linear systems solution are performed on a submesh basis; in particular, for the basic linear multigrid algorithm which is multiplicative (i.e. the different levels are treated in sequence with inter-dependencies between the partial results produced on the different levels), this can be viewed as an intra-level parallelization which concentrates on the smoothing steps performed on each member of the grid hierarchy, A necessary and important step in this adaptation was the construction of appropriate data structures for the distribution of coarse grid calculations. Here, this has been achieved by developing a parallel variant of the original "greedy" type coarsening al-

13 gorithm, which now includes additional communication steps for a coherent construction of the communication data structures on the partitioned coarse grids. 2.2. L a m i n a r f l o w a r o u n d a NACA0012 a i r f o i l

The test case under consideration is given by the external flow around a NACA0012 airfoil at a freestream Mach number of 0.8, a Reynolds number equal to 73 and an angle of incidence of 10 ~. The underlying mesh contains 194480 vertices and 387584 triangles. We are interested here in comparing the single grid and the multigrid approaches when solving the steady laminar viscous flow under consideration. Concerning the single grid algorithm, the objective is to choose appropriate values for the number of relaxation steps and the tolerance on the linear residual so that a good compromise is obtained between the number of non-linear iterations (pseudo time steps) to convergence and the corresponding elapsed time. For both algorithms, the time step is calculated according to CFL=rnin(500 • it, 106) where it denotes the non-linear iteration. Table 4 compares results of various simulations performed on a 12 nodes Pentium Pro cluster. In this table, oo means that the number of fine mesh Jacobi relaxations (~f) or the number of multigrid V-cycles (Nc) has been set to an arbitrary large value such that the linear solution is driven until the prescribed residual reduction (c) is attained; ~1 and ~2 denote the number of pre- and post-smothing steps (Jacobi relaxations) when using the multigrid algorithm. We observe that the non-linear convergence of the single grid is optimal when driving the linear solution to a two decade reduction of the normalized linear residual. However, the corresponding elapsed time is minimized when fixing the number of fine mesh relaxations to 400. However, one V-cycle with 4 pre- and post-smoothing steps is sufficient for an optimal convergence of the multigrid algorithm. Comparing the two entries of Table 4 corresponding to the case c = 10 -1, it is seen that the multigrid algorithm yields a non-linear convergence in 117 time steps instead of 125 time steps for the single grid algorithm. This indicates that when the requirement on linear convergence is the same, the multigrid non-linear solution demonstrates somewhat better convergence due to a more uniform treatment of the frequency spectrum. We conclude by noting that the multigrid algorithm is about 16 times faster than the single grid algorithm for the present test case, which involves about 0.76 million unknowns.

Table 4 Simulations on a F a s t E t h e r n e t Pentium Pro cluster: Np = 1 2 Ng

Nc

llf

1 1 1 1 1 Ng 6 6 6

Nc c~ c~ 1

oo c~ 350 400 450

[pl, p2] 4/4 4/4 4/4

s

Niter

10 -1 10 -2 10 -1~ 10-1~ 10 -1~

125 117 178 157 142 Niter 117 116 117

s 10 -1 10 -2 10 -1~

Elapsed 9 h 28 mn 9h40mn 9h10mn 9 h 06 mn 9h28mn Elapsed 57 mn 1h56mn 33 mn

CPU 8 h 24 mn 8h48mn 8h17mn 8 h 14 mn 8h20mn CPU 50 mn 1h42mn 29 mn

% CPU 88 91 90 90 88 %CPU 88 88 87

14 3. C O N C L U S I O N S A N D P E R S P E C T I V E S Cost-efficient solutions to the Navier-Stokes equations have been computed by means of (multiplicative) multigrid algorithms made parallel via domain-decomposition methods (DDM) based on mesh partitioning. Current research is focused on additive formulations in which the (fine-grid) residual equations are split into a high-frequency and a lowfrequency subproblems that are solved simultaneously, the communication cost also being reduced (since longer vectors are transferred at fewer communication steps). Genetic algorithms have been shown to be very robust in complex optimization problems such as shape design problems in aerodynamics. In their base formulation, these algorithms may be very costly since they rely on functional evaluations only. As a counterpart, their formulation is very well suited for several forms of parallel computing by ('i) DDM in the flow solver; 5i) grouping the fitness function evaluations; (iii) considering subpopulations evolving independently and migrating information regularly [11] (not shown here); (iv) elaborating adequate multi-agent strategies based on game theory. Consequently, great prospects are foreseen for evolutionary optimization in the context of high-performance computing.

REFERENCES

1. E. Cantfi-Paz. A summary of research on parallel genetic algorithms. Technical Report 95007, IlliGAL Report, University of Illinois at Urbana-Champaign, Illinois Genetic Algorithms Laboratory, Urbana, (1995). 2. G. CarrY. An implicit multigrid method by agglomeration applied to turbulent flows. Computers ~ Fluids, (26):299-320, (1997). 3. D.E. Goldberg. Genetic algorithms in search, optimization and machine learning. Addison-Wesley Company Inc., (1989). 4. H. Kuiper, A.J. Van der Wees, C.F. Hendriks, and T.E. Labrujere. Application of genetic algorithms to the design of airfoil pressure distribution. NLR Technical publication TP95342L for the ECARP European Project. 5. M.-H. Lallemand, H. Steve, and A. Dervieux. Unstructured multigridding by volume agglomeration : current status. Computers ~ Fluids, (21):397-433, (1992). 6. S. Lanteri. Parallel solutions of compressible flows using overlapping and nonoverlapping mesh partitioning strategies. Parallel Comput., 22:943-968, (1996). 7. B. Mantel, J. P~riaux, M. Sefrioui, B. Stoufflet, J.A. D~sid~ri, S. Lanteri, and N. Marco. Evolutionary computational methods for complex design in aerodynamics.

AIAA 98-0222. 8. S. Obayashi and A. Oyama. Three-dimensional aerodynamic optimization with genetic algorithm. In J.-A. D~sid~ri et al., editor, Computational Fluid Dynamics '96, pages 420-424. J. Wiley & Sons, (1996). 9. J. P~riaux, M. Sefrioui, B. Stouifiet, B. Mantel, and E. Laporte. Robust genetic algorithms for optimization problems in aerodynamic design. In G. Winter et. al., editor, Genetic algorithms in engineering and computer science, pages 371-396. John Wiley & Sons, (1995). 10. D. Quagliarella. Genetic algorithms applications in computational fluid dynamics.

15

In G. Winter et. al., editor, Genetic algorithms in engineering and computer science, pages 417-442. John Wiley & Sons, (1995). 11. M. Sefrioui. Algorithmes Evolutionnaires pour le calcul scientifique. Application l'~lectromagn~tisme e t ~ la m~canique des fluides num~riques. Th~se de doctorat de l'Universitd de Paris 6 (Spdcialitd : Informatique), 1998.

This Page Intentionally Left Blank

Parallel ComputationalFluid Dynamics Towards Teraflops, Optimizationand Novel Formulations D. Keyes,A. Ecer, J. Periaux,N. Satofukaand P. Fox (Editors) 2000 Elsevier ScienceB.V. All rightsreserved.

High-Performance

17

Spectral Element Algorithms and Implementations*

Paul F. Fischer a and Henry M. Tufo b ~Mathematics and Computer Science Division, Argonne National Laboratory, Argonne IL, 60439 USA f i s c h e r ~ m c s , a n l . gov (http://www.mcs.anl.gov/~fischer) bDepartment of Computer Science, The University of Chicago, 1100 East 58th Street, Ryerson 152, Chicago, IL 60637 USA hmt 9 u c h i c a g o , edu (http://www.mcs.anl.gov/~tufo) We describe the development and implementation of a spectral element code for multimillion gridpoint simulations of incompressible flows in general two- and three-dimensional domains. Parallel performance is presented on up to 2048 nodes of the Intel ASCI-Red machine at Sandia National Laboratories. 1. I N T R O D U C T I O N We consider numerical solution of the unsteady incompressible Navier-Stokes equations, 0u

0t

+ u. Vu - -Vp

1

2

+ - a - V u, /le

-V-u

= 0,

coupled with appropriate boundary conditions on the velocity, u. We are developing a spectral element code to solve these equations on modern large-scale parallel platforms featuring cache-based nodes. As illustrated in Fig. 1, the code is being used with a number of outside collaborators to address challenging problems in fluid mechanics and heat transfer, including the generation of hairpin vortices resulting from the interaction of a flat-plate boundary layer with a hemispherical roughness element; modeling the geophysical fluid flow cell space laboratory experiment of buoyant convection in a rotating hemispherical shell; Rayleigh-Taylor instabilities; flow in a carotid artery; and forced convective heat transfer in grooved-flat channels. This paper discusses some of the critical algorithmic and implementation features of our numerical approach that have led to efficient simulation of these problems on modern parallel architectures. Section 2 gives a brief overview of the spectral element discretization. Section 3 discusses components of the time advancement procedure, including a projection method and parallel coarse-grid solver, which are applicable to other problem *This work was supported by the Mathematical, Information, and Computational Sciences Division subprogram of the Office of Advanced Scientific Computing Research, U.S. Department of Energy, under Contract W-31-109-Eng-38; by the Department of Energy under Grant No. B341495 to the Center on Astrophysical Thermonuclear Flashes at University of Chicago; and by the University of Chicago.

18

Figure 1. Recent spectral element simulations. To the right, from the top: hairpin vortex generation in wake of hemispherical roughness element (Re~ = 850); spherical convection simulation of the geophysical fluid flow cell at Ra = 1.1 x 105, Ta = 1.4 x 108; twodimensional Rayleigh-Taylor instability; flow in a carotid artery; and temporal-spatial evolution of convective instability in heat-transfer augmentation simulations.

classes and discretizations. Section 4 presents performance results, and Section 5 gives a brief conclusion. 2. S P E C T R A L

ELEMENT

DISCRETIZATION

The spectral element method is a high-order weighted residual technique developed by Patera and coworkers in the '80s that couples the tensor product efficiency of global spectral methods with the geometric flexibility of finite elements [9,11]. Locally, the mesh is structured, with the solution, data, and geometry expressed as sums of Nth-order tensor product Lagrange polynomials based on the Gauss or Gauss-Lobatto (GL) quadrature points. Globally, the mesh is an unstructured array of 14 deformed hexahedral elements and can include geometrically nonconforming elements. The discretization is illustrated in Fig. 2, which shows a mesh in IR2 for the case ( K , N ) = (3,4). Also shown is the reference (r, s) coordinate system used for all function evaluations. The use of the GL basis for the interpolants leads to efficient quadrature for the weighted residual schemes and greatly simplifies operator evaluation for deformed elements.

19

r

9T c~.

Figure 2. Spectral element discretization in 1R2 showing GL nodal lines for (K, N) = (3, 4).

For problems having smooth solutions, such as the incompressible Navier-Stokes equations, exponential convergence is obtained with increasing N, despite the fact that only C o continuity is enforced across elemental interfaces. This is demonstrated in Table 1, which shows the computed growth rates when a small-amplitude Tollmien-Schlichting wave is superimposed on plane Poiseuille channel flow at Re = 7500, following [6]. The amplitude of the perturbation is 10 -s, implying that the nonlinear Navier-Stokes results can be compared with linear theory to about five significant digits. Three error measures are computed: errorl and error2 are the relative amplitude errors at the end of the first and second periods, respectively, and erro% is the error in the growth rate at a convective time of 50. From Table 1, it is clear that doubling the number of points in each spatial direction yields several orders of magnitude reduction in error, implying that only a small increase in resolution is required for very good accuracy. This is particularly significant because, in three dimensions, the effect on the number of gridpoints scales as the cube of the relative savings in resolution.

Table 1 Spatial convergence, Orr-Summerfeld problem: K = 15, At = .003125 N 7 9 11 13 15

E(tl) 1.11498657 1.11519192 1.11910382 1.11896714 1.11895646

errOrl 0.003963 0.003758 0.000153 0.000016 0.000006

E(t2) 1.21465285 1.24838788 1.25303597 1.25205855 1.25206398

error2

errorg

0.037396 0.003661 0.000986 0.000009 0.000014

0.313602 0.001820 0.004407 0.000097 0.000041

20 The computational efficiency of spectral element methods derives from their use of tensor-product forms. Functions in the mapped coordinates are expressed as N

u(xk( r, s))lak -

N

E E u~jhN(r)h~(s),

(1)

i=0 j = 0

where u~j is the nodal basis coefficient; h N is the Lagrange polynomial of degree N based on the GL quadrature points, {~y}N=0; and xk(r, s) is the coordinate mapping from the reference domain, Ft "- [-1, 1]2, to f~k. With this basis, the stiffness matrix for an undeformed element k in IR2 can be written as a tensor-product sum of one-dimensional operators,

Ak -

By @~--~+

(2)

/iy |

where A. and/3, are the one-dimensional stiffness and mass matrices associated with the respective spatial dimensions. If _uk _ uijk is the matrix of nodal values on element k, then a typical matrix-vector product required of an iterative solver takes the form N

(Akuk)im -

-_

N

~ E

^

^

k +

mj

(3)

i=O j = O __

~ _ "'T --~ukBy

+

"

_

A~

.

Similar forms result for other operators and for complex geometries. The latter form illustrates how the tensor-product basis leads to matrix-vector products (Au__)being recast as matrix-matrix products, a feature central to the efficiency of spectral element methods. These typically account for roughly 90% of the work and are usually implemented with calls to D G E M M , unless hand-unrolled F77 loops prove faster on a given platform. Global matrix products, Au__,also require a gather-scatter step to assemble the elemental contributions. Since all data is stored on an element-by-element basis, this amounts to summing nodal values shared by adjacent elements and redistributing the sums to the nodes. Our parallel implementation follows the standard message-passing-based SPMD model in which contiguous groups of elements are distributed to processors and data on shared interfaces is exchanged and summed. A stand-alone MPI-based utility has been developed for these operations. It has an easy-to-use interface requiring only two calls: handle=gs_init(global_node_numbers, n)

and

i e r r = g s - o p ( u , op, handle),

where global-node-numbers 0 associates the n local values contained in the vector u 0 with their global counterparts, and op denotes the reduction operation performed on shared elements of u() [14]. The utility supports a general set of commutative/associative operations as well as a vector mode for problems having multiple degrees of freedom per vertex. Communication overhead is further reduced through the use of a recursive spectral bisection based element partitioning scheme to minimize the number of vertices shared among processors [12].

21

3.

TIME

ADVANCEMENT

AND

SOLVERS

The Navier-Stokes time-stepping is based on the second-order operator splitting methods developed in [1,10]. The convective term is expressed as a material derivative, and the resultant form is discretized using a stable second-order backward difference formula fin-2 _ 4fi_n-1 _~_3u n = S(u~),

2At where S(u n) is the linear symmetric Stokes problem to be solved implicitly, and fi___~-qis the velocity field at time step n - q computed as the explicit solution to a pure convection problem. The subintegration of the convection term permits values of At corresponding to convective CFL numbers of 2-5, thus significantly reducing the number of (expensive) Stokes solves. The Stokes problem is of the form H

-D

T

un

_

-D

(

_

0-)

and is also treated by second-order splitting, resulting in subproblems of the form

H~j - s

EF

- g/,

for the velocity components, u~, (i - 1 , . . . , 3 ) , and pressure, pn. Here, H is a diagonally dominant Helmholtz operator representing the parabolic component of the momentum equations and is readily treated by Jacobi-preconditioned conjugate gradients; E := D B - 1 D T is the Stokes Schur complement governing the pressure; and B is the (diagonal) velocity mass matrix. E is a consistent Poisson operator and is effectively preconditioned by using the overlapping additive Schwarz procedure of Dryja and Widlund [2,6,7]. In addition, a high-quality initial guess is generated at each step by projecting the solution onto the space of previous solutions. The projection procedure is summarized in the following steps: l

(~)

--

p_ -

X;

~,~_,, ~, -

~r

n

f_, g_ 9

i=1

(ii)

Solve 9 E A p -

g~ - E~ l

(~)

to tolerance ~.

(4)

l

g+, - (zXp_- E 9&)/ll/V_- E 9&ll~, ~ - ~_TEzXp_. i=1

i=1

The first step computes an initial guess, ~, as a projection of pn in the E-norm (IIP__I[E"= (p__TEp_)89 onto an existing basis, (~1"'" ,~)" The second computes the remaining (orthogonal) perturbation, Ap_, to a specified absolute tolerance, e. The third augments the approximation space with the most recent (orthonormalized) solution. The approximation space is restarted once 1 > L by setting ~1 "-- P~/IIP~IIE" The projection scheme (steps (i) and (iii)) requires two matrix-vector products per timestep, one in step (ii) and one in step (iii). (Note that it's not possible to use gff - Ep__in place of E A p in (iii) because (ii) is satisfied only to within e.)

22

Spherical

Bouyant Convection.

Spher'|ca| B o u y a n t

n=1658880

60 l , i , , l i , , i l i , i , l i , , , l i , J , l i i l , l , , , , l i i i , l i i i i l i i , , l , , , i l , , , , l i , , , l , , ,

Cenvectlon,

n=165888~

_,,,llVlV,l,,l,l,,,,l,llllll,,l,,,,ll,,,ll,l,l,l|,l,,,,

I , ,,11,,,I,,I

L=0 35 30

10-

25

L - 26

L - 26 0

5

10

15

20

25

30

35

S t e p Number

40 -

45

50

55

60

~

65

70

4~I-'

0

llllllllllllllllllllllll,ll,ll,llll,llll,lll,lll|ll,

5

10

15

20

25

30

35

S t e p Number

40 -

45

50

lllll,lll'llll,

55

60

65

70

m

Figure 3. Iteration count (left) and residual history (right) with and without projection for the 1,658,880 degree-of-freedom pressure system associated with the spherical convection problem of Fig. 1.

As shown in [4], the projection procedure can be extended to any parameter-dependent problem and has many desirable properties. It can be coupled with any iterative solver, which is treated as a black box (4ii). It gives the best fit in the space of prior solutions and is therefore superior to extrapolation. It converges rapidly, with the magnitude of the perturbation scaling as O(At ~)+ O(e). The classical Gram-Schmidt procedure is observed to be stable and has low communication requirements because the inner products for the basis coefficients can be computed in concert. Under normal production tolerances, the projection technique yields a two- to fourfold reduction in work. This is illustrated in Fig. 3, which shows the reduction in residual and iteration count for the buoyancydriven spherical convection problem of Fig. 1, computed with K = 7680 elements of order N = 7 (1,658,880 pressure degrees of freedom). The iteration count is reduced by a factor of 2.5 to 5 over the unprojected (L = 0) case, and the initial residual is reduced by two-and-one-half orders of magnitude. The perturbed problem (4ii) is solved using conjugate gradients, preconditioned by an additive overlapping Schwarz method [2] developed in [6,7]. The preconditioner, K M-1 "- RTAo 1Ro + E RT-4k-IRk, k=l

requires a local solve (~;1) for each (overlapping)subdomain, plus a global solve (Ao 1) for a coarse-grid problem based on the mesh of spectral element vertices. The operators R k and R T are simply Boolean restriction and prolongation matrices that map data between the global and local representations, while R 0 and RoT map between the fine and coarse grids. The method is naturally parallel because the subdomain problems can be solved

23

independently. Parallelization of the coarse-grid component is less trivial and is discussed below. The local subdomain solves exploit the tensor product basis of the spectral element method. Elements are extended by a single gridpoint in each of the directions normal to their boundaries. Bilinear finite element Laplacians, Ak, and lumped mass matrices,/)k, are constructed on each extended element, hk, in a form similar to (2). The tensor-product construction allows the inverse of ~-1 to be expressed as

~1

__ (Sy @ Sx)[I @ A:c "1I" A u @ I]-I(sT @ sT),

where S, is the matrix of eigenvectors, and A, the diagonal matrix of eigenvalues, solving the generalized eigenvalue problem A,z__ = A/),z_ associated with each respective spatial direction. The complexity of the local solves is consequently of the same order as the matrix-vector product evaluation (O(KN 3) storage and O ( K N 4) work in IR3) and can be implemented as in (3) using fast matrix-matrix product routines. While the tensor product form (2) is not strictly applicable to deformed elements, it suffices for preconditioning purposes to build Ak on a rectilinear domain of roughly the same dimensions as ~k [7]. The coarse-grid problem, z_ = Aol_b, is central to the efficiency of the overlapping Schwarz procedure, resulting in an eightfold decrease in iteration count in model problems considered in [6,7]. It is also a well-known source of difficulty on large distributedmemory architectures because the solution and data are distributed vectors, while Ao 1 is completely full, implying a need for all-to-all communication [3,8]. Moreover, because there is very little work on the coarse grid (typ. O(1) d.o.f, per processor), the problem is communication intensive. We have recently developed a fast coarse-grid solution algorithm that readily extends to thousands of processors [5,13]. If A 0 E IRnxn is symmetric positive definite, and X := (2_1,..., 2__~) is a matrix of A0-orthonormal vectors satisfying 2_/TA0~j -- 5ij, then the coarse-grid solution is computed as 9-

XX

b_,

(5)

i=1

Since ~_ is the best fit in Tg(X) - IRn, we have 2 _ - z_ and X X T - Ao 1. The projection procedure (5) is similar to (4/), save that the basis vectors {~i} are chosen to be sparse. Such sparse sets can be readily found by recognizing that, for any gridpoint i exterior to the stencil of j, there exist a pair of A0-conjugate unit vectors, ~i and _~j. For example, for a regular n-point mesh in IR2 discretized with a standard five-point stencil, one can immediately identify half of the unit vectors in 1Rn (e.g., those associated with the "red" squares) as unnormalized elements of X. The remainder of X can be created by applying Gram-Schmidt orthogonalization to the remainder of IR~. In [5,13], it is shown that nested dissection provides a systematic approach to identifying a sparse basis and yields a factorization of Ao I with O(na~ ~-) nonzeros for n-point grid problems in IRd, d _> 2. Moreover, the required communication volume on a P-processor machine is bounded by 3 n @ log 2 P, a clear gain over the O(n) or O(n log 2 P) costs incurred by other commonly employed approaches. The performance of the X X y scheme on ASCI-Red is illustrated in Fig. 4 for a (63 x 63) and (127 • 127) point Poisson problem (n - 3069 and n - 16129, respectively) discretized by a standard five-point stencil. Also shown are the times for the commonly used approaches of redundant banded-LU solves and row-distributed Ao 1. The latency,21og P

24 I

i

i

i

2-

2

le-O1

i Red.

LU -

le-O1 -

5

5

2 2 le-02

xx T le-02

5

5

2 le-03 i 2 le-04

'I

le-03

latency

5latency * 2log(P)

_

I

l

/

5

2

J

le-04 2

_

le-05

_

5

_

,

.

latency

5

latency

,

* 2log(P)

.

.

P

.

.

.

.

.

,

P

Figure 4. ASCI-Red solve times for a 3969 (left) and 16129 (right) d.o.f, coarse grid problem.

curve represents a lower bound on the solution time, assuming that the required all-to-M1 communication uses a contention free fan-in/fan-out binary tree routing. We see that the X X T solution time decreases until the number of processors is roughly 16 for the n = 3969 case, and 256 for the n = 16129 case. Above this, it starts to track the latency curve, offset by a finite amount corresponding to the bandwidth cost. We note that X X T approach is superior to the distributed Ao 1 approach from a work and communication standpoint, as witnessed by the substantially lower solution times in each of the workand communication-dominated regimes.

4. P E R F O R M A N C E RESULTS We have run our spectral element code on a number of distributed-memory platforms, including the Paragon at Caltech, T3E-600 at NASA Goddard, Origin2000 and SP at Argonne, ASCI-Blue at Los Alamos, and ASCI-Red at Sandia. We present recent timing results obtained using up to 2048 nodes of ASCI-Red. Each node on ASCI-Red consist of two Zeon 333 MHz Pentium II processors which can be run in single- and dual-processor mode. The dual mode is exploited for the matrix-vector products associated with H, E, and z~lk1 by partitioning the element lists on each node into two parts and looping through these independently on each of the processors. The timing results presented are for the time-stepping portion of the runs only. During production runs, usually 14 to 24 hours in length, our setup and I/O costs are typically in the range of 2-5%. The test problem is the transitional boundary layer/hemisphere calculation of Fig. 1 at Re~ = 1600, using a Blasius profile of thickness ~ = 1.2R as an initial condition. The mesh is an oct-refinement of the production mesh with (K, N) = (8168, 15) corresponding to 27,799,110 grid points for velocity and 22,412,992 for pressure.

25 I

I

I

I

I

I

I

400

3.0f/ -

-~ .0

150 100

f . . . . . ., . . . . . . . . . .,. . . . . . . . . .,. ........................... ..... , .

~;

1;

1;

Step

210

25

5

10

15

.

20

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

,_ 25

Step

Figure 5. P - 2048 ASCI-Red-333 dual-processor mode results for the first 26 time steps for (K, N) - (8168, 15)" solution time per step (left) and number of pressure and (z-component) Helmholtz iterations per step (right).

Figure 5 shows the time per step (left) and the iteration counts for the pressure and (xcomponent) Helmholtz solves (right) over the first 26 timesteps. The significant reduction in pressure iteration count is due to the difficulty of computing the initial transients and the benefits gained from the pressure projection procedure. Table 2 presents the total time and sustained performance for the 26 timesteps using a combination of unrolled f77 loops and assembly-coded D G E M M routines. Two versions of D G E M M were considered: the standard version (csmath), and a specially tuned version (perf) written by Greg Henry at Intel. We note that the average time per step for the last five steps of the 361 GF run is 15.7 seconds. Finally, the coarse grid for this problem has 10,142 distributed degrees of freedom and accounts for 4.0% of the total solution time in the worst-case scenario of 2048 nodes in dual-processor mode.

Table 2 ASCI-Red-333" total time and GFLOPS, K - 8168, N - 15. Single (csmath) Dual (csmath) Single (perf) Dual (perf) P Time(s) GFLOPS Time(s) GFLOPS Time(s) GFLOPS Time (s) G F L O P S 512 6361 47 4410 67 4537 65 3131 94 1024 3163 93 2183 135 2242 132 1545 191 2048 1617 183 1106 267 1148 257 819 361

26 5. C O N C L U S I O N We have developed a highly accurate spectral element code based on scalable solver technology that exhibits excellent parallel efficiency and sustains high MFLOPS. It attains exponential convergence, allows a convective CFL of 2-5, and has efficient multilevel elliptic solvers including a coarse-grid solver with low communication requirements. REFERENCES

1. J. BLAIR PEROT, "An analysis of the fractional step method", J. Comput. Phys., 108, pp. 51-58 (1993). 2. M. DRYJA AND O. B. WIDLUND, "An additive variant of the Schwarz alternating method for the case of many subregions", Tech. Rep. 339, Dept. Comp. Sci., Courant Inst., NYU (1987). 3. C. FARHAT AND P. S. CHEN, "Tailoring domain decomposition methods for efficient parallel coarse grid solution and for systems with many right hand sides", Contemporary Math., 180, pp. 401-406 (1994). 4. P . F . FISCHER, "Projection techniques for iterative solution of Ax - __bwith successive right-hand sides", Comp. Meth. in Appl. Mech., 163 pp. 193-204 (1998). 5. P. F. FISCHER, "Parallel multi-level solvers for spectral element methods", in Proc. Intl. Conf. on Spectral and High-Order Methods '95, Houston, TX, A. V. Ilin and L. R. Scott, eds., Houston J. Math., pp. 595-604 (1996). 6. P . F . FISCHER, "An overlapping Schwarz method for spectral element solution of the incompressible Navier-Stokes equations", J. of Comp. Phys., 133, pp. 84-101 (1997). 7. P . F . FISCHER, N. I. MILLER, AND H. ~/i. TUFO, "An overlapping Schwarz method for spectral element simulation of three-dimensional incompressible flows," in Parallel Solution of Partial Differential Equations, P. Bjrstad and M. Luskin, eds., SpringerVerlag, pp. 159-181 (2000). 8. W. D. GRoPP,"Parallel Computing and Domain Decomposition", in Fifth Conf. on Domain Decomposition Methods for Partial Differential Equations, T. F. Chan, D. E. Keyes, G. A. Meurant, J. S. Scroggs, and R. G. Voigt, eds., SIAM, Philadelphia, pp. 349-361 (1992). 9. Y. ~/IADAY AND A. T. PATERA, "Spectral element methods for the Navier-Stokes equations", in State of the Art Surveys in Computational Mechanics, A. K. Noor, ed., ASME, New York, pp. 71-143 (1989). 10. Y. 1VIADAY,A. T. PATERA, AND E. 1VI. RONQUIST, "An operator-integration-factor splitting method for time-dependent problems" application to incompressible fluid flow", J. Sci. Comput., 5(4), pp. 263-292 (1990). 11. A. T. PATERA, "A spectral element method for fluid dynamics: Laminar flow in a channel expansion", J. Comput. Phys., 54, pp. 468-488 (1984). 12. A. POTHEN, H. D. SIMON, AND K. P. LIOU, "Partitioning sparse matrices with eigenvectors of graphs", SIAM J. Matrix Anal. Appl., 11 (3) pp. 430-452 (1990). 13. H. M. TUFO AND P. F. FISCHER, "Fast parallel direct solvers for coarse-grid problems", J. Dist. Par. Comp. (to appear). 14. H. M. TUFO, "Algorithms for large-scale parallel simulation of unsteady incompressible flows in three-dimensional complex geometries", Ph.D. Thesis, Brown University (1998).

Parallel Computational Fluid Dynamics Towards Teraflops, Optimizationand Novel Formulations D. Keyes,A. Ecer, J. Periaux,N. Satofukaand P. Fox (Editors) 2000 Elsevier Science B.V. All rights r e s e r v e d .

27

O p e r a t o r S p l i t t i n g and D o m a i n D e c o m p o s i t i o n for Multiclusters * M. Garbey, D. Tromeur-Dervout CDCSP - University Lyon 1, Bat ISTIL, Bd Latarjet, 69622 Villeurbanne France

{garbey, dtromeur}@cdcsp,univ-lyonl, fr http://cdcsp.univ-lyonl.fr

We discuss the design of parallel algorithms for multiclusters. Multiclusters can be considerate as two-level architecture machines, since communication between clusters is usually much slower than communication or access to memory within each of the clusters. We introduce special algorithms that use two levels of parallelism and match the multicluster architecture. Efficient parallel algorithms that rely on fast communication have been extensively developed in the past: we intend to use them for parallel computation within the clusters. On top of these local parallel algorithms, new robust and parallel algorithms are needed that can work with a few clusters linked by a slow communication network. We present two families of two-level parallel algorithms designed for multicluster architecture: (1) new time-dependent schemes for operator splitting that are exemplified in the context of combustion, and (2) a new family of domain decomposition algorithms that can be applied, for example, to a pressure solve in Navier-Stoke's projection algorithm. Our implementation of these two-level parallel algorithms relies on a portable inter program communication library developped by Guy Edjlali et al [Parallel CFD 97]. 1. I N T R O D U C T I O N We discuss the design of two-level parallel algorithms for multicluster CFD computation. Section 2 gives the numerical scheme for adaptive operator splitting on multiclusters and discusses an application of this algorithm on two geographically separated parallel computers linked by a slow network. Section 3 presents a new family of domain decomposition algorithms designed for the robust and parallel computation of elliptic problems on multiclusters. Section 4 presents our conclusions. 2. A D A P T I V E

COUPLING

ALGORITHM

FOR MULTICLUSTERS

We consider a system of two coupled differential equations -

F(X, Y),

-

a(X, Y),

*This work was supported by the R4gion RhOne Alpes.

28 where the dot represents the time derivative. We consider second-order schemes of the form 3Xn+l _ 4X - + X n-1

=

F(Xn+l, y,,n+l)

(1)

G ( x *,n+l Y"+l).

(2)

2At 3yn+l _ 4 y n + y n - 1

=

2At Our goal is to compute (1) and (2) in parallel and, consequently, to use weak coupling in time marching; we therefore introduce a prediction of X "+1 (resp. y , + l ) in (2) (resp. (1)). Let p be an integer; we suppose that (1) is computed on machine I and (2) is computed on machine II. Let TI be the elapsed time needed to compute X "+1 when X n, X n-l, y,,n+~ are available in the memory of machine I. We make a similar hypothesis for machine II and assume further, for simplicity, that z - 7-1 - TII. We suppose that the speed of the network that links these two machines is such that the elapsed time needed to send X *'"+1 (resp. y , , , + l ) from machine I (resp. II) to machine II (resp. I) is bounded by pT. In an ideal world p should be at most 1, but realistically we anticipate p to be large. We use a second or third-order extrapolation formula to predict X *'"+~ or y,,n+l We denote such a scheme C(p, 1, j) with j - 2 or 3 as the order of extrapolation. A difficulty with this scheme from the hardware point of view is that machine I and machine II have to exchange two messages every time step. The network will consequently be very busy, and the buffering of the messages may affect the speed of communication. To relax this constraint, we restrict ourselves to communication of messages every q time steps. The same information X "-'+1 and X n-p then is used to predict X *'"+k for q consecutive time steps. The second-order extrapolation formula used on machine II is given by

X *'n+k = (p + k ) X n - p + l - (p + k -

1)X n-p, k = 1..q.

(3)

Accuracy constraints may lead us to use third-order extrapolation as follows:

X,,n+k

_--

(p + k)(P + k2 - 1 + 1 ) X " - p + I +

(p+ k-

1) 2 + ( p + k 2

-

((p + k - 1) 2 + 2 ( p + k -

1 ) ) X n-p

1)xn-p-i.

We denote such a scheme C(p, q, j) with j = 2 or 3 as the order of extrapolation. It is straightforward to show that the truncation error of the scheme is of order two. The explicit dependence on previous time steps supposed by the predictors X *'n+l and y , , , + l imposes a stability constraint on the time step. As shown in [8], this stability constraint is acceptable in the case of weak coupling of the two ODEs. Further, it is important to notice that the scheme should be adaptive in time, in particular, when the solution of the system of ODE goes through oscillations relaxations. Many techniques have been developed to control the error in time for ODE solvers; see [2] and its references. The adaptivity criterion limits the number of time steps q that the same information can be reused, although p + q should be such that the accuracy of the approximation as well as the stability of the time marching is satisfied. A more flexible and efficient way of using the network between the two machines is to use asynchronous communication [1], that is,

29 to let the delay p evolve in time marching in such a way that, as soon as the information arrives, it is used. This is a first step in adaptive control of communication processes. Let us consider now systems of PDEs. For example, we take

OU = AU + b VII, Ot OV =AV+cVU. Ot This system in Fourier space is given by Uk,m = =

( - k 2 - m2)Uk,m + b i (k + m) Vk,~n,

(4)

(-k

(5)

-

+ c i (k +

5 ,m,

where k (resp. m) is the wave number in x direction (resp. y direction). It is clear that these systems of ODEs are weakly coupled for large wave numbers k or m. One can show that the time delay in the C(p,p,j) scheme does not bring any stability constraint on the time step as long as the wave number m or k is large enough. This analysis leads us to introduce a second type of adaptivity in our time-marching scheme, based on the fact that the lower frequencies of the spectrum of the coupling terms need to be communicated less often than the high frequencies. A practical way of implementing this adaptivity in Fourier space is the following. Let ~ m = - M . . M Um be the Fourier expansion of U. We compute the evolution of U (resp. V) on machine I (resp. II), and we want to minimize the constraint on communication of U (resp. V) to machine II (resp. I). Let ~.,n+l be the prediction used in the C(p,p, j) scheme for the Fourier mode ,,n+l

/Jm, and let U be the prediction used in the C(2p, p, j) scheme. Let a be the filter of order 8 given in [9], Section 3, p. 654. We use the prediction u * ' n + l --

E m=-M..M

?gt

5n+l

+

~

m

: n+l

(1 - c r ( a [ ~ l ) ) U m

m=-M..M

with ~ > 2. The scheme using this prediction is denoted cr~C(p/2p, p,j). This way of splitting the signal guarantees second-order consistency in time and smoothness in space. For lower-order methods in space, one can simply cut off the high modes [3]. For high accuracy, however, we must keep the delayed high-frequency correction. This algorithm has been implemented successfully in our combustion model, which couples Navier-Stokes equations (NS) written in Boussinesq approximation to a reaction diffusion system of equations describing the chemical process (CP) [6]. In our implementation NS's code (I) and CP's code (II) use a domain decomposition method with Fourier discretisation that achieves a high scalable parallel efficiency mainly because it involves only local communication between neighboring subdomains for large Fourier wave number and/or small value of the time step [7]. The two codes exchange the temperature from (II) to (I), and the stream function from (I) to (II). The nonblocking communications that manage the interaction terms of the two physical models on each code are performed by the Portable Inter Codes Communication Library developed in [4]. Our validation of the numerical scheme has included comparison of the accuracy of the method with a classical second-order time-dependent

30 scheme and sensitivity of the computation with the C(p, q, j) scheme with respect to bifurcation parameters [8]. We report here only on the parallel efficiency of our scheme. For our experiment, we used two parallel computers about 500 km apart: (II) runs on a tru cluster 4100 from DEC with 400 MHz alpha chips located in Lyon, (I) runs on a similar parallel computer DEC Alpha 8100 with 440 MHz alpha chips located in Paris. Each parallel computer is a cluster of alpha servers linked by a FDDI local network at 100 Mb/s. Thanks to the SAFIR project, France Telecom has provided a full-duplex 10 Mb/s link between these two parallel computers through an ATM Fore interface at 155 Mb/s. The internal speed of the network in each parallel computer is about 80 times faster with the memory channel and 10 times faster when one uses the FDDI ring. ATM was used to guarantee the quality of service of the long-distance 10 Mb/s connection. To achieve good load balancing between the two codes, we used different data distributions for the chemical process code and the Navier-Stokes code. We fixed the number of processors for code (II) running in Lyon to be 2 or 3 and we used between 2 to 9 processors for Navier-Stokes code (I) in Paris. The data grid was tested at different sizes 2 x Nz x 2 x Nx, where N z (resp. N x ) represents the number of modes in the direction of propagation of the flame (resp. the other direction). The number of iterations was set to 200, and the computations were run several times, since the performance of the SAFIR network may vary depending on the load from other users. Table 1 summarizes for each grid size the best elapsed time obtained for the scheme o,,C(6/12, 6, 2) using the SAFIR network between the different data distribution configurations tested. We note the following: 9 A load balancing of at least 50% between the two codes (73.82% for Nx=180) has been achieved. The efficiency of the coupling with FDDI is between 78% and 94%, while the coupling with SAFIR goes from 64% to 80%. Nevertheless, we note that the efficiency of the coupling codes may deteriorate when the size of the problem increases. 3. D O M A I N D E C O M P O S I T I O N

FOR MULTICLUSTER

Let us consider a linear problem (6)

L[U] = f in ~t, Uioa = O.

We split the domain f~ into two subdomains f~ = ~-~1 U ~~2 and suppose that we can solve each subproblem on f~i, i = 1 (resp. i=2) with an efficient parallel solver on each cluster I (resp. II) using domain decomposition and/or a multilevel method and possibly different codes for each cluster. As before, we suppose that the network between cluster I and II is much slower than access to the memory inside each cluster. Our goal is to design a robust and efficient parallel algorithm that couples the computation on both subdomains. For the sake of simplicity, we restrict ourselves to two subdomains and start with the additive Schwarz algorithm: L[u? +11 - f in

al,

n+l n "ttliF1 - - lt21F1 , n+l

n

L[u~ +1] - f in f~2, a21r~ - ullr~-

(7) (8)

31 We recall that this additive Schwarz algorithm is very slow for the Laplace operator, and therefore is a poor parallel algorithm as a pressure solver, for example, in projection schemes for Navier-Stokes. Usually, coarse-grid operators are used to speed up the computation. Our new idea is to speed up the robust and easy-to-implement parallel additive Schwarz algorithm with a posteriori Aitken acceleration [11]. It will be seen later that our methodology applies to other iterative procedures and to more than two subdomains. We observe that the operator T, n i -- Uri --+ ailr ~ + 2/ - Uv~ uilr

(9)

is linear. Let us consider first the one-dimensional case f~ - (0, 1)- the sequence u~l~,~ is now a n+2 sequence of real numbers. Note that as long as the operator T is linear, the sequence uilr~ n+2 has pure linear convergence (or divergence); that is, it satisfies the identity o.ailr~ -Uir ~= (~(uinlr~ - Uir~), where 5 is the amplification factor of the sequence. The Aitken acceleration procedure therefore gives the exact limit of the sequence on the interface Fi based on three successive Schwarz iterates u~tr~, j - 1, 2, 3, and the initial condition u/~ namely,

ur~ = u~,r ~ _ u~lr ' _ uljr ' + u/Or '

(10)

An additional solve of each subproblem (7,8) with boundary conditions ur~ gives the solution of (6). The Aitken acceleration thus transforms the Schwarz additive procedure into an exact solver regardless of the speed of convergence of the original Schwarz method. It is interesting that the same idea applies to other well-known iterative procedures such as the Funaro-Quarteroni algorithm [5], regardless of its relaxation parameter, and that the Aitken acceleration procedure can solve the artificial interface problem whether the original iterative procedure converges or diverges, as long as the sequence of solutions at the interface behaves linearly! Next, let us consider the multidimensional case with the discretized version of the problem (6): (11)

Lh[U] = f in ~, Uioa = O.

Let us use E h to denote some finite vector space of the space of solutions restricted to the artificial interface Fi. Let g , j = 1..N be a set of basis functions for this vector space and P the corresponding matrix of the linear operator T. We denote by u e.~,~,j - 1,.., N the components of u~]r~, and we have then ~+2 -- U j l F i ) j = I , . . , N ('{ti,j

n -_- P(?.l,i,j

VjlFi)j=l

(12)

.... N .

We introduce a generalized Aitken acceleration with the following formula: P-

/

2(j+l)

(uk,i

-

.2j~-I

~k,i]i=l,..,N,j=O,..,g-l(Uk,i

/

2(j+l)

2j -- ?-tk,i)i=l .... g , j = l , . . , g ,

(13)

and finally we get u ~ k,i, i - 1 ~ "'~ N the solution of the linear system

(Id -

c~

P)(Uk,i)i=

1 .... U =

~' 2 N + 2

(~k,i

)i=I,..,N

-- P

2N

(Uk,i)i=1

.... N -

(14)

32 I d denotes the matrix of the identity operator. We observe that the generalized Aitken procedure works a priori independently of the spectral radius of P, that is, the convergence of the interface iterative procedure is not needed. In conclusion, 2N + 2 Schwarz iterates produce a priori enough data to compute via this generalized Aitken accelera/ 2(j+1) tion the interface value Uir ~, k - 1, .., 2. However, we observe that the matrix (uk,~ 2j )~=1....N,j=0....N-~ is ill-conditioned and that the computed value of P can be very sensiltk,i tive to the data. This is not to say that the generalized Aitken acceleration is necessarily a bad numerical procedure. But the numerical stability of the method and the numerical approximation of the operator P should be carefully investigated depending on the discretization of the operator. We are currently working on this analysis for several types of discretization (finite differences, spectral and finite elements). Here, we show that this algorithm gives very interesting results with second-order finite differences. Let us consider first the Poisson problem u ~ + uyy = f in the square (0, 1) 2 with Dirichlet boundary conditions. We partition the domain into two overlapping strips (0, a) • (0, 1)U(b, 1) • (0, 1) with b > a. We introduce the regular discretization in the y direction yi = ( i - 1)h, h - N ~ , and central second-order finite differences in the y direction. Let us denote by 5i (resp. j~) the coefficient of the sine expansion of u (resp. f). The semi-discretized equation for each sinus wave is then

4/h

h

(15)

and therefore the matrix P for the set of basis functions bi - sin(i~) is diagonal. The algorithm for this specific case is the following. First, we compute three iterates with the additive Schwarz algorithm. Second, we compute the sinus wave expansion of the trace of the iterates on the interface Fi with fast transforms. Third, we compute the limit of the wave coefficients sequence via Aitken acceleration as in the one-dimensional case. We then derive the new numerical value of the solution at the interface Fi in physical space. A last solve in each subdomain with the new computed boundary solution gives the final solution. We have implemented with Matlab this algorithm for the Poisson problem discretized with finite differences, a five-point scheme, and a random rhs f. Figure 1 compares the convergence of the new method with the basic Schwarz additive procedure. Each subproblem is solved with an iterative solver until the residual is of order 10 -1~ The overlap between subdomains in the x direction is just one mesh point, a = b + h . The result in our experiment is robust with respect to the size of the discretized problem. Note however that this elementary methods fails if the grid has a nonconstant space step in the y direction or if the operator has coefficients depending on the y variable, that is L = (el(x, y)ux)x + (a2(x, y)uy)y, because P is no longer diagonal. For such cases P becomes a dense matrix, and we need formally 2N + 2 Schwarz iterates to build up the limit. Figure 2 gives a numerical illustration of the method when one of the coefficients a2 of the second-order finite difference operator is stiff in the y direction. We checked numerically that even if P is very sensitive to the data, the limit of the interface is correctly computed. In fact we need only to accelerate accurately the lower modes of the solution, since the highest modes are quickly damped by the iterative Schwarz procedure itself. Further, we can use an approximation of P that neglects the coupling between

33 the sinus waves, and apply iteratively the previous algorithm. This method gives good results when the coefficients of the operator are smooth, that is the sine expansions in the y variable of the coefficients of the operator converge quickly. In our last example, we consider a nonlinear problem that is a simplified model of a semiconductor device [10]. In one space dimension the model writes

Au f

=

-

e -u

+ f, in(O, d),

x 1 tanh(20(-~-~)),

-

u(0)

e~

-

(16) (17)

x e (0, d),

a s inh t--~-) "f(O) + Uo, u ( d ) -

asinh(

)

(18)

The problem is discretized by means of second-order central finite differences. We apply consecutively several times the straightforward Aitken acceleration procedure corresponding to the Laplace operator case. Figure 3 shows the numerical results for 80 grid points and one mesh point overlap. Notice that the closer the iterate gets to the final solution, the better the result of the Aitken acceleration. This can be easily explained in the following way: the closer the iterate gets to the solution, the better the linear approximation of the operator. Similar results have been obtained in the multidimensional case for this specific problem. So far we have restricted ourselves to domain decomposition with two subdomains. The generalized Aitken acceleration technique however can be applied to an arbitrary number of subdomains with strip domain decomposition. 2

,

,

,

Convergence ,

,

0 --1 -

2 --3--4--5--6--7~

-81

1'.5

3

3.5

4

s'.s

6

Figure 1. Solid line (resp. -o- line) gives the loglo (error in maximum norm) on the discrete solution with Schwarz additive procedure (resp. new method)

34

convergence

U

0 ....... i .......

:

.... i ......

-2 -4 -6 -80

-

io

2o

3o

4o

30

5o

a2

al 1 ....... i

i

.

3 ....... i

' ........ .

i

. 9

2 ....... ! ........ :.....

0 ........ i. . . . . . .

20

0"~~

.....

~ 10

1 ....... i ...... i

".....

i ........ : ...... '

i ......

30

20

Figure 2. Application of the m e t h o d to the nonconstant coefficient case convergence

i

i 0

-20

~4-

0

-6' -8

-1( ) N=80

0.5

-0...= . . . . . . . . . . . .

..,=,=,.1 9 ...........

I. . . . . . . . . . . .

I . . . . . . . . . .

1

2

~., .....

I

3

Figure 3. application of the m e t h o d to a non linear problem Let us consider, for the sake of simplicity, the one-dimensional case with q > 2 subdomains. P is then a p e n t a d i a g o n a l m a t r i x of size 2 ( q - 1). Three Schwarz iterates provide enough

35 information to construct P and compute the limit of the interfaces. However, we have some global coupling between the interface's limits solution of the linear system associated to matrix I d - P. Since P is needed only as an acceleration process, local approximation of P can be used at the expense of more Schwarz iterations. The hardware configuration of the multicluster and its network should dictate the best approximation of P to be used. 4. C O N C L U S I O N We have developed two sets of new ideas to design two level parallel algorithms appropriate for multicluster architecture. Our ongoing work generalizes this work to an arbitrary number of clusters. REFERENCES

1. D . P . Bertsekas and J. N. Tsitsiklis, Parallel and distributed computation, numerical methods, Prentice Hall, Englewood Cliffs, New Jersey, 1989. 2. M. Crouzeix and A.L. Mignot, Analyse Numdrique des dquations diffdrentielles, 2nd ed. Masson, 1992. 3. A. Ecer, N. Gopalaswamy, H. U. Akay and Y. P. Chien, Digital filtering techniques for parallel computation of explicit schemes, AIAA 98-0616, Reno, Jan. 12-15, 1998. 4. G. Edjlali, M. Garbey, and D. Tromeur-Dervout, Interoperability parallel programs approach to simulate 3D frontal polymerization process, J. of Parallel Computing, 25 pp. 1161-1191, 1999. 5. D. Funaro, A. Quarteroni and P. Zanolli, An iterative procedure with interface relaxation for domain Decomposition Methods, SIAM J. Numer. Anal. 25(6) pp. 1213-1236, 1988. 6. M. Garbey and D. Tromeur-Dervout, Massively parallel computation of stiff propagating combustion front, IOP J. Comb. Theory Modelling 3 (1): pp. 271-294, 1997. 7. M. Garbey and D. Tromeur-Dervout, Domain decomposition with local fourier bases applied to frontal polymerization problems, Proc. Int. Conf. DD11, Ch.-L. Lai & al Editors, pp. 242-250, 1998 8. M. Garbey and D. Tromeur-Dervout, A Parallel Adaptive coupling Algorithm for Systems of Differential Equations, preprint CDCSP99-01, 1999. 9. D. Gottlieb and C. W. Shub, On the Gibbs phenomenon and its resolution, SIAM Review, 39 (4), pp. 644-668, 1998. 10. S. Selberherr, Analysis and simulation of semiconductor devices, Springer Verlag, Wien, New York, 1984. 11. J. Stoer and R. Burlish, Introduction to numerical analysis, TAM 12 Springer, 1980.

36

Nz=64, Nx=120 Max(s)[Min(s)]Average(s) 221.04 2 1 2 . 8 9 216.19 PNS=4 210.82 2 0 0 . 1 3 204.56 Pcp=2 208.24 1 9 5 . 6 8 202.05 Pys=4 192.87 180.23 187.68 Pcp-2 208.45 2 0 5 . 1 3 206.98 PNS=4 109.43 108.41 108.90 Pcp-2

200 Iterations SAFIR Coupled FFDI Coupled Non Coupled

Processors

200 Iterations SAFIR Coupled FFDI Coupled Non Coupled

Processors

200 Iterations SAFIR Coupled FFDI Coupled Non Coupled

Processors

200 Iterations SAFIR Coupled FFDI Coupled Non Coupled

Processors

PNs=6 Pcp=3 PNS=6 Pcp-3 PNs=6 Pcp=3

Nz=64, Nx=180 Max(s) Min(s) Average(s) 2 7 2 . 2 0 2 5 3 . 0 9 260.50 259.25 2 4 0 . 0 2 247.86 2 0 9 . 3 3 1 9 9 . 4 6 204.76 194.66 1 8 5 . 2 8 189.69 2 1 4 . 3 8 2 0 5 . 9 7 210.82 158.92 1 4 9 . 2 4 155.64

Nz=128, Nx=120 Max(s) Min(s) Average(s) 247.79 Pys=4 2 5 3 . 9 7 241.22 235.38 Pcp-2 2 4 1 . 5 0 228.89 205.08 PNS=4 2 0 8 . 0 8 199.38 190.78 Pcp-2 1 9 7 . 1 7 184.73 208.26 PNS=4 208.95 206.89 104.84 Pcp=2 105.32 104.49 Nz=128, Nx=180 Max(s) Min(s) Average(s) PNS=6 3 0 6 . 7 6 2 8 5 . 0 9 298.23 Pcp=3 2 9 4 . 2 6 271.71 284.95 PNS=6 2 3 8 . 6 0 2 1 5 . 0 0 224.62 Pcp=3 2 2 2 . 7 3 199.75 209.35 PNS=6 2 4 1 . 9 4 2 2 8 . 5 6 236.57 Pcp-3 107.97 1 0 6 . 5 8 107.26

Efficiency

80.60% 91.24% lOO.O %

Efficiency 73.87% 93.98% 100.0%

Efficiency 71.33% 86.76%

lOO.O%

Efficiency 65.85% 88.13%

100% Table 1 Elapsed time for 20 runs of the coupling codes with different hardware configuration

Parallel ComputationalFluid Dynamics Towards Teraflops, Optimizationand Novel Formulations D. Keyes,A. Ecer, J. Periaux,N. Satofukaand P. Fox (Editors) 92000 Elsevier Science B.V. All rights reserved.

37

D e v e l o p m e n t of t h e " E a r t h S i m u l a t o r " Shinichi Kawai a, Mitsuo Yokokawa b, Hiroyuki Ito a, Satoru Shingu b, Keiji Tani b and Kazuo Yoshida ~ ~Earth Simulator Research and Development Center, National Space Development Agency of Japan, Sumitomo Hamamatsu-cho bldg. 10, 1-18-16, Hamamatsu-cho, Minato-ku, Tokyo, 105-0013, Japan bEarth Simulator Research and Development Center, Japan Atomic Energy Research Institute, Sumitomo Hamamatsu-cho bldg. 10, 1-18-16, Hamamatsu-cho, Minato-ku, Tokyo, 105-0013, Japan "Earth Simulator" is a high speed vector processor based parallel computer system for computational earth science. The goal of the "Earth Simulator" is to achieve at least 5 Tflop/s sustained performance, which should be about 1,000 times faster than the commonly used supercomputers, in atmospheric general circulation model (AGCM) program with the resolution of 5-10 km grid on the equator. This computer system consists of 640 processor nodes connected by a fast single-stage crossbar network. Each processor node has 8 arithmetic processors sharing 16 Gbytes main memory. Total main memory capacity and peak performance are 10 TBytes and 40 Tflop/s, respectively. Application software on the "Earth Simulator" and software simulator to evaluate the performance of the program on the "Earth simulator" are also described. 1. I N T R O D U C T I O N "Earth Simulator" is a high speed parallel computer system for computational earth science, which is a part of earth science field for understanding the Earth and its surroundings such as atmosphere, ocean and solid earth through computer simulation. Computational earth science is useful for weather forecast, prediction of global change such as global warming and E1 Nifio event, and some earthquake related phenomena such as mechanisms of earthquake or disaster prediction. The Science and Technology Agency of Japan promotes global change prediction research through process study, earth observation, and computer simulation. Development of the "Earth Simulator" is a core project to achieve this objective. "Earth Simulator" is being developed by Earth Simulator Research and Development Center, or ESRDC, which is a joint team of NASDA, National Space Development Agency of Japan, JAERI, Japan Atomic Energy Research Institute and JAMSTEC, Japan Marine

38

Science and Technology Center. The goal of the "Earth Simulator" is to achieve at least 5 Tflop/s sustained performance in atmospheric general circulation model (AGCM) program with the resolution of 5-10 km grid on the equator, in other words, about 4000 grid points for the longitudinal or eastwest direction, 2000 grid points for the latitudinal or north-south direction, and i00 grid points for the vertical or altitudinal direction. The total number of grid points is about 800 million for a i0 km grid on the equator. For the most commonly used supercomputers in computational earth science, sustained performance is 4-6 Gflop/s with a resolution of 50-100 km grid and 20-30 layers. Sustained performance of the "Earth Simulator" should be about 1,000 times faster than this. There are two types of parallel computer systems, vector processor based parallel and micro processor based massively parallel system. The execution performance of one of the major AGCM codes with various computer systems shows that efficiency of the vector processor based system is about 30~ Efficiency of the micro processor based systems is much less than 10% (Hack, et al. [2]). If we assume that the efficiency of the "Earth Simulator" is 10-15%, 30-50 Tflop/s peak performance is needed to achieve 5 Tflop/s sustained performance for an AGCM program. We think this is possible by using a vector processor based parallel system with distributed memory and a fast communication network. If we tried to achieve 5 Tflop/s sustained performance by a massively parallel system, more than 100 Tflop/s peak performance would be needed. We think this is unlikely by early 2002. So we adopt the vector processor based parallel system. The total amount of main memory is 10 TBytes from the requirement of computational earth science. 2. BASIC DESIGN OF THE "EARTH SIMULATOR"

i

I

IIIIIIIIIIIIIIIIIIIII.IIIIIIII~IlIII//LI/ilI//IiiiiZ/I .........IIIIIII:IiiiLIL.I. :i.~[;//iiii LII//.II:II................................................................IIILIIIIII/://///.I/.I/.I;I/.I/I]I:IS:///I:///I/.I//// ....

ii.iii iiiiiiiii iiii!".! iiiiiiii iii,!iii

ii:.iii ii. ii iii iiii

ii~

~iL~ ...............................................i!~

Processor Node #1

Processor Node #639

~ii Processor Node #0

iii~

Figure i. Configuration of the "Earth Simulator"

The basic design of the "Earth Simulator" is as follows* (Yokokawa, et al. [6], [7]). *This basic design might be changed as development proceeds.

39 The "Earth Simulator" is a distributed memory parallel system that consists of 640 processor nodes connected by a fast single-stage crossbar network. Each processor node has 8 arithmetic processors sharing 16 Gbytes main memory (Figure 1). The total number of arithmetic processors is 5,120 and total main memory capacity is 10 TBytes. As the peak performance of one vector processor is 8 Gflop/s, the peak performance of one processor node is 64 Gflop/s and total peak performance is 40 Tflop/s. Power consumption will be very high, so air cooling technology and power saving technology such as CMOS technology will be used.

AP RCU lOP MMU

Arithmetic 9 Processor Remote 9 A c c e s s Control Unit 1/O 9 Processor Main 9 M e m o r y Unit

From/To Crossbar Network

i

LAN=

/l i

User/Work Disks

Shared Main M e m o r y (MM) 16GB

Figure 2. Processor Node (PN)

Each processor node consists of 8 arithmetic processors, one remote access control unit (RCU), one I/O processor (lOP) and 32 sets of main memory unit (MMU). Each processor or unit is connected to all 32 MMUs by cables. RCU is also connected to crossbar network to transfer data from/to the other processor node. I/O processor (IOP) is also connected to LAN and/or disks. (Figure 2) Each arithmetic processor consists of one scalar unit, 8 units of vector pipelines, and a main memory access control unit. These units are packaged on one chip. The size of the chip is about 2 cm x 2 cm. There are more than 40 million transistors in one chip. The clock frequency of the arithmetic processor is 500 MHz. The vector unit consists of 8 units of vector pipelines, vector registers and some mask registers. Each unit of vector pipelines has 6 types of functional units, add/shift, multiply, divide, logical, mask, and load/store. 16 floating point operations are possible in one clock cycle time (2 ns), so

40

/ ili!|

:~i~......~

H

......iiii

,

~ii~i~i~!~ii~i~i~:~:~:~:i:~:i:~:i~:!~:.iiiiiii ~i:~~m~ :i~J!i~ii~i~ii~ii~

iii i?iii iii i ili i iiiii i?'i i i i 'i ili......... 'i i i1iiiil

' ,~:~:~::' :::iiiii~~iiiii{ii~iiii iiiiiiiii~]]

7 Figure 3. Arithmetic Processor (AP)

the peak performance of one arithmetic processor is 8 Gflop/s. The scalar unit contains instruction cache, data cache, scalar registers and scalar pipelines. (Figure 3). We adopt the newly developed DRAM based 128Mbit fast RAM which has 2,048 banks for the main memory. The access time of the SSRAM is very fast, but it is very expensive. The access time of the fast RAM is between SSRAM and SDRAM and the cost of the fast RAM is much less than the cost of SSRAM. Memory band width is 32 GBytes per second for each arithmetic processor and 256 GBytes per second for one processor node. The ratio between operation capability and memory through-put is 2:1. The interconnection network unit consists of a control unit which is called XCT and 640• crossbar switch which is called XSW. Each XCT and XSW is connected to all 640 processor nodes through cables. More than 80,000 cables are used to connect the interconnection network unit to the 640 processor nodes. XCT is connected to each RCU in each processor node. Data transfer rate is 16 GBytes per second for both input and output paths. (Figure 4). For any two processor nodes, the distance for data transfer from one processor node to another is always the same. Any two different data paths don't interfere each other. So we don't have to care which node we put the data in. For each processor node, input port and output port are independent, so it is possible to input and output data at the same time. (Figure 5). We are planning to treat 16 processor nodes as one cluster to operate the "Earth Simulator". So there exists 40 clusters, and each cluster has a Cluster Control Station (CCS). We assign one cluster as a front-end for processing interactive jobs and smallscale batch jobs. The other 39 clusters are treated as back-end clusters for processing large-scale batch jobs. (Figure 6). System software is under development. Basically, we will use commonly used operating system and languages with some extensions for the "Earth Simulator". For parallelization

41

X CT

@

@

@

Iiiiiiiiiiiiliiiiii!i!i!iiii@lX SW

0

9

9 @

@

9

9

9 @

PN #639

PN #0

Figure 4. Interconnection Network (IN)

IN

i,ii,i,i,iii,ii,liii...... i,i,li@i!i!iii~ii!i!iiiiiii!iiiii!i!!!iiiili!ii~/~ ~!i;,~,~, i}iiiiiiiii:~i@ii}i!i@!i!i@ii!iiiii!ii!ii;~i~

.IN

IIN

IN

t!~::~

00,t ~ ! . ,..::~

-%

OUT

OUT

O1

Figure 5. Conceptual image of 640x640 crossbar switch

environment, MPI, HPF and OpenMP will be supported. 3. A P P L I C A T I O N

SOFTWARE

ON THE "EARTH SIMULATOR"

In the "Earth Simulator" project, we will develop some parallel application software for simulating the global change of the atmosphere, ocean and solid earth. For atmosphere and ocean, we are developing a parallel program system, called N JR [5]. For solid earth, we are developing a program called Geo FEM [3], which is a parallel finite element analysis system for multi-physics/multi-scale problems. We will only explain the N JR program system. N JR is the large-scale parallel climate model developed by NASDA, National Space Development Agency of Japan, JAMSTEC, Japan Marine Science and Technology Center,

42

!i Figure 6. Cluster system for operation

and RIST, Research Organization for Information Science and Technology. N JR is named for the first letters of these three organizations. The NJR system includes Atmospheric General Circulation Model (AGCM), Ocean General Circulation Model (OGCM), AGCM-OGCM coupler and Pre-Post processor. Two kinds of numerical computational methods are applied for AGCM, one is a spectral method called NJR-SAGCM and the other is a grid point method called NJR-GAGCM. NJR-SAGCM is developed referencing CCSR/NIES AGCM (Numaguti, et al. [4]) developed by CCSR, Center for Climate System Research in University of Tokyo and NIES, National Institute for Environmental Studies. There are also two kinds of numerical computation methods in OGCM, a grid point method and a finite element method. Since AGCM and OGCM are developed separately, the connection program between AGCM and OGCM, which is called AGCM-OGCM coupler, is needed when simulating the influence of ocean current on the climate. A pre-processing system is used to input data such as topographic data and climate data like temperature, wind velocity and geo surface pressure at each grid point. A post-processing system is also used to analyze and display the output data. We will deal with the NJR-SAGCM in detail. Atmospheric general circulation program is utilized from short term weather forecast to long term climate change prediction such as global warming. In general, an atmospheric general circulation model is separated into a dynamics part and a physics part. In the dynamics part, we solve the primitive equations numerically which denotes the atmospheric circulation, for example, equation of motion, thermodynamics, vorticity, divergence and so on. Hydrostatic approximation is assumed for the present resolution global model. In the physics part, we parameterize the physical phenomena less than the grid size, for example, cumulus convection, radiation and so on. In NJR-SAGCM, the computational method is as follows. In the dynamics part, spectrum method with spherical harmonic function is applied for the horizontal direction and finite difference method is applied for the vertical direction. In the physics part, since the scale of the physics parameter is much smaller than the grid size, we compute the physics parameters at each vertical one dimensional grid independently. Semi-implicit method

43 and leap-frog method are applied for the time integration. Domain decomposition method is applied for parallelization of data used in NJRSAGCM.

Domoin

1"3--r

Grid Space

ilion

Fourier Space

O O

o FFT o~

4000x2000x100 Grid Points (Longitude X Latitude X Altitude)

S~ I

4000 9 1333

~ ~ ~ l ~!1 r'~

O

..

-,

A

N 1333 ]~i1 Data

"'''~

Transform/~~P'~i~0 ~:..~i~i~:~i~>

Spectrum Space

,-

)0

Transpose

I~ 0

1333

Figure 7. Parallelization technique in spectrum transform

For the dynamics part, spectrum method is used for the horizontal direction. Figure 7 shows the parallel technique in spectrum transform. First, decompose the grid space along the latitude line into the number of processor nodes equally, and distribute the data to each processor node. Then transform the grid space to Fourier space by one dimensional (1D) FFT with the size of the number of grid point for the longitude direction. Since this 1D FFT is independent of latitude and altitude direction, all 1D FFT's are calculated independently and no data transpose occurs for each processor node. In Fourier space, only about 1/3 of the wave numbers are needed and high frequency components are ignored. This is to avoid aliasing when calculating the second order components such as convolution of two functions. Before going to the Legendre transform, domain decomposition of Fourier Space along the longitude line is needed. Then, the data transpose occurs for all processor nodes. Since triangular truncation is used for the latitude direction in one dimensional Legendre transform, various wave number components are mixed in each processor node on transposing the data in order to avoid the load imbalance among the processor nodes. Inverse transforms are the same as the forward transforms. In NJRSAGCM, four times of data transpose occurs while forward and inverse transforms are executed in one time step. For the physics part, the parallel technique is simpler. First, as in dynamics part, decompose the domain into the number of processor nodes and distribute the data to

44

each processor node. Since each set of one dimensional data for the vertical direction is mutually independent, parallel processing is possible for these sets. No data transpose occurs among the processor nodes. In each processor node, data is divided into eight parts for parallel processing with microtask by eight arithmetic processors. In each arithmetic processor, vectorize technique is used. NJR-SAGCM is being optimized for the "Earth Simulator", which has three levels of parallelization: vector processing in one arithmetic processor, parallel processing with shared memory in one processor node and parallel processing with the distributed memory among processor nodes using the interconnection network, toward the achievement of 5 Tflop/s sustained performance on the "Earth Simulator".

4. PERFORMANCE ESTIMATION ON THE "EARTH SIMULATOR"

We have developed a software simulator, which we call GSSS, to estimate the sustained performance of programs on the "Earth Simulator" (Yokokawa, et al. [6]). This software simulator traces the behavior of the principal parts of the "Earth Simulator" and other similar architecture computers such as NEC SX-4. We can estimate accuracy of GSSS by executing the program both on the existing computer system and on GSSS and comparing them. Once we have confirmed the accuracy of GSSS, we can estimate the sustained performance of programs on the "Earth Simulator" by changing the hardware parameters of GSSS suitable for the "Earth Simulator". The GSSS system consists of three parts, GSSS_AP, which is the timing simulator of arithmetic processor; GSSS_MS, which is the timing simulator of memory access from arithmetic processors; and GSSS_IN, which is the timing simulator of asynchronous data transfer via crossbar network. (Figure 8) . We execute a target program on a reference machine, in our case, an NEC SX-4, and get the instruction trace data, then put this instruction trace data file into the GSSS system. We also put the data file for the hardware parameters such as number of vector pipelines, latency and so on. This hardware parameters must be taken from the same architecture machines as the reference one. If we want to trace the behavior of SX-4, we put the hardware parameters for SX-4; if we want to trace the behavior of the "Earth Simulator", we put the hardware parameters for the "Earth Simulator." The output from the GSSS system is the estimated performance of the target program on SX-4 or "Earth Simulator" depending on the hardware parameter file. Sustained performance of the program is usually measured by the flop count of the program divided by the total processing time. In the case of performance estimation by GSSS, total processing time is estimated using GSSS. We have prepared three kinds of benchmark programs. The first is the kernel loops from the present major AGCM and OGCM codes to check the performance of the arithmetic processor. These kernel loops are divided to three groups. Group A includes simple loops. Group B includes loops with IF branch and intrinsic function. Group C includes loops with indirect access of arrays. The second kind of benchmark program is the FT and BT of the NAS parallel benchmark (Bailey, et al. [1]), which involve transpose operations of arrays when they are executed in parallel, to check the performance of the crossbar network. The third one is NJR-SAGCM to check the performance of the application

45

I Target program

'l

Data file for hardware parameters q GSSS system

Reference machine

[iiiiii iii iiiiiii j

ex.) the number of vector pipelines, latency, etc.

i

I n s t r u c t i o n trace d a t a file

I Estimated performance of l the target program

Figure 8. GSSS system

software. It is obtained that the average absolute relative error of processing time executed on GSSS to that executed on NEC SX-4 for Group A of kernel loops is about 1 % and the estimation accuracy of processing time by GSSS is quite good. The estimated sustained performance for Group A on one arithmetic processor of the "Earth Simulator" is almost half of the peak performance on average, nevertheless the Group A includes various types of kernel loops.

5. F U T U R E W O R K S The design of the "Earth Simulator" is evaluated by the software simulator, GSSS system which was described in the fourth section, with some benchmark programs. We feed back the evaluation result into the design to achieve the 5 Tflop/s sustained performance by executing NJR-SAGCM program which was described in the third section. Figure 9 shows the development schedule of the "Earth Simulator." Conceptual design, basic design which was described in the second section, and research and development for manufacturing are finished. We are going to detailed design this year and then will begin manufacture and installation next year. The buildiing for the "Earth Simulator" will be located in Yokohama, and this computer system will begin operation in early 2002.

REFERENCES 1. 2.

D. Bailey, et al., "The NAS Parallel Benchmarks", BNR Technical Report RNR 94007, March (1994) J. J. Hack, J. M. Rosinski, D. L. Williamson, B. A. Boville and J. E. Truesdale,

45

1997 Hardware System

Conceptual Design Basic Design R&D for Manufacturing

__

1998

Manufacture and Installation Software Design Software Development & Test

Installation of Peripheral Devices

2000

i 1999 I

2001

2002

! i

l

[

i

I

--I

i i

!

Detailed Design Operation Supporting Software

I

conl pletion

We are here.

,~.......

r L , [

i !

i i

m

i

It

[

l

i

Operation

Figure 9. Development schedule

3.

4.

5.

6. 7.

"Computational design of the NCAR community climate model", Parallel Computing, Vol. 21, No. i0, pp. 1545-1569 (1995). M. lizuka, H. Nakamura, K. Garatani, K. Nkajima, H. Okuda and G. Yagawa, "GeoFEM: High-Performance Parallel FEM for Geophysical Applications" in C. Polychronopoulos, K. Joe, A. Fukuda and S. Tomita (eds.), High Performance Computing: second international symposium; proceedings/ISHPC'99, Kyoto, Japan, May-26-28, 1999, 292-303, Springer (1999). A. Numaguti, M. Takahashi, T. Nakajima and A. Sumi, "Description of CCSR/NIES Atmospheric General Circulation Model", CGER's Supercomputer Monograph Report, Center for Global Environmental Research, National institute for Environmental Studies, No.3, 1-48 (1997). Y. Tanaka, N. Goto, M. Kakei, T. Inoue, Y. Yamagishi, M. Kanazawa and H. Nakamura, "Parallel Computational Design of N JR Global Climate Models" in C. Polychronopoulos, K. Joe, A. Fukuda and S. Tomita (eds.), High Performance Computing: second international symposium; proceedings/ISHPC'99, Kyoto, Japan, May-26-28, 1999, 281-291, Springer (1999). M. Yokokawa, S. Shingu, S. Kawai, K. Tani and H. Miyoshi, "Performance Estimation of the Earth Simulator", Proceedings of the ECMWF Workshop, November (1998). M. Yokokawa, S. Habata, S. Kawai, H. Ito, K. Tani and H. Miyoshi, "Basic design of the Earth Simulator" in C. Polychronopoulos, K. Joe, A. Fukuda and S. Tomita (eds.), High Performance Computing: second international symposium; proceedings/ ISHPC'99, Kyoto, Japan, May-26-28, 1999, 269-280, Springer (1999).

Parallel Computational Fluid Dynamics Towards Teraflops, Optimization and Novel Formulations D. Keyes, A. Ecer, J. Periaux, N. Satofuka and P. Fox (Editors) 92000 Elsevier Science B.V. All rights reserved.

47

Virtual M a n u f a c t u r i n g and D e s i g n in the Real W o r l d - I m p l e m e n t a t i o n and Scalability on H P P C S y s t e m s

K McManus, M Cross, C Walshaw, S Johnson, C Bailey, K Pericleous, A Slone and P Chowt Centre for Numerical Modelling and Process Analysis University of Greenwich, London, UK. tFECIT, Uxbridge, UK

Virtual manufacturing and design assessment increasingly involve the simulation of interacting phenomena, sic. multi-physics, an activity which is very computationally intensive. This paper describes one attempt to address the parallel issues associated with a multi-physics simulation approach based upon a range of compatible procedures operating on one mesh using a single database - the distinct physics solvers can operate separately or coupled on sub-domains of the whole geometric space. Moreover, the finite volumeunstructured mesh solvers use different discretisation schemes (and, particularly, different 'nodal' locations and control volumes). A two-level approach to the parallelisation of this simulation software is described: the code is restructured into parallel form on the basis of the mesh partitioning alone, i.e. without regard to the physics. However, at run time, the mesh is partitioned to achieve a load balance, by considering the load per node/element across the whole domain. The latter of course is determined by the problem specific physics at a particular location.

1. INTRODUCTION As industry moves inexorably towards a simulation-based approach to manufacturing and design assessment, the tools required must be able to represent all the phenomena active together with their interactions, (increasingly referred to as multi-physics). Conventionally, most commercial tools focus upon one main phenomena, (typically 'fluids' or 'structures') with others supported in a secondary fashion - if at all. However, the demand for multiphysics has brought an emerging response from the CAE s e c t o r - the 'structures' tools ANSYS (1) and ADINA (2~have both recently introduced flow modules into their environments. However, these 'flow' modules are not readily compatible with their 'structures' modules, with regard to the numerical technology employed. Thus, although such enhancements facilitate simulations involving loosely coupled interactions amongst fluids and structures, closely coupled situations remain a challenge. A few tools are now emerging into the

48

community that have been specifically configured for closely coupled multi-physics simulation, see for example SPECTRUM ~3), PHYSICA ~4) and TELLURIDE ~5~. These tools have been designed to address multi-physics problems from the outset, rather than as a subsequent bolt-on. Obviously, multi-physics simulation involving 'complex physics', such as CFD, structural response, thermal effects, electromagnetics and acoustics (not necessarily all simultaneously), is extremely computationally intensive and is a natural candidate to exploit high performance parallel computing systems. This paper highlights the issues that need to be addressed when parallelising multi-physics codes and provides an overview description of one approach to the problem.

2. T H E C H A L L E N G E S Three examples serve to illustrate the challenges in multi-physics parallelisation. 2.1. Dynamic fluid - structure interaction (DFSI)

DFSI finds its key application in aeroelasticity and involves the strong coupling between a dynamically deforming structure (e.g. the wing) and the fluid flow past it. Separately, these problems are no mean computational challenge - however, coupled they involve the dynamic adaption of the flow mesh. Typically, only a part of the flow mesh is adapted; this may well be done by using the structures solver acting on a sub-domain with negligible mass and stiffness. Such a procedure is then three-phase<6-7~: 9 The flow is solved for and pressure conditions loaded onto the structure. 9 The structure responds dynamically. 9 Part of the flow mesh (near to the structure) is adapted via the 'structures' solver and both the adapted mesh and mesh element velocities are passed to the flow solver. Here we are dealing with 3 separate sub-domains (structure mesh, deformed flow mesh and the whole flow mesh) with three separate 'physics' procedures (dynamic structural response, static structural analysis and dynamic flow analysis). Note the deformed flow mesh is a subset of the whole flow mesh. 2.2. Flow in a cooling device

The Scenario here is reasonably straight forward; hot fluid is passing through a vessel and loses heat through the walls. As this happens, the walls develop a thermally based stress distribution. Of course, the vessel walls may also deform and affect the geometry (i.e. mesh) of the flow domain. In the simplest case, part of the domain is subject to flow and heat transfer, whilst the other is subject to conjugate heat transfer and stress development. Of course, if the structural deformation is large, then the mesh of the flow domain will have to be adapted. 2.3. Metals casting

In metals casting, hot liquid metal fills a mould, then cools and solidifies. The liquid metal may also be stirred by electromagnetic fields to control the metallic structure. This problem is complex from a load balancing perspective:

49

-

-

The flow domain is initially full of air, and as the metal enters this is expelled; the airmetal free surface calculation is more computationally demanding than the rest of the flow field evaluation in either the air or metal sub-domains. The metal loses heat from the moment it enters the mould and eventually begins to solidify. The mould is being dynamically thermally loaded and the structure responds to this. The electromagnetic field, if present, is active over the whole domain.

Although the above examples are complex, they illustrate some of the key issues that must be addressed in any parallelisation strategy that sets out to achieve an effective load balance: 9 In 2.1 the calculation has 3 phases (see Figure 1a); each phase is characterised by one set of physics operating on one sub-mesh of the whole domain; one of these sub-meshes is contained within another. 9 In the simpler case in 2.2 the thermal behaviour affects the whole domain, whilst the flow and structural aspects affect distinct sub-meshes (see Figure 1b). 9 In the more complex case in 2.2 the additional problem of adapting the flow sub-mesh (or part of it) re-emerges. 9 In the casting problem in 2.3 the problem has three sub-domains which have their own 'set of physics' (see figure l c) - however, 'one set of physics', the flow field, has a dynamically varying load throughout its sub-domain, and two of the sub-domains vary dynamically. Actually, if the solidified metal is elasto-visco-plastic, then its behaviour is non homogeneous too.

b)

a)

c)

.... : ............................... :.:.:.:.:.:.: ............................... : ............................... : ................... :........................

~:~:~:~:~::~::::::: :~:~:~:~:~:~:~::::~:~::::: ::::::::::::::::::::::::: :~:~::~:~:~:~:~:~:~:~:;: :~:~=: :~:~:~:~:~:~:~:;:~:~:~:~:~:~:~:~:~ :::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::

i•i;i•i•ii?i!i•i•i!i!i•ii!?•iiiii•i!•ii!ii•i;•?i•ii•ii!i;•i!i!i•?i•i!i•i•i•?•i•i•i•i••!i•iii•i•i•i: •i•i!i•i•i•i•i•!i•!•!i•i!!i!•!i•i•?i!•i!•i•i!•ii !i•!••!i•i!i•!•i•i!•i!i!i•i!i•!;i•i!•i!

:','~iiiii!i',i!iiiiiiiiiii',iiiiiiii~:~iiiiiiiii' ',,i~i!~,ii:,i,~ii~~, ,!;i:,i!i!'~':i':i':':,'i:i~:,ii:,;i

I,!! !iiiiii:

i!i!i~i~i~i~i~i;ii!i!{i~i~:ii!i{i!i!i!i!i[i!i~i!i!i!i~i~i~i~i~5iiiiiiiiigiiiiiiii!g!i!i~i~i~i~iiiii;ii?iii!i!iiii!i!i;iiiiiiiii~iii~iiiii~ii;i!i~i~i!i~i~i~i~ i1iii?1iiii•iiiiiii!i;i)•1!i5iii•ii;iii[ii?•iii1ii?i•••iiiiiii••1iiiiiiii)iii)iiiiiii}iiiiiiiiiiiiiiiiiiiiiii;iii•iii)i•iii•ii5iii••iiiiiii}i•5•iii)i•iiiiii•i

i!!i!i!i~!:~!~i~i!i~!!i~i!i!!i~i~i~!i~i!~!~i~i~i~i~ii!~!i~!i%~!i~i~!!~i!i!i~i~!~!~i~i~i!i~!~i~!~!

12/i: ii?:il i .ii:127211

..............................................................................................................................................................

!~!~!~:~!~!~!~!~!~!~!~!~!~!~!~!~!.~.~!~!~!~!~!~!~!~!~!~!~!~!~.~!~!~ ~L!I!I!I!I!I!I:~I,IIIII~IIII ~III~I!I!I!I!I!I!I!I!I!IIII.I.II~

.................................. :........................................................................................................................... :::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::

ii}!~i~;ii~i~i~i~!~i!iiii!~iii!ii~i;~i~i!ii~i~i!ii!ii!ii~!~i~i~ii}~i~i)i~i~!i!ii~i?~i~!~!i~i :::.:::::::::::::::::.:i.:i:.:i:.:::::: ::.i:.:::::::::::::::::::::::::!:::!:::i: :::::,:i:.:::ii?:: :i:i:.:i::::: iiiiii:i.:iii:::::::::::::::::::::::::i:i:iiiii:iiiiiiiiiiiiiii::i:i:.:i:ii:i: iiiiiiiiiiiiiii:i:i:i::i::i: i:iiiiiiiiii:i:i:i:i:i:i::i:

ii!iiNNi N ?iiiiii!

iiiiii?i!iiiiii!iiii!!:=iiiii ! iiii!i iiiiiiiiiN

Figure 1 - Examples of differing mesh structures for various multi-physics solution procedures a)DFSI b) thermally driven structural loading and c) solidification.

50

The approach to the solution of the multi-physics problems, followed here, has used segregated procedures in the context of iterative loops. It is attractive to take the approach of formulating the numerical strategy so that the whole set of equations can be structured into one large non-linear matrix. However, at this exploratory stage of multi-physics algorithm development, a more cautious strategy has been followed, building upon tried and tested single discipline strategies (for flow, structures, etc.) and representing the coupling through source terms, loads, etc. (8~. An added complication here is that separate physics procedures may use differing discretisation schemes, for example, the flow procedure may be cell centred, whilst the structure procedure will be vertex centred.

3. A P R A C T I C A L STRATEGY Despite all the above complexity the multi-physics solution strategy may be perceived, in a generic fashion, as: Do 10 i = 1 to number-of-subdomains Select subdomain (i) Grab data required from database for other subdomain solutions which share mesh space with subdomain (i) Solve for specified 'cocktail' of physics 10 Store solution to database. The above 'multi-physics' solution strategy does at least mean that, although a domain may consist of a number of solution sub-domains (which may overlap or be subsets), each specified sub-domain has a prescribed 'cocktail' of physics. However, the sub-domains may be dynamic - they may grow or diminish, depending upon the model behaviour, as for example in the case of shape casting, where the flow domain diminishes as the structural domain grows. Hence, for any parallelisation strategy for closely coupled multi-physics simulation to be effective, it must be able to cope with multiple sub-domains which a) have a specified 'cocktail' of physics that may be transiently non-homogeneous, and b) can change size dynamically during the calculation. Within a single domain, of course, characteristic a) is not uncommon for CFD codes which have developing physics - a shock wave or combustion front, for example. It is the multi-phase nature of the problem which is a specific challenge. The parallelisation strategy proposed is then essentially a two-stage process: a) the multi-physics application code is parallelised on the basis of the single mesh, using primary and secondary partitions for the varying discretisation schemes. b) the responsibility for determining a mesh partition to give a high quality load balance devolves to the load balance tool. This strategy simplifies the code conversion task, but it requires a mesh partitioning tool that has the following capabilities:

51

9 produces load balanced partition for a single (possibly discontinuous) sub-domain with a non-homogeneous workload per node or element, 9 structures the sub-domain partitions so that they minimise inter-processor communication (e.g. the partitions respect the geography of the problem), 9 dynamically re-load balances as the physics changes, and 9 also tries to structure the partitions so that the usual inter-processor communication costs are minimised. Of course, this problem is very challenging, as Hendrickson and Devine ~9) point out in their review of mesh partitioning/dynamic load balancing techniques, from a computational mechanics perspective. From this review, currently it would appear that the only tools with the potential capability to address this conflicting set of challenges are parMETIS ~1~ and (12 13) JOSTLE These tools follow a broadly similar s t r a t e g y - they are implemented in parallel, are multi-level (in that they operate on contractions of the full graph) and employ a range of optimisation heuristics to produce well balanced partitions of weighted graphs against a number of constraints. The tool exploited in this work is JOSTLE (12). '

.

4. P A R A L L E L P H Y S I C A - T H E T A R G E T A P P L I C A T I O N For the last decade, a group at Greenwich has focused upon the development of a consistent family of numerical procedures to facilitate multi-physics modelling ~4'8~. These procedures are based upon a finite volume approach, but use an unstructured mesh. They now form the basis of a single software framework for closely coupled multi-physics simulation and enable the following key characteristics to be implemented: 9 one consistent mesh for all phenomena, 9 a measure of compatibility in the solution approaches to each of the phenomena, 9 a single database and memory map so that no data transfer is involved and efficient memory use is facilitated between specific 'physics' modules, 9 accurate exchange of boundary data and volume sources amongst phenomena. This has been achieved in the PHYSICA framework which: 9 uses a finite volume discretisation approach on a 3D unstructured mesh for tetrahedral, wedge and hexahedral elements, 9 has fluid flow and heat transfer solvers based on an extension of the SIMPLE strategy, but using a cell centred discretisation with Rhie-Chow interpolation, 9 uses melting/solidification solvers based on the approach of Voller and co-workers, 9 exploits solid mechanics solution procedure based upon a vertex centred approximation with an iterative solution procedure and non-linear functionality. This software has been used to model a wide range of materials and metals processing operations. The parallelisation approach described in section 2 above has been incorporated within the PHYSICA software. This software has --90000 lines of FORTRAN code and so the task of converting the code is immense. However, this has been considerably simplified by employing the strategy exemplified above.

52

A key issue has been the use of primary and secondary partitions to cope with distinct discretisation techniques - flow, etc. are cell centred and the solid mechanics is vertex centre based. Using these distinct partitions streamlines the task of code transformation, but leaves a debt for the partitioner/load balancing tool to ensure that the partitions are structured to maximise the co-location of neighbouring nodes and cells on the same processor. Some considerable attention was given to these issues in the last PCFD conference~16~ and this functionality has now been implemented into JOSTLE ~2~- see the example in Figure 2. The approach is straight forward: 9 a mesh entity type (e.g. cell) associated with the greatest workloads is selected, 9 primary and secondary partitions are generated which both load balance and minimise penetration depth into neighbouring partitions, though it should be noted that: 9 an in-built overhead is associated with the additional information required for secondary partitions.

,s" ' , , /

.

:

Poor secondary partition

/,iii!~~

Good partition from JOSTLE

~ i:~iii~i~

. 9il("'" ~

.2

Figure 2 - Poor and good secondary partitions for two phase calculations; the good secondary partitions are generated by JOSTLE

The upper partition in Figure 2 is poor because, although it is load balanced, it causes rather more communication volume than is required in the lower partition. The impact of a

53

good secondary partitioning which is both load balanced and minimises penetration depth is illustrated in Figure 3, using the PHYSICA code on a CRAYT3E for a modest multi-physics problem with an --8000 node mesh.

s

Parallel Speed-up

9

8,192 element heat and stress

~

~176149

os

50

-

,iP e

s o 4, ~

40

-

s ~

Sp 30 20 -

Imbalanced secondary secondary

Balanced

~ 10

I

10

I

I

I

I

i'

20

30

40

50

60

"-

P

Figure 3 - Speed ups for a multi-physics calculation on a CRAY-T3E showing the impact of balanced secondary partitions

A final challenge to address here is the issue of dynamic load balancing - there are two key factors: 9 the first occurs when the phenomena to be solved changes- as in the case of solidification where metal changes from the fluid to the solid state and activates different solvers, and 9 the second occurs when the phenomena becomes more complex (and so computationally demanding) as, for example, where a solid mechanics analysis moves from elastic into visco-plastic mode for part of its sub-domain. In the first case, this is addressed through a redefinition of the physics specific subdomains and then re-load balanced. In the second case, the overall problem is re-load balanced because of the non-homogeneous dynamically varying workload in one sub-domain. A strategy to address these issues has been proposed and evaluated by Arulananthan et al. (16). Figure 4 shows the effect of this dynamic load balancing strategy on the PHYSICA code.

54

8OOO 80,0

6OOO

B

.I ,~176

Lktukt

SO.O 50.0 40,0

!i

2OOO

30,0

9

!!

....

~ 2o.o 1o,o

0 ~ 0

S

10

15 20 tln~,,$tep

25

0,0

30

.%:== .......--...P~..~j

0

........... "-............................. t ......................................................~. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

a) solidification of cooling bar

gO.O

~ -

~

..............

~

2500.0

i

40,0 20.0 10,0

30

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . r- ........................................... !........................................... I

1500.0

~50,0

t

20

2000.0

70.0 ~80.0

~" 30,0

timo-step

b) time per time-step without DLB

. . . . . . . ",........ ~,............... "i

80,0

10

i

t ................ a P r o c ................ a

=. . . . . .

0.0 0

Pmc 3

500.0

- = Proc 4 10

1000.0

2

......... ...a. . . . . . . . . . ~

Iin~-atap

20

~

1

c) Time per time-step with DLB

3O

0.0

0

10

ern~.W,~p

20

30

d) Overall run time in parallel

Figure 4 - Sample results for dynamic load balancing tool running on Parallel PHYSICA

5. CONCLUSIONS The move to multi-physics simulation brings its own challenges in the parallelisation of the associated software tools. In this paper we have highlighted the problems with regard to algorithmic structure and computational load balancing. Essentially, the multi-physics problem on a large domain mesh may be perceived as a series of sub-domains (and their meshes), each of which is associated with a specific 'set of physics'. The sub-domains may be distinct, overlap, or even be subsets of each other. What this means is that the code parallelisation tool can proceed on the basis of the large single mesh, with the partitioning/load balancing problem entirely devolved to the mesh partitioning and load

55

balancing tools. However, these tools need a range of optimisation techniques that can address non-homogeneous sub-domains which may vary in size dynamically. They must be able to produce high quality secondary as well as primary partitions and run in parallel, as well as deliver the usual array of performance characteristics (such as minimising data movement amongst processors). In the work described here, the implementation focus has involved the multi-physics simulation tool, PHYSICA, and the mesh partitioning/dynamic load balancing tool, JOSTLE. Through these tools a range of challenges have been addressed that should lead to scalable performance for strongly inhomogeneous mesh based multi-physics applications, of which CFD is but one component.

6. REFERENCES

1.

ANSYS, see http://www.ansys.com

2.

ADINA, see http://www.adina.com

3.

SPECTRUM, see htto://www.centric.com

4.

PHYSICA, see http://physica.gre.ac.uk

5.

TELLURIDE, see http://lune mst.laul.gov/telluride

6.

C Farhat, M Lesoinne and N Maman, Mixed explicit/implicit time integration of coupled aeroelastic problems: Three field formulation, geometric conservation and distributed solution. International Journal of Numerical Methods Fluids, Vol. 21,807835 (1995).

7.

A K Slone et al. Dynamic fluid-structure interactions using finite volume unstructured mesh procedures, in CEAS Proceedings, Vol. 2, 417-424 (1997).

8.

M Cross, Computational issues in the modelling of materials based manufacturing processes. Journal of Comp. Aided Mats Des. 3, 100-116 (1996).

9.

B Hendrickson and K Devine, Dynamic load balancing in computational mechanics. Comp. Meth. Appl. Mech. Engg. (in press).

10.

G. Karypis and V Kumar, Multi-level algorithms for multi-constraint graph partitioning, University of Minnesota, Dept of Computer Science. Tech. Report 98-019 (1998).

11.

ParMETIS, see http://www.cs.umn.adu/-metis

12.

JOSTLE, see http://www.gre.ac.uk/jostle

13.

C Walshaw and M Cross, Parallel optimisation algorithms for multi-level mesh partitioning, Parallel Computing (in press).

14.

K McManus et al., Partition alignment in three dimensional unstructured mesh multiphysics modelling, Parallel Computational Fluid Dynamics - Development and Applications of Parallel Technology (Eds C A Lin et al.). Pub. Elsevier, 459-466 (1999).

15.

C Walshaw, M Cross and K McManus, Multi-phase mesh partitioning. (in preparation).

56 16.

A Arulananthan et al. A generic strategy for dynamic load balancing of distributed memory parallel computational mechanics using unstructured meshed. Parallel Computational Fluid Dynamics - Development and Application of Parallel Technology (Eds C A Lin et al.). Pub. Elsevier, 43-50 (1999).

Parallel ComputationalFluid Dynamics Towards Teraflops, Optimizationand Novel Formulations D. Keyes,A. Ecer, J. Periaux,N. Satofukaand P. Fox (Editors) 2000 Elsevier Science B.V. All rightsreserved.

Large-Scale Parallel Viscous Flow Computations

57

Using an Unstructured

Multigrid Algorithm Dimitri J. Mavriplis ~ ~Institute for Computer Applications in Science and Engineering (ICASE), Mail Stop 403, NASA Langley Research Hampton, VA 23681-2199, U.S.A. The development and testing of a parallel unstructured agglomeration multigrid algorithm for steady-state aerodynamic flows is discussed. The agglomeration multigrid strategy uses a graph algorithm to construct the coarse multigrid levels from the given fine grid, similar to an algebraic multigrid approach, but operates directly on the non-linear system using the FAS approach. The scalability and convergence rate of the multigrid algorithm are examined on the SGI ORIGIN 2000 and the Cray T3E. An argument is given which indicates that the asymptotic scalability of the multigrid algorithm should be similar to that of its underlying single grid smoothing scheme. For medium size problems involving several million grid points, near perfect scalability is obtained for the single grid algorithm, while only a slight drop-off in parallel efficiency is observed for the multigrid V and W-cycles, using up to 128 processors on the SGI Origin 2000, and up to 512 processors on the Cray T3E. For a large problem using 25 million grid points, good scalability is observed for the multigrid algorithm using up to 1450 processors on a Cray T3E, even when the coarsest grid level contains fewer points than the total number of processors. 1. I N T R O D U C T I O N Reynolds averaged Navier-Stokes computations using several million grid points have become commonplace today. While many practical problems can be solved to acceptable accuracy with such methods at these resolutions, the drive to more complex problems and higher accuracy is requiring the solution of ever larger problems. For example, the flow over aircraft configurations in off-design configurations, such as high-lift, has been computed with up to 25 million grid points, and cases involving up to 10s grid points can be anticipated in the near future [1]. The elusive goal of developing a universally valid turbulence model has also spurred a new interest in large eddy simulation (LES) models, which may ultimately require in excess of 109 or even 1010 grid points for adequate resolution of the relevant eddie sizes, even excluding the thin boundary layer regions. At the same time, the drive towards more complex configurations and faster gridding turnaround time has emphasized the use of unstructured grid methods. While unstructured grid techniques simplify the task of discretizing complex geometries, and

58 offer great potential for the use of adaptive meshing techniques, they incur additional cpu and memory overheads as compared to block-structured or overset grid methods. On the other hand, unstructured grid methods are well suited for large scale parallelization, since the basic data structures are homogeneous in nature, which enables near perfect load-balancing on very large numbers of processors. Any large-scale solution procedure requires the use of an efficient solver, regardless of the amount of available computer resources. While a simple explicit scheme may achieve the best parallel efficiency on large numbers of processors, its numerical efficiency (i.e. convergence rate) degrades rapidly as the number of grid points is increased. Multigrid algorithms have been shown to provide optimal asymptotic complexity, with the number of operations required for convergence scaling linearly with the number of grid points or unknowns. While other implicit solution techniques can provide fast convergence rates for large problems, FAS multigrid methods avoid the explicit linearization of the non-linear problem, resulting in an algorithm which requires considerably less storage, particularly for unstructured mesh discretizations. Furthermore, because multigrid methods are based on explicit or locally-implicit single grid smoothers, they are also highly memory-latency tolerant, an important consideration in light of current architectural trends, which has seen dramatic increases in the relative memory latency over the last several decades [2,3]. Finally, as will be shown in this paper, multigrid methods can be expected to scale favorably for reasonable size problems on very large numbers of processors. The use of cache-based microprocessor parallel computer architectures is rapidly becoming the dominant approach for large-scale CFD calculations. While parallel vector machines such as the Cray T90 and the NEC SX-4 offer outstanding performance as measured by computational rates, their required use of fast but costly memory has limited the amount of memory available on such machines. This is particularly important for unstructured grid computations, which have traditionally been memory limited. The use of low cost hierarchical (cache-based), high-latency memory systems is somewhat at odds with processors which rely on very long (global) vector lengths for sustaining high computational rates. Hence, the dramatically lower cost of commodity memory has made large scale parallel systems of cache-based microprocessors the most effective architecture for large unstructured grid computations. Other enabling developments for parallel unstructured grid computations include the availability of efficient and robust grid partitioners [4,5], and in particular the appearance of standardized software libraries for inter-processor communication such as the Message Passing Interface (MPI) library [6], which enable code portability and simplify maintenance. In the following three sections, an unstructured mesh multigrid solver designed for turbulent external flow aerodynamic analysis is described. The convergence rate and scalability of this approach are illustrated by two examples in section 5, and a multigrid scalability argument is given in section 6. In section 7 the performance of calculations using up to 2048 processors and 25 million grid points is described, while section 8 discusses the outlook for even larger future calculations.

59 2. B A S E

SOLVER

The Reynolds averaged Navier-Stokes equations are discretized by a finite-volume technique on meshes of mixed element types which may include tetrahedra, pyramids, prisms, and hexahedra. In general, prismatic elements are used in the boundary layer and wake regions, while tetrahedra are used in the regions of inviscid flow. All elements of the grid are handled by a single unifying edge-based data-structure in the flow solver [7]. The governing equations are discretized using a central difference finite-volume technique with added matrix-based artificial dissipation. The matrix dissipation approximates a Roe Rieman-solver based upwind scheme [8], but relies on a biharmonic operator to achieve second-order accuracy, rather than on a gradient-based extrapolation strategy [9]. The thin-layer form of the Navier-Stokes equations is employed in all cases, and the viscous terms are discretized to second-order accuracy by finite-difference approximation. For multigrid calculations, a first-order discretization is employed for the convective terms on the coarse grid levels. The basic time-stepping scheme is a three-stage explicit multistage scheme with stage coefficients optimized for high frequency damping properties [10], and a CFL number of 1.8. Convergence is accelerated by a local block Jacobi preconditioner, which involves inverting a 5 x 5 matrix for each vertex at each stage [11,12]. A low-Mach number preconditioner [13-15] is also implemented in order to relieve the stiffness associated with the disparity in acoustic and convective eigenvalues in regions where the Mach number is very small and the flow behaves incompressibly. The low-Mach number preconditioner is implemented by modifying the dissipation terms in the residual as described in [9], and then taking the corresponding linearization of these modified terms into account in the Jacobi preconditioner, a process sometimes referred to as "preconditioning 2" [9,16]. The single equation turbulence model of Spalart and Allmaras [17] is utilized to account for turbulence effects. This equation is discretized and solved in a manner completely analogous to the flow equations, with the exception that the convective terms are only discretized to first-order accuracy. 3. D I R E C T I O N A L - I M P L I C I T

MULTIGRID

ALGORITHM

An agglomeration multigrid algorithm [7,18] is used to further enhance convergence to steady-state. In this approach, coarse levels are constructed by fusing together neighboring fine grid control volumes to form a smaller number of larger and more complex control volumes on the coarse grid. While agglomeration multigrid delivers very fast convergence rates for inviscid flow problems, the convergence obtained for viscous flow problems remains much slower, even when employing preconditioning techniques as described in the previous section. This slowdown is mainly due to the large degree of grid anisotropy in the viscous regions. Directional smoothing and coarsening techniques [9,19] can be used to overcome this aspect-ratio induced stiffness. Directional smoothing is achieved by constructing lines in the unstructured mesh along the direction of strong coupling (i.e. normal to the boundary layer) and solving the implicit system along these lines using a tridiagonal line solver. A weighted graph algorithm is used to construct the lines on each grid level, using edge weights based on the stencil

50 coefficients for a scalar convection equation. This algorithm produces lines of variable length. In regions where the mesh becomes isotropic, the length of the lines reduces to zero (one vertex, zero edges), and the preconditioned explicit scheme described in the previous section is recovered. An example of the set of lines constructed from the twodimensional unstructured grid in Figure 1 is depicted in Figure 2.

~,6~!!,1111tltlll llttltllli'j~t~''''''''''''''''''''~

Figure 1. Unstructured Grid for threeelement airfoil; Number of Points = 61,104, Wall Resolution = 10-6 chords

Figure 2. Directional Implicit Lines Constructed on Grid of Figure 1 by Weighted Graph Algorithm

t

Figure 3. First Agglomerated Multigrid Level for TwoDimensional Unstructured Grid Illustrating 4:1 Directional Coarsening in Boundary Layer Region

61 In addition to using a directional smoother, the agglomeration multigrid algorithm must be modified to take into account the effect of mesh stretching. The unweighted agglomeration algorithm which groups together all neighboring control volumes for a given fine grid vertex [7] is replaced with a weighted coarsening algorithm which only agglomerates the neighboring control volumes which are the most strongly connected to the current fine grid control volume, as determined by the same edge weights used in the line construction algorithm. This effectively results in semi-coarsening type behavior in regions of large mesh stretching, and regular coarsening in regions of isotropic mesh cells. In order to maintain favorable coarse grid complexity, an aggressive coarsening strategy is used in anisotropic regions, where for every retained coarse grid point, three fine grid control volumes are agglomerated, resulting in an overall complexity reduction of 4:1 for the coarser levels in these regions, rather than the 2:1 reduction typically observed for semi-coarsening techniques. In inviscid flow regions, the algorithm reverts to the isotropic agglomeration procedure and an 8:1 coarsening ratio is obtained. However, since most of the mesh points reside in the boundary layer regions, the overall coarsening ratios achieved between grid levels is only slightly higher than 4:1. An example of the first directionally agglomerated level on a two-dimensional mesh is depicted in Figure 3, where the aggressive agglomeration normal to the boundary layer is observed.

4. P A R A L L E L

IMPLEMENTATION

Distributed-memory explicit message-passing parallel implementations of unstructured mesh solvers have been discussed extensively in the literature [20-22]. In this section we focus on the non-standard aspects of the present implementation which are particular to the directional-implicit agglomeration multigrid algorithm. In the multigrid algorithm, the vertices on each grid level must be partitioned across the processors of the machine. Since the mesh levels of the agglomeration multigrid algorithm are fully nested, a partition of the fine grid could be used to infer a partition of all coarser grid levels. While this would minimize the communication in the inter-grid transfer routines, it affords little control over the quality of the coarse grid partitions. Since the amount of intra-grid computation on each level is much more important than the inter-grid computation between each level, it is essential to optimize the partitions on each grid level rather than between grid levels. Therefore, each grid level is partitioned independently. This results in unrelated coarse and fine grid partitions. In order to minimize inter-grid communication, the coarse level partitions are renumbered such that they are assigned to the same processor as the fine grid partition with which they share the most overlap. For each partitioned level, the edges of the mesh which straddle two adjacent processors are assigned to one of the processors, and a "ghost vertex" is constructed in this processor, which corresponds to the vertex originally accessed by the edge in the adjacent processor (c.f. Figure 4 ). During a residual evaluation, the fluxes are computed along edges and accumulated to the vertices. The flux contributions accumulated at the ghost vertices must then be added to the flux contributions at their corresponding physical vertex locations in order to obtain the complete residual at these points. This phase incurs interprocessor communication. In an explicit (or point implicit) scheme, the

62 updates at all points can then be computed without any interprocessor communication once the residuals at all points have been calculated. The newly updated values are then communicated to the ghost points, and the process is repeated.

2 , , / B/ PTidi~

................. CreatedInternalEdges ....................Communication Path

Figure 4. Illustration of Creation of Internal Edges and Ghost Points at Interprocessor Boundaries

Figure 5. Illustration of Line Edge Contraction and Creation of Weighted Graph for Mesh Partitioning; V and E Values Denote Vertex and Edge Weights Respectively

The use of line-solvers can lead to additional complications for distributed-memory parallel implementations. Since the classical tridiagonal line-solve is an inherently sequential operation, any line which is split between multiple processors will result in processors remaining idle while the off-processor portion of their line is computed on a neighboring processor. However, the particular topology of the line sets in the unstructured grid permit a partitioning the mesh in such a manner that lines are completely contained within an individual processor, with minimal penalty (in terms of processor imbalance or additional numbers of cut edges). This can be achieved by using a weighted-graphbased mesh partitioner such as the CHACO [4] or METIS [23] partitioners. Weighted graph partitioning strategies attempt to generate balanced partitions of sets of weighted vertices, and to minimize the sum of weighted edges which are intersected by the partition boundaries. In order to avoid partitioning across implicit lines, the original unweighted graph (set of vertices and edges) which defines the unstructured mesh is contracted along the implicit lines to produce a weighted graph. Unity weights are assigned to the original graph, and any two vertices which are joined by an edge which is part of an implicit line are then merged together to form a new vertex. Merging vertices also produce merged edges as shown in Figure 5, and the weights associated with the merged vertices and edges are taken as the sum of the weights of the constituent vertices or edges. The contracted weighted graph is then partitioned using the partitioners described in [23,24], and the resulting partitioned graph is then de-contracted, i.e. all constituent vertices of a merged vertex are assigned the partition number of that vertex. Since the implicit lines reduce to a single point in the contracted graph, they can never be broken by the partitioning process. The weighting assigned to the contracted graph ensures load balancing and

63 communication optimization of the final uncontracted graph in the partitioning process. As an example, the two dimensional mesh in Figure 1, which contains the implicit lines depicted in Figure 2, has been partitioned both in its original unweighted uncontracted form, and by the graph contraction method described above. Figure 6 depicts the results of both approaches for a 32-way partition. The unweighted partition contains 4760 cut edges (2.6 % of total), of which 1041 are line edges (also 2.6 % of total), while the weighted partition contains no intersected line edges and a total of 5883 cut edges (3.2 % of total), i.e. a 23~ increase over the total number of cut edges in the non-weighted partition.

~.~i......:.......)~:~,...........................~.....:,,-....~.......~.~-,-~=.~.............._~<.~...... _ ,,,,~:,~.,,::IM

[

Figure 6. Comparison of Unweighted (left) and Weighted (right) 32-Way Partition of Two-Dimensional Mesh

5. S C A L A B I L I T Y

RESULTS

Two test cases are employed to examine the convergence behavior and scalability of the directional implicit parallel unstructured multigrid solver. The first case consists of a relatively coarse 177,837 point grid over a swept and twisted wing, constructed by extruding a two-dimensional grid over an RAE 2822 airfoil in the spanwise direction. Figure 7 illustrates the grid for this case along with the implicit lines used by the solution algorithm on the finest level. The grid contains hexahedra in the boundary layer and (spanwise) prismatic elements in regions of inviscid flow, and exhibits a normal spacing at the wing surface of 10 .6 chords. Approximately 67% of the fine grid points are contained within an implicit line, and no implicit lines on any grid levels were intersected in the partitioning process for all cases. This case was run at a freestream Mach number of 0.1, an incidence of 2.31 degrees, and a Reynolds number of 6.5 million. The convergence of the directional implicit multigrid algorithm is compared with that achieved by the explicit isotropic multigrid algorithm [7] on the equivalent two dimensional problem in Figure 8. The directional implicit multigrid algorithm is seen to be much more effective than the isotropic algorithm, reducing the residuals by twelve orders of magnitude over 600 multigrid W-cycles. The second test case involves a finer grid of 1.98 million points over an ONERA M6

64 wing. The grid was generated using the VGRID unstructured tetrahedral mesh generation program [25]. A post-processing operation was employed to merge the tetrahedral elements in the boundary layer region into prisms [7,1]. The final grid contains 2.4 million prismatic elements and 4.6 million tetrahedral elements, and exhibits a normal spacing at the wall of 10 -~ chords. Approximately 62% of the fine grid points are contained within an implicit line, and no implicit lines on any grid levels were intersected in the partitioning process for all cases. The freestream Mach number is 0.1, the incidence is 2.0 degrees, and the Reynolds number is 3 million. The residuals are reduced by seven orders of magnitude over 600 multigrid W-cycles in this case. This convergence rate is somewhat slower than that achieved on the previous problem, and than the rates obtained on two-dimensional problems using the same algorithm [19]. This is typical of the convergence rates obtained by the current algorithm on genuinely three-dimensional problems.

EXPLICIT FULL COARSENING AMG. _ _ _ DIRECTIONAL IMPLICIT AMG

i~iiiiii!iiiill~!!iii!!iiiiii!iiiiiii:iiii!~i!~iiiiiiiiiiii!iiiiiil iiiiiiiii!iili!~iiiiiil!iiiiiiii!ill!~iiii!iii!i:ili::i!i i:!iii~iiii!ii!!~iiiiiiii;:iiiiii!ii!il iii:~!il!i!ii~:~i:iiiiii!ii!i!ii!iiiiiiiiii!ili!ii! iiiiiiiiii!iiii!i!iiiiiii x "%,

8

: : ~ ~!;.*:~

i ? : ? ~ ~ % ~

~ i ~ ! : ~;!i:~i~ ~i~~ ~::;!~i~i~i~...................~..................~...........9......................................

i~::i:::::i}~i:%!iil;?;!~i~i~i2;~N i~;ill !iii!~:)iii~~i~!!~!i~iii~)iii!i~5i~!~ii)i;::i!?}ii;ii:; i2!2Nii)iiiii~ii?i;i!ii!i;i; i il!ii;i;iillii!:ii~!!)~i::iiiill:::i?ii~;:iiiiii;)ili:): )ii!iii!ii!::ili;::i::~i)~ii :::.i!:i!i! i::i;)?i ~iiii!~ii~!{~ii~~!i~i~i~i!~J~ii~:ii~}?::illiii~i::i!!i:: ~::~i::~::~::i~iii~i::i? ii::ii!~::iii iiii::i::!iii:i!i::i ~ii::::iiiii::iii~ ::iii::!::ii:.::;i::!!!}::i::?~iiii~:.i::i::i::!ii~ i~i::!~!iii::::;::i~i::i i~?iiiiiiiii::i~ii:i:::!ii:iiliii::ii;::i~iii~iiii!iiiii~i!ii::!

x

8. ~o

Figure 7. Unstructured Grid and Implicit Lines Employed for Computing Flow over Three-Dimensional Swept and Twisted RAE Wing

i

I00

i

200

i

3oo

i

4oo

i

50o

600

Number of MG Cycles

Figure 8. Comparison of Convergence Rate Achieved by Directional Implicit Agglomeration Multigrid versus Explicit Agglomeration Multigrid for Flow over Wing

The scalability of the solver for these two cases is examined on an SGI Origin 2000 and a Cray T3E-600 machine. The SGI Origin 2000 machine contains 128 MIPS R10000 195 Mhz processors with 286 Mbytes of memory per processor, for an aggregate memory capacity of 36.6 Gbytes. The Cray T3E contains 512 DEC Alpha 300 Mhz processors with 128 Mbytes of memory per processor, for an aggregate memory capacity of 65 Gbytes. All the cases reported in this section were run in dedicated mode. Figures 9 and 10 show the relative speedups achieved on the two hardware platforms for the RAE wing case, while Figures 11 and 12 depict the corresponding results for the ONERA M6 wing case. For the purposes of these figures, perfect speedups were assumed

65

120 -

=

SINGLE GRID MG V-Cycle MG W-Cycle IDEAL

100

a I,U ul

I1. 1/}

80

500

............. ..... .......... .................... ..........

450 400 a.

........ .......

-

=

SINGLE GRID MG V-Cycle

v

MG

.......... ............. .......... ..........

W-Cycle

............................ IDEAL

35O

300 W W 250

60

r 40

200 150 100

20

50 10

I 40

I I 60 80 NPROC

I 100

I 120

140

Figure 9. Observed Speedups for RAE Wing Case (177,837 grid points) on SGI Origin 2000

00

= r ';-

SINGLE GRID MG V-Cycle MG W-Cycle

500

........ .............. ~ .........J

-r ~9

450

400

100

r

400 450

500

550

550

120 -

e"t 14,1 I,U

100 150 200 250 300 350 NPROC

Figure 10. Observed Speedups for RAE Wing Case (177,837 grid points) on Cray T3E-600

140

el

50

O. 350 80

SINGLE GRID MG V-Cycle

u c W-Cyck,

......'~/= ........../ . . . O / .....- / / .

.......................

'......................

300 I.g Ul

6O

250 200

40 20

150 100

.....

% .......

'o

50

"o

"o

NPROC

8'0

'

100

'

120

140

Figure 11. Observed Speedups for ONERA M6 Wing Case (1.98 million grid points) on SGI Origin 2000

........"

~ ....... 20 ~o'o ~'o ~o'o 20 ~o'o 2o ,o'o 20 ~o'o s ~ o NPROC

Figure 12. Observed Speedups for ONERA M6 Wing Case (1.98 million grid points) on Cray T3E-600

on the lowest number of processors for which each case was run, and all other speedups are computed relative to this value. In all cases, timings were measured for the single grid (non-multigrid) algorithm, the multigrid algorithm using a V-cycle, and the multigrid

66 algorithm using a W-cycle. Note that the best numerical convergence rates are achieved using the W-cycle multigrid algorithm. For the coarse RAE wing case, the results show good scalability up to moderate numbers of processors, while the finer ONERA M6 wing case shows good scalability up to the maximum number of processors on each machine, with only a slight drop-off at the higher numbers of processors. This is to be expected, since the relative ratio of computation to communication is higher for finer grids. This effect is also demonstrated by the superior scalability of the single grid algorithm versus the multigrid algorithms, and of the V-cycle multigrid algorithm over the W-cycle multigrid algorithm (i.e. the W-cycle multigrid algorithm performs additional coarse grid sweeps compared to the V-cycle algorithm). Note that for the RAE wing test case on 512 processors of the T3E, the fine grid contained only 348 vertices per processor, while the coarsest level contained a mere 13 points per processor. While the W-cycle algorithm suffers somewhat in computational performance for coarser grids on high processor counts, the parallel performance of the W-cycle improves substantially for finer grids. Numerically the most robust and efficient convergence rates are achieved using this cycle. While these results reveal faster single processor computational rates for the Origin 2000, the Cray T3E-600 demonstrates higher scalability. In all cases, the fastest overall computational rates are achieved on the 512 processor configuration of the T3E-600. 6. P A R A L L E L S C A L A B I L I T Y O F M U L T I G R I D The previous results demonstrate that multigrid algorithms provide good scalability on large numbers of processors for reasonably sized problems. For a single-grid explicit scheme, the ratio of computation to communication remains relatively constant as the number of processors is increased, provided the problem size is increased proportionately. While this is also true for the finest grid levels of a multigrid algorithm, the coarsest grid of a multigrid algorithm must retain a fixed size as the fine grid problem size is increased (and extra grid levels are added). Thus, the parallel efficiency of the coarsest grid of the multigrid algorithm will deteriorate continuously as the number of processors is increased, ultimately reaching a point where there are fewer coarse grid points than processors. This has lead to speculation in the past that multigrid methods should scale unfavorably for very large numbers of processors [26]. However, as the problem size is increased, the work on the coarsest grid becomes a smaller fraction of the overall work, and a simple argument can be made which suggests a well formulated multigrid algorithm will scale asymptotically to within a constant of its underlying fine grid smoothing algorithm. If N denotes the number of fine grid points for a problem to be solved, and P denotes N i.e. the number of grid points per processor, the number of processors, then the ratio T, is a measure of the computation work to be performed on the fine grid by each processor. 2

Similarly, [_Np]Srepresents the surface area of this partition (in three dimensions), which is ao measure of the communication to be performed by each processor on the fine grid. The ratio of computation to communication for the fine grid is therefore

p-

~=

~

(1)

67 which is constant if N and P are increased proportionately to each other, as explained above. For a multigrid V-cycle which performs one smoothing on each grid level, and where the coarsening factor between fine and coarse grids is 8, the total work per multigrid cycle for each processor is thus:

N

1

-~whereas the total communication

1

x [1+~+~+...]--

8N 7P

(2)

per multigrid cycle is give by:

N]2/a [~

1 1 4 N x [ 1 + ~ + i - 6 + " ' ' ] - 5 [ ]2/a

(3)

so that the ratio of computation to communication for the entire multigrid cycle is given by: P 1/3

•

7

(4)

which is similar to that observed for the single grid algorithm to within a multiplicative constant. Therefore, in spite of the poor scalability of the fixed coarse grid problem size, the entire multigrid algorithm can be expected to scale similarly to the fine grid algorithm for increasing problem size on large numbers of processors. Although a W-multigrid cycle operating on 8:1 coarsened grid levels can also be shown to scale favorably, lower coarsening ratios (such as 4:1) will ultimately lead to worse asymptotic scalability and should be avoided. This current argument neglects the inter-grid communication, which is small in the current implementation, and non-existent when a fully nested fine-coarse grid partitioning strategy is employed.

7. L A R G E T E S T CASE RESULTS The next test case is intended to demonstrate the capability of running very large cases on large numbers of processors. The configuration involves the external flow over an aircraft with deployed flaps. The freestream Mach number is 0.2, and the Reynolds number is 1.6 million and the experimental flow incidence varies over a range of-4 degrees up to 24 degrees. The computations are all performed at zero yaw angle, and therefore only include one half of the symmetric aircraft geometry, delimited by a symmetry plane. An initial grid of 3.1 million points was generated for this configuration using the VGRID unstructured tetrahedral grid generation package [25,1]. A finer grid containing 24.7 million vertices was then obtained through h-refinement of the initial grid, i.e. by subdividing each cell of the initial grid into eight smaller self-similar cells. The refinement operation was performed sequentially on a single processor of an SGI ORIGIN 2000, and required approximately I0 Gbytes of memory and 30 minutes of CPU time. Figure 13 depicts the surface grid for the initial 3.1 million point mesh in the vicinity of the flap system.

68

,\

3.1 Million Point Grid .....

24.7 Million Point Grid

q9

N

~z

o

~00

200

300

4oo

500

6oo~

Number of Cycles

Figure 13. Illustration of Surface Grid for Initial 3.1 million Point Grid for ThreeDimensional High-Lift Configuration

Figure 14. Convergence Rate for Coarse (3.1 million pt) Grid using 5 Multigrid Levels and Low Mach Number Preconditioning and Fine (24.7 million pt) Grid using 6 Multigrid Levels and no Preconditioning at 0.2 Mach Number and 10 degrees Incidence

The convergence history obtained by the multigrid algorithm for the initial and refined grids at an incidence of i0 degrees is shown in Figure 14. In both cases, the line implicit algorithm is employed as a smoother, but the directional agglomeration strategy has been abandoned in favor of the simpler isotropic agglomeration strategy. Memory savings are realized from the faster (8:1) coarsening rates achieved by the isotropic agglomeration algorithm (as opposed to 4:1 for the directional algorithm), and from the preceding argument, the overall multigrid algorithm can be expected to scale asymptotically to within a constant of the single grid algorithm. On the other hand, as a result of the isotropic agglomeration procedure, the convergence rates for these cases are slower than those observed in the previous two cases. However, the multigrid algorithm still delivers convergence rates which are relatively insensitive to the overall grid resolution, as demonstrated by the results of Figure 14. The 3.1 million point grid case has been run on a variety of machines. The scalability of this case on the Cray T3E and SGI Origin 2000 is similar to that illustrated in Figures

69 11 and 12. This case requires a total of 7 Gbytes of memory and 80 minutes on 128 processors (250 MHz) of the ORIGIN 2000, or 62 minutes on 256 processors of the Gray T3E-600 for a 500 multigrid cycle run.

The scalability of the single grid and multigrid algorithms for this case on the IBMbased ASCI Blue Pacific machine, and the Intel-based ASCI Red machine is depicted in Figures 15 and 16. The ASCI Blue Pacific machine, located at Lawrence Livermore National Laboratory in California, consists of 320 nodes, with each node containing 4 IBM 332Mhz 604e shared memory processors. The results in Figure 15 employed all four processors at each requested node, which was found to incur a 20% timing penalty over an approach which employed only a single processor per node, using four times as many nodes. This penalty is likely due to the fact that the available node bandwidth must be shared between the four processors in this node, but this approach is necessary for accessing large numbers of processors. In all cases, a purely MPI-based implementation has been used. Good scalability is obtained up to approximately 256 processors, after which the parallel efficiency begins to drop off. This is partly due to the relatively small size of the problem for this number of processors. The ASCI Red machine, located at Sandia National Laboratory in New Mexico, contains up to 4500 dual cpu nodes. The individual CPUs consist of 333 Mhz Intel Pentium Pro processors. The results in Figure 16 only made use of a single cpu per node, since there is no way to access both processors on a node with a purely MPI-based code. On this machine, good scalability is observed up to 2048 processors for the single grid case, and up to 1024 processors for the multigrid case. The multigrid case would likely scale well at 2048 processors, although such a run has not been performed to date.

1000 900 800

,.........

=

SINGLE GRID MG W-Cycle

.......... ........ ..........

~" ............................IDEAL

700 -

2000

=

SINGLE GRID MG W-Cycle

r ............................I D E A L

1600

............./

......... J

.........

1400

..........

a. =) 1200 a

600 1:1 I11

1,1,1 500 a. r 400

~1000 0.

800

3OO

6O0

200'

400

100

2O0

00

............."

1800

100

200

300

400

500

600

NPROC

700

800

900 1000

Figure 15. Observed Speedups for 3.1 million point grid aircraft case on ASCI Blue Pacific Machine

0

0

200

400

600

800 1000 1200 1400 1600 1800 2000

NPROC

Figure 16. Observed Speedups for 3.1 million point grid aircraft case on ASCI Red Machine

70 The 24.7 million point case was run on the Cray T3E-600 machine using 512 processors. This case requires 52 Gbytes of memory, and 4.5 hours for 500 multigrid cycles, which includes 30 minutes of I/O time to read the grid file (9 Gbytes), and write the solution file (2 Gbytes). The fine grid case was also benchmarked on a larger Cray-T3E-1200E machine. The Cray T3E-1200E contains 600 MHz DEC Alpha processors as well as an upgraded communication chip, as compared to the previously mentioned T3E-600 (300MHz processors). This particular machine contained 1520 processors each with a minimum of 256 Mbytes per processor. Figure 17 depicts the speedups obtained by the single grid, and the five level and six level multigrid runs on the 24.7 million point grid running on 256, 512, 1024 and 1450 processors. The single grid computations achieve almost prefect scalability up to 1450 processors, while the speedups achieved by the multigrid runs are only slightly below the ideal values. The six level multigrid case could not be run on the maximum number of processors, since the partitioning of the coarsest level resulted in empty processors with no grid points. While this does not represent a fundamental problem, the software was not designed for such situations. In any case, the five level multigrid runs are the most efficient overall, since there is little observed difference in the convergence rate between the five and six level multigrid runs. The single grid results are included simply for comparison with the multigrid algorithm, and are not used for actual computations since convergence is extremely slow. The computation times are depicted in Table 1. On 512 processors of the Cray T3E-1200E, the 5 level multigrid case requires 19.7 seconds per cycle, as compared to 28.1 seconds per cycle on the 512 processor Cray T3E-600, which corresponds to an increase in speed of over 40 % simply due to the faster individual processors. On 1450 processors, the same case required 7.54 seconds per cycle, or 63 minutes of computation for a 500 multigrid cycle run. A complete run required 92 minutes, which includes 29 minutes of I/O time, although no attempt at optimizing I/O was made. 1500, 1400

!

1300

_

----e---------v-----

1200 1100 1000 _

5 Level M G W - C y c l e 6 Level M G W - C y c l e

iI I

.//

. , / J

I

I

I

.

I1. 900 a Ill

800

LU 700 a,,

{D 600 500

24.7 Million P t Case (5 M u l t i g r i d Levels) Procs Time/Cyc Gflop/s Platform 22.0 512 28.1 T3E-600 16.1 256 38.3 T3E-1200E 31.4 512 19.7 T3E-1200E 61.0 10.1 T3E-1200E 1024 82.0 7.54 T3E-1200E 1450

4OO 3O0 200 100

- ...........

... 10'0 2' ~300 ' ' ' 400 ' ' ' '500

600 700 800 900 10'~'0'0 11 01 NPROC

' ' 0130014001500

Figure 17. Observed Speedups for 24.7 million point Grid Case on 1520 Processor Cray T3E- 1200E

Table 1 Timings and Estimated Computational Rates for 24.7 million point Grid Case on Various Cray T3E Configurations; Computational Rates are obtained by linear scaling according to wall clock time with smaller problems run on the Cray C90 using the hardware performance monitor for Mflop ratings

71 8. C O N C L U S I O N S While the calculations described in this paper have demonstrated good overall performance for large problems on thousands of processors, these calculations are limited by the pre-processing operations such as grid partitioning, and coarse multigrid level construction, which are currently performed sequentially on a single processor using shared memory machines. The parallelization of these operations for distributed memory computer architectures is required before much larger calculations can be attempted. I/O issues, including bandwidth, file size, and file transfer between machines is also a serious issue for calculations of this size, which must be addressed in future work. While a distributed memory approach has been adopted, there is a growing trend to employ clusters of mid-sized shared memory machines. This is perhaps due to the fact that the most marketable scientific computer architectures are mid-sized shared memory machines. Clustering such machines together is seen as a cost effective approach to building customized very high performance systems. An effective programming model for such architectures may involve the use of mixed shared memory (using Open MP) and distributed memory (using MPI) libraries. An extension of the current solver to a mixed shared-distributed memory model is currently under way. 9. A C K N O W L E D G E M E N T S

Special thanks are due to David Whitaker and Cray Research for dedicated computer time, and to S. Pirzadeh for his grid generation expertise. This work was partly funded by an Accelerated Strategic Computing Initiative (ASCI) Level 2 grant from the US Department of Energy. REFERENCES

1. D . J . Mavriplis and S. Pirzadeh. Large-scale parallel unstructured mesh computations for 3D high-lift analysis. AIAA paper 99-0537, presented at the 37th AIAA Aerospace Sciences Meeting, Reno NV, January 1999. 2. D.E. Keyes, D. K. Kaushik, and B. F. Smith. Prospects for CFD on petaflops systems. In CFD Review M. Hafez, et. al., eds., Wiley, New York, 1997. 3. D. J. Mavriplis. On convergence acceleration techniques for unstructured meshes. AIAA paper 98-2966, presented at the 29th AIAA Fluid Dynamics Conference, Albuquerque, NM, June 1998. 4. B. Hendrickson and R. Leland. The Chaco user's guide: Version 2.0. Tech. Rep. SAND94-2692, Sandia National Laboratories, Albuquerque, NM, July 1995. 5. G. Karypis and V. Kumar. A fast and high quality multilevel scheme for partitioning irregular graphs, to appear in SIAM J. Sci. Comput., 1998. 6. W. Gropp, E. Lusk, and A. Skjellum. Using MPI: Portable Parallel Programming with the Message Passing Interface. MIT Press, Cambridge, MA, 1994. 7. D. J. Mavriplis and V. Venkatakrishnan. A unified multigrid solver for the NavierStokes equations on mixed element meshes. International Journal for Computational Fluid Dynamics, 8:247-263, 1997.

72 o

.

10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22.

23.

24. 25. 26.

P. L. Roe. Approximate Riemann solvers, parameter vectors and difference schemes. J. Comp. Phys., 43(2):357-372, 1981. D. J. Mavriplis. Multigrid strategies for viscous flow solvers on anisotrpic unstructured meshes. In Proceedings of the 13th AIAA CFD Conference, Snowmass, CO, pages 659-675, June 1997. AIAA Paper 97-1952-CP. B. van Leer, C. H. Tai, and K. G. Powell. Design of optimally-smoothing multi-stage schemes for the Euler equations. AIAA Paper 89-1933, June 1989. E. Morano and A. Dervieux. Looking for O(N) Navier-Stokes solutions on nonstructured meshes. In 6th Copper Mountain Conf. on Multigrid Methods, pages 449464, 1993. NASA Conference Publication 3224. N. Pierce and M. Giles. Preconditioning on stretched meshes. AIAA paper 96-0889, January 1996. J. M. Weiss and W. A Smith. Preconditioning applied to variable and constant density time-accurate flows on unstructured meshes. AIAA Paper 94-2209, June 1994. E. Turkel. Preconditioning methods for solving the incompressible and low speed compressible equations. J. Comp. Phys., 72:277-298, 1987. B. van Leer E. Turkel C. H. Tai and L. Mesaros. Local preconditioning in a stagnation point. In Proceedings of the 12th AIAA CFD Conference, San Diego, CA, pages 88101, June 1995. AIAA Paper 95-1654-CP. E. Turkel. Preconditioning-squared methods for multidimensional aerodynamics. In Proceedings of the 13th AIAA CFD Conference, Snowmass, CO, pages 856-866, June 1997. AIAA Paper 97-2025-CP. P. R. Spalart and S. R. Allmaras. A one-equation turbulence model for aerodynamic flows. La Recherche Adrospatiale, 1:5-21, 1994. M. Lallemand, H. Steve, and A. Dervieux. Unstructured multigridding by volume agglomeration: Current status. Computers and Fluids, 21(3):397-433, 1992. D. J. Mavriplis. Directional agglomeration multigrid techniques for high-Reynolds number viscous flows. AIAA paper 98-0612, January 1998. D. J. Mavriplis, R. Das, J. Saltz, and R. E. Vermeland. Implementation of a parallel unstructured Euler solver on shared and distributed memory machines. The J. of Supercomputing, 8(4):329-344, 1995. Special course on parallel computing in CFD. May 1995. AGARD Report-807. V. Venkatakrishnan. Implicit schemes and parallel computing in unstructured grid CFD. In VKI Lecture Series VKI-LS 1995-02, March 1995. G. Karypis and V. Kumar. A fast and high quality multilevel scheme for partitioning irregular graphs. Technical Report Technical Report 95-035, University of Minnesota, 1995. A short version appears in Intl. Conf. on Parallel Processing 1995. B. Hendrickson and R. Leland. A multilevel algorithm for partitioning graphs. In Proceedings Proc. Supercomputing '95, A CM, December 1995. S. Pirzadeh. Viscous unstructured three-dimensional grids by the advancing- layers method. AIAA paper 94-0417, January 1994. P. O. Fredrickson and O. A. McBryan. Parallel superconvergent multigrid. In S. McCormick, editor, Multigrid Methods, Theory, Applications, and Supercomputing, volume 110, pages 195-210. Marcel Dekker, New York, 1988.

CONTRIBUTED

PAPERS

This Page Intentionally Left Blank

Parallel ComputationalFluidDynamics Towards Teraflops, Optimizationand NovelFormulations D. Keyes,A. Ecer, J. Periaux,N. Satofukaand P. Fox (Editors) 92000 ElsevierScienceB.V. All rightsreserved.

75

Efficient Parallel Implementation of a Compact Higher-Order M a x w e l l Solver Using Spatial Filtering Ramesh K. Agarwal National Institute for Aviation Research, Wichita State University, Wichita, KS 67260-0093, USA This paper describes the parallel implementation of a Maxwell equations solver on a cluster of heterogeneous workstations. The Maxwell solver, designated ANTHEM, computes the electromagnetic scattering by two-dimensional perfectly-conducting and dielectric bodies by solving Maxwell equations in frequency domain. The governing equations are derived by writing Maxwell equations in conservation-law form for scattered field quantities and then assuming a single-frequency incident wave. A pseudo-time variable is introduced, and the entire set of equations is driven to convergence by an explicit/pointimplicit four-stage Runge-Kutta time-marching finite-volume scheme. Higher-order implicit spatial filtering is used in conjunction with a higher-order compact scheme for spatial discretization, to filter undesirable oscillations. Far-field boundary conditions are computed using the method of Giles and Shu. Results are compared with known analytic solutions and the method-of-moments. The results indicate that, although compact schemes satisfy the stringent requirements of an accurate electromagnetic computation, spatial filtering is important as it provides an excellent alternative to grid refinement at a fraction of the computational cost. Parallelization on heterogeneous workstations is accomplished by using PVM software for process control and communication. Excellent parallel efficiency is obtained, although it is somewhat dependent on the domain decomposition strategy. 1. INTRODUCTION Recently, there has been much interest in developing numerical methods for accurate determination of electromagnetic signatures to assist in the design and analysis of low observables. The method-of-moments (MoM) has long been used for calculation of surface currents for predicting radar-cross-sections (RCS) of scattering bodies. Although MoM has been established as the most direct and accurate means of predicting RCS of twodimensional bodies and bodies of revolution, in general, MoM is difficult to apply to arbitrary three-dimensional bodies and geometrically complex dielectrics, and loses accuracy for high-frequency incident waves. In recent years, alternatives to MoM have been developed. Many of these new techniques are based on the landmark contribution of Yee [1]. Finite-difference (FD) techniques have been developed by Goorjian [2], Taflove [3], Shankar et al. [4], and Britt [5], among others. Finite-element methods (FEM) have been applied by McCartin et al. [6] and Bahrmasel and Whitaker [7], among others. Hybrid

76 FD/MoM and FEM/MoM techniques have also been investigated by Taflove and Umashankar [8], and Wang et al. [9] respectively. All of these alternatives have the potential to overcome one or more of the limitations of MoM techniques. Recently, a new computational electromagnetics (CEM) method has been presented for computing solutions to Maxwell Equations [ 10]. In this method, Maxwell equations are cast in the hyperbolic conservation law form in curvilinear coordinates. The equations are solved in the time domain using the method of lines, which decouples the temporal terms from the spatial terms. For frequency domain calculations, time dependence is taken out of the conservation form of the equations by assuming a single frequency incident wave, and a pseudo-time variable is introduced; this formulation then allows the same algorithm to be applicable for frequency domain calculations as employed for time domain calculations. An explicit node based finite volume algorithm is developed wherein the spatial terms are discretized using a four-stage explicit/point-implicit Runge-Kutta time stepping scheme. A sixth-order explicit compact dissipation operator is added to stabilize the algorithm and damp the unwanted oscillations. A novel analytic treatment is developed for both the dielectric and far-field radiation boundary conditions. The CEM solver based on this algorithm is capable of computing scattering by geometrically complex two-dimensional dielectric bodies in both the frequency domain and time domain. The CEM solver, designated ANTHEM, has been validated by performing computations in frequency domain for perfectly conducting and dielectric scattering bodies such as cylinders, airfoils, and ogives. The CEM solver ANTHEM was recently parallelized on a cluster of heterogeneous workstations [ 11] by using PVM sottware for process control and communication [12]. The influence of domain decomposition strategies on parallel efficiency, especially for dielectric bodies which include material interfaces, was discussed. In this paper, the convergence characteristics of ANTHEM are significantly enhanced by implementing an implicit sixth-order compact filter suggested by Visbal and Gaitonde [ 13]. This type of filter is very effective in eliminating the unwanted high-frequency waves, the socalled q-waves [14]. The improved ANTHEM is again parallelized on a cluster of heterogeneous workstations using PVM. Spatial filtering results in 30% fewer iterations needed for convergence to a desired accuracy. 2. GOVERNING EQUATIONS In the absence of electric sources, the two-dimensional Maxwell equations can be written in conservation law form [ 10] as OQ OF ~-I- ~ +

(1)

= O.

For transverse electric (TE) polarization,

Q-

Dx , F -

D,

,andG-

(,B~/

--

/

.

(2)

77 For transverse magnetic (TM) polarization,

Q-

Bx ,F-

0

By

,andG-

D

(3)

.

Dz/E

In Equations (2) and (3), D = (Dx, Dr, Dz) is the electric field displacement and B = (Bx, By, Bz) is the magnetic field induction, which are related to the electric and magnetic field intensities through the permitivity ~ and the permeability ~t as follows D = EE and B = gH.

(4)

Since the governing equations are linear, an assumption that the incident field is harmonic in time with a frequency co results in a time-dependent total-field that is also harmonic with frequency c0. Thus the Maxwell equations can be recast as a set of steadystate complex equations by the use of the single-frequency assumption [10]: E - ~(E(x,y)e-'~~

H - ~(H(x,y)e-'~~

D - 9~(l)(x,y)e-'~

andB - ~(B(x,y)e-'~

(5)

In Equation (5), tilde denotes a complex quantity and 9~ denotes the real part. The governing Equation (1) becomes 0F

- ico Q + ~

&

c3G

+ --

0y

- o.

(6)

It should be noted that no generality has been lost due to the single frequency assumption. Since the governing equations are linear, the solution for a general case with an arbitrary incident wave can be obtained by performing a Fourier decomposition of the incident wave and computing the solution for each relevant frequency in the series. Then, the solution for an arbitrary incident wave can be obtained by a superposition of single frequency solutions by applying the Fourier theorem. Solutions to the Maxwell Equations (6) can now be computed in the frequency domain through one of a multitude of numerical integration schemes. In practice, however, it is more efficient to first recast the equations in the "scattered" form. Near the far-field, the amplitude of scattered waves is expected to be small and the amplitude of the incident waves is relatively large. In solving for the total field, errors result from computing the incident field itself as well as from computing the scattered field. By recasting the equations in the scattered form, these incident field errors are eliminated and thus grid resolution may be decreased near the far-field. The total-field value is obtained as a sum of the known incident value and the scattered value, D-D~

+D,

B-B

i +B,

E-E~

+E,

and

H-H

i +H~,

(7)

78

where the subscripts i and s denote the incident and scattered contributions. The incident field is defined to be the field that would exist if no scattering objects were present. The governing Equation (6) now becomes:

- ic0 Q~ +

c3F, c3G i + : S, 0x 0y

where

--~ S - ic0 Q i

c~Fi 0x

cqG i 0y

(8)

Since S contains the incident wave diffraction terms, S = 0 when the local dielectric constants ~ and ~t are equal to the free-space dielectric constants e 0 and la0 . Finally, a pseudo-time variable is introduced in Equation (8) and the final form of the governing equation in the frequency domain is obtained as 0Q + c3Fs + c3G s -ir c3t* c3z ay

- s.

(9)

The use of pseudo-time t* makes it possible to adopt the time dependent algorithms and convergence acceleration methods developed in Computational Fluid Dynamics (CFD) such as local time-stepping and multigrid techniques.

3. BOUNDARY CONDITIONS 3.1. Surface Boundary Conditions For a perfectly conducting surface, the boundary conditions are h x E s - - h x E i (electric wall) and h x H s - - h x Hi (magnetic wall)

(10)

where h is the unit vector normal to the surface. At a material interface, the boundary conditions are (11) The algorithm must take into account for any variation in e and ~t at the interface. For resistive sheets,

have a jump across the sheet. For an impedance layer, the boundary condition is

hxhxE--qhxH.

(12)

79 3.2. Far-Field Boundary Conditions A correct mathematical model would extend the physical domain to infinity and would require all scattered quantities to vanish. This is impractical numerically; a set of well-posed boundary conditions must be placed on a finite domain. The far-field conditions become more accurate as the boundary is placed farther away from the source; however, the computational efficiency is necessarily decreased as the numerical domain is increased. Therefore, it is desirable to implement a set of conditions which would allow for the smallest possible computational domain without compromising accuracy or stability. Many researchers have developed effective far-field boundary conditions for hyperbolic systems [10, 15-20]. The current methodology employs the characteristics (similarity) form of the radiation boundary conditions as discussed by Giles [17] and Shu [10]. The far-field boundary condition proposed by Giles [17] and Shu [10] yields accurate solutions on a substantially smaller domain. For a circular far-field boundary, Shu [10] has derived these conditions using the similarity form of the Maxwell Equations (9) and has shown these conditions to be equivalent to the exact integral form of the Sommerfeld radiation condition. These boundary conditions have been implemented in the Maxwell solver ANTHEM. The details of the methodology are given in Shu [10].

4. NUMERICAL M E T H O D 4.1. Spatial Discretization The set of weakly conservative governing Equations (9) is solved numerically using a finite volume node-based scheme described in [10] and [19]. The physical domain (x, y) is mapped to the computational domain (~,,q) to allow for arbitrarily-shaped bodies. The semidiscrete form of the method can be written as 2

~u ,

2

1+ ~;/6 , Fi,j

----~Gl+sn 2 i,j -i~

i,j

- JS i,j .

(13)

A~ and Arl are defined to be 1 and hence have been omitted. The vectors F' and G' are the curvilinear flux vectors and are defined as F' = Fy n - Gxn

and

G' = -Fy~ - Gx~.

(14)

Hence,

d - x ~ y n - x n y ~,

iS~U+ 89

j

and

~t~U+ 89

1

(15)

The spatial discretization in the~, and q directions is based on the classical fourth-order Pad6 scheme. The Pad6 scheme is based on an extrapolation of second-order accurate differences and requires inversion of a tri-diagonal matrix. The accuracy of the Pade scheme has been shown to be significantly higher than that of the standard central differencing [ 10, 19]. The resolution of second-order central differencing with 20 points per wavelength is matched by compact differencing (Pade scheme) with roughly 8 points per wavelength.

80

4.2. Filtering The hyperbolic nature of the governing equations admits only those solutions that are purely oscillatory. The current spatial discretization technique, however, Suffers from dispersive errors. The compact finite difference Equation (13) admits oscillatory solutions which are nonphysical. These nonphysical wave modes are generated in the regions of insufficient grid resolution or at the far-field due to reflections. It is unlikely that there will be sufficient grid resolution for all relevant length scales, especially where there are sharp gradients in body geometry; it is also unlikely that the far-field boundary condition would admit absolutely no reflections. Some dissipation, therefore, must be present in the discretization to dampen the unwanted wave modes. The goal of the filtering scheme is thus to efficiently annihilate those modes which are not realizable by the discretization technique, while leaving the resolved modes unaffected. Discretization methods such as Lax-Wendroff or upwinding techniques have natural dissipation present in the scheme and no explicit filtering is required. The naturally dissipative methods, however, do not allow any control of the amount of dissipation other than by refining the grid; hence the methods may be overly dissipative in the regions where the grid is coarse. The current finite volume scheme is augmented with a sixth-order filter or dissipation in both the { and n directions. The discrete Equation (13) then is modified by adding a dissipative term Di.j, such that 2 d jQ~,j + ~t~l.t~8~ , l .2t ~ , ~ ~ ~ G i , dt 1+5~2/6 F~,j + 1+15n2/6

, j - if.oJQi

'j

-

JSi, j +

D~j. '

(16)

Explicit Compact Filter: The sixth-order explicit compact filter is given by [ 10]

Di, j

-

J 5 J 5 V6~~( - ~ 8 ~ ) Q - v6~rl ( - ~ 6 ~ ) Q ,

(17)

where A i is the time step when the CFL number is 1. The dissipation term in the direction scales as Ax 5, and the dissipation term in the rl direction scales as Ay 5, thus the fourth-order accuracy in both ~ and q directions is formally preserved. The scheme, nevertheless, may still be overly dissipative in coarse grid regions, and hence it is important to minimize the dissipation coefficient v 6 to preserve the accuracy, v 6 is generally between 0.0008 and 0.001. Implicit Compact Filter: Following the work of Visbal and Gaitonde [ 13 ] and Koutsavdis et al. [ 14], a sixth-order compact implicit filter is employed to enhance the efficiency of the calculations by filtering high-frequency oscillations, called q-waves. The implicit filter is given by the expressions ^

--U/Za" 9

n:0

Z

+Q

)

(18a)

81 and N/2 an Z - -2( Q i , j + .=0

" ^ " O { ' f O i , j - i - + - O i 'j + C ~ ' f O i , j + l -

n - t - Q i , .i-n)'

(18b)

,x

where O~,j and Q,,j denote the filtered functions at grid point Off) and N is the order of the filter. The coefficients of the right-hand side of Equations (18a) and (18b) are derived in terms of a single parameter C~y. The transfer function of the above filter is given by [ 13 ]: ~..~N/2

COS(n(J))

o:0

(19)

1 + 2off cos(o) ' where co is the normalized wave number co = 2nk/N. By eliminating the odd-even mode and by matching the Taylor series coefficients, Visbal and Gaitonde [13] determined the coefficients an for a given N. The value of parameter o~f is chosen between 0.3 and 0.5, c~j = 0.5 being the upper limit for filtering high wave numbers. For the implicit sixth-order compact filter (N = 6), Dij in Equation (16) is then taken as ^

D,,j - O;,a. + Q i , j .

4.3. Time

(20)

Integration

A four-stage Runge-Kutta scheme is used to integrate the governing equations in the pseudo-time plane. The integration method is point implicit for the source term -ico Q, which alleviates the stiffness due to large values of co [21 ]. The time integration is computed as follows: n Q Oj _ Q;,j

o

&

- Q,,, ) -

1 At, j o ' (Ri, ~ - o,,, J

o -c~ 2 Ati,j 0 $2 (Q 2j. _ Q i,j ) J (Rlia. - D i,j ) 3

0

S3(Qi, j - Q o .' ) - - a

0 $4 (Q,~j- Q,,j) -

n+l

Qi,:.

4

- Qi,j ,

3

--0{' 4

At;

j ,J

o

(R,,2j- Di,j)

Ati j , : (R;,j3

_O

i,j 0 )

(21)

82

and cz~ -_1 (~2 _ 1 cz3 _ _1 or4 = 1 , and S t - 1 + c~~c0At, 4' 3' 2' 2

R,,j

0 ' ~ ~6 ~

2

'

= 2/-----~Fi, j +

1+6~

~t ~ ~,~ 6________~ , _ i co J Q

1+ 62/6 ,~ G~j

;j

- J S ~,j ,

and D;.j is the dissipation term given by Equation (17) or (20). A single evaluation of the dissipative terms is required and the scheme is stable provided that the CFL number does not exceed 2~/2. Incorporating the formulation and algorithm described in Sections 2-4, a computational code, designated as ANTHEM, has been created; it has been validated by performing computations for electromagnetic scattering from perfectly conducting and dielectric 2-D bodies such as cylinders, airfoils, and ogives [ 10].

5. P A R A L L E L IMPLEMENTATION The CEM solver ANTHEM has been parallelized on a cluster of heterogeneous workstations such as the HP750, SGI Indigo and IBM RS/6000 by using the Parallel Virtual Machine (PVM) software for process control and communication [12]. In parallel implementation, we employ a master/slave relationship with explicit message massing. In a zonal computation, zones are distributed to different workstations and are solved in parallel. To prevent paging, each processor allows only one zone to be in the solution process at a time. An initial attempt at load balancing is done by estimating the relative work required to solve a zone and then assigning the most intensive zone to the fastest processor and proceeding with the less intensive zones to slower processors. If there are more zones than processors, then the first to finish gets the next zone. The entire grid and the field variables are read by the slave only when first initiated. Boundary data is exchanged in the solution process. The field variables are periodically written back to the master processor to allow for restarting in case of failure. For achieving an efficient implementation, several issues dealing with processor scheduling, load balancing, inter-zone communication, slave task I/O, and fault tolerance were identified and addressed.

6. RESULTS As mentioned before, CEM code ANTHEM has been extensively validated by computing the TE and TM scattering from cylinders, airfoils and ogives by comparing the results with analytical solutions and MoM calculations. For the sake of brevity, here we present the results of parallelization of one case only, but similar performance has been achieved in other calculations. We consider TE scattering from a dielectric NACA0012 airfoil. The computational domain is shown in Figure 1. The hatching on the airfoil identifies it as a PEC NACA0012 profile. Each of the two zones wraps around the PEC airfoil forming a periodic continuous boundary at the x axis. The chordlength of the PEC airfoil is 4~,0. The airfoil's

83 thickness to length ratio is 0.12. A lossless dielectric coating of thickness tcoat = 0.1)~0 surrounds the airfoil. The layer terminates as a blunt trailing edge having a thickness of 2tcoot and extending tco~t beyond the trailing edge of the PEC airfoil. The radiation boundary has a diameter of 6)~0 . The inner zone represents the lossless dielectric layer between the PEC and dielectric interface. The lossless media is characterized by ~c= 7.4 and Pc = 1.4. The outer zone represents freespace. The two-zone grid is shown in Figure 1(b). Due to the density of the grid, every other grid line is plotted. For the chosen dielectric constants, the speed of light is less than a third of that of freespace. Due to the reduced local incident wavelength and interference patterns, increased grid density is required in the dielectric zone. Computations are performed for TE scattering from the dielectric airfoil due to an incident wave at 45 ~ angle to the chord of the airfoil. Figure 2 shows the comparison of bistatic RCS computed with the CEM code ANTHEM and the MoM code. Figure 3 shows the scattered field Dz phase contours obtained with the CEM code. These calculations have been performed on a single HP750 workstation with both the explicit and implicit spatial filtering as shown in Table I. The implicit spatial filtering reduces the computational time by 20%. Tables I - Ill show the result of parallelization on one, two, and three workstations. Several domain decompostition strategies are implemented and evaluated on multiple workstations. In Table II, three strategies are implemented on two workstations. In strategy 1, two zones are divided at the dielectric/freespace interface; as a result there is load imbalance. In strategy 2, two zones are equally divided but the split is in the q-direction. In strategy 3, two zones are equally divided with a split in ~- direction. Strategy 2 yields the best performance. On a cluster of three workstations, as shown in Table III, a similar conclusion is obtained. Again, the implicit spatial filtering reduces the computational time by 20 to 30%.

7. CONCLUSIONS A CEM solver ANTHEM has been parallelized on a cluster of heterogeneous workstations by using PVM software for process control and communication. Excellent parallel efficiency is obtained. The implicit spatial filtering provides an excellent alternative to grid refinement at a fraction of the computational cost.

ACKNOWLEDGMENT The author wishes to acknowledge the contributions of his graduate students Kevin Huh and Mark Shu in the development of the methodology described in this paper.

84 REFERENCES

1. Yee, K. S., Numerical Solution of Initial Boundary Value Problems Involving Maxwell's Equations in Isotropic Media, IEEE Trans. Antennas Propagat., Vol. AP-14, May 1966. 2. Goorjian, P. M., Algorithm Development for Maxwell's Equations for Computational Electromagnetism, AIAA Paper 90-0251, 1990. 3. Taflove, A., Application of Finite-Difference Time-Domain Method to Sinusoidal SteadyState Electromagnetic-Penetration Problems, IEEE Trans. Electromagn. Compact., Vol. EMC-22, Aug. 1980. 4. Shankar, V., Mohammadian, A. H., Hall, W. F., Erickson, R., CFD SpinoffComputational Electromagnetics for Radar Cross Section (RCS) Studies, AIAA Paper 90-3055, 1990. 5. Britt, C. L., Solution of Electromagnetic Scattering Problems Using Time-Domain Techniques, IEEE Trans. Antennas Propagat., Vol. AP-37, September 1989. 6. McCartin, B. J., Bahramsel, L. J., Meltz, G., App#cation of the Control Region Approximation to Two Dimensional Electromagnetic Scattering, PIERS 2 - Finite Element and Finite Difference Methods in Electromagnetic Scattering, Morgan, M. (ed.), Elsevier, New York, 1990. 7. Bahrmasel, L., Whitaker, R., Convergence of the Finite Element Method as Applied to Electromagnetic Scattering Problems in the Presence of Inhomogenous Media, 4th Biennial IEEE Conference on Electromagnetics Field Computation, Toronto, Ontario, October 1990. 8. Taflove, A., Umashankar, K., A Hybrid Moment Method~Finite-Difference Time-Domain Approach to Electromagnetic Coupling and Aperture Penetration into Complex Geometries, IEEE Trans. Antennas Propagat., Vol. AP-30, July 1982. 9. Wang, D. S., Whitaker, R. A., Bahrmasel, L. J., Efficient Coupling of Finite Methods and Method of Moments in Electromagnetics Mode#ng, presented at the North American Radio Sciences Meeting, London, Ontario, June 1991. 10. Shu, M., A CFD Based Compact Maxwell Equation Solver for Electromagnetic Scattering, Sc. D. Thesis, Washington University, St. Louis, 1995. 11. Agarwal, R. K., Parallel Implementation of a Compact Higher-Order Maxwell Solver, Parallel Computational Fluid Dynamics, Lin, C. A. et al. (eds.), pp. 327-336, Elsevier, New York, 1998. 12. P VM User's Guide and Reference Manual, Oakridge National Lab. Report ORNL/TM12187, 1997. 13. Visbal, M. R., Gaitonde, D. V., High Order Accurate Methods for Unsteady Vertical Flows on Curvilinear Meshes, AIAA Paper 98-131, 1998. 14. Koutsavdis, E. K., Blaisdell, G. A., Lyrintzis, A. S., On the Use of Compact Schemes with Spatial Filtering in Computational Aeroacoustics, M A A Paper 99-0360, 1999. 15. Bayliss, A., Turkel, E., Radiation Boundary Conditions for Wave-Like Equations, Communications on Pure and Applied Mathematics, Vol. XXXIII, pp. 707-725, 1980. 16. Engquist, B., Majda, A., Radiation Boundary Conditions for Acoustic and Elastic Wave Calculations, Communications on Pure and Applied Mathematics, Vol. XXXII, pp. 313357, 1979. 17. Giles, M., Non-Reflecting Boundary Conditions for the Euler Equations, CFDL-TR-88-1, MIT Department of Aeronautics and Astronautics, 1988.

85 18. Thompson, K. W., Time Dependent Boundary Conditions for HyperboBc Systems, Journal of Computational Physics, Vol. 68, pp. 1-24, 1987. 19. Huh, K. S., Agarwal, R. K., Widnall, S. E., Numerical Simulation of Acoustic Diffraction of Two-Dimensional Rigid Bodies in Arbitrary Flows, AIAA Paper 90-3 920, 1990. 20. Berenger, J. P., A Perfectly Matched Layer for the Absorption of Electromagnetic Waves, Journal of Computational Physics, Vol. 114, pp. 185-200, 1994. 21. Bussing, T. R. A. and Murman, E. M., A Finite Volume Method for the Calculation of Compressible Chemically Reacting Flows, AIAA Paper 85-0331, 1985.

Table I: P e r f o r m a n c e on 1 w o r k s t a t i o n - H P 7 5 0

Single Zone = 513 x 48 grid (513 x 13 in dielectric region) (a) Explicit Filter (b) Implicit Filter No. of Iterations*

9540

6580

Total cpu 4.568 h 3.510 h Cpu/iteration/grid point: (a) 7 x 10.5 secs, (b) 7.8 x 10.5 secs. *Three-order of magnitude reduction in residuals

Table II: P e r f o r m a n c e on 2 w o r k s t a t i o n s - 2 H P 7 5 0

Domain Decomposition Strategy

Zone Dimensions

Number of Iterations

Zone 1

Zone 2

Explicit Filter

Implicit Filter

Explicit Filter

Implicit Filter

1

513x14

513x36

9880

6840

91.9%

95.4%

2

513x25

513x25

9760

6750

93.2%

96.7%

3

258x48

258x48

10360

6930

87.6%

91.8%

Parallel Efficiency

86 Table III: Performance on 3 w o r k s t a t i o n s - HP750, SGI Indigo, IBM RS/6000

Domain Decomposition Strategy

Zone 1

Zone 2

Zone 3

Explicit Filter

Implicit Filter

Explicit Filter

Implicit Filter

1

513x17

513x17

513x17

9910

6740

87.2%

91.3%

2

172x48

172x48

172x48

10540

6870

82.0%

86.9%

Note:

Zone Dimensions

Number of Iterations

Parallel Efficiency (based on clock time)

-grid line at r I =1 represents the PEC NACA airfoil -grid line at r I = 13 represents the dielectric interface with freespace -grid line at r I =48 represems the outer computational boundary.

87

3

I

2 1

S

0

0

-1

-I -2

--3

I -3

-2

-I

0

~(x)

1

2

3

-3

-3

-2

-1

I

0

~(X)

(~)

3

2

(b)

Figure 1. Computational domain" (a) Zones, (b) Grid, every second line shown, zone 1"13 • zone 2:513 • 35.

.....

":"%.. .:...

r

....

.:.

,oo.

g

~ 0=0"

"~

~,

"

--g

-rr

0

gr2

O~

Figure 2. Bistatic RCS: solid line, MoM solution, dashed Hne, ATHEM solution.

Figure 3. Scattered Dz field phase

This Page Intentionally Left Blank

Parallel Computational Fluid Dynamics Towards Teraflops, Optimization and Novel Formulations D. Keyes, A. Ecer, J. Periaux, N. Satofuka and P. Fox (Editors) 9 2000 Elsevier Science B.V. All rights reserved.

Parallel Two-Level Methods Compressible

89

for T h r e e - D i m e n s i o n a l

Flow Simulations on Unstructured

Transonic

Meshes*

R. Aitbayev ~, X.-C. CaP, and M. Paraschivoiu u Department of Computer Science, University of Colorado, Boulder, CO 80309, U.S.A., {rakhim, c a i } 0 c s , c o l o r a d o , e d u u Department of Mechanical and Industrial Engineering, University of Toronto, Toronto, Canada, M5S 3G8, marius0mie, u t o r o n t o , ca We discuss our preliminary experiences with several parallel two-level additive Schwarz type domain decomposition methods tbr the simulation of three-dimensional transonic compressible flows. The focus is on the implementation of the parallel coarse mesh solver, which is used to reduce computational cost and speed up convergence of the linear algebraic solvers. Results of a local two-level and a global two-level algorithm on a multiprocessor computer are presented for computing steady flouts around a NACA0012 airfoil using the Euler equations discretized on unstructured meshes. 1. I N T R O D U C T I O N \ ~ are interested in the numerical simulation of three-dimensional inviscid steady-state compressible flows using two-level Schwarz type domain decomposition algorithms. The class of overlapping Schwarz methods has been studied extensively in the literature [11], especially, the single level version of the method [6,9]. It is well-known, at least in theory, that the coarse space plays a very important role in the fast and scalable convergence of the algorithms. Direct methods are often used to solve the coarse mesh problem either redundantly on all processors or on a subset of processors [3]. This presents a major difficulty in a fully parallel implementation for 3D problems, especially when the number of processors is large. In this paper, we propose several techniques for solving the coarse mesh problem in parallel, together with the local fine mesh problems, using two nested layers of preconditioned iterative methods. The construction of the coarse mesh is an interesting issue by itself. We take a different approach than what is commonly used in the algebraic multigrid methods in which the coarse mesh is obtained from the given fine m e s h - not the given geometry. In the twolevel methods presented in this paper, we construct both the coarse and the fine mesh from the given geometry. To better fit the boundary geometry, the fine mesh nodes may not be on the faces of the coarse mesh tetrahedrons. In other words, the coarse space and the fine space are not nested. This does not present a problem as long as the proper interpolation is defined [2]. *The work was supported in part by the NSF grants ASC-9457534, ECS-9527169, and ECS-9725504.

90 As a test case, we consider a symmetric nonlifting flow over a NACA0012 airfoil in a three-dimensional setting. We construct the fine mesh by refining the existing coarse mesh and updating the nodes of the fine mesh according to the boundary geometry of the given physical domain. Such an approach is easy to implement since the same computer code can be used on both the fine and the coarse level, and only minimal additional programming is required to construct the restriction and prolongation operators. Moreover, it gives a natural partition of the fine mesh from the partition of the coarse mesh. In the tests, the system of Euler equations is discretized using the backward difference approximation in the pseudo-temporal variable and a finite volume method in the spatial variables. The resulting system of nonlinear algebraic equations is linearized using the Defect Correction (DEC) scheme. At each pseudo-temporal level, the linear system is solved by a restricted additive Schwarz preconditioned FGMRES method [10], and the coarse mesh problem is solved with an inner level of restricted additive Schwarz preconditioned FGMRES method.

2. G O V E R N I N G Let ~t C R 3 be wall boundary P~ velocity vector, e if, e and p as the

EQUATIONS AND BOUNDARY

a bounded flow domain with the boundary consisting of two parts" a and an infinity boundary P~. Let p be the density, g - (u, v, w) T the the total energy per unit volume, and p the pressure. We consider p, unknowns at point (x, y, z), and the pseudo-temporal variable t. Set

v , described by the Euler equations

Ut + V " F -

CONDITIONS

O,

~ oy, o N ox,

. An inviscid compressible flow in f~ is (1)

where /7 - (F, G, H) T is the flux vector with the Cartesian components defined as on page 87 of [7]. Equation (1) is closed by the equation of state for a perfect gas p - (7-1) - pll ll /2), 7 is the ratio of specific heats and I1 is the 2-norm in R a. We specify the initial condition U]t=0 - g0, where U0 is an initial approximation to a steady-state solution, and the following boundary conditions. On the wall boundary F~, we impose a no-slip condition for the velocity g . g - 0, where g is the outward normal vector to the wall boundary. On the infinity boundary F~, we impose uniform free stream conditions p - p~, ~ - g~, and p ~ - 1 / ( 7 M 2 ) , where M ~ is the free stream Mach number. We seek a steady-state solution, that is, the limits of p, ~, e and p as t ~ ee.

3. D I S C R E T I Z A T I O N In this section we present an outline of the discretization of the Euler equations; for more details, see [5]. Let f~h be a tetrahedral mesh in f~, and N be the number of mesh points. For the pseudo-temporal discretization, we use a first-order backward difference scheme. For the spatial discretization of (1), we use a finite volume scheme in which control volumes are centered at the vertices of the mesh. For upwinding, we use Roe's approximate Riemann solver which has the first order spatial accuracy. Second order accuracy is achieved by the MUSCL technique [13] which uses piecewise linear interpolation at the interface between control volumes. For i = 1, 2, . . . , N and n = 0, 1 . . . , let U/~ denote the value of the discrete solution at point (xi, Yi, zi) and at the pseudo-temporal level n and set U~ - (U~, U ~ , . . . , U~) T.

91

Let U~ - Uo(xi, yi, zi) and q/h (Uh) -- (q21 ( U h ) , . . . , gIN (Uh)) T, where q/i (Uh) denotes the described second order approximation of convective fluxes V . F at point (xi, yi, zi). We define the local time step size by --,

zxt ?

=

h l(C

+

_.+

I1 ]11 ),

where CcFL > 0 is a preselected number, Ti is a control volume centered at node i, hi is its characteristic size, Ci is the sound speed and ~/ is the velocity vector at node i. Then, the proposed scheme has a general form

(U~ + l - U i ~ ) / A t ~ + ~ i ( U ~ +1) = 0,

i=1,2,...,N,

n=0,

1, . . . .

(2)

We note, the finite volume scheme (2) has the first order approximation in the pseudotemporal variable and the second order approximation in the spatial variable. On Fw, no-slip boundary condition is entbrced. On F~, a non-reflective version of the flux splitting of Steger and Warming [12] is used. We apply a DeC-Krylov-Schwarz type method to solve (2); that is, we use the DeC scheme as a nonlinear solver, the restarted FGMRES algorithm as a linear solver, and the restricted additive Schwarz algorithm as the preconditioner. At each pseudo-temporal level n, equation (2) represents a system of nonlinear equations for the unknown variable Uh~+1 This nonlinear system is linearized by the DeC scheme [1] formulated as follows. Let. ~h(Uh) be the first-order approximation of convective fluxes V . _f obtained in a way similar to that of ~h(Uh), and let O(~h(Uh) denote its T?n+l,0 n Jacobian. Suppose that for fixed n, an initial guess "h is given (say Uh+~'~ _ U~). For s - 0, 1, . . . , solve for U~+l's+l the following linear system

(D -1-0q)h (V;+l'~

n+l's+l

Uhn+l's)

,s),

where D~ = diag (1/At~, . . . , 1~Ate)is a diagonal matrix. The DeC scheme (3)preserves the second-order approximation in the spatial variable of (2). In our implementation, we carry out only one DeC iteration at each pseudo-temporal iteration, that is, we use the scheme

4. L I N E A R S O L V E R A N D P R E C O N D I T I O N I N G

Let the nonlinear iteration n be fixed and denote'

(4) Matrix A is nonsymmetric and indefinite in general. To solve (4), we use two nested levels of restarted FGMRES methods [10], one at the fine mesh level and one at the coarse mesh level inside the additive Schwarz preconditioner (AS) to be discussed below.

92 4.1. O n e - l e v e l AS preconditioner To accelerate the convergence of the FGMRES algorithm, we use an additive Schwarz preconditioner. The method splits the original linear system into a collection of independent smaller linear systems which can be solved in parallel. Let f~h be subdivided into k non-overlapping subregions f~h,1, f~h,2, f~h,k Let f~' h,2, ... , f~h,k be overlapping extensions of Q h , 1 , Q h , 2 , . - - , Qh,k, respectively, and be also subsets of f~h. The size of the overlap is assumed to be small, usually one mesh layer. The node ordering in f~h determines the node orderings in the extended subregions. For i = 1, 2 , . . . , k, let Ri be a global-to-local restriction matrix that corresponds to the extended subregion f~'h,i, and let Ai be a "part" of matrix A that corresponds to f~'h,i. The AS preconditioner is defined by 9

9

9

~

9

h,1

k

- E R A; i=1

For certain matrices arising from the discretizations of elliptic partial differential operators, an AS preconditioner is spectrally equivalent to the matrix of a linear system with the equivalence constants independent of the mesh step size h, although, the lower spectral equivalence constant has a factor 1/H, where H is the subdomain size. For some problems, adding a coarse space to the AS preconditioner removes the dependency on 1/H, hence, the number of subdomains [11].

4.2. One-level RAS preconditioner It is easy to see that, in a distributed memory implementation, multiplications by matrices R T and Ri involve communication overheads between neighboring subregions. It was recently observed [4] that a slight modification of R/T allows to save half of such communications. Moreover, the resulting preconditioner, called the restricted AS (RAS) preconditioner, provides faster convergence than the original AS preconditioner for some problems. The RAS preconditioner has the form k _

.iI~/T

9--1

i=1

where R~T corresponds to the extrapolation from ~h,i. Since it is too costly to solve linear systems with matrices Ai, we use the following modification of the RAS preconditioner: k

M11 - ~ R~~B;-~R~,

(5)

i=1

where Bi corresponds to the ILU(O) decomposition of Ai. We call MI the one-level RAS

preconditioner (ILU(O) modified). 4.3. Two-level RAS preconditioners Let f~H be a coarse mesh in f~, and let R0 be a fine-to-coarse restriction matrix. Let A0 be a coarse mesh version of matrix A defined by (4). Adding a scaled coarse mesh component to (5), we obtain k

M21 -- (1 -- ct) ~ R~T]~-I ]~i + ct _RoTAo 1R0, i=1

(6)

93 where c, c [0, 1] is a scaling parameter. We call M2 the global two-level RAS preconditioner (ILU(O) modified). Preconditioning by M2 requires solving a linear system with matrix A0, which is still computationally costly if the linear system is solved directly and redundantly. In fact, the approximation to the coarse mesh solution could be sufficient for better preconditioning. Therefore, we solve the coarse mesh problem in parallel using again a restarted FGMRES algorithm, which we call the coarse mesh FGMRES, with a modified RAS preconditioner. Let ~1H be divided into k subregions f~H,1, ~H,2, - . . , ~2H,k with the extented counterparts fYH,1, fYH,2, . . . , ~H,~. To solve the coarse mesh problem, we use F G M R E S with the onelevel ILU(O) modified RAS preconditioner k

M-1 0,1 -- E ( ] ~ 0 , i, ) rB-1R0,i, 0,i

(7)

i=1

where, tbr i - 1, 2, . . . , N, R0,i is a global-to-local coarse mesh restriction matrix, (R'o,i)~ is a matrix that corresponds to the extrapolation from f~H,i, and Bo,i is the ILU(O) decomposition of matrix Ao,i, a part of A0 that corresponds to the subregion fYH , i " After r coarse mesh FGMRES iterations, Ao 1 in (6) is approximated by d o I - polyt(MglA0) with some 1 _< r, where polyl(x ) is a polynomial of degree l, and its explicit form is often not known. We note, l maybe different at different fine mesh FGMRES iterations, depending on a stopping condition. Therefore, FGMRES is more appropriate than the regular GMRES. Thus, the actual preconditioner for A has the form k

JlT/~-1 -

(1 - c~) X~ RitT ~/-1/r~i ~- Ct/!~oT./~O1/~ O.

(8)

i=1

For the fine mesh linear system, we also use a preconditioner obtained by replacing Ao 1 in (6) with _/VL -1 defined by (7) 0,1 k

]~/31

-- E ((1 -- OZ) J[~iITj~I . /r~i qt._ O~/r~,0 T i=1

! "~Tj~-l/~o,i/~O) ([~O,i] O,i

.

(9)

We call Ma a local two-level RAS preconditioner (ILU(O) modified) since the coarse mesh problems are solved locally, and there is no global information exchange among the subregions. We expect that Ma works better than M1 and that M2 does better than Ma. Since no theoretical results are available at present, we test the described preconditioners M1,-/;7/2, and Ma numerically. 5. N U M E R I C A L

EXPERIMENTS

We computed a compressible flow over a NACA0012 airfoil on the computational domain with the nonnested coarse and fine meshes. First, we constructed an unstructured coarse mesh f~n; then, the fine mesh f~h was obtained by refining the coarse mesh twice. At each refinement step, each coarse mesh tetrahedron was subdivided into 8 tetrahedrons. After each refinement, the boundary nodes of the fine mesh were adjusted to the geometry of the domain. Sizes of the coarse and fine meshes are given in Table 1.

94 Table 1 Coarse and fine mesh sizes Coarse Fine Nodes 2,976 117,525 Tetrahedrons 9,300 595,200

...........

~

Fine/coarse ratio 39.5 64

1-level RAS local 2-level RAS global 2-level RAS

...........

I[i [i

1-level RAS local 2-level RAS global 2-level RAS

i! , Li ,~i H~iii

~.\ "

~25

~20 i:i

. ,~"~",~....~

-

13,f , I 3

r, '.J,! li

1000

Total number

2000

of l i n e a r i t e r a t i o n s

3000

.

.

.

.

.

.

.

. ;.,, ,,, ,: ,. n!~ir, ~,l T'='.-,-i-;-,-r7i

1 O0

200 300 400 Nonlinear iteration

500

Figure 1. Comparison of the one-level, local two-level, and global two-level RAS preconditioners in terms of the total numbers of linear iterations(left picture) and nonlinear iterations (right picture). The mesh has 32 subregions.

For parallel processing, the coarse mesh was partitioned, using METIS [8], into 16 or 32 submeshes each having nearly the same number of tetrahedrons. The fine mesh partition was obtained directly from the corresponding coarse mesh partition. The size of overlap both in the coarse and the fine mesh partition was set to one, that is, two neighboring extended subregions share a single layer of tetrahedrons. In (8) and (9), R0r was set to a matrix of a piecewise linear interpolation. Multiplications by R~ and R0, solving linear systems with M1, M2, and Ma, and both the fine and the coarse FGMRES algorithm were implemented in parallel. The experiments were carried out on an IBM SP2. We tested convergence properties of the preconditioners defined in (5), (8), and (9) with c~ - N ~ / N I , where N~ and N/ are the numbers of nodes in the coarse and fine meshes, respectively. We studied a transonic case with Moo set to 0.8. Some of the computational results are presented in Figures 1 and 2. The left picture in Figure 1 shows residual reduction in terms of total numbers of linear iterations. VV'esee that the algorithms with two-level RAS preconditioners give significant improvements compared to the algorithm with the one-level RAS preconditioner. The improvement in using the global two-level RAS preconditioner compared to the local twolevel RAS preconditioner is not very much. Recall, that in the former case the inner FGMRES is used which could increase the CPU time. In Table 2, we present a summary from the figure. We see that the reduction percentages in the numbers of linear iterations drop with the decrease of the nonlinear residual (or with the increase of the nonlinear iteration number). This is seen even more clearly in the right picture in Figure 1. After

95 Table 2 Total numbers of linear iterations and the reduction percentages compared to the algorithm with the one-level RAS preconditioner (32 subregions). One-level RAS Local two-level RAS Global two-level RAS Residual Iterations Iterations Reduction Iterations Reduction 10 -2 859 513 40% 371 57% 10 - 4 1,205 700 42~ 503 58~ 10 -6 1,953 1,397 28% 1,245 36% 10 -s 2,452 1,887 23% 1,758 28%

- - ~ - -

lO~ ~1o-4~

- -

.%

.:

9- -

l - l e v e l R A S / 16 subregions l - l e v e l RAS / 32 subregions local 2-level R A S / 16 subregions local 2-level R A S / 32 subregions

k

~,,_

10-1 f ~ r a . F . ~

10- f

10- f ~-~10-4~"

.o,o-I

Z o71 10 -s

- - ~ - -

r

6

lO-f

500 1000 1500 2000 " 2500 Total number of linear iterations

10 -8

l - l e v e l R A S / 16 subregions

-- :__ lgl'lotaVil2RiAevS/132~/~6gs~

10~ i

\

\ ~

.

global 2-level R A S / 32 subregions

\ ~',~._

\

'"

,.

-%

;-

500 1000 1500 2000 2500 Total number of linear iterations

Figure 2. Comparison of the one-level RAS preconditioner with the local two-level RAS (left picture) and the global two-level RAS preconditioner (right picture) on the meshes with 16 and 32 subregions.

approximately 80 nonlinear iterations, the three algorithms give basically the same number of linear iterations at each nonlinear iteration. This suggests that the coarse mesh may not be needed after some number of initial nonlinear iterations. In Figure 2, we compare the algorithms on the meshes with different numbers of subregions, 16 and 32. The left picture shows that the algorithms with the one-level and local two-level RAS preconditioners initially increase the total numbers of linear iterations as the number of subregions increases from 16 to 32. On the other hand, we see in the right picture in Figure 2 that the increase in the number of subregions had little effect on the convergence of the algorithm with the global two-level RAS preconditioner. These results suggest that the algorithm with the global two-level RAS preconditioner is well scalable to the number of subregions (processors) while the other two are not. In both pictures we observe the decrease in the total number of linear..iterations to the end of computations. This is due to the fact that only 4 or 5 linear iterations were carried out at each nonlinear iteration in both cases, with 16 and 32 subregions (see the right picture in Figure 1), with linear systems in the case of 32 subregions solved just one iteration faster than the linear systems in the case of 16 subregions.

96 6. C O N C L U S I O N S When both the fine and the coarse mesh are constructed from the domain geometry, it is fairly easy to incorporate a coarse mesh component into a one-level RAS preconditioner. The applications of the two-level RAS preconditioners give a significant reduction in total numbers of linear iterations. For our test cases, the coarse mesh component seems not needed after some initial number of nonliner iterations. The algorithm with the global two-level RAS preconditioner is scalable to the number of subregions (processors). Sizes of fine and coarse meshes should be well balanced, that is, if a coarse mesh is not coarse enough, the application of a coarse mesh component could result in the CPU time increase. REFERENCES 1. 2. 3.

K. BOHMER, P. HEMKER, AND put. Suppl, 5 (1985), pp. 1-32. X.-C. CAI, The use of pointwise non-nested meshes, SIAM J. Sci. X.-C. CAI, W. D. GROPP, D.

H. STETTER, The defect correction approach, Corn-

interpolation in domain decomposition methods with Comput., 16 (1995), pp. 250-256. E. KEYES, R. G. MELVIN, AND D. P. YOUNG,

Parallel Newton-Krylov-Schwarz algorithms for the transonic full potential equation, 4. 5.

SIAM J. Sci. Comput., 19 (1998), pp. 246-265. X.-C. CAI AND M. SARKIS, A restricted additive Schwarz preconditioner for general sparse linear systems, SIAM J. Sci. Comput., 21 (1999), pp. 792-797. C. FARHAT AND S. LANTERI, Simulation of compressible viscous flows on a variety of

MPPs: computational algorithms for unstructured dynamic meshes and performance results, Comput. Methods Appl. Mech. Engrg., 119 (1994), pp. 35-60. 6. W. D. GROPP, D. E. KEYES, L. C. MCINNES, AND M. D. WIDRIRI, Globalized Newton-Krylov-Schwarz algorithms and software for parallel implicit CFD, Int. J. High Performance Computing Applications, (1999). Submitted. C. HIR$CH, Numerical Computation of Internal and External Flows, John Wiley and Sons, New York, 1990. 8. G. KARYPIS AND V. KUMAR, A fast and high quality multilevel scheme for partitioning irregular graphs, SIAM J. Sci. Comput., 20 (1998), pp. 359-392. 9. D . K . KAUSHIK, D. E. KEYES, AND B. F. SMITH, Newton-Krylov-Schwarz methods 7.

for aerodynamics problems: Compressible and incompressible flows on unstructured grids, in Proc. of the Eleventh Intl. Conference on Domain Decomposition Methods in Scientific and Engineering Computing, 1999. 10. Y. SAAD, A flexible inner-outer preconditioned GMRES algorithm, SIAM J. Sci. Stat. Comput., 14 (1993), pp. 461-469. 11. B. F. SMITH, P. E. BJORSTAD, AND W. D. GROPP, Domain Decomposition: Parallel Multilevel Methods for Elliptic Partial Differential Equations, Cambridge University Press, 1996. 12. J. STEGER AND R. F. WARMING, Flux vector splitting for the inviscid gas dynamic with applications to finite-difference methods, J. Comp. Phys., 40 (1981), pp. 263-293. 13. B. VAN LEER, Towards the ultimate conservative difference scheme V: a second order sequel to Godunov's method, J. Comp. Phys., 32 (1979), pp. 361-370.

Parallel ComputationalFluidDynamics Towards Teraflops, Optimizationand NovelFormulations D. Keyes,A. Ecer,J. Periaux,N. Satofukaand P. Fox (Editors) 92000 Elsevier ScienceB.V.All rightsreserved.

97

Parallel Calculation of Helicopter B VI Noise by M o v i n g O v e r l a p p e d Grid Method Takashi Aoyama*, Akio Ochi#, Shigeru Saito*, and Eiji Shimat *National Aerospace Laboratory (NAL) 7-44-1, Jindaijihigashi-machi, Chofu, Tokyo 182-8522, Japan tAdvanced Technology Institute of Commuter-helicopter, Ltd. (ATIC) 2 Kawasaki-cho, Kakamigahara, Gifu 504-0971, Japan The progress of a prediction method of helicopter blade-vortex interaction (BVI) noise developed under the cooperative research between National Aerospace Laboratory (NAL) and Advanced Technology Institute of Commuter-helicopter, Ltd. (ATIC) is summarized. This method consists of an unsteady Euler code using a moving overlapped grid method and an aeroacoustic code based on the Ffowcs Williams and Hawking (FW-H) formulation. The present large-scale calculations are performed on a vector parallel super computer, Numerical Wind Tunnel (NWT), in NAL. [] Therefore, a new algorithm of search and interpolation suitable for vector parallel computations is developed for the efficient exchange of flow solution between grids. The calculated aerodynamic and aeroacoustic results are in good agreement with the experimental data obtained by ATIC model rotor test at German Dutch Windtunnel (DNW). The distinct spikes in the waveform of BVI noise are successfully predicted by the present method. 1. INTRODUCTION Helicopters have the great capability of hovering and vertical takeoff and landing (VTOL). The importance of this capability has been realized again especially in Japan after the great earthquake in Kobe where it was shown that helicopters were effective as a means of disaster relief. It is worthy of mention that an international meeting of American Helicopter Society (AHS) on advanced rotorcraft technology and disaster relief was held in Japan in 1998. However, it cannot be said that helicopters are widely used as a mean of civil transportation. Although their capability is effective in the civil transportation, noise is a major problem. Helicopters produce many kinds of noise, such as blade-vortex interaction (BVI) noise, high-speed impulsive (HSI) noise, engine noise, transmission noise, tail rotor noise, bladewake interaction (BWI) noise, main-rotor/tail-rotor interaction noise, and so on. BVI noise is most severe for the civil helicopters which are used in densely populated areas because it is mainly generated in descending flight conditions to heliports and radiates mostly below the helicopter's tip-path plane in the direction of forward flight. What makes it even worse is that its acoustic signal is generally in the frequency range of most sensitive to human subjective response (500 to 5000Hz).

98 Many researchers have been devoting themselves to developing prediction methods for BVI noise. Tadghighi et al. developed a procedure for BVI noise prediction [ 1]. It is based on a coupling method of a comprehensive trim code of helicopter, a three-dimensional unsteady full potential code, and an acoustic code using the Farassat's 1A formulation of the Ffowcs Williams and Hawking (FW-H) equation. National Aerospace Laboratory (NAL) and Advanced Technology Institute of Commuter-helicopter, Ltd. (ATIC) also developed a combined prediction method [2] of a comprehensive trim code (CAMRAD II), a threedimensional unsteady Euler code, and an acoustic code based on the FW-H formulation. The method was effectively used in the design of a new blade [3] in ATIC. However, one of the disadvantages of the method is that users must specify indefinite modeling parameters such as the core size of tip vortex. The recent progress of computer technology prompts us to directly analyze the complicated phenomenon of B VI by CFD techniques. The great advantage of the direct calculations by Euler or Navier-Stokes codes is that they capture the tip vortex generated from blades without using indefinite parameters. Ahmad et al. [4] predicted the impulsive noise of OLS model rotor using an overset grid Navier-Stokes Kirchhoff-surface method. Although the calculated wave forms of high-speed impulsive (HSI) noise were in reasonable agreement with experimental data, the distinct spikes in the acoustic waveform of blade-vortex interaction noise could not be successfully captured. This is because the intermediate and background grids used in their method are too coarse to maintain the strength of tip vortex. In order to solve the problem, NAL and ATIC developed a new prediction method [5] of BVI noise. The method combines an unsteady Euler code using a moving overlapped grid method and an aeroacoustic code based on the FW-H formulation. After making some efforts on the refinement of grid topology and numerical accuracy [6-8], we have successfully predicted the distinct spikes in the waveform of BVI noise. We validated our method by comparing numerical results with experimental data [9-11 ] obtained by ATIC. Our calculations are conducted using a vector parallel super computer in NAL. A new algorithm of search and interpolation suitable for vector parallel computations was developed for the efficient exchange of flow solution between grids. The purpose of this paper is to summarize the progress of the prediction method developed under the cooperative research between NAL and ATIC. 2. CALCULATION METHOD 2.1 Grid System Two types of CFD grids were used to solve the Euler equations in the first stage of our moving overlapped grid method [6]. The blade grid wrapped each rotor blade using boundary fitted coordinates (BFC). The Cartesian background grid covered the whole computation region including the entire rotor. In the grid system presently used in our method, the background grid consists of inner and outer background grids, as shown in figure 1, which increases the grid density only near the rotor. 2.2 Numerical Method in Blade Grid Calculation The numerical method used in solving the Euler equations in the blade grid is an implicit finite-difference scheme [12]. The Euler equations are discretized in the delta form using Euler backward time differencing. A diagonalized approximate factorization method, which

99 utilizes an upwind flux-split technique, is used for the implicit left-hand-side for spatial differencing. In addition, an upwind scheme based on TVD by Chakravarthy and Osher is applied for the explicit right-hand-side terms. Each operator is decomposed into the product of lower and upper bi-diagonal matrices by using diagonally dominant factorization. In unsteady forward flight conditions, the Newton iterative method is added in order to reduce the residual in each time-step. The number of Newton iteration is six. The typical dividing number along the azimuthal direction is about 9000 per revolution. This corresponds to the azimuth angle about 0.04 ~ per step. The unsteady calculation is impulsively started from a non-disturbed condition at the azimuth angle of 0 ~

2.3 Numerical Method in Background Grid Calculation A higher accuracy explicit scheme is utilized in the background Cartesian grid. The compact TVD scheme [ 13] is employed for spatial discretization. MUSCL cell interface value is modified to achieve 4th-order accuracy. Simple High-resolution Upwind Scheme (SHUS) [ 14] is employed to obtain numerical flux. SHUS is one of the Advection Upstream Splitting Method (AUSM) type approximate Riemann solvers and has small numerical diffusion. The time integration is carried out by an explicit method. The four stage Runge-Kutta method is used in the present calculation. The free stream condition is applied to the outer boundary of the outer background grid.

2.4 Search and Interpolation The flow solution obtained by a CFD code is exchanged between grids in the moving overlapped grid approach. The search and interpolation to exchange the flow solution between the blade grid and the inner background grid and between the inner background grid and the outer background grid are executed in each time step. The computational time spent for search and interpolation is one of the disadvantages of the moving overlapped grid approach. In our computation, this problem is severe because a vector and parallel computer is used. Therefore, a new algorithm suitable for vector parallel computations was developed. Only the detailed procedure of the solution transfer from the blade grid to the inner background grid is described in this paper because the other transfers are easily understood. The procedure flow of the new search and interpolation algorithm is shown in figure 2. In the first step, the grid indexes (i, j, k) of the inner background grid points that might be inside of blade grid cells are listed. In the second step, the listed indexes are checked to determine whether they are located inside of the grid cell. The position of the point is expressed by three scalar parameters s, t, and u because the tri-linear interpolation is utilized in the present algorithm. In this step, the values of s, t, and u of each index are calculated. When all s, t, and u are between zero and one, the point is judged to be located inside of the grid cell. Then the grid points outside of the cell are removed from the list and the flow solution is interpolated into temporal arrays. Each processing element (PE) of NWT performs these procedures in parallel. Finally, the interpolated values are exchanged between the processing elements.

100 I-"-------------------------------] [

List grid points index

i.,,. ...............

4, .................. I:[ check whether inside or outside ,:. ..................... . r ...................

!

]

|

.. i [" | .!,

i

l I 9

I

remove outside grid points from index

.... ................

|'|il ~ "

__

4" ....................

interpolate values

]

I

,

|

[!~

_

"-" ""--" "-'" "-" ""--" "4" "-" ""--" "-'""" "-" "-'""" "'-exchange interpolated values between PEs ] 9o ~ 9 9 vector computation

Inner background

Outer background grid

. . -. =. parallel computation

Figure 1. Blade grids, inner and outer Figure 2. Procedure flow of search and background grids, interpolation. 2.5 Aeroacoustic Code The time history of the acoustic pressure generated by the blade-vortex interaction is calculated by an aeroacoustic code [ 15] based on the Ffowcs Williams and Hawking (FW-H) formulation. The pressure and its gradient on the blade surface obtained by the CFD calculation are used as input data. The FW-H formulation used here doesn't include the quadrupole term because strong shock waves are not generated in the flight condition considered here. 2.6 Numerical Wind Tunnel in NAL The large computation presented here was performed on the Numerical Wind Tunnel (NWT) in NAL. NWT is a vector parallel super computer which consists of 166 processing elements (PEs). The performance of an individual PE is equivalent to that of a super computer, 1.7 GFLOPS. Each PE has a main memory of 256 MB. High-speed cross-bar network connects 166 PEs. The total peak performance of the NWT is 280 GFLOPS and the total capacity of the main memory is as much as 45 GB. The CPU time per revolution is about 20 hours using 30 processing elements. Periodic solutions are obtained after about three revolutions. NAL takes a strategy of continuously updating its high performance computers in order to promote the research on numerical simulation as the common basis of the aeronautical and astronautical research field. The replacement of the present NWT by a TFLOPS machine is now under consideration. The research on helicopter aerodynamics and aeroacoustics will be more and more stimulated by the new machine. On the other hand, the challenging problem of the aerodynamics and the aeroacoustics of helicopters, which includes rotational flow, unsteady aerodynamics, vortex generation and convection, noise generation and propagation, aeroelasticity, and so on, will promote the development of high performance parallel super computer. In addition, the aspect of multi-disciplinary makes the problem more challenging. 3. RESULTS AND DISCUSSION 3.1. Aerodynamic Results The calculated aerodynamic results are compared with experimental data. Figure 3 shows the comparisons between measured and calculated pressure distributions on the blade surface

101

in a forward flight condition. The experimental data was obtained by ATIC model rotor tests [9-11] at the German Dutch Windtunnel (DNW). The comparisons are performed at 12 azimuth-wise positions at r/R=0.95. The quantities r and R are span-wise station and rotor radius, respectively. The agreement is good in every azimuth position. The tip vortices are visualized by the iso-surface of vorticity magnitude in the inner background grid in figure 4. The tip vortices are distinctively captured and the interactions between blade and vortex are clearly observed. Figure 5 shows the visualized wake by particle trace. The formation of tip vortex and the roll-up of rotor wake are observed. The appearance of the rotor wake is similar to that of the fixed-wing wake. This result gives us a good reason to macroscopically regard a rotor as an actuator disk. In figure 6, the tip-vortex locations calculated by the present method are compared with the experimental data measured by the laser light sheet (LLS) technique and the calculated result by CAMRAD II. Figure 6 a) and b) show the horizontal view and the vertical view (at y/R=0.57), respectively. The origin of coordinate in figure 6 b) is the leading edge of the blade. All the results are in good agreement. 3.2. Aeroacoustic Results

The time history of the calculated sound pressure level (SPL) is compared with the experimental data of ATIC in figure 7. The observer position is at #1 in figure 8, which in on the horizontal plane 2.3[m] below the center of rotor rotation. A lot of distinct spikes generated by B VI phenomena arc shown in the measured SPL. This type of impulsiveness is typically observed in the waveform of BVI noise. Although the amplitude of the waveform is over-predicted by the present method because it doesn't include the effect of aeroclasticity, the distinct spikes in the wavcform are reasonably predicted. This result may be the first case in the world in which the phenomenon of BVI is clearly captured by a CFD technique directly. Figure 9 shows the comparison between predicted and measured carpet noise contours on the horizontal plane in figure 8. In this figure, the open circle represents a rotor disk. Two BVI lobes are observed in the measured result. The stronger one is caused by the advancing side BVI and the other one is the result of the retreating side BVI. Although the calculation underpredicts the advancing side lobe and over-predicts the retreating side lobe, it successfully predicts the existence of two types of BVI lobes.

~ ,0 2

0.0

-3

I

I

I

I

!

i

i

l.O i

I

c~~, , , ,X o.o

X/C

t,o

O.O -3

I

I

I

I

I

I

I

I

I

~:F,.~F~,, o.o

X/C

2

l.O

O.O

-3

I

~ , .... I

I

I

I

!

!

~'F,, t.o

o.o

~ ! . . . . iI

0

l.O i

I

,-,, X/C

2

0.0

-3

I

,

I

I

I

I

I

1,0 I

I

.'F, ,,-,. 1.o

o.o

X/C

2

O,O

I

I

I

.... l.O

-3

I

I

I

l.O

-3

~:f', , , t.o

O.O

o.o

X/C

1.o

.~

, , _

o,o

X/C

Calculation--- Experiment o 9 Figure 3. Comparison between measured and calculated pressure distributions on blade surface (r/R=0.95).

1.o

.....:*~:S:[[i: .~:d~,:~i<+if?

Figure 5. Visualized wake by particle trace.

Figure 4. Visualized tip vortices by iso-surface of vorticity magnitude. ~----: Vortex1 (LLS) - - : Vortex2(LLS) ~---: Vortex3(LLS) Vortex4(LLS) ~ : CAMRAD

TEST. CONDITION f ~' = 0 . 1 6

"f ~ = 7"5~

f

#

CT

=

"

50 ~

= 0.0064

~'i:

Present results

100

Flow direction 20

5O

.=~

""

"~ P~~

~. 0

/ "

J

"

i~

Bla&

Cut plane F ig.6 b)

g

r

o

N

-2080

-50

60

40

20

0

-2O

• -100

100

50

0

-50

-100

x/RB%0j a) Horizontal plane

b) Vertical plane

Figure 6. Comparison of tip-vortex locations. 4O ca) :~ o) co k_

f.....,

Measured

L-L O)

20

03 03 (b s = .::3 0L~

m 9 <

-20

6

'

26 0

Blade Azimuth [deg.]

'

46 0

4O

Calculated

20 0 -20

",,.%"Ii] \~vq

0'

. . . . . 200 Blade Azimuth [deg.]

Figure 7. Comparison between measured and calculated waveforms of BVI noise.

46o

103

13 traverse microphones to measure carpet noise contours Acoustic data" mean value during about 43 revolutions

2.3m (1.15[

Microphone positions #1" x=-4.0m, y= 0m, z=-2.3m #2: x=-4.0m,y=+l.8m, z=-2.3m

8.0 m #2

Z4 steps per traverse 5.4 m

Flow Figure 8. Microphone settings for noise measurement in ATIC model rotor test. -4.0 ~.

-4.

~_

....... ,............

m 3,

m2.0

"1 1

m

1.0

105 - 106

~

0.0

1.0

1.0

11o - 111 lO9 - 11o

~ N - 11087~ 106

-1.

0.0

l ml

t8~- t8~

102 101 I00 99 9897-

- 103 - 102 - 101 - 100 99 98

94- 9 5

!

93I ""~"t

--'~

2.0- -9..0-1.0 0.0

1'.0

2[0

Experiment

2

12o

118 - 119 117 - 118

114 - 115 113 114

~......

--2,

ff~-

-20 -10 fl0 Prediction

10

2{'1

m

94

-

91

Over all SPL [dB]

Figure 9. Comparison of measured and calculated carpet noise contours. 4. C O N C L U S I O N S NAL and ATIC have developed a new prediction method of BVI noise by combining an unsteady Euler code using a moving overlapped grid method and an aeroacoustic code based on the FW-H formulation. The aerodynamic and aeroacoustic results calculated using a vector parallel super computer, NWT, are in close agreement with the experimental data obtained by ATIC. The distinct spikes in the waveforms of BVI noise are successfully predicted by the present method. 5. F U T U R E P R O B L E M S One of the future problems is to combine CFD solvers for rotors and fuselages. The combined tool will be useful for the design of advanced helicopters.

104 REFERENCES

111 Tadghighi, H., Hassan, A. A., and Charles, B., Prediction of Blade-Vortex Interaction Noise Using Ariloads Generated by a Finite-Difference Technique, AHS 46th Annual Forum, Washington DC, May 1990. 121 Aoyama,T., Kondo,N., Aoki,M., Nakamura, H., and Saito,S., Calculation of Rotor BladeVortex Interaction Noise using Parallel Super Computer, 22nd European Rotorcraft Forum, No.8, 1996. [31 Kondo,N., Nishimura,H., Nakamura, H., Aoki,M., Tsujiuchi,T., Yamakawa,E., Aoyama,T., and Saito,S., Preliminary Study of a Low Noise Rotor, 23rd European Rotorcraft Forum, No.22, Dresden, Germany, 1997. [4] Ahmad, J., Duque, E. P. N., and Strawn, R. C., Computations of Rotorcraft aeroacoustics with a Navier-Stokes/Kirchhoff Method, 22nd European Rotorcraft Forum, Brighton, UK, Sep 1996. [51 Ochi, A., Shima, E., Aoyama, T., and Saito, S., A Numerical Simulation of Flow around Rotor Blades Using Overlapped Grid, Proceedings of the 15th NAL Symposium on Aircraft Computational Aerodynamics (NAL SP-37), pp.211-216, Tokyo, Japan, Jul 1997. [61 Ochi, A., Shima, E., Aoyama, T., and Saito, S., Parallel Numerical Computation of Helicopter Rotor by Moving Overlapped Grid Method, Proceedings of AHS International Meetings on Advanced Rotorcraft Technology and Disaster Relief, Paper No. T1-6, Gifu, Japan, April 1998. 171 Ochi, A., Shima, E., Yamakawa, E., Aoyama, T., and Saito, S., Aerodynamic and Aeroacoustic Analysis of BVI by Moving Overlapped Grid Method, 24th European Rotorcraft Forum, Paper No. AC04, Marseilles, France, Sep 1998. [81 Ochi, A., Aoyama, T., Saito, S., Shima, E., and Yamakawa, E., BVI Noise Predictions by Moving Overlapped Grid Method, AHS 55th Annual Forum, Montreal, Canada, May 1999. I91 Murashige, A., Tsuchihashi, A., Tsujiuchi, T., and Yamakawa, E., Blade-Tip Vortex Measurement by PIV, 23rd European Rotorcraft Forum, Paper No. 36, Dresden, Germany, Sep 1997. I10] Murashige A., Tsuchihashi, A., Tsujiuchi, T., and Yamakawa, E., Experimental Study of Blade-tip Vortex, AHS International Meeting on Advanced Rotorcraft Technology and Disaster Relief, Paper No. T3-6, Gifu, Japan, Apr 1998. [111 Murashige, A., N. Kobiki, Tsuchihashi, A., Nakamura, H., Inagaki, K., and Yamakawa, E., ATIC Aeroelastic Model Rotor Test at DNW, 24th European Rotorcraft Forum, Paper No. AC02, Marseilles, France, Sep 1998. I121 Aoyama, T., Kawachi, K., and Saito, S., Unsteady Calculation for Flowfield of Helicopter Rotor with Various Tip Shapes, 18th European Rotorcraft Forum, Paper No. B03, Avignon, France, Sep 1992. [131Yamamoto, S. and Daiguji, H., Journal of Computers & Fluids, 22, pp.259-270, 1993. [141 Shima, E. and Jounouchi, T., Role of CFD in Aeronautical Engineering (No.14) - AUSM type Upwind Schemes -, Proceedings of the 14th NAL Symposium on Aircraft Computational Aerodynamics (NAL SP-34), pp.7-12, Tokyo, Japan, Jul 1996. 1151 Nakamura, Y. and Azuma, A., Rotational Noise of Helicopter Rotors, Veriica, Vol. 3, pp. 293-316, 1979.

Parallel Computational Fluid Dynamics Towards Teraflops, Optimizationand Novel Formulations D. Keyes, A. Ecer, J. Periaux,N. Satofukaand P. Fox (Editors) 92000 Elsevier Science B.V. All rights reserved.

105

Domain decomposition implementations for parallel solutions of 3-D NavierStokes equations A.R.Aslan a, U.Gulcat a, A.Mlslrhoglu a and F.O.Edis a aFaculty of Aeronautics and Astronautics, Istanbul Technical University, 80626, Maslak, Istanbul, Turkey

Two different domain decomposition methods are implemented for the parallel solution of the Poisson's equation arising in the numerical analysis of the incompressible flow problems. The Poisson's equation is written both in terms of pressure and auxiliary potential. As a test case, the cubic cavity problem is analysed with both methods and formulations. In light of cavity results, flow about a complex geometry, wing-winglet configuration, is analysed using the more efficient domain decomposition method. Pressure and auxiliary potential formulation solutions are compared in terms of accuracy, computation time and parallel efficiency.

1. I N T R O D U C T I O N The Domain Decomposition Method (DDM) is an efficient way of handling large scale flow problems. Since its introduction[l] to the CFD community, various types and numerous implementations have appeared in the literature [2,3]. The most efficient type is the nonoverlapping DDM, which enables one to handle the interface conditions smoothly. Hence, using non-overlapping DDM, super-linear speed-up is possible even for solutions obtained on clusters of workstations. Such speed-ups are presented in Reference [4], where a second-order accurate fractional step time discretization scheme is implemented along with a nonoverlapping iterative domain decomposition method for the finite element solution of incompressible flows. In the DDM method[4], a unit problem is set up from the actual problem of solving the Poisson's equation on a subdivided domain. The Poisson's equation arises for an auxiliary potential defined within the time discretization of N-S equations. The force vector is constructed from the divergence of the half-step velocity field, which is being corrected to satisfy continuity using the solution of the auxiliary potential field. Different convergence rates can be achieved by solving the Poisson's equation for pressure rather than an auxiliary potential. The known pressure field from the previous time-step can be used as an initial guess to accelerate the convergence of the iterative solution for sequential computations. Furthermore, it is anticipated that the DDM can also be accelerated by using the pressure field from the previous time step. In the present work, four different cases of two DDM implementations are analysed. The DDM methods implemented are similar to those of references [1] and [2]. The first method This research has been supported by TUB1TAK (Turkish Scientific and Technical Research Council), under the title COST-F1.

106 (method 1) uses a sequence of initialization-iteration for a unit problem-finalization, while the second one (method 2) is called as the one shot formulation which does not require a finalization. The first method is employed for both auxiliary potential and pressure formulations, whereas the one shot formulation is employed directly to pressure formulation for the sake of convenience. Another option for accelerating the convergence of the iterative domain decomposition procedure is to precondition the interface problem[2]. For high latency parallel systems this is very important since the communication overhead for the solution of the Poisson's equation influences the efficiency which is adversely affected by the number of interface iterations. Differences in the efficiency obtained with these implementations are compared in this study for a test case, which is the lid driven cubic cavity flow of a Re= 1000. The preliminary results indicate that method 1 employed for the auxiliary potential formulation yields the desired solution in the least CPU time. Method 1 employed for pressure formulation yielded the second fastest solution in terms of CPU time. Method 2 currently requires considerably more CPU time, three times as much, for the same test case. The fastest of these methods, method 1, is then used for the solution of a flow problem around a wing-winglet configuration. The Reynolds number based on the chord length is 1000. A fourth order artificial diffusion is employed to avoid spurious oscillation in the velocity field arising from the convective terms [4]. The use of a fourth-order diffusion preserves the second-order accuracy of the scheme. The results obtained, in terms of efficiency and speed-up, with method 1 for the wingwinglet configuration are presented. The computations are carried on a cluster of 8 DEC Alpha XL 266 workstations interconnected through a 100 Mbps TCP/IP network. The communication library used is PVM 3.3.

2. FORMULATION 2.1 Navier-Stokes equations The equations governing the flow of an unsteady, incompressible, viscous fluid are the continuity equation V.u -0 (1) and the momentum (Navier-Stokes) equation D___uu= -Vp + 1 V 2u (2) Dt Re The equations are written in vector form (here on, boldface type symbols denote vector or matrix quantities). The variables are non-dimensionalized using a reference velocity and a characteristic length, as usual. Re is the Reynolds number, Re=U//v where U is the reference velocity, I is the characteristic length and v is the kinematic viscosity of the fluid. Velocity vector, pressure and time are denoted with u, p and t, respectively. The proper boundary and initial conditions are imposed on pressure and velocity values for the solution of the Equations 1 and 2 depending on the flow being internal or external[5].

2.2 FEM formulation. This study utilized the second-order accurate, in both time and space, scheme which was previously developed and implemented for solution of the three-dimensional incompressible

107

Navier-Stokes equations[2]. In the present work, the Poisson's equation is also formulated in terms of pressure. In this method, the solution is advanced over a full time step At (from n to n+ 1) according to M

nI

u~n+l/2 - Mu~ + B~ + PeC~ -

Mu~ - M u ;

+

PeC~+

B~-

[_~e~] nat +D ~ 2 +D

1AO--E~u;/At 2 Pip n+l - 2E,~u~ / A t - Ap n§

~-

Mu~-1E

(pn+l _

At

(4) (5) (5a)

M u~n+l= Mu~. - - ~1E , ~ A t Mu7

~

(3)

(6)

pn)At

(6a)

2

pn+l_ pn--~e

(7)

where subscript e stands for element, ct indicates the Cartesian coordinate components x,y and z; ~ is the auxiliary potential function, M is the lumped element mass matrix, D is the advection matrix, A is the stiffness matrix, C is the coefficient matrix for pressure, B is the vector due to boundary conditions and E is the matrix which arises due to incompressibility. The following steps are taken to advance the solution one-time step: i) Eqn(3) is solved to find the velocity field at time level n+l/2, ii) Eqn(4) is solved to obtain the intermediate velocity field u*, iii) Knowing the intermediate velocity field, in case of auxiliary potential formulation Eqn(5) or in case of pressure formulation Eqn(5a) is solved with domain decomposition, iv) New time step velocity values are obtained from Eqn(6) in case of auxiliary potential formulation or Eqn(6a) for pressure formulation. v) In case of ,p formulation Eqn(7) is used to calculate the new value of pressure at element level. In all computations the lumped form of the mass matrix is used.

3. D O M A I N D E C O M P O S I T I O N The two domain decomposition techniques [1,2,6] are applied for the efficient parallel solution of the Poisson's Equation for the auxiliary potential function, Eqn(5), or for the pressure equation, Eqn(5a).

Domain Decomposition-1 The first method is employed for both auxiliary potential and pressure formulations. This method uses a sequence of initialization-iteration for a unit problem-finalization procedure.

108

Initialization: Eqn(5) or (5a) is solved in each domain ~"~i with boundary of ~-~i and an interface Sj, with vanishing Neumann boundary condition on the domain interfaces. -- AYi

"-"

f i

in

~~i

Yi

--

gi

on

~f~i

~Yi ~ni

=

0

on

Sj

/t o

9 arbitrarily chosen

g0 = /t0_(y2_Yl)sj w0 = gO

where yi stands either for ~ or p. Unit Problem: In this DDM a unit problem is set up from the actual problem of solving the Poisson's equation on a subdivided domain as -Ax" = 0 in [2 i i

n

X.l

=

0

on

~i

"--

(--1) i-1 W"

on

S

n

b Xi ~n i

Steepest Descent a w n ""

-- X2 s j

g O = (yO_ 1

Y")s 2

j

g n+l ._ g . - fl" aw"

S n --

J

ill n+l = IU n -- ~ n w n

wn+l : gn+l + s n w n J Sj

~_. [

(aw")w"ds

J sj

convergence check: I/t TM -/tk]< e Finalization: Having obtained the correct Neumann boundary condition for each interface, the original problem is solved for each domain. -Ayi

=

fi

in

Yi

"-

gi

o n O~~i

~i

~Yi

-- (--1) i-l~t/n+l o n

Sj

3ni

In this section, subscript i and j indicate the domain and the interface respectively, superscript n denotes iteration level.

Domain Decomposition-2: one-shot formulation The second method is called the one shot formulation and does not require a finalization._The one shot formulation is applied directly to pressure formulation.

109

Initialization V2p ~ - f i n

Pi - 0

on on

r (' = P2 - Pl

~'~i

O~i

with

g ( ' = B -~r ~ Wo

~176 = 0

on

Sj

o

=g

On

One Shot ~72--n

on

Pi - 0

-p?--O

on

~"~i ~'~i

~P; _(_l)i-lwn On

fl" = on

Sj

for

n>O

'g"dz Y" w ' d z

Ir n

gn+l _ . g n _ _ & ~ n mn

P7+ ~ = p i - f l . p i

F n+l _. F n ~ ~

--n

nF

wn+l _. g n+l .~. S n W n

~-. _ B-1Fn convergence check: rn+~ ~ 0 then p"+~is the solution. B -~ is the preconditioner.

4. P A R A L L E L I M P L E M E N T A T I O N During parallel implementation, in order to advance the solution a single time step, the momentum equation is solved explicitly twice, Eqns.(3) and (4). At each solution interface values are exchanged between the processors working with domains having common boundaries. Solving Eqn.(4) gives the intermediate velocity field, which is used at the right hand sides of Poisson's Equation (5) or (5a), in obtaining the auxiliary potential or pressure, respectively. The solution of the Poisson equation is obtained with domain decomposition where an iterative solution is also necessary at the interface. Therefore, the computations involving an inner iterative cycle and outer time step advancements have to be performed in a parallel manner on each processor communicating with the neighboring one. The master-slave processes technique is adopted. Slaves solve the N-S equations over the designated domain while master handles mainly the domain decomposition iterations. All interfaces are handled together.

5. RESULTS AND DISCUSSION The two DDM methods are used to solve the NS equations with FEM. First, as a test case, an 1 lxl lxl 1 cubic cavity problem is selected. The Reynolds Number based on the lid length and the lid velocity is 1000. The solution is advanced 1000 time steps up to the dimensionless time level of 30, where the steady state is reached. Four cases investigated are: 1. Potential based solution with initial ~=0, 2. Pressure based solution with initial p=0, 3. Pressure based solution with P=Pold for initialization and finalization, 4. Pressure based solution with one shot.

110 For the above test case the first method gave the desired result in the least CPU time. Therefore, a more complex geometry, namely the Re=1000 laminar flow over a wing-winglet configuration is then analysed with this method. Four and six domain solutions are obtained. The number of elements and grid points in each domain are given in Table 1, while the grid and domain partition for the 6 domain case is shown in Fig. 1. Solutions using pressure and auxiliary potential based Poisson's equation formulation are obtained.The solution is advanced 200 time steps up to the dimensionless time level of 1. The tolerance for convergence of DDM iterations is selected as 5x10 -5 while for EBE/PCG convergence the tolerance is 1x 10-6.

Table 1.

Domain Domain Domain Domain Domain Domain

1 2 3 4 5 6

4 Domain case Number of grid Number of Elements points 7872 9594 7872 9594 7872 9466 7872 9450 -

6 Domain case Number of Number of grid Elements points 7462 5904 6396 4920 6396 4920 6316 4920 6300 4920 7350 5904

Pressure based formulation gave much smoother results, particularly in spanwise direction compared to poteantial based formulation. The pressure isolines around the wing-winglet configuration is shown in Fig.2. The effect of the winglet is to weaken the effect of tip vortices on the wing. This effect on cross flow about the winglet is observed and is seen in Fig.3, at mid spanwise plane of the winglet. Table 2 shows the CPU time and the related speed-up values for 4 and 6 domain solutions of the wing-winglet problem.

Table 2. Process Master Slave-I Slave-II Slave-III Slave-IV Slave-V Slave-VI

based solution (CPU seconds) Speed-up 6 domain 4 domain 0.33 1430 466 50943 57689 0.89 56773 50594 56115 52826 56402 53224 51792 63527 -

p based solution (CPU seconds) 4 domain 6 domain Speed-up 899 1302 0.69 95619 56242 1.76 100443 58817 103802 50756 97030 51396 47074 58817

The speed-up value is 0.89 for 6 domain potential based solution and 1.76 for 6 domain pressure based solution where normalization is done with respect to the CPU time of 4domain solution. The potential based solution presents a speed-down while pressure based solution gives a super-linear speed-up. In the EBE/PCG iterative technique used here, the number of operations is proportional to the square of the number of unknowns of the problem, whereas, in domain decomposition the size of the problem is reduced linearly by number of

111

Figure 1. External view of the grid about the wing-winglet configuration and the 6 domain partition.

Figure 2. Pressure isolines about the wing-winglet configuration. Re=1000 and Time=l. domains. Hence, the overall iterations are reduced linearly as the number of domains increases. Therefore, more than 100% efficiencies are attained as the whole domain is divided

112

I

i

!:

!

7

/ p,

~

1

.......

7 .7:

~

\

i ..

...."

N.

....

Figure 3. Wing-winglet spanwise mid plane cross flow distribution. into a larger number of subdomains. Pressure and potential based solutions are compared in Table 3 in terms of iteration counts.

6. CONCLUSION A second order accurate FEM together with two matching nonoverlapping domain decomposition techniques is implemented on a cluster of WS having very low cost hardware configuration. Poisson's equation is formulated both in terms of velocity potential and pressure itself. Flow in a cubic cavity and over a wing-winglet configuration are analysed. Using method 1, parallel super-linear speed-up is achieved with domain decomposition technique applied to pressure equation. Method 2 (one shot DDM) requires a good preconditioner to achieve super-linear speed-ups. For future work, three-dimensional computations will continue with direct pressure formulations and better load balancing for optimized parallel efficiencies. A variety of preconditioners will be empoyed to speed up the one shot DDM.

REFERENCES [ 1] R. Glowinski and J. Periaux, "Domain Decomposition Methods for Nonlinear Problems in Fluid Dynamics", Research Report 147, INRIA, France, 1982.

113

Table 3. 4 Domain solution Formulation Process 1 Process 2 Process 3 Process 4 6 Domain solution Formulation Process 1 Process 2 Process 3 Process 4 Process 5 Process 6

~, 4 process P, 4process ~, 6 process P, 6process

Total number of Pressure Iterations

Minimun number of Pressure Iterations

r based p based 1164143 2099256 1217973 2202983 1225096 2217662 1129044 2051901 Total number of Pressure Iterations

~ based p based 3577 7838 3729 8288 3752 8325 3467 7671 Minimun number of Pressure Iterations

r based p based ~ based p based 2079923 1767713 5741 6138 2145158 1813694 5871 6293 2269113 1910254 6205 6649 2397005 2007897 6558 6979 2070968 1744059 5642 6064 1998353 1696447 5497 5898 Total number of Minimun number Domain Decom. of Domain Decom. Iterations Iterations 4175 12 7871 29 8017 21 6739 23

Maximum number of Pressure Iterations r based p based 13006 16278 13626 17003 13749 17216 12707 15946 Maximum number of Pressure Iterations r based p based 15290 13432 15812 13752 16704 14471 17691 15177 15231 13261 14732 12886 Maximum number of Domain Decom. Iterations 49 62 60 52

Average number of Pressure Iterations r based 5821 6090 6125 5645

p based 10496 11015 11088 10259

Average number of Pressure Iterations r based 10400 10726 11346 11985 10355 9992

p based 8839 9068 9551 10039 8720 8482

Average number of Domain Decom. Iterations 21 39 40 34

[2] R. Glowinski, T.W. Pan and J. Periaux, A one shot domain decomposition /fictitious domain method for the solution of elliptic equations, Parallel Computational Fluid Dynamics, New Trends and Advances, A. Ecer at.al.(Editors), 1995 [3] A. Suzuki, Implementation of Domain Decomposition Methods on Parallel Computer ADENART, Parallel Computational Fluid Dynamics, New Algorithms and Applications, N. Satofuka, J. Periaux and A. Ecer (Editors), 1995. [4] A.R. Asian, F.O. Edis, U. Grill:at, 'Accurate incompressible N-S solution on cluster of work stations', Parallel CFD '98 May 11-14, 1998, Hsinchu, Taiwan [5] U. Gulcat, A.R. Asian, International Journal for Numerical Methods in Fluids, 25, 9851001,1997. [6] Q.V. Dinh, A. Ecer, U. Gulcat, R. Glowinski, and J. Periaux, "Concurrent Solutions of Elliptic Problems via Domain Decomposition, Applications to Fluid Dynamics", Parallel CFD 92, May 18-20, Rutgers University, 1992.

This Page Intentionally Left Blank

Parallel ComputationalFluid Dynamics Towards Teraflops, Optimizationand Novel Formulations D. Keyes,A. Ecer, J. Periaux,N. Satofukaand P. Fox (Editors) 92000 Elsevier Science B.V. All rights reserved.

Parallel Implementation

115

of t h e D i s c o n t i n u o u s G a l e r k i n M e t h o d *

Abdelkader Baggag ~, Harold Atkins b and David Keyes c ~Department of Computer Sciences, Purdue University, 1398 Computer Science Building, West-Lafayette, IN 47907-1398 bComputational Modeling and Simulation Branch, NASA Langley Research Center, Hampton, VA 23681-2199 ~Department of Mathematics & Statistics, Old Dominion University, Norfolk, VA 23529-0162, ISCR, Lawrence Livermore National Laboratory, Livermore, CA 94551-9989, and ICASE, NASA Langley Research Center, Hampton, VA 23681-2199 This paper describes a parallel implementation of the discontinuous Galerkin method. The discontinuous Galerkin is a spatially compact method that retains its accuracy and robustness on non-smooth unstructured grids and is well suited for time dependent simulations. Several parallelization approaches are studied and evaluated. The most natural and symmetric of the approaches has been implemented in an object-oriented code used to simulate aeroacoustic scattering. The parallel implementation is MPI-based and has been tested on various parallel platforms such as the SGI Origin, IBM SP2, and clusters of SGI and Sun workstations. The scalability results presented for the SGI Origin show slightly superlinear speedup on a fixed-size problem due to cache effects. 1. M o t i v a t i o n The discontinuous Galerkin (DG) method is a robust and compact finite element projection method that provides a practical framework for the development of high-order accurate methods using unstructured grids. The method is well suited for large-scale time-dependent computations in which high accuracy is required. An important distinction between the DG method and the usual finite-element method is that in the DG method the resulting equations are local to the generating element. The solution within each element is not reconstructed by looking to neighboring elements. Thus, each element may be thought of as a separate entity that merely needs to obtain some boundary data from its neighbors. The compact form of the DG method makes it well suited for parallel computer platforms. This compactness also allows a heterogeneous treatment of problems. That is, the element topology, the degree of approximation and even the choice *This research was supported by the National Aeronautics and Space Administration under NASA contract No. NAS1-97046 while Baggag and Keyes were in residence at the Institute for Computer Applications in Science and Engineering (ICASE), NASA Langley Research Center, Hampton, VA 236812199.

116 of governing equations can vary from element to element and in time over the course of a calculation without loss of rigor in the method. Many of the method's accuracy and stability properties have been rigorously proven [15] for arbitrary element shapes, any number of spatial dimensions, and even for nonlinear problems, which lead to a very robust method. The DG method has been shown in mesh refinement studies [6] to be insensitive to the smoothness of the mesh. Its compact formulation can be applied near boundaries without special treatment, which greatly increases the robustness and accuracy of any boundary condition implementation. These features are crucial for the robust treatment of complex geometries. In semi-discrete form, the DG method can be combined with explicit time-marching methods, such as Runge-Kutta. One of the disadvantages of the method is its high storage and high computational requirements; however, a recently developed quadrature-free implementation [6] has greatly ameliorated these concerns. Parallel implementations of the DG method have been performed by other investigators. Biswas, Devine, and Flaherty [7] applied a third-order quadrature-based DG method to a scalar wave equation on a NCUBE/2 hypercube platform and reported a 97.57% parallel efficiency on 256 processors. Bey et al. [8] implemented a parallel hp-adaptive DG method for hyperbolic conservation laws on structured grids. They obtained nearly optimal speedups when the ratio of interior elements to subdomain interface elements is sufficiently large. In both works, the grids were of a Cartesian type with cell sub-division in the latter case. The quadrature-free form of the DG method has been previously implemented and validated [6,9,10] in an object-oriented code for the prediction of aeroacoustic scattering from complex configurations. The code solves the unsteady linear Euler equations on a general unstructured mesh of mixed elements (squares and triangles) in two dimensions. The DG code developed by Atkins has been ported [11] to several parallel platforms using MPI. A detailed description of the numerical algorithm can be found in reference [6]; and the description of the code structure, parallelization routines and model objects can be found in reference [11]. In this work, three different parallelization approaches are described and efficiency results for the selected approach are reported. The next section provides a brief description of the numerical method and is followed by a discussion of parallelization strategies, a citation of our standard test case, and performance results of the code on the Origin2000 and several other computing platforms. 2. D i s c o n t i n u o u s G a l e r k i n M e t h o d The DG method is readily applied to any equation of the form

OU - - - t - V . F(U) = O. Ot

(1)

on a domain that has been divided into arbitrarily shaped nonoverlapping elements f~i that cover the domain. The DG method is defined by choosing a set of local basis functions B = {b~, 1 _< l _< N(p, d)} for each element, where N is a function of the local polynomial order p and the number of space dimensions d, and approximating the solution in the

117 element in terms of the basis set N(p,d)

Ua, ~ Vi -

~

vi,l bl.

(2)

l=l

The governing equation is projected onto each member of the basis set and cast in a weak form to give

OVi

bkFn(~, Vjj) gijds

O,

(3)

where Vi is the approximate solution in element ~i, Vj denotes the approximate solution in a neighboring element ~j, 0ghj is the segment of the element boundary that is common to the neighboring element gtj, gij is the unit outward-normal vector on O~ij, and V~ and Vj denote the trace of the solutions on O~ij. The coefficients of the approximate solution vi,z are the new unknowns, and the local integral projection generates a set of equations governing these unknowns. The trace quantities are expressed in terms of a lower dimensional basis set bz associated with O~ij. fir denotes a numerical flux which is usually an approximate Riemann flux of the Lax-Friedrichs type. Because each element has a distinct local approximate solution, the solution on each interior edge is double valued and discontinuous. The approximate Riemann flux ffR(Vi, Vj) resolves the discontinuity and provides the only mechanism by which adjacent elements communicate. The fact that this communication occurs in an edge integral means the solution in a given element V~ depends only on the edge trace of the neighboring solution Vj, not on the whole of the neighboring solution Vj. Also, because the approximate solution within each element is stored as a function, the edge trace of the solution is obtained without additional approximations. The DG method is efficiently implemented on general unstructured grids to any order of accuracy using the quadrature-free formulation. In the quadrature-free formulation, developed by Atkins and Shu in [6], the flux vector ff is approximated in terms of the basis set bz, and the approximate Riemann flux/~R is approximated in terms of the lower basis set bz: N(p,d)

fi(V~) ~

E

f,,

N(p,d-1)

b,,

fiR(V~, Vj). ~ -

/=1

E

fR,' 6,.

(4)

/=1

With these approximations, the volume and boundary integrals can be evaluated analytically, instead of by quadrature, leading to a simple sequence of matrix-vector operations

(5)

O[vi,l] = (M_IA) [fi,l] - E (M-1BiJ)[fiR,l], Ot (j} where

M-

bk bl d~ ,

A =

Vbk bl dfl ,

Bij =

bk bz ds .

(6)

118

The residual of equation (5) is evaluated by the following sequence of operations:

[~ij,,] = Ti.~[vi,t] } = .F(Vi).n-'ij

Vai,

[fij,,]

v

R

V Of~i~,

O[vi,l] Ot

= =

(M-1A)[fi,t]-

E

/

-R (M-1BiJ)[fij,t]

V f2~

where T~j is the trace operator, and [()~j,z] denotes a vector containing the coefficients of an edge quantity on edge j. 3. Parallel Computation In this section, three different possible parallelization strategies for the DG method are described. The first approach is symmetric and easy to implement but results in redundant flux calculations. The second and third approaches eliminate the redundant flux calculations; however, the communication occurs in two stages making it more difficult to overlap with computation, and increasing the complexity of the implementation. The following notation will be used to describe the parallel implementation. Let f~ denote any element, instead of f2~, and let 0 f2p denote any edge on the partition boundary, and 0 f2/ denote any other edge. The first approach is symmetric and is easily implemented in the serial code in reference [11]. It can be summarized as follows: 1. Compute [~j,l] and If j,1]

V f2

2. Send [vj,t] and [/j,l] on 0f~p to neighboring partitions --R

3. Compute [fl] and (M-1A)[fl] Vf~, and [f~,t] V0f~/ 4. Receive [v~,l] and --R

[fj,z] on

0f2p from neighboring partitions --R

5. Compute [fj,z] V0f2p and (M-lB,)[f3,t]

Vf2

In this approach, nearly all of the computation is scheduled to occur between the --R nonblocking send and receive; however, the edge flux [fj,l] is doubly computed on all 0f2p. It is observed in actual computations that redundant calculation is not a significant --R factor. The calculation of [fiLl] on all Ofhj represents only 2% to 3% of the total CPU time. The redundant computation is performed on only a fraction of the edges given by

0a

/(0a, u 0a ).

The above sequence reflects the actual implementation [11] used to generate the results to be shown later; however, this approach offers the potential for further improvement. By collecting the elements into groups according to whether or not they are adjacent

119

to a partition boundary, some of the work associated with the edge integral can also be performed between the send and receive. Let Ftp denote any element adjacent to a partition boundary and FtI denote any other element. The following sequence provides maximal overlap of communication and computation. 1. Compute [Vj:] and

If j:]

yap

w

2. Send [Vj,z] and [fj,z] on 0Ftp to neighboring partitions --

3. Compute [vj:] and [fj,z] Vai, and 4. Compute [fd and (M-1A)[fl]

[fj,~] VOFti --R

--R

and (M-1Bj)[fjj]

Vai

VFt

5. Receive [vj:] and [fj,d on 0ap from neighboring partitions --R

--R

6. Compute [fj,l] VOFtp and (M-lB,)[fj,l]

VFtp

3.1. O t h e r P a r a l l e l i z a t i o n S t r a t e g i e s

Two variations of an alternative parallelization strategy that eliminates the redundant --R flux calculations are described. In these approaches, the computation of the edge flux [fj,z] on a partition boundary is performed by one processor and the result is communicated to the neighboring processor. The processor that performs the flux calculation is said to "own" the edge. In the first variation, all edges shared by two processors are owned by only one of the two processors. In the second variation, ownership of the edges shared by two processors is divided equally between the two. Let OFt(a) denote any edge owned by processor A, Oft(pb) denote any edge owned by adjacent processor B, and 0Ft(ab) denote any edge shared by processors A and B. For the purpose of illustration, let ownership of all Oa (ab) in the first variation be given to processor A. Thus { 0Ft(a) } V1{ Off(b) } = { ~ } in both variations, and { O Ft(p~b) } gl{ O f~(pb) } = { 0 } in the first variation. Both computation can be summarized in the following steps: Process A 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12.

Compute [vj,l] and [fj,L] V Ftp Send [Vj,t] and [f-j:] V 0Ftp \ 0Ft(p~) Compute [vj,,] and [fj,z] V a I Compute [fz] V Ft Receive [Vj,z] and [fj,l] V OFt~a) --R Ft(pa) Compute [fy,l] VO --R ~-~(pC) Send [fj:] V0 --R Compute [fj,l] V 0 f~I Compute (M-1A)[fd V Ft --R Compute (M-1Bj) [fj,t] VQI Receive [f- nj,,] V 0 t2p \ 0 f~(a) C o m p u t e ( M - 1 B j ) [ j,z] V~p

Process B Compute [Vj,l] and [fj,l] V Ftp Send [~j,l] and [f~:] V 0 Ft, \ 0 Ft(pb) Compute [vj,z] and [fj,l] V~']I Compute [ft] V Ft Receive [vy,z] and [f-~,,] V O fl(pb) --R V0 Ft(pb) Compute [fj,l] --R

~(b)

Send [f.~:] V O --R Compute [fj,l] V O t2i Compute (M-1A)[fl] V ft Compute (M- 1By) [L,,] --R va, ~ Receive [fy,,] V 0Ftp \ 0 gt(ps) Compute (M-1Bj) [fj:] --R

V ~p

120

It is clear that under these strategies, there are no redundant flux calculations. In both variations, the total amount of data sent is actually less than that of the symmetric approach presented earlier. However, because the sends are performed in two stages, it is more difficult to overlap the communication with useful computation. Also, the unsymmetric form of edge ownership, in the first variation, introduces a difficulty in balancing the work associated with the edge flux calculation. In the second variation, the number of sends is twice that of the symmetric approach; however, because { 0 gt(ab) } r { 0 Ft(b) } = { ~ } in the first variation, the total number of sends in this approach is the same as in the symmetric approach presented earlier. 4. P h y s i c a l P r o b l e m The parallel code is used to solve problems from the Second Benchmark Problems in Computational Aeroacoustics Workshop [12] held at Flordia State University in 1997. The physical problem is the scattering of acoustic waves and is well represented by the linearized Euler equations written in the form of equation (1). Details of the problem can be found in reference [12]. Figure (1.a) shows a typical partitioned mesh.

0

(a)

1

2

3 4 5 Number of Processors)

6

7

(b)

Figure 1. Partitioned mesh (a), Performance on SP2 and workstations dusters (b)

5. R e s u l t s and Discussion Performance tests have been conducted on the SGI Origin2000 and IBM SP2 platforms, and on clusters of workstations. The first test case applied a third-order method on a coarse mesh of only 800 elements. Near linear speedup is obtained on all machines until the partition size becomes too small (on more than 8 processors) and boundary effects dominate. The SP2 and clusters are profiled in Figure (1.b). This small problem was also run on two clusters of, resp., SGI and Sun workstations in an FDDI network. The two clusters consisted of similar but not identical hardware. The network was not dedicated to the cluster but carried other traffic. As the domain is divided over more processors, the

121

140 120

! i ..

! ! ! # Elements = ~1.0,392 -~.§ # Elements = 98,910 --+~-;~

.................................................. #.E!ements.-~,.~54;000...;~:,<:

............... i........... i~Jl ...... :....i .........

100

i

I I / ~ ,

300o

# Elements = 40,.392 # Elements = 98,910 .....

i

~i............................. i............................... i............ ,--~io~-o~--=-.,-~ooo ...............

2800

26OO

8o

.

3200

00

.............................

40

.............

- .........

2400

.....................................................

:: --~<:

................................

20

.............

i ..............................

........................................................................

0 20

40

60 #processors

80

100

120

2200

2000

1800

2.5

3

~"

3.5

'~

4

Log(# Elements)

4.5

5

5.5

(b) Figure 2. Speedup t:or large domain on Origin 2000 (a), Computational Rate (b)

number of elements assigned to each processor becomes too small and the communication overhead becomes apparent. The performance figures are far below ideal; however, four to eight distributed workstations still provide a valuable performance improvement. Three larger problems were used to evaluate the code on the Origin2000. For these cases a fifth-order method was used and the problem size was controlled by varying the element size and by varying the location of the outer boundary. In the small and medium size problems, a slight superlinear speedup is obtained for some range of processors. This result is due to the improvement in cache performance that occurs when a fixed problem is divided into smaller parts as the number of processors is increased, and workingsets become cache-resident. The larger problem does not fit in cache in 128 processors and no superlinear speedup is observed; however, it is expected that performance will improve as the number of processors is increased. This is supported by Figure (2.b) which shows the computational rate as a function of the number of elements on each processor. The computational rate is defined as the maximum wall clock time of any processor divided by the number of degrees of freedom per processor. As the number of processors is increased, the domain is divided into smaller parts such that, at some point, all of the data fits in cache. For the cases that were run, this point corresponds to, approximately, a load of 1500 elements per processor. The computational rates for the three large problems are similar indicating good scalability. Thus, computational rate may be a better indicator of scalability than the usual "speedup" measure. 6. C o n c l u s i o n s Three parallelization strategies for the DG method have been described. A symmetric parallelization approach was implemented in an object-oriented computational aeroacoustics code that was ported to several distributed memory parallel platforms using MPI, and performance results are presented. The DG method provides a significant amount of computational work that is local to each element, which can be effectively used to hide the communication overhead. The compact character of the method is exploited in the

122 implementation. The parallel implementation gives nearly superlinear speedup for large problems. This is attributed to the compact form of the DG method which allows useful work to be overlapped with communication as well as cache accelerations that occur when workingsets are cache resident. REFERENCES

1. C. Johnson, J. Pitkgrata, An Analysis of the Discontinuous Galerkin Method for a Scalar Hyperbolic Equation, Mathematics of Computation, 46, (1986), pp. 1-26. 2. B. Cockburn, C. W. Shu, TVB Runge-Kutta Local Projection Discontinuous Galerkin Finite Element Method for Conservation Laws II: General Framework, Mathematics of Computation, Vol. 52, No. 186, 1989, pp. 411-435. 3. B. Cockburn, S. Y. Lin, and C. W. Shu, TVB Runge-Kutta Local Projection Discontinuous Galerkin Finite Element Method for Conservation Laws III: One Dimensional Systems, Journal of Computational Physics, Vol. 84, No. 1, 1989, pp. 90-113. 4. B. Cockburn, S. Hou, C. W. Shu, The Runge-Kutta Local Projection Discontinuous Galerkin Finite Element Method for Conservation Laws IV: The Multidimensional Case Mathematics of Computation, Vol. 54, No. 190, 1990, pp. 545-581. 5. G. Jiang, C. W. Shu, On Cell Entropy Inequality for Discontinuous Galerkin Methods, Mathematics of Computation, Vol. 62, No. 206, 1994, pp. 531-538. 6. H.L. Atkins, C. W. Shu, Quadrature-Free Implementation of Discontinuous Galerkin Method for Hyperbolic Equations, AIAA Journal, 36 (1998), pp. 775-782. 7. R. Biswas, K. D. Devine, and J. Flaherty, Parallel, Adaptive Finite Element Methods for Conservation Laws, Applied Numerical Mathematics, Vol. 14, No. 1-3, 1994, pp. 225-283. 8. K.S. Bey, I. T. Oden, A. Patra, A parallel hp-adaptive discontinuous Galerkin method for hyperbolic conservation laws, Applied Numerical Mathematics, 20, 1996, pp. 321336. 9. H.L. Atkins, Continued Development of the Discontinuous Galerkin Method for Computational Aeroacoustic Applications, AIAA Paper 97-1581, May 1997. 10. H. L. Atkins, Local Analysis of Shock Capturing Using Discontinuous Galerkin Methodology, AIAA Paper 97-2032, June 1997. 11. A. Baggag, H. Atkins, C. Ozturan, and D. Keyes, Parallelization of an Object-oriented Unstructured Aeroacoustics Solver, ICASE Report NO. 99-11, Ninth SIAM Conference on Parallel Processing for Scientific Computing, San Antonio, TX, March 22-24, 1999. 12. NASA Conference Publication 3352, Second Computational Aeroacoustics Workshop on Benchmark Problems, Proceedings of a workshop sponsored by NASA, June 1997. ..

Parallel ComputationalFluidDynamics TowardsTeraflops,Optimizationand NovelFormulations D. Keyes,A. Ecer,J. Periaux,N. Satofukaand P. Fox (Editors) 92000 ElsevierScienceB.V.All rightsreserved.

123

N u m e r i c a l S i m u l a t i o n s of C o m p l e x F l o w s w i t h Lattice-Boltzmann-Automata

on P a r a l l e l C o m p u t e r s

J. Bernsdorf, G. Brenner*, F. Durst ~ and M. Baum b ~Institute of Fluid Mechanics, University of Erlangen, Cauerstr. 4, D-91058 Erlangen bNEC C & C Research Laboratories, Rathausallee 10, D-53757 Sankt Augustin The Lattice-Boltzmann-Automata is used for the simulation of low-Reynoldsnumber flow through highly irregular porous media. The approach makes use of 3D computer tomography for the generation of computational meshes used for the simulation. This permits discretization of the detailed structure of the porous medium and thus permits investigation of the properties of the fluid in regions which cannot be acessed by measurements. Special attention is payed to the evaluation of performance aspects of the parallel version of the method, which employs message passing (MPI) and High Performance Fortran (HPF). 1. I N T R O D U C T I O N Classical methods for simulating flow fields (e.g. Finite Volume or Finite Element approaches for RANS, LES and DNS computations) face serious problems when addressing flows in configurations of complex geometry. The production of suitable grids for these solution methods is often extremely time consuming, requires profound expertise (e.g. flows around cars or airplanes) and is in some cases simply not feasible (e.g. in irregular porous media). Lattice Gas Automata (LGA) and Lattice Boltzmann Automata (LBA) overcome these difficulties by simulating simplified molecular gas dynamics derived from the microscopic description of the fluid (instead of solving macroscopic governing equations as in classical methods). Universal no-slip wall boundary conditions are applied that apply in full generality, including edges and corners. Thus a uniform lattice filling the computational domain replaces complex grids, while universal wall boundary conditions enable arbitrarily complex shaped domains. These properties make LBA approaches appealing for the simulation of low Mach-number flows. However, the explicit time advancing that underlays the LBA approach requires considerable computational resources and the efficient implementation of such methods turns out to be a crucial task in CFD. We present a new and efficient implementation of an LBA for parallel computing platforms and examine the performance achieved on different computer architectures. Two parallelization approaches are considered: (1) a domain decomposition technique with data communication performed with MPI and (2) automatic parallelization with HPF compiling systems. The performance of both approaches is demonstrated for distributed *corresponding author ([email protected])

124

memory computers (Cenju-4, Cenju-3, SP-2) and shared memory architectures (SX-4). As an application of the method, we present the simulation of a low-Reynolds number flow through the complex geometry of a porous SiC structure, digitized by computer tomography. 2. T H E L B A A P P R O A C H

The LBA approach presented in this paper is based on evaluation of the discrete Boltzmann equation, the so called lattice Boltzmann equation [4,6,7]" N~(~', + ~, t, + 1) - N~(~',, t,) + A~(N).

(1)

The particle density distribution function Ni(~',, t,) represents the space (~',), time (t,) and velocity (~) discrete ensemble average of the distribution of (imaginary) fluid particles. The collision operator Ai(N) redistributes the particle densities. In the numerical evaluation of the lattice Boltzmann equation the particle distributions Ni are propagated on a lattice {~',} in discrete time steps t, to their next neighbour nodes in direction ~. Following [6,5] an equivalent operation to the particle collision is performed by evaluating the collision operator in terms of a single step relaxation process on the particle density distribution function Ni. The collision operator is thus given by: Ai(N) - w (N: q - Ni),

(2)

where the relaxation parameter w is related to the fluid viscosity 1(2____ 1) Y--6

(3)

5d

with 0 __ w _ 2. N~q is an equilibrium density distribution function, that is computed locally at every iteration as:

N~q=tpQ

{

1+

ci~u~ u~u~(ci~ci~ c~ + 2c~ c~

5 ~ ) }.

(4)

The density Q and velocity g are computed from the density distribution Ni as: n

E i~=oNi -- Q, E ~ Ni/Q - ~,

(5)

i--0

with weighting factor tp, projection ci~ of ~ onto the axis a, component u~ of fluid velocity ~7in direction a, speed of sound Cs (= 1/v f3). With these relations, all quantities of the equivalent macroscopic equations are obtained from the automata fluid density distributions. Furthermore this approach converges to the time dependent incompressible Navier-Stokes equations in the limits of low Mach and Reynolds numbers. 3. I M P L E M E N T A T I O N OF A N L B A F O R P A R A L L E L C O M P U T E R S Here, we present an implementation of the LBA approach with the BGK model [6,5] that results in the simulation code BEST [1]. A marker and cell technique is used to specify the contours and the volume of solids inside the rectangular computational domain

125 that defines the lattice. Marked cells represent obstacles, while unmarked cells represent gaseous or liquid fluid. The explicit time stepping in BEST is split into two major steps: (1)particle density propagation and (2) particle collision. While the second step is performed locally on each node of the lattice, the first step implies the transfer of data from and to the neighbouring points. However, only next neighbours in the lattice have to be considered in this step. The combination of an explicit time marching method and the restriction to next-neighbour dependencies enables a fast and efficient development of code for execution on parallel machines. However, the very small number of integer and floating point operations per node (about 150 flops per node per iteration) is likely to cause situations where the execution time is determined by the data communication with the processes on which the neighbouring nodes are mapped. Therefore special emphasis has to be put on an efficient parallel execution of BEST. Two different techniques were considered to accomplish this task: (1) a domain decomposition approach with MPI for data transfer between sub-domains and (2) semi-automatic parallelization by HPFcompilers. In the domain decomposition method the whole lattice is subdivided into sub-domains surrounded by ghost-cells that hold the values of the particle densities and the status of the nodes (marked/unmarked) from adjacent sub-domains. In order to minimize the number of ghost-cell layers the operators for the propagation step are written in a form such that a sweep over all nodes of the lattice is restricted to the interior nodes (without ghost-cells layers). Additionally, the procedure is cast in a form that requires only one update of the ghost-cells per time-step. This further decreases the cost of data transfer. A semi-automatic parallelization using HPF-compilers may be justified by the regular structure of the lattices considered by BEST. The parallelization is performed automatically by the HPF-compilers by setting the appropriate directives for the data distribution in the source code. A block-wise data distribution has been chosen to take advantage of the data locality in the particle collision step. Therefore, it was expected that data communication similar to the domain partitioning approach would be produced by the HPF-compilers. Two different HPF-compilers were considered here, the leading edge HPF-compiler ADAPTOR from GMD Forschungszentrum Informationstechnik GmbH [2] and the commercially available HPF-compiler from Portland Group Inc. (pghpf). Figure 1 shows the execution times for the ADAPTOR HPF-compiler in comparison with the MPI based approach for a small test problem, chosen to fit on a node of an IBM SP-2 parallel computer. Despite the small size of the sub-domains when split over 16 processors, a significant speed-up is observed up to 16 processors for both, the domain decomposition and the MPI version of BEST. However, an overhead of about 20% is observed for ADAPTOR, even for single processor execution. With the Portland Group HPF version, a higher overhead was found. This trend was confirmed on other computers such as the NEC Cenju-3 and Cenju-4 systems. On vector processors only the MPI version was evaluated. On a NEC SX-4 a parallel efficiency of 87% was achived on 16 nodes for a problem with 4.2 million gridpoints indicating very good scalability of the code.

126

Performance of BEST on SP2

10,0 ,=..,

c 0 ID X Ill

0,0 1

2

4

8

16

Number of processors

Figure 1. Execution time (microseconds per step per node) of BEST on the SP-2.

4. R E S U L T S To illustrate the capabilities of the Lattice Boltzmann Automata method for simulating flows in highly complex geometries, we present results of a low-Reynolds number flow simulation with BEST through a porous SiC 'sponge-like' structure (see Fig. 2). The geometry was digitized using 3-D computer tomography (3D CT). After a preprocessing step to convert the CT data into a voxel information, the geometry information is integrated in the LBA-software without any time-consuming- and for the present geometries nearly impossible- procedure of grid generation. A cylindrical probe with a height of 30 mm and a diameter of 82 mm was scanned with 3D CT with an average resolution of 0.5 mm. This leads to a discretization of I x x l y x l z = 44 x 147 x 147 voxels. This resolution, although still rather coarse, leads to a qualitatively good reproduction of the basic features of the original porous geometry, as can be seen from the visualization in Fig. 2. The complex geometry data were centered inside a channel to mimic an experimental setup, that is presently used at the LSTM. A flow with a Reynolds number of R e - 1 was simulated using velocity inlet and pressure outlet boundary conditions and periodic boundary conditions elsewere. The simulation was performed on a vector-processor. About 10000 iterations were necessary for this set-up, which took about 5760 CPU seconds, and 800 Mb of computer memory were necessary for the storage of approximately 2.2 x 106 voxels. As a result of the LBA simulation, the pressure and the three components of the flow velocity are known. Local and averaged information from this dataset can be easily computed. Figures 3 and 4 illustrate the complexity of the three-dimensional flow field by displaying an isosurface of the overall flow velocity and some stream ribbons, indicating the tortuosity of the flow. Figure 5 presents the pressure drop in a randomnly packed bed of rectangles, modelling a porous media in terms of the nondimensional friction coefficient

127

lOmm

Figure 2. Section of the digitized SiC structure for the LBA flow simulation.

for Reynolds numbers from 10 -3 up to 102. For a flow in a porous media, it can be shown, both theoretically and experimentally, that the friction coefficient increases nonlinearly for higher Reynoldsnumbers which is confirmed quantitatively in the numerical simulations.

5. C O N C L U S I O N A new and efficient implementation of an LBA approach has been developed. Two alternative parallelization techniques were employed and the performance of codes generated by either the application of HPF-compilers to sequential code or by programming with calls to MPI was compared. While still producing slightly slower code, modern HPF compilers are improving and execution times and speed-up were found close to each other. Furthermore, it was demonstrated that the Lattice Boltzmann Automata are an efficient approach for simulating flows in highly complex geometries. The application of this method was illustrated by the use of three-dimensional computer tomographic data for porous media as a geometry input for LBA simulations.

6. A C K N O W L E D G M E N T S The authors thank O.Giinnewig (HAPEG GmbH, Hattingen) for preparing the 3D computer tomographic data. This work has been partially funded by the Deutsche Forschungsgemeinschaft (DFG) under contract Br 1864/1.

128

Figure 3. Isosurface of the flow velocity in a porous media.

Figure 4. Stream ribbons through the porous media indicating the flow tortuosity.

129

260 <

.

.

.

.

.

.

.

.

i

.

.

.

.

.

.

.

.

,

.

.

.

.

.

.

.

.

i

.

.

.

.

.

.

.

.

.

.

.

.

.

.

,

.

.

.

.

.

.

.

.

i

.

.

.

.

.

.

.

9- - - ~ L B A - Simulation

240

Exp.: 182+1.75 Re

c (J

m R

r

I l l

220

O

O

c O

200

i n

o

i n

I n

U.

180

160 le-03

.

.

.

.

.

.

.

.

i

.

I e-02

.

.

.

.

.

.

.

i

I e-01

.

.

.

.

.

.

.

.

i

I e+00

I e+01

.

I e+02

R e y n o l d s Number Re

Figure 5. Friction coefficient A as a function of Reynolds number Re. The numerical results of the LBA simulations (dashed line) are compared with the experimental data from Durst et al. [3] (straight line)

REFERENCES 1. J. Bernsdorf, F. Durst, and M. Sch~ifer. Cellular automata and finite volume techniques. International Journal for Numerical Methods in Fluids, Paper No.1196, 1998. 2. T. Brandes and F. Zimmermann. A D A P T O R - A transformation tool for HPF programs. In Programming Environments for Massively Parallel Distributed Systems, 91-96. Birkhs Verlag, 1994. 3. Durst F., Haas R. and Interthal W. Journal of Non-Newtonian Fluid Mechanics 22: 169-189, 1987. 4. U. Frisch, D. d'Humi~res, B. Hasslacher, P. Lallemand, Y. Pomeau, and J.P. Rivet. Lattice- Gas Hydrodynamics in two and three dimensions. In Complex Systems 1, 649-707. 1986. 5. P.Bhatnager, E.P.Gross, and M.K.Krook. A model for collision processes in gases I: Small amplitude processes in charged and neutral one-component systems. Phys. Rev., 94:511, 1954. 6. Y. H. Qian. Lattice BKG models for Navier- Stokes Equation. Europhys. Lett., 17:479-484, 1992. 7. Y. H. Qian, S. Succi, and S. A. Orszag. In D. Stauffer, editor, Annual Reviews of Computational Physics III, 195-242. 1995.

This Page Intentionally Left Blank

Parallel ComputationalFluid Dynamics Towards Teraflops, Optimizationand Novel Formulations D. Keyes,A. Ecer, J. Periaux,N. Satofukaand P. Fox (Editors) 92000 Elsevier Science B.V. All rights reserved.

131

Parallel Preconditioners for KKT Systems Arising in Optimal Control of Viscous Incompressible Flows* G. Biros a and O. Ghattas a aComputational Mechanics Laboratory Carnegie Mellon University, Pittsburgh PA, 15213, USA biros 9 cmu. edu, oghattas 9 cmu. edu

1. I N T R O D U C T I O N Recently, interest has increased in model-based optimal flow control of viscous fluids, that is, the determination of optimal values of parameters for systems governed by the fluid dynamics equations. For example, the objective could be minimizing drag on a solid body, and the controls might consist of velocities or tractions on some part of the boundary or of the shape of the boundary itself. Such problems are among the most computationally challenging optimization problems. The complexity stems from their being constrained by numerical approximations of the fluid equations, commonly the Navier-Stokes or the Euler equations. These constraints are highly nonlinear and can number in the millions for typical systems of industrial interest. The current state-of-the-art for solving such flow-constrained optimization problems is reduced sequential quadratic programming (RSQP) methods. General mathematical analysis of these methods [3,9] as well as CFD-related research [5,7] have appeared. In addition, parallel implementations of RSQP methods exhibiting high parallel efficiency and good scalability have been developed [6,10]. These methods essentially project the optimization problem onto the space of control variables (thereby eliminating the flow variables), and then solve the resulting reduced system using a quasi-Newton method. The advantage of such an approach is that only two linearized flow problems need to be solved at each iteration. However, the convergence of quasi-Newton based RSQP methods (QN-RSQP) deteriorates as the number of control variables increases, rendering large-scale problems intractable. The convergence can often be made independent of the number of control variables m by using a N e w t o n I a s opposed to quasi-NewtonIRSQP method. However, N-RSQP requires m linearized forward solves per iteration. The m linear systems share the same coefficient matrix; their right-hand sides are derivatives of the state equations with respect to each control variable. LU factorization of the linear system would be ideal here, but is *This work is a part of the Terascale Algorithms for Optimization of Simulations (TAOS) project at CMU, with support from NASA grant NAG-I-2090, NSF grant ECS-9732301 (under the NSF/Sandia Life Cycle Engineering Program), and the Pennsylvania Infrastructure Technology Alliance. Computing services were provided under grant number BCS-960001P from the Pittsburgh Supercomputing Center, which is supported by several federal agencies, the Commonwealth of Pennsylvania and private industry.

132

not viable for the large, sparse, three-dimensional, multicomponent forward problems we target. Instead, iterative solvers must be used, and N-RSQP's need for m forward solves per optimization iteration is unacceptable for large m. The need for forward solutions results from the decomposition into state and control spaces (range and null spaces of the state equations), and this can be avoided by remaining in the full space of combined state and control variables. This leaves of course the question of how to solve the resulting "Karush-Kuhn-Tucker" (KKT) full space system. For the large, sparse problems contemplated, there is no choice but a Krylov method appropriate for symmetric indefinite systems. How to best precondition the KKT matrix within the Krylov solver remains an important challenge, and is crucial for the viability of large-scale full space optimization methods. In this paper we propose a preconditioner for the KKT system based on a reduced space quasi-Newton algorithm. Battermann and Heinkenschloss [2] have also suggested a preconditioner that is motivated by reduced methods; the one we present can be thought of as a generalization of their method. As in reduced quasi-Newton algorithms, the new preconditioner requires just two linearized flow solves per iteration, but permits the fast convergence associated with full Newton methods. Furthermore, the two flow solves can be approximate, for example using any appropriate flow preconditioner. Finally, the resulting full space SQP parallelizes and scales as well as the flow solver itself. Our method is inspired by the domain-decomposed Schur complement algorithms. In these techniques, reduction onto the interface space requires exact subdomain solves, so one often prefers to iterate within the full space while using a preconditioner based on approximate subdomain solution [8]. Here, decomposition is performed into states and controls, as opposed to subdomain and interface spaces. Below we describe reduced and full space SQP methods and the proposed reduced space-based KKT preconditioner. We also give some performance results on a Cray T3E for a model Stokes flow problem. Our implementation is based on the PETSc library for PDE solution [1], and makes use of PETSc domain-decomposition preconditioners for the approximate flow solves. 2. R E D U C E D

SQP METHODS

We begin with a typical discretized constrained optimization problem, min x

f(x)

subject to

c(x) =

0,

(1)

where x are the optimization variables, f is the objective function and c are the constraints, which in our context are discretized flow equations. Using the Lagrangian s one can derive first and higher order optimality conditions. The Lagrangian is defined by A) :=

(2)

+

and the first order optimality conditions are 1

{

} ={

+

}=

(3)

1All vectors and matrices depend on the optimization variables x or the Lagrange multipliers X or both. For clarity, we suppress this dependence.

133 This expression represents a system of nonlinear equations. Sequential quadratic programming can be viewed as Newton's method for the first order optimality conditions. Customarily, the Jacobian of this system is called the Karush-Kuhn-Tucker (KKT) matrix of the optimization problem. To simplify the notation further, let us define: A := Oxc W := Oxxf + ~ i )tiO~ci g : - Oxf

Jacobian of the constraints, Hessian of the Lagrangian, Gradient of the objective function.

(4)

A Newton step on the optimality conditions (3) is given by A

0

px

- -

or

c

A

0

P~

)~+

- -

g

c

'

(5)

where p~ and p~ are the updates in x and ~k from current to next iterations and )~+ is the updated Lagrange multiplier. To exploit structure of the flow constraints, it is useful to induce a partition of the optimization variables into state xs and control (or decision) variables Xd. The above KKT system can be partitioned logically as follows:

As

Ad

A+

c

The current practice is to avoid solution of the full KKT matrix by a reduction to a lower dimension problem corresponding to the control variables. Such so-called reduced space methods eliminate the linearized state constraints and variables, and then solve an unconstrained optimization problem in the resulting control space. RSQP can be derived by a block elimination on the KKT system: Givenp4, solve the last block of equations for Ps, then solve the first to find )%, and finally solve the middle one for Pd. For convenience let us define

Wz := A T A s T W s s A s l A d -

ATAsTWsd-

WdsAslAd-f- Wdd ,

(7)

the reduced Hessian of the Lagrangian; Bz, its quasi-Newton approximation; and gz := g d - A T A s T g s ,

(8)

the reduced gradient of the objective function. The resulting algorithms for Newton (N-RSQP) and quasi-Newton (QN-RSQP) variants of RSQP are: 1. Initialize: Choose xs, Xd

2. C o n t r o l step: solve for p~ from Wzpd - - g z + ( W d s - A T A - j T W s , ) A - j l c BzPd = --gz

N-RSQP QN-RSQP

(9)

3. S t a t e step: solve for Ps from AsPs = -A,~p~ - c

(10)

134

4. A d j o i n t step: solve for )~+ from AT s A+ = - W s , p , A~T A+ = - g s

-

WsdPd

--

N-RSQP QN-RSQP

gs

(11)

5. U p d a t e : x~ = x ~ + p ,

(12)

X d --" X d + Pd"

The quasi-Newton method defined here is a variant in which second order terms are dropped from the right-hand sides of the control and adjoint steps, at the expense of a reduction from one-step to two-step superlinear convergence [3]. An important advantage of this quasi-Newton method is that only two linearized flow problems need to be solved at each iteration, as opposed to the m needed by Newton's method in constructing A ~ I A d as can been seen in (7). Furthermore, no second derivatives are necessary, since a quasiNewton approximation is made to the reduced Hessian. Finally, it can be shown that this method parallelizes very efficiently [10]. Unfortunately, the number of iterations required for a quasi-Newton method to converge increases with the number of control variables, rendering large-scale problems intractable. Additional processors will not help since the bottleneck is in the iteration dimension. In Newton's method, the convergence is often independent of the number of control variables m. However, the necessary m flow solves per iteration preclude its use, particularly on a parallel machine, for which iterative methods will be used for the forward solves. These solves can be avoided by remaining in the full space of flow and control variables, since it is the reduction onto the control space that necessitates the flow solves. Nevertheless, this also presents difficulties: exploitation of structure is much more difficult with the KKT matrix, which is over twice the size, highly indefinite, and contains scattered blocks, each having mesh-based sparsity structure. 3. F U L L S P A C E S Q P W I T H R E D U C E D

SPACE PRECONDITIONER

In this section we present a new preconditioner for KKT systems arising in full space SQP, based on reduced space quasi-Newton algorithms. The preconditioner recovers the small (constant) number of linearized flow solves per iteration of the quasi-Newton method, but permits the fast convergence associated with Newton methods. The preconditioner retains the structure-exploiting properties of RSQP, and parallelizes as well. To motivate the derivation, let us return to the reduced Newton method. As stated before, N-RSQP is equivalent to solving with a permuted block-LU factorization of the KKT matrix; this factorization can be also viewed as a Schur-complement reduction for the control space step Pd. The unpermuted form is given by

A,]

Wds Wdd AT 0

~--

o , ][AsAd o]

Wd,A-~~ I I

0

A TA-jT

0

0

0

W~

0

,

(13)

135 where Wy := W s d - W s s A s l A d . This block factorization suggests its use as a preconditioner by replacing W~ with B~. The resulting preconditioned KKT matrix, I 0 0

0 0] It,r~B~-1 0 0 I

(14)

would be the identity if B~ were equal to Wz. There are two main differences between this preconditioner and the one suggested in [2]. The first is that this preconditioner is based on an exact factorization of the KKT matrix, i.e. it is indefinite. The second is that it incorporates BFGS as a sub-preconditioner to deflate the reduced Hessian. The preconditioned KKT matrix is positive definite, with 2n unit eigenvalues and the remaining m determined by the effectiveness of BFGS. However, we still require four forward solves per iteration. One way to restore the two solves per iteration of QN-RSQP is to drop second order information from the preconditioner, exactly as one often does when going from N-RSQP to QN-RSQP. A further simplification of the preconditioner is to replace the exact forward operator As by an approximation As, which could be for example any appropriate flow preconditioner. With these changes, no forward solves need to be performed at each KKT iteration. Thus, the work per KKT iteration becomes linear in the state variable dimension (e.g. when A8 is a constant-fill domain decomposition approximation). Furthermore, when Bz is based on a limited-memory quasi-Newton update (as in our implementation), the work per KKT iteration is also linear in the control variable dimension. With an "optimal" forward preconditioner and the assumption that the B~ approximation is a good one, one would expect the number of KKT (inner) iterations to be insensitive to the problem size. Mesh-independence of SQP (outer) iterations would then lead to scalability with respect to both state and control variables. To examine the effects of discarding the Hessian terms and approximating the forward solver, we define two different preconditioners: 9 Preconditioner I: Wz = Bz, all of the Hessian terms in (13) discarded 2 (linearized) solves/iteration

[oo / lEa

ol

Preconditioner

0 I

I 0

ATA-j T 0

0 0

Bz 0

[ /

0 o]

Preconditioned KKT matrix

0 ATs

,

(15)

W T A - j 1 W~B-~ 1 0 W~sA-j 1 WyB-~ 1 I

9 Preconditioner II: W~ = B z , As = A,, Hessian terms retained in (13) no forward solves, Es := (A~-1 - As --1 ), Is := A s A : 1 Preconditioner ~s.A; 1 I I

0

A dA s 0

0

Preconditioned KKT matrix B,

OWyA

0 s

,

O(E~) O(E )

O(E~) + I'V~B-21 O(E~) O(E )

Zy

.(16)

136

4. R E S U L T S O N A N O P T I M A L F L O W C O N T R O L P R O B L E M The preconditioner was tested on a model quadratic programming problem (QP), that of a 3D interior Stokes flow boundary control problem. The objective is to minimize the L 2 norm of the velocity error given a prescribed velocity field, and the constraints are the Stokes equations: minimize ~1

(Uexact- u)2 d~

subject to: in gt in u = Uexact, on F/Fcontro1 u = ud, on Fcontro1. -vAu+Vp=f,

(17)

V.u=O,

Here, Uexact is taken as a Poiseuille flow solution in a pipe, and the control variables correspond to boundary velocities on the circumferential surface of the pipe. We discretize by the Galerkin finite element method, using tetrahedral Taylor-Hood elements. The minimum residual method (MINRES) is used for solving the resulting linear flow equations whenever A~ 1 is needed. To precondition the flow system, we apply a two-block diagonal matrix where the two blocks are domain decomposition approximations of the discrete Laplacian and discrete mass matrices, respectively. Our code is built on top of the PETSc library [1] and the domain decomposition approximations we use are PETSc's block-Jacobi preconditioners with local ILU(0). Both reduced and full space algorithms for the control problem have been implemented. Since the preconditioner we propose is indefinite, we use a quasi minimum residual (QMR) method that supports indefinite preconditioners [4]. The reduced Hessian is approximated by a limited-memory BFGS formula 2 in which we update the inverse of Bz. In this way, only a limited number of vectors need to be retained and both the update and the application of B~-1 involve only vector inner products. Numerical experiments that include scalability and performance assessment on a Cray T3E and comparisons with RSQP have yielded very encouraging results. A comparison between the different methods is presented in Table 1. A simple scalability analysis, depicted in Table 2, shows very good efficiency with respect to the optimization algorithm and the parallel implementation. On the other hand, the algorithmic efficiency of the forward preconditioner is not very good; we intend to remedy this with a more sophisticated preconditioner for As. We are extending the implementation to encompass Navier-Stokes flows. Since the problem is no longer a QP, issues of robustness and global convergence become crucial. We believe that a hybrid SQP method that combines QN-RSQP far from the minimum and full space N-FSQP (preconditioned by RSQP) close to the minimum will prove to be robust and powerful. The order-of-magnitude improvement in execution time and high parallel efficiency observed for the Stokes flow control problem encourage further development and application of the new KKT preconditioner. 2This is done only in QN-RSQP. For a QP problem, full Newton takes one iteration to converge so N-FSQP cannot create curvature information for B z. We have set B z = I in the KKT preconditioner.

137

Table 1 Performance of implementations of reduced and full-space SQP methods for a viscous flow optimal boundary control problem, as a function of increasing number of state and control variables and number of processors. Here, precond is the KKT preconditioner; N or QN iter is the number of optimization iterations; ]1 gz ]I is the Euclidean norm of the reduced gradient; and time is wall-clock hours on the Pittsburgh Supercomputing Center's Cray T3E-900. To prevent long execution times for QN-RSQP, the algorithm was terminated for the first four cases at 200 iterations, and for the last at 100 iterations. Similarly, N-FSQP was terminated at 500,000 KKT iterations for the two largest problems. In contrast, both preconditioned N-FSQP methods were allowed to completely converge to a reduced gradient norm of 10-6 in all cases. For this reason, the true performance of preconditioned N-FSQP is better than depicted in the table. Even with the more stringent tolerance, the new preconditioner improves wall-clock time by a factor of 10 over QN-RSQP. states controls method precond N or QN iter K K T iter II g z II time 21,000 3900 QN-RSQP 200 1 x 10 -4 3.6 (4 PEs) N-FSQP none 1 114,000 9 x 10 -6 2.5 N-FSQP I 1 25 9 • 10 -6 1.2 N-FSQP II 1 3,200 9 x 10 -6 0.3 QN-RSQP 200 4 x 10 -4 8.1 43,000 4800 (8 PEs) N-FSQP none 1 198,000 9 x 10 -6 4.4 N-FSQP I 1 29 9 • 10 -6 1.7 N-FSQP II 1 5,500 9 x 10 -6 0.7 QN-RSQP 200 2 x 10-3 12.8 86,000 6400 (16 PEs) N-FSQP none 1 376,000 9 x 10-6 9.0 N-FSQP I 1 28 9 x 10-6 2.4 N-FSQP II 1 8,200 9 x 10-6 1.0 167,000 12,700 QN-RSQP 200 3 x 10 -3 18.4 (32 PEs) N-FSQP none 1 500,000 8 x 10 -5 12.3 N-FSQP I 1 27 9 x 10 -6 2.7 N-FSQP II 1 11,100 9 • 10 -6 1.3 332,000 23,500 QN-RSQP 100 9 x 10 -3 11.0 (64 PEs) N-FSQP none 1 500,000 4 z 10 -4 13.0 N-FSQP I 1 28 9 x 10 -6 3.1 N-FSQP II 1 14,900 9 x 10 -6 1.7

Table 2 Isogranular scalability results for N-FSQP with Preconditioner I. Per processor Mflop rates are average (across PEs) sustained Mflop/s. Implementation efficiency (impl eft) is based on Mflop rate; optimization algorithmic efficiency (opt eft) is based on number of optimization iterations; forward solver algorithmic efficiency (]orw eft) is deduced; overall efficiency (overall eft) is based on execution time, and is product of all three.

PEs

Mflop//s/PE

Mflop//s

impl eft

4 8 16 32 64

41.5 39.7 38.8 37.2 36.8

163 308 603 1130 2212

1.00 0.95 0.92 0.87 0.85

opt eft forw eft 1.00 0.86 0.89 0.93 0.89

1.00 0.84 0.62 0.55 0.52

overall eft 1.00 0.68 0.51 0.45 0.39

138 ACKNOWLEDGMENTS We thank the authors of PETSc, Satish Balay, Bill Gropp, Lois McInnes, and Barry Smith of Argonne National Lab. We also thank David Keyes of ICASE/Old Dominion University, David Young of Boeing, and the other members of the TAOS project--Roscoe Bartlett, Larry Biegler, Greg Itle and Ivan MalSevid--for their useful comments. REFERENCES

1. S. Balay, W. D. Gropp, L. C. McInnes, and B. F. Smith. PETSc home page. http://www.mcs .anl .gov/petsc, 1999. 2. A. Battermann and M. Heinkenschloss. Preconditioners for Karush-Kuhn-Tucker matrices arising in the optimal control of distributed systems. In W. Desch, F. Kappel, and K. Kunisch, editors, Optimal control of partial differential equations, volume 126 of International Series of Numerical Mathematics, pages 15-32. Birkh~iuser Verlag, 1998. 3. L.T. Biegler, J. Nocedal, and C. Schmid. A reduced Hessian method for large-scale constrained optimization. SIAM Journal on Optimization, 5:314-347, 1995. 4. R.W. Freund and N. M. Nachtigal. An implementation of the QMR method based on coupled two-term recurrences. SIAM Journal of Scientific Computing, 15(2):313-337, March 1994. 5. O. Ghattas and J.-H. Bark. Optimal control of two- and three-dimensional incompressible Navier-Stokes flows. Journal of Computational Physics, 136:231-244, 1997. 6. O. Ghattas and C. E. Orozco. A parallel reduced Hessian SQP method for shape optimization. In N. Alexandrov and M. Hussaini, editors, Multidisciplinary Design Optimization: State-of-the-Art, pages 133-152. SIAM, 1997. 7. M. Heinkenschloss. Formulation and analysis of a sequential quadratic programming method for the optimal Dirichlet boundary control of Navier-Stokes flow. In W. W. Hager and P. M. Pardalos, editors, Optimal Control: Theory, Algorithms, and Applications, pages 178-203. Kluwer Academic Publishers B.V., 1998. 8. D.E. Keyes and W. D. Gropp. A comparison of domain decomposition techniques for elliptic partial differential equations and their parallel implementation. SIAM Journal on Scientific and Statistical Computing, 8(2):S166-$202, March 1987. 9. K. Kunisch and E. W. Sachs. Reduced SQP methods for parameter identification problems. SIAM Journal on Numerical Analysis, 29(6):1793-1820, December 1992. 10. I. Mal~evid. Large-scale unstructured mesh shape optimization on parallel computers. Master's thesis, Carnegie Mellon University, 1997.

Parallel ComputationalFluid Dynamics Towards Teraflops,Optimizationand NovelFormulations D. Keyes,A. Ecer, J. Periaux,N. Satofukaand P. Fox (Editors) 2000 ElsevierScienceB.V. All rightsreserved.

139

Parallel polynomial preconditioners for the analysis of chaotic flows in Rayleigh- Benard convection Edoardo Bucchignani and Alfonso Matrone a, Fulvio Stella b aC.I.R.A. Via Maiorise, Capua (CE), Italy bDip. di Meccanica e Aeronautica, Univ. di Roma "La Sapienza", Italy In this work a parallel fully implicit flow solver for incompressible Navier-Stokes equations, written in terms of vorticity and velocity, has been used for the study of Rayleigh - Benard convection in chaotic regime. The Bi-CGSTAB algorithm has been adopted for the resolution of the linear systems arising from discretization. Parallel polynomial preconditioners have been considered; they have been compared with the classical ILU preconditioner and with its parallel implementation (B-ILU). 1. I N T R O D U C T I O N In the last years, the development of robust fully implicit solvers allowed the study of complex unsteady flows and the analysis of the transition to chaotic and fully developed turbulent regime by means of direct numerical simulations. However, it is well known that implicit methods require high computational resources, in terms of memory and CPU time, especially when a chaotic flow is considered; in this case, in fact, a very fine computational grid must be used, in order to capture all the significant scales of the flow, from the largest (related to the dimensions of the geometry) to the smallest (determined by the viscosity). Therefore, the use of a multiprocessor machine is strongly recommended. The computational kernel of an implicit flow solver is the solution of linear systems of equations. As is well known, the use of a parallel implementation of an iterative method belonging to the Krylov subspace class (e.g. the Bi-CGSTAB algorithm [1]) has been shown to be very cost effective for the solution of systems arising from the incompressible Navier- Stokes equations. Even if often iterative methods can be used without preconditioning the linear systems of equations, the use of a preconditioning technique is, in all practical applications, essential to fulfil the stability and convergence requirements of the iterative procedure itself. The aim of a preconditioner is to convert the original linear system to an equivalent but better-conditioned system. Performances, in terms of parallel speed-up, are strongly affected by the choice of an effective parallel preconditioner. As already shown by many authors (e.g. [2]), polynomial preconditioners are attractive for parallel architectures. They are explicit preconditioners, as they directly approximate the inverse of the coefficient matrix. The aim of this work is to verify the efficiency of parallel polynomial preconditioners

140 when they are applied to the resolution of the incompressible Navier- Stokes equations. The physical problem considered is unsteady Rayleigh- Benard convection [3]. Parallel performances are shown and a comparison with the classical ILU preconditioner is discussed. Several temporal regimes and different routes to chaos have been recognized and a wavelet analysis has been performed, in order to evaluate the frequency content of each signal during the time history of the physical phenomenon. 2. P O L Y N O M I A L P R E C O N D I T I O N E R S

The aim of preconditioning is to transform the original system of equations

Ax = b

(1)

in an equivalent better conditioned system:

M-lAx = M-lb

(2)

where the matrix M, named a Preconditioner, is chosen so that

#(M-~A) << #(A)

(3)

where # is the condition number of the system. It is obvious that M has to be a good approximation of A and easy to invert in order to avoid increasing in the computational cost. Preconditioners can be classified either as implicit, if they approximate the matrix A, or as explicit, if they approximate the inverse of A. Some of the most used implicit preconditioning techniques are based on an incomplete factorization of the original matrix. The reason for their success is the excellent performance and the relatively easy implementation. Among these, the most famous are ILU, MILU, RILU, ILU(p) [4]. A limit of these algorithms is that they cannot reach good parallel performance due to their recursive nature. Moreover, the parallel approach requires some modification to the original algorithm, such as multicolor or blocked procedures [4,5] which increase the number of iterations to reach convergence with respect to the sequential version. Among the explicit preconditioners, those based on a polynomial approximation are the most employed. Although they date back to 1959 [6], only in recent years they have been successfully applied. This is motivated by their implicit parallelism, which allows to reach good parallel performance without revising the sequential version, keeping constant the number of iterations independently on the number of processors. Before describing the two polynomial preconditioners used in our numerical experiments, the following splitting of the matrix A must be considered: A = M-

N

(4)

with M non singular and p(M-1N) < 1, where p is the spectral radius. Therefore, the inverse matrix can be expressed as: A -1 = (I - M -1N) - 1 M -1

(5)

141 A polynomial preconditioner is defined on the basis of which polynomial approximation of (I - M - 1N) - 1 is used. Neumann Polynomial Preconditioner: truncated Neumann Series:

In this case, (5) is approximated by a

A-I'~(j~(M-1N)J) M - l = o

(6)

where n is the degree of the approximation. A good choice for M is the main diagonal of A. The matrix-vector products of (6) are calculated by means of the Hornet algorithm in order to avoid the computation of M. C h e b y s h e v P o l y n o m i a l P r e c o n d i t i o n e r : It is based on the approximation of the quantity ( I - M-1N) -1 by the polynomial:

rn(A) = ( 1 - Tn+I((A-1)/P))

(7)

where Tn+I(z) is the ChebyshevPolynomial of degree n + I and/5 satisfies the following inequality:

p(M-1N) <_fi < 1 Therefore, an approximation of A -1 is given by:

A -1 =rnM -1

(8)

By examining (6) and (8) it is clear that the impact of a Polynomial Preconditioner in a parallel version of a Krylov subspace method is limited to the matrix-vector products. It means that it is very easy to parallelize and it does not require modifications to the sequential algorithm. 3. T H E P H Y S I C A L OGY

PROBLEM

AND

COMPUTATIONAL

METHODOL-

The Rayleigh- Benard convection is a three-dimensional flow in a rectangular box, bounded by rigid impermeable walls and heated from below. Vertical walls are adiabatic, while horizontal surfaces are assumed to be isothermal and held at different temperatures. The study is conducted numerically by means of a parallel fully implicit solver for viscous flows, based on the vorticity velocity formulation [7] of the Navier-Stokes equations, assuming the Boussinesq approximation to be valid. Discretization conducted via a second order accurate finite difference technique on a regular Cartesian mesh [7,8] gives rise to a large sparse linear system of equations Ax = b for each time step, which is solved by means of a parallel implementation of the Bi-CGSTAB algorithm [1], associated with a polynomial preconditioner. In this work a domain 3.5.1.2.1 has been considered, filled with water at 730 C (Prandtl number Pr equal to 2.5). The other governing parameter (the Rayleigh number Ra) has been varied from 20000 to 50000. Discretization has been conducted on a grid 51.31.31, leading to systems of 372736 equations.

142

4. N U M E R I C A L

RESULTS

Numerical simulations have been executed on a SGI Origin 2000 supercomputer, installed at Mountain-View, California. It is a shared memory supercomputer with 128 R10000 processors and 8 GBytes of physical memory. Each RISC processor operates at 250 MHz and has a peak performance of 500 Mflops. The bandwidth is 1600 MByte/s. The numerical code has been written using MPI as the message passing programming paradigm. The sequential version of the polynomial preconditioners has been compared with the classical ILU, which has widely used for its good characteristics of robustness. The degree of Neumann and Chebyshev polynomials has been varied from 3 to 9, considering only the odd values, because as shown in [9], even values provide worse performance. Results are shown in Table 1. It can be observed that polynomial preconditioners allow a reduction in the number of iterations, but the computational cost of each iteration grows with the degree of polynomials. From a global point of view, the efficiency of polynomial preconditioners seems comparable with ILU. Parallel performances are shown in Table 2. The best polynomial preconditioners have been considered and compared with a Block ILU factorization [5]. The most important feature of polynomial preconditioners is that the number of iterations does not increase with the number of processors and therefore the parallel efficiency is considerably improved with respect to B-ILU, having (on 8 processors) about 67% with Neumann and Chebyshev, and only 26% with B-ILU. Table 1 Comparison among Preconditioners: Domain 3.5 x 1 x 2.1, Grid 51 x 31 x 31 Preconditioner ILU Neumann 3rd degree Chebyshev 3rd degree Neumann 5 th degree Chebyshev 5 th degree Neumann 7 th degree Chebyshev 7 th degree Neumann 9 th degree Chebyshev 9 th degree

N. of iter. 50 29 27 20 20 18 17 15 15

Global Time (sec.) 213 200 189 226 228 262 250 282 283

Time per iter.(sec.) 4.3 6.9 7 11.3 11.4 14.5 14.7 18.8 18.9

5. P H Y S I C A L R E S U L T S A N D W A V E L E T T R A N S F O R M

The simulation executed at P r - 2.5 and R a - 20000 starting from rest, gives rise to a steady configuration, shown in Figure 1, named a s o f t - r o l l . Increasing R a causes the flow to become unsteady; for example, at R a - 50000, a complex temporal behaviour was observed (Figure 2). This signal is unsteady in the sense that the frequency spectrum varies in time; the FFT is not able to represent this signal, because the frequency content is not constant. Therefore, a wavelet analysis has been used. Wavelets [10] are mathematical

143 Table 2 Comparison of Parallel Performance using various Preconditioners Proc. 1 2 4 8 16

n. of it. 50 73 79 91 97

B-ILU time (sec.) 213 197 110 102 88

Neumann 3rd deg. n. of it. time (sec.) 29 200 29 113 29 59 29 37 29 30

Chebyshev 3rd deg. n. of it. time (sec.) 27 189 27 105 27 57 27 35 27 30

functions that cut up data into different frequency components; each component can be studied with a resolution matched to its scale. Given a main wavelet function r (the mother wavelet), it is possible to obtain a base for representation simply by scaling and translating r The wavelet transform of the above signal is shown in Figure 3. It can be observed that the chaotic regime (more than three frequencies) (samples between 0 and 4000 - between 10000 and 16000) is replaced by a biperiodic regime for a limited time interval (samples between 4000 and 10000). The biperiodic regime causes a modification of the soft-roll structure, which is replaced by a configuration made up by a unique large roll. As a consequence, the values and scaling laws for heat exchange are strongly modified. The Nusselt number (Nu) provides a measurement of the heat flux from a hot to a cold wall. Nu is a function of Ra and it strongly depends on the flow configuration. Figure 4 shows Nu as a function of Ra for the test case considered. A sudden variation of Nu at Ra = 50000, related to the mentioned change in the configuration, is highlighted.

6. C O N C L U S I O N S Polynomial preconditioners are very attractive because, in the test case conducted, their efficiency is comparable with ILU, and because they are easily parallelizable. In fact, the number of iterations does not increase with the number of processors employed. The time required for the study of the Rayleigh-Benard convection has been significantly reduced. Finally, the wavelet transform represents a powerful tool for the analysis of signals whose frequency content varies in time. It allowed us to better understand the transition mechanism to chaos.

ACKN OWLED GEMENT S The authors would like to thank: - Dr. Fabrizio Magugliani of SGI ITALIA (Milan) for the help supplied in the execution of numerical tests. - Dr. Emma Trencia of University of Naples for contributions to the development of the present work.

144

I

2.5

I

I

I

I

.~

I

1.5 N

1 0.5 _~,

-0.5

l

~ -

I

0

0.5

I

I

1

1.5

I

X

I

2

2.5

I

3

/ -I

3.5

Figure 1. P r - 2.5, R a - - 20000: Isolines of the vertical component of velocity in the horizontal middle plane. A fully three-dimensional configuration ( s o f t - r o l l ) made up of two horizontal rolls whose axes are not parallel is observed

60

!

!

i

40 30 20 D

lO o

-lO -20 -30

3

5

7

9

11

t

13

15

17

19

21

Figure 2. P r - - 2.5, R a - - 50000: Time history of the component of velocity along x at the point (0.7,0.7,0.7). Three clearly distinct zones are observed: 0. < t < 9. (non periodic flow), 9. < t < 15. (quasi periodic flow with two incommensurate frequencies and t > 15. (non periodic flow)

REFERENCES 1. H. van der Vorst, Bi-CGSTAB, a fast and smoothly converging variant of Bi-CG for

145

2. 3. 4. 5.

6. 7. 8. 9.

10.

the solution of nonsymmetric linear systems, SIAM J. Sci. Stat. Computing 13(2), 631-644, (1992). Y. Sand, Practical use of polynomial preconditionings for the conjugate gradient method, SIAM J. Sci. Stat, Comput. 6(4), 865-881, (1985). F. Stella, E. Bucchignani, Rayleigh- B~nard convection in limited domains: Part 1 Oscillatory flow, Numerical Heat Tr. Part A 36, 1-16, (1999). A.M. Bruaset, A survey of preconditioned iterative methods, Longman Scientific & Technical, (1995). F. Stella, M. Marrone, E. Bucchignani, A Parallel Preconditioned CG Type method for Incompressible Navier-Stokes Equations, in Parallel Computational Fluid Dynamics: New Trends and Advances, Ecer A. et al. (Ed.), 325-332, Elsevier, Amsterdam, (1993). H. Rutishauser, Theory of gradient methods, in: Ref. iter. meth. comput, solution and eigenv, of self-adjoint boundary value problems, 24-49, Birkhauser Verlag, (1959). G. Guj, F. Stella, A Vorticity-Velocity Method for the Numerical Solution of 3D Incompressible Flows, J. Comput. Phys. 106, 286-298, (1993). F. Stella, E. Bucchignani, True transient vorticity-velocity method using preconditioned Bi-CGSTAB, Numerical heat transfer Part B 30, 315-339, (1996). C. Schelthoff, A. Basermann, Polynomial Preconditioning for the conjugate gradient method on massively parallel systems. Forschungszentrum Julich Gmbh, Internal report KFA-ZAM-IB-9423, (1994). A. Graps, An introduction to Wavelets, IEEE Computational Science and Eng. 2(2), (1995).

146

9

.

:

:

i

i

i

:

9

,,

i

i

i

)

:

:

:

:

}

!

:

:

i

.

.

:

:

~ :

.

!

!

!

!

: ............

" ............

~ ............

z ............

:

i

i

i

.: i

i

i

i

i

~-

"

:

:

:

:

:

.

.

.

~ ' ~'"

i

i

Pr

!

" . ~. L x i~-..

!

!

!

i

!

!

!

"

"

"

9

!

!

!

!

i

"

!

i

i :

: !

: :

! :

i

i

i

i

i

,

,

,

i

i

i

i

i

5000

Figure 3.

!

:

:

,

i

0

i

:

: ...--L..-b-.

i

i

.......:i........... i . . . . ~ 9

.,,,H~,,,~-,:~-~-~v

..........

.

"

= 2.5,

Ra

1oooo

8amplea

= 50000:

15000

i

Wavelet transform of the previous signal

3.6

I

1

...................................................................... i.................i....] ~~ .................................. i......................

~~

i~. ....

'i"................. ~,..... :........... !................. i................. i....

................i.................i.................i.................i.................ii 25000

Figure 4.

30000

35000

Nusselt number vs.

Transition from

40000 Ra

45000

50000

Rayleigh number for a domain 3.5- 1 - 2 . 1 ,

soft-roll to 1L configuration

Pr

-

2.5"

Parallel ComputationalFluid Dynamics Towards Teraflops, Optimizationand Novel Formulations D. Keyes, A. Ecer, J. Periaux, N. Satofukaand P. Fox (Editors) 2000 Elsevier Science B.V. All rights reserved.

147

An Object-Oriented Software Framework for Building Parallel Navier-Stokes Solvers Xing Cai, a* Hans Petter Langtangen bt and

Otto Munthe b*

aDepartment of Informatics, University of Oslo, P.O. Box 1080, Blindern, N-0316 Oslo, Norway bDepartment of Mathematics, University of Oslo, P.O. Box 1053, Blindern, N-0316 Oslo, Norway We present an object-oriented framework in Diffpack for implementing different NavierStokes solvers. The framework strongly promotes flexible and extensible coding. Moreover, parallelization becomes systematic and easy using the object-oriented programming techniques. A concrete fast Navier-Stokes solver based on finite element methods is presented together with some numerical experiments. 1. I N T R O D U C T I O N The Navier-Stokes equations are fundamental when modeling viscous fluid flow. Naturally, it is essential to devise stable, accurate and etticient methods for the numerical solution of these equations. As a result of intensive research effort in this field, a wealth of different solution methods has already been proposed by scientists and engineers. Most of these methods fall into one of the following main categories: fully implicit strategies, various types of artificial compressibility, discrete and continuous operator splitting, solenoidal approaches and pressure filtering techniques; see e.g. references in [9]. It is well known that the suitability of a particular numerical method for a given flow problem depends strongly on issues such as treatment of boundary conditions, accuracy requirement, stability, efficiency etc, so it is desirable for an engineer to have access to as many solution methods as possible when investigating a new problem. It is also of importance to be able to easily switch between different solution methods due to different considerations. And more importantly, the engineer should be allowed to modify and extend the existing methods to handle special features of a flow problem. To satisfy all the above requirements, we have developed a unified, flexible and extensible framework for implementing numerical Navier-Stokes codes. The fundamental software engineering principle is object orientation, which is realized by using the programming language C + + . With this object-oriented approach, we have been able to gain advantages over conven*These two authors are supported by the Research Council of Norway under Grant 110673/420 (Numerical Computations in Applied Mathematics).

tThis author has received partial support provided by the Strategic University Programme General Analysis of Realistic Ocean Waves funded by the Research Council of Norway.

148

tional procedural programming languages with respect to code design, implementation and maintenance. 2. M A T H E M A T I C A L M O D E L The incompressible Navier-Stokes equations can be written as follows. U i , t -~- U j U i , j

j,j

--

=

1 --P,i P o.

n t- l ] U i , j j n t-

bi ,

(1) (2)

Here, u and p are the primary unknowns representing the fluid velocity and pressure distribution. In addition, p, u and bi denote the density, kinematic viscosity and external body forces, respectively. Indices after the comma separator denote partial differentiation. The system (1-2) is to be supplemented with suitable boundary and initial conditions; see

[9]. 3. A N O B J E C T - O R I E N T E D

FRAMEWORK

For most numerical methods solving the incompressible Navier-Stokes equations, the solution process can be decomposed into sub-steps, where each sub-step solves one or several partial differential equations (PDEs). Very often some PDEs arising in different sub-steps are of the same type. If there exists an object-oriented simulator for solving one particular type of PDE, then it can be easily modified to be used in those sub-steps. This strongly promotes code re-use. To take full advantage of object-oriented programming, we propose a unified framework for implementing flexible and extensible Navier-Stokes solvers. Figure I gives a structural overview of this framework whose three inter-connected components are: I. NsPrms that contains common parameters shared by all Navier-Stokes solvers; 2. NsBase that is a base class defining a generic interface of numerical algorithms; 3. NsCase that is a base class giving a generic description of problem-specific data, e.g. boundary conditions. We note that this structure separates numerics from problem specific data, so that a concrete Navier-Stokes solver, say SomeSolver, can be devised as a subclass of NsBase, without specific knowledge of a particular Navier-Stokes problem. Likewise, different concrete Navier-Stokes problems can be defined easily as subclasses of NsCase, independent of numerical solution algorithms. Diffpack [6,8] is an object-oriented scientific computing environment with emphasis on numerical solutions of PDEs. Many prototype modules for solving common types of PDEs are already available in Diffpack. It is therefore natural to introduce the above-mentioned framework into Diffpack. For achieving a unified, flexible and extensible framework, we have paid special attention to the following issues: 1. Unified implementation

of different base PDE

2. Unified organization of common

solvers;

data representations;

3. Separate administration of problem-dependent data; 4. Separate modules for flow conditions.

149

NsCase SomeCase

----4 NsBase ~-

NsPrms

JsomeSolverj

Figure 1. An object-oriented implementation framework for Navier-Stokes solvers. The solid arrows mean "is-a" relationship, while the dashed arrows mean "has-a" relationship.

Inside this object-oriented framework over ten different Navier-Stokes solvers have already been implemented in a unified way, we refer to [9] for a detailed description. 4. P A R A L L E L I Z A T I O N

The CPU-intensive parts of any standard Diffpack simulation are the discretization, which gives rise to linear systems of equations, and the solution of those linear systems. Parallelizing the discretization is straightforward, provided that there exists a balanced partition that decomposes the global computational grid into a collection of subgrids. Very often, construction of the sub-systems, which make up the virtual global linear system, can be clone entirely independently on individual processors. Parallelizing the linear system solution process is however more demanding, because different processors need to work cooperatively and exchange nodal values between neighboring subgrids. Iterative solution methods, which are used exclusively in solving large scale linear systems, consist mainly of three types of operations: vector addition, inner-product between two vectors and matrixvector product. While vector addition is intrinsically parallel, parallelization of the other two operations can be realized by first invoking local sequential operations restricted to the subgrids and then adding up or updating the sub-results. Therefore, parallelization at the linear algebra level (LAL) can be achieved by adding communication routines necessary for parallel inner-product and matrix-vector product. Taking advantages of object-oriented programming, we have been able to organize the parallelization-related codes into add-on libraries. When parallel linear algebra operations are desired during the solution of a linear system, necessary communication routines will be invoked automatically after the sequential operations are carried out restricted to subgrids. This is achieved by attaching an object of the main interface class GridParthdm to the standard sequential Diffpack linear algebra libraries. Class GridParthdm offers diverse inter-processor communication routines and has its simplified definition as follows. class GridPartAdm : public SubdCommAdm

{

//... public: GridPartAdm ();

150 virtual virtual virtual virtual virtual virtual

};

~GridPartAdm (); void prepareSubgrids (const GridFE& global_grid); void prepareCommunication (const DegFreeFEa dof); void updateGlobalValues (LinEqVectora lvec); void updateInteriorBoundaryNodes (LinEqVectora lvec); void matvec (const LinEqMatrix~ Amat, const LinEqVectora c, LinEqVectora d); virtual real innerProd (LinEqVectora x, LinEqVectora y); virtual real norm (LinEqVectora lvec, Norm_type lp=12);

In above, we note that SubdCommAdm is a dummy base class. Its purpose is to give generic declarations of the communication-related routines, whose definitions are left empty. In this way, it is possible to maintain separately the huge sequential Diffpack linear algebra libraries and the small add-on parallelization-related libraries. When Diffpack users want to incorporate LAL parallelism, the add-on libraries can be connected seamlessly; see [2]. Class GridPartAdm also contains useful functions for partitioning a global computation grid (prepareSubgrids) and building internal data structures for later inter-processor communications (prepareCommunication). We mention that all the communication-related functions of GridPartAdm have a high-level user interface, thus allowing the user to concentrate on the numerics, instead of low-level message passing statements in MPI. Using GridParthdm, parallelization of a particular sequential NavierStokes solver becomes a simple and efficient process. In reality, it suffices to add only a few lines of code, which call the necessary communication functions, to the original sequential solver. This extra amount of code is typically 1-2% of the total simulation code (whose size is of the order several hundred lines, including I/O and user interfaces, when utilizing Diffpack). The resulting parallel solver has good parallel efficiency and great flexibility. For example, the user is able to choose, at run-time, an arbitrary number of processors to solve a problem on an unstructured grid. It should also be mentioned that parallelization can be done at a higher level than LAL. We proposed in [1] a so-called simulator-parallel (SP) parallelization approach for this purpose. Roughly speaking, overlapping Schwarz methods (see e.g. [5,11]) allow re-using an existing sequential simulator as a whole in a globally coordinated solution process; see [1,4]. Although the SP approach has not yet been used in parallelizing Navier-Stokes solvers, our earlier experience in using the SP approach in other CFD applications (see [1,3]) indicates that a structured parallelization following the SP approach can be done. 5. A N E X P L I C I T

FINITE ELEMENT

NAVIER-STOKES

SOLVER

As an example of utilizing our object-oriented framework, we consider the implementation of a fast finite element (FE) Navier-Stokes solver. Mathematically, the numerical scheme uses the technique of operator splitting and consists of the following sub-steps at each time level (more details can be found in [8,10]): =

~n _~__ Un + k~1),

(3)

(4)

151

k}

=

ui

,

=

+ 1

~72p n+l

-

P-P--U*

U n-I-1

-

(5)

-

+ k~2) ),

zxt ~'~' At --- U~ ~__ (p,n+l

(6) (7) (8)

pbi)"

In above, we notice that Equations (3,5,8) are of the same explicit updating type. Assume this type of computations can be handled by class E x p l i c i t . Then it is convenient to derive a new class P r e d i c t o r as subclass of E x p l i c i t for treating Equations (3) and (5). Similarly, another subclass with name C o r r e c t o r can be derived for Equation (8). Implementation is very simple in both P r e d i c t o r and C o r r e c t o r , because one only needs to incorporate the different right-hand sides. For Equation (7), FE discretization will be used. Diffpack already has an existing base class FEM for this, so it suffices to derive a new subclass P r e s s u r e from FEM. Compared to the standard Diffpack FE solvers [8], class P r e s s u r e is optimized to avoid re-assembly of the linear system associated with the pressure equation at each time level [8, ch. 6.5]. The overall solution process is administrated by F a s t S o l v e r , which is a subclass of gsBase. Note that the control from F a s t S o l v e r to P r e d i c t o r , C o r r e c t o r and P r e s s u r e is represented by the dashed arrows in Figure 2.

NsBase I FastSolver / Pressure I V FEM

Predictor /

[ Corrector /

Explicit

Figure 2. The structure of a fast FE Navier-Stokes solver.

6. A T E S T C A S E In this section we consider a so-called "vortex-shedding" test case for the fast FE Navier-Stokes solver. The 2D solution domain is depicted in Figure 3a, where the hole in

152

the domain represents a cylinder standing vertically in a stream coming from left with a constant velocity g* = (u~, 0). The computational grid is unstructured and has relatively high grid resolution in the region around the cylinder. Figure 3b shows an example of grid partition where the original domain is divided into 6 subdomains. The partition is done by using the METIS [7] package. Here we try to distribute the elements evenly among the subdomains. The reason for different area sizes covered by different subdomains is the uneven grid resolution of the original grid. We run the simulation between t = 0 and t = 60. The initial conditions are p = 0 and ff = if* throughout the domain. For the boundary conditions, we have ff = g* on the upper, lower and left boundaries, while u~,n = 0 applies on the right boundary. Due to the existence of vortices in the stream, the cylinder receives oscillatory forces as Figure 3d gives a snapshot of solution p. The Reynolds number is 175. Figure 3c is a snapshot of solution p before vortex arises. Interested readers may like to view two mpeg films showing the animation of the pressure and velocity fields. The W W W addresses are http://www, if i. uio. no/-xingca/DR/NS_pressure .mpg and http ://www. if i. uio. no/~xingca/DR/NS_velocity, mpg, respectively.

(a)

(b) ............... !:-.:i...::i ~ .

.iii.i.i.iiiiiiiiii iiii.iiiiiiiiiiiiii.i.i.iiiiiiiiiiiiii_iiiil , il (c)

(d)

Figure 3. A vortex-shedding example; (a) the solution grid, (b) partition of the solution domain into 6 subdomains, (c) contour plot of p at t = 4.4, (d) contour plot of p at t = 60.

In Table 1 we list some CPU-measurements of the fast FE Navier-Stokes simulator when different number of processors P are in use. The measurements are obtained on an SGI Cray Origin 2000 machine with 195MHz R10000 processors. We note that the relatively poor speedup results are due to the relatively small number of grid points. As

153

P increases, the overhead of communication and synchronization becomes visible.

Table 1 Some CPU-measurements of the fast FE Navier-Stokes simulator. I pl cpg Speedup Efficiency 1 1418.67 N/A N/A 2 709.79 2.O0 1.00 3 503.50 2.82 0.94 4 373.54 3.80 0.95 6 268.38 5.29 0.88 8 216.73 6.55 0.82

7. C O N C L U D I N G R E M A R K S We have presented an object-oriented framework for implementing flexible and extensible Navier-Stokes solvers. The framework promotes structured coding and results in extensibility and flexibility. It also makes development and debugging significantly simpler. Incorporation of parallelism into linear algebra operations of such Navier-Stokes solvers becomes a simple process, using object-oriented programming techniques. More specifically, the user needs to develop a sequential Navier-Stokes solver in the standard way of Diffpack. Then it suffices to add only a few lines of code invoking the parallelizationrelated communication. The communication routines are offered by small add-on libraries. The sequential and parallel versions of the simulator share most of the code and files, but the parallel version can be physically separated from the original code in form of subclasses, when desired. As for future work, we will also incorporate overlapping Schwarz methods into the parallelization. This requires the use of the SP parallelization approach explained in [1,2]. In addition, effort is needed to reduce diverse overheads in the add-on parallelization libraries, so that highest possible parallel efficiency can be achieved for the resulting parallel Navier-Stokes solvers. ACKNOWLEDGMENT We acknowledge support from the Research Council of Norway through a grant of computing time (Programme for Supercomputing). REFERENCES

A. M. Bruaset, X. Cai, H. P. Langtangen, A. Tveito, Numerical solution of PDEs on parallel computers utilizing sequential simulators. In Y. Ishikawa et al. (eds): Scientific Computing in Object-Oriented Parallel Environment, Springer-Verlag Lecture Notes in Computer Science 1343, 1997, pp. 161-168.

154 2. X. Cai, Two object-oriented approaches to the parallelization of Diffpack. Proceedings of HiPer '99 Conference, pp. 405-418. 3. X. Cai, Numerical Simulation of 31) Fully Nonlinear Water Waves on Parallel Computers. In B. Ks et al. (eds): Applied Parallel Computing, PARA'98, SpringerVerlag Lecture Notes in Computer Science 1541, 1998, pp. 48-55. 4. X. Cai, K. Samuelsson, Parallel multilevel methods with adaptivity on unstructured grids. Preprint 1998-5, Department of Informatics, University of Oslo. 5. T.F. Chan, T. P. Mathew, Domain decomposition algorithms. Acta Numerica (1994), pp. 61-143. 6. Diffpack World Wide Web home page, http:////www.nobjects.com. 7. G. Karypis, V. Kumar, METIS: Unstructured graph partitioning and sparse matrix ordering system. Department of Computer Science, University of Minnesota, Minneapolis/St. Paul, MN, 1995. 8. H.P. Langtangen, Computational Partial Differential Equations- Numerical Methods and Diffpack Programming. Springer-Verlag, 1999. 9. O. Munthe, H. P. Langtangen, Finite elements and object-oriented implementation techniques in computational fluid dynamics. URL: h t t p : / / ~ r c . m a t h . u L o . n o / ottom/ns_paper.ps. Submitted, 1999. 10. G. Ren, T. Utnes, A finite element solution of the time-dependent incompressible Navier-Stokes equations using a modified velocity correction method. Int. J. Num. Meth. Fluids 17, pp. 349-364, 1993. 11. B. F. Smith, P. E. Bjorstad, W. D. Gropp, Domain Decomposition, Parallel Multilevel Methods for Elliptic Partial Differential Equations. Cambridge University Press, 1996.

Parallel Computational Fluid Dynamics Towards Teraflops, Optimization and Novel Formulations D. Keyes, A. Ecer, J. Periaux, N. Satofuka and P. Fox (Editors) 9 2000 Elsevier Science B.V. All rights reserved.

155

P a r a l l e l N A S 3 D : A n Efficient A l g o r i t h m for E n g i n e e r i n g S p r a y Simulations using LES D. Caraeni, C. Bergstrom and L. Fuchs a aLund Institute of Technology, Fluid Mechanics Department, Ole Romers Vag 1, Box 118, SE 22100, Lund, Sweden, [email protected] A new parallel flow-solver, ParNAS3D, has been developed at Lund Institute of Technology for simulations of compressible flows in aeronautical and complex industrial applications. The code employs an efficient algorithm, based on the Residual Distribution Scheme approach, well suited for parallelization. The ParNAS3D solver has been coupled with an independent spray-module for LES of spray in compressible flows.The combined LESspray code has been used to investigate the injection, mixing and evaporation processes of a fuel spray in a gas turbine engine. This is the first report on the implementation of the ParNAS3D parallel algorithm. We present also the numerical algorithm employed in our code. 1. I N T R O D U C T I O N Parallel computing has been a trend in the aerospace industry already for more than a decade. At the Division of Fluid Mechanics at LTH, we recently developed a flowsolver for numerical aeronautical simulations, which proved to have very good parallel performance. The code uses PVM and is intended to run on both heterogeneous and homogeneous distributed memory architectures. It is written in the C language and has a modular structure which permits one to plug-in independent modules, such as the spray module for instance. A brief description of the numerical algorithm is given, additional information can be found in [6]. Here we mainly report on the parallel algorithm employed in our code. 2. F L O W S O L V E R A L G O R I T H M

Residual Distribution (RD) or Fluctuation Splitting (FS) schemes date back to the late 80's [1], when Roe developed schemes for solving scalar advection-diffusion problems on unstructured meshes. More recently (1996), van der Weide and Deconinck [2][3] proposed a matrix generalization of the scalar schemes which can be applied to the solution of non diagonalizable hyperbolic systems. Wood showed in a 1998 paper [4] that RD schemes give a better accuracy of the solution, for advection diffusion problems, if we compare with the classical finite-volume formulations. Though most applications of RD schemes have been limited so far to steady state computations, we extended their application to accurately simulate time dependent flows. ParNAS3D uses a second order in time,

156

implicit scheme of Jameson-type [7]. This is a dual-time step scheme, with sub-iterations to converge the solution at every new real-time step. Multigrid iterations are performed to accelerate the convergence of the solution in these sub-iterations. The code works on an originally developed unstructured mesh, cell-vertex data structure. A hierarchical oct-tree organization of the data was employed with a single tetrahedron division rule. This was shown to allow significant reductions in memory usage [5]. The discretization of the convective, diffusive and source terms for the Navier-Stokes equations is based on the Residual System-Distribution schemes approach [2]. Notations: U = (p, puj, pE) is the conservative variables vector, j = 1..d, is the index of Cartesian coordinate, where d is the number of space dimensions, AT is the pseudotime step in Jameson's scheme, which is computed as maximum local time step given by the CFL condition. T is one of the cells (tetrahedral-cells in 3D) sharing the node i, i = 1..(d + 1) is the node index and Vi is the volume of the dual cell, ~ , surrounding the i node,

V~ = ~

1

d +------~VT

(1)

TEfii

n - is the real time step index. Two different update schemes have been implemented so far in ParNAS3D, for time accurate simulations: U ? +1'k+1 =

U?+l'k- AU? +l'k

(2)

Scheme A:

AUn+I, k

1 VT. (~U (Tn~i)_t-- ~ d+ 1 [--~-1

AT

T

[~i,LW'

TEfii

T

+

1

(Ffnj#)] }

(3)

Scheme B: A U~__+l,k ~ = AT

+

0U

) + 1

(4)

TE~ti T Here, U~+l'k is the approximation of V~+1, at the pseudo-time step k, /3~,(...) is the modified matrix-distribution coefficient for the convective term, r T is the advectiveresidual over the cell T,

(5)

Here F ; - F;(U '~+l'k) is the advective flux vector and F ; - F;(U n+l'k) is the viscous flux vector, both computed for U ~+l'k 9 ~T i,(...)r T is the contribution of the advective part of the Navier-Stokes equations to the nodal residual. A central-Galerkin scheme has been used to discretize the viscous part and the source terms (i.e. Coriolis forces, centrifugal forces, mass and heat sources, forces coming from the spray, etc.). 1(F~ nj,i ) is the contribution of the viscous part of the Navier-Stokes equations to the nodal residual, where nj# are the cell-face inward normal vectors. [~tu](Tna~) and [_~](T)0Uare the volume averages of the unstationary term, on the (T N 9t~) and (T) respectively.

157

For the spatial discretization, the scheme A uses a central finite volume discretization while the scheme B uses the same upwind residual distribution scheme, i.e. the Low Diffusion A (LDA) scheme, as for the convective term. Both formulations are second order accurate for smooth flow situations, i.e. no shocks are present. The residual distribution coefficients given by the second order Lax-Wendroff (LW) scheme: 1

d

~i'TLw' = ( d + 1 + -~jvceuKi(~-~. IKj[) -1)

(6)

jET

Here Ycdl is the cell-CFL number, taken to be g4,which recovers the upwind finite volume formulation of Giles. Note that the distribution coefficients for the LW scheme satisfies the invariance condition and gives better accuracy, as it introduces a matrix time step instead of a scalar time step. When doing LES we reduced the value of v~dt to a small value (typically 0.01 of the original value), to reduce the artificial dissipation of the scheme. For the upwind LDA scheme, the distribution coefficients are: T

--(E K?)-1 K?

(7)

jET

No modifications of the original distribution coefficients seems to be needed in order to assure the second order accuracy and a reduced artificial dissipation of the scheme, in this case. For monotone shock capturing, both schemes (A and B) employs a dynamic correction of the artificial dissipation of the scheme based on the local flow characteristics. In the shock wave region the artificial dissipation of the scheme is enhanced, using a continuous blend between the second order scheme(s) and a positive first order scheme (i.e. the Narrow scheme. Monotone shock capturing is then possible, while still having second order accuracy in the smooth regions. The modified matrix-distribution coefficients, ~T i,mod are given by: /3i,mod T

-

(1

~ OLblend)" i,2nd -Jr- OLblend "

i,lst

where ~i,2nd T are either the ~Ti , L W or order, Narrow scheme:

(s)

T ~i,nDA coefficients. Formally, for the positive first

-(E K;) -1 E K;Uj) ~T i,lst --

J

J T

(9)

r

The Jacobians =

or;

Ki, are computed as: (10)

We define Ri as the right eigenvectors matrix of the matrix K~, A~ as a diagonal matrix of eigenvalues of the same matrix, and Li = Ri -1. Since the Euler system of equations is hyperbolic, the matrices Ri, Li, Ai will be real. Then, the matrices K +, Ki- are defined as~

1R~A+L~

(11)

158 1

K~-= -~R~A:~L~

(12)

A+ contains the positive and A~- contains the negative eigenvalues of A~,

2

(13)

The value of OLblend is corrected dynamically. It takes a small value in regions far from flow field discontinuities. 3. S P R A Y M O D E L L I N G In our LES simulations of sprays, the two-phase flow problem is studied using an Eulerian-Lagrangian approach. Thus, parcels containing droplets with the same fluid dynamical properties (dimension, velocity, temperature, etc.) are tracked as in the usual Lagrangian approach, and the mass, momentum and heat transfer are explicitly computed, on a per-parcel basis. Droplets below a pre-specified size are assumed to vaporize instantaneously and to become fully mixed in the gas phase. To model the atomization process and the droplets break-up, a combination of the Reitz and Taylor Analogy Breakup (TAB) models have been used. Sub-models for droplet collision and evaporation have also been included. 4.

SGS

MODELS

In the Large Eddy Simulation technique, the largest scales of the turbulence are resolved and the effect the smallest scales have on the resolved scales, is modeled. This is done by introducing a Sub-Grid Scale (SGS) model. In the present code, the Smagorinsky model, the LDM (Lilly's Dynamic Model) and the Dynamic Divergence Model (DDM)[8] can be used. 5. PARALLEL ALGORITHM

Large Eddy Simulations for engineering applications are typically extremely expensive, requiring huge resources in terms of processor power and memory, and long computational time. When using large parallel computations, LES becomes accessible for many industrial applications. Four design goals have been chosen for the parallelization of the flow solver, NAS3D: - The code should run as effectively on large distributed memory machines (Cray T3E, IBM SP2...), on the SGI Origin 2000 servers or on heterogenous collection of platforms. - The code must be able to communicate, by using message passing, with a third party library, written in a different language, i.e. FORTRAN. These goals have made the PVM standard, a natural choice for ParNAS3D. The PVM 3.4 version has been used in the current implementation. - Good load balancing. ParNAS3D takes advantage of its unstructured grid for realizing an almost perfect load balancing by assigning a number of volumes to compute to each processor, proportional with the processors speed.

159 Good parallel speed-up, even when running with some additional (exterior) modules. This basically means a low ratio for communication time versus computational time and no processor time-out. The parallelization of the code has been done using the SPMD paradigm, i.e. the same code is executed in all processors. The global data structure, i.e. grid, field variables, etc. is decomposed into a number of subdomains, equal with the number of running tasks. While any algorithm for domain decomposition can be employed, so far only one dimensional domain decomposition has been used. This is done in a separate preprocessing step and does not affect the solvers flexibility. Each task runs with its own data structure, which is a disjunct part of the global data structure. Because the LES simulation of the gas-phase is computationally very expensive, a large number of tasks have been used for running the flow solution algorithm, as described in the previous paragraph. At the same time, only one task is used for running the spray algorithm and for postprocessing. Figures 1 to 7 describe schematically the parallel algorithm implemented in the combined LES-spray code. One of the biggest advantages of the fluctuation splitting schemes, for Euler and NavierStokes equations, is that it is well suited for parallelization, when compared with the classical finite-volume schemes. This is because the stencil for the update of a vertex-node is extremely compact, i.e. it only involves information from vertices that are one edgedistance away in the grid structure, while still having a second order accuracy. No overlap region is needed at the domain interface, as for the second-order finite volume codes. Since the fluctuation splitting algorithm only needs to access the first order neighbors to update the solution at a vertex, the communication is greatly simplified. Schematically, the parallel algorithm of the flow-solver applies five main steps, A...E: A. Perform an implicit time integration, with sub-iterations for converging the solution at each real-time step. In each sub-iteration, the solver: - C o m p u t e the residual (i.e. the integral of the convective, diffusive, source term and unstationary part of the Navier-Stokes equation, over one tetrahedral cell) looping through all cells, splits the residual according with the distribution scheme in fluctuations, and sends fluctuations to the nodes, where these are collected in some special fields. - C o m m u n i c a t e using PVM, with neighboring parallel processes, the fluctuations accumulated in nodes and some other useful informations, for nodes laying on the domains interface. - I m p o s e boundary conditions, - L o o p through all nodes, to update the nodal values using the fluctuations stored in those special fields. As it can be seen, the algorithm needs no communication while computing and distributing the fluctuations (this step in the solver algorithm requires the longest computational time). It requires only one communication back and forward each sub-iteration, before applying the boundary conditions and updating the nodal values. The volume size of this communication is reduced, since only nodes on the domains interface exchange information. All data are packed together and sent in one sending operation, thus reducing to a minimum the communication. The tasks, are implicitly synchronized by using blocked data exchange. -

160

B. Perform PVM communication: with the supplementary task, which is running the spray algorithm in parallel. The solver receives data from the spray module, containing P momentum F~i,senergy Qp and species parcels position, velocity, and exchange in mass ms, between the liquid and the gas phase. C. Perform particle tracking in parallel. D. Compute volume averages of the source terms, i.e. mass ms, momentum Fi,senergy Q s and species Ss E. Perform PVM communication: send to the supplementary task appropriate data about the gas-phase flow, i.e. temperature, pressure, velocity, mass fractions for the gas-phase components, etc. While the tasks running the LES simulation algorithm are performing the steps just described, the supplementary task is running the spray algorithm and is performing postprocessing. Its (parallel) algorithm, has the following steps: -Integrate the equation of motion for the parcel (implicitly), - U p d a t e the parcel position, - C o m p u t e droplet stripping, break-up and coalescence, for each parcel, " p ~), - C o m p u t e droplet evaporation, together with mass and heat source terms (?~tps, Qs, for each parcel, - C o m p u t e the forces acting on parcels (F~), - P V M communication with the running tasks, -Postprocessing. Doing particle tracking on an unstructured grid is extremely time consuming. When doing particle tracking in parallel, this time was significantly reduced. The supplementary task synchronizes itself with the running tasks once per real-time step.

6. R E S U L T S The spray injection in a combustion chamber premixer has been studied using Large Eddy Simulation. Figure 8 shows a cross section through the premixer (mixed burner). The fuel is introduced at a position close downstream of the swirl generator, by using pressure atomizers. The injection, mixing and evaporation processes have been investigated and some results are presented in [6]. Figures 9 and 10 show a plot of the instantaneous distribution of the spray parcels and the velocity distribution together with stream traces in the flow field, respectively.

7. P A R A L L E L R E S U L T S The ParNAS3D code has been tested on up to eight processors on an SGI Origin 2000 platform, with and without the spray module. It proved to have excellent scalability (more than 98% from the theoretical speedup). Tests done on a cluster of two PC's running WinNT and connected by an usual Ethernet connection were not as successful (only 75% from the theoretical speed-up), due to the poor communication capability between the two PC's. The results, addressing the parallel speed-up are presented graphically in Fig. 11.

161 8. C O N C L U S I O N S Large Eddy Simulations of spray injected in cross-flow inside a swirl generator have been carried out using the DDM model. In the near future, computations using some other dynamic SGS models may be considered, to estimate the effect these have on the results from an engineering point of view (mean values, shedding frequencies, rms of fluctuations, etc.). These simulations give us a good insight into this complex flow problem and it is hoped that they will be beneficial to the design of the future premixers.

9. A C K N O W L E D G M E N T S Computer resources from LUNARC center, at Lund University, are gratefully acknowledged.

REFERENCES

1. P.L. Roe, Fluctuations and signals. A framework for numerical evolution problems. In K.W. Morton and M.J. Baines, editors, Numerical Methods for Fluid Dynamics, Academic Press, 1982. 2. H.Paillere, H.Deconinck, E. van der Weide,Upwind Residual Distribution methods for compressible flows : An alternative for Finite Volume and Finite Element methods, VKI 28th CFD Lecture Series, March, 1997. 3. H.Deconinck, R. Struijs, G. Bourgois, P.L. Roe, High resolution shock capturing cell vertex advection schemes on unstructured grids, VKI Lecture Series, March 21-25, 1994. 4. William A.Wood, William L. Kleb, Diffusion Characteristics of Upwind Schemes On Unstructured Triangulations, AIAA 98-2443, Albuquerque, 1998. 5. S.Mitran, D.Caraeni, D.Livescu, Large Eddy Simulation of Rotor Stator interaction in Centrifugal Impeller, JPC, Seattle, July 1997. 6. D. Caraeni,C. Bergstrom, L. Fuchs, S. Hoffmann, W. Krebs, K. Meisl, LES of spray in compressible flows on Unstructured Grids, AIAA 99-3762, Norfolk, 1999. 7. A. Jameson., Time dependent computations using multigrid, with applications to unsteady flows past airfoils and wings AIAA 91-1596, 1991. 8. J. Held, L. Fuchs, Large Eddy Simulation of Separated Transonic Flows around a Wing Section, AIAA 98-0405, Reno, 1998.

162

-.~ The Master + Running Slave tasks are running the LESalgorithm, in parallel.

T

Rmming in parallel

Implicit time integration (step I of 3): Loops flrough the cells and, 9 Computes the cell-residual, 9 Splits the r e s i ~ according with the distribution scheme, 9 Sends these f l ~ o n s to the nodes.

( in parallel, no c o - - c a t i o n

1

i r,p.

===~

nee6~)

The Supplementary task is homing the spray

is

The most time-consun~g step of the LES-algorithm ( and of the overall algorithm ) in the ParNAS3D code.

post processing.

Figure 1.

Figure 2.

--~ Implicit time integration (step 2 of 3): PVM c o - - c a t i o n s : 9Cormnt~cates using PVM, with neighboring parallel processes, fluctuations cwmd~ed in nodes and some other useful i n f o ~ o n 's. 9Specifics: non blocking sends blocking receives - ']~vmDataRaw" data encoding. -

-

In.licit time integration (step 3 of 3): Loops through nodes and, 9Imposes boundary conditions

9Updates the n c ~ values using the aanulated fluctuations.

(in parallel, no commtmication reea~)

I.F_B-runningprocesses are synchronized at the end of this

step.

Figure 3.

Figure 4.

163

Inthistin~..

The Supplementary task:

I~

~ ~

9Integrates the equation of motion forparcels, 9Updates the parcel position, 9Lm~;eisd; ~ break-up, coalescence, - evaporation, 9Computesforces acting on parcels.

The Master + Runm'ng Slave tasks: 9PVM conmamication: Receives spray datafrom the Supplementary task. 9Particle tracking, inparallel. 9Computes averagesfor source terms ( ~ s , momentum and energy ) comingfrom the spray. 9PVM conmamication: Sends appropriate data about the gas phase to the Supplementary task.

The Supplementary task: 9sends to LES-running tasks data about the spray 9receives data about the gas phase

Figure 5.

Figure 6.

t'1~

The Master + Running Slave tasks: PVM c o - - c a t i o n : 9Sends to Supplementary task

appropriate data about the gas phase, for post processing.

L~.

"-----'--a~J IllIIIUI'LL~ - ~~ I /~ ~ued

~u~

The Supplementary task:

-~ ~ ~

9PVM comnamication: Receives data about the gas phase, for post processing. . Post processing.

--- A[ial

,~~. Figure 7.

Figure 8.

164

z

5.32

15.98 28.•0

3 7 2 . 4 47.8~ 58.52 85.15

T3.801

F i g u r e 10.

F i g u r e 9.

ParNAS3D parallel speed-up 9 876-

~4-

pray .~~. "

3-

~

SpeedWSpray

.......... T h e o r e t i c a l

210 0

i

i

i

t

2

4

6

8

# Processors F i g u r e 11.

lO

Parallel ComputationalFluid Dynamics Towards Teraflops, Optimizationand Novel Formulations D. Keyes,A. Ecer, J. Periaux,N. Satofukaand P. Fox (Editors) o 2000 Elsevier Science B.V. All rightsreserved.

165

Dynamic Load Balancing for Parallel CFD on NT Networks Y.P. Chien, J.D. Chen, A. Ecer, and H.U. Akay Computational Fluid Dynamics Laboratory, Purdue School of Engineering and Technology, IUPUI, Indianapolis, Indiana 46202, USA.

This paper presents development of a dynamic load balancing technique on NT operating systems. Previously developed technique of periodically measuring the computation and communication loads on distributed UNIX systems is extended to NT. Differences between the two systems are taken into account in the implementation and the experiences gained are summarized. Preliminary results indicate the applicability of the method in NT environments. 1. INTRODUCTION Since the price of personal computers reduces quickly and the computation power of PCs increases rapidly, parallel CFD that was usually performed on workstations and supercomputers can now be performed on networked PCs. Many companies start to replace networked workstations by networked PCs. Since there is a large amount of idle CPU power in networked PCs, it is very attractive to use networked PCs for parallel CFD. Two approaches are commonly used for parallel computation on networked PCs. One is to run PCs with Linux operating system. The other is to run PCs with Windows NT operating system. The advantage of running PCs in Linux environment is that many tools and software developed in UNIX environment can be easily adapted to Linux environment. The disadvantage of using Linux environment is that most networked PCs do not run on Linux. Since Windows NT is widely used, we have been studying how to run parallel CFD jobs and load balancing in Windows NT environment. This paper describes the experiences and issues of running parallel CFD and load balancing in Windows NT environment. 2. BACKGROUND Our current research efforts are aimed at developing domain decomposition methods that can be efficiently executed on parallel networked heterogeneous computers. Our basic approach is to divide the flow domain into many sub-domains called blocks, and solve the governing equations over these blocks. The blocks are connected to each other through the inter-block boundaries called interfaces. The equations for each block are solved in block solvers, while the exchange of boundary conditions between neighboring blocks are performed via interface solvers. The data structures for managing the block and interface information are implemented by a domain decomposition database called GPAR [ 1]. GPAR uses PVM [2] or PVM-like machine portable parallel library [3] for message passing. Each data block is assigned to a computer process. One or more computer processes may reside in each processor.

166 In environments with computers of different architecture, different CPU speed, different memory size, different load, and different network speed, balancing the loads and managing the communication between processors become crucial. Since the domain decomposition approach requires the parallel processes to exchange information periodically, slow computation of any process will cause other parallel processes waiting for information. Waiting for information causes computer to become idle, thus reducing the efficiency of the system. We envision that future advanced parallel computation applications will require the parallel execution of different parallel application programs. Each parallel program may incorporate parallel execution of different solvers. We also envision that future advanced parallel computation environment will consist of networks of different types, computers of different speeds, and computers using different operating systems. Load balancing software tool for mutually dependent parallel processes will definitely be needed to utilize efficiently advanced computation environment and algorithms. Dynamic load balancing is needed for utilizing and sharing the available computing resources for advanced parallel computing. In a massively parallel computation environment, the availability of computer resources changes constantly. Besides, advanced parallel CFD solvers enable the independent selection of time steps. Since the selection of the time steps in all blocks and interfaces are data dependent, a balanced load distribution at one time may not be balanced after certain period of program execution. When the time step size changes, the communication volume between processes changes too. Since the communication speed is a function of the network load for some type of networks, the communication speed change also affects the load balancing. Therefore, dynamically balancing the computer load is essential even when the parallel computers are used in a single user mode. Dynamic load balancing relies on the accurate information about the computer speed and the network speed. In a massively parallel heterogeneous and dynamically changing computation environment, it is almost impossible for a user to track accurately the relative computation speed among all available computers and the communication speed of all available networks for load balancing. Besides, the relative computation speeds among computers depend on the type of mathematical manipulations used in the application programs. Therefore, we believe that the speed information should be measured periodically during the execution of the parallel application programs. We have developed a dynamic load balancing software package in UNIX environment with TCP/IP network [4]. The software can (1) track the load of every processor used in parallel computation, (2) measure the elapsed and CPU computation time for all parallel processes, and (3) measure the elapsed communication time used between every pair of parallel processes in UNIX environment. The package has been used in network of single processor IBM RS6000 workstations, networked Sun workstations, and IBM SP supercomputers. We also developed the parallel CFD computation and communication cost functions for UNIX environment [5] based on the measured computer and network information. These cost functions can be used to predict the elapsed execution time for different load distributions among parallel computers. Based on these cost functions, load balancing has been successfully achieved. The objective function for our current load balancing algorithms is to minimize the execution time of the slowest parallel process. The dynamic load balancing schemes periodically check the computer loads, block computation costs, and inter-processor communication costs. When the load is not balanced, the loads are redistributed among parallel processors. This approach of periodically measuring the actual

167 status of systems and gaining knowledge about the system performance during executions is superior to most other load balancing techniques which rely on idealized specifications of processor and network speeds. The dynamic load balancing tools we developed automate the load distribution of mutually dependent parallel processes and hence greatly increase the efficiency of parallel CFD algorithms. 3. DYNAMIC LOAD BALANCING IN WINDOWS NT ENVIRONMENT 3.1 Problems and Solutions

Although a dynamic load balancing technique has been successfully implemented in UNIX environment previously [4], implementation in Windows NT environment created several new problems. Problem 1: Shell Scripts Shell scripts are commonly used in large size software. However, many UNIX shell commands cannot be used directly in Windows NT. In order to port the load-balancing software from UNIX to Windows NT, the first approach we tried is to create a Korn shell environment in Windows NT. Although this approach may solve our compatibility problem, we have to install the Korn shell program on each NT computer. This can be costly since we intend to use all available PCs in the organization for parallel processing. The second approach is to reduce the shell program to a minimum level so that the shell program can be used in both UNIX and Windows NT. A new parallel program, DLB monitor, written in C language, is developed for gathering the load measurement information from the computers in parallel processing and preprocessing the information for load balancing. We will discuss the details in Section 3.2. Problem 2: Estimation of Communication Cost It has been observed that there are three main differences in terms of for estimating communication costs. These are:

1) In order to measure the communication time used in parallel CFD, the time differences between the system clocks of computers need to be known. In UNIX workstations, the difference between two computer clocks drifts very slowly and can be considered as a constant. Therefore, one way communication time can be measured. However, in NT machines, the clock difference drifts largely around a constant value in terms of minutes. This drift causes huge errors in the communication time measurement. The only solution to this program is not to use the measurement of one way communication. New communication time measurement methods is derive the one way communication time from the measurement of the round trip communication time (divide it by 2). 2) In Unix, the communication cost depends on the computation load (number of running processes). However, in NT, the communication cost is independent of computation load. This phenomenon can be verified by the following experiments. Two machines A and B are used for the experiment. One packet of message is sent from machine A to B. Upon receiving the message, machine B sends the message back to A immediately. The round trip elapsed communication time (Trouna ) is measured. The same experiment is repeated by adding 1, 2, 3, and 4 additional running processes (extraneous load) to machine B. The communication cost (Trou,,d/2 ) between these

168

two machines is shown in Table 1. Table 1. Communication cost via extraneous loads in NT. Extraneous Load Communication Cost (Cab) (Seconds)

0

1

2

3

4

0.016

0.018

0.023

0.020

0.020

3) In Unix, communication cost between processes on the same computer is much less than that on different computers. While in NT, communication cost between processes on the same computer is approximately the same as that on different computers. Based on the above three changes, the communication cost function for NT is defined as follows: Cab = K~C o

where

Cabis

the communication cost between machine a and b, K.~ is the number of packets

in the message and C Ois the communication cost for passing one packet of message between these two machines. Problem 3: Processor Loads Measurement In order to measure the effective computation speed of each computer used for parallel CFD, the number of running processes on the computer needs to be counted. The utilities for counting the number of running processes are available in UNIX. We have developed a program that can periodically read the Windows NT's registry table to find the running processes on each machine. 3.2 DLB Tools for Windows NT

Several tools have been developed for load balancing in Windows NT environment. 9 Stamp Library This is a collection of functions, which can: Be called by C or FORTRAN programs Be embedded into CFD programs Gather the information related to both CFD program and computer network (block and interface size, interface topology, average computation and communication elapsed and CPU time for each block, etc. 9 Ctraek In order to estimate the elapsed execution time for the CFD blocks, information about the communication speed between all computers is needed. CTrack- Communication Tracker program supports communication speed measurement. 9 PTrack Processes tracker program for finding the average number of non-CFD processes (extraneous

169 load). 9 Balance This program is designed for load balancing based on Greedy/Genetic algorithms. It Predicts the computation and communication costs of any given load distribution. Finds an optimal load distribution for the next CFD execution cycle. 9

DLB Monitor

The following tasks are performed in this program: Initializing the computation and communication cost model Running CFD program simultaneously with Ptrack and Ctrack )~ Gathering the time stamp results and Ptrack/Ctrack results from parallel computers Load balancing 9

RCopy & RSpawn

Rcopy: Copies files from/to remote NT based computers using message passing package. Necessary for gathering data for load balancing. Rspawn: Executes system commands (or applications) on remote NT based computers. Figure 1 shows the organization and function of the above DLB tools. DLB 'l Monitor ] All RSpawn Killit i

Start Start

Generate All ] Results [----] A2 I I

Simultaneously Start/Stop with CFD Application

Start

/

Simultaneously Start/Stop with CFD Application

RCopy

I

Load Balancing i

L N

j-'ll

I

New Block Balance Distribution Initial Block~ Distribution r I

Apphcations " " I--~ i A4

II /

Communicatio n Speed Measurement A5

.Timing Result

Communication Cost

CTrack i

Count the Average

[ I Numberof l Non-CFD

processes

A6

Extraneous Load

PTrack

I

Figure 1. Organization and function of the DLB tools.

h II

170

4. C A S E S T U D Y The parallel CFD application program is used in this experiment. This program solves the temperature distribution of a rectangular 3-D domain using explicit finite difference method. The details of this program can be found at the web site (http://www.engr.iupui.edu/cfdlab). A 3D structured domain with 900,000 nodes (x=30, y=1000, z=30) is used in this case study. The domain is divided into 30 blocks. Five NT workstations are used. The execution of the CFD code went through five load-balancing cycles. The dynamic load-balancing program suggested a better distribution after each 100-iterations of CFD program execution. The results are summarized in Table 2. Table 2. The load distribution on five computers and extraneous loads for each cycle.

] 1

6, 8,10, 12, 19, 29 (0.19)

1, 2, 3, 4, 9, 13, 15 (0.18)

21,22,23, 24, 26 (2.36)

5, 11, 17, 7,14,16, 18, 25 20, 27, 28, (0.95) 30 ! (0.06) 21,22,23, 5, 11, 17, 7,14,16, 24, 26 18, 25 20, 27, 28, (0.27) (1.20) 30 (0.14)

2

6, 8,10, 12, 19, 29 (0.19)

1, 2, 3, 4, 9, 13, 15 (0.12)

6, 8, 10, 12, 19, 29 (0.21) 6, 8,10, 12, 19, 29 (0.25)

2, 3, 4, 9, 7, 21, 22, 13, 15 23, 24, 26 (0.13) (0.32) 2, 3, 4, 9, 7, 21, 22, 13, 15 23, 24, 26 (0.07) (2.52)

6, 8, 10, 12, 19, 29 (0.21)

2, 3, 4, 9, 22, 23, 24, 13, 15, 21 26 (2.33) (0.15)

|

454/403

Initial distribution.

409/381

Reduce2 extraneous loads from computer 3 to make it unbalanced New balanced distribution

|

i

5

l

1, 5, 11, 17, 18, 25 (1.12) 1, 5, 11, 17, 18, 25 (1.09)

14, 16, 20, 27, 28, 30 i (0.12) I 14, 16, 20, 27, 28, 30 !(0.07)

351/316

5, 7, 11, B1, 14, 16, 17, 18, 25 20, 27, 28, (0.98) 30 (0.11)

361/332

508/457

Add 2 extraneous loads back to computer 3 to make it unbalanced New balanced distribution

5. C O N C L U S I O N The differences between UNIX and NT based systems that affect the communication cost function is described. These differences are considered in the Dynamic load balancing for parallel CFD. This paper described the problems and solutions for the implementation of dynamic load balancing for parallel CFD in a network of NT workstations. The software tools for dynamic load balancing is also described. These tools have been used successfully in the testing cases.

171 ACKNOWLEDGEMENT

Financial support provided for this research by the NASA Glenn Research Center, Cleveland, Ohio is gratefully acknowledged. REFERENCES

[1]

D. Ercoskun, H. Wang, and N. Gopalaswamy, "GPAR-A Grid Based Database System for Parallel Computing," Internal Report, CFD Laboratory, Department of Mechanical Engineering, Indiana University Purdue University Indianapolis, 1994.

[2]

Oak Ridge National Laboratory, "PVM, Parallel Virtual Machine," World Wide Web at http ://www/epm. o rnl.g ov/pvm/.

[3]

A. Quealy, G.L. Cole, and R.A. Blech, "Portable Programming on Parallel/Networked Computers Using the Application Parallel Library APPL," NASA Technical Memorandum 106238, Lewis Research Center, Cleveland, Ohio, USA, 1993.

[41

Y.P. Chien, A. Ecer, H.U. Akay, and F. Carpenter, "Dynamic Load Balancing on Network of Workstations for Solving Computational Fluid Dynamics Problems," Computer Methods in Applied Mechanics and Engineering, Vol. 119, 1994, pp. 17-33.

[5]

Y.P. Chien, S. Secer, A. Ecer and H.U. Akay, "Communication Cost Function for Parallel CFD Using Variable Time Stepping Algorithms," Parallel Computational Fluid Dynamics - Recent Development and Advances Using Parallel Computers, Elsevier Science, New York, 1998, pp. 51-56.

This Page Intentionally Left Blank

Parallel ComputationalFluid Dynamics Towards Teraflops, Optimizationand Novel Formulations D. Keyes,A. Ecer, J. Periaux, N. Satofukaand P. Fox (Editors) 2000 Elsevier Science B.V. All rightsreserved.

173

Parallel Computer Simulation of a Chip Bonding to a Printed Circuit Board: In the Analysis Phase of a Design and Optimization Process for Electronic Packaging P. Chow a, C. Bailey b, K. McManus b, D. Wheeler b, H. Lu b, M. Cross b, C. Addison a aFujitsu European Centre for Information Technology Limited, 2 Longwalk Road, Stockley Park, Middlesex UB 11 lAB, United Kingdom bCentre for Numerical Modelling and Process Analysis, University of Greenwich, London SE18 6PF, United Kingdom

This paper describes an ongoing study to consider large-scale non-linear analysis on parallel computing systems for electronic packaging applications. The motivation for this research is to investigate detailed 3D chip interconnects to printed circuit boards (PCB), such as the reflow process that requires models to characterise physics governing heat transport, solder solidification and residual stress, within a design and optimization process for the development of electronic components and devices.

1. I N T R O D U C T I O N Solder materials are used to bond chip components and printed circuit boards together during board assembly. In the case of the reflow process, the board assembly passes through a reflow furnace; the solder, initially a solder paste, melts, reflows, then solidifies, and finally deforms between the chip and board. A number of defects may occur during this process such as flux entrapment, void formation, and cracking of the joint, chip or board. These concerns have also been accelerated by the increasing needs toward component miniaturization and smaller pitch sizes. Computer simulations, together with some experiments, provide an effective design and optimization route in reducing these defects and in assessing solder and board integrity and reliability. Although computer chips have a large number of connectors (solder bumps) attached to circuit boards, general modelling practice is to simulate a single connector or assume each connector behaves like a beam in the Finite Element analysis. This is because to undertake a detailed 3D model requires a large mesh and long computing time (solving non-linear equations of thermal and mechanical systems), thereby constraining the number of cycles through a design and optimization process. For complex multiple-chip cases it can easily leads to models having millions of elements. Parallel computing technology opens up the possibility of undertaking such large-scale analysis and delivers the solution in a practical timeframe. In application areas such as automobiles and aeronautics, parallel computing has

174 significantly reduced the period required for analysis and increased the size of models that can be simulated. Figure 1 shows a design and optimization process in electronic packaging with four primary components. 1) CAE/CAD, to create the geometry model. 2) Experiment design, uses techniques such as orthogonal table for planning the number of trails (analysis) and process control to manage the sets of analyses to be performed. 3) Analysis, meshing of the geometry model and applying boundary conditions before going to solvers for analysis and obtaining the result. 4) Optimization, to optimize the design and analysis phase by using techniques such as objective functions and mathematical programming. Then the process cycle repeats, depending on the design objectives. Shape optimization would return to 1 while others, such as materials, go to 2.

~1) CAF_/CADI I

2) Experiment Design Orthogonal Table Process Control

3) Analysis

SOLVERS

Cooling Analysis

Fluid Fluid Heat Transfer Solder Solidification Structural Analysis Thermal Mechanical Stress

4) Optimization Objective Function Mathematical Programming Results

Figure 1. A design and optimization process In this paper, we address the issues of using parallel computers to perform the simulation in the analysis phase (SOLVERS) of the design and optimization process for electronic packaging - the challenges and the reality of simulating detailed multiple chip components on printed circuit boards within the design timeframe. Finally, some preliminary parallel results of 3D cases involving cooling, solidification and residual stresses at solder joints and throughout the component during its assembly are presented; the largest model case has over 1 million elements. Parallel simulations are done on a Fujitsu AP3000 machine using up to 12 processing elements.

2. ANALYSIS SOFTWARE & PARALLEL MODEL The PHYSICA software [ 1, 2] from University of Greenwich is used for the study. It has an open single-code component-based software framework [3] for Multiphysics applications such as metals casting [4] and materials-based manufacturing processes [5]. The threedimensional unstructured mesh code has analysis models for fluid flow, heat transfer, solidification, elastic/visco plasticity, combustion and radiosity. PHYSICA's parallel model [6] is based on the Single Program Multiple Data (SPMD) paradigm, where each processing element runs the same program on a sub-portion of the model domain. The partitioning of the

175

model domain to sub-domain is done with an overlapping domain decomposition procedure based on mesh partitioning. In PHYSICA the graph partitioning code JOSTLE [7] is used to perform the mesh partitioning, and the message passing between processing elements to exchange information, predominately for the overlapped regions, is with the portable communication library CAPlib of CAPTools [8]. For parallel codes to scale well for performance, the non-scalable portions need to be eliminated else it's a critical point in the solution procedure's critical path. Some common examples of non-scalable parts are, reading and writing to files (parallel input and output are currently system dependent, if available), and global summation operations commonly found in popular linear solvers. In the version of PHYSICA used for the present study, JOSTLE is scalar and hence a critical point in the overall scalability. Also, because the whole mesh needs to be read before partitioning, the size of models will be limited by memory and system configuration. A parallel version of JOSTLE is underway to address this matter.

3. NUMERICAL EXPERIMENTS & RESULTS

Figure 2 shows a Guide-pin example result for solidification (FSN), elastic/visco plastic stress (EVPSTR) and temperature (TN). This is a model of two materials, solder (cap top) and alloy, with 21,413 vertices, 57,577 faces and 18,150 elements. Table 1 gives the computing times for 1, 4 and 8 processing elements (PE) in CPU minutes. For this model the total time for 8 PEs is a speedup of 3.19 over a single PE, and the solution speedup (without initial setup, mesh partitioning, and reading and writing to files) is 3.57. This means the nonsolution portion takes about 11% and 13% of the total CPU time, respectively for 4 and 8 PEs, compared to 0.03% for single PE.

r

....

9

:. i i

i!!:,}

!il 'J!i~F,,,,]

iii)]i~

Figure 2. Result of Guide-pin example

..,

Table 1. Parallel performance for Guide-pin example CPU time in minutes PE Solution time Total time Speedup 1 17.40 17.92 4 6.00 6.75 2.65 8 4.87 5.62 3.19

176

Figure 3 shows one quarter of a chip bonding to a PCB example being modelled during the reflow process. Figure 4 showing an enlarged view of the solder bumps with two different attachment materials at top and bottom. The model consists of 273,504 vertices, 1,133,207 faces and 425,890 elements. Figure 5 shows the solidification fronts of the solder bumps during cooling phase of the reflow process. The comer solder bump is solidifying at a rate faster than its neighbours as indicated by front in dark colours. Figure 6 shows the magnitude of visco-plastic strain and deformation throughout the solder bumps at the end of reflow when all the solder bumps are solid. Again the comer solder bump has higher amount of strain than its neighbours. The deformation, as shown by the inclining solder bump, is board contracting more than the chip because of different thermal coefficients in the material properties. ,i...................................

~

iiiill 84184

:~ ............ ~ ................

.:: .:,: ..::::.:.-:::::::::::::::::::::::::::::::::::::::::::::::::::::.. :: :; ~ ~ ~ ,.

i ......

........

9 i i i ii ill i iii

...!... ~i...v~..i.~. .:::::::::::::::::::::::: !

/p Q

:::

Figure 3. Chip bonding to PCB example

Figure 4. Solder bumps

|174 ~Oi

Figure 5. Solidifying solder bumps

i~O

Figure 6. Solder bumps deformation

Table 2 gives the computing times from 2 up to 12 PEs in CPU hours. The model is too big for a single PE on the AP3000, as it reports out of memory. The CPU runtime for 2 PEs is under 7 hours and 12 PEs is under 2 hours. This gives a speedup factor of about 8 for 12 PEs, representing a saving of about 5 hours in analysis time or an extra 1 to 2 cycles in the

177

design and optimization process. For lower PE runs, the speedup factor moves nearer to the linear scaling mark. Table 2. Parallel performance for chip bonding to a PCB example CPU time in hours PE Solution time Total time Speedup 2 6.281 6.748 4.233 4.621 2.921 4 3.261 3.648 3.699 5 2.703 3.070 4.397 6 2.350 2.701 4.997 7 2.082 2.488 5.425 8 1.816 2.153 6.268 1.684 2.023 6.671 10 1.530 1.848 7.305 11 1.460 1.762 7.658 12 1.379 1.698 7.951 12

Linear 10 - ---- AP3000 ., A Estimate,

"

I

1 2

I

r

' I

3

12

f

10

./

/ 0

~

I

4

I

5

I

6

I

7

'1

8

I

I

I

9 10 11 12

Figure 7. Total time performance

, ~ A

Linear AP3000 Estimat__..~e

f 1 2

3

4

5

6

7

8

9 10 11 12

Figure 8. Solution time performance

To get an idea of a single PE runtime, the same model was run on a Sun Enterprise 10000 (E 10000) with 2GB of memory in scalar mode. With UltraSPARC processors inside both the AP3000 and El0000 systems, U 170 and U250 respectively, a total CPU time of 15.34 hours was reported on the El0000 with solution time being 15.24 hours. If we put the El0000 result with the 12 PEs of AP3000, it represents a saving of over 13 hours in analysis time or giving an extra 5 to 6 cycles in the design & optimization process. In terms of speedups, it represents a factor of 9 (compared to 8) for analysis time and 11 (compared to 9) for the solution period. Figures 7 and 8, respectively, show graphs of parallel performance for total and solution times; the triangle markers indicate an idea of the true speedup if the 1 PE time had been possible. These estimates are obtained by substituting the El0000 single PE result in the calculation for speedup.

178 From the performance graphs, it is encouraging to see the curve for total time shows there are potential gains for this model case by adding more PEs (12 PEs is the highest we have access to at present). A downside is the non-solution portion of the analysis time is also increasing with PEs, some 19 percent (or 20 minutes) for the 12 PEs case. To see how larger models may fair, a similar problem with model size of 1,205,997 vertices, 3,504,048 faces and 1,149,312 elements was conducted on 12 PEs. The initial parallel performance reports an analysis time of 6.01 hours with a solution period of 4.94 hours, representing some 18 percent or 1 hour for non-solution activities. This is encouraging, as the percentage figure has not altered significantly.

4. CONCLUDING REMARKS

The work and results presented here are still ongoing and as yet no firm conclusion is drawn. Indications so far are that parallel computing and associated technologies can significantly reduce the analysis period in a design and optimization process for electronic packaging applications. As larger models are anticipated, as with multiple chips on single board, it will exceed (commonly of memories) the capacity of today's uniprocessors such as workstations to deliver the solutions within the timeframe of a designer. The numerical experiments conducted so far indicate some 20% of analysis time on 12 PEs are in non-solution activities, such as retrieving and saving data to files and setup period for parallel computation. Memory usage is lot higher in mechanical analysis than in thermal; a ratio of 2 to 1 has been observed - this is most likely due to the segregated method used in thermal analysis as compared to the full-system in mechanical.

REFERENCES

1. PHYSICA, University of Greenwich, London, U.K. (http://physica.gre.ac.uk/) 2. M. Cross, P. Chow, C. Bailey, N. Croft, J. Ewer, P. Leggett, K. McManus, K.A. Pericleous and M.K. Patel, PHYSICA- A Software Environment for the Modelling of Multiphysics Phenomena, ZAMM, v76, ppl01-104, (1996) 3. P. Chow, C. Bailey, K. McManus, C. Addison and M. Cross, A Single-Code Model for Multiphysics Analysis-Engine on Parallel and Distributed Computers with the PHYSICA toolkit, Proceedings of 1 lth International Conference on Domain Decomposition Methods, Eds. C-H Lai, P. Bjorstod, M. Cross and O. Widlund, (1998) 4. C. Bailey, P. Chow, Y. Fryer, M.Cross and K. Pericleous, Multiphysics Modelling of the Metals Casting Processes, Proc. R. Soc. London, v452, pp459-486, (1996) 5. M. Cross, Computational Issues in the Modelling of Material-Based Manufacturing Processes, Joumal of Computer-Aided Material Design, v3, pp 100-116, (1996) 6. K. McManus, M. Cross, and S. Johnson, Issues and strategies in the Parallelisation of Unstructured Multiphysics Codes, Proceedings of Parallel & Distributed Computing for Computational Mechanics, (1997) 7. JOSTLE, University of Greenwich, London, U.K. (http://www.gre.ac.uk/--wc06/jostle/) 8. CAPTools, University of Greenwich, London, U.K. (http://www.gre.ac.uk/-captools/)

Parallel ComputationalFluidDynamics Towards Teraflops,Optimizationand NovelFormulations D. Keyes,A. Ecer, J. Periaux,N. Satofukaand P. Fox (Editors) o 2000 ElsevierScienceB.V. All rightsreserved.

179

Implicit-explicit Hybrid Schemes for Radiation Hydrodynamics Suitable for Distributed Computer Systems W. Wenlong Dai and Paul R. Woodward University of Minnesota, 116 Church Street S.E., Minneapolis, MN 55455 A finite difference scheme is proposed for radiation hydrodynamical equations in the transport limit, in which flow signals are explicitly treated, while radiation signals are implicitly treated.

Flow and radiation fields are updated simultaneously. An iterative

approach is proposed to solve the set of nonlinear algebraic equations arising from the implicitness of the scheme. No matter how many grid cells radiation signals propagate through in one time step, only an extremely small number of iterations are needed in the scheme, and communication between computer processor elements (PEs) happens only a few times. The number of iteration needed is independent of the number of PEs used, and the iteration carried out in each PE is almost independent of the iteration of other PEs. Each iteration costs only about 0.8 percent of the computer CPU time that is needed in one time step of a second order accurate and fully explicit scheme. Two-dimensional problems are treated through a dimensionally split technique. Therefore, iterations for solving the set of algebraic equations are carried out only in each one-dimensional sweep.

1. I N T R O D U C T I O N In the transport limit, radiation hydrodynamical equations may be written as a hyperbolic system of conservation laws plus emission and absorption of radiation.

Two

distinctive features of the system are the shock waves involved, and the extremely fast wave speeds of radiation signals. An implicit treatment for radiation signals results in a large set of (nonlinear) algebraic equations at each time step. There are difficulties in numerical simulations on distributed computer systems. Typically, each computer processor element (PE) has to communicate with its neighboring PEs after each iteration. Therefore, because of the large number of iterations needed, a large amount of message traffic has to be communicated between P Es each time step. Typically, communication between PEs is forced to take place only after a certain number of iterations, i.e., information

180

provided by other PEs is temporarily frozen during a certain number of iteration. This approach doesn't work well. The freezing reduces the convergence rate, and therefore more iterations are needed for a given overall tolerance. In this paper, we will develop a numerical scheme for radiation hydrodynamical equations in the transport limit, in which the set of time-averaged fluxes needed in the scheme is calculated through Riemann problems solved. The Riemann solver is based on characteristic formulations. Flow signals are explicitly treated, while radiation signals are implicitly treated. The scheme is second order accurate in both space and time for flow motion, and the influence of radiation signals on flow motion is correctly calculated. An iterative approach is developed for the set of nonlinear algebraic equations arising from the implicitness. The treatment for nonlinearity is completely nonlinear, and only a single level of iteration is involved which solves both the implicitness and nonlinearity. No matter how many grid cells radiation signals propagate through in one time step, only an extremely small number of iterations are needed in the scheme, and the number is independent of the number of PEs used. The iterations carried out in each PE is almost independent of the iterations carried out in other PEs.

2. B A S I C E Q U A T I O N S Since we will employ a dimensionally split technique for two-dimensional problems, we write the one-dimensional projection of radiation hydrodynamical equations in the transport limit [1], (1)

OtU + O~F - S.

Here p

PUx

0

pux

pu~ + p

p(9 + x L )

pu~u u

flUy

U-

,

pu~ + f x / c

~

F(U)-

0

Fm

,

s(u)-

+

pU~Uy + U~fy/C

0

er

u~e~ + c f ~

Sr

E

u (E + p) + cL j

j

--Pr Ox + XPuxfx )

where F~ and Sr are defined as 1

Fm -- p u 2 + p + Pr "k- cUXfx,

8r = --Pr Ox

-

p(a~er

-

arapT 4)

Here, p, p, u, ~ and T are the mass density, gas pressure, flow velocity, specific internal

181

energy and temperature of flow, er, f and Pr are the radiation energy density, radiation flux and radiation pressure, E is the total energy density, E-p(c+~u

1

2

-g-

r)

-~- e r ~

gx and ar are the gravitational constant and radiation constant, X, ae and gp are the radiation flux coefficient, radiation energy absorption coefficient and radiation energy emissivity. The set of Eq.(1) are complete if two equations of state are given, one for flow fields, and the other for radiation fields. For the purpose of our test problems, we assume the 7-law for flow fields, p = ( 7 - 1)pe, and the Eddington factor for radiation fields, pr

=

fEer. Here

fE is an Eddington factor. The terms uxOpr/i)x at the right-hand

side of Eq.(1) represent the rate of work done by the fluid against the radiation pressure gradient, and the term fxOux/i)x arises due to the radiation energy flux has inertia [1].

3. N U M E R I C A L

SCHEMES AND RESULTS

Considering a grid {x~}, integrating Eq.(1) in a grid cell x~ < x < x~+~ and over a time step 0 < t < At, we have n

+

-

At

-

+ S

At.

(2)

Here A x i is the width of the cell, At is the time step, Ui and U n are two cell-averaged values of U at the initial time and at the new time t - At, Ui is a time-averaged value of U over the time step at the grid point x - xi. U~ and Ui are defined as

U? -- -

1 fX~,+~V ( A t , x ) d x ,

Axi

Ui -

~ fo~ V(t, x~)dt.

(3)

S~ in Eq.(2) is the source term S(U) evaluated at U~ [= 5I(U, + un)]. To get Eq.(2) we have approximately used the product of cell (or time) averaged values as the cell (or time) averaged value of a product. Therefore, one of the key points in the scheme is to approximately calculate the time-averaged values U---iand the evaluation of S~. The details of deriving a set of algebraic equations arising from the implicitness of the schemes may be found in [4]; we briefly describe the procedure here.

The time-

averaged values Ui needed in Eq.(2) are calculated through Riemann problems approximately solved. The Riemann solver we used is based on characteristic formulations for the system described by Eq.(1). In order to resolve shocks, we treat flow signals explicitly. We restrict the size of time steps so that sound waves propagate through no more than one grid cell. For a given pair of left and right states, we explicitly calculate the timeaveraged values of p, Ux and p through the differentials of Riemann invariants associated

182 with wave speeds ux, u~ =t=cs, where cs is the speed of sound. For radiation signals, the time-averaged values of fx and er at grid interfaces may be approximately calculated by tracing two radiation characteristic curves, which pass through the point (xi, At), back to the centers of two neighboring cells. If we insert these time-averaged fluxes into Eq.(2), we will get a set of nonlinear algebraic equations for a n , eri n and fx~- Since we have used the backward Euler set of cell-averaged values, p'~, uxi formulation for radiation signals, the numerical errors in radiation signals undergo a quick damping in the time scale of flow motion. Following this procedure, we may write a set of algebraic equations in the form [4] Q i S V ~ - PiSV51 + MiSVn_l + Ci.

(4)

Here V~ is a vector for unknowns, v?

n

--

~txi '

"

er i,

5V'~ - V ' ~ - V i ,

,

Qi, Pi and Mi are three matrices which depend on unknowns V.~z, and

Ci is a vector that is independent of unknowns V~. The set of Eq.(4) is what we want to solve, which is nonlinear. For the nonlinearity we have not introduced any approximation, such as any linearizing procedure, in Eq.(4). Therefore, our treatment for the nonlinearity is completely nonlinear. The set of Eq.(4) may be iteratively solved, for example, through a red-black approach, or Gauss-Seidel method. The convergence of the both Gauss-Seidel method and the redblack approach is slow. But, if a sweeping mechanism [3] is added to the Gauss-Seidel method, the resulting sweeping method converges much more quickly. From numerical experiments, we noticed that the sweeping method converges extremely quickly for those problems with known boundary values, V~ and V~v+l , compared to other problems. From this phenomena, we have developed an iterative approach for general boundary conditions, which converges much more quickly than the sweeping method. Suppose we have a problem with boundary conditions written in general forms Hj(V~, V~v+l, V~, V~v) = 0,

j = 1, 2, ..., 8.

(5)

Here V~ and V~v+l are the values on two fake cells at two boundaries. Our iterative approach is 1) Obtain the values V0 and Vi, i = 1, 2, ..., N.

V N+ 1

through the boundary conditions from initial values

183

2) Guess the values of V~ and V~r+l as V 0 and VN+I respectively. 3) Iteratively solve Eq.(4) for V~ ( i - 1, 2, ..., N) to a required accuracy, but keep V~ and V}+ 1 fixed during the iteration. 4) From Eq.(4), find Jacobi coefficients 0V~/0V~, 0V]z/0V~r+l, 0V~r/0V~ and 0V~v/0V~v_ 5) From the boundary conditions, and the values 0V~/0V~ and OV~v/OV~r+l , find corrections, AV~ and AV~v+I , of the initial guess. 6) Modify the values of V~ and V~v+l , V~ - V~ + AVe, V~r+l go back to the step 3).

--

V~r+l

+ AV~v+I , and

We explain the procedure in more detail here. During the step 3), the values of V~ and V~v+l are fixed, and we need only a few iterations if we use the sweeping method. After this step the solutions for V p (i - 0, 1, 2, ..., N + 1) do not necessarily satisfy the boundary conditions. If the boundary conditions are satisfied, we get the solutions. Otherwise, we have to go to the next step. As a result of step 3), the values of V~ and V~v depend on V~ and V~r+l. We may adjust the values of V~ and V~r+l SO that the required boundary conditions hold. Therefore, we may use the set of boundary conditions to find a correction, AV~ and AV~v+I , of the initial guess V~ and V~v+l. To find the dependence of V~ and '''

'

I ' ' ' '

I ' ' ' '

I ' ' ' '

i,,

I

1''

'

'

I'

''

-5

g

,

,,,

I,

,,,

500

,,

, , , ,

1000 1500 the number of iteration

I

2000

....

I ,J,7 2500

Figure 1" The convergence obtained from the accelerated red-black (dashed line) and accelerated sweeping method (solid line) when

AtCr/AX is

about 17920. The horizontal

coordinate is the total number of iteration used for both Eq.(4) and Eq.(6) for all eight Jacobi coefficients. V~v on the values V~ and V~v+l , we have to solve the following set of equations [4]" ^

^

q,v,

-

where

+ M,gL

(6)

184

9 ~ - ovp Here v ~ is any one of elements in V~ and V~+ 1. Qi, r)i and l~i depend on the ^

unknowns V~ as well as V~. Exactly like nq.(4), nq.(6) may be iteratively solved. When we iteratively solve Eq.(6), V~ are kept constant which are the solutions obtained in step ^

^

3). The boundary values, V~ and V~v+l, in Eq.(6) are fixed. For example, if v~ is p~, the boundary condition used for Eq.(6) is

V ~ - (1, O, O, O)T ,

9~r+1--0.

Therefore, only a few iterations are needed for Eq.(6) if the sweeping method is used. Since there are eight elements in both V~ and V~v+l, we have to solve Eq.(6) eight times. After finding the Jacobi coefficients, we may obtain the corrections AV~ and AV~v+I through boundary conditions Eq.(5). I

i

i

i

I

L!

i

i

i

I

I

I

I

IR

i

,

,

I

,

,

,

I

Figure 2" The one-dimensional simulation domain is divided into four subdomains. For typical boundary conditions, such as periodic, reflection, flow-in and flow-out boundary conditions, the forms of boundary conditions Eq.(5) are very simple. For example, periodic boundary conditions are V~

-

V~v+l

,

V~r

-

V~,

and the corrections, AV~ and

AV~r+l , are

obtained through solving the following set

of eight equations: (~V

n

~

ovn

vk + E. ~ ~

~ n

- v~ + ~ v ~ .

Together with the sweeping method, we call the approach the accelerated approach. Figure 1 shows the convergence obtained from the accelerated red-black (dashed line) and accelerated sweeping method (solid line) when A t c r / A X is about 17920. Here cr is the speed of radiation signals. When we implement the scheme in a computer system with multi-PEs, we may simply divide a simulation domain into a few subdomains with a PE responsible for each. The accelerated approach may be very naturally extended to the case with multi-PEs. For example, four PEs are used for four subdomains, as shown in Fig.2. Consider the k th PE

185 taking care of one of interior subdomains. The boundary conditions for this PE becomes

Von,k __ vn~k- 1,

VN+ln,k__

v;,k-t-1.

Here the second index k [or (k + 1)] in superscript is used for the k th [or (k • 1) st] PE. This set of boundary conditions is only a special case of the general boundary conditions, Eq.(5).

n,k During iteration, the values V~ 'k and VN+ 1 in Eq.(4) do not change until a

correction for the initial guess are needed. From the numerical experiments, it is found that the correction happens only a few times in one time step no matter how many cells radiation signals propagate through in one time step. For the same problem as used in Fig.l, Figure 3 shows the convergence when different numbers of PEs are used. . . . .

I ' ' ' 1 ' ' ' 1 ' ' ' 1 -

'

'

I

'

'

'

I

'

'

'

I

'

'

'

I

-2

o,I

<3

v

<3

Cr~ 0

0

~ '

v 0

40

'

60

I

'

'

'

I

'

'

'

,

100

I

'

'

'

I

40

,,

I L ,,

60

80

I:

100

' 1 ' ' ' 1 ' ' ' 1 ' ' ' -2

-

! I

(.o

,<3

v

0

,

40

. . . .

80

I 60

L__

80

I

100

|

|= 60

1 ' ' ' 1 ' ' ' 1 ' ' '

It, ~

1~

80

100

- ' ' ' 1 ' ' ' 1 ' ' ' 1 ' '

(,0

<:3 C~ 0

0

I 40

.

.

.

. 60

.

I

,

80

,

,

I-I 100

the number of iterations

, I , , , I J , , I , ~ 40

60

the number of iterations

Figure 3: The convergence when the different number of PEs are used for the same problem. /ktCr//kx is about 17920. For two-dimensional problems, we currently apply the dimensionally split technique originally proposed by Strang [1] for explicit schemes to our implicit-explicit hybrid calculation. The method for two-dimensional equations is the symmetric product of onedimensional operators:

186

1 x Y Y x L A t - ~(LAtLAt + LAtLAt).

(7)

Here L~,t is a one-dimensional operator with time step At for one-dimensional Eq.(1). For a linear hyperbolic model problem, it may be shown that the operator LAt is secondorder accurate if each one-dimensional operator is second-order. To test the two-dimensional schemes, Initially, we set up a wave propagating at the direction which is at 30 ~ with the x-axis.

Figure 4 shows the numerical solution (solid lines) against the "exact" solu-

tion (dotted lines); the exact solution is obtained from the fully explicit one-dimensional scheme. There are no differences in flow motion between the two sets of solution, although they are obviously different in radiation fields.

4. C O N C L U S I O N S We have proposed a numerical scheme for two-dimensional radiation hydrodynamical equations in the transport limit.

Flow signals are explicitly treated, while radiation

signals are implicitly treated. Flow and radiation fields are updated simultaneously. The accelerated approach for a single PE converges extremely quickly, and the accelerated approach for multi-PEs has the same convergence rate as the case with a single PE. The accelerated approach proposed here may be directly applied to other hyperbolic systems. Communication between PEs occurs only a few times in one time step.

5. A C K N O W L E D G M E N T The work presented here has been supported by the Department of Energy through grants DE-FG02-87ER25035 and DE-FG02-94ER25207, by the National Science Foundation through grant ASC-9309829, by NASA through grant USRA/5555-23/NASA, and by the University of Minnesota through its Minnesota Supercomputer Institute.

REFERENCES

1. D. Mihalas and B.W. Mihalas, Foundations of Radiation Hydrodynamics, Oxford University Press, New York, 1984. 2. G. Strang, SIAM J. Numer. Anal., 5 (1968) 506. 3. W. Dai and P.R. Woodward, J. of Comput. Phys., 128 (1996) 181. 4. W. Dai and P.R. Woodward, J. of Comput. Phys., 157 (2000) 199.

187 I~' ' ' 1 ' ~

.615 f:Z.

1 . . .' . '

' ' ' 11' ' ' 1' ' ' 1~' ' '

1.002

.61

Q.

0

.2

.4

.6

.8

1

0

0

j , , I a ~ ~ I , , , I , , , ,Pr~,,

.2

.4

b! ' ' ' I ' ' ' 1 ' "" "'" /"

10.2

/

, , , I , , , I , ,

.2

" ' " / -/ C ",,

1

1.2

' ' I ' ' ' I ''

\

" \\\

," """"

--

\ \ """

.6

/

.9 ".

.2

.4

//

.6

.8

1

1.2

0

H ~

.2

,,,I,,,I,,,~,1,,,4 .4 .6

.8

1

1.2

1

1.2

.92

~

.91.9

LI

1.2

,, .

i;

.6

\

,,

.8

,.

/

,-,-'2

1

-I

0

12oo lOOO

\

X

/ 11

1

,, \

.4

'~

,/I,,,

.8

'-"'"

0

.4

"'"

"'"

.4

"'~FI

I , , ,t

.8

' ' I'

\

" """ ,

.7

.3

.6

""

9.8 9.610I

0

.2

,1

-.002 FI

1

1.2

OOl o

10.4

I /' ' ~ / ' ~ ~ " ~ \ ~ I ' ' ' I ' ' ' I ' ' , ' 1~

1.004

.2

7

.4

400

-I

.8

//

'.

1.2

1"

.6

'iii ,ii'''

/

,,,

"1"

0

,,,

.2

.4

\

".

.6

.. 9

\

.8

,-;;

/

1

,,

/

1 1.2

X

Figure 4: A comparison in numerical solution between our two-dimensional implicitexplicit hybrid scheme (solid lines) and a fully explicit and one-dimensional scheme (dotted lines completely hidden behind the solid lines). The dashed lines are the initial profiles and solid lines the solution after the sound wave propagates ten wavelengths.

This Page Intentionally Left Blank

Parallel ComputationalFluid Dynamics Towards Teraflops,Optimizationand Novel Formulations D. Keyes,A. Ecer, J. Periaux,N. Satofukaand P. Fox (Editors) 2000 Elsevier Science B.V. All rights reserved.

189

Computations of Three-Dimensional Compressible Rayleigh-Taylor Instability on SGI/Cray T3E Anil Deane 1 Institute for Physical Science and Technology, University of Maryland, College Park, MD 20742, USA. d ean e @m o r p h y. u m d. ed u

Application results for a fully compressible Euler hydrodynamics code parallelized to run on the SGI/Cray T3E are described. The particular flow that is solved for is that of three-dimensional compressible Rayleigh-Taylor instability, for which results for a single spike/bubble configuration as well as for a multi-mode perturbation of the interface are presented. Issues related to the algorithm, parallelization and optimization on the T3E are reported. With compile level optimization, the code obtains 83MF/node (42GF on 512 nodes) on a T3E-600. Even higher performance, 126MF/node, has been obtained on a T3E-1200. The full paper will include scaling results.

Introduction This paper describes the parallelization and optimization and computational results of a code for solving the Euler (inviscid) equations of compressible hydrodynamics. While the code is completely general in terms of solving a variety of problems within compressible flows, the particular choice of boundary and initial conditions chosen correspond to the problem of compressible Rayleigh-Taylor (RT) instability. The RT instability is one of the handful of basic fluid instabilities and has been studied frequently. In addition to its fundamental importance, the instability is an important mixing mechanism in supernovae [1] in which the mixing rates drive the dynamics. One feature that distinguishes the physics of the present calculations from previous ones is the focus on three-dimensional effects. Since mixing rates affect dynamical evolution, 3D effects could potentially significantly alter the picture obtained from previous 2D calculations. Another feature of these calculations is that they are based on direct numerical simulation of formally inviscid hydrodynamics. Although the equations being simulated are truly inviscid, their numerical counterparts do in fact have numerical or artificial dissipation (or both). A consequence of this is that grid resolution plays the part of an effective Reynolds number. Finer grids correspond to higher Reynolds numbers because of accordant decreases in grid based dissipation. Thus higher and higher resolution corresponds more closely to the "true" situation where dissipation in supernovae is, for example, land High Performance Computing, NASA Goddard Space Flight Center, Greenbelt, MD 20771, USA

190 minute. We have performed our calculations on the NASA Goddard Space Flight Center's 1024 node SGI/Cray T3E which is an HPCCP (US High Performance Computing and Communications Program) testbed system and presently ranks as being the world's 5th largest computational facility, in terms of processing speed. The code is written with MPI which allows its portability onto different platforms. Some results on other machines have been reported previously [2, 3]. In this paper, we mainly report results of compile level optimization for performance. This does not alter the code in any significant way, allowing for both portability and ease of reading. We consider these to be important features for code maintenance.

2

Flux Corrected Transport Algorithm

The Flux Corrected Transport (FCT) method [4, 5] has been used extensively as a shock capturing method for compressible hydrodynamics. Shock capturing methods become necessary because the Euler equations admit discontinuous solutions (shocks and contact discontinuities). Straightforward differencing leads to either spurious oscillations at the discontinuity, or an excessive smoothing of the discontinuity. FCT is a finite volume method and proceeds from the notion of the domain being discretized into numerous cells, which exchange fluxes at their boundaries with neighboring cells and with the domain boundaries. The essential steps of the FCT approach to shock capturing are as follows: First a low-order flux (from which a solution, if obtained directly, would be excessively diffusive) and a high-order flux (from which a solution, if obtained directly, would contain spurious oscillation) are formed. The high-order flux is "limited" based on the high- and low-order fluxes and the low-order solution. This limiting, or correction, is the essence of the method because the limited flux, when used for the final high-order update, has the property that the resulting solution is free from oscillations. In fact, shocks and contact discontinuities are resolved to within a few grid points. For the calculation presented here, the high-order scheme is 4th order, the low-order scheme is 2nd order, with some small dissipation (which is considerably smaller than if a non-shock capturing method were used). The time marching is 2nd order. We use the flux limiter of Zalesak [5].

3

Rayleigh-Taylor Instability

In its simplest form, the RT instability arises from the unstable equilibrium configuration of heavy fluid laying on top of a lighter fluid [1, 6, 7, 8, 9, 10, 11, 12, 13, 14]. Then, a small perturbation will cause them to try to exchange positions. This basic mechanism is ubiquitous arising in, for example, supernovae [1] The fully compressible and inviscid situation is relevant because of high Mach numbers and negligible viscosity.

191

Specifically, we consider the 3D domain 0 _< x, y _< lx, ly and 0 _< z < 1; g - {0, 0, - 1 } . 1 so that the Two layers are considered with p - P2 - 2 for z > ~1 and p - /91 - 1 f o r z <_ 3, Atwood number ~pl ~ - 2 The initial configuration is hydrostatic with ~ - pg. Two types of cases were computed. The first corresponds to a single mode perturbation of the layers of the form V~ - Vo sin(kx/lx)sin(ky/ly)c~, where c~ - ~ is the sound speed, and the second correspond to multi-mode perturbation of the form form V~ - Voac~, where a is a 5 normally distributed random variable. We take an ideal gas equation of state with ~ - 5.

4

Parallelization

The domain decomposition follows [2, 3]. The vertical direction is decomposed in planes and distributed across processors. Each plane is further subdivided in one of the remaining directions. This decomposition has the advantage of being simple and straightforward to implement, but is advantageous only for large problems where a large number of processors are available. This is the present case. Message passing is via SGI/Cray's MPI implementation on the T3E. The MPI library is layered over SGI/Cray's shmem library. Although a direct implementation on the shmem is probably a little faster, we use MPI for portability. While HPF and the Cray CRAFT work-sharing features are available, these are severely performance limited. The NASA/GSFC T3E consists of 1024 compute processing elements (PE), each with 16 MW (128 MB) of memory for an aggregate of 128 GB of memory. The system has 960 GB of GigaRing attached fibre channel disks. The giga-ring is a high-speed I/O link with 1GB/sec peak bandwidth per channel. Each PE has a DEC Alpha microprocessor (DEC chip 21164; DEC Alpha EV5) with peak performance of 600 MFlops/sec. The speed is achieved by performing multiple instructions per clock cycle (at 300MHz), is cache-based, has pipelined functional units and supports IEEE standard 32-bit and 64-bit floating point arithmetic. All PEs are connected by a high-bandwidth, low-latency bidirectional 3-D torus system interconnect network. This topology ensures short connection paths and high bisectional bandwidth. Adaptive routing allows messages on the interconnect network to be rerouted around temporary hot spots. Interprocessor data payload communication rates are 480 Mbytes per second in every direction through the torus; in a 512-PE T3E system, bisection bandwidth exceeds 122 Gbytes per second. The NASA/GSFC T3E is based on the 300MHz chips, although we have also obtained timing results for the very recently 600Mz version (T3E-1200). The T3E has two unique features: Streams maximize local memory bandwidth, allowing the microprocessor to run at full speed for vector-like data references. Also, ERegisters provide gather/scatter operations for local and remote memory references and utilize the full bandwidth of the interconnect for single-word remote reads and writes. Latency hiding and synchronization are integrated in 512 off-chip memory-mapped E-registers. The latency of remote memory access can be hidden by using E-registers as a

192 vector memory port, allowing a string of memory requests to flow to memory in a pipelined fashion, with results coming back while the PE works on other data. A range of synchronization mechanisms accommodates both Single Instruction Multiple Data (SIMD) and Multiple Instruction Multiple Data (MIMD) programming styles, although we follow the MIMD approach.

5

Performance

We have systematically added compiler flags to f90 to increase performance. These flags are: 03 Highest level of non-vectorizing general optimization. aggress Raises the limits for internal tables, which increases opportunities for optimization. u n r o l l 2 Highest level of loop unrolling. p i p e l i n e 3 Highest level of pipelining

split2 Highest level of loop splitting. Although these correspond to fairly general optimization steps, particular codes may or may not obtain advantage from them. In Table 1 we summarize our findings where each flag is shown with its corresponding performance measured as CPU time/step. As item 4 indicates unroll2, the aggressive unrolling of loops has a deleterious effect on performance. The biggest performance gain is obtained for switching on streams. Since the T3E is a 64 bit word machine, another big factor of improvement is obtained by using 32 bit arithmetic. This was effected in the code by replacing all REAL*4 as REAL. This was the only performance enhancer that causes a change in the code. This is, however, quite small, requiring only trivial changes by a text editor. All told, the code obtains 83MF/node on the NASA/GSFC T3E-600. When timed on a T3E-1200, the code obtains 126MF/node. We stress that these are very much in the range of the highest performances on these machines without resort to very major restructuring of the code, which are unwelcome from a maintenance viewpoint. All of these increase the single node performance of the code, this being the greatest opportunity for optimization since the code, because of our application, is computationally intensive. Figure 1 shows the performance characteristics of portions of the code. The highorder flux calculation takes the greatest proportion of the time, followed by the limiting

193

flag 03,aggress 03,aggress,unroll2 03,aggress,pipeline3 03,aggress,pipeline3,unroll2 03,aggress,pipeline3; STREAMS 03,aggress,pipeline3,split2; STREAMS 03,aggress,pipeline3,split2; STREAMS; 32 bit 03,aggress,pipeline3,split2; STREAMS; 32 bit, T3E-1200

CPU/step 0.66 0.64 0.61 0.64 0.42 0.40 0.30 0.20

Mflops

83 126

Table 1: Compiler flags for optimization and attendant single node performance. STREAMS indicates that streams have been set on. 32 bit indicates the 32 bit version of the code was used. T3E-1200 indicates the performance on SGl/Cray T3E-1200 procedure. Communication total the next highest, although a significant amount of that is completion of incomplete sends and receives. In the code, these have been overlapped with some computation, although not significantly. The low-order flux calculations, next time level update and equation of state calculations take about the same time. Boundary condition applications are small. Finally, barriers for synchronization at each half-step account for a significant portion. An interesting feature observed is that without such synchronizations, performance is actually degraded across processors. This is likely a consequence of a cascade processes of asynchronous messaging delays across PEs. Its consequence is that barriers actually increase performance. This issue has not been explored fully as yet.

6

Results

For the single mode case, we have chosen lx - ly _ 1 with wavenumber k - 27r disturbance. We choose nx = ny = 32, nz = 128 and computed on 128 processors. In Figure 3 we show contours of the density at time t = 2.43. The central portion of the figure is the spike which is the propagation of the heavy fluid into the light fluid while at the edges is the bubble which is corresponding motion of the lighter fluid penetrating into the heavy fluid. A spike is nearly cylindrically symmetrical about its axis and this is reflected in the nearly identical structure of the two panels. However, in the middle of region of the spike/bubble configuration, this cylindrical symmetry is broken and in fact, the symmetry is that of the initial (square) perturbations. Figure 4 shows the growth rate of the spike for this mode and shows that the growth rate for this wavenumber and resolution are very similar to that obtained previously [3] for different wavenumbers and resolutions. By this we conclude that within this range of wavenumbers the nonlinear growth is roughly identical and independent of the (effective)

194

FCT

on Cray

T3E

20

16

0

12

'II 0

,

,

,

,

-

.~_

.E --

~

"~ "--

ll,i i

o.

"

,

9

I

,

,

~ E

o E

0 0

0 0

E

"E

E

Figure 1: Performance breakdown of FCT code into components, hi and low refer to the high- and low-order flux calculations respectively, l i m i t rain and l i m i t max refer to the limiting with respect to minima and maxima parts of the FCT flux limiting procedure. eos is the equation of state calculation. BC is the application of boundary conditions. This latter is the computational/memory access part; this is the chief source of communications in the code. comm wait and comm t o t a l refer to communication wait and total times. b a r r i e r refers to global synchronization barriers calls.

Reynolds number. As in [3] the growth is consistent with a growth rate ~ t 2. For the multi-mode case, Ix = ly = Iz = 1. In Figure 5 we show the contours of the density at time t = 3.2 of a calculation with n= = ny = nz = 128 on 128 PE of the T3E. A large number of spike and bubble arrangements are evident. The growth rates of these features are shown in Figure 6, where we have identified the growth with the largest amplitude feature in each region. Thus the values associated with "bubble" do not necessarily refer to an individual bubble, but rather to the maximal penetration of bubble features (likewise for spikes). We see that the growth rate changes between early and late times. At late times, the growth is ~ t, i.e. linear. At early times when the amplitudes are small, bubbles and spikes do not feel the presence of their neighbors and grow like single mode features with growth c( t 2. Then, for late times, as the amplitudes reach significance, merging and competition of neighbors slows the growth rate to ~ t. We conclude by noting that we have obtained satisfactory performance and useful results from a general purpose fully compressible 3D hydrodynamics code on the SGI/Cray

195

100 80

J

(D "0

o c

60

c'~

o

LL.

40

20 0

,

,

,

i

,

,

,

i

,

,

,

i

,

,

i

2-104 4.104 6-104 8-104 data page size (wds)/n0de

Figure 2: Variation of node performance as a function of the per node resident data size. T3E. The performance of the code is consistent with that obtained for codes that are memory access time bound, particularly with view to the 67% speedups in switching on streams to increase memory bandwidth and from going from a T3E-600 to a T3E-1200 where the difference is the 2X clock speed. A c k n o w l e d g m e n t s Thanks are due to B. Fryxell for numerous conversations regarding the problem and to T. Clune & S. Swift regarding SGI/Cray T3E optimization and I/O. This work is supported by NASA Grant NAG5-6152 to the University of Maryland and the NASA Goddard Space Flight Center's Earth and Space Science HPCC Project.

References [1] Miiller, E., Fryxell, B., and Arnett, D., Astron. and Astrophys. 251, 505 (1991). [2] Deane, A., Zalesak, S., and Spicer, D., 3D compressible hydrodynamics using Flux Corrected Transport on message passing parallel computers, in High Performance Computing, 1995, edited by Tentner, A., pages 128-133, 1995. [3] Deane, A., Computers Math. Applic. 35, 111 (1998). [4] Boris, J. P. and Book, D. L., J. Comput. Phys. 11, 38 (1973). [5] Zalesak, S., J. Comput. Phys. 31,335 (1979). [6] Taylor, G. I., Proc. Roy. Soc. London Ser. A 201, 192 (1950). [7] Baker, G. R., Meiron, D. I., and Orszag, S. A., Phys. Fluids 23, 28 (1980). [8] Read, Z. I., Physica D 12, 45 (1984).

196

::!:

Figure 3: Contours of the density for a single mode perturbation k - 27r, with lx - ly -1 ~, lz - 1. The data is at t - 2.43 and shows the x - z and y - z planes. [9] Youngs, D., Physica D 37, 270 (1989). [10] Tryggvason, G. and Unverdi, S. O., Phys. Fluids A 2, 656 (1990). [11] Yabe, T., Hoshino, H., and Wsuchiya, T., Phys. Rev. A 44, 2756 (1991). [12] Ofer, D., Shvarts, D., Zinamon, Z., and Orszag, S., Phys. Fluids B 4, 3549 (1992). [13] Li, X., Phys. Fluids A 5, 1904 (1993). [14] Hecht, J., Alon, U., and Shvarts, D., Phys Fluids 6, 4019 (1994).

197 0.50 0.40 O

"o 0.30 .,..,,, (3..

E 0.20 0.10

0.00 0.0

0.5

1.0 1.5 Time

2.0

2.5

Figure 4: Growth of the amplitude of the spike of the single mode p e r t u r b a t i o n of Figure 3. The solid line is the present calculation. The dotted and dashed lines are from the previous calculation of Deane (1998) and are at k = 3~, at twice and four times the resolutions, respectively.

Figure 5" Contours of the density for the multi-mode p e r t u r b a t i o n case (see text), with lx - ly - lz - 1. The d a t a is at t - 3.2 and shows only a section of the extent in z.

198 0.6

. . . . . . . . .

|

. . . . . . . . .

i

. . . . . . . . .

i

0.4

nD

Q.

. . . . . . . . .

ble

0.2 0.0

E '~ -0.2 -0.4 -0.6 ....................................... 0

1

2 Time

3

4

Figure 6" Growth rates of the bubble and spike regions of the multi-mode perturbation case of Figure 5. Note that at late times the growth changes.

Parallel ComputationalFluidDynamics Towards Teraflops,Optimizationand Novel Formulations D. Keyes,A. Ecer,J. Periaux,N. Satofukaand P. Fox (Editors) 92000 ElsevierScienceB.V. All rightsreserved.

199

A n A l g o r i t h m for R e d u c i n g C o m m u n i c a t i o n Cost in Parallel C o m p u t i n g A. Ecer, E. Lemoine and I. Tarkan Purdue University School of Engineering and Technology, IUPUI Indianapolis, Indiana 46202

1. INTRODUCTION

The efficiency of parallel algorithms is closely related to the cost of communicating. In order to improve the parallel efficiency of large problems on numerous processors, one has to control and minimize the communication cost. In the past, we have studied the problem of estimating the communication cost [1] and developed algorithms for reducing the communication cost [2,3]. In the present paper, a method is proposed for reducing the communication cost of a block-strucatred solution scheme as well as the computations. First, the basic Schwarz algorithm for the solution of Laplace's equation is considered. In this case, the problem is divided into a series of blocks with one point overlap between the neighboring blocks. At each iteration step, the Laplace equation is solved exactly for each block. Afterwards, the interface boundary conditions are updated between the neighboring blocks which involves communication. This procedure is computationally intensive since the number of iterations required for global convergence is dominated by the slow convergence rate of the interface boundary conditions; thus, many block solutions need to be performed. To improve the rate of convergence, at the end of each iteration step, one can solve a system of equations for all interfaces of all the blocks, as in the case of Schur complement and variants of Schwarz schemes [4]. The convergence of the interface boundary conditions is accelerated; however, these schemes involve communication between all the interfaces, which requires global communication and synchronization of all the blocks. In all of the above formulations, the problem is defined as iterations between solutions of several elliptic problems. In the present paper, we cast the same problem as a time-dependent parabolic problem. The equations are integrated in time for each block where the blocks are overlapped by one point. After each integration step, if the neighboring blocks communicate with each other, the blocked solution replicates the single block solution. In this case, the scheme would have been communication bound since the computations within a block is minimal, and especially, if one employs an explicit integration scheme for each block. The convergence is governed by the convergence rate of both the solution inside each block, as well as the boundary conditions for each block. The basic objectives of the proposed approach are, 9 reducing computations by obtaining only an approximate solution for each block through an explicit time integration scheme, 9 reducing communications by not updating the boundary conditions during the time integration for each block, and, 9 evaluating and correcting the error caused by the delay in updating the boundary conditions. The methodology described here can be implemented to problems other than the solution of Laplace's equation.

200

2. METHOD OF SOLUTION

First, the case of the following one-dimensional problem is considered: tgu

tgt

c32u " - 0 ~ ~

tgEx

with boundary conditions of u(O,O = o ,

u(1,0 - o

and initial conditions of u (x, O) = O,

O< x < 1

and

u (0.5, O) = 1.

In this case, an explicit forward time-centered space scheme is employed as followsUz n+ 9 l

= ui"

where

+ d (ui_l n - 2 ui n- u i §

d =a

The problem is divided into two blocks. conditions are defined as follows" u ( x a , O) = u ~ = 1,

At ~,X2 "

At the block interface, the initial boundary

and

u (x s , 0) = u n~ = l .

where xa and x s are t h e coordinates of the block interfaces. An exact duplication of the Schwarz algorithm would require the integration of the equations inside each block until a steady state solution is reached without communicating. This approach would require excessive computations. As mentioned above, another approach for the solution is to communicate after each iteration step, which requires excessive communications. If one does not communicate for n time steps, and hold the boundary conditions constant, it is observed that the effect of the boundary conditions diffuse inward in each block as shown in figure 1~ The present approach is based on the observation that the information needed to be communicated between the neighboring blocks at points xA and x s can be estimated and corrected after n time steps without communicating every time step. As can be seen from figure 2, after n iteration steps, the effect of the boundary condition felt immediately at the point next to the boundary. Also, for parabolic systems, the convergence to a steady state is much faster at this point compared to an interior point. Thus, after n iterations, the solution at the grid point neighboring the boundary provide a good estimate of the solution at that point if the time integration was continued until a steady state is reached. The Fourier components of the solution of the difference equation can be written in the following form:

201

uT' = Z A "

exp(Ii

where

0),

An

-

I = ~--1,

l_4dsm

0 < 0 < at,

and,

2

It is also observed that for d < 0.25, the high frequency w a v e s c o n v e r g e m u c h faster than the low frequency ones. I 0.9 0.8 0.7 0.6 0.5 time

0.4 0.:3 0.2 0.1

0 0

0.05

0.1

0.1~5

0.2

0.25

0.3

0.3~5

0.4

0.45

0.~5

0

0

0_I:~5

0.1

0. I~5

0.2

0.25

0.3

0.:3~5

O.a

Figure 1. Diffusion of the boundary condition inside the blocks ( t = 0 to 500 At with 50 At increments). x=0

x=h~

0.9 0.8

x = 4ZXx

0.7 0.6 0.5 0.4

x = 14LXx

0.:3 0.2 0.1 0

0

50

1130

160

200

250

300

350

400

450

500

Figure 2. Convergence of the solution to a steady state after 500 time steps at x = 0, &x, 2kx, 4Ax and 14Ax.

0.45

0.5

202

At the interface of the two blocks, at the initial time, the solution can be written as follows: u~ u,~ -

U, + eA u.

+ eB

for block I at point xA on the interface boundary, for block II at point xB on the interface boundary.

Here, Uz and UH are the steady-state solutions at XA and xB, while SA, 6B are the deviations from the steady state at the same points at the initial time. After n iteration steps, the effect of the imposed boundary conditions are calculated at the interface points inside the neighboring block. A new set of interface boundary conditions is determined after the exchanges between the neighboring blocks, u ~ produces the solution to be utilized as uiF at xB of block II while Ul~produces uF at xA of block I, as shown in figure 3. Here, it is

Block I

Block H UI

XB

0

XA

After n steDs

UI

rt .

_.,,,-

After 2n stevs 2a

1,li

UI

2t~

111I

Figure 3. Schematic description of the integration scheme with communication after each n time steps.

203

assumed that the solution after n iterations, the pertttrbations to the boundary conditions have diffused inward and can provide a good estimate of the steady state values of 6a and 6s. At this time, one can write, u] ' = Uz + k e s

and

un"

= Un + k e a .

where k is an unknown factor and 0 < k < I due to the diffusive nature of the parabolic ue~zuauon. If the time integration is advanced for another n steps, this time, u ~ produces in block I and uz" produces u ~ " in block II. In this case, one can write, u2"=

Uz +

k e sA

uu+

and

E

eB

Even if, U1 and Uzz are unknowns, one can calculate k as followsn

k =

n

2n

2n

urn- u t - uH +ut n n 0 0 Utt - - Ut - - Utt + Ut

and compute an estimate of Uz and Uz/as follows"

u. =ud" + p( ud" - u . ~

u, = u/" + .a ( u / " - u ? ) . where;

/5'

-

k 2

l_k ~

"

Therefore, after 2n time steps, the boundary conditions at the interfaces are corrected based on the above estimates. In figure 4, the convergence rate of two schemes, communicating every time step vs, -0.5

=

i

i

9

u

i

.

1

,

-1.5 -2 ~ . -2.5 "~

-3

o -3.5 -4

-4.5 * - c o m m u n i c a t i o n each 50 t i m e step - o - c o m m each 50 t i m e step with correction -

-5 -5.5

i

0

500

!

!

i

I

!

I

!

!

1 0 0 0 1500 2000 2500 3000 3500 4000 4500 5000 time iteration Figure 4. Variation of residuals with different communication rates and with correction for the onedimensional diffusion equation.

204

communicating every f t ~ steps are compared for a grid of 100 points with 2 blocks for d=-0.25 for the one-dimensional problem defined above. As can be seen from this figure, the rate of convergence is slower for the case with reduced communication than the case of communicating each time step, although, the elapsed time may be reduced considerably as it will be discussed later. The procedure to increase the rate of convergence can be summarized as follows: 9 Equations are integrated in time for each block for n steps, starting with u ~ a n d u n ~ as boundary conditions. 9 The boundary conditions uF and u~ as calculated from the neighboring blocks are exchanged. 9 Equations are integrated in time for another n steps, with u[' and u ~ , as t h e new boundary conditions for each block. 9 It is observed that after 2n t i m e steps, the original boundary conditions u ~ a n d u n ~ return to their original blocks where the error is diffused by a factor k2. At this time, Uz and U u are calculated as shown above and the cycle of 2n t i m e steps is repeated. By following the above procedure for the one-dimensional problem, the rate of convergence was improved as shown in figure 4. The results suggest that the rate of convergence was similar to the case of communicating every time step. During the computations, an upper limit of 0.975 was also set to the value of kto prevent divergence. The same procedure was implemented, by using the forward-time, backward-space to the convection diffusion equation as follows: u n+l = u n - c ( u n - ui_l ~)

+

d (ui_l" - 2 u['- ui+l")

where, c = V t / x and V is a constant. The pemtrbations to the boundary condition convect and diffuse inward starting from the boundaries. In this case, there is a different k value in each direction since the operator is not symmetric. The values of kl, k2, ui a n d u u are calculated as follows:

u?= Ui + cA

u:

uP= U~+ k2c8

u:

o

2n n nil - - n i l

Ul --UI

ug -uzt

nil --UlI

2n

uI-O.

Ulo -- O,

= UD+ k~CA.

rO2n

n

=/

II

n n]

luiu+ - u t u ~

I

0

2n

LUlIUI

n

k= = u'/-O, nil~

,B=

n

--UlUH

u , = u/~ + p ( u2 ~ - u? ) ,

where;

cB.

u,2~ = Uu + k~k2 CB.

uT~= U~ +k~k2zA

V

=U.+

/qkz 1-klk 2

205 Again kl and k2 values are limited to be <0.95. The convergence rates are compared for three cases with: communication every time step, communication every 50 time steps, and finally communication every 50 time steps with correction of the boundary conditions. The results obtained by using the above procedure are shown in figure 5 for a grid with 140 nodes and 2 blocks, with c = 0.12, d = 0.67, andre = VL/a = 100. A three-dimensional example of the developed scheme was chosen to be the solution of Laplace's equation over a 100x 100x 100 grid [5]. The grid is divided into 10 blocks in one direction. The parabolic equation is written as, --s

V2U

~t

with initial conditions u(x,y,z,0) = 0, and boundary conditions, u(0,y,z,t) =1, u(1,y,z,t) =0, u(x,0,z,t) = l-x, u(x,l,z,t)= l-x, u(x,y,0,t)=l-x, u(x,y,l,t)=l-x. A forward time, centered space scheme is employed, this time in three-dimensions, with d = 0.167. Figure 6 shows the results again for the three cases listed above. In this case convergence rate is improved by the correction of the boundary conditions. For this problem, the elapsed time is also plotted in Table 1, when a 10Mbps hub or a 100Mbps switch are utilized. Ten 400 MHz Pentium processors are employed in this case.

ACKNOWLEDGEMENTS

This research was supported by NASA Glenn Research Center.

REFERENCES

1. Chien, Y.P., Secer, S., Ecer, A., Akay, H.U., "Communication Cost Function for Parallel CFD Using Variable Time Stepping Algorithms," Proceedings of Parallel CFD '97 Conference, May 1997, Manchester, England. 2. Gopalaswamy, N., Ecer, A., Akay, H.U., and Chien, Y.P., "Efficient Parallel Communication Schemes for Explicit Solvers of NPARC Codes," AMA Journal, Vol. 36, No. 6, June 1998, pp. 961-967. 3. Gopalaswamy, N., Akay, H.U., Ecer, A., and Chien, Y.P., Parallelization and Dynamic Load Balancing of NPARC Codes," A/AA Journal, Vol. 35, No, 12, December 1997. 4. Dinh Q.V., Glowinski R., Periaux J. and Termsson G., On the Coupling of Viscous and Inviscid Models for Incompressible Fluid Flows via Domain Decomposition. First International Symposium on Domain Decomposition Methocls for PDE, SIAM, 1988. 5. Parallel CFD Test Case, http://www.parcfd.org/

206

- 0 . 5

0

9

"~_ I~lr~

-1

'

.

.

.

.

.

.

.

- +- communication each time step - *- comm each 50 time step

-1.5

" ~ -;2.5

-3

-3.5 -4

0

. 500

.

1000

.

.

150C}"

.

.

2000 2500 3000 time iteration

i

3500

I

4000

4500

5000

Figure 5. Variation of residuals with different communication rates and with correction for the onedimensional convection-diffusion equation. -0.3 -0.35

-0.4

t

-0.45

~L -0.55

- +- Communication - ~- Communication

-0.6 -0.65

0

1 O0

:200

300

each time step each 20 time step with correction

400 500 600 time iterations

700

000

L

L.__

900

1000

Figure 6. Variation of residuals with communication every time step and communication every 50 time steps with correction for the three-dimensional diffusion equation.

Hub Switch

Comnu~c~on each Urne */~ */~ Total 1 99 578 7 9(3 124

ConmmmicalJoneach 50 time s l ~ */~,omp * / ~ Total 41 59 19 60 40 15

RaUoof eapseO ~.n~ 30.4 8.3

Table 1. Comparison of communication cost reduction using the present scheme for the case of a hub and a switch for the three-dimensional test case.

Parallel ComputationalFluid Dynamics Towards Teraflops, Optimizationand Novel Formulations D. Keyes, A. Ecer, J. Periaux, N. Satofukaand P. Fox (Editors) 92000 Elsevier Science B.V. All rightsreserved.

207

Domain decomposition solution of incompressible flows using unstructured grids with pP2P1 elements F.O.Edis a, U.Gulcat a and A.R.Aslan a aFaculty of Aeronautics and Astronautics, Istanbul Technical University, 80626, Maslak, Istanbul, Turkey

Parallel implementation of a fractional step time discretization Finite Element scheme using pseudo-quadratic velocity/linear pressure interpolation tetrahedral elements is presented. An iterative non-overlapping domain decomposition technique is utilized for the distributed solution of the Poisson's equation for the pressure, pP2P1 elements are especially attractive within this scheme as the number of pressure elements are far lower than the number of velocity elements, yielding a better efficiency for the parallelization of the Poisson's solver. The superlinear speed up is achieved between the 2 and 4 domain lid driven cavity solutions.

1. I N T R O D U C T I O N Incompressible flow modelling with unstructured grids is an important area of general industrial interest. Unstructured solvers allow engineers to analyse flows involving complex shapes with little effort in terms of mesh generation. However, due to the increased number of elements, increased computational resources are required for such analyses. Utilization of parallel computers and cost-efficient clusters of workstations for incompressible flow problems can reduce the analysis time. Most solution schemes for unsteady incompressible viscous flows require an implicit solution of the Poisson's equation for pressure, which constitutes the major difficulty in terms of parallel implementation. Iterative domain decomposition techniques are widely used to overcome this difficulty. Efficiency of such a method is presented in [1] for an implementation on a workstation cluster. 1 In Finite Element computation of incompressible flows, utilization of first-order interpolation element both for velocity and pressure is preferred over second order velocity interpolation, due to computational efficiency. However, equal order interpolation elements do not satisfy the so-called Ladyzhenskaya-Babuska-Brezzi condition and often produce spurious oscillations in the pressure field. Elements combining first-order velocity interpolation with first-order pressure interpolation and still satisfying the LBB condition are proposed in [2]. These elements, which can be considered as pseudo-second-order velocity This research has been supported by TUB ITAK (Turkish Scientific and Technical Research Council), under the title COST-F1.

208 interpolation elements, satisfy the LBB condition while offering lower computational requirements compared to equal-order interpolation elements. Pseudo-second-order interpolation is achieved by subdividing a parent pressure element into sub-elements and defining first-order velocity interpolations on these sub-elements. Detailed comparison of results, obtained using quadrilateral/hexahedral (pQ2Q1) and triangular/tetrahedral (pP2P1) shaped pseudo-second-order velocity interpolation elements with those obtained using corresponding equal-order interpolation elements can be found in [3-5]. In the present work, parallel implementation of a fractional step time discretization Finite Element scheme using pseudo-quadratic velocity/linear pressure interpolation tetrahedral elements is presented. An iterative non-overlapping domain decomposition technique [1,6] is utilized for the distributed solution of the Poisson's equation for the pressure, pP2P1 elements are especially attractive within this scheme as the number of pressure elements are far lower than the number of velocity elements, yielding a better efficiency for the parallelization of the Poisson solver. A cluster of DEC Alpha XL266 workstations running Linux as the operating system, interconnected with a 100 Mbps TCP/IP network, is used for computations. A public version of the Parallel Virtual Machine, PVM 3.3, is used as the communication library. Lid-driven flow in a cubic cavity with a Reynolds number of 1000 is selected as a test case to demonstrate the efficiency and accuracy of the method used. Timings and efficiency studies show superlinear speed up between the 2 and 4 domain solutions.

2. FORMULATION

2.1 Navier-Stokes equations The equations governing the flow of an unsteady, incompressible, viscous fluid are the continuity equation V.u =o

(1)

and the momentum (Navier-Stokes) equation D___.u_u= -Vp + 1 V2u Dt Re

(2)

The equations are written in vector form (from here on, boldface type symbols denote vector or matrix quantities). The variables are non-dimensionalized using a reference velocity and a characteristic length, as usual. Re is the Reynolds number, Re=U//v where U is the reference velocity, I is the characteristic length and v is the kinematic viscosity of the fluid. Velocity vector, pressure and time are denoted with u, p and t, respectively. The proper boundary and initial conditions are imposed on the pressure and velocity values for the solution of the Equations 1 and 2 depending on the flow being internal or external [5].

209

2.2 FEM

formulation

In the present work, a fractional step method [3-5] is employed for the temporal discretization of the governing equations. In this method, the solution is advanced over a full time step At (from n to n+l) according to

,

U i -- U? "[-

~)2 p n+l ~Xj~Xj

{1

u Unl

Ui

Re ~Xj~xj

n

(3)

~-~jJ

1 ~U i = ~ ~ At ~x i

(4)

u n+l * ~ p n+l i --U i - A t ~ ~X i

(5)

In Eqn. (2), the pressure gradient is excluded from the half step velocity calculation. The semi-discrete equations given above are spatialy discretized using the Galerkin Finite Element Method [3-5]. The following steps are taken to advance the solution one-time step, Eqn. (3) is solved explicitly to obtain the intermediate velocity field u*, Knowing the intermediate velocity field, Eqn. (4) is solved iteratively with the element by element-conjugate gradient method and a domain decomposition technique for the parallel solution. New time step velocity values are obtained from Eqn. (5).

i.

ii.

iii.

In all computations, the lumped form of the mass matrix is used.

3. DOMAIN DECOMPOSITION A domain decomposition technique [6] is applied for the efficient parallel solution of the Poisson's Equation for the pressure, Eqn. (4). This method uses a sequence of initializationiteration for a unit problem-finalization procedure. Initialization" Eqn. (4) is solved in each domain ~'~i with a boundary of ~"~i and an interface Sj, with vanishing Neumann boundary condition on the domain interfaces. - AYi

=

fi

in

~z i

I t~

Yi

=

gi

on

~9~

gO

=

It~

~Yi ~n i

=

0

on

Sj

wo

=

g~

9 arbitrarily

chosen

210

Unit Problem: a unit problem is set up from the actual problem of solving the Poisson's equation on a subdivided domain as - A xi

n

=

0

in

~"~i

X.t

=

0

on

b~~ i

0 Xi

=

(--1) i-1W"

on

Sj

I'1 n

Oni Steepest Descent:

aw" : (X~ -X2")sj

g"+' = g"-fl"aw"

(y:- y:)s, /,/n+l

:~l

s.

n __j~nwn

2

~n

--

I1 "+ 11

:11 .11 = J

J Sj

J s~

Convergence criterion:

max(p; +' -/.t~') max(/.t~TM )

<s

k - 1.... number of interface nodes

Finalization: Having obtained the correct Neumann boundary condition for each interface, the original problem is solved for each domain. -AYi = Yi

=

fi

in

gi

on ~"~i

~Y._..2. = (_1)i-111 n+l on

~'~i

Sj

bni In this section, subscripts i and j indicate the domain and the interface respectively and superscript n denotes iteration level. The final solution Yi is the desired pressure.

4. PARALLEL IMPLEMENTATION During parallel implementation, in order to advance the solution a single time step, the momentum equation is solved explicitly, Eqn. (3). At each solution, interface values are exchanged between the processors working with domains having common boundaries. Solving Eqn. (3) gives the intermediate velocity field which is used at the right hand side of Poisson's Equation (4), in obtaining the pressure. The solution of the Poisson equation is obtained with domain decomposition where an iterative solution is also necessary at the interface. Therefore, the computations involving an inner iterative cycle and outer time step advancements have to be performed in a parallel manner on each processor communicating with the neighboring one. The parent-child processes technique is adopted. Child solves the N-S equations over the designated domain while parent handles mainly the domain decomposition iterations. All interfaces are handled together.

211

5. P S E U D O

SECOND

ORDER

VELOCITY

INTERPOLATION

ELEMENTS

Pseudo second order velocity/first order pressure interpolation elements are constructed by dividing the pressure element, over which the pressure is interpolated linearly or bilinearly, into sub-elements, over which the velocity also is interpolated linearly or bilinearly. Combination of these first order velocity interpolations can be considered as pseudo second order interpolation over the original pressure element. Overall, the solution can also be considered to be obtained on two different grids, one for the velocity solution and the other for the pressure solution. In the present work, pseudo-quadratic velocity/linear pressure interpolation elements (pP2P1) are considered. Such an element is obtained by inserting mid nodes to the parent pressure element and interconnecting these to give 8 sub-elements over which the velocity is interpolated linearly. These elements and node numbering are shown in Fig. 1. Derivation of the integral matrix equations is similar to the equal order P1P1 element since the interpolation functions are similar [3-5]. For the half step velocity equation, Eqn. (3), and the full step equation, Eqn. (5), the integrals are evaluated over the velocity elements. The pressure gradient used in the full step equation is taken from the parent element of the sub-element since it is constant over the parent element. However, the first term on the right hand side of the pressure equation, Eqn. (4), denotes the integration of the divergence of the velocity field weighted with the interpolation function for pressure. As the velocity interpolation is defined over the subelement of the pressure element, it is possible to evaluate the integral over the sub elements and to assemble the element vector, to yield the integral value over the pressure element. The integrals, thus, can be evaluated analytically using one point integration. The number of pressure elements for which the stiffness matrix for the FEM is constructed is eight times smaller compared to that of the one constructed using the P1P1 element pair with the same number of velocity elements. As the pressure equation forms the implicit part of the time-stepping algorithm, any reduction on the size of this equation yields significant savings in the number of calculations and the storage requirements for the

4

Vd~ty

::~e~nt: :Nodes

9

I0: ::

:":

.i"

:~!3 7~

9

6

N~(~,r/, ~') = La = 1 - ~ - r / - (

:1:

1 5 7:10: 5: ~: 6 : s .

3

6 3:7

4: 5:

8 9. 10.,4: 7 8 10 ~5

6 71

7 8 9~iO 6 7 8, 9.

Pressure element" ~, r/, ( ~ f~ep

8

5 6i:7:8

Velocity element :~, r/, ( ~ ~

9

N2 (~, ~7,( ) = L2 = ~:

N~(~.,.() =/. =~ N4 (~,~7, ~') = L4 =~"

m e m ~ ~ ~i: 2.~3 4:

2

Figure 1. Linear pressure/pseudo-quadratic velocity interpolation elements (pP2P1).

212

solution of it. Similar savings also apply to the interface computations required for the domain decomposition method. However, since the interface is 2-dimensional, the reduction is four fold.

6. RESULTS AND DISCUSSION

Lid-driven flow in a cubic cavity with a Reynolds number of 1000 is selected as a test case to demonstrate the efficiency and accuracy of the method described above. Shown in Fig. 2 is the 2 and 4 domain partioning of the overall grid in the cavity. Velocity and pressure space are shown separately. The pressure grid consists of 1 lxl lx6 points while the velocity grid has 2 lx 1 lx 11 points. Solutions obtained after 7000 time steps at dimensionless time 30 are shown as pressure iso-surfaces in Fig. 3.

Pressure grid

: ~......... ..i ~x~

i)

Velocity grid

Figure 2. Computational mesh used for the two and four subdomain parallel solution of the lid-driven flow in a cubic cavity (726 node pressure, 4851 node velocity mesh).

:Z

213

Figure 3. Pressure iso-surfaces for the parallel solution of the lid-driven cubic cavity flow at Re= 1000 (t=30) with tetrahedral grid using pP2P 1 elements. The time step used for the computations is At=0.1. Different tolerance values are adopted for domain decomposition and pressure iterations. A value of 10-5 is used for subdomain pressure iterations while a value of 10-2 was sufficient for the domain decomposition solution of interface values. Performace comparisons of 2 and 4 domain solutions are shown in Table 1. It is wothwhile to note that when the number of subdomains is increased to 4, the number of interface iterations necessary to reach the same convergence value does not increase, as is the case in most domain decomposition applications. In addition, a very high value of speed-up (2.68) is achieved, compared to the theoretical value of 2.00.

7. C O N C L U S I O N A finite element method with a first-order accurate time scheme is employed for the parallel numerical analysis of incompressible flow problems on unstructured grids. A semiimplicit formulation which involves the solution of a Poisson's equation for pressure is used. A domain decomposition technique on non-overlapping matching sub-domaing meshing is utilized for the parallel solution. Employement of pseudo-second-order elements for pressure results in a smaller implicit problem, leading to very good speed up due to the reduced size for the pressure solution.

214

For future work, comparison of performance with equal-order interpolation elements, parallel implementation for hexahedral (pQ2Q1) elements, and further study of acceleration techniques for iterative domain decomposition method are in progress. Table 1. Performace comparisons of 2 and 4 domain cubic cavity solutions.

2 Dom. M a s t e r CPU Dom. 1 C P U Dom. 2 C P U Dora. 3 C P U Dora. 4 C P U Dom. 1 Press. Dora. 2 Press. Dora. 3 Press. Dom. 4 Press. Domain Dec.

(min) (min) (min) (rain) (rain) lter. lter. lter. lter. Iter.

!ili~~~ii

iii

.

.

.

.

17 249 224 4636094 4666446 46716 .

4 Dom. 12 91 85 91 86 2624251 2628003 2597881 2593266 31025

......

ii!iilii!i iriiilUiiiiii!~iiii!iii~i~iii iiiiiiliiiiii i i i i ilii!iiii ili18i8 l ill

iii ii iiiiiiiiiiiiiiiiiiiiiiiii iiiiiiiiiiiiiiiiiiiiiii iiiiiiiiiiiiiiiiiiiiiiiiiiii iii i

~~N~~ ii i??ii iiiiiiiii! iiiiiiiiiiiii!iiiii!iiii!iilii iiiiiiiiiii ! i i iiiii~~~~iiiN~ii iiiiiiiiiiiiii! ~i iiiiiii~iii!iiiiiiii }iiiiiiiiii~ii~i ii!iiiiililii}ilili:iiiiiliii~ili)i iN~:iiii~N~N~~ iiiiiii;iii i!ili!ii!!ii !i!iiiil ! ili!iii!ii!iiiiili!il;i iii!ii!f!?!ii!!!iiiiiii !i!iiif!!iiiiii!iiii~!!!i !!ii!li!ii!i!iiiii !!ii!!!i!i REFERENCES

[ 1] A.R. Asian, F.O. Edis and, Ll.Gtilqat,, 'Accurate incompressible N-S solution on cluster of work stations', Parallel CFD '98 May 11-14, 1998, Hsinchu, Taiwan. [2] M. Bercovier and O. Pironneau, 'Error estimates for Finite Element Method solution of the Stokes problem in primitive variables', Numer. Math. Vol. 33, pp 211-224, (1979). [3] F.O. Edis and A.R. Asian, 'Efficient incompressible flow calculations using pQ2Q 1 element', Communication in Numerical Methods in Engineering, Vol 14, pp 161-178, (1998). [4] F.O. Edis and A.R.Aslan, 'Efficient incompressible flow calculations using pseudosecond-order finite elements', In Numerical Methods in Laminar and Turbulent Flow: Proceedings of the 10 th International Conference (Swansea, 1997), Pineridge Press [5] F.O. Edis, 'Efficient finite element computation of incompressible viscous flows using pseudo-second order velocity interpolation', PhD Thesis, Istanbul Technical University, (1998). [6] R. Glowinski and J. Periaux, 'Domain decomposition methods for nonlinear problems in fluid dynamics', Research Report 147, INRIA, France (1982).

Parallel Computational Fluid Dynamics Towards Teraflops, Optimization and Novel Formulations D. Keyes, A. Ecer, J. Periaux, N. Satofuka and P. Fox (Editors) 2000 Elsevier Science B.V. All rights reserved.

215

P a r a l l e l A g g l o m e r a t i o n S t r a t e g i e s for I n d u s t r i a l U n s t r u c t u r e d Solvers D. R. Emerson, Y. F. Hu and M. Ashworth CLRC Daresbury Laboratory, Daresbury, Warrington WA4 4AD, UK An industrial aerospace code has been parallelised for execution on shared memory and distributed memory architectures. The flow solver uses an explicit time integration approach and utilises an agglomeration multigrid strategy to accelerate the convergence. A novel method has been used that efficiently parallelises the multigrid solver and also minimises any intrusion into the flow solver. The parallel code uses standard message passing routines and is demonstrated to be portable. 1. I N T R O D U C T I O N

The use of fluid dynamics codes with unstructured grids continues to grow in popularity. This is particularly true for fluid applications that do not involve strong separation or turbulence or where the viscous/inviscid interaction is weak. In these cases the Navier-Stokes equations can be reduced to the Euler equations. For many practical applications, the flow of air over an aircraft can be modelled using the Euler equations because the Reynolds number is high and the viscous effects are primarily restricted to the boundary layer at the aircraft's surface. The computational grid required to model a complete aircraft can be very complex. Unstructured meshes offer the benefit of being able to generate very complex meshes quickly and efficiently. From an industrial perspective, this is a very important requirement. However, whilst efficient solvers can be developed for unstructured mesh applications, the computational time and storage requirements for solving flows over a complete aircraft remain substantial. It is in this context that parallel processing can play a vital role in enabling much bigger and, increasingly, more complex problems to be tackled. This paper describes the parallelisation of an industrial aerospace code used by British Aerospace to solve inviscid flows around highly complex geometries. An important feature of the flow solver is the use of an agglomeration multigrid strategy to enhance the convergence properties of the explicit Runga-Kutta time integration scheme. The parallelisation of the multigrid scheme required careful consideration of the approach used by British Aerospace and a novel parallelisation strategy has been developed. Portability was a high priority and the code has been run successfully on both shared memory and distributed memory machines.

216

2. G O V E R N I N G E Q U A T I O N S

The equations governing a compressible inviscid fluid are given by

~u

OFi =0

~+ ()t

i=1,2,3

~)xi

(1)

where U is given by -

P pu~

U-

pu:

(2)

pu: _pE _

and the inviscid flux vector F is defined by Pui pu~ui + pS~i Fi = pu:ui + P52i

(3)

Pu3ui + p53i

(pE + p )u, In the foregoing notation, u~ is the fluid velocity in the x i direction, 5~j is t h e Kronecker delta, and p, p and E are the pressure, density and total energy of the fluid, respectively. The equation of state for a perfect gas is then given by p =(7"-l)p(E-O.5uiui)

(4)

The solution is advanced in time by a 5-stage Runga-Kutta explicit timestepping algorithm. An advanced multigrid procedure is employed to accelerate the convergence of the explicit algorithm.

3. A G G L O M E R A T I O N MULTIGRID

The difficulty of obtaining a converged solution for explicit schemes is well known, and results from good damping of high frequency errors, but poor damping of low frequency errors. The initial residual can be reduced quite efficiently up to a certain point, beyond which more and more iterations are required to reduce the residual by a certain amount until a converged solution is obtained. One way to overcome this difficulty is through the use of multigrid. In a multigrid approach, a series of successively coarser meshes is employed. Convergence is improved because the low frequency errors on the finer meshes will appear as high frequency errors on the coarser meshes and can readily be damped by the integration scheme. Multigrid has been demonstrated to be an

217

101 O No agglomeration [] 1 agglomeration level O 2 agglomeration levels 3 agglomeration levels 4 agglomeration levels

100

1 0 "1 n:J -el

(D

10 .2

o,--I

10 .3 r.D 1 0 .4

10 .5 ~

10-~

I

--,--

,

0

,

200

I ,

,

,

,

400 600 Number of Cycles

,

,

800

,

J

1000

Figure 1: Normalised RMS density residual convergence for a wing-body problem effective solution strategy and, for the code under consideration, can reduce the time to solution by an order of magnitude or more. Figure 1 illustrates this very well and shows the convergence history of the normalised RMS density residual for a wing-body configuration using a range of multigrid levels. The use of multigrid on structured meshes is relatively straightforward because the topology is preserved and the coarse mesh data can readily be derived from the fine mesh. A major issue with the use of multigrid on unstructured meshes is the representation of the topology on the coarse meshes and particularly on the coarsest mesh employed. For complex geometries, such as complete aircraft, it is important to capture geometrical features t h a t will affect the final solution. A number of ways have been proposed t h a t attempt to overcome this difficulty. One approach is to generate the coarse mesh first and use this as a basis to generate the finer meshes. This has previously been used with success [1] but the finest mesh will clearly depend upon the initial coarse mesh and the problem of ensuring that important geometric detail is captured remains an issue. Another approach that has been used with some success [2] is the use of non-nested meshes where each mesh level is generated independently. This method clearly has some attractions but introduces the additional problem of each mesh generally having no points in common with the other meshes.

218

Figure 2: A typical unstructured grid (left) and a possible agglomeration of the cell volumes (right). The problem of ensuring the coarse mesh captures any important geometric features is still an issue. Clearly, a number of other variations are possible. The approach taken by British Aerospace has been to use an agglomeration multigrid method. In this approach, the finest mesh is generated first in order accurately to represent the topology under consideration. The cell volumes are then merged or agglomerated together. This approach was first proposed by Lallemand [3] and has been successfully demonstrated by Mavriplis [4]. A simple schematic diagram illustrating the principle behind agglomeration is shown in figure 2. In this example, the fine mesh consists of 14 cell volumes and the coarse mesh contains three cell volumes formed by fusing together volumes on the fine mesh.

4. P A R A L L E L I M P L E M E N T A T I O N Recent progress in the partitioning of unstructured meshes has made it a relatively routine problem. A number of very good packages are now available to perform this task; the package selected for this work was METIS [5]. An important feature of METIS for future problems is its ability to perform the partitioning in parallel. For large meshes this could be very useful. However, for the problems under consideration, the sequential version of METIS was used. A number of factors had to be taken into account during the parallelisation of the suite of routines known as FLITE3D. In reality, there are 4 key stages to the program suite. The first two stages, involving the surface mesh generation and volume mesh generation had no impact upon the final strategy. The section that received the most attention was the pre-processing stage. The final stage involving the flow solver required the introduction of appropriate calls to MPI or PVM. By this means we were able to meet one of British Aerospace's requirements; that of minimal intrusion into the flow solver. A further requirement was for results from the sequential and parallel versions to agree. This seems like an obvious statement but it does affect the method chosen for parallelising the multigrid algorithm. An additional requirement was for the restart file to be a single file and in a format suitable for interrogation by their

219

graphical software. This required a global gather operation that can impact upon the parallel efficiency of the flow solver. In reality, very few authors quote speedup and efficiency figures that incorporate any substantial I/O. As previously stated, the largest amount of work was concerned with the preprocessing stage. This stage performs many important operations, such as reordering grid nodes for better cache use, node colouring for vector architectures, calculation of finite element weights, and the generation of the coarse mesh levels for the agglomeration multigrid. The introduction of the parallel strategy within this framework was straightforward. Moreover, any additional burden on the user was minimised and the only requirement is the specification of the number of processors as a command line argument. If no argument is given, the original sequential code is recovered. For the call to METIS to partition the mesh, it was necessary to write a subroutine that would prepare the mesh data. The fine grid mesh was then partitioned with a call to METIS. From a system perspective, this has implications on the memory utilisation. Figure 3 shows the memory requirements for the sequential code, METIS and the routine used to partition the m u l t i g r i d - which will be described next. From figure 3, it is evident that the sequential code requires only a modest amount of memory but the memory increases by a factor of approximately three for METIS. However, once the memory has been allocated, it remains constant regardless of the number of processors. Once METIS has completed its grid partitioning, the memory is released. A number of alternative strategies are available to parallelise the coarse grid meshes. In principle, a simple approach would be to partition the fine grid with METIS and then generate the agglomerated meshes based upon the fine grid partition. This would certainly help to minimise any inter-processor communications but the behaviour of the parallel version would differ from the sequential version and this approach was not followed. Another approach would be to generate the coarse meshes based upon the fine grid and then partition each mesh independently. This approach has been followed, with some success, by Mavriplis [6]. However, the problem of matching the partitioned domains to minimise the overlap, and thereby the communications, is quite complex and requires very careful design of the algorithm. The approach taken by the present authors is to generate the agglomerated meshes as with the sequential code and partition only the fine mesh. Then, for each processor, the fine grid that resides on that particular processor is used to perform a search for the coarse grid descendants. This method helps to minimise the communications by ensuring that only agglomerated cells that are directly affected by the partition are involved in the inter-processor communication. In situations where coarse nodes are shared between processors, the shared node is replicated to minimise interprocessor communication and on one of the processors an active node is declared to avoid duplicating that node's contribution to the restriction phase. A more detailed description of this approach can be found in [7]. One potential disadvantage of the current approach is that the strategy does not allow any control of the quality of the coarse mesh partition. As the coarse level meshes are

220

70

60

50

9 Sequential code [] METIS x Partition routine

o

40

30

0

,

0

I

,

I

,

100 200 N u m b e r of Processors

300

Figure 3: Memory requirements to partition a wing-body problem derived from the agglomeration of cells on the finest mesh, this was not considered to be a critical aspect. Another important feature of the current approach is the fact t h a t the pre-processor memory is only allocated when needed. As an approximation, the number of grid points allocated per processor by METIS is roughly the same and the memory requirement for the parallelisation therefore remains constant for each partition. As shown in figure 3, the memory required by METIS is independent of the n u m b e r of processors and this trend is also followed by the partitioning routine. It should be noted, however, that there is some difficulty in obtaining an accurate assessment of the memory utilisation within a UNIX environment because, within a given process, any memory released dynamically will not necessarily be de-allocated from the process itself. A more recent analysis of the partitioning routine has indicated t h a t there is considerable scope for reducing the a m o u n t of m e m o r y needed, particularly as the number of processors is increased. However, this would not remove the core memory requirement for METIS to operate. For the problem indicated in figure 3, which contains only 51,737 nodes, the memory requirements are clearly modest but for large grid sizes involving more t h a n 10M grid nodes this becomes a non-trivial issue.

5. R E S U L T S

The parallel version of FLITE3D has been tested thoroughly by British Aerospace on workstation clusters and shared memory platforms.

221

40

|

|

O ~ No agglomeration O-O 3 agglomeration levels Ideal 30

20 00

10

O

~

0

I

10

~

I

~

I

20 30 Number of Processors

40

Figure 4" Comparison of speed-up with and without agglomeration We present results for the wing-body case on the Cray T3E/1200E. This is a small test case and it is not expected to scale to a large number of processors, but it does provide an insight into a number of issues. This problem involves 51737 nodes and 302079 tetrahedra. Even this problem, however, is too large to fit on to a single processor and the results given in figure 4 assume an ideal speed-up of 2 on 2 processors. The results clearly show the impact of the additional communications involved in the multigrid algorithm. The results given for 3 levels of agglomeration are fully converged after just 228 iterations and those without agglomeration were stopped after 228 iterations. The reason for this was to match the I/O requirements, where a full restart file was created after 100 iterations and the final output file was created when the program completed. If the I/O was not included, the speed-up would be improved but this would be an unrealistic use of the code. It is felt that the performance on this small problem is acceptable and the strategy adopted in the parallelisation has been shown to work well. In a private communication, British Aerospace reported a speed-up of 100 on their 128 processor SGI Origin. 6. C O N C L U D I N G R E M A R K S

An industrial aerospace code that is used extensively by British Aerospace for solving inviscid flows over complex geometries has been successfully parallelised. The convergence of the explicit scheme was accelerated by an agglomeration multigrid technique. A novel parallelisation strategy has been proposed for the

222

multigrid algorithm and has been demonstrated to be efficient and robust. The strategy proposed took careful consideration of the overall approach employed by British Aerospace. The final implementation is transparent to the user and requires only one command line argument to specify the number of processors. The parallel version uses standard message passing libraries, such as PVM and MPI, and has been shown to be portable across a wide range of computing platforms, including clusters of workstations, shared memory, and distributed memory architectures. The work highlighted a number of areas that need careful consideration, particularly for large-scale applications. Parallel I/O is has been addressed in MPI-2 but this feature was not available to the authors. The graphical analysis of large data sets is also a major problem. The majority of the available graphical codes are sequential and this problem requires considerable attention if applications involving 10M+ nodes are going to be tackled. The implications for partitioning such problems are also obvious but parallel versions of grid partitioning packages are available. However, the problem of generating the initial mesh remains a considerable problem and the ability to generate a mesh in parallel is very much in its early stages. Finally, the storage and management of data being produced by these types of simulation is becoming an increasingly critical issue.

REFERENCES

1. Connell, S. D. and Holmes, D. G., "Three-Dimensional Unstructured Adaptive Multigrid Scheme for the Euler Equations," A/AA Journal, Vol. 32, No. 8, 1994, pp. 1626-1632. 2. Peraire, J., Peiro, J. and Morgan, K., "A 3D Finite Element Multigrid Solver for the Euler Equations," International Journal for Numerical Methods in Engineering, Vol. 36, 1993, pp. 1029-1044. 3. M. H. Lallemand, H. Steve, and A. Dervieux, "Unstructured Multigridding by Volume Agglomeration: Current Status", Computers Fluids, 21(3), 397-433, 1992. 4. Mavriplis, D. J., "Three-Dimensional Unstructured Multigrid for the Euler Equations," A/AA Journal, Vol. 30, No. 7, 1992, pp. 1752-1761. 5. G. Karypis and V. Kumar, "A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs", Tech. Rep. 95-035, Dept. Computer Science, University of Minnesota, Minneapolis, MN 55455, 1995. 6. Mavriplis, D. J., "Three-Dimensional High-Lift Analysis Using a Parallel Unstructured Multigrid Solver," NASA/CR-1998-207682, ICASE Report 98-20. 7. Hu, Y. F., Emerson, D. R., Ashworth, M., Maguire, K. C. F., Blake, R. J. and Woods, P., "Parallelising FLITE3D - A Multigrid Finite Element Euler Solver", submitted to Int. J. Num. Meth. Fluids.

Parallel Computational Fluid Dynamics Towards Teraflops, Optimization and Novel Formulations D. Keyes, A. Ecer, J. Periaux, N. Satofuka and P. Fox (Editors) 2000 Elsevier Science B.V. All rights reserved.

223

Parallel Computation of Turbulent Flows around Complex Geometries on Hybrid Grids with the DLR-TAU Code M. Galle ~, T. Gerhold b and J. Evans b ~NEC European HPC Technology Center, Hefibriihlstr. 21 B, 70565 Stuttgart, Germany bDeutsches Zentrum fiir Luff- und Raumfahrt, Institut fiir Str6mungsmechanik, Bunsenstr. 10, 37073 G6ttingen, Germany

A computational scheme for the simulation of viscous flows around complex geometries on hybrid grids is described. The scheme employs the dual grid method. Computational efficiency is increased by the utilization of a multigrid algorithm and a memory optimized data structure adapted to the architecture of modern vector computers, the flow calculation are parallelized on the basis of domain decomposition. In order to assess the accuracy of solutions obtained on hybrid grids, comparisons between the hybrid code and a validated structured scheme are presented.

1. I N T R O D U C T I O N The use of hybrid grids in Computational Fluid Dynamics for the simulation of viscous flows around complex geometries has become increasingly important in recent years. Though the advantages of this approach appear obvious, there are still a few aspects in which hybrid schemes seem to have drawbacks compared to structured codes. The first point in this respect is related to the accuracy that can be achieved in resolving viscous dominated flow regions such as boundary layers. A second point is the computational efficiency in terms of memory requirements, convergence and floating point operations per time unit of hybrid unstructured schemes. In the present paper these points are assessed by comparing recent results with results obtained with a structured code and by detailed examinations of the computational performance.

2. C O M P U T A T I O N A L 2.1. Hybrid

METHOD

Grid Approach

The basic idea behind the Hybrid Grid Approach [11 is to combine the capability of regular grids to resolve viscous dominated flows in a very efficient and accurate way with

224

Figure 1: Hybrid Grid Approach

the flexibility of tetrahedral grids. As depicted in Figure 1, regular grids are used in the vicinity of solid walls, where viscous effects can be expected. The rest of the flow field can be discretized with tetrahedral cells. The regular part of the grid typically consists of prismatic or hexahedral cells, depending on the surface discretiz~tion. For the results presented herein, the surfaces of the considered geometries are triangulated. In this case, the extension of the surface grid in normal direction results in prismatic cells in the regular part [2].

2.2 D u a l M e s h M e t h o d In the presented scheme the Dual Mesh Method [3] is employed. The primary (hybrid) grid is described by physical node coordinates and their respective connections to tetrahedral, prismatic, pyramidal or hexahedral cells. Before the flow calculation starts, a secondary grid consisting of auxiliary cells or control volumes has to be derived from the primary grid. One control volume is assigned to each node of the primary grid. The surfaces of these control volumes are composed of faces. For each edge of the primary grid there is one related face of the secondary grid that is shared by the control volumes surrounding the two end nodes of the respective edge. As shown in Figure 2, the shape of the face is determined by the metrics of the primary cells adjacent to the edge. The faces of all edges attached to one specific node form the surface of the control volume for this node. The control volumes cover the entire computational domain without gaps and overlaps. Hence, conservation is ensured in the whole flow field, even at interfaces between two cell types or between refined and unrefined regions. The governing equations are integrated on this secondary mesh. The dual mesh technique enables the use of an edge based data structure. The metrics of the secondary grid are described by normal vectors, that represent the size and orientation Figure 2" Face of the dual mesh related of the faces, the geometric coordinates of the to an edge connecting P1 and P2 grid nodes and the volumes of the auxiliary cells. The connectivities of the grid are given by the two nodes on both sides of each face. No relations between nodes and primary cells are needed after the control volumes have been determined. The special attraction of this approach lies in the fact that it works with any kind dual mesh, regardless of the shape of the primary grid cells. Both inviscid and viscous fluxes are evaluated edgewise. Hence, in order to update the flow condition for one node, the fluxes along the edges associated with that node have to be summed up. The sum of the particular fluxes represent the entire flux over the control volume surface, from which the change in the

225

flow condition can be computed. The inviscid fluxes along the edges are computed by employing either an accurate upwind or a more robust central scheme. Different one- and two-equation turbulence models can be applied; the results presented here are obtained using the Spalart-Allmaras one-equation model [4]. The integration in time employs an explicit Runge-Kutta time stepping algorithm, which is easy to implement and very efficient in terms of memory requirements. Therefore, it is a commonly used technique for stationary flows. Furthermore, as a part of a dual time stepping approach, it can be applied also to transient flows.

2.3. C o n v e r g e n c e A c c e l e r a t i o n For the computation of high Reynolds number flows, acceleration of the solution convergence is essential. There are several approaches to accelerate the convergence to steady state for an explicit time stepping scheme. One strategy implemented in the scheme is the smoothing of residuals; this can be done either implicitly or explicitly. Another, very powerful acceleration technique is the multigrid method. In literature diverse strategies for generating coarser grids can found. For finite volume schemes that employ the dual grid technique, a multigrid formulation based on the agglomeration of control volumes [5] is a very elegant way. This approach, which is illustrated in Fig~........... ....................... ::::::::::::::::::::::::::::/2, : : :...]...,.~x.....~ j ii....................... ii!~............. ..... ure 3, is carried out in the hybrid code. The control volumes for the coarser grids are obtained by fusing together the fine grid control volumes. In this way, Figure 3: Agglomeration of Control a set of coarser grids is achieved. Each coarse grid Volumes can be described as the fine grid by the face normal vectors, the geometric coordinates of the grid nodes and the face to node connectivity. Therefore, the same approach as on the finest grid can be used to compute the coarse grid solution. The transfer operators needed for the communication between the different grids are obtained directly during the agglomeration process. As the secondary grid instead of the primary mesh is the basis for the coarsening, there are no difficulties caused by the use of different cell types. ii~:::~:'~:::~:::~::":'

9

' x ~ '/ 84" " " " 9" 9" 849 849 9149.... " 9 "i......... " "" 9149 "i

j;S~Y~

/

"\...S

~,

- .. : ...

:.

2.4. P a r a l l e l i z a t i o n / ~1/ 7~.. i \ ............ ' .... ........-~!'~.....-',~5.................................. :8:...... The parallelization of the solution algorithm works on a domain decomposition of the computational grid. Due to the explicit time stepping scheme, each domain can be treated as a complete grid. During the flux integration, data has to be exchanged between the different domains several times. In order to enable the correct communication, there Figure 4: Domain decomposition is one layer of ghost nodes located at each interface between two neighboring domains. The situation is depicted in Figure 4. The solid drawn nodes are regular nodes of the left domain as they are located on the left side of the cut. Each node of the right domain that has .....\

.~iii~:

/..

,%

::i~i:

x

'~,~

O

\

:::

... . . .

226 a connection to one of the regular nodes of the left domain is a ghost node of the left domain (hollow drawn). The cut edges, which are depicted altering black and grey, are part of both domains. An exchange of updated node data is necessary only before an operation working on the edges. The same approach can also be used on the coarser grids. The data exchange employs MPI. Since the relation of ghost nodes to regular nodes is very small if large grids are decomposed to a reasonable number of domains, the overhead introduced by parallelization remains small. Use of massive parallel systems for small grids worsens the situation and additional effects, such as short vector lengths, may additionally decrease the performance. However, for small grids the aspect of parallel acceleration is less important.

2.5 Adaptation The adaptation module is based on the bisection of edges [6] in the primary grid. New nodes can only be placed in the middle of two existing nodes. As the preprocessing requires a confirming primary grid (without hanging nodes), all cells adjacent to the bisected edge have to be divided. In order to indicate edges to be refined, second differences of flow conditions are computed for the grid edges. The adaptation is controlled by the definition of the number of grid nodes to be generated by the adaptation step. The indicator threshold is adapted to this requirement. After each adaptation step, the preprocessing and the domain decomposition have to be executed again.

3. RESULTS

The first test case considered here is the viscous flow over a RAg 2822 airfoil with case 9 flow conditions. In Figure 5 the computational grid is depicted. The initial hybrid grid is

Figure 5: Hybrid Grid around RAE 2822 Airfoil composed of a prismatic sublayer and tetrahedral cells in the outer region. The surface is discretized with 4074 points and 7516 triangles. The volume grid consist of 24 layers of prismatic cells and about 148,000 points in total. In Figure 6 the computed cp-distribution is compared to experimental data [7] and to results obtained with the

227 structured FLOWer [8] code. In this code the SST version of the key-turbulence was used. In Figure 7 a comparison of the skin friction distribution is shown. the agreement of the results between the two numeric schemes is fairly good, the location differs slightly from the experiment. Additional comparisons can be found

model While shock in [9].

-1.5 -

C313 Cp

~

[

]

[] O. 01

-0.5

Cf ~

_ _

0.005

....... 1~

f

LOW r

[] Experiment r

,

0

B

1

I

i

0.25

l

,

i

I

~

i

i

0.5

I

I

i

0.75

Figure 6: cp-Distribution

i

t

i

I

l

() 0

0.25

0.5

0.75

Figure 7: ci-Distribution

The turbulent flow around the DLR-F4 configuration is the next case to be considered. Figure 8 illustrates the % distribution on the surface of the geometry and the computational grid on the symmetry plane. The prismatic region contains 25 layers of prismatic cells, and the surface triangulation consists of 40,000 points and 80,000 triangles. The total number of grid points is 1,400,000. In Figure 9 the %- and the c/-distribution in diverse chord wise cuts is compared to experimental data and results obtained with the structured code. In the simulations the flow field was assumed to be fully turbulent. '. 2 ~',,

J, i J',., f::x ' ',,

I~AK

:~.,.:,51i~i~ili i i~ .-;.".)!Ni!'~;ii',iii!i~,iiii~i':i::::'i,i'G., -:. i:~ Ni#~!iiii~i'~ii!!:',i',!i~i,,:ii',~:.:: 9v~"~i~,i~,ii!!!!i~i',i',!',i'~i:i'ii:.:

I/-4/4 "

Figure 8 Hybrid Grid and cp distribution for turbulent flow around DLR-F4 configuration

228

-1.0

.....

•0

/, ~'~ 0.0

0.0 I/~ t

1.Oo. 1

0.1

0844

v,

!i ~ 0.01

FLOWer

TAU 0.3

0.5 X/C

0.7

0.9

1.0 1

01

0

,!1! ~Oo~o:, o:~ OiSco.~o:o

7

2 0.008

0.008 1

0.004

0.004

i

0.008

-~'-- ~

f ~i ~0004

0.000 I -i.1'

0.012

0.000

0:1

0:3' 0:5' & 7 ' x/c

0:9'

I!i

-0.1' 011 ' 013

015 x/c

017 ' 019 ' 1

Figure 9" Distribution of pressure and skin friction coefficient along chord wise cuts The last test case to be presented is the turbulent flow around the DLR-F6 configuration, consisting of wing, body, pylon and through flow nacelle. This first grid contains about

Figure 10- cp-distribution and initial Grid for DLR-F6 Configuration 2 Million grid nodes. This grid is adapted twice, with the number of grid points increasing by 30 to 40 percent with each adaptation step, such that the final grid consists of

229

3.6 Million grid nodes. In Figure 10 the cv-distribution on the surface after a computation on the finest grid and the grid on the symmetry plane are presented. The initial grid was l0~

0.65

0.065

10-' ~ 1()-2

i

0.55

i

10 -3

lift ....

i '!

.x= ,.~ c,~ 10-4

h0.055

clrao 9FLOWer: lift

4~

[] FLOWer: dra~

0.45

0.045

10 -~ 0.35 10-(, 0

500 1000 1500 2000 multigrid-cycles 41evel-W

i 2500

b.035 0

2500

500 1000 1500 2000 multigrid-cycles 41evel-W

Figure 11 Convergence of residual and lift/drag coefficient

c5 :, f~,! \ --4-

S!

-......... Flower . . . . TAU, i n i t i a l - - TAU, adapted

/

! i {

.........................................

i

r-q TAU, adapted

O.0

0.2

,

0.4

0.6

0.8

.0

0.0

x/C

0.2

0.4

0.6

0.8

,,~

1.0

x/C

Figure 12 cp- and cf-distribution r / - 0.238

made available by CentaurSoft, Centaur TM, copyright by CentaurSoft. Figure 11 depicts the overall convergence, including the two adaptation steps and the development of the lift and drag coefficients. Figure 12 Performance Wall-Clock Time shows the cp- and c/-distribution at 6419 s 23.8% of the span width in com- - 6000 parison to experimental data and 5000---5000 results obtained with the struc-. .:,:.q.:::"i MFIops Seconds 3 9 7 8 ,M~16ps tured code. While the agreement 4000---4000 in the cp distribution is fairly good, ka24a s ~: ..... . . ^o~%,~5., -- 3000 some differences in the skin fric"%. .,./,.. "~021 J ~ i o p s tion can be observed. These are j " k 1643 s caused mainly by a different tran1021 MFio;s ............... .J" ................. 843 s 1000 sition management and the use of 511 ~ri~ps .............. 580 S different turbulence models. 0 Figure 13 shows the scaling for 0 1 2 3 4 5 6 7 8 9 10 11 12 NO. of P r o c e s s o r s the flow calculation on one, two, four, eight and twelve processors for 100 multigrid iterations on the Figure 13 Performance for 100 Multigrid Iterations initial grid, including parallel input. As can be seen from the figure, the acceleration factor for the turn around time is more than 11 for the 12 CPU case. ,.

e.

230 4. C O N C L U S I O N S As shown in this paper, hybrid schemes have become comparable with structured methods in terms of accuracy and computational efficiency. Due to the ease of grid generation, hybrid schemes can be considered to be a good alternative for the computation of viscous flows around geometries of an increased complexity.

ACKNOWLEDGMENTS

The first author contributed to this work being employed at the "DLR Institute of Design Aerodynamics". Major parts of the work were done within the framework of the MEGAFLOW project [10].

REFERENCES

[1] M. Galle. Unstructured viscous flow solution using adaptive hybrid grids. In Workshop on Adaptive Grid Methods. ICASE/LaRC, November 7-9 1994. [2] T. Gerhold, O. Friedrich, J. Evans, and M. Galle. Calculation of complex threedimensional configurations employing the DLR-T-code. AIAA-97-0167, 1997. [31 T.J. Barth and D.C. Jesperson. The design and application of upwind schemes on unstructured meshes. AIAA-89-0366, 1989. [41 P.R. Spalart and S.R. Allmaras. A one-equation turbulence model for aerodynamic flows. AIAA-92-0439, 1992.

[5]

M.H. Lallemand, H. Steve, and A. Dervieux. Unstructured multigridding by volume agglomeration: current status. Computers and Fluids, 21:397-433, 1992.

[6]

T. Gerhold and J Evans. Calculation of complex three-dimensional configurations employing the DLR TAU-code. DGLR Fach-Symposium, STAB, November 1998.

[7]

P.H Cook, M.A. McDonald, and M.C.P Firmin. Airfoil 2822- pressure distributions and boundary layer and wake measurements. In Experimental Data Base for Computer Program Assessment. AGARD AR-138, 1979.

Is]

O. Brodersen, E. Monsen, A. Ronzheimer, R. Rudnik, and C.-C. Rossow. Numerische Berechnung aerodynamischer Beiwerte ffir die DLR-F6 Konfiguration mit MEGAFLOW. DGLR Fach-Symposium, STAB, November 1998.

[9] D. Schwamborn, T. Gerhold, and V. Hannemann. On the validation of the DLRTAU-code. DGLR-Fach-Symposium, STAB, November 10-12 1998. [10] N. Kroll, C.-C. Rossow, K. Becker, and F. Thiele. MEGAFLOW- A numerical flow simulation system. Int. Council of the Aeronautical Sciences, Melbourne, 1998.

Parallel Computational Fluid Dynamics Towards Teraflops, Optimization and Novel Formulations D. Keyes, A. Ecer, J. Periaux, N. Satofukaand P. Fox (Editors) 92000 Elsevier Science B.V. All rights reserved.

231

Parallel I m p l e m e n t a t i o n of a C o m m o n l y U s e d Internal C o m b u s t i o n E n g i n e C o d e Aytekin Gel and Ismail elik Mechanical ~ Aerospace Engineering Dept., West Virginia University, Morgantown, WV 26506

Abstract

KIVA-3, a widely used computational fluid dynamics code for internal combustion engine simulations, was re-structured for parallel execution via one-dimensional domain decomposition. Communication between processors at the subdomain boundaries is carried out with MPI message-passing library which enabled significant flexibility in porting the distributed-memory implementation of KIVA-3 to various architectures ranging from SGI Origin 2000 to Cray TaE and recently a Beowulf cluster. Due to the complex features incorporated into KIVA-3, the parallelization was done in stages. This paper presents results from the first stage where the speedup and parallel efficiency of the parallel version of KIVA-3 are elucidated. 1

Introduction

In spite of the vast developments in computer technology within the last decade, many grand challenge problems of interest are still beyond the numerical simulation capability of the most state-of-the-art hardware. Fluid dynamics problems involving the prediction of turbulent flows represent one such example. The capability to resolve the wide spectrum of turbulence scales can play a crucial role in the numerical simulation of complex three-dimensional turbulent flows particularly in internal combustion engines. Limitations on the grid resolution, accuracy of the numerical scheme, and turbulence model used are important factors in acquiring this capability. Large-eddy simulation (LES) is a viable alternative to direct numerical simulation (DNS) of turbulent flows due to the fact that certain portion of the detailed small-scale information resolved in DNS is suppressed in LES technique in order to make the computational problem more tractable. However, even with LES, today's challenging problems require accurate schemes with adequately fine grid resolution. In addition to these, a faster turn around time to approximate the instantaneous flow behavior is still important. As part of a research effort towards the large-eddy simulation of diesel combustion engines at West Virginia University, KIVA-3 code originally developed

232 by Los Alamos National Laboratory [1] was re-structured for parallel execution via message-passing on parallel computers, in particular on distributedmemory platforms. One-dimensional domain decomposition approach initially proposed by Ya~ar [3] was successfully implemented in KIVA-3 code with substantial improvements (Gel [4]). KIVA-3 is a general purpose computational fluid dynamics (CFD) code based on the Arbitrary Lagrangian-Eulerian (ALE) method with enhanced features for internal combustion engine applications. The method is typically implemented in three phases. The first phase is an explicit Lagrangian update of the equations of motion. The second phase is an optional implicit phase that allows sound waves to move many computational cells per time step if the material velocities are smaller than the fluid sound speed. The third and final phase is the remap phase where the solution from the end of phase two is mapped back onto an Eulerian grid if Eulerian approach is selected. Time advancement is similar to many codes that utilize-the LES methodology where the convective terms are advanced explicitly and the diffusion terms can be advanced explicitly~ implicitly, or semi-implicitly. The degree to which the diffusion terms are implicitly discretized is based on a combination of stability and efficiency considerations. Global timestep is divided into sub-time steps (referred to as sub-cycles) based on Courant condition to advance the convective terms. 2

D i s t r i b u t e d - M e m o r y I m p l e m e n t a t i o n of KIVA-3

The KIVA family of codes have been primarily developed on Crav vector supercomputers with explicit directives to take advantage of vector processing units without stalling the pipelines. However, the gain from vector supercomputers are becoming obsolete due to the significant improvements in processor chip design and introduction of efficient parallel processing platforms. As a result of these developments, the definition of "high performance computing" has evolved and changed significantly. These changes had significant impact on the problem complexity and size to be tackled. Since the release of it's first version in 1985, KIVA has been also evolving and improving in parallel to all these new developments in computational resources and computational fluid dynamics techniques. For example, the block-structured grid capability in KIVA-3 provides significant advantages over KIVA-II, especially for complex internal combustion engine geometries. Another example is the indirect addressing introduced in KIVA-3 which eliminates the obligation to go through the storage array elements in a particular sequence. In spite of these improvements in the metholodogy, there are certain bottlenecks which are quite difficult to overcome. For example the ghost cells surrounding the grid still impose severe restrictions on the maximum grid resolution that can be achieved on single processor systems. Considering today's challenging problems which require adequately fine grid resolution capability, the most economical way to achieve progress towards this goal with KIVA-3 is to acquire the capability to run this code on parallel systems, preferably on distributed-memory architectures.

233 A shared-memory parallel implementation of KIVA-3 which is particularly optimized for Silicon Graphics symmetric multiprocessor systems has been available for some time. However, the speedup characteristics of this version is not favorable (Gel [4]). Due to the importance of scalability and parallel efficiency in parallel processing, a domain decomposition approach with explicit message-passing was investigated by Ya~ar [2]. In the present study one-dimensional domain decomposition based on the idea of overlapping the left and right faces of adjacent subdomains along x-direction was implemented in KIVA-3. This approach was initially proposed by Ya~ar [3] who also developed a preliminary KIVA-3 implementation on Intel Paragon platform with NX message-passing libraries. Although the fundamental ideas of the current implementation are based on the previous work by Ya~ar [3], the current work essentially originated from scratch by building on top Of the original source code of KIVA-3. Furthermore the present work uses MPI message-passing library to provide the flexibility to port the code on to any distributed-memory or shared-memory platform. In fact, the parallel version of KIVA-3 was ported on to numerous platforms ranging from SGI Origin 2000 to Cray T3E and recently on a DEC Alpha processor based Beowulf cluster that is built locally and running under Linux. Substantial improvements were gained by replacing the modified sort operation in the study by Ya~ar with a pointer array that keeps track of the cells and vertices to be communicated in each subdomain which makes the current version unique in this respect. The current approach is based on the idea of dividing the whole computational domain into "subdomains". Each processor can then be assigned a subdomain to work with. Processors of adjacent subdomains communicate with each other only for the data near the subdomain boundaries through packets of messages sent via a MPI library. For this purpose, the memory locations for the ghost cells and boundary vertices was facilitated for storing the incoming information from adjacent subdomain along the x- (or radial) direction. Note that, in the original version of KIVA-3, the memory allocated for ghost cells and boundary vertices are not used unless an inflow or outflow boundary condition is to be imposed (Gel [4]). Due to the features related modeling complex physics incorporated into KIVA3 code, the overall parallelization task was subdivided into three stages. In the first stage, all of the essential features of KIVA-3 excluding piston movement, chemical reactions and spray dynamics were parallelized. Results of the first stage are presented in this study with calculated speedup for the fixed grid benchmark problems executed on SGI Origin 2000 systems. 3

B e n c h m a r k R e s u l t s for S p e e d u p and Parallel Efficiency

Several problems were selected to perform the benchmark runs for speedup and parallel efficiency after the parallel version results were validated with the

234 original KIVA-3 results for the same problems. The runs performed for benchmarking purposes were conducted for a pre-determined number of timesteps where the dependency of speedup trends on the number of timesteps were also investigated. It was determined that beyond 100 timesteps no significant difference in speedup trends was observed. Due to the significant overhead in setup phase in single processor runs, speedup and parallel efficiency figures were calculated based on both including and excluding the time elapsed in setup phase. Such a distinction was necessary due to the superlinear speedup behavior observed when setup phase time was included. The first problem is the three-dimensional laminar channel flow at Re=l,000 which was tested up to 48 processors with several grid resolutions starting from 250,000 vertices to 1,500,000 vertices, referred as 250K and 1500K grid, respectively. This problem Was particularly selected for debugging purposes during the development of the parallel version and utilizes nearly 80 % of the subroutines which was expected to give an idea about the overall performance. The results show that under favorable conditions with a simple problem like laminar channel flow, a speedup factor of 38 could be achieved with 48 processors for the 1500K grid case (see Figure 1) which corresponds to a parallel efficiency of 0.79. Note that the ratio of computational volume to communicated subdomain face area is an important factor in determining the optimal number of processors for a given grid size. The second problem was the turbulent round jet problem where a jet at a velocity of 1500 cm/s and a diameter of 10 cm enters a three-dimensional rectangular domain through the left wall. This problem is the simplified version of the free swirling annular jet case (i.e., annular jet replaced with round jet without any swirl) studied by Smith et al. [5] on a single processor DEC Alpha machine with KIVA-3 up to 500,000 vertices. Benchmark tests up to 32 processors with a maximum grid resolution of 4,370,000 vertices were performed with KIVA3/MPI (acronym for the present parallel version of KIu A Smagorinsky type subgrid-scale (SGS) model was facilitated during these simulations. Also substantially improved convection scheme combination was used by introducing a third option into KIVA-3 which simply selects a convection scheme based on central differencing for the convective terms in the momentum equations and Quasi-Second-order Upwind (QSOU) for the density and energy equations (Gel et al. [6]). Up to 3 % grid stretching was employed in radial plane (y-z) to capture the core flow and maintain the second order spatial accuracy as close as possible. Table 1 illustrates the speedup and efficiency figures for the grid configuration of 1,024,000 cells. Note that complete run column indicates that the time elapsed in setup phase was included. The third problem was the actual free swirling annular jet case (i.e., case studied by Smith et al. [5] without the simplifications mentioned above). An annular jet with an inner and outer radius of 2.7 cm and 5.3 cm. respectively,

235

Table 1 Speedup & Efficiency for the turbulent round jet problem with 160 x 80 x 80 cells Wall Clock (in sec.) # PEs

Speedup

Efficiency

Complete

Excl.

Complete

Excl.

Complete

Excl.

Run

Setup Phase

Run

Setup

Run

Setup

97,063

68,675

1.00

1.00

1.00

1.00

31,616

31,466

3.07

2.18

1.54

1.09

8,766

8,729

11.07

7.87

1.38

0.98

enters a three-dimensional rectangular domain from the left wall with an initial swirl velocity profile. Initial conditions were based on the measurements of Holzapfel [7]. Table 2 illustrates the speedup and parallel efficiency achieved for the 80 x 80 x 80 grid configuration up to 16 processors 9 Note that the numbers reported here are based on setup phase excluded timing. Table 2 Speedup & Efficiency for 80 x 80 x 80 (500K) Grid after 500 timesteps #PEs

[Speedup 1

1.0

2

1.7

Efficiency 1.00 0.86

, .

8

6.6

0.82

16

11.4

0.71

Several preliminary large scale runs over one million vertices were performed for the second and third cases at different grid resolutions. Table 3 shows the grid layout, total number of vertices, total number of processors (PEs) employed, and the average C P U time required per each timestep of the simulation for the second benchmark problem (i.e., turbulent round jet). Table 3 Preliminary production simulation details for turbulent round jet problem Case

Grid Layout

Tot 9 Vertices

# PEs

A1

160 x 80 x 80

1,089,288

16

32

A2

160 x 80 x 80

1,089,288

32

17

avg.cPuse~ tirnestep

, ,

208 x 100 x 100

2,184,840

16

62

288 x 122 x 122

4,370,000

32

130

236 Single processor KIVA-3 runs at these grid resolutions that are required for the calculation of the speedup and parallel efficiency for these cases were not performed due to very long turn-around times. However, as seen from Table 3 for Cases A1 & A2, when the number of processors employed were doubled for the same size problem, the average CPU time per timestep nearly reduced in half which demonstrates a speedup close to the linear speedup. The KIVA-3/MPI was also tested to simulate the free swirling annular jet problem up to 20 processors with a maximum grid resolution of 120 x 100 x 100 (1,200,000 cells). An average 70 sec of CPU time per timestep was observed during the simulation. Although this simulation is not completed yet, t-he preliminary results indicate legitimate solutions. In particular, the velocity contour plots from the preliminary results (at t = 0.2 sec) indicate the breakup of the initial symmetry (Figures 2 & 3) of the jet.

4

Conclusions

The KIVA-3 code has been parallelized using domain decomposition and MPI libraries. Due to the complexities involved in the original code, parallelization has been subdivided into three stages. The first stage results which excludes moving boundaries, chemical reactions and fuel spray are presented. The results indicate that for the selected test cases, a speedup close to linear speedup can be achieved on up to 48 processors even with grids larger than one million vertices. Also, a parallel efficiency of 70-80 % is maintained. Although the benchmark runs were performed on only Origin 2000 and not all of the features of KIVA-3 are facilitated, these current speedup results are promising. Furthermore, the grid resolutions that could be achieved with KIVA-3 has been significantly improved. The implementation of the features excluded in the first 15hase are under progress and benchmark runs that will incorporate these features will be conducted in the near future. Also a similar set of benchmarks for speedup and parallel efficiency on other hardware platforms are under progress (in particular on a locally built DEC Alpha processor based 11 node Linux cluster). A c k n o w l e d g e m e n t : This research has been conducted under the sponsorship of U.S. Department of Defense, Army Research Office through EPSCoR Program (Grant No: DAAH04-96-1-0196). Time on the Origin 2000 and Cray T3E was provided by DoD Major Shared Resource Centers at the U.S. Army Corps of Engineers Waterways Experiment Station (CEWES MSRC) and the Naval Oceanographic Office (NAVOCEANO MSRC). Time on Pittsburgh Supercomputing Center's Cray T3E is acknowledged during the initial development phase of this research.

237 References

[1] Amsden, A.A., KIVA-3: A KIVA Program with Block-Structured Mesh for Complex Geometries, Los Alamos National Laboratory, Technical Report : LA12503-MS, (Los Alamos, NM 1993). [2] Ya~ar, O., Zacharia, T., Amsden, A. A., Baumgardner, J.R. and Aggarwal, R., Implementation of KIVA-3 on Distributed-Memory MIMD Computers, in: A. Tentner, eds. High Performance Computing '95, The Society for Computer Simulation (1995) 70-75. [3] Ya~ar, O., A Scalable Model for Complex Flows, Int. J. Computers Mathematics with Applications, 35 (1998) 117-128. [4] Gel, A., Distributed-memory Implementation of KIVA-3 with Refinements for Large Eddy Simulation Applications, Ph.D. Dissertation, Mechanical ~: Aerospace Engng. Dept., West Virginia University, (1999). [5] Smith, J., (~elik, I. and Yavuz, I:, Investigation of the LES Capabilities of An Arbitrary Lagrangian-Eulerian (ALE) Method, AIAA-99-O~Z1 (1999) 1-11. [6] Gel, A., Smith, J. and (~elik, I., Assessment of Spatial Accuracy and Computational Performance of KIVA-3, Proceedings of the Fall Technical Conference of ASME ICE Div., FEDSM98-ICE-136, (Irwin, PA 1998) 75-82.

[7] Holzapfel,

F., Zur Turbulenzstruktur Freier Und Eingeschlossener Drehstromungen, Ph.D. Dissertation, University of Karlsruhe, Germany (1996).

45

40

35

30

[ 25 20

15

10

5 w

~

|

i

J

i

~

l

|

|

5

10

15

20

25

30

35

40

45

Number

of Processors

Fig. 1. Speedup plot for 1500K case based on excluding setup phase time.

238 Test Case : Free Swirling Annular Jet at t = 0.2055 sec after 2250 timesteps Number of Processors : 20 Global domain size : 120 x 100 x 100 ; Total # vertices : 1,269,288 Instantaneous U-velocity contours at y = 0 cm plane

100

A

E

U 4935 4518 4100 3682 3265 2847 2430 2012 1594 1177 759 341 -76 -494 -911

6O 50

N

4O 30 20

00

10

20

30

40

50

60

70

80

Axial flow direction, x (cm)

90

100

Fig. 2. U velocity contours for free swirling annular jet case with 1,200,000 cells.

Test Case : Free Swiding Annu~r Jet at t = 0.2055 sec after 2250 ~mesteps Number of Processors : 20 Global domain s~e : 120 x 100 x 100 ; Total # ve~ices : 1,269,288 Instantaneous U-velocity contours & vectors at y = 0 cm plane (En~rged view) U 4935 4518 4100 3682 3265 2847 2430 2012 1594 1177 759 341 -76 -494 -911

60

A

E r N

50

40

0

10

20

Axial flow direction, x (cm)

30

Fig. 3. Enlarged view of the U velocity contours.

239

Sr..eedup Rct ~ r 1500l< G-'id

~ i l ~

,,

~

~

~I

~

.."'

I

~I

~

~I

~

! ~

~ ~

~

' ~

:

'

I

I

~ ~

bn~ D 1:

,."

/t

/

/

L!"

/,

..-.

,.,.

//

//

/ ......

A. .... "

..
/,'/" /t"

.,i y

; mT . . . . . . . . . 59

'0

I

15

I

~

-

<; ..... _

t,d il

/

/i

/

;//:

......, .........

ii I

/

/

,/"

.,,':,."

/

/

i ~i~

~

//

Dt

~0r.,103

/

/

,

/ //"

....' 0 '~- ~-Jter 10:J .A

,

I

25

t,~rrb-~_r d Pro c~.~sL'c~

~

30

I

3~

I

40

I

4,5

This Page Intentionally Left Blank

Parallel ComputationalFluid Dynamics Towards Teraflops, Optimizationand Novel Formulations D. Keyes,A. Ecer, J. Periaux,N. Satofukaand P. Fox (Editors) 2000 Elsevier Science B.V. All rights reserved.

241

Towards Realistic Performance Bounds for Implicit CFD Codes W. D. G r o p p y D. K. Kaushik, bt D. E. Keyes, ct and B. F. Smith d~ aMathematics and Computer Science Division, Argonne National Laboratory, Argonne, IL 60439, groppOmcs, a n l . gov. bMathematics and Computer Science Division, Argonne National Laboratory, Argonne, IL 60439 and Computer Science Department, Old Dominion University, Norfolk, VA 23529, kaushik@cs, odu. edu. CMathematics & Statistics Department, Old Dominion University, Norfolk, VA 23529, ISCR, Lawrence Livermore National Laboratory, Livermore, CA 94551, and ICASE, NASA Langley Research Center, Hampton, VA 23681, keyes@icase, edu. dMathematics and Computer Science Division, Argonne National Laboratory, Argonne, IL 60439, bsmith@mcs, a n l . gov. The performance of scientific computing applications often achieves a small fraction of peak performance [7,17]. In this paper, we discuss two causes of performance problems-insufficient memory bandwidth and a suboptimal instruction mix--in the context of a complete, parallel, unstructured mesh implicit CFD code. These results show that the performance of our code and of similar implicit codes is limited by the memory bandwidth of RISC-based processor nodes to as little as 10% of peak performance for some critical computational kernels. Limits on the number of basic operations that can be performed in a single clock cycle also limit the performance of "cache-friendly" parts of the code. 1. I N T R O D U C T I O N

AND MOTIVATION

Traditionally, numerical analysts have evaluated the performance of algorithms by counting the number of floating-point operations. It is well-known that this is not a good estimate of performance on modern computers; for example, the performance advantage of the level-2 and level-3 BLAS over the level-one BLAS for operations that involve the *This work was supported in part by the Mathematical, Information, and Computational Sciences Division subprogram of the Office of Advanced Scientific Computing Research, U.S. Department of Energy, under Contract W-31-109-Eng-38. *Supported by Argonne National Laboratory under contract 983572401. *Supported in part by the NSF under grant ECS-9527169, by NASA under contracts NAS1-19480 and NAS1-97046, by Argonne National Laboratory under contract 982232402, and by Lawrence Livermore National Laboratory under subcontract B347882. ~This work was supported in part by the Mathematical, Information, and Computational Sciences Division subprogram of the Office of Advanced Scientific Computing Research, U.S. Department of Energy, under Contract W-31-109-Eng-38.

242

same number of floating-point operations is due to better use of memory, particularly the reuse of fast memory [5,6]. Paradoxically, however, the success of the level-2 and level-3 BLAS at reaching near-peak levels of performance has obscured the difficulties faced by many other numerical algorithms. On the algorithmic side, tremendous strides have been made; many algorithms now require only a few floating-point operations per mesh point. However, on the hardware side, memory system performance is improving at a rate that is much slower than that of processor performance [8,11]. The result is a mismatch in capabilities: algorithm design has minimized the work per data item, but hardware design depends on executing an increasing large number of operations per data item. The importance of memory bandwidth to the overall performance is suggested by the performance results shown in Figure i. These show the single-processor performance for our code, PETSc-FUN3D [2,9], which was originally written by W.K. Anderson of NASA Langley [i]. The performance of PETSc-FUN3D is compared to the peak performance and the result of the STREAM benchmark [II] which measures achievable performance for memory bandwidth-limited computations. These show that the STREAM results are much better indicator of performance than the peak numbers. We illustrate the performance limitations caused by insufficient available memory bandwidth with a discussion of sparse matrix-vector multiply, a critical operation in many iterative methods used in implicit CFD codes, in Section 2. Even for computations that are not memory intensive, computational rates often fall far short of peak performance. This is true for the flux computation in our code, even when the code has been well tuned for cache-based architectures [I0]. We show in Section 3 that instruction scheduling is a major source of the performance shortfall in the flux computation step. This paper focuses on the per-processor performance of compute nodes used in parallel computers. Our experiments have shown that PETSc-FUN3D has good scalability [2]. In fact, since good per-processor performance reduces the fraction of time spent computing as opposed to communication, achieving the best per-processor performance is a critical prerequisite to demonstrating uninflated parallel performance [3]. 2. P E R F O R M A N C E UCT

A N A L Y S I S OF T H E S P A R S E M A T R I X - V E C T O R

PROD-

The sparse matrix-vector product is an important part of many iterative solvers used in scientific computing. While a detailed performance modeling of this operation can be complex, particularly when data reference patterns are included [14-16], a simplified analysis can still yield upper bounds on the achievable performance of this operation. In order to illustrate the effect of memory system performance, we consider a generalized sparse matrix-vector multiply that multiplies a matrix by N vectors. This code, along with operation counts, is shown in Figure 2. 2.1. E s t i m a t i n g the M e m o r y B a n d w i d t h B o u n d To estimate the memory bandwidth required by this code, we make some simplifying assumptions. We assume t h a t there are no conflict misses, meaning t h a t each m a t r i x and vector element is loaded into cache only once. We also assume t h a t the processor never waits on a m e m o r y reference; t h a t is, that any number of loads and stores are satisfied in

243 BPeak Mflops/s

mStream Triad Mflops/s

D Observed Mflops/s

9oo~

/l

~~176 7~176

m m

~00 ~

m

m

400 m ,00 m

m m

m m

100

SP

Origin

T3E

Figure 1. Sequential performance of PETSc-FUN3D for a small grid of 22,677 vertices (with 4 unknowns per vertex) run on IBM SP (120 MHz, 256 MB per processor), SGI Origin 2000 (250 MHz, 512 MB per node), and T3E (450 MHz, 256 MB per processor).

a single cycle. For the algorithm presented in Figure 2, the matrix is stored in compressed row storage format (similar to PETSc's AIJ format [4]). For each iteration of the inner loop in Figure 2, we need to transfer one integer (j a array) and N + 1 doubles (one matrix element and N vector elements) and we do N floating-point multiply-add (fmadd) operations or 2N flops. Finally, we store the N output vector elements. This leads to the following estimate of the data volume: Total Bytes Transferred

=

m 9sizeof_int + 2 9 m 9 N 9sizeof_double + n z 9 (sizeof_int + sizeof_double)

=

4,(m+nz)+8,(2,m,N+nz).

This gives us an estimate of the bandwidth required in order for the processor to do 2 9 n z 9 N flops at the peak speed:

Bytes Transferred/fmadd -

(

16 +

-- + --. nz N

Alternatively, given a memory performance, we can predict the maximum achievable performance. This results in MBw=

2 ( 1 6 + N4) m_ _ + _1 2 x B W , nz

(1)

N

where M B w is measured in Mflops/sec and B W stands for the available memory bandwidth in Mbytes/s, as measured by STREAM [11] benchmark. (The raw bandwidth based

244 for (i = O, i < m; i++) { jrow : ia(i+i) ncol = ia(i+l) - ia(i) Initialize, sum1 . . . . , sumN for (j = O; j < ncol; j++) { fetch ja(jrow), a(jrow), xl (j a(j row) ) . . . . . xN (j a(j row) ) do N fmadd (floating multiply add) jrow++ }

}

Store suml . . . . . sumN in yl(i) . . . . , yN(i)

II i Of, AT, Ld

// i Iop // N Ld II I Ld

// I Of, N+2 AT, N+2 Ld // 2N Fop

//

1 lop,

1 Br

// I Of, N AT, N St // i lop, i Br

Figure 2. General Form of Sparse Matrix-Vector Product Algorithm: storage format is AIJ or compressed row storage; the matrix has m rows and nz non-zero elements and gets multiplied with N vectors; the comments at the end of each line show the assembly level instructions the current statement generates, where AT is address translation, Br is branch, Iop is integer operation, Fop is floating-point operation, Of is offset calculation, LD is load, and St is store.

on memory bus frequency and width is not a suitable choice since it can not be sustained in any application; at the same time, it is possible for some applications to achieve higher bandwidth than that measured by STREAM). In Table 1, we show the memory bandwidth required for peak performance and the achievable performance for a matrix in AIJ format with 90,708 rows and 5,047,120 nonzero entries on an SGI Origin2000 (unless otherwise mentioned, this matrix is used in all subsequent computations). The matrix is a typical Jacobian from a PETSc-FUN3D application (incompressible version) with four unknowns per vertex. The same table also shows the memory bandwidth requirement for the block storage format (BAIJ) [4] for this matrix with a block size of four; in this format, the j a array is smaller by a factor of the block size. We observe that the blocking helps significantly by cutting down on the memory bandwidth requirement. Having more than one vector also requires less memory bandwidth and boosts the performance: we can multiply four vectors in about 1.5 times the time needed to multiply one vector.

2.2. Estimating the Operation Issue Limitation To analyze this performance bound, we assume that all the data items are in primary cache (that is equivalent to assuming infinite memory bandwidth). Referring to the sparse matrix-vector algorithm in Figure 2, we get the following composition of the workload for each iteration of the inner loop: 9 N + 5 integer operations

245

Table 1 Effect of Memory Bandwidth on the Performance of Sparse Matrix-Vector Product on SGI Origin 2000 (250 MHz R10000 processor). The STREAM benchmark memory bandwidth [11] is 358 MB/s; this value of memory bandwidth is used to calculate the ideal Mflops/s; the achieved values of memory bandwidth and Mflops/s are measured using hardware counters on this machine. Our experiments show that we can multiply four vectors in 1.5 times the time needed to multiply one vector. Format

Number of vectors

Bytes/fmadd

AIJ AIJ BAIJ BAIJ

12.36 3.31 9.31 2.54

Bandwidth (MB/s) Required Achieved 276 3090 221 827 280 2327 229 635

Mflops/s Ideal I Achieved 58 45 216 120 84 55 305 175

9 2 , N floating-point operations (N fmadd instructions) 9 N + 2 loads and stores Most contemporary processors can issue only one load or store in one cycle. Since the number of floating-point instructions is less than the number of memory references, the code is bound to take at least as many cycles as the number of loads and stores. This leads to the following expression for this performance bound (denoted by Mls and measured in Mflops/sec):

Mis-

2nzN x Clock Frequency. nz(N+2)+m

(2)

2.3. Performance Comparison In Figure 3, we compare three performance bounds: the peak performance based on the clock frequency and the maximum number of floating-point operations per cycle, the performance predicted from the memory bandwidth limitation in Equation 1, and the performance based on operation issue limitation in Equation 2. For the sparse matrixvector multiply, it is clear that the memory-bandwidth limit on performance is a good approximation. The greatest differences between the performance observed and predicted by memory bandwidth are on the systems with the smallest caches (IBM SP and T3E), where our assumption that there are no conflict misses is likely to be invalid. 3. F L U X C A L C U L A T I O N A complete CFD application has many computational kernels. Some of these, like the sparse matrix-vector product analyzed in Section 2, are limited in performance by the available memory bandwidth. Other parts of the code may not be limited by memory bandwidth, but still perform significantly below peak performance. In this section, we consider such a step in the PETSc-FUN3D code.

246

[] Theoretical Peak [~ Oper. Issue Peak

| | |

900 800 700 600 500 400 300

[ ] M e m B W Peak D Observed

g l| |[ |

i

|

200 100 0

I

SP

Origin

T3E

Pentium

Ultra II

Figure 3. Three performance bounds for sparse matrix-vector product; the bounds based on memory bandwidth and instruction scheduling are much more closer to the observed performance than the theoretical peak of the processor. One vector (N = 1), matrix size, m = 90,708, nonzero entries, nz = 5,047,120. The processors are: 120 MHz IBM SP (P2SC "thin", 128 KB L1), 250 MHz Origin 2000 (R10000, 32 KB El, and 4 MB L2), 450 MHz T3E (DEC Alpha 21164, 8 KB L1, 96 KB unified L2), 400 MHz Pentium II (running Windows NT 4.0, 16 KB L1, and 512 KB L2), and 360 MHz SUN Ultra II (4 MB external cache). Memory bandwidth values are taken from the STREAM benchmark web-site. The flux calculation is a major part of our unstructured mesh solver and accounts for over 500/o of the overall execution time. Since PETSc-FUN3D is vertex-centered code, the flow variables are stored at nodes. While making a pass over an edge, the flow variables from the node-based arrays are read, many floating-point operations are done, and residual values at each node of the edge are updated. An analysis of the code suggests that, because of the large number of floating-point operations, memory bandwidth is not a limiting factor. Measurements on our Origin2000 support this; only 57 MB/sec are needed. However, the measured floating-point performance is 209 Mflops/sec out of a peak of 500 Mflops/sec. While this is good, it is substantially under the performance that can be achieved with dense matrix-matrix operations. To understand where the limit on the performance of this part of the code comes from, we take a close look at the assembly code for the flux calculation function. This examination yields the the following mix of the workload for each iteration of the loop over edges:

9 519 total instructions

247 9 111 integer operations 9 250 floating-point

flops

instructions of which there are 55 are fmadd

instructions, for 305

9 155 memory references If all operations could be scheduled optimally say, one floating-point instruction, one integer instruction, and one memory reference per cycle this code would take 250 instructions and achieve 305 Mflops/s. However, there are dependencies between these instructions, as well as complexities in scheduling the instructions [12,13], making it very difficult for the programmer to determine the number of cycles that this code would take to execute. Fortunately, many compilers provide this information as comments in the assembly code. For example, on Origin2000, when the code is compiled with cache optimizations turned off (consistent with our assumption that data items are in primary cache for the purpose of estimating this bound), the compiler estimates that the above work can be completed in about 325 cycles. This leads to a performance bound of 235 Mflops/sec (47~ of the peak on 250 MHz processor). We actually measure 209 Mflops/sec using hardware counters. This shows that the performance in this phase of the computation is actually restricted by the instruction scheduling limitation. We are working on an analytical model for this phase of computation. 4. C O N C L U S I O N S We have shown that a relatively simple analysis can identify bounds on the performance of critical components in an implicit CFD code. Because of the widening gap between CPU and memory performance, those parts of the application whose performance is bounded by the available memory bandwidth are doomed to achieve a declining fraction of peak performance. Because these are bounds on the performance, improvements in compilers cannot help. For these parts of the code, we are investigating alternative algorithms, data structures, and implementation strategies. One possibility, suggested by the analysis in Section 2, is to use algorithms that can make use of multiple vectors instead of a single vector with each sparse-matrix multiply. For another part of our code, the limitation is less fundamental and is related to the mix of floating-point and non-floating-point instructions. Analyzing this code is more difficult; we relied on information provided by the compiler to discover the instruction mix and estimates on the number of cycles that are required for each edge of the unstructured mesh. Improving the performance of this part of the code may require new data-structures (to reduce non-floating-point work) and algorithms (to change the balance of floating-point to other instructions). 5. A C K N O W L E D G M E N T S

The authors thank Kyle Anderson of the NASA Langley Research Center for providing FUN3D. Satish Balay, and Lois McInnes of Argonne National Laboratory co-developed the PETSc software employed in this paper. Computer time was supplied by the DOE.

248

REFERENCES 1. W . K . Anderson and D. L. Bonhaus. An implicit upwind algorithm for computing turbulent flows on unstructured grids. Computers and Fluids, 23:1-21, 1994. 2. W. K. Anderson, W. D. Gropp, D. K. Kaushik D. E. Keyes, and B. F. Smith. Achieving high sustained performance in an unstructured mesh CFD application. Technical report, MCS Division, Argonne National Laboratory, http://www.mcs, anl. gov/petsc-fun3d/papers, html, August 1999. 3. D.F. Bailey. H o w to fool the masses when reporting results on parallel computers. Supercomputing Review, pages 54-55, 1991. 4. S. Balay, W. D. Gropp, L. C. McInnes, and B. F. Smith. The Portable, Extensible,Toolkit for Scientific Computing (PETSc) ver. 22. h t t p : / / w w w . m c s . a n l . g o v / p e t s c / p e t s c . h t m l , 1998. 5. J. J. Dongarra, J. Du Croz, S. Hammarling, and R. J. Hanson. An extended set of Fortran basic linear algebra subprograms: Model implementation and test programs. ACM Transactions on Mathematical Software, 14:18-32, 1988. 6. J . J . Dongarra, J. Du Croz, S. Hammarling, and R. J. Hanson. A set of level 3 basic linear algebra subprograms. A CM Transactions on Mathematical Software, 16:1-28, 1988. 7. W.D. Gropp. Performance driven programming models. In Massively Parallel Programming Models (MPPM-97), pages 61-67. IEEE Computer Society Press, 1997. November 12-14, 1997; London; Third working conference. 8. J. L. Hennessy and D. A. Patterson. Computer Architecture: A Quantitative Approach. Morgan Kaufmann Publishers, 1996. 9. D . K . Kaushik, D. E. Keyes, and B. F. Smith. On the interaction of architecture and algorithm in the domain-based parallelization of an unstructured grid incompressible flow code. In J. Mandel et al., editor, Proceedings of Proceedings of the l Oth International Conference on Domain Decomposition Methods, pages 311-319. Wiley, 1997. 10. D. K. Kaushik, D. E. Keyes, and B. F. Smith. Newton-Krylov-Schwarz methods for aerodynamic problems: Compressible and incompressible flows on unstructured grids. In C.-H. Lai et al., editor, Proceedings of the 11th International Conference on Domain Decomposition Methods. Domain Decomposition Press, Bergen, 1999. 11. J. D. McCalpin. STREAM: Sustainable memory bandwidth in high performance computers. Technical report, University of Virginia, 1995. http://wm~, cs. v i r g i n i a , edu/stream. 12. MIPS Technologies, Inc., h t t p : / / t e c h p u b s . sgi. com/library/manuals/2000/OOT-2490001/pdf/007-2490-001.pdf. MIPS RIO000 Microprocessor User's Manual, January 1997. 13. Silicon Graphics, Inc, h t t p : / / t e c h p u b s , sgi. com/library/manuals/3000/O07-3430-O02/ pdf/007-3430-002.pdf. Origin 2000 and Onyx2 Performance and Tuning Optimization Guide, 1998. Document Number 007-3430-002. 14. O. Temam and W. Jalby. Characterizing the behavior of sparse algorithms on caches. In Proceedings of Supercomputing 92, pages 578-587. IEEE Computer Society Press, 1992. 15. S. Toledo. Improving the memory-system performance of sparse-matrix vector multiplication. IBM J. Res. and Dev., 41:711-725, 1997. 16. J. White and P. Sadayappan. On improving the performance of sparse matrix-vector multiplication. In Proceedings of the ~th International Conference on High Performance Computing (HiPC '97), pages 578-587. IEEE Computer Society, 1997. 17. W. A. Wulf and A. A. McKee. Hitting the wall: Implications of the obvious. Technical Report CS-94-48, University of Virginia, Dept. of Computer Science, December 1994.

Parallel Computational Fluid Dynamics

Towards Teraflops,Optimizationand NovelFormulations D. Keyes,A. Ecer, J. Periaux,N. Satofukaand P. Fox (Editors) 2000 Elsevier ScienceB.V. All rightsreserved.

A Parallel Computing

249

F r a m e w o r k for D y n a m i c P o w e r B a l a n c i n g in

Adaptive Mesh Refinement Applications Weicheng Huang* and Danesh Tafti t National Center for Supercomputing Applications, University of Illinois at Urbana-Champaign, Urbana, IL 61801, U.S.A. The paper describes a new computing paradigm for irregular applications, in which the computational load varies dynamically and neither the nature nor location of this variation is known a p r i o r i . In such instances, load imbalances between distributed processes becomes a serious impediment to parallel performance, and different schemes have to be deviced to balance the load. In this paper, we describe preliminary work which uses "computational power balancing". In this model, instead of balancing the load on distributed processes, the processes ask for help by recruiting other processors. We illustrate the feasibility and practicality of this paradigm in an adaptive mesh application for solving non-linear dynamical systems and in the solution of large linear systems using an Additive Schwarz Preconditioned Conjugate Gradient method. The paradigm is illustrated on a SGI-Cray Origin 2000 system using MPI for distributed programming and OpenMP for shared memory programming. 1. I N T R O D U C T I O N Parallel computers, since their inception in the mid 60's to early 70's, have been designed in a variety of configurations that differ in their organization of address space, interconnect network, and control mechanism for instructions. The most common configurations have been shared address space or shared memory (e.g. Cray Y-MP, Cray C90, SGI Power Challenge), and distributed address space or distributed memory architectures (e.g. Intel IPSC/860, CM-2, CM-5, IBM SP2). Recently, the distinct demarcation between shared memory and distributed memory architectures has blurred with the idea of "virtual shared memory" or "Distributed Shared Memory (DSM)" on cache-coherent physically distributed memory architectures. The early DSM machines were the HP-Convex Exemplar series and the Cray-SGI Origin 2000. However, today one does not have to do a lot of searching to realize the dominance of this architecture (or some variant of it) in current and future machines. Programming paradigms for these architectures can follow one of two paths. One could use either a shared memory paradigm with OpenMP or distributed memory programming with explicit communication calls to message passing libraries like MPI. Although MPI has a higher cost associated with its use than does shared memory programming, MPI imposes * [email protected] t dtafti@ncsa, uiuc.edu

250 explicit data locality, which can have a significant positive impact on scalability. An optimal programming paradigm is to use MPI across DSMs, and shared memory OpenMP inside each shared memory unit [1]. In addition to the obvious use of the hybrid MPI-OpenMp across clusters of SMPs or DSMs, one can use this paradigm within a single DSM unit like the SGI-Cray Origin 2000. A recent paper [1] showed the natural mapping of the hierarchical data structure of domain-decomposed algorithms to a hierarchy of parallelism or embedded parallelism, which integrates distributed memory programming with the shared memory programming paradigm. In this approach, fine grain parallelism using shared memory directives is embedded under the coarse grain parallelism of distributed memory programming using MPI. Embedded parallelism has several advantages over either distributed or shared memory parallelism. It also complements the concept of hierarchical domain decomposition for cache friendly algorithms [2] [3] [4] [5]. In this paper we focus on new programming techniques for "irregular" applications. By "irregular" we mean an application where the computational load is not constant, and is not known a priori. Central to this technique is the use of shared memory threads via OpenMP to equate the "computational power" to the increase or decrease in "computational load". The term, "computational power balancing" has different connotations depending on the context in which it is used. In a distributed memory application, with the use of embedded parallelism (MPI-OpenMP), it means the ability to adjust the computational power between distributed MPI processes such that load imbalances are minimized. In a pure shared memory programming environment, because of task based parallelism, load imbalance is of less consequence. In this computing environment, "computational power balancing" means the ability of the application to make the most efficient use of computational resources by asking for more processors when needed and giving up processors when not needed. There are a number of applications in the sciences, engineering, and commerce where "irregularity" is a norm rather than an exception. For CFD practitioners, the "irregularity" often manifests itself in adaptive h- and p- type refinements, and often different physics in parts of the domain. Load balancing has been a major concern in the parallel implementation of AMR type applications. Usually, when load imbalance occurs, either repartitioning of the mesh, or mesh migration [6] [7] has to be employed in order to improve performance. The repartitioning procedure is often very time consuming and costly, while the mesh migration might cause the degradation of the partition quality in addition to inducing an expensive migrating penalty [8]. Instead of going through the intensive process of re-balancing the load, an alternative, advocated here, is to balance the computing power across parallel regions. In this paper we demonstrate the application of computational power balancing in a shared memory application of Adaptive Mesh Refinement (AMR), which is used to solve the state variables in non-linear dynamical systems. The concept is then illustrated in a static load imbalance problem utilizing embedded parallelism with MPI-OpenMP in an Additive Schwarz Preconditioned Conjugate Gradient (AS-PCG) linear solver.

251

2. N O N L I N E A R D Y N A M I C A L SYSTEM Nonlinear systems appear in various scientific disciplines. Numerical methods for solving such systems usually try to locate the possible asymptotically stable equilibrium states or periodic motions. Then, the stability of these solutions is studied, followed by an investigation of the effect of system parameters on the evolving behavior of these equilibrium states and periodic motions. The final task is to find the global domain of attraction for these equilibrium states and periodic motions. In this work we use the cell-to-cell mapping technique developed by Hsu and his colleagues [9]. 2.1. Cell-to-Cell Mapping The basic goal of cell-to-cell mapping is to partition the state space into a collection of state cells. Each cell is considered as a state entity. Two types of mappings have been investigated, simple cell mapping and generalized cell mapping. The focus of the application presented here is to improve the performance of simple cell mapping techniques via the implementation of a power balancing technique. An iterative refinement process is performed on a relatively coarse cell state space via an h-type unstructured mesh adaptation. A parallel algorithm based on the shared memory standard, OpenMP, is employed to allow for the flexibility of dynamically allocating additional shared memory threads to compensate for the increase of the computing load.

2.2. Adaptive Mesh Refinement Implementation In our application, the computation starts on a very coarse mesh and adapts the mesh locally along with a iterative process, which requires several spatial adaptations to track the solution features. The h-refinement procedure employed in this paper involves edgebased bisection of simplices similar to Rivara's refinement algorithm [10]. The iterative procedure consists of: 1) Initial simplex mesh generation; 2) Edge refinement based on point mapping; 3) Interpolation mapping; 4) Refinement based on simple cell mapping; 5) Cell-to-cell mapping and lumping. Details of the implementation can be found in [11]. 2.3. P a r a l l e l Implementation In order to promote parallelism, the recursive type of operations as well as the data dependency have to be eliminated from the implementation. A major modification of the algorithm is required in the cell-to-cell mapping and lumping stage. In the serial implementation of cell-to-cell mapping and lumping stage, cells are tracked one after another. The lumping procedure makes the best use of the information available from those cells that have been tracked, thus reducing the amount of work. However, the tracking and grouping procedures are interdependent, thus introducing strong data dependencies which prevent the efficient use of modern parallel architectures. To facilitate the efficient use of parallelism, the lumping procedure is further divided into cell tracking and cell grouping procedures. During the tracking process, each cell marks its own destination. Cells can then march to their final destinations independently. This modification eliminates the data dependency, and the procedure is fully parallel. The grouping process requires cross checking between cells in order to lump cells of the same group together. In order to reduce the impact of the cross referencing between cells required by the lumping process, a grouping process similar to that used in a distributed

252

EkV~Eh,,V~m~m~m~ ~ : i : : ! : : : : - : : : : ~ : :'~;~; ~ , ~ ~ ~,"

i~,,;::~-::= :

(a)

:..:..::

.

';~,"i-2,~2,.~';::~'" ":~:-"~ - . . ~ - t ~

"':::::"::;,':~e

(c)

(b)

" ~ r ~

. . . . . .

.,: ,.b--= ~ , ~

~'~

".~. ....

(d)

,. ,,v'.qO~ ~

,.A

~lk,:.

(o)

Figure 1. Evolution of mesh after a sequence

of refinements.

memory paradigm is employed. Cells are divided into several sub-divisions. The grouping process is performed separately on each sub-division, which are then merged together with a minimum amount of cross-referencing, thus reducing the enormous penalty of cross-referencing. 2.4. D y n a m i c P o w e r B a l a n c i n g In order to demonstrate the current technology, a Twin-Well buffing type Oscillator is solved. The governing ODE of the system is - x + x 3 - - e ( 1 - 1.1x2)k

(1)

and is transformed into -

y

-

x-

(2) x 3 -

e(1 -

1.Ix2):/:

(3)

This example, with e - 1.0, possesses twin-well limit cycles circling (x, y) -- (1, 0) and (x, y) - ( - 1 , 0), a saddle point located at (x, y) = (0, 0). The behavior of the system is illustrated focusing on the range (x, y) e [-2.5, 2.5] x [-2.5, 2.5]. A series of figures, Figure 1, show the evolution of the mesh after consecutive levels of refinement. The initial mesh consists of 162 simplex cells. In Figure l(b), the algorithm has located the domain with intensive activity. The mesh concentrates on the major feature starting Figure l(d) which captures the boundaries of different basins. In Figure l(e), the left limit cycle is well defined by the mesh enrichment on the limit cycle. Figure l(f) shows the final grid used after all the refinement stages with the mesh of all the basin boundaries and limit cycles enhanced for better resolution.

253

D 1.0E+05 90E§ : = number of grid nodes 9 ~" .......0 ....... number of edges / 8.OE,O4~--" number of cels /

'-OE*O4L:

/

.-.o.I

.~,oo

105~ 9

~ C r

.....

Iosd I:sISd on gdd nodss IosdIBssd on edgH

IoaM:lbusd ~ ~ h

# of procemors a.lgned

10 4

10 3

Jl

~o~.o,[ 0"0E+-13

. - - - B - . - . fixed # of processors

//

,.OE.,~ l O~.O,F

2~

: 9

10

/

/~ _ . ~ ~ 2

T

4,

,

,

,

level of refinement

(a)

......... ,o

~'" ~;";~"~;~

# of processors

(b)

lO 0

~ ....

~ ....

level of refinement

(c)

Figure 2. (a)Increase of grid size with refinement level. (b) Comparison of wall clock

time when using a fixed number of processors for the whole calculation with calculations where the number of processors vary with computational load. (c) Computational load and shared memory threads assigned at different stages of computation.

Figure 2(a) illustrates the increase in mesh size as the refinement levels increase. From level 1 to 4 of refinement, the mesh is refined based on the point mapping criteria. Levels 5 to 7 are obtained based on simple cell mapping criteria while level 8 is employed to deliver a better resolution over boundaries between different basins. As illustrated in the figure, the size of the mesh starting from an initial grid of 162 cells grows over several orders of magnitude after 8 levels of refinement. To demonstrate the feasibility and flexibility of the concept, we compare the performance of "fixed number of processors" and "processors on demand" based on different criteria. Performance of "fixed number of processors" is obtained by using a fixed number of processors through out the whole calculation as in a conventional parallel calculation. As shown in Figure 2(b), the scalability of the calculations from 1-processor to 64-processors is almost linear. The results for "processors on demand" are obtained by assigning different numbers of processors to different parallel regions. The assignment of the number of processors is linked to the computational load in a given parallel region. For instance, in a given parallel region, the number of processors used may vary dynamically with the number of cells as the calculation progresses. By using equation (4)

effective p r o c e s s o r s -

~parallel regions number of processors • wall clock time total wall clock time

(4)

we can obtain "effective processors" which quantify the average number of processors used during the calculation. Wall clock time in this paper means the program execution time under dedicated environment. In the current application, the mesh size increases as the refinement level increases. However, due to different refinement criteria, the load does not always increase with the size of the mesh. As illustrated in Figure 2(c), the computational load increases as the

254 mesh size grows from level 1 to level 4. However, the load decreases from level 4 to level 5 and increases again after level 5. The number of processors assigned to each level of calculation is changed according to these load estimations with a minimum as low as 1 processor and a maximum as high as 32 processors. For the case shown in Figure 2(c), the "effective number of processors" is 18. By arbitrarily changing the relationship between the number of processors used and the computational load in parallel regions, a series of calculations with different effective processors is obtained. These results are superimposed on Figure 2(b) with the "fixed number of processors" calculations. We find the wall clock time nearly identical with those for "fixed number of processors" calculations. These results indicate that the overheads associated with dynamically changing the number of threads or processors is minimal in the current context. If the overheads associated with this procedure were high, then for the same "effective processor" count the wall clock time would be much higher. It is clearly possible for the wall clock time with dynamic power balancing to be less than that of its "fixed number of processors" counterpart. This is possible in applications which execute on more than the optimal number of processors with poor parallel efficiencies. In such cases, dynamic power balancing, has the potential of utilizing the optimal number of processors based on computational load requirements and make much better use of available resources. This brings up an interesting question as to how one estimates the optimal number of processors for a given parallel region. For general applicability to any application code this can be quite a challenging task, and needs further research in dynamic run-time performance analysis and prediction. This is a potentially interesting research area for computer scientists. Having shown the feasibility of dynamic power balancing in a shared memory environment, it is not difficult to see its extension to a distributed environment, in which dynamic shared memory threads are now used under each MPI process. In the next section, we extend the concept to such an environment. 3. A D D I T I V E

SCHWARZ

PRECONDITIONED

CG METHOD

Conjugate gradient (CG) methods are used widely for solving large sparse linear systems A u - f. Usually, an equivalent preconditioned system M - 1 A u = M - i f which exhibits better spectral properties is solved. M is the preditioning matrix or the preconditioner, such as a polynomial preconditioner [12] or incomplete LU factorization [13]. In recent years, domain decomposition type preconditioners [14] have received much attention. The original domain or system is partitioned into smaller sub-domains which are solved independently in an iterative manner. To enhance the effectiveness of the preconditioner for a large number of sub-domains, multilevel domain decomposition is used. It has been shown that the multilevel Additive Schwarz preconditioner not only enhances convergence but also enhances performance on cache based microprocessor architectures [5]. 3.1. E m b e d d e d P a r a l l e l i s m Computational experiments performed by the authors have shown that the performance of ASPCG using either a distributed memory programming paradigm (MPI), a shared memory programming paradigm(OpenMP), or an embedded parallelism (MPIOpenMP) on the Origin 2000 architecture results in very similar performance character-

255 istics. It was established that all three paradigms can be used interchangeably without much loss in performance. Related information can be found on our web page VRL:http://mithril.ncsa.uiuc.edu/SCD/straka/Perf/Analysis/Comps/pcg2.html. 3.2. S t a t i c Load I m b a l a n c e We now extend the use of embedded parallelism towards the concept of dynamic power balancing. In this case, however, we illustrate its application in a static load imbalance problem using the ASPCG solver. The problem size is 1024 • 1024 cells with sub-domain size of 32 • 32 cells resulting in 32 • 32 sub-domains. Table 1 shows the results of wall clock time for three cases. Case 1 has a perfectly balanced load on 4 MPI processes. Cases 2 and 3 use 3 MPI processes, with one process having twice the load of the other two. In Case 2, no steps are taken to balance the computational power, and hence the wall clock time is about twice as large as Case 1. However, in Case 3, the overloaded MPI process seeks help from an additional shared memory thread to balance the computational power and makes much more efficient use of the available resources. Although this example deals with a static load imbalance problem, one can easily imagine its extension to dynamic load imbalances, where the computational load of each MPI process could vary dynamically. To deal with the varying load per MPI process, shared memory threads could be redistributed within MPI processes (fixed total number of processors). Alternatively, more or less shared memory threads could be used for each MPI process (varying total number of processors). Table 1 : Static load imbalance study of ASPCG kernel MPI processes Shared threads Case 1 4 0 Case 2 3 0 Case 3 3 1

Wall clock time(sec.) 21.54 46.25 27.72

4. C O N C L U S I O N In this paper we introduced the concept of dynamic power balancing by utilizing both distributed and shared memory programming. We first illustrated this concept in a shared memory application of AMR in the solution of non-linear dynamical systems, in which the number of processors used by the calculation was dynamically adjusted, based on the computational load. This concept was then extended to a static load imbalance problem in which shared memory threads were used to balance the computational power on heavily loaded distributed processes. We illustrated these concepts on the SGI-Cray Origin 2000 DSM architecture. Currently, this work is being extended to a distributed dynamic load imbalance problem. A number of issues need to be studied and refined to enable the efficient and wide spread use of this model: 1. Better control over data placement in the MPI-OpenMP environment when using shared memory threads under MPI processes, i.e., maintaining data locality. 2. The ability to dynamically predict the optimal number of shared memory threads for efficient use of available resources.

256 REFERENCES

1. D. K. Tafti, and G. Wang, "Application of Embedded Parallelism to Large Scale Computations of Complex Industrial Flows," Proceedings of the ASME Fluids Engineering Division, FED Vol. 247, pp. 123- 130, 1998 ASME-IMECE, Anaheim, CA., Nov. 1998. 2. Wang, G. and Tafti, D. K., "A parallel programming model of industrial CFD applications on microprocessor based systems" ASME Fluids Engineering Division, FED-Vol. 244, pp. 493- 500. 1997, ASME International Mechanical Engineering Congress and Exposition. 3. G. Wang, and D. K. Tafti, "Uniprocessor performance enhanement by additive Schwarz Preconditioners on Origin 2000," Advances in Engineering Software Vol. 29, No. 3- 6, pp. 425- 431, 1998a. 4. G. Wang,and D. K. Tafti, "Parallel performance of additive Schwarz precontitioners on Origin 2000," Advances in Engineering Software Vol. 29, No. 3-6, pp. 433- 439. 1998b. 5. G. Wang, and D. K. Tafti, "Performace Enhancement on Microprocessors with Hierarchical Memory Systems for Solving Large Sparse Linear Systems," Int. J. of Supercomputing Applications and High Performance Computing, Vol. 13, No. 1, pp. 63 79, Spring 1999. 6. R.D. Williams, "Performance of dynamic load balancing algorithms for unstructured mesh calculations," Concurrency Practice and Experience, 3(5), 457-481, 1991. 7. C. Walshaw, and M. Berzins, "Dynamic load balancing for PDE solves and adaptive unstructured meshes," University of Leeds, School of Computer Studies, Report 92-32, 1992. 8. C.L. Bottasso, J. E. Flaherty, C. (gzturan, M. S. Shephard, B. K. Szymanski, J. D. Teresco, L. H. Zinatz, "The quality of partitions reduced by an interative load balancer," Scientific Computation Research Center and Department of Computer Science, Rensselaer Polytechnic Institute, Report 95-12, 1995. 9. C.S. Hsu, Cell-to-Cell Mapping: A Method fo Global Analysis for Nonlinear System, Springer-Verlag, 1987. 10. M. C. Rivara, "Mesh Refinement Processes Based on the Generalized Bisection of Simplices," SIAM J. Numerical Analysis, 21 (3), 604-613, 1984. 11. W. Huang, "Revisit of Cell-to-Cell Mapping Method for Nonlinear Systems" National Science Council Project NSC-87-2119-M-001-001, Taiwan, 1998. 12. Y. Saad, "Pratical use of polyomial preconditionings for the conjugate gradient method," SIAM J. Sci. Stat. Comput. 6, pp 865, 1985. 13. G. Radicati di Brozolo, and Y. Robert, "Parallel conjugate gradient-like algorithms for solving sparse nonsymmetric linear systems on a vector multiprocessor," Parallel Comput., 11, p. 223, 1989. 14. P. E. Bjorstad, and M. Skogen, "Domain decomposition algorithms of Schwarz type, designed for massively parallel computers," in 5th Int. Symp. Domain Decomposition Methods for Partial Differential Equations, SIAM, Philadelphia, p. 362 1992.

Parallel ComputationalFluidDynamics Towards Teraflops, Optimizationand Novel Formulations D. Keyes,A. Ecer, J. Periaux,N. Satofukaand P. Fox (Editors) 2000 Elsevier ScienceB.V. All rights reserved.

257

A Two-Level Aggregation-Based Newton-Krylov-Schwarz Method for Hydrology * E. W. Jenkins ~, R. C. Berger b, J. P. Hallberg b, S. E. Howington b, C. T. Kelley% J . H . Schmidt b, A. K. Stagg b, and M. D. Tocci c ~North Carolina State University Center for Research in Scientific Computation and Department of Mathematics Box 8205, Raleigh, N. (21. 27695-8205, USA (ewj enkin@eos, ncsu. edu, Tim_Kelley@ncsu. edu) bUS Army ~\-aterways Experiment Station Coastal And Hydraulics Laboratory and Information Technology Laboratory :3909 Halls Ferry Road Vicksburg, Mississippi 39180, USA (berger@hl. wes. army .mil, pettway@j uanita, wes. army .mil, stacy@hwy61, wes. army. rail, schmidt@href, wes. army. mil, st agg@rusty, wes. hpc. mi!) CDepartment of Mathematical Sciences Worcester Polytechnic Institute Worcester, MA 01609, USA (mdtocci~wpi. edu) Newton-Krylov-Schwarz methods solve nonlinear equations by using Newton's method with a Schwarz domain decomposition preconditioned Krylov method to approximate the Newton step. In this paper, we discuss the design and implementation of Newton-KrylovSchwarz solvers in the context of the implicit temporal integration on an unstructured three-dimensional mesh of the Navier-Stokes equations for modeling flow in a river bend. 1. I N T R O D U C T I O N In this paper we discuss the design and implementation of a Newton-Krylov-Schwarz solver for the implicit temporal integration on an unstructured three-dimensional spatial mesh of time-dependent partial differential equations. The novel feature of this approach is the formation of a coarse mesh problem using aggregation methods from algebraic mutt\grid [14]. The solver was tested within the Adaptive Hydrology (ADH) Model, "This research was supported in part by Army Research ONce contract DAAD19-99-1-0186, US Army contract DACA39-95-K-0098, National Science Foundation grant DMS-9700569, a Cray Research Corporation Fellowship, and a Department of Education GAANN fellowship. Computing activity was partially supported by an allocation from the North Carolina Supercomputing Center.

258 a finite element code being developed by the U. S. Army Corps of Engineers Engineer Research and Development Center (ERDC) that is designed to solve a variety of hydrology problems including surface water flow. In [10] we will apply this approach to the solution of Richards' equation for groundwater flow in the unsaturated zone, and in [7] we discuss an application to surface watergroundwater interaction. In this paper we report on the performance of the method in a Navier-Stokes simulation. The Navier-Stokes equations in terms of velocity u (x,t) and pressure p(x, t), can be written as p

~

+ u. Vu

)

-

~ 7 " 0"

~

0

~7 . It

---

0

(1)

where

cr = - p I + r, r - 2#D,

#

=

1 (Oui ~

=

kinematic velocity.

(2)

--i)uJ)

These equations are applicable within the domain ~ C 7~3 with the following boundary conditions: -

g on rg

cr -

&onrh

u

u.n

=

(3)

honFh.

The boundary of fl is denoted as F, and n is the outward normal. Pg and Fh represent non-overlapping subregions of F such that r

=

FjUFh.

In w we give numerical results for a test case. These results show that our preconditioners have good scalability and that our coarse grid formulation is performing well. 2.

NEWT

ON-KRYLOV-S

CHWARZ

The weak formulation of the Navier-Stokes equations as given in [1] leads to implicit temporal integration. The discretization of the weak formulation leads to a system of nonlinear equations that must be solved at each time step. These equations are solved via Newton-Krylov-Schwarz (NKS) methods, which are described below.

259

NKS methods [11] use a I(rylov subspace method to determine the Newton step s in F'(u~)s

-

-F(u~)

where F' (uc) is the Jacobian a.t the current iteration. A Schwarz type preconditioner is used to accelerate the performance of the Krylov solver. The Krylov method used in the ADH 5Iodel is the Bi-CGStab method [13]. Both one-ievel additive Schwarz [:3] and two-level additive Schwarz preconditioners [:3] [4] are currently implemented in the ADH Model. Both of these preconditioners are domain decomposition type preconditioners, which means that the original domain is split into several subdomains, and the solutions of the original problem restricted to the subdomains are combined to form the preconditioner for the original system. If we define a matrix/~i to be the restriction matrix for subdomain i so that Ri - [0 I 0], where I is an ni x ni identity matrix and ni is the size of subdomain i. then the one-level additive Schwarz preconditioner can be written as P

-1

i=1

where p is the number of subdomains. The coarse mesh component of the preconditioner is formed by defining one aggregate element per subdomain. The resulting coarse mesh basis function is constant except in the elements shared between subdomains. The contribution of the subdomain to the coarse matrix is computed locally and then communicated to all of the processors. Thus. ever)' processor solves the coarse mesh problem. The subdomain solves are performed using a profile solver [5] and the coarse grid problem is solved using a dense LU factorization. The two-level additive Schwarz preconditioner is formed by adding the coarse mesh problem to the one-level preconditioner, so that p

i

where Ro and R0r are the restriction and interpolation operators from the fine to coarse meshes, respectively. 3. N U M E R I C A L

EXPERIMENTS

In this section we report on a Navier-Stokes simulation of flow through a river bend. The purpose of this test was to investigate the performance of the preconditioners near steady state. Riprap refers to a foundation or wall of stones or similar material used on riverbanks to prevent erosion. The riprap problems were designed to aid in the placement and size determination of riprap material along natural rivers and channels. The models only incorporate a short section of the test facility; this short section does contain a river bend.

260

Figure 1. Riprap Mesh

k' z

x

The riprap model has a trapezoidal cross-section, and riprap material is placed along the sidewalls. The flow in the cross-sectional area of the bend is in the form of a spiral, and the net flow from side to side in the cross-section is to the outside of the bend at the surface of the water and to the inside of the bend close to the bottom. There is more sediment at the bottom of the river bend than at the top, so normally the outside bend is eroded while a bar is developed on the inside of the bend. These riprap model results are being used to evaluate three-dimensional models of the river bend. The equations were discretized on unstructured tetrehedral meshes in three space dimensions. We used the piecewise constant in time and piecewise linear in space finite element discretizations from [1]. These discretizations are implicit in time and therefore a discretized nonlinear elliptic problem must be solved at each time step. The Galerkin least squares methods of [6], [8], [9], and [12] were used to stabilize the discretization. The meshes were generated using GMS [2]. Initially, a straight channel with a trapezoidal cross-section was generated using the GMS tool. Once a mesh is generated, GMS creates an element connectivity table and a node location table. These tables were used to relocate the appropriate nodes (and elements) and construct the bend. The mesh is shown in Figure 1. After the grid was generated, the nodes were renumbered. The numbering of the nodes plays an important role in the performance of the preconditioner because of the way node allocation is currently done. At the present time, the nodes are divided among processors sequentially; i.e., the first r~ nodes go to processor 0, the next r~ to processor 1, and so forth. The number of subdomains per processor is an input parameter to the code. Once the nodes have a processor assignment, they receive a subdomain assignment. This assignment occurs in the same sequential fashion as the processor assignment. The coarse grid elements are defined on each processor, so the nodes that are assigned to that processor should be as physically close to one another as possible. The renumbering algorithm numbers the nodes in the vertical direction in the innermost loop, in the lateral direction

261 in the first outer loop, and longitudinally in ttle final outer loop. This ensures that the numbering occurs across an entire cross-section before moving to the next cross-section, and it also ensures that tile numbering is done with respect to the aspect ratios present in the mesh. The nonlinear solver terminates when

iF (~)fl,~ < 10-~

(4)

and the criterion for termination of the linear solver is LIF, (~)~ + / r (~)lloo < 10-~

(5)

The initial conditions were near steady state. Four time steps were taken for the smaller of the two meshes and sixteen time steps were taken for the larger mesh. The smaller mesh had .5881 nodes and 30720 elements and the larger mesh had 43889 nodes and 245760 elements. In this way the mesh width was roughly halved. For a regular grid, the condition number of the linearized problem is 1 + O((~t/h2), where ~ is the time step and h the spatial mesh width, and the accuracy is O(6~ + h2). Motivated by these estimates, we reduced the time step by a factor of four for the larger problem, keeping the condition number roughly constant and increasing the temporal accuracy along with the spatial accuracy. In Table 1. we report iteration counts for the small problem on 8 processors with S subdomains per processor, and the larger problem on 64 processors and 4 subdomains per processor, with the time step reduced by a factor of 4. The reduction in the number of subdomains per processor for the larger problem was necessary because the communication costs of assembling the coarse mesh problem with 8 subdomains per processor was too high. For the coarse mesh problem, using ,S subdomains per processor gave the lowest execution time, since the subdomain factorization cost dominated the cost for communication of coarse mesh data. The iteration counts provided in Table 1 are the total number of linear iterations used during the simulation. The timings are given in seconds and are the timings for the entire simulation, including the initialization, calculation of the Jacobian. construction and application of the preconditioners, and nonlinear solves. We have measured performance in this manner because we are not solely interested in the performance of the precondit ioner, but its impact on the entire application. We compare Jacobi and two-level Schwarz iterations. One-level Schwarz did not perform as well as Jacobi in our experiments. For the runs labeled Schwarz. the coarse grid was computed, factored, and stored once for every 10 nonlinear iterations. This means the coarse grid was computed and factored only once for the smaller problem and twice for the larger problem. This lag factor appears to be much more important for the larger problems. We are in the process of determining when lag factors are necessary and what the optimal lag factor is. All of the numerical results were calculated on an IBM SP2 located at the WES Major Shared Resource Center. The SP2 has 255 processors, with an aggregate memory size of 256 Gbytes. The operating system is the SP/135 AIX, Version 4.1.5.x, and the IBM

262 Table 1 Riprap Problem Iterations

Linear Iterations Time (seconds) Linear Iterations/time step

Small Jacobi Schwarz 12398 1467 16:33 953 :3099 ;367

Large Jacobi Schwarz 16279 1:397 4807 2663 1017 87

parallel operating environment (poe), Version 2.1.0.24, is used for batch processing. The compiler is C for AIX with message passing, Version :3.1.4.

4. C O N C L U S I O N S While the preconditioners were originally designed for use on groundwater problems, they perform well for Navier-Stokes simulations. Our use of a coarse mesh problem reduces the number of iterations, while lagging the coarse problem maintains the reduction in iteration counts while simultaneously reducing the execution time for the simulation. Our formulation of the coarse mesh with aggregate elements led to an easy construction of the coarse-mesh problem and trivial intergrid transfers. The performance will be improved in our future work by lagging the fine-mesh subdomain factorizations, using more efficient subdomain solvers or incomplete factorizations, and updating preconditioning information adaptively based on the behavior of the solution. We have seen that the number and size of subdomains on a processor has a significant effect on the performance of the preconditioner, both in terms of linear iterations required for solution and computational cost. We would therefore like to determine how to optimally choose the number of subdomains to use for preconditioning as a function of the problem size and the computer architecture. 5. A C K N O W L E D G E M E N T S The authors wish to thank David Keyes, Jun Zou, Carol Woodward, Van Henson, and Jim .Jones for many helpful discussions. REFERENCES

1.

2.

M. Behr and T. E. Tezduyar. Finite element solution strategies for large-scale flow simulations. Computer Methods in Applied Mechanics and Engineering, 112::3-24, 1994. Brigham Young University. G M S - The Department of Defense Groundwater Modeling System. Reference Manual. Brigham Young University Engineering Computer Graphics Laboratory, Provo, Utah, 1994.

263 3. M. Dryja and O. B. Widlund. An additive variant of the Schwarz alternating method for the case of many subregions. Technical Report 339, Courant Institute, 1987. also Ultracomputer Note t3i. 4. M. Dryja and O. B. \,Vidlund. Some domain decomposition algorithms for elliptic problems. In D. R. Kincaid and L. J. Hayes, editors, [terative Methods j'or Large Linear Sgstems. Academic Press, San Diego, 1990. 5. I.S. Duff, A. M. E risman, and J. K. Reid. Direct Methods jbr Sparse Matrices. Oxford University Press, New York, NY, 1986. 6. L. P. Franca and T. J. R. Hughes. Two classes of mixed finite element methods. Computer Methods in Applied Mechanics and Engineering, 69:88-129, 1988. 7. S. E. Howington, R. C. Berger, J. P. Hallberg, J. F. Peters, A. K. Stagg, E. W. Jenkins, and C. T. Kelley. A model to simulate the interaction between groundwater and surface water, 1999. Proceedings of the High Performance Computing Users' Group Meeting, Monterrey, CA, .June 7-10. 8. T . J . R . Hughes, L. P. Franca, and M. Balestra. A new finite element formulation for computational fluid dynamics" \/. circumventing the Babu~ka- Brezzi condition: A stable Petrov-Galerkin formulation of the Stokes problem accomodating equal-order interpolations. Computer Methods i'n Applied Mechanics a'nd Engineering, 59:85-99, 1986. 9. T. J. R. Hughes, L. P. Franca, and (3. M. Hulbert. A new finite element formulation tbr computational fluid dynamics: VIII. the Galerkin/least squares method for advective-diffusion equations. Computer Methods in Applied Mechanics and Engilzeering, 7:3"173-189, t989. 10. E. W. Jenkins, Joseph H. Schmidt, Alan Stagg, Stacy g. Howington, R. C. Berger, .J.P. Hallberg, C. T. Kelley, and M. D. Tocci. Newton-Krylov-Schwarz methods for Richards' equation. In preparation. l l . D . g. Keyes. Aerodynamic applications of Newton-Krylov-Schwarz solvers, to appear in Proc. of 14th International Conference on Num. Meths. in Fhfid Dynamics. (R. Narishima et al. eds), Springer, NY 199.5. 12. T. E. Tezdoyar. S. Mittal, S. R. Ray, and R. Shih. Incompressible flow computations with stabilized bilinear and linear equal-order-interpolation velocity-pressure elements. Comp'utational Methods in Applied Mechanics and Engineering, 95"22124'2. i 992. 13. H. A. van der \,~rst. Bi-CGSTAB" A fast and smoothly converging variant to Bi-CG for the solution of nonsymmetric systems. SIAM J. Sci. Statist. Comput., 1:3:631-644, 1992. 14. P. Vang~k, J. Mandel. and M. Brezina. Algebraic multigrid by smoothed aggregation for second and fourth order elliptic problems. Computing, 56'179-196, 1996.

This Page Intentionally Left Blank

Parallel ComputationalFluid Dynamics Towards Teraflops, Optimizationand Novel Formulations D. Keyes,A. Ecer, J. Periaux, N. Satofukaand P. Fox (Editors) 92000 Elsevier Science B.V. All rightsreserved.

265

T h e N e x t F o u r O r d e r s of M a g n i t u d e in P e r f o r m a n c e for P a r a l l e l C F D D. E. Keyes a* aMathematics & Statistics Department, Old Dominion University, Norfolk, VA 23529, ISCR, Lawrence Livermore National Laboratory, Livermore, CA 94551, and ICASE, NASA Langley Research Center, Hampton, VA 23681, keyes@icase, edu. While some simulations whose computational work requirements are superlinear in memory requirements have executed at 1 Teraflop/s, simulations of PDE-based systems remain "mired" in the hundreds of Gigaflop/s on the same machines. We briefly review the algorithmic structure of typical PDE-based CFD codes that is responsible for this situation and consider possible architectural and algorithmic sources for performance improvement towards the achievement of the remaining four orders of magnitude required to reach 1 Petaflop/s. 1. I N T R O D U C T I O N A 1 Teraflop/s computer today could be built with as a few as 1,000 processors of 1 Gflop/s (peak) each, but due to inefficiencies within the processors, is more practically, but still optimistically, characterized as about 4,000 processors of 250 Mflop/s (delivered) each. To get to 1 Petaflop/s, we could go wider (1,000,000 processors of 1 Gflop/s each) or mainly deeper (10,000 processors of 100 Gflop/s each), with similar or larger safety factors to accommodate for processor inefficiency. From the point of view of PDE simulations on Eulerian grids, either extreme of a 1 Petaflop/s machine should be interesting. Many of the "Grand Challenges" of HPCC, ASCI, and SSI are formulated as PDEs (possibly among alternative formulations); however, PDE simulations have struggled to hold their own among recent Bell Prize submissions, as they require a balance among architectural components that is not necessarily met in a machine designed to "max out" on the standard LINPACK benchmark. Until recently, CFD has successfully competed against applications with more intensive data reuse only on special-purpose machines (vector or SIMD) in statically discretized, explicit formulations. PDEs come in many varieties and complexities, but though their mathematical properties differ greatly, their computational implementations are surprisingly similar, whether of evolution (e.g., time hyperbolic, time parabolic) or equilibrium (e.g, elliptic, spatially hyperbolic or parabolic) type. Memory estimates, in words, M ~ Nx . (No + Na + N { . Ns), and work estimates, in flops, W ,.~ Nx . Aft. (Nc + Na + N2c 9(N~ + Ns)) hold for many types of grid-based CFD simulations, where Nx is the number of spatial grid points, Aft is the *Supported in part by the NSF under grant ECS-9527169, by NASA under contracts NAS1-19480 and NAS1-97046, by Argonne National Laboratory under contract 982232402, and by Lawrence Livermore National Laboratory under subcontract B347882.

266 number of temporal or iterative steps, Arc is components per point, Na represents auxiliary storage per point, and Ns is the number of grid points in the discretization stencil. The last terms in the expressions for M and W are for implicit methods, for the storage and application of a preconditioner. Semi-implicit operator-split and explicit methods allow some reductions. Whether for explicit or implicit methods, resource scaling for PDEs is similar: for 3D problems, Work c((Memory) 4/a. In equilibrium problems, work scales with problem size times the number of iteration steps. For reasonably tolerable implicit methods the latter is proportional to the resolution in a single spatial dimension. For evolutionary problems, work scales with problem size times the number of time steps, and CFL-type arguments place the latter on order of the resolution in a single spatial dimension. While the exponent 4/3 is difficult to bring down for general CFD problems, the proportionality constant can be adjusted over a very wide range by both discretization (high-order implies more work per point and per memory transfer) and by algorithmic tuning. Four tasks are typical in grid-based PDE solvers: (1) vertex-based loops (resp., cell-based, for cell-centered storage) for state vector and auxiliary vector updates; (2) edge-based "stencil op" loops (resp., dual edge-based) for residual evaluation, approximate Jacobian evaluation, and Jacobian-vector product; (3) sparse, narrow-band recurrences, as in approximate factorization and back substitution; and (4) vector inner products and norms for orthogonalization/conjugation and convergence and stability checks. Let N = Nx. Nc be the number of unknowns and P the number of processors. Then, for explicit solvers of the generic form

U l = U/-1 -- A # .

f(u/-1)

concurrency is pointwise, O(N); the communication-to-computation ratio is surface-toN -1/3) ; the communication range is nearest-neighbor, except for time-step volume, O ((T) N -1) 9 For computation; and the synchronization frequency is once per time-step, O ((T) domain-decomposed implicit solvers of the form

U1 U/-1 At---~+ f(u~) - ~-~i, A# ~ oe concurrency is either pointwise, O(N), or subdomainwise, O(P); the comm.-to-comp. N -1/3); and the communication is still mainly ratio still mainly surface-to-volume, O ((~) nearest-neighbor. However, convergence checking, orthogonalization/conjugation steps, and hierarchically coarsened problems add nonlocal communication. The synchronization frequency is higher-often more than once per grid-sweep, up to Krylov subspace dimension, O ( K ( N ) - I ) . Storage per point is also higher, by factor of O(K). Load balance issues are the same as for explicit routine if the grid is static, and highly challenging in the dynamic adaptive case. In this chapter, we assume grids are quasi-static. 2. S O U R C E

:#:1: E X P A N D E D

NUMBER

OF P R O C E S S O R S

A simple bulk-synchronous scaling argument suggests that continued expansion of the number of processors is feasible as long as the architecture provides a global reduction

267 300

104r

250

Aggregate Gflop/s vs. # nodes

200

.5..g

Execution Time (s) vs. # nodes <*

Asci Red

....-, - \ \"q.. ......>. ->~ ":: %.

q\~

150

j./ ,./

100

103

. f9~

.....>.:b'.'% ":%:~"-

Asc~a,ue

9 /e . . . . g . : ....

, o ....

......A:~5 : ~ j +

00

5;0

Asci Blue

1000 15'00 2000

. ..

2100 3000 3100 4000

102 102

\

~ Asci Red

103

Figure 1. Flop rates (left) and execution time reduction (right) for an Euler flow problem on three machines. Dashed lines show the ideal for each performance curve, based on a left endpoint of 128 processors.

operation whose time-complexity is sublinear in the number of processors. However, the cost-effectiveness of this brute-force approach towards petaflop/s is highly sensitive to frequency and latency of global reduction operations, and to modest departures from perfect load balance. As popularized with the 1986 Karp Prize entry of Benner, Gustafson & Montry, Amdahl's law can be defeated if serial (or bounded concurrency) sections make up a decreasing fraction of total work as problem size and processor count scale - - true for most explicit or iterative implicit PDE solvers. A simple, back-of-envelope parallel complexity analysis [4] shows that processors can be increased as fast, or almost as fast, as problem size, assuming load is perfectly balanced An important caveat relative to Beowulf-type systems is that the processor network must also be scalable (in protocols as well as in hardware). Therefore, all remaining four orders of magnitude could be met by hardware expansion. However, this this does not mean that fixed-size applications of today would run 104 times faster; these Gustafson-type analyses are for problems that are correspondingly larger. As encouraging evidence that even fixed-size CFD problems scale well into thousands of processors, we reproduce from [1], in Fig. 1, flop/s scaling and execution time curves for Euler flow over an ONERA M6 Wing, on a tetrahedral grid of 2.8 million vertices, run on up to 1024 processors of a 600 MHz T3E, 768 processors of IBM's ASCI Blue Pacific, and 3072 dual-processor nodes of Intel's ASCI Red. (The execution rate scales better than the execution time for this fixed-size problem, since as the subdomains get smaller with finer parallel granularity more of the work is redundant, on ghost regions.) 3. S O U R C E

~ 2 - M O R E E F F I C I E N T U S E OF F A S T E R P R O C E S S O R S

Looking internal to a processor, we argue that there are only two intermediate levels of the memory hierarchy that are essential to a typical domain-decomposed PDE simu-

268 Table 1 Flop/s rate and percent utilization, as a function of dense point-block size, which varies in "Incomp." and "Comp." formulations.

Processor Application Actual Mflop/s Pet. of Peak

II II

[[ T3E-900 Origin 2000 II SP R10000 II P2SC (4-card) II Alpha 21164 Incomp. Comp. Incomp. Comp. Incomp. Comp. 75 82 126 137 117 124 8.3 9.1 25.2 27.4 24.4 25.8

lation, and therefore that most of the system cost and performance cost for maintaining a deep multilevel memory hierarchy could be better invested in improving access to the relevant workingsets, associated with individual local stencils (matrix rows) and entire sub domains . Improvement of local memory bandwidth and multithreading together with intelligent prefetching, perhaps through processors in memoryto exploit it could contribute approximately an order of magnitude of performance within a processor relative to present architectures. Sparse problems will never have the locality advantages of dense problems, but it is only necessary to stream data at the rate at which the processor can consume it, and what sparse problems lack in locality, they can make up for by scheduling. With statically discretized PDEs, the schedule is periodic and predictable. The usual ramping up of processor clock rates and the width or multiplicity of instructions issued are other obvious avenues for per-processor computational rate improvement, but only if memory bandwidth is raised proportionally. Improvement of the low effciencies of most current sparse codes through regularity of reference is an active area of research that yields strong dividends for PDEs. PDEs have a simple, periodic workingset structure that permits effective use of prefetch/dispatch directives, and they have a luxurious amount of "slackness" (potential process concurrency in excess of hardware concurrency). Combined with intelligent processors-in-memory (PIM) features to do gather/scatter cache transfers and multithreading for latency that cannot be amortized by sufficiently large block transfers, PDEs can approach full utilization of processor cycles. An important architectural caveat is that high bandwidth is critical to support these other advanced features, since PDE algorithms do only (9(N) work for O(N) gridpoints worth of loads and stores. One to two orders of magnitude can be gained by catching up to the clock, through such advanced features, and by following the clock into the few-GHz range. Even without PIM, multithreading, and bandwidth (in words per second) equal to the processor clock rate times the superscalarity, one can see the advantage in blocking in the comparisons in Table i. For the same Euler flow system considered above, the problem was run incompressibly (with 4 • 4 blocks at each point) and compressibly (with 5 x 5 blocks at each point). On three different architectures, this modest improvement in reuse of cached data leads to a corresponding improvement in efficiency. We briefly consider the workingsets that are relevant to PDE solvers. The smallest consists of the unknowns, geometry data, and coefficients at a single multicomponent stencil, of size Ns. (N2c + Nc + Na). The largest consists of the unknowns, geometry

269 Data Traffic vs. Cache Size

stencil i s in cac e

mostvertices maximallyreused ......................

ilPiilTi.and i ONFLICTMISSES ....

-~ ,,,2n o~o,o / .....

in

COMPULSORYMISSES

Figure 2. Idealized model of cache traffic for fixed computation as cache size increases, showing two extreme knees and one gradual "knee."

data, and coefficients in an entire subdomain, of size (Nx/P). (N2c + Nc + Na). Most practical caches will be sized in between these two. The critical workingset to consider in relation to cache size is the intermediate one of the data in neighborhood collection of gridpoints/cells that is reused when the group of corresponding neighboring stencils is updated. As successive workingsets "drop" into a level of memory, capacity (and with effort conflict) misses disappear, leaving only compulsory misses, as sketched in the illustration of memory traffic generated from a fixed computation versus varying cache size in Fig. 2. There is no performance value in memory levels larger than subdomain, and little performance value in memory levels smaller than subdomain but larger than required to permit full reuse of most data within each subdomain subtraversal (middle knee, Fig. 2). The natural strategy based on this simple workingset structure is therefore, after providing an L1 cache large enough for smallest workingset (and multiple independent copies up to desired level of multithreading, if applicable), all additional resources should be invested in large L2. Furthermore, L2 should be of write-back type and its population should be under user-assist with prefetch/dispatch directives. Tables describing grid connectivity should be built (within each quasi-static grid phase) and stored in PIM used to pack/unpack dense-use cache lines during subdomain traversal. The costs of this greater per-processor efficiency are the programming complexity of managing the subdomain traversal, the space to store the gather/scatter tables in PIM, the time to (re)build the gather/scatter tables, and the memory bandwidth commensurate with peak rates of all processors. Unfortunately, current shared-memory machines have disappointing memory bandwidth for PDEs; the extra processors beyond the first sharing a memory port are often not useful.

270 Table 2 Experiments in grid edge reordering and data structure interlacing on various uniprocessors for the Euler flow problem. In Mflop/s. Final column shows relative speedups. clock Interlacing, Interlacing (MHz) Edge Reord. (only) Original Speedup Processor 120 43 13 P2SC (2-card) 97 7.5 126 26 250 74 4.8 R10000 15 4.4 332 66 34 604e 42 18 4.2 Ultra II 300 75 44 Alpha 21164 91 33 2.8 600 32 Pentium II (Linux) 400 84 48 2.6

4. S O U R C E ~ 3 : M O R E A R C H I T E C T U R E - F R I E N D L Y A L G O R I T H M S

Besides the two just considered classes of architectural improvements more and we consider two classes of algorithmic improvements: some that improve the raw flop rate and some that increase the scientific value of what can be squeezed out of the average flop. In this section, we mention higher-order discretization schemes, especially of discontinuous or mortar type, orderings that improve data locality, and iterative methods that are less synchronous than today's. Algorithmic practice needs to catch up to architectural demands, and several "one-time" gains remain to be contributed that could improve data locality or reduce synchronization frequency, while maintaining required concurrency and slackness. "One-time" refers to improvements by small constant factors, nothing that scales in N or P. Complexities are already near information-theoretic lower bounds for some CFD solvers, and we reject increases in flop rates that derive from less efficient algorithms, as defined by parallel execution time. A caveat here is that the remaining algorithmic performance improvements may cost extra space or may bank on stability shortcuts that occasionally backfire, making performance modeling less predictable. Perhaps an order of magnitude of performance remains here. Raw performance improvement from algorithms include: (1) spatial reorderings that improve locality, such as interlacing of all related grid-based data structures and ordering gridpoints and grid edges for L1/L2 reuse; (2) discretizations that improve locality, such as higher-order methods (which lead to larger denser blocks at each point than lower-order methods) and vertex-centering (which, for the same tetrahedral grid, leads to denser blockrows than cell-centering); (3) temporal reorderings that improve locality, such as block vector algorithms (these reuse cached matrix blocks; vectors in block are independent), and multi-step vector algorithms (these reuse cached vector blocks; vectors have sequential dependence); (4) temporal reorderings that reduce synchronization penalty, such as less stable algorithmic choices that reduce synchronization frequency (deferred orthogonalization, speculative step selection) and less global methods that reduce synchronization range by replacing a tightly coupled global process (e.g., Newton) with loosely coupled sets of tightly coupled local processes (e.g., Schwarz); and (5) precision better-suited processor/memory elements

271 reductions that make memory bandwidth seem larger, such as lower precision representation of preconditioner matrix coefficients or poorly known coefficients (arithmetic is still performed on full precision extensions). Table 2 (from data in [1]) shows some experimental improvements from spatial reordering on the same unstructured-grid Euler flow problem described earlier. 5. SOURCE #4: ALGORITHMS PACKING MORE "SCIENCE PER FLOP"

It can be argued that this last category of algorithmic improvements does not belong in a discussion focused on computational rates, at all. However, since the ultimate purpose of computing is insight, not petaflop/s, it must be mentioned as part of a balanced program, especially since it is not conveniently orthogonal to the other approaches. We therefore include a brief pitch for revolutionary improvements in the practical use of problem-driven algorithmic adaptivity in PDE solvers not just better system software support for well understood discretization-error driven adaptivity, but true polyalgorithmic and multiplemodel adaptivity. To plan for a "bee-line" port of existing PDE solvers to petaflop/s architectures and to ignore the demands of the next generation of solvers will lead to petaflop/s platforms whose effectiveness in scientific and engineering computing might be equivalent to less powerful but more versatile platforms. The danger of such a pyrrhic victory is real. Some algorithmic improvements do not improve flop rate, but lead to the same scientific end in the same time at lower hardware cost (less memory, lower operation complexity). A caveat here is that such adaptive programs are more complicated and less thread-uniform than those they improve upon in quality/cost ratio. They are not daunting, conceptually, but they put an enormous premium on dynamic load balancing. An order of magnitude or more can be gained here for many problems. Some examples of adaptive opportunities are: (1) spatial discretization-based adaptivity, in which discretization type and order are varied to attain required approximation to the continuum everywhere without over-resolving in smooth, easily approximated regions; (2) fidelity-based adaptivity, in which the continuous formulation is varied to accommodate physical complexity without enriching physically simple regions; and (3) "stiffness"-based adaptivity, in which the solution algorithm is changed to provide more powerful, robust techniques in regions of space-time where discrete problem is linearly or nonlinearly stiff, without extra work in nonstiff, locally well-conditioned regions. What are the status and prospects for such advanced adaptivity? Appropriate metrics to govern the adaptivity and procedures to exploit them are already well developed for some discretization techniques, including method-of-lines ODE solutions to stiff IBVPs and DAEs, and FEA for elliptic BVPs. This field is fairly wide open for other types of numerical analyses. Fidelity-based multi-model methods have been used in ad hoc ways in numerous commercially important engineering codes, e.g., Boeing TRANAIR [5]. Polyalgorithmic solvers have been demonstrated in principle e.g., [3], but rarely in the "hostile" environment of high-performance multiprocessing. These advanced adaptive approaches demand sophisticated software approaches, such as object-oriented programming. Management of hierarchical levels of synchronization (within a region and between regions) is also required. User-specification of hierarchical priorities of different threads would also

272 be desirable - - so that critical-path computations can be given priority, while subordinate computations fill up unpredictable idle cycles with other subsequently useful work. An experimental example of new opportunities for localized algorithmic adaptivity is described surrounding Figs. 5 and 6 in [2]. For transonic full potential flow over a NACA airfoil, solved with Newton's method, excellent progress in residual reduction is made for the first few steps and the last few steps. In between, a shock develops and creeps downwing until it "locks" into its final location, while the rest of flow field is "held hostage" to this slowly converging local feature, whose stabilization completely dominates execution time. Resources should be allocated differently before and after shock location stabilizes. 6. S U M M A R Y

To recap in reverse order, the performance improvement possibilities that suggest that petaflop/s is within reach for PDEs are: (1) algorithms that deliver more "science per flop" possibly large problem-dependent factor, through adaptivity (though we won't count this towards rate improvement); (2) algorithmic variants that are more architecturefriendly, which we expect to contribute half an order of magnitude, through improved locality and relaxed synchronization; (3) more efficient use of processor cycles, and faster processor/memory from which we expect one-and-a-half orders of magnitude, through memory-assist language features, PIM, and multithreading; and (4) an expanded number of processors to which we look for the remaining two orders of magnitude. The latter will depend upon more research in dynamic balancing and extreme care in implementation. 7. A C K N O W L E D G M E N T S The author would like to thank his direct collaborators on computational examples reproduced in this chapter from earlier published work: Kyle Anderson, Satish Balay, Xiao-Chuan Cai, Bill Gropp, Dinesh Kaushik, Lois McInnes, and Barry Smith. Computer resources were provided by DOE (Argonne, LLNL, NERSC, Sandia), and SGI-Cray. REFERENCES

1. W . K . Anderson, W. D. Gropp, D. K. Kaushik, D. E. Keyes, and B. F. Smith. Achieving high sustained performance in an unstructured mesh CFD application. In Proceedings of SC'99 (CDROM), November 1999. 2. X.-C. Cai, W. D. Gropp, D. E. Keyes, R. G. Melvin, and D. P. Young. Parallel NewtonKrylov-Schwarz algorithms for the transonic full potential equation. SIAM J. Sci. Comput., 19:246-265, 1998. 3. A. Ern, V. Giovangigli, D. E. Keyes, and M. D. Smooke. Towards polyalgorithmic linear system solvers for nonlinear elliptic systems. SIAM Y. Sci. Comput., 15:681-703, 1994. 4. D.E. Keyes. How scalable is domain decomposition in practice? In C.-H. Lai et al., editor, Proceedings of the 11th International Conference on Domain Decomposition Methods, pages 286-297. Domain Decomposition Press, Bergen, 1999. 5. D.P. Young, R. G. Melvin, M. B. Bieterman, F. T. Johnson, S. S. Samant, and J. E. Bussoletti. A locally refined rectangular grid finite element method: Application to computational fluid dynamics and computational physics. J. Cornp. Phys., 92:1-66, 1991.

Parallel Computational Fluid Dynamics Towards Teraflops, Optimizationand Novel Formulations D. Keyes,A. Ecer, J. Periaux,N. Satofukaand P. Fox (Editors) 92000 Elsevier Science B.V. All rights reserved.

273

D e s i g n of L a r g e - S c a l e P a r a l l e l S i m u l a t i o n s Matthew G. Knepley Ahmed H. Sameh Vivek Sarin Computer Science Department, Purdue University, West Lafayette, IN 47906-1398 {knepley, sameh, sarin}~cs.purdue.edu We present an overview of the design of software packages called Particle Movers that have been developed to simulate the motion of particles in two and three dimensional domains. These simulations require the solution of nonlinear Navier-Stokes equations for fluids coupled with Newton's equations for particle dynamics. Furthermore, realistic simulations are extremely computationally intensive, and are feasible only with algorithms that can exploit parallelism effectively. We describe the computational structure of the simulation as well as the data objects required in these packages. We present a brief description of a particle mover code in this framework, concentrating on the following features: design modularity, portability, extensibility, and parallelism. Simulations on the SGI Origin2000 demonstrate very good speedup on a large number of processors. 1. O V E R V I E W

The goal of our KDI effort is to develop high-performance, state-of-the-art software packages called Particle Movers that are capable of simulating the motion of thousands of particles in two dimensions and hundreds in three dimensions. Such large scale simulations will then be used to elucidate the fundamental dynamics of particulate flows and solve problems of engineering interest. The development methodology must encompass all aspects of the problem, from computational modeling for simulations in Newtonian fluids that are governed by the Navier-Stokes equations (as well as in several popular models of viscoelastic fluids), to incorporation of novel preconditioners and solvers for the nonlinear algebraic equations which ultimately result. The code must, on the one hand, be highlevel, modular, and portable, while at the same time highly efficient and optimized for a target architecture. We present a design model for large scale parallel CFD simulation, as well as the PM code developed for our Grand Challenge project that adheres to this model. It is a true distributed memory implementation of the prototype set forth, and demonstrates very good scalability and speedup on the Origin2000 to a maximum test problem size of over half a million unknowns. Its modular design is based upon the GVec package [14] for PETSc, which is also discussed, as it forms the basis for all abstractions in the code. PM has been ported to the Sun SPARC and Intel Pentium running Solaris, IBM SP2 running AIX, Origin2000 running IRIX, and Cray T3E running Unicos with no explicit

274 code modification. The code has proved to be easily extensible, for example in the rapid development of a fluidized bed experiment from the original sedimentation code. The code for this project was written in Mathematica and C. The C development used the PETSc framework developed at Argonne National Laboratory, and the GNU toolset. The goal of our interface development is mainly to support interoperability of scientific software. "Interoperability" is usually taken to mean portability across different architectures or between different languages. However, we propose to extend this term to cover what we call algorithmic portability, or the ability to express a mathematical algorithm in several different software paradigms. The interface must also respect the computational demands of the application to enable high performance, and provide the flexibility necessary to extend its capabilities beyond the original design. We first discuss existing abstractions present in most popular scientific software frameworks. New abstractions motivated by computational experience with PDEs arising from fluid dynamics applications are then introduced. Finally, the benefits of this new framework is illustrated using a novel preconditioner developed for the Navier-Stokes equations. Performance results on the Origin2000 are available elsewhere [11].

2. E X I S T I N G

ABSTRACTION

The existing abstractions in scientific software frameworks are centered around basic linear algebra operations necessary to support the solution of systems of linear equations. The concrete examples from this section are taken from the PETSc framework [18], developed at Argonne National Laboratory, but are fairly generic across related packages [I0]. These abstractions differ from the viewpoint adopted for the BLAS libraries [4] in the accessibility of the underlying data structure. BLAS begins by assuming a data format and then formulates a set of useful operations which can be applied to the data itself. Moreover, the data structure itself is typically exposed in the calling sequence of the relevant function. In the data structure neutral model adopted by PETSc, the object is presented to the programmer as an interface to a certain set of operations. Data may be read in or out, but the internal data structure is hidden. This data structure independence allows many different implementations of the same mathematical operation. The ability to express a set of mathematical operations in terms of generic interfaces, rather than specific operations on data, is what we refer to as algorithmic portability. This capability is crucial for code reuse and interoperability [5]. The operations of basic linear algebra have been successfully abstracted in to the Vector and Matrix interfaces present in PETSc. A user need not know anything about the internal representation of a vector to obtain its norm. Krylov solvers for linear systems are almost entirely based upon these same operations, and therefore have also been successfully abstracted from the underlying data structures. Even in the nonlinear regime, the Newton-Krylov iteration adds only further straightforward vector operations to the linear solve. These abstractions fail when we discretize a continuous problem, or utilize information inherent in that process to aid in solving the system. A richer set of abstractions is discussed in the next section.

275 3. H I G H E R

LEVEL ABSTRACTIONS

To obtain a generic expression of more modern algorithms, we must supplement our set of basic operations and raise the level of abstraction at which we specify algorithmic components [3,2]. For instance, in most multilevel algorithms there is some concept of a hierarchy of spaces, and in finite element based code this hierarchy must be generated by a corresponding hierarchy of meshes [7,1,17,19]. Thus the formation of this hierarchy should in some sense be a basic operation. The detailed nature of the meshes affects the performance of an algorithm but not its conceptual basis; however, it might be necessary to specify nested meshes or similar restrictions. Raising the level of abstraction, by making coarsening a basic operation, allows the specification of these algorithms independently of the detailed operations on the computational mesh. We propose a set of higher level abstractions to encapsulate solution algorithms for mesh-based PDEs that respects the current PETSc-like interface for objects such as vectors and matrices. A Mesh abstraction should allow the programmer to interrogate the structure of the discrete coordinate system without exposing the details of the storage or manipulation. Thus it should allow low level queries about the coordinates of nodes, elements containing a given node or generic point, and also support higher level operations such as iteration over a given boundary, automatic partitioning and automatic coarsening or refinement. In the GVec libraries, a Partition interface is also employed for encapsulating the layout of a given mesh over multiple domains. A mesh combined with a discretization prescription on each cell provides a description of the discrete space which we encapsulate in the Grid interface. The Grid provides access to both the Mesh and a Discretization. The Discretization for each field which understands how to form the weak form of a function or operator on a given cell of the mesh. In GVec, Operator abstractions are used to encapsulate the weak form of continuous operators on the mesh, and provide an interface for registration of user-defined operators. This capability is absolutely essential as the default set of operators could never incorporate enough for all users. Finally, the Grid must maintain a total ordering of the variables for any given discretized operator or set of equations. In GVec, the VarOrdering interface encapsulates the global ordering and ties it to the nodes in the mesh, while the Local VarOrdering interface describes the ordering of variables on any given node. The Grid must also encapsulate a given mathematical problem defined over the mesh. Thus it should possess a database of the fields defined on the mesh, including the number of components and discretization of each. Furthermore, each problem should include some set of operators and functions defining it which should be available to the programmer through structured queries. This allows runtime management of the problem which can be very useful. For example, in the Backward-Euler integrator written for GVec, the identity operator is added automatically with the correct scaling to a steady-state problem to create the correct time-dependence. The only information necessary from the programmer is a specification of the the fields which have explicit time derivatives in the equation. This permits more complicated time-stepping schemes to be used without any change to the original code. The Grid should also maintain information about the boundary conditions and constraints. In GVec, these are specified as functions defined over a subset of mesh nodes identified by a boundary marker, which is given to each node during

276 mesh generation. This allows automatic implementation of these constraints with the programmer only specifying the original continuous representation of the boundary values. 3.1. N o d e Classes In order to facilitate more complex distributions of variables over the mesh, GVec employs a simple scheme requiring one extra integer per mesh node. We define a class as a subset of fields defined on some part of the mesh. Each node is assigned a class, meaning that the fields contained in that class are defined on that node. For example, in a P2/P1 discretization of the Stokes equation on a triangular mesh, the vertices might have class 0, which includes velocity and pressure, whereas the midnodes on edges would have class 1, including only velocity. Thus this scheme easily handles mixed discretizations. However, it is much more flexible. Boundary conditions can be implemented merely by creating a new class for the affected nodes that excludes the constrained fields, and the approach for constraints is analogous. This information also permits these constraints to be implemented at the element level so that explicit elimination of constrained variables is possible, as well as construction of operators over only constrained variables. This method has proven to be extremely flexible while incurring minimal overhead or processing cost. 4. M U L T I L E V E L P R E C O N D I T I O N I N G

FOR NAVIER-STOKES

As an example of the utility of higher level abstractions in CFD code, we present a particulate flow problem [9,11] that incorporates a novel multilevel preconditioner [17] for the Navier-Stokes equations augmented by constraints at the surface of particles [13]. The fluid obeys the Navier-Stokes equations and the particles each obey Newton's equations. However, the interior forces need not be explicitly calculated due to the special nature of our finite element spaces [8,9,15]. These equations are coupled through a no-slip boundary condition at the surface of each particle. The multilevel preconditioner is employed to accelerate the solution of the discrete nonlinear system generated by the discretization procedure. 4.1. P r e c o n d i t i o n i n g The nonlinear system may be formulated as a saddle-point problem, where the upper left block A is a nonlinear operator, but the constraint matrix B is still linear: ( A

B

u)=

(f).

(1)

Thus we may approach the problem exactly as with Stokes, by constructing a divergenceless basis using a factorization of B. The ML algorithm constructs the factorization

pT B V

=

(D

vTv

=

I,

'

(2)

(3)

where V is unitary, D is diagonal, but P is merely full rank. If P were also unitary we would have the SVD, however this is prohibitively expensive to construct. Using this factorization, we may project the problem onto the space of divergenceless functions spanned by P2. Thus we need only solve the reduced nonlinear problem (4) in the projected space

P~APj~2 = pT (f _ APe D - T v T9)

(4)

277 The efficiency issues for this scheme are the costs of computing, storing, and applying the factors, as well as the conditioning of the resulting basis. The ML algorithm can store the factors in O(N) space and apply them in O(N) time. The conditioning of the basis can be guaranteed for structured Stokes problems, and good performance is seen for two and three dimensional unstructured meshes. The basis for the range and null space of B are both formed as a product of a logarithmic number of sparse matrices. The algorithm proceeds recursively, coarsening the mesh, computing the gradient at the coarse level B, and forming one factor at each level. The algorithm terminates when the mesh is coarsened to a single node, or at some level when an explicit QR decomposition of B can be accomplished. In a parallel setting, the processor domains are coarsened to a single node and then a QR decomposition is carried out along the interface. N

4.1.1. Software Issues The ML algorithm must decompose the initial mesh into subdomains, and in each domain form the local gradient operator. Thus if ML is to be algorithmically portable, we must have basic operations expressing these actions. The ability of the Mesh interface to automatically generate this hierarchy allows the programmer to specify the algorithm independently of a particular implementation, such as a structured or unstructured mesh. In fact, the convergence of ML is insensitive to the particular partition of the domain so that a user may select among mesh coarsening algorithms to maximum other factors in the performance of the code. Using the Grid interface to form local gradients on each subdomain frees the programmer from the details of handling complex boundary conditions or constraints, such as those arising in the particulate flow problem. For example, the gradient in the particulate flow problem actually appears as

B-(

Bx

)'

(5)

where B I denotes the gradient on the interior and outer boundary of the domain, BF is the gradient operator at the surface of the particles, and P is the projector from particle unknowns to fluid velocities at the particle surface implementing the no-slip condition. This new gradient operator has more connectivity and more complicated analytic properties. However, the code required no change in order to run this problem since the abstractions used were powerful enough to accommodate it. 4.1.2. Single Level R e d u c t i o n We begin by grouping adjacent nodes into partitions and dividing the edges into two groups: interior edges connecting nodes in one partition, and boundary edges connecting

partitions.

The gradient matrix may be reordered to represent this division, ( BI ) BF " The upper matrix Be is block diagonal, with one block for each partition. Each block represents the restriction of the gradient operator to that cluster. Furthermore, we may

factoreachblockindependentlyusingtheSVD, sothatifUiBiViT-( Di 0 )

we may factor

each domain independently. We now use these diagonal matrices to reduce the columns in by block row reduction. A more complete exposition may be found in [13].

BFVr

278 4.1.3. R e c u r s i v e F r a m e w o r k

We may now recursively apply this decomposition to each domain instead of performing an exact SVD. The factorization process may be terminated at any level with an explicit QR decomposition, or be continued until the coarsest mesh consists of only a single node. The basis P is better conditioned with earlier termination, but this must be weighed against the relatively high cost of QR factorization. Thus, we have the high level algorithm 1. Until numNodes < threshold do: m

(a) Partition mesh (b) Factor local operator (c) Block reduce interface (d) Coarsen mesh 2. QR factor remaining global operator 5. C O N C L U S I O N S The slow adoption of modern solution methods in large CFD codes highlights the close ties between interoperability and abstraction. If sufficiently powerful abstractions are not present in a software environment, algorithms making use of these abstractions are effectively not portable to that system. Implementations are, of course, possible using lower level operations, but these are prone to error, inflexible, and very time-consuming to construct. The rapid integration of the ML preconditioner into an existing particulate flow code[16] demonstrates the advantages of these more powerful abstractions in a practical piece of software. Furthermore, in the development and especially in the implementation of these higher level abstractions, the architecture must be taken into account. It has become increasingly clear that some popular computational kernels, such as the sparse matrix-vector product, may be unsuitable for modern RISC cache-based architectures[17]. Algorithms such as ML which possess kernels that perform much more work on data before it is ejected from the cache should be explored as an alternative or supplement to current solvers and preconditioners. REFERENCES

1. Achi Brandt, Multi-Level Adaptive Solutions to Boundary-Value Problems, Mathematics of Computation 31 (1977) 333--390. 2. David L. Brown, William D. Henshaw, and Daniel J. Quinlan, Overture: An Object Oriented Framework for Solving Partial Differential Equations, in: Scientific Computing in Object-Oriented Parallel Environments, Lecture Notes in Computer Science 1343 (Springer, 1997). Overture is located at http://www, l l n l . g o v / c a s c / 0 v e r t u r e . 3. H.P. Langtangen, Computational Partial Differential Equations--Numerical Methods and Diffpack Programming (Springer-Verlag, 1999). Diffpack is located at http://www, nobj e c t s . corn/Product s/Dif fpack. 4. Jack Dongarra, J. DuCroz, S. Hammarling, and R. Hanson, A proposal for an extended set of Fortran basic linear algebra subprograms, Technical Memo 41, Mathematics and Computer Science Division, Argonne National Laboratory, December, 1984.

279 The ESI Forum is located at h t t p : / / z , ca. sandia.gov/esi. 6. William D. Gropp, Dinesh K. Kaushik, David E. Keyes, and Barry F. Smith, Cache Optimization in Multicomponent Unstructured-Grid Implicit CFD Codes, in: Proceedings of the Parallel Computational Fluid Dynamics Conference (Elsevier, 1999). Ami Harten, Multiresolution representation of data: General framework, SIAM Journal on Numerical Analysis 33 (1996) 1205-1256. Todd Hesla, A Combined Fluid-Particle Formulation, Presented at a Grand Challenge Group Meeting (1995). Howard Hu, Direct simulation of flows of solid-liquid mixtures, International Journal of Multiphase Flow 22 (1996). 10. Scott A. Hutchinson, John N. Shadid, and Ray S. Tuminaro, Aztec User's Guide Version 1.0, Sandia National Laboratories, TR Sand95-1559, (1995). 11. Matthew G. Knepley, Vivek Sarin, and Ahmed H. Sameh, Parallel Simulation of Particulate Flows, in: Solving Irregularly Structured Problems in Parallel, Lecture Notes in Computer Science 1457 (Springer, 1998). 12. Denis Vanderstraeten and Matthew G. Knepley, Paralell building blocks for finite element simulations: Application to solid-liquid mixture flows, in: Proceedings of the Parallel Computational Fluid Dynamics Conference (Manchester, England, 1997). 13. Matthew G. Knepley and Vivek Sarin, Algorithm Development for Large Scale Computing: A Case Study, in: Object-Oriented Methods for Interoperable Scientific and Engineering Computing (Springer, 1999). 14. Matthew G. Knepley, GVec Beta Release Documentation, available at .

http ://www. cs. purdue, edu/home s/knep i ey/comp_ f luid/gvec, nb. ps.

15. Matthew G. Knepley, Masters Thesis, University of Minnesota, available at http: //www. cs. purdue, edu/home s/knep 1e y / i t er_meth. The Mathematica software and notebook version of this paper may be obtained at h t t p ://www. cs. purdue, edu/homes/knepley/iter_meth. 16. Vivek Sarin, An efficient iterative method for Saddle Point problems, PhD thesis, University of Illinois, 1997. 17. Vivek Sarin and Ahmed H. Sameh, An efficient iterative method for the generalized Stokes problem, SIAM Journal on Scientific Computing, 19 (1998) 206-226. 18. Barry F. Smith, William D. Gropp, Lois Curman McInnes, and Satish Balay, PETSc 2.0 Users Manual, Argonne National Laboratory, TR ANL-95/11, 1995, available via ftp://www .mcs. anl/pub/pet sc/manual, ps. 19. Shang-Hua Teng, Coarsening, Sampling, and Smoothing: Elements of the Multilevel Method, Unpublished, 1999.

This Page Intentionally Left Blank

Parallel ComputationalFluid Dynamics Towards Teraflops, Optimizationand Novel Formulations D. Keyes,A. Ecer, J. Periaux,N. Satofukaand P. Fox (Editors) 92000 Elsevier Science B.V. All rights reserved.

281

A n Efficient Storage T e c h n i q u e for Parallel Schur C o m p l e m e n t M e t h o d and Applications on Different Platforms S. Kocak* and H.U. Akay Computational Fluid Dynamics Laboratory, Department of Mechanical Engineering, Purdue School of Engineering and Technology, IUPUI, Indianapolis, IN 46202, USA.

A parallel Schur complement method is implemented for solution of elliptic equations. After decomposition of the total domain into subdomains, interface equations are expressed in a coupled form with the subdomain equations. The solution of subdomains is performed after solving the interface equations. In this paper, we present an efficient formation of interface and subdomain equations in parallel environments using direct solvers which take into account sparse and banded structures of the subdomain coefficient matrices. With such an approach, we can solve larger system of equations than in our earlier work. Test cases are presented on different platforms to illustrate the performance of the algorithm. 1. INTRODUCTION Parallel solution of large-scale problems attracted attention of many researchers more than a decade for which efficient solution techniques are being sought continuously. Due to advances in computer hardware, either new techniques are being proposed or many of the existing techniques are modified to exploit the advantages of new computer architectures. One of the existing methods used to solve large-scale problems is a substructuring technique known as the Schur complement method in the literature. Although the substructuring method dates back to 1963 on limited memory and sequential computers [ 1], the adaptation of this technique to parallel processors appears to be first introduced by Farhat and Wilson [2]. In spite of previous developments, speedup and efficient storage implementation of substructuring technique poses some major challenges. In this technique, after decomposition of the total domain into subdomains, the interface equations are formed in a Schur complement form. The solution of subdomains is performed after solving the interface equations. The size of interface equations increases with the number of subdomains used to represent the domain. While the subdomain matrices are sparse and banded, the interface matrix is dense. As a result, storage and solution of the interface equations require excessive memory and CPU time, respectively. Hence, special care must be given to efficient use of storage and CPU time. In this study, we extend our previously developed parallel algorithm [3] for solution of larger size systems by forming Schur complement equations more efficiently and test it on different platforms. More specifically, we present results on UNIX based workstations and Pentium PCs using WINDOWS/NT and LINUX operating systems.

*Visiting from Civil Engineering Department, Cukurova University, Adana, Turkey.

282

2. P A R A L L E L SCHUR C O M P L E M E N T A L G O R I T H M In substructuring method, the whole structure (domain) is divided into substructures (subdomains) and the global solution is achieved via coupled solution of these substructures. Here, for sake of simplicity, we will assume that each substructure is assigned to a single processor, referred to as subprocessor. After finite element discretizations, we can express system of equations of a domain with N subdomains in the following familiar form: -

-Kll

0

0

Klr -Pl

K22

0

K2F

-fl

P2

f2

(1)

0 _Krl

0 Kr2

KNN KNr PN] fN 9 9 KrN

Krr

prJ

fr

where K ii, (i = 1..... N), K r r , and KiF = KT i denote, respectively, coefficient matrices of the subdomains ~i, the interface F, and the coupling between subdomains and the interface. The same is true for subdomain vectors P i and fi, and the interface vectors PF and fF. The interface equation of the system can be written in the form: Krr=9

KFi K~ 1KiF P F = f F - - ~ K F i K~ 1 fi /=1

(2)

The equation system given in Eq. (2) is known as the Schur complement equation. Here, it must be noted that the coefficient matrix of the interface is a large and dense matrix. The dimension of this matrix depends on the number of unknowns on the interface, where the number of interface unknowns increases with the number of subdomains. In Figure 1, various subdomains of a square domain are shown for different number of subdivisions. As can be seen from the figure, the interface of the subdomains is not regular. With a greedy-based divider algorithm we have used here [2], the interface sizes of subdomains are not balanced even though nearly equal number of grid points is obtained in subdomains. Depending upon the finite element mesh used, the number of unknowns of interface equations may become very large and unbalanced. For the solution of Schur complement matrix equations, direct or iterative solvers may be applied. Here, we concentrate on assembling the Schur complement equation, which requires solution of a series of equations involving coupling between subdomains and interfaces. We can further express Eq. (2) in compact form as (K r r - GdT)Pr = f r - g r

(3)

where N Grr=~Kr/K~IKir i=1

N

,

gr=~KriKi-ilfi

(4)

i=1

Contributions to G r r to g r are computed by each subdomain processor without any message

283 passing, and then the interface coefficient matrix K r r = (K rr -C, rr) and the source vector f-r" = ( f r - g r ) are assembled via message passing. The solution of interface matrix may be implemented either using direct or iterative parallel solvers. For both of these techniques, the interface matrix is separated into blocks and distributed to processors, and the solution is achieved concurrently.

Figure 1. Domain divided into two, four, five, and ten subdomains. The two terms of Eqs. (4) can be written as

Air'=K~lKir,

A i = K ~ 1 fi

(5)

and to eliminate inversion of subdomain matrices, Eqs. (5) are expressed as

K ii A iF =K iF ,

K ii A i-- fi

(6)

Here, this system of equations is solved for every column of Air and column of A i . The number of columns of Air is governed by the number of unknowns on interface of that subdomain. In this case, we encounter a repeated right hand side system, which can be solved by using a direct solver very efficiently without any message passing. For the parallel solution of interface equations it is possible to use direct or iterative solvers. Both direct and iterative solvers are proposed in the literature, e.g., [2, 4-6]. Here, following the work of Farhat and Wilson [2], we use a direct solver approach, in which each row of the interface matrix is assigned to a processor as schematically illustrated in Figure 2, where we assume that there are three processors. In implementation of the interface solver, the Schur complement matrix has to be formed and distributed to related processors. As can be seen from Eqs. (3) and (4), dimensions of the Schur complement matrix may become very large in large-scale systems. Since contributions to G r r and g r are computed in each subprocessor to assemble the interface matrices of subprocessors, message passing between subprocessors is required. Once the subdomain and coupling terms given in Eqs. (4) are assembled by each subprocessor using Eq. (6), contributions to G r r and g r are computed using a direct solver. Here, we will present the algorithm for assembling the interface matrices without assembling the whole Schur complement matrix. The algorithm given in Table 1 is identical for all subprocessors.

284

-X

X

Processor 1 Processor 2

X

~d

Processor 3

X

X

X

X

X

X

X X

Processor 1

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

x

X

Processor 2 Processor 3

X

~d

X

Figure 2. Distribution of interface matrix to subprocessors for a direct solver.

Table 1. Algorithm for assembling Schur complement matrix for each subprocessor.

do k=l, np (np= number of subprocessors) do I=1, neqs (neqs = number of equations on interface) if contribute to k th subprocessor then form R i which is k th column of Kit, solve KiiUi = R i for U i compute Pi = KFi U i which is contribution to G r r pack Pi to buffer to be send to k th subprocessor endif enddo form R i which is equal to fi solve KiiUi = R i for U i compute Pi = K r'i U i which is contribution to g r pack Pi to buffer to be send to k th subprocessor enddo m

In our earlier study [3], we assembled the entire Schur complement matrix K r r o n a single processor by receiving contributions to G r-r, and g r from subprocessors and then distributed it to subprocessors. Since K r r is a dense matrix, a lot of memory is needed for its storage. This limited the size of the problems which could be solved compared to the ones this new storage algorithm allows. Direct linear solvers are used here for the solution of both subdomain and interface equations of Poisson pressure equations arising in incompressible viscous flows. For the sake

285 of savings in storage, every subdomain has a locally numbered interface. Moreover, contribution to a subprocessor's interface terms is determined by the corresponding global equation number. Therefore, through the use of the algorithm given in Table 1 for the solution of interface equations, one does not need to assemble the Schur complement matrix K r r which might consume a lot of memory space. Banded nature of the subdomain equations are accounted for by using a skyline direct solution technique [7]. More details for the complete substructuring algorithm used in this study are given in [3]. For parallelization, PVM [8] (Parallel Virtual Machine) is used as the message passing interface. 3. E X A M P L E P R O B L E M S AND RESULTS

The computer algorithm developed here is implemented on MIMD distributed memory architectures with UNIX, WlNDOWS/NT, and LINUX operating systems. Timings at different steps of the program are taken to assess the algorithm and the platform. The platforms used have different CPU speeds and message passing interfaces. Since distributed memory architectures are used, message passing interfaces (switches) play an important role in the efficiency studies. The computer program is developed using a master-slave approach where one processor is reserved for master which divides the domain into subdomains and assign a processor (subprocessor) for each subdomain. Before discussing the results, we list below some of the important features of the platforms we used in our tests: IBM RS6K: UNIX/OS, 16 processors with 160 MHz CPU speed and two different switches with (a) 100 Mb/sec and (b) 10 Mb/sec bandwidths. In the remainder, we will identify these switches as fast and slow, corresponding to 100 and 10 MHz bandwidths, respectively. This system is located at IBM Research Center in Kingston, New York. Pentium II: LINUX/OS, 32 dual processor machines with 400 MHz CPU speed. In networking, 4 machines are connected to routers via 100 Mb/sec switch and total 8 routers are connected via 1000 Mb/sec switches. This system is located at NASA Glenn Research Center, Cleveland, Ohio. Pentium II: WINDOWS/NT, 13 processors with 400 MHz speed and 100 Mb/sec switch. This system is located in our CFD laboratory at IUPUI, Indianapolis. For the examples presented here, each subdomain is assigned to a different subprocessor, i.e., each subprocessor deals with only one subdomain so that, for message passing, switches are used instead of CPU. Therefore, the number of equations of subdomains (blocks) and interface changes with the number of blocks or subprocessors used in the analyses. In Figure 3, the elapsed speedup diagram of the program tested on an IBM RS6K UNIX workstation with fast ethernet is depicted for three example problems having 201,000 (Case 1), 120,600 (Case 2), and 30,300 (Case 3) equations, in a square computational domain as shown in Figure 1. It is seen from the diagram that the speedup of the algorithm is good for all the cases. It also can be seen from the diagram that for larger problems the speedup is better, which can be explained with the problem becoming more computation bound, since for the platform used in this case, CPU speed is faster than that of message passing. The speedups, greater than 100% efficiency are attributed to the efficient memory allocation and efficient direct solver operations for reduced size matrices (i.e., increased locality). The

286 fluctuations in speedups are explained by the varying computational loads of processors as well as the variations in bandwidth sizes. The greedy algorithm-based divider we have used is able to balance the number of elements in a domain but not the bandwidth and interface sizes of the subdomains. Shown in Figure 4 are the variations of the total communication and computation times with respect to the number of processors for Case 3 (30,300 equations) using the fast-switch RS6K. It is noted here that the delays which are caused by unbalanced work load of processors are added into the communication time since during the waiting times no CPU is used. The ratio of communication to computation time reaches to 72% for 15 processors. This ratio is however, smaller for larger systems (23% for Case 2 and 15% for Case 1). Figure 5 shows speedup diagrams of LINUX system for the same three cases. In this diagram, it is seen that the speedup is also better for larger systems, though not as good as in the fast-switch RS6K system. The speedups for Cases 2 and 3 decrease after ten processors, which may be attributed to network properties and the operating system used. Figure 6 shows the speedup diagrams for elapsed times on three different platforms with UNIX, LINUX and WlNDOWS/NT operating systems, for Case 3 (30,300 equations). The purpose of this diagram is to illustrate the effect of platform on speedup. Case 3 is chosen, since it is the most communication-bound problem among the three cases considered in this study. It is seen that, for the UNIX platform with the fast switch, the speedup is better than the others. The speedup curves of the remaining 3 platforms decrease for the 15 processor case which means the message passing interfaces of those platforms slow down the program due to their low speed message passings. The difference between that of the UNIX system with the fast and the slow switches illustrates the effect of message passing to speedup. As can be seen in the figure, the speedup value of the UNIX system for the 15 processors drops drastically when a slow switch is used. The difference between the elapsed and CPU speedup is attributed to message passing and delays between the processors. Since message passing is directly related to the size of the interface used, we expect that balancing the work load among subprocessors will minimize the time delays substantially. Here, it should be noted that besides the message passing interface, operating system and hardware properties of different platforms also affect the speed of the programs being tested. In Table 2, subdomain and interface data indicating matrix storage requirements as well as equation and bandwidth sizes are summarized with varying subdomains for Case 2. For the interface coefficient matrices, global and local terms are used to refer assembling the whole interface matrix versus the distributed interface matrices, respectively. Since the interface matrix is distributed to different processors without a global assembly, considerable amount of memory space is saved, which makes the solution of large size problems possible. The values given in the same table correspond to the memory requirements of the skyline solution technique [7] which is used here. The variations in maximum and minimum bandwidth sizes explain the fluctuations in speedups and efficiencies in different cases. Fluctuations would have been less if banded nature of the equations were not accounted for. However, that would have resulted with much longer elapsed times. 4. CONCLUSIONS The Schur complement algorithm presented here provides both memory and speedup efficiencies due to reduction in matrix sizes and operation counts. Using this algorithm,

287

large-scale problems, which could not be solved in our earlier study, can now be solved. The proposed algorithm gives better efficiencies for larger systems. The choice of an algorithm for dividing a domain into subdomains is crucial. It directly affects the number of unknowns on the interface and computational loading of subdomains, and hence the overall efficiency. The platform properties strongly affect the efficiency of the algorithms, therefore to make a general statement about the efficiency of an algorithm, by testing it only on a single platform might lead to unrealistic results. To avoid this, the properties of the platform and algorithm must be studied carefully.

2000

- - = Case 3 --Case 2 Case 1 ............. T h e o r e t i c a l

- -

--

~ f 8

l

computation

12 ,x:l o

- - c o m m u n i c a t i o n + delays

1600

~

"~ -

1200

. . . . .

~. ~oo

/

400 ..,.._,, . - , , - - - .,.= .-,. 0

,

0

,

2

,

4

6

,

|

8

10

,

12

,

14

|

16

,

,

,

|

,

2

4

6

8

10

12

,

i

14

16

18 N u m b e r o f Processors

Number

of Processors

Figure 3. Elapsed time speedups with the FastSwitch RS6K system.

Figure 4. Communication and computation times for Case 3. 20-

--

3

--Case

16-

1

Case

"-- -- LINUX R S 6 K FS - - " " R S 6 K SS = = = W I N D O W S N T FS ............ T h e o r e t i c a l

= = = Case 2

~.1~

~

12 ~9 8

,,..~

8"

0

,

,

,

,

,

,

,

,

2

4

6

8

10

12

14

16

18

Number of Processors

0

|

i

|

i

|

|

|

i

2

4

6

8

10

12

14

16

Number

1

of Processors

Figure 5. Elapsed time speedups with the LINUX

Figure 6. Elapsed speedups on various

system.

platforms (Case 3).

ACKNOWLEDGEMENTS The authors would like to gratefully acknowledge the computer accesses provided by the NASA Glenn Research Center in Cleveland, Ohio and the IBM Research Center in Kingston, New York. The financial support provided to the first author by the Scientific and Technical Research Council of Turkey (TUBITAK) under a NATO Scientific Fellowship program is also gratefully acknowledged.

288

Table 2. Subdomain and interface data for Case 2 (120,600 equations).

Subdomains

Interface

Number of Processors

Coefficient Matrix Size

Number of Equations

Max/Min Bandwidth

Global Matrix Size (Previous)

Local Matrix Size (Present)

Number of Equations

3.25E6

60,000

427/427

9.10F.A

4.56E4

1,698

1.31E6

24,000

333/228

5.46E5

1.09E5

4,284

10

4.86E5

12,000

476/173

1.16E6

1.17E5

6,606

15

2.51E5

8,000

283/143

1.48E6

9.90E4

8,841

REFERENCES

1. J.S. Przemieniecki, "Matrix Structural Analysis of Substructures," AIAA Journal, 1963, Vol. 1,138-147. 2. C. Farhat and E. Wilson, "A New Finite Element Concurrent Computer Program Architecture," International Journal for Numerical Methods in Engineering, Vol. 24, 1771-1792, 1987. 3. S. Kocak, H.U. Akay, and A. Ecer, "Parallel Implicit Treatment of Interface Conditions in Domain Decomposition Algorithms," Proceedings of Parallel CFD '98, Edited by C.A. Lin, et al., Elsevier Science, Amsterdam, 1999, (in print). 4. Nour-Omid, A. Raefsky, and G. Lyzenga, "Solving Finite Element Equations on Concurrent Computers," Parallel Computations and Their Impact on Mechanics, ASME, New York, AMD-VOL 86, 209-227, 1986. 5. J. Favensi, A. Daniel, J. Tomnbello, and J. Watson, "Distributed Finite Element Analysis Using a Transputer Network," Computing Systems in Engineering, Vol. 1, 171-182, 1990. 6. A.I. Khan and B.H.V. Topping, "Parallel Finite Element Analysis Using JacobiConditioned Conjugate Gradient Algorithm," Advances in Engineering Software, Vol. 25, 309-319, 1996. 7. E.L. Wilson and H.H. Dovey, "Solution of Reduction of Equilibrium Equations for Large Structural Systems, "Adv. Engng. Sofiw., 1978, Vol. 1, 19-25. 8. A. Geist, A. Beguelin, J. Dongarra, W. Jiang, R. Manchek, and V. Sundream, PVM: A Users' Guide and Tutorial for Networked Parallel Machines, The MIT Press, 1994.

Parallel Computational Fluid Dynamics Towards Terafiops, Optimization and Novel Formulations D. Keyes, A. Ecer, J. Periaux, N. Satofuka and P. Fox (Editors) 92000 Elsevier Science B.V. All rights reserved.

289

Applications of the Smoothed Particle Hydrodynamics method" The Need for Supercomputing Stefan Kunze 1, Erik Schnetter 2 and Roland Speith 3 Institut fiir Theoretische Astrophysik, Universit~it Tiibingen Auf der Morgenstelle 10C, 72076 Tiibingen, Germany http://www, t at. physik, uni-t uebingen, de/- sp eith/particle.ht ml

We shortly describe the numerical method Smoothed Particle Hydrodynamics (SPH) and report on our parallel implementation of the code. One major application of our code is the simulation of astrophysical problems. We present some recent results of simulations of accretion disks in close symbiotic binary stars.

1. S m o o t h e d Particle H y d r o d y n a m i c s Smoothed Particle Hydrodynamics (SPH; [11], [5], [2], [13]; [7], [15]) is a meshless Lagrangian particle method for solving a system of hydrodynamic equations for compressible fluids. SPH is especially suited for problems with free boundaries, a commonplace situation in astrophysics. Rather than solving the equations on a grid, the equations are solved at the positions of the so-called particles, each representing a mass packet with a certain density, velocity, temperature etc. and moving with the flow. The principle of the SPH method is to transform a system of coupled partial differential equations into a system of coupled ordinary differential equations which can be solved by a standard integration scheme. This is achieved by a convolution of all variables with an appropriate smoothing kernel W and an approximation of the integral by a sum over particle quantities: f(x)

'~ f d3x' f ( x ' ) W ( x - x')

~ y~ V i f i W ( x - xi) i

Then all spatial derivatives can be computed as derivatives of the analytically known kernel function. Thus only the derivatives in time are left in the equations. The main advantage of SPH is that it is a Lagrangian formulation where no advection terms are present. Furthermore conservation of mass comes for free, and the particles can l k u n z e ~ t a t . p h y s i k , u n i - t u e b i n g e n , de 2 s c h n e t t e r @ t a t , p h y s i k , u n i - t u e b i n g e n , de 3 spe i t h @ t a t . phys i k . u n i - t u e b i n g e n , de

290

be almost arbitrarily distributed which removes the need for a computational grid. By varying the kernel function W in space or time one can adapt the resolution, if necessary. The Euler equation, for example, in its SPH form reads dvi _ Pj + Pi dt - - ~ m~ V W (xi - x~)

which has been derived as outlined above and then antisymmetrized. In order to efficiently evaluate this equation it is important to use a kernel function W that has compact support, thus reducing the number of non-zero contributions to the sum which runs over all particles. Finding interacting particles is an important part of every SPH implementation; this is done using well-known grid- or tree structures [6]. In contrast to many other flavors of SPH used in astrophysics, in our approach the viscous stress tensor is not a rather arbitrary artificial viscosity. Instead it is implemented according to the Navier-Stokes equation to describe the physical viscosity correctly [4].

2. The parallel implementation The usual approach of using high level languages (such as High Performance Fortran, HPF) for a parallelization of the code proved not feasible. The irregular particle distributions create irregular data structures, and nowaday's compilers unfortunately cannot create efficient code in this situation. We instead decided to use the low level MPI library as it is available for all common platforms (compare to [3]). 2.1. Straightforward D o m a i n Decomposition

The main principle we settled for was using a domain decomposition where a modified serial version of the code runs on every node. The communication across domain boundaries is taken care of by a special kind of boundary condition, akin to periodic boundaries. This way the communication routines are separated from the routines implementing the physics. We hope that this will make future additions to the physics easier, because people adding new physical features will need only a basic knowledge of the way communication is handled. This inter-domain boundary condition takes care of (almost) all necessary communication and sets up ghost particles for the SPH routines. The same approach had already successfully been implemented for periodic boundaries, only that now the ghost particles come from other nodes. Of course particle interactions that cross domain boundaries are calculated on only one node. The disadvantage of this method is that a low number of particles cannot efficiently be distributed onto many nodes. The ghost particle domain of each node has the size of the interaction range, and for increasing numbers of nodes the ghost particles eventually outnumber the real particles. Although the numerical workload stays the same, managing

291

the particles becomes more expensive. The common remedy is to increase the number of particles proportionally to the number of nodes.

2.2. Not Wasting Memory An SPH code needs at least three passes over all particles, computing the density, the viscous stress tensor, and the acceleration, respectively. If those passes are run one after the other, then the interaction information for all particles has to be kept in memory. Given that there are about 100 interactions per particle this information requires by far (a factor of 10) the largest amount of memory of the overall simulation. This severely limits the total number of particles that fit into a given computer system. In order to save on memory we run these passes in parallel, where each particle begins the next pass as soon as it and all its neighbours have finished the previous one. This can be realized with only negligible overhead by calculating the interactions by sweeping through the simulation domain combining all three passes. The interactions of a particle are determined on the fly when the particle is first encountered and are dropped from the interaction list as soon as the particles have finished the third pass. This sweeping happens (almost) independently on all nodes. In between the passes information about the particles may have to be exchanged between nodes, which is taken care of by the boundary condition module.

2.3. Simple Load Balancing

0.25 fltlillilili

0.2

-----t--

livUlilluiiuil

0.15

'

0.1

......

0.05

0.05

0

0

-0.05

..0.05

,.

,

.... i!i i!

-0.1 .0.15

-0.15 -0.2 -0.25

~,.'

0.15

0.1

-0.1

'~'~..' : l ~

0.2

-0.2 | -0.2

J -0.15

J -0.1

| -0.05

i 0

i 0.05

i 0.1

i 0.15

i 0.2

0.25

-0.25

I

|

I

I

I

I

|

I

I

I

-0.2

-0.15

-0.1

-0.05

0

0.05

0.1

0.15

0.2

0.25

Fig. 1. Two examples of domain decompositions. On the left for 15 nodes, on the right for 32 nodes.

292

Load balancing, although of vital importance, has only been implemented in an ad hoc fashion. The domains are cuboid and of different sizes. They do not form a grid but rather a tree structure. The initial domains are chosen by distributing the particles evenly; this distribution is refined during the simulation by monitoring computing time and resizing the domains after every time step. In order to keep the domain shapes as cubic as possible, nodes may be transferred to a different subtree, thus reorganizing the overall domain structure. Two examples of domain decompositions are shown in Figure 1. Thanks to MPI our code runs on many different platforms. It has been tested on our workstation cluster, a Beowulf cluster, the IBM SP/2, and the Cray T3E. It performs reasonably well on all those platforms; the load balancing takes only a negligible amount of time. The typical overall time spent waiting is less than about 12 %. Typical runs have 300 000 particles on about 50 nodes, where one evaluation of the right hand side takes about one second. 3. A c c r e t i o n Disks in S y m b i o t i c B i n a r y S t a r s

Fig. 2. Pole-on and edge-on view of a simulation of the accretion disk of the Dwarf Nova OY Car. Mass of the White Dwarf: 0.696 Mo, mass of the donor star: 0.069 Mo. The scales indicate 0.1 solar radii. The donor star is on the left. Greycoded is the dissipated energy.

Symbiotic binary stars are so close to each other that their evolution and appearance changes dramatically compared to single stars. Dwarf Novae are a class of variable symbiotic binary stars where mass transfer from one star to the other occurs. The donor is a light main sequence star, the accretor a more massive, but much smaller White Dwarf (WD). Due to its intrinsic angular momentum the overflowing gas cannot be accreted by the WD right away, instead a thin gaseous disk around the WD forms and the subsequent

293

accretion is governed by viscous processes in the disk, [18]. The physics of these accretion disks is far from being well understood. Existing models of long term outburst behaviour are essentially 1D and neglect the tidal influence of the donor star [12]. Observationally, the disks show variability on timescales from minutes to decades, occasionally increasing in brightness up to 5 magnitudes. Numerical simulations, especially in 3D, require enormous amounts of grid points - - or particles in our case - - to achieve the necessary resolution. Since the problem size is so large, and the integration time so long, parallel programs on supercomputers are the only possible way to go [14]. 3.1. 3D-SPH Simulation of the Stream-Disk Interaction in a Dwarf Nova One aspect of Dwarf Nova disks is the impact of the overflowing gas stream onto the rim of the accretion disk. Both flows are highly supersonic and two shock regions form [10]. The shocked gas becomes very hot, a bright spot develops, which sometimes can be brighter than the rest of the disk. The relative heights of the stream and the rim of the disk are unclear. If the stream is thicker than the disk, a substantial portion of the infaUing gas could stream over and under the disk and impact at much smaller radii [1], [8]. Figure 2 shows a snapshot of the simulation of the accretion disk of the Dwarf Nova OY Carinae. Grey-coded is the energy release due to viscous dissipation. One can clearly see the bright spot where the stream hits the disk rim. Furthermore, on the far side of the donor star, a secondary bright spot is visible where overflowing stream material finally impacts onto the disk. In this simulation, about 10 to 20 % of the stream material can flow over and under the disk. 3.2. Superhumps in A M CVn AM Canem Venaticorum stars are thought to be the helium counterparts to dwarf novae. AM CVn stars are believed to consist of two helium white dwarfs, a rather massive primary and a very light, Roche-lobe filling secondary. Roche-lobe overflow feeds an accretion disk around the primary. Tsugawa & Osaki [16] showed that such helium disks undergo thermal instabilities similar to the hydrogen disks in Dwarf Novae. In three AM CVn stars, Dwarf Nova-like outbursts indeed have been observed. In order to investigate whether AM CVn exhibit superhumps we performed 3D-SPH simulations of the accretion disk. Initially, there was no disk around the primary. Particles were inserted at the inner Lagrangian point according to the mass transfer rate. Already after about 30 orbital periods the disk grew to a point where it was subject to the 3:1 inner Lindblad resonance [19]. Subsequently, the disk became more and more tidally distorted and started to precess rapidly in the frame of reference corotating with the stars (see Figure 3), which translates to a slow prograde precession in the observers' frame. Every time the bulk of the disk passes the secondary, the tidal stresses and hence the viscous heating are strongest, giving rise to modulations in the photometric lightcurve,

294

.,,,~:'

..... , w r ~ j : .

o, ~ _ . . , : , . w ' . . . . , ~ . . r , " ' ? , ~.

o.,,~

,."~" . . . . . :

~

.........

::LL,.?-

.J

Fig. 3. A series of snapshots of the disk of Am CVn, 0.2 orbital phases apart. Upper panel: density distribution; lower panel: dissipated energy. The parameters used are: M1 = 1 Mo, M2 -- 0.15 Mo, mass transfer rate 10-1~ Mo/yr. A ploytropic equation of state with .y = 1.01 was used. One can see how the precession of the tidally distorted disk leads periodically to higher dissipation, resulting in superhumps in the lightcurve; see Figure 4.

the superhumps. A Fourier transform of the obtained lightcurve reveals a superhump period excess of 4.4 %. This is in good agreement with the periods given by Warner [17], which differ by 3.8 %. A former study of the superhump phenomenon by Kunze et al. [9] showed that the period excess is a function of the mass transfer rate, the mass ratio of the stars, and the kinematic viscosity of the disk. These parameters are not well known for AM CVn. 4. C o n c l u s i o n s The SPH method is very well suited for solving astrophysical problems with compressible flow and free boundaries. An efficient parallel implementation requires some effort but allows three-dimensional long-term simulations. This is especially helpful for exploring and validating theoretical models where the underlying parameters are not well known. Global properties of the system can be reproduced quite accurately. References [1] P. J. Armitage and M. Livio, Astrophys. J., 470 (1996) 1024 [2] W. Benz, in: Numerical Modelling of Stellar Puslations: Problems and prospects, J. R. Buchler (ed.), Kluwer Academic Press, Dordrecht, 1990 [3] T. Bubeck, M. Hipp, S. Hiittemann, S. Kunze, M. Ritt, W. Rosenstiel, H.Ruder and R.

295

Superhumps of AM CVn i

i

i

i

i

I

!

l

i

i

E~ (D f: (D 13

Q.

.m (D (D 13

J

~

5

10

15 20 25 orbital periods

I

[

30

35

40

Fig. 4. Shown is the total dissipated energy of the disk over a time span of 40 orbital periods. A Fourier transform of the dissipated energy reveals a superhump period excess of 4.4 %.

Speith, in: High Performance Computing in Science and Engineering '98, E. Krause and W. J~iger (eds.), Springer, Berlin, 1999 [4] O. Flebbe, S. Miinzel, H. Herold, H. Pdffert and H. Ruder, Astrophys. J., 431 (1994) 754 [5] R. A. Gingold and J. J. Monaghan, Mon. Not. R. Astr. Soc., 181 (1977) 375 [6] L. Greengard, V. Rokhlin, J. Comp. Phys., 73 (1987) 325 [7] L. Hernquist, Astrophys. J., 404 (1993) 717 [8] F. V. Hessman, Astrophys. J., 510 (1999) 867 [9] S. Kunze, R. Speith and H. mffert, Mon. Not. R. Astr. Soc., 289 (1997) 889 [10] S. H. Lubow and F.H. Shu, Astrophys. J., 198 (1975) 383 [11] L. S. Lucy, Astron. J., 82 (1977) 1013 [12] F. Meyer and E. Meyer-nofmeister, Astron. Astrophys., 132 (1983) 143 [13] J. J. Monaghan, Ann. Rev. Astron. Astrophys., 30 (1992) 543 [14] H. Riffert, H. Herold, O. Flebbe, H. Ruder, in: CPC Topical Issue: Numerical Methods in Astrophysics, W. J. Duschl and W. M. Wscharnuter (eds.), 89 (1995) 1 [15] M. Steinmetz, E. Mfiller, Astron. Astrophsy., 268 (1993) 391 [16] M. Wsugawa, Y. Osaki, Publ. Astron. Soc. Japan, 49 (1995) 75 [17] B. Warner, Astron. & Space Sci., 255 (1995) 249 [18] B. Warner, Cataclysmic Variable Stars, Cambridge University Press, Cambridge, 1995 [19] R. Whitehurst, Mon. Not. R. Astr. Soc., 232 (1988) 35

This Page Intentionally Left Blank

Parallel ComputationalFluidDynamics Towards Teraflops, Optimizationand Novel Formulations D. Keyes,A. Ecer, J. Periaux,N. Satofukaand P. Fox (Editors) 92000 ElsevierScienceB.V. All rights reserved.

297

P a r a l l e l m u l t i g r i d s o l v e r s w i t h b l o c k - w i s e s m o o t h e r s for m u l t i b l o c k g r i d s * Ignacio M. Llorente ~ , Boris Diskin b and N. Duane Melson c ~Departamento de Arquitectura de Computadores y AutomAtica Universidad Complutense, 28040 Madrid, Spain UInstitute for Computer Applications in-Science and Engineering Mail Stop 403, NASA Langley Research Center, Hampton, VA 23681-2199 ~Computational Modeling and Simulation Branch Mail Stop 128, NASA Langley Research Center, Hampton, VA 23681-2199 One of the most efficient approaches to yield robust methods for problems with anisotropie discrete operators is the combination of standard coarsening with alternating direction plane relaxation. However, this approach may be difficult to implement in codes with multiblock structured grids because there may be no natural definition of global lines or planes. This paper studies the behavior of blockwise plane smoothers in order to provide guidance to engineers who use block-structured grids. 1. I N T R O D U C T I O N It is known that standard multigrid smoothers are not well suited for solving problems involving anisotropic discrete operators. Several methods have been proposed in the multigrid literature to deal with anisotropic operators. Alternating-direction plane smoothers in combination with full coarsening have been found to be highly efficient and robust on single-block grids because of their optimal work per cycle and low convergence factor[I]. Single-block structured grids with stretching are widely used in many areas of computational physics. However, multiblock grids are needed to deal with complex geometries and/or to facilitate parallel processing. The range of a plane-implicit smoother on blocked grids is naturally limited to only a portion (one block) of the computational domain since no global lines or planes are assumed. Thus, the plane smoother becomes a block-wise plane smoother. The purpose of the current work was to study whether the optimal properties of planeimplicit smoothers deteriorate for general multiblock grids. Notice that the multigrid method with a block-wise smoother is a priori more efficient than a domain decomposition method because the multigrid algorithm is applied over the whole domain. Its efficiency *This research was supported by the National Aeronautics and Space Administration under NASA Contract No. NAS1-19480 while the first authors were in residence at the Institute for Computer Applications in Science and Engineering (ICASE), NASA Langley Research Center, Hampton, VA 236812199

298

and an excellent parallelization potential were already demonstrated for isotropic elliptic problems [2] where point-wise red-black smoothers were used. We have analysed two sources of anisotropy. The first case is anisotropic coefficients in the equation discretized on uniform grids (Section 3) and the second case is the isotropic coefficient equation discretized on stretched grids (Section 4). Some conclusions about the convergence behavior of block-wise smoothers and their relation with parallel methods for multigrid are outlined in Section 5. 2. D E S C R I P T I O N

OF T H E P R O B L E M

A parallel 3-D code to study the behavior of block-wise plane-smoothers has been developed. The code solves the nonlineaa diffusion equation by using a full multigrid approach and the full approximation scheme (FAS) with V-cycles. The equation is solved on a multiblock grid using a cell-centered formulation. The grid can be stretched. Each block can overlap with the neighboring blocks. For simplicity, the analysis and test problems are done for rectangular grids although the results can be extrapolated to more general multiblock grids. The code is implemented in Fortran77 and has been parallelized using the standard OpenMP directives for shared-memory parallel computing. The behavior of block-wise smoothers when the number of subdomains spanned by a strong anisotropy is low (up to 4) was studied. It is unlikely that regions of very strong anisotropies will traverse too many blocks, especially for stretched grids, in computational fluid dynamics (CFD) simulations. 3. A N I S O T R O P I C

E Q U A T I O N S D I S C R E T I Z E D ON U N I F O R M G R I D S

The smoothing properties of block-wise plane smoothers are analyzed for a model equation discretized on a rectangular uniform grid. e-'t'(Ui.. - 1,iu,iz -- 2Ui= iu,i= nt- Uia:+l,iy,iz)"~ h~ + h2 ( ?-tiz ,iy ,iz -1

,iy ,iz + ?-tix ,iy ,iz + l ) -

(1) f/'

where i~ = 1, ..., n~, iy = 1, ..., ny, iz = 1, ..., nz; el and e2 are the anisotropy coefficients; f ' is a known discrete function; and h:, hy, and hz are the meshsizes in the x, y and z directions, respectively. This is a finite volume cell-centered discretization. We assume a global grid with N 3 partitioned into m 3 blocks (cubic partitioning). Each subgrid overlaps with neighboring blocks. (See Figure 1.) Our aim is to obtain the dependence of the convergence rate p(m, n, 5, e) on the following parameters: the number of blocks per side of the domain (m), the number of cells per block side (n - N) the m ~ overlap parameter (5), and the anisotropy coefficients (el and e2). For simplicity, we assume q >_ e2 >_ 1. The (x,y)-plane smoother is applied inside each block and the blocks are updated in lexicographic order. We use a volume-weighted summation for the restriction operators. Trilinear interpolation in the computational space is applied as the prolongation operator. The 2-D problems defined in each plane are approximately solved by one 2-D V(1,1)-cycle with x-line smoothing.

299

1

[ !,f~J!"I!.l'.!';ti,:l

,~"~1 ~"22 ~'23

~24

o

epp g

i

Artlfleiel boundary Inner

aetl=

True

boundary

cell8

Figure 1. Data structure of a subgrid with overlap: n is the number of cells per side in every block, m is the number of blocks per side in the split and ~7is the semi width of the overlap.

Two sets of calculations were performed to determine the experimental asymptotic convergence rate as a function of the anisotropy coefficients. 9 Both q and e2 are varied for the single-block case (m = 1) and two multiblock cases (m = 2 and m = 4). Different overlaps between blocks (5 = 0, 2 and 4) are examined. The upper graphs in Figure 2 show the results for a 128 a (N = 128) grid. 9 Only el is varied for the single-block case (m = 1), and two multiblock cases (m = 2 and m = 4). Different overlaps between blocks (~ = 0, 2 and 4) are examined. The lower graphs in Figure 2 show these results for a 128 a (N = 128) grid. All the graphs exhibit a similar behavior with respect to ~ and e. We can distinguish three different cases: 9 In the single-block case {m - 1), the convergence rate decreases quickly for an anisotropy larger than 100, tending to (nearly) zero for very strong anisotropies. In fact, the convergence rate per V(1,1)-cycle decreases quadratically with increasing anisotropy strength, as was predicted in [1]. 9 If the domain is blocked with the minimal overlapping (5 = 0), the convergence rate for small anisotropies (1 < el _< 100) is similar to that obtained in a singleblock solver with a point-wise smoother on the whole domain (i.e., about 0.552 per cycle). It increases (gets worse) for larger anisotropies and is bounded above by the convergence rate of the corresponding domain decomposition algorithm. The convergence rate for strong anisotropies approaches one as the grid is refined. 9 If the domain is blocked with larger overlapping (~ > 2), the convergence rate for small anisotropies is similar to that obtained for a single-block grid spanning the whole domain (i.e., about 0.362 per cycle) and increases to the domain decomposition rate for very strong anisotropies. The asymptotic value for strong anisotropies gets closer to one for smaller overlaps and finer grids.

300

0,8

0.8

p 0,6

.o 0 , 6

0,4

0,4

0,2 0

0,2 . lO

.

. ioo

.

I ooo

~'l ----.6 2

1oooo

: -IOOOOO lOOOOOO

m=2

I

=

0,8

"

~

o

lO

1oo

lOOO 10000 61 --~ 6 2

100000

lOOOOO,

m----.,l

0.8

o,6 p,, o,4 o,2

0 1

0,4 _

~

9

io

0,2

~

,

I oo

IOOO

ioooo

~

~

IOOOOO IOOOOOO

0

I

IO

IOO

,~, (~'z = i ) single

I

I ooo

l OOOO iooooo

i oooooo

~', ( s z = I ) block

4-overlap

~

O-overlap

+

8-overlap

~

2-overlap

'

Figure 2. Experimental asymptotic convergence factors, p~, of one 3-D V(1,1)-cycle with block-wise plane-implicit smoother versus anisotropy strength on a 1283 grid for different values of overlap (5) and number of blocks per side (m).

Numerical results show that for strong anisotropies, the convergence rates are poor on grids with minimal overlap (5 = 0), but improve rapidly with larger than the minimal overlap (~ > 2). For multiblock applications where a strong anisotropy crosses a block interface, the deterioration of convergence rates can be prevented by an overlap which is proportional to the strength of the anisotropy. Even when an extremely strong anisotropy traverses four consecutive blocks, good convergence rates are obtained using a moderate overlap. 3.1. Analysis The full-space Fourier analysis is known as a simple and very efficient tool to predict convergence rates of multigrid cycles in elliptic problems. However, this analysis is inherently incapable of accounting for boundary conditions. In isotropic elliptic problems where boundary conditions affect just a small neighborhood near the boundary, this shortcoming does not seem to be very serious. The convergence rate of a multigrid cycle is mostly defined by the relaxation smoothing factor in the interior of the domain, and therefore, predictions of the Fourier analysis prove to be in amazing agreement with the results of numerical calculations. However, in strongly anisotropic problems on decomposed domains, boundaries have a large effect on the solution in the interior (e.g., on the relaxation stage), making the full-space analysis inadequate. In the extreme case of a very large anisotropy, the behavior of a plane smoother is similar to that exhibited by a one-level overlapping Schwartz method [3]. Below we propose an extended Fourier analysis that is applicable to strongly anisotropic problems on decomposed domains. In this analysis, the discrete equation (1) is considered on a layer (ix, iy, iz) : 0 <_ ix <_ N , - c o < iy, iz < oc. This domain is decomposed by overlapped subdomains in an xline partitioning. The boundaries of all the subdomains are given by the set of planes

301

orthogonal to the x-coordinate axis. We assume that the solution Ui~,,iy,iz and the source function f~,i~,,iz have the following form

ui,,i,,,i, - U(ix)e i(~176

f~,,,,,,i~ = F(ix) ei(~176

ix - 0,..., N,

(2)

where U and F are ( N + I ) d i m e n s i o n a l complex-valued vectors. V(ix)and F(iz)represent the corresponding amplitudes of a 2-D Fourier component e'(~ ~+~ (]0y] _< 7r; ]Oz] < 7r) at i~. In this way, the original 3-D problem is translated into a 1-D problem where frequencies of the Fourier component are considered parameters. When estimating the smoothing factor for the 3-D alternating-direction plane smoother, we analyze only the high-frequency Fourier components (max(10y], ]Oz]) >_7r/2). We are given values of N, m, n, 5, el and e2 (at the beginning of section 3) and we let the integers i~ = 0 , . . . , N numerate the celts. We also label the blocks with numbers from 1 to m. There are two processes affecting the amplitude of high-frequency error components. The first is the smoothing in the interior of the blocks. This process is well described by the smoothing factor SM derived from the usual full-space Fourier mode analysis (see [1,4,5]). If the problem is essentially isotropic (~1 = O(1)), then the smoothing factor of the alternating-direction plane relaxation is very small (SM = 1/v~25 ~ 0.089 in the pure isotropic problem), but the overall convergence rate in a V-cycle is worse because it is dictated by the coarse-grid correction. To predict the asymptotic convergence rate in a Vcycle, the two-level analysis should be performed. In problems with a moderate anisotropy in the x direction (el = O(h-t)), the high-frequency error is reduced mainly in the sweep solving z-planes and the smoothing factor approaches the 1-D factor S M = 1/v/5 (see [1]). In strongly anisotropic problems (e~ = O(h-2)), this sweep reduces the smooth error components as well, actually solving the problem rather than just smoothing the error. The second process influencing the high-frequency error is the error propagation from incorrectly specified values at the block boundaries. The distance which this high-frequency boundary error penetrates inside blocks strongly depends on the anisotropy. This penetration distance can be estimated by considering a semi-infinite homogeneous problem associated with the z-plane sweep. The left-infinite problem stated for the k-th block assumes a zero right-hand side and omits the left boundary condition - 1) + U(rk) = B ,

+ -oc

(3)

< i~ < rk,

The solution of this problem is 61 a-1 --1-~ -}- 61,'~ = O;

+ 1) = o,

U(i~:) - BA~(Oy, Oz) ~ rk, where AL(Oy, Oz) satisfies

lal > 1.

(4)

IA,(Oy, O=)] >_ a t -

I;,(0,=/~)1

The right-infinite problem is similarly solved providing ]Ar(0y,0z) t <_ A r -

JAr(0, 7r/2)1,

For high-frequencies under the assumption el > e2 >__ 1,

where Ar(0y, 0z) is a root of equation (4) satisfying lal _< 1, If the anisotropy is strong (el = O(h-2)), both boundaries (far and nearby) affect the error amplitude reduction factor. If the number of blocks m is not too large, then

302 Table 1 Experimental, Pe, and analytical, pa, convergence factors of a single 3-D V(1,1)-cycle with blockwise (x,y)-plane Gauss-Seidel smoother (2 • 2 x 2 partition) versus anisotropy strength, width of the overlap and the block size II

i ~ II ~

n = 64

1~

n = 128

]~

10612 ll ~176176 I 4 II 0"32110"31610"867

i s II

I

!~

I ~ II ~}87 I ~

I~

1 4 II ~ 1 8 I!

I ~ I

I ~ I 0.27

I ~ II ~ 10212 II ~

I~ !~

I~ [~

10412 I! 0.51 i 0.51 I 0.66

[4 II 0 " 1 2 1 0 " 1 4 1 0 " 1 4 18 II I I 0.14

0.939 0.729 0.566 0.340 0.92 0.67 0.49 0.26 0.56 0.14 0.i4 0.14

the corresponding problem includes m coupled homogeneous problems (like 3). This multiblock problem can directly be solved. For the two-block partition it results in

_ A~-~-t 2 RF=(A~-6-~ A ~ - A ~ + ~ )"

(5)

3.2. Comparison with numerical tests For simplicity, we consider the V(1,1)-cycle with only z-planes used in the smoothing step. The assumption that ~1 _> e2 >_ 1 validates this simplification. In numerical calculations for isotropic problems on single-block domains, the asymptotic convergence rate was 0.14 per V(1,1)-cycle, which is very close to the value predicted by the two-level Fourier analysis (0.134). In the case of the domain decomposed into two blocks in each direction, the reduction factor R F can be predicted by means of expression (5). Finally, the formula for the asymptotic convergence rate pa is

0.14).

(6)

Table 1 exhibits a representative sample of experiments for the case cl >> e2 = 1. In this table, pe corresponds to asymptotic convergence rates observed in the numerical experiments, while Pa is calculated by means of formula (6). The results demonstrate nearly perfect agreement.

303

4. I S O T R O P I C E Q U A T I O N D I S C R E T I Z E D ON S T R E T C H E D G R I D S We have also analyzed the case of an isotropic equation discretized on stretched grids. This case is even more favorable, as the convergence rate obtained for the single-block case is maintained on multiblock grids with a very small overlap (5 = 2). Numerical simulations were performed to obtain the experimental convergence rate with respect to the stretching ratio, c~. The single-block and multiblock grids (m = 1, 2, and 4) with different overlaps (~ = 0, 2, and 4) were tested. Figure 3 shows the results for a 1283 grid. The results can be summarized in the following two observations: 9 With a 2 a partitioning, even the 0-overlap (~ = 0) is suficient for good convergence rates. The results for a multiblock grid with overlap of 5 - 2 match with the results obtained for the single-block anisotropic case. That is, the convergence rate tends towards zero as the anisotropy increases. 9 With a 4 a partitioning, results are slightly worse. With the minimal overlap (5 = 0), the convergence rate degrades for finer grids. However, with a larger overlap (5 = 2), the convergence rate again tends towards the convergence rate demonstrated in single-block grid and anisotropic cases.

~,o I j

o,8i

~'~

0,6 i I 0,4 i

0.6 } joe 0,4

o,8 p.

0,2

0,0

1

1,1

1,2

1,3

1,4

--~ non-blocking

2-0verlap

0,0 ~ I

"

13

O-ovedap

-,z-- 4-overlap

1,2

9 1,3

i

Figure 3. Experimental asymptotic convergence factors, Pc, of one 3-D V(1,1)-cycle with block-wise plane-implicit smoother in a 128 a grid with respect to the stretching ratio (c~) for different values of the overlap (5) and the number of blocks per side (m).

5. B L O C K S M O O T H E R S

TO FACILITATE PARALLEL PROCESSING

Block-wise plane-implicit relaxation schemes are found to be robust smoothers. They present much better convergence rates than domain decomposition methods. In fact, their convergence rates are bounded above by the convergence rate of a corresponding domain decomposition solver. In common multiblock computational fluid dynamics simulations, where the number of subdomains spanned by a strong anisotropy is low (up to four), textbook multigrid convergence rates can be obtained with a small overlap of cells between neighboring blocks.

304 Block-wise plane smoothers may also be used to facilitate the parallel implementation of a multigrid method on a single-block (or logically rectangular) grid. In this case there are global lines and planes and block-wise smoothers are used only for purposes of parallel computing. To get a parallel implementation of a multigrid method, one can adopt one of the following strategies (see, e.g., [6]). 9 D o m a i n decomposition: The domain is decomposed into blocks which are indepen-

dently solved using a multigrid method. A multigrid method is used to solve the problem in the whole grid but the operators to perform this solve are performed on grid partitioning.

9 Grid partitioning:

Domain decomposition is easier to implement and implies fewer communications (better parallel properties), but it has a negative impact on the convergence rate. On the other hand, grid partitioning implies more communication but it retains the convergence rate of the sequential algorithm (better numerical properties). Therefore, the use of block-wise smoothers is justified to facilitate parallel processing when the problem does not possess a strong anisotropy spanning the whole domain. In such a case, the expected convergence rate (using moderate overlaps at the block interfaces crossed by the strong anisotropy) is similar to the rate achieved with grid partitioning, but the number of communications is considerably lower. Block-wise plane smoothers are somewhere between domain decomposition and grid partitioning and appear to be a good tradeoff between architectural and numerical properties. For the isotropic case, the convergence rate is equal to that obtained with grid partitioning, and it approaches the convergence rate of a domain decomposition method as the anisotropy becomes stronger. Although higher than in domain decomposition, the number of communications is lower than in grid-partitioning algorithms. However, it should be noted that due to the lack of definition of global planes and lines, grid partitioning is not viable in general multiblock grids. REFERENCES

1. I.M. Llorente and N. D. Melson, ICASE Report 98-37, Robust Multigrid Smoothers for Three Dimensional Elliptic Equations with Strong Anisotropies, 1998. 2. A. Brandt and B. Diskin, Multigrid Solvers on Decomposed Domains, Domain Decomposition Methods in Science and Engineering, A. Quarteroni, J. Periaux, Yu. A. Kuznetsov and O.Widlund (ed.), Contemp. Math., Amer. Math. Soc.,135-155, 1994 3. B. Smith, P. Bjorstad and W. Gropp, Domain Decomposition, Parallel Multilevel Methods for Elliptic Partial Differential Equations, Cambridge University Press, 1996 4. A. Brandt, GMD-Studien 85, Multigrid Techniques: 1984 Guide with Applications to Fluid Dynamics, 1984 5. P. Wesseling, An Introduction to Multigrid Methods,John Wiley & Sons, New York, 1992 6. I.M. Llorente and F. Tirado, Relationships between efficiency and execution time of full multigrid methods on parallel computers, IEEE Trans. on Parallel and Distributed Systems, 8, 562-573, 1997

Parallel ComputationalFluidDynamics Towards Teraflops, Optimizationand NovelFormulations D. Keyes,A. Ecer,J. Periaux,N. Satofukaand P. Fox (Editors) 92000 Elsevier ScienceB.V.All rightsreserved.

305

An Artificial Compressibility Solver for Parallel Simulation of Incompressible Two-Phase Flows K. Morinishi and N. Satofuka Department of Mechanical and System Engineering, Kyoto Institute of Technology Matsugasaki, Sakyo-ku, Kyoto 606-8585, Japan This paper describes an artificial compressibility method for incompressible laminar two-phase flows and its implementation on a parallel computer. The full two-fluid model equations without phase change are simultaneously solved using a pseudotime variable that is added to the continuity equation of incompressible two-phase flows. Numerical results, obtained for a test problem of solution consistency, agree well with analytic solutions. Applicability of the method is demonstrated for a three dimensional two-phase flow simulation about a U-bend. Efficiency is examined on the Hitachi SR2201 parallel computer using domain decomposition and up to 16 processors.

1. I N T R O D U C T I O N The artificial compressibility method was first introduced by Chorin [1] for obtaining steady-state solutions of the incompressible Navier-Stokes equations. In the method, a pseudotime derivative of pressure is added to the continuity equation of incompressible flows, so that the pressure and velocity fields are directly coupled in a hyperbolic system of equations. The equations are advanced in pseudotime until a divergent-free velocity field is obtained. The method has been successfully used by many authors [2,3] for unsteady incompressible flow simulations as well as steady-state flow simulations. The purposes of this study are to develop an artificial compressibility type method for obtaining steady-state solutions of incompressible laminar two-phase flows and to examine its efficiency and reliability on a parallel computer. The pseudotime derivative of pressure is added to the continuity equation derived from the two mass conservation equations of incompressible two-phase flows. The full two-fluid model equations are simultaneously solved with the pseudotime variable. Several numerical experiments are carried out in order to examine the efficiency and reliability of the method. The consistency of the numerical solution is examined for a simple test problem for which numerical results can be compared with analytic solutions. Applicability of the method is demonstrated for a three dimensional flow simulation in a rectangular cross-section U-bend. Efficiency is examined on the Hitachi SR2201 parallel computer using domain decomposition and up to 16 processors.

306

2. BASIC EQUATIONS In this study, we restrict our attention to numerical solutions for the incompressible two-fluid model equations without phase change. The equations can be written as: conservation of mass 0

~-~(0r + V" (Or

--0

(1)

and

~

+ v. (~u~)

0;

(2)

conservation of momentum O

al _ 1 1 4- - - V p - - M 1 + o~lg + - - V . (O~lSl) Pl Pl /91

(3)

A O~2 1 0t (a2U2) -[- V" (O12U2U2)-~- - - V p -- 1 M 2 + a2g + - - V . (a2s2); P2 P2 P2

(4)

o~(Or

+

V.

(Or

and

where ak denotes the volume fraction of phase k, uk the velocity, p the pressure, pk the density, sk the viscous shear tensor, Mk the interracial momentum transfer, and g the acceleration due to gravity. The volume fractions of two phases must be: a l + a2 = 1.

(5)

For the interracial momentum transfer, simple drag forces of bubbly flows are assumed. The drag forces are computed using drag coefficients CD derived from the assumption of spherical particles with uniform radiuses [4]. These are obtained for the dispersed phase

2 as: 3 Ploqoz2CDlUl _ U21(Ul _ U2) M2 = 4db

(6)

and 24

c~ = g/-;~ (~ + 0.~5R~~

(7)

where db is the bubble diameter and Reb the bubble Reynolds number. The interfacial momentum transfer for the continuous phase 1 is obtained with: M1 -- - M 2 .

(8)

For laminar two phase flows of Newtonian fluids, the average stress tensors may be expressed as those for laminar single-phase flows:

skq - ~k

(Ouk~ Ouk~ 2 Ox~ + Ox~ ] - 5~k~v'uk'

where #k is the molecular viscosity of phase k and

(9) 5ij is the Dirac delta.

307

3. A R T I F I C I A L C O M P R E S S I B I L I T Y

By combining the mass conservation equations (1) and (2) with the volume fraction constraint of Eq. (5), the continuity equation for the incompressible two-phase flows is obtained as:

(lo)

V " (CelUl + 0~2U2) -- 0.

The artificial compressibility relation is introduced by adding a pseudotime derivative of pressure to the continuity equation so that the pressure and velocity fields are directly coupled in a hyperbolic system of equations:

ap

(11)

0-'~ -[-/~V" (Cel u I -[- Ce2U2) -- 0,

where 0" denotes the pseudotime variable and/3 an artificial compressibility parameter. As the solution converges to a steady-state, the pseudotime derivative of pressure approaches zero, so that the original continuity equation of the incompressible two-phase flows is recovered. Since the transitional solution of the artificial compressibility method does not satisfy the continuity equation, the volume fraction constraint may be violated in the progress of pseudotime marching. Therefore further numerical constraint is introduced in the mass conservation equations as" 0 + v.

( 1u1) -

+ v.

-

mV. ( 1Ul +

(12)

and 0

(

,Ul +

If the constraint of volume fraction is initially satisfied, the pseudotime derivatives of al and a2 satisfy the following relation: 0

0

O-~(a,) + ~--~(a2) = O.

(14)

Thus the volume fraction constraint is always satisfied in the progress of pseudotime marching. Once the artificial compressibility solution converges to a steady-state solution, Eq. (i0)is effective so that Eqs. (12) and (13) result in their original equations (I) and (2), respectively. Equations (i i)-(I 3) and (3)- (4) are simultaneously solved using cell vertex non-staggered meshes. The convection terms are approximated using second order upwind differences with minmod limiters. The diffusion terms are approximated using second order centered differences. The resultant system of equations are advanced in the pseudotime variable until the normalized residuals of all the equations drop by four orders of magnitude. An explicit 2-stage rational Runge-Kutta scheme [5] is used for the pseudotime advancement. Local time stepping and residual averaging are adopted for accelerating the pseudotime convergence process to the steady-state solution.

308

4. P A R A L L E L I M P L E M E N T A T I O N

Numerical experiments of the artificial compressibility method for the incompressible two-phase flows were carried out on the Hitachi SR2201 parallel computer of Kyoto Institute of Technology. The system has 16 processors which are connected by a crossbar network. Each processor consists of a 150MHz PA-RISC chip, a 256MB memory, 512KB data cache, and 512KB instruction cache. The processor achieves peak floating point operations of 300 MFLOPS with its pseudo-vector processing system. For implementation of the artificial compressibility method on the parallel computer, the domain decomposition approach is adopted. In the approach, the whole computational domain is divided into a number of subdomains. Each subdomain should be nearly equal size for load balancing. Since the second order upwind differences are used for the convection terms of the basic equations, two-line overlaps are made at the interface boundaries of subdomains. The message passing is handled with express Parallelware.

Figure 1. Consistency test case.

5. N U M E R I C A L

RESULTS

5.1. C o n s i s t e n c y Test of Solution A simple test case proposed in [6] is considered to check the consistency of the artificial compressibility method for incompressible two-phase flows. A square box initially filled with 50% gas and 50% liquid evenly distributed within the total volume, is suddenly put in a gravitate environment as shown in Fig. 1. The flow field is modeled using a 21 • 21 uniform mesh. Free-slip boundary conditions are used at the walls. For simplicity, the following nondimensional values of density and viscosity are used for the two fluids:

Pl = 1.0 /tl-l.0

P2 = 0.001 /z2=0.01

The solution is advanced until the residuals of all the equations drop by four orders of magnitude. Steady state distributions obtained for the volume fraction al and normalized

309

1.0

1.0

Numerical Analytical

t-C:D

o..,

"1"13

N9

0.5

0

'o Numericai'-

om,.

-'toN

E

FI'

e-

0.5

E

Z

O

O

Z

0.0

0.0 0.5 1.0 Volume Fraction of Fluid 1 I

I

I

Figure 2. Volume fraction compared with analytical solution.

0.0

0.0 J

0.5 I

1.0 !

Normalized Pressure

Figure 3. Normalized pressure data compared with analytical solution.

pressure are plotted in Figs. 2 and 3, respectively. Analytic solutions are also plotted in the figures for comparison. The numerical results agree well with the analytic solutions. 5.2. 2-D P l a n e U - d u c t Flow The solution method is applied to a two-phase flow through a plane U-duct. Figure 4 shows the model geometry of the plane U-duct. The ratio of the radius curvature of the duct centerline and the duct width is 2.5. The Reynolds number based on the duct width is 200. The flow field is modeled without gravitate effect using a 129 x 33 H-mesh. At the inlet, fully developed velocity profiles with a constant pressure gradient are assigned for a mixed two-fluid flow with the following nondimensional values: Ol I =

0.8

p] = 1.0 #] = 1 . 0

O~2 - - 0 . 2

p2 = 0.001 #2=0.01

These conditions are also used for the initial conditions of the pseudotime marching. The solution is advanced until the residuals of all the equations drop by four orders of magnitude. Flow rates obtained for both phases are plotted in Fig. 5. The conservation of the flow rate is quite good throughout the flow field. Figure 6 shows the volume fraction contours of the heavy fluid al. Within the bend, phase separation is observed due to centrifugal forces that tend to concentrate the heavy fluid toward the outside of the bend. Parallel performance on the SR2201 is shown in Fig. 7. The domain decomposition approach in streamwise direction is adopted for the parallel computing. About 7 and 9 times speedups are attained with 8 and 16 processors, respectively. The performance with 16 processors is rather poor because the number of streamwise mesh points is not enough to attain the high performance. ( About 13 times speedup is attained with 16 processors using a 257 x 33 H-mesh. )

310 5.3. 3-D U - d u c t F l o w

The numerical experiment is finally extended to a three dimensional flow through the rectangular cross-section U-duct. The flow conditions are similar to those of the two dimensional flow case. The flow field is modeled without gravitate effect using a 129 • 33 x 33 H-mesh. Figure 8 shows the volume fraction contours of the heavy fluid c~1. The volume fraction contours at the 45 ~ and 90 ~ cross sections of the bend are shown in Figs. 9 and 10, respectively. Within the bend, secondary flow effects in addition to the phase separation due to centrifugal forces are observed. The parallel performance of this three dimensional case is shown in Fig. 11. The speedup ratios with 8 and 16 processors are 6.7 and 9.2, respectively. Again, the performance with 16 processors is rather poor because the number of streamwise mesh points is not enough to attain the high performance. 6. CONCLUSIONS

The artificial compressibility method was developed for the numerical simulations of incompressible two-phase flows. The method can predict phase separation of two-phase flows. The numerical results obtained for the consistency test agree well with the analytic solutions. The implementation of the method on the SR2201 parallel computer was carried out using the domain decomposition approach. About 9 times speedup was attained with 16 processors. It was found that the artificial compressibility solver is effective and efficient for the parallel computation of incompressible two-phase flows. 7. A C K N O W L E D G E M E N S

This study was supported in part by the Research for the Future Program (97P01101) from Japan Society for the Promotion of Science and a Grant-in-Aid for Scientific Research (09305016) from the Ministry of Education, Science, Sports and Culture of the Japanese Government. REFERENCES

1. Chorin, A.J., A Numerical Method for Solving Incompressible Viscous Flow Problems, Journal of Computational Physics, Vol. 2, pp. 12-26 (1967). 2. Kwak, D., Chang, J.L.C., Shanks, S.P., and Chakravarthy, S.R., Three-Dimensional Incompressible Navier-Stokes Flow Solver Using Primitive Variables, AIAA Journal, Vol. 24, pp. 390-396 (1986). 3. Rogers, S.E. and Kwak, D., Upwind Differencing Scheme for the Time-Accurate Incompressible Navier-Stokes Equations, AIAA Journal, Vol. 28, pp. 253-262 (1990). 4. Issa, R.I. and Oliveira, P.J., Numerical Prediction of Phase Separation in Two-Phase Flow through T-Junctions, Computers and Fluids, Vol. 23, pp. 347-372 (1994). 5. Morinishi, K. and Satofuka, N., Convergence Acceleration of the Rational RungeKutta Scheme for the Euler and Navier-Stokes Equations, Computers and Fluids, Vol. 19, pp. 305-313 (1991). 6. Moe, R. and Bendiksen, K.H., Transient Simulation of 2D and 3D Stratified and Intermittent Two-Phase Flows. Part I: Theory, International Journal for Numerical Methods in Fluids, Vol. 16, pp. 461-487 (1993).

311

u

_t"~

! !

1.0

L 4L

J b,,

0

Q1

o

\

n- 0.5o

1.1..

Q2

i

n

0"00.0

0.5 S/Sa

Sa Figure 4. Geometry of the U-duct.

1.0

Figure 5. Flow rate distributions.

16.0

0,o0 ~ oo* o~

o 8.0

og

<

-~ 4.0 0 m

2.0 1.0

Figure 6. Volume fraction contours.

.,j /' 1

2

4

8

Number of Processors

Figure 7. Speedup ratio of 2-D case.

16

312

i

Figure 8. Volume fraction contours.

r

Figure 9. Volume fraction contours at 45 ~ cross section.

16.0

J

o 8.0

.u

"~ 4.0 m

2.0

1.0~

/

1

Figure 10. Volume fraction contours at 90 ~ cross section.

e~

2

4

8

Number of Processors

16

Figure 11. Speedup ratio of 3-D case.

Parallel ComputationalFluidDynamics Towards Teraflops, Optimizationand Novel Formulations D. Keyes,A. Ecer, J. Periaux, N. Satofukaand P. Fox (Editors) 2000 Elsevier Science B.V. All rights reserved.

313

I m p l e m e n t a t i o n o f a Parallel F r a m e w o r k for A e r o d y n a m i c D e s i g n O p t i m i z a t i o n on U n s t r u c t u r e d M e s h e s E.J. Nielsen a*, W.K. Anderson at, and D.K. Kaushik b~ aNASA Langley Research Center, MS 128, Hampton, VA 23681-2199 bComputer Science Department, Old Dominion University, Norfolk, VA 23529-0162 1. A B S T R A C T

A parallel framework for performing aerodynamic design optimizations on unstructured meshes is described. The approach utilizes a discrete adjoint formulation which has previously been implemented in a sequential environment and is based on the three-dimensional Reynoldsaveraged Navier-Stokes equations coupled with a one-equation turbulence model. Here, only the inviscid terms are treated in order to develop a basic foundation for a multiprocessor design methodology. A parallel version of the adjoint solver is developed using a library of MPI-based linear and nonlinear solvers known as PETSc, while a shared-memory approach is taken for the mesh movement and gradient evaluation codes. Parallel efficiencies are demonstrated and the linearization of the residual is shown to remain valid. 2. I N T R O D U C T I O N As computational fluid dynamics codes have steadily evolved into everyday analysis tools, a large focus has recently been placed on integrating them into a design optimization environment. It is hoped that this effort will bring about an ability to rapidly improve existing configurations as well as aid the designer in developing new concepts. Much of the recent work done in the area of CFD-based design optimization has focused on adjoint methods. This approach has been found to be efficient in aerodynamic problems for cases where the number of design variables is typically large and the number of cost functions and/or flow field constraints is usually small. The adjoint formulation allows rapid computation of sensitivity information using the solution of a linear system of equations, whose size is independent of the number of design variables. Recent examples of this approach can be found in [1-9]. In [8] and [9], a methodology for efficiently computing accurate aerodynamic sensitivity information on unstructured grids is described. A discrete adjoint formulation has been *Resident Research Associate, National Research Council. tSenior Research Scientist, ComputationalModeling and Simulation Branch, Aerodynamics, Aerothermodynamics, and Acoustics Competency. 1:Graduate Student. Also, Mathematics and Computer Science Division, Argonne National Laboratory.

314 employed and the exact linearization of the residual has been explicitly demonstrated. Although the differentiation of the flow solvers has been shown to be highly accurate, a major deficiency uncovered by the work is the excessive CPU time required to determine adjoint solutions as well as other associated tasks such as mesh movement. This drawback has hindered computations on realistically-sized meshes to this point. In an effort to mitigate this expense, the present work is aimed at the parallelization of the various steps of the design process. A previously-developed multiprocessor version of the flow solver is employed, so that the current focus includes modification of the adjoint solver, as well as appropriate treatment of the mesh movement and gradient evaluation codes. The goal of the study is to demonstrate acceptable scalability as the number of processors is increased while achieving results identical to that of the sequential codes. A discussion of the domain decomposition procedure is presented. Speedup figures are established and consistent derivatives are shown. 3. GOVERNING EQUATIONS

3.1. Flow Equations The governing flow equations are the compressible Reynolds-averaged Navier-Stokes equations 1~ coupled with the one-equation turbulence model of Spalart and Allmaras. 11 The present flow solver implementation is known as FUN3D and is available in both compressible and incompressible formulations. 12'13 The solvers utilize an implicit upwind scheme on unstructured meshes. The solution is advanced in time using a backward-Euler time-stepping scheme, where the linear system formed at each time step is solved using a point-iterative algorithm, with options also available for using preconditioned GMRES. 14 The turbulence model is integrated all the way to the wall without the use of wall functions. This solver has been chosen for its accuracy and robustness in computing turbulent flows over complex configurations. 9'15 Although originally a sequential code, a parallel version has recently been constructed for inviscid flow using MPI and PETSc 16 as described in [17]. This implementation utilizes a matrixfree, preconditioned GMRES algorithm to solve the linear system.

3.2. Adjoint and Gradient Equations The discrete adjoint equation is a linear system similar in form to that of the flow equations. A pseudo-time term is added which allows a solution to be obtained in a time-marching fashion using GMRES, much like that used in solving the flow equations. In the current work, all linearizations are performed by hand, and details of the solution procedure can be found in [9]. Once the solution for the costate variables has been determined, the vector of sensitivity derivatives can be evaluated as a single matrix-vector product. 4. DOMAIN DECOMPOSITION METHODOLOGY In the current work, the mesh partitioner MeTiS 18 is used to divide the original mesh into subdomains suitable for a parallel environment. Given the connectivities associated with each node in the mesh and the number of partitions desired, MeTiS returns an array that designates a partition number for each node in the mesh. The user is then responsible for extracting the data structures required by the specific application. Due to the gradient terms used in the reconstruction procedure, achieving second-order accu-

315

tition mdary

on the current partition L-1 ghost node L-2 ghost node

Figure 1. Information required beyond partition boundaries. racy in the flow solver requires information from the neighbors of each mesh point as well as their neighbors. In the present implementation, the gradients of the dependent variables are computed on each mesh partition, then the results are scattered onto neighboring partitions. This approach dictates that a single level of "ghost" nodes be stored on each processor. These ghost nodes that are connected to mesh points on the current partition are referred to as "level-l" nodes. Similarly, the neighbors of level-1 nodes that do not lie on the current partition are designated "level-2" nodes. This terminology is illustrated graphically in Figure 1. The adjoint solver requires similar information; however, unlike the flow solver, residual contributions must be written into off-processor memory locations associated with level-2 mesh points. This implies that a second level of ghost information must be retained along partition boundaries. The gather and scatter operations associated with these off-processor computations for the flow and adjoint solvers are handled seamlessly using the PETSc toolkit described in a subsequent section. Software has been developed to extract the required information from a pre-existing mesh based on the partitioning array provided by MeTiS. This domain decomposition operation is done prior to performing any computations. The user is also able to read in existing subdomains and their corresponding solution files and repartition as necessary. This capability is useful in the event that additional processors become available or processors currently being employed must be surrendered to other users. In addition, software has been developed that reassembles partition information into global files and aids in post-processing the solutions. 5. P A R A L L E L I Z A T I O N Adapting the adjoint solver to the parallel environment has been performed using the MPI message passing standard. In order to expedite code development, the Portable, Extensible Toolkit for Scientific Computation (PETSc)16 has been employed, using an approach similar to that taken in [ 17]. PETSc is a library of MPI-based routines that enables the user to develop parallel tools without an extensive background in the field. The software includes a number of built-in linear and nonlinear solvers as well as a wide range of preconditioning options. To parallelize the mesh movement and gradient evaluation codes, a shared-memory approach

316

has been taken, since the primary hardware to be utilized is a Silicon Graphics Origin 2000 system. In this approach, ghost information is exchanged across partition boundaries by loading data into global shared arrays which are accessible from each processor. Simple compiler directives specific to the Origin 2000 system are used to spawn child processes for each partition in the mesh.

6. SPEEDUP RESULTS 6.1. Adjoint Solver For this preliminary work, the speedup obtained by parallelizing the adjoint solver is demonstrated using an SGI Origin 2000 system. Here, an inviscid test case is run on an ONERA M6 wing. The mesh for this test consists of 357,900 nodes. The surface mesh contains 39,588 nodes, and is shown in Figure 2. The freestream Mach number is 0.5 and the angle of attack is 15

'

Linear Actual

I

'

I

'

i

I

I

O

10

Oq

5

m

i

0

0

Figure 2. Surface mesh for ONERA M6 wing.

I

5 10 Number of P r o c e s s o r s

15

Figure 3. Parallel speedup obtained for the adjoint solver.

2 ~ . The flow solver is converged to machine accuracy prior to solving the adjoint system. The adjoint solution consists of seven outer iterations, each composed of a GMRES cycle utilizing 50 search directions and no restarts. Figure 3 shows the speedup obtained using an increasing number of processors, and it can be seen that the solver demonstrates a nearly linear speedup for this test case.

6.2. Mesh Movement As the design progresses, the volume mesh must be adapted to conform to the evolving surface geometry. Currently, this is done using a spring analogy as outlined in [8] and [9]. This procedure is also used to generate mesh sensitivity terms required for evaluating the gradient of the cost function. The implementation of the spring approach requires a number of sweeps through the mesh in order to modify the node coordinates throughout the entire field. Furthermore, in the case of evaluating mesh sensitivities, this process must be repeated for each design variable. For

317

25

15

_

20

'

Linear Actual

I Linear Actual

I

'

'

O

-

lo

I

~ o

_

15 0 105 5

m

0

0

I

I

5

,

I

10

,

15

Number of P r o c e s s o r s

Figure 4. Parallel speedup obtained for the mesh movement code.

0

,

0

I

5

,

I

10

, 15

Number of P r o c e s s o r s

Figure 5. Parallel speedup obtained for the gradient evaluation code.

large meshes, this process can be a costly operation. Therefore, the method has been extended to run across multiple processors using a shared-memory approach as outlined earlier. Figure 4 shows the speedup obtained by running the mesh movement procedure on the 357,900-node ONERA M6 mesh using a varying number of processors. It can be seen from the figure that the code exhibits a superlinear behavior. This is believed to be due to improved cache efficiency as the size of the subdomains is reduced.

6.3. GradientEvaluation Once the flow and adjoint solutions have been computed, the desired vector of design sensitivities can be evaluated as a single matrix-vector product. For shape optimization, this procedure requires the linearization of the residual with respect to the mesh coordinates at every point in the field. Again, for large meshes, this computation is quite expensive. For this reason, a sharedmemory approach has been used to evaluate these terms in a parallel fashion. The previously described ONERA M6 mesh is used to demonstrate the speedup of this implementation, and results obtained for the computation of a single sensitivity derivative are shown in Figure 5. It can be seen that the procedure is generally 90-95% scalable for the case examined. Due to the large amount of memory required for the gradient computation, superlinear speedup such as that shown in Figure 4 is not obtained for this case. 7. CONSISTENCY OF LINEARIZATION The accuracy of the linearizations used in the sequential adjoint solver has previously been demonstrated in [8] and [9]. In these references, sensitivity derivatives obtained using the adjoint solver were shown to be in excellent agreement with results computed using finite differences. To confirm that these linearizations remain consistent through the port to the parallel environment, sensitivity derivatives are shown in Table 1 for several design variables depicted in Figure 6, where the geometric parameterization scheme has been described in [8] and [19].

318

Twist Shear #1 ",,,,. Camber Thickness #1 Camber Thickness #2 Camber Thickness #3

Twist

~ a ~ ..

'~ ] ~

"'..

""4 -_

Twist Shear

~#3

,- .. . r,,,,. Camber I tcam~ ~ ~ I Thicknes s I ~'

"'-.

!.#4

"-..

I-Camber [ Thickness #5 "..

-.

" ~" " .

Twist Shear ~a

" ~ "''._

I ic ess'r mber

~

I ~ "- ~. "'..

i"'-..

Twist Shear ~ #5

T crr#7 C/T #9

Figure 6. Location of design variables for ONERA M6 wing. Table 1 Sensitivity derivatives computed using the sequential and parallel versions of the adjoint solver. Sequential

Parallel (8 CPU's)

Camber #7

-0.241691

-0.241691

Thickness #5

-0.0204348

-0.0204348

Twist #2

0.0129824

0.0129824

Shear #4

0.0223495

0.0223495

Design Variable

Here, the cost function is a linear combination of lift and drag. Results are shown for both the sequential and multiprocessor versions of the codes, using the flow conditions stated in the previous discussion. For the parallel results, eight processors are utilized. It can be seen that the derivatives are in excellent agreement. 8. S U M M A R Y

A methodology for performing inviscid aerodynamic design optimizations on unstructured meshes has been described. The approach utilizes the PETSc toolkit for the flow and adjoint solvers, in addition to a shared-memory approach for the mesh movement and gradient evaluation codes. Speedup results have been demonstrated for a large test case, and the linearizations have been shown to remain valid.

319 9. A C K N O W L E D G M E N T S

The authors would like to thank David Keyes for his valuable help and suggestions on the PETSc implementations of the flow and adjoint solvers. 10. REFERENCES

1. 2. 3. 4.

5. 6. 7. 8. 9.

10. 11. 12. 13.

14.

15.

16. 17.

Anderson, W.K., and Bonhaus, D.L., "Aerodynamic Design on Unstructured Grids for Turbulent Flows," NASA TM 112867, June 1997. Anderson, W.K., and Venkatakrishnan, V., "Aerodynamic Design Optimization on Unstructured Grids with a Continuous Adjoint Formulation," AIAA Paper 97-0643, January 1997. Jameson, A., Pierce, N.A., and Martinelli, L., "Optimum Aerodynamic Design Using the Navier-Stokes Equations," AIAA Paper 97-0101, January 1997. Reuther, J., Alonso, J.J., Martins, J.R.R.A., and Smith, S.C., "A Coupled Aero-Structural Optimization Method for Complete Aircraft Configurations," AIAA Paper 99-0187, January 1999. Elliott, J., and Peraire, J., "Aerodynamic Optimization on Unstructured Meshes with Viscous Effects," AIAA Paper 97-1849, June 1997. Soemarwoto, B., "Multipoint Aerodynamic Design by Optimization," Ph.D. Thesis, Delft University of Technology, 1996. Mohammadi, B., "Optimal Shape Design, Reverse Mode of Automatic Differentiation and Turbulence," AIAA Paper 97-0099, January 1997. Nielsen, E.J., and Anderson, W.K., "Aerodynamic Design Optimization on Unstructured Meshes Using the Navier-Stokes Equations," AIAA Paper 98-4809, September 1998. Nielsen, E.J., "Aerodynamic Design Sensitivities on an Unstructured Mesh Using the Navier-Stokes Equations and a Discrete Adjoint Formulation," Ph.D. Thesis, Virginia Polytechnic Institute and State University, 1998. White, EM., Viscous Fluid Flow, McGraw-Hill, New York, 1974. Spalart, ER., and Allmaras, S.R., "A One-Equation Turbulence Model for Aerodynamic Flows," AIAA Paper 92-0439, January 1992. Anderson, W.K., and Bonhaus, D.L., "An Implicit Upwind Algorithm for Computing Turbulent Flows on Unstructured Grids," Computers and Fluids, Vol. 23, No. 1, 1994, pp. 1-21. Anderson, W.K., Rausch, R.D., and Bonhaus, D.L., "Implicit/Multigrid Algorithms for Incompressible Turbulent Flows on Unstructured Grids," Journal of Computational Physics, Vol. 128, 1996, pp. 391-408. Saad, Y., and Schultz, M.H., "GMRES: A Generalized Minimal Residual Algorithm for Solving Nonsymmetric Linear Systems," SIAM Journal of Scientific and Statistical Computing, Vol. 7, July 1986, pp. 856-869. Anderson, W.K., Bonhaus, D.L., McGhee, R., and Walker, B., "Navier-Stokes Computations and Experimental Comparisons for Multielement Airfoil Configurations," AIAA Journal of Aircraft, Vol. 32, No. 6, 1995, pp. 1246-1253. Balay, S., Gropp, W.D., McInnes, L.C., and Smith, B.F. The Portable, Extensible Toolkit for Scientific Computing, Version 2.0.22, h t t p : //www. mcs. a n l . g o v / p e t s c , 1998. Kaushik, D.K., Keyes, D.E., and Smith, B.F., "On the Interaction of Architecture and Algorithm in the Domain-Based Parallelization of an Unstructured Grid Incompressible Flow Code," Proceedings of the 10th International Conference on Domain Decomposition Meth-

320 ods, American Mathematical Society, August 1997, pp. 311-319. 18. Karypis, G., and Kumar, V., "A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs," SIAM Journal of Scientific Computing, Vol. 20, No. 1, 1998, pp. 359392. 19. Samareh, J., "Geometry Modeling and Grid Generation for Design and Optimization," ICASE/LaRC/NSF/ARO Workshop on Computational Aerosciences in the 21st Century, April 22-24, 1998.

Parallel Computational Fluid Dynamics Towards Teraflops, Optimizationand Novel Formulations D. Keyes, A. Ecer, J. Periaux,N. Satofukaand P. Fox (Editors) 2000 Elsevier Science B.V. All rights reserved.

321

Validation of a Parallel Version of the Explicit Projection Method for Turbulent Viscous Flows Stefan Nilsson ~* ~Department of Naval Architecture and Ocean Engineering, Chalmers University of Technology, Chalmers tvgrgata 8, 412 96 G6teborg, Sweden

The projection algorithm has, since it was proposed by Chorin in the late '60s, been applied to a great variety of viscous, incompressible flow situations. Here we will describe the numerical validation of a version of it implemented in Fortran90, and executing in parallel using calls to the MPI library. The parallel projection solver is part of a numerical model used for simulation of active suspension feeders in turbulent boundary layers.

1. I N T R O D U C T I O N In a joint project with marine biologists, the use of computational fluid dynamics in modeling of transport processes in coastal waters is investigated. The present work concerns a numerical model describing the flow field above bivalve beds [1]. Large, densely populated beds of bivalves exist on the sea bottoms along the coasts of Europe, consisting mainly of blue mussels, oysters, cockles, and sand clams. Being so-called active suspension feeders, they feed on particulate matter suspended in the sea water which they collect by pumping. The nutritious particles are collected at the gills of the bivalves and the remainder is pumped out again. By their filtering of the sea water, they are believed to have a large impact on the transport of particulate matter in shallow coastal systems. The fluid dynamic part of the model consists of a turbulent boundary layer along the sea bottom with in- and outflow boundary conditions to model the bivalves, which so far are assumed to be buried in the sea bottom. This is then coupled to transport equations for particles suspended in the sea water. The latter are not further considered in this paper. Different algorithms are used to model time-dependent and steady state flows. However, in both cases the flow is assumed to be described by the Reynolds averaged Navier-Stokes equations. In this paper we consider the validation of the implementation of the algorithms used to solve the time-dependent problem on parallel computers. *This work has been sponsored by the MASTEC (MArine Science and TEChnology) project.

322

2. U S E D A L G O R I T H M S We first give the algorithms used. We start with the basic time stepping algorithm, and then briefly describe the turbulence model. All programming has been done using Fortran90. 2.1. Basic algorithm To advance the time-dependent solution in time, an explicit version of the so-called projection algorithm is used [2], [3]. We define a projection operator 7) whose effect of on an arbitrary vector field U is to extract the divergence free part U D, i.e. 7)U = U D (see [7]). Applying this to the unsteady, Reynolds averaged, incompressible Navier-Stokes equations and noting that the pressure gradient is "orthogonal" to the velocity vector for incompressible flows, we get OU = 7 ) ( - V 9( U U ) + u A U + V - T ) , Ot where U - (U, V, W) T is the velocity vector and T is the Reynolds stress tensor. Semidiscretizing in time we can advance the solution temporally in two separate steps 1)

US un+l

2)

__

V

7t

5t

= ( - V . ( U U ) + , A U + V . W),

__ U * (~t

--- - - v c n + l '

(1)

where Ar n+l = V . U*/St. The second step is equivalent to the application of 7) to U*. In our implementation, these steps are used repeatedly to form the classical fourth order explicit Runge-Kutta method, chosen due to its good stability properties. Spatial discretization of the right hand-side of (1) and of the Poisson equation for r is done using central finite differences on staggered grids for the velocity components and the scalar variables. The modeling of T is described below. The reasons for using an explicit formulation are two-fold (at least). Firstly, because of its simplicity, both in implementation and computation, and secondly, because the correct boundary conditions for U* cannot be satisfied, introducing a numerical boundary layer at no-slip boundaries for implicit time-stepping methods (see [5]). The latter problem can be by-passed at the cost of some extra complexity when implementing boundary conditions for implicit methods [9]. 2.2. Turbulence modeling To model T, a non-linear k - e model with low-Re modifications, developed by Craft et al. [6], is used. In this model there is a cubic relation between the Reynolds tensor T, the mean strain-rate tensor S and the mean rotation tensor ft. The turbulent energy k and the "isotropic" dissipation of it ~ are described by two additional transport equations solved simultaneously with the momentum equations in the Runge-Kutta time stepping algorithm. These as well as the expression for T have been designed to be solvable through the entire boundary layer region, and to handle a range

323 of complex flows. For completeness, the k and 2: equations are given here. Expressions for the various coefficients and source terms can be found in [6]. Ok

+ v.

(uk) -

+ v.

[(, +

Ot 02: --

Ot

-t- ~ " (g~')

2: --

eel-s

g:2 -- ce2--ff + E + Yc + V . [(u + ut/e~e)Vg]

2.3. P a r a l l e l i z a t i o n

We are interested in obtaining U as well as the turbulent quantities k and 2:in a logically cubic domain in three dimensions. The numerical solution is to be obtained using parallel computers. The parallelization has been done using domain decomposition. The domain is three-dimensionally decomposed into pl • P2 x P3 sub-domains and these are mapped onto a virtual cartesian processor topology. Each sub-domain is also padded with extra memory at the boundaries to store values sent from neighbouring domains. Message passing and other issues related to the parallel implementation of the algorithms has been done using calls to MPI functions. The solution of the linear equation system arising from the discretization of the Poisson equation for r is done using a Preconditioned Conjugate Gradient (PCG) method obtained from the Aztec library of parallel, Krylov sub-space methods from Sandia National Laboratory [4], which also is built upon MPI. 3. V E R I F I C A T I O N

OF I M P L E M E N T A T I O N

Before the Navier-Stokes solver was used together with the rest of the numerical model, it had to be verified that it had been implemented correctly. The Reynolds averaged Navier-Stokes equations together with the k - e turbulence model gives a system of the form 0S = QS, Ot

(2)

where S = (U, V, W, k, g)r, and Q is a non-linear operator which includes the effect of the projection operator 7). Spatial discretization of the right-hand side of (2) has been done using second-order central finite differences. Boundary conditions of both Dirichlet and Neumann type are used, and due to the staggered grids used, some variables need to be extrapolated from the boundary. To verify that second-order accuracy is indeed obtained throughout the computational domain, including boundaries, comparisons have been made with known analytical solutions of a modified version of (2). To get a system with known analytical solutions, a forcing function F is added to (2). F is obtained as F=

0E Ot

QE,

where E is a known field. The modified system 0S = QS+F, Ot

324

Table 1 Discretization errors on different grids for the h-1/10 h-1/20 U 0.0085 0.0022 V 0.0057 0.0016 W 0.012 0.0032 k 0.0049 0.0013 0.0067 0.0016

whole system h-l/n0 h-1/80 0.00054 0.00013 0.00039 0.000097 0.00077 0.00019 0.00031 0.000082 0.00044 0.00012

then has the solution S = E. Thus, knowing the expected values, we can compare it to the numerical solution in order to verify that the errors decrease with finer discretization in space and time, according to the difference scheme used. We did this, first for a timeindependent field to isolate eventual errors in the implementation of spatial differences and extrapolations. The field chosen was U

=

sin(x)sin(y)sin(z),

Y

=

- cos(x)cos(y)sin(z),

W

=

2 cos(x)sin(y)cos(z),

k

=

sin2(x)sin2(y)sin2(z) + 1,

=

cos 2(x) cos 2(y) cos 2(z) + 1,

and the extent of the domain was varied to test different boundary conditions. Note that E fulfils incompressibility for the Reynolds averaged velocities, as well as positiveness for the turbulent quantities, the latter being a necessary condition for the k and g equations to be well behaved. The resulting numerical solution was computed with the grid spacing h equal to 1/10, 1/20, 1/40, and 1/80 in all directions. As the magnitude of the different terms may vary a great deal, so that errors in the implementation of one term may be hidden by the others, the different terms have been tested separately as well as together. The resulting maximum absolute errors for U, V, W, k, and g are displayed in Table 1. As can be expected, the errors decrease quadratically with h. When performing these tests, some simplifications of the k - e equations have been done, mostly in the low-Re modifications. 4. P A R A L L E L

SCALABILITY

As the computational domain is logically cubic and structured grids are used, good load-balance is trivial to achieve when using domain decomposition. When testing the parallel efficiency of the implemented model, we have chosen to test the computational part solely, omitting the I/O, preconditioner setup, etc., as the ratio of time taken up by these will vary with the length of time we wish to simulate. Parallel scalability of the code was tested by running it on a varying number of processors on a Cray T3E (DEC Alpha 300MHz) machine. The number of grid points/processor was kept constant and the domain as well as the cartesian processor topology only grew

325 70

I

60 II I

t

I

I

I

I

I

I

I

I

L-

50 40 MFlops/s 30 20 10 0

0

I

I

I

I

20

40

60

80

....

I

I

100

120

P

Figure 1. Parallel scalability results with p from 1 to 128 for: ~ - whole solver, + - second step omitted

in the x-direction, i.e. the total number of grid points was p . 60 x 60 x 60 and the processor topology was p x 1 x 1. As a scalability measure the number of MFlops/s per processor was used. As the computation time is very much dominated by the solution of the discretized Poisson equation, this is essentially a measure of the scalability of the PCG-solver in Aztec. To obtain figures relevant for the rest of the code, it was run with the second step of the projection algorithm omitted on the same problem as above. Testing this fixed size per processor speedup, and assuming that the time to send a message of size m from one processor to another is T + C/m, where T is the communication startup time and/3 the inverse bandwidth, the second test case should scale linearly with the number of processors. The scalability of the first case is limited by the scalar product which is part of the conjugate gradient algorithm, and which necessitates an "all-gather" operation to complete (see [8]) but we did not expect this to show for this rather small p to problem-size ratio. As can be seen in Figure 1 this proved to be (approximately) true. As a second test of the parallel efficiency, we also ran the same tests on a test case where the problem size was independent of the number of processors. These tests were run on a Cray Origin2000 (R10000 195MHz), and the number of grid points was 80 x 80 x 80. A three dimensional decomposition of the domain was used. For this kind of test case, the parallel efficiency will be limited by Amdahl's law, and as can be seen in Figure 2, where the speedup is compared with the naive estimate of p; this was true in our case too. As we are using an explicit time-stepping method, there will be stability restrictions on the length of the time step we can take. To compute the global maximum allowable time step, we need to perform another "all-gather" operation. This puts further restrictions

326 35

I

I

[

5

10

15

[

I

I

20

25

30

30 25 20 Speedup 15 10 5 0

35

P

Figure 2. Speedup obtained with p from 1 to 32 for: ~ - whole solver, + - second step omitted, -p

on the obtainable parallel efficiency. 5. C O N C L U S I O N S The tests with regard to numerical accuracy and parallel efficiency shows that the code performs according to theory. Further work will focus on optimizing single processor performance through better utilization of memory hierarchies, and on finding a good preconditioner for the Poisson system, as the solution of this is by far the most time consuming part of the projection solver. To verify the assumptions made when the numerical model was constructed, the results from the solver will also be compared with experimental data.

REFERENCES 1.

S. Nilsson: Modeling of Suspension Feeders in Benthic Boundary Layers using Computational Fluid Dynamics. Tech. Rep. CHA/NAV/R-99/0064, Dep. Nav. Arch. Ocean Eng., Chalmers U. of Tech. (1999) 2. A.J. Chorin: Numerical Solution of the Navier-Stokes Equations. Math. Comp. (1968) 22 3. A. J. Chorin: On the Convergence of Discrete Approximations to the Navier-Stokes Equations. Math. Comp. (1969) 23 4. R. S. Tuminaro, M. Heroux, S. A. Hutchinson, J. N. Shadid: Official Aztec User's Guide Version 2.1. Tech. Rep. Sand99-8801J, Sandia National Laboratories, Albu-

327

5. 6. 7. 8.

9.

querque, NM 87185 (1998) W. E, J.-G. Liu: Projection Method I: Convergence and Numerical Boundary Layers. SIAM J. Numer. Anal. (1995) 32 T. J. Craft, B. E. Launder, K. Suga: Development and application of a cubic eddyviscosity model of turbulence. Int. J. Heat and Fluid Flow. (1996) 17 P. Colella, E. G. Puckett: Modern Numerical Methods for Fluid Flow (draft of November 8, 1994). f t p : / / b a r k l e y . berkeley, edu/e266/ A. Gupta, V. Kumar, A. Sameh: Performance and Scalability of Preconditioned Conjugate Gradient Methods on Parallel Computers. IEEE Transactions on Parallel and Distributed Systems. (1995) 6 J. Kim, P. Moin: Application of a Fractional-Step Method to Incompressible NavierStokes Equations. Journal of Computational Physics. (1985) 59

This Page Intentionally Left Blank

Parallel Computational Fluid Dynamics Towards Teraflops, Optimizationand Novel Formulations D. Keyes, A. Ecer, J. Periaux,N. Satofukaand P. Fox (Editors) 92000 Elsevier Science B.V. All rights reserved.

329

Parallel Solution of Multibody Store Separation Problems by a Fictitious Domain Method T.-W. Pan ~, V. Sarin b, R. Glowinski a, J. P~riaux c and A. Sameh b aDepartment of Mathematics, University of Houston, Houston, TX 77204, USA bDepartment of Computer Science, Purdue University, West Lafayette, IN 47907, USA CDassault Aviation, 92214 Saint-Cloud, France The numerical simulation of interaction between fluid and complex geometries, e.g., multibody store separation, is computationally expensive and parallelism often appears as the only way toward large scale of simulations, even if we have a fast Navier-Stokes solver. The method we advocate here is a combination of a distributed Lagrange multiplier based fictitious domain method and operator splitting schemes. This method allows the use of a fized structured finite element grid on a simple shape auxiliary domain containing the actual one for the entire fluid flow simulation. It can be easily parallelized and there is no need to generate a new mesh at each time step right after finding the new position of the rigid bodies. Numerical results of multibody store separation in an incompressible viscous fluid on an SGI Origin 2000 are presented.

1. FORMULATION In this article, we consider the numerical simulation of multibody store separation in an incompressible viscous fluid by a distributed Lagrange multiplier/fictitious domain method (see refs. [1, 2]). The motion of the rigid body, such as the NACA0012 airfoil, is not known a priori and is due to the hydrodynamical forces and gravity. In the simulation we do not need to compute the hydrodynamical forces explicitly, since the interaction between fluid and rigid bodies is implicitly modeled by the global variational formulation at the foundation of the methodology employed here. This method offers an alternative to the ALE methods investigated in [3], [4], and [5]. Let us first describe the variational formulation of a distributed Lagrange multiplier based fictitious domain method. Let ft C IRd(d = 2, 3; see Figure 1 for a particular case where d = 2) be a space region; for simplicity we suppose that f~ is filled with a Newtonian viscous incompressible fluid (of density pf and viscosity ul) and contains a moving rigid body B of density p~; the incompressible viscous flow is modeled by the Navier-Stokes equations and the motion of the ball is described by the Euler's equations (an almost

330

Figure 1" An example of two-dimensional flow region with one rigid body. direct consequence of Newton's laws of motion). With the following functional spaces Wgo( t ) - {v[v E Hl(ft) a, v -

go(t) on F},

L2(f2) - {qlq 6 L2(ft), J~ qdx - 0},

A(t) - Hl(B(t)) a,

the fictitious domain formulation with distributed Lagrange multipliers for flows around freely moving rigid bodies (see [2] for detail) is as follows

For a.e. t > 0, find {u(t),p(t),VG(t), G(t),w(t),)~(t)} such that u(t) 6 Wgo(t ), p(t) 6 L~)(f2), VG(t) 6 IRd, G(t) 6 IRa, to(t) 6 IR3, A(t) e A(t)

and pf

-0-~-vdx+p;

(u. V ) u - v d x -

+eu/fD ( u ) " D(v)dx

-

Jr2

+(1 - P-Z)[M dVG. Y +

p~

dt

< )~, v - Y

pV-vdx -

0 x Gx~ >A(t)

dto + w x Iw). 0] (I -d-t-

-

(1 - PA)Mg .ps Y + p~~g. vdx, Vv e/-/~ (a)", VY e Ia ~, VO e Ia ~,

f

q V - u ( t ) d x - 0, Vq 6 L2(f2),

dG

dt

= v~,

(1)

(2) (3)

< tt, u(t) - VG(t) -- w(t) • G(t)x~ >A(,)-- 0, Vtt 6 A(t),

(4)

v~(o)

(5)

- v ~ ~(o)=

~o

G(o) - G~

u(x, 0) -- no(x), Vx 6 f2\B(0) and u(x, 0 ) - V ~ + w ~ x G~ ~, Vx 6 B(0).

(6)

331

In (1)-(6) , u ( = {ui} d i=1) and p denote velocity and pressure respectively, ,k is a Lagrange multiplier, D(v) = ( V v + Vvt)/2, g is the gravity, V ~ is the translation velocity of the mass center of the rigid body B, w is the angular velocity of B, M is the mass of the rigid body, I is the inertia tensor of the rigid body at G, G is the center of mass of B; w ( t ) - {cdi(t) }ia=l and 0 - {0i }i=1 a if d - 3, while co(t)- {0, 0,w(t)} and 0 - {0, 0,0} P

if d - 2. From the rigid body motion of B, go has to s a t i s f y / g o " ndF - 0, where

n

Jl

denotes the unit vector of the outward normal at F (we suppose the no-slip condition on OB). We also use, if necessary, the notation r for the function x --+ g)(x, t).

Remark 1. The hydrodynamics forces and torque imposed on the rigid body by the fluid are built in (1)-(6) implicitly (see [2] for detail), hence we do not need to compute them explicitly in the simulation. Since in (1)-(6) the flow field is defined on the entire domain f~, it can be computed with a simple structured grid. Then by (4), we can enforce the rigid body motion in the region occupied by the rigid bodies via Lagrange multipliers. Remark 2. In the case of Dirichlet boundary conditions on F, and taking the incompressibility condition V - U = 0 into account, we can easily show that

D(v)dx-

Vvdx, Vv

w0,

(7)

which, from a computational point of view, leads to a substantial simplification in (1)-(6).

2. A P P R O X I M A T I O N Concerning the space approximation of the problem (1)-(6) by finite element methods, we use PlisoP2 and P1 finite elements for the velocity field and pressure, respectively (see [6] for details). Then for discretization in time we apply an operator-splitting technique la Marchuk-Yanenko [7] to decouple the various computational difficulties associated with the simulation. In the resulting discretized problem, there are three major subproblems: (i) a divergence-free projection subproblem, (ii) a linear advection-diffusion subproblem, and (iii) a rigid body motion projection subproblem. Each of these subproblems can be solved by conjugate gradient methods (for further details, see ref. [2]).

3. PARALLELIZATION For the divergence-free projection subproblems, we apply a conjugate gradient algorithm preconditioned by the discrete equivalent o f - A for the homogeneous Neumann boundary condition; such an algorithm is described in [8]. In this article, the numerical solution of the Neumann problems occurring in the treatment of the divergence-free condition is achieved by a parallel multilevel Poisson solver developed by Sarin and Sameh [9]. The advection-diffusion subproblems are solved by a least-squares/conjugate-gradient algorithm [10] with two or three iterations at most in the simulation. The arising linear

332 systems associated with the discrete elliptic problems have been solved by the Jacobi iterative method, which is easy to parallelize. Finally, the subproblems associated with rigid body motion projection can also be solved by an Uzawa/conjugate gradient algorithm (in which there is no need to solve any elliptic problems); such an algorithm is described in [1] and [2]. Due to the fact that the distributed Lagrange multiplier method uses uniform meshes on a rectangular domain and relies on matrix-free operations on the velocity and pressure unknowns, this approach simplifies the distribution of data on parallel architectures and ensures very good load balancing. The basic computational kernels comprising of vector operations such as additions and dot products, and matrix-free matrix-vector products yield nice scalability on distributed shared memory computers such as the SGI Origin 2000.

4. NUMERICAL RESULTS In this article, the parallelized code of algorithm (1)-(6) has been applied to simulate multibody store separation in a 2D channel with non-spherical rigid bodies. There are three NACA0012 airfoils in the channel. The characteristic length of the fixed NACA0012 airfoil is 1.25 and those of the two moving ones are 1. The xl and x2 dimensions of the channel are 16.047 and 4 respectively. The density of the fluid is pf = 1.0 and the density of the particles is Ps = 1.1. The viscosity of the fluid is vf = 0.001. The initial condition for the fluid flow is u = 0. The boundary condition on 0~ of velocity field is

u/xlx2 l~(

0 (1.0 - e-5~

) - x~/4)

if x 2 - - 2 ,

or, 2,

if x l - - 4 ,

or, 16.047

for t >_ 0. Hence the Reynolds number is 1000 with respect to the characteristic length of the two smaller airfoils and the maximal in-flow speed. The initial mass centers of the three NACA0012 airfoils are located at (0.5, 1.5), (1, 1.25), and (-0.25, 1.25). Initial velocities and angular velocities of the airfoils are zeroes. The time step is A t - 0.0005. The mesh size for the velocity field is hv - 2/255. The mesh size for pressure is hp - 2hr. An example of a part of the mesh for the velocity field and an example of mesh points for enforcing the rigid body motion in NACA0012 airfoils are shown in Figure 2. All three NACA0012 airfoils are fixed up to t - 1. After t - 1, we allow the two smaller airfoils to move freely. These two smaller NACA0012 airfoils keep their stable orientations when they are moving downward in the simulation. Flow field visualizations and density plots of the vorticity obtained from numerical simulations (done on 4 processors) are shown in Figures 3 and 4. In Table 1, we have observed overall algorithmic speed-up of 15.08 on 32 processors, compared with the elapsed time on one processor. In addition, we also obtain an impressive about thirteen-fold increase in speed over the serial implementation on a workstation, a DEC alpha-500au, with 0.5 GB RAM and 500MHz clock speed.

333 ~\

\\ \~\ "~

~xh,\ \ . % \

\

~,,N \ ' , N ~

\

%rx\ x~\\

j" \.:.

\\\'\x~\ \ \ \

\ ,.r,,

\r,, \ ,.r,, ,~, ,N

\ \\\ \ \\ .\ \\ \

\ 9

\\ \

\\

.,x. \ \ \-\

\\ 9\ \ \ \~\

~

\\

\- \ \'x\

\ \ \

.K ,r,. cr,, 4-, \ . a xr\ -~

\\\ \ \ \

\ \ \

x" +

\

N" \ "~2 ,i-x \ . ~

\

\

,.r,, \r-, \ , N @',@

\N.\\

\5,(\

\

\ \\\\

9

~" ~,'

"x]\\ \rx\

\.,.

\

\ \

~\x x

\

\ ~\\

\ ,,>,>,, \\,

\

%\\x~,

\~

~\\

\~x\

\ \ \\\

"4_', \ -,~.:

\ \-,\\

\\ \~>,\\ \ \\\\

\\ \ ,~\, \ \~ . \ \ \

,~ \

, ,',~,~,' \ , ,' \ \ ,

~>, ~,'~','

~ ~ ~,'~,~

x

~ , , x,,\

\

N

\b,\

\

\

\

\

\\ 9

\

~,~ N" \ N" "J2 "~."

~\xx

N" \ N \ N\

:.\

Figure 2. Part of the velocity mesh and example of mesh points for enforcing the rigid body motion in the NACA0012 airfoils with hv - 3/64.

5. CONCLUSION We have presented in this article a distributed Lagrange multiplier based fictitious domain method for the simulation of flow with moving boundaries. Some preliminary experiments of parallelized code have shown the potential of this method for the direct simulation of complicated flow. In the future, our goal is to develop portable 3D code with the ability to simulate large scale problems on a wide variety of architectures. Table 1" Elapsed time/time step and algorithmic speed-up on a SGI Origin 2000 Elapsed Time

Algorithmic speed-up

1 processor*

146.32 sec.

1

2 processors

97.58 sec.

1.50

4 processors

50.74 sec.

2.88

8 processors

27.25 sec.

5.37

16 processors

15.82 sec.

9.25

32 processors

9.70 sec.

15.08

* The sequential code took about 125.26 sec./time step on a DEC alpha-500au.

334

..............................

!iiii!ii!iii!iiiiii!{!{![ !ii !i

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

iiiiiiiiiiiiiii iiiiiiiiiiiiiiiiiiiiii

!

.

.

.

.

:ZZZZ=ZZZZ=ZZZZZZZ=2Z=ZZ=IIZI/=:: .............. ............................................... ...........................................

|

l

,

i

!

.

-I

- - 2

F

-3

.

.

.

.

i

-2

,

i

,

i

i

-i

i

i

i

I

i

0

,

,

i

,

i

1

i

,

,

i

i

2

i,

i

,,

i

i

I

3

Figure 3. Flow field visualization (top) and density plot of the vorticity (bottom) around the NACA0012 airfoils at t - 1 . 5 .

335

iiii;iiiiiiiiii]

-i

-2 -3

-2

-i

0

1

2

3

Figure 4. Flow field visualization (top) and density plot of the vorticity (bottom) around the NACA0012 airfoils at t =2.

336

6. A C K N O W L E D G M E N T S We acknowledge the helpful comments and suggestions of E. J. Dean, V. Girault, J. He, Y. Kuznetsov, B. Maury, and G. Rodin, and also the support of the department of Computer Science at the Purdue University concerning the use of an SGI Origin 2000. We acknowledge also the support of the NSF (grants CTS-9873236 and ECS-9527123) and Dassault Aviation.

REFERENCES [1] R. Glowinski, T.-W. Pan, T. Hesla, D.D. Joseph, J. P~riaux, A fictitious domain method with distributed Lagrange multipliers for the numerical simulation of particulate flows, in J. Mandel, C. Farhat, and X.-C. Cai (eds.), Domain Decomposition Methods 10, AMS, Providence, RI, 1998, 121-137. [2] R. Glowinski, T.W. Pan, T.I. Hesla, and D.D. Joseph, A distributed Lagrange multiplier/fictitious domain method for particulate flows, Internat. J. of Multiphase Flow, 25 (1999), 755-794. [3] H.H. Hu, Direct simulation of flows of solid-liquid mixtures, Internat. J. Multiphase Flow, 22 (1996), 335-352. [4] A. Johnson, T. Tezduyar, 3D Simulation of fluid-particle interactions with the number of particles reaching 100, Comp. Meth. Appl. Mech. Eng., 145 (1997), 301-321. [5] B. Maury, R. Glowinski, Fluid particle flow: a symmetric formulation, C.R. Acad. Sci., S~rie I, Paris, t. 324 (1997), 1079-1084. [6] M.O. Bristeau, R. Glowinski, J. P~riaux, Numerical methods for the Navier-Stokes equations. Applications to the simulation of compressible and incompressible viscous flow, Computer Physics Reports, 6 (1987), 73-187. [7] G.I. Marchuk, Splitting and alternating direction methods, in P.G. Ciarlet and J.L. Lions (eds.), Handbook of Numerical Analysis, Vol. I, North-Holland, Amsterdam, 1990, 197-462. [8] R. Glowinski, Finite element methods for the numerical simulation of incompressible viscous flow. Introduction to the control of the Navier-Stokes equations, in C.R. Anderson et al. (eds), Vortex Dynamics and Vortex Methods, Lectures in Applied Mathematics, AMS, Providence, R.I., 28 (1991), 219-301. [9] V. Sarin, A. Sameh, An efficient iterative method for the generalized Stokes problem, SIAM J. Sci. Comput., 19 (1998), 206-226. 2, 335-352. [10] R. Glowinski, Numerical methods for nonlinear variational problems, SpringerVerlag, New York, 1984.

Parallel ComputationalFluidDynamics Towards Teraflops,Optimizationand Novel Formulations D. Keyes,A. Ecer, J. Periaux,N. Satofukaand P. Fox (Editors) 2000 ElsevierScienceB.V. All rightsreserved.

337

E f f i c i e n t p a r a l l e l - b y - l i n e m e t h o d s in C F D A. Povitsky ~ ~Staff Scientist, ICASE, NASA Langley Research Center, Hampton, VA 23681-2199, e-mail: [email protected]. We propose a novel methodology for efficient parallelization of implicit structured and block-structured codes. This method creates a parallel code driven by communication and computation schedule instead of usual "creative programming" approach. Using this schedule, processor idle and communication latency times are considerably reduced. 1. I n t r o d u c t i o n Today's trend in Computational Science is characterized by quick shift from the yon Neumann computer architecture representing a sequential machine executed scalar data to MIMD computers where multiple instruction streams render multiple data. For implicit numerical methods, computations at each grid point are coupled to other grid points belonging to the same grid line in a fixed direction but can be done independently of the computations at grid points on all other lines in that direction. The coupling arises from solution of linear banded systems by Gaussian Elimination and leads to insertion of communications inside the computational sub-routines, idle stage of processors and large communication latency time. Traditional ADI methods, high-order compact schemes and methods of lines in multigrid solvers fall into the category of implicit numerical methods. A natural way to avoid far-field data-dependency is to introduce artificial boundary conditions (ABC) at inter-domain interfaces. Nordstrom and Carpenter [1] have shown that multiple interface ABC lead to decrease of a stability range and accuracy for highorder compact schemes. Povitsky and Wolfshtein [2] came to similar conclusions about ADI schemes. Additionally, the theoretical stability analysis is restricted to linear PDEs. Therefore, we do not use ABC for parallelization of a serial code unless ABC arise due to a multi-zone approach (see below). Other methods to solve banded linear systems on MIMD computers include transposed algorithms, concurrent solvers and the pipelined Thomas algorithms (PTA). Matrix transpose algorithms solve the across-processor systems by transposing the data to be node local when solving the banded linear systems. Povitsky (1999) [3] compared estimated parallelization penalty time for transposed algorithms with measurements for PTAs on MIMD computers and concluded that PTAs are superior unless the number of grid nodes per processor is small. Hofhaus and van de Velde [4] investigated parallel performance of several concurrent solvers (CS) for banded linear systems and concluded that the floating-point count is

338 2 - 2.5 greater than that for the PTA. Additionally, implementation of CS in CFD codes requires coding of computational algorithms different from the Thomas algorithm. Thus, we confine ourselves with the Thomas algorithm that has a lowest computational count and widely used in CFD community. However, a parallel emciency for the PTA degrades due to communication latency time. The size of packet of lines solved per message is small due to trade-off between idle and latency times [5], [6]. To reduce this difficulty, we drive processors by schedule and not by waiting of information from neighbors. This schedule allows to use processors for other computations while they are idle from the Thomas algorithm computations in a current spatial direction. 2. M e t h o d o l o g y The proposed methodology of parallelization include the following stages 1. Partition indexes in all spatial directions 2. Compute optimal number of lines solved per message 3. Create processor schedule 4. Run computations by this schedule For Step 1 automatized parallelization tools [11] may be recommended. Otherwise, the annoying partitioning by hand should be done. The optimal number of lines (Step 2) is computed from the theoretical model of parallelization efficiency and represents the trade-off between latency time and parallelization penalty time. This models have been developed for the standard pipelined Thomas algorithms [6]. For the standard PTA, the latency and the processor idle time tradeoff for sets of linear banded systems leads to the following expression [6]: K1--

V/

p ( / ~~--1)'

K2-

V/ P - I '

(1)

where K1 and K2 are the numbers of lines solved per message on the forward and backward steps of the Thomas algorithm, respectively, 7 = bo/g2 is the ratio of the communication latency and the backward step computational time per grid node, and p = gl/g2 is the ratio of the forward and the backward step computational times. For the novel version of PTA, denoted as the Immediate Backward PTA (IB-PTA), the optimal size of packet is given by: N2 KI = 2 ( N d _ I ) , K2 - pKI.

(2)

More details about derivation of the theoretical mode] of parallelization, refinements with regards to KI and /('2 and estimation of para]]e]ization penalty time in terms of O(N) are presented in our another study [8]. The forward and backward steps of the Thomas algorithm include recurrences that span over processors. The main disadvantage of its para]lelization is that during the pipelined process processors will be idle waiting for completion of either the forward or the backward step computations by other processors in row.

339 The standard PTA does not complete lines before processors are idle. Moreover, even data-independent computations cannot be executed using the standard PTA as processors are governed by communications and are not scheduled for other activities while they are idle. The main difference between the IB-PTA and the PTA is that the backward step computations for each packet of lines start immediately after the completion of the forward step on the last processor for these lines. After reformulation of the order of treatment of lines by the forward and backward steps of the Thomas algorithm, some lines are completed before processors stay idle. We use these completed data and idle processors for other computational tasks; therefore, processors compute their tasks in a time-staggered way without global synchronization. A unit, that our schedule addresses, is defined as the treatment of a packet of lines by either forward or backward step computations in any spatial direction. The number of time units was defined on Step 1. At each unit a processor either performs forward or backward step computations or local computations. To set up this schedule, let us define the "partial schedules" corresponding to sweeps in a spatial direction as follows: +1 0 -1

J ( p , i , dir) -

forward step computations processor is idle backward step computations,

(3)

where dir = 1,2, 3 denotes a spatial direction, i is a unit number, p is a number of processor in a processor row in the dir direction. To make the IB-PTA feasible, the recursive algorithm for the assignment of the processor computation and communication schedule is derived, implemented and tested [3]. Partial directional schedules must be combined to form a final schedule. For example, for compact high-order schemes processors should be scheduled to execute the forward step computations in the y direction while they are idle between the forward and the backward step computations in the x direction. The final schedule for a compact parallel algorithm is set by binding schedules in all three spatial directions. Example of such schedule for the first outermost processor (1, 1, 1) is shown in Table 1. In this binded

Table 1 Schedule of processor communication and computations, where i is the number of time unit, T denotes type of computations, (2, 1, 1), (1, 2, 1) and (1, 1, 2) denote communication with corresponding neighbors i

T (2,1,1) (1,2,1) (1,1,2)

...

6

7

8

1 1 0 0

1 0 0 0

-1 3 0 0

9 10 2-1 0 3 0 0 0 0

11 2 0 1 0

12 -1 2 0 0

...

43

44

45

46

47

4 0 0 0

-3 0 0 2

4 0 0 0

-3 0 0 2

4 0 0 0

schedule, computations in the next spatial direction are performed while processors are idle from the Thomas algorithm computations in a current spatial direction, and the

340 for i=l,...,I

( for dir=l,3

( if (Com(p,i, right[dir]) = 1) send FS coefficients to right processor; if (Corn(p, i, right[dir]) = 3) send FS coefficients to right processor and receive BS solution from right processor; if (Corn(p, i, left[dir])= 1) send BS solution to left processor; if (Corn(p, i, left[dir]) = 3) send BS solution to left processor and receive FS coefficients from left processor; if (Corn(p, i, right[dir]) = 2) receive BS solution from right processor; if (Corn(p, i, left[dir]) = 2) receive FS coefficients from left processor;

} for dir=l,3

( if (T(p, i) = dir) do FS computations if (T(p, i) = - d i r ) do BS computations if (T(p, i) = local(dir)) do local directional computations

} if (T(p, i ) = local) do local computations

} Figure 1. Schedule-governed banded linear solver, where right = p + 1 and l e f t = p - 1 denote left and right neighbors, dir - 1, 2, 3 corresponds to x, y, and z spatial directions, T governs computations, Corn controls communication with neighboring processors, p is the processor number, and i is the number of group of lines (number of time unit).

Runge-Kutta computations are executed while processors are idle from the Thomas algorithm computations in the last spatial direction. After assignment of the processor schedule on all processors, the computational part of the method (Step 4) runs on all processors by an algorithm presented in Figure 1. 3. P a r a l l e l c o m p u t a t i o n s The CRAY T3E MIMD computer is used for parallel computations. First, we test a building block that includes PTAs solution of the set of independent banded systems (grid lines) in a spatial direction and then either local computations or forward step computation of the set of banded systems in the next spatial direction. We consider standard PTA, IB-TPA, "burn from two ends" Gaussian elimination (TW-PTA) and the combination of last two algorithms (IBTW-PTA). The parallelization penalty and the size of packet versus the number of grid nodes per processor per direction are presented in Figure 2. For the pipeline of eight processors (Figure 2a) the most efficient IBTW-PTA has less than 100% parallelization penalty if the total number of grid nodes per processor is more than 143 . For the pipeline of sixteen processors (Figure2b) the critical number of grid nodes per processor is 163. For these numbers of nodes, the parallelization penalty for the standard PTA is about 170% and

341

250% for eight and sixteen processors, respectively.

17o I~" 160 ~.150 ~.140 ~-

120 110 100 90 _

i,oor

[ 7o

~9 80 .N

~

70~-

"6 6o

""..3

;~ ~oI o

4""

.~ 8o

N so 4o

"-._

.s

30

S

2 3

1

20 10

15

20

25

Number of grid nodes

0

30

i::I

'~'~

....

~o . . . .

~', . . . .

Number of grid nodes

~'o'

120 ~ 110 ~'100 ~'9OI=80

1ool

~

4"J

50

40 30

s SS

2

s

3

. . . .

20

'~I

10

is

Num~)~r of grid noc~s

30

0

15

20

25

Number of grid nodes

30

Figure 2. Parallelization penalty and size of packet of lines solved per message for pipelined Thomas algorithms; (a) 8 processors; (b) 16 processors; 1-PTA, 2-IB-PTA, 3-TW-PTA, 4-IBTW-PTA

Next example is the 3-D aeroacoustic problem of initial pulse that includes solution of linearized Euler equations in time-space domain. Parallel computer runs on 4 • 4 • 4 = 64 processors with 103- 203 grid nodes per processor show that the parallel speed-up increases 1 . 5 - 2 times over the standard PTA (Figure 3a). The novel algorithm and the standard one are used with corresponding optimal numbers of lines solved per message. The size of packet is substantially larger for the proposed algorithm than that for the standard PTA (Figure 3b). 4. C o n c l u s i o n s We described our proposed method of parallelization of implicit structured numerical schemes and demonstrated its advantages over standard method by computer runs on an MIMD computer. The method is based on processor schedule that control inter-processor

342

~40r 50

50 45 e~40 "6 ~35

64processor~

~30 s_.

o-20

~30

C

25

2,10n

12 14 16 18 Number of Nodes

20

10

12 14 16 18 Number of Nodes

20

Figure 3. Parallelization efficiency for compact 3-D aeroacoustic computations, (a) Speedup; (b) Size of packet. Here 1 denotes the standard PTA, and 2 denotes the binded schedule

communication and order of computations. The schedule is generated only once before CFD computations. The proposed style of code design fully separates computational routines from communication procedures. Processor idle time, resulting from the pipelined nature of Gaussian Elimination, is used for other computations. In turn, the optimal number of messages is considerably smaller than that for the standard method. Therefore, the parallelization efficiency gains also in communication latency time. In CFD flow-field simulations are often performed by so-called multi-zone or multi-block method where governing partial differential equations are discretized on sets of numerical grids connecting at interfacial boundaries by ABC. A fine-grain parallelization approach for multi-zone methods was implemented for implicit solvers by Hatay et al. [5]. This approach adopts a three-dimensional partitioning scheme where the computational domain is sliced along the planes normal to grid lines in a current direction. The number of processors at each zone is arbitrary and can be determined to be proportional to the size of zone. For example, a cubic zone is perfectly (i.e, in load-balanced way) covered by cubic sub-domains only in a case that the corresponding number of processors is cube of an integer number. Otherwise, a domain partitioning degrades to two- or one- directional partitioning with poor surface-to-volume ratio. However, for multi-zone computations with dozens of grids of very different sizes a number of processors associated with a zone may not secure this perfect partitioning. Hatay et al. [5] recommended to choose subdomains with the minimum surface-to-volume ratio, i.e., this shape should be as close to a cube as possible. Algorithms that hold this feature are not available yet and any new configuration requires ad hos partitioning and organization of communication between processors.

343 The binding procedure, that combines schedules corresponding to pipelines in different spatial directions, is used in this study. This approach will be useful for parallelization of multi-zone tasks, so as a processor can handle subsets of different grids to ensure load balance. REFERENCES

1. J. Nordstrom and M. Carpenter, Boundary and Interface Conditions for High Order Finite Difference Methods Applied to the Euler and Navier-Stokes Equations, ICASE Report No 98-19, 1998. 2. A. Povitsky and M. Wolfshtein, Multi-domain Implicit Numerical Scheme, International Journal for Numerical Methods in Fluids, vol. 25, pp. 547-566, 1997. 3. A. Povitsky, Parallelization of Pipelined Thomas Algorithms for Sets of Linear Banded Systems, ICASE Report No 98-48, 1998 (Expanded version will appear in, Journal of Parallel and Distributed Computing, Sept. 1999). 4. J. Hofhaus and E. F. Van De Velde, Alternating-direction Line-relaxation Methods on Multicomputers, SIAM Journal of Scientific Computing, Vol. 17, No 2, 1996,pp. 454-478. 5. F. Hatay, D.C. Jespersen, G. P. Guruswamy, Y. M. Rizk, C. Byun, K. Gee, A multilevel parallelization concept for high-fidelity multi-block solvers, Paper presented in SC97: High Performance Networking and Computing, San Jose, California, November 1997. 6. N.H. Naik, V. K. Naik, and M. Nicoules, Parallelization of a Class of Implicit Finite Difference Schemes in Computational Fluid Dynamics, International Journal of High Speed Computing, 5, 1993, pp. 1-50. 7. C.-T. Ho and L. Johnsson, Optimizing Tridiagonal Solvers for Alternating Direction Methods on Boolean Cube Multiprocessors, SIAM Journal of Scientific and Statistical Computing, Vol. 11, No. 3, 1990, pp. 563-592. 8. A. Povitsky, Parallel Directionally Split Solver Based on Reformulation of Pipelined Thomas Algorithm,ICASE Report No. 98-45, http://www, icase, edu/library/reports/rdp/1998, html#98-45 9. A. Povitsky and M.Visbal, Parallelization of ADI Solver FDL3DI Based on New Formulation of Thomas Algorithm, in Proceedings of HPCCP/CAS Workshop 98 at NASA Ames Research Center, pp. 35-40, 1998. 10. A. Povitsky and P. Morris, Parallel compact multi-dimensional numerical algorithm with application to aeroacoustics, AIAA Paper No 99-3272, 14th AIAA CFD Conference, Norfolk, VA, 1999. II. http://www.gre.ac.uk/-captool 12. T. H. Pulliam and D. S. Chaussee, A Diagonal Form of an Implicit Approximate Factorization Algorithm, Journal of Computational Physics, vol. 39, 1981, pp. 347363. 13. S. K. Lele, Compact Finite Difference Schemes with Spectral-Like Resolution, Journal of Computational Physics, vol. 103, 1992, pp. 16-42.

This Page Intentionally Left Blank

Parallel Computational Fluid Dynamics Towards Teraflops, Optimization and Novel Formulations D. Keyes, A. Ecer, J. Periaux, N. Satofuka and P. Fox (Editors) 92000 Elsevier Science B.V. All rights reserved.

345

A n A u t o m a t a b l e G e n e r i c S t r a t e g y for D y n a m i c L o a d B a l a n c i n g in P a r a l l e l S t r u c t u r e d M e s h C F D Code J. N. Rodrigues, S. P. Johnson, C. Walshaw, and M. Cross Parallel Processing Research Group, University of Greenwich, London, UK.

In order to improve the parallel efficiency of an imbalanced structured mesh CFD code, a new dynamic load balancing (DLB) strategy has been developed in which the processor partition range limits of just one of the partitioned dimensions uses non-coincidental limits, as opposed to coincidental limits. The 'local' partition limit change allows greater flexibility in obtaining a balanced load distribution, as the workload increase, or decrease, on a processor is no longer restricted by the 'global' (coincidental) limit change. The automatic implementation of this generic DLB strategy within an existing parallel code is presented in this paper, along with some preliminary results.

1. I N T R O D U C T I O N

Parallel computing is now widely used in numerical simulation, particularly for application codes based on finite difference and finite element methods. A popular and successful technique employed to parallelise such codes onto large distributed memory systems is to partition the mesh into sub-domains that are then allocated to processors. The code then executes in parallel, using the SPMD methodology, with message passing for inter-processor interactions. These interactions predominantly entail the updating of overlapping areas of the subdomains. The complex nature of the chemical and physical processes involved in many of these codes, along with differing processor performance in some parallel systems (such as workstation clusters), often leads to load imbalance and significant processor idle time. DLB can be used to overcome this by migrating the computational load away from heavily loaded processors, greatly improving the performance of the application code on the parallel system. For unstructured mesh application codes (including finite element codes), the flexibility of the partition in terms of the allocation of individual cells to processors allows a fairly straightforward technique to be used [1]. Graph partitioning tools such as JOSTLE [2] can be used to take an existing partition and re-partition it in parallel by moving cells when necessary to reduce the load imbalance. For structured mesh application codes (such as finite difference codes), the nature of

346 the coding techniques typically used forces rectangular partitions to be constructed if efficient parallel execution is to be achieved (since this allows loop limit alteration in the co-ordinate directions to be sufficient). This work investigates the DLB problem for structured mesh application codes and devises a generic strategy, along with library utilities, to simplify the implementation of DLB in such codes, leading to the automatic implementation of the technique within the Computer Aided Parallelisation Tools (CAPTools) environment [3,4]. The aims of this research are threefold: 9 Develop a generic, minimally intrusive, effective load balancing strategy for structured mesh codes 9 Develop utilities (library calls) to simplify the implementation of such a strategy 9 Automate this strategy in the CAPTools environment to transform previously parallelised message passing codes

2. P O S S I B L E LOAD BAI,ANCING S T R A T E G I E S

Three different load balancing strategies can be used in the context of structured mesh codes, each trying to achieve a good load balance without incurring high communication latencies or major alterations to the source code. In Figure 1, Case 1 forces all partition limits to coincide with those on neighbouring processors, greatly restricting the load balance possible. Case 2, although allowing good load balance, suffers from complicated communications and difficulties in constructing the partition. Case 3, where partition limits are forced to coincide on all but one dimension, allows for good load balance, fairly simple and neat communication patterns, and is relatively straightforward to construct and is therefore selected for the generic strategy. Original .

.

.

.

.

.

Case 1

Case 2

Case3

1 Change globally No change Moderate

2 Change locally Complex Good

3 Mix Relatively simple Good

.

Case: Limits: Communication: Balance:

[[

Figure 1: Alternative DLB partitioning strategies and their implications on parallel performance.

347

3. C O M M U N I C A T I O N S W I T H N O N - C O I N C I D E N T P A R T I T I O N L I M I T S To avoid the parallel code from becoming unwieldy w h e n using this DLB strategy, the details of communications involving d a t a from several processors (due to the staggering of the partition limits) is dealt with w i t h i n library routines. Figure 2 shows a 3D mesh decomposed onto 27 processors with staggered partition limits used in one dimension to allow DLB. To u p d a t e the overlap a r e a shown for processor 6 requires d a t a from processors 2, 5 and 8, w h e r e the details of this communication can be hidden within the new communication call. In the original communication call, a variable of a p a r t i c u l a r length a n d type is received from a specified direction [5]. To h a n d l e the staggered limits in this DLB strategy, the original communication call n a m e is changed (as shown in Figure 2), and two extra p a r a m e t e r s are added on, which d e t e r m i n e the a m o u n t to communicate and whom to communicate with. Internally, c a p _ d l b _ r e c e i v e performs three communications as determined by examining the partition limits on the neighbouring processors. Matching communications (i.e. c a p _ d l b _ s e n d ' s ) on the neighbour processors use the same algorithm to determine which d a t a items are to be sent to their neighbours. C h a n g i n g all of the communications in the application code t h a t are orthogonal to the staggered partition limits into dlb communications, will correctly handle inter-processor communication whilst only slightly altering the application code.

/ / / /

Original communication call: call cap_receive(var,length,type,dir) .....

,D ~m~

/ ""~ / i /

/

/ New dlb communication call:

call cap_dlb_receive(var,length,type,dir, first,stag_stride)

Figure 2: Update of overlap on processor 6 with contributions from processors 2, 5 and 8. Also, the original and new dlb communication calls, the latter of which receives a variable of a particular length and type from a specified direction.

348 4. LOAD MIGRATION TO SET-UP THE NEW PARTITION Another key section of a DLB algorithm is efficient load migration, particularly since this will typically involve migrating data relating to a very large n u m b e r of arrays t h a t represent geometric, physical and chemical properties. The algorithm devised here performs partition limit alteration and data migration one dimension at a time, as shown in Figure 3. Figure 3 shows the data migration process for a 2D problem in which the Up/Down limits appear staggered (i.e. use local processor partition range limits). The load is first migrated in the non-staggered dimension using the old processor limits. This involves calculating the average idle time on each column of processors in Figure 3 and then re-allocating (communicating) the columns of the structured mesh to neighbouring processors to reduce this idle time. Note t h a t the timing is for a whole column of processors r a t h e r t h a n any individual processor in the column. In subsequent re-balances, this will involve communications orthogonal to the staggered partition limits and therefore the previously described dlb communication calls are used. Once all other dimensions have undergone migration, the staggered dimension is adjusted using the new processor limits for all other dimensions, ensuring a minimal movement of data. The new partition limits are calculated by adjusting the partition limits locally within each column of processors to reduce the idle time of those processors. Note t h a t now the individual processor timings are used when calculating the staggered limits. Obviously, the timings on each processor are adjusted during this process to account for migration of data in previous dimensions. The actual data movement is performed again using utility routines to minimise the impact on the appearance of the application code, and has dlb and non-dlb versions in the same way as the communications mentioned in the previous section.

tu

t~,,

t~,,

tl~

t~

t~

t~

t~

t~

TI,~

Tla

T~

T~,

~l ~

~.-~'.:~,

J

TT

Data migration Left/Right (Dim=l)

Data migration Up/Down (Dim=2)

Figure 3: Data migration for a 2D processor topology, with global Left/Right processor partition range limits, and staggered Up/Down processor partition range limits.

349

5. D Y N A M I C L O A D B A I ~ A N C I N G A L G O R I T H M

The previous sections have discussed the utility routines developed to support DLB; this section discusses the DLB approach that is to be used, based on those utilities. The new code that is to be inserted into the original parallel code should be understandable and unobtrusive to the user, allowing the user to m a i n t a i n the code without needing to know the underlying operations of the inserted code in detail. Since m a n y large arrays may be moved, the calculation and migration of the new load should be very fast, meaning that there should only be a m i n i m u m amount of data movement whilst preserving the previous partition. In addition, communications should be cheap, as the cost of moving one array may be heavy. The algorithm used to dynamically load balance a parallel code is: 9 Change existing communication calls, in the non-staggered dimensions, into the new dlb communication calls 9 Insert dynamic load balancing code: 9 Start/Stop timer in selected DLB loop (e.g. time step loop) 9 For each dimension of the processor topology 9 Find the new processor limits 9 Add migration calls for every relevant array 9 Assign new partition range limits 9 Duplicate overlapping communications to ensure valid overlaps exist before continuing

6. R E S U L T S

When solving a linear system of equations using a 200x300 Jacobi Iterative Solver on a heterogeneous system of processors, each processor has the same workload but their speeds (or number of users) differ. In this instance, the load imbalance is associated with the variation between processors, which shall be referred to as 'processor imbalance'. This means that when a fast processor gains cells from a slow processor, then these shall be processed at the processor's own weight, since the weight of the cells are not transferred. Figure 4a shows the timings for a cluster of nine workstations used to solve the above problem, where the load was forcibly balanced after every 1000 th iteration (this was used to test the load balancing algorithm rather than when to rebalance the load). It can be seen t h a t the maximum iteration time has been dramatically reduced, and the timings appear more 'balanced' after each rebalance, improving the performance of the code. Figure 4b shows that the load on the middle processor (the slowest) is being reduced each time, indicating that the algorithm is behaving as expected. When running on a homogeneous system of processors, the timings should be adjusted differently before balancing subsequent dimensions, as the load imbalance is no longer associated with the variation between the processors since their speeds are the same.

350

P r o c e s s o r T i m i n g s b e f o r e Load B a l a n c i n g

123

130

113

1111

131

100

110

119

60

45'

74

91

$1'

99

93

241

1171 127I

113

98,~

200

E 150

100

I-

," 100 o

m 50 0

E

o 0

85

1

2

3

4 5 6 7 Processor Number

8

9

Figure 4a. The processor times used to solve 3000 iterations of the Jacobi Iterative Solver on a 3x3 workstation cluster, in which the load is redistributed after every 1000th iteration.

76

39

138 7<

87

49

1011 7<

97 1

>

64

~

85

157 ~<

53

100 7<

62

Tim~1000I~rllflons = 164~ secs

Time/1000I~raflons

Tim~1000Itera~ons = 34.9secs

Iter= 1000

Iter=2000

Iter=3000

= 37Asecs

)

Figure 4b. The workloads on the 9 processors mentioned in Figure 4a after having balanced the load after every 1000 th iteration, in which the middle processor is the slowest (IPX) and the others are a mixture of Ultras and $20' s.

The S o u t h a m p t o n - E a s t Anglia Model (SEA Code) [6] uses an oceanography model to s i m u l a t e the fluid flow across the globe. Its computational load is n a t u r a l l y imbalanced. Typically, parallelising this type of code m e a n s t h a t a discretised model of the E a r t h is partitioned onto a n u m b e r of processors, each of which m a y own a n u m b e r of land cells and a n u m b e r of sea cells, as shown in Figure 5. Parallel inefficiency arises in the oceanography code, for example, w h e n t r y i n g to model the flow of the ocean in the fluid flow solver, w here few calculations are performed on processors owning a high proportion of land cells. This m e a n s t h a t some processors will r e m a i n idle whilst w ai t i ng for other processors to complete their calculations, exhibiting n a t u r a l imbalance, as the a m o u n t of computation depends on the varying depth of sea. W h e n r u n n i n g this code on the CRAY T3E in which the imbalance is solely due to the physical characteristics, the weight of the cell being moved should be t r a n s f e r r e d across onto the new owner. This type of problem shall be referred to as h a v i n g 'physical imbalance', where the processor speeds are homogenous, and the c o mp u ta tio n al load associated with each cell rem ai ns the same after the cell h a s been t r a n s f e r r e d onto a neighbouring processor. T3E - S E A Code Iteration 16 (4x3) .-. 1.4 -~

~--,

~.~j:~

...................... !:!:!:;:i:!:!:i:!:!~iiii.~ii:~i~i:

"~

1

_

!]

__.

.E e 0.8 I-

i Figure 5: Example showing the Earth partitioned onto 9 processors, each represented by different colourings, where each processor owns a varying depth of sea upon which to compute on.

oUnbal [ ] Global I

0.6

[]Proc

I

.=o 0.4 0.2

IIPhys

Jl

~

o

.

. 1

. 2

.

.

.

3

4

5

.

. 6

. 7

. 8

. 9

.

.

10

11

12

Processor Number

Figure 6: Shows the processor timings at iteration 16 for various types of load balancing on a 4x3 processor topology (note that Processor 9 contained Europe and Russia).

351

T3E - SEA Code Total Times

T3E - SEA Code Iteration 16 (4x3) 8000

1.4

,12

,~0.81 0.6 ~ 0.4 0.2 0

i

1

Unbal

9

i

t

[] Average Maximum Minimum

[] Unbal 9Global [] Proc 9Phys

8 6000 m 4000 c ~ 2000

t

Global

Proc

Phys

Type of balancing performed

Figure 7: Statistical measurements for the processor timings for iteration 16, for various types of load balancing on a 4x3 processor topology.

2x2

3x2

3x3

4x3

4x4

Processor Topology

Figure 8: The execution times (CPU + Rebalancing time) using different types of load balancing for various processor topologies, for 2000 iterations.

Therefore it may be necessary to treat the different types of problem differently when load balancing, as the weight is related to the cell itself rather than the processor speed (as with processor imbalance). Figure 6 shows the processor timings for a single iteration using the different balancing techniques, from which the following points can be ascertained. The processor timings appear unbalanced when no load balancing is undertaken, which suggests that there is a fair amount of idle time present. It is also noticeable t h a t there is very little work being done by the processor who owns Europe and Russia (processor 9 in this case, in which a 4x3 topology is being used). The m a x i m u m processor time can be reduced simply by balancing the workload on each processor using the given methods; however, the processor timings are only 'balanced' when staggering the limits assuming physical imbalance (where processor 9 is then given a sufficient amount of work). A more general overview can be seen in Figure 7, in which statistical m e a s u r e m e n t s are given for each of the different balancing techniques. The aim is to reduce the m a x i m u m iteration time down towards the average time, as giving more work to the fast/light processor shall increase the m i n i m u m time. It is apparent that this is achieved only when balancing the problem assuming the correct problem type. A similar trend can be seen for various processor topologies in Figure 8, in which it is clear to say that any form of balancing is better than none, and t h a t staggering the limits is better than changing them globally. These results suggest t h a t the type of problem being balanced is an important factor and needs to be considered when performing any form of DLB.

7. A U T O M A T I O N Implementing any dynamic load balancing strategy manually within a code can be a time consuming and tedious process, which is why it is desirable to be able to automate the whole strategy, preferably as part of the parallelisation. Having already generated a parallel version of their serial code, the user is able to add dynamic load balancing capabilities painlessly to their code, whereby the

352

user selects a routine and loop to balance. CAPTools detects and changes all of the necessary existing communication calls into dlb communications, and inserts the relevant DLB code into the user selected position in the code. CAPTools also determines what data is to be migrated and generates the migration calls, as well as duplicating all of the necessary overlap communications. This allows the user to concentrate on the actual results of the DLB strategy rather than its implementation, saving a lot of time and mundane effort. Note that a similar approach can be used in the implementation of DLB within unstructured mesh codes. Automating this DLB strategy gives the user more control over the attainable balance, as it is now possible for the user to implement the strategy at several levels of granularity (e.g. at the time-step, iteration, or solver loop). This is because there may be several places in which load imbalance occurs, which may occur at different stages during execution. The user can select from a number of suitable positions in which to rebalance the load after which CAPTools inserts all of the necessary code to perform this task, something which would have been infeasible when implementing DLB manually.

8. C O N C L U S I O N The performance of a parallel code can be affected by load imbalance (caused by various factors) which is why DLB is used to improve the parallel efficiency, particularly where the load imbalance is changing continuously throughout execution. Using non-coincidental processor partition range limits in just one of the partitioned dimensions offers greater flexibility in the balance attainable t h a n having used solely coincidental limits, and it is not very complex to code compared to using all non-coincidental limits in each partitioned dimension. Having been embedded into CAPTools, this generic DLB strategy can be used to generate parallel DLB code for a wide range of application codes, where only minor changes are made to the user's code (maintainability and optimisation are still possible). The overhead of using the dlb communications over non-dlb communications when no load imbalance exists is negligible, allowing DLB to be used whenever imbalance is suspected. This DLB strategy has been tested on a number of problems exhibiting good parallel efficiency, and it has been seen that the type of problem being balanced is a contributing factor to the performance of the DLB algorithm.

REFERENCES

[1]

A. Amlananthan, S. Johnson, K. McManus, C. Walshaw, and M. Cross, "A genetic strategy for dynamic load balancing of distributed memory parallel computational mechanics using unstructured meshes", In D. Emerson et al, editor, Parallel Computational Fluid Dynamics: Recent Developments and Advances Using Parallel

353

[21 [3] [4]

[:5] [6]

Computers. Elsevier, Amsterdam, 1997. (Proceedings of Parallel CFD'97, Manchester, 1997). C. Walshaw and M. Cross, "Parallel Optimisation Algorithms for Multilevel Mesh Partitioning", Tech. Rep. 99/IM/44, Univ. Greenwich, London SE 18 6PF, UK, February 1999. C. S. Ierotheou, S. P. Johnson, M. Cross, and P. F. Leggett, "Computer aided parallelisation tools (CAPTools)- conceptual overview and performance on the parallelisation of structured mesh codes", Parallel Computing, 22:163-195, March 1996. E. W. Evans, S. P. Johnson, P. F. Leggett and M. Cross, "Automatic Generation of Multi-Dimensionally Partitioned Parallel CFD Code in a Parallelisation Tool", In D. Emerson et al, editor, Parallel Computational Fluid Dynamics: Recent Developments and Advances Using Parallel Computers. Elsevier, Amsterdam, 1997. (Proceedings of Parallel CFD'97, Manchester, 1997). P. F. Leggett, "The CAPTools Communication Library (CAPLib)", Technical Report, University of Greenwich, published by CMS Press, ISBN 1 89991 45 X, 1998. Southampton-East Anglia (SEA) M o d e l - Parallel Ocean Circulation Model, http://www.mth.uea.ac.uk/ocean/SEA

This Page Intentionally Left Blank

Parallel Computational Fluid Dynamics Towards Teraflops, Optimizationand Novel Formulations D. Keyes, A. Ecer, J. Periaux, N. Satofuka and P. Fox (Editors) 2000 Elsevier Science B.V. All rights reserved.

355

P a r a l l e l P e r f o r m a n c e o f a Z o n a l N a v i e r - S t o k e s C o d e on a M i s s i l e F l o w f i e l d Jubaraj Sahu, Karen R. Heavey, and Daniel M. Pressel U.S. Army Research Laboratory Aberdeen Proving Ground, MD 21005-5066

While in theory parallel computing can produce very high levels of performance, it remains clear that for several reasons the track record for parallel computing has been inconsistent. Some of these reasons are: (1) parallel systems are difficult to program, (2) performance gains are failing to live up to expectations, (3) many of the current generation of RISC-based parallel computers have delivered low levels of performance. This paper discusses the highly innovative techniques that were used to overcome all of these obstacles. A time-marching, Navier-Stokes code, successfully used over a decade for projectile aerodynamics, was chosen as a test case and optimized to run on modern RISC-based parallel computers. As part of the validation process, the parallelized version of the code has been used to compute the three-dimensional (3-D) turbulent supersonic flow over a generic, ogive-cylinder, missile configuration. These results were compared to those obtained with the original version of the code on a Cray C-90. Considerable performance gain was achieved by the optimization of the serial code on a single processor. Parallelization of the optimized serial code, which uses loop-level parallelism, led to additional gains in performance. Recent runs on a 128-processor Origin 2000 have produced speedups in the range of 10-26 over that achieved when using a single processor on a Cray C-90. The parallelized code was then used to compute the turbulent flow over various projectile configurations at transonic and supersonic speeds. 1. I N T R O D U C T I O N The advancement of computational fluid dynamics (CFD) is having a major impact on projectile design and development. Advancements in computer technology and state-of-the-art numerical procedures enables one to find solutions to complex, time-dependent problems associated with projectile aerodynamics; store separation from fighter planes; and other multibody systems. Application of CFD to multibody configurations has proven to be a valuable tool in evaluating potential new designs. 1'2 Recent application of this technology has been directed at the evaluation of a submunition separation process. 2 Although the computational results obtained are encouraging and valuable, the computer central processing unit (CPU) time required for each time-dependent calculation is immense even for axisymmetric flows, with three-dimensional (3-D) calculations being worse. This problem becomes even more extreme when one looks at the turnaround time. These times must be reduced at least an order of magnitude before this technology can be used routinely for the design of multibody projectile systems. This is also true for numerical simulation of single, projectile-missile configurations, which are, at times, quite complex and require large computing resources. The primary technical challenge is to effectively utilize new advances in computer technology, in order to significantly reduce run time and to achieve the

356 desired improvements in the turnaround time. The U.S. Department of Defense (DOD) is actively upgrading its high-performance computing (HPC) resources through the DOD High-Performance Computing Modernization Program (HPCMP). The goal of this program is to provide the research scientists and engineers with the best computational resources (networking, mass storage, and scientific visualization) for improved design of weapon systems. The program is designed to procure state-of-the-art computer systems and support environments. One of the initiatives of the DOD HPCMP is the Common High-Performance Computing Software Support Initiative (CHSSI) aimed at developing application software for use with systems being installed. A major portion of this effort has to do with developing software to run on the new scalable systems, since much of the existing code was developed for vector systems. One of the codes that was selected for this effort is the F3D code, 3which was originally developed at NASA Ames Research Center with subsequent modifications at the U.S. Army Research Laboratory (ARL). This code is a NavierStokes solver capable of performing implicit and explicit calculations. It has been extensively validated and calibrated for many applications in the area of projectile aerodynamics. As such, there was a strong interest in porting this code to the new environments. A key reason for funding this work through the CHSSI project is F3D's proven ability to compute the flow field for projectile configurations using Navier-Stokes computational techniques. ~'2'4'5 The end goal of writing a parallel program is in some way to allow the user to get their job done more efficiently. Traditionally this was measured using metrics such as speed and accuracy, but it can also mean more cost effectively. Historically, parallel computers were based on hundreds, or even thousands, of relatively slow processors (relative to the performance of state-of-the-art vector processors). This meant that, in order to achieve the goal, one had to use techniques that exhibited a high degree of parallelism. In some cases, software developers went to algorithms that exhibited an inherently high level of parallelism, but were most definitely not the most efficient serial/vector algorithms currently in use (e.g., explicit CFD algorithms). In other cases efficient serial/vector algorithms were used as the starting point, with changes being made to increase the available parallelism. In most, if not all, cases, these changes had the effect of substantially decreasing the efficiency of the algorithm by either decreasing the rate of convergence and/or increasing the amount of work per time step (sometimes, by more than an order of magnitude). Therefore, using traditional methods for parallelizing computationally intensive problems was unlikely to deliver the expected speedups in terms of time to completion. The difficulty of writing efficient message-passing code and making efficient use of the processors are the reasons why it has been so difficult to make parallel processing live up to its potential. Figure 1 shows these and other commonly discussed effects. 6 The key breakthrough was the realization that many of the new systems seemed to lend themselves to the use of loop-level parallelism. This strategy offered the promise of allowing the code to be parallelized with absolutely no changes to the algorithm. Initial effort in this parallelization of the code has been successful. 7 This paper describes the continued developmental effort of the code, computational validation of the numerical results, and its application to various other projectile configurations. All computations were performed on the Cray C-90 supercomputer and a variety of scalable systems from Silicone Graphics Incorporated (SGI) and Convex.

357 2. G O V E R N I N G EQUATIONS The complete set of time-dependent, Reynolds-averaged, thin-layer, Navier-Stokes equations is solved numerically to obtain a solution to this problem. The numerical technique used is an implicit, finite-difference scheme. The complete set of 3-D, time-dependent, generalized geometry, Reynoldsaveraged, thin-layer, Navier-Stokes equations for general spatial coordinates ~, rl, and ~ can be written as follows: 0 ~ 1 + 0~F + 0riG + 0CI2I

=

(1)

Re-10cS,

In equation (1), ~1contains the dependent variables and 1~, ~;, and I:I are flux vectors. Here, the thin-layer approximation is used, and the viscous terms involving velocity gradients in both the longitudinal and circumferential directions are neglected. The viscous terms are retained in the normal direction, ~, and are collected into the vector ~. These viscous terms are used everywhere. However, in the wake or the base region, similar viscous terms are also added in the streamwise direction. 3. P A R A L L E L I Z A T I O N M E T H O D O L O G Y Many modern parallel computers are now based on high-performance RISC processors. There are two important conclusions that one can reach from this observation: (1) in theory, there are many cases in which it will no longer be necessary to use over 100 processors in order to meet the user' s needs and, (2) if the theory is to be met, one must achieve a reasonable percentage of the peak processing speed of the processors being used. Additionally, the first conclusion allows for the use of alternative architectures and parallelization techniques which might support only a limited degree of parallelism (e.g., 10-100 processors). Based on this reevaluation, some important conclusions were reached. ..--

............. ....... ......... ............

(1) In using traditional parallel algorithms and ~ techniques, using significantly fewer processors can: ~ .~ 8

IDEAL LINEAR SPEEDUP AMDAHL'S LAW

/

/ 1

COSTS AMDAHL'S LAW § COMM. COSTS + LESS EFFICIENT ALGORITHM

TYPICALHIGHPERFORMANCEVECTORPROCESSORS

..i

."

/,/"

,..'/

./-"

/

i/" .. z

(a) decrease the system cost, ~ (b) increase the reliability of the system, ~ (c) decrease the extent to which the ~ ~ efficiency of the algorithm is degraded, (d) decrease the percentage of the run time spent passing messages, and (e) decrease the effect of Amdahl's Law. Figure 1. Predicted speedup from the parallelization of a problem with a fixed (2) Possibly of even greater significance was the problem size. observation that, with loop-level parallelism, it is possible to avoid many of the problems associated with parallel programming altogether. This is not a new observation, but only now is it starting to be a useful observation. The key things that changed are that: ..i.

./i

..//

./"

....

..... .....

.. . . . . . . . . . .

NUMBER

.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

...

OF PROCESSORS

USED

(a) loop-level parallelism is frequently restricted to using modest numbers of processors and

358

the processors therefore have to be fast enough to achieve an acceptable level of performance; (b) loop-level parallelism will, in general, try and use the same sources of parallelism used to produce a vectorizable code (this makes it difficult to efficiently use this type of parallelism on a machine equipped with vector processors); and (c) it is difficult to make efficient use of loop-level parallelism on anything but a shared-memory architecture and, only recently, have vendors started to ship shared-memory architectures based on RISC processors with aggregate peak speeds in excess of a few GFLOPS. By combining aggressive serial optimizations with loop-level parallelization of vectorizable loops (some loop interchanging was also required), all of the design goals were met. 4. RESULTS A generic missile configuration was used for many of the tests on the parallelized code. In these tests, a one-million-point grid (see Fig. 2) was used to check the accuracy of the results. The computed results obtained with the parallelized code were compared with those obtained using the vectorized code on a Cray C-90. These computed results were :!'i~:~~i~:i!~~i!~li,I!'I compared with the experimental data obtained at the Defense Research Agency (DRA), UK, for the same configuration and test conditions. Typically, computation on the C-90 used 18 MW (148 MB) of memory and 7.5 hr of CPU time. Once the accuracy of the computed results was verified, performance studies were carried out for grid sizes ranging from 1 to 59 million grid points. Figure 3 shows the Figure 2. Computational grid. circumferential surface pressure distributions of the missile at a selected longitudinal station. Computed results from both vectorized (C-90) as well as the parallelized versions of the code are shown to lie on top of one another and are thus in excellent agreement.

i!

In order to properly appreciate the results, it is helpful to keep the following thoughts in mind. (1) Many prominent researchers have pointed out that one cannot get good parallel speedup from loop-level parallelism. One went as far as to point out that in one example the performance leveled off after just five processors were used. Others have gone so far as to report parallel slowdown when more than one processor was used. In almost all cases, these reports were based on the use of automatic parallelizing compilers. 8'9 (2) A common complaint with RISC-based systems (or, for that matter, any system with a hierarchical memory system) is that they can not achieve the same level of performance achieved by a well-designed vector processor (e.g., a Cray C-90). Even if the same level of performance is achieved, the complaint is that the performance degrades badly when larger problem sizes are tried.

359 (3) Another source of confusion is that, while linear speedup is desired, it is generally impossible to obtain it using loop-level parallelism. Instead, the best that can be achieved is a curve with an obvious stair step effect. The source of this effect is the limited parallelism co (especially when working with 3-D codes) associated with loop-level parallelism. Table 1 demonstrates this effect.

X / D =,, 3 . 5

0.15 0.10 O. 0 5 O. O0 -0.05 - 0 . 10 - 0 . 15 -0.20

I

i

I

i

I

.

;

(4) These results were obtained using a o o 3o.o Bo.o 90.0 12o.o 15o.o l a o . o Phi (degrees) highly efficient serial algorithm as the starting Figure 3. Surface pressure comparison. point, and taking great care not to make any changes to the algorithm. Initial efforts to run the vector optimized version of this code on one processor of an SGI Power Challenge (75-MHZ R8000 processor) proved to be extremely disappointing. After aggressively tuning the code for a low-cache miss rate and good pipeline efficiency, a factor of 10 improvement in the serial performance of this code was achieved. At this point, the percentage of peak performance from the RISC-tuned code using one processor on the SGI Power Challenge was the same as the vector tuned code on one processor of a Cray C-90. A key enabling factor was the observation that processors with a large external cache (e.g., 1--4 MB in size) could enable the use of optimization strategies that simply were not possible on machines like the Cray T3D and Intel Paragon which only have 16 KB of cache per processor. This relates to the ability to size scratch arrays so that they will fit entirely in the large external cache. This can reduce the rate of cache misses associated with these arrays, which miss all the way back to main memory, to less than 0.1% (the comparable cache miss rates for machines like the Cray T3D and Intel Paragon could easily be as high as 25%). Table 1. Predicted Speedup for a Loop With While the effort to tune the code was nontrivial, 15 Units of Parallelism. the initial effort to parallelize the code was already showing good speedup on 12 processors (the maximum number of processors available in one Power Challenge Predicted Max. Units of No. of at that time). Additional efforts extended this work to Speedup Parallelism Processors Assigned to a larger numbers of processors on a variety of systems (see Single table 2). Most recently, work has been performed on 64 Processor and 128 processor SGI Origin 2000s (the later is an experimental system located at the Navel Research 1.000 15 Laboratory in Washington D.C.). This work has 1.875 extended the range of problem sizes up to 53 million grid points spread between just three zones, and up to 3.000 115 processors on the larger machine (due to the stair 3.750 stepping effect, the problem sizes run on this machine were not expected to get any additional benefit from 5.000 5-7 using 116-128 processors). 8-14 15

7.500 15.000

Figures 4-6 show the performance results for three of the data sets. All results have been adjusted to

360 remove start up and termination costs. The latest results show a factor of 900-1,000 speedup from the original runs made using one processor of the Power Challenge with the vector-optimized code (the corresponding increase in processing power was less than a factor of 160). Additionally, speedups as Table 2. Systems That Have Run the RISC-Optimized Code. No.

Vendor

System Type

Processor Type

SGI

Challenge/ Power Challenge

R4400R 10000

SGI

Origin 2000

R 10000

SGI

Origin 200

R 10000

SGI

Indigo 2 Workstation

R4400

Exemplar SPP-1000

HP-PA 7100

32

Exemplar SPP- 1600

HP-PA 7200

32

I

6OO

Convex

16-36

500

9 ...........- .........

Cray C90 SGI Origin 2000 (32 processor system)

....

SGI Origin 200 (4 processor system)

- -

.............

o:

..................

SGI Origin 2000 (64 processor system) P r e d i c t e d performance

..........

SGI Origin 2000 (128 processor system)

,~, .........

400

32, 64, 128

/

:.

i ~' .': ...............

(preproduction hardware and operating system)

"

../ ..../

3OO

//

loo

0

" 10

20

30

40

50

60

70

80

NUMBER OF PROCESSORS USED

Figure 4. Performance results for the one million grid point data set.

high as 26.1 relative to one processor of a C-90 have been achieved. Since the numerical algorithm was unchanged, this represents a factor of 26.1 increase in the speed at which floating-point operations were performed and consequently, the wall clock time required for a converged solution decreased by the same factor. Table 3 summarizes the results shown in figures 4-6. In a production environment, such as is found at the four Major Shared Resource Centers set up by the DOD HPCMP, these results represent an opportunity to significantly improve the job throughput. Unfortunately, at this time, the limited availability of highly tuned code is proving to be a major stumbling block. These results clearly demonstrate that using the techniques described herein, it is possible to achieve high levels of performance with good scalability on at least some RISC-based, shared-memory, symmetric multiprocessors. It is also interesting to note that these results were obtained without the use of any assembly code or system-specific libraries and with relatively little help from the vendors.

5. Concluding Remarks A time-marching, Navier-Stokes code, successfully used over a decade for projectile aerodynamics, has been implemented and optimized to run on modern RISe-based parallel computers (SGI PeA, Origin 2000s). The parallelized version of the code has been used to compute the 3-D, turbulent, supersonic, flow over a generic, ogive-cylinder missile configuration for validation and comparison of the results obtained with the original version of the flow solver on a Cray C-90. Both

90

361

=

processor system) processor system) Predicted pedormance SGI Origin 2O0O (128 processor system) ..:...........................

.................. 300

9

Cray C90

....... ....... ~ . . . . . .

Cray C90

SGI Origin 2000 (32

. . . .

SGI Origin 2000 (64

................. ............

"1-

(preproduction hardware and

operating system)

150

processor system) performance processor system) (preproduction hardware and operating system) SGI Origin 2000 (64

;;~

Predicted

SGI Origin 2OO0 (128

; ...........................

~ ........ ' .... S

u~

............. ~:

200

, ~,4,,d;

.....

_z .-

lOO

o

20

o

~.:~,.~;~

.......i ~

u.l m

,,

40

~

60

~,,,~5~

z

80

00

so

20

120

40

~~ .....

60

80

100

120

N U M B E R OF P R O C E S S O R S U S E D

N U M B E R OF P R O C E S S O R S U S E D

Figure 5. Performance results for the 24 million grid point data set Table 3. Performance Results Summary for Figures 4-6. Speedup Relative to One Processor of a Cray C-90

Grid Size (millions of elements)

No. of Processors

SGI Origin 2000

1

30

16.7

9.5

1

45

22.5

12.8

Figure 6. Performance results for the 53 million grid point data set. versions of the code produced the same qualitative and quantitative results. Considerable performance gain was achieved by the optimization of the serial code on a single processor. Parallelization of the optimized serial code, which uses loop-level parallelism, has led to additional gains in performance on parallel computers.

=

1

t

90

28.1

15.9

24

!

30

24.4

10.4

24

i

60

40.0

17.0

9O

53.5

22.8

114

57.9

24.7

53

30

30.0

10.4

53

60

51.3

17.8

90

61.9

21.5

114

75.0

26.1

;

24 24

i

53 ,

53

.

This work represents a major increase in capability for determining the aerodynamics of single- and multiple-body configurations quickly and efficiently. This developed capability is a direct result of the efficient utilization of new advances in both computer hardware and CFD. Although more remains to be done, this capability has great potential for providing CFD results on 3-D complex projectile configurations routinely and can thus have significant impact on design of weapon systems.

362 REFERENCES

1. Sahu, J., and Nietubicz, C. J., "Application of Chimera Technique to Projectiles in Relative Motion," AIAA Journal of Spacecrafts and Rockets, Vol. 32, No. 5, September-October 1995. 2. Sahu, J., Heavey, K. R., and Nietubicz, C. J., "Computational Modeling of SADARM Submunition Separation," Journal of Computer Modeling and Simulation in Engineering, July 1997. 3. Steger, J. L., Ying, S. X., and Schiff, L. B., "A Partially Flux-Split Algorithm for Numerical Simulation of Compressible Inviscid and Viscous Flow," Proceedings of the Workshop on Computational Fluid Dynamics, Institute of Nonlinear Sciences, University of California, Davis, CA, 1986. 4. Sahu, J., and Steger, J. L., "Numerical Simulation of Transonic Flows," International Journal for Numerical Methods in Fluids. Vol. 10, No. 8, pp. 855-873, 1990. 5. Sahu, J., "Numerical Computations of Supersonic Base Flow with Special Emphasis on Turbulence Modeling," AIAA Journal, Vol. 32, No. 7, July 1994. 6. Almasi, G.S. and Gottilieb, A., "Highly Parallel Computing" Second Edition, The Benjamin/Cummings Publishing Company, Inc., Redwood City, CA, 1994. 7. Sahu, J., Pressel, D. M., Heavey, K. R., and Nietubicz, C. J., "Parallel Application of a NavierStokes Solver for Projectile Aerodynamics," Parallel CFD, Recent Developments and Advances using Parallel Computers, Emerson et al.Editors, Elsevier 1998. 8. Hockney, R.W., Jessohope, C.R., "Parallel Computers 2 Architecture, Programming and Algorithms," Philadelphia: Adam Hilger, 1988. 9. Bailey, D.H., "RISC Microprocessors and Scientific Computing," Proceedings for Supercomputer 93, IEEE Computer Society Press, Los Alamitos, CA, 1993.

Parallel ComputationalFluid Dynamics Towards Teraflops, Optimizationand NovelFormulations D. Keyes,A. Ecer, J. Periaux,N. Satofukaand P. Fox (Editors) e 2000 ElsevierScienceB.V. All rightsreserved.

363

Parallel Computation of Three-dimensional Two-phase Flows by the Lattice Boltzmann Method Nobuyuki Satofuka ~ and Tsuneaki Sakai ~Kyoto Institute of Technology, Matsugasaki, Sakyo-ku, Kyoto 606-8585, JAPAN

The multicomponent nonideal gas lattice Boltzmann model proposed by Shan and Chen is applied to simulate the physics of multi-phase flows. A fifteen velocity cubic lattice model is used for the simulations. The motion of a spherical droplet under gravitational force is chosen as a test problem. It is found that the dynamics of two-phase flows can be simulated by using the lattice Boltzgnann method with the S-C model. The lattice Boltzmann method is parallelized by using domain decomposition and implemented on a distributed memory parallel computer, the HITACHI SR2201. It is found that a larger speedup is obtained than that of the single phase lattice Boltzmann code.

1. I N T R O D U C T I O N

There are two approaches in Computational Fluid Dynamics (CFD). The first approach is to solve the Navier-Stokes equations based on the continuum assumption. In the case of incompressible flow computation, one has to solve the momentum equations together with the Poisson equation to satisfy the divergence free condition. The second approach starts from the Bolzmann equation using Lattice Gas Automata (LGA) [1]. Based on LGA, the lattice Boltzmann (LB) method has recently emerged as a promising new method for the simulation of incompressible fluid flows [2-6]. The main feature of the LB method is to replace the particle occupation variables ni (Boolean variables) in LGA by the single particle distribution functions (real variables) fi --(hi), where ( ) denotes a local ensemble average, in the evolution equation, i.e., the lattice Boltzmann equation. A Bhatnager-Gross-Krook (BGK) collision model [7] was usually adopted in the LB equation in place of complicated collision term. In this lattice Boltzmann BGK (LBGK) model, the equilibrium distribution is chosen a posleriori by matching the coefficients in a small velocity expansion, so that the Navier-Stokes equations can be derived using the Chapman-Enskog method. An interesting and important application of the LBGK method is the simulation of fluid flows with interfaces between multiple phases. There are numerous complex flow problems in both industrial systems and natural sciences that involve propagation and interaction of interfaces. Such problems pose considerable difficulties to conventional CFD methods, especially when the interfaces can undergo topological changes. Several LGA/LBGK models for multi-phase flows have been developed since the first introduc-

364

tion of the LGA model for simulation of two immiscible fluids. The first LGA model for immiscible two-phase flow was proposed by Rothmann and Keller [8]. A LB equa,tion version was formulated later [9,10]. The LB model proposed by Shan and Chen [11] introduces a non-local interaction between particles at neighboring lattice sites. An arbitrary number of components with different molecular masses can be simulated in this model. Their model is labeled here as the S-C model. In the present paper, the S-C model is applied to simulate the motion of a spherical droplet under gravitational force on a distributed memory parallel computer, the HITACHI SR2201. The speedup and CPU time are measured to investigate the parallel efficiency of the LBGK approach to two-phase flows.

2. L A T T I C E B O L T Z M A N N

METHOD

FOR TWO-PHASE

FLOW

2.1. C u b i c lattice m o d e l A cubic lattice with unit spacing is used, where each node has fourteen nearest neighbors connected by fourteen links. Particles can only reside on nodes, and move to their nearest neighbors along these links in unit time, as shown in Figure 1. Hence, there are two types of moving particles. Particles of type 1 move along the axes with speed [ei[ = 1; particles of type 2 move along the links to the corners with speed [ei I = x/~. Rest particles, with speed zero, are also allowed at each node. In order to represent two different fluids, red and blue particle distribution functions are introduced. The lattice Boltzmann equation is written for each phase

fki (X -F el, t -t- 1) -- fki (X, t)

_ ___1 {fki (X, t) -- r(r rk

(X, t)}

(1)

where fki(x, t), is the particle distributions in the i velocity directions for the k-th fluid (red or blue) at position x and time t, Jk~r(r t) is the equilibrium distributions at x,t and rk is the single relaxation time for the k-th fluid which controls the rate of approach to equilibrium. The density and the macroscopic moment of k-th fluid are defined in terms of the particle distribution functions as

p~ (x, t ) -

~; h~ (x, t ) ,

,~ (x, t). v~ - ~ h~ (x, t). e~.

i

(2)

i

The total density p(x, t) and the local velocity v is defined as

p(x,t)-

y;p~ ( x , t ) , p(x,t). ~ - F ~ A ~ ( x , t ) . k

k

~.

i

The equilibrium distribution function of the species k as function of x,t is given as

(a)

365

tZ (i- 1,j- 1,k+ 1)

~

(i- 1d+ 1,k+ l)

;-,

~.,

"

q.~

..

(i ~('J,k~

..-:

i

('~J'i'l'l~)

i ,j+l,k-1)

~

~

9~ .... a

tl+ld-l,K-l)

(i,j;~-l) ~ ...-'""

........ t.~V 9 . ...................... ... 0+ld+l,k-1)

Figure 1. Cubic lattice model

mk

f~0q)(x,/)- p k 7 + ,nk

fk(eq)(x, t) - Pk { 1 i 7+ink r(~q)(x,t) - ~_k{

1

1 ivl2 )

3

~(~. v) + 1 (~. v)~ 1 2} + 1(~.v)+

1

(4)

v) ~ - 1 Iv 12}

where mk is a parameter representing the ratio of rest and type 1 moving particles of the k-th fluid. The viscosity is given by uk = (2rk - 1)/6.

2.2. S-C m o d e l The multicomponent multi-phase LB equation model proposed by Shan and Chen [11] uses the BGK collision term. The equilibrium distribution functions are the same as given in Eq.(4), except the velocity v is replaced by the equilibrium velocity v k " given by the relation pkv~~q) - pkv' + ~-kFk ,

(5)

where v' is a common velocity, on top of which an extra component-specific velocity due to interparticle interaction is added for each component. Fk is the total interparticle force acting on the k-th component. Thus, in this model, interaction of particles is through the equilibrium distribution. To conserve momentum at each collision in the absence of interaction, v' has to satisfy the relation

366

The total interaction force on the k-th component at site x is

F~(x) - - ~(x) ~ ~ a~(x, x') ~(x')(x' - x),

(7)

X!

where Gk~(x,x') is a Green's function and satisfies Gk~(x,x') = Gk~(x',x). !Pk(x) is a function of x through its dependence on pk. If only homogeneous isotropic interactions between the nearest neighbors are considered, Gk~(x,x ~) can be reduced to the following symmetric matrix with constant elements:

a~(x,x')-

o ,Ix'-xl > A x 0 ~ , Ix'-xl ___~Xx

(s)

Hence 0k~ is the strength of the interaction potential between component k and/c. The pressure in the 3-D 15-velocity model is given by 3

p - ~ 4p~ + 7 ~ o~. ~ . ~ . k

(o)

k,fr

The effective mass Ok(x) is taken as Ok(x) =

exp{/3kpk(x)}- 1.0,

(10)

where /3k is determined from the second virial coefficient in the virial expansion of the equation of state with Lennard-Jones potential.

3. A P P L I C A T I O N

TO TWO-PHASE

FLOW

3.1. Test for L a p l a c e ' s law The first test problem selected is a 2-D static circular bubble of radius R of high density fluid (red) immersed in another low density fluid (blue). The density ratio is chosen to 3.0. The center of the circle is at the center of the 4R • 4R computational domain. Periodic boundary conditions are applied in all coordinate directions. Initially each fluid has a constant density and pressure with zero velocities. Since the LBGK method is a dynamical procedure, the state of a static bubble is achieved through a time evolution, and the size of the final bubble and final densities are unknown at initial time. In the final stage of the simulation, pressure and the bubble radius R can be determined. The surface tension, ~r, is calculated according to Laplace's law (7

Ap- ~.

(11)

To test Laplace's law in Eq.(11), simulations with bubbles of different initial radii, R 8, 12, 16 and 20 are performed, and the final radius R and the pressure differences are

367 |

0.015 .o. r

|

|

CASE1: (analytical): CASE2: (analytical): CASE3: (analytical):

0.02

|

~ ....... ........ ............... ........ .......

.!!/

i

/"

/

/"

/

.... ..........

......... ,.,.,~

.... "" ~,,~" , / I " ......... ..,;.,/ //

///

0.01

|

11.-/ .,.," ylr ~'"

,..-~

/./~,-

0.005 ./~/--/

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

1/Radius

Figure 2. Laplace's law

recorded for three different values of surface tension. These results are plotted in Figure 2, where computed pressure difference is plotted as a function of the inverse of the bubble radius. It is seen that Laplace's law is satisfied with the accuracy of our computation. Comparison between theoretical and computed values of surface tension is listed in Table 1.

3.2. M o t i o n of a spherical droplet As a second test problem for the LBGK method with interparticle interaction (S-C model), motion of a spherical droplet under gravitational force is simulated. Initially a spherical droplet of radius/~ is put at the center of a 4R • 4R x 4/~ computational domain. The top and bottom boundaries are no-slip walls, while periodic boundary conditions are applied in x and y coordinate directions. The number of lattice sites in each direction is 65. The density ratio is again chosen to be 3.0. However, in this case the density of the droplet is lower than that of the fluid outside. Figure 3 shows the iso-surface of density for a rising droplet in a gravitational field at various time steps. Strong deformation of the interface between the two fluids is clearly captured.

Table 1. Comparison of surface tension

O"1

(theoretical)

or2 (computed)

CASE1 0.16941

CASE2 0.15059

CASE3 0.13176

0.16871

0.15050

0.13106

368

Time step = 100

Time step =3000

Time step = 10000 Figure 3. Iso-surface of density

369

4. P A R A L L E L I Z A T I O N

4.1. C o m p u t e r resource The parallel machine that we used is a distributed memory computer, the Hitachi SR2201. Sixteen processing units (PU) and eight I/O units (IOU) are connected through a two-dimensional crossbar network, of which bandwidth is 300 M bytes per second. These IOU can be used as PU in the computation. Each PU has 256 M bytes of memory. The program is written in Fortran90 and Express is used as the message passing library. 4.2. D o m a i n d e c o m p o s i t i o n for t h r e e - d i m e n s i o n Two-dimensional simulation [12] shows that the longer the size of subdomain in the horizontal direction, the shorter the CPU time. In other words, the longer the dimension of innermost loop, the shorter the CPU time. Since the top and bottom boundaries are noslip walls, we decompose the domain as shown in Figure 4 for better load balancing. For each component (red or blue), fifteen distribution functions are placed on each lattice for the cubic lattice model. Only five of them have to be sent across each domain boundary.

4.0

!iiiiiiii!i !ii i i i i i i i i i i -=

9...... -'. i

3.0

S = i~ N = [og2C

Z /..... D

,

i ......

2.0 1.0

aiififiififiiiJii

"9

[ ......

0.0 0.0

~X Figure 4. Domain decomposition

1.0

2.0 N

3.0

Figure 5. Speedup

Table 2. CPU Time Number of CPUs C 1 2 4 8 16

CPU Time (sec) 1121 561 282 160 100

Speedup T 1.0 2.0 4.0 7.0 11.2

4.0

370

4.3. S p e e d u p and C P U t i m e

The speedup and CPU time measured in the simulation with lattice nodes are shown in Figure 5 and Table 2. With the same number of lattice nodes, higher speedup is obtained in the present multicomponent code than that of the one component code [13].

5. C O N C L U S I O N Parallel computation of three-dimensionM two-phase flows is carried out using the LBGK method with the S-C model for interparticle interaction. The present approach is able to reproduce complicated dynamics of two-phase flows and could be an alternative for solving the Navier-Stokes equations. Reasonable speedup can be obtained for a relatively small lattice size. Further investigation is needed for larger scale problems and extensions to complex geometries.

6. A C K N O W L E D G E M E N T

This study was supported in part by the Research for Future Program (97P01101) from Japan Society for the Promotion of Science.

REFERENCES

1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13.

U.Frish, B.Hasslacher and Y.Pomeau, Phys. Rev. Lett. 56, (1982) 1505-1508. G.McNamara and G.Zanetti, Phys. Rev. Lett. 61, (1988) 2332-2335. F.Heiguere and J.Jimenez, Europhys Lett. 9, (1989) 663-668. F.Heiguere and S.Succi, Europhys Lett. 8, 517-521. H.Chen, S.Chen and W.H.Matthaeus, Phys. Rev. A 45, (1992) 5539-5342. Y.H.Oian and D.D'humieres,Europhys. Lett. 17(6), (1992) 479-484. P.L.Bhatnagar and E.P.Gross, M.Krook, Phys. Rev. 94 , (1954) 511-525. D.H.Rothman and J.M.Keller, J. Stat. Phys. 52, 1119 (1998). A.K.Gunstensen, D.H.Rothman, S.Zaleski, and G.Zanetti, Phys. Rev. A 43, 4320, (1991). D.Grunau, S.Chen, and K.Eggert, Phys. Fluids A 5, 2557 (1993). X.Shan and H.Chen, Phys. Rev. E 47, 1815 (1993). D.R.Emerson, et al., Parallel Computation of Lattice Boltzmann equations for Iracompressible Flows., Parallel Computational F]uid Dynamics, North-Holland, 1998. C.A.Lin, et al., Parallelization of Three-Dimensional Lattice Boltzmann Method for Incompressible Turbulent Flows., Parallel Computational Fluid Dynamics, NorthHolland, 1999.

Parallel Computational Fluid Dynamics Towards Teraflops, Optimization and Novel Formulations D. Keyes, A. Ecer, J. Periaux, N. Satofuka and P. Fox (Editors) e 2000 Elsevier Science B.V. All rights reserved.

371

Parallelization and Optimization of a Large Eddy Simulation Code using OpenMP for SGI Origin2000 Performance Punyam Satya-narayana ~, Ravikanth Avancha b, Philip Mucci c and Richard Pletcher d aRaytheon Systems Company, ARL MSRC, 939-I Beards Hill Rd, Suite #191, Aberdeen MD 21001 bDepartment of Mechanical Engineering, Iowa State University, Ames IA 50011 CComputer Science Department, University of Tennessee, Knoxville TN 37996 dDepartment of Mechanical Engineering, Iowa State University, Ames IA 50011 A multi-block, colocated-grid, finite volume code, developed by researchers at Iowa State University for the large eddy simulation (LES) of complex turbulent flows, is the focus of attention. This code, written in FORTRAN, has been well optimized for use on vector processor machines such as the CRAY C90/T90. Increasing popularity and availability of relatively cost-effective machines using the RISC based NUMA architecture such as the Origin2000 are making it necessary to migrate codes from the CRAY C90/T90. It is well known that CFD codes are among the toughest class of problems to port and optimize for RISC based NUMA architectures [1]. Strategies adopted towards the creation of a shared memory version of the code and its optimization are discussed in detail. Large eddy simulations of the turbulent flow in a channel are then performed on an Origin2000 system, and the corresponding results compared with those from simulations on a CRAY T90, to check for their accuracy. Scaling studies from the parallel version are also presented, demonstrating the importance of cache optimization on NUMA machines. 1. L A R G E

EDDY

SIMULATION

CODE

Large eddy simulation (LES) is currently one of the popular approaches for the numerical simulation of turbulent flows. Turbulent flows are inherently time dependent and three-dimensional in nature. They also have a continuous spectrum of relevant length scales that need to be accurately captured unlike other multiple scale problems which have a finite number of relevant scales. For any flow, the large significant length scales are related to the domain size, and the small scales are related to the dissipative eddies, where the viscous effects become predominant. In LES, the motion of the large-scale structures is computed and the nonlinear interactions with the smaller scales are modeled. LES relies on the assumption that small scales of motion are nearly isotropic and

372 independent of the geometry, and hence can be approximated by a simple universal model. Although the small scales are modeled in LES, significant computer resources are required to accurately capture the large, energy-carrying length scales. The LES of a fully developed turbulent channel flow is performed in order to provide "realistic" turbulent inflow conditions for the simulation of the flow past a backwardfacing step. It is the objective of this paper to parallelize the LES code dealing with the turbulent channel flow. Upon successful parallelization of this code, the methodology will be extended to perform the LES of the flow past a backward facing step, a more complex flow that is characteristic of practical application areas. Within the framework of a fairly simple geometry, the turbulent flow past a backward facing step consists of distinctly different flow regimes: boundary layers, a mixing layer, flow separation, reattachment, and recovery, in the presence of a strong adverse pressure gradient. A coupled finite volume procedure is used to solve the compressible Favre filtered Navier-Stokes equations. The procedure is fully implicit, and second-order accurate in time. The advective terms can be discretized in one of three ways: second-order central differences, fourth-order central differences or QUICK-type upwind differences. The viscous terms are discretized using fourth-order central differences. Time derivative preconditioning is incorporated to alleviate the stiffness and related convergence problems that occur at low Mach numbers [2]. The aforementioned system of algebraic equations is solved using Stone's strongly implicit procedure (SIP). The details of the numerical procedure are described in [3]. Large eddy simulation is essentially an unsteady, three-dimensional computation; supercomputing resources become necessary in order to carry out these computations. Efficient use of these expensive computing resources motivates parallelization and optimization studies such as that carried out in this work. Over the past few years, the high performance computing community has had to expend significant effort to keep in phase with the evolution of high performance computers. This is especially so when market pressures, technological breakthroughs, and the dual demands of the scientific community (ever increasing problem size and the need to minimize wall-clock time) evolve the machine architecture and compiler technology all too often. In the United States, a case in point is the dwindling vendor interest in the once popular vector machines, following successful performance by the new shared/distributed memory machines with sophisticated memory hierarchies. This new direction in high performance computers, combining elements of both shared and distributed memory, is requiring scientists to devote a significant amount of time porting their codes and more fundamentally, necessitating change in their programming style. Often scientists are working with legacy codes that require major efforts to port, in order to fully utilize the performance potential of the HPC machines [4]. In this paper, we attempt to address two questions most often asked by the scientific community when faced with a legacy code in hand (in this case, the LES code), a new HPC machine on the horizon (NUMA Origin2000), and the eventual de-commissioning of a popular workhorse machine (CRAY C90/T90): (1) How long does it take to port and optimize the code? and (2) How does one utilize the tools available to parallelize the code for good performance on the new machines? We identify three different approaches in porting the LES code to utilize the compu-

373 tational potential of the new machines {5]: (a) reliance on vendor-supplied (automatic) parallelizing compilers [6], (b) use of the industry standard OpenMP parallel programming interface [7,8] to produce a Single Processor Multiple Data (SPMD) style parallel code, and (c) use of an emerging tool, called CAPTools, to produce a Message Passing Interface (MPI) style message passing parallel code. 2. O P T I M I Z A T I O N

AND

PARALLELIZATION

STRATEGY

The vector LES code optimized for a Cray C90/T90 is ported to the Origin2000 following the SGI guidelines by first going through Single Processor Optimization and Tuning (SPOT) and then through Multiprocessor Tuning (MUT). SPOT and MUT involved several steps. Starting with the existing tuned code, SPOT meant 1. getting the right answer (comparing with benchmark physics results at every step, Figure 1) 2. finding out where to tune (using SGI's Speedshop, prof, and p i x i e tools) 3. letting the compiler do the work (using various levels of optimization, -02, -03, and

-Ofast)

4. tuning for cache performance (using Speedshop experiments, loop fusion, do loop index ordering for FORTRAN arrays) In the next step, MUT involved 1. parallelization of the code (using PFA, the loop level automatic parallelization compiler, and OpenMP style SPMD paradigm) 2. bottleneck identification (using prof and ssrun) 3. fixing of false sharing (with the help of the perfex tool) 4. tuning for data placement (enclosing initialization in a parallel region or using data distribution directives) 5. performance analysis at every step We used a relatively small, 32 • 32 • 32 grid for initial work and debugging purposes, for faster turn around time. Larger scale problems with 64 and 81 dimensions were used for scaling studies, which will be discussed in Section 2.2. In the following sections we describe the performance of the vector code on the CRAY C90/T90 and its subsequent optimization for cache reuse on the Origin2000. 2.1. C R A Y T90 LES C o d e Extensive reference was made to CRAY Research publications [9,10] to develop an optimized LES code for the CRAY T90. A vectorized version of the coupled strongly implicit procedure (CSIP) solver was developed and was the key to obtaining good performance on the CRAY T90. Figures 1 and 2 show good agreement between the simulations and experiments/simulations from available literature. Table 1 contains timings, and MFLOPS for different grids with either a single processor or multitasking enabled. The numbers in the "%CPU obtained" column indicate the actual CPU allocation for the job, where 100% is equal to a single, dedicated processor. It can be seen from Table 1 that the use of multiple processors (as in the case of 4 and 12 for the 81 • 81 • 81 case, and 4 for the 45 • 45 • 45 case) results in significantly lower wall-clock time to carry out the same number of iterations. Scaling the wall-clock time

374

Table 1 Code Performance on the CRAY T90 Flag Grid Processors Wallclock % CPU User CPU Vector secs Obtained secs Length A 21 x 21 x 21 1 10.824 94.5 10.195 34 A 45 x 45 x 45 1 86.683 69.0 60.362 66 B 45 x 45 x 45 4 41.505 162.0 67.053 65 A 81 x 81 x 81 1 583.364 51.0 297.699 97 B 81 x 81 x 81 4 162.730 193.5 314.085 95 B 81 x 81 x 81 6 342.663 94.5 322.419 95 B 81 x 81 x 81 8 348.514 93.0 323.630 95 B 81 x 81 x 81 12 108.733 295.5 320.870 95 B 81 x 81 x 81 14 340.704 96.0 324.625 95 Notes: FORTRAN compiler: CF90; Number of iterations: 60 Flag A: -dp - 0 v e c t o r 3 , s c a l a r 3 , i n l i n e 3 - 0 a g g r e s s Flag B: -dp - 0 v e c t o r 3 , s c a l a r 3 , i n l i n e 3 , t a s k 3 , t a s k i n n e r - 0 a g g r e s s

MFLOPS 273 455 410 540 511 499 497 501 495

with actual CPU allocation, we find that turn around time is inversely proportional to the number of processors available. By adopting an effective parallelization strategy, the transition from the CRAY T90 to the Origin2000 is expected to be highly beneficial. Since each Origin2000 processor is less powerful than a CRAY T90 processor, it is extremely important to obtain parallel efficiency in order to achieve good overall performance. The challenges in porting and optimizing CFD codes to cache based RISC systems such as the Origin2000 are clearly outlined in [1].

2.2. Single CPU optimization SpeedShop (SGI) tools [12] s s r u n and p r o f were used to instrument pc sampling and idealtime experiments. Continuous monitoring of CPU time in conjunction with p r o f (-heavy -1 options) analysis identified the bottlenecks and cache problems. We made sure that high CPU utilization was achieved during the entire modification process. The 20.0

3.5 .

15.0

.

.

.

.

.

.

.

.

.

3.0

/~/2

~

A

+23 IO.O V

U+ = y-+

r'-"

.

/ " "72 ~

-f4 f -

~ 2.02"5 I u+=zS,n(y+l+s.s im et al (1987

J""

. . . . . . .

--4-

/_b f

,of

5.0 0.0 1

.

-- -~Avancha. u~wind 65 s Qrid "1

~~.: i'o

0i3

. . . . . . .

~P~i-la ]

1'oo

.1.5 0.5 o.o

y+ Figure 1.

Law of the wall plot

Y Figure 2.

Streamwise rms statistics

375 Table 2 Single CPU performance on the Origin2000 Flag Grid Processors CPU L2 cache misses TLB misses sec sec sec A 21 x 21 x 21 1 209 23.2 27.00 B 21 • 21 x 21 1 110 22.5 22.50 C 21 • 21 x 21 1 28 9.3 0.73 Notes: FORTRAN Compiler: f77 MIPSpro Version 7.2; Number of iterations: 60 Flag A: Vector Code, no optimization Flag B: Vector Code, - 0 f a s t Flag C: Optimized Single CPU code, - 0 f a s t

MFLOPS 8.40 24.30 57.24

results are given in Table 2. With efficient collaboration between the authors over a seven day period [11]: the most expensive routines were identified; in order to make the code "cache-friendly", array dimensions in the entire code (containing about 45 subroutines) were rearranged, i.e, array_name(i, j , k , n x , n y ) --~ array_name (nx , n y , i , j ,k); do loop indices were rearranged to attain stride one access for FORTRAN arrays; and routines were rewritten to minimize cache-misses and maximize cache-line reuse. More specifically, within the most expensive subroutine c s i p 5 v . f , an in-house LU decomposition solver : 1. array dimensions were rearranged for efficient cache use 2. indirect address arrays were merged for data locality, i.e, indi(ip,isurf), indj ( i p , i s u r f ) , i n d k ( i p , i s u r f ) -+ i n d ( i , i p , i s u r f ) 3. do loops were unrolled and array references were kept to proximity Single processor optimization performance is given in Table 2 for different compiler options. Single CPU optimization shows that cache misses went down by a factor of 3 with a proportionate reduction in TLB misses. One observation is that the optimized code for a single Origin2000 processor is still slower than the optimized version on the CRAY T90, by a factor of 3. This is indicative of the respective performance capabilities of the two different processors. 3. P A R A L L E L I Z A T I O N

The parallel environment on the Origin2000 allows one to follow various paradigms to achieve parallelization, as follows: (a) automatic parallelization using the MIPSpro FORTRAN compiler; (b) loop-level parallelism; (c) single program multiple data (SPMD) style of parallelism; (d) use of low level parallel threads; (e) message passing using parallel virtual machine (PVM) or MPI; (f) use of the Portland Group's High Performance FORTRAN compiler (HPF); and (g) use of commercial tools that parallelize the code, supposedly with minimal code modifications by the user. In this paper we select the automatic parallelization option of the MIPSpro compiler (Power FORTRAN Analyzer: PFA), the OpenMP interface, and MPI using CAPTools (a computer aided parallelization toolkit that translates sequential FORTRAN programs into message passing SPMD codes). They are discussed in brief in the following sections.

376 400,

o

\.

-*--94 -

350 ~~ '\

\

\ \\

'\

\ \

E

\

200 13.

O

OMP OMPV

\

300 ~ 250

'PFA

"\

~-

~~'-~...

\~

150 '~~~~.....

.

~

100

. ~~

~

, ~~

1

2

i

i

I

i

I

3

4

5

6

7

Number of processors

8

Figure 3. Scalability chart for 323 grid showing wall-clock time for PFA, OpenMP (optimized) and OpenMP (pure vector code). Comparison of OMP and OMPV shows the need for cache optimization; for 8 processors the CPU time drops by a factor of 2 if optimized code is used. First impressions of the performance of LES code on the Origin2000 NUMA machines are reported and compared with the performance of the serial code on the T90. We find that CAPTools can significantly reduce the programmer's burden. However, the use of CAPTools can make the code unreadable and might demand that the user master its communication interface in addition to MPI. We defer CAPTools results to a future article due to limited scope of the paper and space. 3.1. P o w e r F O R T R A N

Analyzer This option is available to the users of the compiler and can be invoked using compiler flags - the code need not be modified. Although no analysis of the code is required by the user, the PFA option usually parallelizes the code at the individual loop level. The compiler automatically places directives in front of each parallelizable loop and does not parallelize the loops with data dependencies. This results in some part of the code, often a significant portion, being serial. The serial component of the code leads to poor scaling beyond 8 processors. Results for a 32 • 32 x 32 grid are shown in Figure 3. The need for cache optimization can be inferred from Figure 3. We note that P FA was compiled with aggressive optimization flag -03, whereas the OMP version was compiled with -02 option. Aggressive optimization flag -03, did not complete due to compiler runtime errors and hence the difference in speed between PFA and OMP versions. 3.2. O p e n M P

The LES code is parallelized using the SPMD style of OpenMP parallel programming (as opposed to loop level parallelism) that relies heavily on domain decomposition. While

377

x

Channel

Geometry

4 blocks---~ 4 processors

Z

_..__>.

26

Block 4 /

.................

I : / .....................

Block 3 /

/_ ........................ 2z6 Boundary Conditions: Streamwise, x : periodic Spanwise, z : periodic Wall normal,y : no slip

V

y

2/

V

Block 1 / Blocking direction Values at block interfaces are obtained from a global array

Figure 4. Channel geometry showing partitioned blocks tasked to processors domain decomposition can of decomposition from the posed, the same sequential multiple sub-domains. The

result in good scalability, it does transfer the responsibility computer to the user. Once the problem domain is decomalgorithm is followed, but the program is modified to handle data that is local to a sub-domain is specified as PRIVATE or THREADPRIVATE. THREADPRIVATE is used for sub-domain data that need file scope or are used in common blocks. The THREADPRIVATE blocks are shared among the subroutines but private to the thread itself. This type of programming is similar in spirit to message passing in that it relies on domain decomposition. Message passing is replaced by shared data that can be read by all the threads thus avoiding communication overhead. Synchronization of writes to shared data is required. For a Cartesian grid, the domain decomposition and geometry are shown in Figure 4. Data initialization is parallelized using one parallel region for better data locality among active processors. This method overcomes some of the drawbacks of first-touch policy adopted by the compiler. If the data is not distributed properly, the first-touch policy may distribute the data to a remote node, incurring a remote memory access penalty. The" main computational kernel is embedded in the time advancing loops. The time loops are treated sequentially due to obvious data dependency, and the kernel itself is embedded in a second parallel region. Within this parallel region, the computational domain is divided into blocks in the z-direction, as shown in Figure 4, which allows each block to be tasked to a different processor. Several grid sizes were considered; the scalability chart is shown in Figure 5 and a typical load balance chart (using MELOPS as an indicator) is shown in Figure 6. We see performance degradation near 8 processors for the 32 • 32 • 32 grid and 16 processors for the 81 • 81 x 81 grid. Less than perfect load balancing is seen due to the remnant serial component in two subroutines. We observe linear speedup up to 4 processors across all the grid sizes and the large memory case levels off at 8 processors. The SPMD style of parallelization shows an encouraging trend in scalability. A detailed analysis with fully

378 3000

,

D

-e- 32/OMP

I-+

2500

2000 r

E

'~' 28 I I I I

13..

27

26

II

D1500 13. 0 1000

64/OMP I

25

I

\

\

24 if) ~) 23

\

'~

J

22

500 '~ ...... .............................. i

2

i

4

i

6

Y

8

i

10

l

12

i

14

Number of processors

Figure 5. Scalability chart for 323, 643 and 81 a grids

16

20 0

2

4

6 8 10 12 Processornumber

14

16

Figure 6. SGI MFLOPS across all processors for OpenMP LES code for 813 grid

cache optimized and parallelized code will be presented elsewhere. 4. S U M M A R Y

In summary, the CRAY C90/T90 vector code is optimized and parallelized for Origin2000 performance. A significant portion of our time is spent in optimizing c s i p 5 v . f , an in-house LU decomposition solver, which happens to be the most expensive subroutine. The FORTRAN subroutine is modified by changing the order of nested do loops so that the innermost index is the fastest changing index. Several arrays in c s i p 5 v . f are redefined for data locality, and computations are rearranged to optimize cache reuse. Automatic parallelization, PFA, scales comparably to SPMD style OpenMP parallelism, but performs poorly for larger scale sizes and when more than 8 processors are used. SPMD style OpenMP parallelization scales well for the 813 grid, but shows degradation due to the serial component in still unoptimized subroutines. These subroutines contain data dependencies and will be addressed in a future publication. Finally, we report an important observation, for the 32 x 32 x 32 grid presented here, that cache optimization is crucial for achieving parallel efficiency on the SGI Origin2000 machine. 5. A C K N O W L E D G M E N T

The current research was partially supported by the Air Force Office of Scientific Research under Grant F49620-94-1-0168 and by the National Science Foundation under grant CTS-9414052. The use of computer resources provided by the U.S. Army Research Laboratory Major Shared Resource Center and the National Partnership for Advanced

379 Computational Infrastructure at the San Diego Supercomputing Center is gratefully acknowledged. REFERENCES

1. James Taft, Initial SGI Origin2000 tests show promise for CFD codes, NAS News, Volume 2, Number 25,July-August 1997. 2. Pletcher, R. H. and Chen, K.-H., On solving the compressible Navier-Stokes equations for unsteady flows at very low Mach numbers, AIAA Paper 93-3368, 1993. 3. Wang, W.-P., Coupled compressible and incompressible finite volume formulations of the large eddy simulation of turbulent flows with and without heat transfer, Ph.D. thesis, Iowa State University, 1995. 4. Jin, H., M. Haribar, and Jerry Yah, Parallelization of ARC3d with Computer-Aided Tools, NAS Technical Reports, Number NAS-98-005, 1998. 5. Frumkin, M., M. Haribar, H. Jin, A. Waheed, and J. Yah. A comparison Automatic Parallelization Tools/Compilers on the SGI Origin2OOO,NAS Technical Reports. 6. KAP/Pro Toolset for OpenMP, http://www.k~i.com 7. OpenMP Specification. http://www.openmp.org, 1999. 8. Ramesh Menon, OpenMP Tutorial. SuperComputing, 1999. 9. Optimizing Code on Cray PVP Systems, Publication SG-2912, Cray Research Online Software Publications Library. 10. Guide to Parallel Vector Appllcations, Publication SG-2182, Cray Research Online Software Publications Library. 11. Satya-narayana, Punyam, Philip Mucci, Ravikanth Avancha, Optimization and Parallelization of a CRAY C90 code for ORIGIN performance: What we accomplished in 7 days. Cray Users Group Meeting, Denver, USA 1998. 12. Origin 2000(TM) and Onyx2(TM) Performance Tuning and Optimization Guide. Document number 007-3430-002. SGI Technical Publications.

This Page Intentionally Left Blank

Parallel Computational Fluid Dynamics Towards Teraflops, Optimization and Novel Formulations D. Keyes, A. Ecer, J. Periaux, N. Satofuka and P. Fox (Editors) 92000 Elsevier Science B.V. All rights reserved.

381

C a l c u l a t i o n of U n s t e a d y I n c o m p r e s s i b l e Flows on a M a s s i v e l y P a r a l l e l C o m p u t e r U s i n g t h e B.F.C. C o u p l e d M e t h o d K.Shimano a, Y. Hamajima b, and C. Arakawa b

a

b

Department of Mechanical Systems Engineering, Musashi Institute of Technology, 1-28-1 Tamazutsumi, Setagaya-ku, Tokyo 158-8557, JAPAN Department of Mechanical Engineering, University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo 113-8656, JAPAN

The situation called "fine granularity" is considered, which means that the number of grid points allocated to one processor is small. Parallel efficiency usually deteriorates in fine granularity parallel computations therefore it is crucial to develop a parallel Navier-Stokes solver of high parallel efficiency. In this regard, the authors have focused on the coupled method. In this study the coupled method is extended to the boundary fitted coordinate and is applied to the unsteady flow around a circular cylinder. High parallel efficiency of the B.F.C coupled method is demonstrated. 1.

INTRODUCTION

Fluid flow simulation of much larger problems will be necessary without doubt. However, a considerable jump-up of single-processor performance cannot be expected while the computational time will become longer as the number of grid points increases. To overcome this problem, the simplest solution is to use as many processors as possible while still keeping parallel efficiency high. In other words, the number of grid points per processor should be decreased. This is called massively parallel computing of very fine granularity. In this study, the authors deal with this situation. Above everything else, it is crucial to develop a parallel Navier-Stokes solver of high efficiency even for fine granularity. The authors[l-2] have focused on the high parallel efficiency of the coupled method originally proposed by Vanka[3] and showed that the coupled method achieved relatively high parallel efficiencies in Cartesian coordinates. Its simple numerical procedures are another advantage.

382

The existing coupled method is characterized by use of the staggered grid in Cartesian coordinates. However, in terms of practical application, it should be applicable to problems in the Boundary Fitted Coordinate (B.F.C.) system. In this paper, extension of the coupled method to B.F.C. is presented and computational results of the unsteady flow around a circular cylinder are shown. High parallel efficiency of the B.F.C. coupled method is demonstrated: for example, even when only 32 control volumes are allocated to one processor in the 3-D calculation, the parallel efficiency reaches 31%. At the end of the paper, accuracy of the B.F.C. coupled method is also discussed.

2.

COUPLED METHOD IN THE CARTESIAN

Vanka[3] proposed the original coupled method as the SCGS method. It is characterized by use of the staggered grid in the Cartesian coordinates. In the 2dimensional case, four velocity components and one pressure component corresponding to one staggered cell are implicitly solved. In practice, the 5 • 5 matrix is inverted for each cell. In the 3-dimensional case, the size of the matrix is 7 • 7. In the iteration process, the momentum equations and the continuity equation are simultaneously taken into account so that mass conservation is always locally satisfied. See references [1-3] for details. In Figure 1, the coupled method is compared with the SIMPLE method in terms of parallel efficiency. The results were obtained by calculation of the lid-driven square cavity flow at Re=100 on a Hitachi SR2201. The number of control volumes (CVs) is 128 • 128. Obviously the efficiency of the coupled method is higher than that of the SIMPLE method. There are several reasons that the coupled method is suitable for fine granularity 120 100 >,

o

E O ._ O

UJ

i!i!iiii

80 60

i SIMPLE 1 D Coupled i

-ii,ir

40 k.

a.

20

2x2

4x4

8x8

Number of Processors

Figure 1. Parallel Efficiency of SIMPLE and Coupled methods

383

parallel computing: 1. Velocity components and pressure values are simultaneously sent and received by message-passing, therefore the overhead time for calling the message-passing library is reduced. 2. Numerical operations to invert the local matrix hold a large part in the code and can be almost perfectly parallelized. 3. Numerical procedures are quite simple.

3.

EXTENSION

OF COUPLED

METHOD

T O B.F.C.

The boundary fitted coordinate (B.F.C.) is often used for engineering applications. A B.F.C. version of the coupled method developed by the authors is presented in this section. In the B.F.C. version, unknowns and control volumes (CVs) are located as shown in Figure 2. This a r r a n g e m e n t is only for 2-D calculation. For 3-D flows, pressure is defined at the center of the rectangular-parallelepiped. In the iteration process of the 2-D calculation, eight velocity components and one pressure component constituting one CV for the continuity equation are simultaneously updated; that is, the following 9 • 9 matrix is inverted for each cell. 9

9

9

9

9

Ui+l,j+ ,-I9

-Xi + l , j + l

I

U/,j+I

.u. 1

Ui,j Ui+l,j Vi+l,j+

l

--

fi:,

:

Vi+l,J _

(1)

i,j+l

Vi,j+I

Vi,J

j+,

i,j fi:l,j

_

o

Black circles in Eq.(1) represent non-zero elements. Since velocity components u and v are defined at the same points, contribution of the advection-diffusion terms is the same for both components. Therefore, the computational cost to invert the 9 • 9 matrix is reduced to order 5 • 5. For 3-D flows, the cost is only order 9 • 9 for the same reason, though the size of the matrix to be inverted is 25 • 25. The matrix inversion mentioned above is applied to all the cells according to odd~even ordering. For example, in the 2-D calculation, cells are grouped into fours and Eq.(1) is solved in one group after another. The number of groups is eight in the 3-D calculation. If this grouping is not adopted, relaxation will be less effective.

384 W h e n the a r r a n g e m e n t in Figure 2 is used, the checkerboard-like pressure oscillation sometimes a p p e a r s . However, pressure differences such as Pc-Pa in Figure 2 are always correctly evaluated and the pressure oscillation does not spoil stability or accuracy. In order to eliminate the oscillation, pressure P" is newly defined as shown in Figure 2 and is calculated according to the equation:

I(pA +PB +Pc

(2)

+PD)

r

,,

0

r

,,

PA

0

~V for C , -th~ Continuity

' o ,,

0

,,

~--~ CV for Pc ~, @ " ~ the Momentum

Figure 2.

A r r a n g e m e n t of Variables and Control Volumes for 2-D B.F.C. 9

9 9 9 9

9 9

//~It era~tion~'~X

/ f i t er!ti on-"~N

9 9 9 9

f / I t era t i on~'~

c~11 Groupmj

~ 1 0~ou~

c~11 0rou~

f~esssage~ %2a~s~i~~

f~terat ion~ ~I O~ou~

~terat ion~ C~11 C~ou~ ~t

f~Iterat i o ~

~11 G~ou~ ~....~ ....

C~eIltle%!rtolOpn~

~e~a~o~

~e~a~o~ ~

k~11 Group~j)

er~ation~-~X

c~11 G~oup~j

Ck~ll Group~j)

~e~atio~ k~11 Group~ l)

9

9

9

9 9

9 9

9 9

9

9

Most Frequent Figure 3.

2-D B. F. C

3-D B. F. C

Communication P a t t e r n s

385

4.

PARALLELIZATION

STRATEGY

The B.F.C. coupled method is implemented on a parallel computer by domain decomposition techniques. The communication pattern is an important factor that determines the efficiency of parallel computation. Three communication patterns are compared in Figure 3. Communication between processors via message-passing is expressed by the gray oval. If the most frequent communication is chosen, data are transferred each time after relaxation on one cell group. This pattern was adopted in the authors' Cartesian Coupled method [1-2]. It has the merit that the convergence property is not considerably changed by domain decomposition. However, if this pattern is adopted for the B.F.C. coupled method, the message-passing library is called many times in one inneriteration; at least 4 times in the 2-D calculation and 8 times in the 3-D calculation. This leads to increased over-head time. To avoid that, the patterns shown in the center and the right of Figure 3 are adopted in this study. In the 2-D calculation, processors communicate after relaxation on every two groups. In the 3-D, data are exchanged only once after relaxation on all eight groups is completed. A less frequent communication pattern is chosen because the 3-D communication load is much heavier. However, in using this communication pattern, the convergence property of the 3-D calculation is expected to become worse. In compensation for that, cells located along internal boundaries are overlapped as shown in Figure 4.

Ordinary Decomposition

Overlapping

Figure 4. Overlapping of Cells

386 Gray cells in Figure 4 belong to both subdomains and are calculated on two processors at the same time. Using this strategy, solutions converge quickly even though the parallel efficiency of one outer-iteration is reduced.

5.

RESULTS

The B.F.C. coupled method was implemented on a Hitachi SR2201 to calculate the flow around a circular cylinder. The Hitachi SR2201 has 1024 processing elements (PEs) and each has performance of 300Mflops. PEs are connected by a 3-D crossbar switch and the data transfer speed between processors is 300Mbytes/sec. In this study, PARALLELWARE was adopted as the software. Widely used software such as PVM is also available. Computational conditions and obtained results are shown in Table 1. Calculations were performed 2-dimensionally and 3-dimensionally. The Reynolds number was 100 in the 2-D calculation and 1000 in the 3-D calculation. The same number of CVs, 32768, was adopted to compare 2-D and 3-D cases on the same granularity level. The item "CVs per PE" in Table 1 represents the number of grid points that one processor handles; namely, it indicates granularity of the calculation. The number of CVs per PE is very small; the smallest is only 32 CVs per processor. One may think that too many processors were used to solve a too small problem, but there were no other options. As mentioned in Chapter 1, quite fine granularity was investigated in this study. If such fine granularity had been realized for much larger problems, a bigger parallel computer with many more processors would have been required. However, such a big parallel computer is n o t available today. Fine granularity is a situation t h a t one will see in the future. The authors had to deal with this future problem using today's computers and therefore made this compromise. Time dependent terms were discretized by the first order backward difference and Table 1. Computational Conditions and Results Total CV PE Subdomain CVs per PE E" ope

2-D 256 x 128=32768 256 512 1024 16x16 32x16 32x32 128 64 32 95.0% 76.9% 51.3%

3-D 64 x 32 x 16=32768 256 512 1024 8x8x4 16x8x4 16x8x8 128 64 32 50.1% 35.6% 23.8%

z"/re,-

97.0%

95.1%

93.6%

126.3%

129.5%

130.9%

~"

92.1% 236

73.2% 375

48.0% 491

63.3% 162

46.1% 236

31.1% 319

Speed-Up

387

implicit time marching was adopted. Performance of the parallel computation is shown in the last four rows in Table 1. There are three kinds of efficiency. ~ expresses how efficiently operation in one outer iteration is parallelized. The second efficiency ~;~ represents the convergence property. If its value is 100%, the number of outer iterations is the same as that in the single processor calculation. The third efficiency e is the total efficiency expressed as product of,~~ and , ~ (see [4]). Judging from ,~ ~ the outer iteration is not well-parallelized in the 3-D calculation. These values are roughly half of those of the 2-D calculation because of the heavy load of 3-D communication and the overlapped relaxation explained in w 4. In the 2-D calculation, the convergence property does not greatly deteriorate because the efficiency c ;~ is kept at more than 93%. This means that the communication pattern for the 2-D calculation is appropriate enough to keep ~,~t~ high. In the 3-D calculation, the efficiency ,~'~ exceeds 100%. This means that the number of outer iterations is smaller than the single processor calculation. This is caused by the overlapped relaxation. If the most frequent communication pattern had been used instead, the iteration efficiency would have stayed around 100% at best. Overlapped relaxation is one good option to obtain fast convergence. Using 1024 processors, the total efficiency ~ reaches 48% in the 2-D calculation and 31% in the 3-D calculation. Considering the number of CVs per PE, these values of efficiency are very high. It can be concluded that the B.F.C. coupled method is suitable for massively parallel computing of quite fine granularity. Further work is necessary to get more speed-up. One idea is a combination of the coupled method and an acceleration technique. A combined acceleration technique must be also suitable for massively parallel computing. Shimano and Arakawa [1] adopted the extrapolation method as an acceleration technique for fine granularity parallel computing. Finally, accuracy of the B.F.C. coupled method is discussed. In the 3-D calculation, the length of the computational domain in the z-direction was 4 times longer than the diameter. The grid of 64 • 32 • 16 CVs was too coarse to capture the exact 3-D structure of the flow. The authors tried 3-D calculations using eight times as many CVs, namely 128 • 64 x 32 (=262144) CVs. In the computational results on the fine grid, the appearance of four cyclic pairs of vortices in the z-direction was successfully simulated (see Figure 5), which had been reported in the experimental study by Williamson [5]. Estimated drag coefficient and Strouhal number are 1.08 and 0.206 respectively, which are in accordance with experimental data 1.0 and 0.21. From these facts, high accuracy of the B.F.C. coupled method is confirmed.

388

Figure 5.

6.

Flow around a 3-D Circular Cylinder, Isosurfaces of Vorticity (X-Axis) Re=1000, 128 • 64 • 32(=262,144) CVs, 64PE

CONCLUSIONS

The B.F.C. coupled method was proposed and applied to 2-D and 3-D calculations of flow around a circular cylinder. The overlapped relaxation adopted in the 3-D calculation was effective for fast convergence. Obtained parallel efficiency was high even when granularity was quite fine. Suitability of the coupled method to massively parallel computing was demonstrated. For example, even when only 32 control volumes were allocated to one processor, the total parallel efficiency reached 48% in the 2-D calculation and 31% in the 3-D calculation. The computational results by the B.F.C. coupled method were compared with experimental data and its high accuracy was ascertained.

REFERENCES 1. K.Shimano, and C.Arakawa, Comp. Mech., 23, (1999) 172. 2. K.Shimano, et al., Parallel Computational Fluid Dynamics, (1998) 481, Elsevier. 3. S.P.Vanka, J. Comp. Phys., 65, (1986) 138. 4. K.Shimano, and C.Arakawa, Parallel Computational Fluid Dynamics, 189 (1995), Elsevier. 5. C.H.K. Williamson, J. Fluid Mech., 243, (1992), 393.

Parallel Computational Fluid Dynamics Towards Teraflops, Optimization and Novel Formulations D. Keyes, A. Ecer, J. Periaux, N. Satofuka and P. Fox (Editors) o 2000 Elsevier Science B.V. All rights reserved.

389

A parallel algorithm for the detailed numerical simulation of reactive flows M. Soria J. Cadafalch, R. Consul, K. Claramunt and A. Oliva a * aLaboratori de Termotecnia i Energetica Dept. de Maquines i Motors Termics Universitat Politecnica de Catalunya Colom 9, E-08222 Terrassa, Barcelona (Spain) manel @labtie, mmt. upc. es

The aim of this work is to use parallel computers to advance in the simulation of laminar flames using finite rate kinetics. To this end, a parallel version of an existing sequential, subdomain-based CFD code for reactive flows has been developed. In this paper, the physical model is described. The advantages and disadvantages of different strategies to parallelize the code are discussed. A domain decomposition approach with communications only after outer iterations is implemented. It is shown, for our specific problem, that this approach provides good numerical efficiencies on different computing platforms, including clusters of PC. Illustrative results are presented.

1. I N T R O D U C T I O N The aim of this work is to use parallel computers to advance in the simulation of laminar flames using finite rate kinetics. A parallel version of an existing sequential, subdomainbased CFD code for reactive flows has been developed[I,6,7]. Although industrial combustors work under turbulent conditions, studies allowing laminar flow conditions are a common issue in their design. Furthermore, a good understanding of laminar flames and their properties constitute a basic ingredient for the modelling of more complex flows[4]. The numerical integration of PDE systems describing combustion involves exceedingly long CPU times, especially if complex finite rate kinetics are used to describe the chemical processes. A typical example is the full mechanism proposed by Warnatz[21], with 35 species and 158 reactions. In addition of the momentum, continuity and energy equations, a transport convection-diffusion has to be solved for each of the species, as well as the kinetics. The objective of these detailed simulations is to improve the understanding of combustion phenomena, in order to be able to model it using less expensive models. *This work has been financially supported by the Comision Interministerial de Ciencia y Tecnologia, Spain, project TIC724-96. The authors acknowledge the help provided by David Emerson and Kevin Maguire from Daresbury Laboratory (UK).

390

2. P H Y S I C A L

MODEL

The governing equations for a reactive gas (continuity, momentum, energy, species and state equation) can be written as follows:

Op 0---{+ V. (pv) = 0 OV

pg/+

(1)

(pv.V)~ = v.~,j - Vp + pg

0 (ph) = V (kVT) - V. 0t

o (pY~) Ot

"

p E hiYi ( v i - v) i=1

+ V. (pY~vi)

(2)

)

= wi

(3)

(4)

pM P = RT

(5)

where t is time; p mass density; u average velocity of the mixture; vii stress tensor; p pressure; g gravity; N total number of chemical species; h specific enthalpy of the mixture; hi specific enthalpy of specie i; T temperature; k thermal conductivity of the mixture; M molecular weight of the mixture; R gas universal constant. The diffusion velocities are evaluated considering both mass diffusion and thermal diffusion effects: T

vi = v - DimVYi - Dim v (In T) P

(6)

where Dim and Di~ are respectively diffusivity and thermal diffusivity of the species into the mixture. The evaluation of the net rate of production of each specie, due to the J reactions, is obtained by summing up the individual contribution of each reaction:

(7) j=l

i=l

i=1

Here, [mi] are the molar concentration and Mi the molecular weights of the species, uij, uij the stoichiometric coefficients appearing as a reactant and as a product respectively for the i specie in the reaction j, and kf,j, kb,j the forward and backward rate constants. The transport and thermophysic properties have been evaluated using CHEMKINS's database. More information of the model can be found at [6,7]. tt

3. P A R A L L E L

ASPECTS

3.1. P a r a l l e l i z a t i o n s t r a t e g y Different parallel computing approaches, or combinations of them, can be used to obtain the numerical solution of a set of PDEs solved using implicit techniques: Functional decomposition[13] is based on assigning different tasks to different processors. In our case,

391

the tasks would be the different PDEs. In the case of reactive flows, this method is more attractive due to the higher number of scalar equations to be solved. Domain decomposition in time[18] is based in the simultaneous solution of the discrete nonlinear equations for different time steps in different processors. Both approaches require the transference of all the unknowns after each outer iteration. As our code is targeted for clusters of workstations, they have been discarded in favour of the domain decomposition[17,14] approach. In domain decomposition method, the spatial domain to be solved is divided into a number of blocks or subdomains which can be assigned to different CPUs. As the PDEs express explicitly only spatially local couplings, domain decomposition is perhaps the most natural strategy for this situation. However, it has to be kept in mind that the PDEs to be solved in CFD are, in general, elliptic: local couplings are propagated to the entire domain. Thus, a global coupling involving all the discrete unknowns is to be expected, and a domain decomposition approach has to be able to deal effectively with this problem if it is to be used for general purpose CFD solvers. Different variants of the domain decomposition method can be found in the literature. In a first approach, each subdomain is treated as an independent continuous problem with its own boundary conditions, that use information generated by the other subdomains where necessary. The boundary conditions are updated between iterations until convergence is reached. Another approach is to consider only a single continuous domain and to use each processor to generate the discrete equations that are related with its part of the domain. The solution of the linear systems is done using a parallel algorithm, typically a Krylov subspace method. The first approach has been used here, because: (i) As linear equations are not solved in parallel, it requires less communication between the processors, and only after each outer iteration and (ii) it allows to reuse almost all the sequential code without changes. The first advantage is specially relevant in our case, as our code is to be used mainly in clusters of PCs. It is important to point out that the iterative update of the boundary conditions without any other strategy to enforce global coupling of the unknowns behaves essentially as a Jacobi algorithm with as many unknowns as subdomains. Thus, the method does not scale well with the number of processors, unless the special circumstances of the flow help the convergence process. This is our case, as the flows of our main interest (flames) have a quasi-parabolic behaviour. Domain decomposition method is known to behave well with parabolic flows: for each subdomain, as the guessed values at the downstream region have no effect over the domain, the information generated at the upstream region is quickly propagated from the first to the last subdomain[15]. 3.2. P r o g r a m m i n g m o d e l a n d s o f t w a r e e n g i n e e r i n g a s p e c t s The parallel implementation of the code had two goals: allow maximum portability between different computing platforms and keep the code as similar as possible to the sequential version. To achieve the first, message passing paradigm has been used and the code has been implemented using MPI libraries. To achieve the second, all the calls to low-level message passing functions have been grouped on a program module and a set of input-output functions has been implemented. The code for the solution of a singledomain problem remains virtually identical to the previous sequential code. In fact, it

392 can still be compiled without MPI library and invoked as a sequential code. The parallel implementation of the code had two goals: allow maximum portability between different computing platforms and keep the code as similar as possible to the sequential version. To achieve them, message passing paradigm has been used (MPI) and all the calls to message passing functions have been grouped on a program module. A set of input-output functions has been implemented. The code for the solution of a singledomain problem remains virtually identical to the previous sequential code. In fact, it can still be compiled without MPI and invoked as a sequential code.

:~i~iii:ii~i~i~ii]~iiiii~i~i!i~i!i~iiiii~iiiii~i!i~i~iiiii~iii~iiiiiii~ii~i~ii~iii!~iiii!iiii~ii!iiiii~i!iii~i~i!i!i~iiii~ii~iii~iiiii~i!ii

2.1 OE3

~i~i~i~i:i:~i~i~if~ii:iii!~i~i~i!i~!i!i~iii~iii~iii!i!i~iii!iii!i!ii~!ii!~i~iii~i!i~i!iii~i~i~i~i~iiiii~iii~!ii~iiiiiii~i~iiiii~iiii i;;&i~ii~i~i!~i~ii~i~iii~i~:~ii~i~i!i!i!iii!i~i~iii~i!i!i~i~i~i~ii~!i!i!i~!!!~!i!i!ii~!~!~!~ii!i!i!i~i~i~i!i~!i!i~i!i~i~i~i~i!~ii .................................................... : ............................................. ..................... 9 w:::::.:. .............

~5~!~!~!~!~!~!~!~~ i !:~~ i~ i~ i~ i~ i !i~~ i~ i~ i~ i~ i~ i~ i! ~ !i ~ i~ i~ i~ i !i~!i~~ i~ i~ i~ i~ i !i~~ i~ i~~ i~ i~ i~ i~ ii ~~ i~ i ;~ i :~ i~ i~ i~ ;i ~ i~ i !~~ i~ i~ i~ i~ i~ i~ i~ i~ i~ i !~!~!~ i~ i~ i~ i~ i !~!~!~!~ i~ i !i~!~!~~ i !~ i~ i~ i~ i~ i !~!~!~~ i~ i~ ii ~i~i~i~;i~i~i~i~i~i~i~i~i!F~ii~i~!~i~i~i~i~i~i~i~i~i~i~i~i!~i~i~i~!i!ii~i~i!~i!i~i~i! 9ww,:..v:wv:~v~w:w:w:.:::. v..~v:.:::v:-:v:w::~::::w:~:.:.~u

......................................................... . ........................... . ......................................................................... : ....................

Y6

T

~:i!ii:ii!iiii~i~iiii!iii:i!~i!i!i~i!i!iii!i!i!i~i!i!i!i~i!i!i!i!!ii!ii!!i~i!~!iii!i!iiiiiii!iiiiiiiii!ii!!i!!!!iii!i!i!ii~iiiii!i!i!i!i!i!i~i!i~

2.C OE3

4.7E-2

iiii~iiiii~iii!i~iiii~!iiiiiii~iiiiii~ii~iii~!i~ii~i~i~i~iii!i~iii~!i~iii~i~i!~ii!i~ii!~i~ii~ii~i~;iii~ii!ii~ii!ii~iii~i 4.5E-2 4.2E-2

1.fi OE3 1.EOE3 1.7 OE3 1.EOE3 1.EOE3 1.4 OE3 1.E OE3

3.9E-2

...i. . .i.... .i.....i..................!............... ...................

3.6E-2 3.4E-2

Yll 1.97E-3 1.85E-3 1.74E-3 1.62E-3 1.51 E-3

i!iiii~ii~iiiiiii~iiii~iiii~iiiiiiiii~ii~iiiiiiiiiiii~iii~iiiiii!iiii!iiiiiii~iiii~iiiiiiiii!ii~iii~i~iiii!iii!~i:iiiiiiiiii~iiiiiii~ i!i•!iii!•ii!i•i•ii!•i•i!i!•!i••ii•iii!ii••ii•!••iiiii•i•iiiii••••iiiii••iiiiii!ii•ii!3.1 •iii!•E-2 ii•i

1.39E-3

iiiiiiiii iiiii i i i li i iil

1.16E-3

2.8E-2 ~iii~i~ii~i~i~iiii~!i~iiiiii~iii~i~iii~i!~!~i~i~i~i~!ii;~iiii~i~iii~i~i~ii~i~iii~!~ii~ii~iii~i~i~iiiiiii~i~i

1..~OE3 1.1 OE3 1s OE3 9s OE2 8s OE2 7s OE2 6s OE2 5.COE2

2.5E-2 2.2E-2 2.0E-2 1.7E-2 1.4E-2 1.1 E-2 8.4E-3 5.6E-3 2.8E-3

1.27E-3 1.04E-3 9.26E-4 8.10E-4 6.95E-4 5.79E-4 4.63E-4 3.47E-4 2.32E-4 1.16E-4

Figure 1. Illustrative results. From left to right: Temperature distribution (K), OH mass fraction and CH3 mass fraction.

4. N U M E R I C A L

ASPECTS

In our implementation of the domain decomposition method, the meshes of the individual subdomains are overlapped and non-coincident. The second feature allows more geometrical flexibility that is useful to refine the mesh in the sharp gradient areas at the edges of the flames but the information to be transferred between the subdomains has to be interpolated. This has to be done accomplishing the weIl-posedness conditions; i.e. the adopted interpolation scheme and the disposition of the subdomains should not affect the result of the differential equations. For instance, methods that would be correct for one second order PDE[5] are not valid for the full Navier-Stokes set. If the well-posedness condition is not satisfied, more outer iterations are needed and slightly wrong solutions

393 can be obtained. Here, conservative interpolation schemes that preserve local fluxes of the physical quantities between the subdomains are used[2,3]. The governing equations are spatially discretized using the finite control volume method. An implicit scheme is used for time marching. A two-dimensional structured and staggered Cartesian or cylindrical (axial-symmetric) mesh has been used for each domain. High order SMART scheme[10] and central difference are used to approximate the convective and diffusive terms at the control volume faces. It is implemented in terms of a deferred correction approach[8], so the computational molecule for each point involves only five neighbours. Solution of the kinetics and the transport terms is segregated. Using this approach, kinetic terms are an ODE for each control volume, that is solved using a modified Newton's method with different techniques to improve the robustness[19]. To solve the continuity-momentum coupling, two methods can be used: (i) Coupled Additive Correction Multigrid[16], in which the coupled discrete momentum and continuity equations are solved using SCGS algorithm; (ii) SIMPLEC algorithm[9] with an Additive Correction Multigrid solver for pressure correction equation [l l] . In both cases, correction equations are obtained from the discrete equations. 5. I L L U S T R A T I V E

RESULTS

The premixed methane/air laminar fiat flame studied by Sommers[20] is considered as an illustrative example and as a benchmark problem. A stoichiometric methane-air homogeneous mixture flows through a drilled burner plate to an open domain. The mixture is ignited above the burner surface. The boundary conditions at the inlet are parabolic velocity profile with a maximum value of 0.78 m/s, T = 298.2 K and concentrations of N2,O2 and CH4 0.72, 0.22, 0.0551 respectively. At the sides, o _ 0 for all the unknowns except vx = 0. The dimensions of the domain, are 0.75 x 4 mm. In Fig. 1, the results obtained with skeletal mechanism by Keyes and Smoke[19] with 42 reactions and 15 species are presented. For the benchmarks, the four-step mechanism global reaction mechanism by Jones and Lindstedt [12] has been used. 6. P A R A L L E L

PERFORMANCE

Before starting the parallel implementation, the sequential version of the code was used to evaluate its numerical efficiency for this problem, from one to ten processors. A typical result obtained with our benchmark problem can be seen in Fig. 2. The number of outer iterations remains roughly constant from one to ten subdomains. However, the total CPU time increases due to the extra cost of the interpolations and the time to solve the overlapping areas. Even prescribing the same number of control volumes for each of the processors, there is a load imbalance, as can be seen in Fig. 3 for a situation with 10 subdomains. It is due to three reasons: (i) the time to solve the kinetics in each of the control volumes differs; (ii) there are solid areas where there is almost no computational effort; (iii) the inner subdomains have two overlapping areas while the outer subdomains have only one. The following systems have used for benchmarking the code: (i) 02000: SGI Origin 2000, shared memory system with R10000 processors; (ii) IBM SP2, distributed memory

394 3500 II. O

=

3000

eo

=.. ,-

3 .o u

,~,_ 0 L

E Z

2500,

=

9

- ~

.iterations time

m

2000

1500 1

,

,

,

,

3

5

7

9

11

Number of subdomains

Figure 2. Time to solve each of the subdomains for a ten subdomains problem.

C 0 0

2

3

4

5

13

7

8

10

Subdomain

Figure 3. Number of iterations and CPU time as a function of the number of subdomains.

system with thin160 nodes; (iii) Beowulf: Cluster of Pentium II (266 MHz) 2. For the benchmark, each subdomain has one processor (the code also allows each processor to solve a set of subdomains). The speed-ups obtained in the different systems (evaluated in relation to the respective times for one processor) are in Fig. 4. They are very similar in the different platforms. This is because the algorithm requires little communication. For instance, for the most unfavourable situation (02000 with 10 processors) 2The system at Daresbury Laboratory (UK) was used. http: / / www. dl. ac.uk / TCSC / disco/Beowulf/config, ht ml

All the information can be found at

395

only a single packet of about 11.25 Kbytes is exchanged between neighbouring processors approximately every 0.15 seconds. So, the decrease on the efficiency is mainly due to the load imbalance and to the extra work done (overlapping areas and interpolations).

Speed-up

7.5 9

/

2.5

i

!

r'~

9 SP2 /~

Beowulf Ideal

1

2

3

4

5

6

7

8

9

10

Number of processors

Figure 4. Speed-ups obtained in the different systems.

7. C O N C L U S I O N S A domain decomposition approach has been used to develop a parallel code for the simulation of reactive flows. The method implemented requires little communication between the processors, and it allowed the authors to reuse almost all the sequential code (even using message passing programming model). Speed-ups obtained in different shared and distributed memory systems up to 10 processors are reasonably good. It is remarkable that the speed-up obtained with this code on a cluster of PC computers with Linux is as good as the speed-up obtained on two commercial parallel computers. To be able to use a higher number of processors while maintaining a reasonable efficiency, load balance has to be improved and the interpolation process has to be optimised. REFERENCES

1. J.Cadafalch et al., Domain Decomposition as a Method for the Parallel Computing of Laminar Incompressible Flows, Third ECCOMAS Computational Fluid Dynamics Conference, ed. J.A.Desideri et al, pp.845-851, John Wiley and Sons, Chichester, 1996. 2. J. Cadafalch et al., Fully Conservative Multiblock Method for the Resolution of Turbulent Incompressible Flows, Proceedings of the Fourth European Computational Fluid Dynamics Conference, Vol. I, Part. 2, pp 1234-1239. John Wiley and Sons, Athens, Grece, Octubre 1998.

396 J. Cadafalch, et al., Comparative study of conservative and nonconservative interpolation schemes for the domain decomposition method on laminar incommpressible flows, Numerical Heat Transfer, Part. B, vol 35, pp. 65-84, 1999. S. Candel et al., Problems and Perspectives in Numerical Combustion. Comp. Meth. in App. Sci. '96, J. Wiley and Sons Ltd, 1996. G.Chesshire and W.D.Henshaw, A Scheme for Conservative Interpolation on Overlapping Grids, SIAM J. Sci. Comput., vol 15, no 4, pp 819-845, 1994. R. Consul et al., Numerical Studies on Laminar Premixed and Diffusion Flames, 10th Conf. on Num. Meth. in Ther. Prob., pp. 198-209, Swansea., 1997. R. Consul et al., A. Oliva, Numerial Analysis of Laminar Flames Using the Domain Descomposition Method, The Fourth ECCOMAS Computational Fluid Dynamics Conference, Vol. 1, Part 2, pp. 996-1001, John Wiley and Sons, Athens, Grece, Octubre, 1998. M.S.Darwish and F.Moukalled, Normalized Variable and Space Formulation Methodology for High-Resolution Schemes, NHTB, v26,pp 79-96, 1994 J.P.Van Doormal and G.D.Raithby, Enhancements of the SIMPLE method for predicting incompressible fluid flows, NHT, v 7, pp 147-163, 1984 10. P.H.Gaskell and K.C.Lau, Curvature-Compensated Convective Transport: SMART, New Boundedness-Preserving Transport Algorithm, Int. J. Numer. Meth. Fluids, vol 8, pp. 617-641, 1988. 11. B.R.Hutchinson and G.D.Raithby, A Multigrid Method Based on the Additive Correction Strategy, NHT, v 9, pp 511-537, 1986 12. W.P. Jones, and R.P. Lindstedt, Global Reaction Schemes for Hydrocarbon Combustion, Comb. and Flame, v 73, pp 233-249, 1988. 13. A.J.Lewis and A.D.Brent, A Comparison of Coarse and Fine Grain Parallelization Strategies for the Simple Pressure Correction Algorithm, IJNMF, v 16, pp 891-914, 1993. 14. M.Peric E.Schreck, Analysis of Efficiency of Implicit CFD Methods on MIMD Computers, Parallel Computational Fluid Dynamics: Algorithms and Results Using Advanced Computers, pp 145-152, 1995 15. N.R.Reyes et al., Subdomain Method in Both Natural and Forced Convection, Application to Irregular Geometries, in Numerical Methods in Laminar and Turbulent Flow: Proceedings of the 8th international conference, Ed. C.Taylor, 1993. 16. P.S.Sathyamurthy, Development and Evaluation of Efficient Solution Procedures for Fluid Flow and Heat Transfer Problems in Complex Geometries, Ph.D.Thesis, 1991. 17. M.Schafer, Parallel Algorithms for the Numerical Simulation of Three-Dimensional Natural Convection, Applied Numerical Mathematics, v 7, pp 347-365, 1991 18. V.Seidl et al, Space -and Time- Parallel Navier-Stokes Ssolver for 3D Block-Adaptive Cartesian Grids, Parallel Computational Fluid Dynamics: Algorithms and Results Using Advanced Computers, pp 577-584, 1995 19. M.D. Smooke et al., Numerical Solution of Two-Dimensional Axisymmetric Laminar Diffusion Flames, Comb. Sci. and Tech., 67: 85-122, 1989. 20. L.M.T.Sommers phD Thesis, Technical University of Eindhoven, 1994 21. J. Warnatz et al., Combustion, Springer-Verlag, Heidelberg, 1996.

Parallel ComputationalFluidDynamics Towards Teraflops, Optimizationand NovelFormulations D. Keyes,A. Ecer, J. Periaux,N. Satofukaand P. Fox (Editors) 2000 Elsevier ScienceB.V. All rightsreserved.

397

P a r a l l e l i z a t i o n of t h e E d g e B a s e d S t a b i l i z e d F i n i t e E l e m e n t M e t h o d A. Soulaimani a* , A. Rebaineband Y. Saad c Department of Mechanical Engineering, Ecole de technologie sup(~rieure, 1100 Notre-Dame Ouest, Montreal, PQ, H3C 1K3, CANADA b Department of Mechanical Engineering, Ecole de technologie sup~rieure, 1100 Notre-Dame Ouest, Montreal, PQ, H3C 1K3, CANADA CComputer Science Department, University of Minnesota, 4-192 EE/CSci Building, 200 Union Street S.E., Minneapolis, MN 55455 This paper presents a finite element formulation for solving multidimensional compressible flows. This method has been inspired by our experience with the SUPG, the Finite Volume and the discontinous-Galerkin methods. Our objective is to obtain a stable and accurate finite element formulation for multidimensional hyperbolic-parabolic problems with particular emphasis on compressible flows. In the proposed formulation, the upwinding effect is introduced by considering the flow characteristics along the normal vectors to the element interfaces. Numerical tests performed so far are encouraging. It is expected that further numerical experiments and a theoretical analysis will lead to more insight into this promising formulation. The computational performance issue is addressed through a parallel implementation of the finite element data structure and the iterative solver. PSPARLIB and MPI libraries are used for this purpose. The implicit parallel solver developed is based on the nonlinear version of GMRES and Additive Schwarz algorithms. Fairly good parallel performance is obtained.

1. I N T R O D U C T I O N This work discusses the numerical solution of the compressible multidimensional NavierStokes and Euler equations using the finite element metholology. The standard Galerkin variational formulation is known to generate numerical instabilities for convection dominated flows. Many stabilization approaches have been proposed in the literature during the last two decades, each introducing in a different way an additional dissipation to the original centered scheme. The Streamline Upwinding Petrov-Galerkin method of Hughes (SUPG) is commonly used in finite element based formulations [1-4] while Roe-Muscl schemes are used for finite volume methods [5]. A new stabilized finite element formulation, refered to as Edge Based Stabilized finite element methd (EBS), has been recently introduced by Soulaimani et al. [6,7] which lies between SUPG and finite volume formula*This work has been financed by grants from NSERC and Bombardier.

398 tions. This formulation seems to embodies good properties of both of the above methods: high order accuracy and stability in solving high speed flows. Preliminary numerical results in 2D were encouraging here we present further developments and more numerical experiments in 3D. In the following, SUPG and EBS methods are briefly reviewed, then the Edge Based Stabilizing method is described. The parallel data structure and the solution algorithms are discussed. Finally, some numerical simulations are presented along with parallel efficiency results.

2. G O V E R N I N G

EQUATIONS

Let ~ be a bounded domain of R nd (with nd = 2 or n d = 3) and F = O~ be its boundary. The outward unit vector normal to F is denoted by n. The nondimensional Navier-Stokes equations written in terms of the conservation variables V = (p, U, E) t are written in a vector form as

V,t + F ia,div( V ) - FdiiI I ( V ) + .7z

(1)

where U is the vector of conservative variables, F adv and F diff a r e respectively the convective and diffusive fluxes in the ith-space direction, and .9" is the source vector. Lower commas denote partial differentiation and repeated indices indicate summation. The diffusive fluxes can be written in the form F dify ~ K i j Y , j while the convective fluxes can be represented by diagonalizable Jacobian matrices Ai - FadVi,v. Any linear combination of these matrices has real eigenvalues and a complete set of eigenvectors.

3. S E M I D I S C R E T E

SUPG

FORMULATION

Throughout this paper, we consider a partition of the domain ~ into elements ~e where piecewise continous approximations for the conservative variables are adopted. It is well known that the standard Galerkin finite element formulation often leads to numerical instabilities for convective dominated flows. In the SUPG method, the Galerkin variational formulation is modified to include an intergral form depending on the local residual 7~(V) of equation (1), i.e. 7~(V) - V,t + Fia,d v ( y ) - FdiiH(V) - ~ , which is identically zero for the exact solution. The SUPG formulation reads: find V such that for all weighting function W ,

E f , [ w . (v, + r,a - J:)+ W,V:'ff]

+Z e

-

w.r:

ff,

dr

(2)

(A~W,~)- v - 7~(V) dQ - 0

e

In this formulation, the matrix r is refered to as the matrix of time scales. The SUPG formulation is built as a combination of the standard Galerkin integral form and a perturbation-like integral form depending on the local residual vector. The objective is to reinforce the stability inside the elements. The SUPG formulation involves two

399

important ingredients: (a) it is a residual method in the sense that the exact continuous regular solution of the original physical problem is still a solution of the variational problem (2). This condition highligths its importance to get high order accuracy; and (b) it contains the following elleptic term: E ( f w ( A t i . w , i ) ' r ( A j U , j ) d ~ ) which enhances the stability provided that the matrix r is appropriately designed. For multidimensional systems, it is difficult to choose r in such a way to introduce the additional stability in the characteristic directions, simply because the convection matrices are not simultaneously diagonalizable. For instance, Hughes-Mallet [8] proposed r = ( B i B i ) -1/2 where Bi - ~a xA ~ a and o_~ are the components of the element Jacobian matrix. j Oxj

4.

E D G E B A S E D S T A B I L I Z A T I O N M E T H O D (EBS)

Let us first take another look at the SUPG formulation. Using integration by parts, the integral f a ~ ( A ~ W , i ) . ' r . "R.(V) d~ can be transformed into fr~ W . ( A n r . "R.(V)) dF fae W . (Ai r . 9~.(V)),i d~ where, F e is the boundary of the element f~e, n ~ is the outward unit normal vector to F ~, and An - neiAi. If one can neglect the second integral, then

e

The above equation suggests that r could be defined, explicitly, only at the element boundary. Hence, a natural choice for r is given by r - ~htAnr -1 Since the characteristics lines are well defined on F r for the given direction n ~, then the above definition of r is not completely arbitrary. Furthermore, the stabilizating contour integral term becomes

Ee

h

?w .

. n(v))

er.

For a one dimensional hyperbolic system, one can recognize the upwinding effect introduced by the EBS formulation. Here, we would like to show how more upwinding can naturally be introduced in the framework of EBS formulation. Consider the eigendecomposition of An, An - SnAnSn -1. Let Pei - )~ih/2z~ the local Peclet number for the eigenvalue hi, h a measure of the element size on the element boundary, ~ the physical viscosity and/3~ - m i n ( P e i / 3 , 1.0). We define the matrix Bn by B n - SnLSn -1 where L is a diagonal matrix given by Li - (1 +/3~) if A~ > 0; L~ - - ( 1 -/3~) if A~ < 0 and Li - 0 if Ai - 0. The proposed EBS formulation can be summarized as follows: find V such that for all weighting functions W ,

W

9 7" ne '

9

n ( V ) dr

-

o

h with -r~e the matrix of intrinsic length scales given by r ned = 7" Bn.

We would like to point out the following important remarks:

400 - As for SUPG, the EBS formulation is a residual method, in the sense that if the exact solution is sufficiently regular then it is also a solution of (3). Thus, one may expect a highorder accuracy. Note that the only assumption made on the finite element approximations for the trial and weighting functions is that they are piecewise continuous. Equal-order or mixed approximations are possible in principle. Further theoretical analysis is required to give a clearer answer. - Stabilization effect is introduced by computing on the element interfaces the difference between the residuals, while considering the direction of the characteristcs. The parameter /~i is introduced to give more weight to the element situated in the upwind characteristic direction. It is also possible to make t3 dependent on the local residual norm, in order to add more stability in high gradient regions. - For a purely hyperbolic scalar problem, one can see some analogy between the proposed formulation and the discontinuous-Galerkin method of Lesaint-Raviart and the Finite Volume formulation. - The formula given above for the parameter/~ is introduced in order to make it rapidly vanishing in regions dominated by the physical diffusion, such as the boundary layers. - The length scale h introduced above is computed in practice as being the distance between the centroides of the element and the edge (or the face in 3D) respectively. - The EBS formulation plays the role of adding an amount of artificial viscosity in the characteristic directions. In the presence of high gradient zones as for shocks, more dissipation is needed to avoid non-desirable oscillations. A shock-capturing artificial viscosity depending of the discrete residual 7~(V) is used as suggested in [9].

5. P A R A L L E L I M P L E M E N T A T I O N

ISSUES

Domain decomposition has emerged as a quite general and convenient paradigm for solving partial differential equations on parallel computers. Typically, a domain is partitioned into several sub-domains and some technique is used to recover the global solution by a succession of solutions of independent subproblems associated with the entire domain. Each processor handles one or several subdomains in the partition and then the partial solutions are combined, typically over several iterations, to deliver an approximation to the global system. All domain decomposition methods (d.d.m.) rely on the fact that each processor can do a big part of the work independently. In this work, a decomposition-based approach is employed using an Additive Schwarz algorithm with one layer of overlaping elements. The general solution algorithm used is based on a time marching procedure combined with the quasi-Newton and the matrix-free version of GMRES algorithms. The MPI library is used for communication among processors and PSPARSLIB is used for preprocessing the parallel data structures. 5.1. D a t a s t r u c t u r e for A d d i t i v e Schwarz d . d . m , w i t h o v e r l a p p i n g In order to implement a domain decomposition approach, we need a number of numerical and non-numerical tools for performing the preprocessing tasks required to decompose a domain and map it into processors. We need also to set up the various data structures,

401 and solve the resulting distributed linear system. PSPARSLIB [10], a portable library of parallel sparse iterative solvers, is used for this purpose. The first task is to partition the domain using a partitioner such as METIS. PSPARSLIB assumes a vertex-based partitioning (a given row and the corresponding unknowns are assigned to a certain domain). However, it is more natural and convenient for FEM codes to partition according to elements. The conversion is easy to do by setting up a dual graph which shows the coupling between elements. Assume that each subdomain is assigned to a different processor. We then need to set up a local data structure in each processor that makes it possible to perform the basic operations such as computing local matrices and vectors, the assembly of interface coemcients, and preconditioning operations. The first step in setting up the local data-structure mentioned above is to have each processor determine the set of all other processors with which it must exchange information when performing matrix-vector products, computing global residual vector, or assembly of matrix components related to interface nodes. When performing a matrix-by-vector product or computing a residual global vector (as actually done in the present FEM code), neighboring processors must exchange values of their adjacent interface nodes. In order to perform this exchange operation efficiently, it is important to determine the list of nodes that are coupled with nodes in other processors. These local interface nodes are grouped processor by processor and are listed at the end of the local nodes list. Once the boundary exchange information is determined, the local representations of the distributed linear system must be built in each processor. If it is needed to compute the global residual vector or the global preconditioning matrix, we need to compute first their local representation to a given processor and move the interface components from remote processors for the operation to complete. The assembly of interface components for the preconditioning matrix is a non-trivial task. A special data structure for the interface local matrix is built to facilitate the assembly opration, in particular when using the Additive Schwarz algorithm with geometrically non-overlapping subdomains. The boundary exchange information contains the following items: 1. n p r o c - The number of all adjacent processors. 2. proc(l:nproc) - List of the nproc adjacent processors. 3. i x - List of local interface nodes, i.e. nodes whose values must be exchanged with neighboring processors. The list is organized processor by processor using a pointer-list data structure. 4. Vasend - The trace of the preconditioning matrix at the local interface, computed using local elements. This matrix is organized in a CSR format, each element of which can be retrieved using arrays iasend and jasend. Rows of matrix Vasend are sent to the adjacent subdomains using arrays proc and ix. 5. jasend and iasend - The Compressed-Sparse-Row arrays for the local interface matrix Vasend , i.e. j a s e n d is an integer array to store the column positions in global numbering of the elements in the interface matrix Vasend and iasend a pointer array, the i-th entry of which points to the beginning of the i-th row in j a s e n d and Vasend. 6. var~cv - The assembled interface matrix, i.e. each subdomains assembles in Varecv interface matrix elements received from adjacent subdomains, varify is stored also in a CSR format using two arrays jarecv and iarecv.

402 5.2. A l g o r i t h m i c a s p e c t s The general solution algorithm employes a time marching procedure with local timesteping for steady state solutions. At each time step, a nonlinear system is solved using a quasi-Newton method and the matrix-free GMRES algorithm. The preconditioner used is the block-Jacobian matrix computed, and factorized using ILUT algorithm, at each 10 time steps. Interface coefficients of the preconditoner are computed by assembling contributions from all adjacent elements and subdomains, i.e. the Varecv matrix is assembled with the local Jacobian matrix. Another aspect worth mentioning is the fact that the FEM formulation requires a continuous state vector V in order to compute a consistent residual vector. However, when applying the preconditoner (i.e. multiplication of the factorized preconditioner by a vector) or at the end of Krylov-iterations, a discontinuous solution at the subdomains interfaces is obtained. To circumvent this inconsistency, a simple averaging operation is applied to the solution interface coefficients.

6.

NUMERICAL

RESULTS

The EBS formulation has been implemented in 2D and 3D, and tested for computing viscous and inviscid compressible flows. Only 3D results will be shown here. Also, EBS results are compared with those obtained using SUPG formulation (the definition of the stabilization matrix employed is given by ~- = (~]i [Bil) -1) [3] and with some results obtained using a Finite Volume code developed in INRIA (France). Linear finite element approximations over tetrahedra are used for 3D calculations. A second order time-marching procedure is used, with nodal time steps for steady solutions. Three-dimensional tests have been carried out for computing viscous flows over a flat plate and inviscid as well as turbulent flows (using the one-equation Sparlat-Allmaras turbulence model) around the ONERA-M6 wing. All computations are performed using a SUN Enterprise 6000 parallel machine with 165 MHz processors. The objective of the numerical tests is to assess the stability and accuracy of EBS formulation as compared to SUPG and FV methods and to evaluate the computational efficiency of the parallel code. For the flat plate, the computation conditions are Re=400 and Mach=l.9. Figures 1-3 show the Mach number contours (at a vertical plane). For the ONERA-M6 wing, a Euler solution is computed for Mach= 0.8447 and an angle of attack of 5.06 degrees. The mesh used has 15460 nodes and 80424 elements. Figures 4-7 show the Mach number contours on the wing. It is clearly shown that EBS method is stable and less diffusive than SUPG method and the first Order Finite Volume method (Roe scheme). Figure 8 shows the convergence history for different numbers of processors. Interesting speedups have been obtained using the parallelization procedure described previously, since the iteration counts have not been significantly increased as the number of processors increased (Table 1). Under the same conditions (Mach number, angle of attack and mesh), a turbulent flow is computed for a Reynolds number of R~ = 11.7106 and for a distance 5 = 10 .4 using the same coarse mesh. Again, the numerical results show clearly that SUPG and first-order FV codes give a smeared shock (the figures are not included because of space, see [7]). It is fairly well captured by the EBS method. However, the use of the second-order FV method results in a much stronger shock.

403 All numerical tests performed so far show clearly that EBS formulation is stable, accurate and less diffusive than SUPG and the finite volume methods. Table 1 shows speed-up results obtained in the case of Euler computations around the Onera-M6 wing for EBS formulation using the parallel implementation of the code. The additive Schwarz algorithm, with only one layer of overlapping elements, along with ILUT factorization and the GMRES algorithm seem well suited as numerical tools for parallel solutions of compressible flows. Further numerical experiments in 3D using finer meshes are in progress along with more analysis of the parallel performance.

REFERENCES

1. A.N. Brooks and T.J.R. Hughes. Computer Methods in Applied Mechanics and Engineering, 32, 199-259 (1982). 2. T.J.R. Hughes, L.P. Franca and Hulbert. Computer Methods in Applied Mechanics and Engineering, 73, 173-189 (1989). 3. A. Soulaimani and M. Fortin. Computer Methods in Applied Mechanics and Engineering, 118, 319-350 (1994). 4. N.E. E1Kadri, A. Soulaimani and C. Deschenes. to appear in Computer Methods in Applied Mechanics and Engineering. 5. A. Dervieux. Von Karman lecture note series 1884-04. March 1985. 6. A. Soulaimani and C. Farhat. Proceedings of the ICES-98 Conference: Modeling and Simulation Based Engineering. Atluri and O'Donoghue editors, 923-928, October (1998). 7. A. Soulaimani and A. Rebaine. Technical paper AIAA-99-3269, June 1999. 8. T.J.R. Hughes and Mallet. Computer Methods in Applied Mechanics and Engineering, 58, 305-328 (1986). 9. G.J. Le Beau, S.E. Ray, S.K. Aliabadi and T.E. Tezduyar. Computer Methods in Applied Mechanics and Engineering, 104, 397-422, (1993). 10. Y. Saad. Iterative Methods For Sparse Linear Systems. PWS PUBLISHING COMPANY (1996).

Table 1 Performance of Parallel computations with the number of processors, Euler flow around Onera-M6 wing SUPG EBS Speedup Etficieny Speedup Efficiency 2 1.91 0.95 1.86 0.93 4 3.64 0.91 3.73 0.93 6 5.61 0.94 5.55 0.93 8 7.19 0.90 7.30 0.91 10 9.02 0.90 8.79 0.88 12 10.34 0.86 10.55 0.88

404

. . . . . . .

-

Figure 1. 3D viscous flow at Re = 400 and M = 1.9- Mach contours with EBS.

Figure 2. 3D viscous flow at Re = 400 and M = 1 . 9 - Mach contours with S U P G .

Figure 3. 3D viscous flow at Re = 400 and M = 1.9- M a c h contours with FV.

Figure 4. Euler flow around Onera-M6 wing- Mach contours with SUPG.

405

Figure 5. Euler flow around Onera-M6 wing- Mach contours with EBS

Figure 6. Euler flow around Onera-M6 wing- Mach contours with 1st order F.V.

Normalized residual norm vs time steps

o

12 4 6 1

-1

-2

-3

1

-4 -5

-7 ' 0

'

'

50

100

J

'

150

200

I 250

Time steps

Figure 7. Euler flow around Onera-M6 wing- Mach contours with 2nd order F.V.

Figure 8. Euler flow around Onera-M6 wing. Convergence history with EBS.

proc. proc. proc. proc.

-e---~---n---x....

This Page Intentionally Left Blank

Parallel Computational Fluid Dynamics Towards Teraflops, Optimization and Novel Formulations D. Keyes, A. Ecer, J. Periaux, N. Satofuka and P. Fox (Editors) o 2000 Elsevier Science B.V. All rights reserved.

407

The Need for Multi-grid for Large Computations A.Twerda, R.L.Verweij, T.W.J. Peeters, A.F. Bakker Department of Applied Physics, Delft University of Technology, Lorentzweg 1, 2628 CJ Delft, The Netherlands. In this paper an implementation of a multi-grid algorithm for simulation of turbulent reacting flows is presented. Results are shown for a turbulent backward facing step and a glass-melting furnace simulation. The multi-grid method proves to be a good acceleration technique for these kinds of computations on parallel platforms.

1

INTRODUCTION

In the last decade considerable efforts have been spent in porting Computational Fluid Dynamics codes to parallel platforms. This enables the use of larger grids with finer resolution and reduces down the vast amount of computer time needed for numerical computations of complex processes. In this paper, heat transfer, combustion, and fluid flow in industrial glass melting furnaces are considered. Simulations of these processes require advanced models of turbulence, combustion, and radiation, in conjunction with sufficiently fine numerical grids to obtain accurate predictions of furnace efficiency and concentrations of various toxic gases. The need for very fine grids in turbulent combustion simulations stems from the huge range in time and length scales occurring in the furnace. For example, the predictions for the concentrations O2 and NOx still vary 10 to 12 % when the grid is refined from a 32 • 48 • 40 to a 32 x 72 • 60 grid [1]. Furthermore, most solvers do not scale with the number of grid-points. To keep turnaround times for the large systems involved moderate, efficient techniques have to be used. The aim of this paper is to show that multi-grid provides a powerful tool in conjunction with domain-decomposition and parallelisation. Parallel computing yields the resources for accurate prediction of turbulent reacting flows in large scale furnaces.

2

MATHEMATICAL

AND PHYSICAL MODEL

Figure (1) shows a typical furnace geometry (0.9 x 3.8 x 1 m 3) , where the pre-heated air (T = 1400 K, v = 9 m/s) and the gas (T = 300 K, v = 125 m/s) enter the furnace

408

Flame

Gas inlet

Air inlet

""

.

Flue gases

f ,

Figure 1: Artist's impression of a furnace geometry with flame separately. The turbulence, mainly occurring because of the high gas-inlet velocity, improves the mixing that is essential for combustion of the initially non-premixed fuel and oxidiser into products. The products exit the furnace at the opposite side. The maximum time-averaged temperatures encountered in the furnace are typically 2200 K. At this temperature, most of the heat transfer to the walls occurs through radiation. The conservation equations of mass, momentum, energy and species are applied to describe the turbulent reacting flow. As often used in turbulent reacting flow simulation, Favre averaging is applied to these equations. The quantities are averaged with respect to the density. The Favre averaged incompressible (variable density) Navier-Stokes equations are solved for conservation of momentum and mass. The standard high-Reynolds number k - 6 with wall functions are applied to account for the turbulence [10]. The transport equation for the enthalpy is solved for the conservation of energy. The conservation of species is modelled with a conserved scalar approach. The concentrations of all species are directly coupled to the mean mixture fraction. To close the set equations, the ideal gas law is used. For radiative heat transfer the Discreet Transfer Model is applied. The chemistry is modelled with an constraint equilibrium model. A f~- probability density function is used for computing the mean values of the thermo-chemical quantities. Additional details can be found in [9] and [1].

3

NUMERICAL

MODEL

The set of equations described in the previous section are discretised using the Finite Volume Method (FVM). The computational domain is divided in a finite number of control volumes (CV). A cell-centered colocated Cartesian grid arrangement is applied [3]. Diffusive fluxes are approximated with the central difference scheme. For laminar flows, the convective fluxes are approximated with the central difference scheme. For turbulent (reacting) flows, upwind scheme with the 'van Leer flux limiter' is used. For pressure velocity coupling the SIMPLE scheme is applied [7, 3]. The porosity method is used to match the Cartesian grid to the geometry of the furnace [9]. GMRES or SIP are used for solving the set of linear equations [15, 11].

409

3.1

PARALLEL

IMPLEMENTATION

To increase the number of grid points and keep simulation times reasonable, the code is parallelized. Domain Decomposition (DD) with minimal overlap is used as the parallelisation technique. This technique has proven to be very efficient for creating a parallel algorithm [13]. A grid embedding technique, with static load balancing, is applied to divide the global domain into sub-domains [2]. Every processor is assigned to one subdomain. Message passing libraries MPI, o r SHEM_GET//PUT on the Cray T3E, are used to deal with communication between processors [4].

3.2

MULTI-GRID

METHOD

One of the drawbacks of DD is that, when implicit solvers are used to solve the discretised set of equations, the convergence decreases as the number of domains increases [8]. To overcome this problem, a multi-grid (MG) method is applied over the set of equations. That is, when a grid-level is visited during a MG-cycle several iterations of all transport equations are solved. Cell-centred coarsening is used to construct the sequence of grids. This means that, in 3D, one coarse grid CV compromises eight fine grid CVs. The V-cycle with a full approximation scheme is implemented using tri-linear interpolation for restriction and prolongation operators. Since the multi-grid tends to be slow in the beginning of the iteration, it is efficient to start with a good approximation. For this approximation a converged solution of a coarser grid is used, and interpolated onto the fine grid. This leads to the Full Multi-Grid method (FMG) [14]. Special care is taken to obtain values on coarse grids of variables which are not solved with a transport equation. These variables (e.g. density and turbulent viscosity) are only calculated on the finest grid and interpolated onto the coarser grids.

4 4.1

RESULTS TURBULENT

BACK

~_

h

I

STEP

FLOW

inlet

.......

I

lOh Figure 2: Schematic picture of the backward facing step The code has been validated on this standard CFD test problem. The geometry of this test case is simple, but the flow is complex. The recirculation zone after the step is especially difficult to predict. See Figures (2) and (3). Le and Moin provided some direct numerical simulation data and Jakirli5 calculated this test-case with the k - c turbulence

410

model [6, 5]. Our results agree well with these results as shown by Verweij [12]. A different configuration was used to perform the MG calculations. The Reynolds number based on the inlet-velocity and the step height is R e - 105. A constant inlet velocity, Vinlet -- l m / s was used. The step height was h - 0.02m. The coarsest grid-level consisted of 6 • 24 uniform CVs, which were needed to catch the step-height in one CV along a vertical line. In Figure (4) the convergence histories for the turbulent kinetic energy (k)

Figure 3: Recirculation zone of the backward facing step equation for two grids are plotted. Respectively three and four levels where used during the computation resulting in 24 • 96 and 48 • 192 number of grid points. The convergence of the FMG is much better than for the single grid case, and is almost independent of the number of CVs. These results are in agreement with results presented by Ferizger and Peri~ [3]. i

I

i

i

i

10-6 10 -6 10

~.,, '~-_- - .

FMG 24x96 FMG 48xl 92 SG 24x96 SG 48xl 92

----

-10

----=

\

:3 10-12

~ rr

10-14 10

-16

30

-18

10

-20

I I

\ \

i

200

,

i

400

,

i

600

,

i

800

,

t

1000

,

L

1200

# iterations Figure 4: Convergence history of k equation of the backward-facing step

4.2

PARALLEL

EFFICIENCY

The parallel efficiency was also analysed, using 1 to 16 processors on a CRAY-T3E. The two grids were used: a coarse grid consisting of 6 x 24 points and a fine grid with 48 • 196

411

points. On the coarse grid no MG was applied. Four grid levels were used in the MG algorithm on the fine grid. The results are shown in Table 1. The left number shows the number of MG-cycles which were needed for convergence (residual less then 10-16). The right number is the time used per MG-cycle. For the coarse grid the number of Table 1: Number of iterations and time per iteration in seconds coarse grid fine grid # processors # cycles time (s) # cycles time (s) 1

80

1.30

-

-

2

80 > 500 > 500 -

1.32 1.34 1.42 -

150 180 170 350

12.95 10.05 8.51 6.39

4 8 16

cycles increases dramatically when the number of processors increases. This is due to the implicit solver which was employed. This is not the case for the MG algorithm on the fine grid, and the is main reason for using the algorithm. The CPU-time consumed per iteration is nearly constant for the coarse grid and decreases for the fine grid. The small problem size is the main reason for the bad speed-up for the coarse grid. Also the M G algorithm on the finer grid scales poorly with the number of processors. This is explained by the fact that some iterations are still needed on the coarse grid. These iterations do not scale with the number of processors, as already mentioned. 4.3

FULL

e+07

FURNACE

SIMULATION

e+06 e+06

~ ~ ~ ! ~ ~ ~ ! ~ ! ~ ! ~ ! ~ ! ~ i l l ~ ~ i ~{IIiiiiii!!iliiiliiliii i!;!iilii!iii!i!ilii~i!!!i!i!ili~ii~i i!!iii:lili~i!!!ii!iiii:,~ii!i!!ii!!(!ii!i!:i!l~ii!i!iiiiii! !! i~ii:!i:.ii!:=!i!i!i:!i~i~!iili !=iii!iliiiiiiiiili!?iliiiiilii ! ~~i~iiiii!iiiiiii{iiiiiiiiiii!iiiiiii!:.!i!iili::i:i!iiii!iliiii!iiiiii!illii!iiiili'.ill:.El!ii!?!!ii:!!i:!:!iiiiii!!ill i!i!i::!ili!ii!!i!;:iiliii~:ililiililiilili~i =~'
i!~!1!!i!~;; i'!: :: ~!; ii ;:

i:: : :::: i: ii:ii ii:1:)[i:i!i ;!11iii!!!::i i i!iiiiiii!iii:!iii!]!!ii!iiii;i!!

Figure 5: Contour plot of the enthalpy in the center plane of the flame Finally, we present some results of a full furnace simulation. The furnace simulated is the I F R F glass-melting furnace at IJmuiden. More detailed description of the furnace can been found in [1]. The computations are done on 2 grids: a medium (34 x 50 x 42)

412

and fine (66 x 98 x 82) grid on 16 processors on a CRAY-T3E. On both grids two levels of MG are used. A typical contour plot of the enthalpy is shown in Figure (5). The flame with the peak in the enthalpy is clearly visible near the gas inlet. In Figure (6) the fuel balance, or fuel inlet-outlet ratio, of the computations of the medium and fine grid. This value should of course go to zero, and is therefore a useful parameter for checking the convergence. Usually, the computation is considered converged if the value is below 10 - 1 % . Other convergence criteria must also be met. The number of iterations on the x-as are on the finest grid. As these simulations are in 3D, the number of grids points on the coarser grid are 8 times less. So this is also good measure for the computing times consumed by the program. Also for this type of simulation the MG algorithm converges 10

2

,_..,

10 ~

(9 0 ttl:l

10 0

o~

~- - - -~ M e d i u m grid S G Fine grid S G -: M e d i u m grid M G Fine grid M G

rn

~) L L 10-1

,

0

i

5000

,

i

10000

# iterations

,

15000

Figure 6: Fuel balance for furnace simulations using a medium and a fine grid faster than the single grid counterpart. The computations on the fine grid have not fully converged yet, but it is clear that the MG algorithm performs better. The convergences rate is not independent of the number of CVs applied, because on the fine grid physics will be resolved that are not captured by the coarser grids.

5

CONCLUSIONS

The multi-grid method, as described in this paper, has proven to be a good acceleration technique for turbulent reacting flow simulations. The method also improves the convergence behaviour when domain decomposition is applied, and the number of iterations does not change significantly when more processors are used. Using the MG allows the use of finer grids, which will increase the accuracy of turbulent combustion simulations, and yet keep simulation times acceptable.

413

ACKN OWL ED G EMENT

S

The authors would like to thank the HPc~C-centre for the computing facilities.

REFERENCES [1]

G.P. Boerstoel. Modelling of gas-fired furnaces. Technology, 1997.

PhD thesis, Delft University of

[2] P.

Coelho, J.C.F. Pereira, and M.G. Carvalho. Calculation of laminar recirculating flows using a local non-staggered grid refinement system. Int. J. Num. Meth. Fl., 12:535-557, 1991.

[8] J.H.

Ferziger and M. Perid. Computational Methods for Fluid Dynamics. SpringerVerlag, New York, 1995.

[4]

William D. Gropp and Ewing Lusk. User's Guide for mpich, a Portable Implementation of MPI. Mathematics and Computer Science Division, Argonne National Laboratory, 1996. ANL-96/6.

[5]

S. Jakirli(~. Reynolds-Spannungs-modeUierung komplexer turbulenter StrSmungen. PhD thesis, Universit~it Erlangen-Niirnberg, 1997.

[6] H. Le, P. Moin, and J. Kim. Direct numerical simulation of turbulent flow over a backward-facing step. In Ninth symposium on turbulent shear flows, pages 1321-1325, 1993. [7] S.V. Patankar. Numerical heat transfer and fluid flow. McGraw-Hill, London, 1980. [8] M. Perid, M. Sch~ifer, and E. Schreck. Computation of fluid flow with a parallel multigrid solver. Parallel Computational Fluid Dynamics, pages 297-312, 1992. [9] L. Post. Modelling of flow and combustion in a glass melting furnace. PhD thesis, Delft University of Technology, 1988. [10] W. Rodi. Turbulence models and their application in hydraulics- a state of the art review. IAHR, Karlsruhe, 1984. [11] H. L. Stone. Iterative solution of implicit approximations of multi-dimensional partial differential equations. SIAM J. Numer. Anal., 5:530-558, 1968. [12] R.L. Verweij. Parallel computing for furnace simulations using domain decomposition. PhD thesis, Delft University of Technology, 1999. [13] RL. Verweij, A. Twerda, and T.W.I. Peeters. Parallel computing for reacting flows using adaptive grid refinement. In Tenth International Conference on Domain Decomposition Methods, Boulder, Colorado, USA, 1998.

414 [14] P. Wesseling. An introduction to multigrid methods. John Wiley & Sons, Chichester, 1992. [15] Y.Saad and M.H. Schultz. Gmres" A generalized minimal residual algorithm for solving nonsymmetric linear systems. SIAM, 7:856-869, 1986.

Parallel ComputationalFluid Dynamics Towards Teraflops, Optimizationand Novel Formulations D. Keyes,A. Ecer, J. Periaux,N. Satofukaand P. Fox (Editors) 92000 Elsevier Science B.V. All rights reserved.

415

Parallel C o m p u t a t i o n s of Unsteady Euler Equations on D y n a m i c a l l y D e f o r m i n g Unstructured Grids A. Uzun, H.U. Akay, and C.E. Bronnenberg Computational Fluid Dynamics Laboratory, Department of Mechanical Engineering, Purdue School of Engineering and Technology, IUPUI, Indianapolis, IN 46202, USA. A parallel algorithm for the solution of unsteady Euler equations on unstructured deforming grids is developed by modifying USM3D, a flow solver originally developed at the NASA Langley Research Center. The spatial discretization is based on the flux vector splitting method of Van Leer. The temporal discretization involves an implicit timeintegration scheme that utilizes a Gauss-Seidel relaxation procedure. Movement of the computational mesh is accomplished by means of a dynamic grid algorithm. Detailed descriptions of the parallel solution algorithm are given and computations for airflow around a full aircraft configuration are presented to demonstrate the applications and efficiency of the current parallel solution algorithm. 1. INTRODUCTION Problems associated with unsteady flows and moving boundaries have been of long interest to aerodynamicists. Examples of such problems include external flows around oscillating airfoils and other three-dimensional configurations in motion. The current research is based on developing a parallel algorithm for the solution of unsteady aerodynamics and moving boundary problems using unstructured dynamic grids and the Arbitrary Lagrangian-Eulerian (ALE) method [1]. The sequential version of flow solver, USM3D, originally developed at the NASA Langley Research Center, has been modified here to solve unsteady aerodynamics and moving boundary problems. The version of the solver USM3D modified here was primarily developed to solve steady-state Euler equations on unstructured stationary grids [2]. The algorithm is an implicit, cell-centered scheme in which the fluxes on cell faces are obtained using the Van Leer flux vector splitting method [3]. The dynamic grid algorithm that is coupled with the flow solver moves the computational mesh to conform to the instantaneous position of the moving boundary. The solution at each time step is updated with an implicit algorithm that uses the linearized backward-Euler time-differencing scheme. 2. FLUID FLOW EQUATIONS The Arbitrary Lagrangian-Eulerian formulation of the three-dimensional time-dependent inviscid fluid-flow equations are expressed in the following form [4]:

416 (1)

f2

~)f2

where Q = Lo, pu, pv, pw, e] r is the vector of conserved flow variables, F. fi = F = ((fi-'~). fi)Lo

,ov pw

pu

is the convective flux vector, fi = [nx

ny

e + p] T+p[0

nx

ny

nz

a t ]T

(2)

nz] T is the normal vector on the boundary Of~,

fi = [u, v, w] T and ~ = [x t, Yt, Zt IT are the fluid and grid velocity vectors, respectively, and

= ~r fi is the contravariant face speed. The pressure p is given by the equation of state for a perfect gas p = ( 7 - 1 ) [ e1- - ~ p ( u 2 +2 vw 2+] ) . a t = xtn x + Ytny + ztn z

The above equations have been nondimensionalized by the freestream density p~ and the freestream speed of sound a~. The domain of interest is divided into a finite number of tetrahedral cells and Eq. (1) is applied to each cell. 3. DYNAMIC MESH ALGORITHM

The current work models the unsteady aerodynamic response that is caused by oscillations of a configuration. Hence, the mesh movement is known beforehand. The dynamic mesh algorithm moves the computational mesh to conform to the instantaneous position of the moving boundary at each time step. Following the work of Batina [5], the algorithm treats the computational mesh as a system of interconnected springs. This system is constructed by representing each edge of each triangle by a tension spring. The spring stiffness for a given edge i - j is taken as inversely proportional to the length of the edge as: k m =(~/(xj-xi)

2 +(yj-yi)

2 + ( Z j --Zi) 2

(3)

Grid points on the outer boundary of the mesh are held fixed while instantaneous location of the points on the inner boundary (i.e., moving body) is given by the body motion. At each time step, the static equilibrium equations in the x, y, and z directions that result from a summation of forces are solved iteratively at each interior node i of the mesh for the displacements Ax i , Ay i and Azi, respectively. After the new locations of the nodes are found, the new metrics (i.e., new cell volumes, cell face areas, face normal vectors, etc.) are computed. The nodal displacements are divided by the time increment to determine the velocity of the nodes. It is assumed that the velocity of a node is constant in magnitude and direction during a time step. Once the nodal velocities are computed, the velocity of a triangular cell face is found by taking the arithmetic average of the velocities of the three nodes that constitute the face. The face velocities are used in the flux computations in the solution algorithm.

417 4. T I M E I N T E G R A T I O N In the cell-centered finite volume approach used here, the flow variables are volumeaveraged values, hence the governing equations are rewritten in the following form: V n+l

AQ n . . . . At where

Wn

IIF(Qn+I) 9r i d S _ Q n . AV___~ n At

is the cell volume at time step

n,

V n+l

(4)

is the cell volume at time step n+l,

AQn _Qn+l _ Q n , A V n = V n+l - V n is the change in cell volume from time step n to n+l, and At is the time increment Since an implicit time-integration scheme is employed, fluxes are evaluated at time step n+l. The flux vector is linearized according to R n+l

:

Rn

~R n -I-

AQ n

(5)

0Q Hence the following system of linear equations is solved at each time step: A-AQ n

= R n

_Qn. Avn At

where

A =

V n-I-1

(6)

~R n and R n = - I I F(Q n ). II dS .

I - ~ At ~Q

~ta

The flow variables are stored at the centroid of each tetrahedron. Flux quantities across cell faces are computed using Van Leer's flux vector splitting method [3]. The implicit time integration used in this study has been originally developed by Anderson [6] in an effort to solve steady-state Euler equations on stationary grids. The system of simultaneous equations that results from the application of Eq. (6) for all the cells in the mesh can be obtained by a direct inversion of a large matrix with large bandwidth. However, the direct inversion technique demands a huge amount of memory and extensive computer time to perform the matrix inversions for three-dimensional problems. Therefore, the direct inversion approach is computationally very expensive for practical threedimensional calculations. Instead, a Gauss-Seidel relaxation procedure has been used in this study for the solution of the system of equations. In the current relaxation scheme, the solution is obtained through a sequence of iterations in which an approximation of AQ n is continually refined until acceptable convergence. 5. P A R A L L E L I Z A T I O N 5.1. Mesh Partitioning A domain decomposition approach has been implemented to achieve parallelization of both the dynamic mesh algorithm and the flow solution scheme. The computational domain

418 that is discretized with an unstructured mesh is partitioned into subdomains or blocks using a program named General Divider (GD). General Divider is a general unstructured meshpartitioning code developed at the CFD Laboratory at IUPUI [7]. It is currently capable of partitioning both structured (hexahedral) and unstructured (tetrahedral) meshes in threedimensional geometries. The interfaces between partitioned blocks are of matching and overlapping type [8]. They exchange data between the blocks.

5.2. Parallelization of the Dynamic Mesh Algorithm During a parallel run of the dynamic mesh algorithm, the code executes on each of the partitioned blocks to solve the relevant equations for that particular subdomain. The dynamic mesh algorithm is an iterative type solver. After every iteration step, the blocks communicate via interfaces to exchange the node displacements on the block boundaries. The nodes on the interfaces that are flagged as sending nodes send their data to the corresponding nodes in the neighboring block. Similarly, the nodes on the interfaces that are flagged as receiving nodes receive data from the corresponding nodes in the neighboring block. In this research, the communication between blocks is achieved by means of the message-passing library PVM (Parallel Virtual Machine) [9].

5.3. Parallelization of the Flow Solver The implicit time integration used in this study is also an iterative solver in which an approximation of AQ for each cell is continually refined until acceptable convergence. In a given block, there are three types of cells as follows: 9 Type I (Interior). Interior cells: These are cells that do not have a face on any kind of boundary. All four faces of such cells are interior faces. 9 Type PB (Physical Boundary). Boundary cells that are not adjacent to an outer interface boundary: These are cells that have a face on a physical boundary such as an inviscid wall, far-field boundary, etc., and not on an outer interface boundary. 9 Type IB (Interface Boundary). Boundary cells that are adjacent to an outer interface boundary: These are cells that have a face on an outer interface boundary. During the course of parallel computations, the time integration scheme runs on each of the blocks. However, the time integration scheme computes the relevant AQ values only for Type I and Type PB cells and not for Type IB cells in a given block. The matching and overlapping interface type used in this study ensures that the Type IB cells in a given block will be of either Type I or Type PB in the corresponding neighboring block. Hence, AQ values in Type IB cells in a given block are taken from the corresponding neighboring block as part of the data exchange during the course of parallel computations. The AQ values in Type IB cells in a given block are needed while computing the AQ values in Type I and Type PB cells that share a common face with the Type IB cells. This approach is very effective for the implicit time integration scheme used in this study. The main advantage of this approach is the fact that it eliminates the need for applying boundary conditions explicitly on outer interface boundaries. Furthermore, the continuity of the solution across the interfaces is guaranteed.

419

5.4. Parallel Computing Environment In this research, a Linux cluster located at the NASA Glenn Research has been used as the parallel computing environment. The cluster has 32 Intel Pentium II 400MHz dual processor PCs connected via a Fast Ethernet network that has a transmission rate of 100Mbps. Each PC has 512MB memory and runs on the Linux operating system. Message passing among the machines was handled by means of the PVM software. The main differences between the steady and the unsteady flow solution schemes are as follows: 9 The dynamic grid algorithm is only used in unsteady flow problems. It is not needed in steady state computations since the computational grid remains stationary. 9 Local time stepping strategy accelerates the convergence to steady state, hence this strategy is used while solving steady state problems. 9 For unsteady problems, a global time increment has to be defined since the unsteady flow solution has to be time accurate. All cells in the computational grid use the same time increment during the course of unsteady flow computations. 9 For both steady and unsteady problems, the flow solver performs 20 nonlinear iterations in each time step to solve the system of equations. Since time accuracy is not desired in steady state problems, the flow solver first performs the 20 nonlinear iterations and then communicates once in each time cycle while solving steady state problems. This approach reduces the communication time requirements of the flow solver while solving steady state problems. 9 In unsteady problems, the flow solver needs to communicate once after every nonlinear iteration. In other words, the flow solver has to communicate 20 times in each time step in order to maintain time accuracy. If this is not done, errors are introduced. 6. TEST CASES The program was first tested for accuracy by using well-known oscillating and plunging cases for the NACA0012 airfoil case. The accuracy of the results was verified by comparing with experiments [ 10] and other numerical solutions [ 11]. To demonstrate the applicability of the current method to complex problems, an aircraft configuration has been considered as the next test case. Due to space limitations, we will present the results of a generic aircraft case for testing parallel efficiency of the code. Both steady and unsteady computations were done on the aircraft configuration. The steady state computations were performed for a freestream Mach number of Moo = 0.8 and zero degree angle of attack. For unsteady computations, the aircraft was oscillated sinusoidally with an amplitude of 5 degrees. The reduced frequency was chosen as k = 0.50 for the unsteady calculations. The computations were performed on a coarse and a fine grid. The coarse grid has 319,540 cells and 57,582 nodes whereas the fine grid has 999,135 cells and 178,201 nodes. Figure 1 shows the surface triangulation for the coarse grid and Figure 2 shows the partitioned surface grid for the 16-block case. The steady-state computations were started at a CFL number of 5 on both grids. The CFL number was then linearly increased to 150 over the first 20 time steps. Local time stepping strategy was used while solving for the steady state solution. 20 subiterations were needed in

420

each time step and the blocks were communicated once at the end of 20 subiterations in each time step. The steady-state solution on the coarse grid is reached in approximately 250 time steps whereas the steady-solution on the fine grid needs about 300 time steps to converge.

Figurel. Surface triangulation for the coarse aircraft grid.

Figure 2. Partitioned surface triangulation for the 16 block coarse grid partition.

Figure 3a shows the speedup curves of the steady-state solutions for the different cases on the coarse and the fine grids. The speedups on the coarse and the fine grid seem to be identical. These speedups are quite close to the ideal speedup curve. This shows that the current parallel steady state flow solution scheme remains quite efficient even for large number of blocks. In the unsteady test case, the aircraft configuration is pitching sinusoidally about the mid chord with an amplitude of 5 degrees. For this test case, the reduced frequency, freestream Mach number, mean angle of pitching and the chord length are k = 0.50, M ~ =0.80, O~m

=

0 ~ ,

and c = 8.76, respectively, where the chord length corresponds to the full length of

the aircraft. The initial condition for this case is the previously obtained steady state solution. The results were obtained using 1500 steps per cycle of motion. 20 subiterations were done in each time step and the blocks were communicated once every subiteration in order to maintain the time accuracy. Figure 3b showss the speedup curves of the unsteady steady solutions, for the coarse and the fine grids and compares them with the ideal speedup curve. The speedup curve for the fine grid appears to be slightly better than the speedup curve for the coarse grid. The parallel efficiency of the unsteady algorithm decreases rapidly as the number of blocks is increased since the communication time requirements become more significant with increasing number of blocks. This behavior can be attributed to two factors. The first factor is the dynamic grid algorithm that is embedded in the unsteady flow solution scheme. The dynamic grid algorithm is an iterative solver, hence an error is computed after every iteration of the dynamic grid algorithm. This error has to be less than a predefined tolerance, usually 10 -5 , in order for the dynamic grid algorithm to stop the iterations. The number of iterations necessary for convergence depends on the total number of nodes in the grid and how much the moving boundaries are displaced. For example, for the unsteady flow solution on the coarse aircraft grid using 1500 time steps per cycle of motion, the number of iterations to deform the mesh at each time step ranges from about 50 to about 100. The second factor

421 responsible for the behavior of the total communication time is the fact that the unsteady flow solution scheme communicates 20 times in each time step of the nonlinear iterations in order to maintain the time accuracy.

36 .................................................................................................................................. 32

32

28

28

~. 24

-o- Coarse Grid

20

Fine Grid - t - Ideal S p e e d u p

r~ 12

I~ 24 20

r~ 12

8

8

4

4

0

0 0

4

8

12

16

20

24

28

32

N u m b e r of Blocks

(a) Steady

0

4

8

12

16

20

24

28

32

N u m b e r of B l o c k s

(b) Unsteady

Figure 3. Speedup curves for different cases on the coarse and the fine grid. The time accuracy of the unsteady computations was established by comparing multiblock solutions with the single block solution as well as by varying the time step At. Figures 4 and 5 show the deformed fine grid at the maximum and minimum angle of attack positions, respectively, of the unsteady computations.

Figure 4. Deformed coarse grid at maximum angle of attack position.

Figure 5. Deformed coarse grid at minimum angle of attack position.

7. C O N C L U S I O N S

In this research, a sequential program, USM3D, has been modified and parallelized for the solution of steady and unsteady Euler equations on unstructured grids. The solution algorithm was based on a finite volume method with an implicit time-integration scheme. Parallelization was based on domain decomposition and the message passing between the parallel processes was achieved using the Parallel Virtual Machine (PVM) library. Steady

422 and unsteady problems were analyzed to demonstrate the possible applications of the current solution method. Based on these test cases, the following conclusions can be made: 9 The parallel steady-state solution scheme showed good efficiency for all multi-block cases. The speedup of the steady-state solution scheme was quite close to the ideal speedup curve. 9 The parallel unsteady flow solution scheme showed less efficiency as the number of blocks was increased. The reason for the inefficiency for cases involving large number of blocks is the increased communication time requirements of the iterative dynamic grid algorithm. 9 Reasonable efficiencies are achieved for up to 32 processors while solving steady flows (90%) and 16 processors while solving unsteady flows (70%). ACKNOWLEDGEMENTS The permission provided by the NASA Langley Research Center for use of the flow code USM3D and its grid generator VGRID is gratefully acknowledged. The access to a Linux computer cluster provided by the NASA Glenn Research Center for parallel computations is also gratefully acknowledged. REFERENCES

1. Trepanier, J. Y., Reggio, M., Zhang, H., Camarero, R., "A Finite Volume Method for the Euler Equations on Arbitrary Lagrangian-Eulerian Grids," Computers and Fluids, Vol. 20, No. 4, pp. 399-409, 1991. 2. Frink, N. T., Parikh, P., Pirzadeh, S., "A Fast Upwind Solver for the Euler Equations on Three-Dimensional Unstructured Meshes," AIAA Paper 91-0102, 1991. 3. Van Leer, B., "Flux Vector Splitting for the Euler Equations," Lecture Notes in Physics, Vol. 170 (Springer-Verlag, New YorlJBerlin 1982), pp. 507-512. 4. Singh, K. P., Newman, J. C., Baysal, O., "Dynamic Unstructured Method for Flows Past Multiple Objects in Relative Motion," AIAA Journal, Vol. 33, No. 4, 1995. 5. Batina, J. T., "Unsteady Euler Algorithm with Unstructured Dynamic Mesh for ComplexAircraft Aerodynamic Analysis," AIAA Journal, Vol. 29, No. 3, 1991. 6. Anderson, W. K., "Grid Generation and Flow Solution Method for Euler Equations on Unstructured Grids," Journal of Computational Physics, Vol. 110, pp. 23-38, 1994. 7. Bronnenberg, C. E., "GD: A General Divider User's Manual - An Unstructured Grid Partitioning Program," CFD Laboratory, IUPUI, 1999. 8. Akay, H. U., Blech, R., Ecer, A., Ercoskun, D., Kemle, B., Quealy, A., Williams, A., "A Database Management System for Parallel Processing of CFD Algorithms," Parallel CFD '92, Edited by R. B. Pelz, et. al., Elsevier, Amsterdam, pp. 9-23, 1993. 9. Geist, G. A., Beguelin, A. L., Dongarra, J. J., Jiang, W., Manchek, R., Sunderam, V., "PVM 3 User's Guide and Reference Manual," Oak Ridge National Laboratory ORNL/TM-12187, 1993. 10. Landon, R. H., "NACA 0012. Oscillating and Transient Pitching," Compendium of Unsteady Aerodynamic Measurements, Data Set 3, AGARD-R-702, Aug. 1982. 11. Kandil, O. A. and Chuang, H. A., "Computation of Steady and Unsteady VortexDominated Flows with Shock Waves," AIAA Journal, Vol. 26, pp. 524-531, 1988.

Parallel ComputationalFluidDynamics Towards Teraflops, Optimizationand Novel Formulations D. Keyes, A. Ecer, J. Periaux, N. Satofukaand P. Fox (Editors) 92000 Elsevier Science B.V. All rightsreserved.

423

P a r a l l e l i z a t i o n and M P I P e r f o r m a n c e o f T h e r m a l Lattice B o l t z m a n n C o d e s for Fluid Turbulence George Vahala a, Jonathan Carter b, Darren Wah a, Linda Vahala c and Pavol Pavlo d aDepartment of Physics, William & Mary, Williamsburg, VA 23187 bNERSC, Lawrence Berkeley Laboratory, Berkeley, CA CDepartment of Electrical & Computer Engineering, Old Dominion University, Norfolk, VA 23529 dInstitute of Plasma Physics, Czech Academy of Science, Praha 8, Czech Republic

The Thermal Lattice Boltzmann Model (TLBM) is presented for the solution of complex twofluid systems of interest in plasma divertor physics. TLBM is a mesoscopic formulation to solve nonlinear conservation macroscopic equations in kinetic phase space, but with the minimal amount of discrete phase space velocity information. Some simulations are presented for the two-fluid interaction of perpendicular double vortex shear layers. It is seen that TLBM has almost perfect scaling as the number of PE's increases. Certain MPI optimizations are described and their effects tabulated on the speed-up in the various computational kernels.

1. INTRODUCTION One of the goals of divertor physics [1] is to model the interaction between neutrals and the plasma by a coupled UEDGE/Navier-Stokes system of equations [2]. An inverse statistical mechanics approach is to replace these highly nonlinear two-fluid macroscopic equations by two coupled linear lattice BGK kinetic equations, which, in the Chapman-Enskog limit, will permit recovery of the original two-fluid system. While the complexity of phase space has been increased from (x, t) to the kinetic phase space (x, ~, t), the TLBM approach now seeks to minimize (and discretize) the required degrees of freedom that must be preserved in ~space. As a result, instead of solving for the fluid species density ns(x, t), mean velocity Vs(X, t), and temperature 0s(x, t), we solve for the species distribution function Ns,pi(X, t) - where p is the number of speeds and i is the directional index for the chosen velocity lattice. For example, on a hexagonal (2D)lattice, one can recover the nonlinear conservation equations of mass, momentum and energy by choosing p - 2, i - 6 ; i.e., a 13-bit model (for each species)

424 can recover the 4-variable n, Vx, Vy, 0 -macroscopic system [3-10].

Now the storage

requirements for the linear kinetic representation are increased over that needed for conventional CFD by not more than a factor of 2 because of the auxiliary storage arrays needed in the CFD approach. However, there are substantial computational gains achieved by this imbedding: (i) Lagrangian kinetic codes with local operations, ideal for scalability with PE's, and (ii) the avoidance of the nonlinear Riemann problem, in CFD. There is also a sound physics reason for pursuing this kinetic imbedding. In the tokamak divertor, one encounters (time varying) regimes in which the neutral collisionalities range from the highly collisional (well treated by fluid equations) to the weakly collisional (well treated by Monte Carlo methods). The coupling of fluid-kinetic codes is a numerically stiff problem. However, by replacing the fluid representation by a TLBM, we will be coupling two kinetic codes with non-disparate time and length scales. In Sec. 2, we will present a two-fluid system of equations that we will represent by two linear kinetic BGK equations. Simulations will be presented for the case of the turbulent interaction between two perpendicular double-shear velocity layers. In Sec. 3, we will discuss the parallelization and MPI optimization of our TLBM code.

7.. T W O - F L U I D T R A N S P O R T E Q U A T I O N S

We will determine a kinetic representation for the following coupled set of nonlinear conservation equations (species mass ms, density ns, mean velocity vs, mean temperature 0s)

at (msns)+~)o~ (msnsVs,o~)=0 t)t (msnsVs,et) + ~(1-Io~13 +msnsVs, o~Vs,[5) =- msns (Vs,ct - Vss' ~x ) 7, SS'

Ot (3nsOs-

msns Vs2) + t)ct (2qs, a + Vs,oc [3nsOs + msns Vs2 ] + 2Vs, f~ I-ls,off~ ) 1 m

(1) [3 n s (t3s - 0 s, ) + msn s (Vs2 -- Vs, 2)]

7, SS'

where I-Is is the stress tensor (with viscosity Its =7,ssns Os)

ns, ,,

- ,,sOs a , ,

-

s

Us,

+

V

-

2

8~ 1 + ms ns

7,ss X 7, SS'

X

oss, - 0 ms ssct~+(Vss,,Ot--Vs,o~Vss,~--Vs,~)

l

- - '80t~ ~(Vss'2--Vs

with %s the s-species relaxation time and % s' the cross-species relaxation rate. The heat flux vector (with conductivity ~q = 5 ~f12) is

57, q o~ =-~: Oo~Os + - ss ns(O-Os ~s)(Vs'~176 27,s ~

ms2ns (Vs -Vss'

)2 (Vs, o~-

Vss' ~x )

425

2.1 Thermal Lattice Boltzmann Model of Eq. (1). In the Chapman-Enskog limit, one can recover the above nonlinear macroscopic moment equations from the following linearized BGK kinetic equations [11,12]

~tfs +~)a(~a f s ) -

(2)

- fs - gs.........~_fs -gss' T, ss

T, ss'

for species s interacting with species s'. gss is the s-species relaxation Maxwellian distribution function, while gs~,is the cross-species Maxwellian distribution

(/3 /2 E m s

gss' = ns 2~ Oss,

exp -

ms(~-Vss 20~s'

,21

(3)

The first term on the r.h.s, of (2) corresponds to the self-species collisional relaxation at rate

"Css ns ~. Os for some constant r (dependent on the interactions); while the second term corresponds to the cross-species collisional relaxation at rate 1/~

Xss, = ~E

ms+ ms'

,

where

~E =

-2

(ms + ms

nsns'

+

ms'

and 7 is another constant dependent on the type of collisional interaction. The cross-species parameters vss' and 0ss. are so chosen that the equilibration rate of each species mean velocity and temperature to each other will proceed at the same rate for these linear BGK collision operators as for the full nonlinear Boltzmann collision integrals [11,12]. The only restriction on the parameter ~ is: ~ > -1. In TLBM [3-10], one performs a phase velocity discretization, keeping a minimal representation that will still recover Eq. (1) in the Chapman-Enskog limit. For a hexagonal (2D) lattice, we require at each spatial node 13 bits of information in ~-space [3-10]

fs(X,~,t)

r

Ns,pi(X,t ) , i=1 ....6, p=0,1,2

Performing a Lagrangian discretization of the linearized BGK equations,

Us, pi (x, t )- U'se~i(x et~ Ns, pi(X +epit + 1)-Ns, p i ( X , t ) = -

T, ss

Ns, pi (x,

t) -

Nseq pi ( x, t)

T,ss'

(4)

426 where the N eq a r e appropriate Taylor-expanded form of the corresponding Maxwellians [c.f. Eq. (2)]. r are the lattice velocities.

2.2 Simulation for 2-Species Turbulence : double vortex layers We now present some simulations for a 2-species system in 2D, with double vortex layers perpendicular to each other: fluid #1 has a double vortex layer in the x-direction, while fluid #2 has a double vortex layer in the y-direction. The two fluids are coupled through a weak inter-species collisionality, with Zss' >> %, ":s'. In Fig.1 we plot the evolution of the mass density weighted resultant vorticity of the 2-fluid system, in which mz - ml, n2 - 2 nl. The first frame is after 1000 TLBE time steps. The vortex layers for fluid #1 (in the x-direction) and for fluid #2 (in the y-direction) are still distinct as it is too early in the evolution for much interaction between the two species. The vortex layers themselves are beginning to break up, as expected for 2D turbulence with its inverse cascade of energy and the formation of large spacing-filling vortices. After 5K time steps (which corresponds to about 6 eddy turnover times), larger structures are beginning to form as the vortex layers break up into structures exhibiting many length scales. After 9K time steps, the dominant large vortex and counter-rotating vortex are becoming evident. The smaller structures are less and less evident, as seen after 13K time steps. Finally, by 17K time steps one sees nearly complete inter-species equilibration. More details can be found in Ref. 13.

3. P A R A L L E L I Z A T I O N OF T L B M CODES

3.1 Algorithm For simplicity, we shall restrict our discussion of the parallelization of TLBM code to a single species system [i.e., in Eq. (4), let "Css'---) r The numerical algorithm to advance from time t --)t + 1 is: (a) at each lattice site x, free-stream Npi along its phase velocity Cpi :

(5)

Npi(X) ~ Npi(X + Cpi) (b) recalculate the macroscopic variables n, v, 0 so as to update N eq = Neq(n, v, 0) (c) perform collisional relaxation at each spatial lattice node:

Npi(x)N p i ( X ) --

(x)

--~ Npi (X), at time t + 1

T,

It should be noted that this algorithm is not only the most efficient one can achieve but also has a (kinetic) CFL - 1, so that neither numerical dissipation or diffusion is introduced into the simulation.

427 3.2 P e r f o r m a n c e on the C R A Y C90 Vector S u p e r c o m p u t e r On a dedicated C90 with 16 PE's at 960 MFlops/sec and a vector length of 128, the TLBM code is almost ideally vectorized and parallelized giving the following statistics on a 42minute wallclock run 9

Table 1 Timing for TLBM code on a dedicated C90 Floating Ops/sec avg. conflict/ref CPU/wallclock time ratio Floating Ops/wal! sec Average Vector Length for all operations Data Transferred

603.95 M 0.15 15.52 9374.54 M 127.87 54.6958 MWords

3.3 Optimization of the M P I Code on T 3 E With MPI, one first subdivides the spatial lattice using simple domain decomposition. The TLBM code consists of 2 computational kernels, which only act on local data: 9"integrate" - in which one computes the mean macroscopic variables, (b)" and 9 "collision" - in which one recomputes N eq using the updated mean moments from "integrate", and then performs the collisional relaxation, (c). 9 The "stream" operation, (a), is a simple shift operation that requires message passing only to distribute boundary data between PE's. With straightforward MPI, the execution time for a particular run was 5830 sec - see Table 2 - of which 80% was spent in the "collision" kernel. For single PE optimization, we tried to access arrays with unit stride as much as possible (although much data was still not accessed in this fashion), to reuse data once in cache, and to try to precompute expensive operations (e.g., divide) which were used more than once. This tuning significantly affected "collision", increasing the Flop-rate for this kernel from 24 MFlops to 76 MFlops. The "integrate" kernel stayed at 50 MFlops. As the "stream" operation is simply a sequence of data movements, both on and off PE, it seemed more appropriate to measure it in terms of throughput. Initially, "stream" propagated information at an aggregate rate of 2.4 MB/sec, using mainly stride-one access. We next investigated the effect of various compiler flags on performance. Using fg0 flags: -Ounroll2, pipeline3,scalar3,vector3,aggress-lmfastv further increased the performance of the "collision" kernel to 86 MFlops and "stream" to 3.4 MB/sec. The overall effect can be seen in column 2 of Table 2. We then looked at optimizing the use of MPI. By using user-defined datatypes, defined through MPI_TYPE_VECTOR, data that are to be passed between PE's which are separated by a constant stride can be sent in one MPI call. This eliminates the use of MPI_SEND and MPI_RECV calls within do-loops and reduces the total wait time/PE from 42 secs. (33% of total time) to just 5 secs. (5% of total time). The overall effects of this optimization are seen in column 3 of Table 2. It should be noted that the computational kernels, "collision" and "integrate", and the data propagation routine "stream" access Npi in an incompatible manner. We investigated whether

428

further speed-up could be achieved by interchanging array indices so as to obtain optimal unit stride in "collision" and "integrate" - such a change resulted in non stride-one access in "stream". To try and mitigate this access pattern, we made use of the "cache_bypass" feature of the Cray T3E, where data moves from memory via E-registers to the CPU, rather than through cache. Such an approach is more efficient for non stride-one access. This flipped index approach resulted in a speed-up in "collision" from 86 to 96 MFlops, and in "integrate" from 52 to 70 MFlops, while "stream" degraded in performance from 3.4 to 0.9 MB/sec. The total effect was slower than our previous best, so this strategy was abandoned.

Table 2 Performance of Various Routines under Optimization. The total execution time was reduced by a factor of 4. 42, while the "collision" routine was optimized by nearly a factor of 6. M P I optimization increased the performance of "stream" by a factor of 5.8, while the time spent in the SEND, R E C V calls was reduced by a factor of over 10. Single PE optim. + Single PE Optim. Initial Code MPI optim. 1320 sec 2050 sec. 5830 sec COLLISION 4670 sec 80.1% 893 sec 43.6% 788 sec 59.7% INTEGRATE 359 sec 6.2% 360 sec 17.6% 369 sec 28.0% STREAM 791 sec 13.6% 780 sec 38.1% 134 sec 10.2% MPI SEND 396 sec 6.8% 400 sec 19.2% MPI RECV 294 sec 5.0% 278 sec 13.6% MPI SENDREC 63 sec 5.0% In Table 3, we present the nearly optimal scaling of the MPI code with the number of PE's, from 16 to 512 processors. In this comparison, we increased the grid dimensionality in proportion to the increase in the number of PE's so that each PE (in the domain decomposition) worked on the same dimensional arrays. Table 3 Scaling with PE's on the T3E # PE's GRID 4 x4 (16) 1024 x 1024

CPU/PE (sec) 999.3

8 x4

(32)

2048 x 1024

1003.6

8x8

(64)

2048 x 2048

1002.8

16 x 16

(256)

4096 x 4096

1002.2

32 x 16

(512)

8192 x 4096

1007.2

Some benchmarked runs were also performed on various platforms, and the results are presented in Table 4. In these runs, the dimensionality was fixed at 512 x 512 while the

429

1 y 32~

~ 48 ~ i i

/~ 32x

17 48~~ii

17

32

x

x

y 8

48~.,~ / i

I7

17

~

~

1

17

Figure 1. Evolution of the Composite Mass-Density Vorticity for a 2-fluid system with perpendicular double vortex. layers. The frames are separated by 4000 TLBE time steps (with an eddy turn-over time corresponding to 900 TLBE time steps). The approach to the time asymptotic 2D space-filling state of one large vortex and one large counterrotating vortex is clearly seen by 20 eddy turnover times (last frame)

430 number of PE's was increased. The scalings are presented in brackets, showing very similar scalings between the T3E and the SP604e PowerPC.

Table 4 Benchmarked Timings (512 x 512 grid) MACHINE CRAY T3E SP POWER2 SUPER SP 604e POWERPC

16 PE's 1360 1358 1487

32 PE's 715 834 800

[ 0.526 ] [ 0.614 ] [ 0.538 ]

64 PE's 376 463 424

[ 0.276 ] [ 0.341 ] [ 0.285 ]

REFERENCES 1. J. A. Wesson, Tokamaks, Oxford Science Publ., 2nd Ed., 1997. 2. D. A. Knoll, P. R. McHugh, S. I. Krasheninnikov and D. J. Sigmar, Phys. Plasmas 3 (998) 422. 3. F.J. Alexander, S. Chen and J. D. Sterling, Phys. Rev. E47 (1993) 2249. 4. G. R. McNamara, A. L. Garcia and B. J. Alder, J. Stat. Phys. 81 (1995) 395. 5. P. Pavlo, G. Vahala, L. Vahala and M. Soe, J. Computat. Phys. 139 (1998) 79. 6. P. Pavlo, G. Vahala and L. Vahala, Phys. Rev. Lett. 80 (1998) 3960. 7. M. Soe, G. Vahala, P. Pavlo, L. Vahala and H. Chen, Phy. Rev. E57 (1998) 4227. 8. G. Vahala, P. Pavlo, L. Vahala and N. S. Martys, Intl. J. Modern Phys. C9 (1998) 1274. 9. D. Wah, G. Vahala, P. Pavlo, L. Vahala and J. Carter, Czech J. Phys. 48($2) (1998) 369. 10. Y. Chert, H. Obashi and H. Akiyama, Phys. Rev.ES0 (1994) 2776. 11. T. F. Morse, Phys. Fluids 6 (1964) 2012. 12. J. M. Greene, Phys. Fluids 16 (1973) 2022. 13. D. Wah, Ph. D Thesis, William & Mary (June, 1999)

Parallel ComputationalFluidDynamics Towards Teraflops, Optimizationand NovelFormulations D. Keyes,A. Ecer, J. Periaux,N. Satofukaand P. Fox (Editors) 92000 Elsevier Science B.V. All rightsreserved.

431

Parallel computation of two-phase flows using the immiscible lattice gas Tadashi Watanabe and Ken-ichi Ebihara Research and Development Group for Numerical Experiments, Center for Promotion of Computational Science and Engineering, Japan Atomic Energy Research Institute, Tokai-mura, Naka-gun, Ibaraki-ken, 319-11, Japan

The two-dimensional two-phase flow simulation code based on the immiscible lattice gas, which is one of the discrete methods using particles to simulate two-phase flows, is developed and parallelized using the MPI library. Parallel computations are performed on a workstation cluster to study the rising bubble in a static fluid, the phase separation in a Couette flow, and the mixing of two phases in a cavity flow. The interfacial area concentration is evaluated numerically, and its dependencies on the average density, wall speed, and region size are discussed. 1. I N T R O D U C T I O N Two-phase flow phenomena are complicated and difficult to simulate numerically since two-phase flows have interfaces of phases. Numerical techniques for simulating two-phase flows with interfaces have recently progressed significantly. Two types of numerical methods have been developed and applied: a continuous fluid approach, in which partial differential equations describing fluid motion are solved, and a discrete particle approach, in which motions of fluid particles or molecules are calculated. It is necessary in a continuous fluid approach to model the coalescence and disruption of the interface. On the other hand, in the discrete particle approach, the interface is automatically calculated as the boundary of the particle region. Among particle simulations methods, the lattice gas automata (LGA) is one of the simple techniques for simulation of phase separation and free-surface phenomena, as well as macroscopic flow fields. The LGA introduced by Fi'isch, Hasslacher and Pomeau (FHP) asymptotically simulates the incompressible Navier-Stokes equations[i]. In the FHP model, space and time are discrete, and identical particles of equal mass populate a triangular lattice. The particles travel to neighboring sites at each time step, and obey simple collision rules that conserve mass and momentum. Macroscopic flow fields are obtained by coarse-grain averaging in space and time. Since the algorithm and programming are simple and complex boundary geometries are easy to represent, the LGA has been applied to numerical simulations of hydrodynamic flows [2]. In the LGA, only seven bits of information are needed in order to specify the state of a site, and only the information of neighboring sites is used for

432 updating the state of a site. The LGA is thus efficient in memory usage and appropriate for parallel computations [3]. The immiscible lattice gas (ILG) is a two-species variant of the FHP model [4]. Red and blue particles are introduced, and the collision rules are changed to encourage phase segregation while conserving momentum and the number of red and blue particles. In this study, the two-dimensional two-phase flow simulation code based on the ILG is developed and parallelized using the MPI library. Parallel computations are performed on a workstation cluster to simulate typical two-phase flow phenomena. The interracial area concentration, which is one of the important paremeters in two-phase flow analyses, is evaluated numerically, and its dependency on the density, wall speed, and region size are discussed. 2. T H E I M M I S C I B L E

LATTICE GAS MODEL

In the ILG, a color field is introduced to the LGA [4]. Red and blue particles are introduced, and a red field is defined as the set of seven Boolean variables"

r(x) -- {ri(x) e {0, 1}, i -- 0, 1, ..., 6},

(1)

where ri(x) indicates the presence or absence of a red particle with velocity ci at lattice site x: Co - 0 and Cl through c6 are unit vectors connecting neighboring sites on the triangular lattice. The blue field is defined in a similar manner. Red and blue particles may simultaneously occupy the same site but not with the same velocity. The phase segregation is generated by allowing particles from nearest neighbors of site x to influence the output of a collision at x. Specifically, a local color flux q[r(x), b(x)] is defined as the difference between the net red momentum and the net blue momentum at site x" q[r(x), b(x)] - ~ ci[ri(x) - bi(x)].

(2)

i

The local color field f ( z ) is defined to be the direction-weighted sum of the differences between the number of red particles and the number of blue particles at neighboring sites: f(x)

+

-

i

- bj(

+ c )l.

(3)

j

The work performed by the color flux against the color field is W(r, b) - - f - q(r, b).

(4)

The result of a collision at site z, r ~ r t and b ~ bt, is then with equal probability any of the outcomes which minimize the work,

W(r', b') - m i n W ( r , b),

(5)

subject to the constraints of colored mass conservation,

i

- Z

Z

i

i

-

(6) i

433

and colorblind momentum conservation,

c,(,-.; + b;) - Z c;(fi + b',). i

(7)

i

3. P A R A L L E L C O M P U T A T I O N

ON A WORKSTATION

CLUSTER

The ILG code is parallelized, and parallel computations are performed on a workstation cluster. The domain decomposition method is applied using the MPI library, and the two-dimensional simulation region is divided into four small domains in the vertical direction. The workstation cluster consists of four COMPAQ au600 (Alpha 21164A, 600MHz) workstations connected through 100BaseTx (100 Mb/s) network and switching hub. As an example of parallel computations, the rising bubble in a static fluid is shown in Figs. 1 and 2. Figure 1 is obtained by a single-processor calculation, while Fig. 2 is by the parallel calculation using four workstations. The simulation region of the sample problem is 256 x 513 lattices, and 32 x 64 sampling cells are used to obtain flow variables. The number of particles is about 0.67 million. A circular red bubble with a radius of 51 lattice units is placed in a blue field in the initial condition. Periodic conditions are applied at four boundaries of the simulation region. The external force field is applied to the particles by changing the direction of the particle motion. The average amount of momentum added to the particles is 1/800 in the upward direction for red and 1/10000 for blue in the downward direction. A flow field is obtained by averaging 100 time steps in each sampling cell. The 20th flow field averaged from 1901 to 2000 time steps is shown in Figs. 1 and 2.

256 X

cyclic

ratio

:~!,?~:i~!~!i)i~i?ii!iii:~i!Ui84i ! 84184184 il, ~ ii~ii!iiillil ~:ii i~,~!ii i~~!i i i l,!ili i i!i !i ;i i!;~ii~!i !i i i!!ii~j Zii~!iiiiiii i~i ~i i~i!! i;i:~? 84

5t3

ioo

dec

sam pting Cells

c~i!ie: Figure 1: Rising bubble obtained by single-processor calculation.

.....

i

.....

......

Figure 2: Rising bubble obtained by fourprocessor calculation.

434 The shape of the rising bubble is shown in Figs. 1 and 2 along with the stream lines in the flow field. It is shown in these figures that the rising bubble is deformed slightly due to the vortex around the bubble. The pressure in the blue field near the top of the bubble becomes higher due to the flow stagnation. The downward flow is established near the side of the bubble and the pressure in the blue fluid becomes lower than that in the bubble. The bubble is thus deformed. The downstream side of the bubble is then concaved due to the upward flow in the wake. The deformation of the bubble and the vortex in the surrounding flow field are simulated well in both calculations. In the lattice-gas simulation, a random number is used in collision process of particles in case of two or more possible outcomes. The random number is generated in each domain in the four-processor calculations, and the collision process is slightly different between single-processor and four-processor calculations. The flow fields shown in Figs. 1 and 2 are thus slightly affected. The speedup of parallel computations is shown in Fig. 3. The speedup is simply defined as the ratio of the calculation time: T1/Tn, where T1 and T~ are the calculation time using single processor and n processors, respectively. In Fig. 3, the speedup on the parallel server which consists of four U14.0 traSparc (300MHz) processors cono--o COMPAQ;100Mb/s / ~ nected by a 200MB/s high-speed network is also shown. The calculation speed of the COMPAQ 3.0 600au workstation is about two "O times faster than that of the U1traSparc processor. The network c~ (/) 2.0 speed is, however, 16 times faster for the UltraSparc server. Although the network speed is much slower, the speedup of parallel computation is 1.0 better for the workstation cluster. 2 3 4 The difference in parallel processNumber of processors ing between the workstation cluster and the parallel server may affect the Figure 3: Speedup of parallel computations. speedup. I

4. I N T E R F A C I A L

I

AREA CONCENTRATION

The interracial area concentration has been extensively measured and several empirical correlations and models for numerical analyses were proposed [5]. Various flow parameters and fluid properties are involved in the correlations according to the experimental conditions. The exponents of parameters are different even though the same parameters are used, since the interracial phenomena are complicated and an accurate measurement is difficult. The interfacial area concentration was recently shown to be measured numerically by using the ILG and correlated with the characteristic variables in the flow field [6]. The interfacial area was, however, overestimated because of the definition of the

435

interfacial region on the triangular lattice. The calculated results in the small simulation region (128 x 128) and the average density from 0.46 to 0.60 were discussed in Ref [6]. In this study, the effects of the characteristic velocity and the average density on the interracial area concentration are discussed by performing the parallel computation. The definition of the interfacial region is slightly modified. The effect of the region size is also discussed. In order to evaluate the interfacial area concentration, red and blue regions are defined as the lattice sites occupied by red and blue particles, respectively. The lattice sites with some red and blue particles are assumed to be the interfacial lattice sites. In contrast to Ref. [6], the edge of the colored region is not included in the interfacial region. The interfacial area concentration is thus obtained as the number of interfacial lattice sites divided by the number of all the lattice sites. The interfacial area concentration is measured for two cases: the phase separation in a Couette flow and the mixing of two phases in a cavity flow. The simulation region is 128 x 128 or 256 x 256, and the average density is varied from 0.46 to 0.73. The no-slip and the sliding wall conditions are applied at the bottom and top boundaries, respectively. At the side boundaries, the periodic boundary condition is applied in the Couette flow problem, while the no-slip condition is used in the cavity flow problem. The initial configuration is a random mixture of red and blue particles, with equal probabilities for each, for the Couette flow and a stratified flow of red and blue phases with equal height for the cavity flow. The interfacial area concentration, Ai, in the steady state of the Couette flow is shown in Fig. 4 as a function of the wall speed, Uw. 1.25 i c~ 9d=O.46(128x128) d=O.46(256x256) ' The error bar indicates the standard = d=O.53(128x128) deviation of the fluctuation in the --r1.20 ': - ~ d=O.53(256x256) steady state. The phase separation --A d=O.60(128x128) -7J, --V in the steady state obtained by the 1.15 1 --A d=0.60(256x256) - . ~ _ ~ ' ~ ~'r/ ~_ _ parallel computation in this study .% are almost the same as shown in Ref < 1.10 [6]. The interface is fluctuating due to the diffusivity of one color into the 1.05 other, and the interfacial region has a thickness even tbr Uw = 0. In order to see the effect of the wall speed 0.95 clearly, the interfacial area concen0.0 0.1 0.2 0.3 tration for the case with Uw = 0 is Wall used for normalization in Fig. 4. It is shown that Ai*(Ai/Ai(u~=o)) Figure 4: Effect of the wall speed on Ai ~ in the increases with an increase in the wall steady state Couette flow. speed. This is because the detbrmation of phasic region is large due to the increase in phasic momentum. = =

speed

436

It was, however, shown for the Couette flow that Ai* did not increase largely with an increase in the wall speed [6]. The interfacial region was estimated to be large in Ref. [6] due to the definition of the interface and the phasic region. The effect of the wall speed was thus not shown clearly in Ref. [6]. The interracial areal concentration was shown to be correlated with the wall speed and increased with U w 1/2 [6]. The same tendency is indicated in Fig. 4. The increase in Ai* is seen to be small for d = 0.46 in comparison with other cases. The surface tension is relatively small for low densities and the interfacial region for U w = 0 is large in this case. The effect of the wall speed is thus small for d = 0.46. The effect of the wall speed is large for higher densities in case of the small region size (128 x 128). The increase in A i ~ is larger for d=0.60 than for d=0.53. For the large region size (256 x 256), however, the increase in A i ~ is slightly larger for d = 0.53 than for d = 0.60. This is due to the difference in the momentum added by the wall motion. The interracial area is determined by the surface tension and the momentum added by the wall motion. The average momentum given to the particle is small for the case with the large region size when the wall speed is the same, and the increase in Ai* is small for the region size of 256 x 256 as shown in Fig. 4. The effect of the wall speed is thus slightly different for the different region size. The interfacial area concentration in the steady state of the cavity flow is shown in Fig. 5. The steady state flow fields obtained by the parallel computation are almost the same as shown in Ref [6]. Although the effect of the wall speed for higher densities in case of the small region size is different from that in Fig. 4, the dependency of d=O.46(128x128) t 1.25 =o---o 9d=O.46(256x256) Ai* on the wall speed is almost the ' =- - -- d=O.53(128x128) same as for the Couette flow. 1.20 ~- - ~ d=O.53(256x256) It is shown in Figs. 4 and 5 that the increase in Ai* is not so different 1.15 quantitatively between the Couette flow and the cavity flow. The dif< 1.10 ference in the flow- condition is the 1.05 boundary condition at the side wall: periodic in the Couette flow and no1 00~ slip in the cavity flow. The effect of the side wall is, thus, found to be 0.95 . . . . . small in these figures. 0.0 0.1 0.2 0.3 The effect of the wall speed was Wall discussed for the densities from 0.46 to 0.60 in Figs. 4 and 5, since this Figure 5: Effect of the wall speed on Ai* in the density range was studied in Ref. [6]. steady state cavity flow. The dependency of Ai* on the wall speed in the cavity flow is shown in Fig. 6 for higher density conditions.

I

--

.~

====

speed

437

1.25

0.42

e, e, d=O.65(128x128) o ~ o d=O.65(256x256) u - - i d=O.73(128x128) ~--- ~ d=O.73(256x256)

1.20 1.15

0.38

1.10

9Uw=O.O5(128x128) o ~ - o Uw=O.O5(256x256) t-~ Uw=O.15(128x128) c- -4::] 1Uw=O. 5 ( 2 5 6 x 2 . _ 56)

~ ,

0.34

\

1.05 0.30

1.00-~

=,==

0.95 0.00

0.10

0.20

Wall speed

0.30

Figure 6: Effect of the wall speed on A i ~ in the steady state cavity flow for higher densities.

0.26

0.4

.

. . 0.5

.

. 0.6

0.7

0.8

Density Figure 7: Dependency of interfacial area concentration on the density.

It is shown that the increase in A i ~ becomes negative as the wall speed increases. In the steady state cavity flow, large regions and small fragments of two phases are seen in the flow field [6]. The decrease in A i ~ indicates that the large phasic region increases and the small fragments of one phase decreases. The mean free path is becomes shorter and the collisions of particles increases for higher density conditions. The coalescence of phasic fragments may occur frequently as the wall speed increases. The dependency of A i on the average density, d, in the cavity flow is shown in Fig. 7. The interfacial area concentration is not normalized in this figure. It is seen, when the wall speed is small ( U w = 0.05), that A i becomes minimum at around d = 0.6. This density almost corresponds to the maximum surface tension [7]. The interracial area is thus found to be smaller as the surface tension becomes larger, since the deformation of phasic region is smaller. The value of the density which gives the minimum A i is slightly shifted to higher values as the wall speed increases as shown in Fig. 7. The effect of the coalescence of small fragments may be dominant for higher density conditions as shown in Fig. 6. The interracial area for Uw = 0.15, thus, becomes small as the density increases. 5. S U M M A R Y In this study, the ILG has been applied to simulate the rising bubble in a static fluid, the phase separation in a Couette flow and the mixing of two phases in a cavity flow. The ILG code was parallelized using the MPI library and the parallel computation were performed on the workstation cluster. The interface was defined as the intert'acial lattice sites between two phases and the interfacial area concentration was evaluated numerically.

438 It was shown in the steady state that the interfacial area concentration increased with the increase in the wall speed for relatively lower density conditions (d = 0 . 4 6 - 0.60). In the higher density conditions, however, the increase in the interracial area concentration was negative as the wall speed increased. It was shown that the coalescence of phasic fragments might be important for higher density conditions. Phase separation and mixing with a change of interface are complicated and difficult to simulate by conventional numerical methods. It was shown in Ref. [6] and this study that the interracial area concentration was evaluated systematically using the ILG, which is one of the discrete methods using particles to simulate multi-phase flows. The colored particles used in the ILG have, however, the same properties, and the two-phase system with large density ratio cannot be simulated. Although several models based on the LGA have been proposed to simulate various two-phase flows, the original ILG was used in this study to test the applicability of the LGA. Our results demonstrate that interracial phenomena in multi-phase flows can be studied numerically using the particle simulation methods. REFERENCES

[1] U. Frisch, B. Hasslacher, and Y. Pomeau, Phys. Rev. Lett. 56, 1505(1986). [2] D. H. Rothman and S. Zaleski, Ref. Mod. Phys. 66, 1417(1994). [3] T. Shimomura, G. D. Doolen, B. Hasslacher, and C. Fu, In: Doolen, G. D. (Ed.). Lattice gas methods for partial differential equations. Addison-Wesley, California, 3(1990). [4] D. H. Rothman and J. M. Keller, J. Stat. Phys. 52, 1119(1988). [5] G. Kocamustafaogullari, W. D. Huang, and J. Razi, Nucl. Eng. Des. 148, 437(1994). [6] T. Watanabe and K. Ebihara, Nucl. Eng. Des. 188, 111(1999). [7] C. Adler, D. d'Humieres and D. Rothman, J. Phys. I France 4, 29(1994).

Parallel ComputationalFluid Dynamics Towards Teraflops, Optimizationand Novel Formulations D. Keyes,A. Ecer, J. Periaux, N. Satofukaand P. Fox (Editors) 2000 Elsevier Science B.V. All rights reserved.

439

Parallel performance modeling of an implicit advection-diffusion solver P. Wilders ~* ~Departement Information Technology and Systems, Delft University, P.O. Box 5031, 2600 GA Delft, The Netherlands,p.wilders~its.tudelft.nl.

A parallel performance model for a 2D implicit multidomain unstructured finite volume solver is presented. The solver is based upon the advection-diffusion equation and has been developed for modeling tracer transport. The parallel performance model is evaluated quantitatively. Moreover, scalability in the isoefficiency metric is discussed briefly. 1. I N T R O D U C T I O N We focus on an implicit advection-diffusion solver with domain decomposition as the basic tool for parallelism. Experiments on the SP2 show acceptable efficiencies up to 25 processors [1]. The analysis of communication patterns and sequential overhead did not explain the observed efficiencies completely [2]. Therefore, we decided to develop a full parallel performance model and to evaluate this model in some detail. Generally speaking, our interests are towards large scale modeling. For this reason the emphasis is on linearly scaled problems and, more generally, on scalability in the isoefficiency metric. Related studies on parallel performance analysis of Krylov-Schwarz domain decomposition methods have been undertaken in [3], [4], [5] and [6] for fixed size problems. Load balancing effects are absent or assumed to be absent and the authors introduce analytical expressions relating the iterative properties to the dimensions of the problem. The consequence is that the emphasis in these studies is on qualitative issues. Instead, our study includes quantitative aspects and some extra care is needed at these points. The present advection-diffusion solver has been developed for modeling the transport of passive scalars (tracers) in surface and subsurface environmental engineering [7], [8]. In this paper we present parallel results for a basic problem in this field, i.e. an injection/production problem in a porous medium (quarter of five spots). 2. S O M E B A S I C I S S U E S

We consider a scalar conservation law of the form

Oc p~--~+V.[vf(c)-DVc]-0

,

xEftE~2

,

t>0

.

(1)

*This work has been supported by the Dutch Ministry of Economic Affairs as part of the HPCN-TASC project.

440 In this paper, only tracer flows are considered. The coefficients 9~, v and D are assumed to be time-independent and f(c) = c. Our interests are on the advection-dominated case, i.e. the diffusion tensor D depends on small parameters. For the spatial discretization we employ cell-centered triangular finite volumes, based upon a 10-point molecule [9]. This results in the semi-discrete system dc

L?-/-

,

(2)

for the centroid values of the concentration. L is a diagonal matrix, containing the cell values of the coefficient q~ multiplied with the area of the cell. Moreover, F is a nonlinear differentiable function of c. We set

J-

OF Oc

"

(3)

The Jacobian J represents a sparse matrix with a maximum of 10 nonzero elements in each row. The linearly implicit trapezoidal rule is used for the time integration:

( ~-~L 2Jn)cn+l __ ( ~--~L ~1 j n ) c " + F,~ .

(4)

Here, Tn denotes the time step. The scheme is second-order accurate in time. The linear system (4) is solved iteratively by means of a one-level Krylov-Schwarz domain decomposition m e t h o d - in our case a straightforward nonoverlapping additive Schwarz preconditioner with GMRES as the Krylov subspace method. ILU-preconditioned BiCGSTAB is used for the approximate inversion of the subdomain problems. Experimental studies have been conducted for tracer flows in reservoirs. The velocity v is obtained from the pressure equation, using strongly heterogeneous permeability data provided by Agip S.p.A (Italian oil company). In this paper we consider the quarter of five spots, a well-known test problem for porous medium applications. The total number of unknows is N. As soon as N varies, either in the analysis or in experiments, it is important to take application parameters into account [10]. From the theory of hyperbolic difference schemes it is known that the Courant number is the vital similarity parameter. Therefore, we fix the Courant number. This means that both the spatial grid size h (N = O(1/h2)) and the time step ~- vary with v/h constant. It is possible to reduce the final system to be solved in the domain decomposition iteration to a small system formulated in terms of interface variables [11]. This makes it possible to include the GMRES routine in the sequential task, which is executed by the master. As a consequence, all communication is directed to and coming from the master (see Figure 1). 3. T I M I N G

AND

PERFORMANCE

MEASURES

Our concern is the computional heart of the code, the time stepping loop. For ease of presentation we consider a single time step. Let p denote the number of processes (each process is executed on a separate processor). There are three phases in the program: computation as a part of the sequential task 0, computation as a part of the distributed tasks

441 gather: interface v a r i a b l e ~ , h ~ scatter: halo variables gat / ~ ~ @ - ~ / sc:tter

~

S broadcast

process 1

Figure 1. Communication patterns.

1, .., p and communication. Inversion of subdomain problems is part of the distributed tasks. For the sake of analysis we consider the fully synchronized case only. The different processes are synchronized explicitly each time the program switches between two of the three phases. The elapsed time in a synchronized p processor run is denoted with Tp. It follows that (5)

with T (s) the computational time spent in the sequential task 0, Tp(d) the maximal time spent in the distributed tasks and T(~c) the communication time. Of course, T[ c) - O. Let us define T , ( p ) - r(s) + pTJ d)

(6)

.

F o r p = 1, it holds that TI(p) = T1. For p > 1, rl(p) is the elapsed time of a single processor shadow run, carrying out all tasks in serial, while forcing the distributed tasks to consume the same amount of time. It is clear that TI(p) - T1 presents a measure of idleness in the parallel run, due to load balancing effects. It easily follows that

T,(p)

-

T1

pT~") ~)

-

-

(7)

.

The (relative) efficiency Ep, the (relative) costs Cp and the (relative) overhead @ are defined by

T1

EP = pTp

,

1 pTp Cp - Ep = T1

,

Op - Cp - 1

(8)

.

Note that in (8) we compare the parallel execution of the algorithm (solving a problem with Q subdomains) with the serial execution of the same algorithm (relative). We introduce the following factorization of the costs Cp:

Cp-C(')C(S)

,

Cp(')- 1 + O q)

,

C(p)- 1 + O (p)

,

(9)

442 where o(l) = T1 (p) - T1

TI

O(p) _ pTp - T1 (p)

'

(10)

TI(p)

It follows that Op - O q) + O (p) + O q)O (p)

(11)

.

The term 0(/) is associated with load balancing effects. Of) is referred to as the parallel overhead. O(pp) combines sequential overhead and communication costs. Let us introduce 0p - 0 (/) + 0 (p) + 0 ( 0 0 (v)

,

(12)

with

O(pl) _ pT(a;-)T[a)

O(v) _ p- 1 T (~) ,

-

P

T(a)

T(c) }

T(a)

(13)

.

Using (5), (6) and (10)it can be seen that the relative differences between Oq), O(p) and, respectively, 0q), 0(v) can be expressed in terms of the fraction of work done in the sequential task; this fraction is small (< 1% in our experiments). Therefore, we expect Op, 0(') and ()(P)to present good approximations of, respectively, Op, 0 q) and 0 (v). 4. T I M I N G

AND

PERFORMANCE

MODEL

FOR A SQUARE

DOMAIN

A fully balanced blockwise decomposition of the square domain into Q subdomains is employed; the total number of unknowns is N, N/Q per subdomain. The total number of edges on internal boundaries is B, on the average B/Q per subdomain. B is given by B - v/Sv/~(V/~ - 1 )

(14)

.

Both the total number of interface variables as well as the total number of halo variables depend on linearly on B. We assume that either p - 1, i.e. a serial execution of the Q subdomains, or p - Q, i.e. a parallel execution with one subdomain per processor. The proposed timing model reads" N B N PT(~)Q - C~1 ~ ~ -+M(a2 Q +a3 Ip)

,

(15)

with M

1 m~ 'p__

Q ~1 E I(m,q)

=1 q=l M 1 m~ m a x / ( m =1 q=l .... ,Q

, p-1 q)

,

p-

(16) Q

443

T (~) - / 3 M 2 B

,

(17)

B T(p~) - M ( T l f ( p ) + 7 2 ( p - 1 ) ~ )

,

f(p) - m i n ( [ x / ~ , F21ogp])

.

(18)

Here, m = 1,..., M counts the number of matrix-vector multiplications in the domain decomposition iteration (outer iteration) and I ( m , q) denotes the number of matvec calls associated with the inner iteration in subdomain q for the m-th outer matvec call. The first term on the rhs of (15) corresponds with building processes (subdomain matrix, etc.). The third term on the rhs of (15) reflects the inner iteration and the second term comes from correcting the subdomain rhs. Relation (17) is mainly due to the GramSchmidt orthogonalization as a part of the GMRES outer iteration. Finally, (18) models the communication pattern depicted in Figure 1 for a SP2. The first term on the rhs models the broadcast operations and the second term the gather/scatter. Here, [ stands for taking the smallest integer above. The term [2log Pl is associated with a 'treelike' implementation of communication patterns and the term [x/~ with a 'mesh-like' implementation [12]. It is now straightforward to compute analytical approximations 0p, 0p(1), 0p(;) of the overhead functions using (12), (13) and (15), (17), (18). The explanatory variables in the resulting performance model are N, Q, P, M and Ip. These variables are not completely independent; in theory it is possible to express the latter two variables as a function of the first variables. Such expressions are either difficult to obtain or, if available, too crude for quantitative purposes. Therefore, we will rely upon measured values of M and Ip. Note that these values are already available after a serial run. 5.

EXPERIMENTS

Experiments have been carried out on a SP2 (160 Mhz) with the communication switch in user mode. The parallel runs are characterized by N = NoQ, Q = 4, 9, 16, p = Q with No = 3200 (linearly scaled) and N = 12800, 28800, 51200, Q = 4, p = Q (number of processors fixed); in total, five parallel runs. Of course, every parallel run is accompanied by a serial run (see (8)). Physical details on the experiments can be found in [1]. First, it is necessary to obtain approximations of the unknown coefficients in (15), (17), (18). The number of available measurements are ten for (15), five for (17) and five for (18) and a least-squares estimation leads to 0~ 1

3 -

1 6 , 1 0 .4

-.

.68

9 10 -7

71-.62"10-3

,

ct2 - . 81 9 10 -5

,

,

c~3-.86.10

-6

,

(19)

(20)

,

72-.44"10

-6

9

(21)

Let O denote one of the original overhead functions (see (8), (10)) and d its analytical approximation obtained in section 4. In order to evaluate the performance model we introduce Dev-

1

_ x/" r/ j = l

Oj - Oj

Oj

9 lOO% ,

(22)

444

Table 1 Dev, deviations between overhead functions and their analytical approximations.

Dev:

8.7

1

11.5

2.0

,

,

--'Op

,

0.5 --:0

. *

....

90 (1)p

_..

0 (p)

P

0.4

0.8

(I)

. - - - - -

.... : 0 P

t~

~0.3

oc-0 . 6 O > O

4(

/

-

(P) P

~0.2

..

/

-.'0

0 > 0

.•

/

~0.4

0.2

,, * "

• ..""

...........

o

0.1 ----._._.

, -~o ..........................

5

Figure 2. scaled.

10 P

15

Overhead functions, linearly

N

-o

x 10 4

Figure 3. Overhead functions, fixed p.

with n denoting the number of measurements (n=5). Table 1 presents the values of Dev for the different overhead functions and it can be seen that the analytical predictions are quite accurate. ~'inally, Figure 2 and Figure 3 present the (total) overhead Op and its components Oq), O(pp). The analytical approximations have not been plotted because, within plotting accuracy, they coincide more or less with the true values. Figure 2 presents the overhead functions as a function of p for the linearly scaled experiments. Figure 3 plots the overhead functions as a function of N for the experiments with a fixed p (p = 4). Rather unexpectedly, it can be seen that the load imbalance leads to a significant amount of overhead. A close inspection of the formulas presented in section 4 shows that 0(/) depends linearly upon ( I p - I1), which means that the load imbalance is due to variations in the number of BICGSTAB iterations over the subdomains. 6. I S O E F F I C I E N C Y In the isoefficiency metric, a parallel algorithm is scalable if it is possible to find curves in the (p, N) plane on which the efficiency Ep and the overhead Op are constant [13]. A sufficient condition for the existence of isoefficiency curves is lira @ < o o N--+ co

p fixed

.

(23)

445

Table 2 Iterative properties for 4 subdomains (Q = 4). N M Ii,p=l Ip,p=4 3200 12800 51200

11.5 11.5 11.7

5.0 5.0 5.0

5.8 5.8 5.7

This condition is equivalent with (see (11)) lim O q ) < o o

N--+ c~ p fixed

,

lira O (p) < ~

N-~c~ p fixed

.

(24)

In order to be able to discuss these limits, some knowledge on M and Ip is needed. Table 2 presents some measured values for the experiments leading to Figure 3. It seems to be reasonable to assume constant (or slowly varying) iterative properties in the limiting process. This enables some preliminary conclusions on the basis of the analytical performance model for a square domain. From (13), (14), (15), (17) and (18)it follows that lim 0~ p ) - 0

N--~ oo p fixed

.

(25)

The parallel overhead approaches zero with a speed determined by the ratio B / N O(~/r From (13), (14), (15) and (16)it follows that O(t) ~

M ( Ip - 1 1 ) (OZl/O~3)-71- i f

X -+ oo I

,

p fixed

.

=

(26)

'

This means that the load balancing effect approaches a constant. Table 2 and (19) show that this constant is close to 0.1 for p = O = 4. From the analysis it follows that (23) is satisfied. In fact, the total overhead approaches a constant value in the limit. Isoefficiency curves with an efficiency lower than 0.9 exist and efficiencies close to 1.0 cannot be reached.

7. C O N C L U D I N G

REMARKS

We have presented a parallel performance model of an implicit advection-diffusion solver based upon domain decomposition. The model was evaluated quantitatively and a good agreement between the model and the results of actual runs was established. As such the model presents a good starting point for a more qualitative study. As a first step we have used the model to investigate the scalability in the isoefficiency metric. Although the solver is scalable, it is clear from the experimentally determined efficiencies that the solver is only attractive for machines with sufficient core-memory up to approximately 25-50 processors. To keep the overhead reasonable, it is necessary to have

446 subdomains of a moderate/large size. The size No = 3200 taken in the linearly scaled experiments turns out to be somewhat small. Beyond 50 processors the load balancing effects will become too severe. Because the latter are due to variable iterative properties over the subdomains, it is not straightforward to find improvements based upon automatic procedures. REFERENCES

1. C. Vittoli, P. Wilders, M. Manzini, and G. Potia. Distributed parallel computation of 2D miscible transport with multi-domain implicit time integration. Or. Simulation Practice and Theory, 6:71-88, 1998. 2. P. Wilders. Parallel performance of domain decomposition based transport. In D.R. Emerson, A. Ecer, J. Periaux, T. Satofuka, and P. Fox, editors, Parallel Computational Fluid Dynamics '97, pages 447-454, Amsterdam, 1998. Elsevier. 3. W.D. Gropp and D.E. Keyes. Domain decompositions on parallel computers. Impact of Computing in Science and Eng., 1:421-439, 1989. 4. E. Brakkee, A. Segal, and C.G.M. Kassels. A parallel domain decomposition algorithm for the incompressible Navier-Stokes equations. Simulation Practice and Theory, 3:185-205, 1995. 5. K.H. Hoffmann and J. Zou. Parallel efficiency of domain decomposition methods. Parallel Computing, 19:137/5-1391, 1993. 6. T.F. Chan and J.P. Shao. Parallel complexity of domain decomposition methods and optimal coarse grid size. Parallel Computing, 21:1033-1049, 1995. 7. G. Fotia and A. Quarteroni. Modelling and simulation of fluid flow in complex porous media. In K. Kirchgassner, O. Mahrenholtz, and R. Mennicken, editors, Proc. IUIAM'95. Akademic Verlag, 1996. 8. P. Wilders. Transport in coastal regions with implicit time stepping. In K.P. Holz, W. Bechteler, S.S.Y. Wang, and M. Kawahara, editors, Advances in hydro-sciences and engineering, Volume III, MS, USA, 1998. Univ. Mississippi. Cyber proceedings on CD-ROM. 9. P. Wilders and G. Fotia. Implicit time stepping with unstructured finite volumes for 2D transport. J. Gomp. AppI. Math., 82:433-446, 1997. 10. J.P. Singh, J.L. Hennessy, and A. Gupta. Scaling parallel programs for multiprocessors: methodology and examples. IF,RE Computer, July:42-50, 1993. 11. P. Wilders and E. Brakkee. Schwarz and Schur: an algebraical note on equivalence properties. SIAM J. Sci. Uomput., 20, 1999. to be published. 12. Y. Saad and M.N. Schultz. Data communication in parallel architectures. Parallel Computing, 11:131-150, 1989. 13. A. Gupta and V. Kumar. Performance properties of large scale parallel systems. Or. Par. Distr. Comp., 19:234-244, 1993.

Parallel Computational Fluid Dynamics Towards Teraflops, Optimization and Novel Formulations D. Keyes,A. Ecer, J. Periaux,N. Satofukaand P. Fox (Editors) e 2000 Elsevier Science B.V. All rights reserved.

447

A Parallel 3D Fully Implicit Unsteady Multiblock CFD Code Implemented on a Beowulf Cluster M. A. Woodgate, K. J. Badcock, B.E. Richards ~ ~Department of Aerospace Engineering, University of Glasgow, Glasgow, G12 8QQ, United Kingdom

We report on the development of an efficient parallel multi-block three dimensional Navier-Stokes solver. For robustness the convective terms are discretised using an upwind TVD scheme. The linear system arising from each implicit time step is solved using a preconditioned Krylov subspace method. Results are shown for the ONERA M6 wing, NLR-F5 wing with launcher and missile as well as a detailed look into parallel efficiency.

1. I N T R O D U C T I O N Navier-Stokes solvers for complete aircraft configurations have been used for a number of years. Geometric complexities are tackled through the use of either unstructured or block structured meshes. For multi-disciplinary work such as shape optimisations, where repetitive calculations on the same or similar mesh are required, the efficiency of the flow solver is more important than the cost of generating the mesh. This makes the use of block structured grids viable. The current paper describes the development of an implicit parallel method for solving the three dimensional unsteady Navier-Stokes equations. The work builds on developments in two dimensions. The features of the method are the iterative solution [1] of an unfactored linear system for the flow variables, approximate Jacobian matrices [2], an effective preconditioning strategy for good parallel performance [3], and the use of a Beowulf cluster to illustrate possible performance on commodity machines [4]. The method has been used on a wide range of applications which include unsteady aerofoil flows and multielement aerofoils [5], an unsteady Delta wing common exercise [6] and the unsteady NLR F-5 CFD validation exercise [7]. The following work will focus on how a fast, low memory and communication bandwidth method has been developed for solving unsteady problems. This method will be used to demonstrate that a few commodity machines can be used to obtain unsteady solutions over geometrically complex configurations in a matter of hours.

448

2. G O V E R N I N G

EQUATIONS

The three-dimensional Cartesian Navier-Stokes equations can be written in conservative form as 0W Ot

0(F i + F ~) t

Ox

O(G i + G ~) +

Oy

+

O(H i + H ~) Oz

=0,

(1)

where W = (p, pu, pv, pw, p E ) T denotes the vector of conservative variables. The inviscid flux vectors F i, G i, H i and viscous flux vectors F v, G v, H v are the direct 3D extension to the fluxes given in [5]. The laminar viscosity is evaluated using Sutherland's law whilst the turbulent eddy viscosity is given by the Baldwin-Lomax turbulent model [8]. Finally, the various flow quantities are related to each other by the perfect gas relations. 3. N U M E R I C A L

METHOD

The unsteady Navier-Stokes equations are discretised on a curvilinear multi-block body conforming mesh using a cell-centred finite volume method that converts the partial differential equation into a set of ordinary differential equations, which can be written as d (lfn+lT~y~Tn+lh D /~lrn+l d---t~*i,j,k "i,j,k J + ,~i,j,kt ,, i,j,k) - 0.

(2)

The convective terms are discretised using either Oshers's [9] or Roe's [10] upwind methods. MUSCL variable extrapolation is used to provide second-order accuracy with the Van Albada limiter to prevent spurious oscillations around shock waves. The discretisation of the viscous terms requires the values of flow variables and derivatives at the edge of cells. Cell-face values are approximated by the straight average of the two adjacent cell values, while the derivatives are obtained by applying Green's formula to an auxiliary cell. The choice of auxiliary cell is guided by the need to avoid odd-even decoupling and to minimise the amount of numerical dissipation introduced into the scheme. Following Jameson [11], the time derivative is approximated by a second-order backward difference and equation (2) becomes . a

( ~ i T n + 1 ~ __ i,j,k \ 9 9 i,j,k ]

ovi,j,kql/'n+llTtTn+l'' i,j,k _

_~_ l / - n - 1 I : ~ T n - 1 vi,j,k ' ' i,j,k -Jr- a i

vin n 4--'j'kWi'j'k 2/~ t

j k ~{ w 9n +i l, j , k ) - - O. ' '

(3)

xrn+l To get an expression in terms of known quantities the term D,~* i,j,kt[ ~vv i,j,k) is linearized w.r.t, the pseudo-time variable t*. Hence one implicit time step takes the form

[( v 3v) XF + G-/

0a

( W r e + l - W m)

-

-R*(W

TM)

(4)

where the superscript m denotes a time level m a t * in pseudo-time. Equation (4) is solved in the following primitive form V + 3V

0 W _~ OR

_ pro) _ _ R , ( W m)

(5)

449

4. I M P L E M E N T A T I O N To obtain a parallel method that converges quickly to the required answer two components are required. First is a serial algorithm which is efficient in terms of CPU time and second is the efficient parallelisation of this method. With a limited number of computers no amount of parallel efficiency is going to overcome a poor serial algorithm. A description of the Beowulf cluster used in performing the computations can be found in [4]. 4.1. Serial P e r f o r m a n c e In the present work, the left hand side of equation (5) is approximated with a first order Jacobian. The Osher Jacobian approximation and viscous approximation is as in [2], however, instead of using the Roe approximation of [12] all terms of O(Wi+l,j,k-- Wi,j,k) are neglected from the term 0oR ow. this forms a much more simplified matrix reducing W 0P the computational cost of forming the Roe Jacobian by some 60%. This type of first order approximation reduces the number of terms in the matrix from 825 to 175 which is essential as 3D problems can easily have over a million cells. Memory usage is reduced further by storing most variables as floats instead of doubles; this reduces the memory bandwidth requirements, and results in the present method requiring 1 MByte of memory per 550 cells. The right hand side of equation (5) is not changed in order to maintain second order spatial accuracy. A preconditioned Krylov subspace algorithm is used to solve the linear system of equations using a Block Incomplete Lower-Upper factorisation of zeroth order. Table 1 shows the performance of some of the key operations in the linear solver. The matrix vector and scalar inner product are approximately 25% of peak while the saxpy only obtains about half this. This is because saxpy needs two loads per addition and multiply while the others only need one. The drop off in performance is even more pronounced in the Pentium Pro 200 case. This may arise from the 66MHz bus speed, which is 33% slower than a PII 450.

Table 1 Megaflop performance of key linear solver subroutines Processor Matrix x vector Pre x vect PII 450 115 96 PPro 200 51 44

IIx.xll 108 44

saxpy 58 18

4.2. Parallel P e r f o r m a n c e The use of the approximate Jacobian also reduces the parallel communication since only one row of halo cells is needed by the neighbouring process in the linear solver instead of two. This coupled with only storing the matrix as floats reduces the communication size by 75%.To minimize the parallel communication between processes further, the BILU(0) factorisation is decoupled between blocks; this improves parallel performance at the expense of not forming the best BILU(0) possible. This approach does not seem to have a major impact on the effectiveness in 2D [3], however more testing is required in 3D.

450 Since the optimal answer depends on communication costs, different approaches may be required for different machine architectures.

5. R E S U L T S 5.1. O N E R A M 6 W i n g The first test case considered is flow around the ONERA M6 wing with freestream Mach number of 0.84 at angle of attack of 3.06 ~ and a Reynolds number of 11.7 million [13]. An Euler C-O grid was generated containing 257 x 65 x 97 points and a viscous C-O grid of 129 x 65 x 65 points. Table 2 shows the parallel and computational efficiency of the Euler method. It can be seen that the coarse grid is not sufficient to resolve the flow features but the difference between the medium and fine grids are small and are limited to the shockwaves. Both the medium and fine grids use multi level startups to accelerate the convergence giving nearly O(N) in time dependence on the number of grid points when comparing the same number of processors. The second to last column contains the wall clock time and shows that the 1.6 million cell problem converged 6 orders of magnitude in just under 30 minutes on 16 Pentium Pro machines. Taking the medium size grid time of under 4 minutes, which includes all the file IO, the use of commodity computers is shown to be very viable for unsteady CFD calculations. The last column contains the parallel efficiencies for the code, these remain high until the runtimes are only a few minutes. This is due to the totally sequential nature of the file IO which includes the grid being read in once per processor. The times also show that the new Roe approximate Jacobians produce slightly faster runtimes at the expense of some parallel efficiency. This is because although it is faster to form the Roe Jacobian the linear system takes about 10% longer to solve so increasing the communication costs.

Table 2 Parallel performance and a grid density study for Grid CL Co Procs Total 0.0114 16 257 x 65 x 97 0.2885 0.0125 2 129 x 33 x 49 0.2858 4 8 16 0.0186 1 65 x 17 x 25 0.2726 2 (Osher) 4 8 0.0189 1 65 x 17 x 25 0.2751 2 (Roe) 4 8

the euler ONERA M6 wing CPU time Wall time Efficiency 432 29.1 N/A 46.4 24.0 100 47.0 12.5 96 48.4 6.6 90 50.9 3.7 81 5.15 5.25 100 5.27 2.80 94 5.43 1.52 87 5.79 0.92 72 4.87 4.97 100 5.00 2.63 94 5.19 1.45 86 5.54 0.88 70

451

Figure 1. Fine grid pressure contours for the ONERA M6 Wing

The wing upper surface seen in Figure 1 clearly shows the lambda shock wave captured. The predicted pressure coefficient distributions around the wing agree well with experimential data for both the Euler and Navier-Stokes calculations (see Figure 2 for the last 3 stations). In particular, the lower surface is accurately predicted as well as the suction levels even right up to the tip. However, both calculations fail to capture the small separation region at the 99% chord. 5.2. N L R F-5 W i n g 4- T i p Missile The final test cases are on the NLR F-5 wing with launcher and tip missile [7]. The computational grid is split into 290 blocks and contains only 169448 cells. The steady case is run 320, (angle of attack 0 ~ and a Mach number of 0.897) while the unsteady case is run 352 (mean angle of attack -0.002 ~, the amplitude of the pitching oscillation 0.115 ~, the reduced frequency 0.069 and Mach number 0.897) . This is a challenging problem for a multiblock method due to the geometrical complexity of the problem. This gives rise to a large number of blocks required as well as a large variation in block size (the largest contains 5520 cells while the smallest has only 32 cells). The run time of the code was 40 minutes on 8 processors to reduce the residual by 5 orders. Figure 3 shows the upper surface where the shock wave is evident on the wing from root to tip due to the absence of viscous effects. The unsteady calculation was again performed on 8 processors taking 40 minutes per cycle for 10 real timesteps per cycle and 60 minutes per cycle for 20 real timesteps per

452

Figure 1. Fine grid pressure contours for the ONERA M6 Wing

The wing upper surface seen in Figure 1 clearly shows the lambda shock wave captured. The predicted pressure coefficient distributions around the wing agree well with experimential data for both the Euler and Navier-Stokes calculations (see Figure 2 for the last 3 stations). In particular, the lower surface is accurately predicted as well as the suction levels even right up to the tip. However, both calculations fail to capture the small separation region at the 99% chord.

5.2. N L R F-5 Wing + Tip Missile The final test cases are on the NLR F-5 wing with launcher and tip missile [7]. The computational grid is split into 290 blocks and contains only 169448 cells. The steady case is run 320, (angle of attack 0~ and a Mach number of 0.897) while the unsteady case is run 352 (mean angle of attack -0.002 ~ the amplitude of the pitching oscillation 0.115 ~ the reduced frequency 0.069 and Mach number 0.897) . This is a challenging problem for a multiblock method due to the geometrical complexity of the problem. This gives rise to a large number of blocks required as well as a large variation in block size (the largest contains 5520 cells while the smallest has only 32 cells). The run time of the code was 40 minutes on 8 processors to reduce the residual by 5 orders. Figure 3 shows the upper surface where the shock wave is evident on the wing from root to tip due to the absence of viscous effects. The unsteady calculation was again performed on 8 processors taking 40 minutes per cycle for 10 real timesteps per cycle and 60 minutes per cycle for 20 real timesteps per

453

O N E R A M6 Wing - Run 2308 - Eta = 0.90 1.5 I 1

! . ~ ....................................... i

0.5

i ..... i.........

!

i

i ...........

i. . . . . . . . . . .

-0.5

O N E R A M6 Wing - Run 2308 - Eta = 0.95 1.5 I

1

:) ,,. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

O"I

........

~1- - T u r b u l e n t calculation -0.5 {~1- - Inviscid calculation

......

o o Experimental data

-10

1.5

0.2

0.4

WC

0.6

0.8

1

0

0.2

0.4

X/C

0.6

0.8

1

O N E R A M6 Wing - Run 2308 - Eta = 0.99

,O

1 0.5 .

.

.

.

.

.

.

~ - Turbulent calculation -0.5 ][1- - Inviscid calculation II o o E x p e r i m e n t a l d a t a 0

0.2

0.4

X/c

0.6

0.8

1

Figure 2. Pressure distribution for Navier-Stokes &c Euler equations ONERA M6 Wing, Moo - 0 . 8 4 , c ~ - 3.06 ~ Re=ll.7106

cycle. As can be seen from Figure 4, the effect of doubling the real number of timesteps per cycle is minimal.

6. C O N C L U S I O N S An unfactored implicit time-marching method for solving the three dimensional unsteady Navier-Stokes equations in parallel has been presented. The ONERA M6 wing has been used to evaluate the characteristics of the method as well as the parallel performance. The simplistic parallel implementation leads to only good parallel efficiency on small problem sizes, but this is offset by a highly efficient serial algorithm. Future work includes an improved parallel implementation of the method for small data sets and a detailed study of the performance of the preconditioner with the advent of fast ethernet communication speeds.

454

Figure 3. Pressure contours for F5 Wing with launcher and missile Upper Surface, M~ = 0.897, a - 0~ Run 320

REFERENCES

1. Badcock, K.J., Xu, X., Dubuc, L. and Richards, B.E., "Preconditioners for high speed flows in aerospace engineering", Numerical Methods for Fluid Dynamics, V. Institute for Computational Fluid Dynamics, Oxford, pp 287-294, 1996 2. Cantariti, F., Dubuc, L., Gribben, B., Woodgate, M., Badcock, K. and Richards, B., "Approximate Jacobians for the Solution of the Euler and Navier-Stokes Equations", Department of Aerospace Engineering, Technical Report 97-05, 1997. 3. Badcock, K.J., McMillan, W.S., Woodgate, M.A., Gribben, B., Porter, S., and Richards, B.E., "Integration of an impilicit multiblock code into a workstation cluster environment", in Parallel CFD 96, pp 408. Capri, Italy, 1996 4. McMillan, W.,Woodgate, M., Richards, B., Gribben, B., Badcock, K., Masson, C. and Cantariti, F., "Demonstration of Cluster Computing for Three-dimensional CFD Simulations", Univeristy of Glasgow, Aero Report 9911, 1999 5. Dubuc, L., Cantariti, F., Woodgate, M., Gribben, B., Badcock, K. and Richards, B.E., "Solution of the Euler Equations Using an Implicit Dual-Time Method", AIAA Journal, Vol. 36, No. 8, pp 1417-1424, 1998. 6. Ceresola, N., "WEAG-TA15 Common Exercise IV- Time accurate Euler calculations of vortical flow on a delta wing in pitch motion", Alenia Report 65/RT/TR302/98182, 1998 7. Henshaw, M., Bennet, R., Guillemot, S., Geurts, E., Pagano, A., Ruiz-Calavera, L. and Woodgate, M., "CFD Calculations for the NLR F-5 Data- Validation of CFD Technologies for Aeroelastic Applications Using One AVT WG-003 Data Set", presented at CEAS/AIAA/ICASE/NASA Langley International forum on aeroelasticty and structural dynamics, Williamsburg, Virginia USA, June 22-25, 1999

455 0.001084

!

1() Timesteps/cycle 20 Timesteps/cycle

0.001082 0.00108 E 0.001078

(D .m

0 .m

o 0.001076

0 C~

a 0.001074 0.001072 0.00107 0.001068 -0.15

-

0'.1

I

I

-0.05 0 0.05 Incidence in degrees

0.1

0.15

Figure 4. Effect of Steps/Cycle on unscaled Drag Mo~ - 0.897, am - -0.002 ~ a0 =0.115 k - 0.069. Run 352

10. 11. 12.

13.

Baldwin, B. and Lomax H., "Thin-Layer Approximation and Algebraic Model for Separated Turbulent Flows", AIAA Paper 78-257,1978. Osher, S. and Chakravarthy, S., "Upwind Schemes and Boundary Conditions with Applications to Euler Equations in General Geometries", Journal of Computational Physics, Vol. 50, pp 447-481, 1983. Roe, P.L., "Approximate Riemann Solvers, Parameter Vectors and Difference Schemes", Journal of Computational Physics, vol. 43, 1981. Jameson, A. "Time dependent calculations using multigrid, with applications to unsteady flows past airfoils and wings", AIAA Paper 91-1596, 1991 Feszty, D., Badcock, B.J. and Richards, B.E., "Numerically Simulated Unsteady Flows over Spiked Blunt Bodies at Supersonic and Hypersonic Speeds" Proceedings of the 22nd International Symposium on Shock Waves, 18-23 July 1999, Imperial College, London, UK, Schmitt, V. and Charpin, F., "Pressure Distributions on the ONERA-M6-Wing at Transonic Mach Numbers", in "Experimental Data Base for Computer Program Assessment", AGARD-AR-138, 1979.

This Page Intentionally Left Blank