Domain-Based Parallelism and Problem Decomposition Methods in Computational Science and Engineering

DOMAIN-BASED PARALLELISM AND PROBLEM DECOMPOSITION METHODS IN COMPUTATIONAL SCIENCE ANDENGINEERING This page intentio...

Author: David E. Keyes | Yousef Saad | Donald G. Truhlar

12 downloads 824 Views 33MB Size Report

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!

Report copyright / DMCA form

DOWNLOAD PDF

DOMAIN-BASED PARALLELISM AND PROBLEM DECOMPOSITION METHODS IN COMPUTATIONAL SCIENCE ANDENGINEERING

This page intentionally left blank

DOMAIN-BASED PARALLELISM AND PROBLEM DECOMPOSITION METHODS IN COMPUTATIONAL SCIENCE AND ENGINEERING

Edited by David E. Keyes

Old Dominion University and ICASE NASA Langley Research Center

Youcef Saad

University of Minnesota

Donald G. Truhlar

Minnesota Supercomputer Institute

EiaJTL« Society for Industrial and Applied Mathematics

Philadelphia

^jp* The royalties from the sales of this book are being placed in a fund to help students attend SIAM meetings and other SIAM related activities. This fund is administered by SIAM and qualified individuals are encouraged to write directly to SIAM for guidelines. Society for Industrial and Applied Mathematics, 3600 University City Science Center, Philadelphia, PA 19104-2688. Copyright © 1995 by the Society for Industrial and Applied Mathematics 10987654321 All rights reserved. Printed in the United States of America. No part of this book may be reproduced, stored, or transmitted in any manner without the written permission of the publisher. For information, write the Society for Industrial and Applied Mathematics, 3600 University City Science Center, Philadelphia, PA 19104-2688.

Library of Congress Cataloging-in-Publication Data Domain-based parallelism and problem decomposition methods in computational science and engineering / edited by David E. Keyes, Youcef Saad, Donald G. Truhlar. p. cm. Includes bibliographical references. ISBN 0-89871-348-X 1. Parallel processing (Electronic computers) I. Keyes, David E. II. Saad, Y. III. Truhlar, David G., 1944QA76.58.D66 1995 519.4'0285'52-dc20 95-7318

sisJIL. is a registered trademark.

Preface

This monograph arises from the recognition that physical scientists, engineers, and applied mathematicians are developing, in parallel, solutions to problems of parallelization. The new cross-disciplinary field of scientific computation is bringing about better communication between heterogeneous computational groups, as they face this common challenge. However, as with a parallel computer itself, the scientific computing community benefits from a better balance between individual computations and communication. This volume is one attempt to provide such cross-disciplinary communication. The subject addressed is problem decomposition and the use of domainbased parallelism in computational science and engineering. The authors met to exchange views on this subject at a workshop held at the University of Minnesota Supercomputer Institute in April 1994, and this fostered some appreciation for the relationships between the problems addressed and for several independently developed approaches to solving these problems. The editors commend the contributing authors for their efforts to write for an interdisciplinary audience and to concentrate on transferable algorithmic techniques, rather than on scientific results themselves. Crossdisciplinary editing was employed to identify jargon that needed further explanation and to ensure provision of a brief scientific background of each chapter at a tutorial level so that the physical significance of the variables is clear and correspondences between fields are visible. The editors have greatly enjoyed discovering links between the solution techniques arising in the various disciplines represented in this volume, though we would be the first to admit that some of them are philosophical only, and do not lead to immediately transferable solutions. We believe that each individual chapter well represents one or more algorithmically progressive developments in its respective field of application, and we commend them to the reader on that basis alone. We look forward to more cross-reading and algorithm-mining of one another's disciplines and hope that many readers will be encouraged to do the same. Hampton, VA Minneapolis, MN September 1994

v

This page intentionally left blank

Think Globally, Act Locally: An Introduction to Domain-based Parallelism and Problem Decomposition Methods David E. Keyes

Yousef Saad

Donald G. Truhlar

"Think globally; act locally." This bumper sticker maxim has a lot to say to practitioners of contemporary high performance computing. It is increasingly incumbent on computational scientists to respect the data access hierarchies that accompany the large memories required by applications programs. These hierarchies are imposed, ultimately, by the finite size of data storage media and the finite speed of light, but their presence is asserted more immediately by the hardware and software overheads of system protocols for the delivery of data. From the frame of reference of any given processing element, an approximate cost function can be constructed for the minimum time required to access a memory element that is any given logical or physical distance away. Such cost functions typically consist of plateaus separated by sharp discontinuities that correspond to software latencies where some boundary of the hierarchy, such as a cache size or a local memory size, is crossed. The ratio of times required to access remote and local data varies from 10 to 105 in typical architectures, the latter being characteristic of network cluster computing. An underlying motivation for the development of problem decomposition algorithms is that these discontinuities should explicitly be respected by user applications. If users cannot afford to treat memory as "flat" in large problems, then neither can they afford to treat all nonzero data dependencies on an equal footing. Consequently, algorithms must adapt to architecture, guided by knowledge of the relative strengths of different couplings from the underlying physics. Ironically, such forced adaptation sometimes results not in compromise, but in the discovery of intrinsically better methods for flat memory environments, as well. Steady-state natural and human-engineered systems are often zero-sum Department of Computer Science, Old Dominion University, Norfolk. VA 23529-0162 and Institute for Computer Applications in Science and Engineering, NASA Langley Research Center. Hampton, VA 23681-0001. email: keyesQicase.edu. Department of Computer Science, University of Minnesota, Minneapolis, MN 554550154. email; saadacs.umn.edu. Minnesota Supercomputer Institute, 1200 Washington Ave. S., Minneapolis, MN 55415. email: mfISlOlQsc.msc.edu. vii

viii

Introduction

networks in which the overall distribution of a quantity to be determined is conserved. The conservation principle holds over any size control volume, from the smallest scales requiring resolution up to the global domain. Somewhere between these extremes are the scales at which the latencies of the memory hierarchy are asserted. This suggests a multilevel discretization of the conservation laws, with coarse-grained interactions between "basins" of fast memory (thinking globally, but on a small problem) and with fine-grained interactions within them (acting locally, on the scales of the resolution required). Algorithms exploiting multilevel discretization have evolved naturally and somewhat independently in a variety of applications, both continuous (e.g., conservation of energy in a conducting body) and discrete (e.g., conservation of current in a network of electronic components). It is an objective of this volume to promote cross-fertilization of such applications by identifying analogous features between them. It may be assumed without loss of generality that the challenges of writing algorithms for large-scale problems on hierarchical memory systems occur for physical systems that are irreducible in the matrix theoretic or group theoretic sense of the term. Each degree of freedom depends upon all of the others; no degrees of freedom may be removed and solved for exactly in isolation. For irreducibly coupled physical systems with arbitrary interactions between the components, there is not necessarily any benefit to a decomposition of the unknowns of the problem into sets that are proximate (in space) or strongly coupled (by dynamics) and a mapping into the global memory in a way that preserves their proximity or strong coupling. However, the interactions in the systems studied herein decay with an appropriate "distance" (in physical or basis function index space) sufficiently rapidly that remote interactions may be lumped or even ignored in certain phases of the solution process. There is a history of applying both direct and iterative methods to such problems. Direct methods involve the construction by explicit condensation of lower-dimensional systems for degrees of freedom that act as separators. In the literature of differential equations, this is the Poincare-Steklov operator; in linear algebra, it is the Schur complement; in physics, it is the optical potential. The simplest iterative methods involve cycling between the subdomains whose unknown boundary data are updated by neighbors and may generically be called Schwarz methods. Many modern approaches combine direct and iterative aspects in the form of preconditioned Krylov methods. The trade-offs involved in deciding what couplings may be lumped or ignored, with what consequences in terms of convergence rate or accuracy, and with what benefits in terms of mapping the computation to the memory hierarchy, constitute one of the main themes of this volume. A key concept in this regard is the selection of a reduced basis in which to represent the solution of a large-dimensional problem. This is an explicit choice in some

Introduction

ix

cases (as in a wave expansion), automated but still explicitly identifiable in some others (as in a Krylov method), and implicit in yet others (as in a multilevel or multipole method). In several chapters of this volume, the authors have brought out the benefits that accrue from selecting a good basis. These benefits range from getting any handle on the problem at all, to making a quantifiable asymptotic complexity reduction relative to a full-dimensional method, to identifying "reusable" bases for recurring computational tasks. A "good" basis is usually physically motivated (or problem-fitted), hierarchical, or orthogonal, and such good bases permit the solution process to be separated into distinct parts. A physically motivated or problem-fitted basis separates components of the result into dominant parts that may be suggested by some physical approximation and subdominant parts to patch in for more accuracy. A hierarchical basis separates components of the solution by their scales of variation. Expansion in an orthogonal basis provides another way to separate the components of the solution. Of course, these three attributes of a good basis are not mutually exclusive. A signature of the choice of basis visible in some of the chapters is an expression of a key resolvent operator, or an approximation thereto, by a sum containing triple products of operators consisting of the inverse of a different-dimensional operator in the middle, with "rectangular" operators on either side that map between spaces of different dimensions. For instance, a Schur complement contains such triple products in which the middle term may be of higher dimension than the terms of the sum itself. A Schwarz preconditioner contains such triple products in which the middle term is of lower dimension. The "rectangular" operators can even be infinite dimensional in the long direction. In the chapters describing quantum chemistry applications, these triple products are sometimes expressed in bra and ket notation, while in the chapters originating from a problem in the continuum, linear algebraic expressions may be found. Several other themes arise that transcend disciplinary barriers and are common within subsets of the chapters. These include: 1. opportunities to bring a physical understanding of the continuous problem into the discretization or the decomposition, particularly in the selection of partitions in problems in which the decay metric is anisotropic; 2. multiple discretizations of the same problem (e.g., on different scales, or to different orders of accuracy); 3. trade-offs in linear and nonlinear convergence rates that are mediated by a time-like parameter that stabilizes the nonlinear iteration while accelerating the linear iteration (by steepening the algebraic decay rate of the interactions at the same implicit time level), at the price

x

Introduction of requiring many such time steps; 4. opportunities for reuse of computational results from one iteration on related problems in subsequent iterations; 5. opportunities for and experience with parallel implementations.

In the rest of this introductory chapter we discuss several examples of of problem decomposition methods more specifically, each of which is the subject of one of the following chapters. Xiao-Chuan Cai presents the classical Schwarz domain decomposition approach for the solution of elliptic and parabolic problems with operators that are dominated by the self-adjoint second-order terms, but need not be either self-adjoint or even definite. With a fixed geometric overlap between neighboring subdomains, and with a single coarse-grid problem involving approximately one degree of freedom per subdomain as part of the preconditioner at each Krylov iteration, an iteration count bound that is asymptotically independent of both the resolution of the problem and the number of subdomains can be achieved. The coarse-grid solution being critical, recent work examines how to obtain the coarse-grid operator in the context of irregular grids and decompositions. Alfio Quarteroni describes domain decomposition methods for hyperbolic problems, in which characteristics play an essential role in selecting partitions and imposing interfacial boundary conditions. Scalar convection problems and systems of conservation laws are addressed, with applications from acoustics and elasticity. The author considers three examples of wave equations describing convective, acoustic, and elastic waves. He illustrates how these problems can be reformulated in the framework of a decomposition of the spatial domain and devises algorithms based on subdomain iterations. Finally, he addresses the interaction of time-differencing and space decomposition. Fetter Bj0rstad and Terje Karstad's contribution on two-phase immiscible, incompressible flow in oil reservoir simulation spans the subject matter of both of the first two chapters with an operator splitting that separately exploits the hyperbolic and elliptic features of the governing system of PDEs. The hyperbolic part of the problem is solved by a modified method of characteristics. Of particular interest is the resulting conflict between the optimal parallel mappings of the two split subproblems. In spite of the compromise, this chapter makes a strong case for the practicality of high-granularity parallel solutions to problems of real-world complexity. In particular, the resulting computational problems involve up to 16,384 subdomains (with one-element-wide overlap at their boundaries) and a coarse space. The solution is achieved via data parallel implementation with one subdomain per processor, approximate subdomain solvers, and a multigrid approach on the coarse grid.

Introduction

xi

V. Venkatakrishnan presents parallel solution techniques for the highly nonsymmetric Jacobian systems that arise when the convectively dominated Navier-Stokes equations are discretized on unstructured grids and solved by Newton's method. For these multicomponent problems, a coarse-grid operator leading to an optimal convergence rate is not known; nevertheless, a coarse system derived from agglomeration proves effective. The equations are solved by a preconditioned iterative method with a block diagonal preconditioner corresponding to a fixed sparsity pattern and involving a factorization within each processor subject to homogeneous Dirichlet boundary conditions. Such boundary conditions become more and more accurate as the outer Newton iteration progresses. Partitioning, node ordering, and the accuracy with which subdomain problems should be solved for most efficient solution of the overall steady-state problem are addressed. An implicit scheme for unstructured grids is demonstrated that requires fewer iterations for a given nonlinear residual reduction than the best singlegrid method. Dana Knoll and co-authors extend Krylov-Schwarz domain decomposition methods without a coarse-grid operator to nonlinear problems. The edge plasma fluid equations are a highly nonlinear system of two-dimensional convection-diffusion-reaction equations that describe the boundary layer in a Tokamak fusion reactor. There are six or more components with complicated interactions through composition-dependent transport coefficients and source/sink terms. A matrix-free version of Newton's method exploits the Krylov nature of the solver (in which the action of the Jacobian is probed only through matrix-vector products) to avoid forming the actual Jacobian of the nonlinear system, except for diagonal blocks used only in preconditioning and updated infrequently. Matrix-free methods depend critically upon numerical scaling since they approximate matrix-vector products through a truncated Taylor series. The implications for the robustness of various Krylov solvers are explored. William Gropp and Barry Smith present an implementation philosophy and a publicly available implementation in portable parallel software of a variety of preconditioned Krylov algorithms for domain decomposition, in which the notion of subdomain is generalized to the block partitioning of a sparse matrix. The emphasis is on performance of such solvers on a variety of distributed memory architectures in the limit of large problem size, and the resulting trade-offs in convergence rate and parallel efficiency. Andrew Lumsdaine and Mark Reichelt discuss the spatio-temporal simulation of semiconductor devices via accelerated versions of the waveform relaxation method, a classical method for systems of temporally varying ordinary differential equations. In contrast to conventional parabolic treatments, in which space parallelism only is sought at each time level, the entire space-time cylinder is partitioned for parallel processing purposes. Time, being causal the initial value problems under consideration here,

xii

Introduction

invites a special windowing treatment. Graham Horton applies two-level and multilevel discretizations beyond the realm of PDEs to steady-state Markov chains, which arise, for instance, in queuing theory, and in the performance analysis of networks. Of particular interest is the derivation of a coarse-grid correction scheme that never violates the feasibility range of bounded variables, in this case probabilities. The resulting scheme is equivalent to a conventional multigrid method but with nonlinear (solution-dependent) intergrid transfer operators. Simple queuing networks with highly anisotropic coefficients, for which the novel multilevel method is particularly effective, are seen to have the same algebraic structure as convectively dominated transport equations. Charbel Farhat also focuses on the coarse level of a multilevel preconditioner, from a parallel efficiency point of view and in the context of multicomponent problems of structural mechanics. The practically important problems of multiple right-hand sides in engineering analyses and how to amortize for multiple right-hand sides in the context of iterative methods are also addressed. Of particular interest are the extensions of domain decomposition methods for "nearby" systems that arise in design problems, time-dependent problems, and eigenvalue problems. Scalable results are demonstrated for structural mechanics problems. Francois-Xavier Roux presents the dual Schur complement method of domain decomposition with application to nonlinear elasticity problems, and shows the dual to be preferable from a spectral convergence theory point of view. Along with Farhat, he addresses reuse of previous righthand side work in reconjugation and extends to nonlinear cases in which the matrix also changes. Parallel implementation on distributed-memory parallel machines is discussed. Roland Glowinski and co-authors show how domain decomposition and domain embedding techniques, seemingly complementary techniques for making irregular geometry amenable to acceleration by fast solvers, may be merged in the solution of both elliptic and time-dependent problems. This approach is based on using an auxiliary domain with a simple shape that contains the actual domain with a more complicated shape. Jacob White and co-authors exploit the fast multipole and fast Fourier transform methods in the context of a boundary element discretization of electrostatic potential problems. Boundary element formulations lead to dense matrix operators of sufficient diagonal dominance and superior conditioning that rapid convergence of Krylov methods can be obtained without complex preconditioners; however, the matrix-vector multiply is dense, and hence expensive. The fast multipole method applies the action of the underlying operator without forming it explicitly, resulting in orderof-magnitude reductions in asymptotic complexity while guaranteeing an arbitrary given accuracy in the result. The techniques are applicable to a wide variety of engineering applications based on 1/r2 interactions.

Introduction

xiii

The remaining chapters illustrate how problems and solutions analogous to those in mechanics applications in the preceding chapters also arise in quantum mechanics. In modern quantum mechanics, one works in basis function space rather than physical space, but the space is still structured into subsets that are strongly coupled within and weakly coupled between. Although the various quantum mechanical problems discussed have significant differences, there are recurring themes such as basis set contraction, which occurs in one way or another in all of these chapters. The chapters of Ellen Stechel and Hans-Joachim Werner are concerned with large-scale electronic structure problems, which involve elliptic eigenvalue problems of very large dimension. Contraction occurs at several levels in electronic structure problems. Stechel includes an overview of recent attempts to reach the ultimate scaling limit whereby the computational effort scales linearly in the number of particles or dimensions. Some of the techniques employed are very similar to the work described by White. Werner reviews modern numerical methods for the treatment of electron correlation effects, including the internally contracted configuration interaction method in which sets of physically related many-body basis functions are treated as a single degree of freedom to reduce the size of the variational space. He also discusses the vectorization and parallelization strategies that are required to make the resulting algorithms efficient, including techniques for iterative solution of large matrix eigenproblems, solution of nonlinear equations in multiconfiguration self-consistent-field and coupled-cluster approaches, and the use of direct inversion on an iterative subspace. Problems of vectorization, parallelism, input/output bottlenecks, and limited memory are addressed, and the I/O bottleneck is addressed by disk striping. This provides an example of parallelism in communication that seems less widely discussed than parallelism associated with multiple processors. Zlatko Bacic and Georges Jolicard and John Killingbeck discuss the vibrational eigenvalue problem in quantum mechanics. Bacic introduces the discrete variable representation (DVR), in which the analogies between function spaces and physical spaces are very clear, and he presents DVB-based divide-and-conquer computational strategies for reducing the dimensionality of the Hamiltonian matrix. Jolicard and Killingbeck discuss the wave operator theory as a tool to define active spaces and simplified dynamics in large quantum spaces. They present a partitioning integration method for solving the Schroedinger equation based on projections in reduced active spaces. For the Floquet treatment of photodissociation experiments, the choice of the relevant subspaces and construction of the effective Hamiltonians are carried out using the Bloch wave operator techniques. Recursive methods for the solution of the basic equations associated with these operators, based on Jacobi, Gauss-Seidel, and variational schemes are given. David Schwenke and Donald Truhlar discuss large-scale problems in quantum mechanical scattering theory. In quantum mechanical scattering

xiv

Introduction

theory the basis functions may be delocalized, and they are typically grouped in sets associated with channels. At the highest level, associated with distortion potential blocks, Schwenke and Truhlar explicitly couple those channels which physical arguments indicate are the most strongly interacting. At the intermediate level, they can perform a sequence of calculations increasing in complexity, optimizing the (contracted) basis functions at each step. At the lowest level, they discuss replacing a class of weakly coupled channels with a phenomenological optical potential. The optical potential idea can also be introduced using a different kind of motivation for the partitioning, as a way to reduce the computational effort by partitioning the energyindependent parts of the problem from the energy-dependent parts. The resulting "folded" formulation has interesting computational analogies to domain decomposition although it is accomplished in basis function space rather than physical space. Finally the partitioning based on strength of coupling can be re-exploited by solving the coupled equations iteratively with preconditioners blocked by the same physical considerations as were employed to block the distortion potentials. The work summarized above underscores the importance in large problems of informing the solution process directly with the physics being modeled and with the architecture for which the computation is destined, arid portrays the tension between concentrating operations locally and taking strategic account of remote information that dominates parallel algorithm development today and for the foreseeable future.

Contents

Chapter 1

1

A Family of Overlapping Schwarz Algorithms for Nonsymmetric and Indefinite Elliptic Problems Xiao-Chuan Cat

Chapter 2

21

Domain Decomposition Methods for Wave Propagation Problems Alfio Quarteroni

Chapters

39

Domain Decomposition, Parallel Computing and Petroleum Engineering Fetter E. Bj0rstad and Terje Karstad

Chapter 4

57

Parallel Implicit Methods for Aerodynamic Applications on Unstructured Grids V. Venkatakrishnan

Chapters

75

Newton-Krylov-Schwarz Methods Applied to the Tokamak Edge Plasma Fluid Equations DA. Knoll, P.R. McHugh, and V.A. Mousseau

Chapter 6

97

Parallel Domain Decomposition Software William Gropp and Barry Smith

Chapter?

107

Decomposition of Space-Time Domains: Accelerated Waveform Methods, with Application to Semiconductor Device Simulation Andrew Lumsdaine and Mark W. Reichelt

Chapters

125

A Parallel Multi-Level Solution Method for Large Markov Chains Graham Morton

xv

xvi

Contents

Chapter 9

141

Optimizing Substructuring Methods for Repeated Right Hand Sides, Scalable Parallel Coarse Solvers, and Global/Local Analysis Charbel Farhat

Chapter 10

161

Parallel Implementation of a Domain Decomposition Method for Non-Linear Elasticity Problems Frangois-Xavier Roux

Chapter 11

1.77

Fictitious Domain/Domain Decomposition Methods for Partial Differential Equations Roland Glowinski, Tsorng-Whay Pan, and Jacques Periaux

Chapter 12

193

Multipole and Precorrected-FFT Accelerated Iterative Methods for Solving Surface Integral Formulations of Three-dimensional Laplace Problems K. Nabors, J, Phillips, F.T. Korsmeyer, and J. White

Chapter 13

217

Linear Scaling Algorithms for Large Scale Electronic Structure Calculations E.B. Stechel

Chapter 14 _

239

Problem Decomposition in Quantum Chemistry Hans-Joachim Werner

Chapter 15

263

Bound States of Strongly Coupled Multidimensional Molecular Hamiltonians by the Discrete Variable Representation Approach Zlatho Bacic

Chapter 16

279

Wave Operators and Active Subspaces: Tools for the Simplified Dynamical Description of Quantum Processes Involving Many-Dimensional State Spaces Georges Jolicard and John P. Killingbeck

Chapter 17 Problem Decomposition Techniques in Quantum Mechanical Reactive Scattering David W. Schwenke and Donald G. Truhlar

303

Chapter 1 A Family of Overlapping Schwarz Algorithms for Nonsymmetric and Indefinite Elliptic Problems Xiao-Chuan Cai Abstract The classical Schwarz alternating method has recently been generalized in several directions. This effort has resulted in a number of new powerful domain decomposition methods for solving general elliptic problems, including the nonsymmetric and indefinite cases. In this paper, we present several overlapping Schwarz preconditioned Krylov space iterative methods for solving elliptic boundary value problems with operators that are dominated by the self-adjoint, second-order terms, but need not be either self-adjoint or definite. All algorithms discussed in this paper involve two levels of preconditioning, and one of the critical components is a global coarse grid problem. We show that, under certain assumptions, the algorithms are optimal in the sense that the convergence rates of the preconditioned Krylov iterative methods are independent of the number of unknowns of the linear system and also the number of subdomains. The optimal convergence theory holds for problems in both two- and three-dimensional spaces, and for both structured and unstructured grids. Some numerical results are presented also.

1

Introduction

In this paper, we present a family of overlapping domain decomposition methods for the solution of large, sparse, nonsymmetric and/or indefinite linear systems of equations obtained by discretizing elliptic partial differential equations. This family of methods originates from the classical Schwarz alternating algorithm, introduced in 1870 by H. A. Schwarz [37] in an existence proof for elliptic boundary value problems denned in certain irregular regions. This method has attracted much attention as a convenient computational method for the solution of a large class of elliptic or parabolic equations, see e.g.. [14, 38], especially on parallel machines, [22]. There are Department of Computer Science, University of Colorado at Boulder, Boulder, CO 80309. cai9cs.colorado.edu . The work is supported in part by the NSF grant ASC9457534, and in part by the NSF Grand Challenges Applications Group grant ASC9217394 and by the NASA HPCC Group grant NAG5-2218. 1

2

Domain-Based Parallelism and Problem Decomposition Methods

essentially two ways to use the algorithm as a computational tool. The first approach is to use it directly on the continuous partial differential equation denned on a physical domain. The mesh partitioning and the PDE discretization are then carried out subdomain by subdomain, which may sometimes result in nonmatching grids between overlapping subdomains. The second approach is to use it on the already discretized PDE, i.e., a linear system of algebraic equations. In this approach, a global grid is assumed to have been introduced before the domain, or mesh, is partitioned into subdomains. We shall consider only the second approach. Some of the material presented in this paper can also be found the references [8, 10, 11, 12]. This family of overlapping Schwarz algorithms has been shown to be efficient and robust for solving differential equations of many different types under a wide range of circumstances. In this paper, we shall focus only on the class of nonsymmetric and/or indefinite second order elliptic finite element, or finite difference, equations. The solution of such problems is an important computational kernel in implicit methods, for example, the Jacobian problems that need to be solved in any Newton-like method used in the solution of nonlinear partial differential equations such as in computational fluid dynamics [9]. An efficient iterative algorithm for solving general elliptic equations requires three basic steps, namely (a) a discretization scheme, (b) a basic iterative method, and (c) a preconditioning strategy. There is a significant difference between symmetric and nonsymmetric problems, the latter being considerably harder to deal with both theoretically and algorithmically. The main reasons are the lack of a generally applicable discretization technique for the general nonsymmetric elliptic operator, the lack of "good" algebraic iterative methods (such as CG for symmetric, positive definite problems), and the incompleteness of the mathematical theory for the performance of the algebraic iterative methods that do exist, such as GMRES [35, 36], By a "good" method, we mean a method that is provably convergent within memory requirements proportional to a small multiple of the number of degrees of freedom in the system, independent of the operator. One must assume that the symmetric part is positive definite and be able to afford amounts of memory roughly in proportion to the number of iterations, in order to obtain rapid convergence with GMRES. The task of finding a good preconditioner for nonsymmetric or indefinite problems is more important than for symmetric, positive definite problems, since, first, the preconditioner can force the symmetric part of the preconditioned system to be positive definite, and second, a better-conditioned system implies both more rapid convergence and smaller memory requirements. The focus of this paper is on the construction of efficient, parallel and scalable preconditioners by using domain decomposition methods. Domain decomposition methods are commonly classified according to a few criteria. "Overlapping" and "nonoverlapping" methods are differen-

Overlapping Schwarz Algorithms

3

tiated by the decomposition into territories on which the elemental subproblems are defined. We shall not discuss any nonoverlapping algorithms in this paper, interested readers should consult the paper [13] for recent progress. For a comparison of some of the overlapping and nonoverlapping algorithms, we refer to the paper [8]. Overlapping methods generally permit simple (Dirichlet) updating of the boundary data of the subregions at the expense of having to solve some larger linear systems, defined on subregions, per iteration from the redundantly degrees of freedom. An advantage of the overlapping methods, over non-overlapping substructuring type methods, is that the solution of the so-called interface problems (see [8, 13]) can always be avoided. We remark here that a general purpose, robust interface solver that guarantees the optimal convergence for the class of general variable coefficients, nonsymmetric and indefinite elliptic problems is yet to be introduced. We shall restrict our attention to the so-called optimal algorithms, i.e., algorithms whose convergence rates are independent of the number of unknowns as well as the number of subregions. All the algorithms under consideration can be used in either two- or three-dimensional spaces, with either structured or unstructured meshes. A coarse space, which is used in all the algorithms, plays an extremely important role in obtaining the optimality. It essentially reduces the original nonsymmetric and/or indefinite elliptic problem to a positive definite problem [11, 12], which may not be symmetric. Most of the theory concerning the convergence rate of domain decomposition methods is in the framework of the Galerkin finite element method. In some cases the Galerkin results transfer immediately to finite difference discretizations, though this is less true for nonsymmetric problems than for symmetric. We shall describe the algorithms by using a matrix language which is independent of the underlying discretization schemes, however, we shall switch to the finite element language when discussing the convergence theory. We remark that algorithms based on preconditioned iterative solution of the normal equations can also be used to solve nonsymmetric and/or indefinite linear systems, but are beyond the scope of this paper. Interested readers should consult, for examples. [3, 28, 32]. The paper is organized as follows. In the rest of this section, we shall define our model elliptic problem and its discretization. Section 2 is devoted to the description of an overlapping partitioning of the mesh, as well as algorithms for subdomain coloring. Both nested and nonnested coarse meshes are discussed in Section 2. The main algorithms of this paper are introduced in Section 3. This section includes the discussion of a number of optimal overlapping Schwarz algorithms including the additive Schwarz algorithm, the multiplicative Schwarz algorithm and some polynomial Schwarz algorithms. Several inexact subdomain problems solving techniques, and an algebraic extension of the Schwarz algorithms for

4

Domain-Based Parallelism and Problem Decomposition Methods

general sparse linear systems are also discussed in Section 3. A brief overview of the available theory for the optimality of the Schwarz algorithms is given in Section 4. The paper ends with Section 5, which contains some numerical results. We confine ourselves to the following model problem. Let 0 be a polygonal region, in Rd (d = 2,3), with boundary dfi, and let

be a second-order linear elliptic operator with a homogeneous Dirichlet boundary condition. Here

We assume that the matrix {ciij(x)} is symmetric and uniformly positive definite for any x 6 fi and the right-hand side / £ L 2 (O). Only Dirichlet boundary conditions are considered here; however, the algorithms can be used to solve problems with other boundary conditions as well, such as Neumann or mixed boundary conditions. We also assume that a finite element mesh, structured or unstructured, has been introduced on O. A finite element, or finite difference, discretization of the elliptic problem (1) on the given mesh in SI gives us a linear system of algebraic equations

where B is an n x n sparse matrix and n is the total number of interior nodes in Q. Here and in the rest of the paper u* denotes the exact solution of the linear system (3). We shall use h, even in the unstructured case, to characterize the mesh interval of the grid, which will be referred to as the hlevel or fine grid. The nodal points in the fine grid will be referred to as the /i-level nodes. We shall use the n x n matrix A to denote the discretization of the symmetric, positive definite part of the operator L. Let (•, •) denote the Euclidean inner product with the corresponding norm || • | . We denote the energy norm associated with the matrix A as

In practice, there are many discretization schemes can be used to obtain the linear system (3), such as the artificial diffusion and streamline diffusion methods [23] and the methods in [1]. Multiple discretizations can also be combined in the same iterative process; see, e.g., [24]. The preconditioning techniques to be discussed in the next few sections can easily be used together with these discretization schemes.

Overlapping Schwarz Algorithms

2

Overlapping Partitioning, Coarse Spaces

Subdomain

Coloring

5

and

In this section, we discuss a number of issues, mostly non-numerical, related to the partitioning the finite element mesh, and the definition of a coarse mesh space, which is an important component of the algorithms of this paper. We begin with the overlapping partitioning of the mesh. Let {fi,,i = 1, • • • , N}, be nonoverlapping subregions of fJ, such that

where fi means the closure of 0. Some earlier theory on Schwarz algorithms, [11, 12, 17, 18], required that the partitioning {fij} forms a regular finite element subdivision of Q, but recent development shows that this requirement is not necessary [7]. These subdomains can be of any shapes. In the case of unstructured meshes, this partitioning is often achieved by using certain graph partitioning techniques; namely, we first define an adjacency graph for the finite element mesh, then partition the graph into a number of disjoint subgraphs. We refer to [10, 21, 26, 33] for issues of graph partitioning. We assume that the vertices of any Qj, not on <9Q, coincide with the /i-level nodes. Following [11, 18], we can obtain an overlapping decomposition of O, denoted by

Here Q^ is obtained by extending each fi,: to a larger region which is cut off at the physical boundary of Q. We assume that

for a constant 8 > 0. Here 'Distance' is in the usual Euclidean sense. In the uniform mesh case, 6 is usually equal to an integer multiple of the mesh size h. 6 is an important parameter in these overlapping algorithms. Usually, using a larger overlapping can result in a reduced total number of iterations, however, per-iteration arithmetic operations and local memory requirement may increase. Let n, be the total number of h-level interior nodes in 0,i:, and B{ the rii x n, stiffness matrix corresponding to the discretization of L on the fine grid in f^, with a zero Dirichlet boundary condition on d£li. Since the matrices B{ are used only in the preconditioner part of the algorithms, they need not be calculated exactly. A detailed discussion on the use of inexact subdomain solvers can be found in Section 3.4. The size of the matrix BI depends not only on the size of the substructure Oj but also on the degree of overlap. The cost for solving the linear systems corresponding to the matrix

6

Domain-Based Parallelism and Problem Decomposition Methods

FIG. 1. The coloring pattern of 16 fine grid overlapped subregions and a coarse grid region. Color "0" is for the global coarse grid. The extended subregions of the other colors are indicated by the dotted boundaries.

Bi is determined not only by the size of the matrix but also by the type of solver. We note that a less accurate solver, such as an ILU [30], or ILUT [34], with a small number of fill-ins and a relatively large drop tolerance, can keep the overall cost down, even if the overlap is not too small. When using some of the multiplicative algorithms (Section 3.1), the subdomains are usually colored with the purpose of reducing the number of the sequential steps and speed up the overall convergence. The coloring is realized as follows. Associated with the decomposition {fij}, we define an undirected graph in which nodes represent the extended subregions and the edges intersections of the extended subregions. This graph can be colored by using colors 1, • • • , J, such that no connected nodes have the same color. Obviously, colorings are not unique. Simple greedy heuristic subgraph coloring algorithms have been discussed in the literature; see for examples, [10]. Numerical experiments support the expectation that minimizing the number of colors enhances convergence. An optimal five-color strategy (J = 4) is shown for the decomposition in Figure 1, in which the total number of subregions (including the coarse grid on the global region) is JV + 1 = 17. Let Ri be an raj x n matrix representing the algebraic restriction of an n-vector on f£ to the nj-vector on f^. Thus, if v is a vector corresponding to all the h-level interior nodes in fi, then RiV is a vector corresponding to the Wevel interior nodes in Qj, The transpose (Hi)*is an extension-by-zero matrix, which extends a length nj vector to a length n vector by padding with zero. All the algorithms discussed in the next section involve a coarse level

Overlapping Schwarz Algorithms

7

discretization. Let us define it here. Suppose that there is another mesh defined on fi, which contains no nodes, and is coarser than the fine mesh. Let BQ be the discretization of L on this coarse mesh. Let R^ be an extension operator, which maps any coarse mesh vector to the corresponding fine mesh vector. There is a variety of ways that one can define such an operator. Here we discuss only one example in the finite element context. Let 4>j(x) be the basis function defined at the jth coarse node. Let {xie fi,i — 1, • • • , n} be the fine mesh nodes. Then the n x no matrix RQ = {T,;J} can be defined by Tij = <j)j(xi). RQ is the transpose of RQ, and is used as a restriction operator that maps a fine mesh vector to a coarse mesh vector. In practice, the coarse mesh space needs not to be a subspace of the fine mesh space; we refer to [7] for a detailed discussion. We shall use 00 to denote the coarse grid, which is always assumed to have color 0. We conclude this section by introducing several frequently used notations. For each subdomain f^, i = 0, • • • , N, we define two n x n matrices

For j = 0,1, • • • , J, we denote by Qj the sum of all Pj and by N~l the sum

of all M^1 that correspond to subregions of the jth color. These matrices

will serve as the basic building blocks of the overlapping Schwarz algorithms to be discussed.

3

Some Schwarz Algorithms

In this section, we describe several overlapping Schwarz type algorithms constructed by using P, as the basic building blocks. We shall begin with the so-called multiplicative Schwarz algorithm, which is a direct extension of the classical Schwarz alternating algorithm. Then, we discuss a much simpler, additive version of the multiplicative Schwarz algorithm, which will be referred to as the additive Schwarz algorithm. In the third subsection, we introduce a family of Schwarz algorithms constructed by using a multivariable matrix-valued polynomial with P7; as the variables. In fact, both additive and multiplicative Schwarz algorithms are special cases of this family of polynomial Schwarz algorithms. Finally, in the last subsection, we briefly discuss an algebraic extension of the overlapping Schwarz algorithms, introduced recently in [10], for solving general sparse linear systems.

3.1

Multiplicative Schwarz Method (MSM)

Unlike other preconditioners, such as the additive Schwarz, MSM algorithm can be employed either as an iterative algorithm by itself or used as a preconditioner. As an iterative algorithm, or equivalently as a preconditioncr accelerated by a simple Richardson method, MSM is rather sensitive to some

8

Domain-Based Parallelism and Problem Decomposition Methods

of the problem parameters, such as the size of the first order terms in the partial differential operator, and sometimes loses its convergence, see for example Table 1. However, it is an excellent and robust preconditioner, especially when accelerated by a Krylov space iterative method, such as the GMRES. Along with the other algorithms to be described below, we shall normally employ it as a preconditioner for GMRES, but because of its historical importance, and to illustrate certain robustness advantages of acceleration, we also include the Richardson version in our discussion. In this paper, we shall use the abbreviation MSM for the multiplicative Schwarz-preconditioned GMRES method, and MSR for the simple Richardson process that corresponds to the classical Schwarz alternating algorithm with an extra coarse grid solver. To obtain parallelism, one needs a good subdomain coloring strategy so that a set of independent subproblems can be introduced within each sequential step and the total number of sequential steps can be minimized. A detailed description of the coloring algorithm and its theoretical aspects can be found in [4, 12, 25]. We now describe the MSR algorithm in terms of a subspace correction process. Let uk be the current approximate solution. Then uk+l is computed as follows. For j = 0,1, • • - , J: (i) Compute the residual in subregions with the jth color:

(ii) Solve for the subspace correction in all fi^s that share the jth color:

(iii) Update the approximate solution in all O^s that share the jth color:

At each iteration, every subproblem is solved once. For j ^ 0, applications of operators Rj and Rj do not involve any arithmetic operations. For j / 0, within each series of steps (i)-(iii), the operations in subregions sharing the same color can be done in parallel. MSR can also be written in the following more compact form: For a given initial approximate solution u°, and k — 0,1, • • -,

where the error propagation operator Ej+\is defined as Ej+i = (/ — Qj) • • • ( / — Qo) and g = gjis computed at a pre-iteration step by the

Overlapping Schwarz Algorithms

9

following J -f 1 sequential steps:

Next, we shall discuss an accelerated version of MSR. We begin with the observation that if the matrix I — Ej+\is invertible, then the exact solution of equation (3) also satisfies

which is sometimes referred to as the transformed, or preconditioned, system corresponding to (3). We next observe that for a given vector v € Rn, the matrix-vector product (J — Ej+i)v, denoted as vj, can be computed in a manner similar to that of g, namely,

Now, the multiplicative Schwarz preconditioned GMRES method (MSM) can be described as follows: Find the solution of equation (3) by solving the equation (4) with the GMRES method for a given initial guess and inner product. Even in the case that the matrix B is symmetric positive definite, the iteration matrix / — -Bj+i is not symmetric. An obvious symmetrization exists, upon which a conjugate gradient method can be used as the acceleration method; however, we shall not emphasize the case of a symmetric B in this paper.

3.2

Additive Schwarz Algorithm (ASM)

An additive variant of the Schwarz alternating method was originally proposed in [15, 17, 31] for selfadjoint elliptic problems and extended to nonselfadjoint elliptic cases in [5, 11]. The idea is simply to give up the data dependency between the subproblems defined on subregions with different colors, as in going from Gauss-Seidel to Jacobi. Instead of iterating with (5), one uses

10

Domain-Based Parallelism and Problem Decomposition Methods

Of course, similar changes have to be made to the right-hand side vector g. Coloring does not play a role at all in (6). Because of the lack of data dependency, the method is usually not to be recommended as a simple Richardson process (it may not converge), but as a preconditioner for some algebraic iterative methods of CG type. We denote by M^gu the preconditioning part of (6). Following [11] and using the notation of the previous subsection, we can define the inverse of the matrix MASM, referred to as the additive Schwarz preconditioner, as

The key ingredients for the success of the ASM are the use of overlapping subregions and the incorporation of a coarse grid solver. At each iteration, all subproblems are solved once. It is obvious that all subproblems are independent of each other and can therefore be solved in parallel. The ASM discussed in this subsection can be used recursively for the solving the subdomain problems. The result is the multilevel ASM, as developed in [2, 12, 19, 40, 41].

3.3

Some Polynomial Schwarz Algorithms

In this subsection, we discuss a family of Schwarz algorithms constructed by using some matrix-valued, multivariable polynomials. The previously discussed ASM and MSM algorithms can be viewed as two extreme cases of polynomial Schwarz algorithms, namely polynomials with the lowest and highest possible degrees. We remark here that the degree of the polynomial usually equals the number of sequential steps of the algorithm. Related subjects can be found in the papers [3, 6, 16, 27, 39], and references therein. Let us define as a matrix-valued polyromial in the variables Pi , and we assume the polynomial satisfies poly(0, • • • , 0) = 0, which simply means that the constant term in the polynomial is zero. It is not difficult to see that Tit* € Vh can be computed without knowing u* itself. This is because that PiU*, i = 0, • • • , N, can be computed directly from the right-hand side function /. By denoting g — Tu*, we can define a new linear system

which will be referred to as the transformed system of (3). It can be proved that if the matrix T is invertible, then the equation (8) has the same solution as the equation (3). To obtain the matrix T explicitly is usually not possible; however, for any v £ Rn, the matrix-vector multiply TV can be computed easily. This makes the linear system (8) a good candidate for using Krylov space iterative methods.

Overlapping Schwarz Algorithms

11

We next look at some special examples. The first and simplest, in which the degree of poly(• • •) is one, is the additive Schwarz method, in which the operator has the form

The second example is the so-called multiplicative Schwarz operator

where / is the identity matrix and Ej+i = (/ - Qo)(I — Q\] • • • ( / - Qj}. The degree of this polynomial depends on the number of colors, and the exact form of the polynomial depends on how the subregions are colored. The third example, which is a hybrid additive and multiplicative Schwarz algorithm (AMSM), was introduced in [6], and

where a; is a balancing parameter and Ej = (I — Q\) • • • ( / — Qj), without containing the coarse operator term. Numerical experiments suggest that u) = I is usually a good choice, although the corresponding theory is yet to be established. The algorithm can be viewed as a combination of the additive and multiplicative Schwarz methods. There are two major advantages. First, it converges faster than the additive Schwarz algorithm because of the extra local data dependency. Secondly, it is more parallelizable than the multiplicative Schwarz algorithm since the global coarse problem can now be solved simultaneously with the rest of the local problems. It is important to note that even if the original equation (3) is not well-conditioned, the transformed systems can be uniformly well-conditioned and, more importantly, the transformed system can be so arranged that a highly parallelizable algorithm can be developed for solving it. 3.4

Using Inexact Subdomain Solvers

Using an inexact solver for the interior subproblems. or an exact solver for approximate interior subproblems, can significantly reduce the overall computational complexity. This is, in fact, one of the major advantages of domain decomposition methods, in that they allow the use of fast solvers designed for special differential operators on regions of special shape. A somewhat disappointing experimental observation is that inexact solutions seem not to work well for the coarse grid solver. In fact, the existing theory for MSM [12], as well as the theory for ASM [11], requires an exact solve on the coarse grid. There are essentially two ways to introduce an inexact subproblem solver. The first method involves an approximation to the differential operator L. In each subdomain Q'fe, k / 0. L is replaced by a certain

12

Domain-Based Parallelism and Problem Decomposition Methods

spectrally equivalent differential operator Lfc, which is usually chosen to have constant coefficients, or has other special properties, so that a fast solver, such as an FFT-based method, can be used to solve the corresponding discretized problem. As an example, we mention that if L is a general operator, defined as in (2), then Lfc can be denned as

where c/j is an averaged eigenvalue of the matrix {dij(xk}} and Xk is a fixed point in fZ fe . In this case, the subdomain matrix Bi, used in any of the Schwarz algorithms discussed in the previous subsections, can be replaced by a discretization of Lk with a zero Dirichlet boundary condition on 9Qk. The second class of inexact subproblem solvers can be defined at the algebraic level. We assume that the matrices B^ have already been obtained by the discretization of certain differential equations on fj fc . In this case, using inexact solver is understood as solving the subdomain linear system

inexactly. Here x, b € Rnk. For examples, the above linear system can be "solved", by (1) a few multigrid cycles [29]; or (2) a few Gauss-Seidel (SOR, SSOR, Jacobi) iterations; or (3) replacing Bk with its ILU [30] or ILUT factorization [34], etc. 3.5

Algebraic Schwarz Algorithms

According to [10], the previously studied Schwarz framework can also be extended for solving general sparse linear systems. The fundamental principle underlying this extension is to replace the domain of definition of the problem by the adjacency graph of the sparse matrix, i.e., the graph that represents its non-zero pattern. We note that by switching from a domain to a graph the concept of Euclidean distance, which plays an important role in the optimality analysis of these domain decomposition methods, is lost. It was shown in [10] that, mostly by means of numerical experiments, tha the efficiency of the overlapping methods can be preserved to some extent with certain well-balanced overlapping graph decomposition. Suppose B = {by} is an n x n sparse matrix. To describe a model algebraic Schwarz algorithm, let us define the graph G — (W, E), where the set of vertices W = {1, • • • , n}, represents the n unknowns and the edge set E = { ( i , j ) bij ^ 0} represents the pairs of vertices that are coupled by a nonzero element in B. Let us assume that the non-zero pattern is symmetric, and therefore the adjacency graph G is undirected. For the remaining discussion, we assume that the graph partitioning has been applied and

Overlapping Schwarz Algorithms

13

has resulted in a number N of subsets Wi whose union is W,

We will denote by Ni the vector space spanned by the set W, in Rn and by m^ its dimension. For each subspace vV, we define a corresponding submatrix. In matrix terms, this is defined by the sub-identity matrix /j of size n x n whose diagonal elements are set to one if the corresponding node belongs to Wi and to zero otherwise. With this we define the matrix,

which is an extension to the whole subspace, of the restriction of B to A^. This is sometimes termed the section of B on N{. Its action on a vector is to project it on Ni, then apply B to the result and finally project the result back onto TVj. Note that although Bi is not invertible, we can invert its restriction to the subspace spanned by Wi. and define

With this definition of B^~l, the Schwarz algorithms can be defined the same as in the previous subsections. The only missing piece is the coarse preconditioner. As indicated in [10], without further geometric information of the problem, to define a coarse preconditioner is generally very difficult. 4

Convergence Theory

We now discuss very briefly a theory that can provide with us some understanding of these Schwarz preconditioners. All the discussions are based on an assumption that there is an underling finite element space. For simplicity, we consider only the piecewise linear finite element case. Let b(u,v) be the bilinear form associated with the Dirichlet problem ( f ) . The convergence of MSR has been proved in [12], under certain assumptions. The rate of convergence is

where CMSR > 0 is a constant independent of h, H and J. The estimate holds in both two- and three-dimensional spaces. Here H is the diameter of the subdomains. The assumptions include: (1) the overlap is uniform and must be O(H); (2) H must be sufficiently small: and (3) the number of colors, J, must be independent of the size of the subregions H. The

14

Domain-Based Parallelism and Problem Decomposition Methods

same estimate, with a different constant, holds for MSR with either exact or spectrally equivalent inexact solvers. For the accelerated version MSM, under the same assumptions, we have that there exist two constants CMSM > 0 and CMSM > 0, independent of both h and H, such that the transformed system is uniformly bounded:

and the symmetric part of the transformed system is positive definite in the inner product (A-, •):

For the additive Schwarz algorithm, it was shown [5, 11] that, in the piecewise linear finite element case, the preconditioner M under the same first two assumptions made for MSM in the sense that there exist two constants C*ASM > 0 and CASM > 0, which may be different for exact and inexact subdomain solvers and are independent of both h and H, such that the preconditioned linear system is uniformly bounded:

and the symmetric part of the preconditioned linear system is positive definite in the inner product (^4-, •)

Similar boundedness results hold also for the operator (9) with certain well chosen parameter u, see [6]. The extensions of the above results to the unstructured grids cases can be found in [7]. In the case B — A, which means that the original elliptic operator is symmetric positive definite, the left-preconditioned system is symmetric positive definite in the (A-, •) inner product; thus one can use a CG method. In the nonsymmetric case, the preconditioned system is nonsymmetric regardless of inner product. Therefore, instead of the A-inner product, we usually use the Euclidean inner product for practical implementations. By giving up the symmetry requirement of the preconditioned system, we could also use ASM as a right-preconditioner. Neither of the pair of estimates (12) and (13) has been proved in the L2 norm, but in the numerical experiments section, variability in ASM convergence rates measured (as is customary) with respect to L2 residuals clearly diminishes as mesh and subdomain parameters are both refined, leading us to conjecture that analogous results hold. We remark that the bounds (10), (11), (12) and (13) can be used to estimate, theoretically, the number of iterations for some of the Krylov space iteration methods, such as GMRES. As is well-known, the GMRES method,

Overlapping Schwarz Algorithms

15

introduced in [35], is mathematically equivalent to the generalized conjugate residual (GCR) method [20] and can be used to solve the linear system of algebraic equations: where P is a nonsingular matrix, which may be nonsymmetric or indefinite, and b is a given vector in Rn. In this paper, P is one of the transformed systems T = poly(Po, • • • , P/v). According to the theory of [20, 5], the rate of convergence of the GMRES method can be estimated by the ratio of the minimal eigenvalue of the symmetric part of the operator to the norm of the operator. Those two quantities are defined by cp = and Cp = supx._^0 \\P-X\ A/\\X\\A, where (-,-)/! 's our -<4-mner product o that induces the norm \\-\\A- Following [20], the rate of convergence can be characterized, not necessarily tightly, as follows: If cp > 0, which means that the symmetric part of the operator P is positive definite with respect to the inner product [•,•]. then the GMRES method converges and at the mth iteration, the residual is bounded as

where rm = b — Pxm. The algorithm is parameter-free and quite robust. Its main disadvantage is its linear-in-m memory requirement. To fit the available memory, one is sometimes forced to use the fe-step restarted GMRES method'[35].

5

Numerical Experiments

We present a few numerical experiments in this section to illustrate the convergence behavior of some of the Schwarz algorithms. For comparison, We also include some results obtained by using ILU(A;) as preconditioned. Two test problems will be considered. A more complete comparison of overlapping Schwarz algorithms with other domain decomposition algorithms can be found in [8]. Some three dimensional experiences with overlapping Schwarz algorithms can be found in [22]. Example 1. Lu = — A u + 8ux + 8uy. x)x- ((1 + | sm(50irx)sm(50TTy)uy}y Example 2. Lu= -((1 + \ sin(507rx)ii +20sin(107rx)cos(H)7ry)?^ - 20cos(107nr)sin(107ry)u y - 70u, In all the tests, Q = [0,1] x [0,1], and homogeneous Dirichlet boundary conditions are prescribed on 00. GMRES is used as the Krylov iterative method, and the iteration is stopped when the initial residual is reduced by 10~5. A one-point-per-subdomain coarse grid solver is used in all the tests. All the subdomain problems are solved exactly. For Example 1, a uniform 128 x 128 grid is used on Q, the number of subdomains is 64 = 8 x 8, and the overlap is 4/?, (h = 1/128). The first order terms are discretized by two

16

Domain-Based Parallelism and Problem Decomposition Methods TABLE 1 Iteration count for solving Example 1.

Methods

8 x 8 = 64 subdomains Central-difference Method 6= 100 50 10 1 5 150 35 MS-Richardson 00 10 6 6 6 4 7 5 4 4 MS-GMRES 9 17 13 9 AMS-GMRES 20 7 8 20 AS-GMRES 15 10 11 11 23 41 27 59 ILU(0)-GMRES 60 84 81 34 15 22 ILU(1)-GMRES 38 53 51 19 13 28 ILU(2)-GMRES 31 46 42 Upwind-difference Method 8= 10 50 100 500 1000 10000 12 12 12 MSR 9 12 12 6 6 6 MS-GMRES 6 7 7 12 12 12 9 11 12 AMS-GMRES 17 14 15 16 AS-GMRES 17 18 23 16 6 ILU(0)-GMRES 82 61 50 12 9 4 36 28 ILU(1)-GMRES 51 11 4 8 ILU(2)-GMRES 42 30 24

TABLE 2 Iteration count for solving Example 2.

h= MS-Richardson MS-GMRES AS-GMRES ILU(0)-GMRES ILU(1)-GMRES ILU(2)-GMRES

1/32

H = 1/8 1/64 1/128

00

00

00

16 29 44 28 22

15 26 78 44 36

15 25 312 99 76

schemes as indicated in Table 1. For Example 2, we test a few different fine mesh sizes, as given in Table 2, and the overlapping size is always set to be 25% of the size of the unextended subdomain in both x and y directions.

Overlapping Schwarz Algorithms 6

17

Conclusions

In this chapter, we discussed a family of parallel overlapping Schwarz type domain decomposition algorithms in the framework of preconditioned Krylov space iterative methods. The preconditioners, which are optimal in the sense of possessing mesh and subdomain parameter independent convergence rates, are constructed by using a multiple discretization of the partial differential equation, in local subdornairis and also on a coarser grid. When the subdomains are properly colored, the algorithms are fully parallel since sub-problems denned on different subdomains can be mapped on to different processors and solved independently at each Krylov iteration.

References [1] R. E. Bank, J. F. Biirgler, W. Fichtner, and R. K. Smith. Some upwmding techniques for the finite element approximation of convection-diffusion equations, Numcr. Math., 58 (1990), pp. 185 - 202. [2] P. E. Bj0rstad and M. D. Skogen, Domain decomposition algorithms of Schwarz type, designed for massively parallel computers, in Fifth International Symposium on Domain Decomposition Methods for Partial Differential Equations, T. F. Chan, D. E. Keyes, G. A. Meurant. J. S. Scroggs, and R. G. Voigt, eds., Philadelphia, PA, 1992, SIAM. [3] J. H. Bramble, Z. Leyk, and J. E. Pa iak, Iterative hernes for non-symmetric and indefinite elliptic boundary value problems, Math. Comp., 60 (1993), pp. 1-22. [4] J. H. Bramble, J. E. Pa iak, J. Wang, and J. Xu, Convergence estimates for product iterative, methods with applications to domain decomposition, Math. Comp., 57 (1991), pp. 1-21. [5] X.-C. Cai. Some Domain Decomposition Algorithms for Nonselfadjoint Elliptic and Parabolic Partial Differential Equations, PhD thesis, Courant- Institute of Mathematical Sciences. September 1989. Tech. Rep. 461, Department o Computer Science, Courant Institute. [6] , An optimal two-level overlapping domain decomposition method for elliptic problems in two and three dimensions, SIAM J. Sci. Comput., 14 (1993), pp. 239 247. [7] , The use of pointwise interpolation in domain decomposition methods with non-nested meshes, SIAM J. Sci. Comput., 16 (1995). To appear. [8] X.-C. Cai. W. D. Gropp, and D. E. Keyes. A comparison of some domain decomposition and ILU preconditioned iterative methods for nonsymmetric elliptic problems, Numer. Lin. Alg. Applies. (1994). To appear. [9] X.-C. Cai, W. D. Gropp, D. E. Keyes, and M. D. Tidriri, Newton-KrylovSchwarz methods in CFD, in Proceedings of the International Workshop on the Navier-Stokes Equations, Notes in Numerical Fluid Mechanics, R. Rannacher. ed.. Vieweg Verlag, Braun hweig. 1994. To appear. [10] X.-C. Cai and Y. Saad, Overlapping domain decomposition algorithms for general sparse matrices, Tech. Report Preprint 93-027, Army High Performance Computing Research Center, University of Minnesota, 1993. [11] X.-C. Cai arid 0. Widlund, Domain decomposition algorithms for indefinite elliptic problems, SIAM J. Sci. Statist. Comput., 13 (1992), pp. 243-258.

18 [12] [13]

[14] [15] [16] [17] [18]

[19]

[20] [21] [22]

[23] [24]

[25]

[26]

Domain-Based Parallelism and Problem Decomposition Methods , Multiplicative Schwarz algorithms for nonsymmetric and indefinite elliptic problems, SIAM J. Numer. Anal., 30 (1993), pp. 936-952. T. F. Chan and T. P. Mathew, Domain decomposition preconditioned for convection diffusion problems, in Sixth Conference on Domain Decomposition Methods for Partial Differential Equations, A. Quarteroni, J. Periaux, Y. A. Kuznettsov, arid O. Widlund, eds., Providence, RI, 1993, AMS. T. F. Chan and T. P. Mathew, Domain decomposition algorithms, Acta Numerica, 1994, pp. 61-143. M. Dryja, An additive Schwarz algorithm for two- and three-dimensional finite element elliptic problems, in Domain Decomposition Methods, T. Chan, R. Glowinski, J. Periaux, and 0. Widlund, eds., Philadelphia. PA, 1989, SIAM. M. Dryja, B. F. Smith, and O. B. Widlund, Schwarz analysis of iterative substructurmg algorithms for problems in three dimensions, SIAM J. Numer. Anal, 31 (1994). To appear. M. Dryja and O. B. Widlund, An additive variant of the Schwarz alternating method for the case of many subregions, Tech. Report 339, also Ultracomputer Note 131, Department of Computer Science, Courant Institute, 1987. , Towards a unified theory of domain decomposition algorithms for elliptic problems, in Third International Symposium on Domain Decomposition Methods for Partial Differential Equations, held in Houston, Texas, March 2022, 1989, T. Chan, R. Glowinski, J. Periaux, and 0. Widlund, eds., SIAM, Philadelphia, PA, 1990. , Multilevel additive methods for elliptic finite element problems, in Parallel Algorithms for Partial Differential Equations, Proceedings of the Sixth G AMMSeminar, Kiel, January 19-21, 1990, W. Hackbu h, ed., Braun hweig, Germany, 1991, Vieweg & Son. S. C. Eisenstat, H. C. Elman, and M. H. Schultz, Variational iterative methods for nonsymmetric systems of linear equations, SIAM J. Numer. Anal., 20 (1983), pp. 345-357. C. Farhat and M. Lesoinne, Automatic partitioning of unstructured meshes for the parallel solution of problems in computational mechanics, Int. J. Numer. Meth. Engrg., 36 (1993), pp. 745-764. W. D. Gropp and B. F. Smith, Experiences with domain decomposition in three dimensions: Overlapping Schwarz methods, in Sixth Conference on Domain Decomposition Methods for Partial Differential Equations, A. Quarteroni, J. Periaux, Y. A. Kuznettsov, and O. Widlund, eds., Providence, RI, 1993, AMS. C. Johnson, U. Navert, and J. Pitkaranta, Finite element methods for linear hyperbolic problems, Comp. Meth. Appl. Mech. Eng., 45 (1984), pp. 285-312. D. E. Keyes and W. D. Gropp, Domain-decomposable preconditioned for second-order upwind di retizations of multicomponent systems, in Fourth International Symposium on Domain Decomposition Methods for Partial Differential Equations, R. Glowinski, Y. A. Kuznetsov, G. Meurant, J. Periaux, and O. Widlund, eds., Philadelphia, PA, 1991, SIAM. P. L. Lions, On the Schwarz alternating method. I., in First International Symposium on Domain Decomposition Methods for Partial Differential Equations, R. Glowinski, G. H. Golub, G. A. Meurant, and J. Periaux, eds., Philadelphia, PA, 1988, SIAM. J. W. H. Liu, A graph partitioning algorithm by node separators, ACM Transactions on Mathematical Software, 15 (1989), pp. 198-219.

Overlapping Schwarz Algorithms

19

[27] J. Mandel, Hybrid domain decomposition with unstructured subdomains, in Sixth Conference on Domain Decomposition Methods for Partial Differential Equations, A. Quarteroni, J. Periaux, Y. A. Kuznettsov, and O. Widlurid, eds., Providence, RI, 1993, AMS. [28] T. A. Manteuffel and S. V. Parter, Preconditioning and boundary conditions, SIAM J. Numer. Anal., 27 (1990), pp. 656-694. [29] S. F. McCormick, Multigrid Methods, SIAM, 1987. [30] J. A. Meijeririk and H. A. V. der Vorst, Guidelines for the usage of incomplete decompositions in the solving sets of linear equations as they occur in practical problems, J. Comp. Phys., 44 (1981), pp. 134-155. [31] S. V. Nepomnya hikh, Domain Decomposition and Schwarz Methods in a Subspace for the Approximate Solution of Elliptic Boundary Value Problems, PhD thesis, Computing Center of the Siberian Branch of the USSR Academy of Sciences, Novosibirsk, USSR, 1986. [32] S. V. Parter and S.-P. Wong, Preconditioning second order elliptic operators: Condition numbers and the distribution of the singular values, J. Sci. Comput., 6 (1991), pp. 129 157. [33] A. Pothen, H. D. Simon, and K.-P. Liou. Partitioning sparse matrices with eigenvectors of graphs, SIAM J. Matrix Anal. Appl., 11 (1990), pp. 430-452. [34] Y. Saad, ILUT: a dual threshold incomplete ILU factorization, Tech. Report 9238, Minnesota Supercomputer Institute, University of Minnesota, Minneapolis, 1992. [35] Y. Saad, A flexible inner-outer preconditioned GMRES algorithm, SIAM J. Sci. Stat. Comput., 14 (1993), pp. 461-469. [36] Y. Saad and M. H. Schultz, GMRES: A generalized minimum residual algorithm for solving nonsymmetric linear systems, SIAM J. Sci. Stat. Comput., 7 (1986), pp. 856-869. [37] H. A. Schwarz, Gesammelte Mathemati he Abhandlungen. vol. 2, Springer, Berlin, 1890, pp. 133-143. First published in Vierteljahrs hrift der Naturfor henden Gesell haft in Zurich, volume 15, 1870, pp. 272-286. [38] B. F. Smith. P. E. Bj0rstad, and W. D. Gropp, Domain Decomposition and Multilevel Methods for Elliptic PDEs: Algorithms, Implementations and Theory, Cambridge University Press, 1995. To appear. [39] J. Xu, A new class of iterative methods for nonse.lfadjoint or indefinite problems, SIAM J. Numer. Anal., 29 (1992), pp. 303-319. [40] X. Zhang, Studies in domain decomposition: multilevel methods and the biharmonic. Dirichlet problem, PhD thesis, Courant Institute. New York University, New York City, 1991. [41] . Multilevel Schwarz methods, Numer. Math., 63 (1992), pp. 521-539.

This page intentionally left blank

Chapter 2 Domain Decomposition Methods for Wave Propagation Problems Alfio Quarteroni

Abstract Wave propagation problems can arise in many areas of applications, notably gasdynamics. geophysics, structural dynamics, arid electromagnetism. The goal of this paper is to present some domain decomposition methods that are applicable to this type of problem with the aim of addressing material inhomogeneities, reducing the numerical complexity, and exploiting parallelism. We will consider three simple hyperbolic iriitial-boundary-value problems: a scalar convection equation, a second-order scalar equation for acoustic wave propagation, and a second order system for elastic wave propagation. For each of these problems we derive in Section 1 the multidomain formulation, then we introduce domain decomposition algorithms based on iteration-bysubdomains (Section 2) and other algorithms based on the reformulation in terms of the so-called Poincare-Steklov interface equation (Section 3). In Section 4 we comment on other domain decomposition algorithms that can be devised when the temporal discretization is accomplished by explicit finite difference schemes. 1

Problem Specification

Wave propagation problems can be encountered in many different areas of application. Notable examples are provided from gas dynamics, geophysics, electromagnetism, and the dynamics of elastic structures. A mathematical feature in common to such problems (which are hyperbolic in nature) is that information propagates along characteristic curves. Partitioning the spatial domain into subdomains entails two principal benefits. On one hand, numerical complexity is reduced, as the original problem is split into subproblems of smaller size (which can also be addressed in a parallel fashion). On the other hand, the zonal approach induced by the domain partition is especially well suited for dealing with heterogeneous Work partially supported by "Fondi M.TJ.R.S.T. 40%" and Sardinian Regional Authorities. Department of Mathematics. Technical University of Milan, and CRS4, Cagliari, Italy. 21

22

Domain-Based Parallelism and Problem Decomposition Methods

FlG. 1. The domain £7 and its subdomain partition.

media, as the latter yield sudden changes of the wave speeds when crossing the interfaces of different materials. We illustrate the basic mathematical concepts behind domain decomposition for three model linear examples on a bounded two-dimensional spatial domain fi C IR2. The first is a simple convection equation, the second refers to acoustic waves, the last to linear elastic waves. For each we will describe the associated initial-boundary-value problem, as well as its multidomain formulation. In the next section we will discuss how domain decomposition algorithms can be derived after an implicit time-differencing has been applied to the time marching. In Section 4 we will do the same for explicit timedifferencing, after presenting in Section 3 an alternative viewpoint that leads to a new family of algorithms. Finally in Section 5 we will present and discuss several numerical results. Throughout, we will denote by Q a bounded two-dimensional domain and by fii and f^ two disjoint subdomains forming a partition of it. Moreover, we will denote by v the unit normal vector on dfl directed outward, and by n the one on F = dfli fl dftz directed from ill to f^- See Fig. 1 for an example.

1.1

Convective Waves

The initial boundary-value problem we consider is a simple scalar conservation law

where T > 0 is an upper time-level, /, u0,

is the inflow boundary, and bold-face denotes a vector-valued function. Defining the characteristic curves £(t) in the (x, £)-space as the solution

Wave Propagation Problems

23

to

it follows that du/dt = / — (divb)'« on ( t , £ ( t ) ) . Therefore the solution u along the characteristic curves can be obtained solving two sets of ordinary differential equations. In the case in which / = 0 and b is divergence-free, the solution u is constant along the characteristics and it is simply the wave travelling with local speed b. The unknown u can represent different physical variables, such as, e.g., the concentration of a certain quantity. A multidomain formulation of problem (1.1) with respect to the partition of Fig. 1 reads as follows. For any function v denned on f2, let Vk denote its restriction to J2fc, k — 1,2. i.e.

In particular, the restriction of u to f^, Uk, satisfies:

for k = 1,2. as well as the following matching condition at. the interface F:

The latter implies that u\ — u<2 at all points of F except for those where the flow field is tangential to the interface. For a precise statement of the multidomain problem (1.2)-(1.3) based on a weak formulation we refer to Gastaldi and Gastaldi [15] and Quarteroni and Valli [31]. As will be illustrated below, the multidomain form above is the basis of the domain decomposition algorithms.

1.2

Acoustic Waves

In a bounded inhomogeneous medium fi the acoustic dilatation is represented by the solution u(x,t) (the pressure field) to the governing equation [35]

24

Domain-Based Parallelism and Problem Decomposition Methods

supplemented by the initial conditions

and by suitable boundary conditions on the whole boundary of fi, which, for simplicity of exposition, we will assume to be of Dirichlet type, i.e.

In (1.4) p(x) is the density, c(x) the wave velocity, f ( x , t ) the source forcin term. Finally, UQ(X), VQ(X) and
for A; = 1,2, supplemented by the following transmission conditions at subdomain interfaces:

This time, besides the continuity of the solution (see (1.8)) we have required continuity of its normal flux as well (see (1.9)).

1.3

Elastic (Shear-Horizontal and Pressure-Shear Vertical) Waves

We consider a plane elastic structure with constant thickness in the regime of small strain and small displacement. Denoting by u = (u\,u-i) the inplane displacement field, e(u) = (ejj(u)) , i,j = 1,2 the strain field and
Wave Propagation Problems

25

known distributions of external actions, the equilibrium equations read (see Raviart and Thomas [33]):

where

The constitutive law provides the stress components as

where

and A and JJL are the Lame constants. The differential equations (1.10) need to be supplemented by initial conditions, say

and by boundary conditions, that for simplicity we will assume to be of Dirichlet type

One of the most common solution strategies leads to the so-called displacement formulation of the stress analysis problem (1.10) passing through the constitutive equations (1-11) and (1.12). The resulting system reads

with

Denoting by u^. the restriction of u to J7jt, k = 1,2, problem (1.11), (1.12). (1.13) can be written in the following multidomain equivalent form:

26

Domain-Based Parallelism and Problem Decomposition Methods

for k = 1,2, with the interface transmission conditions

which enforce the continuity of displacements as well as that of normal stresses at subdomains interfaces.

2

Domain Decomposition Algorithms

In this section the equations introduced earlier are advanced in time by implicit finite difference schemes. Then at each time-level we will solve the resulting boundary-value problem by an iterative procedure among subdomains. With this aim, let us introduce a time-step Ai > 0 and the time levels tm = mAi with m = 0,..., M and tM = T. If we advance (1.2) from tm~l to tm (m > 1) by an implicit finite-difference scheme, and denote the function f>k(tm) by Uk (for simplicity of notation), then (1.2) becomes

where a > 0 is proportional to the inverse of Ai, and F depends on / and u^ at previous time-levels. The coupled problem (2.1) can be solved iteratively by alternating a boundary-value problem in fii with one in f^. The perhaps simplest way is to construct two sequences of functions {u™} and {1*2}) n > I that satisfy:

for k — 1,2, where [k] — 1 if k = 2, [k] = 2 if k = I and Tkjn is the portion of F on which b^ is pointing into J72, i.e., F^^ = {x e F : b^ • n^ < 0} having denoted by n^ the normal unit vector on <9i7/j directed outward.

Wave Propagation Problems

27

For each n we have therefore two independent inflow-outflow boundaryvalue problems to be solved, one in f2i the other in f^The issue of space discretization will not be addressed here. However, it is understood that in order for the domain decomposition algorithms to be effective, each individual subproblem needs to be faced by a suitable space discretization method. In particular, if a Galerkin method is used (with either finite element or polynomial shape functions) (see [32], Chap. 5) the derivation of the domain decomposition algorithms need to be carried out in the framework of a weak (or variational) formulation of the given problem (2.1). If fl was decomposed into K > 2 subdoinairis, the conclusion would be the same, namely K (rather than two) inflow-outflow independent subproblems have to be solved at each iteration step. The convergence of sequence {v%} to if/t, k = 1,2, can be proven by a fixed-point argument, after reformulating for each k the inflow-outflow problem (2.2) in a weak form. For the proof we refer to Gastaldi and Gastaldi [15]. The same authors have proven in [16] that when problems (2.2) are discretized in space by a Streamline Upwind Petrov-Galerkin (SUPG) finite element method, the iterative procedure among subdomains converge, and the convergence rate is independent of the finite element grid size h. A similar approach was pursued in Quarteroni [27] for a linear hyperbolic system of advection equations. In that work the space discretization is based on a spectral collocation method. Again, the iterative procedure is proven to be convergent with a rate independent of the number of collocation nodes used in each subdomain. The same paper also discusses how to devise subdomain iterations for nonlinear hyperbolic systems of conservation laws. Subdomain iterations for nonlinear hyperbolic systems have also been investigated by Kopriva [21], Hanley [20] and Lie [19]. Let us turn now to the acoustic wave problem (1.7)-(1.9). The second order temporal derivative can be discretized by implicit finite differences. Suitable schemes are, e.g., the two-step, second-order backward differences (Gear [17]) or the family of one-step Newmark schemes (e.g. Raviart and Thomas [33]) that include either first- and second-order methods. In all cases, after advancing from t = tm~l to t = tm, if we keep denoting by Uk the updated function n/ c (i m ), we are left with the new problem:

28

Domain-Based Parallelism and Problem Decomposition Methods

where a is a multiple of At each time-level we have therefore a second order elliptic boundaryvalue problem to be solved in fii U ^2Several domain decomposition algorithms involving iterations among subdomains are available for elliptic boundary-value problems (see, e.g., Bramble, Pasciak and Schatz [3], [4], [5], [6], Dryja and Widlund [12], Quarteroni [28] and the references therein). Here we report the so-called Neumann-Neumann method (Lebedev and Agoshkov [22], and Bourgat, Glowinski, Le Tallec and Vidrascu [2]). Setting we construct a sequence {u^}, n > 1, by solving (for k — 1,2)

and then

where

Here 9 > 0 is a parameter that is chosen in order to accelerate the convergence of A71 to the common value of u\ and u^ on F while j3\ and fa are positive constants such that j3\ + fa = 1. The parameter 6 can also be determined automatically by an optimal conjugate-gradient strategy owing to the fact that L is self-adjoint. An alternative viewpoint consists in the restatement of the original initial-boundary-value problem (1.4)-(1.6) as a first-order hyperbolic system. Introducing the new set 01 unknowns w =

(du du du^ —-, ——, —— , and setting

\ at oxi 0x2 /

F = (/, 0,0)* and assuming for the sake of simplicity p = c = 1, we obtain

where A and B are 3 x 3 matrices with the following entries: a\^ = 013 = 021 — — l j dij — 0 otherwise, 612 = ^13 = 631 = — 1, bij = 0 otherwise. Initial and boundary conditions are derived accordingly.

Wave Propagation Problems

29

The multidomain version of (2.4) can be easily obtained after generalizing (1.2)-(1.3) for the scalar advection equation. With this aim, for any point of F let n = (n\, n^}1 denote the normal unit vector there directed from QI into fl?, and define the characteristic matrix C = C(n) = n\A + n^B. Since (2.4) is a hyperbolic system, C can be diagonalized as A = T~1CT and A = diag{A^, i = 1,2,3} with A^ e M. In turn, T is the matrix of left eigenvectors of C. With the usual notational convention, the restrictions w^ of w to fifc, k --- 1,2, satisfy

When we iterate between the two subdomains, the matching conditions (2.9) at the interface ought to be split into incoming and outgoing characteristics. For this, we introduce the characteristic variables z^ = T~ivfk, k = 1,2, and distinguish among non-negative and negative eigenvalues. Assume that, e.g., A^ > 0 for i < p and A^ < 0 if j > p for a suitable p < 3. Then (2.9) can be written equivalently as

where z^.-L denotes the i-th component of z^., for k = 1,2 and i = 1, 2, 3. If (2.8) is advanced in time from tm~l to tm by an implicit finite difference scheme (e.g., by the backward Euler method), the resulting boundary-value problem at the time-level tm can be solved by the following subdomain iteration method (n > 1 is the iteration counter, while, as usual, the superindex indicating the time-level is dropped):

where a = I/At and Gk = F^ + aw^"1"1), k = 1, 2. Note that for both problems (2.10) and (2.11), we are providing the values of the incoming characteristics on F. These conditions, together with

30

Domain-Based Parallelism and Problem Decomposition Methods

the boundary conditions prescribed on 0fi, make both (2.10) and (2.11) well posed. The convergence of the sequence {wjj} to {w^} as n —» oo, for k = 1,2 can be proven by analyzing the behaviour of the corresponding characteristic variables z£ = T^wjJ [27]. Turning now to the elastic waves problem addressed in Section 1.3, it is clear that all methods discussed for the problem of acoustic waves apply to it as well. The change of notation is obvious and the conclusions are quite similar. (See, e.g., [14]). When solving scattering problems in acoustics, the Helmholtz equation becomes an important numerical ingredient. Solution techniques based on the domain decomposition approach have been proposed by Bristeau, Glowinski, and Periaux [7], and by Ernst and Golub [13].

3

The Poincare-Steklov Problem at the Interface

When facing a boundary-value problem in a multidomain fashion, the mechanism of exchange of information throughout the subdomain interfaces is in fact driven by an additional equation which takes the name of PoincareSteklov problem. The latter is set solely at subdomain interfaces; its solution provides the restriction of the unknown global solution at the interfaces. Once such a restriction is available, it can be used as boundary data to reconstruct the desired solution within every subdomain. Let us show how this can be worked out on the multidomain advection problem (2.1). We will assume that there exists /?o > 0 such that a + divbfc/2 > /?0 for k = 1,2 so that (2.1) can admit a unique solution. For each k = 1,2 let us introduce the function j% solution to the advection problem

Similarly, for each function A defined on F we define its hyperbolic extension J^\ as the solution to the advection problem (for k = 1,2):

In view of (2.1) it is easy to see that the equalities

Wave Propagation Problems

31

hold iff (bi • n)ui — (ba • n)«2 on F, which in turn is true iff A satisfies (3.4) SX := (bs-n^A^brnViA = (b r n)j7-(b 2 .n)j 2 * =: T(F,
the operator S can be proven to be bounded, non-negative and bijective from L^(T) into its dual. Therefore (3.4) can be solved by suitable iterative methods, such as, e.g., gradient-like methods for nonsymmetric operators or methods of projection on Krylov subspaces (Saad and Schultz [34]). As a matter of fact, it is not difficult to show that applying the Richardson iterative procedure (e.g. Quarteroni and Valli [32], Chap.2) to (3.4) with a acceleration constant equal to one and with a suitable preconditioner would reproduce the subdornain iterative process (2.2). For the details the reader can refer to Gastaldi and Gastaldi [16] and to Quarteroni [27]. A similar characterization can be provided for the second order elliptic problem (2.3) arising from the time-discretization of the acoustic wave equation (1.4). This time the functions j% and J^A are defined as follows for fc = 1.2

Defining now we conclude that u^ is the solution of (2.3) iff A satisfies

32

Domain-Based Parallelism and Problem Decomposition Methods

This is the new Poincare-Steklov equation. The interface Poincare1 /9 Steklov operator S is now operating between X — HQQ (F) and its dual, 1 /9 where HQQ (T) denotes the space of traces on F of functions belonging to HQ (fi), and the latter is the space of measurable functions that are squaredintegrable in J7 together with their first derivatives, and vanish on dfl (see, e.g., Lions and Magenes [23], Quarteroni and Valli [32]). Besides, S is symmetric, continuous and coercive, i.e., there exist two positive constants 7, 6 such that

for all A,/z e X, where the symbol {•, •} denotes the duality pairing between X and its dual X', while || • || denotes the norm of the space X. Owing to these properties, the Poincare-Steklov problem (3.7) can be addressed by effective iterative approaches (such as the conjugate-gradient iterations). The crucial issue is how to find suitable preconditioners, as the finite dimensional counterpart of S is ill-conditioned. The methods of iteration-by-subdomains (2.4)-(2.6) presented in the previous section can be regarded as a special instance of this approach (see Quarteroni [29] for this interpretation). A different approach consists in approximating directly (3.7) by a Galerkin projection method upon a subspace XN of X. The challenging point here is how to find a well-conditioned basis for XN, so that the finite dimensional Galerkin problem may be solved by an optimal iterative method whose rate of convergence be independent of the dimension of X^. Methods based on this approach have been given the name of Projection Decomposition Methods (PDM) and have been extensively investigated especially by Agoshkov and Ovtchinnikov (see [1], [26], [18]).

4

Time-discretization by Explicit Finite Differences

The multidomain problems considered in Section 1, namely the convection, acoustic and elastic problems, introduced respectively by (1.2)-(1.3), (1.7)(1.9), and (1.15)-(1.19), can also be advanced in time by a fully explicit finite difference scheme, or by a semi-explicit one, as alternative to the fully implicit schemes considered in Section 2. We will illustrate these strategies on a specific example, the elastic wave problem (1.15)-(1.19) that was the most neglected in the preceding Sections 2 and 3. Clearly, the same kind of considerations will be applicable to the other cases as well. In order to state our problem correctly, we start by reformulating (1.15)-(1.19) in a weak form. For that, let us introduce the

Wave Propagation Problems

33

following definitions:

for k = 1.2. Notice that a ^ ( - , •) is the bilinear form associated to the operator L in f]fc. If we assume for the sake of simplicity that

where for each ^> e X and each fc = 1,2,^ denotes any possible continuous prolongation of if) to O/t- More precisely,

for a positive constant C independent of tj). Finally, (4.3)-(4.5) have to be closed by the initial conditions (1.16) at t = 0. For each k = 1, 2 the equation (4.3) is equivalent to (1.15), while (4.5) are the counterpart of the stress continuity equations (1.19). For the derivation of these equivalence relationships for elliptic equations see. e.g., Marini and Quarteroni [25]. Assume now that problem (4.3)-(4.5) is discretized in space by a Galerkin finite element method. Formally speaking, this can be easily accomplished if Vk-. V® and X are replaced by suitable finite element subspaces, say Vj^h, V^ arid Xfr. The easiest way to generate these spaces is to start from a master space Vh which is a classical finite element subspace of V, and then to proceed as done in (4.1) for defining the subspaces Vk: V® and X. Then u fc(£)>v/c and V' are respectively replaced by u^/,^), v^ an
34

Domain-Based Parallelism and Problem Decomposition Methods

the aim of simplifying we will drop the subindex h everywhere and therefore we will refer to (4.3)-(4.5) as to the finite element problem. Concerning the discretization of the temporal derivative, we will consider as a reference example of explicit scheme the second order Leap-Frog method, which consists of approximating the second order equation y"(t) = 3>(t,y(t)) at t — tm by the difference equation

A fully explicit treatment of the finite element problem (4.3)-(4.5) yields at each time level tm the problem:

An alternative approach consists in coupling (4.6), (4.7) with the new interface equation

where

now denotes the second-order backward implicit discretization of y" at t = tm+l. Since the values u™+1 are available at all internal finite element nodes after applying (4.6), the new "implicit" equations (4.8)' yield for each finiteelement node p on F a simple algebraic equation that provides the common value u™+1(p) = <+1(p). Therefore (4.6), (4.7) and (4.8)' provide a "'semi-explicit' method which has about the same computational complexity as the fully explicit method (4.6), (4.7), (4.8). Another "semi-explicit' scheme (but with a higher complexity though) is the one that advances first the interface equations by the explicit Leap-Frog method, providing the updated values u™+1 = u™+1 at each interface node. Then for each k — 1,2 the internal equations are advanced

Wave Propagation Problems

35

TABLE 1 Speed-up values for the fully-explicit domain decomposition solver. Different values for m (number of subdomains) and N (number of grid-points used in each subdomain) have been taken into account. Values in brackets refer to the ratio between communication time and global CPU-time.

N = 100 N = 400 N = 900 N = 1600

m=2 1.869 (3.4 x!0~ 2 ) 1.900 (2.1 Xl0~ 2 ) 1.997 (6.1 x!0~ 3 ) 1.997 (1.4 xlO~ 3 )

m=4 3.410 (7.4 Xl0~ 2 ) 3.608 (4.0 Xl0~ 2 ) 3.882 (6.9 xlO~ 3 ) 3.993 (3.1 xlO~ 4 )

m=6 4.765 (8.5 xlO~ 2 ) 5.706 (2.8 xlO~ 2 ) 5.938 (9.9 xlO~ 3 ) 5.964 (4.5 Xl0~ 3 )

m-8 6.246 (9.4 xlO" 2 ) 7.520 (3.3 x!0~ 2 ) 7.851 (1.2 xlO~ 2 ) 7.880 (5.3 x!0~ 3 )

m =9 6.817 (1.2 xlO^ 1 ) 8.508 (3.7 xlO~ 2 ) 8.698 (1.3 xlO~ 2 ) 8.749 (6.2 xlO~ 3 )

m = 10 7.612 (1.2 xlO" 1 ) 9.007 (4.5 Xl0~ 2 ) 9.677 (1.4 x!0~ 2 ) 9.684 (2.6 xlO" 2 )

by the second-order backward implicit scheme. Similarly to what proposed by Dawson and Dupont [11], the new scheme reads as follows. Solve first (4.8). For each finite element node on F these equations, together with (4.7) provide the common value of the updated function u™ +1 (= u™ +1 = u™ +1 ). Now solve for each k = 1,2:

Note that (4.6)' yields an elliptic boundary-value problem with Dirichlet boundary data on each subdomain.

5

Numerical Experiments and Conclusions

We have carried out some numerical tests on the IBM SP-1 machine with 32 nodes, using both the fully-explicit approach (4.6)-(4.8) and the semiexplicit one (4.6)', (4.7), (4.8), and polynomial subspaces for the spatial directions. The speed-up factors that have been obtained are reported in Tables 1 and 2. Clearly, the larger is N (the number of grid-points in each subdomain) the higher is the speed-up. Other comparison among fully-explicit, semi-explicit and implicit schemes for domain partitions can be found in [24]. The domain decomposition approach to problems of wave propagation are also well-suited to deal with material inhomogeneity. Many illustrative examples for both cases of acoustic and elastic waves are provided in [14], [24] and [35].

36

Domain-Based Parallelism and Problem Decomposition Methods

TABLE 2 Speed-up values for the semi-explicit domain decomposition solver.Difernt values for m (number of subdomains) and N (number of grid-points used in each subdomain) have been taken into account. Values in brackets refer to the ratio between communication time and global CPU-time. In a few cases computations have not beem performed as matrix size exceeded memory capability.

N = 100 N = 400 N = 900 N = 1600

m=2 1.807 (6.0 XlO- 2 ) 1.981 (8.3 xlO~ 3 ) 1.982 (1.3 xlO" 3 ) 1.997 (1.2 x!0~3)

m=4 3.538 (8.1 XlO- 2 ) 3.907 (2.2 xlO~ 2 ) 3.995 (4.8 Xl0~ 3 ) 3.998 (3.3 x!0~4)

m=6 4.050 (3.5 XlO- 1 ) 5.120 (1.8 xlO" 1 ) 5.566 (1.0 XlO- 1 )

m=8 6.224 (3.2 xlO"1) 7.000 (1.9 xlO- 1 ) 7.710 (1.1 XlO- 1 )

m= 9 5.041 (5.0 xlO" 1 ) 6.238 (3.6 XlO- 1 ) 7.440 (2.0 XlO- 1 )

m = 10 8.458 (3.2 xlO"1) 9.021 (1.8 XlO- 1 ) 9.516 (1.0 XlO- 1 )

References [1] V.I. Agoshkov and E. Ovtchinnikov, Projection Decomposition Method, Technical Report, CRS4, Cagliari, January 1993. [2] J.F. Bourgat, R. Glowinski, P. Le Tallec and M. Vidrascu, Variational formulation and algorithm for trace operator in domain decomposition calculations, 3-16 in [8]. [3] J.H. Bramble, J.E. Pasciak and A.H. Schatz, The construction of preconditioners for elliptic problems by sub structuring, I, Math. Comp., 47, 1986, 103-134. [4] J.H. Bramble, J.E. Pasciak and A.H. Schatz, The construction of preconditioners for elliptic problems by sub structuring, II, Math. Comp., 49, 1987, 1-16. [5] J.H. Bramble, J.E. Pasciak and A. H.Schatz, The construction of preconditioners for elliptic problems by sub structuring, III, Math. Comp., 51, 1988, 415-430. [6] J.H. Bramble, J.E. Pasciak and A.H. Schatz, The construction of preconditioners for elliptic problems by sub structuring, IV, Math. Comp., 53, 1989, 1-24. [7] M.O. Bristeau, R. Glowinski and J. Periaux, On the numerical solution of the Helmholtz equation at large wave numbers using exact controllability methods. Application to scattering, 399-419 in [30]. [8] T.F. Chan, R. Glowinski, J. Periaux and O.B. Widlund (eds.), Domain decomposition methods for partial differential equations, vol.2, SIAM, Philadelphia, 1989. [9] T.F. Chan, R. Glowinski, J. Periaux and O.B. Widlund (eds.), Domain decomposition methods for partial differential equations, vol.3, SIAM, Philadelphia, 1990. [10] F. Collino, High order absorbing boundary conditions for wave propagation models: straight line boundary and corner case, in R. Kleinman et al., Eds., Mathematical and Numerical Aspects of Wave Propagation, SIAM, Philadelphia, 1993, 161-171. [11] C.N. Dawson and T.F. Dupont, Noniterative domain decomposition for second order hyperbolic problems, 45-52 in [30]. [12] M. Dryja and O.B. Widlund, Towards a unified theory of domain decomposition algorithms for elliptic problems, 3-21 in [9].

Wave Propagation Problems

37

[13] O. Ernst and G.H. Golub, A domain decomposition approach to solving the Helmholtz equation with a radiation boundary condition, 177-192 in [30]. [14] E. Faccioli, A. Quarteroni and A. Tagliani, Spectral domain decomposition methods for the solution of elastic waves equations, submitted to Geophysics, 1993. [15] F. Gastaldi and L. Gastaldi, Convergence of subdomain iterations for the transport equation, Boll. U.M.I., 1994, to appear. [16] F. Gastaldi and L. Gastaldi, On a domain decomposition for the transport equation: theory and finite element approximation, IMA J. Num. An. 14 1994, 111-136. [17] C.W. Gear, Numerical Initial Value Problems in Ordinary Differential Equations, Prentice-Hall, Englewood Cliffs, 1971. [18] P. Gervasio, E. Ovtchinnikov and A. Quarteroni, The Spectral Projection Decomposition Method for Elliptic Equations, submitted to SI AM J. Num. Anal., 1994. [19] I. Lie, Interface conditions for the heterogeneous domain decomposition, 469476 in [30] [20] P. Hanley, A Strategy for the Efficient Simulation of Viscous Compressible Flows Using a Multi-domain Pseudospectral Method, J. Comput. Phys. 108, 1993, 153-158. [21] D. Kopriva. ICASK Report 86-28, NASA Langley Research Center, Hampton, VA: 1986. [22] V.I. Lebedev and V.I. Agoshkov, Generalized Schwarz algorithms with variable parameters, Dept. Num. Math., USSR Academy of Sciences, Moscow. Report n.19, 1981 (in Russian). [23] J.L. Lions and E. Magenes, Nonhomogeneous Boundary Value Problems and Applications, Vol.1, Springer-Verlag, Berlin-Heidelberg-New York, 1972. [24] F.Maggio, Domain decomposition based parallelism, for acoustic wave simulation by spectral methods, in preparation, 1994. [25] D. Marini and A. Quarteroni, A relaxation procedure for domain decomposition methods using finite elements, Num. Math. 55, 1989, 575-598. [26] E. Ovtchinnikov, Projection Decomposition Method with Polynomial Basis Functions, Technical Report, CRS4, Cagliari, 1993. Submitted to CALCOLO. 27 A. Quarteroni, Domnin decomposition methods for systems of conservation laws: spectral collocation approximations, SIAM J. Sci. Star. Comput.. 11 1990, pp. 1029-1052. [28] A. Quarteroni, Domain decomposition and parallel processing for the numerical solution of partial differential equations, Surv. Math. Ind. 1, 1991, 75-11. [29] A. Quarteroni, Mathematical aspects of domain decomposition methods, in Proc. of the 1st European Congress of Mathematics, Birkhauser, Boston. 1994 (in press). [30] A. Quarteroni, J. Periaux, Yu.A. Kuznetsov and O.B. Widlund, Eds.. Domain Decomposition Methods in Science and Engineering, Contemporary Mathematics 157, A.M.S., Providence, R.I. 1994. [31] A. Quarteroni and A. Valli, Theory and application of Steklov-Pomcare operators for boundary-value problems: the heterogeneous operator case, in R. Glowinski et til., Eds., Domain Decomposition Methods for Partial Differential Equations, IV, SIAM, Philadelphia, 1991, 58-81. [32] A. Quarteroni and A. Valli. Numerical Approximation of Partial Differential Equations, Springer-Verlag, Heidelberg, 1994.

38

Domain-Based Parallelism and Problem Decomposition Methods

[33] P.A. Raviart and J.M. Thomas, Introduction a I' Analyse Numerique des Equations aux Derivees Partielles, Masson, Paris, 1983. [34] Y. Saad and M.H. Schultz, GMRES: A generalized minimal residual algorithm for solving non-symmetric linear systems, SIAM J. Sci. Statist. Comput. 7, 1986, 856-869. [35] G. Seriani and E. Priolo, Spectral element method for acoustic wave simulation in heterogenoeus media, Computer Methods in Applied Mechanics and Engineering, 1994, to appear.

Chapter 3 Domain Decomposition, Parallel Computing and Petroleum Engineering Fetter E. Bj0rstad

Terje Karstad

Abstract A prototype black oil simulator is described. The simulator has a domain-based data structure whereby the reservoir is represented by a possibly large number of smaller reservoirs each having a complete local data structure. This design is essential for effective use of preconditioning techniques based on domain decomposition. The chapter describes a splitting technique for the solution of the nonlinear system and an effective implementation of the algorithm on massively parallel computer systems. Most communication is localized and long range communication is kept at a minimum. Results from an implementation of the method are reported for a 16384 processor MasPar MP-2. 1

Introduction

This chapter describes a numerical algorithm for the simulation of flow in porous media. The problem carries substantial economic interest in the petroleum industry. It is widely used for planning purposes, for reservoir management and for prediction of the reservoir performance [3]. Another field of application is the study of ground water flow, in particular, the study of ground contamination by way of pollution simulation models [1]. The main emphasis of our approach is to exploit data locality by using a data structure in which locality in the physical model implies locality in the computer representation. The domain decomposition approach that we follow effectively separates short and long range communication into two distinct phases in each preconditioning step. This makes it easy to analyze the performance of the method and to adjust the notion of locality with respect to the target computer at hand. Our goal is an algorithm that will be flexible, portable and efficient in a distributed memory system. "Institutt for Informatikk, H0yteknologisenteret, University of Bergen, N-5020 Bergen, Norway. This author's work was supported by NFR under grant no 27625. ^Statoil, P.O. Box 300 N-4001 Stavanger, Norway. This author's work was supported by Statoil.

39

40

Domain-Based Parallelism and Problem Decomposition Methods

A simulator based on the algorithms in this chapter should adapt well to hierarchical memory, cache-based single-processor systems, as well as distributed-memory parallel computers with a wide range of performance characteristics. Even a simple mathematical model describing the flow in porous media is a nonlinear system consisting of an elliptic equation and a formally parabolic convection-diffusion equation. Our algorithms and techniques apply to a much larger class of applications. Application of similar ideas in the semiconductor context using a simple drift-diffusion model is reported in [5]. Here one encounters a system consisting of a Poisson equation coupled to two, formally parabolic, advection-diffusion equations [32].

2

The Governing Equations

A petroleum reservoir consists of hydrocarbons and other chemicals trapped in tiny pores in the rock. If the rock permits and if the fluid is sufficiently forced, it will flow from its current position in the rock to some other points in the reservoir. By injection of additional fluids and the release of pressure through the production of fluids at wells, the petroleum engineer can adjust the flow rate and modify the mixture of chemicals produced. In the associated reservoir flow problem, we are given the reservoir conditions (pressure and saturation) and the well flow conditions. The main problem is to model the fluid flow, especially to predict the fluid flow into production wells. The fluid flow problem can be expressed in terms of approximate mathematical equations. The physical laws that govern the flow can be derived from volume balance, phase equilibrium and conservation of mass, plus Darcy's Law stating that the fluid flow is proportional to pressure gradients and gravitational potential differences [1]. The constants of proportionality depend on permeability, viscosity and the density of the phases involved. The equations constituting a mathematical model for a reservoir are almost always too complex to be solved by analytical methods, even after many idealizations. In order to focus on the main ideas of the algorithm we will here consider a simplified prototype black oil simulator in two space dimensions. We assume incompressible flow and neglect gravity. One can then reduce the model [16] to an equation for the pressure (here for the phase pressure p of water) and a saturation equation (here for the relative saturation s of water)

The parameter e is a scaling parameter introduced when the equations are converted to dimensionless form. It reflects the reciprocal of the dimensions

Domain Decomposition in Petroleum Engineering

41

of the reservoir and is therefore most often quite small. The source terms qi depend on the rate of oil production and water injection in the reservoir. The reservoir properties are the permeability tensor

the fractional flow function of water,

and the function The total velocity v = vw + v0 is given by Darcy's law as (6)

The oil and water parameters giving the ratio of the relative and absolute permeability tensors are defined by

where ki(s) is the relative permeability and /^ is the viscosity of oil and water (i = o,w). Finally, K(x) is the absolute permeability tensor, 0 is the porosity of the rock, and pc(s) is the capillary pressure. These quantities are known properties. K(x) is a measured quantity describing how fluid can move through the rock. The capillary pressure is assumed to be a known function of the computed saturation s; the oil phase pressure is then determined from the relation p0 = p — pc. The pressure equation is modeled with K(x) = 0 near the boundary in order to enforce a no flow condition. Together with an initial known value for the pressure along the boundary, this effectively imposes a Dirichlet boundary condition. The saturation equation is modeled with a normal no flow condition (Neumann condition), but the injection wells contribute isolated Dirichlet conditions. 3

The Solution Strategy

There are many different approaches for solving the equations. One can treat the complete nonlinear system using Newton's method [3] or decouple the system and iterate between the pressure equation and the saturation equation. The pressure equation is solved with the saturation fixed, and then the saturation equation is solved with the velocity (pressure) fixed. This splitting is not supported by a rigorous mathematical theory; rather it is heuristically motivated by the different behavior of the pressure equation and the saturation equation [1].

42

Domain-Based Parallelism and Problem Decomposition Methods

FlG. 1. The plot shows a typical fractional flow function /(s), and also its splitting f ( s ) = a(s] + sfe(s), where a(s) is the convex envelope of f ( s ) .

The remainder of this section derives the appropriate equations that are numerically solved at each time step. This derivation is a bit technical and may be skipped by the reader whose primary interest is the application of domain decomposition techniques to partial differential equations. Note that the two resulting equations (1) and (12) have a general form similar to what is encountered in many other applications. Our method decouples the equations and is based on the work of Espedal and Ewing [27]. Both equations should be solved implicitly in order to allow the use of large time steps. However, in order to handle both the first and second order terms in (2), we begin by splitting off a purely hyperbolic part

For a convex fractional flow function a(s) the solution to the Riemann problem (7) (given v) is either a continuous solution, called a rarefaction, or a discontinuous solution, called a shock. A common procedure for solving such problems is to use time-stepping along the characteristic curves [20, 29, 37]. Figure 1 shows the shape of a typical fractional flow function f ( s ) for an oil-water problem. We see that f(s} is not convex. Since the procedure just described depends on convexity we must introduce a splitting of the function. Following [27] (see also [18, 38]), we split the fractional flow function f ( s ) = a(s] + sb(s], where the convex part a(s) appears in (7). The characteristic system associated with Equation (7) is given by

Domain Decomposition in Petroleum Engineering

43

The source term q% in (2) models injection wells. These can be described as point sources and adequately taken into account by specifying that s = 1 at the corresponding coordinates. We can now use the first equation in (8) and rewrite Equation (2) as

From the last equation in (8) we can find a first approximation for the saturation at a new time r"+1 by tracing the characteristic curve back (upstream) to time level rn. In the following, let ,s(x) denote the saturation at the new time level r n+1 while s(x) is evaluated at the previous time r". In light of (7), the saturation s(x) at time level rn+l should equal s(x) at the time rn provided that x and x lies on the same characteristic. That is, given our current coordinate x we determine x from the single nonlinear ecmation

This step can be carried out independently for all grid points in the model. The operations are purely local, but in this process one must be able to access the data related to other grid points that are on the same characteristic in the upstream direction. Using the second relation in (8) we approximate the term ^ in (9) with

Inserting (11) in (9) and avoiding the nonlinearities in b ( s , x ) and D(s,x) by using the approximate value s(x) from the previous time step, we obtain a linear equation for the saturation s(x) at the new time level

To summarize, our solution procedure involves three main computational tasks at every time step: first the solution of the pressure equation (1), next the solution of (10) in order to determine .s(x), and finally the solution of (12) to find the saturation s(x) at the new time level.

4

Distributed Data Structure

Our choice of data structure for the computer realization of the algorithm is guided by the following considerations: Data locality. Independent of any particular computer architecture and specific algorithm, this concept has been established as a primary guideline for high performance, efficient algorithms.

44

Domain-Based Parallelism and Problem Decomposition Methods

FIG. 2. The domain on the left, the distributed data structure on the right. Each local data structure contains a complete subdomain with its boundary. In addition, there is one grid point from a coarser discretization in the local structure.

Domain decomposition. To be described in the next section, our iterative algorithms depend on an effective preconditioner based on domain decomposition. This representation of the equations restricted to subdomains specifies precisely what data is needed in the local data structures. ' Communication requirements. The explicit distributed memory model defines the necessary communication precisely. In contrast, the data movements are most often implicit in a (virtual) shared memory or cache-based computer. The data structure and the algorithm will together define both the volume and structure of data communication. Keeping this in mind, we will try to separate short and long range communication, minimize the total volume, and balance this with the required volume of computation. Based on this, it is very natural to consider a (distributed) data structure where the domain (reservoir) is partitioned into many subdomains. Each subdomain carries the entire data structure of a reservoir. In fact, each subdomain is equipped, with enough data, including material properties and boundary data, to enable a processor to carry out a complete reservoir simulation. The boundary data is therefore duplicated in subdomains that are neighbors. The stiffness matrices for equations (1) and (12) are assembled and stored as part of the local data structure for each subdomain.

Domain Decomposition in Petroleum Engineering

45

Our algorithm as outlined in the previous section has three basic steps. Since we use preconditioned Krylov space methods for steps 1 and 3, the data structure must be able to support multiplication of vectors by the stiffness matrices as well as efficient application of subdomain-based preconditioners. The matrix vector product needs attention since the global stiffness matrix is differentfrom the sum of the local matrices because of the overlap between subdomains. (Since the local data structures store interior boundary information redundantly.) In addition, each subdomain data structure will store at least one grid point belonging to a coarser, global grid. This is an important part of the overall algorithm [Cai, this volume] and the local data structure must therefore also support restriction of its local data to this coarse grid point and interpolation of a value from the coarse data point to all appropriate local data points. Figure 2 gives a schematic picture of the local data structure that naturally is associated with a (possibly virtual) processor.

5

The Numerical Algorithm

With a suitable data structure in place, we can now describe the numerical algorithm. Our focus will be on the main steps with emphasis on data locality and, in particular, on what non-local data movements are necessary. The reader should keep in mind that the procedure can be carried out on one processor (in a loop over local subdomains) or on a parallel set of processors, in the limiting case with as many processors as there are subdomains. Since we use domain decomposition algorithms extensively we now introduce a particular version, the Additive Schwarz Method (ASM). An extensive literature on this subject exists, see for example [6, 11, 12, 13, 14, 21, 23, 24] and in the references these articles. Domain decomposition algorithms are usually used as sophisticated preconditioners. In fact, the additive method that we describe below may not even converge if used as a stand alone iteration, and is therefore always used together with an outer iterative Krylov space method. The ASM method was originally motivated by the famous alternating iteration of H.A. Schwarz [39], but it can also be viewed as a generalization of the block Jacobi iteration [26]. Define a restriction matrix RI by

where ui is a grid function defined in the local data structure i. Here u is the global reservoir grid function and Ri consists of zeros and ones in such a way that it only selects the components that belong to subdomain i. The locally stored part of the stiffness matrix is similarly given by AI — RiARj'. Our preconditioner will be built from the inverses of these local matrices.

46

Domain-Based Parallelism and Problem Decomposition Methods

l Let BI = RjA i Ri. The first part of the preconditioner is then given by

The solution of elliptic (diffusion dominated) problems depends on both local and long range effects, as can be easily seen from the structure of the relevant Green's function for such problems. An effective preconditioner must therefore incorporate a mechanism for global exchange of information between all subdomains at every iteration. If not, the number of iterations will necessarily be bounded from below by O(!/#), roughly the number of subdomains spanning the nondimensionalized domain, where we assume that a typical subdomain has diameter H. If the subdomains communicate only with their nearest neighbors (via the shared overlap) then this bound is the time it takes before the influence of one subdomain has propagated across the full domain. This may be acceptable with relatively few subdomains. We are interested in having thousands of subdomains for two reasons. The work per subdomain decreases when the local problems are smaller and the degree of parallelism increases. We therefore introduce a coarse discretization Ac of the domain, with element size equal to the previously defined subdomains. Consistent with previous notation, let R% be the interpolation from the coarse grid to the original (fine) grid and let the transpose RQ define the corresponding restriction. With

the complete two-level preconditioner is

This class of domain decomposition preconditioners can be shown to have optimal convergence properties in the sense that the number of iterations are bounded independently of the discretization parameters h (of the fine grid), H (of the coarse grid) and of the ratio H/h as long as the overlap between subdomains is fixed independently of h. More information on the theoretical properties of the method can be found in the references listed above. In our implementation, we violate the assumption on the overlap size, by keeping only one grid line of overlap independent of h. This does lead to a modest increase in (outer) iterations, but extensive experience [7, 8, 40] and theoretical results [25] indicate that this may still be advantageous when the amount of data stored and the duplicated computation in the overlap area is taken into account. That is, a small overlap tends to minimize the overall computational time. We next discuss the calculations that must be carried out at each time level of the simulation. Our focus is on the data locality issue; a more

Domain Decomposition in Petroleum Engineering

47

detailed description of the algorithm and its implementation is given in [34].

FIG. 3. Elapsed time for one iteration of the algorithm broken down into the three main steps. The problem has a million grid points and runs on the full MP-2. In the left plot the pressure equation takes about ten seconds, the saturation equation about 20 seconds, the hyperbolic equation about 30 seconds, the top curve shows the total time. A direct method is used for the subdomain solution. The same information appears in the rightmost plot, but an iterative method is used. The time step is Jixed at 0.01 in both cases.

5.1

Solution of the Pressure Equation

The pressure equation is discretized using linear (Pi) finite elements. The velocity is computed using (6). A detailed description of this procedure can be found in [38]. An alternative technique is a mixed finite element method where both the pressure and the velocity is directly approximated to second order [28]. The first computational task is the calculation of the permeability tensor (3), which is a completely local computation. In the reservoir problem this coefficient is often characterized by having large variations and jump discontinuities. Next, we compute the stiffness matrix for (1) and the corresponding right hand side vector; again carried out independently in each local subdomain. This calculation is repeated for the coarse (subdomain size) triangulation of the domain, again within each local data structure. The resulting linear system is symmetric, positive definite and the appropriate Krylov space method is Conjugate Gradients [30]. The matrix vector product is a completely local calculation. We precondition the system using the ASM method described above. The local problems Bi ,i = 1 , 2 , . . . , are solved directly or approximately using a fixed, small

48

Domain-Based Parallelism and Problem Decomposition Methods

number of SSOR iterations (Symmetric Successive Over-Relaxation [30]). In a case where the number of grid points in a single subdomain exceeds, say, a hundred, one should implement a simple multigrid-based iteration in order to keep the complexity of this step as low as possible. The application of (13) to a vector requires communication between the nearest neighbor data structures when the local boundary values are updated. The coarse grid matrix (14) requires exchange of data between all subdomains. We use a fixed number of multigrid V-cycles [10, 31] for this purpose. This gives us a very structured long range communication pattern. Previous work [8, 40] on this type of equation has shown that the volume of long range data exchange can be balanced with the local data communication between neighbor domains by adjusting the H/h ratio. Perhaps equally important, the ASM preconditioning is very robust with respect to large variations in the permeability [8]. Under certain circumstances, it is possible to prove that ASM preconditioners will give a rate of convergence that is independent of jumps in the coefficients; see [22] and the references therein. Similarly, the dot-products in the conjugate gradient iteration require non-local interaction of data. The relative importance of the three kinds of data movement that we encounter in this algorithm depends on the computer architecture at hand. Specific models of this cost is normally easy to establish when the computer is known. The last part of this computational step determines the total velocity v from the computed pressure. We note that this is a local calculation. A careful procedure based on control volumes and flux calculations is described in [38].

5.2

Solution of the Hyperbolic Equation

The numerical solution of (10) is based on the assumption of a smooth velocity field that can be approximated locally by a constant. In order to satisfy this, the time interval [r n ,r n+1 ] is broken into several subintervals. The procedure then evaluates (10) starting with the maximal value of the slope of a(s) (see Figure 1). If the final value of s(x) at time rn is consistent with our upstream integration along a characteristic (i.e., the value is unchanged) then we stop, otherwise the correct saturation is higher and we repeat the procedure doing bisection using smaller values of o'(s). This algorithm is carried out for all grid points within each local subdomain, but the calculations may, depending on the subdomain size and the length of the time step, access data along the characteristic across a sequence of nearby subdomains. Furthermore, this path may be quite different in different parts of the reservoir. The resulting data movement becomes unstructured.

Domain Decomposition in Petroleum Engineering 5.3

49

Solution of the Saturation Equation

Our last computational step is to solve (12). This equation is also treated using an ASM preconditioner and several computational steps are similar to those described for the pressure equation. The main difference (and difficulty) is the presence of the first order convective term At^'^v • V(6(x)s(x)), The equation is discrctized with finite elements, using rectangular elements and bilinear basis functions. A Petrov-Galerkin formulation is used to better represent the convective term. More details on this can be found in [4, 19. 33, 15, 35]. The equations are non-symmetric and the Krylov space method BiCGstab [41] was found to perform satisfactory. For relatively small subdomains a direct subdomain solver is competitive and such a solver provides a more robust preconditioner and leads to a smaller number of outer iterations than using an inexact, iterative solver. We have observed cases (with large time steps) where this choice was necessary for convergence; the inexact subdomain solver would only succeed with a smaller time step. Our preferred inexact solver for the nonsymmetric case is based on the nested factorization described in [2], a popular method in the reservoir engineering community. The nearest neighbor communication and the need for dot-products are similar to the situation discussed for the pressure equation. The main difference is the role of the long range component (14) of the preconditioner. The previous hyperbolic step should (if our splitting is successful) essentially move the oil-water front to its new location. The saturation equation that we solve here serves as a correction in order to fully represent the fractional flow function (4) and account for the diffusive term. If t in (12) is small then we find very satisfactory convergence without using the coarse space, thus eliminating the need for long range data movements in this step. On the other hand, if f is relatively larger, the situation is similar to the pressure equation and a global exchange of data in the reservoir is needed. The situation is complicated by several factors. The two most important are the splitting of the fractional flow function, which changes the relative size of the convective and the diffusive term as a function of the saturation, and the variable distance the (oil-water) front moves in one time step, relative to the size of the subdomains. We have found it difficult to design a coarse space component (14) of the preconditioner that works satisfactorily in a cases. A more refined implementation that provides non-local coupling only in the areas of the reservoir that have significant diffusion may be needed.

6

Parallel Computing Aspects

The design of our data structure and the structure of the solution algorithm make a parallel implementation feasible. The algorithm has been described as a sequence of computational steps operating on a (possibly) large number

50

Domain-Based Parallelism and Problem Decomposition Methods

FIG. 4. The figure shows how the total simulation time (40 time steps) scales. To the left we change the size of the subdomain grid from 2 x 2 to 8 x 8 corresponding to an increase in the number of grid points by a factor 16. The size of the machine is 16384 processors. In the rightmost plot we keep the subdomain grid fixed at 7 x 7 per processor, but scale the size of the machine (and therefore also the size of the problem) by a factor 16.

of local data structures. Since the access of non-local data is identified and isolated in the description, we can easily analyze the communication overhead for a given parallel computer. An important consideration in domain decomposition algorithms is the size (and the number) of the individual subdomains. We prefer to discuss this somewhat independently of the number of processors on a target parallel machine. One can often think in terms of having a one to one mapping between subdomains and virtual processors. The assignment of virtual processors to actual processors in a parallel computer may then be considered as a separate problem. Still, the number of subdomains should normally be at least equal to the number of processors available. There is a volume-to-surface relation between the computation and the communication when applying the first level of our preconditioner (13). If the algorithm used inside a subdomain has a computational complexity growing faster than linearly in the number of unknowns, small subdomains are strongly favored. Aside from the increased volume of local communication, this must also be balanced with the increase in size of the coarse grid problem (14). On massively parallel machines the coarse grid problem is a significant computation. In this case, it often is advantageous to solve the coarse grid problem separately from the subdomain problems, by applying a parallel algorithm for its solution. One may also use a hybrid multiplicative additive version of the preconditioner since the two steps will be sequential anyway [Cai, this volume]. The unstructured communication encountered when solving (10) poses a very different problem, in particular, on computers that have a highly regular interconnection network. First, observe that it also depends on the size of

Domain Decomposition in Petroleum Engineering

51

the subdomains discussed above. A larger subdomain size will reduce or at least localize the data access in this task, since the oil-water front will tend to stay within the same or at most move to a neighboring subdomain during one time step. We also note the influence of the step size in time; decreasing the time step will have the same effect as making the subdomains larger. The computer simulations reported here used a time step that sometimes moved the oil-water front as far as five subdomains in one step. The algorithm we described for step (10) is therefore not very suitable for a highly structured machine like the MasPar MP-2. These observations indicate a way to partly circumvent the problem. We can try to subdivide our time step. Instead of searching upstream over the full interval [rn. r n+1 ], we take a shorter step starting from, say, rn+l/k. As soon as the approximate saturation s(x) has been found by following the upstream direction, we can update this intermediate value of s at time rn+i/k ancj repeat the process (starting from rn+l/k

i = 2 , . . . , k) until w

n+1

obtain the approximate value s(x) at time level r . To what extent is the overall algorithm discussed in this chapter scalable? The definition of a scalable algorithm for a complex simulation like this is far from easy. Intuitively, one would like to ask, for a fixed number of subdomains per processor, each of a fixed size, what happens if one doubles the size of the problem by doubling the size of the machine? An unchanged elapsed computing time would be one (ideal) definition of scalability. (That is for fixed memory size, check if the computing time scales with the number of processors.) However, because of the hyperbolic step, an increase in the spatial resolution will unavoidably increase non-local communication unless the time step is reduced as well. This effect will tend to increase the elapsed time for the simulation. On the other hand, even with implicit schemes that may be able to maintain long time steps, if the spatial resolution increases it is often natural to also increase the resolution in time. This will increase the elapsed time for a given simulation violating our scalability since the time-stepping does not reflect increased memory usage. A more restricted measure would be to consider the elapsed computing time for a single simulation time step with suitable assumptions on the relative resolution in space and time. In this situation the entire algorithm should behave similarly to the pressure solver. It can be seen from Figure 4 that our pressure solver scales with the number of processors up to at least 16384 on our MP-2. Thus, such a scalability result seems within reach, but it may not be what the reservoir engineer asks for: constant time to solution.

7

Results and Discussion

We have tested an implementation of this prototype simulator on a 16384 processor MasPar MP-2 computer having one gigabyte of distributed memory. The machine is a SIMD computer with a peak speed of approximately 2.3G

52

Domain-Based Parallelism and Problem Decomposition Methods

FIG. 5. The saturation profile after 15 time steps (with At — 0.Q18) when 32 percent of the oil is produced. The left plot shows a simulation using 1024 processors with e = 0.01 and h — l/(6 * 32). The right plot shows the same simulation using 16384 processors with e = 0.0001 and h = l/(6 * 128).

64 bits floating point operations per second. It has a high capacity mesh interconnect providing communication to a processor's 8 nearest neighbors called the XNET and a general 3-stage interconnect called the ROUTER for arbitrary (unstructured) communiction. A more detailed description of the machine can be found in [9, 17, 36]. The simulation reported here is a standard industry benchmark with parameters given in [34]. There is one injection well and one production well and the oil-water front is sweeping across the domain. We assign one subdomain to each processor and consider the performance of the algorithm. First, in Figure 3 we measure the elapsed time per time step of the three main computational steps. We use a grid with more than one million grid points, corresponding to a little less than one hundred grid points per subdomain. This is close to the memory limitation of the machine, given all the general data that must be stored and that scales with the number of grid points in the simulation. Observe the difference in time when using a direct solver for the subdomains due to the more complex saturation equation, while the times are very similar when the inexact iterative method is used. The figure is for e = 0.001, and there is no coarse grid component in the preconditioner for the saturation equation. We notice that more than half of the time is spent in our second computational step, the integration along characteristics. This is directly related to communication time as we use the slow, but very general, ROUTER hardware to handle the unstructured communication. We have not yet tested the substepping alternative discussed in the previous section. Presumeably, such a change in the algorithm would enable us to use the much faster XNET communication. The overall elapsed time for one time step is between 50 and 60 seconds, giving a total simulation time of about 30 minutes.

Domain Decomposition in Petroleum Engineering

53

In Figure 4 we show how the simulation time scales with the number of grid points and also how a change in the size of the machine affects the elapsed time. The simulation is performed with a fixed time step of 0.01 independent of the resolution in space; this should be kept in mind when interpreting the results. The first figure shows the increase in simulation time as we increase the size of the problem (by increasing the subdomain size). We observe about a factor of four in increased simulation time as the problem size is scaled up by a factor of sixteen. This shows the increased computational efficiency as the relative time spent on communication (and computational overhead) decreases. In the other half of the figure we scale the problem and the machine, keeping the amount of data per processor fixed. As discussed earlier this implies that each processor gets a smaller and smaller piece of the reservoir to compute. Notice the perfect scaling of the pressure solver at about 300 seconds. We have e = 0.001 and do not use the coarse grid in the saturation solver. This shows by the increase in the time for this part of the simulation. Overall, the simulation time increases by a factor of two when we scale the problem and machine by a factor of sixteen. In summary, the design and testing of this algorithm shows that a domain based decomposition of an oil reservoir simulator is a promising approach for the development of such codes for massively parallel computers. The paradigm has flexibility to balance the computation and communication time on a wide range of computer architectures. Many issues remain. The design of a more adaptive coarse grid operator for the saturation equation is needed. We believe that our step 2, following streamlines in parallel, can be substantially improved. Further generalizations of the method to handle the compressible case, three dimensions and three phases should be investigated. We believe that some of the guiding principles when developing this algorithm can be useful also in other and more general situations. Practical developments of parallel, portable code, based on concepts like the ones presented here, is still a formidable challenge. References [1] M. B. Allen, G. A. Behie, and J. A. Trangenstein. Multiphase Flow in Porous Media, Springer-Verlag, 1988. Lecture Notes in Engineering. [2] J. R. Appleyard and I. M. Cheshire, Nested factorization, November 1983. Presented at the Seventh Reservoir Simulation Symposium, San Francisco, SPE 12264. [3] K. Aziz and A. Settari, Petroleum reservoir simulation, Elsevier Applied Science Publishers, 1979. [4] J. W. Barrett and K. W. Morton, Approximate symmetrization and PetrovGalerkin methods for diffusion-convection problems. Comp. Meth. in Appl. Mech. and Eng., 45 (1984). [5] P. Bj0rstad, J. W. M. Coughran, and E. Grosse, Parallel domain decomposition applied to coupled transport equations, in Proceedings from the Seventh

54

[6] [7]

[8]

[9] [10] [11]

[12] [13] [14] [15] [16] [17] [18] [19]

[20] [21]

Domain-Based Parallelism and Problem Decomposition Methods International Conference on Domain Decomposition Methods in Scientific and Engineering Computing, D. E. Keyes and J. Xu, eds., Providence, 1994, AMS. Penn State University, October 1993. P. E. Bj0rstad, Multiplicative and Additive Schwarz Methods: Convergence in the 2 domain case, in Domain Decomposition Methods, T. Chan, R. Glowinski, J. Periaux, and 0. Widlund, eds., Philadelphia, 1989, SIAM. P. E. Bj0rstad, R. Moe, and M. Skogen, Parallel domain decomposition and iterative refinement algorithms, in Parallel Algorithms for PDEs, Proceedings of the 6th GAMM-Seminar held in Kiel, Germany, January 19-21, 1990, W. Hackbusch, ed., Braunschweig, Wiesbaden, 1990, Vieweg-Verlag. P. E. Bj0rstad and M. Skogen, Domain decomposition algorithms of Schwarz type, designed for massively parallel computers, in Fifth International Symposium on Domain Decomposition Methods for Partial Differential Equations, T. F. Chan, D. E. Keyes, G. A. Meurant, J. S. Scroggs, and R. G. Voigt, eds., Philadelphia, 1992, SIAM. T. BLANK, ed., The MasPar MP-1 architecture, IEEE, February 1990. Proceedings of IEEE Compcon Spring 1990. W. Briggs, A Multigrid Tutorial, SIAM, Philadelphia, 1987. X.-C. Cai, An additive Schwarz algorithm for nonsclfadjoint elliptic equations, in Third International Symposium on Domain Decomposition Methods for Partial Differential Equations, T. Chan, R. Glowinski, J. Periaux, and 0. Widlund, eds., Philadelphia, 1990, SIAM. , Additive Schwarz algorithms for parabolic convection-diffusion equations, Numer. Math., 60 (1991), pp. 41-61. X.-C. Cai and O. Widlund. Domain decomposition algorithms for indefinite elliptic problems, SIAM J. Sci. Statist. Comput., 13 (1992), pp. 243-258. , Multiplicative Schwarz algorithms for some nonsymmetric and indefinite problems, SIAM J. Numer. Anal., 30 (1993), pp. 936-952. M. A. Celia, I. Herrera, and S. Kindered, A new numerical approach for the advective-diffusive transport equation, Tech. Rep. 7, Comunicaciones Tecnicas del Institute Geofisica, (UNAM), Mexico, 1987. G. Chavent, G. Cohen, and J. Jaffra, Discontinuous upwinding and mixed finite elements for two-phase flows in reservoir simulation, Comp. Meth. in Appl. Mech and Eng., 47 (1984), pp. 93-118. P. CHRISTY, ed., Software to support massive parallel computing on the MasPar MP -1, 1990. Proceeding of IEEE Compcon Spring 1990, IEEE, February 1990. H. K. Dahle, Adaptive Characteristic Operator Splitting Techniques for Convection-Dominated Diffusion in One and Two Space Dimension, PhD thesis, Dep. of Appl. Math., University of Bergen, Norway, 1988. L. Demkowicz and J. T. Oden, An adaptive characteristic Petrov-Galerkin finite element method for convection-dominated linear and non-linear parabolic problems in one space variable, Comp. Meth. in Appl. Mech. and Eng., 55 (1986). J. Douglas and T. F. Russel, Numerical methods for convection-dominated diffusion problems based on combining method of characteristic with finite elements or finite difference procedures, SIAM J, 19 (1982), pp. 871-885. M. Dryja, An additive Schwarz algorithm for two- and three-dimensional finite element elliptic problems, in Domain Decomposition Methods, T. Chan, R. Glowinski, J. Periaux. and 0. Widlund, eds., Philadelphia, 1989, SIAM.

Domain Decomposition in Petroleum Engineering

55

[22] M. Dryja, M. Sarkis, and O. B. Widlund, Multilevel Schwarz methods for elliptic problems with discontinuous coefficients in three dimensions, Tech. Rep. 66 Courant Institute, New York University, March 1994. [23] M. Dryja and O. B. Widlund, An additive variant of the Schwarz alternating method for the case of many subregions, Tech. Rep. 339, also Ultracomputer Note 131, Department of Computer Science, Courant Institute, 1987. [24] , Additive Schwarz methods for elliptic finite element problems in three dimensions, in Fifth International Symposium on Domain Decomposition Methods for Partial Differential Equations, T. F. Chan, D. E. Keyes, G. A. Meurant, J. S. Scroggs, and R. G. Voigt, eds., Philadelphia, 1992, SIAM. [25] , Domain decomposition algorithms with small overlap, Tech. Rep. 606, Department of Computer Science, Courant Institute, May 1992. To appear in SIAM J. Sci. Stat. Comput. [26] , Some recent results on Schwarz type domain decomposition algorithms, Contemporary Mathematics, 157 (1994), pp. 53-61. In proceedings from the sixth international symposium on domain decomposition. [27] M. S. Espedal and R. E. Ewing, Characteristic Petrov-Galerkin subdomain method for two-phase immiscible flow, Comp. Meth. in Appl. Mech. and Eng., 64 (1987). [28] M. S. ESPEDAL, R. E. EWING, AND T. F. RUSSEL, eds., Mixed methods, operator splitting, and local refinement techniques for simulation on irregular grids, September 1990. Proceeding of 2nd European Conference on the Mathamatics in Oil Recovery 11-14, Aries. [29] R. E. Ewing, T. F. Russel, and M. F. Weeler, Convergence analysis of an approximation of miscible displacement in porous media by mixed finite elements and a modified method of characteristic, Comput. Meth. Appl. Mech Engrg., 47 (1984). [30] G. H. Golub and C. F. V. Loan, Matrix Computations, John Hopkins Univ. Press, 1989. Second Edition. [31] W, Hackbusch, Multigrid Methods and Applications, Springer, Berlin, 1985. [32] K. Hess, J. P. Leburton, and U. Ravaioli, Computational electronics: Semiconductor transport and device simulation, Kluwer, 1991. [33] C. Johnson, Numerical solution of partial differential equations by the finite element method, Cambridge University Press, 1987. [34] T. Karstad, Massively parallel algorithms in reservoir simulation, PhD thesis. Department of Informatics, University of Bergen, September 1993. [35] K. W. Morton. Finite Element Methods for None-Self-Adjoint Problems, Springer-Verlag, 1982. Lecture Notes in Mathematics 965. [36] J. NICKOLLS, ed., The design of the MasPar MP-1, a cost effective massive parallel computer, IEEE, 1990. Proceedings of IEEE Compcon Spring 1990. [37] T. F. RUSSEL, ed., Finite elements with characteristics for two-component incompressible miscible displacement, 1982. Proceedings from the sixth SPE symposium on reservoir simulation SPE 10500, New Orleans. [38] O. Saevareid, On local Grid Refinement Techniques for Reservoir Flow Problems, PhD thesis. Dep. of Appl. Math., University of Bergen, Norway, 1990. [39] H. A. Schwarz, Gesammelte Mathematische Abhandlungen. vol. 2. Springer, Berlin, 1890, pp. 133-143. First published in Vierteljahrsschrift der Naturforschenden Gesellschaft in Zurich, volume 15, 1870, pp. 272-286.

56

Domain-Based Parallelism and Problem Decomposition Methods

[40] M. D. Skogen, Parallel Schwarz methods, PhD thesis, University of Bergen, Dept. of Informatics, H0yteknologisenteret, N-5020 Bergen, Norway, 1992. [41] H. A. van der Vorst, Bi-CGSTAB: A fast and smootly converging variant of Bi-CG for solution of non-symmetric limear systems, SIAM J. Sci. Statist. Comput., 13 (1992).

Chapter 4 Parallel Implicit Methods for Aerodynamic Applications on Unstructured Grids V. Venkatakrishnan

Abstract A finite-volume scheme for solving the Euler equations on triangular unstructured meshes is parallelized in MIMD (multiple instruction/multiple data stream) fashion. An explicit four-stage RungeKutta scheme is used to solve two-dimensional flow problems. A family of implicit schemes is also developed to solve these problems, where the linear system that arises at each time step is solved by a preconditioned GMRES algorithm. The choice of the preconditiorier in a distributed memory setting is discussed. All the methods are compared both in terms of elapsed times and convergence rates. It is shown that the implicit schemes offer adequate parallelism at the expense of minimal sequential overhead. The use of a global coarse grid to further minimize this overhead is also investigated. The schemes are implemented on a distributed memory parallel computer, the Intel iPSC/860. 1

Introduction

Triangular meshes have become popular in computational fluid dynamics. They are ideal for handling complex geometries and for adapting to flow features, such as shocks and boundary layers. Distributed memory parallel computers seem to offer an avenue for solving large problems rapidly due to their scalability. For the goal of sustained high performance on these machines to be realized, however, many fundamental issues need to be addressed. Among these are scalable algorithms and software. Explicit schemes used in computational fluid dynamics possess almost complete parallelism. They require only a simple update procedure that involves local dependencies. On a parallel computer, such schemes typically require communication only to nearest neighbors. Implicit schemes, on the other hand, require the solution of coupled equations which involves "This research was supported under NASA contract No. NASA1-19840 while the author was in residence at ICASE, NASA Langley Research Center. Hampton, VA. * Institute for Computer Applications in Science and Engineering. NASA Langley Research Center, Hampton, VA 23681-0001. 57

58

Domain-Based Parallelism and Problem Decomposition Methods

global dependencies. Extracting parallelism in implicit schemes is a challenging task. When steady state solutions are sought, explicit schemes typically require thousands of time steps to converge and exhibit very slow convergence rates. However, there are some applications, such as certain classes of unsteady flows, where explicit schemes are useful. Implicit schemes allow larger time steps to be taken and usually result in better convergence rates for steady state problems. Implicit schemes have to be designed carefully since the work involved at each time step could be substantial. References [24, 25, 36] have used the Generalized Minimal Residual technique (GMRES) [22] with diagonal preconditioning to solve the linear system of equations arising from an implicit discretization of the compressible Navier-Stokes equations on unstructured grids. In [34, 35] point-implicit schemes have been used to solve inviscid and turbulent flows in two and three dimensions. In reference [27] a family of implicit schemes was tested on the Cray Y-MP for solving the two-dimensional compressible NavierStokes equations on unstructured meshes. It was concluded that GMRES with incomplete LU preconditioning (GMRES/ILU) was superior to other implicit schemes over a whole range of flow conditions and was as efficient as the unstructured multigrid strategy of [16]. On distributed-memory parallel computers, the design of implicit schemes is more difficult since parallelism and load balance during the implicit phase are additional considerations. In reference [28], the schemes of reference [27] were investigated for parallelism on the Intel iPSC/860. In reference [12] an implicit iterative solution strategy based on the diagonalpreconditioned matrix-free GMRES algorithm [4] on the Connection Machine CM-2 was implemented. In [21] an implicit incompressible flow solver that uses a "linelet-based" preconditioner was developed and tested on an Intel iPSC/860. Reference [1] investigates an implicit parallel algorithm for the Navier-Stokes equations using structured grids on the Intel iPSC/860. References [29, 7] have shown that it is possible to obtain supercomputer performance when solving explicit unstructured grid problems on the Intel iPSC/860. By paying careful attention to the partitioning of the mesh, communication schedule and data structures, they have been able to show that 2-3 times the speed of a Cray Y-MP/1 could be obtained with 128 processors of the iPSC/860. The effects of using various strategies for partitioning the unstructured grids on communication costs have been examined as well. More recently, unsteady and steady viscous flows have been computed in [10] using an explicit scheme on unstructured grids. This paper first summarizes implicit methods used to solve aerodynamic problems. The full details may be found in [28]. The implicit algorithms developed in [27] are explored as candidate schemes on a distributed-memory parallel computer. The issues in implementing the GMRES algorithm and the preconditioners in parallel are addressed. Results for a typical flow around a multi-element airfoil are presented and the performances of the

Parallel Implicit Methods for Aerodynamics

59

explicit and implicit schemes on the Intel iPSC/860 are compared. Finally, the use of a global coarse grid to improve convergence is investigated with the best implicit scheme.

2

Governing Equations and Spatial Discretization

The Euler equations in integral form for a volume fi with boundary 3fi read

Here W is the solution vector comprised of the conservative variables density, the two components of momentum, and total energy. The vector F(W, n) represents the inviscid flux vector for a surface with normal vector n. The solution vector is stored at the vertices of a triangular mesh. Equation (1) holds for any volume, and in particular for a specific volume surrounding a grid point, termed the control volume. The control volumes are nonoverlapping polygons which surround the vertices of the mesh. They form the dual of the mesh. Eqn. (1) then states that the time rate of change of the variables inside the control volume is the negative of the net flux of the variables through the boundaries of the control volume. This net flux through the control volume boundary is termed the residual. We are interested only in computing the steady state solution, when this residual vanishes over all control volumes. Starting from an initial guess, typically freestream conditions, Eqn. (1) is marched in time until the solution W does not change. The contour integrals in Eqn. (1) are replaced by discrete path integrals over these dual edges. The physical flux F(W, n) is replaced by a numerical flux function. This is usually an approximation to the physical flux function that depends on the variable values only in a small neighborhood. More details on the spatial discretization may be found in [3]. The spatial discretization results in a stencil-based operation, where a stencil is defined as the neighborhood of the grid point affecting the solution at that point. The residuals are computed by looping over the edges of the mesh, and vectorization is achieved by coloring the edges. The coloring algorithm groups the edges into "colors", so that edges belonging to the same color can be processed simultaneously. In this paper, it is assumed that the triangulation of the computational domain has been performed separately at the outset. The triangular mesh is partitioned across multiple processors. Each vertex of the triangulation is assigned uniquely to a partition and the interpartition boundaries consist of the edges of the control volumes. Figure 1 shows a triangular mesh under a 2-way partitioning. The numbering of vertices that are members of both partitions is indicated as tuples, where the first index refers to the numbering for processor 0 and the second index refers to the numbering for processor 1. Two communication phases are

60

Domain-Based Parallelism and Problem Decomposition Methods

FlG. 1. 2-way partitioning for a simple triangulation.

required at each stage of the four-stage Runge-Kutta time integration. The processors exchange data at two rows of vertices that are incident to the interpartition boundary edges. Each processor can thus compute the entire integrals for all the vertices it owns. Duplication of the flux calculations occurs at the interpartition boundary edges, but it is not a crucial issue on medium-grained parallel computers. As discussed in [29], this duplication can be avoided on fine-grained parallel computers at the expense of more communication. Reference [26] considers three different recursive partitioning strategies for partitioning unstructured meshes. In the context of an explicit twodimensional Euler solver, it was shown in [29] that the spectral bisection strategy [20] was superior to the coordinate and graph bisection strategies in terms of the communication costs. The spectral bisection technique produces uniform, mostly connected subdomains with short boundaries. Therefore, in this work, the recursive spectral bisection is used for partitioning. After partitioning, global values of the data structures required to define the unstructured mesh are given local values within each partition. We thus dispense with any references to global indices. In [7] primitives are develope that allow access to global data address space, so that the transformation from global to local data structures is unnecessary. Access to global data structures is highly desirable for adaptive grid applications, but is not necessary when dealing with static meshes. In the present implementation, each local data set also contains the information that a partition requires for communication at its interpartition boundaries. The information required for communication at the interpartition boundaries is precomputed using sparse matrix data structures. Each subgrid is assigned to one processor. It was shown in [29] that the mapping of the subgrids to processors is not a

Parallel Implicit Methods for Aerodynamics

61

crucial issue on the Intel iPSC/860. Therefore, a naive mapping that assigns subgrid 0 to processor 0, subgrid 1 to processor 1, and so on, is employed. Partitioning, conversion from global to local addresses, and generation of the data structures required for communication at the interpartition boundaries are all done on a workstation as a preprocessing step. This is justified when the same geometric case will be run for a varity of analyses, varying freestream Mach number, angle of attack, etc. In adaptive grid situations, where the grid evolves with the flow solution, such an approach requires constant repartitioning and this is clearly not viable. Therefore, procedures such as those outlined in [13, 32] need to adopted. 3

Implicit Scheme

After discretizing Eqn. (1) in space, the following system of coupled ordinary differential equations is obtained:

Here W is the vector of unknowns over all the mesh points. M is the mass matrix which represents the relationship between the average value in a control volume arid the values at the vertices (the vertex representing the control volume and its nearest neighbors). It is a function only of the mesh and hence, a constant matrix for a static mesh. Since a steady state solution (R(W) = 0) is sought, time-accuracy is not an issue and M can be replaced by the identity matrix. This approximation is called "masslumping" in finite-element literature. The following system of ordinary differential equations for the vector of unknowns W then results:

If the time derivative is replaced by:

an explicit scheme is obtained by evaluating R(W) at time level n. An implicit scheme is obtained by evaluating R(W) at time level n + 1. In the latter case, linearizing R about time level n, we obtain

Eqn. (5) represents a large nonsymmetric linear system of equations for the updates of the vector of unknowns, AW, and needs to be solved at each

62

Domain-Based Parallelism and Problem Decomposition Methods

time step. As Ai tends to infinity, the method reduces to the standard Newton's method. The term J^p symbolically represents the implicit side upon linearization and involves the Jacobian matrices of the flux vectors with respect to the conservative variables. Due to storage considerations and computational complexity, only a lower order representation of the operator is employed. A consequence of this approximation is that Eqn. (5) can never approach Newton's method (with its associated quadratic convergence property) due to the mismatch of the right- and left-hand side operators. However, the matrix-free approaches of [12, 17, 6] allow consistent approximation of the Jacobian and may help realize the benefits of Newton's method. Based on the work of [19], the time step in Eqn. (5) is allowed to vary inversely proportional to the LI norm of the residual. Since there is a mismatch of operators in Eqn. (5), it is however necessary to limit the maximum time step. There is a host of methods in linear algebra literature for solving nonsymmetric systems of linear equations, but, in this work, as in [27] only the GMRES technique [22] is considered. The GMRES technique is quite efficient for solving sparse nonsymmetric linear systems and is outlined below. Let XQ be an approximate solution of the system

where A is an invertible matrix. The solution is advanced from XQ to xj, as

GMRES (fc) finds the best possible solution for yj. over the Krylov subspace < v\, Avi, A2vi ,...., Ak~lvi > by solving the minimization problem

v The GMRES procedure forms an orthogonal basis {vi,t>2, k} (termed search directions) spanning the Krylov subspace by a modified GramSchmidt method. These search directions need to be stored. As k increases, the storage increases linearly and the number of operations, quadratically. Good solutions can however be found in small subspaces (k
4

Preconditioning and Parallelism Issues

Instead of Eqn. (7) the preconditioned iterative methods solve the following system:

Parallel Implicit Methods for Aerodynamics

63

The system of linear equations in Eqn. (11) is referred to as the rightpreconditioned system and Q as the right preconditioner. The role of the preconditioner is to cluster the eigenvalues around unity. The ideal preconditioner is A~l, yielding an identity matrix for AQ, but is not practical. Thus the preconditioner should improve the conditioning of the matrix, while being not very expensive to form and invert. The choice of preconditioners and the issues in implementing the preconditioned GMRES on a distributed-memory parallel computer are addressed below. On a distributed-memory parallel computer, the same least squares problem of Eqn. (9) is solved by each of the processors. While this results in some duplication of work, the main nonlocal kernels of the GMRES are distributed across multiple processors. These kernels include sparse matrixvector multiplication, dot products, and L^ norm evaluations. On a Cray Y-MP, vectorization for the sparse matrix-vector product was achieved by using an edge-oriented data structure for the matrix and coloring the edges of the graph [27]. Coloring of edges destroys locality while allowing for vectorization, and is attractive on a computer such as the Cray Y-MP, because of the fast gather/scatter functions it possesses. However, this is not the optimal way to compute the matrix vector product on a parallel computer, where locality is of utmost importance. The usual row-oriented sparse matrix data structure, which affords more locality, is used instead. We have found that even on a single node this approach outperforms the one that uses the edge-based data structure by a factor of two, because of the memory hierarchy. The rows are uniquely assigned to processors. Akin to the explicit scheme, each processor computes its share of the matrix vector multiplication. The communication step consists of exchange of the vector components at the two rows of vertices incident to the interpartition boundary edges. This is followed by the matrix-vector multiplication at each processor. More details on the implementations of the matrix-vector product on vector-parallel and distributed memory computers may be found in [30]. In most problems of interest, the choice of the preconditioner is very important, but the effort involved in applying the preconditioner should not be prohibitive. The implicit scheme without preconditioning possesses complete parallelism, except for the duplication of some work when solving the least squares problem in GMRES. On a parallel computer, the parallelism in the preconditioning phase is an important additional consideration. A simple choice is a block diagonal preconditioner that computes the inverse of the 4 x 4 diagonal block associated with a mesh point. The LU decomposition of the 4 x 4 blocks and the forward and back solves are local and, hence, are inherently parallel. A family of preconditioners has been developed in [18], wherein a sparsity pattern is specified arid the entries outside this pattern that occur during the factorization are ignored. In [27] an ILU(O) preconditioner was

64

Domain-Based Parallelism and Problem Decomposition Methods

also considered. ILU(O) refers to an incomplete lower-upper factorization with no fill-in allowed beyond the original non-zero pattern. By using a level scheduling [2] (also known as wavefront ordering) it is possible to obtain parallelism with this preconditioner. Under this permutation of the matrix, unknowns within a wavefront can be eliminated simultaneously. However, since the degree of parallelism varies with the wavefront, it cannot be easily exploited on a distributed-memory parallel computer. A fixed partitioning strategy for the mesh incurs substantial load imbalance, while a dynamic partitioning strategy entails substantial data movement and hence, increased communication costs. It was found in [11] that using a fixed partitioning strategy when solving triangular systems of equations on a regular grid results in low upper bounds on efficiency even in the absence of communication. A higher degree of parallelism in ILU(O) can be achieved by using a different ordering of unknowns, but typically such an ordering adversely affects the convergence of the underlying iterative method. Therefore, for general sparse matrices, the ILU(O) preconditioner is ill-suited for implementation on a distributed-memory parallel computer. Therefore, we settle on an ILU preconditioner that is processor-implicit i.e., ILU(O) is carried out for all the vertices internal to a processor. Thus, at a macro-level, the overall preconditioner can be viewed as an approximate block Jacobi iteration, wherein each block is assigned to a processor and an approximate LU factorization, viz. ILU(O), is carried out. A block here refers to a sub-domain consisting of all the unknowns assigned to a processor. In the preconditioning phase, ILU factorization is carried out for each processor by zeroing out the matrix entries whose column numbers lie outside the processor domain. This is equivalent to solving the problem within each processor with zero Dirichlet boundary conditions during the preconditioning. This approximation is consistent with the steady state solution AVF = 0 everywhere. The overall preconditioner is weaker than the global ILU(O), and degenerates to a block diagonal preconditioner in the limit of one grid point per processor. Thus, as the number of processors increases, degradation in convergence is to be expected. This degradation should be moderate, since the iPSC/860 is coarse-grained parallel computer. In order to minimize the sequential overhead, we appeal to techniques developed in domain decomposition. For an overview of domain decomposition techniques and their suitability to parallel computers see [14]. One of the most successful methods in use in domain decomposition is the Schwarz alternating procedure [23] for overlapping subdomains, which can also be implemented as a preconditioner. Two variants of this procedure have been developed in the literature, the additive and the multiplicative algorithms; see [9]. The term additive denotes that the preconditioning can be carried out independently for each subdomain. The scheme outlined above is an example of an additive Schwarz preconditioner. In contrast, the multiplicative Schwarz method requires that the preconditioner be applied in a sequential

Parallel Implicit Methods for Aerodynamics

65

way by cycling through the subdomains in some order, as in Gauss-Seidel relaxation. It is possible to extract some coarse-grained parallelism by coloring the subdomains in an additive/multiplicative hybrid, but the potential is limited. Therefore in a parallel context, the additive Schwarz method is preferred. A powerful idea for elliptic problems advocated in [9], is the use of a coarse grid in order to bring some global influence to bear on the problem, similar in spirit to a two-level multigrid algorithm. The coarse grid operator is applied multiplicatively in our context i.e., the coarse grid problem is solved first. The solution from the coarse grid problem is subsequently used by the processors during the additive (parallel) phase as Dirichlet data at the subdomain boundaries. Applying the coarse grid in this manner does impose a penalty in a parallel setting; it becomes a sequential bottleneck. Additive coarse grid operators are also common [5]. In this reference, the multiplicative and additive Schwarz algorithms are applied to the solution of nonsymmetric elliptic problems. An almost /i-independent convergence, where h is the fine grid size, is observed provided the coarse grid is fine enough. In that work the coarse grid operator was formed by discretizing the partial differential equation on a coarse grid. However, in our application, this would require a triangulation followed by a discretization on this coarse grid. Generation of interpolation operators to transfer information between the coarse and the fine grids would also be necessary. We avoid all these complexities by appealing to an alternative way of obtaining a coarse grid operator described in [33]. A coarse grid Galerkin operator is easily derived from a given fine grid operator by specifying the restriction and prolongation operators. We choose the restriction operator to be a simple summation of fine grid values, and the prolongation operator to be injection. Under this choice, the coarse grid discretization is similar to the one used in an agglomeration multigrid strategy, see [15, 31]. It amounts to identifying all the vertices that belong to a subdomain by one coarse grid vertex, and summing the equations and the right-hand sides associated with them. Thus the coarse grid system has as many vertices as the number of subdomains. At each time step, a coarse grid system is formed and solved by using a direct solver. The data obtained from the coarse grid is used on the boundaries as Dirichlet data for each subdomain. WTe have found that in practice, a direct solver is seldom needed to solve the coarse grid system; an iteration of incomplete LU decomposition seems to suffice. The implementation of the coarse grid solver is discussed next. Each processor first forms parts of the coarse grid matrix and the right-hand side at every time step. A global concatenation is performed so that each processor has the entire coarse grid system. This system is solved redundantly by each processor by forming approximate L and U factors. During the preconditioning phase, each processor forms a portion of the right-hand side. After the global concatenation, each processor carries out

66

Domain-Based Parallelism and Problem Decomposition Methods

forward and backward solves and deduces the appropriate Dirichlet data. We have found that at least one cycle of implicit smoothing similar to that employed in multigrid context [16] is needed to mitigate the adverse effects of injection of the solution from the coarse to the fine grid. Therefore, on the fine grid, after injection, given the old vector, u°ld, the following system of implicit equations is solved for the new solution vector, unew:

where e is taken to be 0.5, d is the degree of the vertex i, and the summation is over the neighbors of each vertex. We have found one Jacobi iteration applied to Eqn. (12) to be sufficient. This smoothing step involves communication at the boundaries. We have also developed a weaker smoother that dispenses with the communication associated with the Jacobi smoothing, but yields comparable convergence. This technique termed modified Jacobi smoothing, smooths the neighboring coarse grid data (to be used as Dirichlet data) with the data that the processor holds. This step is given by the following relation:

where UD is the old Dirichlet data, Upew is the new Dirichlet data and ULOC is the value of the coarse grid vertex assigned to the processor.

5

Performance on the Intel iPSC/860

The Intel iPSC/860 is a multiple instruction/multiple data stream (MIMD) parallel computer. The machine used has 128 processor nodes. Each node comprises of a 40 MHz Intel i860 micro-processor, 8 MBytes of memory, and a Direct Connect Module (DCM) which handles communication in the hypercube communication network. Each node has a peak performance of 60 Mflops in 64-bit arithmetic. The bi-directional hypercube interconnect facilitates communication across the nodes. Flow past a four-element airfoil in a landing configuration at a freestream Mach number M^ = 0.2 and an angle of attack of 5° is considered as a test case. Performance results are presented for two problem sizes that are representative of two-dimensional inviscid flows. The coarse mesh has 6019 vertices, 17,473 edges, 11,451 triangles, 4 bodies, and 593 boundary edges. The fine mesh has 15,606 vertices, 45,878 edges, 30,269 triangles, 4 bodies, and 949 boundary edges. Figure 2 shows the coarse grid about the four-element airfoil. In the Cray implementation of the explicit code, vectorization is achieved by coloring the edges of the mesh (more details may be found in [3]). The Cray implementation is highly optimized and the performance of the code is comparable to that of existing structured

Parallel Implicit Methods for Aerodynamics

67

grid codes. The implicit code was not optimized for the Cray Y-MP, since it was developed on the Intel iPSC/860. The result is that it runs in an almost scalar fashion on the Cray, except for the right-hand side computation. However, a similar implicit unstructured mesh Navier-Stokes code was implemented earlier on the Cray Y-MP and optimized [27] to run at approximately 110-120 megaflops. All the megaflop numbers in this section are based on operation counts using the Cray hardware performance monitor.

FIG. 2.

Coarse grid about a four-e.leme.nt airfoil.

The explicit scheme is a four-stage Runge-Kutta scheme and uses a Courant-Frieclrichs-Lewy (CFL) number of 1.4. The CFL number is the ratio of the time step to the time required for the fastest signal in the hyperbolic PDE system to cross a grid cell. Thus the CFL number can be thought of as a non-dimensionalized time step. With the GMRES/DIAG scheme, the start-up CFL number is 3 and the CFL number is allowed to vary inversely proportional to the LI norm of the residual up to a maximum of 30. With GMRES/ILU, the start-up CFL number is 20 and the CFL number is allowed to vary inversely proportional to the L% norm of the residual up to a maximum of 200,000. With both implicit schemes, the number of GMRES search directions is limited to 15. Hence, we use a fixed-storage inexact Newton method [8]. The performances of the explicit and the implicit schemes are compared on the Intel iPSC/860. Tables 1 and 2 show the times per iteration in

68

Domain-Based Parallelism and Problem Decomposition Methods

seconds and the convergence rates. The convergence rate is denned as

where Rn is the LI norm of the residual of the density equation at the end of nth time step and R\ is the residual at the end of the first time step. Figure 3 shows the convergence histories for the fine mesh as a function of the number of iterations. It may be observed that the explicit scheme is barely converging while the implicit schemes converge much faster. The GMRES/ILU processor-implicit preconditioning exhibits degradation in convergence as the number of processors increases, but the degradation is moderate. It is also seen that the convergence histories with GMRES/ILU gravitate towards that of GMRES/DIAG as the number of processors increases. In the limit of 1 grid point per processor the two will be identical. Since the problem does not fit on one processor of the Intel iPSC/860, the uni-processor runs were carried out on the Cray Y-MP. Even with 128 processors, the GMRES/ILU scheme requires only about 20% more iterations than the ideal 1 processor scheme to obtain the same level of convergence (5 orders of reduction in the residual norm). Since the time to completion is of ultimate interest, Figure 4 shows the convergence histories as a function of the elapsed times with the number of processors fixed at 64. It clearly shows the superiority of the GMRES/ILU processor-implicit technique over the explicit and the GMRES/DIAG schemes. TABLE 1 Performance of the implicit scheme on the Intel iPSC/860 - 6019 vertices.

Scheme

RK4 GMRES/ DIAG GMRES/

ILU

Measure Time/iter (sec) Conv. rate Time/iter (sec) Conv. rate Time/iter (sec) Conv. rate

No of processors 32 64 8 16 0.59 0.32 0.20 0.13 0 973 0.973 0 .973 0 .973 0.973 0 .973 1.66 0.95 0.59 0.42 3.06 0 874 0.874 0 .874 0 .874 0.874 0 .874 1.32 0.77 0.52 4.42 2.36 0 791 0.795 0 .796 0 .797 0.797 0 .797 1

4 1.07

Finally, we examine the effects of using a coarse grid as discussed in Section 4 to improve convergence for the 15,606-vertex mesh. Figure 5 shows the convergence histories as a function of iterations for the uniprocessor, 32-processor and 128-processor cases, with and without the use of a coarse grid. A cycle of modified Jacobi smoothing is employed as part of the preconditioner in order to stabilize the procedure with the coarse grid system, and the coarse grid system is solved redundantly by all processors. The convergence has improved dramatically, illustrating the power of the

Parallel Implicit Methods for Aerodynamics

69

FlG. 3. Convergence histories with GMRES/ILU on the fine mesh. TABLE 2 Performance of the implicit scheme on the Intel iPSC/860 - 15606 vertices. No. of processors Scheme

RK4

GMRES/ DIAG GMRES/ ILU

Measure Time/iter (sec) Conv. rate Timc/itcr (sec) Conv. rate Time/iter (sec) Conv. rate

1 0.997 0.968 0.870

16 0.78 0.997 2.19 0.968 3.07 0.878

32 0.43 0.997 1.24 0.968 1.73 0.878

64 0.25 0.997 0.75 0.968 1.07 0.880

128

""ai5~ 0.997 0.51 0.968

0.65 0.891

coarse grid; the convergence with 128 processors is even better than that obtained with the uni-processor scheme. Unfortunately, this improved convergence does not translate into a reduction in the time required to solve the problem. This is illustrated in Figure 6, which shows the convergence histories as a function of elapsed times on 32 and 128 processors with and without the coarse grid. In both the 32- and the 128-processor cases, it may be observed that the times required to solve the problem are nearly the same with and without the use of the coarse grid. On a per iteration basis, the elapsed times for the 32-processor case are 1.73 and 1.87 seconds respectively, without and with the coarse grid. For the 128-processor case.

70

Domain-Based Parallelism and Problem Decomposition Methods

FIG. 4. Convergence histories as a function of elapsed times on the fine mesh with 64 processors.

these times are 0.64 an 1.02 seconds. This points to a major drawback of using the coarse grid system to improve convergence on a parallel computer. With too small a coarse grid system, the effort required to solve the system is minimal, but so is the improvement in convergence. With a larger coarse grid system, the gain in convergence is substantial but comes at a greater cost. The coarse grid operator, being sequential in nature, predominates as the number of processors increases. When invoking the coarse grid operator, a large fraction of the time is spent in the global concatenation. Thus, if the parallel computer were to have better communication rates, the technique would be more competitive in terms of elapsed times as well. In order to get an idea of the relative performances of the codes on the Intel iPSC/860 and the Cray Y-MP/1, performance data from the Cray implementation are given. The elapsed times on the Cray Y-MP/1 are respectively, 0.15 and 0.39 seconds per time step for the coarse and fine meshes with the explicit scheme. The explicit code runs at 150 megaflops on the Cray Y-MP/1. The megaflop ratings on the Cray are obtained using the hardware performance monitor. By simple scaling it may be verified that the explicit code runs at nearly 400 megaflops on 128 processors of the Intel iPSC/860 with the larger problem. Timings for the implicit scheme on the Cray Y-MP/1 are not provided since the codes have not been not optimized.

Parallel Implicit Methods for Aerodynamics

71

FlG. 5. Convergence histories for the 15606-vertex case as a. function of iterations with and without the use of a coarse grid. 6

Conclusions

It has been shown that implicit schemes can be carefully designed to yield good performance when solving unstructured grid problems on parallel computers. Explicit schemes are not competitive despite their parallelism. The GMRES technique with ILU preconditioning within each processor performs better than the GMRES procedure with block diagonal preconditioning. In the case where cells are partitioned, a block diagonal preconditioning for the interface points is shown to be effective. The implicit schemes presented in this paper offer almost complete parallelism at the expense of minimal sequential overhead, thus resulting in efficient parallel algorithms. Finally, it is demonstrated that the use of a global coarse grid, while improving convergence, does not result in a reduction in the time to solve the problem on the Intel iPSC/860.

7

Acknowledgements

The author thanks the Numerical Aerodynamics Simulation facility at NASA Ames Research center for the use of the Intel iPSC/860. The author also thanks Tim Earth of NASA Ames Research Center for providing the explicit. Cray code, and David Keyes and Moulay Tidriri of ICASE for the

72

Domain-Based Parallelism and Problem Decomposition Methods

FlG. 6. Convergence histories 15606-vertex case as a function of elapsed times with and without the use of a coarse grid. many helpful discussions on domain decomposition methods.

References [1] K. Ajmani, M. Liou and, R. W. Dyson, Preconditioned implicit solvers for the Navier-Stokes equations on distributed-memory machines, AIAA Paper 940408, presented at the 32nd Aerospace Sciences meeting, Reno, NV, Jan. 1994. [2] E. Anderson and Y. Saad, Solving sparse triangular systems on parallel computers, International Journal of High Speed Computing, Vol. 1, No. 1 (1989), pp. 73-96. [3] T. J. Barth and D. C. Jespersen, The design and application of upwind schemes on unstructured meshes, AIAA Paper 89-0366, presented at the 27nd Aerospace Sciences meeting, Reno, NV, Jan. 1989. [4] P. N. Brown and Y. Saad, Hybrid Krylov methods for nonlinear systems of equations, SIAM Journal of Scientific and Statistical Computing, Vol. 11 (1990), pp. 450-481. [5] X. Cai, W. D. Gropp, and D. E. Keyes, A comparison of some domain decomposition and ILU preconditioned algorithms for nonsymmetric elliptic problems, to appear in Journal of Numerical Linear Algebra with Applications, 1994. [6] X. Cai, W. D. Gropp, D. E. Keyes, and M. D. Tidriri, Newton-KrylovSchwarz methods in CFD, Proceedings of the International Workshop on Numerical Methods for the Navier-Stokes Equations (F. Hebeker and R. Rannacher, eds.), Notes in Numerical Fluid Mechanics, Vol. 47, Vieweg Verlag,

Parallel Implicit Methods for Aerodynamics

73

Braunschweig/Wiesbaden, Germany (1994), pp. 17-30. [7] R. Das, D. J. Mavriplis, J. Saltz, S. Gupta, and R. Ponnusamy, The design and implementation of a parallel unstructured Euler solver using software primitives, AIAA Journal, Vol. 32, No. 3 (1994), pp. 489-496. [8] R. S. Dembo, S. C. Eisenstat, and T. Steihaug, Inexact Newton Methods, SIAM Journal of Numerical Analysis, Vol. 19 (1982), pp. 400-408 [9] M. Dryja and O. B. Widlund, Towards a unified theory of domain decomposition algorithms for elliptic problems, Third International Symposium on Domain Decomposition Methods for Partial Differential Equations, T. Chan, R, Glowinski, J. Periaux, and 0. B. Widlund, eds., SIAM, Philadelphia (1990), pp. 3-21. [10] C. Farhat and S. Lanteri, Simulation of compressible viscous flows on a variety of MPPs: computational algorithms for unstructured dynamic meshes and performance results, Tech. Report 2154, Institut National de Recherche en Informatique et en Automatique (INRIA), Sophia-Antipolis, France. 1994. [11] A. Greenbaum, Solving sparse triangular linear systems using FORTRAN with parallel extensions on the NYU Ultracomputer prototype, Technical Report Ultracomputer Note 99, April 1986. [12] Z. Johan, T. J. R. Hughes, K. K. Mathur, and S. L. Johnsson, A data parallel finite element method for computational fluid dynamics on the. Connection Machine system, Computer Methods in Applied Mechanics and Engineering, Vol 99 (1992), pp. 113-124. [13] Y. Kallinderis and A. Vidwans, Generic parallel adaptive-grid Navier-Stokes algorithm, AIAA Journal, Vol. 32, No. 1 (1994), pp. 54 61. [14] D. E. Keyes, Domain decomposition: a bridge between nature and parallel computers, in Adaptive, multilevel and hierarchical computational strategies, (A. K. Noor ed.), ASME, New York, pp. 293-334. [15] M. H. Lallemand, H. Steve, and A. Dervieux, Unstructured multigridding by volume agglomeration: current status, Computers and Fluids, Vol. 21, No. 3 1992, pp. 397-433. [16] D. J. Mavriplis and A. Jameson, Multigrid solution of the two-dimensional Euler equations on unstructured, triangular grids, AIAA Journal, Vol. 26, No. 7 (1988), pp. 824-831. [17 P. R. McHugh and D. A. Knoll, Inexact Newton's method solutions to the incompressible Navier-Stokes and energy equations using standard and matrixfree implementations, AIAA Paper 93-3332CP, Proceedings of the llth AIAA Computational Fluid Dynamics Conference, Orlando, FL (1993). pp. 385-393. [18] J. A. Meijerink and H. A. van der Vorst, Guidelines for the usage of incomplete decompositions in solving sets of linear equations as they Occur in practical problems, Journal of Computational Physics, Vol. 44, No. 1 (1981), pp. 134155. [19] W. A. Mulder and B. van Leer, Implicit upwind methods for the Euler equations, AIAA Paper 83-1930, presented at the 6th AIAA CFD conference, Danvers. MA, July 1983. [20] A. Potheri, H. D. Simon, and K. P. Lion, Partitioning sparse matrices with eigenvectors of graphs, SIAM Journal of Matrix Analysis and Applications, Vol. 11 (1990), pp. 430-452. [21] R. Ramamurthi, R. Lolmer, and W. Sandberg, Evaluation of a scalable 3-D finite element incompressible flow solver, AIAA Paper 94-0756, presented at the 32nd Aerospace Sciences meeting, R.eno. NV. Jan. 1994.

74

Domain-Based Parallelism and Problem Decomposition Methods

[22] Y. Saad and M. H. Schultz, GMRES: A generalized minimal residual algorithm for solving nonsymmetric linear systems, SI AM Journal of Scientific and Statistical Computing, Vol. 7, No. 3 (1986), pp. 856-869. [23] H. Schwarz, Gesammelte mathematische abhangdlungen, Berlin, Springer, vol. 2 (1890), pp. 133-134. [24] F. Shakib, T. J. R. Hughes, and Z. Johan, A multi-element group preconditioned GMRES algorithm for nonsymmetric problems arising in finite element analysis, Computer Methods in Applied Mechanics and Engineering, Vol. 87 (1989), pp. 415-456. [25] D. C. Slack. D. L. Whitaker, and R. W. Walters, Time integration algorithms for the two-dimensional Euler equations on unstructured meshes, AIAA Journal, Vol. 32, No. 6 (1994), pp. 1158-1166. [26] H. D. Simon, Partitioning of unstructured problems for parallel processing, Computing Systems in Engineering, Vol. 2, No. 2/3 (1991), pp. 135-148. [27] V. Venkatakrishnan and D. J. Mavriplis, Implicit solvers for unstructured meshes, Journal of Computational Physics, Vol. 105 (1993), pp. 83-91, [28] V. Venkatakrishnan, Parallel implicit unstructured grid Euler solvers, AIAA Journal, Vol. 32, No. 10 (1994), pp. 1985-1991. [29] V. Venkatakrishnan, H. D. Simon, and T. J. Barth, A MIMD implementation of a parallel Euler solver for unstructured grids, The Journal of Supercomputing, Vol. 6 (1992), pp. 117-137. [30] V. Venkatakrishnan, Parallel computation of Ax and ATx, International Journal of High Speed Computing, Vol. 6, No.2 (1994) pp. 325-342. [31] V. Venkatakrishnan and D. J. Mavriplis, Agglomeration multigrid for the threedimensional Euler equations, AIAA Paper 94-0069, January 1994; to appear in AIAA Journal. [32] A. Vidwans, Y. Kallinderis and V, Venkatakrishnan. Parallel dynamic loadbalancing algorithm for three-dimensional adaptive unstructured grids, AIAA Journal, Vol. 32, No. 3 (1994), pp. 497-506. [33] P. Wesseling, An introduction to multigrid methods, John Wiley & Sons, New York, 1992. [34] W. K. Anderson, A grid generation and flow solution method for the Euler equations on unstructured grids, Journal of Computational Physics, Vol. 110, No. 1 (1994), pp. 23-38. [35] W. K. Anderson and D. L. Bonhaus, An Implicit upwind algorithm for computing turbulent flows on unstructured grids, Computers and Fluids, Vol. 23, No. 1 (1994), pp. 1-21. [36] D. L. Whitaker, Three-dimensional unstructured grid Euler computations using a fully-implicit, upwind method, AIAA Paper 93-3337-CP, Proceedings of the llth AIAA Computational Fluid Dynamics Conference, Orlando, FL (1993), pp. 448-461.

Chapter 5 Newton-Krylov-Schwarz Methods Applied to the Tokamak Edge Plasma Fluid Equations D. A. Knoll

P. R. McHught

V. A. Mousseau

Abstract The tokamak edge plasma fluid equations are a highly nonlinear system of two-dimensional convection-diffusiori-reaction partial differential equations that describe the boundary layer of a tokamak fusion reactor. These equations are characterized by multiple time and spatial scales. We use Newton's method to linearize the nonlinear system of equations resulting from the implicit, finite volume discretization of the governing partial differential equations. The resulting linear systems are neither symmetric nor positive definite, and are poorly conditioned. Preconditioned Krylov iterative techniques are employed to solve these linear systems. We investigate both standard and matrix-free implementations. Additive and multiplicative Schwarz precondilioners are investigated and compared with Incomplete Lower-Upper factorizatio (ILU) preconditioning. While this system of equations describes a specific application, the general algorithm should be applicable to other convection-diffusion-reaction systems of equations. We include solutions to five-equation systems on 256 x 64 grids (80K unknowns), as well as solutions to eleven-equation systems on 128 x 32 grids (44K unknowns). 1

Introduction

The tokamak edge plasma fluid equations are a highly nonlinear set of convection-diffusion-reaction equations that describe the boundary layer plasma of a tokamak fusion reactor. They contain widely varying time and spatial scales, and the transport coefficients and reaction rates are strong functions of density and temperature. These equations describe the flow of plasma particles and energy from the edge of the reactor core into what is called the divertor region, which serves as the exhaust system. Work supported under the auspices of the U.S. Department of Energy Office of Scientific Computing Applied Mathematics Program; Office of Fusion Energy. Fusion Theory and Computation Program; and the EG&G Idaho Long Term Research Initiative in Computational Mechanics under DOE Idaho Field Office Contract DE-AC07-76ID01570. t Idaho National Engineering Laboratory, Idaho Falls, ID 83415-3895. 75

76

Domain-Based Parallelism and Problem Decomposition Methods

Designing a divertor system that can handle the large power flows inherent in reactor-relevant edge plasmas is one of the most challenging issues facing fusion scientists today. Thus, the ability to solve the edge plasma equations robustly and efficiently is of great interest to the international fusion community. The multiple time scales contained in these equations motivate the use of a highly implicit algorithm; and the strong nonlinear coupling between the equations suggests the application of Newton's method. This type of algorithm was initially applied to the edge plasma equations on a simply connected geometry using an analytical Jacobian and banded Gaussian elimination [13, 16]. This work demonstrated the viability of a Newton's method solution to the edge plasma equations. The need for solutions on multiply-connected geometries with larger and more complex systems of equations motivated the use of Newton-Krylov methods [3, 4], incomplete lower upper (ILU) preconditioning [26], and numerically evaluated Jacobians [14, 15]. As the complexity and coupling of our system of equations has increased, to more accurately model the physics, two features of the solution process have changed notably: first, we must use a finite time step in a pseudo-transient relaxation; and second, the evaluation of the numerical Jacobian has become a co-dominant part of our computation. These changes have motivated the use of matrix-free Newton-Krylov methods [3, 6, 17]. We use "matrix-free" in a more general sense than that of computing actual Jacobian elements on the fly. Here Jacobian elements are never computed, except possibly as required in the formation of various preconditioners. Rather, finite differences of residuals are used to approximate the Jacobian-vector products required in the Krylov iteration algorithm. This approximation may enable one to significantly reduce the number of times the Jacobian must be formed. The need for finer grid solutions and increased numbers of equations has also motivated the study of Schwarz methods (domain decomposition) [7] for improved preconditioning. We have seen that as the number of equations is increased and/or the grid is refined, global ILU preconditioning becomes less effective than domainbased preconditioning. The use of domain-based preconditioning also allows a natural extension to parallel implementations of the algorithm. Our use of domain-based preconditioning for Newton-Krylov methods has been influenced by previous research on divide and conquer techniques in computational fluid dynamics, and as applied to convection-diffusionreaction systems specifically. In the early work of Vanka [24] domain decomposition was used as an outer iteration with Newton solves on the subdomains. This work solved the two-dimensional incompressible NavierStokes equations, and a sparse Gaussian elimination method was used as the linear solver. Overlapping Dirichlet boundary conditions connected the subdomains explicitly. Keyes [12] studied Newton's method as the outer iteration with domain decomposition based preconditioning in solving

Newton-Krylov-Schwarz Methods

77

a two-dimensional, three-equation system representing a laminar diffusion flame. Various iterative techniques were investigated for the solution of the linear system arising at a variety of points in a pseudo-transient Newton's method. He considered both no communication between subdomains, and communication between subdomains via a "Modified Schur Complement (MSC)-coupled" method. Gropp and Keyes [10] studied a Newton-Krylov algorithm for the stream function vorticity model of two-dimensional, incompressible Navier-Stokes using the "Tile" algorithm to include coupling between subdomains. Other recent applications of domain decomposition with Newton-Krylov type solvers include Cai et. al. [6], Jacobs et. al. [18], Dutto et. al. [8], and Ajmani et al [1]. In Section 2 we describe the physics and geometry of the edge plasma, discuss the governing equations, and present results from solutions to two model problems. Our standard Newton-Krylov algorithm is briefly discussed in Section 3, with the majority of the section devoted to motivating, describing, and showing performance of matrix-free Newton-Krylov methods. Section 4 describes the Schwarz preconditioning algorithms and presents results using multiplicative and additive Schwarz methods. Section 5 contains our conclusions and observations.

2

Geometry and Physical Model

The tokamak edge plasma, also referred to as the scrape-off layer, is that plasma which lies between the last closed magnetic flux surface, called the separatrix, and the vessel wall. This plasma is made up of hydrogen ions, electrons, impurity ions from the vessel structure such as carbon, as well as hydrogen and impurity neutral atoms and molecules. A schematic of toroidal coordinates, the poloidal symmetric cross-section of a single-null divertor tokamak, and the simplified Cartesian representation of this cross section are shown in Figure 1. The dashed line in Figure Ib is a branch cut, in the mapping to Figure Ic. The components of toroidal coordinates are major radius, R, minor radius, r, toroidal angle,
78

Domain-Based Parallelism and Problem Decomposition Methods

FlG. 1. tation

Toroidal coordinates, poloidal cross-section, and Cartesian represen-

closed magnetic field lines and the magnetic separatrix. Once in the edge region, these flows are rapidly transported along open magnetic field lines towards the divertor plates. The plasma energy is deposited on the divertor plates and the plasma ions are neutralized and diffuse back into the plasma as atoms or molecules. These neutrals are ionized by the plasma electrons and charge exchange with plasma ions. The ions, which are accelerated into the divertor plates by the plasma sheath [22], may physically sputter material off the divertor plates introducing impurities into the plasma. A typical impurity in existing experiments is carbon. When carbon impurities are introduced, they become a sink of energy to the electrons through atomic line radiation. Since the transport of the carbon and the energy loss due to carbon are both functions of the ionization state distribution, we follow each ionization state of carbon as a separate fluid. This is analogous to following separate continuity equations for each reacting species in a system of equations describing combustion. The fluid equations of the edge plasma are a simplification of the Braginskh equations [2]. The set of equations we present here should not be considered comprehensive, but representative. We have not yet included the diamagnetic and E x B drift terms, although they have been included

Newton-Krylov-Schwarz Methods

79

by colleagues in a similar Newton-Krylov algorithm [19]. The neutral atom transport model used here is a simple diffusion model, while others have used a full Monte-Carlo kinetic simulation [11]. The assumptions are: toroidal symmetry, which reduces our problem to two dimensions, poloidal and radial; no current, i.e. ambipolar flow; and purely diffusive radial transport. It is assumed that the neutrals, which are generated at the divertor plates, can be modeled with a single energy diffusion equation [25]; and the effects of viscous heating have been ignored. All ions are assumed to have the same temperature, but electrons have a separate temperature. The plasma is assumed transparent to all radiation. Momentum transfer between ions and neutrals has been ignored. Our problem is then modeled by the following set of convection-diffusionreaction equations with x the poloidal direction and y the radial direction : Plasma continuity,

Neutral continuity,

Impurity charge state z continuity,

Ion parallel momentum,

80

Domain-Based Parallelism and Problem Decomposition Methods

Impurity parallel momentum,

Ion internal energy,

Electron internal energy,

and Parallel Ohm's law,

Here, n» is the ion number density, ne is the electron number density, n0 is the neutral number density, nz is the number density of impurity charge state z, whose charge is Ze, u\\ is the velocity along the total magnetic field (B), u is the poloidal velocity, and v is the radial velocity, m
Poloidal velocity,

Newton-Krylov-Schwarz Methods

81

Electron velocity (no current) and density (quasi-neutrality),

In this model the poloidal transport coefficients including thermal conductivity (K), viscosity (rj), and the thermal friction coefficients (a and /5) are classical Braginskii [2], while the radial transport coefficients are anomalous, i.e., empirical. The source and sink terms are due to three atomic collision processes: electron impact ionization, electron ion recombination, and charge exchange between neutral hydrogen and impurity ions. The respective Maxwellian averaged cross-sections are < av >-ie. < av >rec, and < w >CXrec, all of which are functions of temperature and density. L0 and Lz represent the energy lost due to atomic line radiation from the neutral hydrogen and impurities, respectively. Modeling the true geometry of the edge plasma requires a non-orthogonal curvilinear, multiply-connected geometry. Although we have this capability, for this work we use a Cartesian representation of a single null geometry as shown in Figure 1. We model the full plate to plate geometry of a symmetric problem. This enables us to simulate the same problem using both a simply-connected geometry and a multiply-connected geometry, each with the same solution. In the simply-connected geometry zero flux boundary conditions are imposed on the cuts in Figure Ic. This reduces the number of non-zero diagonals significantly as compared to the multiply-connected geometry. The plate to plate distance is 8 meters with the null point being 0.7 meters from each plate. The radial distance from the separatrix to the wall boundary is 0.1 meters. We will look at the performance of the algorithms in solving two different problems on this geometry, one with the base five-equation system, and one with an eleven-equation system which includes the effects of a carbon impurity. Half of a 64 x 16 grid, the plasma density, and the electron temperature for the solution of the five-equation system on a 256 x 64 grid can be seen in Figure 2. In the x direction the grid is clustered near the divertor plate and in the y direction it is clustered near the separatrix. The density contour plot shows the density build up in front of the plate due to neutral recycling. The electron temperature shows a sharp radial gradient across the separatrix and a sharp poloidal gradient near the plate due to the energy loss which is required to ionize the recycling neutral atoms. The eleven-equation impurity model problem, solved on a 128 x 32 grid, has a prescribed source of carbon in the vicinity of the divertor plate. This source of carbon produces a significant energy loss in the electrons due to atomic line radiation. In this problem, the plasma is cooled to the point where recombination begins and the neutral atom density becomes larger than that of the plasma near the plate. Figure 3 shows the line plots of electron and ion temperature, ion and neutral density, and impurity

82

Domain-Based Parallelism and Problem Decomposition Methods

FIG. 2. Half mesh, Plasma Density, and Electron Temperature from fiveequation problem

Newton-Krylov-Schwarz Methods

83

densities, from the null point to the divcrtor plate along the separatrix. We see the very rapid adjustment of the carbon density profiles as they move through the sharp temperature jump in front of the plate. From Figure 3 one can see that the carbon densities vary over several orders of magnitude.

FIG. 3. Temperatures and Densities along separatrix, between null point and divertor plate, for eleven-equation problem

84 3

Domain-Based Parallelism and Problem Decomposition Methods Matrix-Free Newton-Krylov Method

We use the finite volume method to discretize our equations on a staggered grid with velocities at cell faces and thermodynamic variables at cell centers. This produces a nonlinear system of algebraic equations that we linearize using Newton's method. We employ a numerical Jacobian to simplify implementation, while mesh sequencing, adaptive damping, and pseudotransient continuation are used to enhance performance and robustness. Preconditioned Krylov methods are used to solve the linear system. To date we have relied mainly on ILU(O) preconditioning. A variety of Krylov algorithms have been used including: the Arnoldi-based GMRES(k) algorithm [20]; and the Lanczos-based CGS [21], TFQMR [9], and BiCGSTAB [23] algorithms. For more details see [14]. The Newton-Raphson method is a robust technique for solving systems of nonlinear equations of the form, F(x) = [/1(x),/2(x).../ra(x)]T where x, our state vector, is expressed as, x = [xi,X2...xn]T. Application of the Newton-Raphson method requires the solution of the linear system,

where the ( i , j ) element of the Jacobian matrix, J, is given by Jjj = -^ and the new solution approximation is obtained from,

where s is a damping scalar, which is less than or equal to one. This iteration is continued until the norm of F(x ) and/or <5xk falls below some tolerance level. The Krylov projections methods used in this study require the action of the Jacobian only in the form of matrix-vector products, these products may be approximated by [4],

where J is the Jacobian matrix, v is an arbitrary vector, x is the state vector, e is a perturbation, and F represents the system of discrete nonlinear equations. Equation (14) indicates that the accuracy of the matrix-free approximation is dependent not only upon e, but also upon the vector, v. Since this vector changes within the Krylov iteration, the accuracy of this approximation is subject to some uncertainty. The matrix-free approximation enables the action of the Jacobian without explicitly forming or storing the matrix. This property can be extremely advantageous in problems where forming the Jacobian represents a significant fraction of the total CPU time, and/or storing the Jacobian matrix is prohibitive. In many instances, however, the Jacobian or parts

Newton-Krylov-Schwarz Methods

85

thereof are still needed to generate an effective preconditioner. In this situation the primary advantage of the matrix-free implementation may lie in reducing the total number of required function evaluations by amortizing the cost of forming the preconditioner over several Newton steps. TABLE 1 Steady state performance data comparing the standard and matrix-free implementations on a 128 x 32, multiply-connected geometry using the five-equation model.

Krylov Newton Ave. Krylov Krylov Iter. Algorithm Implementation Iter. Iter. /Newton Limit Hits CGS Standard 6 29 0 TFQMR Standard 5 0 40 Bi-CGSTAB Standard 6 25 0 GMRES(40) Standard 6 36 0 CGS* Matrix-Free 15 150 15 TFQMR Matrix-Free 11 11 150 Bi-CGSTAB* Matrix-Free 132 15 13 GMRES(40) Matrix-Free 6 36 0 Calculation failed to converge within 15 Newton iterations. We wish to illustrate two main points in this section. The first is the difference in behavior sometimes observed between the Arnoldi-based and the Lanczos-based Krylov algorithms when applied in the matrixfree mode. The second point is the potential CPU advantage of the matrix-free implementation. In order to illustrate the first point, the fiveequation model problem is solved on 128 x 32 multiply-connected geometry directly (i.e. no pseudo-transient) using the different Krylov algorithms. In all cases the calculations are initialized using an interpolated 64 x 16 solution. Table 1 presents the performance data for these solutions. The Jacobian and preconditioner were formed on every Newton iteration in order to isolate the effect of the matrix-free approximation. The inner iteration performance of the Lanczos-based algorithms is adversely affected by the matrix-free implementation as indicated by the larger average inner iteration counts and the higher number of times the inner iteration failed to converge within the allowed 150 iterations. The GMR.ES data appears to be unaffected. Among the Lanczos algorithms, only the Newton-TFQMR calculation converged within the allowed 15 Newton iterations using the matrix-free implementation, but its performance was poor compared to the Newton-GMRES(40) calculation. In order to gain some insight into the poor performance of the Lanczos-based Krylov algorithms, the inner iteration convergence behavior is plotted in Figures 4 and 5 for both the

86

Domain-Based Parallelism and Problem Decomposition Methods

standard and matrix-free implementations. This data corresponds to the first Newton step. Figure 5 clearly contrasts the performance differences exhibited by the Lanczos-based and Arnoldi-based algorithms. Presumably, these performance differences are because the Lanczos-based algorithms are more susceptible to a loss of accuracy when the matrix-free approximation is employed. The norms of the Krylov vectors which appear in Eq. (14) are of unit two in the GMRES algorithm, but they can vary wildly within the other algorithms. Based upon these observations we choose to focus our attention upon the use of the GMRES algorithm when using the matrix-free implementation.

FIG. 4. Inner iteration convergence behavior for standard implementation on first Newton step.

In order to illustrate the second point mentioned above, we investigate the solution of the eleven-equation model problem on a 64 x 16 simplyconnected geometry. In this model problem we move from one steady-state to another by increasing the carbon source by a factor of three. This problem exhibits very strong nonlinear coupling between equations, which prohibited a solution with the steady-state equations directly. Additionally, the elevenequation system resulted in very costly Jacobian evaluations. Thus, this problem is ideally suited to the use of the matrix-free approximation in

Newton-Krylov-Schwarz Methods

87

FIG. 5. Inner iteration convergence behavior for matrix-free implementation on first Newton step.

conjunction with pseudo-transient relaxation, i.e.,

It is important to note that in this pseudo-transient relaxation the residual, F(x ), is the steady-state residual. Table 2 compares the standard algorithm implementation, which requires evaluation of the Jacobian on each Newton step, with the matrix-free implementation, which requires evaluation of the Jacobian only to generate a new preconditioner. The Jacobian evaluation frequency is varied in conjunction with the matrix-free implementation. Additionally, different ILU preconditioners are considered in an attempt to further reduce the required number of inner iterations. All calculations use Ai = 1 x 10~4 and each required 280 Newton steps for convergence. Note that all CPU times in this table are normalized with respect to the standard implementation which required 19 hours on an HP 735 workstation. A factor of three speed-up (with respect to the standard algorithm) is obtained using the matrix-free implementation with ILU(2) preconditioning arid evaluating the Jacobian every 10 Newton steps. The Newton convergence behavior for these two calculations versus CPU time are shown in Figure 6. Note that no attempt was made at optimizing the calculation by allowing A£ to vary (increase) within the iteration. The data

88

Domain-Based Parallelism and Problem Decomposition Methods

in Table 2 shows that controlling the inner iteration count is the key to the success of the matrix-free implementation. Improved preconditioning serves to lower the inner iteration counts and so improves the matrix-free implementation performance. In contrast, improved preconditioning for the standard implementation is not warranted because the evaluation of the Jacobian clearly dominates the total CPU time. It is apparent from these results that a low Jacobian evaluation frequency is desirable up to a point. Beyond this point, however, the inner iteration counts will tend to increase because of the lagged preconditioner, as seen in Table 2 when the Jacobian is evaluated once every twenty Newton steps. When this occurs the benefit of fewer Jacobian evaluations is countered by more inner iterations and the overall CPU efficiency diminishes. TABLE 2 Pseudo-transient performance data comparing the standard and matrix-free implementations on a 64 x 16 , simply-connected grid using the eleven-equation model.

Ave. Tot. Krylov CPU Iter. Time 14 St-Every Newton Step-ILU(O) 1.00 14 MF-Every 2 Newton Steps-ILU(O) 0.80 MF-Every 10 Newton Steps-ILU(O) 15 0.45 MF-Every 10 Newton Steps-ILU(2) 6 0.33 MF-Every 20 Newton Steps-ILU(O) 19 0.48 St = Standard implementation. MF — Matrix-free implementation. Calculation Description Mode- Jacobian Eval. Preq.-Prec.

4

Jac. Eval. % 93 59 25 34 13

Linear Solve % 6 40 74 65 86

Newton-Krylov-Schwarz Method

The viability of Newton-Krylov techniques to solve nonlinear equations often depends on the effectiveness of the preconditioner used in the linear equation solve. The two main reasons why we investigate domain-based preconditioners are improved scaling/robustness and parallelism. We have noted that ILU(O) preconditioning loses its effectiveness as the linear system increases in size. Figure 7 demonstrates this point with a 32 x 8, 64 x 16, 128 x 32, and 256 x 64 grid sequence. ILU(2) scales better than ILU(O), but it is much more expensive both in terms of memory and CPU cost. This trend was also noted in [5]. In contrast, it has been shown that for elliptic problems the Schwarz algorithms, with enough overlap and a coarse grid solve, scale independent of the problem size [5]. Additionally, as our problem size continues to grow, parallel implementation becomes a more

Newton-Krylov-Schwarz Methods

FIG. 6. calculations.

89

Newton update versus CPU time (sec) for two pseudo-transient

important consideration. Traditional global preconditioners like ILU are very difficult to implement in a parallel fashion, but domain decompositionbased algorithms, like additive Schwarz, parallelize almost trivially. The two algorithms investigated here are the additive arid multiplicative Schwarz [5] methods. The algorithms are: Solve M"1J6x = M" 1 (-F(x)) where M"1 is given by Additive Schwarz: M"1 = 3l l + J 2 [ + ... + J"1 Multiplicative Schwarz: M"1 =!-(!- J^J) • • • (I - 3~13) where w = M~ 1 v can be computed by

Here Ji . . . J n are a partition of the full Jacobian matrix J. The choice of a blocking strategy is an extremely important consider-

90

Domain-Based Parallelism and Problem Decomposition Methods

FlG. 7. Effectiveness of ILU preconditioning versus level of fill-in and Jacobian dimension.

ation in the successful implementation of the Schwarz methods [18]. By blocking strategy we mean the technique used to partition the whole computational domain into subdomain blocks. Referring to the mesh in Figure 2, dividing the domain into subdomains with equal numbers of cells results in sub-blocks with widely varying physical aspect ratios. Blocks in the lowerleft-hand corner are very long and thin while blocks in the upper-right-hand corner are very short and wide. In general, transport along the magnetic field, in the x direction, is much faster the transport across the field; and for the most part Ax is much larger than Ay. Near the plate however, Ax can approach Ay. The neutral atom diffusion equation has isotropic transport, i.e. the neutrals do not see the magnetic field. For these reasons determining an effective blocking strategy for the edge plasma equations is not straightforward, and may require significant numerical experimentation. For all of the results to date we have used banded Gaussian elimination as our subdomain solve. We have found that the stripwise blocking which gives the most efficient subdomain solves (narrow bandwidth) produces the least effective preconditioner. Note also that the CPU times for solutions to this model problem were dominated by the evaluation of the Jacobian. Consequently, the overall solution CPU timings are relatively flat and do not allow meaningful comparisons. Thus, we chose to compare preconditioner effectiveness based on inner iteration counts.

Newton-Krylov-Schwarz Methods 91

Table 3 shows the number of inner TFQMR iterations for three Newton steps on a 128 x 32 grid with the eleven-equation model problem. This table has results for four different blocking strategies. These strategies are indicated by NB X x NBy where NBX is the number of blocks in the x direction and NBy is the number of blocks in the y direction. The size of the sub-blocks are nbx x nby where nbx = nx/NBx and nby = ny/NBy (for Table 3 nx=128 and ny=32). The columns of the table represent Additive Schwarz with and without a two-cell overlap and Multiplicative Schwarz with and without a two-cell overlap. The two-cell overlap is in both the x and the y directions. Table 4 shows similar results for the 64 x 16 simply-connected grid. TABLE 3 TFQMR Iterations Per Three Newton Steps 128 x 32 (1LU(0)=205) Additive Blocking

2x2 8x2 4x4 2x8

No Overlap

Multiplicative

2 Cell Overlap

No Overlap

2 Cell Overlap

126 61 140 209 NEM 115 377 275 204 440 482 243 exceeded maximum inner iteration — 200 NEM Not Enough Memory

39 NEM 89 120

TABLE 4 TFQMR Iterations Per Three Newton Steps 64 x 16 (ILU(0)=65) Additive

Multiplicative

Blocking

No Overlap

2 Cell Overlap

No Overlap

2x2 4x1 1x4

91 51 180

94 40 91

48 23 76

2 Cell Overlap ]'

23 16 39

As indicated in these tables, several of the blocking strategies reduced the inner iteration count significantly relative to ILU(O) preconditioning. For equal numbers of blocks, the large NBX small NBy blocking strategy outperforms the small NBX large NBy blocking strategy. In all but one case, overlapping reduced inner iteration count, and multiplicative Schwarz always outperformed additive Schwarz, sometimes by more than the factor of two that one expects from the theory of model elliptic problems [6]. Figure 8 isolates the 4 x 1 blocking on the 64 x 16, grid and varies the overlap with additive Schwarz. We can see that no additional benefit

92

Domain-Based Parallelism and Problem Decomposition Methods

is gained by using more than a three- cell overlap. Figure 9 compares an overlap of 0 and 2 on a 2 x 1, 4 x 1, 8 x 1, 16 x 1, and 32 x 1 blockings with additive Schwarz preconditioning. We see that the performance, with respect to the number of blocks, is superior with overlap.

FIG. 8. TFQMR Iterations per three Newton Iterations on a 64 x 16 grid with eleven-equation problem

5

Conclusions

Newton-Krylov algorithms using both ILU and domain-based preconditioned were used to solve the highly nonlinear set of convection-diffusionreaction equations that describe the boundary layer plasma of a tokamak fusion reactor. Fine grid calculations using both the base five-equation system (256 x 64 grid ) and an eleven-equation system that included carbon impurities (128 x 32 grid), demonstrated the capabilities of the standard solution algorithm. Additionally, a matrix-free implementation of this algorithm was investigated. This implementation replaced Jacobian-vector products within the Krylov iteration with finite difference projections. Results using this approximation indicated that the Arnoldi-based algorithm, GMRES(40), performed better than the Lanczos-based algorithms considered (CGS, TFQMR, and Bi-CGSTAB). Compared with the standard algorithm solution to the eleven-equation system on a 64 x 16 grid, the matrix-

Newton-Krylov-Schwarz Methods

93

FIG. 9- TFQMR Iterations per three Newton Iterations on a 64 x 16 grid with eleven-equation problem

free implementation enabled a threefold speedup by amortizing the cost of each ILU preconditioner formation over ten pseudo-transient Newton steps. The use of domain-based Schwarz methods were studied using the standard Newton-Krylov algorithm in order to obtain preconclitioners that exhibit better sealing and inherent parallelism. Additive and multiplicative Schwarz algorithms were employed with and without overlap. Although preconditioner effectiveness was sensitive to blocking strategy and amount of overlap, significant reductions (factors of 4 to 5) in Krylov iterations, compared with ILU(O) preconditioning, were possible. This improvement, on the difficult eleven-equation model problem, together with the parallel nature of these domain-based techniques makes them attractive preconditioning options. The results and observations of this study suggest additional areas in need of further investigation. With regard to domain-based preconditioning these include: parallel/distributed implementation of the Newton-KrylovSchwarz algorithm, use of a coarse grid operator to improve scaling, determination of optimal blocking strategies, and consideration of alternatives to direct solves on subdomains. Additionally, pseudo-transient Newton-Krylov solutions of the edge plasma equations could be made more efficient by coupling the matrix-free implementation with domain-based preconditioning,

94

Domain-Based Parallelism and Problem Decomposition Methods

adaptive time step control, and parallel implementation. References [1] K. Ajmani, M.-S. Liou, and R. W. Dyson, Preconditioned implicit solvers for the navier-stokes equations on distributed-memory machines, Tech. Rep. NASA Technical Memorandum 106449, NASA, National Aeronautics and Space Administration, Lewis Research Center, Cleveland,OH, 44135-3191, January 1994. AIAA-94-0408 : ICOMP-93-49. [2] S. Braginskii, Transport Processes in a Plasma, in Reviews of Plasma Physics, Consultants Bureau. New York, 1965. [3] P. Brown and A. Hindmarsh, Matrix-free methods for stiff systems of ode's, SIAM Journal of Numerical Analysis, 23 (1986), pp. 610-638. [4] P. N. Brown and Y. Saad, Hybrid krylov methods for nonlinear systems of equations, Siam Journal of Scientific and Statistical Computing, 11 (1990), pp. 450-481. [5] X.-C. Cai, W. Gropp, and D. Keyes, A comparison of some domain decomposition and ILU preconditioned iterative methods for nonsymmetric elliptic problems, J. Numer. Lin. Alg. Appli., (May 1994). (to appear). [6] X.-C. Cai, W. D. Gropp, D. E. Keyes, and M. D. Tidriri, Newton-KrylovSchwarz methods in CFD, in Proceedings of the International Workshop on Numerical Methods for the Navier-Stokes Equations, F. Hebeker and R. Rannacher, eds., Braunschwieg, 1994, Vieweg Verlag. [7] X.-C. Cai and Y. Saad, Overlapping domain decomposition algorithms for general sparse matrices, Tech. Rep. Preprint 93-027, Army High Performance Computing Research Center, University of Minnesota, 1993. SIAM J. Sci. Comp. (submitted). [8] L. C. Dutto and W. G. Habashi, A parallel strategy for the solution of the fullycoupled compressible navier-stokes equations, in Advances in Finite Element Analysis in Fluid Dynamics, M. N. Dhaubhadel, M. S. Engelman, and W. G. Habash, eds., American Society of Mechanical Engineers, 1993. FED-Vol 171. [9] R. Freund, A transpose-free quasi-minimal residual algorithm for non-hermitian linear systems, SIAM Journal Scientific Computing, 14 (1993), pp. 470-482. [10] W. D. Gropp and D. E. Keyes, Domain decomposition methods in computational fluid dynamics. International Journal for Numerical Methods in Fluids, 14 (1992), pp. 147-165. [11] D. Heifetz, D. Post, M. Petravic. J. Weisheit, and G. Bateman, A Monte-Carlo Model of Neutral-Particle Transport in Diverted Plasmas, J. Comp. Phys., 46 (1982), pp. 309-327. [12] D. E. Keyes, Domain decomposition methods for the parallel computation of reacting flows, Computer Physics Comm., 53 (1989), pp. 181-200. [13] D. Knoll, Development and Application of a Direct Newton Solver for the TwoDirnensional Tokarnak Edge Plasma Fluid Equations, PhD thesis, University of New Mexico, 1991. [14] D. Knoll and P. McHugh, An Inexact Newton Algorithm for Solving the Tokamak Edge Plasma Fluid Equations on a multiply connected domain, J. Comp. Phys. accepted pending revision. [15] , NEWEDGE: A 2-D Fully Implicit Edge Plasma Fluid Code, for Advanced Physics and Complex Geometries, J. Nuc. Mat., 196-198 (1992), pp. 352-356. [16] D. Knoll, A. Prinja. and R. Campbell, A Direct Newton Solver for the Two-

Newton-Krylov-Schwarz Methods

[17] [18]

[19] [20] [21] [22] [23] [24] [25j [26]

95

Dimensional Tokamak Edge Plasma Fluid Equations, J. Comp. Phys., 104 (1993), pp. 418-426. P. McHugh and D. Knoll, Inexact Newton's method solution to the incompressible navier-stokes and energy equations using standard and matrix-free implementations, AIAA J. in press. P.G.Jacobs, V. Mousseau, P. McHugh, and D. Knoll, Newton-krylov-schwarz techniques applied to the two-dimensional incompressible navier-stokes and energy equations, in Seventh International Conference on Domain Decomposition Methods in Scientific Computing, D. E. Keyes and J. Xu, eds.. University ParLPenn, October 27-30 1993, AMS. T. Rognlien, J. Milovich, M. Rensink, and G. Porter, A fully implicit, timedependent 2-d fluid code for modeling tokamak edge plasmas, J . Nuc. Mat., 196-198 (1992), pp. 347-351. Y. Saad and M. Schultz, Gmres: A generalized minimal residual algorithm for solving non-symetric linear systems, SIAM Journal Scientific Statistical Computing, 7 (1986), p. 856. P. Sonneveld, CGS, a fast lanczos-type solver for nonsymmetric linear systems, Siam Journal of Scientific and Statistical Computing, 10 (1989), pp. 36-52. P. Stangeby, The Plasma Sheath, in Physics of Plasma Wall Interactions in Controlled Fusion, Post and Behrisch, eds., Plenum, 1984. H. A. van der Vorst, Bi-CGSTAB:a fast and smoothly converging variant of BiCG for the solution of nonsymmetric linear systems, Siam Journal of Scientific and Statistical Computing, 13 (1992), pp. 631-644. S. Vanka, Block-implicit calculation of steady turbulent recirculating flows, Int. J. Heat Mass Transfer, 28 (1985), pp. 2093 2103. E. Void, Transport in the Tokamak Edge Plasma, PhD thesis, University of California, Los Angles, 1989. J. W. Watts, A conjugate gradient-truncated direct method for the iterative solution of the, reservoir simulation pressure equation, Society of Petroleum Engineers Journal, (June 1981), pp. 345 353.

This page intentionally left blank

Chapter 6 Parallel Domain Decomposition Software William Gropp

1

Barry Smith

Introduction

Domain decomposition is natural for parallel computing; the domains are natural ways to decompose the problem across multiple processors. In practice, domain decomposition has often proven difficult to implement efficiently, particularly when the full domain involves an unstructured or semi-structured grid. This is particularly true of multilevel domaindecomposition methods, which, for a wide range of problems, have a substantial theory that demonstrates their optimality. Much of this is because effective domain decomposition methods require some kind of coupling or interaction between the domains, and this part of the computation, while involving only a small amount of the data, is often rife with special cases that complicate the implementation. In this paper, we discuss both some of the software that is available for domain decomposition, as well as software techniques that may be used to overcome problems in the implementation of domain decomposition methods. To help focus our discussion, we will consider one particular type of domain decomposition: the overlapping Schwarz methods for preconditioning the solution of linear systems of equations representing a discretization of a partial differential equation (PDE). There are many other kinds of domain decomposition discussed in this volume; for example, methods motivated by an asymptotic analysis of the PDE or a partitioning of the domain across parallel processors for explicit discretizations. Many of the techniques that we discuss here may be applied to those areas as well. We do not believe that it is possible or even desirable to provide completely general domain-decomposition software, since, by its very nature, domain decomposition methods are adapting to special situations. This does not preclude software for specific families of problems and methods, as we 'Mathematics and Computer Science Division. Argonne National Laboratory, Argonne, IL 60439-4844 (groppQmcs.anl.gov). t Mathematics and Computer Science Division. Argonne National Laboratory, Argonne, IL 60439-4844 (bsmith(Smcs.anl.gov). 97

98

Domain-Based Parallelism and Problem Decomposition Methods

will demonstrate. For the purposes of this volume, however, the more general techniques are appropriate. Perhaps what is more important, while completely general, "turnkey" software is not possible, there are techniques that can be used to maximize the reuse of existing code and eliminate accidental dependencies between different parts of an implementation. The technique that we recommend borrows ideas from object-oriented programming. We use encapsulation to write an operation and its data-structures in such a way that only the operation is visible to the user; the internal data is hidden. We also use polymorphism; the same operation can be applied to different objects. A simple example of polymorphism is the sum operator in FORTRAN. Both the expressions 1 + 2 and 1. OdO + 2. OdO are meaningful even though the type of the data (and the particular hardware to carry out the operation) is different in each case. A difference in our approach is an emphasis on providing multiple data-structures for representing a single object. For example, in representing a sparse matrix, many different formats can be used, including both ones that are fully general and ones that are optimized for special cases, such as diagonal and block diagonal formats. Our approach allows the application programmer to use any of these or to introduce their own, application-oriented sparse-matrix format, by defining the necessary operations on the sparse matrix. In particular, this allows the application programmer to use data-structures and implementations that are highly optimized for a particular problem and computer; for example, long-vector code for a vector supercomputer. More details on this approach may be found in [9], in which we discuss how this approach can be efficiently implemented.

2

Overlapping Schwarz Methods

Overlapping Schwarz methods are a powerful technique for preconditioning a linear system when using iterative method. Let the linear equation be Ax = b.

Partition the domain into subdomains as shown in Figure 1. Define operators Rj that restrict the original domain into the ith subdomain. In addition, (optionally) define an operator RQ that restricts the original domain into a coarse grid that, in some sense, covers the entire domain. Define

Using these, we can define many different preconditioners B for A. Two common cases are Additive

Parallel Domain Decomposition Software

99

FlG. 1. Example of the decomposition of a domain.

Multiplicative See [2] in this volume and references therein for theoretical motivation. We are concerned herein with implementation aspects, particularly for parallel computers. There are several aspects to parallelizing a domain decomposition algorithm that can be treated modularly. The iterative method it self must be parallelized; in the next section we will discuss how a particularly powerful class of iterative methods called Krylov methods can be easily parallelized. Next, we will discuss how the matrix-vector product (Ax) used in the Krylov methods can be parallelized. This will give us the techniques that we will use to parallelize the preconditioner itself. Finally, we will discuss some special issues that come up in the parallelization of the "coarse grid'' part of the preconditioner. 3

Parallelizing the Krylov methods

Krylov methods are a class of iterative methods that construct an approximate solution to a system of linear equations by forming the Krylov space (XQ, AXQ, A2xo,...). An approximate solution x is formed by choosing a linear combination of the elements of the Krylov space. Different Krylov methods make different choices of linear combinations. One method. GMRES, picks x so that the residual \\Ax- b\\ is minimized in the Krylov space. Other methods, such as BiCGStab, require less work and memory per iteration, but do not necessarily give the best (in the sense of minimizing the residual) approximation x.

100

Domain-Based Parallelism and Problem Decomposition Methods

Figure 2 shows executable code for the BiCGStab method [11]. This code is taken from the author's Krylov Space Package (KSP) [10]. Note how the operations are written, particularly the vector operations such as DAYPX and DDOT. By using the object-oriented techniques described in the introduction, we can (and do) use the same code on anything from PC's to massively parallel processors (MPPs). All that is required is that the vector operations (and the matrix operations that we discuss in the next sections)

for (1=0; Kmaxit; i++) { DOT(R,RP,&rho); /* rho <- rp } r */ beta = (rho/rhoold) * (alpha/omegaold); DAXPY(-omegaold,V,P); /* p <- p - w v */ DAYPX(beta,R,P); /* p <- r + p beta */ MATOP(P,V,T); /* v <- K p */ DOT(RP,V,&dl); alpha = rho / dl; /* a <- rho / (rp' v) */ DWAXPY(-alpha,V.R.S); /* s <- r - a v */ MATOP(S,T,R); /* t <- K s */ DOT(S,T,&dl); DOT(T,T,&d2);

omega = dl / d2; DAXPY(alpha,P,X);

/* /*

w <- (s't) / (t't) */ x <- x + a p */

DAXPY(omega,S,X); DWAXPY(-omega,T,S,R); NORM(R,&dp); rhoold = rho; omegaold = omega; if (CONVERGED(dp)) break; }

/* /*

x <- x + w s r <- s - w t

FIG. 2. Executable code for Bi-CGStab.

*/ */

Parallel Domain Decomposition Software

101

are parallelized. For most of these operations, this is simple, because the operations like DAYPX and DAXPY require only local information: that is, they only require vector elements on the same processor. The one exception is the inner-product DDOT; this requires a collective operation to accumulate the contributions from each processor. The collective operation DDOT illustrates one of the most important quantitative differences between parallel processors, particularly MPPs, and more conventional computers. A major class of MPPs uses message-passing to send data from one processor to another. This operation is relatively expensive and can often be modeled as taking time s + rn to send n bytes, where s is the latency and r is the inverse of the transfer rate. Typical numbers are

For example, it takes about 2040 times as long to send a single double precision quantity as it does to operate on it. The dominant part of this is due to the latency term; this suggests sending more data at any one time. It is important to realize that this high cost for accessing distant or remote memory is not a unique feature of so-called distributed-memory MPPs. Almost all computers make some memory more costly to access than other memory. For example, most so-called shared-memory computers provide faster access to memory that is physically closer to the processor. Even uniprocessors will provide faster access to memory that is in cache than to memory that is not. One of the potential advantages of domain decomposition methods is that it provides a natural way to restrict the memory accesses to ones that are in some sense local or faster. Returning to DDOT, since the formation of a sum of contributions from all processors requires logp steps for p processors, just the part of DDOT that accumulates the contributions from the individual processors can take a huge amount of time. With p = 64 and the values of s, r, and / above, this accumulation takes about 12000 as long as a floating point operation. This cost should be considered when choosing or developing algorithms.

4

Parallelizing the matrix-vector products

A key step of any Krylov method is the computation of the matrix-vector product. In using a domain-decomposition method, we have (usually) defined how we are distributing the vector across the processor. For simplicity, we will assume that there is one domain per processor; the generalizations for multiple domains per processor are straightforward. If we further assume that the matrix A is sparse, then to compute the part

102

Domain-Based Parallelism and Problem Decomposition Methods

FIG. 3. Example decomposition for 2-d PDE and 5-point stencil. Ghost points are shown in gray.

of Ax for the elements of x in the local domain, we need only reference the elements of x that are adjacent to the domain. To make this precise, consider the regular mesh in Figure 3. If the matrix A was generated by using the usual five-point discretization, then to compute the value of Ax on any subdomain, the elements of x on the adjacent domains must be accessed. One simple way to do this is to copy these values into ghost-points. These ghost points are not part of the domain but are used to simplify the evaluation of Ax; this is a common technique, and software is available to help the application programmer write code for this approach [6]. There is a problem with this approach, however. Say that the local domain is described in FORTRAN by the declaration real x(20:40,10:50). When we add the ghost points, we need real x(19:41,9:51). The problem comes when we want to use this array as the representation of the local vector in the Krylov method. Because the ghost points are not part of the vector, they must not be used when performing the vector operations (particularly the inner-product). This means that we can not use, in our Krylov method, implementations of vector operations that assume that the vector elements are stored in consecutive elements in memory. There are several ways to solve this problem. One is to copy data from a vector stored in consecutive elements into a temporary area that includes the ghost points. Another is to split the computation of Ax into the part that involves only the local elements and another part that involves only the ghost points. Yet another approach that exploits our data-structure-neutral approach is to define a new set of vector operations for vectors that are stored with ghost points. The best approach depends on the many factors; no one of these will always be best on all machines. For more general meshes and matrices, it can become difficult to describe, when the program is written, exactly what data is needed by each

Parallel Domain Decomposition Software

103

domain. Fortunately, we can program the computer to compute this for us as part of the initialization of the method. The object that we use to do this is an index set. An index set is simply a collection of indices (integers) that correspond to vector elements. For example, if A is an n x n matrix, then an index set is a subset of the integers between 1 and n. All the usual set operations apply to index sets (e.g., union and intersection). To determine which vector elements are needed to compute Ax on a particular subdomain, let D be the index set of the elements of x in that subdomain. Let AD be the rectangular matrix formed by taking only those rows of A numbered in D. That is, if D = {1,10.37}, then AD contains the first, tenth, and thirty-seventh row of A. Form a new index set SD such that j e SD only if the i,jth matrix element (AD)I,J is non-zero for some i and j. We call SD the support of AD- Finally, the set SD(~\D (the intersection of SD with the complement of D) contains the indices of the elements of x that are needed to compute Apx that are not in D. This approach is completely general and can be applied to arbitrary unstructured meshes. This approach may also be used to generate the overlapping domains in Figure 1 by starting with a non-overlapping partitioning D° and defining Dl+l = SD'', this gives a purely algebraic method for generating overlapping domains for an arbitrary sparse matrix. The use of index sets allows a program to compute automatically and dynamically many different data and operation patterns. For example, this approach may be used to find the matrix ADI that consists of those rows of AD that do not involve any elements not in D; this allows a parallel program (or an out-of-core solver) to begin computing with ADI before all the data in SD has arrived. Another important point is that the internal representation of an index set in the computer program, and the code for implementing the set operations, is easily encapsulated and may even be specialized to particular structures. For example, on a regular mesh, a representation of index sets by collections of ranges (so that, for example, the set 12,13,14,15,16,17, 20 would be represented as 12 - 17, 20) may b appropriate. In other words, using the index set abstraction does not require any loss of performance.

5

Parallelizing the preconditioner

Parallelizing the preconditioner turns out to be very similar to parallelizing the matrix-vector product. The operations required for applying the preconditioner to a domain involve applying R^ and R?\ both of these are nothing more than using an index set in a vector scatter or gather operation (which will usually involve communication with neighboring processors).

104

Domain-Based Parallelism and Problem Decomposition Methods

The subdomain solve A~ is entirely local1. Certain optimizations, such as overlapping the computation of ADX with the communication necessary before beginning the application of Bi, can be determined by using index sets. The only tricky issue is BQ, the coarse grid problem.

6

Parallelizing the coarse grid solver

In parallelizing the coarse grid solution phase of the algorithm, we are faced with a relatively small problem on a relatively large number of processors. This sort of problem is usually not suitable for a parallel processor (it is too small) and most of the work in the literature on the parallel solution of linear systems assumes that the number of unknowns per processor is large. In the case of small systems, the time to communicate data between the processors can be larger than the time needed to solve the system of equations [4]. There are four obvious methods that can be used in solving the coarse grid problem: 1. Solve in parallel on all processors 2. Solve in parallel on clusters of processors 3. Solve on one processor and distribute the solution 4. Solve redundantly on all processors We can use the communication model developed above to help evaluate these approaches. Except for the first approach, these all require that the right-handside data be collected to the processors that are computing the solution. A variety of algorithms exist for this; they all take time that is roughly logp(s+rn), where there are p processors and n unknowns in the coarse grid. Parallel solution algorithms that involve direct factorization will often take p communication steps; this makes them slow relative to the cost of gathering the data to a single processor. Most iterative methods will involve some number of steps that involve inner products; these take logp communication steps, making these methods at best take as long as gathering data to a single processor. Thus, if the coarse grid is small enough that the solution on a single processor takes less time than logp(s + rn), it can be better to solve the coarse grid problem redundantly on all processors. Note that a large coarse grid, such as in [1] in this volume, will need to be solved in parallel. It is only for relatively small coarse grids where the considerations in this section come into play. An alternative approach, described in [7] and used by Fischer [3]. is to compute the inverse of the coarse grid matrix; solving the coarse-grid problem is then reduced to a matrix-vector product. *It could be distributed across several processors, of course, but we will ignore that case for simplicity

Parallel Domain Decomposition Software

105

Additional discussion of these issues may be found in [5]. 7

Conclusion

When developing domain decomposition algorithms for parallel computers, it is important to keep a clean separation between the implementation and the mathematics. The data-structure-neutral approach provides a natural framework for expressing the mathematics in a program without sacrificing performance. When choosing the details of an implementation, try to keep the operations at as high a level as possible to provide the maximum flexibility in the implementation. (This flexibility is often critical in maintaining and extending a code.) One of the strengths of domain decomposition methods, that you can adapt the approach to address specific behavior, requires that the program be flexible enough to be adapted to different situations. A number of libraries are available to provide some of the software pieces for implementing parallel domain decomposition algorithms. The authors' PETSc package of Gropp and Smith was developed in large part to aid in research in parallel domain decomposition methods. This package is available by anonymous ftp from inf o .mcs. anl. gov in directory pub/pdetools. Information on the package is also available on the worldwide-web; the URL http://www.mcs.anl.gov/home/gropp/petsc.html is a good starting point. At the lower level of providing message-passing operations, the authors' BlockComm package, available in pub/pdetools provides a portable way of exchanging data between processors without using explicit message-passing. For message-passing itself, every programmer should use MPI (Message Passing Interface); this is an ad-hoc standard (much like High-performance FORTRAN) that is efficient, has a portable implementation (available from info.mcs.anl.gov in pub/mpi), and provides special features directed at library writers and programs that use subsets of processors (for example, for the coarse grid solve). Information on MPI is available on the world-wideweb through the URL http://www.mcs.anl.gov/mpi/index.html. The authors' Chameleon parallel programming system [8] provides some of these features and allows an application to interoperate with code written using older message-passing systems; it is also available from pub/pdetools. References [1] Fetter E. Bjorstad. Domain decomposition and parallel computing with applications to oil reservoir simulation. In David Keyes, Youcef Saad, arid Donald Truhlar, editors, this volume. SIAM, 1994. [2] Xiao-Chuan Cai. Domain decomposition methods for nonsymmetric and indefinite systems of equations. In David Keyes. Youcef Saad, and Donald Truhlar, editors, this volume. SIAM, 1994.

106

Domain-Based Parallelism and Problem Decomposition Methods

[3] Paul F. Fischer. Parallel domain decomposition for incompressible fluid dynamics. In Alfio Quateroni. Jacques Periaux, Yuri Kuznetsov. and Olof B. Widlund, editors, Domain Decomposition Methods in Science and Engineering, pages 313-322, Providence, RI, 1994. AMS. [4] William D. Gropp. Solving PDEs on loosely-coupled parallel processors. Par. Comput., 5:165-173, 1987. [5] William D. Gropp. Parallel computing and domain decomposition. In David Keyes, Tony F. Chan, Gerard Meurant, Jeffrey Scroggs, and Robert G. Voigt, editors, Fifth International Symposium on Domain Decomposition Methods for Partial Differential Equations, pages 349-361, Philadelphia, 1992. SIAM. [6] William D. Gropp. BlockComm for Fortran. Technical Report ANL-93/00 (to appear), Argonne National Laboratory, May 1993. [7] William D. Gropp and David E. Keyes. Domain decomposition with local mesh refinement. SIAM J. Sci. Stat. Comput., 13(4):967-993, July 1992. [8] William D. Gropp and Barry Smith. Chameleon parallel programming tools users manual. Technical Report ANL-93/23, Argonne National Laboratory, March 1993. [9] William D. Gropp and Barry Smith. The design of data-structure-neutral libraries for the iterative solution of sparse linear systems. Technical Report MCS-P356-0393, Argonne National Laboratory, May 1993. [10] William D. Gropp and Barry Smith. Users manual for KSP: Data-structureneutral codes implementing Krylov space methods. Technical Report ANL93/30, Argonne National Laboratory, August 1993. [11] H. van der Vorst. Bi-CGSTAB: A fast and smoothly converging varient of Bi-CG for the solution of nonsymmetric linear systems. SIAM J. Sci. Stat. Comput., 13:631-644, 1992.

Chapter 7 Decomposition of Space-Time Domains: Accelerated Waveform Methods, with Application to Semiconductor Device Simulation Andrew Lumsdaine

Mark W. Reichelt Abstract

In this paper, we present accelerated waveform methods for solving time-dependent problems from the viewpoint of domain-based parallelism. We show that the waveform approach is a methodology for decomposing the space-time computational domain of time-dependent problems and circumventing the serial time-stepping bottleneck that arises when parallelizing standard sequential methods. Experimental results using waveform methods to solve the time-dependent semiconductor drift-diffusion equations are presented and demonstrate that waveform techniques are superior to standard techniques in different parallel environments.

1

Introduction

Standard solution methods for numerically solving time-dependent problems typically begin by discretizing the problem on a uniform time grid and then sequentially solving for successive time points. The initial time discretization imposes a serialization to the solution process and limits parallel speedup to the speedup available from parallelizing the problem at any given time point. This bottleneck can be circumvented by the use of waveform methods in which multiple timepoints of the different components of the solution are computed independently. With the waveform approach, a problem is first spatially decomposed and distributed among the processors of a parallel machine. Each processor then solves its own time-dependent subsystem over the entire interval of interest using previous iterates from other processors as inputs. Synchronization and communication between processors take place infrequently, and This work was supported in part by NSF grant CCR92-09815. Intel parallel computer resources were provided by the San Diego Supercomputer Center. Department of Computer Science and Engineering, University of Notre Dame, Notre Dame, IN 46556 ([email protected]). The Math Works, Inc., 24 Prime Park Way, Natick, MA 01760 ([email protected]). 107

108

Domain-Based Parallelism and Problem Decomposition Methods

FIG. 1. The computational domain of a time-dependent problem is a spacetime domain.

FIG. 2. The standard approach for solving time-dependent problems.

communication consists of large packets of information — discretized functions of time (i.e., waveforms). One can describe waveform methods from the viewpoint of domain-based parallelism by realizing that the computational domain of time-dependent problems encompasses both space and time. Fig. 1 shows the space-time computational domain of the physical problem described in Section 3.2. The standard approach for solving time-dependent problems essentially takes slices of the space-time computational domain normal to the time axis. A space-time interpretation of this process is shown in Fig. 2. The waveform approach, on the other hand, first decomposes the space-time computational domain into subdomains, as shown in Fig. 3. The standard approach is still used to solve the sub-problem within each subdomain. This spacetime domain-based decomposition is readily parallelized — each space-time subdomain can be assigned to a separate processor for parallel solution.

FIG. 3.

The waveform approach for solving time-dependent problems.

Accelerated Waveform Methods

109

In this paper, we describe the classical waveform relaxation algorithm and present two recently developed acceleration schemes for it namely Krylov-subspace and convolution successive overrelaxation (CSOR) techniques. The accelerated waveform methods are applied to solving the timedependent semiconductor drift diffusion equations. Experimental results in two different parallel environments — a workstation cluster and the Intel Paragon — illustrate the general applicability of waveform methods and demonstrate that waveform methods are superior to standard methods in these parallel environments. More traditional domain-decomposition approaches for time-dependent problems can be found in [5], [11], and [42], A generalization of waveform relaxation to partial differential equation (PDE) problems through the use of domain decomposition is described in [4]. A domain-decomposition approach for static semiconductor simulation is described in [14].

2

Waveform Methods

A thorough (and mathematical) description of waveform methods can be developed easily through the use of a model problem. For the model problem, we seek to compute the transient (temporal) solution to the system of N simultaneous ordinary differential equations (ODEs), subject to an initial condition. This type of problem, usually called an initial value problem, (IVP) is expressed as:

where x(t) and f ( t ) are TV-dimensional real vectors and A is an N x TV matrix. The vector /(£) is understood to be the input and x(t) is understood to be the unknown which is to be computed over a time interval from t = 0 to t = T (i.e., for t € [0, T]). We can describe these quantities mathematically by writing A(t) E RNxN, f ( t ) <E RN, and x(t) & KN. The traditional approach for numerically solving the IVP begins by discretizing (1) in time with an implicit integration rule (since large dynamical systems are typically stiff) and then solving the resulting matrix problem at each time step [13, 16]. This pointwise approach can be disadvantageous for a parallel implementation, especially for distributed memory parallel computers having a high communication latency, since the processors will have to synchronize repeatedly for each timestep. A more suitable approach to solving the IVP with a parallel computer is to decompose the problem at the differential equation level. That is, the large system is decomposed into smaller subsystems, each of which is assigned to a single processor. The IVP is solved iteratively by solving the smaller IVPs for each subsystem, using fixed values from previous iterations for the variables from other subsystems. This dynamic iteration process is

110

Domain-Based Parallelism and Problem Decomposition Methods

variously known as waveform relaxation (WR), dynamic iteration, or as the Picard-Lindelof iteration [28, 52]. Example. Consider an equation by equation decomposition of (1). A waveform relaxation algorithm for this process is described by: ALGORITHM 2.1. (JACOBI WAVEFORM RELAXATION) 1. Start: Select initial value x°(t) for t e [0,T], 2. Iterate: For each waveform iteration k = 1 , 2 , . . . , until satisfied do: For each equation i — 1,2,... ,N solve the IVP:

Here, at every waveform iteration fc, each equation in the system is solved for its corresponding component of x, using previous values of the other components as input. Note that the left hand side uses only the diagonal element of A (i.e., an for the computation of Xj(i)). This computational structure is analogous to that of Gauss-Jacobi relaxation for solving linear systems of equations [48]; hence, this particular decomposition and solution process is usually called Jacobi waveform relaxation. Alternative decompositions can be constructed (e.g., Gauss-Seidel), based on alternative splittings of the matrix A. Since the WR algorithm was first introduced as an efficient technique for solving the large sparsely-coupled differential equation systems generated by simulation of integrated circuits [22], its properties have been under substantial theoretical and practical investigation. The precise nature of the loose coupling in integrated circuits, which was responsible for the rapid convergence of WR for those examples, was first made clear in [32]. The more formal theory for WR applied to linear time-invariant systems in normal form is described in [28], and theoretical aspects which arise when WR is applied to the more general form (C^ + A)x(i) = f ( t ) are examined in [31] and for the differential algebraic case in particular in [27]. Since the WR method decomposes the problem before time discretization, it has been used as a tool for examining the stability properties of multirate integration methods [50]. Though the major practical success of WR has been in accelerating the simulation of integrated circuits [12, 26, 33, 51], it has been examined for other applications. For example, the method has been successfully applied to semiconductor device simulation [38] and to chemical engineering problems [45]. As the above body of work makes clear, for WR to be a computational competitor to pointwise methods, its convergence must be accelerated. Approaches to accelerating the convergence of WR include multigrid [9, 47] SOR [28], convolution SOR [37], Krylov-subspace methods [24], adaptive window size selection [20, 21], and the use of shifted iterations [44].

Accelerated Waveform Methods

111

In the following sections, we describe the Krylov-subspace and convolution SOR acceleration techniques, concentrating primarily on their practical aspects.

2.1

Operator Equation Formulation

Since we wish to use techniques taken from linear algebra to accelerate WR, it is useful to put WR into a form that is analagous to the canonical Ax — b formulation of linear algebra problems. In (1), let A — M — N be a splitting of A. The waveform relaxation algorithm based on this splitting is expressed in matrix form as

ALGORITHM 2.2, (WAVEFORM RELAXATION FOR LINEAR SYSTEMS) 1. Initialize: Pick a;0 2. Iterate: For waveform iteration k = 0 , 1 , . . .

Solve forxk+l(t) on [0,T]. We can solve for xk+l(t) explicitly [18], that is,

Instead of using this formulation, it is useful to abstract (2) and consider x as an element of a function space (of iV-dimensional functions) and the integral as an operator on TV-dimensional functions. Using operator notation, we can write (2) as Here the variables are denned on the space of JV-dimensional square integrable functions, which we will denote as H = L2([0,T].K' V ). The operator /C (mapping from H to H) is denned by

and T/> G H is given by

Roughly, application of the operator K. means: "take one step of waveform relaxation." Now, we also know (based on the splitting) that the solution x to (1) will satisfy

112

Domain-Based Parallelism and Problem Decomposition Methods

Or, using operator notation, we can see that x will satisfy

where the operator / is the identity operator. It is useful to describe characteristics of the operator /C that are similar to familiar properties of matrices. The following are standard results (see, e.g., [10, 19]) which will be used in subsequent discussions of (4). Note that the adjoint of an operator is analogous to the transpose of a matrix. LEMMA 2.1. If M and N are piecewise continuous with respect to t, then /C : H —» H is compact, has a spectral radius of zero, and K,*, the adjoint operator for 1C, is given by

where superscript ^ denotes algebraic transposition.

2.2

Waveform Relaxation

It is interesting to note that the standard waveform relaxation algorithm (3) can be obtained simply by applying the Richardson iteration [48] directly to the operator equation (4):

In the integral equation literature, this approach is also known as the method of successive approximations [1, 19, 23]. Example. Let M(t) = 0. Then e-M(*-s) = J so that (3) becomes

which is the familiar Picard iteration (typically used to show existence and uniqueness of solutions to ODEs) [17, 36]. Example. Let M(t) be the diagonal part of A(t). Then Algorithm 2.2 becomes the Jacobi WR algorithm (Algorithm 2.1). As /C has zero spectral radius, a straightforward convergence result can be stated. THEOREM 2.1. Under the assumptions of Lemma 2.1, the method of successive approximations, defined in' (3), converges. A more detailed analysis of convergence can be derived by considering cases for which /C is defined as T —+ oo, in which case JC has nonzero spectral radius [28].

Accelerated Waveform Methods

2.3

113

Krylov Subspace Algorithms

One straightforward method for developing Krylov subspace acceleration techniques to WR is to consider those methods which are Galerkin methods, since a large body of literature regarding Galerkin methods for solving Fredholm and Volterra integral equations already exists [1, 19, 25]. The use of a Galerkin method over a Krylov space generated by (/ — 1C) is discussed in [29] and [34], where the approach is called the method of moments (see also [49]). It should be apparent from Lemma 2.1 that, in general, 1C is not selfadjoint, i.e., 1C / 1C* We therefore restrict our attention to those Krylovsubspace methods which are appropriate for non-self-adjoint operators, such as the generalized minimum residual algorithm (GMRES) [39]. The waveform GMRES algorithm (WGMRES) — the extension of GMRES to to the space H — is given below. Convergence results for WGMRES are given in [24]. ALGORITHM 2.3. (WAVEFORM GMRES) 1. Start: Set r° = i/> - (I - JC)x°, vl = r°/\\r°\\, /3 = ||r° 2. Iterate: For k = 1, 2 , . . . , until satisfied do:

3. Form approximate solution: The two fundamental operations in Algorithm 2.3 are the operatorfunction product, (/ — K,}p, and the inner product, { - , - ) . When solving (4) in the space H, these operations are as follows: Operator-Function Product: To calculate w = (I — K-)p: I. Solve the IVP

for y(t), te[0,T]; this gives us y=Kp.

Inner Product: The inner product (x,y) is given by

114

Domain-Based Parallelism and Problem Decomposition Methods

Recall that the application of 1C is equivalent to the application of one step of WR. The first part of Step 1 of the operator-function product is therefore equivalent to one step of the standard WR iteration (with zero input), hence WGMRES can be considered as a scheme for accelerating the convergence of WR. This also implies that computing the operatorfunction product in the Krylov-subspace based methods is as amenable to parallel implementation as WR. Moreover, the inner products required by the WGMRES algorithm can be computed by N separate integrations of the pointwise product Xi(t)yi(t), which can be performed in parallel, followed by a global sum of the results. Finally, it should be noted that although the initial residual is given by r° = t/» — (I — K,)x°, it is computed in practice according to

The latter formulation avoids explicit computation of i/>, since the expression in parentheses is merely the result obtained by performing a single step of WR.

2.4

Hybrid Krylov Methods for Nonlinear Systems

Many interesting applications are nonlinear and cannot be described as linear time invariant systems (like our model problem). We will use the following as a model nonlinear problem:

In order to use the previously developed methods, which only apply to linear systems, we must first linearize (5). To linearize (5), we apply Newton's method directly to the nonlinear ODE system (in a process sometimes referred to as the waveform Newton method (WN) [40]) to obtain the following iteration:

Here, JF is the Jacobian of F. We note that (6) is a linear time-varying IVP to be solved for xm+l, which can be accomplished with a waveform Krylovsubspace method. Note that the previous development of waveform Krylovsubspace methods extends trivially to the linear time-varying case [24]. The resulting operator Newton/Krylov-subspace algorithm, a member of the class of hybrid Krylov methods [7], is shown below.

ALGORITHM 2.4. (WAVEFORM NEWTON/WGMRES) 1. Initialize: Pick a;0

Accelerated Waveform Methods

115

2. Iterate: For m = 0 , 1 , . . . until converged Linearize (5) to form (6) Solve (6) with WGMRES Update xm+l For the WGMRES algorithm applied to solving (6), the required operator-function product can be computed using the formulas in Section 2.3, with the splitting

It is also possible to use a "Jacobian-free" approach [8], but the nature of the linearization in the operator-Newton algorithm makes that approach somewhat unreliable [24]. Within each nonlinear (operator-Newton) iteration, the initial residual for the WGMRES algorithm must be computed. Denote the initial guess for xm+l in the WGMRES part of the hybrid algorithm as x m+1 '° and the initial residual by rm+1'°. If xm+l<° = xm, then the initial residual for the WGMRES algorithm can be computed using a two-step approach as follows: 1. Solve the IVP

fory(t), te[0,T].

2. Set rm+1'° = y - xm This approach is similar to that used by WGMRES for linear systems. Methods for approximating rm+1'° so that M(t)xm(t) does not need to be explicitly calculated can be found in [24]. 2.5

Convolution SOR

Successive overrelaxation (SOR) type acceleration of WTR was studied in great detail in [28], with somewhat discouraging results. However, convolution SOR (CSOR), a novel type of SOR acceleration, was developed in [37] and circumvents the limitations of waveform SOR as described in [28]. To abbreviate the description of the CSOR algorithm, we will consider the problem of numerically solving the linear initial-value problem (1). A waveform relaxation algorithm using CSOR for solving (1) is shown in Algorithm 2.5. We take an ordinary Gauss-Seidel WR step to obtain a value for the intermediate variable xf +1 . The iterate x,f+1 is obtained by moving x7^+1 slightly farther in the iteration direction by convolution with a CSOR parameter, function u(t). With the convolution, the CSOR method correctly accounts for the temporal frequency-dependence

116

Domain-Based Parallelism and Problem Decomposition Methods

of the spectrum of the Gauss-Jacobi WR operator (e.g., Gauss-Jacobi WR smoothes high frequency components of the error waveform more rapidly than low frequency components), by in effect, using a different SOR parameter for each frequency. ALGORITHM 2.5. (GAUSS-SEIDEL WR WITH CSOR ACCELERATION) 1. Initialize: Pick vector waveform x°(t) € (Rn, [0,T]) with x°(0) = XQ. 2. Iterate: For k = 0,1,... until converged Solve for scalar waveform xf~l(i) £ (R, [0,T]) with zf^O) = x 0i ,

Overre/az to generate zf^t) e (R, [0,T]),

In a practical implementation, the CSOR method is used to solve a problem that has been discretized in time with a multistep integration method. The overrelaxation convolution integral (7) is replaced with a convolution sum,

Here, m denotes the timestep, k denotes the discretized waveform iteration, and i denotes the component of x. Like the standard algebraic SOR method [48, 53], the practical difficulty is in determining an appropriate overrelaxation parameter, in this case, the sequence w[m]. One successful approach for estimating the optimal SOR parameter has been to consider the spectrum of the SOR operator as a function of frequency and to use a power method to estimate an optimal wopt[m] [37]. There are a variety of alternative approaches to extending the CSOR algorithm to problems with nonlinearities. We used a waveform extension of relaxation-Newton methods (WRN) for solving nonlinear algebraic problems [35, 40]. For the nonlinear problem of the form of (5), the iteration update equation for the ith component of a; in a CSOR-Newton algorithm is given by

followed by

Accelerated Waveform Methods

3 3.1

117

Semiconductor Device Simulation The Drift-Diffusion Equations

Charge transport within a semiconductor device is assumed to be governed by the Poisson equation and by the electron and hole continuity equations:

Here, u is the normalized electrostatic potential in thermal volts, n and p are the electron and hole concentrations, «7n and Jp are the electron and hole current densities, Np and NA are the donor and acceptor concentrations, R is the net generation and recombination rate, q is the magnitude of electronic charge, k is Boltzmann's constant, T is temperature, and e is the spatiallydependent dielectric permittivity [2, 43]. The current densities Jn and Jp are given by the drift-diffusion approximations:

where jun and /j,p are the electron and hole mobilities, and Dn and Dp are the diffusion coefficients. The mobilities p,n and ^p may be computed as nonlinear functions of the electric field E, i.e.,

where vsai and /? are constants and /j,nu is a doping-dependent mobility [30]. The diffusion constants Dn and Dp are related to the mobilities by ^j- (the thermal voltage) in a pair of equations known as the Einstein relations [15]

The drift-diffusion approximations (11) and (12) are typically used to eliminate the current densities Jn and Jp from the continuity equations (9) and (10), leaving a differential-algebraic system of three equations in three unknowns, u, n, and p.

118

3.2

Domain-Based Parallelism and Problem Decomposition Methods

MOSFET Simulation

A key component of modern VLSI circuits is a semiconductor device known as a MOSFET (Metal-Oxide Semiconductor Field Effect Transistor). Although a MOSFET is a three-dimensional structure consisting of several different regions of silicon, oxide and metal, a MOSFET may be modeled by a two-dimensional slice of the device, as shown in Fig. 4. In the figure, thick lines represent metal contacts to the drain, source, substrate and gate oxide regions, to which external voltage boundary conditions are applied.

FiG. 4. A two-dimensional slice of a MOSFET device.

Given a rectangular mesh covering a two-dimensional slice of a MOSFET, a common approach to spatially discretizing the device equation system is to use a finite-difference formula to discretize the Poisson equation, and an exponentially-fit finite-difference formula to discretize the continuity equations (this process is known as the Scharfetter-Gummel method [41]). On an N-node mesh, this spatial discretization yields a sparsely-coupled differential-algebraic initial value problem (IVP) consisting of 3JV equations in 3N unknowns, denoted by

where t E [0, T], and u(t), n(i), p(t] €. R are vectors of normalized potential, electron concentration, and hole concentration. The initial conditions are assumed to be consistent [6]. Here, Fi,F 2 ,F3 : R37V —»• ^-N are specified component-wise as

The summations are taken over the silicon nodes j adjacent to node i. As shown in Fig. 5, for each node j adjacent to node z, Lij is the distance

Accelerated Waveform Methods

119

from node i to node j, djj is the length of the side of the Voronoi box that encloses node i and bisects the edge between nodes i and j, and A\ is the area of the Voronoi box. Similarly, the quantities e^-, /i ni. and fj,pij are the dielectric permittivity, electron and hole mobility, respectively, on the edge between nodes i and j. The Bernoulli function, B(x) — x/(ex - 1), is used to exponentially fit potential variation to electron and hole concentration variations, and effectively upwinds the current equations.

FIG. 5. Illustration of a mesh node i, the area A% of its Voronoi box, and the lengths dy and L,,,.

4

Experimental Results

Fig. 6 illustrates the MOSFET simulation used for the numerical experiments in this section. The figure shows a two-dimensional slice of the silicon device along with the external boundary conditions on the potentials u at the terminals. Here, the potentials at the source and substrate terminals are held at 0. the potential u at the gate terminal is held at 5V, and there is a short pulse at the drain terminal. The concentrations n and p at each terminal are held constant at an equilibrium value determined by the background doping concentration at that terminal.

FIG. 6. Illustration of the example problem and the Dirichlet boundary conditions on the terminal potentials u.

The experiments compared parallelized pointwise Newton/GMRES, WRN [40], WN/WGMRES, and WRN with CSOR acceleration. To obtain the CSOR results, the "optimal" CSOR parameter was determined by linearizing the device problem about the solution at time t — 0, and fitting

120

Domain-Based Parallelism and Problem Decomposition Methods

Method # Procs 1 Pointwise (Direct) 1 Pointwise (GMRES) 2 Pointwise (GMRES) WRN 1 WRN 2 4 WRN WRN 8 1 WN/WGMRES 2 WN/WGMRES 4 WN/WGMRES 8 WN/WGMRES WRN with CSOR 1 WRN with CSOR 2 4 WRN with CSOR WRN with CSOR 8 TABLE 1

Time 2462.48 1221.98 6931.86 8230.23 4469.91 2712.58 1571.92 * * 925.60 504.50 1665.58 884.64 541.76 316.08

Execution times (in wall clock seconds) for transient simulation of the example problem on a PVM workstation cluster. A * indicates that the experiment could not run because of memory limitations.

Wopt(z) (as a function of frequency) with a rational function as described in [37]. Also, to diminish the effect of the nonlinearity, the overrelaxation convolution was applied only to the potential variables u. The backward Euler method with 256 fixed timesteps was used for all experiments, on a simulation interval of 512 picoseconds. Although the use of global uniform timesteps precludes multirate integration (one of the primary computational advantages of WR on a sequential machine), it also simplifies the problem of load-balancing. The convergence criterion for all experiments was that the maximum relative error of any terminal current over the simulation interval be less than 10~4. The initial guess for WRN and for the accelerated waveform methods was produced by performing 8 WR iterations beginning with flat waveforms extended from the initial conditions. Table 1 shows a comparison of the execution times (in wall clock seconds) required to complete a transient simulation of the example problem using WRN, WN/WGMRES, and CSOR on an Ethernet-connected PVM [3] workstation cluster consisting of eight IBM RS/6000 workstations as compute nodes and one Sun SparcStation 2 as host. Despite differences in compute node processing power, the mesh was divided as evenly as possible among the nodes — no load balancing was attempted. Note that the execution time for pointwise Newton-GMRES increased by a factor of 5.67 when

Accelerated Waveform Methods

121

Time # Procs Method 1 2540.49 Pointwise (Direct) 1 1871.06 Pointwise (GMRES) 2 1251.63 Pointwise (GMRES) 4 993.98 Pointwise (GMRES) Pointwise (GMRES) 8 822.77 558.93 16 Pointwise (GMRES) WRN 1,2 > 3600 4 3156.29 WRN WRN 8 1641.28 WRN 863.62 16 1 2309.04 WRN with CSOR 2 1173.19 WRN with CSOR 662.12 4 WRN with CSOR WRN with CSOR 8 346.08 WRN with CSOR 192.63 16 TABLE 2 Execution times (in wall clock seconds) for transient simulation of the example problem on the Intel Paragon. Times marked "> 3600" did not finish because of queue run-time limitations.

parallelized on two processors. Table 2 shows a comparison of the execution times (in wall clock seconds) required to complete a transient simulation of the example problem using WRN and CSOR on the Intel Paragon. Here, pointwise Newton-GMRES exhibits some speedup all the way to 16 processors, but the speedup is still inferior to that obtained by the waveform methods.

5

Conclusion

From the viewpoint of domain-based parallelism, the waveform approach is a methodology for decomposing the space-time computational domain of time-dependent problems, thereby circumventing the serial time-stepping bottleneck that arises when parallelizing standard sequential methods. The experimental results showed waveform methods to be relatively insensitive to the underlying communication network of the parallel environment where they are running. As a result, waveform methods are perfectly suited to environments having very high communication latency, such as workstation clusters. As MIMD computers continue to become more popular, and as workstation clusters continue to become a legitimate parallel computing resource, waveform methods will grow in importance. There are many open questions in this area. As shown in this paper,

122

Domain-Based Parallelism and Problem Decomposition Methods

many techniques for solving static (e.g., matrix) problems can be extended in a straightforward way to solving dynamic (e.g., operator) problems. At the same time, we can modify these generalized algorithms (as in CSOR) to gain computational advantage, because of the structure of the particular problems being studied (in this case, initial-value problems). Thus, it would be interesting to study the waveform extensions of such approaches as additive and multiplicative Schwarz algorithms (as suggested in [46]).

Acknowledgments The authors would like to thank Jacob White and Ken Jackson for many helpful discussions. Jeff Squyres conducted many of the numerical experiments and programmed much of the message passing module of the simulation program.

References [1] K. E. Atkinson, A Survey of Numerical Methods for the Solution of Fredholm Integral Equations of the Second Kind, SIAM, Philadelphia, 1976. [2] R. Bank, W. Coughran, Jr., W. Fichtner, E. Grosse, D. Rose, and R. Smith, Transient simulation of silicon devices and circuits, IEEE Trans. CAD, 4 (1985) pp. 436-451. [3] A. Beguilin et al, A users' guide to PVMparallel virtual machine, ORNL/TM 11826, Oak Ridge National Laboratories, Oak Ridge, TN, 1992. [4] M. Bj0rhus, Dynamic iteration for time-dependent partial differential equations: A first approach, Technical Report Numerics no. 1, Norwegian Institute of Technology, Trondheim, Norway, 1992. [5] H. Blum, S. Lisky, and R. Rannacher, A domain splitting algorithm for parabolic problems, Computing, 49 (1992), pp. 11-23. [6] K. E. Brenan, S. L. Campbell, and L. R. Petzold, Numerical Solution of InitialValue Problems in Differential-Algebraic Equations, North Holland, New York, 1989. [7] P. Brown and Y. Saad, Hybrid Krylov methods for nonlinear systems of equations, SIAM J. Sci. Statist. Comput., 11 (1990), pp. 450-481. [8] P. N. Brown and A. C. Hindmarsh, Matrix-free methods for stiff systems of ODE's, SIAM J. Numer. Anal., 23 (1986), pp. 610-638. [9] C.Lubich and A. Osterman, Multigrid dynamic iteration for parabolic problems, BIT, 27 (1987), pp. 216-234. [10] J. B. Conway, A Course in Functional Analysis, Second Edition, SpringerVerlag, New York, 1990. [11] C. Dawson, Q. Du, and T. Dupont, A finite difference domain decomposition algorithm for numerical solution of the heat equation, Technical Report TR9024, Rice University, Houston, TX, 1990. [12] D. Dumlugol, The Segmented Waveform Relaxation Method for Mixed-Mode Simulation of Digital MOS Circuits, PhD thesis, Katholieke Uriiversiteit Leuven, October 1986. [13] C. W. Gear, Numerical Initial Value Problems in Ordinary Differential Equations, Automatic Computation. Prentice-Hall, Englewood Cliffs, New Jersey, 1971.

Accelerated Waveform Methods

123

[14] L. Giraud and R. S. Tuminaro, Domain decomposition algorithms for the drift-diffusion equations, in Sixth SI AM Conference on Parallel Processing for Scientific Computing, R. F. Sincovec et al., eds., Norfolk, VA, 1993, pp. 719726. [15] P. E. Gray and C. L. Searle, Electronic Principles: Physics, models and circuits, Wiley, New York, 1969. [16j E. Hairer. S. P. Norsett, and G. Warmer, Solving Ordinary Differential Equations, vol. 1 and 2, Springer-Verlag, New York. 1987. [17] M. W. Hirsch and S. Smale, Differential Equations, Dynamical Systems, and Linear Algebra, Academic Press, New York, 1974. [18] T. Kailath, Linear Systems, Prentice-Hall, Englewood Cliffs, 1980. [19] R. Kress, Linear Integral Equations, Springer-Verlag, New York, 1989. [20] B. Leimkuhler, Estimating waveform relaxation convergence, SIAM J. Sci. Comput., 14 (1993), pp. 872-889. [21] B. Leimkuhler and A. Ruehli, Rapid convergence of waveform relaxation, Applied Numerical Mathematics, 11 (1993), pp. 221-224. [22] E. Lelarasmee, A. E. Ruehli, and A. L. Sangiovanni-Vincentelli, The waveform relaxation method for time domain analysis of large scale integrated circuits, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 1 (1982), pp. 131-145. [23] P. Linz, Analytical and Numerical Methods for Volterra Equations, SIAM, Philadelphia, 1985. [24] A. Lumsdaine, Theoretical and Practical Aspects of Parallel Numerical Algorithms for Initial Value Problems, with Applications, PhD thesis, Massachusetts Institute of Technology, Cambridge, MA, 1992. [25] R. C. MacCamy and P. Weiss, Numerical solution of Volterra integral equations, Nonlinear Anal., 3 (1979), pp. 677-695. [26] S. Mattison, CONCISE: A concurrent circuit simulation program, PhD thesis, Lurid Institute of Technology, Lund, Sweden, 1986. [27] U. Miekkala, Dynamic iteration methods applied to linear DAE systems, J. Comput. Appl. Math., 25 (1989), pp. 133-151. [28] U. Miekkala arid O. Nevanlinna, Convergence of dynamic iteration methods for initial value problems, SIAM J. Sci. Stat. Comp., 8 (1987), pp. 459-467. [29] G. Miel, Iterative refinement of the method of moments, Numer. Funct. Anal. and Optimiz,, 9(11-12) (1987-1988), pp. 1193-1200. [30] R. S. Muller and T. I. Kamins, Device Electronics for Integrated Circuits, John Wiley and Sons, New York. 1986. [31] 0. Nevanlinna and F. Odeh, Remarks on the convergence of the waveform relaxation method, Numerical Functional Anal. Optimization, 9 (1987), pp. 435445. [32] F. Odeh, A. Ruehli, and C. Carlin, Robustness aspects of an adaptive waveform relaxation scheme, in Proceedings of the IEEE Int. Conf. on Circuits and Comp. Design, Rye,N.Y., October 83, pp. 396-440. [33] F. Odeh, A. Ruehli, and P. Debefve, Waveform techniques, in Circuit Analysis, Simulation and Design, Part 2, A.Ruehli, ed.. North-Holland, 1987, pp. 41-127. [34] P. Omari. On the fast convergence of a Galcrkin-like method for equations of the second kind, Math. Z., 201 (1989), pp. 529-539. [35] J. M. Ortega and W. C. Rheinbolt, Iterative Solution of Nonlinear Equations in Several Variables, Computer Science and Applied Mathematics, Academic Press, New York, 1970.

124

Domain-Based Parallelism and Problem Decomposition Methods

[36] L. S. Pontriagin, Ordinary Differential Equations, Addison-Wesley, Reading. MA, 1962. [37] M. Reichelt, Accelerated Waveform Relaxation Techniques for the Parallel Transient Simulation of Semiconductor Devices, PhD thesis, Massachusetts Institute of Technology, Cambridge, MA, 1993. [38] M. Reichelt, J. White, J. Allen, and F. Odeh, Waveform relaxation applied to transient device simulation, in Proceedings of the IEEE Int. Conf. on Circuits and Systems, Espoo, Finland, October 83, pp. 396-440. [39] Y. Saad and M. Schultz, GMRES: A generalized minimum residual algorithm for solving nonsymmetric linear systems, SIAM J. Sci. Statist. Comput., 7 (1986), pp. 856-869. [40] R. Saleh and J. White, Accelerating relaxation algorithms for circuit simulation using waveform-Newton and step-size refinement, IEEE Trans. CAD, 9 (1990), pp. 951-958. [41] D. Scharfetter and H. Gummel, Large-signal analysis of a silicon read diode oscillator, IEEE Transactions on Electron Devices, ED-16 (1969), pp. 64-77. [42] J. Scroggs, Parallel processing of a domain decomposition method, in Third SIAM Conference on Parallel Processing for Scientific Computing, G. Rodrigue, ed., Los Angeles, 1987, pp. 164-168. [43] S. Selberherr, Analysis and Simulation of Semiconductor Devices, SpringerVerlag, New York, 1984. [44] R. D. Skeel, Waveform iteration and the shifted Picard splitting, SIAM J. Sci. Statist. Comput., 10 (1989), pp. 756-776. [45] A. Skjellum, Concurrent dynamic simulation: Multicomputer algorithms research applied to ordinary differential-algebraic process systems in chemical engineering, PhD thesis, California Institute of Technology, May 1990. [46] S. Vandewalle, Parallel Multigrid Waveform Relaxation for Parabolic Problems, Teubner-Skripten zur Numerik, B. G. Teubner, Stuttgart, Germany, 1993. [47] S. Vandewalle and R. Piessens, Efficient parallel algorithms for solving initialboundary value and time-periodic parabolic partial differential equations, SIAM J. Sci. Statist. Comput., 13 (1992), pp. 1330-1346. [48] R. S. Varga, Matrix Iterative Analysis, Automatic Computation Series, Prentice-Hall Inc, Englewood Cliffs, New Jersey, 1962. [49] Y. V. Vorobyev, Method of Moments in Applied Mathematics, Gordon and Breach, New York, 1965. [50] J. White and F. Odeh, A connection between the convergence properties of waveform relaxation and the A-stability of multirate integration methods, in Proceedings of the NASECODE VII Conference, Copper Mountain. Colorado, 1991. [51] J. White, F. Odeh, A. Viricentelli, and A. Ruehli, Waveform relaxation: Theory and practice, Trans, of the Society for Computer Simulation, 2 (1985), pp. 95133. [52] J. K. White and A. Sangiovanni-Vincentelli, Relaxation Techniques for the Simulation of VLSI Circuits, Engineering and Computer Science Series, Kluwer Academic Publishers, Norwell, Massachusetts, 1986. [53] D. M. Young, Iterative Solution of Large Linear Systems, Academic Press, Orlando, FL, 1971.

Chapter 8 A Parallel Multi-Level Solution Method for Large Markov Chains Graham Horton Abstract A new iterative algorithm for the numerical solution of steady state Markov chains is presented. The method utilizes a set of recursively coarsened representations of the original system to achieve accelerate convergence. Results of numerical experiments are reported, showing significant reductions in computation time, often an order of magnitude or more, relative to the Gauss-Seidel algorithm for a well-known test problem. The potential for parallelization of the method is discussed.

1 Introduction Computer-aided modelling tools are widely used for the assessment of existing computing systems and the design of future ones. Among those tools used for performance and reliability modelling are queueing networks and generalized stochastic Petri nets (GSPNs). Queueing theory predicts statistics of waiting times and queue lengths for systems with multiple customers whose arrival times and service times satisfy known probability distributions. Petri nets (devised by C. A. Petri in the 1960's) are graphical models of concurrent systems that can be used to explore certain properties of the systems. System states are described by the placement of "tokens" at the nodes of the Petri net, and allowable transitions between states are described by the the motions of the tokens along the edges of the net. A simple example is explored in the next section. Under certain conditions queueing networks and GSPNs are equivalent to Markov chains, or sequences of discrete variables such that each member of the sequence is probabilistically dependent only on its predecessor. The steady-state analysis of a Markov chain may be represented by a system of linear equations whose governing matrix is typically sparse and (as a representation of conservation law) shares important properties with matrices arising in Computer Science Department, University of Erlangen-Niirnberg, Germany. Email: [email protected]. This work was carried out in part while the author was a guest at ICASE, NASA Langley Research Center. 125

126

Domain-Based Parallelism and Problem Decomposition Methods

the study of PDEs. The solution of these equations allows a user to derive useful information about the model, such as average job queue length or the probability of the system being down. Unfortunately the number of states of the Markov chain (and thus the dimension of the linear system) grows extremely quickly as the complexity of the model is increased. There is one unknown for each state that the model may be in - a number that is subject to a combinatorial explosion. Thus the Markov chains that have to solved even for relatively coarse models may have tens or hundreds of thousands of states. The resulting large systems of equations must be solved numerically using an iterative scheme. Typical iterative methods in use in the modelling community are the Power Method, Gauss-Seidel (GS) iteration, and SOR iteration. Each of these methods has the drawback that it may require many iterations to reach a solution, particularly if the system is large, or a if high degree of accuracy is required. This can lead to unacceptably long computation times. We demonstrate in this chapter the application of ideas that may be said to have originated in the study of PDEs, namely multigrid and domain decomposition, to the solution of Markov chains. The multilevel (ML) algorithm for Markov chains was introduced in [8]. There, experimental results were given for the Gauss-Seidel, SOR, and multilevel algorithms when applied to Markov chains generated from birth-death processes, tandem queueing networks, and a stochastic Petri-net model. These showed that the ML method is capable of achieving significant savings in computation time over the standard schemes. The method is similar in structure to well known multigrid algorithms [4, 6], in that it makes use of successively coarsened representations of the original Markov chain and interprets values computed on a coarse level as corrections to the values at the next finest level. An added feature is that the unknowns constitute a vector of probabilities, with upper and lower bounds on each component, and an additive constraint. This has motivated the idea of a multiplicative correction process, whose utility may ultimately feed back into the design of multigrid algorithms for PDE systems with bounded variables, such as composition fractions. In section 2 we give an example of a small GSPN and explain how the Markov chain is derived from it. In section 3 we describe the aggregation equations for Markov chains and in section 4 the multi-level method which is based on them. In section 5 we describe the possibility of parallelizing the ML scheme. In section 6 we describe the relationship between the ML method and an established class of solution methods for Markov chains. In section 7 results of two experiments are presented. The first compares the sequential performance of the ML method to GS using a well-known GSPN test problem, while the second shows speedup results for the ML method on a virtual shared memory multiprocessor. In the final section we summarize the chapter.

Multi-Level Solution of Markov Chains

127

FlG. 1. Simple GSPN model with associated Markov Chains.

2

A Simple GSPN Example

Generalized stochastic Petri-nets (GSPN) were introduced by Ajmone Marsan et al [3] as an extension of the original stochastic Petri-nets of Molloy [11]. A GSPN is a bipartite, directed graph, whose nodes consist of places and transitions. Each place may contain a number of tokens, which may move to neighbouring places via the transitions according to well-defined rules. A transition may be of type direct or timed. Timed transitions are associated with an exponentially distributed firing rate, characterized by the mean value. Direct transitions are associated with an instantaneous firing probability. Thus tokens are delayed in places for an exponentially distributed times when they are to leave via a timed transition, but depart immediately when a direct transition is available. Conceptually, a GSPN is built in such a way that the movement of tokens corresponds to the changes of state of the system being modelled. A Markov chain is a directed graph, whose nodes correspond to the states in which the GSPN can be found and the weights on the edges correspond to the mean firing rates of the transitions. Conceptually a token is considered to start out in the node corresponding to the initial state (initial marking) of the GSPN and move from node to node. Solving the system of equations which the Markov chain represents is interpreted as computing the probability of finding the token at any given node when the system is in a steady state. For a comprehensive introduction to GSPNs and their relationship with Markov chains, see [1]. For an introduction to queueing networks and Markov chains, see [9]. Figure 1 shows a very simple GSPN model of a hypothetical computer which can be in one of four states : (ok, failed, rebooting, being repaired), each of which is represented by a place (circle). Timed transitions (represented by a thick line) are associated with the change of state associated with repair, failure and reboot. The numerical values at the timed transitions are rates. (In the example, a computer fails, on average, twice per month, and can be rebooted at a rate of 360 times per month (two

128

Domain-Based Parallelism and Problem Decomposition Methods

hours), or repaired at a rate of 60 times per month (twelve hours)). Direct transitions (represented by thin lines) model the probabilities that failure recovery can be achieved by a simple reboot of the system, or by some repair process. The numerical values at the direct transitions are probabilities. The associated Markov chain, shown in the centre, has only three states. Note how the direct transitions have no counterpart in the Markov chain, as they have no effect on the temporal behaviour of the model. Their values are incorporated into the edge weights of the Markov chain. Thus, the edge with weight 1.6 represents the path in the GSPN from "ok", through the timed transition to failure and through the 80% probability direct transition that the failure can be remedied by a simple reboot. The obvious relationship between places in the GSPN and states of the Markov chain in this small example, is however, deceptive. In general, no such mapping exists. Consider, for example the same model, where two computers are modelled by inserting two tokens into place "ok". In this case, the Markov chain has six states, one for each of the possible states of the GSPN (figure 1, right). Using the notation (state of one processor, state of the other processor) these are (ok, ok), (ok, do-reboot), (do_reboot, do_reboot), (do_repair, do_reboot), (dojrepair, do_repair), (ok, do-repair). It is clear that the size of the Markov chain will grow rapidly as a function of the number of tokens and places of the GSPN. The systems of equations generated are

For the system matrix P 6 IRn x JRn of any Markov chain we have

Furthermore we note that P is almost always sparse and has one linearly dependent row. The solution vector p is a probability vector, i.e.

Multi-Level Solution of Markov Chains

FIG. 2.

3

129

Aggregation of Markov Chains.

Aggregation of Markov Chains

Consider a Markov chain consisting of n states s\...sn. Denote the unknown vector by p, where p-L is the probability of being in state st. We then have to solve the system of equations

with the additional condition

Equations (1) and (2) form a sparse linear system which is typically solved numerically using the GS or SOR algorithm. These schemes suffer the drawback of often needing a large number of iterations when n is large or when a high degree of accuracy is required. A coarser representation of the Markov chain described by matrix P may be obtained by aggregation. This means creating a new Markov chain described by a matrix Q with the vector of state probabilities q, each of whose N states S\ ... SN is derived from a small number of states of the original system. Figure 2 illustrates the situation for an eight-state Markov chain P. where states are aggregated in pairs to form a four-state coarser level system Q. In the following we will use the terms fine level and coarse level to refer to Markov chains where the latter is obtained by aggregation from the former. The relation s^ 6 Si signifies that the fine level state s^ is mapped by the aggregation operation to the coarse level state Si. The matrix Q of the aggregated system is chosen as follows :

130

Domain-Based Parallelism and Problem Decomposition Methods

This is the classical aggregation equation. Note that the matrix Q is a function not only of the fine level matrix P, but also of the fine level solution vector p. This yields the aggregated equations in the unknown q:

It can then be shown that

i.e. the solution q of the aggregated system truly represents a coarser version of the solution p of the original problem. The probability of being in state qi is the sum of the probabilities of being in any of its constituent fine-level states.

4

Multi-Level Solution Algorithm

We adopt the following abbreviations for componentwise operations on vectors a, b, c e IRm:

As is customary in multigrid, the ML algorithm is defined recursively from a two-level version of the algorithm. We enter the (i + l)th iteration of the ML algorithm with the current approximation to the solution pW obtained as a result of the (i)th iteration, whereby p^ denotes the initial guess, and begin by performing one or more sweeps of the GS algorithm, obtaining the vector p: We assume throughout that application of GS includes a subsequent normalization step to enforce (2). The vector p will not, in general be the solution p of (1), but we may write

where p* is the elementwise multiplicative correction necessary to p. Knowledge of p* would immediately enable us to compute the solution p. We may write (1) as Since p has been smoothed by application of the GS algorithm, we assume that it no longer contains any high frequency error components,

Multi-Level Solution of Markov Chains

131

and thus that p* is smooth. Therefore we may compute an approximation to p* on a coarsened system, since the dimension of the latter is smaller, and thus the computation will be cheaper. We write a coarsened version of as where Q is the matrix of the aggregated system and q* and q represent aggregated representations of p* and p, respectively. The coarse system matrix Q is chosen to be an approximation to the the matrix Q from (3), replacing p by p, since p is not available until the algorithm has converged:

In the case of a converged solution, we will, however, have p = p and therefore the correct coarse matrix Q = Q. In order to obtain q in (8) we require an operator that maps a fine level vector to the coarser, aggregated vector. This operator will be denoted by R (from the multigrid restriction operation), and we choose summation for R, in order to fulfill (4) at convergence:

We proceed by defining

thus obtaining the coarse level equations to be solved:

Solving (12) for q will therefore enable us to compute q*, the coarse approximation to the correction via (11): q* = q/q. We compute the fine-level correction from its coarse approximation using the operator / (interpolation):

The multiplicative correction is equivalent to a scaling: a coarse level node value q~i represents the probability of being in any state Sfc € Si and therefore g* represents the scaling factor necessary to achieve this for the values Pfc, s/c € Si. The new iterate p(*+1) is computed using

132

Domain-Based Parallelism and Problem Decomposition Methods

If the algorithm converges, we hope that p(l+1) will be a better approximation to p than pW. Meanwhile, however, we have p(*+1) ^ p, since the correction p* was computed only approximately on a coarse grid, using an incorrect matrix Q ^ Q. The reader with a background in multigrid will recognize the major elements of smoothing, intergrid transfers, and coarse equation derivation. With its use of a multiplicative correction the ML algorithm differs from conventional multigrid, where an additive correction is performed. However, we may rewite the ML correction step in an additive way, obtaining for any fine-level state Si

Thus the additive correction is solution-dependent: the correction depends on the value to be corrected. After the iteration has converged, we have

where 1 denotes the vector (1, 1, . . . , l)T, i.e. no further correction takes place. We then also have Q — Q and therefore q — q. It is interesting to note that the ML choice of coarse level equation satisfies the Galerkin property, i.e. that the coarse operator be defined as the concatenation of the restriction, fine-level and prolongation operators. This can be seen by rewriting (9) as that sequence of operations applied to the coarse-level vector (0, . . . , 0, 1, 0, . . . , 0) with the 1 in the z-th position and using temporary variables ii, v: (Restriction) (Fine - level operator) (Prolongation)

Note that the question of assigning fine nodes to their coarse counterparts is still open. This we will call the coarsening strategy or aggregation strategy. The aggregation strategy can have a significant effect on the performance of the ML algorithm. In cases where we have some knowledge of the structure of the Markov chain, for example when queueing networks are to be modelled, then we may utilize this information in the construction of the aggregated system. In other cases, mapping strongly coupled fine states to the same aggregated state seems to be an efficient strategy.

Multi-Level Solution of Markov Chains

133

procedure mss(l)

if (I = 0) solve Pipi — 0 else Pi = CS"(pi) Pi-i = RI-IJ(PI) rnssd — 1) P*i-i = Pi-i/Pi-i P* = k-iM-\} Pi = C(pi,pj) return FIG. 3. Multi-Level Algorithm. The too-level version of the ML iteration is given by the sequence of steps (5), (10), (9), (12), (13), (14). The multi-level algorithm is obtained by recursive application of the two-level algorithm to obtain a solution to the aggregated equation (12) and is described in algorithmic form in figure 3. We use the subscript I to denote level of representation (I = Imax for the finest level, / = 0 for the coarsest level). The coarse level / — 1 and fine level I between which the operators I and R map are identified by appropriate indices. Note that, because of the recursive nature of the algorithm, the unknowns q*, q and q are represented by the variables p*_j, pi-\ arid p/_i, respectively. We allow in general the possibility of applying GS v times at each level with v > 1, denoted by GS". Although the original problem (1) is linear, the multi-level algorithm is non-linear, as the coarse matrix coefficients are a function of the current fine level solution vector. 5

Parallelization

The ML method allows an easy parallelization via data parallelism. The approach is completely analogous to the domain-based parallelism in multigrid methods for PDEs. Essentially, we observe that the processing of levels of the hierarchy of chains must be done sequentially, but that all operations on data within each level or between levels may be performed in parallel. We make the following assumptions about the data partitioning: 1. The states of the Markov chain are distributed equally among all processors. 2. Aggregation is performed locally within each processor, i.e. no two states assigned to different processors are mapped to a common coarser level state.

134

Domain-Based Parallelism and Problem Decomposition Methods

3. Due efforts are made to minimize communication overhead. If points (1) and (2) are met above, we observe immediately from equations (10) and (14) that the restriction and correction operators can be performed in parallel with 100% efficiency, since no communication is required. This is an interesting difference to the equivalent operations in a classical multigrid context, where the restriction and prolongation operators are described by stencils which may necessitate intra-grid, i.e. horizontal, communication. Although point (2) above is not strictly necessary, i.e. the parallel algorithm could be implemented without it, we obviously would prefer the increased efficiency and ease of programming that it brings. This means that the partitioning algorithm influences the aggregation strategy and therefore the convergence behaviour of the algorithm. This implies that due care must be exercised during the partitioning phase to ensure that a numerically appropriate, parallel aggregation is possible. In order to achieve parallelism in the relaxation step, we must change the ordering of the states. This can always be done in such a way that all processors simultaneously relax the states, using "old" values from the neighbouring partitions. Such a scheme may be regarded as a domain decomposition relaxation scheme, in which each block is treated by pointwise Gauss-Seidel and the domains are treated in a Block-Jacobi manner with respect to each other. Since the Markov chain may have any topology, we are forced to treat it as a general directed graph. This makes partitioning it a non-trivial problem. Much research has recently gone into the problem of producing good partitions of general graphs, as the problem arises in the unstructured grid solution of PDEs. We use the Chaco package [7] in a preprocessing step which takes the Markov chain as input and delivers a high quality partitioning as output. The quality of the partitioning is determined by the degree to which points (1) and (3) above are satisfied.

6

Relationship to IAD methods

Iterative Aggregation/Disaggregation (IAD) methods are a well-known class of algorithms for the steady state solution of Markov chains and which bear a close relationship to the ML method presented here. IAD methods are reviewed in [12]. The generic IAD method is that of Koury, McAllister and Stewart (KMS) [10]. Using the notation of section 3, the KMS method is defined as follows: 1. Construct coarse level matrix Q using (9) 2. Solve coarse system (12). 3. Perform the correction (13).

Multi-Level Solution of Markov Chains

135

FIG. 4. Birth/Death Markov chain. 4. For each coarse level state Si solve the set of equations connecting all fine level states Sk aggregated to it (sk 6 Si). 5. If not converged goto 1 Step (4) is a Block Gauss-Seidel step on the finer level, where the blocks are defined by the aggregated system. IAD methods such as this suffer the drawback that they are only applicable to so-called nearly completely decomposable Markov chains, i.e. those which may be represented by a matrix which is almost block diagonal in structure. The aggregation is then defined by assigning one coarse unknown to each block. The number of fine unknowns that are aggregated to a single coarse state is generally quite large. The KMS method allows for an obvious parallelization, since each of the equation solves in step (4) above can be performed on a different processor, if a Block Jacob! approach is used. The same style of parallelism is applicable to steps (1) and (3). The solution of the coarse system could be done using a standard data-parallel method. This approach has, however the drawback that the degree of parallelism is determined by the structure of the problem and that the different blocks may vary greatly in size, resulting in a possibly serious load imbalance. We now see that the KMS algorithm is a special case of ML, obtained by the following choices: 1. Use of only two levels of representation of the system, rather than multiply coarsened problems. 2. N
Experimental Results

We first consider a birth-death Markov chain with a birth rate of A = 1 and a death rate of n — 2, shown in figure 4 and vary the number of states. Birth-Death chains (with more general coefficients) provide the

136

Domain-Based Parallelism and Problem Decomposition Methods

FIG. 5. Birth-Death chain. L e f t : Number of Iterations; Right : Computation time. From [6j.

fundamental building block for queueing systems and biological population models, where the solution value at each state represents the probability of a queue or population having a certain size and the coefficients representing arrival/birth and departure/death rates. Birth-Death chains are treated in detail in [9]. For this most simple example the matrix P is given by the tridiagonal matrix

The results are presented in figure 5. The number of iterations increases linearly with the number of states for the GS and SOR algorithms, whereas it remains fixed at 21 iterations for the ML algorithm. We use the optimal relaxation parameter u for SOR, obtained by using a binary search between 1.0 and 2.0. This results in over-optimistic results for the SOR algorithm since in practice u is not known, and can at best be approached by dynamic tuning, thus resulting in additional iterations. In all experiments we set the initial iterate to the vector (^, . . . , ^)T and the systems are solved with with all methods to an accuracy of ||PpW||2 < 10~9. The ML algorithm recursively coarsens the set of equations until the coarsest sytem has only two states. The computation time of SOR and GS increases quadratically for SOR and GS with the number of states, whereas it increases only linearly for ML. Thus ML is an optimal method for this particular problem. The GS (SOR) algorithm requires 257 (128) times more processing time than ML for

Multi-Level Solution of Markov Chains

137

FIG. 6. Ajmone Marsan/Balbo/Conte Model.

FlG. 7. Computational work (millions of floating point operations) for GS and ML to solve the Ajmone Marsan/Balbo/Conte model. a birth-death chain of 10,000 states. The ratios increase with system size. Even for small birth death chains, such as 1,000 states, the ML algorithm is more than an order of magnitude faster than GS and SOR. Note, however, that this experiment serves only to demonstrate the typical behaviour of the three methods, as this problem has a well-known and simple analytic solution. Results of further experiments are reported in [8]. Figure 6 shows a multiprocessor system in which the n processors p\ .. .pn compete for access to two memory units CM1, CM2 via a common bus. Ajmone Marsan, Balbo and Conte [2] give a GSPN model of this multiprocessor which allows for the possibility of failure and repair of the processors, the bus and the memory units. Figure 7 showrs the computational work of the ML and GS methods applied to this problem as a function of problem size measured as the number of processors in the model. The number of unknowns varied from 91 (2 processors) to 3883 (10 processors). Although these are problems of very small size, the saving in computational effort of ML over GS is quite dramatic (a factor of 39 for the smallest and of 77 for the largest problems considered). It is also clear that the gap widens as the problem size is increased.

138

Domain-Based Parallelism and Problem Decomposition Methods

FIG. 8. Speedup S of the ML algorithm on the KSR 1 applied to Molloy's Petri-net using 50, 60 and 70 tokens.

Figure 8 shows the speedup obtained by a parallel implementation of the ML method on the KSR 1 using 1, 2, 4, and 8 processors. The KSR 1 is a virtual shared memory multiprocessor which allows the user to make use of the shared variables programming model. This greatly simplifies the parallel implementation of the ML method, as the data structures are irregular and make heavy use of pointers. The problem solved was the stochastic Petri net in the original paper of Molloy [11]. The number of tokens used was 50, 60 and 70, giving Markov chains with 45526, 77531 and 121836 states and computation times per ML iteration on one processor of approximately 23, 44 and 65 seconds, respectively. The Markov chain was partitioned with the Chaco [7] software package. The speedup is seen to be close to optimal for all three problems for two and four processors, but somewhat less good for eight processors. The current implementation achieves its parallelism from the sequential program essentially by retaining the original code, computing upper and lower bounds within the solution vector for the assigned work of each processor and inserting global synchronizations at appropriate points. Bearing this in mind, we consider the speedup result to be very good. It is widely agreed that realizing the data partitioning in the data structures themselves and mapping these to the local memories of the appropriate processor achieves superior reults due to the higher degree of data locality.

8

Summary and Outlook

The ML algorithm described in this chapter has been shown to require significantly less computation time than the standard scheme for a number of test problems. Many other numerical experiments not reported here have shown this to be true for a wide range of cases. Moreover, the advantage of ML increases with the number of states of the Markov chain. The ML method can be parallelized using a standard approach which

Multi-Level Solution of Markov Chains

139

is well established in the context of multigrid algorithms for PDEs. Initial experimental results on a shared memory multiprocessor show that good speedups are attainable on a small number of processors on problems of medium size. More work is needed (and is already in progress) to understand its parallel behaviour and to optimize the implementation. Work in progress also includes a parallel implementation on the basis of a standard message-passing package. This will enable the code to run on a wide variety of parallel machines, including workstation clusters.

Acknowledgement s The implementation of the parallel multi-level method was carried out by K. Buchacker. The Markov chains were generated using the SPNP software package of G. Ciardo [5]. References [1] M. Ajmone Marsan, G. Balbo, G. Chiola, G. Conte, Generalized Stochastic Petri Nets Revisited: Random Switches and Priorities, in Proceedings of the International Workshop on Petri Nets and Performance Models, Madison, WI, USA, August 1987. IEEE Computer Society Press, 1987. [2] M. Ajmone Marsan, G. Balbo, G. Conte, Performance models of multiprocessor systems. MIT Press, Cambridge, 1988. [3] M. Ajmone Marsan, G. Balbo, G. Conte, A class of Generalized Stochastic Petri Nets for the performance evaluation of multiprocessor systems, ACM Trans. Comput. Systems 2 (2) (1984) 93-122. [4] W. L. Briggs, A Multigrid Tutorial. SIAM, Philadelphia, 1987. [5] G. Ciardo, K. Trivedi, J. Muppala, SPNP: stochastic Petri net package. Proc. of the Third Int. Workshop on Petri Nets and Performance Models (PNPM89), Kyoto, Japan, pp. 142-151 Dec, 1989. IEEE Computer Society Press. [6] W. Hackbusch, Multigrid Methods and Applications. Springer Verlag, Heidelberg, 1985. 7] B. Hendrickson, R. Lelarid, The Chaco User's Guide, Version 1.0. Report SAND93-2339, Sandia National Laboratories, Albuquerque, NM, 1993. [8] G. Horton, S. Leutenegger, A Multilevel Solution Algorithm for Steady-State Markov Chains. ICASE Report #93-81 NASA CR-191558, NASA Langley Resarch Center, September 1993. And Proceedings of SIGMETRICS 94, May 16-20 1994, Nashville. [9] L. Kleinrock, Queueing Systems, Vol. 1 : Theory. Wiley & Sons, New York, 1975. [10] R. Koury, D. McAllister, W. Stewart, Methods for computing stationary distributions of nearly completely decomposable Markov chains. SIAM J. Alg. Disc. Math. 5, 2 (1984), 164-186. [11] M. Molloy, Performance analysis using stochastic Petri nets. IEEE Trans. Comp. Vol 31, No. 9, 913-917, Sept. 1982. [12] P. Schweitzer, A Survey of Aggregation-Disaggregation in Large Markov Chains. In W. STEWART Numerical Solution of Markov Chains, Marcel Dekker, 1991.

This page intentionally left blank

Chapter 9 Optimizing Substructuring Methods for Repeated Right Hand Sides, Scalable Parallel Coarse Solvers, and Global/Local Analysis Charbel Farhat Abstract

Direct solvers currently dominate commercial finite element structural software, but do not scale well in the fine granularity regime targeted by emerging parallel processors. Substructure-based iterative solvers — also called domain decomposition (DD) algorithms — lend themselves better to parallel processing, but must overcome several obstacles before earning their place in general purpose structural analysis programs. One such obstacle is the solution of systems with many or repeated right hand sides. Such systems arise, for example, in multiple load static analyses, in implicit linear dynamics computations, in the solution of nonlinear problems via a quasi-Newton scheme, and in various structural eigenvalue problems. Direct solvers are well-suited for these problems because after the system matrix has been factored, the multiple or repeated solutions can be obtained through relatively inexpensive forward and backward substitutions. On the other hand, iterative solvers are in general ill-suited for these problems because they must often restart the iterations from scratch for every different right hand side. Another obstacle for DD iterative methods is the parallel non-scalability of the underlying factorization-based coarse solvers, which often ruins the sought after overall parallel speed-up. In this chapter, we present a methodology for extending the range of applications of substructuring methods to problems with multiple or repeated right hand sides, and for solving efficiently their coarse grid problems on massively parallel processors. We illustrate the proposed methodology with the solution of This work was supported in part by the National Science Foundation under Grant ASC9217394, by CMB at NASA Langley Research Center under Grant NAG 1536427, and by RNR NAS at NASA Ames Research Center under Grant NAG 2-827. Department of Aerospace Engineering Sciences, and Center for Aerospace Structures, University of Colorado at Boulder, Boulder, CO 80309-0429, U. S. A.

141

142

Domain-Based Parallelism and Problem Decomposition Methods static and dynamic structural problems, and highlight its potential to outperform direct solvers on massively parallel computers. We also describe an extension of this methodology to global/local analyses which are common in structural mechanics and involve multiple right hand sides as well as nearby left hand matrices.

1

Why Domain Decomposition Based Iterative Solvers?

Direct solvers currently dominate commercial finite element structural software, essentially because: (a) they are robust and reliable, (b) they are versatile, (c) they work well with secondary storage, and (d) for structural problems, they usually outperform the class of iterative algorithms that were popular when these codes were originally developed. However, with the advent of parallel processing, alternatives to direct solvers must be researched because parallel factorization algorithms applied to sparse systems arising from the finite element formulation of structural problems are in general not scalable — that is, their performance does not necessarily increase with the number of processors. This lack of parallel scalability is illustrated in the following analysis. While true sparse solvers [1, 2] are increasingly penetrating finite element structural software, skyline solvers [1, 3] are still widely used for solving the symmetric systems of equations generated by research and production structural codes. Most massively parallel skyline solvers that have been recently reported in the literature are closely related to the parallel active column symmetric solver presented in [4]. In general, clusters of columns are distributed across the processors following a one-dimensional static block-wrap mapping scheme (FIG. 1). At each step k of the factorization process, column k is broadcast to all processors and the entries of row k are updated using parallel dot products (FIG. 1). Of course, depending on the target parallel architecture, several improvements can be added to this basic parallel skyline solver. Some of these are described and discussed in detail in [5]. Usually, after the mesh nodes of a finite element model are renumbered for optimal storage, the skyline structure of the symmetric stiffness matrix becomes close to that of a constant-bandwidth matrix. Hence, for simplicity, we shall assume that the system matrix is banded (FIG. 2). Let b, Sa, Si and Np denote respectively the semi-bandwidth of the given system, the 64-bit floating-point arithmetic peak performance of a single processor of the target parallel machine, the peak interconnect speed of that machine measured in bytes per second, and its number of processors. We assume that all real data are stored in 8-byte words. At each step of the factorization process, the computational and communication parallel time

Optimizing Substructuring Methods

143

of the factorization algorithm can be evaluated as follows (see FIG. 2):

FIG. 1. Parallel skyline symmetric solver.

FIG. 2. Banded computational model, (upper part of the symmetric matrix)

Note that in the second of Eqs. (1), we have neglected for simplicity the effect of latency and the number of processors Np on transmit- For example in an optimal broadcast, the above communication time should be increased by a factor log? (Np). However, our simplified expression for ttransmit is reasonably accurate since computation and communication can be overlapped during the factorization process [5]. Obviously, the best parallel performance is obtained when transmit
144

Domain-Based Parallelism and Problem Decomposition Methods

From Eqs. (1), it follows that the above saturation condition holds if and only if:

TABLE 1 reports the values of the system half-bandwidth that meet the saturation condition for today's emerging parallel processors. Clearly, b = 21,942 and b = 60,262 are unrealistic values of the matrix halfbandwidth. For example, most finite element models related to aerospace shell structures have a half-bandwidth that varies between 300 and 2,000. Moreover, problems with a half-bandwidth as large as 21,942 or 60,262 entail memory, CPU, and I/O requirements that overwhelm even the largest of the currently available supercomputing resources. TABLE 1 Skyline solver - Saturation condition - half-bandwidth values Parallel processor

Np

Sa

Si

b

iPSC-860 KSR-1 CM-5

128 256 512

60 Mflops 40 Mflops 128 Mflops

2.80 Mbytes/a. 8.50 Mbytes/a. 8.70 Mbytes/s.

21,942 9,637 60,262

If the saturation condition (2) is enforced for a fixed problem characterized by its half-bandwidth b, the number of "useful" processors — that is, the number of processors beyond which communication time becomes dominant — can be computed from Eqs. (1-2) as follows:

TABLE 2 reports for current massively parallel systems the number of useful processors processors for b = 1,500. This value of the half-bandwidth is representative of today's large-scale problems in aerospace structures. TABLE 2 Skyline solver - Number of useful processors - 6 = 1 , 500 Parallel processor

Np

Sa

Si

Np1

N$/NP

iPSC-860 KSR-1 CM-5

128 256 512

60 Mflops 40 Mflops 128 Mflops

2.80 Mbytes/sec 8.50 Mbytes /sec 8.70 Mbytes/sec

9 40 13

7.00 % 15.60 % 2.50 %

Optimizing Substructuring Methods

145

Clearly, the relative number of processors that can be effectively used by parallel skyline solvers on today's massively parallel processors is small. A similar scalability analysis can be performed for parallel skyline forward and backward substitutions. However, it should be clear from FIG. 2 that the result of such an analysis can only be more pessimistic. In the above scalability analysis, we have used peak performances for both floating-point arithmetic and interprocessor communication speeds, because these are the only numbers that researchers in this area will agree on. However, this leads to a worst case scenario because in practice a larger percentage of the peak performance can be attained for communication than for computation. In summary, the scalability analysis presented here shows that parallelism in direct skyline solvers is limited by the problem half-bandwidth and not by the problem size, and that these solvers are not scalable when applied to structural problems. Whether future improvements in interprocessor communication hardware will improve the potential of massively parallel skyline solvers remains unknown at this moment. On the other hand, substructure based iterative algorithms — also known as domain decomposition (DD) methods ([6] arid the proceedings of subsequent meetings) — are known to have excellent parallel scalability properties [7]: for many of these algorithms and for sufficiently large problems, the cost of each iteration can be decreased almost linearly with the number of available processors. However, because increasing the number of subdomains is the simplest means for increasing the degree of parallelism of a DD method, the only interesting DD algorithms for massively parallel processing are those where the number of iterations does not significantly grow with the number of subdomains. This property defines a numerically scalable DD method — that is, a DD method with an interface problem characterized by a condition number that does not grow, or "grows weakly", asymptotically with the number of subdomains [7-9]. REMARK 3. It should be noted that some DD methods outperform direct solvers even on sequential machines, and even when solving illconditioned structural problems discretized with shell and beam elements [10, 11]. 2 Practical Obstacles However, DD methods must overcome several obstacles before they can be adopted by structural analysis program developers. One such obstacle is the solution of systems with many or repeated right hand sides. Such systems arise, for example, in multiple load static analyses, in implicit linear dynamics, in the solution of nonlinear problems via a quasi-Newton scheme. in eigenvalue problems, and in many other structural computations. Direct solvers are well-suited for these problems because after the system matrix has been factored, the multiple or repeated solutions can be obtained through

146

Domain-Based Parallelism and Problem Decomposition Methods

relatively inexpensive forward and backward substitutions. On the other hand, iterative solvers are in general ill-suited for these problems because they must often restart iterations from scratch for every different right hand side. Another obstacle for numerically scalable DD methods relates to their efficient implementation on massively parallel processors. Indeed, numerical scalability is usually achieved via the introduction in a DD method of a coarse problem (or coarse grid, by analogy with multigrid methods) that relates to the original problem and that must be solved at each global iteration (usually a conjugate gradient (CG) iteration for symmetric problems). Direct methods are often chosen for solving the coarse problem despite the fact that they are difficult to implement and inefficient on a massively parallel processor. Therefore in many cases, a numerically scalable DD method loses its appeal because of its lack of parallel scalability. Obviously, one way to restore parallel scalability is to solve iteratively the coarse problem, for example, using a CG scheme. However, this approach raises again the first obstacle — that is, how to solve iteratively and efficiently a system with a constant matrix and repeated right hand sides.

3

Objectives

The iterative solution of symmetric systems with multiple right hand sides has been previously addressed in [12,13], and more recently in [14,15]. In this chapter, we revisit this problem in the context of substructure based iterative methods and a priori as well as a posteriori multiple right hand sides. The basic idea exposed herein is partially related to that proposed in [12] and analyzed in [13]. However, the specific algorithm presented in this chapter is different, simpler, and easier to parallelize than that described in [12], and faster but more memory consuming than both schemes discussed in [15]. Essentially, we formulate the overall problem as a series of consecutive minimization problems over stiffness-orthogonal and supplementary subspaces, and tailor the preconditioned conjugate gradient (PCG) algorithm to solve them efficiently. For each new right hand side, we first compute an optimal startup solution via the projection of the new interface problem onto an agglomerated Krylov space associated with previous right hand sides. Next, we improve this solution with an accelerated PCG algorithm where all search directions are orthogonalized with respect to the Krylov subspaces generated by previous right hand sides. The resulting solution algorithm is scalable, whereas forward and backward substitution algorithms are not. Its convergence rate improves with the number of search directions that are stored, and therefore its performance depends on the amount of available memory. However for most structural problems, this new solution algorithm

Optimizing Substructuring Methods

147

entails only a fraction of the storage requirements of direct solvers. We apply the same iterative methodology for solving the coarse problem of a numerically scalable DD method and report impressive performance results on an iPSC-860 computer with 128 processors, for several static and dynamic structural problems. We also extend the proposed solution methodology to global/local analyses [16, 17] which are common in structural mechanics and involve a mixture of multiple right hand sides and nearby stiffness matrices. Finally, we note that even though the specific algorithms described herein are for symmetric positive semi-definite systems, their extension to unsymmetric problems is straightforward. For example, we point the reader to reference [18] for an extension to unsymmetric systems of the algorithm proposed in [12], and for its application in boundary integral equations.

4

Problem Formulation

For the sake of clarity, we first discuss the problem and the proposed solution methodology in the absence of any substructuring technique. The implications of domain decomposition and coarse problems are highlighted in Section 6. Here, we are interested in solving iteratively the following problems:

where K, [fi}\ = f r h ", and {m}\ II ^rhs, denote respectively the stiffness matrix of a given structure, a set of Nrhs generalized force vectors, and the corresponding set of Nrhs generalized displacement vectors. Such problems arise, for example, when multiple load patterns are applied to a structure, when linear time dependent problems are solved, and in general when a computational algorithm requires repeated solutions of a system of linear equations with the same matrix but different right hand sides. Problems (5) above can be transformed into the following minimization problems:

where NK is the dimension of the stiffness matrix, TZ. is the set of real numbers, and T is the transpose superscript. If each minimization problem in (6) is solved with a PCG algorithm, the following Krylov subspaces are generated:

148

Domain-Based Parallelism and Problem Decomposition Methods (k}

where s\ and r» < NK denote respectively the search direction vector at iteration k, and the number of iterations for convergence of the PCG algorithm applied to the minimization of 3>i (u). Additionally, we introduce the following agglomerated subspaces:

Let Si denote the rectangular matrix whose column-vectors are Si. Prom the orthogonality properties of the conjugate gradient method, it follows that:

However, note that in general Si KSi is not a diagonal matrix. Finally, we define 5, as the matrix whose column vectors also span the subspace Si, but are orthogonalized with respect to the stiffness matrix K. Hence, we have:

5 Splitting, Uncoupling, Projecting, and Orthogonalizing Suppose that the first problem Kui = f\ has been solved in n PCG iterations, and that the NK x n matrix Si associated with the Krylov subspace Si is readily available. Solving the second problem Kui = f% is equivalent to solving:

If HNK is decomposed as follows:

then the solution of problem (11) can be written as:

From Eqs. (11-13), it follows that problem (11) decouples into two minimization subproblems, and that u% is the solution of:

T ptimizing Substructuring Methods and V2 is th e solution of:

First, we consider the solution of problem (14). Since u® € Si, there exists a y® € 7?,ri such that:

Substituting Eq. (16) into Eq. (14) leads to the following minimization problem:

whose solution y% is given by:

From Eq. (9), it follows that the system of equations (18) is diagonal. Hence, the components [y^lj of y° can be simply computed as follows:

Next, we turn to the solution of problem (15) via a PCG algorithm. Since the decomposition (13) requires v? to be K-orthogonal to u®, at each iteration k, the search directions s2 must be explicitly A'-orthogonalized to Si. This entails the computation of modified search directions s2 as follows:

Except for the above modifications, the original PCG algorithm is unchanged. However, convergence is expected to be much faster for the second and subsequent problems than for the first one, because S^ and the subsequent supplementary spaces have smaller dimensions than T^NK , and a significant number of the solution components are included in the startup solutions of the form of u®. In summary, once the first problem Ku\ = f\ has been solved in n PCG iterations and the NK x n matrix Si associated with the Krylov subspace Si has been stored, the second problem Kit,? = /2 is solved in two steps as follows:

150

Domain-Based Parallelism and Problem Decomposition Methods

Step 1. K is projected onto S\ and the resulting diagonal problem S^KSiy^ — S^/2 is trivially solved in NK floating-point operations. Next, the partial solution u® = Siy® is formed. This partial solution ?/2 is an optimal startup value for u% because: (a) it minimizes uTKu/2 — uT/2 over Si C T^NK , and (b) it is inexpensive to compute. Note that the r\ non-zero entries of the diagonal matrix SfKSi are automatically computed during the PCG solution of the first problem Kui = f\. Therefore, these entries can be stored and need not be recomputed. Step 2. The basic PCG algorithm is applied to the solution of Kii2 = /2 after it is modified to: (a) accept u® as a startup solution, and (b) orthogonalize the search directions s2 and 5i with respect to K. The generalization to the case of Nrhs right hand sides of the two-step solution procedure described above goes as follows. Suppose that i* — 1 < Nrhs consecutive problems Kui = fi, i = 1, ..., i* — 1 have been solved with a PCG algorithm modified as described above, and that the NK x n*-i matrix S"i*-i associated with the X-orthogonalized agglomerated Krylov subspace <S,«-i has been stored. The i*-th problem Kui* — fi* is solved in two steps. First, K is projected onto Si*-i, and an optimal startup solution u®f = Si"-\y\, is computed via the trivial solution of the diagonal '-p

^y»

system Si*-\KSi*-\y^ = SV-i/i*. Next, the PCG algorithm is applied to the solution of Km* = fi* with it?, as a startup solution, and all search directions s|, are K-orthogonalized to SV-i. Clearly, the performance of the solution method proposed here depends on the performance of the preconditioner used in the conjugate gradient algorithm. However, it should be noted that from the decomposition (12) it follows that if after the PCG solution of some problem indexed by 1 < i* < Nrhs the dimension r> of the AT-orthogonalized Krylov subspace Si* becomes equal to the size of one problem NK , the proposed solution method will converge in theory to a direct solver. In that case, each remaining problem KUJ. = f i , i— i* +1, Nrhs will be solved directly (zero iterations) and economically as follows (see Eq. (10)):

For time dependent problems, the right hand sides are never random in practice. Therefore the superconvergence behavior outlined above will be reached well before the point where TV = NK, and each of the PCG solutions of the remaining NThs —i* problems will converge in two or three iterations.

Optimizing Substructuring Methods

151

REMARK 4. From Eqs. (11-20), it follows that if two right hand sides fi and fi+\ are proportional, the solution of the problem with right hand side /j+i is Ui+\ = u®+l = Siyf+1, and therefore, this solution will be found in zero iterations.

6

Interface and Coarse Problems

Despite its elegance and simplicity, the methodology described in Section 5 can be impractical when applied to the global solution of the problems Ku-i = fi, i = 1, Nrha- Indeed, during the PCG solution of the first few problems — that is, before superconvergence can be reached — the cost of the orthogonalizations implied by Eq. (20) can offset the benefits of convergence acceleration via the optimal startup solution and the modified search directions s\ . Moreover, storing every search direction s\ and the corresponding matrix-vector product Kst can significantly increase the memory requirements of the basic PCG algorithm. However, the proposed methodology is computationally feasible in a domain decomposition context because it is applied only to the coarse grid and/or the interface problem and therefore: the additional memory requirements entailed by the storage of the search directions sj and the matrix vector products As\ are proportional to the reduced size of the coarse grid and/or the interface problem only. the cost of the orthogonalizations implied by Eq. (20) are negligibl compared to the cost of the forward and backward substitutions that are required at each iteration k for the evaluation of the subdomain solutions. REMARK 5. In practice, the number of search directions that are stored for orthogonalization is determined by the memory space that is left available after all other storage requirements of the structural analysis have been satisfied. When only a few directions can be stored, a partial orthogonalization procedure is performed. Finally, it should be noted that the solution methodology proposed herein involves essentially dot products and matrix-vector multiplications; therefore, it scales well in the fine granularity regime targeted by emerging parallel processors, whereas direct forward and backward substitutions do not. 7 Extension to Global/Local Analysis When analyzing a complex structure, the component of main interest is often not known prior to a stress analysis. In such cases, a "global" analysis

152

Domain-Based Parallelism and Problem Decomposition Methods

is first performed, and the hot spot(s) of the structure is (are) identified. Next, a series of refined local models are constructed for the hot spots and an iterative global/local analysis is performed to accurately predict the behavior of the component of interest in the structure [17]. The first two iterations of this process are graphically depicted in FIG. 3.

FIG. 3. Global/local analysis.

Because of space limitations, we describe only the first two levels of a global/local analysis. The generalization to an arbitrary number of levels is straightforward. Let KU, /i, and ui denote respectively the stiffness matrix of the initial finite element global model of a given structure, the corresponding force vector, and the resulting displacement solution vector. Clearly, these quantities satisfy: If the finite element global model is hierarchically refined in a few locations and the subscript 2 is used to designate the resulting additional degrees of freedom (d.o.f.), the new refined displacement vector ur can be written in partitioned form as:

where Au is the displacement "enrichment" induced by the local refinements. The new equations of equilibrium can be written in matrix form as:

where K^ is the stiffness matrix resulting from the local hierarchical refinement, A/ is the corresponding force correction, and K\z is the coupling

Optimizing Substructuring Methods

153

stiffness matrix betwen the global and local models. Eqs. (22-24) can be combined to obtain:

Clearly, if the size of K^ is small compared to that of KH as it is usually the case in global/local analysis, it is reasonable to view Eqs. (22, 25) a two systems with two different right hand sides but two nearby — or similar — left hand sides, and to design a solution methodology for problem (25) that follows the ideas described in Section 5. Using the nomenclature introduced in Section 4 and the partitioning denned in Eqs. (23-25), the extension of the methodology presented in Section 5 for solving the coupled global/local problem (25) goes as follows. Suppose that the global problem K\\UI — f i has been solved in r\ PCG iterations, and that the matrix S\ associated with the corresponding prolongated-byzeroKrylovsubspaceSi^^ 1 ' o ] T , . . , [s^ 0 ] T , ..., [s^} o] T } is readily available. Solving the coupled global/local problem Krt\u — Ag is equivalent to solving:

If IZNKT is decomposed as follows:

then the solution of problem (25) can be written as:

From Eqs. (26-28) it follows that problem (26) decouples, and that Aii° is the solution of:

and At; is the solution of:

Since A'U0 6 S\, there exists a y° e 7^ri such that:

154

Domain-Based Parallelism and Problem Decomposition Methods

Substituting Eq. (31) into Eq. (29) leads to the following minimization problem:

which in view of Eq. (9) admits for solution:

The solution of problem (30) can be obtained via the modified PCG algorithm discussed in Section 5. Since the decomposition (28) requires Av to be Xr-orthogonal to Au°, at each iteration fc, the search directions s^ must be explicitly A"r-orthogonalized to S\. This entails the computation of modified search directions s2 as follows:

Here again, KUS^' and d\q are computed during the solution of the first (global) problem. They can be stored and need not be recomputed during the second (global/local) problem. As in the case of systems with multiple right hand sides, one can reasonably expect that the solution of the refined problem will converge much faster than that of the global problem because S$ has a smaller dimension than ~R,Nl<, and a significant amount of the physical behavior is captured by the optimal startup solution Atx°.

8

Applications and Results

First, we consider the static solution of a two-dimensional fixed-pulled plane stress elasticity problem on a rectangular domain using the Finite Element Tearing and Interconnecting (FETI) domain decomposition method [11, 19] on an iPSC-860 parallel processor. The FETI method is numerically scalable. For elasticity problems, the two-norm condition number of its preconditioned interface problem grows asymptotically as «2 = O (1 + log2 ^ ), where H and h denote respectively the subdomain and mesh sizes [9]. Hence, if the problem size is kept constant and the number of subdomains is increased, the numerical performance of the FETI method does not degrade. On the other hand, if the problem size is increased proportionally to the number of processors (H/h is kept constant), the FETI method is theoretically capable of solving larger problems at constant CPU time. At each fc-th FETI global PCG iteration, the following coarse problem [19] must be solved:

Optimizing Substructuring Methods

155

where G = [-Bi-Ri ... BSRS ... B w f R N f ] , Bs is a boolean matrix that extracts from a subdomain quantity its interface component, .Rs spans the null space of the stiffness matrix associated with subdomain s, Nf denotes the number of floating subdomains — that is, subdomains with Neumann boundary conditions only — and bk is related to the interface residual at the k-th global PCG iteration. The parallel solution of these repeated problems via the modified CG scheme presented in Section 5 uses the existing FETI parallel data structures and requires communication only between neighboring subdomains. Using 4-node plane stress elements, four uniform finite element models corresponding to 4, 16, 64, and 128 subdomains are constructed. Each model has a different size but the ratio Hjh is kept constant across all four of them. Since we wish to compare the performance results of the FETI method with that of a direct skyline solver, the constant subdomain size is designed such that the optimally renumbered skyline of the global model can be stored on the iPSC-860 computer. The performance results obtained for these problems are summarized in TABLE 3, where Ns, NEQ, NEQC, Nftr, Tc, ^tr KT/ ' TFETI, and TSKY denote, respectively, the number of subdomains, the number of equations generated by the finite element model, the size of the FETI coarse problem, the total number of CG iterations for solving all FETI coarse problems, the total CPU time spent in the coarse solver, the total number of FETI global PCG iterations, the total CPU time of the FETI solver, and the CPU time of a highly optimized parallel skyline solver [5] (CPU time for the factorization phase only). The number of processors Np is set equal to Ns. The average column height b of each skyline matrix is indicated between parenthesis under TSKY. TABLE 3 Two-dimensional elasticity problem on an iPSC-860/128 Iterative FETI solver (\\Ku - f\\2 = 10-~ 3 ||/|J2) v.s. direct skyline solver

NP = NS NEQ 4 3,200 12,800 16 51,200 64 102,400 128

NEQc N?tr T< N™TI TFETI TSKY 5.36 s. (6 = 106) 3.41 s. 0.14s. 6 6 10 36

168 360

36 168 360

0.63 s. 2.38 s. 4.80 s.

16 17 18

4.79 s. 6.92 s. 9.29 s.

29.58 s. (6 = 212) 209.82 s. (b = 425) > 600.00 s. (b = 376)

Clearly, the performance results reported in TABLE 3 demonstrate the efficiency and superority of the FETI method, highlight its combined numerical/parallel scalability, and confirm the parallel non-scalability of the skyline solver predicted in Section 1. Because the successive right hand sides

156

Domain-Based Parallelism and Problem Decomposition Methods

of the coarse problems are random vectors and, more importantly, full precision is required for the solution of the coarse problem, the modified CG solver converges in NEQC iterations during the solution of the first coarse problem, and in zero iterations during the solution of each subsequent one. The reader should note that the solution of the coarse problems via a direct solver requires the explicit evaluation of the matrix G\Gi (see Eq. (35)). Even though the size of this matrix is small, its evaluation is computationally expensive — and cumbersome to implement. For example, for the fourth finite element problem (NEQ = 102,400) and 128 processors, the parallel evaluation of G\Gi consumes 7.74 s., while its factorization consumes 1.02 s. only. For the same problem and the same number of processors, the set-up of GI and the CG solution of all coarse problems consumes 4.80 s. only, which demonstrates the efficiency of the proposed parallel coarse grid solver. Note also that the explicit evaluation of G\Gi and its factorization destroy the subdomain-by-subdomain nature of the parallel computations and are quite cumbersome to implement. Moreover, for complex problems such as plates and shell structures, G^Gi can reach an unacceptable size. Next, we apply the methodology described in this paper to the solution of repeated systems arising from the linear transient analysis using an implicit time-integration scheme of the three-dimensional stiffened wing of a High Speed Civil Transport (HSCT) aircraft (FIG. 4). The structure is modeled with 6,204 triangular shell elements, 456 beam elements, and includes 18,900 d.o.f. The finite element mesh is partitioned into 32 subdomains with excellent aspect ratios using TOP/DOMDEC [20]. The size of the interface problem is 3,888 — that is, 20.57% of the size of the global problem. The transient analysis is carried out on a 32-processor iPSC-860 system. After all of the usual finite element storage requirements are allocated, there is enough memory left to store a total number of 360 search directions. This number corresponds to 9.25 % of the size of the interface problem. Using a transient version of the FETI method without a coarse grid [20], the system of equations arising at the first time step is solved in 30 iterations and 7.75 seconds CPU. After 5 time steps, 89 search directions are accumulated and only 10 iterations are needed for solving the fifth linear system of equations (FIG. 5). After 45 time steps, the total number of accumulated search directions is only 302 — that is, only 7.76% of the size of the interface problem, and superconvergence is triggered: all subsequent time steps are solved in 2 or 3 iterations (FIG. 5) and in less than 0.78 second CPU (FIG. 6). When the parallel skyline solver is applied to the above problem, the factorization phase consumes 60.5 seconds CPU, and at each time step the pair of forward/backward substitutions requires 10.65 seconds on the same 32 processor iPSC-860. Therefore, the proposed solution methodology is clearly an excellent alternative to repeated forward/backward substitutions

Optimizing Substructuring Methods

157

on distributed memory parallel processors.

FIG. 4. HSCT stiffened

wing.

9 Closure In this chapter, we have presented a methodology for extending the range of applications of domain decomposition based iterative methods to problems with multiple or repeated right hand sides. Such problems arise, for example, in multiple load static analyses, in implicit linear dynamics, in the solution of nonlinear problems via a quasi-Newton scheme, in eigenvalue problems, in global/local analysis, and in many other structural computations. We have formulated the global problem as a series of minimization problems over A'-orthogonal and supplementary subspaces, and have tailored the preconditioned conjugate gradient algorithm to solve them efficiently. The resulting solution method is scalable in the fine granularity regime targeted by emerging parallel processors, whereas direct factorization schemes and forward and backward substitution algorithms are not. We have illustrated the proposed methodology with the solution of static and dynamic structural problems, and have highlighted its potential to outperform forward arid backward substitutions on parallel computers. The proposed methodology

158

Domain-Based Parallelism and Problem Decomposition Methods

enhances the versatility of domain decomposition based iterative algorithms.

FIG. 5. Convergence rate history.

FIG. 6. CPU history.

Optimizing Substructuring Methods

159

References [1] A. George and J. W. H. Liu, Computer solution of large sparse positivedefinite systems, Prentice-Hall, New Jersey, 1981. [2] I. S. Duff, A. M. Erisman and J. K. Reid, Direct Methods for Sparse Matrices, Clarendon Press, Oxford, 1986. [3] E. Wilson and H. Dovey, Solution or reduction of equilibrium equations for large complex structural systems, Adv. Engrg. Soft., 1 (1978), pp. 19-25. [4] C. Farhat and E. Wilson, A parallel active column equation solver, Comput. & Struc., 28 (1988), pp. 289-304. [5] C. Farhat and F. X. Roux, Implicit parallel processing in structural mechanics, Computational Mechanics Advances, 2 (1994) pp. 1-124. [6] R. Glowinski, G. H. Golub, G. A. Meurant and J. Periaux (eds.), First international symposium on domain decomposition methods for partial differential equations, SIAM, Philadelphia, 1988. [7 ] D. E. Keyes, Domain decomposition: a bridge between nature and parallel computers, in A. K. Noor, ed., Adaptive, Multilevel and Hierarchical Computational Strategies, ASME, AMD-Vol. 157, (1992) pp. 293-334. [8 ] J. Mandel, Balancing domain decomposition, Comm. Appl. Num. Meth., 9 (1993) pp. 233-241. [9] C. Farhat, J. Mandel and F. X. Roux, Optimal convergence properties of the FETI domain decomposition method, Comput. Meths. Appl. Mech. Engrg., 115 (1994), pp. 367-388. [10] C. Farhat, A Lagrange multiplier based divide and conquer finite element algorithm, J. Comput. Syst. Engrg., 2 (1991) pp. 149-156. [11] C. Farhat and F. X. Roux, A method of finite element tearing and interconnecting and its parallel solution algorithm, Internat. J. Numer. Meths. Engrg., 32 (1991), pp. 1205-1227. [12] B. N. Parlett, A new look at the Lanczos algorithm for solving symmetric systems of linear equations, Lin. Alg. Applies., 20 (1980) pp. 323-346. [13j Y. Saad.On the Lanczos method for solving symmetric linear systems with several right-hand sides, Math. Comp., 48 (1987), pp. 651-662. [14] C. Farhat, L. Crivelli and F. X. Roux, Extending substructure based iterative solvers to multiple load and repeated analyses, Comput. Meths. Appl. Mech. Engrg., 117 (1994). [15] P. Fischer, Projection techniques for iterative solution of Ax=b with successive right-hand sides, ICASE Rep. No. 93-90, NASA CR-191571. [16] C. C. Jara-Almonte and C. E. Night, The specified boundary stiffness/force SBSF method for finite element subregion analysis, Internat. J. Numer. Meths. Engrg., 26 (1988) pp. 1567-1578. [17] J. D. Whitcomb, Iterative global/local finite element analysis, Comput. & Struc., 40 (1991), pp. 1027-1031. [18] K. Guru Prasad, D. E. Keyes and J. H. Kane, GMRES for sequentially multiple nearby systems, submitted to SIAM J. Sci. Comp. [19] C. Farhat, A saddle-point principle domain decomposition method for the

160

Domain-Based Parallelism and Problem Decomposition Methods

solution of solid mechanics problems, in: D. E. Keyes, T. F. Chan, G. A. Meurant, J. S. Seroggs and R. G. Voigt, ed,, Proc, Fifth SIAM Conference on Domain Decomposition Methods for Partial Differential Equations, SIAM (1991), pp. 271-292. [20 ] C. Farhat, S. Lanteri and H. Simon, TOP/DOMDEC, A software tool for mesh partitioning and parallel processing, J. Coniput. Sys. Engrg., in press. [21 ] L. Crivelli and C. Farhat, Implicit transient finite element structural computations on MIMD systems: FETI vs. direct solvers, AIAA Paper 93-1310, AIAA 34th Structural Dynamics Meeting, La Jolla, California, April 19-21, 1993.

Chapter 10 Parallel Implementation of a Domain Decomposition Method for Non-Linear Elasticity Problems Frangois-Xavier Roux Abstract

This paper reports some experiments with the parallel implementation on a distributed memory parallel system of a domain decomposition method for solving nonlinear elasticity problems. The Newton algorithm is used for the nonlinearity. At each iteration of the Newton algorithm, a linearized problem has to be solved. The solution of the linearized problems is computed via the dual Schur complement method. The dual Schur complement method is accelerated via a preconditioning procedure using the direction vectors computed for the solution of the previous linearized problems.

1

Introduction

The nonlinear elasticity equations are most commonly solved through Newton iterative procedures that require the solution of a linearized problem at each iteration. As the tangent matrix of the linearized problem needs to be updated at regular intervals, each tangent matrix is used for only a small number of right hand sides. In such a situation, using domain decomposition based iterative solvers have proven to be reliable and very efficient, from both numerical and parallel implementation points of view [4], Furthermore, domain decomposition techniques allow the storage of successive direction vectors built for solving each linearized problem, that can be used for designing efficient preconditioners for the linear problems of subsequent steps. Hence, the efficiency of the domain decomposition approach is improved. This paper is organized as follows: in section 2, the formulation and discretization of nonlinear elasticity equations are briefly recalled. Section 3 is devoted to the presentation of the dual Schur complement method and addresses parallel implementation issues. In section 4, an acceleration procedure based upon a reconjugation technique is introduced for the ONERA, Division Calcul Parallele, Chatillon, France.

161

162

Domain-Based Parallelism and Problem Decomposition Methods

solution of successive linearized problems with the same tangent matrix. This procedure is extended in section 5 to the case of modified tangent matrices and implementation results are given.

2 2.1

Solution of nonlinear elasticity equations Governing equations

The equilibrium equations for a body Q undergoing large deformations can be written in the weak form:

where: Tr is the trace, T(x) is the first Piola-Kirchhoff tensor, v is any admissible displacement field, f ( x ) is the density of body forces, g(x) is the density of surface tractions. For compressible materials, the constitutive law takes the following form:

where W(F) is the specific internal elastic energy. An example is the Saint-Venant-Kirchhoff law:

where F(x) = Id + Vu(x) is the deformation gradient, u(x) being the displacement field, and Id being the identity operator. A and \i correspond to the usual Lame constants in linear elasticity, which characterize the mechanical properties of the material. See, for instance, (1] for a thorough presentation of these formulations. This problem can be interpreted as a non linear fixed point problem:

This problem is usually discretized through a finite element method.

Parallel Nonlinear Domain Decomposition

2.2

163

Solution of the nonlinear equation by the Newton method

A classical solution method for the discretized nonlinear problem (4) is the Newton method, that consists in iteratively computing the solution of the fixed point problem (4) according to the following procedure:

In a matrix form, the Newton algorithm can be written:

where the tangent matrix KT is given by:

There are many variants of Newton-like methods. If the problem is not too stiff, a quasi-Newton method that consists in updating the tangent matrix every q iterations only can be used. On the other hand, if the problem is stiff, it can be necessary to perform an incremental loading to avoid the breakdown of the method. Incremental loading can be improved with the arc-length continuation method that monitors the increment according to the previous Newton iteration. An even more robust method is the so-called bordering algorithm [6]. All these methods have in common the solution of successive linearized problems with a tangent matrix that changes for each new right hand side or every q steps, q remaining a small number in any case. So, as the tangent matrix is often updated, using robust iterative solvers like domain decomposition methods for the solution of the linearized problems, instead of global direct solvers, is a suitable strategy.

3 3.1

Solution of the linearized problem with a domain decomposition method Principle of the dual Schur complement method

For simplicity of notation, let us introduce the dual Schur complement method for the Poisson equation. If the computational domain Q is split into two subdomains fii and fly with interface Fa, then u is the solution of the Poisson equation:

164

Domain-Based Parallelism and Problem Decomposition Methods

if and only if the restriction of u in each subdomain is the solution of the Poisson equation:

and satisfies the continuity relations:

Most domain decomposition methods consist in performing fixed point iterations on the matching conditions (10). Local equations (9) are well posed if either u or ^ are fixed on the interface. The primal Schur complement method consists in finding via an iterative scheme the value of the trace itg of the solution along the interface. Given an approximate value of this trace, 1*3, then the local equations are well posed:

The interface residual is equal to the disequilibrium of the fluxes:

The trace of the solution on the interface, u^ is then updated in order to decrease the interface residual. At convergence, this disequilibrium vanishes, and so the solutions of local problems (11) are the restrictions of the solution of the global problem (8). Therefore, the primal Schur complement method consists in finding, through an iterative procedure, the value u% of the trace of the field along the interface for which the solutions of the local problems (11) with Dirichlet boundary conditions on the interface Fa have matching normal derivatives. The dual Schur complement method consists in finding, through an iterative procedure, the value A of the normal derivative of the fields along the interface for which the solutions of the local problems with Neumann boundary conditions:

satisfy the matching condition:

Parallel Nonlinear Domain Decomposition

165

The ± sign indicates that the outer normal derivatives of QI and Q.2 are the opposites each other. Given \p an approximate value of A, then u? are the solutions of the local Neumann problems (13), and the residual of the condensed interface dual Schur complement operator is:

The fixed point iterations consist in updating \p in order to decrease the jump along the interface, gp.

3.2

Discretization

If KI and K<2 are the stiffness matrices obtained for a finite element discretization of the linear elasticity equations on Oi and fia with Neumann boundary conditions on the interface FS (prescribed density of surface tractions ±A) and if B\ and B% the trace operators over T^ of displacement fields defined upon HI and ^2, the dual Schur complement method derives from the hybrid formulation of the problem:

By substitution of u\ and u? given by the two first equations in the third one, the condensed interface problem for A is:

For a given approximation Xp of the interaction forces between subdomains along their interface, the residual of the condensed interface problem:

is equal to the jump of the solutions Ui of the local Neumann problems:

The condensed interface problem can be solved by the conjugate gradient algorithm. At each iteration, the local Neumann problems are solved by a direct method, and then the jump of the local solutions along the interfaces is computed. See [4] for a more detailed presentation of the method.

166

3.3

Domain-Based Parallelism and Problem Decomposition Methods

Parallel implementation for linear elasticity problems

This kind of method is very well suited for parallel implementation on distributed memory MIMD machines with message passing programming environment. Given a splitting of the computational domain, each subdomain fij is allocated to one processor. The description of the subdomain is the classical description of a single computational domain for any standard finite element code. The only nonstandard data are related to the description of the interface. For each interface between fij and another subdomain S7j, the list of the nodes of fij that belong to this interface must be known. In fact, such a list can be described in the same way as a classical boudary area, associated with Dirichlet or Neumann boundary conditions, except that it corresponds to a new "interface" type of boundary. Each processor can assemble and factorize the local stiffness matrix associated with its subdomain, as well as the local righthand side arising from the prescribed displacements or external forces. The only non local operations are related to the computation of the jump of local fields along the interface. The practical implementation consists in gathering in each processor the values of the local field Vi on each interface Fy = 50^ (1 dtlj, sending them to the processor treating subdomain %, then receiving the data arriving from neighbouring subdomains and computing for each interface the jump of "inner" values (local) and "outer" values (received). Thus, just a purely local description of the interfaces is required. The values of the jumps are computed at the same time for all subdomains, with opposite signs for the jumps computed within two neighbouring subdomains Q» and £lj along their interface Fy. This means that each processor computes the jump according to the outer normal derivative of its assigned subdomain, and thus that it computes the interaction forces A on its interfaces as external forces. So, there is no need to define a global orientation of the normal vectors on the interfaces: each subdomain considers its natural outer normal. The solution of the global problem through the dual Schur complement method goes as follows. Initialization, computation of the local initial field:

Computation of the initial residual, equal to the jump of the local solution fields along the interfaces:

Parallel Nonlinear Domain Decomposition

167

Computation of the starting direction vector:

Given the values of fields Ap, gp and wp, iteration p + 1 consists of the following steps: Computation of the local field satisfying the Neumann problem:

Computation of product of w by the dual Schur complement matrix D: Updating of A, g and w:

Updating of the local displacement fields:

Most of these computations are purely local and therefore parallelizable. The computation of the jumps along the interface (24) requires data transfers, according to the scheme presented above: each processor sends its contributions on its interfaces to the processors in charge of its neighbouring subdomains, then receive their contributions, and compute the jump that is equal to the inner contribution minus the outer contribution on each interface. These data transfers are made according to the topology of the splitting of the domain. They just require gather operations and node-tonode transfers. The second kind of data transfers are related to the computation of the descent and reeonjugation coefficients p and 7 in (25). Each processor computes the associated dot products for its own interfaces, and then the local contributions are globally assembled and broadcasted. These data transfers involve all the processors, independent of the topology of the domain splitting. They can be realized by calling the global data transfers functions available in any message passing library that are optimized for the network topology of the target machine. So, the programming effort for the parallel implementation of this method with a message passing programming environment is very small. In fact, it is

168

Domain-Based Parallelism and Problem Decomposition Methods

even more natural than a shared memory environment. With a distributed memory environment, each processor runs the same code, which is essentially a classical sequential mono-domain code. The only nonstandard routines are the computation of the jumps along the interfaces, and the global assembly of local contributions to the dot products. The real implementation problem lies in the preprocessing phase. Splitting an irregular mesh in balanced subdomains with a minimum number of interfaces is not easy [3].

4 4.1

Acceleration of the linear iteration via reconjugation Principle of the restarting procedure

For solving the linear system of equations Ax = b, the conjugate gradient algorithm consists in computing a set of direction vectors that are conjugate with respect to the dot product associated with the matrix A, and that generate the Krylov space Sp&n{g°, Ag°,... ,Ap~~^g°}, where g° is the initial residual Ax° — b. The approximate solution xp at iteration p, minimizes the A-norm of the error (Axp — 6, xp — x) over the space

z°+Span{
Then, the iterations of the conjugate gradient algorithm can start from the optimal starting point xQstarf. But the application of the standard conjugate gradient algorithm does not ensure that the new direction vectors are conjugate to the vectors w*. To enforce these additional conjugacy relations, the new direction vector $ at iteration number j must be reconjugated to the vectors w1 through the following procedure:

where gi is the gradient vector at iteration j, g3 = Ax3 — b. This is equivalent to performing the new iterations of the conjugate gradient algorithm in the subspace orthogonal, with respect to the dot product associated with A, to Span(ti)1, w"2,..., wp}. Then the algorithm is optimal in the sense that the actual dimension of the problem to be solved by the conjugate gradient algorithm is now equal to the dimension of A minus p. In practice, the set of conjugate vectors (wl) is built by accumulating the direction vectors computed for the solution of all previous problems. In order to make the method more robust by enforcing all the vectors to be

Parallel Nonlinear Domain Decomposition

169

actually conjugate a complete reconjugation procedure is performed:

4.2

Application to domain decomposition methods

The reconjugation technique presented in the previous section is applicable in principle to any conjugate gradient like method. But the standard application of these methods is to large sparse linear systems. For such problems, the cost, in both computing time and memory requirement, of the reconjugation procedure makes it impracticable. Applying this procedure requires the storage of the direction vectors (wl) arid of their products by the matrix (Aw1). The number of arithmetic operations required by the reconjugation step (29) is 4*n*p, where n is the dimension of the matrix and p the number of stored direction vectors, because p dot products (,gj, Aw1) and p sums of a scalar times a vector must be computed. For sparse linear systems arising from finite element discretization of linear elasticity problems, the number of nonzero entries per row of the matrix is generally a few tens. This means that if the number of stored directions for the reconjugation is more than a few tens, the reconjugation technique will cost more, in memory requirement and in number of operations than all the other steps of the conjugate gradient algorithm. The situation when applying this technique to a domain decomposition method is quite different. In this case, the dimension of the problem to which the conjugate gradient iterations apply is just the dimension of the interface. Furthermore, each iteration of the domain decomposition method requires riot only a sparse matrix-vector product, but the solution, through a forward-backward substitution, of all local problems (24). So the ratio of the number of matrix components stored and the number of arithmetic operations per iteration over the dimension is much larger for the condensed interface problem arising from a domain decomposition method than for a classical sparse global solver. Hence, the relative cost of reconjugation remains low when applied to domain decomposition method [7] [2j. Nevertheless, in the case of a parallel implementation on a distributed memory machine, with large numbers of processors and of subdomains, the global number of interface nodes tends to be non-negligible. The implementation introduced in section 3..3 must be adapted in order to avoid redundant storage and computation. For the computation of the descent and reconjugation coefficients, p and 7. in (25), the simplest method consists in computing the associated dot products in each processor, for all the interface nodes of the subdomain, and then assembling globally all the local contributions. Once these coefficients are known, the updating of A, g or w can be made independently in each processor.

170

Domain-Based Parallelism and Problem Decomposition Methods

This procedure requires the contribution of each interface node to be computed and stored twice, once for each subdomain connected through the corresponding interface. But it also reduces the number of communications: no interface vectors have to be transferred. As communicating is inevitably more expensive than computing on a distributed memory parallel machine, it is more efficient in practice to perform an operation twice than to transfer the result. However, for the reconjugation procedure, the amount of redundant arithmetic operations and storage becomes very large, if the same procedure is applied. A simple way for avoiding this excess-cost consists in splitting each interface Tij in half. The processor allocated to subdomain Oj computes and stores the direction vectors for the first half of the interface nodes, and the processor allocated to subdomain fij assumes the second half. Once each processor has computed the reconjugated gradie! nt (29) for its allocate

4.3

Application to the quasi-Newton algorithm

This technique has been applied for quasi-Newton iterations for solving the nonlinear elasticity equations for a cylinder made of a slightly compressible elastomer with a clamped bottom and a prescribed compression on the top. The problem is discretized using quadratic three dimensional finite elements, with a gloabal number of degrees of freedom equal to 4821. The mesh is split into 4 subdomains. The number of iterations for the dual Schur complement method at each iteration of the quasi-Newton algorithm with the restarting procedure by reconjugation are indicated in Table 1. The stopping criterion for the conjugate gradient method at each nonlinear iteration is 10~~3, and the global stopping criterion for the nonlinear iteration is 10~6. This difference between the stopping criteria is allowed by the fact that the Newton method has a quadratic asymptotic convergence, and that the solution of the linearized problem is just an increment of the approximate solution. Table 1 : Application to the quasi-Newton method quasi-Newton iteration 1 2 3 4 5 6 7

Number of e.g. iterations 32 25 9 1 1 1 1

Without the restarting procedure, the number of required iterations for each solution is the same, 32, for each solution. So the method is quite efficient. Its main limitation lies in the fact that the quasi-Newton algorithm is slower

Parallel Nonlinear Domain Decomposition

171

and less robust than the plain algorithm, with the tangent matrix being updated at each step. For the present test case for instance, the number of iterations for the Newton algorithm is only 4, instead of 7.

5 5.1

Extension to the case of an updated tangent matrix at each Newton iteration Interpretation of the reconjugation procedure as a preconditioner

In the case where the tangent matrix is modified, the reconjugation procedure cannot be applied. Nevertheless, the method can be generalized as a preconditioning technique. Consider the linear system of equations Ax = 6, with a preconditioner M, and a given set of conjugate directions (w l ), 1 < i < p. A preconditioner is an approximate inverse of the matrix A. Preconditioning the conjugate gradient algorithm consists in minimizing the residual over the space z0+Span{M0°, MA(A4g°],..., (MA)^ (Mg°)} [5]. Then the convergence of the method is driven by the condition number of the matrix MA, and is accelerated when this matrix is almost equal to the identity matrix. Practically, using the preconditioner M consists in building the new direction vector d3 at iteration j by conjugation of the preconditioned gradient MgJ:

Now, the set of conjugate directions (wl) can be used to improve the preconditioner M by building a new preconditioned vector:

such that Mnewgi is a better approximation of A~lgi than Mgi. As there are only p parameters (7*), the best such approximation satisfies: Equations (31) and (32) give the following formula:

In this case, the preconditioning technique leads to nothing other than the reconjugation procedure (29) applied to the conjugate gradient algorithm with preconditioner M, because the dot products (gi,wl) are equal to zero, as:

172

Domain-Based Parallelism and Problem Decomposition Methods

and the vectors g° and (dk) are conjugate to the vectors (iyj) if the restarting and reconjugation procedure are applied. In fact, this orthogonality relation between the gradient vector and the previous direction vectors is a standard property of the conjugate gradient algorithm.

5.2

Extension to the case where the matrix changes

The main interest of this interpretation of the reconjugation procedure as a preconditioning technique lies in the fact that it can be extended to the

case where the direction vectors (wl) are conjugate for a matrix B that

is different from the matrix A, In this case, the improved preconditioner can then be computed as in equation (31), with coefficients computed using matrix B as follows:

This relation implies that Mnewgi is a better approximation of B~lgi than Mgi. Of course, this relation will not necessarily improve the convergence if the matrices B and A are completely different. But in the case where these matrices are tangent matrices computed at successive iterations of a Newton algorithm applied to nonlinear elasticity problems, they are likely to be similar. This is certainly the case asymptotically. This can also be due to the fact that the nonlinear behaviour of the structure can be limited to some regions, or does not necessarily modify the whole spectrum of the stiffness matrix. It is particularly useful if it does not modify the lower end of the spectrum, as the reconjugation technique applied to the dual Schur complement method is very efficient for capturing the low frequency phenomena [8]. Of course, a starting procedure may be defined in the same way as for the case where the matrix does not change. The optimal starting point is given by the formula:

These starting and preconditioning techniques have been tested for the solution of the linear problems associated with the updated tangent matrices arising from the application of the Newton algorithm to the nonlinear problems introduced in section 4.3. The iterations counts for solving the successive linear problems with and without applying them are presented in Table 2.

Parallel Nonlinear Domain Decomposition

173

Table 2 : Extension to the Newton method Newton iteration 1

2 3 4

e.g. iterations

32 47 47 48

preconditioned e.g. iterations 32

35 12 1

With a refined mesh, the figures are as in Table 3. Table 3 : Results with a refined mesh Newton iteration 1 2

3 4

e.g. iterations 43 58 66 66

preconditioned e.g. iterations 43 45 26 3

The reduction of the number of iterations is not as sharp as in the case where the matrix does not change, but the results are still very satisfactory.

5.3

Some remarks

The preconditioning operator built via the recorijugation procedure presented here is not guaranted to be symmetric when the matrix changes. In theory, such a preconditioner cannot be employed for the conjugate gradient algorithm. In practice, it can be employed, provided that the standard complete reconjugation procedure is performed for the direction vectors computed for each new matrix and right hand side. In any case, this complete reconjugation is mandatory for avoiding a loss of orthogonality that can cause the breakdown of the method, whether the matrix changes or not. In the case of successive different matrices, each subset of direction vectors forms a conjugate set of vectors for its associated tangent matrix. The preconditioning procedure cannot be applied with the whole set at once, but recursively, subset by subset. Each subset of direction vectors is associated with its own matrix allowing an improvement of the preconditioner built with the previous subsets. In the same way, this technique can be used in conjunction with any other preconditioner, such as the "lumped" interface preconditioner presented in [4]. Last but not least, the application of this preconditioning technique in a distributed memory environment is as easy as the application of the reconjugation technique in the case where the matrix does not change. The only slight difference lies in the recursive procedure that requires retention of the dimensions of the various subsets of direction vectors.

174 6

Domain-Based Parallelism and Problem Decomposition Methods Conclusions

Domain decomposition methods are special iterative methods, because the relative dimension of the condensed interface problem is small compared to the global memory requirements associated with the factorization of the local stiffness matrices. This makes possible to retain a large amount of information concerning the convergence history of the iterations. With conjugate gradient-like methods, all the information lies in the direction vectors built for minimizing the residuals. These direction vectors allow a very fast computation of the inverse of the projected operator upon the subspace generated by these vectors. This approximate inversion can also be used as a preconditioner in the case of the Newton iterations with an updated tangent matrix at each iteration. In this case, the use a direct global solver would require a new factorization at each step. So the domain decomposition approach for solving linearized problems at each step of a Newton procedure presents the best features of direct and iterative solvers. Furthermore, it is very easy to parallelize with a message passing programming environment. Nevertheless, the reconjugation technique is less performing for large numbers of subdomains [8]. This advocates the use of several levels of parallelism, at the subdomains level first, with small numbers of subdomains, and within one subdomain. at a cluster level.

References [1] P. G. Ciarlet, Mathematical Elasticity, North-Holland, Amsterdam, New-York, 1988. [2] C. Farhat, Optimizing Substructuring Methods for Repeated Right Hand Sides, Scalable Parallel Coarse Solvers, and Global/Local Analysis , in these proceedings. [3] C. Farhat, M. Lesoinne, Automatic partitioning of unstructured meshes for the parallel solution of problems in computational mechanics, Int. J. Numer. Meths. Engrg., vol 36, pp. 745-764, 1993. [4] C. Farhat, F.-X. Roux, An unconventional domain decomposition method for an efficient parallel solution of large-scale finite element systems, SIAM J. Sci. Stat. Comput., vol 13, n°l,pp. 379-396, 1992. [5] G. H. Golub, C. F. Van Loan, Matrix Computations, Johns Hopkins, Baltimore, Maryland, 1983. [6] H. B. Keller, The bordering algorithm and path following near singular points of higher nullity, SIAM J. Sci. Stat. Comput., vol 4, n°l,pp. 573-582, 1983. [7] F.-X. Roux, Acceleration of the outer conjugate gradient by reorthogonalization for a domain decomposition method for structural analysis problems, T. Chan, R. Glowinski, J. Periaux, 0. B. Wildlund eds., Proceedings of the Third International Symposium on Domain Decomposition Methods, SIAM, Philadelphia (1990), pp. 314-319. [8] F.-X. Roux, Dual and spectral properties of Schur and saddle point domain decomposition methods , T. Chan, D. Keyes, G. Meurant, J. Scroggs, R.

Parallel Nonlinear Domain Decomposition

175

Voigt eds., Proceedings of the 5th International Symposium on Domain Decomposition Methods, SIAM, Philadelphia (1992), pp. 73-90.

This page intentionally left blank

Chapter 11 Fictitious Domain/Domain Decomposition Method for Partial Differential Equations Roland Glowinski

Tsorng-Whay Pan

Jacques Periaux

Abstract

Motivated by computation on parallel MIMD machines, we address the numerical solution of some classes of elliptic problems by combination of domain decomposition and fictitious domain methods. We take advantage of the fact that the Steklov-Poincare operators associated to the subdomain interfaces and to the fictitious domain treatment of internal boundaries have very similar properties. We use these properties to derive fast solution methods of the conjugate gradient type with good parallelization properties which simultaneously force the matching at subdomain interfaces and the actual boundary conditions. The above method have been applied to solve elliptic problems in multiply connected domains and in three dimensional domains. Preliminary results obtained on a KSR machine are presented.

1 Introduction

Fictitious domain methods for partial differential equations are showing very interesting possibilities for solving complicated problems from science and engineering (see, for example, [4, 17] for some impressive illustrations). The main reason for this popularity of fictitious domain methods (sometimes called domain imbedding methods; cf. [5]) is that they allow the use of fairly structured meshes on a simple shape auxiliary domain containing the actual one, therefore allowing the use of fast solvers. In [9] and [10], we have used Lagrange multiplier and finite element methods combined with fictitious domain techniques to compute the numerical solutions of elliptic problems with Dirichlet boundary conditions and applied these methods for solving some nonlinear time dependent problems, namely Department of Mathematics. University of Houston, Houston, Texas 77204 USA, Universite P. et M. Curie, Paris, and CERFACS, Toulouse, Prance. Department of Mathematics, University of Houston, Houston. Texas 77204 USA. Dassault Aviation, 92214 Saint-Cloud, Prance. 177

178

Domain-Based Parallelism and Problem Decomposition Methods

the flow of a viscous-plastic medium in a cylindrical pipe and external incompressible viscous flow modelled by the Navier-Stokes equations. In [11], we address the numerical solution of some classes of elliptic problems by combination of domain decomposition and fictitious domain methods. We take advantage of the fact that the Steklov-Poincare operators associated to the subdomain interfaces and to the fictitious domain treatment of internal boundaries have very similar properties. We use these properties to derive fast solution methods of conjugate gradient type with good parallelization properties which simultaneously force the matching at subdomain interfaces and the actual boundary conditions. In this article, we further explore the above methodology by applying it to two and three dimensional elliptic problems with more subdomains on a parallel MIMD computer and obtain good speedup. This methodology seems to provide an efficient alternative to conventional solution methods for the solution of the Poisson equation on parallel MIMD computer. In Section 2, we give the formulation of a family of Dirichlet problems and a brief review of the fictitious domain method. In Section 3 we give an equivalent formulation which is the basis of the domain decomposition and fictitious domain methods. In Section 4 we discuss iterative solution of the equations by the one shot method. In Section 5 the above methods are applied to solve elliptic problems in multiply connected domains and in three dimensional domains. Preliminary results obtained on a KSR machine are presented. 2 Brief review of fictitious domain method

Figure 2.1

Fictitious Domain/Domain Decomposition Methods for PDEs

179

Let O be a "box" domain in Rd(d > 1), u; a bounded domain in Rd(d > 1) such that (jj is strictly contained in fi (e.g., see Figure 2.1), F (resp., 7) the boundary <9Q (resp., duj], we suppose that 7 is smooth. Let OQ = f2\u> (e.g., in Figure 2.1 it is the region between F and 7). We consider the following elliptic problem:

where a > 0, and /, §Q, and g\ are given functions denned over fioi 7 and F, respectively. I f / , go, and g\ are smooth enough, problem (2.1) has a unique solution. In the rest of this section and Sections 3 and 4, we may assume that / = 0 and pi = 0. Since by solving the following problem in fi:

where

the solution u of problem (2.1) equals to UQ + u where u is the solution of problem (2.3)

A fictitious domain method was proposed for problem (2.1) in [9]. The main idea of the fictitious domain method is to solve problem (2.1) through an equivalent formulation defined in a simple shape auxiliary domain Ji. To derive fictitious domain formulation, let us consider the following problems:

and

180

Domain-Based Parallelism arid Problem Decomposition Methods

where g is given. Combining problems (2.4) and (2.5), we have

The solution of problem (2.6) is

it^, there is a jump of the normal derivative Due to the combinationg and of ul

of ug at 7 for each given g, which is

where n is the outward normal unit vector (see Figure 2.1). This is an oner r\

-i

to-one and onto relation betweengand the jump —— (e.g., see [9]). But we are more interested in the inverse relation, i.e., to a given jump of the normal derivative, //, we associate u^ 7 (the restriction of u^ to 7) where u^ is the solution of following problem:

where [wJ7 is the difference between u^ Q\^ and W M | W on 7. If we are able to find a jump of normal derivative ^ such that

where u^ is the solution of problem (2.7), then u^0 is the solution of problem (2.1). So for each given //, if we define an operator A by

then we must solve the equation

Fictitious Domain/Domain Decomposition Methods for PDEs

181

After obtaining A from problem (2.9), we solve problem (2.7) with the jump A to obtain the solution u. Combining problems (2.7) and (2.9), we have the following fictitious formulation for problem (2.1): Find u and X such that

Remark 2.1. In problem (2.10), A is a Lagrange multiplier associated with the boundary condition u = go on 7 for the saddle point approach [7, 9]. D In [9], the operator A is self-adjoint and positive definite. So we apply the general conjugate gradient method described in, e.g., [8, Chapter 3] to problem (2.9) and obtain the following conjugate gradient algorithm:

solve

and set

For n > 0, assuming that \n,rn,wn are known, compute Are+1, r n+1 , wn+1 as follows: solve

182

Domain-Based Parallelism and Problem Decomposition Methods

compute

set

and set

If (/f*n+i ds/ jr^ds)1^< e, take A = A n + i, u = un+i\ if not, compute

and set Do n = n + 1 and go to (2.15). Remark 2.2. (1) In the conjugate gradient algorithm (2.11)-(2.21), elliptic problems (2,12) and (2.15) can be solved in a simple shape auxiliary domain f l . So we can use fairly structured meshes and then a fast solver to solve the related discrete problems. (2) When ui, 0 C K 2 with a smooth boundary 7, we have obtained a quasioptimal preconditioner for the conjugate gradient algorithm (2.11)-(2.21) by Fourier Analysis (see [9]). 3 Domain decomposition/fictitious domain approach For simplicity we consider the case where w is the union of two disjoint bounded domains, u\ and u?, and a two subdomain decomposition like the one in Figure 3.1, where O = fii U ^2; we denote by 70 the interface between fii and Sl-z, by 71 (resp. 72) the boundary of wi (resp. o^), and let Jl* = (fii \ d>i) U Ui arid I\ = T n di\ for i = 1,2.

Fictitious Domain/Domain Decomposition Methods for PDEs

183

Then applying the domain decomposition method (e.g., see [6, 13]) to problem (2.10), we have Find UY, «2, AI, A2, and \d such that

We have an equivalence in the sense that if relations (3.1), (3.2) hold then Ui — u|n i ; for i = 1,2, where u is solution of (2.10), and conversely. Remark 3.1. In problem (3.1), (3.2), Aj is Lagrange multiplier associated to the boundary condition u = go on % for i = 1,2 and AO is Lagrange multiplier associated to the interface boundary condition u\ = u-2 on 70 for the saddle point approach [7, 9].

Figure 3.1

184

Domain-Based Parallelism and Problem Decomposition Methods

4 Iterative solutions Due to the combination of the two methods, there are two Lagrange multipliers associated to the boundary conditions and to the matching of solutions at the subdomain interfaces respectively. Thus we can solve the saddle-point system (3.1), (3.2) by the following three different conjugate gradient algorithms: (1) The algorithm BD: It is a nested loop in which the outer loop is a conjugate gradient algorithm driven by the multiplier associated to the boundary conditions and the inner one is driven by the multiplier associated to the matching at the subdomain interface. (2) The algorithm DB: It is a nested loop in which the outer loop is a conjugate gradient algorithm driven by the multiplier associated to the matching at the subdomain interface and the inner one is driven by the multiplier associated to the boundary conditions. (3) The one shot method: It combines the outer loop and the inner loop of either one of the above two algorithms into one loop which is a conjugate gradient algorithm driven by two multipliers at the same time. These three methods have different parallelization properties and can be parallelized on MIMD machines. The algorithm BD was discussed in [11] and due to its nested loop feature its speed is much slower than that of the one shot method for the two subdomains decomposition shown in Figure 3.1. The one shot method is the following:

solve Find Uj and u% such that

and set

Fictitious Domain/Domain Decomposition Methods for PDEs

185

For n > 0, assuming that A n , r n , wn are known, compute A n + i , r n+1 , w n+1 as follows: solve

Find

and

suchthat

and set

We then compute

and set

and se£

Do n = n + 1 and go to (4.5). In the one shot algorithm (4.1)-(4.12), for r" = (r^rf.r^) its norm is defined as follows:

and [it]7i is the difference between ii|n t \^. and u Wi on 7^ for i = 1, 2.

186

Domain-Based Parallelism and Problem Decomposition Methods

Remark 4.1. When u, fi C M2 with a smooth boundary 7, a preconditioned one shot method has been discussed in [11].

5 Performance on a KSR1 machine

5.1 Two dimensional case We consider problem (2.1) with a = 1 as test problem and let u(xi,xz) = x\ + x\ be the solution of the test problem. Then f ( x i , X 2 ) = x\ + x% — 4 i
2 where u^ = {(xl,X2)\—T—^ h /1/16w < 1J' fo

i = 1,2, ci =1.1875, c2=0.8125; take Q = (0,2) x (0,2) (see Figure 5.1).

Figure 5.1. In the numerical experiments, we consider the case where fJ is the union of eight subdomains Jli, . . . , to fig (see Figure 5.1) and use the finite element method (e.g., see [14]) to obtain the discrete analogue of problem (2.1) and the one shot algorithm. Finite element spaces for the discrete solution of u in fli for i = 1 , . . . , 8 consists of continuous piecewise linear polynomials. Part of an example of triangulation used to define finite element spaces in fij for i = 1 , . . . , 8 is shown in Figure 5.2. For i = 0.1, 2, the finite element spaces A^ for the multipliers are defined as follows: A^ = {nh\P>h € L°°(7i),jtXfe is constant on the segment joining 2 consecutive mesh points on 7^}. The choice of mesh points on 7! and 72 are shown on Figure 5.2. For stability reasons -the so called LBB inf-sup condition [2]- the length of each segment on 71 and 72 has to be chosen greater than the meshsize h. The obvious

Fictitious Domain/Domain Decomposition Methods for PDEs

187

choice for the mesh points on 70 are the midpoints of the edges located on 7o (see Figure 5.2).

Figure 5.2. Mesh points marked by "o" on 71 and 72 and part of mesh points marked by "*" on 70 with h = 1/64. In the one shot method, the elliptic problems have been solved on each subdomain by a Fast Elliptic Solver from the package FISHPAK [1] based on cyclic reduction [3, 15, 16]. Concerning implementation of the one shot method on the KSR1 machine, eight discrete elliptic problems can be solved simultaneously. For meshsize h = 1/32, 1/64. 1/128. and 1/256, the number of iterations of the one shot method is 78, 91, 116, and 151 respectively and the number of iterations for the preconditioned one shot method is 68. 75. 88, and 92 respectively. Thus the preconditioner for two dimensional problems works very well. The CPU time per iteration of the one shot method with or without preconditioner is about the same. In Tables 5.1 and 5.2 we have shown CPU time and speedup per iteration of the discrete analogues of the preconditioned one shot method for different meshsizes and Np is the number of processors used in computation. The speedup per iteration in Table 5.2 is better as the size of problem gets bigger.

188

Domain-Based Parallelism and Problem Decomposition Methods

Np I 2 4 8

Np 1 2

4 8

Table 5.1. CPU per iteration on a KSR1 h=l/64 h=l/32 h=l/128

0.291 sec. 0.189 sec. 0.125 sec.

1.130 sec.

5.251 sec.

h=l/256 22.608 sec.

0.649 sec. 0.394 sec.

2.828 sec.

11.756 sec.

1.543 sec.

6.177 sec.

0.099 sec.

0.231 sec.

0.883 sec.

3.334 sec.

Table 5.2. Speedup per iteration on a KSR1 h=l/64 h=l/32 h=l/128

1.00 1.54 2.33 2.94

1.00 1.74 2.87 4.89

1.00 1.86 3.40 5.95

h=l/256

1.00 1.92 2.66 6.78

5.2 Three dimensional case We consider the following three-dimensional test problem. Let uj be an axisymmetric airship whose bodyshape is denned by the equation

where r 2 = x\ + x% and b = 0.02 (in Figure 5.3, a cross section of airship is shown); take Q = ( — 2 , 2 ) x (—1,1) x (-1,1), and let u(xi,x2,X3) = 10(xi + #2 ~~ XI) De tne solution of the elliptic problem (2.1) with a = 1. Then f(xi,x2,x3) = lQ(x\ + x\- x|) - 60(xi + x2 - x3).

Figure 5.3.

Fictitious Domain/Domain Decomposition Methods for PDEs

189

In the numerical experiments, we consider the case where f2 is the union of eight slices {ft;}f=1 where ^ = (-2 + (i - l)/2, -2 + z/2) x (-1,1) x ( — 1,1) for i — 1 , . . . , 8. Let 5, be the interface between Qj and Jlj+i for i = 1 , . . . , 7. Finite element spaces for the discrete solution of u in f2j for i = 1 , . . . , 8 consists of continuous piecewise linear polynomials. An example of triangulation used to define finite element spaces in f2j for i — 1 , . . . , 8 is shown in Figure 5.4. For the multiplier associated to the boundary condition on 7, the finite element space A/,, is defined as follows:

where T^ is a triangulation of 7: but any edge of any T^ 6 T^ does not need to be an edge of another T'l G T^, For the multipliers associated to the matching of solution at subdomain interfaces, {5j}J=1, the finite element space A^ is defined as follows:

where T, ' is a triangulation of Si for i — 1,..., 7.

Figure 5.4. In the implementation of the one shot method, these three dimensional elliptic problems have also been solved on each subdomain by a Fast Elliptic Solver from the package FISHPAK [1] based on cyclic reduction [3, 15, 16] and eight discrete elliptic problems can be solved simultaneously. For meshsize h = 1/20. 1/40, and 1/50, the number of iterations of the one shot method is 79. 98. and 105 respectively. In Tables 5.3 and 5.4 we have shown CPU time and speedup per iteration of the discrete analogues of the one shot method for different meshsizes and Np is the number of processors used in computation. As in the results in Table 5.2, the speedup per iteration in Table 5.4 is better as the size of problem gets bigger. In Table 5.5 we

190

Domain-Based Parallelism and Problem Decomposition Methods

have shown the comparison of performances between the fictitious domain method (2.11)-(2.21) and the one shot method applied to the test problem considered here on a KSR1 machine. The one shot method seems to provide an alternative to conventional solution methods for the solution of Poisson equation on parallel MIMD computer. Table 5.3. CPU per iteration on a KSR1 h=l/40 h=l/20 h=l/50 5.049 sec. 41.420 sec. 83.117 sec. 22.704 sec. 45.898 sec. 3.275 sec. 24.972 sec. 1.819 sec. 13.355 sec. 13.214 sec. 1.465 sec. 7.156 sec.

Np I 2 4 8

Table 5.4. Speedup per iteration on a KSR1 h=l/40 h=l/20 h=l/50 1.00 1.00 1.00 1.54 1.74 1.81 2.87 2.78 3.33 3.45 4.89 6.29

Np I 1 4

8

Np I 2 4 8

FD Method

Table 5.5. Comparison h=l/40 h=l/20

h=l/50

398.88 sec.

4059.19 sec.

258.70 143.67 115.76 164.82

2225.01 sec.

8727.24 sec. 4819.29 sec.

1308.77 sec.

2622.08 sec.

701.21 sec.

1387.54 sec.

2300.96 sec.

7280.26 sec.

sec. sec. sec. sec.

6 Conclusion

Domain decomposition methods combined to fictitious domain methods seem to provide an alternative to conventional solution methods for the solution of Poisson equation on parallel MIMD computer. This new methodology looks promising for solving elliptic problems in multiply connected domain and in three dimensional domain. However, further experiments are needed for very large problems to explore parallelization properties of one shot algorithm for 3-D flows.

Acknowledgements We would like to acknowledge the helpful comments and suggestions of referees, M. Luskin, B. Cockburn, E. J. Dean, M. Druguet, J. Singer and S.

Fictitious Domain/Domain Decomposition Methods for PDEs

191

Datta. The support of the following corporations and institutions is also acknowledged: AWARE, Dassault Aviation, CERFACS, Texas Center for Advanced Molecular Computation. University of Houston, Universite P. et M. Curie. We also benefited from the support of DARPA (Contracts AFOSR F4962089-C-0125 and AFOSR-90-0334), DRET (GRANT 89424),NSF (GRANTS INT 8612680, DMS 8822522 and DMS 9112847) and the Texas Board of Higher Education (Grant 003652156ARP and 003652146ATP).

REFERENCES 1. J. Adams, P. Swarztrauber and R. Sweet, FISHPAK: a package of FORTRAN subprograms for the solution of separable elliptic partial differential equations, NCAR, Boulder, Colorado, 1980. 2. F. Brezzi, On the existence, uniqueness and approximation of saddle point problems arising from Lagrangian multipliers, RAIRO Num. Aanl. 8 (1974), pp. 129 151. 3. O. Buneman, A compact non-iterative Poisson solver, Report 294, Standford University Institute for Plasma Research (1969), Stanford, Gal., 1969. 4. J. E. Bussolctti, P. T. Johnson, S. S. Samanth, D. P. Young, R. H. Burkhart, EM~ TRANAIR: Steps toward solution of general 3D Maxwell's equations . in Computer Methods in Applied Sciences and Engineering, R. Glowinski. ed., Nova Science, Cornmack, NY, 1991. pp. 49-72. 5. B. L. Buzbec, F. W. Dorr, J. A. George, G. H. Golub, The direct solution of the discrete Poisson equation on irregular regions, SIAM J. Num. Anal. 8 (1971), pp. 722-736. 6. L. C. Cowsar, E. J. Dean, R. Glowinski, P. Le Tallec, C.H. Li, J. Periaux, M. F. Wheeler, Decomposition principles and their applications in scientific computing, in Parallel processing for scientific computing, J. Dongarra, K. Kennedy. P. Messina, D. C. Sorensen, R. G. Voigt, eds. (1992), SIAM, Philadelphia, pp. 213-237. 7. V. Girault, P. Raviart. Finite element methods for Navier-Stokes equations, theory and algorithms, Springer Verlag, Berlin, 1986. 8. R. Glowinski, P. Le Tallec, Augmented Lagrangian and operator splitting methods in nonlinear mechanics, SIAM, Philadelphia, 1989.

9. R. Glowinski, T. W. Pan, J. Periaux, A fictitious domain method for Dirichlet problem and applications, Comp. Meth. Appl. Mech. Eng. Ill (1994), pp. 283 303. 10. R. Glowinski, T. W. Pan, J. Periaux, A fictitious domain method for external incompressible viscous flow modeled by Navier-Stokes equations, Comp. Meth. Appl. Mech. Eng. 112 (1994), pp. 133-148. 11. R. Glowinski, T. W. Pan, J. Periaux. A one shot domaindecomposition/ficti domain method for the solution of elliptic equations, in the proceeding of Parallel CFD'93, Paris, France, May, 1993 (to appear). 12. R. Glowinski. O. Pironneau, Numerical methods for the first biharmomc equation and for the two-dimensional Stokes problem, SIAM Rev. 21 (1979), pp. 167-212. 13. R. Glowinski. M. F. Wheeler, Domain decomposition and mixed finite element methods for elliptic problems, in Domain decomposition methods for partial differential equations, R. Glowinski, G. H. Golub, G. Meurant, J. Periaux, eds. (1988), SIAM, Philadelphia, pp. 144-172. 14. C. Johnson, Numerical solutions of partial differential equations bt the finite element method, Cambridge University Press, Cambridge, 1987. 15. R. A. Sweet. A generalized cyclic reduction algorithm, SIAM J. Num. Anal. 11 (1974), pp. 506 220.

192

Domain-Based Parallelism and Problem Decomposition Methods

16. R. A. Sweet, a cyclic reduction algorithm for solving block tridiagonal systems of arbitrary dimension, SIAM J. Num. Anal. 14 (1977), pp. 706-720. 17. D. P. Young, R. G. Melvin, M. B. Bieterman, F. T. Johnson, S. S. Samanth, J. E. Bussoletti, A locally refined finite rectangular grid finite element method. Application to Computational Physics, J. Comp. Physics 92 (1991), pp. 1-66.

Chapter 12

Multipole and Precorrected-FFT Accelerated Iterative Methods For Solving Surface Integral Formulations of Three-dimensional Laplace Problems K. Nabors1

J. Phillips

F. T. Korsmeyer

J. White

Abstract It is common practice to use Krylov-subspace iterative methods, like GMRES, to solve the dense matrices generated by discretized surface integral formulations of three-dimensional Laplace problems. In this overview, we describe some aspects of using fast-multipole and precorrected-FFT algorithms to reduce the memory and computational cost of such methods. A variety of examples are presented to give a clear indication of the practical merits of these schemes.

1

Introduction

Finding computationally efficient numerical techniques for simulation of three dimensional structures has been an important research topic in almost every engineering domain. Surprisingly, the most numerically intractable problem across these various disciplines can be reduced to the problem of solving a three-dimensional linear partial differential equation whose Green's function is given by G(x, x') = ., _^ ,,,. Such problems are often referred to as potential problems, and application examples include electrostatic analysis of sensors and actuators [18]; electro- and magneto- quasi-static analysis of "This work was supported by Advanced Research Projects Agency contract N00014-91J-1698, Office of Naval Research contract N00014-90- J-1085, National Science Foundation contracts MIP-8858764 A02 and ECS-9301189, F.B.I, contract J-FBI-88-067, and grants from Digital Equipment Corporation and I.B.M. tDept. of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, MA 02139 ([email protected] .edu.) *Dept. of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, MA 02139 (ksnSrle-vlsi.mit.edu.) s Dept. of Ocean Engineering, Massachusetts Institute of Technology, Cambridge, MA 02139 (xmeyer9chf.mit.edu.) Dept. of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, MA 02139 (white9rle-vlsi.mit.edu.)

193

194

Domain-Based Parallelism and Problem Decomposition Methods

integrated circuit interconnect and packaging [16]; and potential flow based analysis of ocean structures [8]. In the early 1970's, boundary- or volumeelement techniques were developed for solving integral formulations of potential problems, but the techniques generate dense matrices and therefore only can be used for relatively simple geometries. In the 1980's, finite-element techniques applied directly to the partial differential equation formulation supplanted approaches based on the integral formulation, because finiteelement techniques generate sparse matrices which can be solved rapidly by preconditioned conjugate-direction algorithms or multigrid approaches. However, the recent development of extremely fast multipole-accelerated iterative algorithms for solving potential problems has renewed interest in integral equation formulations [15]. In such methods, the fast-multipole or adaptive-multipole algorithms [6] are used to implicitly construct a sparse representation of the dense matrix associated with a discretized integral equation formulation, and this implicit representation is used to quickly compute the matrix-vector products required in a Krylov-subspace iterative method like GMRES [17]. That the multipole-accelerated approach leads to practical engineering analysis tools has been demonstrated for electrostatic analysis of integrated circuit interconnect and packaging problems [11]. The present success of multipole-accelerated iterative algorithms for discretized integral equations suggests that other fast summation techniques be reexamined in the context of solving integral equations. In this overview, we begin such an examination by comparing multipole and precorrectedFFT (Fast Fourier Transform) acceleration. What emerges from this comparison is the importance of inhomogeneity in determining algorithm performance. In the section that follows, we review Galerkin and collocation algorithms applied to a first-kind potential integral equation, as well as Krylov-subspace iterative algorithms for solving the generated dense matrix problem. Application of the fast-multipole algorithm to this problem is described in Section 3. The precorrected-FFT method is described in Section 4, and a variety of experimental results are presented in Section 5. Finally, conclusions and acknowledgments are given in Section 6.

2

Formulation

To solve the exterior Laplace problem, one must find a twice continuously differentiable function ^ : R3 —*• R which both satisfies given boundary conditions and for which

for all x 6 D c R3. Here, D is the exterior of a possibly multiply-connected two-dimensional surface 5. Typical boundary conditions for this exterior problem are constraints on tp or its normal derivative at all points of 5, and

Multipole and Precorrected-FFT Methods

195

the condition that

Green's theorem applied to the exterior Laplace problem can be used to generate several equivalent integral formulations which can be used to compute if). For example, if ip is known on S then the problem is referred to as an exterior Dirichlet problem, and can be formulated as a first-kind integral equation (for a second-kind formulation of the exterior Dirichlet problem on multiply-connected domains see [5]). Using what is referred to as an indirect formulation [1] leads to a firstkind integral equation for ' in terms of a single-layer surface density, hereafter referred to as a charge density. Specifically, in the indirect formulation, the charge density cr satisfies the integral equation

where S is a two-dimensional surface in R , i/)(x) is a given surface potential, da1 is the incremental surface area, x, x' G R3, and \\x\\ is the usual Euclidean length of x given by \lx\ + x\ + x\. More compactly, we denote (3) by

The charge density a can be related to electrostatic capacitances and forces, or fluid velocities for the case of potential flow. Electrostatic capacitances are useful figures of merit for designers of electronic packaging [16]; microelectromechanical system designers are interested in electrostatic forces [18]; and ocean vehicle designers are interested in potential flow [14]. For these engineering problems. 0.1% —» 1% accuracy is typically sufficient, and therefore low-order schemes are in common use. Note, the problem in (3) is not ill-posed, even though it is a first-kind Fredholm integral equation, because of the singular kernel [10]. Therefore, the problem can be solved without regularization, and to compute an approximation to cr, in general one can consider an expansion of the form

where OI(X),...,ON(X) : R3 —> R are a set of (not necessarily orthogonal) expansion functions, and qi, ...,QJV are the unknown expansion coefficients. The expansion coefficients are then determined by requiring that they satisfy a Galerkin or collocation condition of the form

196

Domain-Based Parallelism and Problem Decomposition Methods

where P € ~RNxN and p, q e R^. In the case of a Galerkin condition,

and

where {/,
and

where (6(x),u) = u(x), and x\, ...,XN are the collocation points [10]. For example, the approach used in many engineering applications is to approximate the surface 5 with N planar quadrilateral and/or triangular panels over which a is assumed uniform. The expansion functions are then

where o,j is the area of panel i. The Galerkin scheme for determining the expansion coefficients yields

and for the collocation method

There are closed-form expressions for the integral in (13) [14], but closedform expressions for the integral in (12) are known only in special cases [16]. The dense linear system of (6) can be solved to compute expansion coefficients from a right-hand side derived from (8) or (10). If Gaussian elimination is used to solve (6), the number of operations is order JV3. Clearly, this approach becomes computationally intractable if the number of expansion functions exceeds several hundred. Instead, consider solving the linear system (6) using a Krylov-subspace method like GMRES [17]. Such methods have the general form: ALGORITHM 1. General Krylov-subspace algorithm for solving (6). Set q° to some initial guess at the solution. Compute the initial residual and Krylov vector, r° = p° — p — Pq°. Set k = 0. repeat {

Determine the value of some cost function (e.g. \\rk\\).

Multipole and Precorrected-FFT Methods

197

if (cost < tolerance), return q as the solution. else Choose a's and 0 in to minimize the cost function. The dominant costs of Algorithm 1 are in calculating the N2 entries of P using (7) or (9) before the iterations begin, and performing JV2 operations to compute Ppk on each iteration. Described below are fast-multipole and precorrected-FFT algorithms which avoid forming most of P and reduce the cost of forming Ppk to order JV or order NlogN operations. Though not a subject in this paper, it should be noted that effective preconditioners for accelerating GMRES convergence have been developed for discretized integral equation problems [19, 12].

3

Multipole Acceleration

In the case of collocation, computing the dense matrix-vector product Pq is equivalent to evaluating the potential at N collocation points, {x±, ...,:r/v}, due to a charge density described by X^Li 1i@i(x)- It is possible to avoid forming P, and to substantially reduce the cost of computing Pq, using the fast-multipole algorithm [7, 6]. The fast-multipole algorithm uses a hierarchical partitioning of the problem domain and careful application of multipole and local expansions to accurately compute potentials at N points due to N charged particles in order N operations. To understand how the fast-multipole algorithm achieves this efficiency, consider again the case where the expansion functions represent uniform distributions over flat panels and centroid collocation is used to determine the coefficients of the expansion. Then evaluating the potential due to the charge distribution on d panels at d centroid collocation points, or evaluation points, requires d2 operations. If the panels are in a cluster, then the cost of this calculation can be reduced if some approximation is allowed. The potential due to the cluster of panels can be represented by a truncated multipole expansion, and this expansion can used to compute the potential at d evaluation points, as shown in Figure 1. Multipole expansions have the general form

where / is the expansion order, r, 0 and are the spherical coordinates with respect to the multipole expansion's origin (usually the center of the charge cluster), Y™(0,(t>)'s are the surface spherical harmonics, and the M™'s are the multipole coefficients [9].

198

Domain-Based Parallelism and Problem Decomposition Methods

FiG. 1. Approximately computing the potentials at d evaluation points due to a cluster of d charged panels in order d operations using a multipole expansion.

Multipole expansions can be used to efficiently evaluate the potential due to a cluster of charges at any point where the distance between the evaluation point and the cluster's center is significantly larger than the radius of the cluster. A dual optimization is possible using local expansions, as shown in Figure 2. That is, for a cluster of evaluation points, the potential due to charges whose distances from the cluster's center are significantly larger than the radius of the cluster can be combined into a local expansion at the cluster's center. Then, this local expansion can be used to efficiently compute potentials at evaluation points in the cluster. A local expansion has the form

where I is the order of the expansion, r, 9 and (f> are the spherical coordinates of the evaluation location with respect to the expansion's center, and the L™'s are the local expansion coefficients. The above examples make clear that multipole and local expansions can be used to improve computational efficiency, at the cost of some accuracy, when a set of evaluation points are well-separated from a set of panel charges, and one of the sets is clustered. The loss in accuracy can be made arbitrarily small by increasing the order of the multipole expansions. For a general distribution of panels, nearby interactions should therefore be computed directly. This result can then be summed with the contributions to the potential due to distant panels which, of course, can be represented using multipole or local expansions. Direct evaluations may also be used for small groups of distant panels, as multipole and local expansions are inefficient unless they represent the effect of a larger number of panel charges than there are multipole coefficients.

Multipole and Precorrected-FFT Methods

199

FlG. 2. Approximately computing the potentials at a cluster of d evaluation points due to d charged panels in order d operations using a local expansion.

3.1

Using Fast-Multipole Algorithms for Integral Equations

As described above, an iterative algorithm would use the fast-multipole algorithm many times to compute matrix-vector products. This implies that the algorithm is used many times without a change in geometry, and this fact can be exploited to improve algorithm efficiency. As is made clearer below, the fast-multipole algorithm involves constructing multipole expansions from charge density expansions; shifting and combining multipole expansions; converting multipole expansions and charge density expansions into local expansions; shifting local expansions; and evaluating the potential due to charge density expansions, multipole expansions or local expansions. However, each of the construction, translation and evaluation operations are linear functions of the expansion coefficients, and therefore can be represented as matrices whose elements are functions of geometry alone [13]. The above implies that the fast-multipole algorithm produces an approximation to Pq, denoted Pq, which is a linear function of q. That is, the fast-multipole algorithm is directly analogous to a sparse representation of the dense matrix P. This has several important consequences. When the fast-multipole algorithm is used to compute matrix-vector products in an iterative method, the iterative method's convergence is controlled by the properties of P, not by how well P approximates P. Also, the translation matrices need be computed only once, then used repeatedly until the iteration converges.

3.2

The Fast-Multipole Algorithm

Below we give a multipole algorithm which follows closely the development in [7]. The algorithm exploits the linearity of the multipole and local approximations by presenting the algorithm exclusively in terms of translation matrices.

200

Domain-Based Parallelism and Problem Decomposition Methods

To begin, we consider the hierarchical domain partitioning. Let the root cube be the smallest cube containing the problem domain. More precisely, the root cube is the smallest cube which contains all the collocation points and for which x outside the cube implies 9i(x] — 0 for all i. The hierarchy is then just a recursive eight-way sectioning of this root cube. DEFINITION 3.1. Cube hierarchy: The cube containing the problem domain is referred to as the level-0, or root, cube. Then, the volume of the cube is subdivided into eight equally sized child cubes, referred to as level-1 cubes, and each has the level-0 cube as its parent. The collocation points are distributed among the child cubes by associating a collocation point with a cube if the point is contained in the cube. Each of the level-1 cubes is then subdivided into eight level-2 child cubes and the collocation points are again distributed. The result is a collection of 64 level-2 cubes and a Q^-way partition of the collocation points. This process is repeated to produce D levels of cubes, and D distributions of collocation points starting with an 8way partition and ending with an 8D-way partition. The depth, D, is chosen so that the maximum number of 9i 's whose support intersects any finest level cube is less than a selected constant. The terms below are used to concisely describe the fast-multipole algorithm. DEFINITION 3.2. Evaluation points of a cube: The collocation points within the cube. DEFINITION 3.3. Nearest neighbors of a cube: Those cubes which have a corner in common with the given cube. DEFINITION 3.4. Second nearest neighbors of a cube: Those cubes which are not nearest neighbors but have a corner in common with a nearest neighbor of the given cube. Note that there are at most 124 nearest and second nearest neighbors of a cube, excluding the cube itself. DEFINITION 3.5. Interaction set of a cube: The set of cubes which are either the second nearest neighbors of the given cube's parent, or are children of the given cube's parent's nearest neighbors, excluding nearest or second nearest neighbors of the given cube. There are at most 189 cubes in an interaction set, roughly half are from a level one coarser than the level of the given cube, the rest are on the same level. For the jth cube on level d: its parent's index on level d — I is denoted F ( d , j } \ its set of (d + l)-level children is denoted C(d, j); its set of of interaction cubes is denoted I ( d , j } ; the set of cube j and cube j's nearest and second-nearest neighbors is denoted N(d,j)] the vector of multipole expansion coefficients representing the charge density in the cube is denoted Md(J; the vector of local expansion coefficients for the cube is denoted Ldj', and the vectors of the cube's charge density expansion coefficients and collocation point potentials are denoted q^j and pd,j respectively.

Multipole and Precorrected-FFT Methods

201

The matrix that represents the conversion of the charge in a level-d cube J to a multipole expansion at the center of a level-d cube j is denoted Q2M(d,j,d,j). The matrix for the translation of a multipole expansion in a cube to a multipole expansion in another cube, M2M, the matrix for the conversion of a multipole expansion to a local expansion, MIL, the matrix for the translation of a local expansion to a local expansion, L2L, the matrix for the computation of a contribution to the potential from a local expansion, L2P, and the matrix for the computation of a contribution to the potential from the charge in a cube, Q2P, are all similarly specified. ALGORITHM 2. The Fast-Multipole Algorithm. THE CONSTRUCTION PHASE: Computes multipole expansions at the finest level. For each level-D cube j = 1 to 8D MDJ = Q2M(D,j,D,j)qDij THE UPWARD PASS: Computes multipole expansions. For each level d = D — 1 to 2 For each level-d cube j — I to 8d Md,j = £j6C(dj) M2M(d,j, d + 1, j)M d+ ij THE INTERACTION PHASE: Converts multipole expansions to local expansions. For each level d — 2 to D For each level-d cube j = 1 to 8d L *j = ZfeHdj) M2L(d,j,d,j)Mi J THE DOWNWARD PASS: Transfers and sums local expansions. For each level d = 3 to D For each level-d cube j = I to 8d Ldj = Ld,j + L2L(d,j,d-l,F(dJ))Ld_1}F(d<j) THE EVALUATION PHASE: Evaluates the potential. For each level-D cube j = 1 to 8D PDJ = L2P(D,j)LDJ + Ej 6 jv(Dj) Q2P(D,j,D,j)qD<3The above algorithm, which is summarized in Figure 3, is not adaptive, and would be inefficient for inhomogeneous distributions such as those associated with discretized surfaces in three space dimensions. A reasonably efficient adaptive-multipole algorithm can be derived simply by ignoring empty cubes, but more sophisticated approaches can be found in [4, 12].

4

Precorrected-FFT Acceleration

Another approach to computing distant interactions is to exploit the fact that the potential at evaluation points distant from a cube can be accurately computed by representing the given cube's charge distribution using a small number of weighted point charges. Moreover, if the point charges all lie on a

202

Domain-Based Parallelism and Problem Decomposition Methods

FIG. 3.

Graphical summary of the fast-multipole algorithm steps.

uniform grid, then FFT can be used to compute the potential at these grid points due to the grid charges. Specifically, one method for approximating Pq in order n log n operations has four steps: directly compute nearby interactions, project the panel charges onto a uniform grid of point charges, compute the grid potentials due to grid charges using an FFT, and interpolate the grid potentials onto the panels. This process is summarized in Figure 4. The difficulty with the above four steps, as will be made clearer below, is that the calculations using the FFT on the grid do not accurately approximate the nearby interactions. This poor approximation must be subtracted from the result before accurate direct calculation of nearby interactions can be substituted. This subtraction can be performed at almost no cost by modifying the way the nearby interactions are computed, a step we refer to as precorrection, and is described below.

4.1

Projecting onto a Grid

For panel charges contained within a given cube, the potentials at evaluation points distant from the given cube can be accurately computed by representing the given cube's charge distribution with a small number of appropriately weighted point charges on a uniform grid throughout the given cube's volume. For example, consider the cube embedded in the center of a 3 x 3 x 3 array of cubes, and assume that the potential will be evaluated at points exterior to the 27 cube array. Then, since the potential satisfies

Multipole and Precorrected-FFT Methods

203

FIG. 4. 2-D Pictorial representation of the four steps of the precorrected-FFT algorithm. Interactions with nearby panels (in the grey area) are computed directly, interactions between distant panels are computed using the grid.

Laplace's equation, the error in the point charge approximation over the entire exterior can be minimized by minimizing the potential error on the surface of the cube array. The above observation suggests a scheme for computing the grid charges used to represent charge in a given cube a. First, test points are selected on the surface of the 3 x 3 x 3 cube array which has cube a as its center. Then, potentials due to the grid charges are forced to match the potential due to the cube's actual charge distribution at the test points. Since such collocation equations are linear in the charge distribution, this projection operation which generates a subset of the grid charges, denoted q%, can be represented as a matrix, Wa, operating on a vector representing the panel charges in cube a, qa. In particular, if there are G grid charges and A panels, t.hori

where Pgi e R/G 1-)xG is the mapping between grid charges and test point potentials and is given by

Pqt 6 R/G l^xA is the mapping between panel charges and test point

204

Domain-Based Parallelism and Problem Decomposition Methods

FlG. 5. 2-D Pictorial representation of the grid projection scheme. The black points represent the point charges being used to represent the triangular panel's charge density. The white points are the the points where the potential due to the black point charges and the potential due to the triangular panel's charge density are forced to match.

potentials and is given by

Here x\ and x| are the position of the i-ih test point and the j-th grid point. The rows of ones in (16) ensure that the sum of grid charges is equal to the net charge in the cube. This is equivalent to matching the order-zero multipole expansion coefficient, which ensures that the error in the calculated potential decays to zero far away from the grid charge cell. The grid projection scheme is summarized in Figure 5. For an alternative approach, based more generally on matching multipole expansion coefficients, see [2],

4.2

Using the FFT

Once again, consider subdividing the cube containing the entire threedimensional problem domain into a k x I x m array of small cubes. Then, the collocation approach above can be used to generate point charge approximations for charge distributions in every cube, effectively projecting the charge density onto a three-dimensional grid. For example, if the representative point charges are placed at the cube vertices, then the resulting charge distribution will be projected to a (k +1) x (/ +1) x (m +1)

Multipole and Precorrected-FFT Methods

205

uniform grid. The fast-multipole algorithm also effectively create a uniform grid by constructing multipole expansions at the center of each cube, but due to sharing, the point charge approach can be more efficient. For example, a point charge at a cube vertex is used to represent charge in the eight cubes which share that vertex. Once the charge has been projected to the grid, computing the potentials at the grid points due to the grid charges is a three-dimensional convolution. We denote this as

where i,j,k and i',jf,k' are triplets specifying the grid points, tfrg is the vector of grid point potentials, qg is the vector of grid point charges, and h(i — i',j — j', k — k'} is the inverse distance between grid points i, ji, k and i',j',k'. As will be made clear below, /i(0,0,0) can be arbitrarily defined, and is set to zero. The above convolution can be rapidly computed by using the FFT. Once the grid potentials have been computed, they can be interpolated to the panels in each cube using the transpose of Wa [3j. Therefore, projection, followed by convolution, followed by interpolation, can be represented as where q is the vector of panel charges, tpFFT is an approximation to the panel potentials, W is the concatenation of the Wa's for each cube, and H is the matrix representing the convolution in (19). 4.3

Precorrecting

In ijJFpT °f (20), the portions of Pq associated with neighboring cube interactions have already been computed, though this close interaction has been poorly approximated in the projection/interpolation. Before computing a better approximation, it is necessary to remove the contribution of the inaccurate approximation. In particular, denote as Pa^ the portion of P associated with the interaction between neighboring cubes a and 6, denote the potential at grid points in cube a due to grid charges in cube b as Ha^, and denote i^a and <#, as the panel potentials and charges in cubes a and b respectively. Then

will be a much better approximation to ijja. We then define

206

Domain-Based Parallelism and Problem Decomposition Methods

to be the precorrected direct interaction operator. When used in conjunction with the grid charge representation P^ results in exact calculation of the interactions of panels which are close. Assuming that the Pq product will be computed many times in the inner loop of an iterative algorithm, P^ will be expensive to initially compute, but will cost no more to subsequently apply than Pa^.

4.4

Combining Steps

Combining the above steps leads to the precorrected-FFT algorithm, which rapidly computes the Pq dense matrix-vector product. Using the above notation, the algorithm can be described as two steps. The first step is

using the FFT to sparsify H. Then, the second step is to compute, for each cube a,

5

Experimental Results

In this section, results from computational experiments in solving (3) are presented. These experiments were conducted using FASTCAP [13], a threedimensional electrostatic analysis program which uses an implementation of a preconditioned GMRES algorithm with both adaptive-multipole acceleration and precorrected-FFT acceleration. The program uses the piecewiseconstant panel expansion functions given in (11), and a panel centroid collocation scheme. First, idealized examples of potential flow problems are examined to allow a controlled investigation of aspects of the multipoleaccelerated algorithm. Then simple plate examples are used to indicate the performance of the precorrected-FFT algorithm compared to the adaptive multipole algorithm. Finally, we compare the two methods on several realistic engineering examples. Note, for the comparisons between the adaptive-multipole and precorrected-FFT methods, the adaptive multipole algorithm was used with second-order multipole expansions, and the FFT method used 27 grid points and precorrection for nearest-neighbor cubes only. This yields similar accuracies from the two programs.

5.1

Spheres in Potential Flow

The potential due to a unit sphere in an infinite fluid translating at unit velocity along the x^-axis is given by

Multipole and Precorrected-FFT Methods

207

FIG. 6. The single sphere discretized by 2592 panels translating in an infinite fluid. The shading corresponds to the density strength, a(x). The dark shading at the left pole of the sphere is a plotting artifact. The charge density a satisfying (3) for the potential in (25) can be determined analytically, and is given by

Figure 6 presents a shaded sphere, where the shading corresponds to the charge density given in (26). Having the solution for the translating sphere problem in closed form allows us to demonstrate the convergence of the complete algorithm. Figure 7 shows that the convergence rate will approach l/N if the tolerance on the convergence of the iterative method is small enough, and the order to which terms are retained in the spherical harmonic expansions is high enough. The definition of the integrated error is the following summation over the N panels:

where a, is the area of panel z, <& is the computed solution for panel i, and a(xi] is (26) evaluated at the centroid of panel i. The results in the figure using lower order expansions and a larger tolerance for convergence of the iterative method show how these parameters limit the accuracy of the computed solution given a particular spatial discretization.

208

Domain-Based Parallelism and Problem Decomposition Methods

FIG. 7. Convergence study for the single sphere translating in an infinite fluid. "Tolerance" refers to the convergence criterion for the iterative method, and "order" refers to the highest term retained in the spherical harmonic expansions.

FIG. 8. The two-sphere case, each discretized by 2592 panels. Here, a fictitious potential is applied. The sphere centers are three radii apart.

Multipole and Precorrected-FFT Methods

209

FlG. 9. Operation counts for computing the iterates for the fictitious Dirichlet problem of two spheres, where the discretization is refined. "Direct" refers to the standard (order (N2)) application of the matrix to a vector, "Order I" refers to the adaptive rnultipole algorithm with expansions to order I.

FlG. 10. Stacked plate example.

210

Domain-Based Parallelism and Problem Decomposition Methods

To investigate the advantages of using the multipole acceleration, a more complicated case of two spheres, shown in Figure 8, is considered. In this case, we do not have a closed form solution, so a Dirichlet problem is contrived by applying the known potential of the single sphere case to each of the two spheres. The difference in the cost of computing Pq directly and with the multipole algorithm is shown in Figure 9. Here the operation count is the number of multiply-add operations required to compute Pq once, assuming that the entries in P and the multipole translation matrices have been precomputed. As the graph in Figure 9 clearly demonstrates, the cost of computing Pq with the adaptive multipole algorithm increases linearly with the number of panels, and for the case where second-order expansions are used, is faster than direct computation with as few as 500 panels.

5.2

Plate Examples

The adaptive-multipole algorithm has order N cost regardless of the distribution of panels on the discretized surfaces [12], but this is not true of the precorrected-FFT method. The use of the FFT implies that the algorithm has order N log N cost only for fairly homogeneous distributions of panels. An example where this is certain to be the case is one which contains stacked plates, as in Figure 10.

Method Multipole FFT

Matrix- Vectorf 3.37 1.26

Memory (Mb) 130.5 38.5

TABLE 1 Performance figures for stacked plate example, t CPU time in seconds on DEC 3000AXP/500.

Multipole and Precorrected-FFT Methods

211

FlG. 11. Parallel Plate example.

To show that the precorrected-FFT method has an advantage over the fast-multipole algorithm for the homogeneous case, consider the results in Table 1. For this example, each plate in a stack of twenty was discretized into a 20 x 20 array of panels, for a total of 8000 panels. As Table 1 shows, the precorrected-FFT method uses one-third the CPU time and one-third the memory of the fast-multipole algorithm. A simple approach to generating an example which is inhomogeneous is to separate a pair of parallel plates, as in Figure 11. The relative CPU time for adaptive-multipole and precorrected-FFT algorithms as a function of plate separation is given in Table 2. As is clear from the table, for very inhomogeneous problems, the adaptive-multipole algorithm is much faster.

212

Domain-Based Parallelism and Problem Decomposition Methods CPU time* for plates separated by N panel widths

N

1

Multipole

12.6

FFT

8.2

15 9.7 7.7

31 11.6 11.0

64 8.0 37

128 8.6 73

TABLE 2 Comparison of adaptive- multipole and precorrected-FFT algorithms for two parallel plates, 31 x 31 panels, t DEC 3000AXP/500 seconds.

5.3

Comparison Examples

In this section we present results comparing adaptive-multipole to precorrected-FFT acceleration for example problems in integrated circuit interconnect and packaging analysis. These examples are not particularly inhomogeneous, and as might be expected, the results in Figure 12 indicate that the precorrected-FFT approach can be as much as 40% faster and can use as little as one fourth the memory of the fast multipole algorithm. We have also presented a figure-of-merit defined as the product of the memory and speed advantages of the precorrected-FFT method. It is important to consider this figure because it is often possible to trade memory for speed by changing the size of the grid used in the precorrected-FFT method (this is true of the via structure, for example). In terms of the speed-memory product, the precorrected-FFT method is superior to adaptive-multipole algorithm for all the examples shown, in some cases by more than a factor of six. Note that for the homogeneous stacked-plate example, the precorrectedFFT advantage in this metric is nearly an order of magnitude.

6

Conclusions and Acknowledgments

In this overview, we briefly described the fast- and adaptive-multipole, and precorrected-FFT algorithms for accelerated solution of surface integral formulations of the three-dimensional Laplace's equation. Experimental results were presented to show that these methods are effective for realistic problems; they provide substantial computation time and memory reductions compared to standard order N2 approaches. Comparisons between the two acceleration methods were given, not to demonstrate the superiority of one over the other, but to show the essential role of problem inhomogeneity in algorithm performance. These issues deserve additional study. The authors wish to thank Len Herman (IBM Yorktown Research Center) and Leslie Greengard (Courant Institute) for their many valuable suggestions on this subject.

Multipole and Precorrected-FFT Methods

213

References [1] K. Atkinson, A survey of boundary integral equation methods for the numerical solution of Laplace's equation in three dimensions, in Numerical Solution of Integral Equations, M. Golberg, ed., Plenum Publishing, New York, (1990), pp. 1-34. [2] L. Herman, Grid-multipole calculations, Tech. Rep. RC 19068(83210), IBM Research Report, 1993. [3] A. Brandt, Multilevel computations of integral transforms and particle interactions with oscillatory kernels, Computer Physics Communications, (1991), pp. 24-38. [4] J. Carrier, L. Greenga.rd, and V. Rokhlin, A fast adaptive multipole algorithm for particle simulations, SIAM Journal on Scientific and Statistical Computing, 9 (1988), pp. 669-686. [5] A. Greenbaum, L. Greengard, and G. B. McFadden, Laplace's equation and the Dirichlet-Neumann map in multiply connected domains, tech. rep., Courant Institute. New York, 1991. [6] L. Greengard, The Rapid Evaluation of Potential Fields in Particle Systems, MIT Press, Cambridge, Massachusetts, 1988. [7] L. Greengard and V. Rokhlin, A fast algorithm for particle simulations, Journal of Computational Physics, 73 (1987), pp. 325-348. [8] J. L. Hess and A. M. O. Smith, Calculation of potential flow about arbitrary bodies., Progress in Aeronautical Science, 8 (1966), pp. 1-138. [9] E. W. Hobson, The Theory of Spherical and Ellipsoidal Harmonics, Chelsea, New York, 1955. [10] R. Kress, Linear Integral Equations, Springer-Verlag, Berlin, 1989. [11] K. Nabors, S. Kim, and J. White, Fast capacitance extraction of general three-dimensional structure, IEEE Transactions on Microwave Theory and Techniques, 40 (1992), pp. 1496-1507. [12] K. Nabors, F. T. Korsmeyer, F. T. Leighton. and J. White. Preconditioned, adaptive, multipole-accelerated iterative methods for three-dimensional first-kind integral equations of potential theory, SIAM Journal on Scientific Computing, 15 (1994), pp. 713 - 735. [13] K. Nabors and J. White, FASTCAP: A multipole accelerated 3-D capacitance extraction program, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 10 (1991), pp. 1447-1459. [14] J. N. Newman, Distributions of sources and normal dipoles over a quadrilateral panel, Journal of Engineering Mathematics, (1986), pp. 113-126. [15] V. Rokhlin, Rapid solution of integral equation of classical potential theory, Journal of Computational Physics, 60 (1985), pp. 187-207. [16] A. E. Ruehli and P. A. Brennan, Efficient capacitance calculations for threedimensional multiconductor systems, IEEE Transactions on Microwave Theory and Techniques, 21 (1973), pp. 76-82. [17] Y. Saad and M. H. Schultz, GMRES: A generalized minimal residual algorithm for solving nonsymmetric linear systems, SIAM Journal on Scientific and Statistical Computing, 7 (1986), pp. 856-869. [18] S. D. Senturia, R. M. Harris, B. P. Johnson, S. Kim, K. Nabors, M. A. Shulman, and J. K. White, A computer-aided design system for microelectromechanical systems (menicad), IEEE Journal of Microelectromechanical Systems, 1 (1992), pp. 3 13.

214

Domain-Based Parallelism and Problem Decomposition Methods

[19] S. A. Vavasis, Preconditioning for boundary integral equations, SIAM Journal of Matrix Analysis and Applications, 13 (1992), pp. 905-925.

Panels 2346

micromotor CPU* Mem.* 0.71 0.81

Product 0.58

Panels 14406

49x49 panel cube CPU* Mem.* Product 0.35 1.14 0.31

Panels 9360

5x5 woven bus CPU* Mem.* Product 0.34 0.82 0.42

Panels 11520

3x3 bus crossing CPU* Mem.* Product 0.26 0.59 0.15

Multipole and Precorrected-FFT Methods

Panels

6126

via structure CPU? Mem.*

2.26

0.37

215

RAM cell Product

Panels

0.84

4881

CPW 0.88

Mem.1 0.73

Product

0.64

FIG. 12. Comparison of computational cost for the precorrected-FFT and adaptive-multipole analyses of various examples. Actual discretizations are finer than shown in figures, f Ratio of FFT CPU time to multipole CPU time. J Ratio of FFT memory use to multipole memory use. Numbers less than one imply an advantage for the FFT method.

This page intentionally left blank

Chapter 13 Linear Scaling Algorithms for Large Scale Electronic Structure Calculations E. B. Stechelt Abstract Traditional local density functional calculations were considered to have good scaling behavior in that the computational effort and memory requirements scaled only as the cube and the square of the size of the system, respectively. While this scaling, for first principles electronic structure calculations, is relatively good, it is now apparent that it is possible to develop algorithms that exhibit the ultimate scaling limit, whereby both the computational effort and the required memory scale linearly with the size of the system. In this chapter we give a brief overview and our perspective of recent attempts. We focus on localized orbital approaches and how to "re-use" the solutions as "building blocks" for large scale calculations. We give some details and results from methods that solve for non-orthogonal, localized, occupied orbitals.

1 Introduction Meaningful simulation of the microscopic behavior of condensed matter and molecular systems requires a reliable description of interatomic forces by which we mean solutions to the problem of electrons moving in the field of fixed charged nuclei. Each set of nuclear positions generates a new fixed external potential within which the electrons move. The function of the electronic distribution is largely to provide a potential field in which the nuclei will execute vibrations in stable molecules and condensed matter systems or will evolve in time as in a molecular dynamics simulation. Densityfunctional-theory (DFT) [1] provides a rigorous theoretical foundation for determining the ground-state properties of a many-electron system as a functional of the total electron density of the system in a fixed external potential (i.e., the positions and charges of the nuclei). The local-densityapproximation (LDA) [2] approximates the non-local exchange-correlation *Work supported under DOE Contract No. DE-AC04-76-DP00789. ^Sandia National Laboratories, Advanced Materials Physics, Albuquerque, NM 87185nsdfi 217

218

Domain-Based Parallelism and Problem Decomposition Methods

functional (which is typically unknown) as a local function of the total electron density. This approximation is rigorous for a system with a homogeneous electron density such as jellium, the quintessential metal. In contrast, Hartree-Fock (HF) theory (see for example Ref. [3]) includes no effects of correlation, but treats exchange exactly within the independent electron approximation. Both LDA and HF share the property that the ground-state is expressed in terms of uncorrelated electrons. The proper treatment of exchange, however, makes HF calculations more computationally expensive than LDA. However, because of the absence of correlation, accurate calculations use HF typically as a starting point to even more difficult calculations that do include some correlation. While much of what is in this chapter is directly applicable to HF calculations, we focus on LDA. This is because, despite the apparent simplicity of LDA, more than 20 years of experience indicates that it can accurately predict structure and properties of many classes of materials, including molecules, crystals, interatomic compounds, and surfaces. For example, lattice constants are typically within 1% of experiment and bulk moduli are within 10%. Furthermore, gradient corrections [4-6] that introduce some (but not an inordinate amount of) additional work apparently provide for a more accurate treatment of non-homogeneous electron densities. Hence, while LDA is not the most rigorous or currently the most accurate method for the solution of electronic structure problems, it is the most promising for large systems. Its solution is that of independent electrons with all the many-body physics buried in the definition of the potentials that depend only on the total electron charge density (and sometimes gradients). Consequently, LDA includes exchange and correlation (at least approximately), but does not contain two-electron integrals in the formulation. Thus LDA represents a general framework for the study of important material-science issues, e.g., crystal growth, defect properties, catalytic chemistry, band-gap tailoring, drug design, materials degradation and failure, etc. Since the seminal paper of Car and Parrinello [7], much effort (see for example Refs. [8,9]) has been devoted to making LDA calculations computationally more efficient. However, until very recently, the computational effort of all LDA algorithms scaled asymptotically with the cube (N3) of the number of electrons in the system, and the memory requirements with the square (N 2 ). For periodic systems in which the calculations require a number of points, Nt, in the Briollion zone the scaling is N^ N3 where N is the number of electrons in the unit cell. However, as the size of the system increases the number of points needed decreases roughly inversely with N. Hence until the number of k points is one, the scaling is roughly N2 and only approaches N3 asymptotically. Similarly the scaling of the memory is roughly linear for small N and approaches quadradic only as the number of necessary k-points becomes one. Invariably the scaling derives from the solution of a Schroedinger-like equation for the lowest N/2 stationary states

Large Scale Electronic Structure Calculations

219

that form a set of delocalized orthogonal functions. Each stationary state requires order N parameters to describe it (hence the N2 memory scaling). The computation requires an additional power of N because a matrix of size order N is diagonalized (an N3 operation) or in iterative solutions because orthogonality between different solutions must be maintained. However, in as much as electronic structure is embodied in the electronic charge density which is a local property of a system [10], solutions in terms of global stationary states are at odds with the physics of the problem. Using the locality principle, papers by Yang [11] and by Baroni and Giannozzi [12], showed that linear scaling (N) is, in principle, attainable in LDA. Following these papers a number of new approaches to linear scaling and related algorithms have been suggested [13-21]. As a result of these recent attempts and proofs of principle [11-21], it appears certain that LDA algorithms for ground-state electronic structure calculations can scale optimally, i.e., linearly or almost linearly (N log N) with the number of electrons. However, to achieve this, it is necessary to give up solutions in terms of delocalized, orthogonal stationary states. 2

Density Functional Theory: Mathematical Elements

In density functional theory (DFT) the basic independent element is th density operator, p, which itself is a trace over the outer product of occupied functions {4>j}\ The diagonal element of p in a spatial representation relates to the electronic charge density, p(r) = 2 < r|/5|r >. For a paramagnetic ground state with N electrons, the occupied functions are the N/2 lowest energy eigenfunctions (stationary states) of the one-electron, self-consistent Kohn-Sham equation [2]:

where Vcouiomb is the electrostatic potential from the hxed positively charged nuclei and the self-consistent potential from the electrons with charge density p(r); Vcouiomb is the solution of Poisson's equation. Vxc is the approximated exchange-correlation potential (a local function of p(r) in LDA). VNL, if used, is the non-local (but short-range) pseudo-potential that eliminates core electrons from the calculation. As mentioned above, the solution of this Schroedinger-like equation for the {^j} and {EJ}, whether achieved by direct or iterative diagonalization, is the source of the "N3" scaling. In the iterative methods this scaling arises from orthogonality constraints [8,9]. Since the potentials Vcwomfr and Vxc depend on the solutions and the solutions depend on the potentials, this is a problem in

220

Domain-Based Parallelism and Problem Decomposition Methods

self-consistency. Hence to solve one typically 1) guesses an input potential, 2) solves for the {ipj}, 3) solves for the output potential and 4) continues until the input and output potentials are the same. If the {ipj} are also solved for iteratively, the two iterative cycles can be combined [7-9]. A key principle in DFT is that the total energy of the ground-state of a many body electron system is a functional of the charge density. The ground-state charge density is a global minimum of that functional. LDA approximates the Vxc part of the functional and solves for the global minimum of that, i.e., for the global minimum of

Note that this is not twice the sum of the lowest N/2 eigenvalues because of the extra factor of 1/2 in front of the self-consistent potentials to account for double counting of electrons.

Twice the sum of the eigenvalues is commonly referred to as the bandstructure energy. The only eigenvalue that has any physical meaning is the highest occupied (the corresponding stationary state is sometimes referred to as the HOMO, highest occupied molecular orbital). In DFT there is no analog of Koopman's theorem as in HF[3] except for the highest eigenvalue. Solving directly for the charge density that minimizes (3b) is complicated by the constraint that the density operator is a projection operator. As can be seen in (1) the density operator projects onto the occupied (lowest N/2) subspace of H/^g. Consequently it must be idempotent, i.e., /52 = p. Possibly the most important property of the density operator is its known locality; (r\p\r') —» 0 as r - r'\ —> oo exponentially fast in insulators [22,23]. In metals, at zero temperature the falloff is only algebraic (very slow). However, it is not hard to show that at finite temperature even for metals the falloff goes exponentially with the inverse of the temperature. If not for this locality property, linear scaling might not be feasible. This property states that in the density operator (complete solution) there is only a linear amount of information. However, when defined in terms of the stationary states, it is defined in terms of a quadratic amount of information. Some of that information is thus redundant. We emphasize that no physical quantities, in the density functional formalism, depend on individual occupied functions but only on a trace over the occupied subspace. The trace operation is invariant to all similarity transformations. Hence, any set of {e}, such that

Large Scale Electronic Structure Calculations

221

can equally well represent the occupied subspace, provided the inverse of T exists, i.e, the {<^} are linearly independent. The {t}. Orthogonal or not, the density operator can always be written as

where the |<^ >= ]T)m Im > -CW are the bi-orthogonal complement basis, < (j)i\4>m >= <W- Formally, the matrix D is the inverse of the overlap matrix S and it is the matrix representation of the density operator in the {<^} basis. We will return to this below. With this mathematical foundation of DFT we now give a brief overview of linear scaling algorithms in the literature. Each solves for a representation of the density operator without calculating the stationary states of (2).

3

Overview of Linear Scaling

We distinguish between algorithms, proposed in the 90's, which have the focused goal of achieving linear scaling in accurate first principles calculations, from methods proposed in the 60's and 70's that are in many ways very similar [24-28]. The goals of the early methods were, however, more to justify semi-empirical calculations and to provide for fast but approximate methods. For a brief review of chemical pseudopotentials, see Ref. [28]. Yang's [11] algorithm relies on the principle of "divide and conquer" and would well fit into the domain decomposition paradigm. Yang solves the Kohn-Sham equations (2) in overlapping small regions of the system using a basis set of atomic orbitals (LCAO representation), thereby solving for local rather than global stationary functions. If each region has M < N atoms and there are aN/M such regions (ex > 1 accounts for the overlap and scales roughly as the surface area of the region or M2//3 in three dimensions), the computational effort then scales as aNM 2 or NM8/3 (the number of regions times a diagonalization of order M3 for each region). The memory requirements are down from the CPU by a factor of M. Accuracy is improved by increasing both M and a. However, the variational principle is not rigorously preserved; i.e., the total energy is not necessarily an upperbound. Global communication in the system occurs only through the definition of a common Fermi energy (the energy that separates the occupied from the unoccupied states). Yang [11] has some very impressive results on a large number of molecules. Baroni and Giannozzi's [12] method represents the Hamiltonian on a real-space grid and solves directly for the charge density, using a Lanczos algorithm [29] to determine where G(z) is the Green's function

222

Domain-Based Parallelism and Problem Decomposition Methods

for (2). This would also fall into the paradigm of domain decomposition since the solution at any point is independent of the solution at any other point given a fixed potential. Contour integration on z determines the charge density and band-structure energy. The Hamiltonian is very sparse in this representation (provided a local approximation to the kinetic energy is used). The sparseness of the representation and the Lanczos algorithm allows convergence of each with work that is independent of the size of the system. The Lanczos algorithm [29] or recursion [10] is a natural way to achieve linear scaling, in that it involves evaluating the repeated action of a sparse matrix on a localized vector. In a Lanczos algorithm linear scaling results naturally only with an orthogonal representation, making a real space grid convenient. We note, however, that some proposals utilize non-orthogonal tight-binding representations [30] without destroying the favorable scaling. In both the "divide and conquer" and recursion methods explicit knowledge of the occupied and unoccupied subspaces is obtained. In particular, in Baroni and Giannozzi's approach [12], the orthogonal representation is a real space grid and because there are so many more grid points than occupied states, most of the information acquired concerns the unoccupied states. Since all information about the total energy and the force is embodied in the occupied electronic states it seems likely that excess computational effort is being expended. A third type of approach has been proposed by Li, et al. [15] and by Daw [16], based on solving directly for the density matrix, /5, in a sparse localized orthogonal fixed representation such as in tight-binding (TB) calculations. (TB is a terminology for a fixed (non-selfconsistent) matrix representation of the electronic structure problem.) The Hamiltonian matrix elements are either determined semi-empirically or fit to accurate self-consistent calculations. The representation may or may not be orthogonal. If it is non-orthogonal the matrix elements of the overlap matrix become additional fitting parameters. The fits usually include first and second neighbor and occasionally third neighbor interactions. In the density matrix approach the idempotency constraint, p2 — p, is replaced by a transformation that constrains the eigenvalues of the density matrix to fall between 0 and 1. The consequence is that the variational principle is preserved, i.e., the total energy is a rigorous upperbound provided the zero of the energy is adjusted so all (non) occupied states have (positive) negative energy. It can be shown that the minimum of the energy occurs only when the eigenvalues of the density matrix are 0 (for unoccupied) or 1 (for occupied) and the eigenfunctions corresponding to the non-zero eigenvalues are simultaneously eigenfunctions of the Hamiltonian with negative eigenvalues (up to a unitary transformation). Again explicit knowledge of both the occupied and unoccupied subspaces is obtained, implying that the method's prefactor is likely to be unnecessarily large. This is an iterative global algorithm which

Large Scale Electronic Structure Calculations

223

is to say that all the parameters (density matrix elements) are updated simultaneously. Linear scaling is obtained because the representation of the density matrix is sparse (hence order N non-zero elements) and the algorithm involves matrix multiplications of only sparse matrices and thus scales as NM2 where M is independent of the size of the system and is equal to the number of non-zero elements in a row or column of the density matrix. Galli and Parrinello [13] introduced a localized orbital formulation of the Kohn-Sham equations that, in principle, solves for orbitals that are fully occupied. No information about unoccupied states is determined. This formulation is closely related to the chemical pseudo-potentials and chemically invariant orbitals that were proposed by Adams [24], Gilbert [25] and Anderson [26,28] and extensively used by Bullett [27]. This point was not recognized in the Galli and Parrinello [13] paper. The density operator was written as in (5) above and the matrix D was approximated as a Taylor series expansion of S"1. Recent variational implementations of the localized orbital formulation by Mauri, Galli and Car [17] (in TB) and by Ordejon. et al. [18] (in a non-orthogonal minimal atomic orbital basis and no self-consistency), achieve minimization of the total energy only for orbitals that are orthogonal or, in other words, for generalized Wannier orbitals [22,23,31]. Constraining the occupied orbitals to be localized makes this class of algorithms scale linearly roughly as NMm where M is the number of basis functions per orbital and m is the number of orbitals that physically interact. References [17,18] achieve orthogonality and energy minimization simultaneously, without imposing additional constraints. These iterative localized orbital algorithms, like the density matrix algorithms, are global; i.e., all orbitals are updated at once and thus these methods do not naturally fall into the paradigm of domain decomposition. Kohn [19] has also given a prescription for determining generalized Wannier orbitals with linear scaling. Another approach, related to the generalized Wannier orbitals and to the chemical pseudo-potentials, explicitly solves for non-orthogonal localized occupied orbitals. No explicit knowledge of the unoccupied states is required or obtained. It can be solved either in a state-by-state algorithm [20] (domain decomposition) or in a global algorithm [21] much like that for the generalized Wannier orbitals of Mauri, Galli and Car [17] and Ordejon, et al. [18]. The localized orbital approaches (orthogonal or not) could be considered algorithms in search of the ultimate minimal basis set (a concept introduced by Frost [32] and in a different context by Freed and coworkers [33]), N orbitals to represent N states. The focus of this current chapter is on this last approach, solving for N localized, non-orthogonal occupied orbitals.

224 4

4.1

Domain-Based Parallelism and Problem Decomposition Methods Localized versus Delocalized Solutions

Qualitative Discussion

Stationary states will transform as irreducible representations of the total symmetry group. In contrast, in real space methods, such as localized orbitals, the solutions transform as reducible representations in which a symmetry operation of the group whether it be translation, reflection, rotation, etc. will transform one real-space orbital into another. As an example consider the methane molecule, CH4. Carbon has six electrons and each hydrogen has one. Thus methane has 10 electrons and hence five occupied states. Each state can be doubly occupied, with a spin up and a spin down electron. One of the occupied levels is the Is core level of carbon. For the purposes here it is sufficient to describe the electronic structure of this molecule as the occupied core atomic state and four occupied valence states, described as linear combinations of 2s and 2p orbitals on carbon and a Is orbital on each hydrogen. These atomic orbitals are not necessarily identical to the free atom but in the most accurate representation would be chemically modified by the presence of other atoms. An even more accurate treatment would include p orbitals on hydrogen and d orbitals on carbon. By symmetry (all four hydrogens are equivalent) and the four stationary valence states can be written as follows:

where the positions of C, HI, H2, HS and RJ are at (0,0,0), (1,1,1), (!,-!,-!), (-1,1,-!) and (-!,-!,!), respectively. These positions are in units of ro/V3 where r$ is the CH bond length. Two normalization conditions determine two of the parameters {MA, 0A, &T> /^r} and energy minimization determines the other two. Chemically, however, one thinks of this molecule as four equivalent CH bonds. No one bond has the symmetry of the molecule; the symmetry operations transform one bond into another. But where is the bond orbital in the above enumerated stationary states? The bond orbitals are defined to be that linear combination of the four occupied stationary states (and therefore is not, itself, a stationary state) that most localizes along one bond. For this system we can completely eliminate any weight on the three hydrogens not participating in the bond. That is we can write four, equivalent, localized occupied orbitals as follows:

Large Scale Electronic Structure Calculations

225

The subscript on the p orbitals indicates the direction. (It points directly at the hydrogen atom in the corresponding bond.) A set of three parameters (a/,, /?£, 7z,} determines these orbitals. Normalization arbitrarily determines one parameter; energy minimization determines the others. We should point out that these orbitals are necessarily non-orthogonal; Sij = {<j>i\<j>j} ^ Sij and nonstationary; H\(f>j} ^ £i\<j>j}- However, because the set of four bond orbitals is linearly independent, the four "bond" orbitals span the same Hilbert space as the occupied stationary states. The band-structure minimum energy is Tr{$~l"H}. The relationship between the two sets of parameters is:

Consequently, the stationary state and the bond orbital representation are equivalent, are equally valid and carry the same information content. For this simple system, the same number of independent parameters describes both sets. For this molecule, because it is as small as it is, there is no obvious advantage to either solution. We presented it here to illustrate the difference in the types of solution. Figure 1 shows the hydrocarbon molecule, decane. Figure 2 shows calculated bond orbitals for decane and for a smaller hydrocarbon, heptane [21]. This is a case in which the "bond" orbitals have an obvious advantage over the stationary states. The "bond" orbitals localize over a relatively small fraction of the total system. For example, the total information content in decane (CioH22 with 62 valence electrons, 62 atomic orbitals in the basis and 31 occupied states) requires approximately 930 parameters to describe the electronic system in terms of delocalized eigenstates. This is after taking account of symmetry and normalization. In contrast the bond orbital description has three chemically inequivalent CC bonds and four chemically inequivalent CH bonds. (Within Fig. 2 are contour plots of the five symmetry inequivalent CC bond orbitals.) Each bond orbital is described by 8-15 parameters depending on the orbital. The total parameter set has less than 105 parameters. More importantly, this same set of parameters can describe C n H2 n +2 for all n>7. Figure 2 also shows a comparison of converged orbitals for heptane, n=7 and those for decane, n=10. Quantitatively, the same set of "bond" orbitals can describe heptane and decane with a precision error of 10~2 meV and an accuracy (within LDA and the chosen basis) of 2 meV. Section 4.3 below describes these calculations. The chosen range for the "bond" orbitals determines the intrinsic accuracy limit. (In this calculation the range includes nearest neighbor to the bond being formed.)

226

Domain-Based Parallelism and Problem Decomposition Methods

FIG. 1. The decane molecule is a hydrocarbon chain with 10 carbon atoms and 22 hydrogen atoms, Ci^Hti- The big balls are carbon, the small ones are hydrogen. All the carbon atoms lie in a plane.

Thus, when chemically invariant subunits exist on length scales smaller than the system size (or the unit cell for periodic systems), a formulation of the electronic structure in real space localized terms can be very advantageous. Naturally, this approach relates well to chemical thinking. The benefit of a real space approach to condensed matter systems has been emphasized in the works of Haydock, Heine and Kelly beginning in the 1970's [10]. The emphasis in those works, however, is on obtaining local density of states of atomic valence orbitals within non-self consistent tight binding representations. In contrast, the emphasis here is on chemically invariant subunits that are smaller than the size of the system and on solving for bond-orbitals that depend only on the local chemical environment not on the global system and to do this within an accurate, self-consistent formulation. The two advantages in changing the representation of the electronic structure of a system from delocalized to localized are: 1) by determining localized orbitals directly (i.e., not first determining delocalized eigenfunctions) we compute the electronic structure with an algorithm that scales linearly with the size of the system (Sections 4.2 and 4.3 below) and 2) by determining localized orbitals we can get very accurate first guesses for the electronic structure of large systems by breaking it into chemically invariant subsystems for which a smaller calculation can be performed. As the above example illustrated, the electronic structure of any CnH2n+2 ( n >7) can be accurately guessed by doing one calculation on a smaller (CyHig) system. In essence this latter point is sublinear scaling. We should mention that localized orbitals in the framework of HF theory have been obtained, typically as a unitary transformation of the already computed delocalized eigenfunctions (or molecular orbitals). See for example, Refs. [34,35]. The method of Edmiston and Rudenberg [34] is the method most often used. Having given up the well-established linear equations for the stationary

Large Scale Electronic Structure Calculations

227

FlG. 2. Converged carbon-carbon (CC) bonds for heptane (C-^HIQ) and decane. The first panel corresponds to heptane. The left two correspond to decane. The amplitudes are for normalized orbitals. The plot area is a 4Ax 4A square within the plane of the carbon backbone. The bonds count from the end until the symmetry plane in the middle of the molecule. Heptane has three unique bonds and decane has five. The contour spacing is 0.2 with solid thin contour lines indicating positive contours, broken lines negative contours and the thick solid line marking the zero contour. The positions of the atoms are indicated. This figure is reproduced from Ref. [21].

228

Domain-Based Parallelism and Problem Decomposition Methods

states we need to determine what non-linear equation do the localized orbitals satisfy. The basic equation for the {^} is

To cast (8) in a projection operator language, we refer to the occupied and unoccupied subspaces as P and Q, respectively, where Q=1-P. The basic equation for the |«/»| > is then QHP = 0. In other words, it is only necessary for the Hamiltonian to be block diagonalized into two blocks (occupied and unoccupied) subject to the criterion that the band-structure energy, Tr{PH} is an absolute minimum (for a fixed one-electron potential).

This essentially defines the problem. The ej are the lowest N/2 eigenvalues of (2). It is typically both necessary and sufficient that (8) is satisfied and (9) is an absolute minimum for a fixed potential. This follows because of the invariance of the trace. The only way to minimize the sum of the oneelectron energies is to be a representation of the lowest N eigenfunctions of (2). (There are one or two exceptions for which the total energy is a minimum, but the band-structure energy is not an absolute minimum for the given potential. We do not further address these cases here.) We emphasize, however, that (8) while minimizing (9) does not have a unique solution. Indeed our aim and the goal in all local orbital formulations is to use the freedom corresponding to this non-uniqueness to try to find a set of {<^} that not only solves (8) but that is also optimally "localized," or equivalently, that makes the matrix representations of H and S optimally sparse. Further requirements sometimes involve orthogonality [17-19] and/or chemical invariance as discussed above and in Refs. [24-28] at great length. Cast in this form localized orbital theories can be thought of as rigorously bridging the gap between the intuitive chemical "building block" picture of a molecule, surface or solid (well established chemical concepts) and the delocalized nature of the quantum mechanical solution of the Schroedingerlike equation in the independent-electron approximation. While stationary states are sensitive to perturbations anywhere in the system, localized orbitals, in contrast, are sensitive to perturbations only within its range of localization and insensitive to perturbations far away. In localized orbital formulations, we seek a solution in which analogous chemical environments have similar bond orbitals (e.g., if two different molecules contain parts that for chemical or physical reasons are considered similar in character, then the quantities defined should also look similar). Figure 2 above showed an example. Hydrocarbon chains of different lengths should have terminal carbon-carbon bonds that look alike and internal carbon-carbon bonds that also look alike. As can be seen the third (from the end) carbon-carbon bond

Large Scale Electronic Structure Calculations

229

looks the same in heptane (seven carbons) as it does in decane (ten carbons). Furthermore, bonds 4 and 5 (the innermost) in decane are indistinguishable from bond 3. Similarly the terminal bond 1 in both heptane and decane are indistinguishable. The same is truuOvorObont IL These results are exactly what we look for in a localized orbital formulation. They are invariant in going between chemical environments that are similar even if not rigorously related by symmetry. In essence, having solved for the electronic structure of heptane, we can accurately "guess" the electronic structure of decane as is consistent with chemical intuition but inconsistent with traditional delocalized molecular orbital solutions. While not used extensively in large scale computations, the concepts of the Wannier [22,23,31] and generalized Wannier orbitals are fairly well known. These are linear combinations of different k-space eigenfunctions or Bloch waves (in a translationally invariant system) that are localized in real space. They have the property that a Wannier function localized in one unit cell is orthogonal to its translational replica in another unit cell. For our purposes we define generalized Wannier functions as real space, localized, orthogonal, occupied (i.e., linear combinations of the occupied stationary states of a system) functions. Thus they are related to the occupied stationary state representation by a unitary transformation. The asymptotic behavior of the Wannier orbitals at large distances, r is determined by exp(—Kb\r\) where Kb is a branch point in complex k space, and is, in magnitude, of the order of the smallest gap involving the given band [22,23]. Because the range relates to the gap it is clearly physically significant. However, in real systems this range can be rather large (semiglobal) precluding chemical invariance for all practical purposes. As pointed out by Anderson [26], it is trivial to show that the Wannier functions are not the most localized linear combination of the Bloch waves. However, since the range is physical it cannot be removed by making a qualitative change. We contend that this semi-global behavior should not be ignored. It is, nevertheless, a property of fermions and Pauli exclusion and thus does not necessarily enter the definition of chemically invariant localized orbitals that are non-orthogonal. All the relatively long-range information, that is sensitive to fairly distant perturbations in the system, can be embodied in the density matrix of the bond-orbital representation. In an orthogonal occupied representation the density matrix is unity. Thus all physical content must be contained in the orbitals. For non-orthogonal occupied representations, however, the density matrix D is formally the inverse of the overlap matrix and thus can carry relatively long range information that is sensitive to semi-global aspects of the system while the individual bond orbitals remain insensitive. At this point, this is largely conjecture. Nevertheless, the significance of this point seems to have been missed in the early literature.

230

4.2

Domain-Based Parallelism and Problem Decomposition Methods

State by State Algorithm

We describe here a state by state iterative solution for a set of non-orthogonal "occupied" orbitals \(j)f >. As the iterations (n) proceed, the quantity [1 — P^]H\(j)f ) approaches zero for i = 1,..., N/2. Thus at convergence, n -» oc,Q^HP^ -f 0 where Q^ = [1 - P< n >]. p(n> is the projection operator onto the subspace spanned by the \4>f > and converges to the density operator. In each "^-update" exactly one |0^ > changes. (More generally, exactly one, in a periodic system, can be one per unit cell. The algebra in Ref. [20] can be readily generalized, to either a finite number or to one per unit cell.) However, all \m > in a range such that Dmf ^ 0 also change. Reference [20] showed that it is sufficient to constrain D to have sparsity comparable to that defined by the overlap metric, S. With "update" defined this way, we have a non-standard iterative scheme in which the number of ^-updates is linear in N, but the work per each (/d-update is N-independent. Reference below to the number of "iterations" means to the number of ^-updates divided by N/2. The algorithm is a sequence of 0-updates. First we describe the "in principle" algorithm, by which we mean no approximations. We try to indicate where the in principle will differ from the in practice by explicitly including the phrase "in principle". An orbital is updated by optimally mixing in the residual, \Xf ) which is defined as

The mixing is a "generalized Jacobi rotation" (where "generalized" will be denned momentarily); i.e.,

The r\i is chosen so that \x™ ) is normed to unity. At convergence the r)i vanish, and the \Xg ) are undefined. Prior to convergence, the residual is -1 orthogonal (at least in principle) to all the \<j>m ); i.e., (x\n \4>m } — 0, for m = 1,2,... ,N/2. In other words, it is an element in Q^"1' space. Furthermore, it can be shown that any other direction in Q^""1', that is orthogonal to the residual, has no Hamiltonian matrix element with the orbital being updated. Thus, if it isn't the absolute optimal direction for updating, it is very nearly so. By optimal direction we mean the one direction that would allow for the largest decrease in the energy. The mixing angle 0 is chosen to minimize the total band-structure energy E^(0) where -1

-1

We have left the (n — 1) superscript label on \m )(m ^ I) to emphasize that these do not change during the update. Generalizing the energy

Large Scale Electronic Structure Calculations

231

minimization to that of the total rather than the band-structure energy; i.e., eliminating the double counting of the Coulomb energy and including the exchange-correlation energy, can be accomplished similarly to that described in Refs. [8, 9]. At first glance, approximations to minimize E^n\9) may appear necessary, because with an update to \(f>£ }, many ((p^ must also be updated. Nevertheless, algebra shows that no approximations are necessary [20]. The optimal mixing angle involves evaluating a 2x2 Hamiltonian matrix with the functions ^ (note this is the complement function) and Xf • However, Of is not the rotation angle for diagonalizing that 2x2, nor is it the rotation angle for diagonalizing the Hamiltonian matrix for the two directions that are being rotated (11). Because it is not the diagonalizing rotation angle, we refer to (11) as a "generalized Jacobi rotation". Nevertheless, despite the change in many of the complement functions - and hence many individual state energies - the change in the total band-structure energy depends only on matrix elements involving the residual and the current complement function. After determining 9f, the corresponding row and column of S and H are trivially updated. Updating the orbital representation of the Hamiltonian involves determining the inner product of H|Xf } with a number of \(pm ) A key point is: this number is finite, independent of the size of the system and depends on the physical range of interactions. In principle the metric would update as follows: without loss of generality, we move the current function to the top of the basis, and define the remaining (N/2-l)x(N/2-l) states as q-space, then

The important implication of this result is that matrix elements of S^ n 1\ which are zero, stay zero, and any non-zero off-diagonal elements can only decrease in magnitude, because cos(0)| < 1, which implies that the condition of the matrix improves. Furthermore, the metric is updated trivially and involves only m multiplications, where m is the number of non-zero elements in the updated row of S. In addition, due to this form as the update of S, the update of D is also relatively simple. More important, the complement basis can be updated directly. Consequently, the matrix multiplication of DH can be done algebraically rather than numerically. Matrix multiplication would be order m3; the algebraic update is order m 2 . We emphasize that 1) the "inversion" of the metric is not necessary at every -update - in principle it is necessary only at the first step; and 2) the condition of the metric improves with each 0-update. Actually it is not very significant that the condition improves, but it is meaningful that

232

Domain-Based Parallelism and Problem Decomposition Methods

the condition cannot get worse, thus assuring stability of the algorithm. Note that, while the metric approaches unity (orthogonality), it does not, in general, become unity. Hence, our algorithm does not iterate to orthogonal or generalized Wannier orbitals as in the methods of Refs. [17, 18]. Additionally we emphasize that the algebra of the updates relies only on the residual vector being orthogonal to the current set of "occupied" orbitals. After updating |<^ > and the relevant matrices, a new orbital is chosen. In practice, after each iteration (or few iterations), we recompute all the matrices given the \
FlG. 3. Convergence of the average energy per occupied orbital for a supercell (96 atoms) of crystalline silicon, randomized silicon (10% randomization from crystalline positions] and crystalline silicon with one vacancy. The final point is the exact answer (determined by direct diagonalization).

One final point is: as noted, D as computed is not identically S"1; consequently Tr{DH} (where Tr is the sum of the diagonal elements) is not necessarily a variational upper bound to the band-structure energy. In principle, this is a problem because without the variational principle the sought after solution is not necessarily the minimum of the function that is being minimizing. Nevertheless, E= 2Tr{(2 — DS)DH} can be shown to be an upper bound, provided the zero of energy is taken so that H is negative

Large Scale Electronic Structure Calculations

233

definite which is to say that all occupied orbitals have negative energy. (E is the energy plotted in Fig. 3.) With this formulation, the effective density matrix is 2D - DSD; thus, the particle number is, 27>{(2 - DS)DS} which can be shown to be less than or equal to N. The transformation to the effective density matrix guarantees that its eigenvalues are all nearly but less than or equal to - one. Equality arises if D is identically S"1. The correction term to the band-structure energy, 2TY{(1 — DS)DH}, is similar to that used in Refs. [15-18] and the same as in Refs. [20,21].

4.3

Global Algorithm

We describe here a global iterative solution for a set of non-orthogonal "occupied" orbitals |<^ > by functional minimization [21]. Following the last section the total (band-structure) energy is

where H^ n ) and S^ are the Hamiltonian and the overlap matrices in the {<% } representation. The energy zero is adjusted so that the highest eigenvalue of H^l'ip >= eS^\i/j > is negative. Expanding the |<4 ) m a fixed basis, we have |<^n) >= E^cJ^!^)- Hence, S<"> = [c^}TSc^ and H(") = [c(n)]T/ftc(") where the script matrices, <S and 7i, are matrices in the fixed basis representation, {£,^}. The variational parameters are then the matrix elements of D^ n ^ and c( n ). By imposing locality, the total number of variational parameters is order N. The gradient of E^') with respect to the variational parameters is formed (in a manner very similar to that of Refs. [17, 18]). The parameter set is updated by conjugate-gradient line minimizations [37]. The difference from R,efs. [17, 18] is in including the variational matrix D^ n ^. This allows the orbitals to be non-orthogonal without loss of accuracy [21]. As a model problem we again use the nonorthogonal tight-binding Hamiltonian and overlap parameters for silicon [36] and compute localized bonding orbitals for a 64 atom supercell of crystalline silicon. Figure 4 (reproduced from Ref. 21) shows the matrix representation of the computed density operator (represented by localized functions) in the basis of the exact stationary states of the Hamiltonian (as obtained by direct diagonalization). The filled dots represent the diagonal elements of this matrix as a function of the corresponding energy eigenvalues; the desired result is unity (zero) for eigenvalues below (above) zero. The unfilled dots are the averages, the error bars denoting maximum and minimum, of the off-diagonal elements in the columns corresponding to the respective eigenvalue; these should vanish. The bonding orbitals are constrained to be localized along each Si-Si bond and can spread only to the first shell of neighboring bonds. The upper panel is for the non-orthogonal algorithm with D computed variationally and the lower panel for D constrained to

234

Domain-Based Parallelism and Problem Decomposition Methods

be unity (as in Ref. [17] with their parameter r| = 1 and Ref. 18]). The difference in the quality of the solution is readily apparent.

FIG. 4. Comparison of the spectral decompositions of converged density matrices for non-orthogonal (upper panel) and generalized Wannier (lower panel) algorithms. See text for details.

The results [21], shown earlier in Fig. 2, were calculated by applying this algorithm to a real system as opposed to a tight-binding model. For the hydrocarbons, the fixed basis was a minimal basis (contracted gaussian, atomic orbitals, 2s and 2p on carbon and s on hydrogen). The potential

Large Scale Electronic Structure Calculations

235

(given the charge density) and the Hamiltonian and overlap matrices in the fixed basis were calculated on a parallel computer (nCubed), using the code QUEST, developed at MPCRL, Sandia National Laboratories [38] and based on methods developed by Feibelman [39]. QUEST scales nearly linearly for every aspect of the calculation except for diagonalization. The linear scaling algorithms replace the diagonalization. The orbitals for the fixed potential and the corresponding density matrix (c(")D( n )[c(™)] T ) in the fixed basis representation were determined on a workstation and fed back to QUEST for a new iteration (determination of the new potential and the corresponding matrix elements), until self-consistency was obtained. As mentioned, the orbitals determined for heptane can be "re-used" to build the first guess for the orbitals in decane. It is only necessary to recalculate the corresponding charge density for decane; conjugate gradient minimization of the energy with c fixed determines a new D. Within the variational freedom offered the energy minimizes without further iteration of either the orbitals or the potential. This demonstrates unequivocally the principle and potential power of transferability and how to divide and conquer the electronic structure of a large system into chemically invariant units. In general, the transferability may not be as complete as for heptane to decane. Nevertheless, the implication is that it will always provide for very good first guesses, thereby requiring very few iterations to converge.

5

Conclusions

In this chapter we have given a brief overview of linear-scaling algorithms for large scale electronic structure calculations. We have outlined algorithms that iteratively solve for non-orthogonal localized ("bond") orbitals that form a representation of the occupied subspace of the Kohn-Sham Hamiltonian. The linear scaling arises from keeping all matrices sparse. More importantly, we have attempted to show that the solutions are sensitive only to the immediate chemical environment and insensitive to changes that occur in remote regions of the system. The domain decomposition paradigm comes into play in two ways. 1) in terms of the concept of a good basis to represent the solution and 2) in solving for "reusable" solutions from calculations on related systems. The basis here is the solution, in that we solve for the ultimate minimal basis, N/2 functions to represent N/2 solutions. We demonstrated the "re-use" of prior calculations with a simple homologous series, hydrocarbon chains. Future work will concentrate on more complex organic molecules and semiconductor systems arid making the necessary generalizations in order to treat metals. As mentioned above, for metals it is necessary to go to a finite temperature formulation (to recover the locality principle). The consequence will be the inclusion of some partially occupied states.

236

6

Bomain-Based Parallelism and Problem Becomposition Methods

Acknowledgments

Much of this work was done in collaboration with A.R. Williams, P.J. Feibelman and W. Hierse. I take great pleasure in acknowledging their expertise and numerous contributions. I would also like to thank Prof. J. Kiibler for sharing his excellent Ph.D. student, W. Hierse. I am also very grateful to A.F. Wright and J.S. Nelson for many enormously helpful and stimulating conversations. Lastly, I am greatly appreciative to M.P. Sears and P.A. Schultz for helping W. Hierse in the use of their QUEST code.

References [1] P. Hohenberg and W. Kohn, Inhomogeneous Electron Gas, Phys. Rev. B 136 (1964) pp. 864-871. [2] W. Kohn and L. J. Sahm, Self-Consistent Equations Including Exchange and Correlation Effects, Phys. Rev. A 140 (1965) pp. 1133-1138. [3] J.C. Slater, Quantum Theory of Molecules and Solids, Vol 1, (McGraw-Hill Book Co., N.Y., 1963). [4] A.D. Becke, Correlation Energy of an Inhomogeneous Electron Gas: A Coordinate Space Model, J. Chem. Phys. 88 (1988) pp. 1053-1058. [5] J. P. Perdew, et. al., Atoms, Molecules, Solids and Surfaces: Applications of the Generalized Gradient Approximation for Exchange and Correlation, Phys. Rev. B 46 (1992) pp. 6671-6675. [6] J.A. Pople, P.M.W. Gill and B.C. Johnson, Kohn-Sham Density-Functional Theory within a Finite Basis Set, Chem. Phys. Letts. 199 (1992) pp. 557-560. [7] R. Car and M. Parrinello, Unified Approach for Molecular Dynamics and Density-Functional Theory, Phys. Rev. Lett. 55 (1985) pp. 2471-2474. [8] I. Stich, R. Car, M. Parrinello and S. Baroni, Conjugate Gradient Minimization of the Energy Functional: A New Method for Electronic Structure Calculation, Phys. Rev. B 39 (1989) pp. 4997-5012. [9] M.P. Teter, M.C. Payne and B.C. Allan, Solution of Schrodinger's Equation for Large Systems, Phys. Rev. B 40 (1989) pp. 12255-12263; M.C. Payne, M.P. Teter, B.C. Allan, T.A. Arias and J.B. Joannopoulos, Iterative Minimization Techniques for ab-initio Total Energy Calculations: Molecular Dynamics and Conjugate Gradients, Rev. of Mod. Phys. 64 (1992) pp. 1045-1097. [10] Haydock, Heine and Kelly in Solid State Physics 35 (1980) and references therein. [11] W. Yang, Direct Calculation of Electron Density in Density-Functional Theory, Phys. Rev. Lett. 66 (1991) pp. 1438-1443; Implementation for Benzene and a Tetrapeptide, Phys. Rev. A 44 (1991) pp. 7823-7828; C. Lee and W. Yang, The Divide-and-Conquer Density Functional Approach: Molecular Internal Rotation and Density of States, 3. Chem. Phys. 96 (1992) pp. 2408-2413. [12] S. Baroni and P. Giannozzi, Towards Very Large-Scale Electronic Structure Calculations, Europhys. Lett. 17 (1992) pp. 547-552. [13] G. Galli and M. Parrinello, Large Scale Electronic Structure Calculations, Phys. Rev. Letts 69 (1992) pp. 3547-3550. [14] L.-W. Wang and M.P. Teter, Simple Quantum-Mechanical Model of Convalent Bonding using a Tight-Binding Basis, Phys. Rev. B 46 (1992) pp. 12798-12802. [15] X.-P. Li, R.W. Nunes and B. Vanderbilt, Density-Matrix Electronic-Structure

Large Scale Electronic Structure Calculations

[16] [17] [18] [19] [20] [21]

[22] [23] [24]

[25] [26]

[27] [28] [29]

237

Method with Linear System-Size Scaling, Phys. Rev. B 47 (1993) pp. 108911094. M. S. Daw, Model for Energetics of Solids Based on the Density Matrix, Phys. Rev. B 47 (1993) pp. 10899-10902. F. Mauri, G. Galli and R. Car, Orbital Formulation for Electronic Structure Calculations with Linear System Size Scaling, Phys. Rev. B 47 (1993) pp. 99739976. P. Ordejon, D.A. Drabold, M.P. Grumbach and R.M. Martin. Unconstrained Minimization Approach for Electronic Computations that Scales Linearly with System Size, Phys. Rev. B 48 (1993) pp. 14646-14649. W. Kohii, Density Functional/ Wannier Function Theory for Systems of Very Many Atoms, Chem. Phys. Lett. 208 (1993) pp. 167-172. E.B. Stechel, A.R. Williams and P.J. Feibelmaii, N-Scaling Algorithm for Density-Functional Calculations of Metals and Insulators, Phys. Rev. B 49 (1994) pp. 10088-10101. W. Hierse and E. B. Stechel, Order N Methods in Self-Consistent Density Functional Calculations, Phys. Rev. B, submitted; Linearly Scaling FirstPrinciples Electronic Structure with Optimal Transferability between Related Systems, presented by W. Hierse at the APS, 21-24 March, 1994, Pittsburgh, PA. W. Kohn, Analytic Properties of Bloch Waves and Wannier Functions, Phys. Rev. 115, (1959) pp. 809-821. E.I. Blount, in Advances in Solid State Physics, ed. F. Seitz and D. Trumbull, Academic Press, N.Y., Vol 13, (1962). W.H. Adams, On the Solution of the Hartree-Fock Equation in Terms of Localized Orbitals, J. of Chem. Phys. 34 (1961) pp. 89-102; Orbital Theories of Electronic Structure, J. of Chem. Phys. 37 (1962) pp. 2009-2018; Least Distorted Localized Orbital Self-Consistent Field Equations, Chem. Phys. Letts 11 (1971) pp. 71-74; W.H. Adams, Distortion of Interacting Atoms and Ions, Chem. Phys. Letts 12 (1971) pp. 295-298; On the Solution of the Schrodinger Equation in Terms of Wavefunctions Least Distorted from Products of Atomic Wavefunctions, Chem. Phys. Letts 11 (1971) pp. 441-444; Localized Wavefunctions and the Interaction Potential between Electronic Groups, Phys. Rev. Letts 13 (1974) pp. 1093-1095. T.L. Gilbert, Self-Consistent Equations for Localized Orbitals in Polyatomic Systems, in Molecular Orbitals in Chemistry, Physics and Biology, eds. P.-O. Lowdin and B. Pullman (Academic Press, N.Y. 1964), pp. 405-420. P.W. Anderson, Self-Consistent Pseudopotentials and Ultralocalizcd Functions for Energy Bands, Phys. Rev. Letts 21 (1968) pp. 13-16; Localized Orbitals for Molecular Quantum Theory I The Huckel Theory, Phys. Rev. 181 (1969) pp. 25-32; J.D. Weeks, P.W. Anderson and A.G.H. Davidson, Non-Hermitian Representations in Localized Orbital Theories, J. of Chem. Phys. 58 (1973) pp. 1388-1395. D.W. Bullett, Chemical Pseudopotential Approach to Covalent Bonding:! & II. J. Phys. C: Solid State Phys. 8 (1975) pp. 2695-2706 and pp. 2707-2712. P.W. Anderson, Chemical Pseudopotentials, Phys. R.eps. 110 (1984) pp. 311320. C. Lanczos, J. Res. Nat. Bur. of Standards, Sect. B 45 (1950) pp. 255-260; J. K. Collum and R.A. Willoughby Lanczos Algorithms for Large Symmetric Eigenvalue Computations, Progress in Scientific Computing Vol. 3, Birkhauser

238

Domain-Based Parallelism and Problem Decomposition Methods

(1985). [30] R. Riedinger, et al., Electronic Structure of Finite or Infinite Systems in the Tight-Binding Model with Overlap, Phys. Rev. B 39 (1989) pp. 13175-13185 and references therein. [31] G.H. Wannier, Dynamics of Band Electrons in Electric and Magnetic Fields, Rev. of Mod. Phys. 34 (1962) pp. 645-655. [32] A.A. Frost, The Floating Spherical Gaussian Orbital Method, in Methods of Electronic Structure Theory, ed. H.F. Schaefer III, (Plenum Press, N.Y. and London, 1977), pp. 29-49 and references therein. [33] H. Sun, K.F. Freed, M.F. Herman and D.L. Yeager, Ab initio Effective Valence Shell Hamiltonian for the Neutral and Ionic Valence States of N, 0, F, Si, P and S, J. Chem. Phys. 72 (1980) pp. 4158-4173 and references therein. [34] K. Ruedenberg, The Physical Nature of the Chemical Bond, Rev. of Mod. Phys. 34 (1962) pp. 326-376; C. Edmiston and K. Ruedenberg, Localized Atomic and Molecular Orbitals, Rev. of Mod. Phys. 35 (1963) pp. 457-465. [35] S.F. Boys, Construction of Some Molecular Orbitals to be Approximately Invariant for Changes from one Molecule to Another, Rev. of Mod. Phys. 32 (1960) pp. 296-299; J.M. Foster and S.F. Boys, Canonical Configuration Interaction Procedure, Rev. of Mod. Phys. 32 (1960) pp. 300-302. [36] L.F. Mattheis and J.R. Pattel, Phys. Rev. B 23 (1981) pp. 5384-5390. [37] W.H. Press, B.P. Flannery, S.A. Teukolsky and W.T. Vetterling, Numerical Recipes Cambridge University Press (1986). [38] M.P. Sears and P.A. Schultz, QUEST, Quantum Electronic Structure, unpublished. [39] P.J. Feibelman, Efficient Solution of Poisson's Equation in Linear Combination of Atomic Orbitals Calculations of Crystal Electronic Structure, Phys. Rev. B 33 (1986) pp. 719-725; Pulay-type Formula for Surface Stress in a Local-Density Functional, Linear Combination of Atomic Orbitals, Electronic-Structure Calculation, 44 (1991) pp. 3916-3925.

Chapter 14 Problem Decomposition in Quantum Chemistry Hans-Joachim Werner

Abstract An overview of methods for calculating correlated electronic wavefunctions is given. Formally, this requires solution of the electronic Schrodinger equation, which is a second-order partial differential eigenvalue equation with typically 50-150 degrees of freedom. Since this problem cannot be solved exactly, finite basis sets are used, which must be carefully chosen in order to obtain accurate results. The problem is decomposed into the choice and optimization of threedimensional one-electron functions, from which the 3AT-dimensional JV-electroii functions are constructed. Various contraction schemes are discussed, which can be used to reduce the size of the basis sets. Some aspects of vectorization and parallelization are also addressed. 1

Introduction

The ultimate goal of computational quantum chemistry is to solve the timeindependent molecular Schrodinger equation

where H = T/v + Hei is the molecular Hamiltonian. represents the kinetic energy of the M nuclei1,

The operator TJV

and the electronic Hamiltonian Hf\ describes the movement of the N electrons for fixed nuclei

"Institut fur Theoretische Chemie, Universitat Stuttgart, D-70569 Stuttgart, Germany 'Throughout this paper we use atomic units, in which h — 1 and the electron mass me = 1.

239

240

Domain-Based Parallelism and Problem Decomposition Methods

where MK and ZK are the mass and charge of nucleus K, respectively. The one-electron operators h(i) act on the individual electronic coordinates Xj, while g(i, j) are two-electron operators that describe the Coulomb repulsion between pairs of electrons. Enuc is the constant nuclear repulsion energy. The molecular wavefunction

is a function of the 3M nuclear coordinates R^ = (XK,Y-K,ZK), the 3N spatial electronic coordinates r, = (xi,yi,Zi), and the N spin-coordinates Sj of the electrons, which are restricted to the values ±1/2. The observables to be determined are the energy eigenvalues En^ for states denoted by the quantum numbers n, k. Other observables can be computed as expectation values from the molecular wavefunction. The Laplace operators V|- and Vf

act on the coordinates of one particle at a time. According to the Pauli principle the wavefunction ty must be antisymmetric with respect to permutations of particles with half integer spin (Fermions) and symmetric with respect to those with integer spin (Bosons). In the above formulation relativistic effects arid time-dependent perturbations have been neglected. Equation (1) is a second-order partial differential equation with 3(JV + M) degrees of freedom. Even for molecules of modest size the number of coordinates is very large; for instance, for benzene, CeHg, we have M = 12 and N = 42, which yields a total of 162 spatial coordinates. Since chemical energies are very small relative to the total energies, the solutions must be very accurate, typically 108 : 1. At present, it is impossible to solve this problem exactly. Therefore, various approximations are introduced in order to find approximate solutions. The first step is usually to separate the fast motions of the electrons from the slow motions of the much heavier nuclei. This leads to the BornOppenheimer approximation [1], in which the electronic problem is first solved for fixed nuclear coordinates R:

Problem Decomposition in Quantum Chemistry

241

Here and in the following, x — (r, s) represents all electronic space-spin coordinates. Since the electrons are Fermions, the electronic wavefunction must be antisymmetric with respect to permutations of the electronic coordinates Xj. The eigenfunctions *f?£ and the energy eigenvalues Vn depend in a parametric way on the nuclear positions R. Solution of the electronic Schrodinger equation (7) for many different sets of nuclear coordinates yields the potential energy surfaces V n (R), which are used in a second step as effective potentials in the Schrodinger equation of nuclear motion

Solution of this equation yields the desired eigenvalues En^\ the total molecular wavefunction >]>(n:fc) is simply the product of the electronic arid nuclear wavefunctions:

In this two-step approach certain coupling terms are neglected. For more details see, e.g., Ref. [2, 3]. In the following we will restrict our discussion to the approximate solution of the electronic Schrodinger equation (7). Other articles in the present volume deal with the problem of nuclear motion. The electronic wavefunction ^'^ describes the distribution of N electrons in the electronic state n and depends on 3N spatial and N spin coordinates. Usually, it is represented as a linear combination of antisymmetrized products (Slater determinants) of N orthonormal one-electron functions which are called spin-orbitals. The spin orbitals are approximated as linear combination contracted Gaussian basis functions. In the simplest case, which is the Hartree-Fock approximation, only one Slater determinant is considered, and the spin-orbitals are optimized according to the variational principle. In the method of configuration interaction, a linear combination of many Slater determinants is formed, and the linear coefficients are variationally optimized. In general, this leads to a matrix eigenvalue equation of large dimension, and therefore further decomposition of the wavefunction into excitation classes is useful. Furthermore, contraction schemes can be used to reduce the size of the basis set. Selection of certain configuration classes is usually based on perturbation theory. In section 2 we briefly discuss the one-electron and TV-electron basis sets. As will be seen, the convergence of the TV-electron basis depends crucially on the choice of the one-electron functions (orbitals). Therefore, the first step of most electronic structure calculations is an orbital optimization. The method of configuration interaction and the solution of the large eigenvalue equations is explained in section 4. Finally, some aspects of vectorization and parallelization are discussed in section 5.

242

Domain-Based Parallelism and Problem Decomposition Methods

For a basic introduction to electronic structure theory the reader is refered to a lucid article by J. Almlof [4]. A more complete treatment can be found, e.g., in the book of McWeeny [5].

2

One-electron and A/-electron TV-electron Basis Sets

In the limit of completely independent electrons, i.e. in the absence of the Coulomb interactions, the Hamiltonian is simply a sum of one-electron operators

and the JV-electron Schrodinger equation can be separated into a set of oneelectron eigenvalue equations

The one-electron functions ^fc(xj) are called spin-orbitals, and the corresponding eigenvalues e^ orbital energies. The spin-orbitals are assumed to be orthonormalized

It can easily be shown that the normalized eigenfunctions of the TV-electron operator H^ are then products of N spin-orbitals

while the total energies are sums of the corresponding orbital energies

Different TV-electron eigensolutions are obtained by different choices of spin-orbitals in the product (13). A particular such choice, which can be represented by a composite index I = {kl...p}, is called an electron configuration. In the independent particle model, as described above, each configuration corresponds to a different electronic eigenstate. As already noted, the electronic wavefunction must be antisymmetric with respect to permutation of the electronic coordinates. Since the electronic Hamiltonian Hei is invariant with respect to such permutations, an antisyrnmetrized product of spin-orbitals (Slater determinant)

Problem Decomposition in Quantum Chemistry

243

is still an eigenfunction of H^ with eigenvalues given by Eq. (14). Here Pi are all possible permutation operators acting on the electronic coordinates. In the presence of the electron interactions the Slater determinants $/ are no longer eigenfunctions of Hei. However, the (infinite) set of all Slater determinants forms a complete iV-electron basis in which the electronic wavefunctions ^™ can be expanded

We will assume here that completeness of the basis can in principle be achieved without further discussion. In practice, one has to make two distinct approximations: (i) The number of orbitals used is finite, and therefore the one-electron basis is necessarily incomplete. (ii) In most cases only a subset of all Slater determinants which can be constructed from a given set of orbitals is used to approximate the /'/-electron wavefunction. The second approximation is necessary since the number of Slater determinants grows very rapidly with the number of orbitals and electrons. In the following we will assume that each spin orbital ^(x) is a product of a spatial orbital fc(r) and a spin function a(s) which is nonzero only for s = ±1/2. The spin-functions for s = 1/2 and s = — 1/2 are denoted a and /3, respectively. According to the Pauli prinicple a maximum of two electrons can occupy an orbital fa. Doubly occupied orbitals form closedshells, singly occupied orbitals form open-shells, and the remaining orbitals are unoccupied. A valid TV-electron wavefunction should not only be an eigenfunction of H£I, but also of the /V-electron spin operators S'2 and Sz

Since in the absence of spin-orbit effects the electronic Hamiltonian commutes with S2 and Sz, the eigenfunctions of Hei are automatically eigenfunctions of the spin-operators, provided all possible spin couplings for each orbital configuration are included. Spin-symmetry imposes certain restrictions among the coefficients Cj . It is often convenient to introduce these restrictions explicitly and use from the beginning TV-electron basis functions which are spin-eigenfunctions themselves. Such spin-adapted functions are called configuration state functions (CSFs) and are in general linear combinations of Slater determinants with the same orbital occupancies but different spin-assignments to the open-shell electrons. The total number of CSFs for

244

Domain-Based Parallelism and Problem Decomposition Methods

m orbitals and N electrons is given for S = MS by the Weyl formula [6]

where the quantities in parentheses are binomial coefficients. Due to the factorials in these coefficients the number of CSFs becomes astronomical even for rather small systems. For a finite (incomplete) basis the coefficients of1 can be optimized according to the variational principle. This leads to the Schrodinger equation in matrix form with The Hylleraas-Undheim [7]-MacDonald [8] theorem states that the eigenval ues Vn are upper bounds to the true energies of the corresponding states.

This method to determine the linear coefficients Cj is known as configura-

tion interaction (CI). A given finite basis of orthonormal molecular orbitals {fa} can be transformed to an infinite number of different orthonormal sets {fa} by unitary transformations

For a full configuration expansion (FCI), which includes all possible CSFs for a given orbital basis, the wavefunctions (Eq. 16) and energies obtained by solving Eq. (20) are invariant to such unitary orbital transformations (of course the matrix elements HIJ and vectors c(") are not). However, this invariance property is lost if the configuration expansion is truncated by neglecting part of the CSFs. In fact, for such truncated configuration expansions the convergence of the energies Vn towards the FCI energies with respect to the number of CSFs depends crucially on the choice of the orbitals. It is therefore of utmost importance to optimize the orbitals in some way before solving the CI problem. We will come back to the CI method in section 4.

3

Orbital Optimization

In the Hartree-Fock approximation the ground state wavefunction &el' is represented by one particular Slater determinant QHF, and the orbitals are optimized according to the variational principle by minimizing the energy expectation value

Problem Decomposition in Quantum Chemistry

245

subject to the orthonormality conditions This leads to the eigenvalue equation [5]

where /(«) is a one-electron operator that describes the movement of electron i in an average potential g(i) of the remaining N — I electrons:

The average electron interaction operator g(i) depends on the form of all orbitals which occur in the Slater determinant ^HF (occupied orbitals), and which are eigensolutions of Eq. (25). Therefore, an iterative method must be used to obtain the optimum orbitals. In each iteration, Eq. (25) is solved for a fixed operator /, and the solutions are then used to construct an improved operator /. This procedure is repeated until self-consistency is achieved (Hartrec-Fock Self Consistent Field (SCF) method}. In practice, the molecular orbitals (MOs) j^} are usually approximated as linear combinations of atomic basis functions (AOs) {x^}

These basis functions are in general non-orthogonal with a metric

Representing the Hartree-Fock equation (25) in the AO basis yields a generalized matrix eigenvalue equation

where E is a diagonal matrix with elements EM — 6k$kl- For the simplest case of a closed-shell wavefuriction, in which each spatial orbital is occupied by two electrons with opposite spin, the Fock matrix F takes the explicit form

where

is the one-electron density matrix, and

246

Domain-Based Parallelism and Problem Decomposition Methods

are one- and two-electron integrals in the AO basis, respectively. All integrals over the electronic Hamiltonian Hei can be expressed in terms of these integrals. The bottleneck of the SCF method is the computation and storage of the two-electron integrals (nv\pa}. Usually, Gaussian functions

with / M ,m^,n M = 0,1,2... are used as a basis, since then the integrals can be factorized into separate integrals for the three cartesian coordinates x, y, z and can be computed very efficiently [9]. A disadvantage of Gaussian basis functions is that they do not have the correct radial behaviour, and therefore quite large basis sets must be used. The number of integrals to be stored can be reduced by contraction of the primitive Gaussian functions {Xv} with fixed coefficients Tv^ to a smaller set {x^}:

The exponents ap and contraction coefficients T^ can be optimized in Hartree-Fock calculations for individual atoms and then be used in molecular calculations. Large libraries of such contracted basis sets are nowadays available. It is important to note that the Hartree-Fock wavefunction $HF is not an eigenfunction of the electronic Hamitonian Hei, nor is the sum of the Hartree-Fock energies tk equal to the energy expectation value EHF- The energy expectation value for closed-shell SCF wavefunctions is

where h^k =< and fkk =< k\f\4>k > are matrix elements of the one-electron operators in the MO basis. The Hartree-Fock approximation should only be applied if the electronic wavefunction can be well approximated by a single Slater determinant. This is mostly the case for electronic ground states of molecules in the vicinity of their equilibrium structures. For excited states, or in order to describe the dissociation of molecular bonds, often several Slater determinants are needed for a qualitatively correct representation of the wavefunction. In such cases it is possible to optimize the orbitals for a multiconfiguration wavefunction

simultaneously with the linear coefficients {c™ } (Multiconfiguration SelfConsistent Field (MCSCF) method) [10]. Typically, such expansions include

Problem Decomposition in Quantum Chemistry

247

only configurations {$/} which are constructed from the relatively small set of molecular orbitals. In the Complete Active Space SCF (CASSCF) method [11] all possible configurations for a given set of active orbitals are included. In calculations for excited states it is often useful to optimize the wavefunctions for several states simultaneously by minimizing the energy average of the states of interest (state-averaged MCSCF). The optimization of MCSCF wavefunctions is considerably more difficult than in the single configuration SCF case. This is due to the large number of variational parameters, which are often strongly coupled. Therefore, efficient secondorder optimization methods, which take into account all first and second derivatives of the energy expectation value with respect to all variational parameters have been developed. The description of such methods is beyond the scope of the present article, but several excellent reviews are available [10].

4

Configuration Interaction

Even though the overlap of the Hartree-Fock wavefunctions with the exact wavefunctions is often very close to one (< $HFInexact >~ 0.95) and the Hartree-Fock energy accounts for typically 99% of the exact electronic energy, the accuracy of this approximation is frequently not sufficient for quantitative calculations. The difference between the Hartree-Fock energy and the exact energy is called correlation energy. The correlation energy is due to dynamical electron correlation effects, which are not accounted for in the average potential of the Hartree-Fock approximation. The correlation energy often changes quite strongly as a function of geometry. Therefore, the inclusion of electron correlation effects is particularly important if energy differences like dissociation or reaction energies are considered. For small molecules such chemical energies are typically 103-104 times smaller than the total energies and of the same order of magnitude as the correlation energies. In order to compute the correlation energy, one can expand the wavefunction into a basis of CSFs (Eq. 16) and determine the linear coefficients by solving the matrix eigenvalue equation (20). The molecular orbitals (MOs) are taken from a proceeding SCF or MCSCF calculation and kept fixed. The matrix elements over the Hamiltonian can be expressed in the form

where

248

Domain-Based Parallelism and Problem Decomposition Methods

are one- and two-electron integrals in the MO basis, respectively. These can be obtained by transforming the AO integrals (Eqs. 32, 33) into the MO basis,

where C is the matrix of MO coefficients [cf. Eq. (27)]. The coupling coefficients I™ and F^tu depend only on the formal structure of the CSFs $7 and $j. Even though the method of configuration interaction appears very straightforward, in practice one has to face the following problems: (i) The convergence of the results with respect to the number of CSFs is slow. Therefore, the number of required CSFs and the dimension of the matrix H can become very large. Even though H is sparse, it may not be possible to store it in memory and not even on disk. (ii) The number of two-electron integrals (rs\tu) can be extremely large. (iii) The coupling coefficients r^tu are difficult to compute. Similarly to H these quantities are sparse, but their number may be too large to be stored on disk. On the other hand, there are some facts which facilitate the problem: (i) Only one or a few eigenvectors corresponding to the lowest eigenvalues are needed. (ii) The matrix H usually satisfies the property \Hjj < \Hjj — HJJ\ for most off-diagonal elements HIJ. Therefore iterative schemes for finding the lowest eigenvalues and eigenvectors converge rather quickly. (iii) Many coupling coefficients F^iu are identical or related by simple factors, and this should be exploited by the algorithms. Conventional CI calculations, in which the matrix H is explicitly constructed and stored on disk, are restricted to about 3 • 104 CSFs. Since H is sparse, indirect addressing must be used which makes vectorization difficult and may cause some overhead. Therefore, efficient direct CI techniques [12,13,14,15] have been developed, in which the matrix H is not constructed explicitly and never stored on disk. Instead, iterative shemes are used in which in each iteration a product g = H • c is computed. We will first consider the problem in general and then address some special cases. The iterative scheme used to find the lowest eigenvalues and eigenvectors are based on perturbation theory. The desired eigenvectors c^ are

Problem Decomposition in Quantum Chemistry

249

expanded in a set of trial vectors Ac^'m)

Here, the indices n, m run over the number of desired eigenstates, and i is the number of the present iteration. The optimimum expansion coefficients a km are found by solving a small eigenvalue problem

with

From the vectors c^'") as obtained by Eq. (43) in iteration i, a residual vector v^' n ) is computed for each desired state n

The residual vectors are used to obtain a new set of trial vectors Ac*-1'™' according to first-order perturbation theory

The process is repeated until the norms of residual vectors v^!"^ are smaller than a certain threshold. The method as described above requires to store all expansion vectors Ac^'m) and Ag^' m ). A general discussion of various variants of this so called Davidson method can be found in Ref. [16]. The basic step in each iteration is the computation of the vectors Ag( l ' n ' = H • Acr''™). For simplicity, we will omit the indices ( i , n ) in the following. Using Eq. (38) the elements of the vector Ag can be writte explicitly in the form

Comparison with Eq. (38) shows that in the computation of Ag the summations can be rearranged in such a way that the matrix elements HIJ are never constructed. Such rearrangements are very important for an optimally efficient algorithm, since it may lead to a dramatic reduction of

250

Domain-Based Parallelism and Problem Decomposition Methods

the number of floating point operations. In principle, the formation of the product Ag = H • Ac would require N^SF multiplications, not counting additional operations for the construction of H. As will be shown below, for certain cases rearrangement of the summations and explicit use of the sparsity of the coefficients F^tu may lead to almost linear rather than quadratic dependence of the computational effort on the number of CSFs. The optimum algorithm depends on the type of the approximation used. We will consider a few important cases in the following sections.

4.1

Full CI (FCI) (FCI)

As first shown by Knowles and Handy [17], full CI calculations can be efficiently performed in a basis of Slater determinants. Using the method of second quantization [5], the coupling coefficients can be expressed as

where Ers are spin-coupled excitation operators

The operators 77^ and rfe are spin-orbital creation and annihilation operators, respectively. Using Eqs. (50,51), Eq. (49) can be written as

with

The coupling coefficients in Eq. (53) can be further decomposed by inserting the resolution of the identity Y,K \®K >< ®K\ — 1;

In the context of FCI optimization, this decomposition has first been proposed by Siegbahn [18]. If a basis of Slater determinants is used, the coupling coefficients < $K\Etu\$j > take only the values 0,±1 and can efficiently be computed whenever needed. Thus, no coupling coefficients

Problem Decomposition in Quantum Chemistry

251

need to be stored on disk. Eq. (55) can then be evaluated in the following steps:

All steps involve matrix multiplications and can be efficiently vectorized. Calculations with more than 107 Slater determinants have been performed using this method [19, 20, 21]. 4.2

Single-reference Correlation Treatments

According to first-order perturbation theory, the first-order correction to the Hartree-Fock wavefunction <&HF involves only configurations {$/} that have a non-vanishing matrix element < $/|.ff e /|$#jF >. This first-order interacting space is spanned by all configurations which differ by at most 2 spin-orbitals from the SCF determinant, and can be generated by applying one- and two-particle excitation operators Eai and EaiEbj, respectively, to the reference function &HF- In this section, we will only consider closed-shell reference functions; a generalization to general MCSCF reference functions will be discussed in the next section. The singly and doubly excited CSFs are defined as

Here and in the following, the indices ( i , j , k , l ) refer to orbitals that are occupied in &HF, and (a, 6) refer to unoccupied (virtual or external) orbitals. The CISD wavefunction takes the form

A minimal set of equations which is sufficient to determine the unknown amplitudes cla and C^b = Cj^ can be obtained by projecting the Schrodinger equation onto the space of functions $* and $^. Since the configurations $"j* are not orthonormal, it is more convenient to project the Schrodinger equation onto the equivalent contravariant space [22, 23], defined such that its members $"j* satisfy

252

Domain-Based Parallelism and Problem Decomposition Methods

This space is spanned by the functions

Explicit forms of the residuals

can be expressed in terms of matrix multiplications involving the coefficient vectors cl and matrices C^ as well as integral matrices Fab =< (pa\/(& >, [3kl}ab = (ab\kl), and [Kkl}ab = (ak\lb) [13, 22, 23]. For simplicity, we give the explicit expressions only for VJJ and neglect of the contributions of single excitations:

with

The contributions of integrals with four external indices are accounted for by the exchange operators

It is possible to compute these operators directly from the two-electron integrals in AO basis [13, 23]. Thus, a full integral transformation can be avoided, a prerequisite for integral-direct implementations. All coupling coefficients are evaluated explicitly in this case, and there are only simple factors like ±1 and ±^ left. The structure of the equations depends only on the labels (i,j,k) of orbitals which are occupied in the SCF reference function. Thus, all logic involving the large virtual space (a, b) has been removed and brought into the form of matrix operations. In order to achieve this simplicity one has to use the unnormalized and non-orthogonal CSFs as defined in Eqs. (59) and (60). Of particular importance is that the sets of amplitudes and integrals have been ordered into matrices and vectors. As compared to early implementations of the direct CI method [12, 14], which were integral driven and involved some logic for each individual integral, here the largest possible blocks of integrals and amplitudes are treated together with the same logic.

Problem Decomposition in Quantum Chemistry

253

An optimum implementation of the above equations requires about

xt + 2m3N^xt multiplications, where m is the number of correlated

orbitals (i, j ) and Next the number of correlation orbitals (a, b). The number of CSFs is about NCSF = ^m 2 ]V 2 . Thus, the computational effort is much smaller than N^SF, which would be expected for the case that the product H • c is computed explicitly. In practice, an almost linear dependence of the computation time with respect to the number of CSFs is observed in most cases. The coupled electron pair approximation (CEPA) [24] or the coupled pair functional (CPF) [25] approach are variants of the CfSD method, which require the same computational effort. These approximations account in an approximate way for higher order excitations, which are not explicitly included in the wavefunction. Also singles-doubles coupled cluster (CCSD) [26], quadratic configuration interaction (QCISD) [27], and Brueckner coupled-cluster (BCCD) [28] methods can be formulated in a very similar way. In the latter three cases the computational effort scales with \m2+Ng Xt 4m3JVga.t per iteration, and is therefore not very much higher than for the simple CISD case. Also M0ller-Plesset perturbation theory [29] is closely related; in order to compute the second-order energy (MP2), one simply only needs V'J = K ZJ ; the third order energy (MP3) can be computed from CID residual as given above, while for MP4(SDQ) one needs the CID residual and in addition the CCSD G^ terms; in the latter case the computational effort scales with ^m2N^ xt + 6m3 JV3^. In contrast to the CI and CCSD methods, M0ller-Plesset perturbation theory is non-iterative and therefore the residuals have to be computed only once. The accuracy of the CCSD and MP4 results can be considerably improved by adding perturbative corrections for triple excitations [CCSD(T) and MP4(SDTQ)] [30]. These corrections require computational effort proportional to m 3 A r4 and often take most of the time. Compact explicit expressions for the CISD, CCSD, QCISD, and BCCD methods can be found in Ref. [23]. The formulation has also been extended to open-shell reference functions [31]. All methods mentioned above have been implemented in the MOLPRO package of ab initio programs [32]. 4.3

Multireference Configuration Interaction (MRCI)

All methods described in the previous section use a single determinant Hartree-Fock wavefunction as zero'th order approximation. As already mentioned, such methods can usually not be applied for excited state calculations or if global potential energy surfaces are needed. In such cases MCSCF or CASSCF wavefunctions are required as zero'th order approximation. Dynamical correlation effects can then be included by adding further configurations to the wavefunction. In most cases, only single and double excitations, which span the first-order interacting space relative

254

Domain-Based Parallelism and Problem Decomposition Methods

to the MCSCF reference function, are considered. The coefficients of all CSFs (including the reference CSFs) are optimized without changing the orbitals by solving the corresponding matrix eigenvalue equation (20). This method is denoted singles-doubles multireference configuration interaction (MRCISD) The MRCI wavefunction [33, 34, 35, 36] can in general be written as

where all configurations which differ at most by two orbitals from any of the reference configurations are included in the expansion. The state index n has been omitted for clarity. The configuration space is conveniently decomposed into three subspaces: the internal CSFs {$/}, which are constructed from the same orbital subspace as the reference CSFs and include these as a subset; the singly external CSFs {$5}, which contain one electron in the complementary external orbital space; and finally the doubly external CSFs {3>ap}, in which two electrons have been excited to external orbitals. The indices S and P denote all possible internal N — I and N — 2 electron hole states, respectively, which can be obtained by annihilating one or two electrons from any of the reference CSFs. The disadvantage of this method is that the number of the S and P hole states and hence the length of the configuration expansion and the computational cost increases very rapidly with the number of reference CSFs. Therefore, the number of reference CSFs is typically limited to about 30, except for calculations with very small basis sets. This makes it necessary to apply configuration selection schemes to the reference space, but such reference selection is often not straightforward and its effect on the accuracy of the final results is difficult to predict. The length of the configuration expansion can also be reduced by applying selection and extrapolation schemes for the excited configurations {$5} and {$p*} [33]. In order to eliminate these problems, two different contraction schemes have been proposed, which reduce the number of variational parameters in the calculation. In the external contraction proposed by Siegbahn [37, 38], all configurations with the same internal states S or P, but different external orbitals are contracted with fixed coefficients

The contraction coefficients cf and C^b are determined by first-order perturbation theory. The contracted configurations are then used as a basis in a relatively small variational optimization of the wavefunction

Problem Decomposition in Quantum Chemistry

255

The computational effort for this approximation corresponds to that of 1-2 iterations for a fully uncontracted MRCI calculation, and thus the computation time is typically reduced by a factor of 5. However, the main disadvantage of the uncontracted MRCI, namely the strong scaling of the effort with the number of reference CSFs, is still present. The loss of correlation energy by this contraction amounts to about 3%. In the internally contracted MRCI'[39, 40, 41, 42, 43, 44, 45], as originally proposed by Meyer [39] and first worked out in detail and implemented by Werner and Reinsch [41], certain groups of the internal S and P state are contracted. In this case, the contracted configurations are generated by applying two-particle excitation operators to the reference function ^R a whole:

ef is a single Slater determinant, these definitions are equivalent to Eqs. (59,60). With a multiconfiguration reference function

one obtains

This shows that configurations obtained by applying the same excitation operators to different reference CSFs are contracted with fixed coefficients a#. These coefficients are computed once by performing a CI calculation in the space of reference CSFs (these must not necessarily be the same as in a preceeding MCSCF or CASSCF calculation). The internally contracted MRCI wavefunction then has the form

Again, the reference configurations are a subset of the internal configurations {$/}, and their coefficients are fully reoptimized in the MRCI. However, the contracted configurations {$^} and {
256

Domain-Based Parallelism and Problem Decomposition Methods

the structure of the contracted configurations is very complex and the coupling coefficients are more difficult to compute than in the uncontracted case. Secondly, the contracted configuration subsets are neither normalized nor orthogonal. They may not even be linearly independent. Therefore, redundant configurations must be eliminated from the summations [41, 42], and for the perturbative update in each iteration a transformation of the residual vector to an orthogonalized basis is necessary. The transformation depends only on labels of the internal orbitals ( i , j , k ) ; nevertheless, the overlap and transformation matrices for the singly external configurations {<&if} can become very large (m6 elements). Therefore, in a more recent implementation of Werner and Knowles [42, 43, 44, 45], only the doubly external CSFs are internally contracted, while the uncontracted CSFs {$/} and {$|} are used otherwise. Very efficient techniques for computing the required coupling coefficients have been developed [43], which allow to compute them directly whenever needed. Using these techniques, MRCI calculations with up to about 104 reference CSFs and large basis sets have been performed. Some of these calculations were equivalent to uncontracted MRCI calculations with about 3 • 10s CSFs, but "only" about 2 • 106 variational parameters had to be optimized. Various comparisons of internally contracted and uncontracted MRCI calculations have demonstrated that the accuracy of the results is virtually the same in most cases; the loss of correlation energy by the internal contraction amounts only to 0.3-0.5% [43, 44, 46]. The explicit equations for the residual vectors have a similar matrix structure as Eq. (66). For instance, the pair-pair contribution to the matrix Gu takes the form

where the coupling coefficients < 0\Eik,ji\® > and < Q\Eik,ji,mn\Q > second and third order reduced density matrices of the reference function, respectively. The coupling coefficients can be decomposed into simpler quantities using techniques similar to those described in section 4.1. As in the single reference case, the problem has been structured in an optimum way by decomposing the large sets of two-electron integrals and CIcoefficients to vectors and matrices. All coupling coefficients depend only on the labels of internal orbitals, and thus a single coefficient is used for whole blocks of integrals and coefficients.

Problem Decomposition in Quantum Chemistry 5

257

Vectorization and Parallelization

A few remarks about vectorization and parallelization are made in this section. Firstly, the matrix structure of the methods discussed in the previous sections implies that vectorization is mostly over the external orbital space. Thus, typical vector lengths are equal to the number of orbitals in one symmetry, which is usually less than 150. On most machines, this is too short for obtaining optimum speed in simple vector operations. However, matrix multiplications, which strongly dominate all algorithms, run at maximum speed for such dimensions. This is not only true for vector machines, but also for modern RISC workstations, since blocking and unrolling techniques allow the use of the fast cache in a most efficient manner. Parallelization is less straightforward. Single vector or matrix operations are too small to be effectively parallelized on most machines. However, a coarse grain parallel implementation of CI and CCSD methods appears to be promising: for instance, the computation of the matrices G4J can be distributed to different processors, and parallelization is then over different pairs (ij). In large scale CI or CCSD calculations there may be 102 to 103 such pairs, which is sufficiently large to obtain a good load balance on machines with a moderate number of processors. However, each processor has to access many Jki, K w , and Cki matrices. Therefore, on distributed memory machines or clusters of workstations, it may be necessary to store all these quantities on all nodes in order to minimize data communication overhead. This would require a large memory on each node and involve a lot of redundancy. On the other hand, shared memory machines appear to be better suited for these methods, since in this case all processors can access the same set of matrices and redundancy is minimized. A shared memory parallel implementation of CCSD and MRCI methods along these lines is presently being developed in our laboratory. 6

Conclusions

An overview of the various stages of ab initio methods for computing correlated electronic wavefunctions and energy eigenvalues has been given. Due to the extreme complexity of the task, which requires solution of a second order partial differential equation with a very large number of coordinates, the problem is decomposed into several steps. Firstly, a contracted basis set of non-orthogonal one-electron functions is chosen, and the required integrals over this basis are computed. Secondly, an orthonormal set of optimized one-electron functions (orbitals) is computed using the HartreeFock self-consistent field or multiconfiguration self-consistent field methods. The orbitals are represented by linear combinations of the contracted basis functions. The TV-electron wavefunction is then represented as a linear

258

Domain-Based Parallelism and Problem Decomposition Methods

combination of antisymmetrized products of the optimized orbitals, and the linear coefficients are optimized by solving a very large eigenvalue problem. Various special types of such configuration expansions have been considered, and efficient ways to solve the eigenvalue problem iteratively have been discussed. Decomposition of the large integral and coefficient sets into vectors and matrices not only leads to a minimization of the logic involved, but also allows to formulate the algorithms in terms of matrix multiplications. This results in very effective vectorization, and also suggests the possibility of coarse grain parallel implementations. Due to the enormous amount of data simultaneously needed, shared memory machines appear to be most well suited for effective parallelization.

References References [1] M. Born and J. R. Oppenheimer, Zur Quantentheorie der Molekeln, Ann. Phys. (Leipzig), 84 (1927), pp 457-484. [2] M. Baer, Quantum-mechanical treatment for charge-transfer processes in ionmolecule collisions, in Adv. Chemical Phys., Vol. LXXXII (1992), pp. 187-241, and references therein. [3] C. A. Mead and D. G. Truhlar, Conditions for the definition of a strictly diabatic electronic basis for molecular systems, J. Chem. Phys. 77 (1982), pp 6090-6098. [4] J. Almlof, Electronic Structure Theory, in Mathematical Frontiers in Computational Chemical Physics, edited by D. G. Truhlar (Springer Verlag, New York, 1988), pp 18-39. [5] R. McWeeny, Methods of Molecular Quantum Mechanics, Second Edition, Academic Press, London (1989). [6] H. Weyl, The Theory of Groups and Quantum Mechanics, Dover, New York (1956). [7] E. A. Hylleraas and B. Undheim, Numerische Berechnung der 2S Terme von Ortho- und Para-Helium, Z. Phys. 65 (1930), pp 759-772. [8] J. K. L. MacDonald, Successive approximations by the Rayleigh-Ritz variation method, Phys. Rev. 43 (1933), 830. [9] For a recent review see, e.g.: J. Cioslowski, Ab initio calculations on large molecules: methodology and applications, in Reviews in Computational Chemistry, Vol. 4, pp 1-33, ed. K. B. Lipkowitz and D. B. Boyd, VCH publishers, New York, 1993. [10] For reviews see, e.g.: H.-J. Werner, Matrix-formulated direct multiconfiguration self-consistent field and multiconfiguration reference configuration-interaction methods, Adv. Chem. Phys. 49 (1987), pp 1-62; R. Shepard, The multiconfiguration self-consistent field method, Adv. Chem. Phys. 49, (1987) pp 63-200. [11] K. Ruedenberg, L. M. Cheung, and S. T. Elbert, MCSCF optimization through combined use of natural orbitals and the Brillouin-Levy-Berthier theorem, Int. J. Quantum Chem. 16 (1979), 1069-1101. [12] B. 0. Roos, A new method for large-scale CI calculations, Chem. Phys. Lett. 15 (1972), pp 153-159. [13] W. Meyer, Theory of self-consistent electron pairs. An iterative method for correlated many-electron wavefunctions, J. Chem. Phys. 64 (1976), pp 29012907. [14] B. O. Roos and P. E. M. Siegbahn, The direct configuration interaction method

Problem Decomposition in Quantum Chemistry

[15] [16] [17] [18] [19] [20] [21] [22] [23]

[24]

[25] [26]

[27] [28]

259

from molecular integrals, in Modern Theoretical Chemistry, ed. H. F. Schaefer III (Plenum, New York, 1977). For a review, see I. Shavitt, The treatment of electron correlation: where do we go from here?, in Advanced Theories and Computational Approaches to the Electronic Structure of Molecules, eds. C. E. Dykstra (Reidel, Dordrecht, 1983). E. R. Davidson, Super-matrix methods, Computer Phys. Comm. 53 (1989), pp 49-60. P. J. Knowles and N. C. Handy, A new determinant-based full configuration interaction Method, Chem. Phys. Lett. Ill (1984), pp 315-321. P. E. M. Siegbahn, A new direct CI method for large CI expansions in a small orbital space, Chem. Phys. Lett. 109 (1984), pp 417-423. P. J. Knowles, Very large full configuration interaction calculations, Chem. Phys. Lett. 155 (1989), pp 513-517. J. Olsen, P. J0rgensen, and J. Simons, Passing the one-billion limit in full configuration interaction (FCI) calculations, Chem. Phys. Lett. 169 (1990), pp 463-472. R. J. Harrison, Approximating full configuration interaction with selected configuration interaction and perturbation theory, J. Chem. Phys. 94 (1991), pp 5021-5031. P. Pulay, S. Saeb0, and W. Meyer, An efficient reformulation of the closed-shell self-consistent electron pair theory, J. Chem. Phys. 81 (1984), pp 1901-1905. C. Hampel, K. A. Peterson, and H.-J. Werner, A comparison of the efficiency and accuracy of the quadratic configuration interaction (QCISD), coupled cluster (CCSD), and Brueckner coupled cluster (BCCD) methods, Chem. Phys. Lett. 190 (1992), pp 1-12. W. Meyer, PNO-CI studies of electron correlation effects. I. Configuration expansion by means of non-orthogonal orbitals, and application to the ground state and ionized states of m,ethane, J. Chem. Phys. 58 (1973), 1017; W. Kutzelnigg, Pair correlation theories, in Methods of Electronic Structure Theory, ed. H. F. Schaefer III, (Plenum, New York, 1977). R. Ahlrichs and Peter Scharf, The coupled-pair approximation, Adv. Chemical Physics LXVII (1987), pp 501-537. J. Cizek, On the correlation problem in atomic and molecular systems. Calculation of wavefunction components in Ursell-Type expansion, using quantumfield theoretical methods, J. Chem. Phys. 45 (1966), pp 4256-4266; J. Cizek and J. Paldus, Correlation problems in atomic and jnolecular systems. III. Rederivation of the coupled-pair many-electron theory using the traditional quantum, chemical methods, Int. J. Quantum Chem. 5 (1971), pp 359-379: G. D. Purvis and R. J. Bartlett, A full coupled-cluster singles and doubles model: The inclusion of disconnected triples, J. Chern. Phys. 76 (1982), pp 1910-1918; J. Noga and R. J. Bartlett, The full CCSDT model for molecular electronic structure, J. Chem. Phys. 86 (1987), pp 7041-7050; G. E. Scuseria and H. F. Schaefer III. A new implementation of the full CCSDT model for molecular electronic structure, Chem. Phys. Lett. 152 (1988), pp 382-386. M. Head-Gordon, J. A. Pople. and K. Raghavachari, Quadratic configuration interaction. A general technique for determining electron correlation energies, J. Chem. Phys.' 87 (1987), pp 5968-5975. R. J. Bartlett, C. E. Dykstra, and J. Paldus, Coupled-cluster methods for molecular calculations, NATO ASI, Vol. 133 (Reidel, Dordrecht, 1987), pp 127160; N. C. Handy, J. A. Pople, M. Head-Gordon, K. Raghavachari, and G.

260

[29]

[30] [31] [32]

[33]

[34]

[35] [36] [37] [38] [39] [40] [41]

[42] [43] [44]

Domain-Based Parallelism and Problem Decomposition Methods W. Trucks, Size-consistent Brueckner Theory limited to double substitutions, Chem. Phys. Lett. 164 (1989), pp 185-192. J. A. Pople, J. Binkley, and R. Seeger, Theoretical models incorporating electron correlation, Int. J. Quant. Chem. Symp. 10 (1976), pp 1-19; J. A. Pople, R. Krishnan, H. B. Schlegel, and J. S. Binkley, Derivative studies in Hartree-Fock and M0ller-Plesset Theories, Int. J. Quant. Chem. Symp. 13 (1979), pp 225-241. K. Raghavachari, G. W. Trucks, J. A. Pople, and M. Head-Gordon, A fifthorder perturbation comparison of electronic correlation theories, Chem. Phys. Lett. 157 (1989), pp 479-483. P. J. Knowles, C. Hampel, and H.-J. Werner, Coupled cluster theory for high spin, open shell reference wave functions, J. Chem. Phys. 99 (1993) 5219-5227. MOLPRO is a package of ab initio programs written by H.-J. Werner and P. J. Knowles, with contributions from J. Almlof, R. D. Amos, M. J. O. Deegan, S. T. Elbert, C. Hampel, W. Meyer, K. Peterson, R. Pitzer, A. J. Stone, and P. R. Taylor. R. J. Buenker and S. D. Peyerimhoff, and W. Butscher, Applicability of the multi-reference double-excitation CI (MRD-CI) method to the calculation of electronic wavefunctions, and comparison with related techniques, Mol. Phys. 35 (1978), pp 771-791. P. E. M. Siegbahn, Generalizations of the direct CI method based on the graphical unitary group approach. I. Single replacements from a complete CI root function of any spin, first order wavefunctions, 3. Chem. Phys. 70 (1979), pp 5391-5397. P. E. M. Siegbahn, Generalizations of the direct CI method based on the graphical unitary group approach. II. Single and double replacements from any set of reference configurations, J. Chem. Phys. 72 (1980), pp 1647-1656. V. R. Saunders, J. H. van Lenthe, The direct CI method. A detailed analysis, Mol. Phys. 48 (1983), pp 923-954. P. E. M. Siegbahn, The direct configuration interaction method with a contracted configuration expansion, Chem. Phys. 25 (1977), pp 197-205. P. E. M. Siegbahn, The externally contracted CI method applied to N%, Int. J. Quantum. Chem. 23 (1983), pp 1869-1889. W. Meyer, Configuration expansion by means of pseudonatural orbitals, in Modern Theoretical Chemistry, ed. H. F. Schaefer III, (Plenum, New York, 1977). P. E. M. Siegbahn, Direct configuration interaction with a reference state composed of many reference configurations, Int. J. Quant. Chem. 18 (1980), pp 1229-1242. H.-J. Werner and E. A. Reinsch, The self-consistent electron pairs method for multiconfiguration reference state functions, J. Chem. Phys. 76 (1982), pp 31443156. H.-J. Werner and P. J. Knowles, An efficient internally contracted multiconfiguration-reference configuration interaction method, J. Chem. Phys. 89 (1988), pp 5803-5814. P. J. Knowles and H.-J. Werner, An efficient method for the evaluation of coupling coefficients in configuration interaction calculations, Chem. Phys. Lett. 145 (1988), pp 514-522. H.-J. Werner and P. J. Knowles, A comparison of variational and nonvariational internally contracted multiconfiguration-reference configuration interaction calculations, Theor. Chim. Acta 78 (1992) 175-187.

Problem Decomposition in Quantum Chemistry

261

[45] P. J. Knowles and H.-J. Werner, Internally contracted multiconfigurationreference configuration interaction calculations for excited states, Theor. Chim. Acta 84 (1992) 95-103. [46] K. A. Peterson, R. A. Kendall, and Th. Dunning, Benchmark calculations with correlated molecular wave/unctions. III. Configuration interaction calculations on first row homonuclear diatomics, J. Chem. Phys. 99 (1993), pp 9790-9805.

This page intentionally left blank

Chapter 15 Bound States of Strongly Coupled Multidimensional Molecular Hamiltonians by the Discrete Variable Representation Approach Zlatko Bacic Abstract Discrete variable representation (DVR), rigorously defined on the Gaussian quadrature points of classical orthogonal polynomials, is described. The DVR matrix of multidimensional Hamiltonians is easy to form, and is very sparse. DVR matrices of all coordinate functions, including the potential, are diagonal, avoiding the need for numerical integration in several dimensions. Sparseness of the Hamiltonian is exploited to formulate a highly effective method for reducing by a large factor (10-20) the size of the final Hamiltonian matrix, without lowering the accuracy of the calculated eigenstates. The method involves a physically motivated decomposition of the full Hamiltonian into a hierarchy of lower dimensional Hamiltonians, whose truncated sets of eigenvectors provide a compact basis for the problems in higher dimension.

1

Introduction

The past two decades have witnessed extraordinary advances in the experimental techniques of molecular physics such as laser spectroscopy and supersonic molecular beams. This has enabled direct probing of molecules in very highly excited vibrational states, previously beyond reach. Highlying vibrational states, often close to the dissociation limit of a molecule, are involved in various processes of fundamental and practical importance, such as multiphoton excitation, laser-induced unimolecular decomposition, collisional and intramolecular vibrational energy transfer. Quantitative understanding of these and related processes is not possible without accurate computation of vibrationally highly excited molecular eigenstates. "This work was supported in part by a grant from the National Science Foundation (Grant CHE-9312312). Department of Chemistry, New York University, 4 Washington Place, New York, NY 10003. 263

264

Domain-Based Parallelism and Problem Decomposition Methods

The wave functions of highly excited vibrational states are extensively delocalized over large portions of the potential energy surface, which may include multiple high-energy local minima (isomers) [1]. In addition, at such high excitation energies, the coupling between the internal degrees of the molecule is strong. These distinct features make the task of calculating the energies and wave functions of high-lying vibrational states very difficult. It became imperative to develop novel theoretical approaches [1, 2, 3], which are different, both conceptually and computationally, from the traditional methods applicable to low-energy vibrations [4, 5]. A breakthrough in this area has been achieved by the emergence of bound state methods based on localized representations [1, 2, 3], in particular discrete variable representation [1], which is the subject of this review. These pointwise bases, which include distributed Gaussians [6, 7], provide a much more compact description of the high-lying delocalized eigenstates. They also allow formulation of general and powerful computational procedures for drastic reduction of the size of the Hamiltonian matrix [1], permitting calculations for molecular systems and energy regimes intractable by other methods.

2

General Remarks about Calculation of Molecular Eigenstates

The Hamiltonian for a molecule with / nuclei has 3/ coordinates. However, describing the relative positions of these I nuclei requires only 37 — 6 internal coordinates (the other six coordinates are associated with the motion of the center of mass which is of no interest for our purposes) [1,4]. This number of N = 31 — 6 internal, purely vibrational degrees of freedom is usually referred to as the (physical) dimensionality of the problem (and of its Hamiltonian). Notice that we focus solely on the nuclei, with no mention of the electrons which are, of course, present. This is because the theoretical treatment of the vibrational eigenstates involves the Born-Oppenheimer (BO) approximation [8, 9]. The BO approximation, which is based on the large difference between the mass of an electron and the masses of nuclei, permits separate treatment of electronic and nuclear motions. In this approximation, one of the cornerstones of modern chemistry, the electrons generate a potential energy surface(PES) V(Q) [8], which is a function of only 31 — 6 nuclea coordinates Q. The PES V(Q) governs the motion of the nuclei. Given a set of internal coordinates Q of the nuclei, one can derive (sometimes with great difficulties) the vibrational Hamiltonian H(Q) [4, 10, 11]. Calculation of the vibrational eigenstates requires solving the Schrodinger equation *n(Q) and En are the eigenfunctions and eigenvalues of H and represent

Molecular Bound States

265

the vibrational wave functions and energy levels, respectively. To represent a bound state solution, \& n (Q) has to tend to zero when any of the internal coordinates Qi goes to infinity [12]. Already for small polyatomic molecules (/ > 3), accurate solution of the multidimensional, nonseparable Schrodinger equation in (1) is a most challenging large-scale computational problem [1, 4, 5].

3

Finite Basis Representation

In order to introduce the notion of the finite basis representation (FBR), necessary for later discussion, it is useful to review the variational method for calculating the eigenstates of a Hamiltonian [4, 9]. For simplicity, we consider the problem of finding the bound states of a particle with mass m in a ID potential V(x). The Schrodinger equation which needs to be solved is where fyn and En are the desired eigenfunctions and eigenvalues, respectively, and H is the Hamiltonian

^n(x) is subject to the boundary condition that it must vanish for x —> ±00 [12]. In the variational method, the eigenfunctions $n in (2) are expanded in terms of orthonormal basis functions {(pi(x)\i = 1 , 2 , . . . , M } [4, 9], From now on, it is important to distinguish the dimension (size) M of the basis set (and the Hamiltonian matrix), equal to the number of basis functions, from the dimensionality N of the vibrational problem, which is equal to the number of internal coordinates (37 — 6, for / nuclei). We assume, for convenience, that {(pi(x)} are classical orthogonal polynomials [13] satisfying the orthogonality condition on the interval [a, b]:

In (4), w(x) is the appropriate weight function [13]. The expansion of $n in the basis consisting of functions {^(x)}

will be referred to as the finite basis representation (FBR) [1]. When a basis set is used to represent ^ n , the Schrodinger equation in (2) becomes a matrix eigenvalue problem [1, 4, 5]:

266

Domain-Based Parallelism and Problem Decomposition Methods

In (6), HFBR is a matrix of dimension M, of the Hamiltonian in (3), with elements VFBR is also a M x M matrix, whose columns fFBR, f£BR,..., f^BR are th eigenvectors of ~H.FBR, associated with the eigenvalues EI, £2,..., EM constituting the diagonal matrix E, Both are obtained by the diagonalization of HFBR.

4

Discrete Variable Representation in One Dimension

An alternative, pointwise representation has been introduced by Light, Bade, and collaborators [1, 2, 3], for the purpose of solving bound state problems. In this representation, denoted as the discrete variable representation (DVR) [1, 14], the solutions of the Schrodinger equation are represented by their amplitudes at the coordinate (DVR) points. We use the ID bound state problem of the preceding section to define the DVR. In the FBR basis {(f>i(x)}, the coordinate matrix X.FBR of x, where

is not diagonal (neither is the potential V(x)). However, ~X.FBR can be diagonalized by means of a transformation matrix T [1, 3, 14]:

~X.DVR in (9) is diagonal. Its M diagonal elements {xa} (eigenvalues of XFB-R) define the points of the DVR [1, 3, 14]. T consists of the eigenvectors of ~KFBR, and is unitary:

T is also the matrix which transforms between FBR and DVR, in a way described below. In the special but very important case when {tpi(x)} are classical orthogonal polynomials [13] obeying the orthonormality condition of (4), it has been shown [15] that the elements of the FBR-DVR transformatio matrix T are In (11), (jjaand xa are the weights and points, respectively, of the M-point Gaussian quadrature associated with {^(x)}. In this case, the quadrature points {xa} constitute the DVR. The FBR of H, HFBR in (6), can be transformed to DVR using th FBR-DVR transformation matrix T [1, 3, 14]:

Molecular Bound States

267

where The unitary transformation in (12) and (13) is exact and does not introduce any approximation in flDVR. In addition, since T is unitary, HDVR remains Hermitian, just like its FBR counterpart ~H.FBR. We now make an approximation which is the key to the simplicity and power of the DVR, that the matrix of the potential V(x) (and of any other coordinate function) is diagonal in the DVR fl, 3, 14],

In view of the central role of this approximation, it is important to assess its accuracy. With the help of (11), (13) and (14), it is easy to show that

Thus, the DVR approximation in (14) is equivalent to evaluating the potential matrix element in the FBR by a M-point Gaussian quadrature [13]. It therefore becomes highly accurate as the dimension (M) of the DVR increases. The fact that DVR matrices of the potential and other coordinate functions can be treated as diagonal is one of important advantages that DVR has over FBR. Calculation of the potential matrix elements in the DVR involves no numerical integration; only the values of the potential at the DVR points are needed. The resulting reduction of the computational effort can be dramatic, especially for multidimensional calculations discussed later, involving complicated potential functions. The eigenstates of flDVR are obtained by diagonalization. A caveat should be made at this point. Due to the approximation in (14), DVR is strictly speaking not variational. It is evident from (15) that the size (M} of the DVR basis also controls the Gaussian quadrature accuracy of evaluating the potential matrix elements. Too few DVR points will lead to errors in the quadrature; the eigenvalues of such H D V f i can be below the exact ones. This has never been the problem in practice though, where at least 10-15 DVR points per coordinate are used, assuring an accurate calculation of the potental matrix elements. When the FBR basis {^(x)} consists of polynomials in x, neither the FBR nor the resulting DVR points {xa} reflect in any way the properties

268

Domain-Based Parallelism and Problem Decomposition Methods

of the PES, To improve this aspect of the DVR, a so-called potentialoptimized DVR (PO-DVR) has been proposed [16], where the eigenstates of the ID reference Hamiltonian serve as the FBR basis {^(x)}, in which the coordinate matrix X^-6^ in (8) is formed. In that case, the DVR points obtained by diagonalizing XFBRare to some extent adapted to the PES. A slight disadvantage of this PO-DVR is that the diagonal DVR approximation in (14) is not of Gaussian accuracy [16]. We point out that the ID DVR and the FBR-DVR transformation matrices appeared first in the work of Harris et al. [17]. However, they used the DVR merely as numerical technique for calculating potential matrix elements, retaining FBR as the primary representation in which the eigenvalue calculations and analysis were carried out. The work of Light, Bade, and coworkers established the DVR as a genuine representation in its own right, isornorphic with FBR, but superior to it in ways that will become apparent soon.

4.1

The Wave Functions in Discrete Variable Representation

We showed that FBR and DVR matrices of the Hamiltonian H are related through the unitary transformation in (12) and (13). But, what is the connection between the wave functions, i.e., eigenvectors in the two representations? To show the connection, we multiply from the left both sides of (6) with Tt, and insert TTf = 1 between HFBR and VFBR in (6), which gives Because of (12), (16) can be vritten as

where

Remember that ^FBR is comprised of column-wise eigenvectors ffBR, fFBR,..., f^BR. The same is true for VDVR. Consequently, the FBR and the DVR eigenvectors of an eigenstate *3>n, fFBRand fjf VR , respectively, are related by

The connection between the eigenvectors in the FBR and the DVR is shown in (19). We said before that the elements of fFBRare the expansion coefficients of the eigenfunction ^n(x) in the basis {^(x)}. What about the elements f^Rof f£ VR1It is easy to show that

Molecular Bound States

269

i.e., the fc-th element of n-th DVR eigenvector {£VR is equal to the amplitude of the eigenfunction ^ n (x) at the DVR point Xk, times the square root of the weight o^. In actual calculations, it is seldom necessary to transform the eigenvectors from one representation to another, using (19). The calculation of eigenstates and expectation values of various observables can be performed entirely in either FBR or DVR (although, in practice, many of the demanding calculations done in the DVR, would not have been feasible in the FBR). When making wave function plots, though, it is desirable to transform the DVR eigenvectors to the FBR via (19), since the DVR points are usually too far apart to allow straightforward interpolation. 5

Multidimensional Discrete Variable Representation

Extending the DVR to more than one dimension is rather straightforward. A multidimensional DVR is constructed simply by forming a direct product of ID DVRs defined for each internal coordinate [2, 3, 18, 19, 20, 21, 22, 23]. Let us consider a 3D system described by coordinates x,y,z. It should be understood that x,y,z stand for any orthogonal coordinates, not just Cartesian. When applied to coordinates, the term "orthogonal" means that the corresponding kinetic energy operator has no mixed second-order differential operators. Let {xa}, {y/i}, {z7} be the ID DVRs in x,y,z, respectively. They are generated by the procedure described above, i.e., by diagonalizing the coordinate matrices X, Y, and Z, formed in the ID FBR bases {0f}, {?}, and {$%}, respectively. The 3D DVR is then comprised of triplets of points {xa, y/3,z^ a = 1 , . . . , Mx; j3 = 1 , . . . , My; 7 = 1...., Mz} [2, 3, 18, 19, 20, 21, 22]. The size of this 3D DVR is Mx x My x Mz. The 3D FBR-DVR transformation matrix T is just a direct product of the ID transformation matrices for the internal coordinates x, y, z [2, 3, 18, 19, 20, 21, 22, 24]:

T x , Ty, and T2 in (21) diagonalize X, Y, and Z [see (9)], to yie {xa},{y/3},{z~(}, respectively. When the direct product 3D FBR {0f<^f.} consist of classical orthogonal polynomials, the matrix elements of T A (A = x,y,z) are given by (11).

5.1

The Hamiltonian in the 3D DVR

We demonstrate the attractive features of multidimensional DVR by considering a very general 3D Hamiltonian

270

Domain-Based Parallelism and Problem Decomposition Methods

The coordinates x, y, z are orthogonal and could represent, for example, Jacobi, Radau, hyperspherical (and Cartesian) coordinates, widely used in bound state and scattering quantum calculations [1]. In (22), /i A (A = x, y, z) contains the differential operators in coordinate A, i.e., it is the kinetic energy operator in A, while / A l (A2, AS) is either a constant or a function of the other coordinates A2, AS ^ AI. It is computationally optimal, and in most cases possible, to choose as the ID FBR in coordinate A(A = x, y, z) orthonormal polynomials {A} which are eigenfunctions of /i A (A = x, y, z):

Clearly, the matrix of /IA is diagonal in such ID DVR:

The matrix elements #„ „/* [n = (i,j, k)} of the Hamiltonian in (22), in the direct product 3D FBR {<j>f ^|} are given by

In (25), e A (A = x,y,z) are diagonal matrices, their non-zero elements being the eigenvalues of /IA, in (23). flFBRin (25) is now transformed to the direct product 3D DVR described in the preceding section. This is done by applying the FBRDVR transformation in (12) and (13), where T is now given by (21). The elements of the 3D DVR Hamiltonian matrix flDVRare readily obtained as [3, 18]

with Note that in the 3D DVR the matrix of the potential function V(x,y,z), and matrices of coordinate functions f x , f y , f z , are set to be diagonal. As discussed earlier, this is equivalent to the evaluation of FBR matrix elements by the appropriate M-point Gaussian quadrature in each coordinate. All that is needed to calculate the potential matrix elements are the values of the potential at the grid points of the 3D DVR. Elimination of multidimensional numerical integration in the 3D DVR greatly simplifies and accelerates the quantum mechanical calculations.

Molecular Bound States

271

The abundance of Kronecker delta symbols in (26) shows that the Hamiltonian matrix in the 3D DVR is extremely sparse, most elements being identically zero. The only off-diagonal matrix elements in f{DVRcome from the differential (kinetic energy) operators h^(X = x,y,z). But, these operators couple only one dimension at the time [there are two Kronecker deltas next to each eA in (26)], keeping the number of nonzero off-diagonal elements very small. This particular, sparse structure of the Hamiltonian in multidimensional DVR allows implementation of a very effective method for reducing the size of the Hamiltonian matrix [1], discussed below, which is not possible in the FBR. The sparseness of the DVR Hamiltonian was also utilized in combination with the Lanczos algorithm [25], to calculate highly excited vibrational levels of three- and four-atom molecules [26]. Another important advantage that DVR has over FBR, is that the former can be tailored to complicated multidimensional PESs, by retaining only those DVR points which lie in the relevant, energetically accessible regions of the PES [I]. In 3D, the DVR grid points for which

are discarded, thereby reducing significantly the size of the DVR basis. EMAX is a potential energy cutoff parameter, defined by the maximum energy of interest for a particular problem.

5.2

Successive Diagonalization and Truncation

The successive diagonalization and truncation method of Bacic and Light [1, 2, 27, 28] is one of the major strengths of the DVR. The method [2, 3, 18, 19, 20, 21, 27, 28] involves partitioning of the DVR matrix HDVR of the full Hamiltonian having N internal degrees of freedom (N ranges from 3 to 6), into a set of (smaller) matrices of Hamiltonians with N — I internal coordinates. These, in turn, can be decomposed further into Hamiltonians with N — 2 internal coordinates; the process can be continued if necessary. A subset of eigenvectors of lower dimensional (e.g., N — 2) Hamiltonians, usually chosen by an energy criterion, serves as a basis for the Hamiltonians in the next higher dimension (e.g., N — 1). In the last step of the procedure, the truncated matrices of N — 1 dimensional Hamiltonians are recoupled exactly, to form the final matrix of the fulliV-dimensional Hamiltonian. Diagonalization of this matrix, whose size is much smaller than that in the original, direct product DVR basis, gives the desired molecular energy levels and wave functions. The truncation of the intermediate eigenvector bases in each cycle of the procedure is possible because the eigenvectors of the lower dimensional problems are well adapted to the features of the PES and, therefore, already contain a significant portion of the full solution. We illustrate the sequential diagonalization and truncation method on the 3D Hamiltonian in (22) and its 3D DVR matrix HDVRin (26). In the

272

Domain-Based Parallelism and Problem Decomposition Methods

first cycle, ID Hamiltonians (defined below) in coordinate Ai(Ai = x, y, z) of our choice are diagonalized for all DVR points in the other two coordinates A2 and \^. Truncated sets of ID eigenvectors provide a contracted basis in AI for a series of 2D eigenvalue problems in coordinates (Ai,A2), solved at each DVR point of AS. In the final step, the full 3D Hamiltonian is transformed to the truncated basis of 2D (Ai, A2) eigenvectors; this compact representation is diagonalized to yield the eigenstates of the system. Evidently, there is considerable freedom in selecting the coordinates which define intermediate ID and 2D problems. We may select that particular sequence of ID and 2D diagonalizations and truncations which is optimal for the PES of the system under study. Here, in the first step, we choose to form a one-dimensional Hamiltonian iDjja/3 f or mc^ DVR point (xa,yp). This ID Hamiltonian includes all matrix elements of HDVR in (26) which are diagonal in the indices a, f3 [3, 18, 19, 21]:

MxMyID eigenvalue problems are solved:

producing eigenvalues lDeaP and eigenvectors lDQaP. Note that each problem in (30) is solved for a differentID cut in z, defined bya,yp), through the 3D PES V(x,y,z). Consequently, the ID eigenvectors 1DCa^ are a locally optimal basis for the particular z-cut through the potential. A truncated ID eigenvector basis, denoted by 1-DCQ^, is formed at every DVR point (xa, y@), by keeping only those ID eigenvectors with eigenvalues such that where lDEcuT is a ID energy cutoff parameter. Our expectation, confirmed by computational experience, is that the number of ID eigenvectors retained at each (xa, yp) can be much smaller than the number of DVR points in z, Mz (for the same level of accuracy of the final results). The 3D DVR Hamiltonian matrix HDVR in (26) is transformed to H in the truncated ID eigenvector basis [3, 18, 19, 21]:

T in (32) denotes a transpose of a matrix. H in (32) is in a mixed representation, which combines the DVRs for x and y coordinates (indices a/3, a'/3') with a ID eigenvector basis for z (indices n,n').

Molecular Bound States

273

The second cycle of diagonalization and truncation begins with the formation of a two-dimensional Hamiltonian 2D h Q in ( y , z ) , at every DVR point xa [3, 18, 19, 21]; 2D h Q includes all terms of H in (32) which are diagonal in a. Rather lengthy expressions for 2Dha have been published for several important cases [18, 19, 21], and will not be repeated here. Diagonalization of Mx 2D Hamiltonians 2D h a results in a set of 2D eigenvalues 2Dea and eigenvectors 2D C Q . As in the ID case above, the 2D eigenvector basis is truncated to form 2DCa, by retaining those eigenvectors with EcuT being the 2D energy cutoff parameter. This set of 2D C Q 's provides a compact, adiabatic basis for (y, z ) motions, which is locally optimized for each 2D cut, defined by xa, through the 3D PES. In the last step, the Hamiltonian matrix H in (32) is transformed to H, in the truncated basis of the 2D eigenvectors 2DCa [3, 18, 19, 21]. Each a, a' block of H is transformed to the basis of 2D C Q 's as 2D

The diagonal elements of the final Hamiltonian matrix H in (34) are the 2D eigenvalues <2Df^n. Diagonalization of H, which contains the complete molecular rotation-vibration Hamiltonian, gives the eigenvalues and eigenfunctions of the full 3D problem. Let us remind ourselves why we go through these transformations. The dimension of the initial Hamiltonian matrix tlDVR in (26), in the primitive, uncontracted 3D DVR is Mx x My x Mz, which can easily be 104 - 105 for realistic polyatomic systems. There is ample evidence from a variety of demanding applications that each cycle of diagonalization and truncation can reduce the size of the DVR basis by a factor of 3-5, without any loss of accuracy. Therefore, the dimension of H in the (truncated) basis of 2D eigenvectors, which is equal to Mx times the number of 2D eigenvectors reatined at each xa, is typically 10-20 times smaller than the dimension of HDVR. Since the computational effort of diagonalizing a matrix scales as M3, extra work involved in the transformation to ID and 2D eigenvector bases is richly rewarded with the savings resulting from dramatic reduction of the size of the final matrix. 5.2.1

Advantages of Successive Diagonalization and Truncation

The diagonalization and truncation method presented above has a number of unique and desirable features. One of them, exceptional effectiveness in reducing the size of the Hamiltonian matrix, has already been discussed. Another one is that in, say, a 3D problem, the adiabatic 2D eigenvectors used

274

Domain-Based Parallelism and Problem Decomposition Methods

as the basis for the final diagonalization, are extremely well adapted to the full problem. Consequently, a large fraction (~25%) of the 3D eigenvectors are usually very accurate, to about five significant figures. Moreover, the intermediate 2D eigenstates are physically meaningful in their own right, and most helpful in the assignment and interpretation of the accurate 3D results [28, 29, 30]. The diagonalization and truncation procedure is very flexible. It has been successfully implemented in systems where DVR was used for only one or two internal degrees of freedom, while different bases, such as distributed Gaussians [6, 7] or spherical functions, were employed for the remaining coordinates [27, 28, 31, 32, 33, 34]. Such situations arise for molecules with more than three atoms, where it is often not possible to form a direct product DVR for all internal degrees of freedom. Finally, it is worth emphasizing that the diagonalization and truncation method represents an outstanding example of a divide-and-conquer computational strategy. Brute force diagonalization of a Hamiltonian in a primitive direct product basis is broken down into a series of diagonalizations of a rigorous and physically motivated hierarchy of lower dimensional Hamiltonians. Each lower dimensional problem deals with only a small portion of the PES in the neighborhood of the particular DVR point for which it is defined. Its eigenvectors serve as an optimal basis for the coordinates involved, in that restricted portion of the coordinate space. The lower dimensional Hamiltonians can be diagonalized simultaneously and independently of one another. This makes the diagonalization and truncation method, afforded by multidimensional DVR, particularly suitable for parallel computing, since the calculations can be divided among many processors and require little communication. 6

Comparison of DVR and the Collocation Method

DVR in one or more dimensions is related to, but distinct from the collocation method for solving differential equations [35, 36, 37, 38], whi has been applied to several 2D and 3D bound state problems as well [39, 40, 41, 42]. In the collocation method, the wave function \I> is expanded in a finite M—dimensional basis: ^(x) = Y^n Cn4>n(x)- ^ is then required to satisfy the Schrodinger equation at M coordinate points Xj. This results in a general eigenvalue problem for a non-symmetric matrix Hin — (xi\H\n}. The differences between the DVR and the collocation method are numerous and significant. In the DVR, the Hamiltonian matrix is Hermitian (so that its eigenvalues are guaranteed to be real) and related to the FBR by a rigorous, unitary transformation. That is not true in the collocation method, where the matrix of H is non-symmetric (its eigenvalues can be complex) and in a mixed representation of both points and basis functions. The DVR has the Gaussian quadrature accuracy of evaluating matrix

Molecular Bound States

275

elements, which is not assured in the collocation method. Most importantly, in the collocation method, the structure of the Hamiltonian matrix does not allow implementation of the sequential diagonalization and truncation procedure. Therefore, the collocation would not be the method of choice for dealing with highly excited eigeristates of strongly coupled polyatomic molecules. The advantages of the collocation method are that it is very easy to program, and that it permits greater freedom in choosing the points Xi than the DVR (at the cost of losing Gaussian quadrature accuracy). In particular, it can be used with non-direct product basis sets, (e.g., spherical harmonics) important in some systems, for which the DVR cannot be denned.

7

Conclusions

We have reviewed the discrete variable representation (DVR,) approach for accurate calculation of highly excited vibrational states of small polyatomic molecule. The DVR, as a rigorous pointwise representation, possesses a number of attractive features which have made it an indispensable tool for dealing with previously intractable bound state problems. Multidimensional DVR can be readily constructed for most coordinate systems and associated Hamiltonians commonly employed in quantum bound state and scattering calculations. The DVR of the Hamiltonian, which is easy to form, is very sparse. Matrices of the potential and other coordinate functions are diagonal in the DVR, eliminating multidimensional numerical integration. The sparseness of the Hamiltonian is exploited in the sequential diagonalization and truncation procedure, which has proved very effective in achieving drastic reduction (by a factor of 10-20) of the final Hamiltonian matrix, with no loss of accuracy. This has enabled rigorous quantum treatment of larger molecular systems, at higher excitation energies, than has previously been possible. Although the emphasis of this paper was on the bound state applications, the DVR has been used with considerable success recently in quantum rtactive scattering [43, 44, 45, 46, 47, 48, 49]. It is evident that the DVR has become the approach of choice for accurate quantum dynamics of difficult polyatomic systems.

References [1] Z. Bacic and J. C. Light, Ann. Rev. Phys. Chem., 40 (1989), p. 469. [2] Z. Bacic, R. M. Whitnell, D. Brown, and J. C. Light, Comput. Phys. Commun., 51 (1988), p. 35. [3] J. C. Light, R. M. Whitnell, T: J. Park, and S. E. Choi, in Supercomputer Algorithms for Reactivity, Dynamics and Kinetics of Small Molecules, edited by A. Lagana, NATO ASI Ser. C, Vol. 277, Kluwer, Dordrecht, 1989. [4] S. Carter and N. C. Handy, Comput. Phys. Rep., 5 (1986), p. 115. [5] J. Tennyson, Comput. Phys. Rep., 4 (1986). p. 1.

276

Domain-Based Parallelism and Problem Decomposition Methods

[6] Z. Bacic and J. Simons, J. Phys. Chem., 86 (1982), p. 1192. [7] I. P. Hamilton and J. C. Light, J. Chem. Phys., 84 (1986), p. 306. [8] C. E. Dykstra, Quantum Chemistry and Molecular Spectroscopy, Prentice Hall, Englewood Cliffs, NJ, 1991. [9] A. Messiah, Quantum Mechanics, Vol. II, Wiley, New York. [10] B. T. Sutcliffe, in Methods of Computational Chemistry, edited by S. Wilson, Plenum Press, New York, 1992. [11] M. J. Bramley, W. H. Green, and N. C. Handy, Mol. Phys., 73 (1991), p.1183. [12] H. Kroemer, Quantum Mechanics for Engineering, Materials Science, and Applied Physics, Prentice Hall, Englewood Cliffs, NJ, 1994. [13] G. Arfken, Mathematical Methods for Physicists, Academic Press, Inc., New York, 1985. [14] J. C. Light, I. P. Hamilton, and J. V. Lill, J. Chem. Phys., 82 (1985), p. 1400. [15] A. S. Dickinson and P. R. Certain, J. Chem. Phys., 49 (1968), p.4209. [16] J. Echave and D. C. Clary, Chem. Phys. Lett., 190 (1992), p. 225. [17] D. O. Harris, G. G. Engerholm, and W. D. Gwinn, J. Chem. Phys., 43 (1965), p. 1515. [18] R. M. Whitnell and J. C. Light, J. Chem. Phys., 90 (1989), p. 1774. [19] S. E. Choi and J. C. Light, J. Chem. Phys., 92 (1990), p. 2129. [20] S. E. Choi and J. C. Light, J. Chem. Phys., 97 (1992), p. 7031. [21] M. Mandziuk and Z. Bacic, J. Chem. Phys., 98 (1993), p. 7165. [22] J. Tennyson, J. Chem. Phys., 98 (1993), p. 9658. [23] C. Leforestier, J. Chem. Phys., 94 (1991), p. 6388. [24] J. R. Henderson, J. Tennyson, and B. T. Sutcliffe, J. Chem. Phys., 98 (1993), p. 7191. [25] J. K. Cullum and R. A. Willoughby, Lanczos Algorithms for Large Symmetric Eigenvalue Computations, Birkhauser, Boston, 1985. [26] M. J. Bramley and T. Carrington, Jr., J. Chem. Phys., 99 (1993), p. 8519. [27] Z. Bacic and J. C. Light, J. Chem. Phys., 85 (1986), p. 4594. [28] Z. Bacic and J. C. Light, J. Chem. Phys., 86 (1987), p. 3065. [29] J. C. Light and Z. Bacic, J. Chem. Phys., 87 (1987), p. 4008. [30] Z. Bacic, J. Chem. Phys., 95 (1991), p. 3456. [31] M. Mladenovic and Z. Bacic, J. Chem. Phys., 93 (1990), p. 3039. [32] M. Mladenovic and Z. Bacic, J. Chem. Phys., 94 (1991), p. 4988. [33] J. A. Bentley, R. E. Wyatt, M. Menou, and C. Leforestier, J. Chem. Phys., 97 (1992), p. 4255. [34] J. A. Bentley, C.-M. Huang, and R. E. Wyatt, J. Chem. Phys., 98 (1993) p. 5207. [35] B. A. Finlayson, The Method of Weighted Residuals and Variational Principles, Academic Press, New York, 1972. [36] L. Collatz, The Numerical Treatment of DifferentialEquations, Springer, Berlin, 1960. [37] S. A. Orszag, Studies Appl. Math., 51 (1972), p. 253. [38] D. Gottlieb and S. A. Orszag, Numerical Analysis of Spectral Methods: Theory and Applications, SIAM, Philadelphia, 1977. [39] W. Yang and A. C. Peet, Chem. Phys. Lett., 153 (1988), p. 98. [40] A. C. Peet and W. Yang, J. Chem. Phys., 90 (1989), p. 1746. [41] R. C. Cohen and R. J. Saykally, J. Phys. Chem., 94 (1990), p. 7991. [42] R. C. Cohen and R. J. Saykally, J. Chem. Phys., 98 (1993), p. 6007. [43] Z. Bacic, J. D. Kress, G. A. Parker, and R. T. Pack, J. Chem. Phys., 92 (1990),

Molecular Bound States

[44] [45] [46] [47] [48] [49]

277

p. 2344. T. J. Park and J. C. Light, J. Chem. Phys., 91 (1989), p. 974. T. J. Park and J. C. Light, J. Chem. Phys., 94 (1991), p. 2946. D. Brown and J. C. Light, J. Chem. Phys., 97 (1992), p. 5465. D. T. Colbert and W. H. Miller, J. Chem. Phys., 96 (1992), p. 1982. T. Seideman and W. H. Miller, J. Chem. Phys., 97 (1992), p. 2499. G. C. Groenenboom and D. T. Colbert, J. Chem. Phys., 99 (1993), p. 9681.

This page intentionally left blank

Chapter 16 Wave Operators and Active Subspaces: Tools for the Simplified Dynamical Description of Quantum Processes Involving Many-Dimensional State Spaces Georges Jolicard

John P. Killingbeck

Abstract Time-dependent and stationary wave operators are presented as tools to define active spaces and simplified dynamics for the integration of the time-dependent Schrodinger equation in large quantum spaces. Within this framework a new light is thrown on the duality between time-dependent and time-independent approaches and a generalized version of the adiabatic theorem is given. For the Floquet treatment of photodissociation processes, the choice of the relevant subspaces and the construction of the effective Hamiltonians arc carried out using the Bloch wave operator techniques. Iterative solutions of the basic equations associated with these wave operators are given, based on Jacobi, Gauss-Seidel and variational schemes.

1 1.1

Molecular Collisions: Their Computational Representation and Analysis. Statement of the Problem

A basic numerical problem for chemistry and molecular physics is the study of collisions. A general collision process leads from some entrance (initial) continuum to some exit (final) continuum. In the case of a full collision, energy exchanges appear between molecular bound states and molecular continua, leading to products characterized by new energy distributions, or by new chemical structures in the case of reactions and dissociations. Half-collision processes are those in which one continuum (entrance or exit) is a photon continuum and the other one is an ionization or "Laboratoire de Physique Moleculaire (U.R.A. 772 du C.N.R.S.). Faculte des Sciences et des Tecnniques - La Bouloie 25030 Besancon Cedex, France. New permanent address: Observatoire de Besancpn. Universite de Franche Comte, 41, Av. de I'Observatoire, BP 1615, 25010 Besanyon Cedex. France. * Department of Applied Mathematics, The University, Cottingham Road HULL, HU6 7RX, United Kingdom.

279

280

Domain-Based Parallelism and Problem Decomposition Methods

dissociation continuum. Typical half-collision processes are photoionization and photodissociation and their reverse processes, radiative recombination and radiative association. In most cases these half-collision processes can be described by a timedependent Schrodinger equation [1].

where r represents one or many dissociative coordinates, and q a group of bound coordinates. The Hamiltonian H is in part a differential operator with respect to q and r. It depends explicitly on the time when some of the molecular coordinates or the electromagnetic field are considered as classical variables [2, 3]. Many approaches involving the theory of wave packet dynamics have been proposed during the last ten years to treat equation (1) [4, 5, 6]. These treatments discretize the continua by working with a bounded range of radial coordinates in conjunction with finite basis sets of square integrable functions. They are well suited for numerical integration methods; the construction of both the partial propagator U(t + dt, t) and the global propagator U(t, to) involves the repeated formation of the matrix-vector product Hty as the principal operation in the description of the time evolution process. Traditional methods of calculating the partial propagator include the second order differencing scheme, the split operator method and the short iterative Lanczos method [7, 8]; the global propagator can be calculated using a scheme based on Chebyshev polynomial expansions [9]. These methods encounter two typical difficulties. The first arises from the discretization of the continua which effectively implies the addition of absorbing boundary conditions. The Hamiltonian operator then becomes non-hermitian, which often consequently reduces the efficiency of the propagators [10]. The second difficulty is the large CPU time which characterizes wave packet propagation computations. This often becomes prohibitively long even when Fast Fourier Transforms (FFT) are used to transform Discrete Variable Representation (DVR) into a Finite Basis Representation (FBR). The difficulty can be partially overcome by deriving a time-independent treatment from the time-dependent one. For time-dependent Hamiltonian this is unfortunately a difficult task, since the analytical form of U(t, to) is unknown. A solution has been proposed by Shirley within the framework of the Floquet theory for time periodic Hamiltonians [11]. Recently a promising approach, the (t, t ) method, has been proposed by Peskin and Moiseyev [12] for time non-periodic Hamiltonians. The study of collisions poses a second challenge to data storage and CPU time limitations with the calculation of resonance states (either

Wave Operators and Active Subspaces

281

purely molecular resonance states or resonances induced by the fieldmatter interaction) [13]. These states are eigenvectors of the stationary Hamiltonian having particular asymptotic features: they are purely outgoing and exponentially diverging functions and correspond to poles of the resolvent operator, lying in the complex lower half-plane [14]. Thus they can be characterized by the equations

where the asymptotic fragment wavefunctions $v(
1.2

The Boundary Conditions

The full study of molecular collisions in a finite basis representation is usually carried out in three steps: The first step defines the finite L2 basis set representation and asymptotic absorbing boundaries. In this same volume Bacic introduces the DVR representation and presents associated strategies for reducing the dimensionality of the Hamiltonian matrix. The second step integrates the matrix differential equations obtained by projection of the relevant equations of motion onto the selected basis set.

282

Domain-Based Parallelism and Problem Decomposition Methods

The third step makes an asymptotic analysis of the fragments resulting from the interaction. These three steps are in fact closely correlated, and the merits or defects of each one have direct consequences for the others. The boundary conditions are, for example, of primary importance in the resonance state eigenproblem. This is because these states are nonstationary states that transfer quantum flux outside the region of interaction. The adjoint eigenvectors 0+, constituting with the 0 a biorthogonal basis set, are the complex conjugates 0* of the resonance eigenvectors, so that the incoming flux associated with 0+ is equal in amplitude to the outgoing flux associated with the corresponding 0 states, and H is thus non-Hermitian with respect to these vectors [13], This property is inconsistent with the use of a finite L2 basis set which produces an hermitian H matrix. One can overcome this difficulty by using a mathematical artifact, namely by introducing near the upper boundary an additional absorbing imaginary potential term (the optical potential) which includes a set of adjustable parameters A [17],

Many studies have shown that this model is appropriate for the resonance eigenproblem and that precise resonance state eigenvalues can be obtained through a stabilization procedure with respect to the parameters A [18, 19]. A similar choice of boundary conditions can be made in the case of wave packet propagation but the shape and the position of the optical potential are correlated with the asymptotic analysis of the fragments. The most efficient methods of analysis give energy-resolved results over the full spectrum of the wave packet by analysing, at a fixed asymptotic position, the outgoing wave function or the quantum flux [20, 21]. However the asymptotic absorption of the wave packet has non-local effects so that the precision of the analysis essentially depends on the radial extent of both the optical potential and the outgoing wave packet [21]. This analysis cannot be considered for resonance states having an infinite extent but fortunately these states are characterized by an identical exponential time dependence of the flow in each open channel. Thus the ratios of the integrated flows are equal to the ratios of the instantaneous flows at a given arbitrary time. Further the branching ratios are simply expressed in terms of the asymptotic amplitudes of the resonances states in the various open channels [22, 23]. The preceding comments reveal that any new theoretical treatment such as the wave operator approach must possess some specific features in order to be applied successfully in collision theory. For example, it should be able to work with large non hermitian matrices and also be adaptable to permit variational stabilisation procedures, without repeating the full calculational

Wave Operators and Active Subspaces

283

procedure. In solving the eigenproblem it should supply the eigenvectors, which directly participate in the quantum flow calculations.

2 2.1

The Wave Operator Theory The Time-Dependent Wave Operator Concept

Let us consider again the time-dependent evolution equation (1) and let us assume that there exists some active subspace So with a dimension small compared to that of its orthogonal complement space SQ , usually called the complementary space. The projector of the active space is then denoted by PO and that of the complementary space by QQ. By assuming that an orthogonal basis \i > © |a > spans the whole of the vector space required to describe the system we can set

The solution of the equation of motion (1) is given formally in equation (2) by introducing the propagator U(t:to), sometimes written more explicitly as U(t,tQ-,H), since various different effective Hamiltonians are introduced in approximate treatments. The active subspace component of U(t, to; H) is PQU(t,to\H)PQ. In general, using an obvious abbreviated notation, U will be the sum of four components

where we can regard each operator as being represented in the full space by a partitioned square matrix with three zero blocks. If, for example, the full space is of the dimension N and the subspace So is of dimension M, then PoUPo will be represented by a matrix with a non-zero block of M x M type and three zero blocks which complete the full N x N matrix. The M x M non-zero part of P^UPo will have an M x M inverse, which, when augmented by three zero blocks to form an N x N matrix, is denoted (in a slight abuse of the usual conventions) by (Pot/Po)"1. From the matrix multiplication properties of partitioned matrices, represented by the projection operator properties PoPo = PO, PoQo = 0, etc, it follows that we then obtain the result on remembering that in the full space PO is represented by a unit M x M matrix plus three appropriate zero blocks. Reverting to the full notation, we denote the operator on the left of (8) by fi(Mo) and call it the timedependent wave operator [27], while the second operator on the right is called the reduced wave operator X(t,to) ; equation (8) then becomes

284

Domain-Based Parallelism and Problem Decomposition Methods

with

X(t, to) is an off-diagonal part of fi(t, to) ; in the NxN matrix representation it has non-zero elements only in the lower left (N — M) x M block. In numerical computation, of course, it is only the relatively small number of non-zero elements of X(t,to) and fi(i, io) which need to be calculated. In the time-independent case, the formalism above shows that, if we can find a transformation of the basis vectors in the N dimensional space which will render X(ie QoUPo) zero, then the operator U will not couple the space So and the complementary space, and we shall have produced an isolated MxM matrix block PoUPo which gives a more tractable problem of reduced dimension. A matrix form of this approach has been applied directly to matrix eigenvalue problems by some workers [24, 25], although the projection operator formalism is that predominantly used by workers in wave operator theory. In the time-dependent case, of course, the quantities (including the space So) are changing with time and a single fixed transformation will not serve to decouple the two spaces permanently by rendering X = QoU(PoUPo)~lzero at all times. It turns out that in principle one can construct a sequence of transformations, involving an effective Hamiltonian Heff(t),which will keep on maintaining this decoupling. As one would expect, what is obtained is actually an equation of motion which must be satisfied to maintain the decoupling, since the formalism is essentially a recasting of the basic time-dependent Schrodinger equation. The starting point of the method is to combine equations (9) and (10) to give

which may be seen to be valid by recalling the partitioned matrix representation described above. Interpreting that representation in dynamical terms, PoUPo describes a first evolution step within the active space So, while the X term (involving QoUPo) describes a transition from So to the complementary space without any reverse path (a reverse path would involve PoUQo). The equations presented so far are purely formal insofar as the operators fi and X are expressed via the evolution operator U. The term Po^-Po m (10) is the So space projection of U, and the aim of the theory developed here is to regard it as the exact time development operator associated with an effective Hamiltonian Heffvia the standard Schrodinger equation of motion. Accordingly, we study a small time interval tn —» in+i and set

with The guiding idea behind the formalism is that at t =• tn we start with X ( t ) = 0, ie no coupling between SQ and its complementary space. As

Wave Operators and Active Subspaces

285

the time-development begins to produce this coupling, we arrange the transformations of the basis vectors to cancel this coupling. The formal mathematics behind the calculation is based on the identity

Because of the particular block form of X, described earlier, it can be shown that the operator ihQjf-, regarded as a kind of fictitious Hamiltonian, has a very simple development operator,

The block form of the matrices also shows that X2 = 0, so that (1 — X ) ( l + X) — 1 and 1 — X is the inverse of 1 + X. At this stage we make appeal to the theory of itnermediate representations [26], according to which the development operator UA+B f°r an operator sum A + B can be written as

For the sum in equation (14) this gives the result

with If we now specify X more precisely by demanding that it should be chosen to make QoHeffPazero, we obtain an equation of motion for X ;

The quantity of principal interest now is the projection of PIeff within the active space SQ. The PQ projection operator on the left of equation (12 ) separates out the two left hand blocks. We do not need to specify more closely the other two parts of Heff in the block matrix structure. Remembering the location of the various blocks in the matrix structure, we can write the active space effective Hamiltonian as

The three equations (17), (19) and (20) are the three basic equations of the time-dependent wave operator theory, and are handled numerically by using small discrete time steps tn —> i n +i, as noted in connection with equation (12). Equation (19) which defines the reduced wave operator can be written differently after some algebraic manipulations, [28], in particular:

286

Domain-Based Parallelism and Problem Decomposition Methods

Alternatively, by combining eqns. (18) and (21) we obtain

The three different forms (19), (21) and (22) will be used in the next chapters and related to other theoretical treatments and concepts.

2.2

The Stationary Wave Operator

Wave operator theory was largely developed within quantum chemistry to obtain new effective Hamiltonians which can provide accurate molecular energies and easily interpretable wavefunctions [29]. The basic idea is to define an effective Hamiltonian Heff acting in a reduced model space SQ such that a subset of the exact energies of the Hamiltonian H coincide with those of Heff. The choice made by Bloch [30, 31] corresponds to setting:

where PQ is, as previously, the projection operator of the model space So and P the projection operator of the corresponding target space. The eigenvectors |A°) of Heff are in this case the projections into the model space of the corresponding eigenvectors |Aj) of H and O appears in this context as the operator which transforms these eigenvectors projected into SQ back to the exact eigenvectors :

The theory of Bloch and its extensions [32], as developed in numerous quantum chemical studies, are purely stationary state methods; however, they can be derived from the general time-dependent wave operator theory presented in Section 2.1. Let us assume that the operator H is generated from the unperturbed operator H° by adiabatically introducing the perturbation V(= H - H0}:

then the derivative dX/dt tends to zero at any time, and eq. (19) at t = 0 leads to : This equation which is illustrated in Fig.l is the basic equation of Bloch's wave operator theory. Most of the existing literature presents this basic

Wave Operators and Active Subspaces

287

FIG. 1. Schematic representation of the effective Hamiltonian calculation by the wave operator transformation.

equation in the alternative form [29]:

However, (26) has a pedagogical advantage, since it shows that the reduced wave operator X is simply the non-diagonal part of a non-unitary and non-singular transformation (T = 1 + X)which possesses a trivial inverse (T"1 = 1 — X) [33]. This transformation isolates theSQspace by cancelling out the couplings which connect SQ to SQ", so that the eigenvalues originating within the space So are simply obtained by diagonalising He// = PoHSl.

3 Solution of the Wave Operator Equations 3.1 Definition of the Active Space Equation (17) provides an exact description of the quantum evolution issuing from the space SQ. This equation is complicated when compared with the standard time-dependent Schrodinger equation, since Q (oiX) are the solutions of the non-linear equation (19). Thus its application must be restricted to processes which are confined within target spaces induced from relatively small active spaces. Fortunately, wave operator theory is an efficient tool for defining and constructing suitable active spaces, while permitting simultaneous generalization of the adiabatic theorem. A process will be said to be adiabatically evolving with respect to the active space So if the corresponding reduced wave operator at any time obeys the restriction that it has a suitably small time derivative,

i.e., if the time-dependent wave operator essentially converges to the instantaneous stationary one. When SQ at to consists of a single basis vector

288

Domain-Based Parallelism and Problem Decomposition Methods

|A0} which is itself one nondegenerate eigenvector of H(to), the adiabatic evolution process leads to :

The wave operator fi is here the instantaneous eigenvector of H(t) at time t and E(t) is the corresponding eigenvalue. As H(i) satisfies an intermediate normalisation condition ((X°\£l(t)) = 1), E(t) posseses an imaginary component which ensures norm conservation when H(t) is hermitian. The wave operator formulation can also resolve more intricate situations in which the evolution, while appearing adiabatic with respect to the transitions between SQ and SQ, exhibit strong nonadiabatic or even sudden behaviour inside the active space. A direct definition of this active space can then be obtained by first defining the initial space spanned by the initial vector |A°) (from a practical point of view the basis used should be chosen so that |A°) has projections on only one or a few basis vectors).- Then the wave operator associated with this initial space is derived in an iterative treatment (see section 3.2) and the vectors of the whole space are sorted in the order of decreasing absolute values of the components of this wave operator. Finally the active space, the dimension JVo of which is imposed, is built by taking the NQ first sorted vectors [34]. Accordingly the very concept of a reduced active space disappears when the number of the nonnegligible components becomes too large. This procedure of selection of an active space has been successfully applied by Wyatt et al. [35] to calculations of the overtone relaxation of benzene, where it appears simpler than the well known artificial intelligence methods [36], since search algorithms and evaluation functions are not required. The same procedure has also been used in the theory of photorelaxation and photodissociation processes [34, 37] . Figure 2 illustrates this for the dissociation of a Morse oscillator produced by a monochromatic field.

3.2

Integration of the Stationary Bloch Equation

The stationary wave operator fl (or X ) plays a central role. It is used to define the active space concept (section 3.1). It directly participates in the construction of Heff and in the relationships between the eigenvectors of Heff and the exact eigenvectors of (24). It also participates in the reduced dynamical equations inside the target space (see section 4.1 ). Thus several linear and quadratic iterative methods have been proposed to solve the nonlinear equation (26) [38]. (Some of them are intractable or approximate, e.g. they assume a diagonal structure of H in the complementary space SQ). The recursive distorted wave approximation (RDWA) [39] leads to an iterative process described by the equations

Wave Operators and Active Subspaces

289

FIG. 2. Schematic representation in the dressed picture of the potential surfaces involved in the dissociation process of the HF molecule. The field intensity is I — ITW/cm2 and the frequency (u> = 0.02914 a.u.) is tuned to the frequency of the transition (v = 0 —>?,' = 2). The surfaces n = — 1 and n = 1 are the dressed surfaces associated with the absorption and emission of one photon. The figure shows the bound and the continuum states retained by the wave operator selection algorithm and constituting the model space.

290

Domain-Based Parallelism and Problem Decomposition Methods

with

At each step, most of the required computational CPU time is devoted to forming the matrix-vector product HX. The trial value XN=0 =0 is often used but in fact any arbitrary value can be selected; it is obvious that the choice of a trial operator XN=0near to the converged solution increases the speed of convergence. Divergence of the series (30) appears in the strong coupling regime, particularly when intruder states associated with avoided crossing are involved. To partially overcome these difficulties a treatment called the single cycle method (SCM) has recently been proposed in conjunction with the Gauss-Seidel method. [25]. The SCM method modifies the RDWA algorithm by using the fact that the family of special transformations 1 + X used is such that the simple product rule (1+X)(1 + Y) = 1 + X+ Y holds ; the result (1 — X) = (1 + X)~l is a special case of this property. The SCM uses the decomposition,

and then expands each transformation (1 + &XN) as a series of elementary transformations:

-

where i is the column index of the wave operator in the model space and a the row index in the outer space SQ. The improvement comes from the fact that after each partial transformation, the H elements are modified before being used for the next transformation. This produces a larger radius of convergence but at the cost of a more complicated structure of the iteration. If a generalized iteration index (L = (N, a, i)), incremented after each partial transformation, is introduced instead of N, the basic equation (30) remains identical but (31) is replaced by It thus involves a step by step relationship (Hc = /(ff £-1 )) instead of a direct relationship H*~ = f(H)). Fortunately the iteration remains tractable, since only a small part of the H matrix, namely the diagonal part and the band QoH^Po, is modified at each step and needs to be stored. Details of this iteration are given elsewhere [37].

Wave Operators and Active Subspaces

3.3

291

Integration of the Time-Dependent Wave Operator Equation

The time-dependent wave operator is the main concept involved in our study insofar as it gives rise to the stationary wave operator. It appears in equation (19) as the solution of a non-linear differential equation. An iterative procedure, closely related to those used for the stationary wave operator, was proposed [40] to solve this equation. Unfortunately the slow convergence of this solution makes it not well adapted to the study of evolution over long time periods. A better approach is obtained by using second order differencing to approximate the time derivative in (19) and to preserve time reversal symmetry. With a variable time step, one gets :

WITH The parameter e connected with the time step variation is confined in a small interval centered on t = I . Its value is defined at each step with regard to the amplitude derivative dX/dt. From a practical point of view it must be recalled that the major part of the CPU time used to integrate (35) is devoted to forming the matrix-vector product H x X appearing in the expression for dX/dt.

3.4

From the Stationary to the Time-Dependent Problem

A basic feature of the wave operator theory is its capacity to establish relationships between the time-dependent and the time-independent formalisms. The fact that the Bloch wave operator has been derived from the timedependent one. in the adiabatic limit, can for instance be used in stationary problems to overcome convergence difficulties of the iterative scheme in the strong coupling regime. As noted previously, the initial trial operator XN=0introduced in these iterations is arbitrary; that derived from a timedependent procedure by establishing the couplings nearly adiabatically can often be expected to be very close to the desired exact one. The equation (35) is well suited for the time integration of a slowly varying evolution process because the features leading to large time derivatives at the adiabatic limit have been suppressed. Equation (29) shows for example that the quickly varying phase terms have been extracted and separated from the wave operator. Moreover, the large couplings which could appear near the avoided crossings are suppressed by this formulation if the states participing in these avoided crossings are included in the active space SQ. A simple illustration is presented in Figs. (3) and (4). This corresponds

292

Domain-Based Parallelism and Problem Decomposition Methods

FlG. 3. Lower part of the spectrum of two harmonic oscillators with linear and quadratic perturbations versus the perturbation parameter X. The case studied by the wave operator formulation and presented in fig.4 corresponds to the value A = 0.4 and to the 4th group of quasidegenerate eigenvectors.

to two harmonic oscillators in near resonances which are linearly perturbed and quadratically coupled. Fig. 3 presents the lower part of the energy spectrum of this system. The perturbation amplitude is established as a linearly increasing function of time on the interval [t = 0, t = N/v] where v is the smaller of the two vibrational frequencies. In the present case this amplitude varies from 0 up to 0.40. The wave operator corresponding to the 4th group of quasidegenerate eigenvectors in the energy scale (see Fig. 3) has been calculated and the four corresponding approximate eigenvectors derived from eq.(24). Figure 4 presents the factors

for i = 1 to 4 versus the index j. These factors, which estimate the deviation from a purely stationary case, are unexpectedly small for N — 3 (a time interval with only three vibrational oscillations).

3.5 From the Time-dependent Problem to the Stationary Problem Equations 21 and 22 provide a means to transform the time-dependent dynamical problem into a stationary problem [41]. Equation (21) is the

Wave Operators and Active Subspaces

293

FlG. 4. The non-adiabaticity factors: for the th four states (i = I to 4)(o, +, x, * ) of the 4 group of quasidegenerate states. The factors are shown versus the index j which denotes the harmonic basis vectors sorted in the order of increasing energy values.

equivalent of the stationary equation (26) in which the Floquet-type operator H — ihd/dt is substituted for the molecular Hamiltonian H. In the case of Hamiltonians which are periodic in time (H(t) = H(t + r)), the Floquet theorem establishes the existence of stationary solutions, themselves periodic in time, i.e. [42] :

The quasienergy e* is unique up to a multiple of 2?rn/r, and in accord with the Floquet theorem the quasienergy state can be written in the form

The two equations (38) and (39) can be identified with eqs. (22) and (17), respectively, for the case of a one-dimensional active space. In this case the effective Hamiltonian is the time independent scalar e& and the reduced wave operator itself becomes a quasivector obeying the periodic conditions:

However, the time-dependent wave operator theory adopts a more general point of view by generalizing the quasivector concept to degenerate active

294

Domain-Based Parallelism and Problem Decomposition Methods

space situations. This generalization, which transforms the quasi-vector into a wave operator and the quasi-energy into a quasi-effective Hamiltonian operator improves the calculational scheme for the quasi-vectors when many quasi-vectors in near resonance are collected into the space SQ.

3.6

The Concept of an Evolutive Nonorthogonal Basis Set

The wave operator formalism is confronted by difficulties when the initial state projects onto a large initial space. This is the case with wavepacket propagation calculations when DVR basis sets are used. The waye packet dilates during the propagation; it quickly spreads out in space and takes nonzero values on the whole grid. As soon as the dimension of the subspace SQ becomes non-negligible with respect to the dimension of SQ, the treatment presented in the subsection 3.3 is no longer useful. Nevertheless this dilemma can be overcome by defining, at the initial time of each propagation step, a new nonorthogonal basis set. If the primary orthonormal basis set is denoted by ) can be built up as follows:

where ^(io) is the exact wave function at the new initial instant £Q (cf. eq. 1) and p is the index associated with the largest component < (j>j\^(to) >. When the wave function is expanded in the new basis set:

the resulting matrix equation is :

The difficulty generated by the large initial space has disappeared in this new scheme since A'(to) has only one nonzero component:

Moreover the matrix B which define the new Hamiltonian matrix has a simple form. It equals the identity matrix except for the pth column, for which Bjp =< (f>j\W(t)>. The matrix B~l has the same structure but with:

Thus going from H to B~1HB only requires a matrix-vector product.

Wave Operators and Active Subspaces

4

295

Illustrative Numerical Examples: The Photodissociation of Molecules by an Intense Laser Field

Molecular photodissociation is a process which, while apparently complicated, can often be described theoretically using a small target space formed by only a few quasi-vectors. The wave operator formalism, adapted in accord with Floquet theory, is then well suited for its description.

4.1

Projected Dynamical Equation in the Quasienergy Formulation

Consider a molecular system driven by a periodic external field (i.e. a continuous field) with frequency uj and period T(= 2n/u) . Let us introduce the Floquet state nomenclature |a, n > where \a > designates the elements of a complete orthonormal set of eigenfunctions of the unperturbed molecular Hamiltonian and n > are the Fourier vectors such that < t n >— exp(inut). One can then express the transition probability (a —» /?) averaged over the initial phases of the field seen by the molecular system, as

where T-L(= H — ihd/dt) is the Floquet matrix. This exact expression for the evolution can be simplified by first defining a reduced active space SQ, as explained in section (3.1); Figure 2 illustrates this selection of So for a model system. Then the wave operators HP and OQ associated with the two complementary spaces SQ and SQ are introduced and the non-unitary transformation is applied to the evolution operator (exp(—i'H&.t/h)Po}, as shown in Fig. 5. The simplified relaxation scheme is obtained by neglecting the relaxation path through the space SQ . A detailed demonstration in reference [34] leads to the final equation :

in which the overlap matrices and the quasi eigenvalues have simple expressions in terms of the wave operator Op. Eq. (48) reveals that the wave operator formulation simplifies the dynamical description by confining it within the space spanned by the quasi-vectors derived from space So. An interesting situation, corresponding to the model presented on Fig. 2, is shown on Fig. 6. The conditions leading to appreciable dissociation probabilities from the ground state (v = 0) correspond to the use of doubly degenerate target spaces in which the quasi-vector state <3>(n = 0, v = 0) mixes with a second quasi-vector $(n = —l,v) in near resonance [37].

296

Domain-Based Parallelism and Problem Decomposition Methods

FlG. 5. Schematic representation of the time relaxation before and after the wave operator transformation.

4.2

Extension to Photodissociation Processes Using Short Laser Pulses

The wave operator-Floquet formalism can be generalized to non-periodic Hamiltonians to describe dissociations induced by short laser pulses of form

In a discretized model the range of variation [0,.Fmax] is divided into N elementary intervals of the same width A.77 and a corresponding partitioning of the time interval is produced (t,) with F(ti) = iAjF, 2 < N and Ffa) = (2N - i) A.77, N < i < 2N. The continuous function is then approximated by the discontinuous function which takes the constant value F(ti) when t is between the two adjacent points ti and ti+\. An active space So is denned as for the case of a continuous laser and for each discretized value of the field magnitude a target space is associated with the active space : By assuming that at the discontinuity between two adjacent values of the electric field, namely j and j' + 1, the wavefunction located in the subspace S^ is completely projected into the subspace S^+l\ the wave function obtained at any time t is [43]:

Wave Operators and Active Subspaces

297

FIG. 6. Photodissociation of the HF molecule in the frequency range [0.015a.w.. 0.048a.w.] calculated at t = 109a.u. for a continuous wave laser of intensity I = ITW/cm2. Upper panel: The non negligible overlaps between the initial state n — 0. v — 0 > and eigenvectors \$n of HF. Middle panel: Imaginary part of the eigenvalues of the states |<&„,,., >. Lower panel: The dissociation probabilities.

298

Domain-Based Parallelism and Problem Decomposition Methods

FIG. 7. Photodissociation of the H£ molecule: Complex energies of the central Floquet block (n = 0) without field-matter interaction. A group of (2 x 150) functions has been used to span the two electronic states (lsag)l'L+ and (2p(7u)1E+ which participate. (+ denotes the bound states of the g potential surface, o and * denote discretized continuum states of the potential surfaces g and u respectively).

where |0,a > denotes the initial molecular state, and the £ summations are limited to the target space S^ previously constructed. A recent study has analysed the photodissociation of H% within the framework of this theory [43] and confirmed the results of previous studies [44]. Two main points have been revealed. First, the dissociation process is correctly described by collecting in the evolutive target spaces the resonance quasistates characterized by Gamov boundary conditions, thus neglecting the diffusion states which possess small lifetimes (see fig. 7). Second, it is possible to repeat fast calculations of dissociation probabilities for varying pulse shapes if some data has been stored during the initial construction of the target spaces: in particular, the eigenvalues of the vectors constituting the basis set of S^: E^, the overlap matrices of basis sets of consecutive target spaces < $7j-+1 !$/?,• > and the asymptotic amplitudes of the basis vectors ^(r^). This allows a detailed analysis of the influence

Wave Operators and Active Subspaces

299

of the pulse shape on the dissociation probabilities and on the kinetic energy spectrum of the fragments [28].

5 Conclusion Quantum dynamical problems involving coupled molecular and photon continua require the use of discrete representations which require sets of square integrable functions of large dimension. Various strategies have been used to reduce the dimension of these basis sets (cf the article by Bacic hi this volume). In most cases, this dimension reduction is only a first step and must be accompanied by other simplification procedures at the level of the dynamics. The Bloch wave operator theory presented in this chapter proposes a procedure based on the selection of an active space within which the dynamics is confined, together with a projection of the dynamical equations into this space. In recent years several formulations of dynamical theory ivolving active subspaces have been given; the selection of such a space is often made hi the context of artificial intelligence methods. However, the theory of Bloch wave operators and its numerical implementation offers several specific advantages for handling problems of reactive dynamics: The selection of the active space and the construction of the projected dynamical equations are both based on the same central concept, that of the wave operator. The theory is equally applicable to treatments of the Hamiltonian operator which lead to either real or complex matrices; this flexibility is essential hi the treatment of molecular resonances using discrete finite bases. The eigenvectors which are associated with the active space, and which contain all the relevant dynamical information, can be obtained from the wave operator calculation. The fundamental non-linear equations which constitute the heart of the method can be solved by several iterative treatments hi which the basic computational operation, requiring the largest portion of CPU tune at each iteration, is the formation of the matrix-column product H$, where H is the fixed Hamiltonian matrix and * is a column whic changes at each iteration. This feature of the calculations is obviously suitable for parallel computations. The iterative nature of the calculations permits the introduction of adjustable parameters into the Hamiltonian and the study of the parameter dependence of results without the need to repeat ab initio the whole of the calculation. This feature has been exploited in the stabilization approach to molecular resonances, where the variable

300

Domain-Based Parallelism and Problem Decomposition Methods parameter is the amplitude of the optical potential term, and in the study of laser-induced photodissociation, where the variable parameter is the electric field amplitude.

Prom a didactic point of view, the wave operator approach provides a deeper understanding of reactive phenomena ; effects which at first sight involve several continua often reduces to the study of a few discrete resonance states. The second essential point about our formulation is the use of the general concept of a time-dependent wave operator which has the stationary wave operator of Bloch, widely utilized in quantum chemistry, as its adiabatic limit. Application of the fully time-dependent wave operator are as yet little developed, but we have high hopes of its utility. Two formal features of the operator are already noteworthy. The concept allows a new interpretation of the adiabatic theorem. In the presentation of Messiah [26] the theorem concerns only the strict adiabatic limit for spaces which are purely non degenerate. The wave operator equations permit the study of more general evolution processes characterized by two time scales, one relating to a rapid internal evolution of the active space, the other relating to quasi-adiabatic exchanges between the active space and its complementary space. Equations (21) and (22), which define the time-dependent wave operator, transform the time-dependent dynamical equations into stationary equations in which the original Hamiltonian H(t) is transformed into a Hamiltonian H(t) — ihd/dt of Floquet type. This approach gives a changed formal status to the time variable and suggests some new routes for the treatment of the Schrodinger equation.

References [1] R. Kosloff, Time dependent quantum mechanical methods for molecular dynamics, J. Phys. Chem., 92 (1988), pp. 2087-2100. [2] G.D. Billing, Classical path approach to inelastic and reactive scattering, in Super computer Algorithms for Reactivity, Dynamics and Kinetics of small Molecules, Reidel Publ. Co. North-Holland, 1989 [3] G.D. Billing, The dynamics of molecule-surface interaction, Comp. Phys. Rep., 6 (1990), pp. 383-450 [4] R. Heather, An asymptotic wavefunction splitting procedure for propagating spatially extended wavefunction : application to intense field photodissociation ofH}, Comp. Phys. Comm, 63 (1991), pp. 446-459. [5] D. Neuhauser, D.M. Baer, R.S. Judson and D.J. Jouri, The application of timedependent wavepacket methods to reactive scattering, Comp. Phys. Comm, 63 (1991), pp. 460-481. [6] D.J. Kouri and R.C. Mowrey, Close Coupling wave packet formalism for gas phase non reactive atom-diatom collisions, J. Chem. Phys, 86 (1987), pp. 20872094 [7] C. Leforestier, R.H. Bisseling, C. Cerjan, M.D Feit, R. Friesner, A. Guldberg, A. Hammerich, G. Jolicard, W. Karrlein, H.D. Meyer, N. Lipkin, O. Roncero

Wave Operators and Active Subspaces

[8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] [27] [28]

301

and R. Kosloff, A comparison of different propagation schemes for the time dependent Schrodinger equation, J. Comp. Phys, 94 (1991), pp. 59-80 T.J. Park and J.C. Light, Unitary quantum time evolution by iterative Lanczos reduction, J. Chem. Phys, 85 (1986), pp. 5870-5876 Y. Sun, D.J. Kouri, D.W. Schwenke, Time-dependent wave packet algorithm for inelastic molecule,-molecule scattering, Computer Phys. Commun, 63 (1991). pp. 51-62 R. Kosloff and D. Kosloff, Absorbing boundaries of wave propagation problems, J. Comp. Phys., 63 (1986), pp. 363-376 J.H. Shirley, Solution of the Schrodinger equation with a Hamiltonian periodic in time, Phys. Rev., B138 (1965), pp. 979-987 U. Peskin and N. Moiseyev, The solution of the time dependent Schrodinger equation by the (t, t } method. Theory, computational algorithm and applications, J. Chem. Phys., 99 (1993), pp. 4590-4596 B.R. Junker, Recent computational developments in the use of complex scaling in resonance phenomena, Advan. At. Mol. Phys., 18 (1982), pp. 207-263 R. Lefebvre, Szegert quantization, complex rotation and molecular resonances, J. of Phys. Chem, 88 (1984), pp. 4839-4844 W. Reinhardt, Complex coordinates in the theory of atomic and molecular structure and dynamics, Ann. Rev. Phys. Chem., 33 (1982), pp. 223-255 G. Jolicard and J.Perie, The theory of stationary and time-dependent wave operators and their applications in photochemistry, Advan. Multiphoton Process and Spect., Vol. 8 (1994), pp. 1-126 G. Jolicard and E.J. Austin, Optical potential method of calculating resonance energies and widths, Chem. Phys., 103 (1986), pp. 295-302 M. Garcia Sucre and R. Lefebvre, Localization of the wave functions of resonance states through the use of an optical potential, J. Chem. Phys., 85 (1986), pp. 4753-4754 N. Rom, N. Lipkin and N. Moiseyev, Optical potential by the complex coordinate method, Chem. Phys., 151 (1991), p. 199-204 G.G. Balint Kurti, R.N. Dixon and C.C. Marslon, Time-dependent quantum dynamics of molecular photoframentation processes, J. Chem. Soc. Faraday Trans., 86 (1990), pp. 1741-1749 J.Perie and G. Jolicard, A spectral analysis method using the optical potential as a wave absorber, J. Phys. B., 26 (1993), pp. 4491-4502 U. Peskin, N. Moiseyev and R. Lefebvre, Partial widths by asymptotic analysis of the complex scaled resonance wave function, J. Chem. Phys., 92 (1990), pp. 2902-2909 J.Perie, G. Jolicard and J.P. Killingbeck, An iterative Bloch approach to the resonance state problem, J. Chem. Phys, 98 (1993), pp. 6344-6351 J.A.R. Coope, Partitioning without implicit eigenvalues, Mol. Phys, 18 (1970), pp. 571-574 J. Killingbeck and G. Jolicard, Perturbation methods for the matrix eigenproblem, J. Phys. A. Math. Gen., 25 (1992), pp. 6455-6459 A. Messiah, Mecanique Quantique Part 2, Dunod, Paris 1964 G. Jolicard, A recursive solution to the time dependent Schrodinger equation using a generalized quasidegenerate Bloch forvnalism, J. Chem. Phys., 90 (1989), pp. 2320-2327 G. Jolicard, The solution of the time dependent Scrhodinger equation by the time-dependent wave operator method, Phys. Rev. A to be published

302

Domain-Based Parallelism and Problem Decomposition Methods

[29] Ph. Durand, Direct determination of effective Hamiltonians by wave operator methods. I. General formalism, Phys. Rev., A28 (1983), pp. 3184-3192 [30] C. Bloch, sur la theorie des perturbations des etats lies, Nucl. Phys., 6 (1958), pp. 329-349 [31] J. Des Cloizeaux, Extension d'une formule de Lagrange a des problemes de valeurs propres, Nucl. Phys., 20 (1960), pp.321-346 [32] J.P. Malrieu, Ph. Durand and J.P. Daudey, Intermediate Hamiltonians as a new class of effective Hamiltonians, J. Phys. A Math. Gen., 18 (1985), pp. 809-826 [33] J.P. Killingbeck, Microcomputer algorithms. Action from algebra, Adam Hilger, Bristol 1991 [34] G. Jolicard and A. Grosjean, on the validity of partitioning techniques for molecular infrared-multiphoton excitation, J. Chem. Phys., 95 (1991), pp. 19201927 [35] R.E. Wyatt, C. lung and C. Leforestier, Quantum dynamics of overtone relaxation in benzene I. 5 and 9 modes for relaxation from (CH(v=3), J. Chem. Phys., 97 (1992), pp. 3458-3476 [36] J. Chang and R.E. Wyatt, Preselecting paths for multiphoton dynamics using artificial intelligence, J. Chem. Phys, 85 (1986), PP. 1826-1839 [37] G. Jolicard, J.P. Killingbeck, Ph. Durand and J.L. Heully, A wave operator description of molecular photodissociation processes using the Floquet formalism, J. Chem. Phys., 100 (1994), pp. 325-333. [38] J.C. Barthelat and Ph. Durand, Theory of effective Hamiltonians and applications, in computational Chemistry structures interactions and reactivity Part A, Elsevier, Amsterdam, 1992 [39] G. Jolicard, Effective Harniltonian theory : an intermediate representation method for the wave operator calculation, Chem. Phys., 115 (1987), pp. 57-68 [40] G. Jolicard and E.J. Austin, On the recursive integration of stationary and timedependent Bloch wave operator equations, Chem. Phys. Letters, 180 (1991), pp. 503-508 [41] T. Tung Nguyen Dang, Adiabatic time evolution of atoms and molecules in intense radiation fields, J. Chem. Phys., 90 (1989), pp. 2657-2665 [42] S.I. Chu, Generalized Floquet theoretical approaches to intense-field multiphoton and nonlinear optical processes, Advan. Chem. Phys, 73 (1989), pp. 739-798 [43] G. Jolicard and G.D. Billing, The extension of wave operator-Floquet formalism to molecular photodissociation processes with short laser pulses, J. Chem. Phys. (submitted for publication) [44] G. Giusti-Suzor, X. He, O. Atabek and F.H. Mies , Above threeshold-dissociation of H£ in intense laser fields, Phys. Rev. Lett., 64 (1990), pp. 515-518 ; G. Jolicard and O. Atabek, Above threshold dissociation dynamics of Hf with short intense laser pulses, Phys. Rev., A46 (1992), pp. 5845-5855

Chapter 17 Problem Decomposition Techniques in Quantum Mechanical Reactive Scattering David W. Schwenke

Donald G. Truhlar

Abstract

In this chapter we discuss strategies for efficiently solving quantum mechanical reactive and inelastic scattering problems using algebraic variational methods. First we review the outgoing wave variational principle. Then we review three aspects of its implementation where problem decomposition techniques are used to make the calculations efficient. The first of these involves partitioning the Hamiltonian into a distortion part that is solved numerically and a coupling part that is treated by expanding the difference of the full outgoing wave and the distortion-potential-induced part of the outgoing wave in a basis. The second involves problem decomposition in channel space or physical space in order to obtain efficient basis functions for the fully coupled problem. In this section we also propose a new pre-diagonalization technique that may be used as the basis of a divide-and-conquer approach. Finally, we consider schemes for partitioning basis functions into Hilbert subspaces as direct analogs of domain decomposition in physical subspaces.

1

Introduction

In this chapter we consider the problem of predicting the rate of a chemical reaction where A, B, and C are atoms. The motion and interactions of these particles are assumed to be governed by the laws of quantum mechanics. We restrict This work was supported in part by the National Science Foundation. NASA Ames Research Center, MS 230-3, Moffett Field, CA 94035-1000 Department of Chemistry, Chemical Physics Program, Supercomputer Institute, an High Performance Computing Research Center, University of Minnesota, Minneapol nesota 55455

303

304

Domain-Based Parallelism and Problem Decomposition Methods

ourselves to the gas phase where the density is low enough so that only bimolecular collisions are important and wall interactions are negligible. In this case, we can separate out the motion of the center of mass of ABC and hence reduce the number of coordinates by three. We further assume in the present work that the Born-Oppenheimer [1,2] approximation has been invoked to decouple the motion of the electrons from the nuclei. This makes the quantum mechanical description of the reaction equivalent to that for the motion of two point masses on a potential energy hypersurface (PES). The PES can be determined using methods described in other chapters in this book [3,4]. In comparing the structure of the multi-particle Schrodinger problem to partial differential equation problems that arise in other chapters of this volume, it is useful to consider some general characteristics of the problem. We consider only the time-independent formulation, which leads to a sixdimensional, linear, elliptic partial differential equation. The relationship of the solution to the physical observables is contained in the total energy, which is a parameter in the equations, and in the complex traveling-wave boundary conditions where one particle is far from the other two; at the energies considered here there is not enough energy to simultaneously break all the bonds, so only the atom-diatom limit need be considered. The solution of the quantum mechanical equation of motion (the Schrodinger equation) for this problem yields a scalar function of the six internal degrees of freedom. This function is called the wave function, and all observable attributes of the collision can be determined from it. When A and BC are widely separated, the wave function can be written in the form

where the coordinates are (ri,0i,0i), the spherical polar coordinates of the vector from B to C, and (JZi,Qi,$i), the spherical polar coordinates of the vector from the center of mass of BC to A. The subscript 1 indicates the arrangement A + BC rather than B+AC (arrangement index a = 2) or C+AB (arrangement index a = 3). The function (j)avjis an easily determined real square integrable (£2) function that describes the vibration of BC, and is labeled by its number of nodes (v, the vibrational quantum number) and by the equation it solves (which in turn is labeled by j and a, and here a = 1). For given ja, the <$>cmj form a complete orthonormal set. The function 3^M is a linear combination of the product of two spherical harmonics and describes the rotations of the system. The quantum number j specifies the rotational angular momentum of BC, t the orbital angular momentum of A with respect to BC, J the angular momentum of the system as a whole, and M the orientation of total angular momentum vector. Since

Quantum Mechanical Reactive Scattering

305

space is isotropic, the equations of motion contain no explicit M dependence, and thus the only part of the wave function depending on M is yLM. There is no coupling between wave functions with different values of J and M. The final function in (1), / li;3^n , describes the radial relative motion o A and BC', and it is not an £2 function. It is labeled by the previously introduced quantum numbers as well as the boundary condition index no. The radial function is regular (i.e., zero) at the origin and satisfies the largeRi boundary condition (2) lim fn,no = &nn0kn 2 exp[-i(knRi-lnTT/2)}-Snn0kn2 exp[i(knRi-^nVT/2)], R\—>oo

where n denotes a particular set of avjt (each such set is called a channel), 8nno is the Kronecker delta function (which is one if n = no and zero otherwise), kn is the wave vector defined by

where // is the reduced mass of the system (we mass scale all coordinates to a single reduced mass [5]), E is the total energy of the system, e n is the internal energy for the choice of avj specified by n, H is Planck's constant divided by 27T, and Snn0 is & complex coefficient. The matrix with elements Snn0 is called the scattering matrix and is a dense, complex, symmetric, unitary matrix. The scattering matrix depends parametrically on the total energy E and from it we can compute all measurable quantities of the collision using simple formulas [6,7,8]. Thus the focus of the remainder of this work is on the calculation of the scattering matrix. The inclusion of y-J.(M in (1) deserves special emphasis. Use of this kind of basis function, which is intrinsically delocalized, allows us to take account of conservation of J and M, which is absolutely essential for efficiency as it greatly cuts down on the number of basis functions that must be considered at one time for convergence. The necessity to use basis functions corresponding to definite J and M is a principal reason for emphasizing delocalized basis functions in quantum mechanics. For nonreactive problems, the most straightforward approach to determining the scattering matrix is to expand the wave function as in (1), substitute this into the Schrodinger equation, and project in turn on each of the known products !,/j'^/'^*- This yields a set of coupled second order ordinary differential equations (ODEs) for the unknown radial functions /nn0 • A linearly independent set of solutions is then numerically integrated outwards from the small-R\ region, where the /nn0 are negligible, to the large-RI asymptotic region, where the numerical solutions can be matched to (2) to give the scattering matrix. For reactive problems, this straightforward approach cannot be used because basis functions defined in terms

306

Domain-Based Parallelism and Problem Decomposition Methods

of a single set of coordinates cannot efficiently describe all three possible reactants or products. Thus we turn to an alternate approach of using basis functions to describe all degrees of freedom defining some of the basis functions in terms of ra, Oa, cn Ra, ©a, $a with a = 1 and others in terms of such coordinates with a = 2 and 3, and using a variational principle equivalent to the Schrodinger equation to determine the scattering matrix. This reduces the problem to performing quadratures and linear algebra. Both steps can be performed efficiently on modern computers, and this approach provides considerable scope for introducing the ideas of problem decomposition. Another approach to the coordinate problem of reactive scattering is to use hyperspherical coordinates and wave function matching [9,10,11,12,13]. This allows one to return to the coupled ODE description, but with new complications. The hyperspherical approach will not be discussed further in this article. Although the main motivation for the algebraic approach is the difficulty of treating reactive scattering with the coupled-ODE approach, it turns out that all of the details of our formalism that are important for the present discussion also arise in nonreactive scattering. Thus to simplify the following discussion of our formalism, we will only consider nonreactive scattering, and hence we drop the label a. We also drop the quantum numbers J and M, which do not play an important role in the following discussion. Extension to reactive scattering essentially only involves adding back the extra quantum number (the arrangement index a) and carrying out a new class of integrals. The explicit form of the integrals is given elsewhere [14]. In Section 2 we summarize the algebraic variational principle that we use, and in Sections 3-5 we discuss three ways in which problem decomposition is invoked to make the calculations efficient. Since our emphasis is on problem decomposition techniques, we shall not discuss specific applications in detail, but the reader is referred to a review [15] and a typical application [16] for such background.

2

Variational Principle

We first sketch out the derivation of the variational principle we use. We start with the Schrodinger equation which takes the form

where H is the Hamiltonian operator, which is the sum of the six-dimensional Laplacian times =£L- (the kinetic energy) and a scalar function (the PES). We partition the Hamiltonian as

Quantum Mechanical Reactive Scattering

307

where Vc is denned by

and HD takes the form

where T is the kinetic energy operator, Vvlb is the potential energy of the isolated diatomic, and VD is a distortion potential, the choice of which is discussed in the next section. All potentials are assumed to be represented as analytic functions of the internal coordinates possibly combined with nonlocal projection operators (which will be called projectors). We assume that we have numerically obtained the Green's function

and the distorted waves i/jW™ which solve

The function ?/>(+) n is expanded as in (1) and its radial part is regular at the origin and is subject to the boundary conditions of (2); ^(-)™ is its time reversed counterpart. In practice the G®^ and i/i(±) n functions are obtained by the coupled-ODE approach that is discussed in the introduction. In fact, in most cases we do not actually construct GD^ but rather we directly solve the coupled ODEs for required integrals over the GD^ [17]; that detail of implementation need not concern us here, but the efficiency of this technique is critical to the issues discussed in Section 4. Then a formal solution to the full problem is given by [18,19]

where

Referring to (2) and constructing wave packets from wave functions with such boundary conditions, $(+)" is used to construct wave packets with an unscattered wave and outgoing scattered waves, whereas fy(~)n is used to construct wave packets with an unscattered wave and incoming scattered

308

Domain-Based Parallelism and Problem Decomposition Methods

waves, which is the time-reversed description [20]. It is convenient to define the quantity

which may be called the outgoing wave. (Note that actually ^ Ow *s tn difference between the full outgoing wave and the part induced by the distortion potential, but we always call it just the outgoing wave.) Thus

and it can be shown by taking the large-^? limit of (10) that the scattering matrix is given by

where °Snn0 is the scattering matrix due to the distortion potentials, and

In (15) and the equations that follow, (a|6|c) means the integral

where * denotes the complex conjugation. In practice three of the six integrations in (16) can be carried out analytically. The remaining threedimensional integration is carried out by direct products of Gaussian quadratures [14]. Equation (15) could be used with some trial vjK+)«o to compute an approximation to the scattering matrix, but the results would be very sensitive to the choice of the trial function since if the trial function differs from the exact one by the amount <S'd>(+)™°, the computed scattering matrix elements differ from the exact ones by {•0(~) n |ZY|<SvI'(+) Tl0 }. We seek instead a stationary expression for the scattering matrix which gives rise to errors which involve integrals containing the product <5\l>(-) n *<5vI/( + ) na , which should be much smaller. This can be done in several ways. In the method we use, we write

where we use three different expressions for Snno on the right hand side. The first is obtained by substituting (13) into (15):

Quantum Mechanical Reactive Scattering

309

In the second we substitute ?/>(">" obtained from inverting (10) into (15) to obtain

In the third, we start with (19), substitute (13) into the second terms and simplify using (12) to get

Putting (18)-(20) into (17) and using (8) to simplify to result, we have

The error in the scattering matrix obtained using this expression with a trial function is -^(^(-^[H - E|£\I>(+)"°). An additional feature of this expression is that is gives a symmetric scattering matrix for all choices of \I/(+}no and MT(-)™. Equation (21) gives us a prescription to compute the scattering matrix given a trial function, but it does not indicate how to choose parameters contained in the trial function. We reinforce the stationary nature of (21) by using the following procedure. Represent the trial function as a linear combination of known basis functions:

The basis functions are regular at the origin and must either be £2 or have the limit

where ppvjt is some coefficient. Other choices either cause the matrix element integrals to diverge or introduce incorrect boundary conditions. Then (21) can be written as

where double underline denotes a matrix,

310

Domain-Based Parallelism and Problem Decomposition Methods

where the superscript T means transpose and AT denotes the time reversal [8] operator. Note that C_ is square and very large, 5 is square and its row and column dimensions are about an order of magnitude smaller, and A and |? are rectangular. We then require that

\ for all n, no and 0, where 5ft means real part, and £j means imaginary part. This results in the linear equations

and thus

This is our final result. We call this the outgoing wave variational principle (OWVP), and we attribute the original derivation of a variational principle with this flexibility to Schlessinger [21]. In Schlessinger's work and our original derivation [22], we used boundary conditions based on the T matrix instead of the S matrix, where the T matrix is denned by

This version of the variational principle is called the scattered wave variational principle (SWVP); since (31) is linear in the two matrices, the OWVP and SWVP yield identical results. For other kinds of variational principles, see, for example, Refs. [23,24]. The appearance of a time reversal operator in (26) and (27) may at first be surprising, but it becomes clear by elaborating on the discussion below (11). The natural formulation of a transition amplitude in quantum mechanics is < ^ f\H'\^i >, where \£j is the initial state, H' is the operator causing the transition, and *fff is the final state. For scattering the final state is the time reversal of the initial state because the initial state evolves from a pure state in the infinite past into a mixture of states in the infinite future [25], whereas the final state evolves from a mixture of states in the infinite past to a pure state at the detector [20]. Thus the natural description

Quantum Mechanical Reactive Scattering

311

of scattering phenomena involves the time reversal operator in the bra [26]. In practical work when spin-orbit coupling may be neglected, one typically choses a phase convention such that application of AT reduces to complex conjugation on the radial components of the wave function or trial function but no conjugation on angular parts [8]. In our work (j>Vj(r) is real, and the trial function consists of sums of terms, each of which is a function of R times a function independent of R. Thus AT reduces to complex conjugation of the functions of R. If desired, one can solve for a single column of A, which yields a single column of <S, which is sufficient to calculate all observables for collisions involving a particular initial state. This is useful when one employs iterative methods [27,28]. In practice, we have found it efficient to expand \I'Ow using a mixture of £ 2 functions and functions with the boundary conditions of (23). Since our choice of £2 functions leads to matrix elements of C_ which are real when two £2 functions are involved, it is valuable to solve the linear equations in (30) by blocks, with the real block eliminated before considering the complex blocks [see (42-45) below]. The largest calculations we have carried out t date involved on the order 104 basis functions, about half of which were real. The dimension of the scattering matrix was about a factor of ten smaller. The cost of the calculation was about evenly split between the construction of the matrices <SB, 5, and C_ and the determination of ^ by (30). Although we have focussed on a trial function with complex boundary conditions, this is not essential. Similar ideas can be used to solve for wave functions with real, standing-wave boundary conditions, from which the 5* or T matrix can be constructed by a transformation, and we have solved some large-scale problems this way [29]. However we have found that the direct variational solution for wave functions with the complex, travelingwave boundary conditions leads to more stable results [29], and it will be the focus of the rest of this chapter. 3 Distortion Potentials In this section we discuss the simplest aspect of problem decomposition in the OWVP, namely the choice of the distortion potential VD in (7). In order to conveniently discuss the distortion potentials, we introduce the scalar product (n|a|n') defined bv

where a is some operator. In practice it is possible to transform the scalar

312

Domain-Based Parallelism and Problem Decomposition Methods

products so that three of the five integrations can be carried out analytically. The remaining two-dimensional integration is not expensive. Then admissible distortion potentials are dependent on the form of the matrix elements (n\H\n') in the limit R —* oo. In particular, it is necessary that

and

where H%^, is the result of the limits in (33). If these two relations are not satisfied, then the correct boundary conditions cannot be obtained. It is advantageous to consider distortion potentials which exploit some or all of the decoupling permissible by (33) and (34). The basis functions specified by n = vjl are eigenfunctions of the operator T + Vvib, and lim/j-Kx, R2V = 0, so any distortion potential that also satisfies lim^-^oo R2VD 0 is possible. In general, we will want to use basis functions which are linear combinations of terms corresponding to different vjl. In this case, offdiagonal coupling can come from T + Vvib. This can restrict the choices of the coupling in the distortion Hamiltonian. The simplest possible choice is thus VD = 0. In this case the radial part of the distorted waves are Bessel functions, which are easily calculated. However this choice is not practical except for very large f, because of the nature of the potential energy V. Atoms have hard cores so that in the limit R —> 0, the potential energy is very repulsive and since we are concerned with relatively low energies, there will be a finite distance RQ below which all radial functions will be negligible. Unless t is large, the regular Bessel functions will not be negligible for R smaller than RO, and so S3 in (24) would have elements orders of magnitude larger than S_. Although in principle the remaining terms of (24) would cancel out the spurious larger contributions if the basis were large enough, in practice this would be almost impossible for real problems. [Note that the two terms on the right side of (30) are not separately unitary, although the left side becomes unitary as the solution converges.] One possible fix is to multiply the Bessel functions by some arbitrary cutoff function. This would implicitly define a new VD, but it is an unnatural way to proceed. Next we discuss distortion potentials which do not suffer from these drawbacks. In our work we use distortion potentials which take the general form

Quantum Mechanical Reactive Scattering

313

where Anri/ is a coefficient which specifies a particular distortion potential scheme, V is the potential energy of the interaction of the diatomic with the atom, and |n)(n| and n')(n' are projectors. The projectors are defined so that

which is a function of R. The coefficient A nn ' is defined by partitioning the set of n into disjoint sets. These sets are called distortion potential blocks. Then if n and n' belong to the same distortion potential block, A nn ' = 1. Otherwise it is zero. The use of distortion potentials effects a problem decomposition. Part of the coupling is in the distortion problem, for which we solve (8) and (9), and the rest, defined by (6) and (11), is treated by the algebraic variational principle. The optimal distortion Hamiltonian coupling is the compromise between making either of the two sides of the problem decomposition too large. As the distortion blocks become larger, less coupling needs to be taken care of by the variational principle, so fewer basis functions are required to expand ^Q-^, and less work is required in evaluating the right hand side of (30). On the other hand, as the distortion blocks become larger, the amount of work to determine the distorted waves of (9) and distorted-wave Green's functions of (8) increases, the amount of memory or disk needed to store these functions increases, and the work in evaluating the matrix elements for the right hand side of (30) increases. In our work, we include the basis functions most strongly coupling to the initial ones of interest in a single distortion block while decoupling as much as possible other less important functions. Further discussion of the tradeoffs is provided elsewhere [30]. 4 Basis Function Contraction In this section we are concerned with the choice of basis functions used to expand the outgoing wave, i.e., the functions F^ in (22), and how the ideas of problem decomposition can be used to improve efficiency. In our work we have used two types of basis functions to expand ^ow • The first are £2 functions which take the form

where tm(R) is a Gaussian centered at Rm and having width parameter wm. Here ,8 indicates a particular choice of v, j, i, and m. These functions are not orthogonal because different Gaussians will overlap, however orthogonality is riot required in our formalism. In fact, when treating reactive processes.

314

Domain-Based Parallelism and Problem Decomposition Methods

we expand $Ow in terms of three such sets of basis functions, with each set using the coordinates most natural for the particular partitioning of the three atoms into atom and diatom. Thus each set uses a different set of coordinates, and hence each basis function of one set has nonzero overlap with all basis functions of the other sets. It also has nonzero overlap with functions in the same set that have the same v, j, and L The second type of basis functions that we use are the continuum functions, and these are generated by applying a Green's function to a £2 basis function of the form of (37):

These functions can be computed using similar techniques as used to determine the distorted waves [14,17]. In particular, these functions have the explicit form

where the functions 9vJn(R) solve an inhomogeneous second order differential equation subject to the boundary conditions that they be regular at R = 0 and (23) is satisfied. We call these radial functions half-integrated Green's functions (HIGFs). As we have presented them, these HIGFs are obtained using the Green's function for the distortion Hamiltonian used to generate the distorted wave. However it is clear from (21) and (22) that this is not necessary. It is not clear what advantage there would be to use a different distortion Hamiltonian to generate the HIGFs. The motivation for (38) arises when one considers the variational principle [17,23] for the amplitude density, which is equal to £/$(+)". The result is equivalent to (21) with the restriction that only basis functions of the type given by (38) are used. The present formulation is much more general. It is generally desirable to use the £2 functions of (37) rather than those of (38), because they are cheaper to deal with, but some HIGFs are required to ensure that ^ow has the correct boundary conditions at large R, However we typically find we need fewer HIGFs than £2 functions to obtain a given level of convergence, so it can be advantageous to use more than the minimum number of HIGFs. Although we have had good success using Gaussian basis functions for the R coordinate, we note that the function being expanded is far from Gaussian. Our formalism is not limited to Gaussians, and thus it should be possible to improve the efficiency of the calculations by introducing a different set of basis functions carefully tailored to the problem. One strategy

Quantum Mechanical Reactive Scattering

315

is to form these as linear combinations of the Gaussian basis functions, and in such a case they are termed contracted basis functions. This contraction can be done in several ways. For the £2 functions, it is possible to energy adapt [17,31] the basis functions by diagonalizing a small matrix, and then using the eigenvectors whose local wavelength is approximately correct as contracted basis functions. (This may be considered to be a problem decomposition in that one transforms to energy to take advantage of the natural decoupling of states of widely differing average energy.) Another option is to solve a one-dimensional scattering problem [24,32]. For the HIGFs, we have considered in previous work [33] the contraction of the R part of the £2 functions on the right hand side of (38). In this case the contraction coefficients were based on full scattering calculations with restricted avjf. basis sets. The results were quite encouraging. These calculations were carried out using the variational principle for the amplitude density, and with the more general framework presented here, it is clear that possibilities also exist for contracting after applying the Green's function. The techniques used in some of the previous work to obtain contracted basis functions correspond to problem decomposition in channel space. In particular the scattering problem is fully solved for one channel [32] or a subset of channels [33], and the resulting solutions are used as basis functions for treating the fully coupled set of all the channels. So far we have just considered contracting functions for the coordinate R. Since the Gaussian functions are highly localized, we can consider conM tracting the (j)Vjy^f of (37) by diagonalizing the full Hamiltonian averaged over R weighted by a single Gaussian. This would yield functions approximately adiabatic with respect to the coordinate R. The use of adiabatic functions has proven its value many times in calculations not using a variational principle [34]. Zhang and Miller [35] have shown that this technique is also valuable in algebraic variational calculations. Another valuable option, although not really basis function contraction, is to use more general vibrational functions in (37) [14], Both this option and the others discussed in the previous paragraph are problem decomposition techniques in that one first treats the coupling in coordinates other than R, for a fixed R or averaged over a narrow range of R. and one then uses the results as basis functions for the full problem spanning all R. It is also possible to form contracted functions coupling all degrees of freedom. For example, would could diagonalize the purely £2 part of C_ required in (30), but use only a subset of the eigenvectors (selected on the basis of their energy eigenvalues or the character of the eigenvector) to form the inverse in (30). This could be advantageous since for this part of C, the basis functions are independent of E, so the diagonalization would have to be performed only once. This idea could be implemented in a divide-

316

Domain-Based Parallelism and Problem Decomposition Methods

and-conquer way by partitioning the basis functions into several groups, each diagonalized separately. The partitioning might be based on channel indices, or — for localized basis functions — on physical space. A characteristic of the techniques discussed in this section is that a part of the problem is decoupled from the other parts to get good basis functions for treating the fully coupled problem. Then the basis functions are combined variationally for the full problem. The variational character of the final step makes up to some extent for the deficiencies of the basis due to the fact that it was obtained in decoupled steps.

5

Optical Potentials and Related Approaches

In previous sections we considered methods for improving the efficiency of our calculations that did not lead obviously to simplifying approximations. In this section we manipulate our equations to reveal the possibilities of introducing a phenomenological function, called an optical potential, which hopefully allows results of useful accuracy to be obtained at reduced cost. We start by partitioning the basis functions F/j into two groups called P and Q so that we can write

and

Next we solve (30) by blocks, obtaining

where

and

If the Q functions are all real, then (42-45) form an attractive solution to ~ QQ (30) since C_ will be real. However, in this section we want to consider

Quantum Mechanical Reactive Scattering

317

the more general case. First though, we make a connection to work in other fields. If our basis functions were localized, P and Q could be associated with different domains of the physical space. Then (43)-(45) would provide a generalization of traditional domain decomposition techniques [36,37]. We may, for example, consider (45) to represent the Schur complement of function space P connected to function space Q. Consider (43) in more detail. It can be rewritten as

Now provided

which will be true if the internal part of r,g is othogonal to the internal part of 'tM+)n° or if lirrift-^oo Tp = 0, where R is the continuum coordinate for ip(+)no^ We can write

where

We define

and we call this the optical potential. This kind of partitioning was first introduced into quantum mechanical scattering theory by Feshbach [38] and was introduced into algebraic variational calculations by Nesbet [39]. The usefulness of the optical potential is that if we can evaluate it simply, it is possible to carry out a small calculation including only the P functions

318

Domain-Based Parallelism and Problem Decomposition Methods

and adding the optical potential to the interaction potential and obtain the same result as the larger calculation including both the P and Q functions. The problem is that the exact optical potential of (51) is no easier to determine than solving the problem explicitly including both the P and Q functions. However, it may be possible to obtain an approximate optical potential which gives results of useful accuracy. Let us consider some of the properties of the optical potential. It is energy dependent, both explicitly as indicated by (51) and implicitly through ~QQ C_ , which contains the energy [see (27) and (41)]. It is nonlocal so that more work may be required to evaluate its matrix elements as compared to U. If all of the Q functions are £2 and hence real, the optical potential will be real. Otherwise the optical potential will have nonzero real and imaginary parts. Perhaps the most troublesome property of the optical potential is its non locality. There is extensive literature on using local phenomenological optical potentials in electron scattering [40], and these are mostly based on physical arguments relating to the role of electronically excited states. Local approximations to the optical potential are also widely used in nuclear physics, in which case they are typically justified by energy averaging but determined in practice by empirical considerations [38]. Neither the electron scattering or nuclear reaction literature is particularly helpful in the present context. Local approximations more suitable for scattering processes involving molecular vibrational and rotational motions have also been advanced [41], but they are less well developed. It is possible to obtain local potentials that are fully equivalent to the nonlocal optical potential, but these show strong energy-dependent structure as a function of scattering energy; useful approximations can be obtained by smoothing these potentials [42]. Other approaches include treating the optical potential by perturbation theory [43] or attempting to collapse its effects into a smaller number of "effective" channels [44]. Considerable progress should also be possible employing nonlocal optical potentials, though, by choosing the Q space to make the calculations convenient. For the present application to reactive scattering, one could consider several choices for the Q functions. One choice, considered by Baer and coworkers [45] and Seideman and Miller [46], is to have Q include all continuum functions. In that work the phenomenological optical potentials were taken to be purely imaginary, with negative imaginary parts, nonzero only at the boundaries where the interaction potential is small. The assumed form of the phenomenological optical potentials was quite simple — nevertheless quite encouraging results were obtained. Another partitioning would be to assign functions with eV8jBas > E to Q and other functions to P. In this case all Q functions are £2 (the HIGFs

Quantum Mechanical Reactive Scattering

319

are £2 in this case since k^- a < 0), and the optical potential would b real. Perhaps the most promising avenue of approach is based on returning to the domain decomposition analog. For example, suppose we choose our basis functions such that

Let the sizes of the partitions be M F , M<2, and MR such that the order of £is

Then, instead of solving the M x M complex equations for A, we may solve the following much simpler set [36]:

Extension of this three-partition approach to four or more partitions is clearly possible. Any number of basis set partioning schemes may be imagined to make the smaller problems real, independent of energy, and/or particularly convenient for solution by iterative or parallel methods. The new ideas presented in this section are topics of current research.

6

Concluding Remarks

We have seen that techniques for solving quantum mechanical scattering problems with linear variational principles and delocalized basis functions allow for a number of divide-and-conquer strategies. Some of these are

320

Domain-Based Parallelism and Problem Decomposition Methods

analogs in spectral space of techniques used in other fields in physical space, whereas other problem decomposition avenues are specialized approaches based on the specific nature of the scattering problem.

References [1] See e.g., M. A. Morrison, T. L. Estle, and N. F. Lane, Quantum states of atoms, molecules, and solids, Prentice-Hall, Englewood Cliffs, 1976. [2] C. A. Mead, The Born-Oppenheimer approximation in molecular quantum mechanics, in Mathematical Frontiers of Computational Chemical Physics, ed. by D. G. Truhlar, Springer-Verlag, New York, 1988, pp. 1-17. [3] H. J. Werner, chapter in this volume. [4] E. B. Stechel, chapter in this volume. [5] J. Z. H. Zhang, D. J. Kouri, K. Haug, D. W. Schwenke, Y. Shima, and D. G. Truhlar, £2 amplitude density methods for multichannel inelastic and re-arrangement collisions, J. Chem. Phys., 88 (1988), pp. 2492-2512. [6] M. S. Child, Molecular Collision Theory, Academic Press, London, 1974. [7] J. M. Blatt and L. C. Biedenharn, The angular distribution of scattering and reaction cross sections, Rev. Mod. Phys. 24 (1952) pp. 258-272. [8] D. G. Truhlar, C. A. Mead, and M. A. Brandt, Time-reversal invariance, representations for scattering wave functions, symmetry of the scattering matrix, and differential cross sections, Adv. Chem. Phys. 33 (1975) pp. 295-344. [9] D. W. Schwenke, D. G. Truhlar, and D. J. Kouri, Propagation method for the solution of the arrangement-channel coupling equations for reactive scattering in three dimensions, J. Chem. Phys., 86 (1987) pp. 2772-2786. [10] R. T Pack and G. A. Parker, Quantum reactive scattering in three dimensions using hyperspherical (APE) coordinates. Theory, J. Chem. Phys., 87 (1987) pp. 3888-3921. [11] J. M. Launay and B. LePetit, Three dimensional study of the reaction H+FH(vj)—>HF(v'j')+H in hyperspherical coordinates, Chem. Phys. Lett., 144 (1988) pp. 346-352. [12] G. C. Schatz, Quantum reactive scattering using hyperspherical coordinates: results for H+H2 and Cl+HCl, Chem. Phys. Lett., 150 (1988) pp. 92-98. [13] S. A. Cuccaro, P. G. Hipes, and A. Kuppermann, Hyper-spherical coordinate reactive scattering using variational surface functions, Chem. Phys. Lett., 154 (1989) pp. 155-164. [14] G. J. Tawa, S. L. Mielke, D. G. Truhlar, and D. W. Schwenke, Linear algebraic formulation of reactive scattering with general basis functions, in Advances in molecular vibrations and collision dynamics, ed. by J. M. Bowman, JAI Press, Greenwich, 1994, Vol. 2B, pp. 45-116. [15] D. G. Truhlar, D. W. Schwenke, and D. J. Kouri, Quantum dynamics of chemical reactions by converged algebraic variational calculations, J. Phys. Chem., 94 (1990) pp. 7346-7352. [16] S. L. Mielke, G. C. Lynch, D. G. Truhlar, and D. W. Schwenke, A more accurate potential energy surface and quantum mechanical cross section calculations for the F+Hz reaction, Chem. Phys. Lett., 213 (1993) pp. 10-16. [17] D. W. Schwenke, K. Haug, M. Zhao, D. G. Truhlar, Y. Sun, J. Z. H. Zhang, and D. J. Kouri, Quantum mechanical algebraic variational methods for in-

Quantum Mechanical Reactive Scattering

[18] [19]

[20] [21]

[22]

[23] [24]

[25] [26] [27]

[28]

[29j [30]

[31] [32]

[33]

321

elastic and reactive molecular collisions, J. Phys. Chem., 92 (1988) pp. 32023216. B. A. Lipprnann and J. S. Schwinger, Variational principles for scattering processes, I, Phys. Rev., 79 (1950) pp. 469-480. D. J. Kouri, Y. Sun, R. C. Mowrey, J. Z. H. Zhang, D, G. Truhlar, K. Haug, and D. W. Schwenke, New time-dependent and time-independent computational methods for molecular collisions, in Mathematical Frontiers of Computational Chemical Physics, ed. by D. G. Truhlar, Springer-Verlag, New York, 1988, pp. 207-243. M. L. Goldberger and K. M. Watson, Collision theory, Wiley, New York, 1964. L. Schlessinger, Use of analyticity in the calculation of nonrelativistic scattering amplitudes, Phys. Rev. 167 (1968) pp. 1411-1423; S. C. Pieper, J. Wright, and L. Schlessinger, Calculations of three-particle scattering amplitudes using local Yukawa potentials, Phys. Rev. D, 3 (1971) pp. 2419-2424. Y. Sun, D. J. Kouri, D. G. Truhlar, and D. W. Schwenke, Dynamical basis sets for algebraic variational calculations in quantum-mechanical scattering theory, Phys. Rev. A, 41 (1990) pp. 4857-4862. G. Staszewska and D. G. Truhlar, Convergence of C2 methods for scattering problems, J. Chem. Phys., 86 (1987) pp. 2793-2804. D. E. Manolopoulos, M. D'Mello, and R. E. Wyatt, Translational basis set contraction in variational reactive scattering, J. Chem. Phys., 93 (1990) pp. 403-411. A. Bohm, Quantum mechanics: foundations and applications, 3rd edition, Springer-Verlag, New York, 1993, pp. 54-58. D. G. Truhlar and C. A. Mead, Stationary principle for quantum mechanical resonance states, Phys. Rev. A, 42 (1990) pp. 2593-2602. D. C. Chatfield, M. S. Reeves, D. G. Truhlar, C. Duneczky, and D. W. Schwenke, Complex Generalized Minimal Residual Algorithm for Iterative Solution of Quantum Mechanical Reactive Scattering Equations, J. Chem. Phys., 97 (1992) pp. 8322-8333. M. S. Reeves, D. C. Chatfield, and D. G. Truhlar, Preconditioned Complex Generalized Minimal Residual Algorthrn for Dense Algebraic Variatioiial Equations in Quantum Reactive Scattering, J. Chem. Phys., 99 (1993) pp. 2739-2751. S. L. Mielke, R. S. Friedman, D. G. Truhlar, and D. W. Schwenke, Highenergy state-to-state quantum dynamics for D+HZ (v=j=l) —> HD(v'=l,j')+H, Chem. Phys. Lett., 188 (1992) pp. 359-367. S. L. Mielke, D. G. Truhlar, and D. W. Schwenke, Improved techniques for outgoing wave variational principle calculations of converged state-to-state transition probabilities for chemical reactions, J. Chem. Phys., 95 (1991) pp. 5930-5939. G. Staszewska and D. G. Truhlar, Energy-adapted basis sets for quantal scattering calculations, J. Chem. Phys., 86 (1987) pp. 1646-1648. J. Abdallah, Jr, and D. G. Truhlar. The use of contracted basis functions in algebraic variational scattering calculations, J. Chem. Phys., 61 (1974) pp. 30-36. M. Zhao, D. G. Truhlar, D. W. Schwenke, C-h. Yu, and D. J. Kouri, Con-

322

[34]

[35] [36]

[37] [38] [39] [40]

[41]

[42] [43]

Domain-Based Parallelism and Problem Decomposition Methods traded basis functions for variational solutions of quantum mechanical reactive scattering problems, J. Phys. Chem. 94 (1990) pp. 7062-7069. See, e.g., J. C. Light and R. B. Walker, An R matrix approach to the solution of coupled equations for atom-molecule reactive scattering, J. Chem. Phys., 65 (1976) pp. 4272-4282; A. Kuppermann, G. G. Schatz, and M. Baer, Quantum mechanical reactive scattering for planer atom plus diatom systems. I. Theory, J. Chem. Phys., 65 (1976) pp. 4596-4623; N. A. Mullaney and D. G. Truhlar, The use of rotationally and orbitally adiabatic basis functions to calculate rotational excitation cross sections for atom-molecule collisions, Chem. Phys., 39 (1979) pp. 91-104. J. Z. H. Zhang and W. H. Miller, Quasi-adiabatic basis functions for the S-matrix Kohn variational approach to quantum reactive scattering, J. Phys. Chem. 94 (1990) pp. 7785-7789. See, e.g., O. B. Widlund, Iterative sub structuring methods: Algorithms abd theory for elliptic problems in the plane, in First international symposium on domain decomposition methods for partial differential equations, ed. by R. Glowinski, G. H. Golub, G. A. Meurant, and J. Periaux, SIAM, Philadelphia, 1988, pp. 113-128; T. F. Chan and D. C. Resasco, A framework for the analysis and construction of domain decomposition preconditioned, ibid., pp. 217-230; and other chapters in this volume. For a recent review, see P. LeTallec, Domain decomposition methods in computational mechanics, Computational Mechanics Advances, 1 (1994) pp. 121220. For a review, see H. Feshbach, Theoretical nuclear physics: Nuclear reactions, John Wiley and Sons, New York, 1992. R. K. Nesbet, Theory of Low-Energy Electron Scattering by Complex Atoms, Adv. Quant. Chem., 9 (1975) pp. 215-297. See, e.g., D. G. Truhlar, Effective potentials for intermediate-energy electron scattering: Testing theoretical models, in Chemical applications of atomic and molecular electrostatic potentials, ed. by P. Politzer and D. G. Truhlar, Plenum, New York, 1981, pp. 123-172; D. Thirumalai, G. Staszewska, and D. G. Truhlar, Dispersion relation techniques for approximating the optical model potential for electron scattering, Comments At. Mol. Phys., 20 (1987) pp. 217-243. See, e.g., D. A. Micha, Optical models in molecular collision theory, in Dynamics of molecular collisions, Part A, ed. by W. H. Miller, Plenum, New York, 1976, pp. 81-129; D. Farelly and R. D. Levine, Optical model of dissociative chemisorption: HI on the (111), (HO), and (100) faces of copper, J. Chem. Phys., 97 (1992) pp. 2139-2148. D. W. Schwenke, D. Thirumalai, and D. G. Truhlar, Accurate, smooth, local, energy-dependent optical potentials for electron scattering, Phys. Rev. A, 28 (1983) pp. 3258-3267. See, e.g., G. Y. Csanak and H. S. Taylor, Adiabatic limit and nonadiabatic effects in the second order transition potential: semi-empirical analysis, J. Phys. B, 6 (1973) pp. 2055-2071; B. H. Bransden, T. Scott, R. Shingal, and R. K. Roychoudhury, Elastic and inelastic scattering of electrons by atomic hydrogen at intermediate energies in a coupled-channel second-order potential model, J. Phys. B, 15 (1982) pp. 4605-4616.

Quantum Mechanical Reactive Scattering

323

[44] G. Staszewska, D. W. Schwenke, and D. G. Truhlar, Collapsed close-coupling method: A systematic alternative to the multiparticle optical potential for solutions of the Schrodinger equation in a truncated subspace, Phys. Rev. A, 31 (1985) pp. 1348-1353. [45] I. Last, D. Neuhauser, and M. Baer, The application of negative imaginary arrangement decoupling potentials to reactive scattering: conversion of a reactive scattering problem into a bound-type problem, J. Chem. Phys., 96 (1992) pp. 2017-2024. [46] T. Seideman and W. H. Miller, Quantum mechanical reaction probabilities via a discrete variable representation-absorbing boundary condition Green's function, J. Chcm. Phys. 97 (1992) pp. 2499-2514.

Domain-Based Parallelism and Problem Decomposition Methods in Computational Science and Engineering

Read more

Domain-Based Parallelism and Problem Decomposition Methods in Computational Science and Engineering

Read more

Domain-Based Parallelism and Problem Decomposition Methods in Computational Science and Engineering

Read more

Domain decomposition methods in science and engineering

Read more

Domain Decomposition Methods in Science and Engineering

Read more

Domain Decomposition Methods in Science and Engineering XVIII (Lecture Notes in Computational Science and Engineering)

Read more

Domain Decomposition Methods in Science and Engineering XIX (Lecture Notes in Computational Science and Engineering, 78)

Read more

Domain Decomposition Methods in Science and Engineering XVII (Lecture Notes in Computational Science and Engineering)

Read more

Advanced Computational Methods In Science And Engineering

Read more

Advanced Computational Methods in Science and Engineering (Lecture Notes in Computational Science and Engineering)

Read more

Advanced Computational Methods in Science and Engineering (Lecture Notes in Computational Science and Engineering)

Read more

Domain decomposition methods in science and engineering XVIII

Read more

Domain decomposition methods in science and engineering XVI

Read more

Domain decomposition methods in science and engineering XVI

Read more

Computational Science and Engineering

Read more

Computational science and engineering

Read more

Computational Science and Engineering

Read more

Computational science and engineering

Read more

Computational Science and Engineering

Read more

Domain Decomposition Methods in Science and Engineering XVI (Lecture Notes in Computational Science and Engineering) (v. 16)

Read more

Integral Methods in Science and Engineering: Computational and Analytic Aspects

Read more

Integral methods in science and engineering. Computational and analytic aspects

Read more

Computational Methods in Engineering & Science

Read more

12.Computational Science and Engineering

Read more

Advances in electrical engineering and computational science

Read more

Recent Advances in Computational Science and Engineering

Read more

Computational Methods in Earthquake Engineering (Computational Methods in Applied Sciences)

Read more

Advanced mathematical methods in science and engineering

Read more

Decomposition Techniques in Mathematical Programming: Engineering and Science Applications

Read more

Decomposition Techniques in Mathematical Programming: Engineering and Science Applications

Read more

Recommend Documents

Domain-Based Parallelism and Problem Decomposition Methods in Computational Science and Engineering

Domain-Based Parallelism and Problem Decomposition Methods in Computational Science and Engineering

Domain-Based Parallelism and Problem Decomposition Methods in Computational Science and Engineering

Domain decomposition methods in science and engineering

Lecture Notes in Computational Science and Engineering Editors Timothy J. Barth, Moffett Field, CA Michael Griebel, Bon...

Domain Decomposition Methods in Science and Engineering

Lecture Notes in Computational Science and Engineering Editors Timothy J. Barth Michael Griebel David E. Keyes Risto M. ...

Domain Decomposition Methods in Science and Engineering XVIII (Lecture Notes in Computational Science and Engineering)

Lecture Notes in Computational Science and Engineering Editors Timothy J. Barth Michael Griebel David E. Keyes Risto M....

Domain Decomposition Methods in Science and Engineering XIX (Lecture Notes in Computational Science and Engineering, 78)

Lecture Notes in Computational Science and Engineering Editors: Timothy J. Barth Michael Griebel David E. Keyes Risto M...

Domain Decomposition Methods in Science and Engineering XVII (Lecture Notes in Computational Science and Engineering)

Lecture Notes in Computational Science and Engineering Editors Timothy J. Barth Michael Griebel David E. Keyes Risto M. ...

Advanced Computational Methods In Science And Engineering

Lecture Notes in Computational Science and Engineering Editors Timothy J. Barth Michael Griebel David E. Keyes Risto M....

Advanced Computational Methods in Science and Engineering (Lecture Notes in Computational Science and Engineering)

Lecture Notes in Computational Science and Engineering Editors Timothy J. Barth Michael Griebel David E. Keyes Risto M....