Scientific Computing With Automatic Result Verification (Mathematics in Science and Engineering,189)

Scientific Computing with Automatic Result Verification This is volume 189 in MATHEMATICS IN SCIENCE AND ENGINEERING ...

Author: Ernst Adams | U. Kulisch

61 downloads 703 Views 8MB Size Report

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!

Report copyright / DMCA form

DOWNLOAD PDF

Scientific Computing with Automatic Result Verification

This is volume 189 in MATHEMATICS IN SCIENCE AND ENGINEERING Edited by William F. Ames, Georgia Instirute of Technology A list of recent titles in this series appears at the end of this volume.

SCIENTIFIC COMPUTING WITH AUTOMATIC RESULT VERIFICATION Edited B y

U.Kulisch IN~PORANGEWANDTEMATHEMATIK UNWERSlTAT W R U H E

KARLSR~EE,GEREAANY

ACADEMIC PRESS, INC. Harcourt Brace Jovanovich. Publishers

Boston San Diego New York London Sydney Tokyo Toronto

This book is printed on acid-free paper.@ Copyright 0 1993 by Academic Press, Inc. All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopy, recording, or any information storage and retrieval system, without permission in writing from the publisher. ACADEMIC PRESS, INC. 1250 Sixth Avenue, San Diego, CA 92101-431 1 United Kingdom Edition published by ACADEMIC PRESS LIMITED 24-28 Oval Road, London NWI 7DX ISBN 0-12-044210-8 Printed in the United States of America 9 2 9 3 9 4 9 5 EB 9 8 7 6 5 4 3 2 1

Contents Contributors Prehce Acknowledgements

Vii

ix X

E. Adams, U. Kulisch Introduction

1

I. Language and Progmmmmg Supportfor Veri6ed Scienac Computation R Hammer, M. Neaga, D. Ratz

PA!X&XSC, New Concepts for Scientific Computation and Numerical Data Processing

15

Wofgang V. Walter ACIUTH-X!X, A Fortran-likeLanguage for Veiified ScientificComputing

45

Christian Law0 GXSC,A Programming Environmentfor Verified ScientificComputing and Numerical Data Processing

71

G. Bohlender, D. Codes, A Knofel, U. Kulisch, R Lohner, W.V. Walter Proposal for Accurate Floating-pointVector Arithmetic

87

11. Enclosure Methods and Algorithms with Automatic ResultVefication Hans-Christoph Fischer Automatic Differentiation and Applications

105

Rainer Kelch Numerical Quadiature by Extrapolation with Automatic Result Verifkation

143

Ulrike Storck Numerical Integration in Two Dimensions with Automatic Result Verification

187

V

vi

Contents

Hans-Jiirgen Dobner VerXed Solution of Integral Equations with Applications

225

Wolfram Klein Enclosure Methods for Linear and Nonlinear Systems of Fredholm Integral Equations of the Second Kind

255

W. Rufeger and E. Adams A Step Size Control for Lohner's Enclosure Algorithm for Ordinary DifferentialEquations with Initial Conditions

283

Rudolf J. Lohner Interval Arithmetic in Staggered Correction Format

301

In. Applications in the Engineering Sciences Walter W e r MultiplePrecisionComputationswith Result Verification

325

Beate Gross Verification of Asymptotic Stabilityfor Interval Matrices and Applications in ControlTheory

357

Wera U. Klein Numerical Reliability of MHD Flow Calculations

397

Emst Adams The Reliability Question for Discretizationsof Evolution Problems Part I Theoretical Consideration on Failures Part Ik Practical Failures

423

465

R Schiitz,W. Winter, G. Ehret KKR Bandstructure Calculations,A Challenge to Numerical Accuracy

527

Andreas Knofel A Hardware Kernel for ScientWEngineerhg Computations

549

Gerd Bohlender Bibliography on Enclosure Methods and Related Topics

571 609

Contributors E. Adams,IAM G. Bohlender, IAM D. Cordes, Landstr. 104A,D-6905 Schriesheim, Germany H.J. Dobner, Math. Inst. 11, Univ. Karlsruhe, D-7500Karlsruhe 1, Germany G. Ehret, Kernforschungszentrum, Postfach 3640, D-7500 Karlsruhe 1, Germany H. C. Fischer, Tauberstr. 1 , D-7500 Karlsruhe 1, Germany B. Gross, Math. Inst. I, Univ. Karlsruhe, D-7500Karlsruhe 1, Germany R. Hammer, IAM R. Kelch, Tilsiter Weg 2, D-6830 Schwetzingen, Germany W.Klein, Kleinstr. 45,D-8000Munchen 70, Germany W. U. Klein, Kleinstr. 45, D-8000 Miinchen 70, Germany A. Kniifel, IAM W. Kramer, IAM U. Kulisch, IAM Ch. Lawo, IAM R. Lohner, IAM M. Neaga, Numerik Software GmbH, Postf. 2232,D-7570Baden-Baden, Germany D. Ratz, IAM W. Rufeger, School of Math., Georgia Inst. of Tech., Atlanta GA, 30332,USA R. Schutz, Kernforschungszentrum, Postfach 3640, D-7500Karlsruhe 1, Germany U. Storck, IAM W. V. Walter, IAM H. Winter, Kernforschungszentrum, Postfach 3640,D-7500Karlsruhe 1, Germany

IAM is an abbreviation for: Institut f. Angewandte Mathematik Univ. Karlsruhe Kaiserstr. 12 D-7500 Karlsruhe 1 Germany vii

This page intentionally left blank

Preface This book presents a collection of papers on recent progress in the development and applications of numerical algorithms with automatic result verification. We also speak of Enclosure Methods. An enclosure consists of upper and lower bounds of the solution. Their computability implies the existence of the unknown true solution which is being enclosed. The papers in this book address mainly the following areas:

I. the development of computer languages and programming environments supporting the totally error-controlled computational determination of enclosures;

11. corresponding software, predominantly for problems involving the differentiation or the integration of functions or for differential equations or integral equations; 111. in the context of scientific computing the mathematical simulation of selected major real world problems, in conjunction with parallel numerical treatments by means of software with or without total error control.

Concerning 111, the practical importance of techniques with automatic result verification or Enclosure Methods is stressed by the surprisingly large and even qualitative differences of the corresponding results as compared with those of traditional techniques. These examples do rest on “suitably chosen” illconditioned problems; rather, they have arisen naturally in research work in several areas of engineering or physics. The “surprise character’’ of these examples is due to the fact that computed enclosures guarantee the existence of the enclosed true solution, whereas there is no such implication in the case that a numerical method has been executed without total error control. The bulk of the papers collected in this book represent selected material taken from doctoral or diploma theses which were written a t the Institute for Applied Mathematics a t the University of Karlsruhe. Concerning Enclosure Methods, the level of development being addressed here rests on extensive research work. The essentially completed developmental stages comprise in particular large classes of finite-dimensional problems and ordinary differential equations with initial or boundary conditions. We refer to the list of literature a t the end of the book. Every paper in this book contains a list of references, particularly with respect to hardware and software supporting Enclosure Methods. This book should be of interest to persons engaged in research and/or development work in the following domains:

ix

Preface

X

(A) the reliable mathematical simulation of real world problems, (B) Computer Science, and (C) mathematical proofs involving, e.g., a quantitative verification of the mapping of a set into itself. The diagnostic power of numerical algorithms with automatic result verification or Enclosure Methods is of particular importance concerning (A). In fact, their total reliability removes the possibility of numerical errors as a cause of discrepancies of applications of a mathematical model and physical experiments. For a diagnostic application of this kind, it is irrelevant that the cost of Enclosure Methods occasionally exceeds the one of corresponding numerical methods without a total error control. Fr: the papers in this book, the background prerequisites are essentially: three years of Calculus, including Numerical Analysis, leading to the equivalent of a B. S. degree and the corresponding level of knowledge and experience in the employment of computer systems.

Acknowledgements It is a pleasure to acknowledge with gratitude the support received for parts of the presented research from IBM Corporation and Deutsche Forschungsgemeinschaft (DFG). We appreciate the enthusiastic encouragement for this compilation of papers which we have received from Academic Press and Professor W. F. Ames, the editor of the book series. We are grateful to our present and former students and coworkers for their contributions to this collection. Finally, we wish to thank our colleague Walter Krsmer for taking over the responsibility for the final layout of the book.

Dedication This book is dedicated to Professor Dr. J. Heinhold and Professor Dr. J. Weissinger on the occasion of their 80th birthdays. We are particularly grateful for many years of a fruitful and enjoyable collaboration.

E. Adams and U. Kulisch

Introduction E. Adams and U. Kulisch

1

On Scientific Computing with Automatic Result Verification

As stated in the Preface, the totally error-controlled computational determination of an enclosure or, synonymously, an inclusion or automatic result verification rests on contributions of the following four kinds: (A) suitable programming languages, software, and hardware for a reliable, fast, and skillful execution of all arithmetic operations;

(B) for problems in Euclidean spaces, algorithms translating into a set of machine numbers and, therefore, into (A); (C) mathematical methods and corresponding algorithms relating problems in function spaces to suitable problems in Euclidean spaces; (D) links relating (A), (B), and (C) such that the enclosure property and the guaranteed existence are valid for the unknown enclosed true solution. The computer-basis (A) of numerical algorithms with automatic result verification or enclosure methods is addressed in Part I of this book and in Sections 2,3,and

6 of this Introduction. As stated in the Preface, the development of algorithms concerning (B) has essentially been completed. Again we refer to the list of literature at the end of the book. Part I1 of this book is mainly concerned with the domains (C) and (D), for problems involving differentiations, or integrations, or differential equations. In a large number of case studies for problems in the domains (B) or (C), it has been shown that traditional "high-precision" numerical methods or computer systems may deliver quantitatively and even qualitatively incorrect results, unless there is a total error control. Since such failures are particularly important in the mathematical simulation of real world problems, Part 111 of this book presents case studies of this kind.

Scientific Computing with Automatic Result Verification

1

Copyright Ca 1993 by Academic Press, Inc. All rights of reproduction in any form reserved. ISBN 0-12-044210-8

E. Adams, U. Kulisch

2

2

The Hardware-Basis for Automatic Result Verification

The speed of digital computers is ever increasing. Consequently, less and less can be said about the accuracy of the computed result. In each one of the preceding two or three decades, the computational speed increased by a factor of roughly 100. For the present decade, another growth factor of approximately 1000 is expected. The transfer from electro-mechanical to electronical digital computers also involved a factor of approximately 1000. Since then, the speed of the most advanced computational systems grew by a factor of lo*. But it is not just the speed of the computers that has been tremendeously increased. Also in quantity the available computing power has grown to an extraordinary large scale. In fact, millions of Personal Computers and Workstations have been sold during the last years. In view of this huge automatic computational basis, there arises the reliability question of the delivered results. Already 30 years ago, important computations were carried out by means of a floating-point arithmetic of about 17 decimal mantissa digits. At present, the situation is not much different. By means of a speed with up to 10” arithmetic operations per second, now major computational projects are carried out which need hours of CPU-time. Obviously, a traditional estimate of the accumulated incurred computational error is not possible any more. Because of this reason, the computational error is often totally ignored and not even mentioned in the bulk of the contemporary computer-oriented literature. Some users attempt a heuristic confirmation or to justify the computed results by plausibility reasons. Other users are hardly aware of the fact that the computed results may be grossly incorrect. At present, the arithmetic of a digital computer in general is of “maximal accuracy”. This means that the computed result of an individual arithmetic operation differs from the true one by a t most one or one half unit of the last mantissa digit. For sequences of such operations, however, the quality of the delivered result may fail to be valid fairly rapidly. As an example, consider the following simple algorithm: z:=~+b-a.

If a and b are stochastically chosen from among the standard data formats single, double, or quadruple precision, the computed result for z is completely incorrect in 40% to 50% of all data choices. This means that the result of a computation consisting of only two arithmetic operations is incorrect with respect to all mantissa digits, and perhaps even the sign and the exponent. This kind of failure occurs for all usual arithmetic data formats! The usual arithmetic standards are not excluded. For the corresponding more general algorithm z := a b+ c, a completely incorrect result of the kind observed before is still obtained in an essential percentage of all then the percentage of cases. If this algorithm is executed for vectors a , b, c E R”, incorrect results approaches 100 as n increases. Of course, more involved examples

+

Introduction

3

may be chosen where the reason for an incorrect result is not as obvious. As an example, the reader may try to compute the scalar product of the following vectors:

2.718281828, = - 3.141592654, a3 = 1.414213562, a4 = 0.5772156649, a5 = 0.3010299957, =

a2

bl =

b2

b3

b4 b5

= = = =

1486.2497 878366.9879 - 22.37492 4773714.647 0.000185049

The correct value of the scalar product is

-1.00657107 x 10". It is a matter of fact that this kind of errors could easily be avoided if accumulations were executed in fixed-point instead of floating-point arithmetic. Computers have been invented to take over complicated jobs from man. The obvious discrepancy between the computational performance and the mastering of the computational error suggests the transfer of the job of error estimation and its propagation to the computer. By now, this has been achieved for practically all standard problems in Numerical Analysis and numerous classical applications. To achieve this, it is primarily necessary to make the computer more powerful arithmetically as compared with the usual floating-point set-up. In addition to the four basic arithmetic operations the following generalized arithmetic operations should be carried out with maximal accuracy: the operations for real and complex numbers, vectors, and matrices as well as for the corresponding real and complex intervals, interval vectors, and interval matrices. In this context, the scalar product of two vectors is of basic importance. It is necessary to execute this operation with maximum accuracy; i. e., the computer must be able to determine sums of products of floating-point numbers such that the computed result differs from the correct result by a t most one or one half unit of the last mantissa digit. This then is said to be the optimal scalar or dot product. For this particular task, in the immediate past fast circuits have been developed for all kinds of computers, like Personal Computers, Workstations, Mainframes, and supercomputers. It is a surprising fringe benefit that these circuits generally need less time for the evaluation of the optimal scalar product than traditional circuits which may deliver a totally incorrect result. It is an interesting fact that already 100 years ago computers have been on the market which were able to deliver the correct result for scalar products, i. e., for sums of products of numbers. The technology then used may be considered as a direct predecessor of the presently employed electronic circuits. However, since electronic realizations of the optimal dot product were not available in computers on the market during the last 40 years, algorithms using this quality have not been systematically developed during this period of time. At present, this causes certain difficulties concerning the acceptance of this new tool. In fact, computers govern

E. Adams, U.Kuliscb

4

man and his reasoning: what is not known cannot be used with sophistication or even with virtuosity. It is hoped that this book provides a bridge which contributes to remove these difficulties. Numerous examples in this book demonstrate that the new and more powerful computer arithmetic as outlined in this Introduction allows the computational assessment of correct results in cases of failures of traditional floating-point arithmetic. However, the new and more powerful computer arithmetic has been developed for standard applications, not just for extreme problems. Traditionally, there is frequently a need for numerous test runs employing variations of the working precision or of input data in order to estimate the reliability of the computed result. By means of the new technique, there is usually no such need. In fact, just one computer run in general yields a totally reliable computed result.

3

Connection with Programming Languages

In the translation of the new approach to computer arithmetic into practice, immediately there arise serious difficulties. Only the four arithmetic operations are made available by the usual programming languages such as ALGOL, FORTRAN, BASIC, PASCAL, MODULA, or C. By use of these languages, it is not practically possible to address a maximally accurate scalar product or a maximally accurate matrix product directly! The usual simulation of these operations by floating-point arithmetic is responsible for the kind of errors indicated by the examples above. Suitable extensions of these languages provide the only meaningful way out of these difficulties. These extensions should provide all operations in the commonly used real and complex vector spaces and their interval correspondents by the usual mathematical operator symbol. Corresponding data types should be predefined in these languages. For an example, the variables a, b and c are chosen to be real square matrices. Then a matrix product with maximal accuracy is simply addressed by c:=a*b. An operator notation of practically all arithmetic operations simplifies programming in these language extensions significantly. Programs are much easier to read. They are much easier to debug and thus become much more reliable. But in particular, all these operations are maximally accurate. Already in the 19709, a corresponding extension of the programming language PASCAL was developed and implemented in a cooperative project of the Universities of Karlsruhe and Kaiserslautern (U. Kulisch and H. W. Wippermann). The applicability of the new language, PASCAL-SC, however, was severely restriced by the fact that there were only two code-generating compilers available for the Z-80 and MOTOROLA 68000 processors.

Introduction

5

In another cooperation between IBM and the Institute for Applied Mathematics at the University of Karlsruhe, a corresponding extension of the programming language FORTRAN was developed and implemented in the 1980s. The result is now available as an IBM program product for systems of the /370 architecture. It is called ACRITH-XSC. The programming language ACRITH-XSC is upwardly compatible with FORTRAN 77; concerning numerous details, ACRITH-XSC is comparable with the new language FORTRAN 90. With respect to its arithmetic features, however, ACRITH-XSC exceeds the new FORTRAN by far. Parallel to the developemt of ACRITH-XSC, the programming language PASCALSC has been further developed at the Institute for Applied Mathematics at the University of Karlsruhe. Analogously to the IBM program product ACRITHXSC, the new language is being called PASCAL-XSC. For this language, there is a compiler available which translates into the programming language C; additionally, the extended run-time system for all arithmetic routines of PASCAL-XSC has been written in C. Consequently, PASCAL-XSC may be used on practically all computers possessing a C-compiler. A Language Reference of PASCAL-XSC with numerous examples has been published by Springer-Verlag in German and in English. A translation into Russian is under preparation. Compilers for PASCAL-XSC are now available on the market. They can be purchased from Numerik Software GmbH, P. 0. Box 2232, 7750 Baden-Baden, Germany. The new computer language C++ possesses several new features such as an operator concept, overloading of operators, generic names for functions, etc, which are also available in the programming languages PASCAL-SC, ACRITHXSC, and PASCAL-XSC. With these features of C++, it is possible to provide arithmetic operations with maximal accuracy, standard functions for various types of arguments, etc. in C++ without developing a new compiler. Using these features of C++, a C++ module for numerical applications, as an extension of the programming language C, has been developed at the Institute for Applied Mathematics at the University of Karlsruhe. This programming environment is called C-XSC. In order to employ it, it is sufficient for a user to be familiar with the programming language C and this arithmetic-numerical module extension. A knowledge of C++ is not required. This module, however, can also be used in conjunction with C++ programs. A C-XSC program has to be translated by a C++ compiler. An identical run-time system is used by C-XSC and PASCALXSC. Therefore, identical results are obtained by corresponding programs written in these languages. In the present book, the contributions to Part I provide brief introductions to the programming languages PASCAL-XSC and ACRITH-XSC as well as to the programming environment of C-XSC. The run-time system of PASCAL-XSC and C-XSC has been written in C. It provides the optimal real or complex arithmetic for vectors, matrices, and corresponding intervals. The execution of these routines would need significantly smaller computation

6

E. Adams, U. Kulisch

times if there would be a suitable support by the computer hardware. The fourth contribution in Part I of this book, therefore, addresses the computer manufacturers. It contains a proposal of what has to be done on the side of computer hardware in order to support a real and complex vector, matrix, and interval arithmetic! We would like t o stress the fact that a realization of this proposal would result in significantly faster computer systems. This is a consequence of the fact that a hardware realization of the optimal scalar product allows a significantly simpler and faster execution of this operation as compared with the traditional, inaccurate execution of the dot product by means of the given floating-point arithmetic. By use of the programming languages PASCAL-XSC and ACRITH-XSC in the last few years and recently also by use of C-XSC, programming packages have been developed for practically all standard problems of Numerical Analysis where the computer verifies automatically that the computed results are correct. So, today there are problem solving routines with automatic result verification available for systems of linear algebraic equations and the inversion of matrices with coefficients of the types real, complex, interval, and complex interval. Corresponding routines are also available for the computation of eigenvalues and eigenvectors of matrices, for the evaluation of polynomials, for roots of polynomials, for the accurate evaluation of arithmetic expressions, for the solution of nonlinear systems of algebraic equations, for numerical quadrature, for the solution of large classes of systems of linear or nonlinear ordinary differential equations with initial or boundary conditions, or corresponding eigenvalue problems, for certain systems of linear or nonlinear integral equations, etc. In most cases, the code verifies automatically the existence and the local uniqueness of the enclosed solution. If no solution is found, a message is given to the user that the employed algorithm could not solve the problem. In the case of differential or integral equations, the routines deliver continuous upper and lower bounds for the solution of the problem. Solutions obtained by means of these methods have the quality of a theorem in the sense of pure mathematics. The computer proves those theorems or statements by means of the clean arithmetic in a mathematically correct and reproducible manner by often very many tiny steps. In order t o obtain these results, numerous well-known and powerful methods of Numerical Analysis are employed. Additionally, entirely new constructive approaches and tools had to be developed. Examples for techniques of this kind are presented in Part I1 of this book.

4

Enclosure Methods, Predominantly for Problems in Function Spaces

For problems in Numerical Algebra or Optimization, residual correction or iterative refinement is the essential tool. Usually, these techniques are not applied to the

Introduction

7

computed approximation but, rather, to its error. Because of cancellation, this well-known classical tool fails generally when employing a usual floating-point arithmetic. In fact, these tools become practically useful in conjunction with an optimal scalar product and interval methods. An interval arithmetic with multiple precision is another very useful tool. The paper by R. Lohner in Part I1 of this book is concerned with a PASCAL-XSC routine for a version of this method that has been called Staggered Arithmetic by H. J. Stetter. By means of the optimal scalar product and PASCAL-XSC, this arithmetic can be implemented very elegantly by easily readable programs. Automatic differentiation is an important tool for the development of methods possessing a built-in verification property, mainly for problems relating to function spaces. The first paper in Part I1 of the book, by H. C. Fischer, presents an introduction into these techniques. For sufficiently smooth functions, derivatives and Taylor-coefficients may hereby be determined very efficiently. They can be enclosed automatically by means of interval methods. Concerning the numerical integration of functions, by quadrature formulas, the remainder terms are usually neglected in the execution of traditional numerical methods. These terms are fully taken into account by all methods with automatic result verification. In the remainder term, the unknown argument is replaced by an interval containing this argument. For this purpose, interval bounds for the derivatives occurring in the remainder term are computed by interval techniques. By the usual factors I/n! and a high power of the step size, the remainder term, as the procedural error, can usually be made smaller than a required accuracy bound. By means of the remainder term and rounded interval arithmetic, both the procedural and the rounding errors are fully taken into account; this yields a mathematically guaranteed result. The second paper in Part I1 of the book, by R. Kelch, treats the problem of numerical quadrature in the spirit of extrapolation methods. For every element of the usual extrapolation-table, a representation for the remainder term is determined by means of the Euler-Maclaurin summation formula. The integration begins with the evaluation of this remainder term for a particular element of the tableau. If the result is less than a prescribed error bound, the value of the integral then is determined by means of an interval scalar product of the vector collecting the values of the function a t the nodes and the vector of the coefficients of the quadrature formula, which are stored in the memory of the computer. The third paper in Part 11, by U. Storck, presents an outline of this method for the case of multidimensional integrals. Integral equations are treated in the fourth and the fifth paper in Part I1 of the book. The methods presented by W. Klein in the fifth paper have been applied successfully in the case of large systems of nonlinear integral equations. The kernel is replaced by a two dimensional finite Taylor-expansion series with remainder term yielding the sum of a degenerate kernel and a contracting remainder kernel. Both parts then can be treated by standard techniques, and enclosures are obtained by

8

E. Adams, U.Kulisch

interval methods. The fourth paper of Part 11, by H. J. Dobner, shows among other things that suitable problems in partial differential equations can be reduced to integral equations and then treated with Enclosure Methods. Suitable problems with integral equations can also be represented equivalently by differential equations and then treated with Enclosure Methods and automatic verification of the result. Lohner's enclosure algorithms for systems of ordinary differential equations are now the most widely used tool in order to determine verified enclosures of the solution for initial or boundary value problems as well as for corresponding eigenvalueproblems. For the case of initial value problems (IVPs), Lohner has developed a program package called AWA, which is available by means of the computer languages PASCAL-XSC, ACRITH-XSC, and C-XSC. In the case of IVPs, a suitable a priori choice of the step size h is difficult. As a supplement of AWA, the sixth paper in Part I1 of the book, by W. Rufeger and E. Adams, presents an automatic control of the step size h and its application to the Restricted Three Body Problem in Celestial Mechanics.

5

On Applications, Predominantly in Mat hematical Simulation

Since at least 5000 years, computational tasks have arisen from practical needs. As concerned with the computational aspects of mathematical simulations of real world problems, "Scientific Computing" is now considered to be the third major domain in the Sciences, in addition to "Theory" and "Experiment". In the absence of a total error control, "Scientific Computing" may deliver quantitatively and even qualitatively incorrect results. "Automatic Result Verification", the topic of this monograph, implies reliability of the numerical computation. In case of errors of the hardware, the operating software, or the algorithm, the execution of a verification step usually fails. The development of methods, algorithms, etc. possessing this verifying property is irrelevant unless they are applicable with respect to nonartificial mathematical simulations, i. e., models not only chosen by mathematicians but, rather, by physicists, engineers, etc. This applicability is demonstrated by the contributions in Part I11 of this book and by the extensive existing literature as outlined in an appendix. By means of the computer languages supporting the determination of verified enclosures, numerous mathematical models as taken from outside of mathematics have been treated. In the majority of these case studies, failures of traditional numerical methods were observed which are not totally error-controlled. A large portion of these publications appeared in the proceedings of the annual conferences which, since 1980, have been jointly conducted by GAMM (Gesellschaft f i r Angewandte Mathematik und Mechanik) and IMACS (International Association for Mathematics and Computers in Simulation). These conferences were devoted to the areas of "Computer Arithmetic, Scientific Computation, and Automatic

Introduction

9

Result Verification". Concerning mathematical simulations, the papers presented in the proceedings of these conferences belong to different domains. We mention a few of them. Mechanical Engineering: turbines at high numbers of revolution, vibrations of gear drives or rotors, robotics, geometrical methods of CAD-systems, geometrical modelling; Civil Engineering: nonlinear embedding of pillars, the centrally compressed buckling beam; Electrical Engineering: analysis of filters, optimization of VLSI-circuits, simulation of semiconducting diodes; Fluid Mechanics: plasma flows in channels, infiltration of pollutants into groundwater, dimensioning of wastewater channels, analysis of magneto-hydrodynamic flows at large Hartman numbers; Chemistry: the periodic solution of the Oregonator problem, numerical integration in chemical engineering; Physics: high temperature supraconduction, optical properties of liquid crystals, expansions of solutions of the Schriidinger equation with respect to wave functions, rejection of certain computed approximations in Celestial Mechanics or concerning the Lorenz equations. Persons engaged in the mathematical simulation of a real world problem usually are not computer specialists. Consequently, they expect software not requiring special knowledge or experience. The first paper in Part 111, by W. Kramer, demonstrates the power and elegance of the programming tools under discussion in this book. In this paper, PASCAL-XSC-codes are presented for validating computations in cases where single precision is not sufficient or appropriate. The programs use available PASCAL-XSC modules for a long real or a long interval arithmetic. Because of the employed operator notations, the codes can be read just like a technical report. Each one of the examples in this paper demonstrates the power of the available programming environment, particularly by means of a comparison with a corresponding PASCAL-code. Covering many pages, a code of this kind would be lengthy and almost unreadable; it therefore would be almost outside a user's control. The second paper in Part 111, by B. Gross, addresses the classical control theory on the basis of systems of linear ordinary differential equations (ODES),y' = Ay, where A represents a constant matrix. In a realistic mathematical simulation, intervals must be admitted for the values of at least some of the elements of A. Concerning applications, it is then desirable to obtain verified results on the asymptotic stability of y' = Ay and the corresponding degree of stability. For this purpose, four constructive methods are developed such that the interval matrix admitted for A is directly addressed. Consequently, there is no need for an employment of the characteristic polynomial and its roots, i. e., of quantities which would have to be computed prior to a stability analysis. As a major example, the automated electro-mechanical track-control of a city bus is presented. In the third paper in Part 111, by W. U. Klein, a discretization of a parameterdependent boundary value problem with a nonlinear partial differential equation is investigated. Throughout many years, solutions of the system of difference equations could not be approximated reliably for high values of this parameter when making use of standard numerical methods. As shown in the paper, an

E. Adams, U. Kulisch

10

employment of Enclosure Methods allow a reliable determination of the difference solutions, and this even in the case of high values of the parameter. The fourth and the fifth paper in Part 111, by E. Adams, address mathematical simulations by means of ordinary differential equations (ODEs) and partial differential equations (PDEs). In particular, the following problem areas are discussed: (a) in conjunction with an automatic existence proof, a verified determination of true periodic solutions 0

0

of systems of nonlinear autonomous ODEs, particularly the Lorenz equations in Dynamical Chaos and

of systems of linear ODEs with periodic coefficients, particularly the ones arising in the analysis of vibrations of gear drives;

(b) for discretizations of nonlinear ODEs, the existence and the actual occurrence of spurious difference solutions; while exactly satisfying the system of difference equations, they do not approximate any true solution of the ODEs, not even qualitatively; (c) for the Lorenz equations and the ones of the Restricted Three Body Problem in Celestial Mechanics, the occurrence of diverting difference approximations; in the course of time, these computed approximations of difference solutions are close to grossly different true solutions of the ODEs. In Adams’ chapters in Part 111, the overruling topic is the unreliability of difference methods as has been shown by means of Enclosure Methods. The severity of this problem area can be characterized by the title of a paper which will appear in the Journal of Computational Physics: “Computational Chaos May be Due to a Single Local Error”. The chapters by Adams highlight the need for further development 0

0

of hardware and software supporting Enclosure Methods and efficient mathematical methods for the determination of enclosures of true solutions of PDEs that cannot be enclosed through the available techniques or through a preliminary “approximation” of the PDEs by systems of ODEs, e. g. , by means of finite elements.

The sixth paper in Part 111, by Ehret, Schiitz, and Winter, addresses a problem involving quantitative work concering the Schriidinger equations. Just as in the third paper in this part of the book, applications of traditional numerical methods have failed conspicuously; however, reliable results were determined by means of verifying Enclosure Methods.

Introduction

6

11

Concluding Remarks Concerning Computer Arithmetic Supporting Automatic Result Verification

In the opinion of the authors the properties of computer arithmetic should be defined within the programming language. When addressing an arithmetic operation, the user should be fully aware of the computer’s and the compiler’s response. Only thus, the computer becomes a reliable mathematical tool. When the first computer languages were generated in the 1950s, there were no sufficiently simple means available for the definition of a computer arithmetic. Consequently, this issue was ignored and the implementation of a computer arithmetic was left to the manufacturer. A jungle of realizations was the consequence. In fact, there appeared specialists and even schools of people ridiculing arithmetic shortcomings of individual computer systems. Basically, a search of this kind is irrelevant and idle. Rather, it should be attempted to find better and correct approaches to this problem area. This has been done by now, and computer arithmetic can be defined as follows: if the data format and a data type are given, then every arithmetic operation which is provided by the computer hardware or addressable by the programming language must fulfil the properties of a semimorphism. All arithmetic operations thus defined then possess maximum accuracy and other desirable properties; i. e., the computed result differs from the correct one by a t most one rounding. Usually, in case of floating-point operations, there is only a marginal difference between a traditional implementation of the arithmetic and the one governed by the rigorous mathematical principles of a semimorphism. Consequently, the implementation of the properties of a semimorphism is not much more complicated. Rather, if vector and matrix operations are already appropriately taken into account during the design of the hardware arithmetic unit, the computer becomes considerably faster. Basically, the arithmetic standards 754 and 854 of IEEE, ANSI, and IS0 are a step in the desired direction. The thus provided arithmetic operations realize a semimorphic floating-point and interval arithmetic. It is regrettable, however, that no language support has been made available allowing an easy use of interval arithmetic. This is much more so since prototypes concerning the arithmetic hardware as well as the language support were already available a t Karlsruhe 25 years ago. In this context, an essential progress is made available by the programming languages PASCAL-XSC, ACRITH-XSC, and C-XSC which have been mentioned in Section 3 of this Introduction and which will be further characterized in Part I of the book. They provide a universal computer arithmetic. Additionally, they allow a simple handling of semimorph operations by means of the usual operator symbols that are well-known in mathematics. This is still true in product spaces like intervals and complex numbers, and for vectors and matrices of the types real, complex, interval and complex interval. Regrettably, the IEEE standards 754 and

12

E. Adams, U. Kulisch

854 referred to before do not support the operators in product spaces which have just been addressed. Therefore, it is very hard to convince manufacturers that more hardware support for arithmetic is needed than just the IEEE floating-point arithmetic. In particular, a vector processor, of course, should provide semimorphic vector and matrix operations which are of highest accuracy. A software simulation of semimorphic vector and matrix operations is at the expense of speed, whereas a gain in computing speed is to be expected in the case of a support by hardware. This problem is aggravated by the fact that processors implementing the IEEE arithmetic standard 754 do not deliver products of double length; consequently they have to be simulated by means of software with a resulting considerable loss of speed. Products of double length are indispensable and essential for the semimorph determination of products of vectors and matrices. The fourth contribution in Part I of the book presents a proposal concerning a supplement of existing computer arithmetics or arithmetic standards supporting semimorph computer arithmetics for vectors and matrices. Detailed investigations reveal that the additional costs are small, provided there is a homogeneous design of the overall arithmetic unit. The paper by A. Kniifel in Part I11 of the book studies more closely the hardware realization of such an arithmetic unit.

A hardware support of the optimal dot product allows a simple realization of a long real arithmetic for various basic types, as is shown by the seventh paper in Part I1 of the book. Numerous codes concerning Computer Algebra could be considerably accelerated provided a hardware support of the optimal scalar product was available. The first paper in Part I11 demonstrates the then possible straightforward coding and execution, even of very complicated algorithms. Consequently, we strongly recommend a hardware support of the arithmetic proposal in the fourth paper in Part I as well as a revision or an extension of the existing standards. The authors hope that this book will help to convince users and manufacturers of the importance of progress needed in the domain of computer arithmetic, concerning both hardware and standards. Basically, it is not acceptable that maximum accuracy is required only in the case of the basic four arithmetic floating-point operations for operands of type real while this is not so with respect to omplex numbers or vectors or matrices. In the case of a homogeneous design of the arithmetic unit, additional hardware costs are small. But it makes an essential difference whether a correct result of an operation is always delivered or only frequently. In the latter case, a user has to think about and to study the achieved accuracy for every individual operation, and this perhaps a million times every second! In the first case, this is wholly unnecessary. Karlsruhe, May 1992

E. Adams and U. Kulisch

I. Language and Programming Support for Verified Scientific Computation

This page intentionally left blank

PAS CAL-X S C New Concepts for Scientific Computation and Numerical Data Processing R. Hammer, M. Neaga, and D. Ratz

The new programming language PASCAL-XSC is presented with an emphasis on the new concepts for scientific computation and numerical data processing of the PASCAL-XSC compiler. PASCAL-XSC is a universal PASCAL extension with extensive standard modules for scientific computation. It is available for personal computers, workstations, mainframes and supercomputers by means of an implementation in C. By using the mathematical modules of PASCAL-XSC, numerical algorithms which deliver highly accurate and automatically verified results can be programmed easily. PASCAL-XSC simplifies the design of programs in engineering scientific computation by modular program structure, user-defined operators, overloading of functions, procedures, and operators, functions and operators with arbitrary result type, dynamic arrays, arithmetic standard modules for additional numerical data types with operators of highest accuracy, standard functions of high accuracy and exact evaluation of expressions. The most important advantage of the new language is that programs written in

PASCAL-XSC are easily readable. This is due to the fact that all operations, even

those in the higher mathematical spaces, have been realized as operators and can be used in conventional mathematical notation.

In addition to PASCAL-XSC a large number of numerical problem-solving routines with automatic result verification are available. The language supports the development of such routines.

1

Introduction

These days, the elementary arithmetic operations on electronic computers are usually approximated by floating-point operations of highest accuracy. In particular, for any choice of operands this means that the computed result coincides with the rounded exact result of the operation. See the IEEE Arithmetic Standard [3] as an example. This arithmetical standard also requires the four basic arithmetic operations -, *, and / with directed roundings. A large number of processors already on the market provide these operations. So far, however, no common programming language allows access to them.

+,

On the other hand, there has been a noticeable shift in scientific computation from general purpose computers to vector and parallel computers. These so-called Scientific Computing with Automatic Result Verification

15

Copyright 0 1993 by Academic Press, Inc. All rights of reproduction in any form reserved. ISBN 0-12-044210-8

R. Hammer, M. Neaga, and D. Ratz

16

super-computers provide additional arithmetic operations such as "multiply and add" and "accumulate" or "multiply and accumulate" (see [lo]). These hardware operations should always deliver a result of highest accuracy, but as of yet, no processor which fulfills this requirement is available. In some cases, the results of numerical algorithms computed on vector computers are totally different from the results computed on a scalar processor (see [13],[31]). Continuous efforts have been made to enhance the power of programming languages. New powerful languages such as ADA have been designed, and enhancement of existing languages such as FORTRAN is in constant progress. However, since these languages still lack a precise definition of their arithmetic, the same program may produce different results on different processors. PASCAL-XSC is the result of a long-term venture by a team of scientists to produce a powerful tool for solving scientific problems. The mathematical definition of the arithmetic is an intrinsic part of the language, including optimal arithmetic operations with directed roundings which are directly accessable in the language. Further arithmetic operations for intervals and complex numbers and even vector/matrix operations provided by precompiled arithmetical modules are defined with maximum accuracy according to the rules of semimorphism (see [25]).

The Language PASCAL-XSC

2

PASCAL-XSC is an extension of the programming language PASCAL for Scientific

Computation. A first approach to such an extension (PASCAL-SC) has been .available since 1980. The specification of the extensions has been continuously improved in recent years by means of essential language concepts, and the new language PASCAL-XSC [20],[21] was developed. It is now available for personal computers, workstations, mainframes, and supercomputers by means of an implementation in C. PASCAL-XSC contains the following features: 0

Standard PASCAL

0

Universal operator concept (user-defined operators)

0

Functions and Operators with arbitrary result type

0

Overloading of procedures, functions and operators

0

Module concept

0

Dynamic arrays

0

Access to subarrays

0

String concept

0

Controlled rounding

0

Optimal (exact) scalar product

PASCAL-XSC - New Concepts for Scientific Computation 0

0

17

Standard type dotprecision (a fixed point format to cover the whole range of floating-point products) Additional arithmetic standard types such as complex, interval, rvector, rmatrix etc.

0

Highly accurate arithmetic for all standard types

0

Highly accurate standard functions

0

Exact evaluation of expressions (#-expressions)

The new language features, developed as an extension of PASCAL, will be discussed in the following sections.

2.1

Standard Data Types, Predefined Operators, and F’unct ions

In addition to the data types of standard PASCAL, the following numerical data types are available in PASCAL-XSC: interval rvector rmatrix

complex cin terval cvector ivector civector cimatrix cmatrix imatrix

where the prefix letters r, i, and c are abbreviations for real, interval, and complex. So cinterval means complex interval and, for example, cimatrix denotes complex interval matrices, whereas rvector specifies real vectors. The vector and matrix types are defined as dynamic arrays and can be used with arbitrary index ranges.

A large number of operators are predefined for theses types in the arithmetic modules of PASCAL-XSC (see section 2.8). All of these operators deliver results with maximum accuracy. In Table 1 the 29 predefined standard operators of PASCALXSC are listed according to priority. Type monadic

Operators

Priority

monadic +, monadic 3 (highest)

multiplicative

2

and, div, mod

additive

1

or

relational

0 (lowest)

-, not

*,*<,*>,/,/<,I>,** t,t <, t >, -,-<,->,

t*

in

=, <>, <=, <, >=, >, ><

Table 1: Precedence of the Built-in Operators

R. Hammer, M. Neaga, and D. Ratz

18

Compared to standard PASCAL, there are 11 new operator symbols. These are the operators o< and o>, o E {+, -, *, /} for operations with downwardly and upwardly directed rounding and the operators **, +*, >< needed in interval computations for the intersection, the convex hull, and the disconnectivity test. Tables 2 End 3 show all predefined arithmetic and relational operators in connection with the possible combinations of operand types.

right

integer red

complex

operand

interval

+,

rvector cvector

*,*<,*>,

cin t erval

interval cinterval

rvector cvector

+, -

+, -

+, -

monadi;)

integer real complex

I 1

- 9

*,I,

+*

1 +,+*,** I

I ,/ < , I >

-9

*,I,

*,I

* 0,o

<, 0

+*

I

ivector civector

+, -

+,-

I *, *<, * >

I *

I +, *

-, *,4,

+*

ivector civector rmatrix cmatrix

+*

+,-A)

imatrix cimatrix

+*

l)

The operators of this row are monadic (i.e. there is no left operand).

7

0

3, o 4,

E I+, * r /I E {+, -, *}, where -1

* denotes the scalar or matrix product.

* denotes the scalar or matrix product.

+* : Interval hull ** : Interval intersection Table 2: Predefined Arithmetical Operators

I

I+,-,*, I

I

+*

4,

+*, **

PASCAL-XSC - New Concepts for Scientific Computation

19

Compared with standard PASCAL, PASCAL-XSC provides an extended set of mathematical standard functions (see table 4). These functions are available for the types real, complex, interval, and cinterval with a generic name and deliver a result of maximum accuracy. The functions for the types complex, interval, and cinterval are provided in the arithmetic modules of PASCAL-XSC.

right operand

integer real complex

I

integer interval real cinterval :omplex

=, <>, <=, <,

>=. >

interval cinterval rvector cvector ivector

civec tor

rmatrix cmatrix imatrix cimatrix

I

=, <> in

in, ><,

<=, <,

t IL

’) The operators <= and < denote the “subset” relations,

>= and > denote the

“superset” relations.

>< : Test on disconnectivity for in:

intervals

Test on membership of a point in an interval or test on strict enclosure of an interval in the interior of an interval

Table 3: Predefined Relational Operators

R. Hammer, M. Neaga, and D. Ratz

20 Function Absolute Value

I

4

I Generic Name I Argument Type I abs * I

arcosh

I

* * *

I

exp2

1

*

Arc Cosine

arccos

Arc Cotangent

arccot

11 Inverse Hyperbolic Cosine

I 15 11 Power Function (Base 2)

I

I

Table 4: Mathematical Standard Functions (* includes the types integer, real, complex, interval, and cin terval) Besides the mathematical standard functions, PASCAL-XSC provides the necessary type transfer functions intval, inf, sup, compl, re, and im for conversion between the numerical data types (for scalar and array types).

PASCAL-XSC - New Concepts for Scientific Computation

21

The General Operator Concept

2.2

By a simple example of interval addition, the advantages of a general operator concept are demonstrated. In the absence of userdefined operators, there are two ways to implement the addition of two variables of type interval declared by

type interval = record inf, sup: real; end; One can use a procedure declaration

procedure intadd(a,b: interval; var c: interval); begin c.inf :=a.inf < b.inf; c.sup :=a.sup > b.sup end;

+ +

1 mathematical notation I corresponding program statements I intadd( a,b,z); intadd(z,c,z); intadd( z,d,z);

z := a + b + c + d

or a function declaration (only possible in PASCAL-XSC, not in standard PAS-

CAL)

function intadd(a,b: interval): interval; begin intadd.inf :=a.inf < b.inf; intadd.sup := a.sup > b.sup end;

+ +

I mathematical notation I corresponding program statement I I := + b + + d I z := intadd(intadd(intadd(a,b),c),d); I z

a

c

In both cases the description of the mathematical formulas looks rather complicated. By comparison, if one implements an operator in PASCAL-XSC

operator + (a,b: interval) intadd: interval; begin intadd.inf := a.inf < b.inf; intaddsup := a m p > b.sup end:

+ +

R. Hammer, M. Neaga, and D. Ratz

22

I

mathematical notation corresponding program statement

I

z:=a+b+c+d

z:=a+b+c+d;

I

then a multiple addition of intervals is described in the traditional mathematical notation. Besides the possibility of overloading operator symbols, one is allowed t o use named operators. Such operators must be preceded by a priority declaration. There exist four different levels of priority, each represented by its own symbol: 0 0 0 0

monadic : multiplicative : additive .. relational

:

t * + --

level 3 (highest priority) level 2 level 1 level 0

For example, an operator for the calculation of the binomial coefficient defined in the following manner priority choose = *;

(i) can be

{priority declaration}

operator choose (n,k: integer) binomial: integer; var i,r : integer; begin if k > n div 2 then k := n-k; r := 1; for i := 1 t o k do r := r * (n - i 1) div i; binomial := r; end;

+

I mathematical notation 1 corresponding program statement I c := (;)

c := n choose k

The operator concept realized in PASCAL-XSC offers the possibilities of 0

defining an arbitrary number of operators

0

overloading operator symbols or operator names arbitrarily many times

0

implementing recursively defined operators

The identification of the suitable operator depends on both the number and the type of the operands according to the following weighting-rule:

If the actual list of parameters matches the formal list of parameters of two different operators, then the one which is chosen has the first “better matching” parameter. “Better matching” means that the types of the operands must be consistent and not only conforming.

PASCAL-XSC - New Concepts for Scientific Computation

23

Example: operator

+* (a: integer; b:

operator

+* (a: real; b: integer) rires: real;

var x

y, z

real) irres: real;

: integer; : real;

+* y; * +* x; z := x +* x; + z := y +* y; =+

z := x z := y

1. operator 2. operator 1. operator impossible !

Also, PASCAL-XSC offers the possibility to overload the assignment operator :=. Due to this, the mathematical notation may also be used for assignments:

Example: var c : complex; r : real;

operator := (var c: complex; r: real); begin c.re := r; c.im := 0; end; r := 1.5; c := r; {complex number with real part 1.5 and imaginary part 0)

2.3

Overloading of Subroutines

Standard PASCAL provides the mathematical standard functions sin, cos, arctan, exp, In, sqr, and sqrt

R. Hammer, M. Neaga, and D. Ratz

24

for numbers of type real only. In order to implement the sine function for interval arguments, a new function name like isin(. . .) must be used, because the overloading of the standard function name sin is not allowed in standard PASCAL. By contrast, PASCAL-XSC allows overloading of function and procedure names, whereby a generic symbol concept is introduced into the language. So the symbols sin, cos, arctan, exp, In, sqr, and sqrt can be used not only for numbers of type real, but also for intervals, complex numbers, and other mathematical spaces. To distinguish between overloaded functions or procedures with the same name, the number, type, and weighting of their arguments are used, similar to the method for operators. The type of the result, however, is not used.

Example: procedure rotate (var a,b: real); procedure rotate (var a,b,c: complex); procedure rotate (var a,b,c: interval); The overloading concept also applies to the standard procedures read and write in a slightly modified way. The first parameter of a new declared input/output procedure must be a var-parameter of file type and the second parameter represents the quantity that is to be input or output. All following parameters are interpreted as format specifications.

Example: procedure write (var f text; c: complex; w: integer); begin write (f, ’(’, c.re : w, ’,’, c.im : w, ’)’); end Calling an overloaded input/output procedure the file parameter may be omitted corresponding to a call with the standard files input or output. The format parameters must be introduced and seperated by colons. Moreover, several input or output statements can be combined to a single statement just as in standard PASCAL.

Example: var r: real; c: complex; write (r : 10, c : 5 , r/5);

PASCAL-XSC - New Concepts for Scientific Computation

2.4

25

The Module Concept

Standard PASCAL basically assumes that a program consists of a single program text which must be prepared completely before it can be compiled and executed. In many cases, it is more convenient to prepare a program in several parts, called modules, which can then be developed and compiled independently of each other. Moreover, several other programs may use the components of a module without their being copied into the source code and recompiled. For this purpose, a module concept has been introduced in PASCAL-XSC. This new concept offers the possibilities of 0

modular programming

0

syntax check and semantic analysis beyond the bounds of modules

0

implementation of arithmetic packages as standard modules

Three new keywords have been added to the language:

module

:

starts a new module

global

:

indicates items to be passed to the outside

use

:

indicates imported modules

A module is introduced by the keyword module followed by a name and a semicolon. The body is built up quite similarly to that of a normal program with the exception that the word symbol global can be used directly in front of the keywords const, type, var, procedure, function, and operator and directly after use and the equality sign in type declarations. Thus it is possible to declare private types as well as non-private types. The structure of a private type is not known outside the declaration module and can only be influenced by subroutine calls. If, for example, the internal structure as well as the name of a type is to be made global, then the word symbol global must be repeated after the equality sign. By means of the declaration

global type complex = global record re, im : real end; the type complex and its internal structure as a record with components re and im is made global.

A private type complex could be declared by global type complex = record re, im: real end; The user who has imported a module with this private definition cannot refer to the record components, because the structure of the type is hidden inside the module.

R. Hammer, M.Neaga, and D. Ratz

26

A module is built up according to the following pattern: module ml; use < other modules >; < global and local declarations > begin < initialization of the module > end. For importing modules with use or use global the following transitivity rules hold

M1 use M2

and

M2 use global M3 =+- M1 use M3.

M1 use M2

and

M2 use M3

but

+

M1 use M3,

Example: Let a module hierarchy be built up by main program

J

[STANDARDS

I

All global objects of the modules A, B, and C are visible in the main program unit, but there is no access to the global objects of X, Y and STANDARDS. There are two possibilities to make them visible in the main program, too: 1. to write use X, Y, STANDARDS

in the main program 2. to write use global X, Y

in module A and use global STANDARDS

in module B or C.

PASCAL-XSC - New Concepts for Scientific Computation

2.5

27

Dynamic Arrays

In standard PASCAL there is no way to declare dynamic types or variables. For instance, program packages with vector and matrix operations can be implemented with only fixed (maximum) dimension. For this reason, only a part of the allocated memory is used if the user wants to solve problems with lower dimension only. The concept of dynamic arrays removes this limitation. In particular, the new concept can be described by the following characteristics: a

Dynamics within procedures and functions

a

Automatic allocation and deallocation of local dynamic variables

a

Economical employment of storage space

a

Row access and column access to dynamic arrays

a

Compatibility of static and dynamic arrays

Dynamic arrays must be marked with the word symbol dynamic. The great disadvantage of the conformant array schemes available in standard PASCAL is that they can only be used for parameters and not for variables or function results. So, this standard feature is not fully dynamic. In PASCAL-XSC, dynamic and static arrays can be used in the same manner. At the moment, dynamic arrays may not be components of other data structures. The syntactical meaning of this is that the word symbol dynamic may only be used directly following the equality sign in a type definition or directly following the colon in a variable declaration. For instance, dynamic arrays may not be record components.

A two-dimensional array type can be declared in the following manner: type matrix = dynamic array[*,*]of real; It is also possible to define different dynamic types with corresponding syntactical structures. For example, it might be useful in some situations to identify the coefficients of a polynomial with the components of a vector or vice versa. Since PASCAL is strictly a type-oriented language, such structurally equivalent arrays may only be combined if their types have been previously adapted. The following example shows the definition of a polynomial and of a vector type (note that the type adaptation functions polynomial(. . .) and vector(. . .) are defined implicitly):

type vector = dynamic array[*] of real; type polynomial = dynamic array[*]of real; operator

+ (a,b: vector) res: vector[lbound(a)..ubound(a)];

R. Hammer, M. Neaga, and D. Ratz var v : vector[l..n]; p : polynomial[O..n-I]; v p v v

:= vector(p); := polynomial(v); := v v;

+

:= vector(p)

+ v; { but not v := p + v; }

Access to the lower and upper index limits is made possible by the new standard functions lbound(. . .) and ubound(. ..), which are available with an optional argument for the index field of the designated dynamic variable. Employing these functions, the operator mentioned above can be written as

operator + (a,b: vector) res: vector[lbound(a)..ubound(a)]; var i : integer; begin for i := lbound(a) to ubound(a) do res[i] := a[;]

+ b[lbound(b) + i

-

lbound(a)]

end; Introduction of dynamic types requires an extension of the compatibility prerequisites. Just as in standard PASCAL, two array types are not compatible unless they are of the same type. Consequently, a dynamic array type is not compatible with a static type. In PASCAL-XSC value assignments are always possible in the cases listed in Table 5 .

I

Type of Left Side

Type of Right Side

anonymous dynamic

arbitrary array type

I

Assignment Permitted if structurally equivalent

known dynamic

known dynamic

if types are the same

anonymous static

arbitrary array type

if structurally equivalent

known static

known static

if types are the same

Table 5: Assignment Compatibilities In the remaining cases, an assignment is possible only for an equivalent qualification of the right side (see [20] or (211 for details). In addition to access to each component variable, PASCAL-XSC offers the possibility of access to entire subarrays. If a component variable contains an * instead of an index expression, it refers to the subarray with the entire index range in the corresponding dimension, e. g. via m[*, j ] the j-th column of a two-dimensional array m is accessed. This example demonstrates access to rows or columns of dynamic arrays:

PASCAL-XSC - New Concepts for Scientific Computation

29

type vector = dynamic array[*] of real; type matrix = dynamic array[*]of vector;

var v m

: vector[I..n]; : matrix[l..n,l..n];

v := m[i]; m[i] := vector(m[*, j]); In the first assignment it is not necessary to use a type adaptation function, since both the left and the right side are of known dynamic type. A different case is demonstrated in the second assignment. The left-hand side is of known dynamic type, but the right-hand side is of anonymous dynamic type, so it is necessary to use the intrinsic adaptation function vector(. ..).

A PASCAL-XSC program which uses dynamic arrays should be built up according to the following scheme:

program dynprog (inputloutput); type vector = dynamic array[*]of real; < different dynamic declarations > var n : integer;

p r o c e d u r e main (dim: integer); var a,b,c : vector[l..dim];

begin < 1/0 depending on the value of dim > c := a

+ b;

R. Hammer, M. Neaga, and D. Ratz

30

It is necessary to frame only the original main program by a procedure (here: main), which is refered to with the dimension of the dynamic arrays as a transfer parameter.

Accurate Expressions

2.6

The implementation of enclosure algorithms with automatic result verification or validation (see [17],[24],[28],[33]) makes extensive use of the accurate evaluation of dot products with the property (see [25])

To evaluate this kind of expression the new datatype dotprecision was introduced. This datatype accomodates the full floating-point range with double exponents (see [25],[24]).Based upon this type, so-called accurate ezpressions (#-expressions), can be formulated by an accurate symbol (#, #*, #<, #>, or ##) followed by an ezact ezpression enclosed in parentheses. The exact expression must have the form of a dot product expression and is evaluated without any rounding error. The following standard operations are available for dotprecision: 0 0

0 0

conversion of real and integer values t o dotprecision (#) rounding of dotprecision values to real; in particular: downwardly directed rounding (#<), upwardly directed rounding (#>), and rounding to the nearest (#*) rounding of a dotprecision expression to the smallest enclosing interval (##) addition of a real number or the product of two real numbers to a variable of type dotprecision

0

addition of a dot product to a variable of type dotprecision

0

addition and subtraction of dotprecision numbers

0

monadic minus of a dotprecision number

0

the standard function sign returns -1,O, or +1, depending on the sign of the dotprecision number

To obtain the unrounded or correctly rounded result of a dot product expression, the user needs to parenthesize the expression and precede it by the symbol # which may optionally be followed by a symbol for the rounding mode. Table 6 shows the possible rounding modes with respect to the dot product expression form (see the appendix on page 41 for details).

PASCAL-XSC - New Concepts for Scientific Computation Symbol

I 1

#* #< #> ## #

Expression Form

Rounding Mode

scalar, vector or matrix

nearest

scalar, vector or matrix

downwards

scalar, vector or matrix

upwards

31

I scalar, vector or matrix 1 smallest enclosing interval I I scalar only I exact, no rounding I

Table 6: Rounding Modes for Accurate Expressions In practice, dot product expressions may contain a large number of terms making an explicit notation very cumbersome. To alleviate this difficulty in mathematics, the symbol C is used. If for instance A and B are n-dimensional matrices, then the evaluation of n

represents a dot product expression. PASCAL-XSC provides the equivalent shorthand notation s u m for this purpose. The corresponding PASCAL-XSC statement for this expression is D := #(for k:=l t o n s u m (A[i,k]*B[kj])) where D is a dotprecision variable. Dot product expressions or accurate expressions are used mainly in computing a defect (or residual). In the case of a linear system A z = b, A E IRnxn, I,b E R", as an example Ay M b is considered. Then an enclosure of the defect is given by O ( b - Ay) which in PASCAL-XSC can be realized by means of ##(b - A*Yh then there is only one interval rounding operation per component. To get verified enclosures for linear systems of equations it is necessary to evaluate the defect expression

O(E - R A ) where R M A-* and E is the identity matrix. In PASCAL-XSC this expression can be programmed as ##(id(A) - R*A); where an interval matrix is computed with only one rounding operation per component. The function id(. . .) is a part of the module for real matrix/vector arithmetic generating an identity matrix of appropriate dimension according to the shape of A (see section 2.8).

R. Hammer, M. Neaga, and D. Ratz

32

2.7 The String Concept The tools provided for handling strings in standard PASCAL do not enable convenient text processing. For this reason, a string concept was integrated into the language definition of PASCAL-XSC which admits a comfortable handling of textual information and even symbolic computation. With this new data type string, the user can work with strings of up to 255 characters. In the declaration part the user can specify a maximum string length less than 255. Thus a string s declared by

var s: string[$O]; can be up to 40 characters long. The following standard operations are available: 0

concatenation

operator 0

+ (a,b: string) conc: string;

actual length

function length(s: string): integer; 0

conversion string

+ real

function rval(s: string): real; 0

conversion string + integer

function ival(s: string): integer; conversion real

+ string

function image(r: real; width,fracs,round: integer): string; 0

conversion integer + string

function image(i,len: integer): string; 0

extraction of substrings

function substring(s: string; i j : integer): string; 0

position of first appearance

function pos(sub,s: string): integer; 0

relational operators

<=, <, >=, >, <>, =, and in

PASCAL-XSC - New Concepts for Scientific Computation

2.8

33

Standard Modules

The following standard modules are available: 0

interval arithmetic (I-ARI)

0

complex arithmetic (C-ARI)

0

complex interval arithmetic (CI-ARI)

0

real matrix/vector arithmetic (MV-ARI)

0

interval matrix/vector arithmetic (MVI-ARI)

0

complex matrix/vector arithmetic (MVC-ARI)

0

complex interval matrix/vector arithmetic (MVCI-ARI)

These modules may be incorporated via the use-statement described in section 2.4. As an example, Table 7 exhibits the operators provided by the module for interval matrix/vector arithmetic.

Table 7: Predefined Arithmetical and Relational Operators of the Module MVI-ARI In addition to these operators, the module MVI- ARI provides the following generically named standard operators, functions, and procedures intval, inf, sup, diam, mid, blow, transp, null, id, read, and write. The function intval is used to generate interval vectors and matrices, whereas inf and sup are selection functions for the infimum and supremum of an interval object. The diameter and the midpoint of interval vectors and matrices can be computed by diam and mid, blow yields an interval inflation, and transp delivers the transpose of a matrix.

R. Hammer, M. Neaga, and D. Ratz

34

Zero vectors and matrices are generated by the function null, while id returns an identity matrix of appropriate shape. Finally, there are the generic input/outputprocedures read and write, which may be used in connection with all matrix/vector data types defined in the modules mentioned above.

2.9

Problem- Solving Routines

PASCAL-XSC routines for solving common numerical problems have been implemented. The applied methods compute a highly accurate enclosure of the true solution of the problem and, a t the same time, prove the existence and the uniqueness of the solution in the given interval. The advantages of these new routines are listed in the following: 0

0

0

The solution is computed with maximum or high, but always controlled accuracy, even in many ill-conditioned cases. The correctness of the result is automatically verified, i. e. an enclosing set is computed which guarantees existence and uniqueness of the exact solution contained in this set. In case, that no solution exists or that the problem is extremely illconditioned, an error message is indicated.

Particularly, PASCAL-XSC routines cover the following subjects: 0

linear systems of equations

0

full systems (real, complex, interval, cin terval) matrix inversion (real, complex, interval, cin terval) least squares problems (real, complex, interval, cinterval) computation of pseudo inverses (real, complex, interval, cinterval) band matrices (real) sparse matrices (real)

polynomial evaluation - in one variable (real, complex, interval, cinterval)

- in several variables (real) 0

zeros of polynomials (real, complex, interval, cin terval)

0

eigenvalues and eigenvectors - symmetric matrices (real)

- arbitrary matrices (real, complex, interval, cin terval) 0

initial and boundary value problems of ordinary differential equations

- linear - nonlinear

PASCAL-XSC - New Concepts for Scientific Computation

3

0

evaluation of arithmetic expressions

0

nonlinear systems of equations

0

numerical quadrature

0

integral equations

0

automatic differentiation

0

optimization

35

The Implementation of PASCAL-XSC

Since 1976, a PASCAL extension for scientific computation has been in the process of being defined and developed a t the Institute for Applied Mathematics at the University of Karlsruhe. The PASCAL-SC compiler has been implemented on several computers (280, 8088, and 68000 processors) under various operating systems. This compiler has already been on the market for the IBM PC/AT and the ATARI-ST (see [22], [23]). The new PASCAL-XSC compiler is now available for personal computers, workstations, mainframes, and supercomputers by means of an implementation in C. Via a PASCAL-XSC-to-C precompiler and a runtime system implemented in C, the language PASCAL-XSC may be used, among other systems, on all UNIX systems in an almost identical way. Thus, the user has the possibility to develop his programs for example on a personal computer and afterwards get them running on a mainframe via the same compiler.

A complete description of the language PASCAL-XSC and the arithmetic modules as well as a collection of sample programs is given in [20] and [21].

4

PASCAL-XSC Sample Program

In the following, a complete PASCAL-XSC program is listed, which demonstrates the use of some of the arithmetic modules. Employing the module LIN-SOLV, the solution of a system of linear equations is enclosed in an interval vector by succecsive interval iterations. The procedure main, which is called in the body of linsys, is only used for reading the dimension of the system and for allocation of the dynamic variables. The numerical method itself is started by the call of procedure linearsystemsolver defined in module LINSOLV. This procedure may be called with arbitrary dimension of the employed arrays.

For detailed information on iteration methods with automatic result verification see [17], [24], [28], or [32], for example.

R. Hammer, M . Neaga, and D. Ratz

36

Main Program program lin-sys (input’output); { { { {

Program for verified solution of a linear system of equations. The } matrix A and the right-hand side b of the system are to be read in. } The program delivers either a verified solution or a corresponding } failure message. 1

{ lin-solv linear system solver 1 lin-solv, mvari, mviari; { mvari matrix/vector arithmetic 1 { mviari matrix/vector interval arithmetic }

use var

n : integer;

procedure main (n : integer); { The matrix A and the vectors b, x are allocated dynamically with } { this subroutine being called. The matrix A and the right-hand side } { b are read in and linear-systemsolver is called. 1

var

ok b x A

: boolean; : rvector[l..n]; : ivector(l..n]; : rmatrix[l..n,l..n];

begin writeln(’P1ease enter the matrix A:’); read( A); writeln(’P1ease enter the right-hand side b:’); read( b); linear-systemsolver(A,b,x,ok);

if ok then begin

writeln(’The given matrix A is non-singular and the solution ’); writeln(’of the linear system is contained in:’); write(x);

end

PASCAL-XSC - New Concepts for Scientific Computation

else writeln(’No solution found !’);

end;

{procedure main}

begin write(’P1ease enter the dimension n of the linear system: ’); read(n); main(n);

end. {program lin-sys}

37

R . Hammer, M . Neaga, and D. Ratz

38

Module LIN-SOLV module lin-solv; { Verified solution of the linear system of equations Ax = b. } { i-ari : interval arithmetic 1 i-ari, m v a r i , mviari; { m v a r i : matrix/vector arithmetic 1 { mviari : matrix/vector interval arithmetic }

use

priority

inflated =

*;

{ priority level 2 }

operator inflated ( a : ivector; eps : rea1)infl: ivector[l..ubound(a)]; { Computes the so-called epsilon inflation of an interval vector. )

var

: integer; x : interval;

I

begin for i:= 1 to ubound(a) do begin x:= a[i]; if (diam(x) <> 0) then a[i] := (l+eps)*x

else

- eps*x

a[i] := intval( pred (inf(x)), succ (sup(x)) );

end; {for}

infl := a; {operator inflated}

end;

PASCAL-XSC - New Concepts for Scientific Computation function approximateinverse (A: rmatrix): rmatrix[ l..ubound(A),l..ubound(A)]; { Computation of an approximate inverse of the (n,n)-matrix A } { by application of the Gaussian elimination method. 1

var

1, j, k, n : integer; : real; factor R, Inv, E : rmatrix[l..ubound(A),l..ubound(A)];

begin

n := ubound(A); E := id(E); R := A;

{ dimension of A }

{ identity matrix }

{ Gaussian elimination step with unit vectors as } { right-hand sides. Division by R[i,i]=O indicates } { a probably singular matrix A. 1

for i:= 1 to n do for j:= (i+l) to n do begin

factor := RU,i]/R[i,i]; for k:= i to n do Rb,k] := #*(RU,k] - factor*R[i,k]); EL] := Eb] - factor*E[i]; end; {for j:= ...}

{ Backward substitution delivers the rows of the inverse of A. }

for i:= n downto 1 do Inv[i] := #*(E[i]

- fork:=

( i t l ) to n sum(R[i,k]*Inv[k]))/R[i,i];

approximateinverse := Inv; {function approximateinverse}

end;

39

R. Hammer, M. Neaga, and D. Ratz

40

global procedure 1inearAystemsolver (A : rmatrix; b : rvector; var x : ivector; var ok : boolean); { Computation of a verified enclosure vector for the solution of the } { linear system of equations. If an enclosure is not achieved after } { a certain number of iteration steps the algorithm is stopped and } { the parameter ok is set to false. 1 const epsilon = 0.25; { Constant for the epsilon inflation } max-steps = 10; { Maximum number of iteration steps } var

1 : integer; y, z : ivector[l..ubound(A)]; R : rmatrix[l..ubound(A),1..ubound(A)]; C : imatrixp. .ubound(A),1..ubound(A)];

begin

R := approximateinverse(A);

{ R*b is an approximate solution of the linear system and z is an enclosure } { of this vector. However, it does not usually enclose the true solution. }

z := ##(R*b); { An enclosure of I - R*A is computed with maximum accuracy. } { The (n,n) identity matrix is generated by the function call id(A). } C := ##(id(A) x : = z;

repeat i

- R*A);

i := 0;

.- i + 1; ._

y := x inflated epsilon; { To obtain a true enclosure, the interval } { vector c is slightly enlarged. 1 x := z

+ c*y;

ok := x in y;

{ The new iterate is computed. } { Is c contained in the interior of y? }

until ok or (i = maxsteps); end; {procedure linearsystemsolver)

end.

{module lin-solv}

PASCAL-XSC - New Concepts for Scientific Computation

Appendix Review of Real and Complex #-Expressions Syntax: #-Symbol

#

Result Type

dotprecision

#-Symbol ( Exact Expression ) Summands Permitted in the Exact Expression variables, constants, and special function calls of type integer, real, or dotprecision products of type integer or real scalar producta of type real

real

variables, constants, and special function calls of type integer, real, or dotprecision products of type integer or real scalar products of type real

complex

variables, constants, and special function calls of type integer, real, complex, or dotprecision products of type integer, real, or complex scalar products of type real or complex

#* #< #>

variables and special function calls of type rvector rvector

c vector

rmatrix

cmatrix

products of type rvector (e.g. rmatrix * rvector, real * rvector etc.) variables and special function calls of type rvector or cvector products of type rvector or cvector (e.g. cmatrix * rvector, real * cvector etc.) variables and special function calls of type rmatrix products of type rmatrix variables and special function calls of type rmatrix or cmatrix products of type rmatrix or cmatrix

41

R. Hammer, M.Neaga, and D. Ratz

42

Review of Real and Complex Interval #-Expressions Syntax: #-Symbol

Result Type

interval

I

## ( Exact Expression ) Summands Permitted in the Exact Expression type integer, real, interval, or dotprecision products of type integer, real, or interval scalar products of type real or interval variables, constants, and special function calls of type integer, real, complex, interval, cinterval, or dotprecision

cin terval

products of type integer, real, complex, interval, or cinterval scalar products of type real, complex, interval, or cin terval

##

ivector

variables and special function calls of type rvector or ivector products of type rvector or ivector

civec tor

variables and special function calls of type rvector, cvector, ivector, or civector products of type rvector, cvector, ivector, or civector

imatrix

variables and special function calls of type rrnatrix or imatrix products of type rmatrix or imatrix

cimatrix

variables and special function calls of type rmatrix, cmatrix, imatrix, or cimatrix products of type rmatrix, cmatrix, imatrix, or cirnatrix

PASCAL-XSC - New Concepts for Scientific Computation

43

References [l] Allendorfer, U., Shiriaev, D.: PASCAL-XSC to C - A Portable PASCAL-XSC Compiler. In: [18],91-104, 1991. [2] Allendorfer, U., Shiriaev, D.: PASCAL-XSC - A portable development system. In [9],1992. [3] American National Standards Institute / Institute of Electrical and Electronic Engineers: A Standard for Binary Floating-Point Arithmetic. ANSI/IEEE Std. 754-1985,New York, 1985. [4] Bleher, J. H., Rump, S. M., Kulisch, U., Metzger, M., Ullrich, Ch., and Walter, W.: FORTRAN-SC: A Study of a FORTRAN Eztension for Engineering/Scientific Computation with Access to ACRITH. Computing 39,93 - 110, 1987.

[5] Bohlender, G., Griiner, K., Kaucher, E., Klatte, R., Kramer, W., Kulisch, U., Rump, S., Ullrich, Ch., Wolff von Gudenberg, J., and Miranker, W.: PASCAL-SC: A PASCAL for Contemporary Scientific Computation. Research Report RC 9009,IBM Thomas J. Watson Research Center, Yorktown Heights, New York, 1981. [6] Bohlender, G., Griiner, K., Kaucher, E., Klatte, R., Kulisch, U., Neaga, M., Ullrich, Ch., and

Wolff von Gudenberg, J.: PASCAL-SC Language Definition. Internal Report of the Institute for Applied Mathematics, University of Karlsruhe, 1985.

[q

Bohlender, G., Rall, L., Ullrich, Ch., and Wolff von Gudenberg, J.: PASCAL-SC: A Computer Language for Scientific Computation, Academic Press, New York, 1987.

PASCAL-SC Wirkungsvoll programmieren, kontrolliert rechnen. Bibliographisches Institut, Mannheim,

[8] Bohlender, G.,Rall, L., Ullrich, Ch. und Wolff von Gudenberg, J.: 1986.

[9] Brezinsky, C. and Kulisch, U. (Eds): Computational and Applied Mathematics - Algorithms and Theory. Proceedings of the 13th IMACS World Congress, Dublin, Ireland. Elsevier, Science publishers B. V. To be published in 1992.

[lo] Buchholz, W.: The IBM System/370 Vector Architecture. IBM Systems Journal 25/1, 1986. [ll] Cordes, D.: Runtime System for a PASCAL-XSC Compiler. In: [18],151-160, 1991. [12] DaSler, K. und Sommer, M.: PASCAL, Einfdbrung in die Sprache. Norm Entwurf DIN 66256,Erlauterungen. Springer, Berlin, 1983. [13] Hammer, R.: How Reliable is the Arithmetic of Vector Computers? In: [33],1990. [14] Hammer, R., Neaga, M., Ratz, D., Shiriaev, D.: PASCAL-XSC - A new language for scientific computing. (In Russian), Interval Computations 2,St. Petersburg, 1991, [15] IBM High-Accuracy Arithmetic Subroutine Library (ACRITH). General Information Manual, G C 33-6163-02,3rd Edition, 1986. [16] IBM High-Accuracy Arithmetic Subroutine Library (ACRITH). Program Description and User’s Guide, SC 33-616402,3rd Edition, 1986. [17] Kaucher, E., Kulisch, U., and Ullrich, Ch. (Eds.): Computer Arithmetic - Scientific Computation and Programming Languages. Teubner, Stuttgart, 1987. [18] Kaucher, E., Markov, S. M., Mayer, G. (Eds): Computer Arithmetic, Scientific Computation and Mathematical Modelling. IMACS Annals on Computing and Applied Mathematics 12, J.C. Baltzer, Basel, 1991. [19] Kirchner, R. and Kulisch, U.: Accurate Arithmetic for Vector Processors. Journal of Parallel and Distributed Computing 5,250-270,1988. [20] Klatte, R., Kulisch, U., Neaga, M., Ratz, D. und Ullrich, Ch.: PASCAL-XSCSprachbeschreibung mil Beispielen. Springer, Heidelberg, 1991.

R. Hammer, M . Neaga, and D. Ratz

44

[21] Klatte, R., Kulisch, U., Neaga, M., Ratz, D. und Ullrich, Ch.: PASCAL-XSC Language Reference with Ezamples. Springer, Heidelberg, 1992. [22] Kulisch, U. (Ed.): PASCAL-SC: A PASCAL Eztension for Scientific Computation, Information Manual and Floppy Disks, Version ATARI ST. Teubner, Stuttgart, 1987. (231 Kulisch, U. (Ed.): PASCAL-SC: A PASCAL Eztension for Scientific Computation, Information Manual and Floppy Disks, Version IBM PC/AT (DOS). Teubner, Stuttgart, 1987. [24] Kulisch, U. (Hrsg.): Wissenschaflliches Rechnen mil Ergebnisuerifikation - Eine Einfihrung. Akademie Verlag, Ost-Berlin, Vieweg, Wiesbaden, 1989. [25] Kulisch, U. and Miranker, W. Press, New York, 1981.

L.: Computer Arilhmetic in Theory a n d Praclice. Academic

[26] Kulisch, U. and Miranker, W. L.: The Arithmetic o f t h e Digital Computer: A New Approach. SIAM Review, Vol. 28, No. 1, 1986. [27] Kulisch, U. and Miranker, W. demic Press, New York, 1983.

L. (Eds.): A New Approach

lo

Scientific Computation. Aca-

[28] Kulisch, U. and Stetter, H. J. (Eds.): Scientific Computation with Automatic Result Verification. Computing Suppl. 6, Springer, Wien, 1988. [29] Neaga, M.: Erweiterungen uon Programmiersprachen fir wissenschaflliches Rechnen und Eririerung einer Implemenlierung. Dissertation, Universitat Kaiserslautern, 1984. [30] Neaga, M.: PASCAL-SC - Eine PASCA L-Erweiterung fir wissenschaflliches Rechnen. In: [24], 1989. [31] Ratz, D.: The Eflecls of the Arithmelic of Vector Computers on Basic Numerical Methods. In: [33], 1990. [32] Rump, S.

M.:Solving Algebraic Problems with High Accuracy. In: [27J, 1983.

[33] Ullrich, Ch. (Ed.): Contribution8 to Computer Arithmetic and Self-validating Numerical Methods. J. C. Baltzer AG, Scientific Publishing Co., IMACS, 1990. [34] Wolff von Gudenberg, J.: Einbeilung allgemeiner Rechnemn'thmelik in PASCAL miltels eines Operatorkonzeptes und Implementierung der Standardfunktionen mil optimaler Genauigkeit. Dissertation, Univerritit Karlsruhe, 1980.

ACRITH-XSC A Fortran-like Language for Verified Scientific Computing Wolfgang V. Walter ACRITH-XSC is a Fortran-like programming language designed for the development of self-validating numerical algorithms. Such algorithms deliver results of high accuracy which are verified to be correct by the computer. Thus there is no need to perform an error analysis by hand for these calculations. For example, self-validating numerical techniques have been successfully applied to a variety of engineering problems in soil mechanics, optics of liquid crystals, ground-water modelling and vibrational mechanics where conventional floating-point methods have failed. With few exceptions, ACRITH-XSC is an extension of FORTRAN 77 [l]. Various language concepts which are available in ACRITH-XSC can also be found in a more or less similar form in Fortran 90 [13]. Other ACRITH-XSC features have been specifically designed for numerical purposes: numeric constant and data conversion and arithmetic operators with rounding control, interval and complex interval arithmetic, accurate vector/matrix arithmetic, an enlarged set of mathematical standard functions for point and interval arguments, and more. For a restricted class of expressions called "dot product expressions", ACRITH-XSC provides a special notation which guarantees that expressions of this type are evaluated with least-bit accuracy, i.e. there is no machine number between the computed result and the exact solution. The exact dot product is essential in many algorithms to attain high accuracy. The main language features and numerical tools of ACRITH-XSC are presented and illustrated by some typical examples. Differences to Fortran 90 are noted where appropriate. A complete sample program for computing continuous bounds on the solution of an initial value problem is given at the end.

1

Development of ACRITH-XSC

The expressive and functional power of algorithmic programming languages has been continually enhanced since the 1950's. New powerful languages such as Ada, C++,and Fortran 90 have evolved over the past decade or so. The common programming languages attempt to satisfy the needs of many diverse fields. While trying to cater to a large user community, these languages fail to provide specialized tools for specific areas of application. Thus the user is often left with ill-suited means to accomplish a task. This has become quite apparent in numerical programming and scientific computing. Even though programming has become more convenient through the use of more modern language concepts, numerical programs have not necessarily become more reliable. Scientific Computing with Automatic Result Verification

45

Copyright @ 1993 by Academic Press, Inc. All rights of reproduction in any form reserved. ISBN 0-12-044210-8

46

Wolfgang V. Walter

The development of programming languages suited for the particular needs of numerical programming has been a long-term commitment of the Institute of Applied Mathematics a t the University of Karlsruhe. With languages and tools such as PASCAL-XSC, C-XSC (see articles in this volume), ACRITH-XSC, and ACRITH, the emphasis is on accuracy and reliability in general and on automatic result verification in particular. The first language extension for “Scientific Computation”, PASCAL-SC [7, 191, was designed and implemented in the late 70’s and has been under continuous development since. The most recent implementation, called PASCAL-XSC, has been available for a wide variety of computers ranging from micros to mainframes since 1991 [15]. In order to reach a broader public, reports and proposals on how to incorporate similar concepts into Fortran 8x were published in the early 80’s [5,6]. In the meantime, several of the proposed features have found their way into the Fortran 90 standard [13, 29, 30,31, 321. However, a rigorous mathematical definition of computer arithmetic and roundings (e.g. as defined by Kulisch and Miranker in [17, 181) is still lacking in Fortran 90. The standard does not contain any accuracy requirements for arithmetic operators and mathematical intrinsic functions. The programming language FORTRAN-SC [4, 22, 27, 28, 231 was designed as a Fortran-like language featuring specialized tools for reliable scientific computing. It was defined and implemented a t the University of Karlsruhe in a joint project with IBM Germany and has been in use a t a number of international universities and research institutions since 1988. The equivalent IBM program product High Accuracy Arithmetic - Extended Scientific Computation, called ACRITH-XSC for short, was released for world-wide distribution in 1990 (111. Numerically, it is based on IBM’s High-Accuracy Arithmetic Subroutine Library (ACRITH) [9, lo], a FORTRAN 77 library which was first released in 1984. The use of ACRITH in FORTRAN 77 programs triggered the demand for a more convenient programming environment and resulted in the development of ACRITH-XSC. With the aid of these tools, numerical programming takes a major step from an approximative, often empirical science towards a true mathematical discipline.

2

Brief Comparison with Fortran90

The new Fortran standard, developed under the name Fortran8x, now known as Fortran90 [13],was finally adopted and published as an international (ISO) standard in the summer of 1991. The new Fortran language offers a multitude of new features which the Fortran user community has been awaiting impatiently. Among the most prominent are extensive array handling facilities, a general type and operator concept, pointers, and modules. Also, many of the newly added intrinsic functions, especially the array functions, numeric inquiry functions, floating-point manipulation functions, and kind functions (for selecting one of the representation methods of a data type) can be quite useful for numerical purposes. Through their judicious use, the portability of Fortran 90 programs can be enhanced.

ACRITH-XSC: A Fortran-like Language for Verified Scientific Computing

47

Unfortunately, however, portability of numerical results is still extremely difficult to achieve since the mathematical properties of the arithmetic operators and mathematical standard functions, in particular strict accuracy requirements, remain unspecified in the Fortran 90 standard. Thus, computational results still cannot be expected to be compatible when using different computer systems with different floating-point units, compilers, and compiler options. ACRITH-XSC contains a number of Fortran 90-like features such as array functions and expressions, user-defined operators and operator overloading, dynamic arrays and subarrays, and others. However, an attempt was also made to keep the language reasonably small compared with Fortran 90. Thus other features present in Fortran 90 were not included (e. g. pointers and modules). On the other hand, the ACRITH-XSC language provides a number of specialized numerical tools which cannot be found in Fortran 90, such as complete rounding control, interval arithmetic, accurate dot products, and the exact evaluation of dot product expressions. These make ACRITH-XSC well-suited for the development of numerical algorithms which deliver highly accurate and automatically verified results. In contrast, the Fortran 90 standard does not specify any minimal accuracy requirements for intrinsic functions and operators and for the conversion of numerical data. For the user, their rounding behavior may vary from one machine to another and cannot be influenced by any portable or standard means.

3

Rounding Control and Interval Arithmetic

By controlling the rounding error a t each step of a calculation, it is possible to compute guaranteed bounds on a solution and thus verify the computational results on the computer. Enclosures of a whole set or family of solutions can be computed using interval arithmetic, for example to treat problems involving imprecise data or other data with tolerances, or to study the influence of certain parameters. Interval analysis is particularly valuable for stability and sensitivity analysis. It provides one of the essential foundations for reliable numerical computation.

ACRITH-XSC provides complete rounding control for numeric constants, input and output of numeric data, and the arithmetic operators +, -, *, / (for real and

complex numbers, vectors, and matrices). This ensures that the user knows exactly what data enters the computational process and what data is produced as a result. Besides the default rounding, the monotonic downwardly and upwardly directed roundings, symbolized by < and >, respectively, are available to compute guaranteed lower and upper bounds on a solution. All three rounding modes deliver results of 1 ulp (one unit in the last place) accuracy in all cases. A special notation is available for rounded constants. It may be used anywhere a numeric constant is permitted in a program. The conversion from the decimal representation of a constant to the internal format always produces one of the two neighboring floating-point numbers. Rounding downwards produces the largest

Wolfgang V. Walter

48

floating-point number not greater than the given constant, rounding upwards produces the smallest floating-point number not less than the constant. If no rounding is specified, the constant is converted to the nearest floating-point number: (<2.7182818284590) 2.7182818284590 (>2.7182818284591)

rounded downwards e rounded to nearest e rounded upwards e

The direction of rounding can also be prescribed in the I/O-list of a READ or WRITE statement. In the following example, a guaranteed lower bound for the sum of two numbers (given in decimal) is produced (again in decimal notation):

READ (*,*I x:’<’, y:’<’ WRITE (*,*I x +< y :’<’ The conversion accuracy during 1 / 0 is as for constants in the program. The default rounding mode is always to the nearest representable number for point data and outwards to the smallest enclosing interval for interval data. During input, as many digits are taken into account as necessary to be able to perform the desired rounding. During output, the best decimal representation with the given number of digits and respecting the prescribed rounding is produced. The arithmetic operators with explicit rounding control (downwards and upwards)

+<

+>

-<

->

*<

*>

/<

/>

are predefined in ACRITH-XSC. Additionally, a complete interval arithmetic is available. It encompasses the data types INTERVAL, DOUBLE INTERVAL, COMPLEX INTERVAL, and DOUBLE COMPLEX INTERVAL, a notation for interval constants, interval input/output, arithmetic and relational operators, mathematical standard functions for interval arguments, and all the necessary type conversion functions. The result of every arithmetic operation is accurate to 1 ulp. The accuracy of every predefined mathematical function is a t least 2 ulp. An interval is represented by a pair of (real or complex) numbers, its infimum (lower bound) and its supremum (upper bound). For the infimum, the direction of rounding is always downwards, for the supremum, upwards, so that the inclusion property is never violated. By adhering to this principle, the computed result interval will and must always contain the true solution set.

The elementary interval operators 4, -, *, / as well as the binary operators .IS. (intersection) and .CH. (convex hull) are available for all four interval types. The relational operators for intervals are: .EQ. (equal), .NE. (not equal), SB. (subset), .SP. (superset), .DJ. (disjoint), and .IN. (point contained in interval).

.

ACRITH-XSC also provides a special notation for interval constants. The conversion from the decimal representation of a constant to the internal format always produces the smallest possible enclosure of the given number or interval. Again, the accuracy is 1 ulp (for each bound):

ACRITH-XSC: A Fortran-like Language for Verified Scientific Computing ((-2. ooooi, -i.99999>) (<-3.1415926535898>) (< (3 i D O ) (3.001 1) >)

49

single precision real interval single precision enclosure of --?r double complex interval

The following 24 mathematical functions are available for real and complex numbers and intervals in the single and the double precision format (i. e. 8 different versions):

WR SIN ASIN SINH ARSINH ABS

SqRT

cos

ACOS COSH ARCOSH ARG

EXP TAN ATAN TANH ARTANH

LOG

LOG10

COT ACOT

ATAN2

corn

ARCOTH

All predefined mathematical functions deliver a result of 1 ulp accuracy for point data and of 2 ulp accuracy for interval data. In order to be able to access interval bounds, to compose intervals and to perform various other data type changes, numerous type conversion functions such as INF, SUP, IVAL,UP, DOWN, etc. are available. There are many compound functions which perform a combination of elementary type conversion tasks. All type conversion functions can also be applied to arrays. Fortran90, as opposed to ACRITH-XSC, does not provide any means for automatic error control or for deliberate rounding control. In particular, the arithmetic operators with directed roundings +<, +>, -<, ->, *<, *>, /<, /> are not available in Fortran 90. Thus, regrettably, Fortran 90 does not provide access to the rounded floating-point operations defined by the IEEE Standards for Floating-point Arithmetic 754 and 854 [2, 31. In view of the steadily increasing number of processors conforming to these standards, this is most unfortunate for the whole of numerics.

4

Vector/Matrix Arithmetic

In traditional programming languages such as FORTRAN 77, Pascal, or Modula-2, each vector/matrix operation requires an explicit loop construct or a call to an appropriate subroutine. Unnecessary loops, long sequences of subroutine calls, and explicit management of loop variables, index bounds and intermediate result variables complicate programming enormously and render programs virtually incomprehensible. In ACRITH-XSC, vector/matrix arithmetic is predefined according to the rules

of linear algebra. All operations are accessible through their usual operator sym-

bol. This allows the same expressional notation for vector/matrix expressions as in mathematics. Arithmetic and relational operators for vectors and matrices with

Wolfgang V. Walter

50

real, complex, interval, and complex interval components are predefined. The operators + and - for numerical vectors and matrices of the same shape as well as the operators . I S . (intersection) and .CH. (convex hull) for interval vectors and matrices of the same shape are defined as element-by-element operations. Multiplication and division of an array by a scalar are also defined componentwise. The vector/matrix products V*V, M*V, and M*M (where V stands for any vector and M for any matrix) are defined as usual in linear algebra (not componentwise as in Fortran 90). They are implemented using the accurate dot product and produce results which are accurate to 1 ulp in every component. Again, the direction of rounding can be specified (e. g. *<, *>). If no rounding is specified, the best possible floating-point result is produced (with 1/2 ulp accuracy). The usual relational operators for real and complex numbers are also predefined for vectors and matrices. Similarly, the operators .EQ., .NE., SB., .SP., .DJ., and .IN. are predefined for interval vectors and interval matrices in ACRITH-XSC. All of these operators produce a single LOGICAL result.

.

In contrast, all array operators are defined as element-by-element operations in Fortran 90. This definition has the advantage of being uniform, but the disadvantage that highly common operations such as vector and matrix products (inner products) are not easily accessible. The Fortran 90 standard does not provide an operator notation for these operations, and it prohibits the redefinition of an intrinsic operator (e.g. *) for an intrinsically defined usage. Instead, the dot product is only accessible through the intrinsic function call DOTPRODUCT(V,V), the other vector/matrix products through the intrinsic function calls MATMUL(V ,M), MATMUL(M,V), and MATMUL(M,M). Clearly, function references are far less readable and less intuitive than operator symbols, especially in complicated expressions. If one wants to reference the intrinsic functions DOTPRODUCT and MATMUL via an operator notation, there are only two choices: either one defines a new operator symfor all possible type combinations that can occur in vector/matrix bol, say .m., multiplication, or one defines new data types, e. g. RVECTOR, DRVECTOR, CVECTOR, DCVECTOR, RMATRIX, .. . and then overloads the operator symbol * for all possible type combinations of these new types. Both of these methods are quite cumbersome and seem to contradict one of the major goals of the Fortran 90 standard, namely to cater to the needs of the numerical programmer, in particular by providing extensive and easy-to-use array facilities. Note that both of these methods require a minimum of 64 operator definitions to cover all of the intrinsic cases. If more than two REAL and two COMPLEX types (single and double precision) are provided by an implementation, this number becomes even larger. The most serious drawback of the functions DOTPRODUCT and MATMUL, however, is the fact that they are not generally reliable numerically. This is due to the total lack of any accuracy requirements in the Fortran 90 standard. Now that these functions have been "standardized", the potential danger becomes even more evident. Thus, unless an implementation gives explicit error bounds for these intrinsic functions, every Fortran 90 programmer should think twice before using them, especially if the possibility of leading digit cancellation cannot be excluded.

ACRITH-XSC: A Fortran-like Language for Verified Scientific Computing

5

51

Dynamic Arrays and Subarrays

For dynamic arrays, storage space may be allocated and freed as necessary during runtime. Thus the same program may be used for arrays of varying size without recompilation. Furthermore, storage space can be used economically since only the arrays currently needed have t o be kept in storage. In ACRITH-XSC, the DYNAMIC statement is used to declare dynamic arrays as well as named array types. An array type is characterized by the (scalar) data type of its array elements and the number of dimensions of the array type. The size and shape of an array have no influence on its type.

DYNAMIC / COMPLEX INTERVAL (:,:) / DYNMAT, / VECTOR = REAL (:) / DYNAMIC / POLYNOMIAL = REAL (:) / POLY, & / VECTOR / X, Y, Z &

These statements declare two named array types, VECTOR and POLYNOMIAL, and five dynamic arrays: DYNMAT is a two-dimensional dynamic array with elements of type complex interval; POLY, X, Y, and Z are real one-dimensional dynamic arrays. Note that X, Y and Z are of type VECTOR whereas POLY is of type POLYNOMIAL. The ALLOCATE statement is used to obtain storage space for a dynamic array. For each dimension of a dynamic array, an index range must be specified:

ALLOCATE DYNMAT ( 5 , - 5 : 5 ) , ALLOCATE X, Y (ZPOLY)

POLY (0:lO)

These statements allocate DYNMAT as a 5 x 11 matrix, POLY as a polynomial of 10th degree, and X and Y as vectors with the same index range as POLY. The storage space of dynamic arrays can be freed (deallocated) by a FREE statement:

FREE X, Y During assignment to a dynamic array, allocation of the storage space needed to hold the result of the right-hand side expression is automatically performed. An existing (allocated) dynamic array may be reallocated explicitly by an ALLOCATE statement or implicitly by array assignment without prior execution of a FREE statement. The storage space previously occupied by the array is automatically freed in these cases. This “object-oriented’’ approach makes the use of dynamic arrays very convenient for the programmer. The size of an array can also be changed by a RESIZE statement, which preserves the contents of those array elements that have the same index (combination) before and after the resize operation. Furthermore, the index bound inquiry functions LB for lower index bounds and UB for upper index bounds are often useful and necessary.

Wolfgang V. Walter

52

ACRITH-XSC also provides a convenient notation to access subarrays (array sections) of static and dynamic arrays. A subarray is defined as a contiguous "rectangular'' part of another array (e. g. a row, column, or submatrix of a matrix). In the executable part of a program, subarrays can be used wherever regular arrays may be used. Array sections in Fortran 90 are somewhat more general than subarrays in ACRITH-XSC, but their notation and functionality are the same. The following examples shows the syntax and semantics of the subarray notation in ACRITH-XSC. It is assumed that A is a 6 x 9 matrix: REAL A(6,9). Note that an unspecified index bound defaults to the corresponding bound of the parent array: A(3, :> A(: ,9> A(: , 9 : ) A( :3,:) A(3: ,:) A(2:4,2:4)

3rd row of A last column of A 6 x 1 submatrix of A consisting of last column of A first 3 rows of A last 4 rows of A 3 x 3 submatrix of A

At first sight, the dynamic array concepts of ACRITH-XSC and Fortran 90 look fairly similar. Storage space for a dynamic array may be allocated and freed as desired. The size and shape of a dynamic array may thus change during runtime. There are, however, some major differences. For example, memory management is automatically performed for DYNAMIC arrays in ACRITH-XSC, which is not generally true for ALLOCATABLE arrays in Fortran 90. In ACRITH-XSC, allocation of the left-hand side dynamic array is performed implicitly during assignment, and automatic deallocation of intermediate result arrays and local variables takes place when these are no longer accessible. Also, dummy arguments and function results may be declared to be dynamic arrays in ACRITH-XSC. The Fortran 90 standard, on the other hand, requires the array on the left-hand side of an assignment to be allocated with the proper shape before the assignment is encountered. This requirement severely limits the usefulness of ALLOCATABLEarrays and makes life needlessly complicated for the programmer. Also, neither dummy arguments nor function results nor structure components can be ALLOCATABLE arrays. In particular, if the size of the result of a function cannot be determined at the time the function is entered, its result must be declared as a POINTER and not as an array. Together, these restrictions cripple this concept to the point where it is virtually useless.

To illustrate some of the problems with ALLOCATABLE arrays, consider the multiplication of two non-square matrices A and B where the final product matrix is supposed to redefine the matrix variable B. In ACRITH-XSC, due to its "objectoriented" approach, the notation for this is as simple as in mathematics: DYNAMIC / REAL(:,:) / .

I

.

B 5 A * B

A, B

ACRITH-XSC: A Fortran-like Language for Verified Scientific Computing

53

In Fortran 90, on the other hand, one has to write something like this:

REAL, DIMENSION(:,:),

...

ALLOCATABLE :: A, B, TEMP

ALLOCATE ( TFMP(SIZE(A, l), SIZE(B, 2)) ) TEMP = MATMUL(A , B) DEALLOCATE (B) ALLOCATE ( B(SIZE(TEMP, l), SIZE(TEMP, 2)) ) B = TEMP DEALLOCATE (TEMP) The only alternative to ALLOCATABLE arrays are array pointers. In Fortran 90, a POItiTER can be used in all of the situations mentioned above. However, pointers are unsafe in the same sense as in Pascal. Also, as for ALLOCATABLE arrays, no garbage collection is required of an implementation. Frequently in array applications written in Fortran90, one has to resort to pointers in place of ALLOCATABLE arrays even though the problem itself does not require pointers at all. Since DYNAMIC arrays in ACRITH-XSC do not have the same kinds of problems, pointers are not provided in ACRITH-XSC. In Fortran 90, pointers are strongly typed, just as in Pascal. A pointer is declared by the POINTER attribute. Unfortunately, the initial status of a pointer is undefined, that is, the Fortran 90 standard does not require an implementation to preset the association status of a pointer to disassociated. Any reference to an uninitialized pointer may therefore result in totally unpredictable results. In particular, the intrinsic function ASSOCIATED cannot deliver any reliable information in this case. The conscientious user is thus forced to initialize all pointers by explicitly disassociating them using the NULLIFY statement. Regrettably, static initialization of pointers (e. g. in a DATA statement) is also prohibited; a constant NIL or NULL does not even exist in Fortran 90.

A typical situation where one is involuntarily forced to use pointers is when trying to write a function that returns an array result whose size and shape cannot be determined at the time of invocation of the function. For example, a function that reads and returns a vector of unknown length can only be defined as a function returning a pointer: FUNCTION read-vector ( REAL, DIMENSION( :) , POINTER : : read-vector INTEGER dim READ (*,*) dim ALLOCATE ( read-vector(dim) ) READ (*,*) read-vector END FUNCTION read-vector Since it is impossible to predict the size of the result, it is also impossible to correctly allocate a variable to hold the result, and thus one cannot legitimately assign the

Wolfgang V. Walter

54

function result to a variable. The only way out of this dilemma is to use pointer assignment (indicated by the pointer assignment symbol => ):

REAL, DIMENSION(:), POINTER :: my-vector

. . .

my-vector => read-vector()

In contrast, all of this can be done quite naturally using DYNAMIC arrays in ACRITH-XSC. In particular, the result of a function can be specified to be a dynamic array by declaring the function's result type in a DYNAMIC statement. The size of the result array of such a function does not have to be known to the calling procedure a t the time the function is called. In ACRITH-XSC, it is always the function itself that decides when and how to allocate its result. In summary, ALLOCATABLE arrays in Fortran 90 seem to be much less useful than DYNAMIC arrays in ACRITH-XSC. The problems, inconveniences, and dangers which have been associated with Pascal-like pointers for the past 20 years persist with the POINTER concept of Fortran 90.

6

User-Defined Operators

In many applications, it is more convenient to use operators instead of function calls. An expressional notation using operator symbols is generally much easier to read and write. In ACRITH-XSC, any external (user-defined) function with one or two arguments can be called via an unary or binary operator, respectively. An operator symbol or name can be associated with such a function in an OPERATOR statement. Any predefined operator symbol (e.g. +, **, -<) or name (e.g. .NOT., .EQ ., .IN. ) may be employed, or the user may choose to invent new operator names with up to 31 characters enclosed in periods (e. g. .vectornorm. ):

OPERATOR &

/ //

= DIVDWN (INTEGER, INTEGER) INTEGER

,

= MODULO (INTEGER, INTEGER) INTEGER OPERATOR .MUL = DYPROD (REAL ( :) , REAL( :) ) DOUBLE REAL( : , :) OPERATOR .vectornorm. = ABSSUM (REAL(:)) DOT PRECISION

.

The data type of (the left and) the right operand is indicated in parentheses, the result type is given after these parentheses. The combination of the operator symbol/name and the operand type(s) must be unique in order to be able to reference the correct function. In the example above, the first operator declaration redefines standard integer division, the second overloads the concatenation operator / / for integer operands, the third defines a new binary operator .MIL. for real vectors, and the fourth defines a new unary operator .vectornorm. for a real vector operand. The result types are integer, integer, double precision real matrix, and dot precision, respectively. The four functions associated with these operators could be implemented as follows:

ACRITH-XSC: A Fortran-like Language for Verified Scientific Computing

**

55

INTEGER FUNCTION DIVDWN (i. j) Integer division with rounding downwards (to minus infinity) INTEGER i, j IF ( j .NE. 0 THEN DIVDWN = IDOWN( DBLE(i) /C DBLE(j) ) ELSE ERROR ( J Division by 0 END IF END

J,

**

INTEGER FUNCTION MODULO (1, j) Remainder of integer division with rounding downwards INTEGER i, j OPERATOR / = DIVDWN (INTEGER, INTEGER) INTEGER MODULO = i - (i/j) * j END

FUNCTION DYPROD (a, b) This function computes the double precision dyadic product of two single precision real vectors without any rounding errors. DYNAMIC / REAL( :) / a, b, / DOUBLE REAL(: ,:) / DYPROD INTEGER i, j ALLOCATE DYPROD( LB(a) :UB(a), LB(b) :UB(b) ) DO 10, i = LB(a) , UB(a) DO 10, j = LB(b) , UB(b) DYPROD(i,j) = DPROD(a(i), b(j)) 10 CONTINUE END

** **

**

** **

DOT PRECISION FUNCTION ABSSUM (v) This function computes the exact sum of the absolute values of the components of the vector V without error. The result is an unrounded dot precision value (see next section). DYNAMIC / REAL(:) / v INTEGER i ABSSUM = # ( SUM( ABS(v(i)), i = LB(v), UB(v) ) ) END

Note that ACRITH-XSC allows functions with an array result. The result of the function DYPROD is declared as a dynamic array and allocated with the appropriate shape within the function.

All predefined operators, whether symbolic or named, may be redefined (for existing operand types) and overloaded (for new operand types) any number of times as long as the operation (i. e. implementing function) can be uniquely determined.

56

Wolfgang V. Walter

This does not change their priority. A redefinition of a predefined operator for an exisiting operand type (combination) masks the intrinsic definition. Operators with user-defined (non-predefined) names may be overloaded as many times as desired. These always have lowest priority if they are binary and highest priority if they are unary. The operator priorities in ACRITH-XSC are exactly as in Fortran 90. In Fortran 90, operators can be defined through INTERFACE blocks. Redefinition of an intrinsic usage of an operator (e. g. integer division) is not allowed in Fortran 90. Otherwise, the concept is very much like in ACRITH-XSC, except that the notation is much more verbose:

INTERFACE OPERATOR (.MUL.) FUNCTION DYPROD (a, b) REAL, DIMENSION(:), INTENT(1N) : : a, b DOUBLE PRECISION, DIMENSION( : ,:) : : DYPROD END FUNCTION DYPROD END INTERFACE Besides operator overloading, Fortran 90 also allows overloading of functions, subroutines, and assignment.

7 Accurate Dot Products and Dot Product Expressions In ACRITH-XSC, the fundamental tool to achieve high accuracy in a computation is the accurate dot product which is capable of calculating arbitrary products of two vectors (inner products) with just one final rounding. All vector/matrix products are implemented using the accurate dot product, so their results are always accurate t o 1 ulp (in every component) regardless of the input data. Furthermore, ACRITH-XSC provides a unique tool for the exact evaluation of arbitrary sums of numbers, vectors, matrices, and simple products thereof dot product ezpressions. Such expressions are evaluated without error. The final rounding to be applied to the exact result can be chosen by the user. This important numerical tool has no analogue in Fortran 90, and it seems extremely difficult to even simulate it. Syntactically, a dot product expression is defined as a finite sum of simple expressions. A simple expression may be a constant, a variable, or a single product of these. A variable may be a scalar, a vector, or a matrix of type real or complex. Depending on the result type of the expression, it is called a scalar, vector, or matrix dot product expression. The exact evaluation of a dot product expression is requested by prefixing the parenthesized expression by the sharp-sign #. The # sign may optionally be followed by a rounding symbol. Omitting the rounding symbol is only allowed in the scalar case and delivers the exact (unrounded) result to full accuracy in a variable

ACRITH-XSC: A Fortran-like Language for Verified Scientific Computing

57

of type DOT PRECISION or DOT PRECISION COMPLEX. Only assignment, addition, subtraction, and the standard comparisons are allowed for dot precision values. Dot precision variables may also appear as summands in any scalar dot product expression. The possible rounding modes of dot product expressions are: unrounded, exact (DOT PRECISION,DOT PRECISION COMPLEX) rounding to nearest (1/2 ulp) #> rounding upwards (1 ulp) #< rounding downwards (1 ulp) ## rounding to the smallest enclosing interval (1 ulp) # #*

In practice, dot product expressions may contain a large number of terms, making an explicit notation very cumbersome. In mathematics, the summation symbol C is used for short. ACRITH-XSC provides the equivalent short-hand notation SUM. For instance, ## (

SUM(A(:,:,i)

*

B(:,:,i),

i = 1, n) )

will produce a sharp (1 ulp) interval enclosure of a sum of n matrix products. Note that the subarray notation is available in dot product expressions. It is quite likely that the implementors of Fortran 90 compilers and the manufacturers of floating-point units have not yet fully realized the impact of the new intrinsic functions DOTPRODUCT,MATMUL,and SUM in the Fortran 90 standard. These functions require careful implementation if they are to deliver mathematically meaningful results. For the Fortran 90 programmer, these functions appear to be very welcome since they seem to provide a portable way of specifying these highly common operations, especially as they are inherently difficult to implement. At the same time, however, it is very dangerous to employ these functions if they are not correctly implemented. The user may obtain anything from accurate to irrelevant results due to the ill effects of rounding and leading digit cancellation. Furthermore, the user has no knowledge or control of the order in which accumulation operations are performed. This makes any kind of realistic error analysis virtually impossible. The inevitable consequence of this situation is that these three new intrinsic functions are unusable for all practical purposes -at least if one wishes to write portable Fortran 90 programs which deliver reliable results. Tests on large vector computers show that simple rearrangement of the components of a vector or a matrix can result in vastly different results [8,25]. Different compilers with different optimization and vectorization strategies and different computational modes (e. g. scalar mode or vector mode with a varying number of vector pipes) are often responsible for incompatible and unreliable results.

Wolfgang V. Walter

58

As an example, consider the computation of the trace of the n x n product matrix C of a n x k matrix A and a k x n matrix B , which is defined by

B) =

cc n

n

trace (C) = trace ( A

Ci; = i=l

k

Aij

i=l j=1

* Bj; .

In ACRITH-XSC, this double sum can be calculated accurately and effectively by the following simple expression:

The notation is simple and effective and the computed result is guaranteed to be accurate to 112 ulp in every case. In contrast, the corresponding FORTRAN 77 program looks something like this:

TRACE = 0.0 DO 10 I = 1, N DO 20 J = 1, K TRACE = TRACE + A(I,J) 20 CONTINUE 10 CONTINUE

*

B(J,I)

This program has two disadvantages: it is hard to read, and its numerical results are unreliable. The computational process involves 2 n k rounding operations. Cancellation can, and often will, occur in the accumulation process. This leads to results of unknown accuracy a t best, or to completely wrong and meaningless results if many leading digits cancel. When using Fortran 90, the notation becomes somewhat simpler, but still fairly cumbersome since there is no operator for dot products:

TRACE = 0.0 DO I = 1, N TRACE = TRACE + DOTPRODUCT(A(I,:), B(:,I)) END DO Furthermore, the accuracy problems persist because in the computation of dot products, the products are typically rounded before they are added and the accumulation is typically performed in the same floating-point format in which the elements of A and B are given. Since the Fortran 90 standard does not impose any accuracy requirements on intrinsic functions such as SUM, DOTPRODUCT, or MATMUL, there are no simple remedies.

ACRITH-XSC: A Fortran-like Language for Verified Scientific Computing

8

59

Overview of ACRITH-XSC System

The ACRITH-XSC system consists of the compiler, the runtime system, the ACRITH library, and the ACRITH online training component (OTC). The ACRITH-XSC compiler performs a complete lexical, syntactical and semantical analysis of the source code and produces detailed error messages with a precise position indication (line and character position). It generates VS FORTRAN code (IBM’s standard FORTRAN 77 extension). The ACRITH-XSC runtime system consists of libraries containing the arithmetic and relational operations for all predefined numerical types (everything from single precision real arithmetic with rounding control to double precision complex interval matrix arithmetic), the type conversion functions necessary to convert between these types, a set of 24 mathematical functions, each for 8 different argument types, and many other support operations. All array operators support static and dynamic arrays and subarrays and automatically perform memory management, subarray addressing, etc. . Furthermore, ACRITH-XSC provides simplified interfaces to ACRITH’s problem solving routines by taking advantage of interval types and dynamic arrays and by eliminating the error parameter (return code). Error handling is fully integrated into the runtime system. ACRITH-XSC is available for IBM /370 systems running under VM. Besides providing the arithmetic foundations for ACRITH-XSC, the ACRITH library offers problem-solving routines for many standard problems of numerical mathematics such as solving systems of linear and nonlinear equations, computing eigenvalues and eigenvectors, finding zeros of polynomials, evaluating polynomials and expressions accurately, and more. Any result delivered by ACRITH is verified to be correct by the computer. Even in many ill-conditioned cases, ACRITH computes tight bounds on the true solution. If there exists no solution or if the problem is so ill-conditioned that verification fails, an error message is issued. The online training component (OTC) provides an ideal way to become acquainted with the basic tools and problem-solving routines offered by ACRITH. The OTC can be used to solve small problems interactively.

9

Concluding Remarks

Numerically, the Fortran90 standard is still deficient in many ways. The mathematical properties, in particular the accuracy of the arithmetic operators and the mathematical standard functions, remain unspecified. In ACRITH-XSC, these are an integral part of the language definition. In the presence of a growing number of mathematical processors conforming to standards such as the IEEE Standard 754 for Binary Floating-point Arithmetic [2] and the IEEE Standard 854 for Radix-Independent Floating-Point Arithmetic [3],

60

Wolfgang V. Walter

the reluctance of programming language design committees to provide easy access to the elementary arithmetic operations with directed roundings by incorporating special operator symbols such as +<, +>, -<, ->, *<, *>, /<, /> is incomprehensible. Since none of the major programming languages provides any simple means to access these fundamental operations, it is not astonishing that they are seldom used. Interval arithmetic is one way of making these operations accessible and more widely accepted. The IMACS-GAMM Resolution on Computer Arithmetic [12] requires that all arithmetic operations - in particular the compound operations of vector computers such as "multiply and add", "accumulate", and "multiply and accumulate" be implemented in such a way that guaranteed bounds are delivered for the deviation of the computed floating-point result from the exact result. A result that differs from the mathematically exact result by at most 1 ulp (i. e. by just one rounding) is highly desirable and always obtainable, as demonstrated by ACRITH-XSC. The "Proposal for Accurate Floating-point Vector Arithmetic" (see article in this volume) essentially requires the same mathematical properties for vector operations as are required for the elementary arithmetic operations by the IEEE Standards. Hopefully, such user requests will influence the hardware design of computing machinery, especially of supercomputers [14, 161, in the near future.

A

Sample Program: Initial Value Problem

The following sample program is intended to illustrate the usage and power of the programming tool ACRITH-XSC. As such, it should be viewed as a prototype implementation of a differential equation solver which can be improved and refined in many ways. However, even this simple program serves its purpose quite well, namely to compute enclosures of the solution of an initial value problem over an interval (not only at discrete points). The program is quite flexible: in order to treat another differential equation, only one line in the function subprogram F needs to be changed. It was tested on various linear and nonlinear first-order differential equations, giving guaranteed answers of satisfactory accuracy. However, because of its simplistic approach, it cannot be expected to compute tight enclosures forever. Eventually, overestimation will take over and the intervals will blow up. A much more sophisticated differential equation solver for a large class of systems of ordinary differential equations which does not suffer from this problem was written by Lohner [20, 211. Its diverse areas of application include IVPs, BVPs, EVPs, and periodicity and bifurcation problems. It uses Taylor expansions, fast automatic differentiation, and parallelepipeds to avoid the "wrapping-effect" (interval overestimation). Existence and uniqueness of the solution within the computed continuous bounds are proved by the computer, making it a reliable and unique tool for engineers and numerical analysts.

A CRITH-XSC: A Fortran-like Language for Verified Scientific Computing

61

Mathematical Problem: Find an enclosure of the true solution function of the initial value problem y'

=

f ( 5 , Y)

Y(a) = 'I

*

It is assumed that f : D + R is continuous on the domain D c R2and satisfies a local Lipschitz condition in y. This guarantees existence and uniqueness of the solution y* for any initial point ( a , ~E) D as long as (z,y*(z)) E D. The method of solution is the well-known Picard-Lindellif iteration. It is performed zl], [zl, 2 2 1 , . . . . For simplicity, the points a=10 < z1 < on successive intervals [zo, 12 < . . . are assumed to be equidistant with xk = a -t k . h, where h is a fixed step size. The iteration on the interval [ z k , z k t l ] is defined by

The initial approximation of the solution on the interval [ ~ k z, k + l ] is the constant function yo(z) := y(zk). For k = O , this is the initial value I'. For k > O , this is the approximate function y evaluated a t the right endpoint of the previous interval. The mapping T is a contraction, so convergence is guaranteed.

Mathematical Model: In order to be able to model functions on the computer, one has to choose a convenient base, that is, a class of functions that permits fairly simple implementation on the computer. Polynomials as approximation of the Taylor series are a good choice. In the proposed program, polynomials with a fixed maximum degree k are used. In fact, since a continuous enclosure of the solution function is sought, interval polynomials with a fixed maximum degree k are employed throughout the computation. A variable polynomial degree could have been used with little extra effort. An interval polynomial is a polynomial with interval coefficients. The graph of an interval polynomial is bounded by an upper function and a lower function, both continuous. For I 2 0, the upper function is the polynomial with coefficients equal to the suprema of the interval coefficients of the interval polynomial, the lower function is the polynomial with the infima as coefficients. For I 5 0, the upper function is the polynomial with coefficients equal to the suprema for even powers of I and equal to the infima for odd powers of z. For the lower function, the opposite extrema are taken for the polynomial coefficients. In general, the band of functions defined by the graph of an interval polynomial grows in width as the distance from the y-axis increases. To keep things simple,

Wolfgang V. Walter

62

one should only work in one of the two half planes (the right or the left) a t any given time. There, the upper and the lower function of an interval polynomial are polynomials. It is important to realize that interval polynomials can be employed t o compute enclosures of functions which are not themselves polynomials. It is often sufficient to show that a function lies within the graph of an interval polynomial, i. e. between the lower and the upper function. However, care has to be taken when substituting interval polynomial arithmetic for the arithmetic in a more general function space. For instance, differentiation of an interval polynomial cannot possibly deliver an enclosure of the derivatives of all continuously differentiable functions lying within the graph of the interval polynomial. Integration, on the other hand, delivers such an enclosure. The sample program below consists of three parts: a function F defining the righthand side of the differential equation, a collection of functions providing elementary interval polynomial arithmetic and other fundamental operations on interval polynomials, and a main program IVP performing the Picard-Lindelof iteration on successive subintervals of the interval of integration, thus computing a continuous enclosure of the solution of the initial value problem. So the upper and the lower function which are computed are continuous, piecewise polynomial functions which are guaranteed to bound the true solution. The following ACRITH-XSC function defines the right-hand side of the differential equation y’ = x2 - y2 :

**

FUNCTION F ( X, Y ) Import i n t e r v a l polynomial arithmetic:

%INCLUDE IPOLYOPS INCLUDE DYNAMIC / IPOLY / X, Y, F

**

Sample d i f f e r e n t i a l equation:

y ’ = x*x

F = X * X - Y * Y END

-

y*y

Source code that is needed in different places in an ACRITH-XSC program can be imported from an INCLUDE file. The following listing represents the contents of the file IPOLYOPS INCLUDE, where the data type and the operators for interval polynomial arithmetic are defined. Note that for each binary operator, several versions for different type combinations of the operands are provided for convenience.

** **

IPOLY is t h e data type f o r i n t e r v a l polynomials: DYNAMIC / IPOLY = DOUBLE INTERVAL ( : > / These are t h e unary operators f o r IPOLY: OPERATOR + = PPLUS ( IPOLY

OPERATOR

-

= PMINUS ( IPOLY

) IPOLY ) IPOLY

ACRITH-XSC: A Fortran-like Language for Verified Scientific Computing

**

These are the OPERATOR + OPERATOR + OPERATOR + OPERATOR OPERATOR OPERATOR OPERATOR OPERATOR OPERATOR

-

* *

*

OPERATOR / OPERATOR / OPERATOR /

63

binary operators for IPOLY: = PPADD ( IPOLY, IPOLY IPOLY = PIADD ( IPOLY, DOUBLE INTERVAL ) IPOLY = IPADD ( DOUBLE INTERVAL, IPOLY ) IPOLY PPSUB ( IPOLY, IPOLY

) IPOLY

= PISUB ( IPOLY, DOUBLE INTERVAL ) IPOLY = IPSUB ( DOUBLE INTERVAL, IPOLY ) IPOLY = PPMUL ( IPOLY, IPOLY ) IPOLY = PIMUL ( IPOLY, DOUBLE INTERVAL ) IPOLY =

IPMUL ( DOUBLE INTERVAL, IPOLY ) IPOLY

= =

PPDIV ( IPOLY, IPOLY IPOLY PIDIV ( IPOLY, DOUBLE INTERVAL ) IPOLY = IPDIV ( DOUBLE INTERVAL, IPOLY ) IPOLY

The following is the main program for the computation of a continuous enclosure of the solution (or family of solutions) of an initial value problem:

PROGRAM I V P Import interval polynomial arithmetic: %INCLUDE IPOLYOPS INCLUDE

**

&

**

OPERATOR .Integral. = INTGRL (IPOLY) IPOLY , .at. = HORNER (IPOLY, DOUBLE INTERVAL) DOUBLE INTERVAL EXTERNAL F, MONOMI

DYNAMIC / IPOLY / X, Yold, Ynew, F, MONOMI DOUBLE INTERVAL eta, b, h, hival INTEGER degree Global variables hival and degree must be initialized below: COMMON / IRANGE / hival, / MAXDEG / degree WRITE(*,*) 'init. value, endpoint, step size, max. degree : ' READ (*,*) eta lb , h degree

** **

** **

Starting point assumed to be at x=O: Y(0) = eta Interval of integration is [O,b] ( or [b,O] if b) .CH. h Monomial of first degree: X = MONOMI(1)

Wolfgang V. Walter

64

**

Do t h e f o l l o w i n g f o r each s u b i n t e r v a l : REPEAT

**

I n i t i a l v a l u e i s t h e c o n s t a n t polynomial Y = eta: Ynew = e t a * MONOMI(0)

**

Picard-Lindeloef i t e r a t i o n : REPEAT Yold = Ynew Ynew = eta + . I n t e g r a l . F( X, Yold ) UNTIL ( Ynew .SB. Yold ) Repeat u n t i l i n c l u s i o n achieved ( i . e . Ynew s u b s e t of Yold)

**

** **

S t a r t i n g p o i n t of n e x t s u b i n t e r v a l : X(0) = X(0) + h Polynomial Y e v a l u a t e d a t h ( u s i n g H o m e r ’ s scheme): eta = Ynew . a t . h WRITE(*,*) k

**

’

’

, SUP(X(O)), l i e s i n t h e i n t e r v a l I, e t a

The s o l u t i o n a t

UNTIL ( SUP(ABS(X(0))) .GE. SUP(ABS(b)) ) Repeat u n t i l endpoint b reached END

The remaining functions provide the necessary support functions and operator implementations for interval polynomial arithmetic: FUNCTION INTGRL ( P ) I n t e g r a l of t h e i n t e r v a l polynomial P o v e r t h e i n t e r v a l h i v a l DYNAMIC / IPOLY = DOUBLE INTERVAL ( :) / P, INTGRL DOUBLE INTERVAL h i v a l INTEGER i , d e g r e e COMMON / IRANGE / h i v a l d e g r e e = UB(P) ALLOCATE INTGRL (=P) INTGRL(0) = 0 DO 10, i = 1, degree-1 INTGRL(i) = P ( i - l ) / i 10 CONTINUE INTGRL(degree) = P(degree-l)/degree + hival*P(degree)/(degree+l) END

**

ACRITH-XSC: A Fortran-like Language for Verified Scientific Computing

**

10

**

FUNCTION HOMER ( P, X )

Evaluation of t h e interval polynomial P on the interval X DYNAMIC / IPOLY = DOUBLE INTEFtVAL (:) / P

DOUBLE INTERVAL HOMER, X INTEGER i , degree degree = UB(P) HOMER = P(degree) DO 10, i = degree-1, 0 , -1 HOMER = HOMER * X + P ( i ) CONTINUE END FUNCTION MONOMI ( i ) Returns monomial of degree i with c o e f f i c i e n t 1 DYNAMIC / IPOLY = DOUBLE INTERVAL (:) / MONOMI INTEGER i . degree COMMON / MAXDEG / degree ALLOCATE MONOMI(0 :degree) MONOMI = 0 MONOMI(i) = 1 END

**

FUNCTION PPLUS ( A )

unary + f o r interval polynomials DYNAMIC / IPOLY = DOUBLE INTERVAL (:) / PPLUS = A

A, PPLUS

END

**

**

FUNCTION PMINUS ( A ) unary

- for

interval polynomials

DYNAMIC / IPOLY = DOUBLE INTERVAL (:) / A, PMINUS PMINUS IPOLY(-A) END FUNCTION PPADD ( A, B )

interval polynomial + interval polynomial DYNAMIC / IPOLY = DOUBLE INTERVAL (:> / PPADD = IPOLY(A + B)

A, B, PPADD

END

**

FUNCTION PPSUB ( A, B )

-

interval polynomial interval polynomial DYNAMIC / IPOLY = DOUBLE INTERVAL (:) /

PPSUB END

IPOLY(A

- B)

A, B, PPSUB

65

Wolfgang V. Walter

66

FUNCTION PPMUL ( A, B interval polynomial * interval polynomial DYNAMIC / IPOLY = DOUBLE INTERVAL (:) / A, B, Bback, C, PPMUL DOUBLE INTERVAL hival, help INTEGER i, j, degree COMMON / IRANGE / hival degree = UB(A) ALLOCATE PPMUL, Bback, C (=A) DO 10, i = 0, degree Bback(degree-i) = B(i) 10 CONTINUE DO 20. i = 0. degree-I PPMUL(i) = A(0:i) * Bback(degree-i:degree) C(degree-i) = A(degree-i :degree) * Bback(0: i) 20 CONTINUE C(0) = A * Bback help = C(degree) DO 30, i = degree-I, 0, -1 help = C(i) + hival*help 30 CONTINUE PPMUL(degree) = help END

**

**

**

**

FUNCTION PIADD ( A, B ) interval polynomial + interval DYNAMIC / IPOLY = DOUBLE INTERVAL ( : ) / DOUBLE INTERVAL B PIADD = A PIADD(0) = A(0) + B END FUNCTION PISUB ( A, B ) interval polynomial - interval DYNAMIC / IPOLY = DOUBLE INTERVAL ( : ) / DOUBLE INTERVAL B PISUB = A PISUB(0) A(0) - B END FUNCTION PIMUL ( A, B ) interval polynomial * interval DYNAMIC / IPOLY = DOUBLE INTERVAL ( : ) / DOUBLE INTERVAL B PIMUL = IPOLY(A * B) END

A, PIADD

A, PISUB

A, PIMUL

ACRITH-XSC: A Fortran-like Language for Verified Scientific Computing

**

FUNCTION PIDIV ( A, B ) interval polynomial / interval DYNAMIC / IPOLY = DOUBLE INTERVAL (:) / DOUBLE INTERVAL B PIDIV = IPOLY(A / B)

67

A, PIDIV

END

**

FUNCTION IPADD ( A, B ) interval + interval polynomial DYNAMIC / IPOLY = DOUBLE INTERVAL (:) / DOUBLE INTERVAL A IPADD = B IPADD(0) = A + B(0)

B, IPADD

END The remaining functions are analogous to those listed. One may use this program to determine the critical initial value q,it that separates the solutions tending to +m from those tending to -m. The program was used to show that initial values q 2 -0.67597824006728 generate a solution that tends to infinity as x + 00 and that initial values 7 5 -0.67597824006729 generate a solution that tends to minus infinity as x + 00, that is q,,it must lie in the interval

(< -0.67597824006729, -0.67597824006728 >).

Interval polynomials with maximum degree k = 20 and a step size of h = 1/32 were used to compute the following enclosures:

X

1 2 3 4 5 5.125 5.25 5.375 5.5 5.625 5.75 5.875 6

enclosure of f(z) (<-0.13451294981468805D+01, -0.13451294981468731D+01>) (<-0.22171067575491852D+01,-0.22171067575488672D+01>) (<-0.31551202908832364D+01,-0.31551202908148874D+01>) (<-0.41197590532724216D+01,-0.41197589548903563D+01>) (<-0.50941624065188669D+01,-0.5093174222899591OD+Ol>) (<-0.52088910466262548D+OI,-0.52053087180506980D+01>) (<-0.53014650866408702D+01,-0.52881363203272424D+OI>) (<-0.53078378288300164D+OI,-0.52575566580706822D+01>) (<-0.49873468944161677D+Oi,-0.48013292936427300D+OI>) (<-0.36159007364410454D+01,-0.30237178501827985D+Ol~) (<-0.34544323099867103D+OO,0.77662057330902746D+OO>) (< 0.27367307402849199D+01, 0.46705472098948469D+OI>) (< 0.19690738660628261D+OlD 0.82381052127498358D+Ol>)

Wolfgang V. Walter

68

enclosure of f(x) (<-0.67597824006729001D+00,-0.67597824006728998D+00>) (~-0.13451294981469509D+01,-0.13451294981469437D+01>) (<-0.22171067575516328D+01,-0.22171067575513081D+01>) 2 (<-0.31551202914069217D+013-0.31551202913371416D+01>) 3 (<-0.41197598070578250D+01,-0.41197597066161069D+01>) 4 (<-0.51017400250580170D+01,-0.51007296445308598D+01>) 5 (<-0.66237760651246689D+Ol, -0.63762471962750131D+Ol>) 5.5 5.6875 (<-0.32347098618453590D+023-0.19340414625323496D+02>) X

0 1

These enclosures show t h a t this branch of the solution remains negative. Towards t h e end of t h e computation, t h e accuracy deteriorates very rapidly, t h a t is t h e intervals become wide quickly. This effect can be avoided if a different technique involving Taylor expansions is used, as has been noted above.

References [l] American National Standards Institute: American National Standard Programming Language FORTRAN. ANSI X3.9-1978, 1978.

[2] American National Standards Institute / Institute of Electrical and Electronics Engineers: IEEE Standard for Binary Floating-point Arithmetic. ANSI/IEEE Std 754-1985, New York, 1985. [3] American National Standards Institute / Institute of Electrical and Electronics Engineers: IEEE Standard for Radix-Independent Floating-point Arithmetic. ANSI/IEEE Std 854-1987, New York, 1987. [4] Bleher, J. H.; Rump, S. M.; Kulisch, U.; Metzger, M.; Ullrich, Ch.; Walter, W. (V.): FORTRAN-SC A Study of a FORTRAN Extension for Engineering/Scientific Computation with Access to ACRITH. Computing 39, 93-110, Springer, 1987. [5] Bohlender, G.; Kaucher, E.; Klatte, R.; Kulisch, U.; Miranker, W. L.; Ullrich, Ch.; Wolff von Gudenberg, J.: FORTRAN for Contemporary Numerical Computation. IBM Research Report RC 8348, 1980; Computing 26, 277-314, Springer, 1981. [6] Bohlender, G.; Bohm, H.; Griiner, K.; Kaucher, E.; Klatte, R.; Kramer, W.; Kulisch, U.; Miranker, W. L.; Rump, S. M.; Ullrich, Ch.; Wolff von Gudenberg, J.: Proposal for Arithmetic Specification in FORTRAN 82. Proc. of Int. Conf. on: Tools, Methods and Languages for Scientific and Engineering Computation, Paris 1983, North-Holland, 1984. [7] Bohlender, G.; Rall, L. B.; Ullrich, Ch.; Wolff von Gudenberg, J.: PASCAL-SC: Wirkungavoll programmiemn, kontrolliert rechnen. Bibl. Inst., Mannheim, 1986.

. . .: PASCA L-SC: A Computer Language for Scientific Computation. Perspectives in Computing 17,Academic Press, Orlando, 1987.

[8] Hammer, R.: How Reliable is the Arithmetic of Vector Computers? In [26], 467-482, 1990.

A CRITH-XSC: A Fortran-like Language for Verified Scientific Computing

69

[9] IBM System/S70 RPQ, High-Accumcy Arithmetic. SA22-7093-0, IBM Corp., 1984. [lo] IBM High-Accumcy Arithmetic Subroutine Library (ACRITH), General Information Manual. 3rd ed., GC33-6163-02, IBM Corp., 1986.

...,Program Description and User’s Guide. 3rd ed., SC33-6164-02, IBM Corp., 1986.

[ll] IBM High Accuracy Arithmetic - Eztended Scientific Computation (ACRITHXSC),General Information. GC33-6461-01, IBM Corp., 1990.

..., Reference. SC33-6462-00, IBM Corp., 1990. ..., Sample Programs. SC33-6463-00, IBM Corp., 1990. . .., How to Use. SC33-6464-00, IBM Corp., 1990. ..., Syntax Diagrams. SC33-6466-00, IBM Ccrp., 1990.

[12] IMACS, GAMM: Resolution on Computer Arithmetic. In Mathematics and Computers in Simulation 31,297-298, 1989; in Zeitschrift fur Angewandte Mathematik und Mechanik 70, no. 4, p. T5, 1990; in Ch. Ullrich (ed.): Computer Arithmetic and Self- Validating Numerical Methods, 301-302, Academic Press, San Diego, 1990; in [26], 523-524, 1990; in E. Kaucher, S. M. Markov, G. Mayer (eds.): Computer Arithmetic, Scientific Computation and Mathematical Modelling, IMACS Annals on Computing and Appl. Math. 12,477-478, J.C. Baltzer, Basel, 1991. [13] International Standards Organization: Standard Programming Language Fortmn. ISO/IEC 1539 :1991, 1991. (141 Kirchner, R.; Kulisch, U.: Arithmetic for Vector Processors. Proc. of 8th IEEE Symp. on Computer Arithmetic (ARITH8) in Como, 256-269, IEEE Computer Society, 1987.

[15] Klatte, R.; Kulisch, U.; Neaga, M.; Ratz, D.; Ullrich, Ch.: PASCAL-XSC Sprachbeschreibung mit Beispielen. Springer, Berlin, Heidelberg, 1991.

...: PASCAL-XSC Language Reference with Ezamples. Springer, Berlin, Heidelberg, 1992.

[16] Knofel, A.: Fast Hardware Units for the Computation of Accumte Dot Products. Proc. of 10th IEEE Symp. on Computer Arithmetic (ARITH 10) in Grenoble, 7074, IEEE Computer Society, 1991. [17] Kulisch, U.; Miranker, W. L.: Computer Arithmetic in Theory and Practice. Academic Press, New York, 1981. Kulisch, U.: Crundlagen des numerischen Rechnens: Mathematische Begm-ndung der Rechnemrithmetik. Reihe Informatik 19, Bibl. Inst., Mannheim, 1976. [18] Kulisch, U.; Miranker, W. L. (eds.): A New Approach to Scientific Computation. Notes and Reports in Comp. Sci. and Appl. Math., Academic Press, Orlando, 1983. [19] Kulisch, U. (ed.): PASCAL-SC: A PASCAL Eztension for Scientific Computation; Information Manual and Floppy Disks; Version IBM PC/AT, Operating System DOS. Wiley-Teubner Series in Comp. Sci., B. G. Teubner, J. Wiley & Sons, 1987; Version ATARI ST. B. G. Teubner, Stuttgart, 1987.

70

Wolfgang V. Walter

[20]Lohner, R. J.: Enclosing the Solutions of Ordinary Initial and Boundary Value Problems. In E. Kaucher; U. Kulisch; Ch. Ullrich (eds.): Computerarithmetic: Scientific Computation and Programming Languages, 255-286, B. G. Teubner, Stuttgart, 1987. [21] Lohner, R.: Einschlieflung der L6sung gew6hnlicher Anfangs- und Randwertaufgaben und Anwendungen. Ph. D. thesis, Univ. Karlsruhe, 1988. [22]Metzger, M.: FORTRAN-SC: A FORTRAN Extension for Engineering/Scientific Computation with Access to ACRITH, Demonstration of the Compiler and Sample Progmrns. In [24],63-79, 1988. [23] Metzger, M.; Walter, W. (V.): FORTRAN-SC: A Programming Language for Engineering/Scientific Computation. In [26],427-441, 1990. [24]Moore, R. E.(ed.): Reliability in Computing, The Role of Interval Methods in Scientific Computing. Perspectives in Computing 19,Academic Press, San Diego, 1988. [25] Ratz, D.: The Eflects of the Arithmetic of Vector Computers on Basic Numerical Methods. In [26],49S514, 1990. [26]Ullrich, C. (ed.): Contributions to Computer Arithmetic and Self-validating Numerical Methods. IMACS Annals on Computing and Appl. Math. 7, J.C. Baltzer, Basel, 1990. [27]Walter, W. (V.): FORTRAN-SC: A FORTRAN Extension for Engineering/Scientific Computation with Access to A CRITH, Language Description with Ezamples. In [24],43-62, 1988. [28]Walter, W. (V.): Einfchrung in die wissenschaftlich-technische Progmmmierspmche FORTRAN-SC. ZAMM 69, 4, T52-T54, 1989. [29]Walter, W. (V.): FORTRAN 66, 77, 88, -SC . . . - Ein Vergleich der numerischen Eigenschaften von FORTRAN 88 und FORTRAN-SC. ZAMM 70,6, T584-T587, 1990. [30]Walter, W. V.: Flexible Precision Control and Dynamic Data Structures for Programming Mathematical and Numerical Algorithms. Ph. D. thesis, Univ. Karlsruhe, 1990. [31] Walter, W. V.: Fortran 90: Was bringt der neue Fortran-Standard fcr das numerische Progmmmieren? Jahrbuch Uberblicke Mathematik 1991,151-175,Vieweg, Braunschweig, 1991. [32] Walter, W. V.: A Comparison of the Numerical Facilities of FORTRAN-SC and Fortran 90. Proc. of 13th IMACS World Congress on Computation and Appl. Math. (IMACS '91) in Dublin, Vol. 1, 30-31, IMACS, 1991.

c-xsc

A Programming Environment for Verified Scientific Computing and Numerical Data Processing Christian Law0

C-XSC is a tool for the development of numerical algorithms delivering highly accurate and automatically verified results. It provides a large number of predefined numerical data types and operators. These types are implemented as C++ classes. Thus, C-XSC allows high-level programming of numerical applications in C and C++. The C-XSC package is available for all computers with a C++ compiler translating the AT&T language standard 2.0.

1

Introduction

The programming language C has many weak points causing difficulties in applications to the programming of numerical algorithms. C does not provide the basic numerical data structures such as vectors and matrices and does not perform index range checking for arrays. This results in unpredictable errors which are difficult to locate within numerical algorithms. Additionally, pointer handling and the lack of overloadable operators in C reduce the readability of programs and make program development more difficult. Furthermore, the possibility of controlling the accuracy and rounding direction of arithmetic operations does not exist in C. The same is true for the 1/0 routines in the C standard libraries, where nothing is said about conversion error and rounding direction. The programming language C++, an object-oriented C extension, has become more and more popular over the past few years. It does not provide better facilities for the given problems, but its new concept of abstract data structures (classes) and the concept of overloaded operators and functions provide the possibility to create a programming tool which eliminates the disadvantages of C mentioned above. It provides the C and C++ programmer with a tool to write numerical algorithms producing reliable results in a comfortable programming environment without having to give up the intrinsic language with its special qualities. The object-oriented aspects of C++ provide additional powerful language features that reduce the programming effort and enhance the readability and reliability of programs. Scientific Computing with Automatic Result Verification

71

Copyright 0 1993 by Academic Press, Inc. All rights of reproduction in any form reserved. ISBN 0-12-044210-8

72

Ch. Lawo

With its abstract data structures, predefined operators and functions, the C-XSC programming environment provides an interface between scientific computing and the programming languages C and C++. Besides, C-XSC supports the programming of algorithms which automatically enclose the solution of a given mathematical problem in verified bounds. Such algorithms deliver a precise mathematical statement about the true solution. The most important features of C-XSC are: 0

2

real, complex, interval, and complex interval arithmetic with mathematically defined properties

0

dynamic vectors and matrices

0

subarrays of vectors and matrices

0

dot precision data types

0

predefined arithmetic operators with highest accuracy

0

standard functions of high accuracy

0

dynamic multiple-precision arithmetic and standard functions

0

rounding control for 1/0data

0

error handling

0

library of problem solving routines

Standard Data Types, Predefined Operators, and Functions

C-XSC provides the basic numerical data types real, interval, complex, and cinterval (complex interval) with their corresponding arithmetic operators, relational operators, and mathematical standard functions. All predefined arithmetic operators deliver results of at least 1 ulp (unit in the last place) accuracy. By using the types interval and cinterval, the rounding in all arithmetic operations can be controlled. Type conversion routines exist for all reasonable type combinations. Special routines are provided for constant conversion. Similarly to the predefined operators, all standard functions are available by their generic names and returning results of guaranteed high accuracy for arbitrary legal arguments. Additionally, the standard functions for the types interval and cinterval enclose the range of values in tight bounds.

C-XSC - New Concepts for Scientific Computing

73

araine arctan

10

Cmine

Co.

11

Cotangent

cot

12

Hyperbolic Cosine

con

122

IIsquuan

I

sqr

23

Squareroot

24

Tangent

tan

25

Hyperbolic tangent

tanh

* * *

1

* * * *

I

For the above numerical scalar types, C-XSC provides the corresponding dynamic vector and matrix types: rvector, cvector, ivector, civector matrix, cmatrix, imatrix, cimatrix Dynamic arrays enable the user to allocate or free storage space for an array during execution of a program. Thus, the same program may be used for arrays of any size without recompilation. Furthermore, storage space can be employed economically, since only the arrays currently needed have to be kept in storage and since they always use exactly the space required in the current problem. Type compatibility

Cb. Law0

74

and full storage access security are also ensured for the predefined dynamic vector and matrix classes. The most important advantages of dynamic arrays are: 0

storage space used only as needed,

0

array size may change during execution,

0

no recompilation for arrays of different sizes,

0

complete type and index checking,

0

no module space for dynamic array storage.

Example: Allocation and resizing of dynamic matrices: cout << "Enter t h e dimension n:"; c i n >> n ; i m a t r i x B, C , A(n,n); Resize ( B , - l , n - 2 , 2 , n + l ) ;

....

/* AC11 [11 /* B[-l] [2]

. . . ACnI [nl . . . B[n-21 [ n + l l

*/ */

C = A * B ;

The declaration of a vector or matrix without index bounds defines a vector of length 1 or a 1x1-matrix, respectively. When the object is needed, it may be allocated with the appropriate size by specifying the desired index bounds as parameters of the Resize statement. Alternatively, the index bounds may be specified directly in the declaration of a vector or matrix. Furthermore, allocation of a dynamic vector or matrix occurs automatically when assigning the value of an array expression to an array if its size does not match that of the array expression (e. g. in the statement C = A * B in the example above). The storage of a dynamic array that is local to a subprogram is automatically released before control returns to the calling routine. Array inquiry functions facilitate the use of dynamic arrays. In particular, the functions Lb and Ub provide access to the lower and upper index bounds of an array.

C-XSC - New Concepts for Scientific Computing

I : convex hull

75

L: intersection

Table 2: Predefined Arithmetic Operators

3

Subarrays of Vectors and Matrices

C-XSC provides a special notation for manipulating subarrays of vectors and ma-

trices. Subarrays are arbitrary rectangular parts of arrays. Note that all predefined operators are also available for subarrays. Access to a subarray of a matrix or vector is gained using the ( )- operator or the [ ]-operator. The ( )-operator specifies a subarray of an object, where this subarray is of the same type as the original object. For example, if A is a real nxn-matrix, then A(i,i) is a real ixi-submatrix. Note that parentheses in the declaration of a dynamic vector or matrix do not specify a subarray, but define the index ranges of the object being allocated. The [ ]-operator generates a subarray of a “lower” type. For example, if A is a real nxn-matrix, then A[i] is the i-th row of A of type rvector and A[i]k] is the (ij)-th element of A of type real. Both types of subarray access can also be combined, for example: A[k](ij) is a subvector from index i to index j of the k-th row vector of the matrix A. The capability of subarrays is illustrated in the following example describing the LU-factorization of a nxn-matrix A:

Ch. Law0

76

This example shows how the subarray notation allows more efficient programming and reduces program complexity.

operand P left operand

e X

V

J

J

= {==,! <=, <, >=, >} vc = {==,! =, <=, <} Val1

=I

r

o

r

0

0

r

0

r

q$ x

x

veq = {==,! =} v, = {==,! =,>=,>}

Table 3: Predefined Relational Operators

x

C-XSC - New Concepts for Scientific Computing

77

Evaluation of Expressions with High Accuracy

4

For many numerical algorithms, the accuracy of the evaluation of arithmetic expressions is crucial for the quality of the final result. Although all predefined numerical operators and functions are highly accurate, expressions composed of several such elements do not necessarily yield results of high accuracy. However, techniques have been developed to evaluate numerical expressions with high and guaranteed accuracy.

A special class of such expressions are the so-called dot product expressions. Dot product expressions play a key role in numerical analysis. Defect correction and iterative refinement methods for linear and nonlinear problems usually lead to dot product expressions. Exact evaluation of these expressions eliminates the ill effects of cancellation. A dot product is defined as: n

i=l

where Xi and K may be variables of type real, interval, complex or cinterval. To obtain an evaluation with 1 ulp accuracy, C-XSC provides the dot precision data types: dotprecision, cdotprecision, idotprecision, cidotprecision Intermediate results of a dot product expression can be computed and stored in a dot precision variable without any rounding error. The following example computes an optimal inclusion of the defect b - Az of a linear system Az = b : i v e c t o r Defect ( r v e c t o r b , m a t r i x A , r v e c t o r x) i d o t p r e c i s i o n accu; i v e c t o r INCL (Lb(x) ,Ub(x));

-

f o r ( i n t i=Lb(x); i<=Ub(x); i + + ) accu bCil; accumulate (accu , - A [ i l , x) ; INCL[i] = rnd(accu) ;

>

1

r e t u r n INCL;

<

<

Ch. Law0

78

In the example above, the function accumulate computes the sum: n

and adds the result to the accumulator accu without rounding error. The idotprecision variable accu is initially assigned b[i]. Finally, the accumulator is rounded to the optimal standard interval INCL[i]. Thus, the bounds of INCL[i] will either be the same or two adjacent floating-point numbers.

For all dot precision data types there exists a reduced set of predefined operators incurring no rounding errors. The overloaded dot product routine accumulate and the rounding function m d are available for all reasonable type combinations.

real

interval complex cinterval dotpr. idotpr. cdotpr. cidotpr.

Remark: All binary operators {+, -, ! =, ==, &, I} are also available for the symmetric case (i. e. if a + b exists, then b a is also available).

+

a = {+, -, =,==,! =, <, <=, >=,+ =, - =} -y = {+,-,=,==,!=,+ =,- =,>,>=}

= {+, -, =, ==,! =, + =, ( = {+, -, + =, - =, =, ! =, ==}

e

P = {+,-,<,<=}

+

6 = {+, -, =, =, - =} p = {+, -, =, =, - =} 7] = {+, -, =, t =, - =, >=, >}

+

Table 4: Predefined Dot Precision Operators

5

Dynamic Multiple-Precision Arithmetic

Besides the classes real and interval, the dynamic classes long real and long interval as well as the corresponding dynamic vectors and matrices are implemented including all arithmetic and relational operators and multiple-precision standard functions. The computing precision may be controlled by the user during runtime. By replacing the real and interval declarations by long real and long interval, the user’s

C-XSC - New Concepts for Scientific Computing

79

application program turns into a multiple-precision program. This concept provides the user with a powerful and easy-to-use tool for error analysis. Furthermore, it is possible to write programs delivering numerical results with a user-specified accuracy by internally modifying the computing precision during runtime in response to the error bounds for intermediate results within the algorithm. All predefined operators for real and interval types are also available for long real and long interval. Additionally, all possible operator combinations between single and multiple-precision types are included. The following example shows a single precision program and its multiple-precision analogue: main0 { i n t e r v a l a , b; a = 1.0; b = 3.0; cout << "a/b =

I'

3

main0 { 1-interval a , b; = 1.0; a b = 3.0; stagprec = 2; cout << I1a/b = 'I

<< a/b;

/* /* /* /*

Standard i n t e r v a l s a = Cl.O,i.Ol b = [3.0,3.01 a/b = c0.333, 0.3341

*/ */ */ */

/* 1-interval i s t h e c l a s s name f o r */ /* t h e long i n t e r v a l data type */

<< a/b;

/* g l o b a l i n t e g e r variable /* a/b = c0.333333, 0.33333341

*/ */

During runtime, the predefined global integer variable stagprec (staggered precision) controls the arithmetic computing precision of the underlying multipleprecision real and interval arithmetic in steps of 64 bit words. The precision level of a multiple-precision object is defined as the number of double words used to store the long number's value. An object of type long real or long interval can change its precision level during runtime. Components of a vector or a matrix may have different precision levels. All multiple-precision arithmetic routines and standard functions compute a numerical result of the precision level currently specified by stagprec. Allocation, resize, and subarray access of multiple-precision vectors and matrices are similar to the corresponding single precision data types.

6

1 / 0 Handling in C-XSC

Using the stream concept and the overloadable operators "<<" and ">>" of C++, the C-XSC system provides rounding and formating control during 1/0 (input/output) for all new data types, even for the dot precision and multipleprecision types. 1/0parameters such as rounding direction, field width, etc. also use the overloaded 1/0 operators to manipulate 1/0data. If a new set of 1/0parameters is to be used, the old parameter settings can be saved on an internal stack.

80

Ch. Lawo

New parameter values can then be defined. After having used the new settings, the old ones can be restored from stack. The following example illustrates the use of the C-XSC I/O: main() C real a , b; interval c;

1

cout << cout << c i n >> cout << c i n >> ~"0.11, cout << cout << cout << cout <<

"Please e n t e r real a , RndDown; a; /* reading a rounded downwards RndUp; b; /* reading b rounded upwards 0.221" >> c; /* s t r i n g t o i n t e r v a l conversion SaveOpt ; /* save o l d 1/0 parameters on s t a c k S e t p r e c i s i o n (20,I S ) ; /* set f i e l d width and d i g i t s Hex; /* hexadecimal output format r e q u i r e d c << Restoreopt; /* r e l o a d i n g o l d parameters from s t a c k

*/ */ */ */ *I */ */

7 Error Handling in C-XSC Besides the C++ function prototyping, type checking, save linking and other C++ security mechanisms, the C-XSC system offers index range checks for vectors and matrices and checks for numerical errors such as overflow, underflow, loss of accuracy, illegal arguments, etc. C-XSC provides the user with various modification possibilities to manipulate the reactions of the error handler.

Library of Problem Solving Routines

8

The C-XSC problem solving library is a collection of routines for standard problems of numerical analysis producing guaranteed results of high accuracy. The following areas are covered: 0

evaluation of arithmetic expressions

0

matrix inversion, linear systems

0

eigenvalues, eigenvectors

0

systems of nonlinear equations

0

evaluation and zeros of polynomials

C-XSC - New Concepts for Scientific Computing

9

0

numerical quadrature

0

initial and boundary value problems in ordinary differential equations

0

integral equations

81

Conclusions

In contrast to C and C++, all predefined arithmetic operators, especially the vector and matrix operations, deliver a result of at least 1 ulp accuracy in C-XSC. There is no need to learn a lot of new C++ features in order to be able to use the C-XSC programming environment for numerical applications. In most cases, knowledge of the C language is sufficient to work with C-XSC. The advanced user can extend the C-XSC system by using object-oriented C++ programming features. Programs written in C-XSC can be combined with any other C++ software. If some elementary programming rules are respected, C-XSC programs always deliver compatible numerical results even on different computers with different C++ compilers. This means that C-XSC provides tools to achieve full numerical result compatibility in the sense of interval mathematics.

References [l] Alefeld, G.; Herzberger, J.: Introduction to Interval Analysis. Academic Press,

New York, 1983.

(21 Kaucher, E.; Kulisch,U., Ullrich, Ch. (eds): Computer Arithmetic: Scientific Computation and Programming Languages. B. G . Teubner, Stuttgart, 1987.

[3] Kernighan, B. W.; Ritchie, D. M.: Progmmmieren in C. Hanser Verlag, 1983. [4] Kulisch, U.: Grundlagen des Numerischen Rechnens - Mathematische Begnhdung der Rechnemrithmetit Reihe Informatik, Band 19, Bibliographisches Institut, Mannheim, 1976.

[5]Kulisch, U., Miranker, W.L.: Computer Arithmetic in Theory and Practice. Academic Press, New York, 1983.

[6] Moore, R. E. (ed.): Reliability in Computing: The Role of Interval Methods in Scientific Computing. Perspectives in Computing 19, Academic Press, 1988.

[7] Stroustrup, B.: Die C++ Progmmmierspmche. Addison-Wesley, 1987. [8] Ellis, M. A.; Stroustrup, B.: The Annotated C++ Reference Manual. AddisonWesley, 1990.

Ch. Lawo

82

List of C-XSC Sample Programs

A

The examples demonstrate various concepts of C-XSC 0

Interval Newton Method

- data type interval - interval operators - interval standard functions 0

Runge-Kutta Method

- dynamic arrays - array operators - overloading of operators - mathematical notation 0

Trace of a Product Matrix

- dynamic arrays - subarrays

- dot product expressions Well-known algorithms were intentionally chosen so that a brief explanation of the mathematical background is sufficient. Since the programs are largely selfexplanantory, comments are kept to a minimum.

A.l

Interval Newton Method

An inclusion of a zero of a real-valued function f(x) is computed. It is assumed that f’(z) is a continuous function on the interval [a,b], where 0$

{f’(z),2 E

4)

[a,

and

f(.) . f(b) < 0

If an inclusion X , for the zero of such a function f ( z ) is already known, a better inclusion X,+l may usually be computed by the iteration formula:

where m ( X ) is some point in the interval X (for example, the midpoint). For this example, the function f(z) = fi (z 1) . cos(z) is used. In C-XSC, interval expressions are written in mathematical notation. Generic function names are used for the interval square root and interval sine and cosine functions. For details on the mathematical theory, see [l].

+ +

C-XSC - New Concepts for Scientific Computing

83

/* i n c l u d e i n t e r v a l a r i t h m e t i c package */ / * i n c l u d e i n t e r v a l s t a n d a r d f u n c t i o n s */

# i n c l u d e " i n t e r v a l . h" # i n c l u d e "imath h"

.

<

i n t e r v a l F (real& X) r e t u r n sqrt(x) + (x+i)

3

*

COS(X);

<

i n t e r v a l Deriv ( i n t e r v a l & x) r e t u r n (i / (2 * s q r t ( x ) ) + c o s ( x )

3

-

(x+l)

*

<

sin(x));

i n t Criter ( i n t e r v a l & x) /* computing F(a) * F(b) < 0 i n t e r v a l F a , Fb; /* u s i n g p o i n t i n t e r v a l s /* o p e r a t o r <= i s t h e r e l a t i o n a l Fa = F ( I n f ( x 1 ) ; /* o p e r a t o r 'element o f ' Fb = F(Sup(x)); r e t u r n (Sup(Fa*Fb) < 0 . 0 && ! ( O <= D e r i v ( x 1 ) ) ;

*/ */ */ */

3

main() C i n t e r v a l y , y-old; real mid ( i n t e r v a l & );

/* p r o t o t y p e of t h e midpoint f u n c t i o n */

c o u t << " P l e a s e e n t e r s t a r t i n g i n t e r v a l : " ; c i n >> y ; w h i l e ( I n f ( y ) != Sup(y)) if ( c r i t e r ( y ) ) do y-old = y ; << y << ll\nll; c o u t << "y = y = (mid(y>-F(mid(y>)/Deriv(y)) & y; /* The i t e r a t i o n formula */ 3 /* & i s t h e i n t e r s e c t i o n */ w h i l e (y != y-old) ;

c

3

else

3

3

3

<

<

<

e o u t << " C r i t e r i o n n o t s a t i s f i e d ! \n";

c o u t << " P l e a s e e n t e r s t a r t i n g i n t e r v a l : c i n >> y ;

If;

Cb. Lawo

84

With t h e s t a r t i n g i n t e r v a l [2,3], t h e computed i n c l u s i o n s are : C 2.OE+OO , 3.OE+OO] 2.OE+OO , 2.3E+00] 2.05E+00, 2.07E+00] c 2.05903E+00, 2.05906E+00] [2 .059045253413E+00, 2.059045253417E+00] [2.059045253415E+OO, 2.059045253416E+OOI

c c

A.2

Runge-Kutta Method

The initial-value problem for a system of differential equations is to be solved. The Runge-Kutta method to solve one differential equation may be written in C in an almost mathematical notation. In C-XSC it is possible to use the same notation for a system of differential equations. The concept of dynamic arrays is used to make the program independent of the size of the system. Only as much storage as needed is occupied during runtime. The following system of first-order differential equations

Y' = F ( z , Y ) with initial condition Y ( z 0 )= Yo is considered. If the solution Y is known at a point I, then the approximation Y ( z h) is computed by:

+

Ki

= h

*

= h * = h * K4 = h * Y(x+h) = Y + K2 K3

F(x,Y) F(x + h / 2, F(x + h / 2, F(x + h , Y + (Ki + 2 K2

*

Y + KI / 2) Y + K2 / 2) K3) + 2

*

K3 + K4) / 6

Starting at z0, an approximate solution may be computed at the points 20 i * h.

+

# i n c l u d e " r v e c t o r .h"

I;

=

/* r v e c t o r i s t h e p r e d e f i n e d class name */ /* f o r dynamic real v e c t o r s */

r v e c t o r F (real x , r v e c t o r y> r v e c t o r Z(3) ;

<

/* F u n c t i o n d e f i n i t i o n */ /* C o n s t r u c t o r c a l l */

C-XSC - New Concepts for Scientific Computing

void I n i t (real& x, real& h, rvectorC Y) Resize (Y,3) ; x =O; h = 0.1; YCII = 0; YC21 = I ; YC31 = I ;

85

<

/* I n i t i a l i s a t i o n */ /* Resize dynamic array */

1

<

main0 /* Declarations and dynamic */ real x , h; /* memory allocation */ rvector Y(3), K1(3), K2(3), K3(3), K4(3); I n i t (x, h , Y); /* Runge Kutta step */ f o r ( i n t i = i ; i<=3; i + + ) /* with array r e s u l t */ K i = h * F(x, Y); K2 = h * F(x + h / 2 , Y + K i / 2 ) ; F(x + h / 2 , Y + K2 / 2 ) ; K3 = h K 4 = h * F(x + h , Y + K3); Y = Y + (K1+2*K2+2*K3+K4)/6; x += h; cout << S e t P r e c i ~ i o n ( i 8 ~ 1 6<< ) Dec; /* 1/0 modification */ cout << "Step: II << i << "\n"; << x << "\n"; cout << "x = << Y << ll\nll; cout << I l Y =

<

-

*

1

1

A.3

Trace of a Product Matrix

Dot product expressions are sums of real, complex, interval or complex interval constants, variables, vectors, matrices, as well as products of pairs of these. Dot precision variables are used to store intermediate results of the dot product expression without rounding errors. The contents of a dot precision variable can be rounded to a floating-point number with a user-specified rounding direction. The following C-XSC program demonstrates the use of this tool. The trace of a product matrix A B is computed without evaluating the product matrix itself. The result will be of maximum accuracy, i.e. it is the best possible floating- point approximation of the exact solution. The trace of the product matrix is given by:

-

n

n

Ch. Law0

86

/* Use t h e complex m a t r i x package */

# i n c l u d e " c m a t r i x . h"

main() C i n t i , n; c o u t << " P l e a s e e n t e r t h e m a t r i x dimension n : c i n >> n ; c m a t r i x A(n,n) , B(n,n) ; complex r e s u l t ; c d o t p r e c i s i o n accu;

#I;

/* S t o r a g e a l l o c a t i o n f o r A , B */ /* u s i n g t h e C o n s t r u c t o r */

c o u t << " P l e a s e e n t e r t h e m a t r i x A : ' I ; c i n >> A ; c o u t << " P l e a s e e n t e r t h e m a t r i x B: "; c i n >> B ; accu = 0.0; /* f o r (i=i;i < = n ; i++) /* accumulate (accu, A [ i ] , B [ C o l ( i ) ] ) ; / * r e s u l t = r n d (accu, RND-NEXT); /*

Clear accu A [ i ] and B[Col(i)] are s u b a r r a y s of t y p e c v e c t o r Rounding t h e e x a c t r e s u l t /* t o n e a r e s t complex number

1

cout

<< "The trace

of t h e product m a t r i x i s :

I'

<< r e s u l t ;

*/ */ */ */ */

Proposal for Accurate Floating-Point Vector Arithmetic G . Bohlender, D. Cordes, A. Knofel, U. Kulisch, R. Lohner, W. V. Walter Many computers today provide accurate and reliable scalar arithmetic for floatingpoint numbers. An accurate definition of the four elementary floating-point operations -, * , / is given in the IEEE standards for floating-point arithmetic and was well established long before. An increasing number of computers (especially PC’s and workstations) feature IEEE arithmetic, many others provide a t least faithful (1 ulp) scalar arithmetic. In many numerical algorithms, however, compound operations such as the summation of a sequence of numbers or the dot product of two vectors are highly common. Some of the fastest computers currently available provide these operations in hardware. It is a well-known fact that a simulation of these compound operations by means of elementary floating-point operations leads t o accumulation of rounding errors and may suffer from catastrophic cancellation of leading digits. Problems with cancellation are inherent in iterative refinement and defect correction methods, when determining zeros of functions, and in vector/matrix calculations. In many other applications, cancellation is a continual threat. Existing standards for floating-point arithmetic do not improve this situation. The goal of this proposal is to define vector operations in a manner consistent with the elementary scalar arithmetic operations. The rounding modes and accuracy requirements as well as the data formats of the operands and results of the vector operations described in this proposal are chosen t o be fully consistent with the existing scalar floating-point arithmetic.

+,

1

Introduction

When something goes wrong in a numerical calculation, the culprit is often a n accumulation process. “Naive accumulation” using traditional floating-point addition is highly sensitive to the order of summation. Modern optimizing compilers for parallel computers a n d vector processors a n d sometimes the hardware itself routinely reorder a n d regroup t h e operands of compound operations. The user has little or no influence o n this process even though i t has potentially disastrous effects [14, 351. Even for just t e n numbers, there are millions of ways of adding them up, a n d each ‘Acknowledgement: The authors would like to thank all those who, through their suggestions and criticism, have contributed to the maturation of this proposal, in particular D. W. Matula, for his relentless support and for chairing the IMACS-GAMM working group on “Enhanced Computer Arithmetic”. Many ideas and arguments were brought forth at working group meetings and open sessions held at several conferences in 1991 and have been incorporated into this paper.

Scieutific Computing with Automatic Result Verification

87

Copyright 0 1993 by Academic Press, Inc. All rights of reproduction in any form reserved.

ISBN 0-12-014210-8

88

G. Bohlender, D. Cordes, A. Kniifel, U.Kulisch, R. Lohner, W. V. Walter

rearrangement may potentially produce a different result if traditional floatingpoint arithmetic is used. If the summation is noncritical, many of these potential results are fairly close together and relatively close to the true solution. However, if global cancellation occurs (so the final sum is of smaller magnitude than some of the summands), these potential results may vary widely - almost like random numbers. Usually, in such cases, the probability of the computed result being close to the true solution is minute. Unfortunately, the quality of the scalar floating-point arithmetic that is being used has very little influence on this situation. Even an arithmetic that conforms to one of the IEEE floating-point arithmetic standards [l,21, which were released in 1985 and 1987, does not help. Since the early 1970's, pipelining has become one of the principal ways of speeding up computing. Meanwhile, modern chip technology allows pipelined instructions even on microcomputers. In addition to the four elementary arithmetic operations, pipelined computers usually provide a number of fast compound operations such as "accumulate" and "multiply and accumulate". It is an inherent property of all traditional implementations of these compound operations that the sequence of summands entering the process is scrambled during accumulation. So even if addition within a pipelined accumulation operation is performed exactly as required by the IEEE standards, the computed result generally differs from that obtained on a sequential computer. Furthermore, both results may be totally wrong. The best way to solve these problems is to perform the accumulation of floatingpoint numbers or products with only one final rounding. This should (and can) be done by any kind of computer, whether pipelined or not. It would guarantee that the computed result is independent of the order of summation, and that it cannot differ from the correct result by more than 1 ulp (unit in the last place). The goal of this proposal is to define vector operations in a manner consistent with the elementary scalar operations by supporting the same rounding modes, data formats, and exceptions. Note that requiring the same rounding modes implies requiring the same global accuracy for a vector operation as for an elementary scalar operation. Some vector/matrix operations such as addition and subtraction can be easily obtained by performing the corresponding scalar operations on the vector/matrix components. Their accuracy is the same as for scalar operands. For vector/matrix multiplications, however, the situation is completely different because the reduction of the dot prodhct to a sequence of scalar floating-point operations does not, in general, produce accurate (or even meaningful) results. The typical problems that occur in vector/matrix calculations and in accumulation processes in general can be eliminated by one additional operation: the accurate dot product. More and more designs and implementations of a dot product producing results of maximum accuracy (with respect to the chosen rounding mode and destination format) are emerging. In order to keep these different implementations compatible, the dot product operation shall be defined in a mathematically rigorous way. Implementations of vector/matrix arithmetic should be modified to conform to this proposal.

Proposal for Accurate Floating-Poin t Vector Arithmetic

2

89

Motivation

There are currently several standards and quasi-standards concerned with floatingpoint arithmetic, and their importance is growing due to the increasing availability of conforming hardware and software. At the same time, the users of numerical hardware and software are becoming more aware of accuracy problems and more concerned with the reliability of their computations. Also, some manufacturers are beginning to look for more guidance from users and mathematicians in order to be able to satisfy the needs of contemporary numerical computing. The IEEE standards 754 and 854 for floating-point arithmetic [l, 21 define data formats and elementary arithmetic for scalar real operands. Their goal is to improve the accuracy and reliability as well as the portability and compatibility of floating-point computations. The IEEE standards prescribe well-defined rounding modes and thus uniquely defined floating-point results for the elementary arithmetic operations -, * , / .

+,

Another standard that is still under development at the international (KO) level is the Language Compatible Arithmetic Standard (LCAS) [19]. It intends to specify the minimal mathematical properties that any floating-point arithmetic should satisfy. However, in contrast to the IEEE standards, it does not specify particular data formats, nor does it require specific rounding modes, so numerical results may once again depend on the hardware. On the other hand, it is applicable to a wider range of computers and more easily accomodated by existing programming languages. The requirement to specify the mathematical properties of the arithmetic within the definition of programming languages dates from the early 1970's [27,28]. Unfortunately, even today none of the existing programming language standards satisfy this requirement (the Ada standard contains at least some minimal accuracy requirements). Only a few prototype languages such as PASCAL-XSC [4, 23, 25, 91 and ACRITH-XSC (formerly called FORTRAN-SC) [15, 3, 171 have included a rigorous specification of the arithmetic in the language definition. Besides standards for scalar arithmetic, there is one notable industry standard for vector/matrix arithmetic: the Basic Linear Algebra Subprograms (BLAS) [29, 11, 12, 331. The BLAS essentially prescribe the (FORTRAN 77) interface and the mathematical functionality of a carefully chosen set of subprograms for vector/matrix calculations. The means of implementation are not specified (even though FORTRAN 77 code is available), allowing for highly optimized versions on particular machines, especially on vector processors and parallel computers. Unfortunately, there are no accuracy requirements for the BLAS, so the numerical results are highly machine- and implementation-dependent and by no means compatible or portable. Currently, much of the discussion about how to achieve higher accuracy in floatingpoint computations circles around the idea of a new data format with more precision. It seems to lead in the direction of an extended (e. g. quadruple precision) floating-point format "EXTENDED" with at .least twice the number of digits of

90

G . Bohlender, D. Cordes, A. Knofel, U. Kulisch, R. Lohner, W. V. Walter

the usual (e. g. double precision) floating-point format “USUAL”. Note that the term accuracy refers to the quality of a computational result whereas the term precision refers to the length of the mantissa (number of digits) of a floating-point number. The main options that are being considered are: 1. A full EXTENDED arithmetic with the operations:

+, -, * , / .

2. A limited EXTENDED arithmetic: only the exact multiplication of USUAL numbers into the EXTENDED format and the addition/subtraction of EXTENDED numbers are available. 3. An exact USUAL arithmetic as defined in [S]: in all USUAL operations, the low order part of the exact result (in +,-, * ) or the remainder (in / ) is accessible as a second USUAL result.

4. A “multiply and add” instruction for USUAL numbers (such as in the IBM

RISC System/6000 MAF unit): the USUAL result is the exact solution of a+b*c rounded only once.

5. An accurate dot product for vectors of USUAL numbers: the USUAL result differs from the exact dot product by a t most one rounding.

A full EXTENDED arithmetic (1) provides easy access to higher precision, but the extra costs for the floating-point unit, wider data paths and additional data storage space are quite significant. Additionally, using the same technology, such an EXTENDED arithmetic is slower than the USUAL arithmetic. This option only defers the accuracy problem, but does not solve it in principle. The additional arithmetic features in (2) to (4) provide building blocks for the software emulation of a complete EXTENDED arithmetic. However, there is a considerable time penalty. Also, numerical algorithms must be adapted to take advantage of the emulated higher precision. Furthermore, it remains difficult to assess the quality of computed results because the main reasons for a loss of accuracy, namely cancellation and rounding errors, persist. In particular, even though option (4) reduces the number of roundings incurred in the computation of a dot product, it hardly improves its numerical stability. Note that (2) is much more useful than (4)because it allows the direct computation of “short dot products’’ a * b f e * d with just one final rounding. The short dot product is a frequent operation in many numerical applications. For example, complex multiplication consists of one short dot product for the real and one for the imaginary part. Option (3) proposed in [S] (see also [20,31, 10,301) yields exact results by producing two USUAL floating-point results: a high-order and a low-order part whose sum -, *, or an approximate quotient and the corresponding is the exact result for exact remainder for / . Despite the fact that no information is lost when employing such an arithmetic, its usefulness is limited because the number of intermediate floating-point results may increase dramatically in the course of a calculation if the

+,

Proposal for Accurate Floating-point Vector Arithmetic

91

sum of these is to represent the exact solution. Also, (3) necessitates a lot of extra software, and it is yet much slower than (2) in most applications. Compared with the other ways to achieve higher accuracy, the accurate dot product ( 5 ) has several advantages. It provides an effective way of mapping numerically sensitive parts of an algorithm to computer instructions. Furthermore, it eliminates intermediate rounding errors and the ill effects of cancellation in summations and dot products. High performance can be achieved, and the hardware costs for a dot product computation unit are quite low when compared with a full EXTENDED arithmetic. So the accurate dot product combines high accuracy with high performance and application-oriented functionality. As with options (2) to (4), numerical algorithms must be adapted to take advantage of the accurate dot product. Triple, quadruple and higher precision arithmetic can be easily implemented using the accurate dot product. If the accurate dot product is implemented in hardware, the computation of dot products of USUAL vectors is generally faster than any simulation using traditional floating-point operations. One goal of this proposal is to show ways of making vector/matrix operations (for example, the BLAS) more reliable than traditional implementations. In fact, simply by adding an accurate dot product to the set of elementary arithmetic operations, one can implement all of the basic operations that occur in linear algebra (except complex division) accurately and efficiently - for real and complex numbers, vectors and matrices. This is achieved by giving a clear and rigorous definition of an accurate dot product which can be implemented very efficiently in hardware [24, 32, 26, 71. If additionally one has an arithmetic that offers downwardly and upwardly directed roundings (e. g. IEEE arithmetic), then all of the basic operations of linear algebra can be provided for real and complex intervals as well. In programming languages such as PASCAL-XSC and ACRITH-XSC, all of these arithmetic operations are available and easily accessible. Their results are required to be accurate to 1 ulp. Furthermore, these operations provide the basis for the solution of numerical problems with automatic result verification. Since naive interval arithmetic always performs a worst-case analysis, it may lead to verified results of poor accuracy. Therefore, a tool to improve the accuracy is essential to obtain guaranteed results of high accuracy: the accurate dot product. Typical problems where self-validating numerical methods have been successfully applied include linear and nonlinear systems of equations, eigenvalue problems, polynomial and expression evaluation, systems of differential and integral equations, quadrature, and more.

3

Basic Requirements

For the purposes of this proposal, it is assumed that a floating-point format (data format) is defined by its radix r (base), its precision p (mantissalength), and its exponent range ernin-ernas (minimal and maximal exponent). These integer parameters define a fixed floating-point format F(r,p, ernin, emaz) and thus a finite

92

G. Bohlender, D. Cordes, A. Kniifel, U. Kulisch, R. Lohner, W. V. Walter

set of real values representable in that format, the floating-point numbers. If a machine offers several data formats, only one of these is considered at a time.

It is also assumed that the elementary floating-point operations +, -, * , / for the given data format are "properly defined" on the computer. In particular, their results must always be accurate to at least 1 ulp (unit in the last place), that is, the computed result must be one of the two floating-point numbers neighboring the exact result. This is sometimes called "faithful" arithmetic [lo]. Furthermore, it is assumed that there is a set of one or more rounding modes that map the exact result of an arithmetic operation to the set of representable (floating-point) numbers. The same rounding modes should be available for all four elementary arithmetic operations (+, -, * , /). Mathematically, a rounding is a monotonic nondecreasing projection from the real numbers onto the set of floating-point numbers. This ensures that the floating-point numbers are invariant under a rounding and that the elementary arithmetic operations are accurate to at least 1 ulp. Note that this is required by the LCAS [19] and by both IEEE standards for floating-point arithmetic [l, 21. For more detail, refer to [27, 281 where the concept of semimorphism is introduced.

4

Accurate Dot Product

A traditional computation of the dot product (scalar product, inner product) of two vectors with n components each (in ordinary floating-point arithmetic with rounded multiplications and additions) involves (2n- 1) roundings and may lead to catastrophic cancellation of significant digits. This may happen even if an extended precision data format is used for the accumulation. Besides the loss of accuracy, a considerable amount of processing time may be required to perform unnecessary intermediate steps such as composition, decomposition, normalization, and rounding of intermediate floating-point values. Furthermore, unnecessary load and store operations may have to be performed. In the computation of an accurate dot product, on the other hand, most of these steps can be avoided, and the result can be guaranteed to be highly accurate. The naive use of longer and longer (extended) floating-point formats to achieve higher accuracy decreases performance significantly and leads both manufacturers and users into a "precision race". Moreover, higher precision does not solve the fundamental accuracy problem in general. In contrast, an accurate dot product offers the best possible result with respect to a given floating-point format and a given rounding mode while requiring less hardware and improving performance. For these reasons, several algorithms and a number of software, firmware and hardware implementations for accurate dot products have been developed in recent years (for a survey, see for example [5]). Also, various efforts have been made to augment IEEE arithmetic in this respect [13, 23, 37, 21, 6, 341. At the same time, various parts of the numerical user community have been demanding highly accurate

Proposal for Accurate Floating-point Vector Arithmetic

93

implementations of the "elementary compound operations" of vector processors, such as "accumulate" and "multiply and accumulate". For example, such demands are stated in the "Resolution on Computer Arithmetic" [18], which was officially adopted by GAMM in 1987 and by IMACS in 1988.

For the purposes of this proposal, the dot product operation is defined as follows: Given two vectors x and y with n floating-point components each, and

a prescribed rounding mode 0,the floating-point result s of the dot

product operation (applied to x and y) is defined by

where all arithmetic operations are mathematically exact. Thus s shall be computed as if an intermediate result correct to infinite precision and with unbounded exponent range were first produced and then rounded to the desired floating-point destination format according to the selected rounding mode 0. This definition guarantees highest possible accuracy (for the given rounding mode and floating-point destination format) and agrees with the definition of computer arithmetic by "semimorphism" in [27, 281. It is also analogous to the definition of floating-point arithmetic given in the IEEE standards 754 and 854 [I, 21. The above definition also shows how to extend the given scalar arithmetic to vectors in a consistent manner. Vector operations must behave as if they obeyed the following rules:

General Behavior For a given floating-point format A, assume that there exists a higher precision floating-point format B such that all numbers of the given floating-point format A can be represented exactly in B, and that all intermediate operations necessary to compute the result of the chosen vector operation can be performed without error in B. Compute the exact result in B, then round it to the destination format. Any "special values" (such as infinities, signed zeros, or non-arithmetic values) of the given floating-point format A must have corresponding representations in B, and the same results, exceptions and invalid operations must be defined for these special values as in the given floating-point format A. The abstract computation in such a virtual format B guarantees the accuracy of the result since intermediate roundings and intermediate overflowjunderflow cannot occur. Furthermore, this guarantees consistent behavior in special cases such as operations with infinity, signed zeros, and "special values" (such as NaNs in IEEE arithmetic).

94

G. Bohlender, D. Cordes, A. Knofel, U. Kulisch, R. Lohner, W. V. Walter

Following these general rules, the dot product operation can be defined as follows:

IDot Product Operation

An implementation shall provide the dot product operation Cbl z;* y; for any number n of pairs of operands z;,y;, i = 1, ...,n of the same data format. The result shall be rounded t o the destination format in the same manner as for scalar floating-point operations. The value of the natural number n shall be representable in a supported integer format. Note that the order in which the elementary operations are performed when determining the dot product is not specified, allowing parallel and pipelined processing. This has no influence on the computed result, but only on the behavior in case of an exception. Since the dot product is a compound operation, an invalid operation exception shall be signaled only if one of the following cases occurs: 1. invalid floating-point operands (e. g. NaNs)

2. invalid multiplication (e. g.

00

* 0) or addition (e. g. (+00)

+ (-m))

The overflow, underflow, and inesact exceptions shall not occur until the final rounding is applied. The overflow or underflow exception shall be signaled if the accurate result of the dot product operation does not fit into the exponent range of the destination format. The inesact exception shall be signaled if the accurate result of the dot product operation does not fit into the restricted floating-point mantissa of the destination format.

5

Fundamental Operations

Since the dot product of two vectors is a compound operation which may require many machine cycles, the computational process should be divisible into smaller steps. The dot product operation can be accomplished by c,alculating full doublelength products and by performing the accumulation of these products in a special object (called “accumulator object” in the sequel) without rounding error. Therefore, an accumulator object is provided, along with a set of operations described below, to keep the intermediate state of the computation. The realization of dot products by means of such a data object usually leads to an efficient implementation and a t the same time to increased flexibility in applications. Therefore, many implementations are based on this concept [4, 25, 15, 16, 171. However, different implementations are possible. For our purpose, an “accumulator object” is only an abstract concept and not a general all-purpose data format. The representation of an “accumulator object” is not specified in order to allow various

Proposal for Accurate Floating-Poin t Vector Arithmetic

95

implementations. Note, however, that the accuracy of the result is required to be independent of the implementation method.

Accumulator Object

For any number n of pairs of operands xi, y; of the same data format, an “accumulator object” shall be capable of holding the unrounded result of any dot product CLIxi * y; as computed with infinite precision and unbounded exponent range. The value of the natural number n shall be representable in a supported integer format. The following section describes a minimal set of “accumulator operations” for the accumulator object.

Fundamental Accumulator Operations The fundamental operations which are needed for an accurate dot product are listed in Table 1 . The operands x and y and the result z are numbers in one of the supported floating-point formats, ACC is a suitable accumulator object. Apart from the explicit final rounding operation 0 , all operations are performed without rounding error and with unbounded exponent range. operat ion l)ACC:=x*y 2)ACC:=ACC+z*y

3 ) z := o ( A C C )

explanation initialize accumulator with product of x and y (exact, without rounding error) add product of x and y t o A C C (exact, without rounding error) round ACC to the floating-point number z (according to the rounding mode 0 )

Table 1: Fundamental Accumulator Operations The following exceptions can occur in accumulator operations:

Exceptions In the rounding z := o ( A C C ) of an accumulator object to the destination format, the exceptions overflow, underflow, and inexact can occur. In all other accumulator operations, only an invalid operation exception may be signaled, indicating an invalid elementary operation, an operation on invalid input data, or that the accumulator object is insufficient to hold the intermediate exact result. The last case shall never occur if the total number n of accumulation operations 2) performed on the same accumulator object is representable in a supported integer Note that an invalid operation exception due to an insufficient accumulator object is extremely unlikely since it requires the accumulation of a t least (rnaxint 1)

+

96

G. Bohlender, D. Cordes, A . Kniifel, U. Kulisch, R. Lohner, W. V. Walter

simple products of the largest floating-point number with itself. Dependent on the processing power of the machine, a small number of extra digits is sufficient to avoid the occurrence of this exception in the lifetime of the computer.

Recommended Additional Operations

6

In addition to the fundamental operations listed above, several other related operations are often useful. They are listed in Table 2 and agree with the accumulator operations provided in PASCAL-XSC, ACRITH and ACRITH-XSC [4,25,15,16,17].

Ldditional Accumulator Operations operation 4) ACC := x

5) ACC := ACC

+x

6) ACC := ACC - x 7) ACC := ACC - x

*y

8 ) ACC := -ACC 9) ACC1< > = ACCz 10) ACC1 := ACCl+ ACCz 1 1 ) ACC1 := ACC1- ACCz 12) ACCl := ACCz

explanation initialize an accumulator object with a floating-point value add a floating-point value to an accumulator object subtract a floating-point value from an accumulator object subtract a product of floating-point operands from an accumulator object invert the sign of an accumulator object compare two accumulator objects add two accumulator objects subtract two accumulator objects copy the contents of an accumulator object

Table 2: Additional Accumulator Operations

7

Conclusion

The operations described in this proposal allow a natural extension of the ideas of existing standards for scalar floating-point arithmetic to vectors and matrices. The accurate dot product as defined in this proposal provides a flexible and applicationoriented basis for the implementation of sophisticated, efficient, and reliable numerical software. Thus it becomes possible to program algorithms in such a way that they carry their own error control and produce results of high accuracy.

A new branch of numerical mathematics has evolved around the idea of automatically verifying computational results on the computer. Besides interval arithmetic, which is essential to compute guaranteed bounds on a solution, a general tool to improve the accuracy of numerical calculations is essential to obtain reliable results of high accuracy. The accurate dot product is well suited for this purpose.

Proposal for Accurate Floating-Poin t Vector Arithmetic

97

Various engineering problems in soil mechanics, optics of liquid crystals, groundwater modelling, magnetohydrodynamics and other fields have been successfully solved using automatic result verification techniques. In all of these problems, the known traditional floating-point methods had previously failed. Thus, the proposed vector arithmetic extension goes far beyond ordinary floating-point arithmetic, yet with reasonable extra effort.

A

Case Study for IEEE Arithmetic

The terms and concepts defined in the IEEE Standard 754-1985for Binary FloatingPoint Arithmetic [l] and in the IEEE Standard 854-1987for Radix-Independent Floating-point Arithmetic [2] are consistent with the basic requirements for the natural extension of the scalar arithmetic operations to the dot product operation. The IEEE data formats, the definition of arithmetic operations, the rounding modes and the exception handling are well-defined in the IEEE standards and can be applied without changes to this proposal for accurate vector arithmetic.

A.l

Dot Product Operation

This section prescribes the structure of a possible new standard for an accurate dot product operation as a basis for a well-defined vector arithmetic. The terms and wording of the IEEE standards are used whenever appropriate. This section also proves that an extension of the IEEE concepts to an accurate dot product operation can be realized without any restrictions of the basic concepts. Using the diction of the IEEE standards, the computation of the dot product of floating-point vectors must be considered as if performed in an extended data format, where all the required operations can be performed without error, and then rounded back to the destination’s precision. This guarantees both the accuracy of the result and conforming results in case of special input operands such as infinities, signed zeros and NaNs. Also, the same exception and trap handling occurs when the rounding is applied. The contents of sections 1.-4. Scope, Definitions, Precisions, Rounding of the IEEE standards can be adopted without changes. Only one definition, extracted from section 5. Operations of the IEEE Standard 854-1987,is quoted here to outline the main goal behind the new vector operations: Except for conversion between internal floating-point representations and decimal strings, each of the operations shall be performed as if it first produced an intermediate result correct to infinite precision and with unbounded exponent range, and then coerced this intermediate result to fit in the destination’s precision.

G. BohJender, D. Cordes, A. KnGfeJ, U.KuJisch, R. Lohner, W. V. Walter

98

Following the structure of the IEEE standards, the dot product operation is defined:

Dot Product Operation An implementation shall provide the dot product operation ELl zi * yi for any number n of pairs of operands zi, y;, i = 1, ...,n of the same basic format. The result shall be rounded to the destination’s precision as specified in section 4 of the existing IEEE standards. The value of the natural number n shall be representable in a supported integer format. The definitions in 6. Infinity, NaNs, and Signed Zero can be adopted, too.

For the exceptional behavior of the dot product operation, the definitions in 7. Ezceptions of the IEEE standards are sufficient and can be easily interpreted. An invalid operation shall be signaled if one of the following cases occurs: 1. Any operation on a signaling NaN, magnitude subtraction of infinities (such as (+m) - (+m)), or invalid multiplication (0 * m).

2. The number of vector elements exceeds the limit imposed by the supported integer format. The exceptions overflow, underflow, and inexact can only occur when the final rounding is applied. The trap handling in section 8. Traps can also be adopted.

A.2

Accumulator Operations

This section does not have any corresponding section in the IEEE standards. Essentially, the abstract definition of an accumulator object and of the related operations are given.

Accumulator Object

For any number n of pairs of operands x i , y i of the same basic format, an “accumulator object” shall be capable of holding the unrounded result of any dot x; * y, as computed with infinite precision and unbounded exponent product range. The value of the natural number n shall be representable in a supported integer format.

EL,

Note that an accumulator object must be capable of holding any value representable in the data format of the operands, i. e. there must be representations for NaNs, infinity and signed zero. This is a consequence of the requirements imposed by the specification of the general behavior of vector operations.

Proposal for Accurate Floating-Point Vector Arithmetic

99

The following section describes a minimal set of “accumulator operations’’ for the accumulator object.

The fundamental operations which are needed for an accurate dot product are listed in Table 3. The operands z and y and the result t are numbers in one of the basic formats, ACC is a suitable accumulator object. Apart from the explicit final rounding operation 0,all operations are performed without rounding error and with unbounded exponent range. operation

I)ACC:=z*y

2) ACC := ACC

3) z := o(ACC)

+- z * y

explanation initialize accumulator with product of z and y (exact, without rounding error) add product of z and y t o ACC (exact, without rounding error) round ACC to the destination’s precision (according to the rounding mode 0 )

Table 3: Fundamental Operations The following exceptions can occur in accumulator operations:

Except ions In the rounding z := D(ACC) of an accumulator object to the destination’s precision, the exceptions overflow, underflow, and inesact can occur. In all other accumulator operations, only an invalid operation exception may be signaled, indicating an operation on a signaling NaN, a magnitude subtraction of infinities (m - m), an invalid multiplication (0 * m), or that the accumulator object is insufficient t o hold the intermediate exact result. The last case shall never occur if the total number n of accumulation operations 2) performed on the same accumulator object is representable in a supported integer format. In addition to the fundamental operations listed above, several other related operations are often useful. These additional operations are listed in Table 4.

G. Bohlender, D. Cordes, A. Kniifel, U. Kulisch, R. Lohner, W. V. Walter

100

hdditional Operations explanat ion initialize a n accumulator object with a floating-point value add a floating-point value to a n 5) ACC := ACC x accumulator object subtract a floating-point value from a n 6) ACC := ACC - x accumulator object subtract a product of floating-point operands 7) ACC := ACC- x + y from a n accumulator object invert the sign of an accumulator object 8) ACC := -ACC compare two accumulator objects 9)ACCl< > =ACC2 10) ACC1 := ACCl ACC2 add two accumulator objects 1 1 ) ACCl := ACCi- ACC2 subtract two accumulator objects copy the contents of a n accumulator object 12) ACCl := ACC2

operat ion 4) ACC := I

+

+

Table 4: Additional Operations

References [l] American National Standards Institute/Institute of Electrical and Electronics En-

gineers: IEEE Standard for Binary Floating-point Arithmetic. ANSI/IEEE Std 754-1985, New York, 1985.

[2] American National Standards Institute/Institute of Electrical and Electronics Engineers: IEEE Standard for Radiz-Independent Floating-point Arithmetic. ANSI/IEEE Std 854-1987, New York, 1987. [3] Bleher, J. H.; Rump, S. M.; Kulisch, U.; Metzger, M.; Ullrich, Ch.; Walter, W.: FORTRAN-SC: A Study of a FORTRAN Eztension for Engineering/Scientific Computation with Access to A CRZTH. Computing 39, 93-1 10, Springer-Verlag, 1987. [4]Bohlender, G.; Rall, L. B.; Ullrich, Ch.; Wolff von Gudenberg, J.: PASCAL-SC: Wirkungsuoll progmmmieren, kontrolliert rechnen. Bibl. Inst., Mannheim, 1986;

.

. . : PASCAL-SC: A Computer Language for Scientific Computation. Perspectives in Computing 17,Academic Press, Orlando, 1987.

[5]Bohlender, G.: What Do We Need Beyond ZEEE Arithmetic? In Ullrich, Ch. (ed.): Computer Arithmetic and Self-validating Numerical Methods, Academic Press, 1990. [6] Bohlender, G.: A Vector Eztension of the IEEE Standard for Floating-point Arithmetic. In [22],3-12, 1991. [7] Bohlender, G.; Knofel, A.: A Survey of Pipelined Hardware Support for Accurate Scalar Products. In [22],29-43, 1991.

Proposal for Accurate Floating-point Vector Arithmetic

101

[8]Bohlender, G.; Kornerup, P.; Matula, D. W.; Walter, W. V.: Semantics for Ezact Floating Point Opemtions. Proc. of 10th IEEE Symp. on Computer Arithmetic (ARITH 10) in Grenoble, 22-26, IEEE Comp. SOC., 1991. [9]Cordes, D.: Runtime System for a PASCAL-XSC Compiler. In [22],151-160, 1991. [lo] Dekker, T. J.: A Floating-Point Technique for Eztending the Available Precision. Numerical Mathematics 18,224-242, 1971. [ll] Dongarra, J. J.; DuCroz, J.; Hammarling, S. Hanson, R.: A n Eztended Set of Fortmn Basic Linear Algebm Subprogmms. ACM Trans. on Math. Software 14,no. 1,1988. [12] Dongarra, J. J.; DuCroz, J.; Duff, I.; Hammarling, S.: A Set of Level 9 Basic Linear Algebm Subpragmms. ACM Trans. on Math. Software 16, no. 1, 1990. [13] Hahn, W.; Mohr, K.: APL/PCXA, Erweiterung der ZEEE Arithmetikfu’r technisch wissenschaftliches Rechnen. Hanser Verlag, Munchen, 1989. [14]Hammer, R.: How Reliable is the Arithmetic of Vector Computers? In [36],467482, 1990. [15] IBM: System/370 RPQ, High-Accumcy Arithmetic. SA22-7093-0, IBM Corp., 1984. [16]IBM: High-Accumcy Arithmetic Subroutine Library (ACRZTH), General Information Manual. 3rd ed., GC33-6163-02,IBM Corp., 1986. [17] IBM: High Accuracy Arithmetic - Eztended Scientific Computation (ACRITHIBM Corp., 1990. XSC), General Information. GC33-6461-01, [18] IMACS, GAMM: Resolution on Computer Arithmetic. In Mathematics and Computers in Simulation 31,297-298, 1989; in Zeitschrift fur Angewandte Mathematik und Mechanik 70,no. 4,p. T5,1990; in Ch. Ullrich (ed.): Computer Arithmetic and Self- Validating Numerical Methods, 301-302, Academic Press, San Diego, 1990; in [36],523-524, 1990; in [22],477-478, 1991. [19] ISO:Language Compatible Arithmetic Standard (LCAS). Committee Draft (Version 3.1), ISO/IEC 10967, 1991. [20] Kahan, W.: Further Remarks on Reducing Truncation Errors. Comm. ACM 8 , no. 1, 40, 1965. [21] Kahan, W.:Doubled Precision ZEEE Standard 754 Floating-Point Arithmetic. Conf. on Computers and Mathematics, Mini-Course on “The Regrettable Failure of Automated Error Analysis”, MIT, June 13, 1989. [22] Kaucher, E.; Markov, S. M.; Mayer, G. (eds): Computer Arithmetic, Scientific Computation and Mathematical Modelling. IMACS Annals on Computing and Applied Mathematics 12,J.C. Baltzer, Basel, 1991. [23] KieOling, I.; Lowes, M.; Paulik, A.: Genaue Rechnemrithmetik, Znteruallrechnung und Progmmmieren mit PASCAL-SC. Teubner Verlag, Stuttgart, 1988. [24]Kirchner, R.; Kulisch, U.: Arithmetic for Vector Processors. Proc. of 8th IEEE Symp. on Computer Arithmetic (ARITH8) in Como, IEEE Comp. SOC.,1987. [25] Klatte, R.; Kulisch, U.; Neaga, M.; Ratz, D.; Ullrich, Ch.: PASCAL-XSC Spmchbeschreibung mit Beispielen. Springer-Verlag, Berlin, Heidelberg, 1991.

...: PASCAL-XSC Language Reference with Ezamples. Springer-Verlag, Berlin, Heidelberg, 1992.

102

G. Bohlender, D. Cordes, A. Knofel, U. Kulisch, R. Lohner, W. V. Walter

[26]Knofel, A.: Fast Hardware Units for the Computation of Accumte Dot Products. Proc. of 10th IEEE Symp. on Computer Arithmetic (ARITH 10) in Grenoble, 7074, IEEE Comp. SOC.,1991. [27]Kulisch, U.: Grundlagen des numerischen Rechnens: Mathematische Begru'ndung der Rechnemrithmetik. Reihe Informatik 19, Bibl. Inst., Mannheim, 1976. [28]Kulisch, U.; Miranker, W. L.: Computer Arithmetic in Theory and Practice. Academic Press, New York, 1981. [29]Lawson, C.; Hanson, R.; Kincaid, D.; Krogh, F.: Basic Linear Algebm Subprogmms for Fortmn Usage. ACM Trans. on Math. Software 5, 1979. [30]Linnainmaa, S.: Analysis of Some Known Methods of Improving the Accumcy of Floating-Point Sums. BIT 14,167-202, 1974. [31]Moller, 0.:Quasi Double Precision in Floating-Point Addition. BIT 5,37-50, 1965. [32]Miiller, M.; Rub, Ch.; Riilling, W.: Ezact Addition of Floating Point Numbers. Sonderforschungsb. 124, FB 14,Informatik, Univ. des Saarlandes, Saarbriicken, 1990. [33]NAG: Basic Linear Algebm Subprograms (BLAS). Group Ltd, Oxford, 1990.

The Numerical Algorithms

[34]Priest, D.M.: Algorithms for Arbitrary Precision Floating Point Arithmetic. Proc. of 10th IEEE Symp. on Computer Arithmetic (ARITH 10)in Grenoble, IEEE Comp. SOC., 1991. [35]Ratz, D.: The Eflects of the Arithmetic of Vector Computers on Basic Numerical Methods. In [36],499-514, 1990. [36] Ullrich, Ch. (ed.): Contributions to Computer Arithmetic and Self-validating Numerical Methods. IMACS Annals on Computing and Applied Mathematics 7 , J.C. Baltzer, Basel, 1990. [37]Weeks, D.: Vectoriing a Robust Inner Product Algorithm. Proc. of Third Int. Conf. on Supercomputing, ACM, 1989.

11. Enclosure Methods and Algorithms with Automatic Result Verification

This page intentionally left blank

Automatic Differentiation and Applications’ Hans- Christoph Fischer

1

Introduction

The computation of derivatives is a problem which arises in many fields of Applied Mathematics: for a function one is interested not only in the function value but also in the values of certain derivatives. E. g., the Newton-Raphson method ZY+l

= ZY - ( f ( Z Y ) ) - - l

. f(+’)

( u = 0,1,2,.

..)

for the solution of a nonlinear equation f(s)= 0 uses the value of the first derivative - or in case of a system of nonlinear equations the Jacobian matrix - f’(z”). The computation of derivatives can be done in an efficient way by means of the socalled automatic differentiation: formulas have not to be manipulated symbolically and derivatives are determined explicitly. Furthermore, with the help of interval arithmetic it is possible to compute guaranteed bounds for the values of derivatives. This is the basis for many problem-solving algorithms with automatic result verification, e. g., numerical quadrature, or the solution of ordinary differential equations (initial and boundary value problems), or systems of integral equations. Before the method of automatic differentiation will be explained in more detail, some words about symbolic and numerical differentiation are in place. In numerical differentiation the values of derivatives are approximated by corresponding differential quotients. E. g., the approximation of the first derivative of a real function f(+) requires two evaluations of the function, one subtraction, and one division:

The stepsize h has to be chosen such that the approximation error becomes small and the rounding errors are not amplified in an unacceptable way (e. g., due to cancellation in the numerator). Because of these difficulties, this method frequently leads to unsatisfactory results. In contrast to numerical differentiation, the methods of symbolic and automatic differentiaton are not approximative. The symbolic differentiation ‘by hand’ or by use of a computer-algebra system (e. g. MACSYMA [26], MAPLE [27] or REDUCE [33)) starts with a functional expression ‘The author wants to express his appreciation to Prof. U. Kulisch and Prof. E. Kaucher, both of the University of Karlsruhe, for their constant encouragement when the thesis [8] was written, which is the basis of the present paper. Scientific Computing with AutomaticResult Verification

Copyright 0 1993 by Academic Press, Inc. 105 All rights of reproduction in any form reserved. ISBN 0-12-044210-8

106

H.-C. Fischer

for f = f ( z ) and then applies, step by step, the well-known rules of differentiation to get an expression for the desired derivative. For a fixed value of z,the evaluation of this expression then results in the numerical value of the derivative. Even for simple functions the automatically generated expressions may be large and unstructured, and their generation may involve considerable requirements concerning memory and CPU-time. Therefore, in the case that one is not interested in expressions for the derivatives but rather in values only, this method may not be efficient. Automatic differentiation avoids the problems of the symbolic method: the application of the rules of differentiation and the evaluation are carried out in parallel, i. e., during the whole process only numbers have to be manipulated. Therefore, the method of automatic differentiation can be easily coded in such programming languages as FORTRAN, PASCAL, etc. Provided the programming language supports overloading of operators, then an expression may be entered in the usual mathematical notation: the evaluation of derivatives is executed by means of a suitable definition of the operators. In general, the method requires only small extra storage. Furthermore, the employment of fast floating-point hardware is possible, i. e., the running times are short. The paper is organized as follows: in Section 2 we develop the basic principles of the (forward mode of) automatic differentiation: the Computation of the first derivative, the computation of Taylor coefficients for functions f : R + R, and the evaluation of the gradient Of = . .., for functions f : R" + R.

(g,z)

In Section 3 the method of reverse computation is presented. This method allows a considerable acceleration for many algorithms of automatic differentiation; this leads to the so-called fast automatic differentiation or reverse automatic differentiation. Especially for the computation of gradients, the following surprising estimation can be shown: A (f,Vf) 5 C . A (f),where A (f,Of)denotes the number of operations for the computation of the gradient (including function evaluation) and A (f)the number of operations for the function evaluation, respectively. Section 4 introduces interval slopes. They can be evaluated efficiently by the method of reverse computation. With interval slopes, it is possible to define centered forms which are the basis for the efficient calculation of the range of values. In Section 5 we apply the reverse computation to the problem of evaluating functions in floating-point arithmetic. Guaranteed bounds for the rounding errors can be computed. In the Appendix a program in PASCAL-XSC demonstrates an implementation of automatic differentiation. For the function f(x) = 25(z - 1)/(z2 l ) , the second derivative j'(2) is computed.

+

107

Automatic Differentiation

2

Automatic Differentiation (Forward Mode)

First, we briefly summarize the basic results of automatic differentiation since they will be essential for the following sections. Some ideas concerning this method can be found already in the late 1950s (Beda et al. [ 5 ] ) . In the books of Moore [29] and Rall [30], the automatic differentiation (forward mode) is used to solve various problems of Applied Mathematics. These authors have also examined the possibilities to implement the method.

2.1

The Computation of First Derivatives

The principles of automatic differentiation may be explained best with help of the evaluation of first derivatives. As functions we examine arithmetic expressions which can be formulated in typical programming languages (FORTRAN, PASCAL . . .), i. e. expressions consisting of a -, /, and certain finite composition of constants, variables, the basic operations differentiable standard functions such as exp, sin, cos, . .

.

+,

a,

For a function f : R -+ R, the computation of the values of the function and its first derivative is carried out by means of a differentiation arithmetic. This is an arithmetic of ordered pairs like complex arithmetic or interval arithmetic. The with u,u’ E R, where the first component is the pairs look as follows: U := (u,u’) function value and the second component is the value of the first derivative.2 The rules for addition, subtraction, etc. of these pairs are simple. In the first component, u,the function value is computed by addition, subtraction etc.; in the second component, u’,the resulting value of the derivative is computed by the well-known rules for differentiation:

u+v u-v u v *

u/v

+

+

+

(v,v’) = (u v, 21‘ v’), u’)- (v,v’) = ( u - v, 21’ - v‘), = (u, u‘)* (v,v’) = (u* v,u’* v u . v’), = (u, = (11, u’)/(v, v’) = (u/v, (u’ - (u/v) * v’)/v), v = (u, u’)

+

# 0.

Pairs involving standard functions are computed by use of the chain rule. E.g., there hold: exp(U) = exp(u,u‘) = (exp(u),exp(u)-u‘), sin ( U ) = sin(u, u’)= (sin(u), cos(u) u’), cos ( U ) = cos(u, u’)= (cos(u), - sin(u) u’). For the independent variable z we use the pair X := (z0,l); a constant c is represented by C := (c,O). Thus, for all admitted expressions the function value f(z0) zIn the following the existence of the derivatives is always assumed. For functions from the class of arithmetic expressions this assumption can be checked easily.

H.-C.Fischer

108 and the value of the first derivative use of the above rules.

f’(z0)

can be computed step by step, making

A simple example shows the procedure. This demonstrates the fact that neither approximations nor symbolic manipulations are required.

+

Example 1 For the function f (z) = 25(z - l)/(z2 l ) , the function value and the value of the first derivative are to be computed for to = 2. The employment of the differentiation arithmetic3 gives:

In the Appendix it is shown that the differentiation arithmetic can be implemented in a well-structured and efficient way provided the programming language allows operator overloading. In the language PASCAL-XSC [19] this is possible. In particular, its standard datatypes are designed for numerical applications. An analogous statement holds for Ada, C++ and the FORTRAN-based language ACRITH-XSC (141.

2.2

The Computation of Taylor Coefficents

The employment of automatic differentiation is not limited to the computation of first derivatives. By applying suitable recurrence relations, it is possible to compute derivatives of a higher order. Furthermore, no symbolic manipulations or numerical approximations are necessary. Just as in the previous subsection, the admitted functions are arithmetic expressions. To simplify the presentation, we use Taylor coefficients instead of derivatives. For the Taylor coefficients o f f : R + R at z = so, we use the following notations:

i. e., f(zo) = ( f ) ~f’(z0) , = (!)I, and f ( k ) ( z ~ =)k! ( f ) k . The special functions f(z) = I and f(z) = c (for constants c E R) lead to (I)O

= do, (x), = 1 ,

(z)k

= 0 for k 2 2

and (c)o = c, (c)k = 0 for k 2 1. 30fcourse, the usual priority rules are also applied concerning the operations of the differentiation arithmetic.

Automatic Differentiat ion

109

The rules for the differentiation of sums, products, or quotients lead (for k 2 0) to -, and / (see e. g. [29],[30]): the following formulas for the operators

+,

a,

j=O

k-1

In particular, if the functions f or g are the variable z or a constant, then the above formulas may be simplified. E. g., there hold:

For efficiency reasons, these cases have to be taken into consideration in the implementation.

With respect to the Taylor coefficients of h = f’ and the definition of (h)k, the relation

and the chain rule of differentiation make it possible to compute Taylor coefficients for standard functions.

For the exponential function, there holds

i. e. by use of (2),

H.-C. Fischer

110 Analogously, for the functions

20'

= sin(f) and wc = cos(f), one gets (see e. g.

[291,[301):

and

Thus, the computation of the Taylor coefficients of orders 0 (function value) to p 2 1 of a function can be realized by use of a Taylor arithmetic, i. e. an arithmetic of (p 1)-vectors. These vectors are defined as follows: U := (uo,. . . , u p ) with u k = ( u ) k (k = 0 . . .p). Especially, the independent variable x is represented by the (p 1)-vector X := ( x O , l , O , . . .,O), a constant c by the (p 1)-vector C := (c,O,. . . ,O). E.g., a multiplication can be written in the form:

+

+

u.v

+

=

(210,.

..,Up)

*

(vole.

.,.p)

= (wo,. . ., w p )

k

with

W k = x U j . V k - j ( k = O

...p).

j=O

Additionally, from this formula, an estimation of the number of operations can be derived: computing the coefficients of the product U V from the coefficients of U and V requires i p ( p 1) additions and i ( p l) ( p 2) multiplications, i. e. (p 1)* operations are necessary. For the ratio of the cost A ( T p f )of the computation of all Taylor coefficients of f (including order p) and the cost A (f)for the computation of the function value, the following estimate is valid4 (e. g. [29]):

+

+

+

+

+

For p = 1, the Taylor arithmetic becomes the differentiation arithmetic of the preceding subsection. However, only the combination with interval arithmetic ([2],[29]) will allow the maximum utilization of the automatic generation of Taylor coefficients. Numerous algorithms with automatic result verification (see e. g. [18],[20]) require estimations of remainder terms

4Note that for the computation of the coefficients in (3), (4), and (5) only the Taylor coefficients of order 0 require a call of a standard function.

Automatic Differentiation

111

Interval arithmetic is a useful tool for this purpose. For two real intervals [ a ,b] and [c,d] the (interval) operations o E {+, -, defined by [ a ,b] o [c,d] := { z o y I I E [ a ,b], y E [c,4).

.,/} are

The result is always an interval (in the case of a division it is assumed that 0 4 [c, 4) and the bounds of the result can be computed by use of suitable operations on the bounds of the operands.

+

+

+

E. g., for an addition [a,b] [c,d] = [a c, b d] and for a subtraction, there holds [a, b] - [c,d] = [a - d, b - c ] . For a (continous) standard function sf E {exp, sin, cos,. ..}, the interval evaluation is defined by use of sf ([a, b ] ) = {sf(.) I z E [ a , The following property (see e. g. [29]) is essential for interval arithmetic evaluation of an arithmetic expression f (z): provided the value of 2 is replaced by the interval X and all operations and standard functions are replaced by the corresponding interval operations and operations on standard functions (and there are no unadmissible operations), then one gets the so-called interval extension F of f ; additionally, the following important inclusion is valid: f(X) := {f(z)12 E X } c F ( X ) . Thus, the recurrence relations for the computation of Taylor coefficients can be evaluated in interval arithmetic: replacing 10 by the interval Xo and replacing operations by interval operations, the resulting interval ( F ) k will enclose the values of the following Taylor coefficients

By means of a suitable machine interval-arithmetic ([23],[24]) all interval computations can be carried out on a computer without loosing the enclosure property. In particular, all possible rounding errors are automatically taken into consideration; of course, this is also true for the case where Xo is a point interval, i. e. Xo = [zo,101.

2.3

The Computation of Gradients

In this subsection we consider the problem of computing the gradient Of = (&,... , of a function f (I), I = ( I ~ , .. .,zn), making use of automatic differentiation.

g)

The procedure is analogous to that of the differentiation arithmetic for the computation of first derivatives: instead of computing by means of pairs of real numbers, we use vectors just as in the case of the Taylor arithmetic. The (n 1)-vectors are composed of the function value f(zo) (zo = (zy,. . .,z",) and the n values of the partial derivatives with respect to the variables 2 1 , . .. ,2,. The rules for these vectors are the well-known rules for the computation of gradients, e. g. in the case of multiplication: V(f.g) = f . Vg g . V f .

+

+

5E.g., for the (monotonous) exponential function this is: exp([a, b ] ) = [exp(a),exp(b)].

H.-C. Fischer

112

Thus, the combination of two (n+ 1)-vectors U = (uo,. ..,tin) and V = (vo,. . .,vn) obeys the following laws of a gradient arithmetic (see e. g. [31]):

UfV

u-v

u/v

= with = with =

with = with sin(U) = with cos(U) = with

exp(U)

+

The independent variables XI,. . . x, are represented by the (n 1)-vectors XI := (z:, l,O,. . .,O), . . ., Xn := (z:,O,. . .,0, l ) , and a constant c by C := (c,O,. . . ,O). Thus, the function value f (20) and the value of the gradient V f (20) can be computed for all arithmetic expressions in a step-by-step process.6 f(x1,52,53,z4) = x1 . z2 e z 3 . z4 the function value and the gradient will be evaluated f o r xy = 1, x: = 2, x: = -1, xs = -2. The computation is carried out in the following steps:

Example 2 For the function

Xl

= (1,1,0,0,0),

x 2

= (2,0,1,0,0),

x 3

x 4 F 5 F 6 F 7

= (--1,0,0,1,0), = (-2,0,0,0,1), = XI * X2 = (2,2,1,0,0), = F5 * X 3 = (-2, -2, -1,2,O), = F6 xs = (4,4,2, -4, -2),

i . e., f(1,2,-1,-2)

= 4,

$ = 4,

= 2,

= -4, and

= -2.

The number of operations for the evaluation of the function and its gradient grows as a linear function of n: in every step the n 1 components of the vectors have to be computed', i. e., for A ( f , Of)(the cost for computing the function and its

+

61n a proper implementation, the sparse structure of the vectors X and C can be used to reduce the number of operations. 70fcourse, the number of operations for the approximation of gradients by divided differences grows analogously.

Automatic Differen tiat ion

113

gradient) and A ( f ) (the cost for computing the function) there holds (see e. g.

[301):

The method requires only a small extra storage space since the intermediate results can be overwritten immediately after their processing. Additionally, by use of the method of automatic differentiation, other problems can be treated in an efficient way: the computation of Hessian and Jambian matrices as discussed in [30],and the calculation of Taylor coefficients of multivariate functions in [S].Algorithms for the evaluation of the product of a gradient and a given vector or the product of a Hessian matrix and a given vector can be found in [S]. Here, only the problem of the product p of a gradient and a given vector v = ( ~ 1 , .. .,v,) E R" will be treated in a more detailed way. By a modification of the algorithm for gradients, the product can be computed without an explicit evaluation of the gradient. For the ratio A (f,Vf.p)/A (f),it can be shown that it is bounded by a constant, i. e., independent of n. We show the procedure for the case of the multiplication h = f x g: ph := V h - v = (f*V9+g*Vf).v = (f * Vg). v (9* Of)* v

+

= f.p9+9.P',

-

where pf and pQdenote the already computed products V f v and Vg v . For the other operations, there hold analogous formulas. The process starts with p' := vc for f = zk, k = 1 . . . n and pf := 0 for f = c E R.

3

Fast Automatic Differentiation

In this section, algorithms for fast automatic differentiation are presented. With these algorithms, it is possible to solve more efficiently some of the problems of the preceding section. The new method for the computation of gradients and Hessian matrices reduces the number of operations by one order of magnitude. In the first subsection, the basic algorithm of the fast method will be presented, i. e., the reverse computation. This reverse mode is also essential for the methods of Section 4 and

5.

3.1

The Basic Algorithm of Reverse Computation

The following simple proposition for the solution of a linear system (with a triangular matrix) is the basis of the reverse method.

H. - C. Fischer

114

Proposition 1 Let be r, a E Rsxs( s E N ) where I' = diag(Tl7') with 7i # 0 (1 5 i 5 s), a the triangular matrir: (aij) with aij = 0 for i 5 j (1 5 i , j 5 s), and the vector z = ( z i ) E R" (1 5 i 5 s). Then, for the solution h = ( h i ) (1 5 i 5 s ) of (I' - a ) h = z there holds: i-1

with 81 = o for j 8' = 7;and

> i,

i

d'j

=

7j

C

k=j+l

dkakj

for j = i - l , i - 2

,..., 1.

(9)

Proof From (I'- a ) h = z , there follows 7;lh' - C;!,j aijhj = zi (1 I i 5 s ) , i. e., hi = 7i aijhj + zi) (1 5 i 5 s ) ; thus, (7) has been proved.

(xi:!,

Let L be given by L := I'-a. Then, L = (lij) (1 5 i , j 5 s) with li; = 7?', lij = 0 for i < j , i. e., det(L) = det(I') which is # 0 by assumption. Let D be given by D = ( d j ) := L-'; then h = Dz and hi = C J = l d ' j z , (1 5 i 5 s). With I?-' = diag(7;) and the identity matrix I, there holds:

(r

D D(I'-a) Dr D i. e., for 1 5 with

= -a)-1 = I = I+Da =

5

s: dij = (6ij

= 0 (j 8' = 7i,

8 j

( I + Da)I'-',

+ ci=j+l dkak')

3 73

.

for j = s,s - 1 , . . , l . Thus,

> i),

k=j+l equations (8) and (9) have been proved.

0

Remark 1 The computation of h b y means of (7) is called the forward mode, that b y use of (8) and (9) the reverse mode. The name of the latter is due to the computation of (9) in descending order. The formulas (8) and (9) are used in [7] to estimate the temporal complezity of gradients for rational functions f : R" + R.

Automatic Differentiation

115

R e m a r k 2 Here and subsequently, the following expressions are used synonymo us1y: (i) the cost, the temporal complexity, and the number of operations (of an algorithm) and (ii) the storage cost, the spatial complezity, and the number of storage units.

Since in the following we are interested only in the component ha and the values d"", . . . ,d"' as following from the solution of (I' - a ) h = z , a corollary summarizes the basic algorithm. Corollary 1 For s E N,let be yi,zi E R, 7i # 0 (1 5 i 5 s), i 5 j , and the sequence h', . , . ,h' with

s), aij = 0 for

h'

=

aij

E

R (1 5 i,j 5

7121 i-1

hi = 7;( C a i j h J

+

2;)

for 1

< i 5 s.

(10)

j=1

Then, there holds 8

h" = c d j z j ( d j E R, 1 5 j 5 s) j=1

with

+C 8

d3 = 7j (6,j

dkakj)

k=j+l

for j = s , s

The computation of the values d", d+', means of the following algorithm:

- 1,... ,l.

.. . ,d1

b y use of (12) can be carried out b y

Algorithm 1 (Basic Algorithm of Reverse Mode) I . { Initialisation} d" := 7, and dj := 0 for j

< s.

2. {Computation} FOR k : = s DOWNTO 2 DO d3 := dj

+ 7jakjdk

for all j

< k.

(13)

+ +

Proof From the Proposition 1, there follows: ha = '&d"jzj with daj = 7,(6,j Cy=j+l d"'a1j) for j 5 s and, thus, (12) with d' := d"'. Define now d i := 7j (6,j Cf=k+l d ' a l j ) for 1 5 k 5 s. Then, d i = 7j6,j is valid; this is step 1 of the algorithm. Furthermore, dk = d i = di-l = ... = d t because of a l k = 0 for 1 5 k. From djk-l = -yj ( & j + d k a k j d'a'j) = yjakjdi djk for k = s, s - 1,.. . ,2 and j < k step 2 of the algorithm can be derived. 0

+ xf=k+l

+

H.-C.Fischer

116

3.2

Fast Computation of Gradients

In this subsection we use the reverse mode of computation to get an algorithm for the fast evaluation of gradients. For an arithmetic expression f(x) of f : R" R,we define a decomposition f l , . . .,f' off as the sequence of intermediate results in the course of the evaluation of f . For the sake of simplicity, the sequence may start with the independent variables and the constants, i. e., f' = 21,. .. ,f" = x,, and fn+' = c1,. f" = c,,~,. For an expression, there may exist several decompositions.

..

Example 3 For the function f ( X I , x2, x3, x4) = xl'd2'x3'54, the following sequence is a decomposition: f'

=

f2

f4

= 22 = x3 = 24

f5

= f'.

f3

r" f7

= =

I1

f2

f 5 . f3

r". f 4 .

For the function f(x) with decomposition f' . . .f", the component gY (v = 1.. .n) of the gradient V f = (91, .. .,g,,) = Vf' = (gf,.. . , g i ) can be computed by use of the following steps (cf. forward mode): g,k = 6,. for

fk

= X k , k = 1.. .n

g,k = o for f k = Ck-", k = n

+ 1...m

g , k = g L f g L f o r f k = f'&f',

k>m

g,k=f'xgL+ f ' x g L f o r g,k = (9;-

fk

fk=

f'xf', k > m

x gL)/f for f k = f'/f, k > m

g,k = sf'( f') x gf, for

fk

= sf (f'), k

>m

The common argument x = (21,. .. ,xn) for f k and g i has been omitted for reasons of simplicity.

Automatic Differentiat ion

117

+

.

Only the values g,k, k = 1.. .n depend explicitly on v ; i. e., for k = n 1.. s the expressions for all components of the gradient are the same. Thus, Corollary 1 can be applied. For k

> m and l,r < k, we define a k j =

g,i. e.:

and otherwise a k j = 0.

By setting 7 k = 1 and z k = 6ku for fixed v (v E (1 .. .n}) and k = 1.. . s, we get the following expression for hk = 7 k akjhj zk):

(C;::

and, for k

+

> m,

hk = h ' / f

- fk/f

x h', if f k = f'/f

and thus ha = gf. On the other hand, Corollary 1 shows that

ha = c d j z j = d", since zj = 6,") j=1

i. e., V f = ( 8 ,... ,d"), where the d-values are given by (12) or Algorithm 1. Thus, the computation of all components of the gradient o f f is equivalent to one evaluation of the d-values (using the a-values defined above). This result can be summarized in the following algorithm:

H.-C. Fischer

118

Algorithm 2 (Evaluation of gradients (reverse mode))

{ in: f : R" R, x E R", and a decomposition f' . . .f" off; out: f = f ( z ) , g = V f ( x ) = ( & ) ) ( i = l ...n);

1

1. {Forward step} Compute and store f ' ( x ) . .. f " ( x ) .

2. { Initialisation} d" := 1 and dj := 0 for j < s. 3. {Reverse computation} FOR k := s DOWNTO m

+ 1 DO

+ dk, d := d f dk, if f k = f' f f'

d' := d'

d ' : = d ' + f ' X d k , d : = d +f ' x d k , i f f k = f ' x f ' d' := d'

+ [ d k / f ' ] ,d := d - f k x [ d k / f ] ,if f k = f'/f'

{ [ d k / f ' ]has to be computed only once!} d' := d'

+ sf'( f') x dk, if f k = sf(f')

4 . {Output}

f := f", g := (d', . . .,d").

The number of operations for the computation of a gradient by means of the above algorithm can be estimated in the following way: depending on the elementary function f k in step 3 of Algorithm 2, the number of necessary operations is listed in the table below. An addition or a subtraction is counted with 1A , a multiplication or division with 1M and a call of a standard function with 1SF. The ratio 1is denoted by Q k . For the sake of simplicity, all operations are weighted with the same factor. function f k

I A( f k ) 1

4dj) 2A+2M 2A+2M

I Qk 1

119

Automatic Differentiation

Thus, in Algorithm 2, the ratio of the costs A( f , V f ) and A( f ) can be estimated to be

In particular, inequality (14) shows that the temporal complexity of the computation of the gradient (including the function value) is of the same order as the complexity of the computation of the function value. The name 'fast automatic differentiation' can be justified by a comparison with the estimation (6): the computation of gradients by use of the new algorithm has a complexity which is one order of magnitude less than that of the forward mode. Estimations analogous to (14) can be found in [ll] and [15]. In these sources, the results are proved by use of graph theory. Just as in the previous section, interval arithmetic will be used to compute guaranteed enclosures of the results. If the bounds are not sufficiently tight, defectcorrection methods for the evaluation of formulas may improve the results [9]. Algorithm 2 is now illustrated by an example.

-

Example 4 Choose f(Zl,Z2,23,24) = 11 z2 5 3 . 2 4 and 2: = 1, xi = 2, -1, xs = -2. By use of the intermediate results, f,, ... ,f7 one gets: fl

fa

=

f3

= =

f4

=

fs = f6 = f=f7 =

2:

=

d'

= O + f 2 * d S= 4 , = O+f'.d5=2, = O+f5.d6=-4, d4 = O+f'.d'=-2, d5 = O + f 3 - 8 = 2 , 8 = 0+f44=-2, d7 = 1,

=1 =2 23 = -1 2 4 = -2 fi*fa=2 f5'f3=-2 f6*f4=4 51

dl d3

22

i. e., f(1,2, -1, -2) = 4, V f = (d', dl,d3, d4) = (4,2, -4, -2).

The next example shows that the fast computation of gradients enables one to attack problems of considerable size, too.

Example 5 Let be f the Helmholtz energy-function (see e. 9. [IIJ): n

f(z) = C x i l n i=l

Xi

1 - bTx

xTAz

- -In

0 5 5 , b E R", A = AT E

4bTx

Rnxn.

+ (1 + \/Z)bTZ 1 + (1 - fi)bTz' 1

H.-C. Fischer

120

Algorithm 2 was coded in PASCAL-XSC. The function value f(x) and the gradient

V f(x) were evaluated for different values of x. The time for the evaluation of the

function value was measured b y use of a ‘straightforward’ program (using FORloops for the sum and the uector products). The following table shows the ratio V = RT( f,V f ) / R T (f) of running times for different values ofn: n n = 10

n=20

V 2.35 2.85

n n = 50 n = 100

V 3.59 9.99

A possible disadvantage of the fast automatic differentiation in its basic form of Algorithm 2 is the necessity for storing all the intermediate results f’ . . .f’;they are required in the course of the computation of the d-values in step 3 of Algorithm 2. Instead of storing all intermediate results, they can be recomputed when needed. In many cases this recomputation can be carried out efficiently: the original steps of the evaluation are inverted ([10],[28]). E.g., the intermediate results f,,+] := fl f 2 , f n + 2 := fn+l f 3 , . . .,f2,,-1 := f2,,-2 . f,,, which occur in the course of the computation of f(x1,. ..,x,) = nb1x; ( x i # 0 ) , can be recomputed from f = f2,,-1 by use of the inverted program f n + i - 2 := f,,+i-I/ f;, i = n,...3. The reduction of the required storage will increase the number of operations: about one additional evaluation of f will hereby be necessary. Another approach for balancing the temporal and spatial complexity in reverse automatic differentiation is presented in [12].

.

-

The method of fast automatic differentiation (just like the automatic differentiation in the forward mode) may help to solve some other problems efficiently: the computation of a Hessian matrix and the evaluation of the product of a Hessian matrix and a given vector can be found in [8]. In both cases, the new algorithms reduce the temporal complexity by one order of magnitude as compared to the results of Section 2. Additionally, the method of reverse evaluation may be used for ‘symbolic’ computations. In [8],a method for the generation of an explicit code for gradients is discussed: a source program for the evaluation of a function f is transformed into an extended program which evaluates f and its gradient. For the extended program, the complexity estimation (14) is valid.

4

Fast Computation of Interval Slopes

The concepts of (interval) slopes and derivatives are closely connected. In [21] and [22] interval slopes for rational functions (depending on one or several variables) and their corresponding centered forms are discussed. Additionally, a

Au t ornatic Differentiation

121

procedure for the computation of an interval slope FI[.,.]for a rational function f is given there; the complexity of this procedure can be estimated by

In the following, an algorithm is presented which computes an interval slope FII[.,.] for an arithmetic expression with complexity

i. e., the costs are reduced by one order of magnitude. With this interval slope, a quadratically-convergent centered form can be defined. Thus, in combination with a subdivision strategy it is possible to compute enclosures for ranges of values.

4.1

Slopes for Arithmetic Expressions

The following definition and proposition are based on results concerning rational functions in [22].

Definition 1 Let f : D -+ R, D E R" be given b y an arithmetic expression. Then, a continous function f[.,-1 : D x D + R" with

is called slope for f .

Remark 3 The vector f[z,z]E R" is a rowvector, i. e., the product product o f a ( 1 x n) and a (n x I)-matrix.

-

is the

Remark 4 In generul, there may be different slopes for a function; e. g., f[z,z]= ( 1 , l ) and f[z,z]= ( 1

+

52

- z2,1 - (51 - 21))

are slopes for f(z1, 52) = 51

+ 52.

The following proposition provides expressions for the computation of a slope for a function f (with decomposition f' ...f"). In analogy with the evaluation of gradients, this method is called computation in forward mode.

H.-C. Fischer

122

Proposition 2 Let f : D 3 R, D R" be an arithmetic expression, f 1 decomposition o f f , and x,z E D . Furthermore, choose

...f"

a

fk[z, z] := ek { k - t h unity vector} for k = 1 ...tz fk[x,z ] := o {zero vector} for k = n and, for k

+ 1 . . .m

> m (l,r < k),

fk[z, z] := -f'[s,21, i/ f k = -f'

fk[x,z]:=f'[x,z]ff'[x,z],

fk[z,z]:= f'(z)f'[z,z]

iffk=

i f f k =!Iff'

+ f'(z)f'[x,z],

iffk = f' x

f'

0and (fk(z)+ fk(z))> 0

in the case of a standard function

sf [ a ,b] =

sf (a)-af ( b )

if a z b,

if a = b.

Then, f[z, z] := f"[x,z ] is a slope for f .

Proof We show by induction that f'[x, 21.. . fk[x,21.. .f.[z, z] are slopes for f 1 . . .fk...f". For k = 1.. .m, this is trivial. Now, for k > m, let f'[z,z] and f " x , z ] be slopes for f' and f' ( l , r < k).

If f k = f' f f',then:

f'(4f f'(z) - (f'k) f f'(z)) = f'(4- f'(4 f (f'(z) - f'(z)) = f"., z]* (z - z ) f f r [ x ,z ] * (5 - z ) = fk[x,21 * (z- 2).

fk(z)- fk(4=

Automatic Differentiation

Thus, f k [ z z, ] is a slope for fk. For k = s this completes the proof.

Remark 5 The additional assumption (fk(z) + f k ( z ) )> 0 for

123

0

0

fk = is necessary, since the square root can be differentiated in ( 0 , ~ )only; for z , z with fk(z)= fk(z) = 0 , a slope is not defined. For standardfunctions such as exp,sin and cos, the derivative ezists on the entire domain.

H.-C.Fischer

124

By setting 2 = z in the above proposition, one gets the recurrence relations for the computation of the gradient of f , i. e. Vf(z) = f[z, z]. Therefore, there is the following estimate for the complexity of the computation of slopes by use of the above algorithm

Analogously to the case of the fast computation of gradients, it is now possible to formulate a fast algorithm for the computation of slopes: the application of the corollary in Section 3 reduces the complexity to

Algorithm 3 (Computation of slopes (reverse mode))

{ in: f : Rn+ R, x , z E R", and a decomposition f' . . .f" o f f ; out: f(z), f(z), and the slope f[z, 21 by use of Proposition 2;

1

1. {Forward step}

Compute and store f'(z).

2. { Initialisation} do := 1 and d j := 0 for j

.. y ( z ) and f'(z). . . f"(z).

< s.

9. {Reverse computation}

FOR k := s DOWNTO m + 1 DO d' := d' + d k ,

d := d f d k ,i f f k = f ' f f r , d'

:= d'

6

:=

d

d'

:= d'

d

:=

d'

:= d'

+d k / f ( z ) ,

d - fk(z) x d k / f ( x ) , if f k = f'/f,

+ (f'(z) +

f'(2))

x

dk,

if f k = ( f l y ,

+ &/(Ik(=) + f k ( z ) ) , i f f k = 0, d' + sf [f'(z), f'(z)] x dk, i f f k = sf (f').

d' := d'

d' :=

+ f ' ( ~ x) d k , + f'(z) x dk, i f f k = f' x f',

Automatic Differen tiat ion

4.

{Output} f(x) := p(z), f(z)

125

: = P ( Z )~,[ z , z:=(dl, ] ...,d").

Only the combination of interval arithmetic with the concept of slopes make it possible to compute enclosures for ranges. Therefore, interval slopes are defined as follows (cf. [21]):

Definition 2 Let f : D + R, D c R" be an arithmetic ezpression. Then, G E IR" is called an interval slope for f with respect to X c D ( X E IR") and z E X provided

is valid for all x E X.

Remark 6 Interval slopes can also be defined for non-digerentiable functions:

In the following, we discuss a different possibility to compute interval slopes: in the decomposition f l . ..f" of f , the value of x is replaced by the interval vector X, the operations and functions are replaced by the corresponding interval operations and interval functions; in the absence of errors, this will lead to the interval evaluation F'(X) . . .F'(X) = F ( X ) of f. The following example shows that the interval evaluation may not exist.

Example 6 The function f (x)= l / ( z x x for the decomposition

+ 1) is defined for all x E R. However,

and X = [-2,2], the interval evaluation does not exist since F'(X) = [-3,5] 3 0; i. e., the final division is not possible. For the decomposition

f' = 2, f 2 = 1, f 3 = (fl)',

f4

= f3

+ f',

f5

= f'/ f',

one gets F3 = [0,4], F4 = [1,5], and F = [0.2,1] provided the standard function for squaring is defined as X' := (1'11 E X}; the property f 3 2 0 is here still valid in interval arithmetic. In the following, we assume that all relevant intervals exist. If the operations in Proposition 2 are carried out in interval arithmetic then, by inclusion isotony, we get f k [ z , z ] E F k [ X , z ] f o r a l l k = l ...s a n d a l l z E X (zfixed).

H . 4 . Fischer

126

In particular, F[X, z] = F"[X,z] ( z E X, z fixed) is an interval slope for f in X since for all z,z : f(5)

- f(z) = f[z, z]* (z- z ) E F[X, z] * (z- 2 ) .

Furthermore, with f[z, z]= Vf(z)one gets for z,z E X:

where the subscript I indicates that the gradient was computed by use of Proposition 2 (i. e. in forward mode). The same convention will be used for F,[X, z] and F,[X, XI since the next section presents an algorithm for slopes which uses the reverse mode.

4.2

Computation of Interval Slopes in Reverse Mode

As in the preceding subsection we assume the existence of all intervals in the expressions to be evaluated. Through a replacement of z by X ,sf [f'(z),f'(z)] by sf'(F'(X)), dk by D k ,and the operations of Algorithm 3 by the corresponding interval-arithmetic operations, there holds: (d',

...,d")= f[z, z] E (D', . . . ,D") =: F I I [21.~ ,

In particular: Fl,[X, z] ( z E X, z fixed) is an interval slope for f in X ;for z,z E X , we get:

For the complexity of the computation of an interval slope by use of methods I and 11, estimations analogous to those for point slopes can be derived:

Especially, the new method I1 (by a factor of n) is faster than the procedure for rational functions as described in [22]. The following example shows that the results FI[X, z ] and FI,[X, z],respectively, (V,f(X) and V,rf(X), respectively) can be different. In an a priori fashion, it can not be predicted which method computes tighter bounds. Therefore, in most cases method I1 will be preferable due to its lower cost. If both FI (V,)and FIX(V,,)are computed then, an intersection my be used to improve the enclosures.

127

Automatic Differen tiation

Example 7 Choose f(z) = (z - z 2 ) z with decomposition f' = z, f' - f 2 , f 4 = f"'.

f2

= z2, f 3 =

Then, we get:

F,[X,z] V,f(X) FI,[X, z ] v,, f ( X )

+

+

= ( 1 - ( X z ) ) X (z - z2), = (1 - 2 X ) X + ( X - X 2 ) , = ( 2 - 2) - X ( X z),

=

+x + ( X - X2) + x - X(2X).

In interval arithmetic, the law of subdistributivity is valid, i. e., F,[X, z ] F,,[X, z ] and V r f( X ) 5 Vrrf( X I . For X = [0,1],z = 0 the enclosures are proper (i. e., they hold without the admission of the equality sign). Choose f ( z ) = 2xx2 - x 2 with decomposition f' = z, f5 = f4 - f3.

f2

= 22,

f3

=

(f')2,

f4

=

f2f3,

For this function we get: F,[X, z ] v,f ( X ) F,,[X, z ] v,, f ( X )

+ 2z(X + z ) - ( X + z), + ( 2 X ) ( 2 X )- 2 x , (22 - 1 ) ( X + z ) + 2 x 2 ,

= 2x2 = 2x2 =

= (2X

- 1 ) ( 2 X )+ 2 x 2 ;

i. e., Frl[X,z ] c F I [ X ,z ] and VIrf( X ) G V i f ( X ) . Here, the choice X = [0,1],z = 1 will lead to a proper enclosure.

4.3

Interval Slopes and Centered Forms

In this subsection we use the interval slopes of the previous subsection to define quadratically convergent Krawczyk-forms. In the paper of Krawczyk and Neumeier ([22]),interval slopes were first used to improve centered forms of rational functions.

Definition 3 Let f be an arithmetic ezpression and F,[X,z], F I I [ X , Z ] ( X E IR") the corresponding interval slopes as computed b y method I and II, respectively. Then, the intervals

are called Krawczyk-forms for f and z E X .

H.-C. Fischer

128

Since FI and F11 are interval slopes, the following is valid for all z, z E X :

f

- f ( z ) E FI/II[X,4 . .( - ). E FI/II[X,4 * ( X - z ) ,

i. e., f(z) E F1,1l2(X) and thus f ( X ) E F , / , I ~ ( Xfor ) all z E X with f denoting the range of values of f. Because of (16)and (17), normally the enclosures are tighter than those of the mean-value form [2,321: F M ( X )= f(z) VI/IIf ( X ) (X - z ) . The new centered forms are quadratically convergent. This is shown by the next proposition.

+

-

Proposition 3 Let f : D + R be an arithmetic ezpression, Xo C D c R",and f 1 . . f" a decomposition o f f . For Xo E IR", the ezistence of the interval slopes FI[XO,Xo] and FI,[XO,XO]is assumed. Then, for all z E X C Xo, there exist constants61 := S,(f,XO) and6,l := 611(f,Xo) (independent of X ) such that the following inequalities are valid for the corresponding centeredforms (where w([a, b ] ) := b - a denotes the width of an interval):

.

w (FIZ(X))-

( f ( X ) )I61 IIw (x)ll;

and

w (FII,(X))- w ( j ( X ) )I 611 IIw (X)Ili.

Proof See [8]. The Krawczyk-forms can be used to compute lower bounds (with arbitrary accuracy) for the global minimum fi := min,,x{ f(z)}of a function f : D + R, D c R" where X E D is an interval vector, i. e. X E IR", and f is an arithmetic expression. For this purpose, X is subdivided step by step into smaller intervals. The minima for the ranges of the subregions can be bounded by use of the Krawczykforms. The subdivision is based on the strategy of Skelboe [35]which guarantees that the number of regions does not grow exponentially as the subdivision is refined. Monotonicity tests making use of the values of partial derivatives improve the performance of the method. For a detailed outline of the algorithm, see [8]. Additionally, there is the following comparison with numerical results from [4].

Example 8 For the function

f (21 z2,z37 z4i z5) 7

= fl(xl)f2(z2)f3(z3)f4(x4)f5(x5)

Automatic Differentiation

129

the range of values has to be computed. For 51

E [8.7,8.8],

52

E [-9.4,-9.31,

14

E [-4.6, -4.51, E [3.5,3.6],

5 5

E [-2.9,-2.81,

53

the algorithm computes

f (q,z2,

z3,2 4 , z5)

E [24315.0,24513.7] and [24345.9,24481.5],

respectively, using either no or between one to five subdivisions. In [4/ the Computation of the enclosure [24054.5,24774.2] requires already four subdivisions; a naive interval evaluation yields [22283.5,26731.4].

5

Formula Evaluation and Analysis

Rounding-Error

In this section we consider the problem of evaluating functions in floating-point arithmetic. The method of reverse computation makes it possible to perform the error analysis in an automated way; for reasons of simplicity, only absolute errors are considered here. The problem of the accurate evaluation of a segment of numerical code will now be discussed briefly. Here, accurate refers to the evaluation in the set R, rather than the set of machine numbers.

For the arithmetic expression f : R" + R, the value of f (z) may be approximated by use of the floating-point result I($). The following two problems will be treated:

If($)

1. What is the error - f (.)I (or which bound can be given for it) provided the computation is carried out by use of a floating-point arithmetic with precision I? The input data may be given exactly by 1 digits or rounded to 1 digits. (It is assumed that the basic floating-point operations -, X , and / compute the optimal 1-digit result, see [24].)

+,

2. Which maximal input errors can be allowed for the values of 21.. .I,,and which floating-point precision is necessary in the course of the computation o f f in order to guarantee the error bound If(2)- f )I(. 5 E (for fixed e E R)?

The first problem can be solved easily by use of interval arithmetic. In general, the error bounds of this computation are too pessimistic. However, the combination

H.-C. Fischer

130

of interval arithmetic and reverse computation makes it possible to estimate the rounding errors in a satisfactory way.

A first algorithm (recomputation algorithm) for the solution of the second problem (without use of the reverse computation) has been given in [34]. In the execution of the first step of the algorithm, interval enclosures of all intermediate results are computed. With these enclosures it is possible to determine the precision of the input data and the computation in order to guarantee the estimation l f ( E ) - f ( x ) l 5 c. Since the algorithm in [34] does not use methods of automatic differentiation (in [34], they are believed to be too inefficient), the errors of the individual intermediate result cannot be computed seperately. Therefore, the algorithm must compute with precisions that in general are unnecessarily high. Here, a method is presented which makes it possible to express the total rounding error in terms of the input errors and the errors committed during the floatingpoint computation. The cost of the method is of the same order as the function evaluation itself.

5.1

Evaluation of Expressions and Reverse Computation

In this section, the following notations will be used: by ?k := o p , ( x k ) we denote the result of rounding x k E R to the set s, of floating-point numbers of precision pk, f k = f' opk and f k = sfpk(f') denote the floating-point operations of precision p k , where it is assumed that E i. e., no additional rounding of the is necessary. Similary, p k denotes the evaluation of F k in machine input f' and interval-arithmetic.

p

f',p s,,,

p

The next proposition summarizes some well-known results for the propagation of absolute errors.

Proposition 4 Let 3i be an approximation of xi and u; = E i - xi ( i = 1,2). Then, for the propagated absolute errors u := (51 0 5 2 ) - ( X I 0 x 2 ) and u := sf ( E l ) -sf ( X I ) , the following is valid: u* = u1 f 212 for o = f, u x = E2ul

+ x1u2 for

usf = sf'(() u1 for a

o = x,

€EW

x l ,

with differentiable standard functions sf and

Eluxl

in the domain of sf'.

Automatic Differentiation

131

Proof E. g., division:

The other cases are proved analogously.

0

For the arithmetic expression f(z) and its decomposition f'. ..f", the (local) rounding error zk is given by zk

:= z k - X k for k = 1.. .n, := ti&,, - Ck-,, for k = 72 t 1.. . m ,

zk

:=

%k

:= sf,,(f')

t k

f'o,,p-f'op

- sf (f')

for k > m a n d o € { + , - , x , / } , for a standard function.

Thus, using Proposition 4, the total absolute error Ak := f k ( 5 )- f k ( z )is given by:

Ak Ak Ak Ak Ak

= = = =

zk for k = 1.. . m ,

A'f A' + Z k , if f k = f ' f f', P A ' +- f'A' t k , if f k = f' X f', (A' - f k A r ) / p t t k , if f k = f ' / f , = Sf'(t)A'+%k,if f k = S f ( f ' ) ( < € f'd').

+

Because of Corollary 1, the absolute error A = f(5) - f ( z ) can be written as A = A. = djzj, where the values dj can be computed analogously to the case of Algorithm 1.

C:=,

However, some of the values are not known exactly, e. g., f' in the case of f k = f' x f'. If they are replaced by the corresponding approximations (e. g. f' by ! I ) , then only approximations of the dj can be calculated. By use of interval arithmetic, it is possible to compute guaranteed enclosures Dj and Bj of the values dj. Choose enclosures and 6; such that zi,& E 2; (i = 1 ... n), c;,& E 6; (i = 1. . . m

zi

- n). The machine interval-arithmetic evaluations of

to exist for the computation with precision P.

p1. . .Pa are assumed

Then, with initializations Da := 1, Dk := 0 (k < s), and:

D' := D' t Dk, D' := D' f Dk, if D' := D'

+ P' x Dk, D'

:= D'

fk

= f' f f',

+ P' x Dk, if f k

= f' x f',

132

H.-C.Fischer D' := D' + Dk//pr, D' := D' - pk x Dk//pr, if

(for k := s, s - 1.. . rn

fk

=

f'/r,

+ l ) , there holds:

The values flj are computed by evaluating the preceding formulas making use of machine interval-arithmetic with precision P. By means of (18), it is possible to estimate the contribution of the error propagation (of input errors) and the one of the rounding errors. Provided all operations are carried out exactly, i. e. zk = 0 for k > rn, then the total error is caused only by the propagation of input errors; they may be a consequence of unprecise or rounded data.

For the propagated error A, := Cz, djzj, the following is valid because of (18):

C DJ(5j- + C Dj(zj-n n

A, E

m

~ j )

j=1

~j-n)

j=n+l

and thus n

m

j=1

j=n+l

This formula can be modified easily such that a guaranteed bound can also be calculated by use of a computer (employing a suitable machine interval-arithmetic). The expression in (19) corresponds to the well-known formula for error propagation (see e. g. [GI)

which, however, represents only an approximation. In contrast to this, the estimation (19) is guaranteed and it can be calculated by use of a computer with a cost that is proportional to the cost of an evaluation of f. Thus, the cost of a sensitivity analysis based on (19) or (20) is of the same order as the cost of the evaluation of the function itself. Additionally, the contribution of the rounding errors (and their propagation in the course of the computation) can be estimated by use of (18). Provided all input data

Automatic Differentiation

133

are exact (i. e. zj = 0 for j = 1 .,.m), then the rounding error is Ae := &m+l and by use of (18):

C

dzj,

8

Ae E

D'zj.

j=m+l

Provided the operation j is exact (e. g. in the case terms cancel rigorously in a subtraction), then zj = 0 and Djzj does not influence the value of the sum in (21). In general, the exact value of zj in (21) will not be known. Since we use an optimal arithmetic and zj is the local rounding error in step f j , the absolute value of zj can be bounded by:

For the local precision pj, the inequality pj 2 P must be valid; otherwise, it cannot be guaranteed that the value (computed with precision pj) is in the corresponding interval Fj. Thus, the absolute value of the computational error Ae can be estimated by use of (21) and (22):

Now using (20) and (23), the absolute value of the total error A = A, bounded by

14

=

IJ(+f(41 n

m

j=1

j=n+l

+ Ae is

To get a guaranteed upper bound when (24) is calculated in floating-point arithmetic, the values Dj must be substituted by b j and the sums and products have to be computed by use of directed operations. The inequality (24) can be interpreted in the following way: the absolute value of Dj shows the maximum amplification of the error in step j ; thus, the values Dj are the condition numbers for the total error with respect to the errors of all intermediate results.

H.-C. Fischer

134

If one is interested in these values only, or in an estimation of the form

then the values dj6 may be computed directly by: 1. d i := 1, d: := 0 for k

< s,

2. For k := s,s - 1 . ..m

+ 1:

df, := df,

+

df, := df,

+ Isf'(P')I

x d:,

4

:= 4

x d!, if

fk

+ Ip'I x d j , if f k

= f' x

f',

(27)

= sf(f').

(29)

Thus, the need for interval arithmetic is limited here to the computation of the values p1.. . pa. The operations of steps (26) to (29) are carried out in a directedrounding mode in order to guarantee the inequality (25).

5.2

Algorithms for Formula Evaluation

The above methods for rounding-error analysis are now used to develop algorithms for the evaluation of expressions in floating-point arithmetic. First, a procedure for the solution of Problem 1 (see the beginning of this section) will be given.

Algorithm 4 ( F o r m u l a evaluation I (rounding-error estimation)) { in: f : R" 4 R, x E R" and decomposition f' . . . f" off; out: Approximation f(5) (computed with floating-point arithmetic of precision 1) - f(z)l 5 E ; and error bound E > 0 with If(?)

1

1. { Precomputation (in interval arithmetic); P 5 I } Set pk := o p ( x k ) for k = 1.. .n, pk := O p ( C k - " ) for k = n compute p r n + l . . in interval arithmetic of precision P .

.P"

{ I n particular, the ualuesf(Z) and f ( z ) arelocated in pa, i. e. w(P"). }

If(?)-

+ 1.. . m and f(z)l 5

135

Automatic Differentiation

{Reverse step} Let be d i := 1 and di := 0 ( j < s ) and compute the values d:-' to (26) - (29).

. . .d:

according

{Floating-point evaluation} Evaluate f' . . . in floating-point arithmetic of precision 1. {Error estimation} Compute the error bound e' := f dj6 IPjI B1-'.

xi=,,+,-

(_Output} f ( j . 1 :=

I.,

-

'&d i

*

1j.j

- xjl+ C;=,+,d i * IZj-,, - ~

j - ~ l +

e := min(e', w ( P 8 ) ) .

Remark 7 If, in step 1 of the above algorithm, there is an unadmissible interval operation, then the step is repeated b y use of a higher precision 1 2 P' > P . If even the precision P' = 1 is not suflcient, then the algorithm can be applied recursively with respect to the critical element: the interval Enit = [ f n i i - c , f n i i + c ] is an enclosure for fnit. If kit still causes problems, then the algorithm cannot evaluate the formula with precision 1; it is then necessary to use a multiple-precision arithmetic.

The following example demonstrates the employment of the algorithm.

Example 9 The system of linear equations Ay = b is solved b y use of a LUdecomposition (without pivoting) of A. The 'formula'f for the evaluation is the procedure for the first component y1 of the result which is computed b y use of a program for the LU-decomposition supplemented b y a forward-backward substition; i. e., f = f ( A ,b) = y l ( A ,b) is a function of n2 n variables.

+

Let A be the Boothroyd-Dekker matrix of dimension n = 7 , i. e.,

The components of the right-hand side are all equated to 1 . A program for Algorithm 4 (written in PASCAL-XSC and using a decimal arithmetic with 13 digits) computes: f = 1.000000003111 and the error bound e = 1.095322002366B - 06; i. e., the interval [0.9999989077889,1.000001098434] is a guaranteed enclosure for the result f = y1 = 1 . The naive interval computation in step 1 of the algorithm computes only the rough enclosure = [0.9463683520819,1.053631643420]. Of course, it is more favorable to compute all components of the result b y means of a specialized algorithm for the verified solution of linear systems (see e. g. [25]).

H.-C. Fischer

136

Problem 2 (see the beginning of the section) can also be solved by use of the methods developed so far. The propagated error and the rounding error can be estimated by use of (20) and (23). In practice, the precision of the computation will be chosen in such a way that the accumulated rounding error is of the same order of magnitude as the propagated error as caused by uncertainties of the input data. Therefore If"(Z) - f(x)l 5 e is valid under the assumptions: (a) the choice for the precision of each step results in an error B1-Pj 5 f, and (b) IZj-xjI, lEj-n-cj-nl IEji-" - cj-"I 5 f.

di

!j

- lpjl-

in (25) satisfy thebound Cj"=ldi.JZj-xjl+Cz=n+ld i -

Then propagated and the rounding error are approximately equal in magnitude. The procedure can be summarized by the following algorithm:

Algorithm 5 (Formula evaluation I1 (predefined error bound)) { in: f : R" -+ R, x E R" and decomposition f' . .. f" o f f , 0 < e E R; out: Precisions (Pk), k = 1 . . .s and approximation f(5) (computed b y Boatingpoint arithmetic of precision f(x)l 5 6;

Pk

in step k) which satisfies the error bound: f(Z) -

1

1. { Precomputation (in interval arithmetic)} Set pk := o p ( Z k ) for k = 1 . . .n, Fk := O p ( C k - n ) for k = n 1.. . m and .PEb y use of machine interval-arithmetic with precision P . compute pm+l.. { I n principle, the choice of P > 0 is arbitrary; however, if there is an unde-

+

fined interval operation, this step must be repeated with an increased precision, P' > P.} Z f w ( P ) 5 e is valid, then the procedure has been completed and each computed value from fi satisfies the required error bound.

2. {Reverse step} Set d: := 1 and di := 0 ( j < s ) and compute the values di-' to (26) - (29).

...4 according

3. {Determination of precisions} For k = 1 m the precisions Pk are chosen in such a way that the following is valid:

.. .

137

Automatic Differentiation

+

For k = m 1 . . .s, the precisions following is valid:

pk

are chosen in such a way that the

..

Set Pk := max ( P k , P ) for k = 1 .s. { The minimum precision P is necessary in order to validate the estimations as computed b y use of machine interval-arithmetic with precision P.}

4.

{Floating-point computation} Compute (by use of the precisions approsimations E l . . .

p.

Pk

of the previous step) the floating-point

5. (Output} f ( z ) :=

p, a vector ofprecisions ( P k ) .

In step 3 of the preceding algorithm, all.precisions are chosen in such a way that the error contributions (weighted by d t . IPkl) satisfy identical bounds and

t

respectively. In cases where the weights d; * lPkl differ in a wide range, other error distribution strategies may lead to betters results. In practice, however, the availability of different hard- or software arithmetics has a significant influence on the optimal choice, too. The following example shows Algorithm 5 at work.

Example 10 The function f ( a , b) := 333.75b6+a2(lla2b2-b6-121b4-2)+5.5ba+ al(2b) from [13] is to be evaluated for a = 77617.0 and b = 33096.0. Algorithm 5 was implemented in PASCAL-XSC. In the first two steps of the algorithm, the interval computations are cam'ed out b y use of a decimal arithmetic of 13 digits. The input data a, b can be represented without input errors.

If

E is chosen in such a way that the relative error of the result is less than one percent, then the program computes precisions for the floating-point operations which are in a range from 13 to 42. Indeed, for precisions of less than 37 decimal digits (at critical places) not even the sign of the floating-point result is correct. For at least 37 digits, the accuracy increases suddenly, and the result differs from the solution -8.2739.. . E - 1 only in the last place. Therefore, the result of the algorithm (i. e., a minimum precision of 42 instead of 37 digits) is a meaningful result, especially if one considers the condition of the problem: the absolute values of the both are of the order 1P2. partial derivatives

2,%

For the input a = 3.333333333333E - 1 and b = 4.444444444444E - 1 (i. e., f = 2.2348 ...) the problem is not dificult: x 3,1%1 x 30. For an €-bound the computed corresponding to a relative error of the result of less than precisions range from 13 to 15. For these values of a and b, the floating-point

121

H.-C.Fischer

138

evaluation b y use of 13 digits already has a relative error of less than even the bounds of the interval evaluation off difler b y less than 5 digits in the last place. In this case too, the results of the algorithm (a minimum precision of 15 digits) are meaningful since, in an a priori fashion, it is not clear whether the problem is 'trivial'or not. The algorithm automatically chooses the precision in the appropriate way. The cost for this guarantee consists at most in the computation of some extra digits.

If in the first step of the above algorithm, we set pk := O p ( X k ) , X k E IR for k = 1.. .R, then the computed precisions (step 3) are even valid for all x k E Xk. The following example shows how this can be used to solve a 'collection' of problems.

Example 11 For the reduced argument z E [0,0.1],the exponential function is c;xi, ci := l/i! of degree N = 3. approximated b y the Taylor polynomial TN = CEO What accuracy for x and the coeflcients ci is necessary, which precision for the computation is required in order to get a computational error of less than 5 Since the truncation error le" - TNI is less than 5 for the interval under consideration, the following is valid for the total error: le" - F N ( 2 ) l 5 5. +5 = for z,2 E [O,O.1]. The evaluation may be carried out as follows: f' = X , f 2 = co, f 3 = ~ 1 f,4 = ~ 2 f 5 = c 3 , f 6 = f 5 X f', f ' = f"f4, f 8 = f ' x f', f 9 = f"f3, f'O= f 9 x f', f" = f ' O + f 2 .

-

In the above algorithm, we set p' := [0,0.1],p2 := p3 := [l,11, p4 := [0.5,0.5]and p5:= [0.166,0.167];this leads to the following values (using %digit interval computation): p6 = [0,0.0167],p' = [0.5,0.517],p8 = [0,0.0517],pg = [l, 1.061, plo=

[0,0.106],p" = [l,1.111 and, thus, d;' = di0 = 1, 4 = 9 = 0.1, d:' = d'O6 = 1, 4 = 9 = 0.1, 4 = 6 = 0.01, d i = 0.001, d i = 0.01, dz = 0.1, d;4 = 1, d i = 1.13. From this, there follow the inequalities 5 10-p a2.7 5 5 . and p 2 6.5. Thus, it has been proved for & = i.1 = 1, 22 = 0.5, i.3 = 0.1666667, and a jloatingpoint evaluation with 7 digits that the computation satisfies the required error bound for all arguments from [0,0.1](rounded to 7 digits).

5.3

Accurate Evaluation of Segments of Code

The applicability of the preceding algorithms for the purpose of an evaluation of segments of code has already been shown by means of the example making use of a LU-decomposition. In that case, a program for the solution of a linear system was used for the 'formula' to be evaluated. The LU-decomposition was carried out without pivoting since comparisons and conditional statements here are necessary. The comparison of intermediate values as computed by use of floating-point arithmetic must lead to the same logical result as the comparison of the exact values. Therefore, interval enclosures for the operands of a comparison are computed in the course of the precomputation. If the logical result of the interval comparison is

,

Automatic Differentiation

139

unique, then the computation can be continued without problems. Otherwise, the enclosures of the operands have to be improved until a decision is possible. This improvement can be carried out either by means of an increase of the precision of the interval computation or by an application of Algorithm 5 for an evaluation of the operands. By means of this procedure, all comparisons which are meaningful for floating-point computation, can be brought to a decision. E.g., the following statement (which seems to be trivial) is not meaningful in this sense: IF (1/3) x 3 = 1 THEN . . . The logical result (TRUE or FALSE) of the IF-condition cannot be computed (without further transformations) by use of a floating-point arithmetic (provided the machine basis is even (e. g. 2, 10 or 16). Thus, the problem of evaluating segments of a code can be solved by use of Algorithm 5 (and the preceding extensions for the purpose of conditional statements). For algorithms with iterations, it may be difficult to store all intermediate results. Here, Algorithm 5 must be applied to the main part of the iteration only; alternatively, the possibilities for the reduction of the spatial complexity of the reverse computation (see section 3) must be used.

A

Implementation of a Differentiation Arit hmetic

In the following, it is shown how a differentition arithmetic can be implemented in the programming language PASCAL-XSC [19], a PASCAL extension for Scientific Computation. The computation is done in interval arithmetic; this makes it possible to compute bounds for derivatives. The type interval for intervals is a standard data type, the corresponding interval operations are defined in the module i-ari. The pairs of the differentiation arithmetic (for the first derivative of a function f : R + R) are represented by type df-type. The function is f(z) := 25(r-l)/(z2+1) from Example 1 with I the point interval X = [2,2]. (References to FORTRAN implementiations can be found, e. g., in [ l l ] and [30].) program ex-i (input,output); use i-ari; type df-type = record f,df: interval; end; operator + (u,v: df-type) res: df-type; begin res.f:=u.f+v.f; res.df:=u.df+v.df; end; operator - (u,v: df-type) res: df-type; begin re8.f:iu.f-v.f; res.df:=u.df-v.df; end;

H.-C. Fischer

140

operator * (u,v: df-type) res: df-type; begin res.f:=u.f*v.f; res.df:=u.df*v.f+u.f*v.df; end; operator / (u,v: df-type) res: df-type; var h: interval; begin h:=u.f/v.f; res.f:=h; res.df:=(u.df-h*v.df)/v.f;

end;

operator + (u: df-type; v: integer) rea: df-type; begin res.f:=u.f+v; res.df:=u.df; end;

-

operator (u: df-type; v: integer) res: df-type; begin res.f:=u.f-v; res.df:-u.df; end; operator * (u: integer; v: df-type) res: df-type; begin res.f:=u*v.f; res.df:=u*v.df; end;

(***

further operators for mixed types ***)

function sqr (u: df-type) : df-type; begin sqr.f:=sqr(u.f); sqr.df:=2.0*u.f*u.df; end;

(***

further standard functions ***)

function df-var (h: interval) : df-type; (* definition of independent variable *) begin df-var.f:=h; df-var.df:=l.O; end; var x,f: df-type; h : interval; begin h:~ 2 . ;0 x:=df-var(h) ; f:=25*(~-l)/(sqr(x)+i); (* Example i *) uriteln( 'f , df : ' ,f.f,f.df); end.

References [l] Alefeld, G.: Bounding the Slope of Polynomial Operators and some Applications, Computing 26, pp. 227-237, 1981 [2] Alefeld, G., Herzberger, J.: Introduction to Interval Computation, Academic Press, New York, 1983

Automatic Differentiation

141

[3]ANSI/IEEE Standard 754-1985,Standard for Binary Floating-point Arithmetic, New York, 1985 [4]Asaithambi, N.S., Shen, Z., Moore, R.E.: On Computing the Range of Values, Computing 28, pp. 225-237, 1982 [5] Beda, L.M. et al. : Programs for automatic differentiation for the machine BESM, Inst. Precise Mechanics and Computation Techniques, Academy of Science, Moscow, 1959 [6] Bauer, F.L.: Computational Graphs and Rounding Error, SIAM J. Numer. Anal. Vol. 11, NO. 1, pp. 87-96, 1974 [7] Baur, W.,Strassen V.: The Complexity of Partial Derivatives, Theoretical Computer Science 22, pp. 317-330, 1983 [8] Fischer, H.-C.: Schnelle automatische Differentiation, Einschliefiungsmethoden und Anwendungen, Dissertation, Universitat Karlsruhe, 1990 [9] Fischer, H.-C., Haggenmiiller, R. and Schumacher, G.: Evaluation of Arithmetic Expressions with Guaranteed High Accuracy, Computing Suppl. 6, Springer, pp. 149-158, 1988

[lo] Gries, D.:The Science of Programming (Chapter 21: Inverting Programs), Springer, 1981

[ll] Griewank, A.: On Automatic Differentiation, in Mathematical Programming: Recent Developments and Applications, ed. M. Iri und K. Tanabe, Kluwer Academic Publishers, pp. 83-103, 1989

[12] Griewank, A.: Achieving Logarithmic Growth of Temporal and Spatial Complexity in Reverse Automatic Differentiation, Preprint MCS-P228-0491,Argonne National Laboratory, 1991 [13] IBM High-Accuracy Arithmetic Subroutine Library (ACRITH), Program Description and User’s Guide, SC 33-6164-02,1986 [14] IBM High-Accuracy Arithmetic - Extendend Scientific Computation, Reference, SC 33-6462-00,1990 [15] Iri, M.: Simultaneous Computation of Functions, Partial Derivatives and Estimates of Rounding Errors - Complexity and Practicality -, Japan J. Appl. Math. 1, pp. 223-252, 1984 [16] Kaucher, E., Miranker, W.L.: Self-validating Numerics for Function Space Problems, Academic Press, New York, 1984 [17] Kedem, G.: Automatic Differentiation of Computer Programs, ACM TOMS Vol. 6, No. 2, pp. 150-165, 1980 [18] Kelch, R.: Numerical Quadrature by Extrapolation with Automatic Result Verification, this volume

142

H.-C. Fischer

[19] Klatte, R. et al.: PASCAL-XSC, Sprachbeschreibung mit Beispielen, Springer, 1991 [20]Klein, W.: Enclosure Methods for Linear and Nonlinear Systems of Fredholm Integral Equations of the Second Kind, this volume [21] Krawczyk, R.: Intervallsteigungen fur rationale Funktionen und zugeordnete zentrische Formen, Freiburger Intervall-Berichte 83/2,pp. 1-30, 1983 [22]Krawczyk, R., Neumaier A.: Interval Slopes for Rational Functions and Associated Centered Forms, SIAM J. Numer. Anal. , Vol. 22, No. 3, pp. 604-616, 1985 [23]Kulisch, U.: Grundlagen des Numerischen Rechnens, Bibliographisches Institut, Mannheim, 1976 [24]Kulisch, U., Miranker, W.L.: Computer Arithmetic in Theory and Practice, Academic Press, New York, 1981 [25]Kulisch, U., Miranker, W.L. (ed.): A New Approach to Scientific Computation, Academic Press, New York, 1983 (261 MACSYMA, Reference Manual, Symbolics Inc. Cambridge (Massachusetts, USA) [27]MAPLE, Reference Manual, Symbolic Computation Group, University of Waterloo (Ontario, Canada), 1988 [28] Matijasevich, Y.V.: A posteriori interval analysis, Proceedings of EUROCAL 85, Springer Lecture Notes in Computer Science 204/2,pp. 328-334, 1985 [29] Moore, R.E.: Interval Analysis, Prentice-Hall, Englewood Cliffs, 1966 [30]Rall, L.B.: Automatic Differentiation, Springer Lecture Notes in Computer Science 120, 1981 [31]Rall, L.B.: Differentiation in PASCAL-SC: Type GRADIENT, ACM TOMS 10, pp. 161-184, 1984 [32]Ratschek, H., Rokne, J.: Computer Methods for the Range of Functions, Ellis Horwood, Chichester, 1984 [33]REDUCE, User’s Manual V3.3,Rand Corporation, Santa Monica, 1987 (341 Richman, P.L.: Automatic Error Analysis for Determing Precision, Com. of the 15, NO. 9,pp. 813-817, 1972 ACM, VO~. [35] Skelboe, S.: Computation of Rational Interval Functions, BIT 14, pp. 87-95, 1974 [36]Wengert, R.E.: A simple automatic derivative evaluation program, Com. ACM 7, pp. 463-464, 1964

Numerical Quadrature by Extrapolation with Automatic Result Verification Rainer Kelch

In this paper, we will derive an adaptive algorithm for a verified computation of an enclosure of the values of definite integrals in tight bounds with automatic error control. The integral is considered as sum of an approximation term and a remainder term. Starting from a Romberg extrapolation the recursive computation of the T-table elements is replaced by one direct evaluation of an accurate scalar product. The remainder term is verified numerically via automatic differentiation algorithms. Concerning interval arithmetic, the disadvantages of the Bulirsch sequence are overcome by introducing a so-called decimal sequence. By choosing different stepsize sequences we are able to generate a table of coefficients of remainder terms. Via this table and depending on the required accuracy, a fast search algorithm determines that method which involves the least computational effort. A local adaptive refinement makes it possible to reduce the global error efficiently to the required size, since an additional computation is carried out only where necessary. In comparison to alternative enclosure algorithms, theoretical considerations and numerical results demonstrate the advantages of the new method. This algorithm provides guaranteed intervals with tight bounds even a t points where the approximation method delivers numbers with an incorrect sign. An outlook on the application of this method to multi-dimensional problems is given.

1 1.1

Introduction Motivation

Integrals appear frequently in scientific or engineering problems. Often they provide the input parameters in a chain of complex algorithms and therefore they have a decisive influence on the quality of the final result. With approximative quadrature procedures, it is not possible to come up with a verified statement about the error tolerance. Indeed, there exist asymptotical enclosure methods (see e.g. in [8,35, 37]), but they cannot lead to a guaranteed solution (see Section 6).

For that reason we turn t o enclosure methods with absolute, exact, and safe error bounds. In [ l l , 121 methods for integral enclosures are derived employing NewtonC6tes formulae, a Gauss quadrature, and a Taylor-series ansatz. In [lo, 421 the Scientific Computing with Automatic Result Verification

Copyright 0 1993 by Academic Press, Inc. 143 All rights of reproduction in any form reserved. ISBN 0-12-044210-b

R. Kelch

144

problem of an optimal quadrature procedure is discussed. In the narrower choice the Gauss quadrature and the Romberg quadrature are left. Because of the necessary computation of the irrational nodes for the Gauss quadrature, this comparison indicates the Romberg integration as preferable and more elegant. The attempts to transfer these onto interval methods and the recursive computation of the elements of the T-table turn out to be detrimental. The observed growths of intervals can accumulate quickly and destroy the advantages of the accelerated convergence. Under this aspect it seems to be preferable to employ the Gauss quadrature or some other direct method. Through the idea of transforming the recursive computation of the T-table elements into a direct computation of one scalar product (see Section 4), the disadvantage of the computation of the Romberg extrapolation can be neutralized. Even in [4], Bauer pointed out this possibility, however, he preferred the recursive one because of its higher effectiveness. This assertion is not valid any more in view of the KulischMiranker-Arithmetic, which admits a controlled calculation of verified enclosures of solutions by use of a computer, an interval arithmetic, and an extension of a programming language (see [2, 5, 19, 20, 24, 25, 261). The direct calculation via a scalar product is more accurate than the recursive computation of the T-table elements; this admits a more rapid calculation of a solution possessing the required accuracy. Indeed Bauer specifies formulae for computing the weights, but they are not practicable. Direct algorithms for a determination of the weights are derived in this article. Analogously to the Romberg sequence, an enclosure algorithm may be realized for the Bulirsch sequence Fs.In Theorem 7 we present the proof of an important property for the application of the h-sequence by Bulirsch, which was missing until now. By introducing a new h-sequence FD (Decimal sequence) the set of the possible procedures for the required accuracy can be enlarged. Of course, in any particular case, the search for the best (and therefore most rapid) method should not be destroyed by a high computational effort in the preparatory work: the storage of the remainder term factors in a table for a selection of effective methods enables us to choose the optimum before the start of the computation-intensive part of the algorithm.

1.2

Foundations of Numerical Computation

In the execution of numerical algorithms and formulae by use of a computer, there are rounding errors because of the transition from the continuum to a floating-point screen. Additionally there are procedural errors, whose practical determination is not possible even though they can be expressed explicitly by means of the integrand or its derivatives. In order to deliver mathematically verified results, a specific algorithm has to be able to control both rounding and procedural errors. These basic requirements call for a certain standard of computer arithmetic and programming languages. They are met by the Kulisch-Mimnker- An'thmetic and the programming language PASCAL-SC, which have been developed for scientific

Quadrature by Extrapolation

145

computations. Under the name PASCAL-XSC (see [23]), this extension of PASCAL is not only available for PCs but also for workstations and host-computers. By means of the implementation of the optimal scalar product, we avoid the accumulation of rounding errors and therefore obtain an increase in accuracy (see [5, 241). The foundations, which are the tools of modern numerics, are explained in detail in [2, 20, 251. Important terms like screen, rounding, rounding symbols like 0,4 with o E {+,-,*,-,/}, optimal scalar product, long accumulator, and ulp (one unit in the last place), are defined in these sources.

1.3

Conventions

+

h-sequence : Stepsize sequence F := { h; } with hi := $ , n; 1 the number of nodes for the quadrature formula, i.e., we need n; + 1 nodes in the i-th row of the T-table for the element in the O-th column.

N , Z, R

: Sets of natural, entire, and real numbers respectively.

ZR : Set of real intervals X := [a, b] := {yly E R A a 5 y 5 b } , with a, b E R.

##

: In PASCAL-SC this denotes the evaluation of maximum accuracy by use of rounding to the nearest enclosing interval possessing machine numbers as endpoints of the braced expression (see [5]).

alb, ayb: a divides b, a does not divide b. cr ee cc

/

pr : Eequired absolute

/ /

p e : estimated absolute pc : absolute

/

/

relative error.

/

relative error.

relative error as following from computed enclosures.

t x : total execution time in seconds for the method X. gP : the quotient of the true (accurate) and the estimated error.

V ( m ,I , F) : quadrature method with the T-table row m , the bisection step I , and the h-sequence F as parameters (see Section 5). vquad : optimal quadrature algorithm with verified integral enclosure.

approz : quadrature algorithm via an approximating Romberg integration. asympt : quadrature algorithm with an asymptotical enclosure via a h m b e r g ex-

trapolation.

itayl : verified quadrature algorithm by use of a Taylor-series ansatz.

Formulae and equations are enumerated for each section. If they are used in another section, they are cited with the section number at first place.

R. Ir‘elch

146

Romberg Integration

2

The Romberg integration is essentially an application of the Richardson extrapolation (see [14]) to the Euler-Maclaurin summation formula (see [37]). For an approximate computation of the integral (1)

J :=

J” f ( t ) d t , a

a, b E

R , f E C k [ a ,b] ,

we can use the trapezoidal sum

c n

T ( h ):= h *

(2)

j=O

11

b-a f(u + j h ) , h := -, n E N . n

Now we discuss the asymptotical behavior of T(h) for h going to zero.

2.1

The Euler-Maclaurin Summation Formula

In [27, 371 we find the proof of

Theorem 1: For f E CZm+z[u, b] the trapezoidal sum (2) is given by 1.

T(h ) = J

+ i=l c Tih2’ + am+,( h ) m

*

h2m+2

Remarks: 1. The ~i are constants independent of h; the equation in 3. is said to be the Euler-Maclaurin summation formula. 2. In [l],the Bernoulli numbers Bk are stored for k 5 30.

R defined by Sk(z) := BL(Z- [ z ] )with , Bk(z) the 3. The functions Sk : R Bernoulli polynomials, have the period 1. This implies

Quadrature by Extrapolation

147

Corollary 1: The remainder term in 3. in Theorem 1 can be expressed by use of %+l(h)

2.2

=

*

( b - a ) . &+1

*

(ZmtZ)

c

, t E [a,bI.

The Remainder Term

Let us consider the definite integral J defined in (1) for the case of a sufficiently large k. Let the h-sequence

(3)

ho , O : = b - a , F : ={hO,hl,h2,...}, h; := h ni

n;EN

be given. By means of the appropriate trapezoidal sums T(hi),an approximation p(0) of J is computed via a Lagrange interpolation making use of the well-known recursion formulae (4)

{

,o Ii 5

T;o := T ( h ; ) Tik :=

hi-k

(7) 'Ti,k-l-Ti-l,k-l I

(*)L

rn,

,lllclilrn;

the Tik are the T-table elements, which may be grouped in the T-table

(5)

TOO Ti0 Ti1 T20 Tzi T2z T ~ oTm1

T,,

The procedural error

may be computed directly, with m

and

,

m arbitrarily large.

R. Kelch

148

Concerning special h-sequences, we demonstrate in [22] that K(x) does not change its sign in [a,b]. From this we obtain

By use of the Euler-Maclaurin summation formula and with respect to (6) or (9), we deduce

Theorem 2: For the Romberg sequence 3 R := { h 0 / 2 i } , K(x) in ( 7 ) does not change its sign in [a,b]; i.e. we obtain the following remainder term formula for the T-table elements for 3 R , with t E [ a ,b]:

Proof: By use of a complete induction

(see [4, 271).

Remarks:

1. For the sequence with the smallest increase of the number of the required nodes (relative to the 0-th column of the T-table), we obtain a remainder term formula with an analogously simple structure.

2. The statement ' ' 3has ~ the minimal increase of required nodes" is not necessarily valid for the diagonal elements (see Sections 4 and 5 ) .

3. For arbitrary h-sequences, it is not possible to determine in an a priori fashion whether there is a change of the sign (if there is one). The possibility of a simplified representation (6) of the error depends on the choice of the hsequence.

4. Subsequently we often use the notation T, := T,,,

R, := &m.

Quadrature by Extrapolation

Convergence

2.3 For i

149

+ 00, we

proportional to

can show that the error of Tik in the k-th column goes to zero

n h L j . Additionally, the following theorem is valid: k

j=O

Theorem 3: If (*)*

I

5 a < 1 for all i E No and m-m lim h, = 0, then there holds lim

m+w

Tom = mlim T,, +w

= T ( 0 )= J

.

In [7] we find the proof by use of the Theorem of Toeplitz.

For all elements T i k the proof of the following theorem is given in [4]:

Theorem 4: Provided the 0-th column of the T-table converges, all diagonal-sequences converge.

2.4

The Classical Algorithm

The difference of two successive T-table elements serves as a truncation criterion (see [31]):

(10)

ITm-l,m

- Tmml 5

€7

* J M Tmm

*

Therefore it is necessary to compute 4 T-table elements which are positioned to the left and above x k . In Section 6 we will use the algorithm approsfor the purpose of comparisons.

3

Verified Computation of the Procedural Error

Concerning a determination of an enclosure of the remainder term, the only difficulty is the computation of a higher order derivative of f at an intermediate point ( E [a, a]. Because this ( is unknown, we compute an enclosure of f(j)(() by replacing ( with [a,b]. A numerical approximation for computing the j-th derivative is not suitable for a verified result. A symbolic differentiation requires too much effort. By means of recursive calculations, the method of "automatic differentiation" yields enclosures of Taylor coefficients, which immediately imply the required derivatives.

R. Kelch

150

The algorithms are valid independently of any argument type. Therefore, they are valid for real-valued as well as for interval-valued nodes. The recursion formulae for the calculation of some standard functions, which are referred to in [28, 30, 331, are supplemented in [22] by means of additional formulae. By use of two examples, we now explain the direct transfer of mathematical algorithms, which is possible in PASCAL-SC. Finally we outline algorithms for automatic differentiation of functions in two variables with respect to cubature problems.

3.1

Automatic Differentiation Algorithms

Let u, v, w be real-valued functions, which are sufficiently smooth in a neighborhood of t o . We define the Taylor coefficients (u)k of a function u by means of ( u ) k :=

(1)

$ . d k ) ( t o ) := $ - dkto) , for k 2 o

or ( u ( T ) ):= ~

B . & ) ( T ) , T E aZ

or

7

(u)

E IR .

(b)

In this notation, the Taylor-series of u around to is m

(2)

u(t) = c ( u ) k ' ( t - to)'. k=O

For functions that are compositions of other functions by use of arithmetic operations, we can immediately give the following rules making use of rules for computations with power series: (UfV)k

=

( u * v)k =

k

C(u)j

*

*

f o r k 2 0 (b)

(2))k-j

j=O

(u/v)k =

(a)

(U)kf(V)k

k

{(u)k-

C(v)j.(u/v)k-j} j=1

(4

*

By use of the trivial relations (c)o = c

(t)o = t o

, (c)k = 0,

for k 2 1 , c a constant, , ( t ) l = 1, ( t ) k = 0 , for k 1 2 , for the independent variable t

,

we are able t o compute the Taylor coefficients of each order k for arbitrary rational functions by first computing the coefficients for k = 0 for all partial expressions, then for k = 1, etc. We are able to derive calculation formulae for the Taylor coefficients for a larger

Quadrature by Extrapolation

151

class of functions making use of the following relation, which can be derived immediately, with an additional application of the chain rule.

For the Taylor coefficients of the exponential function, we obtain: k-1

(6)

(e")k

j

= c(1 - -) k

*

(e"),

*

(U)k-j

j=O

,k21 .

Analogously recursion formulae are derived for all other standard functions. Composite function expressions do not cause any problems since the arguments of standard functions in recursion formulae are not the independent variable t but, rather, a suitable function u of t. These expressions are valid for arbitrary k 2 1; for k = 0 we take the function value. In [22] all these functions and proofs are listed in detail.

3.2

Implementation in PASCAL-SC

All usual operators and functions are implemented as itaylor-operators and functions. For linking they are available in a package. As data type we choose a dynamic array where the components represent the Taylor coefficients:

type itaylor = dynamic array [*] of interval; By means of two examples we illustrate the handling and the structure of the differentiation package:

+

Let f(s) := x 2 - 3 s 1 be given. Let 3 be the maximum required order of the Taylor coefficients. Therefore we declare the independent variable x by use of

var x, y : itaylor [0..3]; By use of (4), the variable x is represented by (I, 1,0,0) and the constants 3 and 1 by (3,0,0,0) and ( l , O , 0, O), respectively; (3b) as applied to x 2 ( = I . s ) yields (s2,25, 1,O). By use of (3a), we obtain for y := f(t0):

With

f"(10)

= 2!. (f(s0))2, we obtain:

f"(20)

=2 * 1 =2

.

R. Kelcb

152

By initializing, e.g. x as 10 := 1 or [O,l] respectively, we obtain for the components of y:

i=2 i=3 Example 2: By use of the following program, the third order derivative of the function f(z) = I will be computed at nodes to be read in. It is observed that the coding is similar to the mathematical notation.

-

program example 2(input,output);

use irari, itaylor; var i, j : integer;

h, x : itaylor[0..3]; x0 : interval; function f(x:itaylor):itaylor[O..ubound(x)]; begin f := x * exp(1-sqr(x));

end;

begin { main program } -

read(input,xO); expand(x,xO);. { Initialization of x with (xO,l,O,..,O). } h := f(x); write (output, h[3] * 6); { f(”(s)= fs(z) 3! } end.

-

The following multiplication and the exponential function from the numeric packages irari, itaylor are used: global operator

* (A,B : itay1or)res : itaylor[O..ubound(A)];

var k, j : integer;

begin for k := 0 to ubound(A) do -

end;

res[k] := ## ( for j := 0 to k sum (Ab]*BF-j]) );

global function exp(x : itaylor) : itaylor[0..ubound(x)];

var k, j : integer;

Quadrature by Extrapolation

153

h : itaylor [O..ubound(x)];

begin h[O] := exp(x(0));

for k := 1 to ubound(x) do begin -h[k] := zero; for j := 0 to k-1 do h[k] := h[k] (k-j) * hb] h[k] := h[k]/k;

+

* xk-j];

4; &;

3.3

exp := h;

Automatic Differentiation in T w o Dimensions

Concerning a function f : R" + R , f E C k ( G ), G E R" , the automatic computation of Taylor coefficients is possible by use of analogous recursion algorithms, just as in the one-dimensional case. The foundation is the Theorem of Taylor for the n-dimensional case (see [IS]). The rule for the n-dimensional case can be outlined as follows: 1. For the first variable, the rule for the one-dimensional case is applied in such a way that the n-1 other variables are treated as constants. 2. In the expression generated in this way, the rule for the one-dimensional case is applied with respect to the second variable while the n-2 other variables are treated as constants, etc.

, sequence of the Since the functions under consideration are in the space C k ( G ) the differentiations with respect to the variables is irrelevant. In the following we consider functions of two variables which are denoted by tl and t z . Let u,v, and w be real-valued functions in Cml+m*(G) with G c R2.With d k l v b ) we define the function which is generated by differentiating u at first k1 times with respect to tl and then k2 times with respect to tz:

Analogously to the one-dimensional case we define the Taylor coefficients of a function u by means of

1

or

= k l ! kz!

(U)kl,k2

R. h'elch

154

In this notation, then the Taylor series of u is as follows, making use of the point of expansion (t:, t:):

For functions which are arithmetic compositions of other functions, we can immediately indicate calculation rules for the Taylor coefficients that are analogous to (3a-c). For this purpose, we use the following identity and abbreviating notation:

(10)

((U)j,O)O,k

:= ( U ) j , k =: ((U)O,k)j,O

For the following rules it is assumed that kl, kz 2 1. For k l or kz = 0, respectively, the corresponding one-dimensional rule is valid. For the four basic operations, we obtain

Proof: see [22, 301. Applying (10) with respect to the standard functions and by means of the onedimensional rules as derived in Section 3.1, we obtain the desired two-dimensional recursion formulae. As an example we obtain for the exponential function:

Reversing the ordering of the successive applications of the one-dimensional rule and because of (lo), we obtain:

The expressions (14) and (15) yield identical values in the case of an exact computation since, for e'' e C k ( G ) ,the ordering is unimportant for computing the derivatives. When computing the derivatives numerically, however, we will obtain different results. Using interval arithmetic for the calculation, it becomes possible to obtain a tighter enclosure by generating the intersection of both results. This is

Quadrature by Extrapolation

155

possible for all functions and operations. Analogously t o the computation of the Taylor coefficients of the exponential function by use of (14), we are able to derive recursion formulae for other transcendental functions (see [22]). The implementation is executed analogously to the one-dimensional case.

4

Modified Romberg Extrapolation

Provided we replace ( in the expression for the remainder term by the interval of integration [a,b], then just as in (2.9) and on the basis of inclusion-isotony (see [24, 25]), we obtain the optimal approximating diagonal elements with

(1)

J = Tmm

+

Rnm

E Tmm

+ Rmm.

Let us now consider the operations in a floating-point screen instead of in the real set. The real operations o are now replaced by the computer interval operations @ with the properties explained in [24]. We then obtain

(2)

J E

VJ

:=

VTmm @ V k n m *

The outward directed roundings may accumulate to a considerable growth of the diameter of V J . As shown subsequently and in Section 5, it is always possible to decrease km as compared with Tmmsuch that the value of the relative diameter is insignificant with respect t o the accuracy of V J . Nevertheless and because of the recursive definition of the Tmm,significant interval growths may occur by means of cancellations, This can be avoided by use of the new modified Romberg procedure. We replace the recursive calculation of all T;k by one direct computation of an element T,, by means of the call of one optimal scalar product. We thus can guarantee in (2) that the expression VTmmis a result of operations with roundings as small as possible. The calculation of the procedural error is possible, without computing Tmma priori. Thereby the error is minimized efficiently without the need for the execution of an unnecessary computation of T-table elements.

4.1

The Basic Enclosure Algorithm

From equations (2.6) and (2.9), a first method for the enclosure of values of integrals may be derived. The computation of the T-table element (in the following we only consider diagonal elements) via the recursive formula (2.4) is replaced by a direct computation by means of the accurate scalar product (see (21, 221). Thus, a verified computation via simple interval arithmetic is replaced by a faster and more accurate enclosure method. A special T-table element may be computed directly without requiring an explicit knowledge of other T-table elements!

R. Kelcb

156

4.1.1

Optimal Computation of t h e Approximating Term

With Definition 1 : The Romberg sequence FR := {

(3)

e}is given by

ni := 2 ' , i 2 0 .

For the Romberg sequence, there holds Theorem 5: For the Romberg sequence FRwith Tik concerning (2.4) there holds:

Each element Tik can be represented by

For k=O, the weights Wikj may be computed via (6)

wioj =

(2ni)-' n;'

, ,

for j = 0 and ni f o r j = I(1)ni- 1

'

and, for k 2 1, by means of the recursion rule

Proof: by complete induction (see [21, 221). The weights Wikj may be represented precisely since they are rational. They may be computed a priori via a rational arithmetic and stored in a table separately for the numerator and the denominator of all relevant T-table elements. Instead of a recursive computation of the Tik requiring numerous evaluations of

Quadrature by Extrapolation

157

functions, we only need to calculate rational numbers as following from (6)and (7). Because the weights do not depend on f, we are able to generate fixed tables. A recursive calculation going further than the fixed table is unnecessary in practice because we will not extend the maximum level (here: m = 7). Therefore we define a scalar product of the vector of the weights

(8) G i k := ( W ~ M ) ,... W i k n i ) and the vector of the function values

(9)

x k

:=

(f(zi0)

f(zin;))

with n; := 2'. Since T i k is computed by use of nLi

H N i k is the main denominator of the weights W i k j , and the 6 i k j are the corresponding numerators.

Examdes: drn = & i

30,I), G o

= i(1, 2, 1)

,

= A(217, 1024, 352, 1024, 436, 1024, 352, 1024, 217).

The expression OT,, appearing in (2) is evaluated exactly by use of Theorem 5 (&j 2 0) and (lo), making use of the optimal scalar product

(11) O T ~ , = [

v

n, ( C G n m j

j=O nm

A(C&mj j=O

*

Vf(Zmj)

* Af(zmj)

7

) ] Qho+HNrnrn.

v

The interval rounding symbols A and in (11) in front of the summation sign make it clear that there is only one rounding operation instead of roundings in each summation and each multiplication of 6,,j with f(zmj). When running the program, only the function values are to be computed. 4.1.2

Optimal Computation of the Remainder Term

Instead of an error estimation of the difference of two T-table elements, we are now able to compute an error enclosure of the remainder term R due to equation (2.9). For the Romberg sequence the following holds true, provided ( is replaced by the interval [a,b]:

R. Kelch

158

In general, the coefficient of the remainder term C , may be stored. The Taylor coefficient fZrn+2( [a, b]) is verified numerically via automatic differentiation algorithms [21, 22, 32, 33, 341.

For a verified enclosure we get the equation

with

(14) OC,

:=

Q OBrn+~+2"("+')

In this case too, we are able to confine our attention to only a few interval rounding operations. The evaluation of OC, can be carried out in a non-recurrent fashion in advance just as in the case of the weights w;kj. Therefore during runtime we have only to compute the Taylor coefficients and to multiply them with OC., With ( l l ) , (12) and (14), we change (2) to

Therefore the essential computational effort is the evaluation of the function values and of the Taylor coefficients. In this case Gmm is the vector of the weight numerators with conversion to the main denominator.

4.1.3

The Enclosure Algorithm Basic

The algorithm Basic is a direct interval algorithm, which is a realization of the modified new Romberg procedure for a verified integral value enclosure. The interval diameter of

& serves as a truncation criterion

where 6 is the required absolute total error. Thus, the choice of an appropriate method for (16) does not require that any T-table element has to be known! If real operations are replaced by corresponding screen operations (see [25]), rounding errors in computer applications will not affect the result anymore.

159

Quadrature by Extrapolation

Algorithm Basic: 1. input (f,a,b, eps) 2. i := -1; 3. repeat i := i+l OR, := -Ci QO(.f[a, b])?i+l until d ( 0 R i ) Ieps O J ~:= ## ( O(Gi * * ho OR)

+

5)

4. output (OJi, i)

Remarks:

f7:

1. The variables Ci, Gi are tabulated, represents the vector of the function values at the points q0 to sin,; this vector is computed only for the last i. The Taylor coefficients (f[a, b])2i+2are verified in a separate step.

2. As compared to the algorithm approz, and in addition to the quality of a guaranteed enclosure which has now been achieved, the essential improvement is the direct and therefore exact evaluation of the remainder term, which is executed subsequently to the determination of an optimal remainder term.

4.2

Extension to Arbitrary h-Sequences

It is possible to extend the algorithm Basic to arbitrary h-sequences 7 . We state the following lemma for 3,for which K(x) in the remainder term formula (2.6) changes its sign in [a,b].

Lemma 1: Let 3 := { h i } be an h-sequence, for which K(x) in (2.7) changes its sign in [a,b]. For the procedural error R in the Romberg extrapolation, then there holds

with

(18) D, = ( b - a )

B,+I

Ic,,.,~~. htrn+’

. i=O

R. Kefch

160

Proof: see [22]. If we compute the c,,,i to determine D , by a single evaluation of (18), the algorithm Basic can be generalized to the case of sequences 3converging because of Theorem 3, even though there is a change of sign of K(x). In order to obtain the same optimal properties by means of applications of an arbitrary h-sequence as in the algorithm Basic, we try to represent the Tik as a scalar product. Starting at the h-sequence F := we consider the set of nodes Ki relating to the stepsize hi. Because the boundaries a and b belong to each node set, we simplify the representation as follows

{?},

(19) Ki :=

{k}

ni-1

.

j=1

In the case of the usual and the new evaluation of Ti,,we employ the function values precisely in those nodes where the trapezoidal sums Z-k,Oto Ti,oare needed; i.e., by choice of the optimally approximating T-table elements T,, we use all function values in the trapezoidal sums TOO to T,o which have already been computed. To obtain a recursion formula for the evaluation of the weights for the purpose of a direct representation of Tik by means of a scalar product, it is necessary to know the number of preceding elements Ti0 with respect to all required function values. For this we consider two sequences, which in a sense are extreme in this point.

A) Maximal Utilizing Sequence Concerning the newly required values of functions, the ones that have already been computed appear in the row immediately above 3 R . This is characterized by . (20) ni = 2 . n ,-I and

, for

all i 2 1

(21) Ki-1 C K; , for all i 2 1 ; i.e., 4 function values evaluated up to the (i-l)-th row appear precisely in the (i-1)-th row and all of them are needed in the i-th row.

B) Minimal Utilizing Sequence ho Pi has the property 3 p

:= {-}

, pi =

i-th pri.me number, i 2 1, po := 1 ,

Ki r l Kj = 0 , for all i, j with i # j

;

i.e., in order to evaluate the function values needed in the i-th row, we cannot use any individual function value which has been evaluated up to this point. These considerations are important for the determination of the procedure with the smallest number of employed nodes.

Quadrature by Extrapolation 4.2.1

161

The Bulirsch Sequence

Now we discuss the Bulirsch sequence 3 B .

Definition 2: The Bulirsch sequence 3 3.2t-1

B

:= {

2 }is given by

,2li,

i#O

,i=o

.

For 3 8 there holds:

( 2 3 ) n; = 2

.n;-2

+

K;-2 C K; ,for all i 2 3

.

The recursion formula ( 2 . 4 ) may be generalized. We obtain:

with

Concerning 3 B , all nodes from TOOto T i - l , ~are included in T i - l , ~and Ti,o (see (23)); for the construction of a recursion rule for the weights, therefore, only the two last-computed rows in the T-table have to be added:

j=O

Thus, we proceed to

j=O

R. Kelch

162 T h e o r e m 6:

?or the Bulirsch sequence F6 due to (22) with n W l:= 0, there holds:

?or all i,k we obtain:

~ i := j

a+j.hi,j=O(l)ni,OIk5i.

?or all i,k, there holds:

j=O

j=O

rhe weights W i k j and

Uikj

are ration 1 and may be c mputed as follows:

1. fork = 0, 0 5 i: (29)

vitht

(Yik

[

and

wiOj

Vioj

pik

= =

l/ni { l/(2ni)

0

, j = l(l)ni - 1 , j = 0 and ni

, j = O(1)ni-l

;

due to (25).

Proof: see [22] Analogously to the Romberg sequence, an enclosure algorithm may be realized for the Bulirsch sequence provided the weights have been stored in a T-table and

Quadrature by Extrapolation

163

after it has been decided which one of the remainder term formulae is to be applied, i.e., whether or not K(x) in equation (2.7) changes its sign in [a,b]. This mathematical problem is discussed in [7]. For Fs a t certain intermediate points of the integration interval (the estimation has been carried out in steps of Bulirsch emphasizes that the function K(x) possesses the same sign for all m with m 5 15; therefore he concludes that “with considerable certainty” K(x) does not change its sign for m 5 15 in the whole interval. This rather unsatisfactory statement may now be replaced by T h e o r e m 7:

Tmm - J = him+3* E m . B m + 1

(f(t))~rn+~

with

Proof: The proof for the absence of a change of sign for K ( z ) is not obvious and may be looked up in [22]. It was executed by means of the computer algebra system REDUCE and by means of a program computing tight enclosures of the range of values of polynomials [15] . 4.2.2

Introduction of t h e Decimal Sequence

Concerning the Bulirsch sequence, problems with respect to the implementation may occur since, in general, the nodes cannot be represented exactly. This may have negative consequences for the interval diameter and for the execution time in the case of the function values; as a remedy a so-called decimal sequence FD may be introduced: Definition 3:

no=1,

,n1=2

,i 2 2 ,i 2 3

Now the nodes can always be represented precisely provided a decimal arithmetic is

R. Kelcb

164

used. Clever implementation, however, allows 3 B to be used in many cases. Since shows a similar behavior as 3 B , the screen in the coefficient table becomes more dense and thus the choice of an optimal method is improved. Theorems 6 and 7 are valid analogously for the decimal sequence (see [22]).

3 D

4.3

Comparison of Different h-Sequences

For this comparison we consider the number of required function values for the evaluation of T,, and we establish their relationship to the corresponding procedural error &. Since we cannot specify a priori statements about the growth rates of the Taylor coefficients for an arbitrary function f , here we will relate only the remainder terms of the same column, i.e., with identical order of the derivative and the corresponding function values. Therefore in (12) we consider only the expression Cm = . him+3. Bm+lwith the remainder term factor

em

(33)

em= (

m

n n y

.

i=O

In the determination of T, by means of a scalar product, generally the growth of the intervals is not related to m or the choice of a sequence 3 ~Therefore, . in the determination of an optimal sequence 3, we may confine our attention to the remainder term. w e compare the sequences 3 R , 3 B , 3 ~and, ~ J V for ; all rn 5 7 we determine the number of the required function values and the quantities 6,. By use of equations presented before, we conclude for m 2 1 that

(34)

cm=

therefore, we obtain a table of remainder term factors (see Table 5, Section 4.4 in [221)We will now compare the remainder term factors 6, for identical m and identical numbers of the required nodes. In order to achieve identical numbers of nodes by using identical m, occasionally we have to carry out bisections or divisions into three parts. In such a case, there is a change of the factor him+3in C,. Table 1 lists the optimal sequences for all m 5 7.

Quadrature by Extrapolation

165

Table 1: Optimal h-sequences For a high precision in conjunction with a minimal number of nodes and the smallest error term, the best possible choices are the Bulirsch sequence 3 8 and, in the case of a decimal arithmetic, the then more meaningful sequence 3 D . Provided only a low accuracy is required, it is possible to go only to the third row of the T-table and there to choose the optimal sequence (see Table 1).

5

The Optimal Enclosure Algorithm vquad

We will now derive the algorithm uquad for the determination of the optimal remainder term on the basis of Section 4.3. For the purpose of a reduction of the error we combine bisection methods and a continuation in the diagonal of the Ttable. After a first basic partition, there follows an adaptive refinement. This is controlled by the remainder term. Only if the remainder term satisfies the required error bounds, it becomes necessary to evaluate the T-table element. Thereby it is possible to achieve the required accuracy with a minimal total cost because an increase of the computing effort by means of bisections or higher T-table elements is used only where it is necessary. Locally where the function is sufficiently simple, we achieve high accuracy at low cost; vqvad delivers an optimal result fulfilling the required accuracy without large over- or underestimations. In the case of periodic functions, the trapezoidal rule yields the best approximation as has already been mentioned in [4]. Therefore a continuation in, e.g., the diagonal of the T-table will not cause an improvement but only a growth of the remainder term. Also in this case vqvad selects the best remainder terms in the corresponding partitions such that both the error and the effort are small. Table 8 in [22] lists the optimal remainder term factors for all m 5 7. The underlying assumption is a lemma which is derived in Section 5.1. In dependency of the Taylor coefficients, the finally decisive comparison of the remainder terms is executed in vquad, with minimal cost and by using the table.

5.1

Criterion for the Selection of the Optimal Method V(m,l,F)

The expression

emin the representation of the remainder term,

can be computed and stored for each h-sequence and for all m 5 mmaz. Now we return to the bisections of as discussed in Section 4.3 and we arrive at

em

R. Kelch

166

Definition 4: V(m,l,F) is the 6asic method computing the remainder term with the remainder term factor

c:,,:

IC,,:

is the value of

ernfor the h-sequence Fzlafter 1 bisections of Z;. I

Remark: The values 6, and Bm+l are stored in a table. We also can store him+3 in a table (see Section 5.3.1 in [22]); this involves a minimal computational effort. For the sub-domain 1; we require (see Section 5.3.1):

(3)

2

d(~m)

(e+)i

*

By use of ( l ) , we transform (3) into

=: K,

and we obtain

Lemma 2: Provided the inequality

is satisfied, making use of,'tZ and A, in (4), then the originally required accuracy (3) is also satisfied. Remarks: 1. Lemma 2 is the central point of the algorithm for the search for the best method V(m,l,F). This is the primary control of the whole adaptive algorithm vquad.

2. Instead of computing the corresponding remainder term for each table element (i.e. n times.? for the purpose of a comparison with e, according to (3), we have only to evaluate once to be able to execute the comparisons!

c:,,

The table of the coefficients

c, (see Section 4) is now extended by the parameters

Fz (for the choice of the h-sequence) and 1 (bisection step). We give two examples concerning an interpretation.

Quadrature by Extrapolation a) Table Concerning

167

cL,las Sorted in View of Values Around lo-'*:

From Table 2, we imply that CTo yields the best relationship between computational effort and accuracy. Depending on the value of the diameter of the Taylor coefficient, the position of the optimal remainder term may change. So only by using the algorithm search as described in Subsection 5.3.2, we can arrive a t a final decision.

(m,l,F)

C;,

#fi

Table 2: Some

... *--

c:,

(472,D) (3,3,R) 2.3E - 12 1.8E - 12 32 64

(7,O,D) (2,5,B) 9 . 5 E - 13 8.1E - 13 20 96

a * *

... ...

for the value lo-''

For all m 5 7, Table 8 in [22] contains all parameters of the optimal method (in this case a total of 61).

b) Table Concerning

c;,, as Sorted in View of the Taylor Coefficients m:

For the case of m = 6, let us compare the different methods V ( m ,I, F); we thus obtain: (m,l,F) Cil f;

*

Table 3: Some

(670, R) (672, B) (6, 1, B) 2.3E - 13 4.9E - 18 1.6E - 13 64 64 32

c:,

for m = 6

For FB,the results imply a procedural error smaller by a factor of than the error for FR, provided the same computational effort is being made. If FB is bisected only once, the computational effort is only half as much as in the case of FR,and the error still decreases by 30 $6. In this case, 3 B is to be preferred. The optimal method can be looked up in the C:,[-table in an a priori fashion. Subsequent to the computation of the appropriate Taylor coefficients, the best method (i.e. the method requiring the least computational effort satisfying approximately the required accuracy) may be realized in an a posteriori fashion by applying equation (5) in Lemma 2.

5.2

Survey of the Principles of Generating vquad

c;,,

By means of a comparison of the extended coefficient table with equation (5), the optimal method may be chosen. Concerning the required accuracy and the increase or decrease of the derivatives of the integrands in the combinations of the three different h-sequences 3 with the T-table row and the bisection step, the

R. Kelch

168

fastest method is realized via a search algorithm (see 5.3.2). The time spent on this search is irrelevant, since at most 10 values have to be compared in each row. It is only afterwards that the function values are computed which are necessary for the optimal method V ( m ,I, F);this is the main computational effort.

If the chosen method did not satisfy the required accuracy e,, nevertheless we would obtain an integral enclosure with guaranteed bounds even though the interval diameter is larger than required. For a further reduction, an adaptive refinement strategy is applied. To keep the errors small from the beginning, we start by initially segmenting the interval into subintervals allowing applications of the Bulirsch sequence. If no satisfactory method can be found for a certain m, m will be increased or (in the case that the remainder term is augmented by the last increase) bisected and the same method will be applied recursively to both subintervals. Above all, bisections are preferable in the case of a more rapid increase of the higher order derivatives of the integral as compared with the decrease of We observe this situation in the cases of, e.g., strongly oscillating functions or close to poles. By deciding whether the new remainder term is larger than the old one, we can automatically prevent a growth of the procedural error. In this case, we choose a bisection step instead of a new T-table element. This is to be repeated as long as it is useful or a maximal bisection or T-table step has been achieved. The appropriate function values are computed after the optimal method has been executed for a certain subinterval; T, und R, are verified numerically by use of the above-mentioned theorems and stored in a linked list. Finally, these lists are scanned and the values are accumulated. Thus, we obtain an optimal method since additional computation is carried out only where necessary. The algorithm wquad realizes this method.

c,”,,.

5.3

vquad

in Detail

In [22] we find a complete outline of all details of the following flow-chart. An individual computation of the total number of 2 (mmaz 3) coefficients for each sub-interval l i of the basic partition is carried out.

-

+

169

Quadrature by Extrapolation

0

Algorithm vquad :

0

but(f+,b,p4 I

u

Compute c, over PI. I

Initial segmenting in interval. I,, computation of

.............

. b l o d ( I , , R, m, 1, c,,

J at the endpoints.

......................................

...

I

Compute (+), from c, according t

logl:=(d(Rg')

< (er)i)

I

I

:.................

I

........................................ I

Accumulatedl Tm,Rm.

6

R. Kelch

170 Global a n d Local R e q u i r e m e n t s of A c c u r a c y

5.3.1

By way of the input of the required relative total error, a global condition is considered,

(6)

pr

= -

€7

:=

€7

Ijge8’1

with (7)

d(Jge8)

=

SuP(Jge8)

-

inf(Jge8)

*

This means that our method as applied to [a,b] should yield an integral enclosure O J with

(8)

!

c = d ( O R ) 5 er

or

The reason for these choices is that only the procedural errors are taken into account for our automatic error control by use of an adaptive refinement strategy. We compute er by use of pr and

(10) cr = P.

ijge8 .i

The letter r in the index of c und p refers to a required accuracy; j,,, is the notation for a good approximation of the exact value. We use (9) to allow a control of the adaptive refinement strategy. As compared with the exact value of J,,,, a does not affect decisively the adaptive control (there is no small difference of jges influence on the verification, of course). Therefore, in most cases, a coarse approximation method is sufficient for the evaluation of j,,, and therefore also for step 4 in vquad corresponding to (10). We now have to consider the choice of the local requirements of accuracy with respect to the sub-intervals in order to satisfy the global requirement (8). By use of the definition of the achieved local absolute error (11)

ci

:=

ci,

d(R,) ,

and by use of the definition of the required absolute error ( e l ) ; ,

Quadrature by Extrapolation

171

we immediately obtain the following lemma, which is proved in [22]:

Lemma 3:

k

5.3.2

The Algorithm search

By use of the table of the remainder term factors, the determination of the optimal remainder term RRp' employs a search algorithm search, which is called by a procedure block (see flow-chart vpuad) calling itself recursively. This perhaps will require a bisection of a subinterval Ii and the determination of an optimal remainder term by means of search and the thus generated partitioned subintervals. This is initiated by calling the procedure block for the new intervals. As the final condition, that Rapt is chosen which has the smallest number of nodes in the set of all R$ for 0 5 i 5 rnmao. The search time is negligible since there are less than 10 candidates for applicable methods concerning any row in the T-table. Remarks to the Followine Algorithm search: 1. The €-growth in search may lead to an acceleration of the method. 2. The external repeat-loop in search may be omitted; it serves as a refinement of the method, provided we use a more dense net of the values of the premultiplication factors.

R. Kelch

172

Algorithm search:

9

no

I

ves

5.3.3

Computation, Storage, and Accumulation of R,,, and T, with High Accuracy

Subsequent to a determination of m via search (and therefore a determination of R,,,),there follows the evaluation of T, by means of (4.10). The addition of R,,, and T, by use of (4.14) is carried out only a t the end in order to avoid an unnecessary growth of the interval diameter. In this context, we use the identity

By use of a floating-point screen, there holds Jgea

E

VJ E

The values R,,, are determined via the procedures search and block. By use of the h-sequences and the thus generated m- and 1-values, we are able to read the suitable weights in the tables in order to compute the corresponding function values and in

Quadrature by Extrapolation

173

order to determine 0 T,. The values (Tm)iand (R,,,)i, wh.:h are stored in linkec lists, are accumulated successively (without roundings) in the long accumulator, i.e., without a loss of information. For this we use the data type dot precision (see [5, 241). For each Zi in the subdivided basic interval, a linked list is gekerated. Only in one step at the end, there will be one rounding operation with, therefore, a minimal interval growth.

Remarks on the Implementation

5.4

By an adaptive refinement, the recursive call in algorithm block leads to a tree structure. In case of a bisection, the nodes of the tree point to a left and to a right branch. Otherwise, the evaluation takes place successfully and the obtained results are stored in a linked list. Figure 1 shows a simple example for a generated tree. A node consists of a data part D and three pointers. Provided the step I,, is reached, further bisection should be carried out, i.e. there is a maximum of (I,, - 1) bisections. Step 1

Pointer

l j*

3

J := O(CZ

Figure 1: Example for a tree for an adaptive bisection

+ C&)

R. Kelch

174

6

Numerical Results and Comparative Computations

The algorithm vquad is now to be compared with the approximative algorithm approx (see (2.10)), the asymptotic method asympt (see (2) in the following section), and an enclosure method itayl based on the Taylor-series ansatz. To guarantee a "fair" comparison, comparable algorithms were also equipped with a similar adaptive refinement strategy. All methods were computed on an Atari Mega-ST. They were implemented in PASCAL-SC (see 151). In general, the simple test examples provide close approximations or tight enclosures even though occasionally there are considerable differences of the computational effort (see above). The asymptotic method requires generally the highest execution time. Since the ratio of these times is more than 100 as compared to vquad, this method soon becomes irrelevant, particularly as there is no verification and incorrect enclosures are occasionally obtained.

6.1

Methods for Comparison

In Section 6 in [22], there is a detailed discussion of the methods employed for the intended comparisons.

6.1.1

approx and asympt

As to the approximative method, it may happen that the true error is much higher than the estimated error; i.e. the program proclaims a significantly smaller error than it actually produces. Thus, for difficult integrals, the reliability of this method is doubtful. Additionally, in case of a high requirement for accuracy, the computational effort is higher than the one of algorithm vquad. An asymptotic enclosure is obtained by introducing a U-sequence by use of

Thus, due to [35] there also holds:

In practice, however, it is not possible to determine this m' precisely. Just as in case of approximative methods, estimations may only yield numbers which have nothing to do with the solution.

Quadrature by Extrapolation

175

A Taylor-Series Ansatz with Verified Enclosure

6.1.2

With f E C"[a,b], we obtain for J in (2)

J=Q+R

(3) with

and

We are able to evaluate (4) and (5) in different ways by choice of different values for

20.

Assume that z o =

% or zo = a (or, analogously, zo = b).

In order to obtain a high precision with a minimal cost of computing time, the diameter d(h)is of decisive importance. There holds that d(h)as due to the first version (zo = is smaller by a factor of 2" than d(k)by means of the second version. Therefore Theorem 8 as proved in [22] is valid. Another improvement is obtained via intersections.

9)

Theorem 8: As an enclosure of the integral J := Jab f(z)dz,we obtain:

(6)

J E

(8) R, := w; :=

Q,+ R, with

w,

b .2-' it1

- v,

with

, i = O(l)n,

R. Kelch

176

6.2

Discussion of the Presented Methods

6.2.1

Theoretical Statements

As measures for the quality, in the case of an approximation method, we choose an error estimator (e.g. the difference of two succeeding approximations) and, in the case of verification methods, the diameter of a remainder term enclosure.

We test whether the ‘pseudo”-endosures as generated by asympt, are true enclosures. In a final comparison of all four procedures and as a measure for the quality of a method, we will choose the total computing time for obtaining a required accuracy. The reason is that in vquad and in itayl Taylor coefficients are computed which yield an essential contribution to the total effort. The measure of the overor underestimation, respectively, of the true error is of special importance. (1) Comparison of approx and asympt: Based on the numerical stability of the Romberg extrapolation (see [37]), we frequently will get very good approximations. But we can demonstrate easily by use of examples that the rounding errors may quickly lead to an error estimator that is far away from the exact error. In the case of lower accuracy requirements er, approx needs less computation time than asympt; for asympt, the validation of (2) is more costly than the one of the error estimation. For large er the method asympt cannot work efficiently, either. Because of the identical approximation terms for small e,, we obtain similarly good approximations. The “pseudo”-enclosure due to asympt may yield results that are as misleading as the ones due to approx. Therefore we expect that asympt will yield less acceptable results than approx.

(2) Comparison of vquad and itayl: Because of the convergence-accelerating effect of the extrapolation, it is to expected that vquad terminates faster than itayl. This does not take into account the cost for evaluating the Taylor coefficients, both in the remainder term and in the approximation term.

a) We now compare the remainder term formulae for vquad ( Rv ) and for itayl ( RI ). For the quality of the remainder term, the interval diameter is of decisive importance, which is caused by the true interval l i in f2,+2(Zi). Since this value is identical for both methods, the factors &! and @ govern the absolute magnitude of the procedural error. Inspection of Figure 3 reveals immediately that the new algorithm vquad is distinctly superior to itayl beginning with m=3. b) We now compare the effort of evaluating the corresponding approximation terms T,, and QZm+2. The computational effort for T,, rises proportionally to the number of the required function values. Therefore, a bisection step implies a doubling of the computation time for the evaluation of the function values. For the evaluation of the Taylor coefficients, we cannot derive an analogous direct dependence

Quadrature by Extrapolation

177

between the approximation degree and the runtime. Numerical results show a behavior that differs occasionally. Using lower or medium accuracy or sufficiently simple functions, the Taylor-method is faster and somewhat more precise than the extrapolation method. In difficult cases and/or with higher requirements for the accuracy, the algorithm vquad delivers distinctly better results (see Table 10 in 1221). A

10-20

--

10-15

--

10-10

--

-

0

0

0

I

0

*

0

1

- m

2

3

4

5

6

7

Figure 2: Comparison of the remainder terms of vquad and itayl Figure 3 illustrates these effects distinctively for fi, fz and f3; e.g. for fz, the absolute diameter increases while the absolute values of the Taylor coefficients decrease. For and f1, the absolute diameters are growing by one power of ten with every increase in the approximation degree m . The occurrence of this growth is remarkable in view of the fact that the argument zo is a number that can be precisely represented. As compared with itayl, there is another advantage of vquad in the case of bisections. Using vquad, let all values of functions computed up to that point be re-used (see Section 4). As a consequence, the additional effort due to the iterative bisection is significantly reduced. Using itayl on the other hand, it is not possible to re-use a Taylor coefficient or a corresponding vector since, because of Theorem 8, the Taylor coefficients are computed only a t the midpoint of the interval currently under consideration. After a bisection, the old midpoints become endpoints and therefore cannot be used! In this case, the effort for itayl increases enormously. In view of the cost and the precision of the remainder term, therefore, in the majority of the cases under consideration vquad is to be preferred.

178

R. Kelch

The functions

fi

fl

=

f2

=

f3

are evaluated a t

LO;

they are defined as follows:

ezz . sin(ez2) , zo = 2.0 ( see J1 in Section 6.2.2)

-&, z o = 1.0 ( see J2 in Section 6.2.2)

&.

s02* wre

w.imdt

7 2 0 = 0.5 with re =a3cos3t + a ~ c o s 2 t + a l c o s t+ a 0 im = a3 sin 3t a2 sin 2t al sin t v = 3a3 cos 3t 2a2 cos 2t al cos t w = 3a3 sin 3t 2a2 sin 2t al sin t and a3 = l , ~ = 2 -5.5,al = 8 . 5 , = ~ -3

=

+

+ + +

+ +

(3) Comparison of vquad and asympt As has been observed i n (1) asympt is more costly and not more accurate than approx in spite of the declaration of an “enclosure”. The extent to which approx still delivers useful error estimators for difficult integrals will be seen in Section 6.2.2.

t

8l

l2

4

2

*/

4 / d

8

10

12

14

- m

f3 f2

Figure 3: Interval blow-up of the Taylor coefficients as a function of the approximation degree m.

6.2.2

Numerical Results

Table 4 and Figure 4 confirm the conjecture concerning a doubtful reliability of the classical error estimator (notice the logarithmic scale!). Just as in the case of

Quadrature by Extrapolation

11 1 1 I 1

179

average requirements for accuracy, we frequently are able to find examples showing significant distances between the error estimator and the accurate value. Thus, this error estimation is worthless (see e.g. Table 8). le-4 le-6 le-5 4.5e-7 5e-4 2.2e-5 50 gP 50

le-8 le-10 1.3e-9 9.6e-12 1.6e-9 6.5e-12 1.3 0.7

Table 4: The doubtful reliability of the classical error estimator

I

* *

+underestimation

Figure 4: Some g,-values of the integral examples The method itayl yields excellent results. In case of simple functions with low requirements for accuracy, this method is faster than vquad. In difficult cases with high requirements for accuracy, however, vquad is five times faster than itayl.

R. Kelch

180 Integral-Example No. 1:

J1 =

1 b

eZ2. sin(ez2)dx (see [8])

The integrand is strongly oscillating. The magnitude of the derivatives grows rapidly as x increases, particularly in the case of large b. Thus, there are serious problems. For b=2.1 we obtain the results listed in the subsequent Table 5. Computations marked by (*) in the column for appros in Table 5 yield approximations with significant distance from the error estimator. Neither do the asymptotic enclosing intervals as obtained from asympt contain the integral value in the cases marked by (*). Figure 5 demonstrates once more distinctively the advantages of the verified algorithm vquad as compared with the comparison algorithm.

1600-

0

0

0

0

1400-

1200--

*

1000800

600 --

200 -100 --

itayl

o

approx

*

vquad

*

--

400-

*

*

* 0

a

0

6

required relative accuracy p,

Figure 5: Computational effort in comparison with the achieved accuracy for 5 1

Quadrature by Extrapolation

119 220 252

le-6 le-9 le-11

858 1092 1135

181

142") 282 465

39 1 1

1653 1656 1773

Table 5: Comparison of the computational effort with the achieved accuracy for 51

Integral-Example No. 2:

The authors of [43] deal with this integral as a result of the integration of an equation of motion. If we analyse v(t) at the nodes t = 0(0.2)2.0, our enclosure algorithm uquad yields excellent results in the case of a required accuracy of pr = lo-', see Table 6 which also lists the values as given in [43]. It is remarkable that in most cases only 2 or 3 digits are correct, whereas in [43] 3 additional decimal digits are considered to be accurate.

Table 6: Comparative computation

Integral-Example No. 3:

1 1

J3

=

a4

+

ax (31 - 1)4

R. Kelch

182

For the given parameter values a = lo-' and a = Tables 7 and 8 illustrate the numerical results in the case of the absolute accuracy requirements cr = 1 and The accurate value of 53 for a = lo-' is 740. If we apply approz, we obtain 86 as a first approximation, making use of an absolute error estimator of 21. For a = there holds J3 z 740480. Here, the approximation method provides an approzimation differing even more from the solution. The error estimator is too small by a factor of lo4: whereas approz provides the number 30 as an error estimator with respect to the approximated value 335289, itayl aborts with an exponent overflow. Exdanations to Tables 7 and 8: Marking vquad by (**) denotes that no standard partitioning was chosen. Table 7 demonstrates quite clearly the advantages of a more favorable computation of the Taylor coefficients. The integral values marked by (*) are outside of the domain of the verified enclosure; they thus demonstrate the insufficiency of this method. The last column shows the obtained absolute error d(J) or the indicator size gp for the quality of the error estimation. If gP > 1 , this implies a significant undemstimation of the error (see below).

Table 7: Comparative computation in the case of example J S , with a = lo-'

I .-

t 10-3

I

1.2

I 3.35289..E + 5")

I

1.3E+3

Table 8: Comparative computation in the case of example 53, with a =

I

Quadrature by Extrapolation

7

183

Conclusions and Outlook

As in the case of almost all numerical problems, it is shown that an approximation method may provide results close to the solution only with a certain probability. If, however, guaranteed results as well as bounds are required for cases (even ill-conditioned ones), then verified enclosure algorithms - see vquad in Section 5 should be used. Enclosure algorithms enable the user t o compute pairs of bounds with small distances for the required integral; this can be achieved in a fast and uncomplicated manner via automatic differentiation algorithms and transformation of the Romberg extrapolation into a direct scalar product. Thus, a true error control has become possible via an adaptive step size control! In the case of ill-conditioned integrals, the classical error estimators may fail. Whether or not an integral is a critical case in this respect cannot be decided in an a priori fashion (see Example No. 2 in Section 6.2.2). Only subsequent to applications of enclosure methods, we are able to ascertain whether an integral is ill-conditioned. Obviously, applications of inaccurate approximation methods are unnecessary since the method vquad yields guaranteed bounds which enclose the true results for the values of the integral. An interesting possibility is the generalization of vquad for multi-dimensional integrals (see [13, 14, 29, 411). In [38] we find an excellent continuation of the present work for the case of two-dimensional integrals.

References [l] Abramowitz, M. and Stegun, J.A.: Handbook of Mathematical Functions, Dover Publications, New York, 1965 [2] Alefeld, G., Herzberger, J.: Introduction to Interval Analysis, Academic Press, New York, 1983

[3] Bauch, H., Jahn, K.-U., Oelschlagel, D., Susse, H., Wiebigke, V.: Intervallmathematik, Theorie und Anwendungen, BSB B.G. Teubner Verlagsgesellschaft, Leibzig, 1987

[4] Bauer, F.L., Rutishauser, H. and Stiefel, E.: New Aspects in Numerical Quadrature, Proc. of SIAM, 15, AMS, 1963 [5] Bohlender, G., Rall, L.B., Ullrich, Ch., Wolff v. Gudenberg, J.: PASCAL-SC, Wirkungsvoll programmieren, kontrolliert rechnen, B.I., Mannheim, 1986 [6] Braune, K.: Hochgenaue Standardfunktionen fur reelk und komplexe Punkte und Intervalle in beliebigen Gleitpunktrastern, Doctoral Dissertation, University of Karlsruhe, 1987 [7] Bulirsch, R.: Bemerkungen zur Romberg-Integration, Num. Math.,6, pp.816, 1964

[El Bulirsch, R. and Stoer, J.: Asymptotic Upper and Lower Bounds for Results of Extrapolation Methods, Num. Math.,E,pp.93-104, 1966

[9] Bulirsch, R. and Stoer, J.: Numerical Quadrature by Extrapolation, Num. Math.,9, pp.271278, 1967

184

R. Kelch

[lo] Bulirsch, R., Rutishauser, H.: Interpolation und genaherte Quadratur, in Sauer, R., Szabb, I. (Editor): Mathematische Hilfamittel des Ingenieurs, Springer, Berlin, 1968 [ll] Corlk, G.F.: Computing Narrow Inclusions for Definite Integrals, in [20] [12] Corlk, G.F. and Rall, L.B.: Adaptive, Self-validating Numerical Quadrature, MRC Technical Summary Report, # 2815, University of Wisconsin, 1985 [13] Davis, Ph.J., Rabinowitz, Ph.: Methods of Numerical Integration, Academic Press, San Diego, 1984 [14] Engels, H.:Numerical Quadrature and Cubature, Academic Press, New York, 1980 [15] Fischer, H.C.: Bounds for an Interval Polynomial, ESPRIT-DIAMOND-&port, Doc. No. 03/2b-3/1/K02.f, 1988 [16] Fischer, H.C.: Schnelle automatische Differentiation, EinschlieBungsmethoden und Anwendungen, Doctoral Dissertation, University of Karlsruhe, 1990 [17] Aearn, A.C., REDUCE user’s manual, The Rand Corporation, 1983 [18] Heuser, H.:Lehrbuch der Analysis, Teil 1 und 2, Teubner, Stuttgart, 1989 [19] Kaucher, E., Miranker, W.L.: Self-validating Numerics for Function Space Problems, Academic Press, New York, 1984 (201 Kaucher, E., Kulich, U., Ullrich, Ch. (Eds.): Computerarithmetic, Scientific Computation and Programming Languages, B.G. Teubner, Stuttgart, 1987 [21] Kelch, R.: Quadrature, ESPRIT-DIAMOND-Report, Doc. No. 03/3-9/1/Kl.f,

1988

[22] Kelch, R.: Ein adaptive8 Verfahren zur Numerischen Quadratur mit automatischer Ergebnisverifikation, Doctoral Dissertation, University of Karlsruhe, 1989 [23] Klatte, R., Kulisch, U., Neaga, M., Ratz, D., Ullrich, Ch.: PASCAL-XSC, Sprachbeschreibung mit Beispielen, Springer, Berlin, 1991 [24] Kulisch, U.W.: Grundlagen des Numerischen Rechnens, B.I., Mannheim, 1976 (251 Kulisch, U.W., Miranker, W.L. (Eds.): Computer Arithmetic in Theory and Practice, A c e demic Press, New York, 1981 [26] Kulisch, U.W., Miranker, W.L. (Eds.): A New Approach to Scientific Computation, A c e demic Press, New York, 1983 [27] Locher, F.: Einfiuhrung in die numerische Mathematik, Wissensch. Buchges., Darmstadt, 1978 [28] Lohner, R.: EinschlieBung der Lbung gewohnlicher Anfanga- und Randwertaufgaben und Anwendungen, Doctoral Dissertation, University of Karlsruhe, 1988 [29] Lyness, J.N. and McHugh, B.J.J.: On the Remainder Term in the N-Dimensional Euler Maclaurin Expansion, Num. Math.,l5, pp.333-344, 1970 (301 Moore, R.E.: Interval Analysis, Prentice Hall, Englewood Cliffs, New Jersey, 1966 [31] Neumann, H.:Uber Fehlerabschatzungen zum Rombergverfahren, ZAMM,46( 1966), pp.152153

Quadrature by Extrapolation

185

[32] Rall, L.B.: Differentiation and Generation of Taylor Coefficients in PASCAL-SC, in [25] [33] Rall, L.B.: Automatic Differentiation: Techniques and Applications, Lecture Notes in Computer Science, No.120, Springer, Berlin, 1981 [34] Rall, L.B.: Optimal Implementation of Differentiation Arithmetic, in [20] [35] Schmidt, J.W.: Asymptotische Einschliehng bei konvergenzbeschleunigenden Verfahren, Num. Math.,& pp.105113, 1966 [36] Stiefel, E.: Altes und Neues uber numerische Quadratur, ZAMM 41, 1961 [37] Stoer, J.: Einfihrung in die Numerische Mathematik I, Springer, Berlin, 1979 [38] Storck, U.: Verifizierte Kubatur durch Extrapolation, Diploma Thesis, Institut fur Angewandte Mathematik, University of Karlsruhe, 1990 [39] Stroud, A.H.: Error Estimates for Romberg Quadrature, SIAM, Vo1.2,No.3, 1965 [40] Stroud, A X . : Numerical Quadrature and Solution of Ordinary Differential Equations, Springer, New York, 1974 [41] Stroud, A.H.: Approximate Calculation of Multiple Integrals, Prentice Hall, New York, 1971 [42] Wilf, H.S.:Numerische Quadratur, in: Ralston, A,, Wilf, H.S.,Mathematische Methoden f i r Digitalrechner, Bd.2, Oldenbourg Verlag, Munchen, 1969 [43] Wylie, C.R.,Barrett, L.C.:Advanced Engineering Mathematics, Me Graw Hill, 1982, pp. 265266

This page intentionally left blank

Numerical Integration in Two Dimensions with Automatic Result Verification Ulrike Storck

For calculating an enclosure of two-dimensional integrals, two different methods with automatic result verification have been developed. Both procedures are based on Romberg extrapolation. They determine an enclosure of an approximation of the integral and an enclosure of the corresponding remainder term using interval arithmetic. In both algorithms, the quality of the remainder term chiefly determines the error of the result, i. e. the width of the enclosure of the integral. We therefore examine in detail the representations of the remainder terms in dependency on the chosen step size sequences.

1 1.1

Introduction Motivation

In scientific and engineering problems, the values of multi-dimensional integrals are frequently needed. There are many different methods for numerical integration, especially in one and two dimensions (see [5], [6], [IS]). Particularly in the twodimensional case, however, the remainder term, assuming it is taken into account, is not given in a form suitable for numerical computation. In addition, the roundoff errors are rarely taken into account. Therefore, a reliable statement about the accuracy is not possible in general, and the numerical results are often doubtful. In order to obtain an error estimate for a numerical result, two methods for calculating integrals of the form b

a

d

c

with automatic result verification are presented in this paper. We will call these procedures the single Romberg extrapolation and the double Romberg extrapolation, respectively.

Scientific Computing with Automatic Result Verification

Copyright Q 1993 by Academic Press, Inc. 187 All rights of reproduction in any form reserved. ISBN 0-12-044210-8

U.Storck

188

1.2

Foundations of Numerical Computation

To introduce the two methods, we start with the representation of the integral J in (1) by

J=T+R with T denoting the approximation and R the remainder ter_m. In both procedures the remainder terms depend on unknown (&) with ((ji) E [a,b] x [c,d]; consequently, a direct evaluation of R is impossible. Therefore, we replace (&) by [a, b] x [c,d] and obtain an enclosure R of R, which yields

However, if the calculation is executed in a floating-point system, the round-off errors must be taken into account. There follow now some important definitions (see [lo]):

A floating-point system is defined by S = S(b,I,el,e2) := (0 := 0 - bcl}

u {z = *rn. be I * E {+,-}, b E N , I

-

b 2 2, el, e2, e E Z,el 5 e 5 e2, rn = C z[i] P, z[i]E {0,1,

...b -

i=l

1) for i = 1(1)I, 4 1 1 # 0)

.

Here b is called the base, 1 the length of the mantissa, el the minimal and e2 the maximal exponent. We have S(b, I, el, e2) c R. For calculations by means of a computer, a rounding o:R+Swith

A O x = x ZES

is required. The following roundings are very important in practice: the monotone downwardly directed rounding

v with

the monotone upwardly directed rounding A with

A Az := min{y E s I y 2 z) ,

ZER

Numerical Zntegration in Two Dimensions 0

the interval rounding

A

189

0 (where ZR denotes the set of intervals over R)with

OX := [V(minx),A(maxx)] SEX

XEIR

For an arithmetic operation x@y := O(x 0 y )

0,

SEX

the interval operation

Q is defined by

.

+,

Furthermore, we assume that all floating-point operations -, / are of maximum accuracy and that we can use an exact scalar product (see [3], [lo]) which, using a long accumulator, determines error-free values of expressions of the form

c n

A=

i=l

-

xi yi with x;, y; E S

s,

.

The result of the exact scalar product is rounded by one of the roundings V, A, 0, which is indicated by a rounding symbol 0 E (0, A, 0)in front of the sum; the scalar product is then called an accurate scalar product. For the rounding 0 we have

In both extrapolation methods, interval arithmetic and the accurate scalar product are used for calculating an enclosure of the approximations, denoted by OT, and for calculating an enclosure of the remainder terms, denoted by Ox ; we have

J E

1.3

OJ := OT+OR.

Romberg Extrapolation

We will now discuss the principle of Romberg extrapolation. Assuming f(Z) is integrable in [ a ,b] x [c, 4, we get an approximation of the integral J in (1) by applying the trapezoidal rule

k=O

it follows that

I=O

lim T(h1,h l ) = J

hi ,ha+o

.

Here, the double prime next to the summation symbol indicates that the first and the last summand are multiplied by the factor 3. In order to obtain an optimal

190

U. Storck

approximation of J , we have to choose hl and h2 as close to 0 as possible. However, the round-off errors increase with decreasing hl, hz; moreover, very small hl , h2 yield a large number of function evaluations costing a lot of computing time. Therefore we cannot use very small hl,h2 for our computation. Instead, we obtain an approximation of T(0,O)by use of Romberg extrapolation. First of all, we introduce two arbitrary step size sequences

F1 = { h l O , h l l , hl2, ...} , 9 - 2 = { h o , h21, h 2 2 , ...I with

this yields the corresponding trapezoidal sum

Considering the T ( h l ; ,h2j) as function values a t the nodes hli, h,j, by extrapolation for ( h l ,hz) = (O,O), we obtain an approximation of T(0,O). Now, in addition to the possibility of choosing among different step size sequences, we can distinguish between two different ways of using Romberg extrapolation. The first one is generated by choosing the same number of nodes of T(hl;,h z j ) in the 21- and 2 2 direction, i. e. j = i and nl; = n2;. With ni := nl; , h; := 1 , and substition of (3) n, into (4), there follows

Now the values T ( h ; )are used for the Neville-Aitken-algorithm (see [14]) in order to extrapolate the values of T ( h )for h = 0. This method is called the single Romberg extrapolation. The second method, denoted as the double Romberg extrapolation, is based on two different Romberg extrapolations for the z1- and 2 2 - direction. This means that the two extrapolations are independent of each other.

Numerical Integration in Two Dimensions

191

The Single Romberg Extrapolation

2

In accordance with the last section, we have

T(hi) = ( b - a ) . ( d -

(6)

C) *

h?.

ni

ni

k=O

1=0

C”Z”j(a + k. ( b - a ) hi , *

C+

I. ( d -

C ) * hi)

,

with hi = and n, E N. According to the Neville-Aitken-algorithm, we obtain the following recursion: Ti0

:= T ( h ; )

O
with arbitrary m, and we obtain the following T-table:

The classical algorithm determines the values Tik by means of their recursive definition (7). For the computation of Tik, the values X . k - 1 , T , - l , k - l are required; therefore it is necessary to determine all T-table elements to the left and above T i k . Now, we will consider the single Romberg extrapolation. Our first problem is the representation of the remainder term. For this we need the two-dimensional EulerMaclaurin summation formula.

The Two-dimensional Euler-Maclaurin Summation Formula

2.1

For the Euler-Maclaurin summation formula we need the definition of the Bernoulli polynomials. Definition 2.1 1. Bo(s):= 1

2. B;(z) := k . B k - l ( ~ ) , 3.

J;

Bk(Z)dZ =

0

k21

, k21

192

U. Storck

Hence it follows that

(9)

B~(z) = Bk(0) + k *

and we get the following properties of the Bernoulli polynomials:

Bk(z) is a polynomial of degree k Bk(0) = & ( I ) ,

k l 2

Bk(;+z)= (-ly*Bk(+),

k20

BZk+l(O)

k l 1 .

= &k+l(l) = 0,

The values (11)

Bk

:= (-l)k+'

*

Bzk(0) , k 2 0

are called the Bernoulli numbers; they can be found in

1.

We note (see [S

Moreover, we define the following functions with period 1 according to [7]: Sk(Z)

:= Bk(Z - [.I)

,

where [z]is the largest integer less than or equal to z. We obtain the following equations by application of ( 1 0 ) and (12):

(13)

[

&(O) &k+i(O)

= Sk(0) = Sk(i), =0 =

SZk(0) = ( - 1 ) k + '

JiSzk(t)dt, '

Bk,

k > l , i 2 1 k21 k2O.

To reduce the complexity of some complicated expressions, the following operators are introduced:

Definition 2.2

Numerical Integration in Two Dimensions

193

We are now prepared for the Euler-Maclaurin summation formula:

Theorem 2.1 Let f be the integrand of J from (1) with f E C2m+2*2m+z[a, b] x [c, 4, then the trapezoidal sum from (2) with m 2 1 has the ezpansion 1. T(h1, hz) = J

+C m

a=1

a a ( h 1 r h2)

+ rrn+l(hl, hz)

with b d

2. a , ( h l , hz) = JJ

C 68' 8

*

612,1_zj f ( Z ) d Z

a c j=O

I

-6!

. DL\z

+ DkL2 - Dkl} f (Z)dZ

for m even

A detailed proof of this theorem is given in [15]. Furthermore, the two-dimensional Euler-Maclaurin summation formula is referred to in [4], [ll], [13]. Now, we will have a closer look at the remainder term r,+l(hl,hz). Since the functions Sj, j E N are continuous and periodic, we can conclude that there exists an upper bound C, which is independent of hl, hz, with b d

J/

(Sj(?)

- Sj(0))dZ 5

c,

i = 1,2

.

a c

Thus, for rm+l(hl,hz), we can find an upper bound which is independent of hl, hz, and we can observe that rm+l(hl,h2) is a bounded function of hl, hz for all hl =

, hz = 2 .

y)

In [7] it is proved that the expressions Szj+2( - Szj+2(0)with j E N, i = 1,2 have no change in sign; hence, the extended mean value theorem can be applied with respect to part (3) of Theorem 2.1. Moreover, using (11) and (13) leads to

(14)

{

b

J ( S z j ( y )- SZj(O))d51 = (-1)'' ( b -

U)

*

Bj

"d

J(S2j(33 C

- SZj(0))dQ = (-l)j * ( d - c) * Bj

j z l .

U.Storck

194

Hence, we have:

Corollarv 2.1 The remainder term (3) of Theorem 2.1 can be represented by rmtl(hl, h2) = ( b - U ) ( d - C ) . (-l)m-l

c h y . h;m+?-2j.

mtl

*{ B B ,

I-

-f

(2j*2m+2-2j)

j=O

with

(6)

(6)E [ a ,b] x [c,4, m E N and j = O(1)m + 1 .

2.2

The Remainder Term

According to (7) we have

Olilm

Tio := T ( h , )

with arbitrary m. If, by interpolation, we determine the polynomial Fmm(h)with m

(16) Fmm(h)= x a j . hzj and Fmm(hi)= T ( h i ) , i = O(1)m , j=O

then there follows

(17) Fmm(0) = T m m .

-

For a closer look a t Tmm(h),we apply the Lagrange interpolation formula and

obtain

k#i

Hence it follows that

x m

(18) Fmm(0) =

i=O

cm; Ti0

with

h i

=

n

k=O

k#i

hi

Numerical Integration in Two Dimensions

195

In [7]it is proved that

j=O

m

j = l(1)m i=O

(-1)".

m

m

nh?

i=O

j=rn+l.

m

1

j=k=O

= { 0

( b - a)'j. ( d - c)". (-l)m

j + k = l(1)m

n h? m

j +k =m

i=O

+1 .

Employment of the expansion for T ( h , )from Theorem 2.1 in (18) yields the following representation: m

m

Now, considering (19) and using the representation of a , ( h l i , h z i ) in part (2) of Theorem 2.1, we get m

(21)

Chi * a s ( h 1 i , h z i )= 0

,

s = l(l)m

.

i=O

Moreover, with (19) we have

Cc,i.J = J m

(22)

i=O

and, applying (21) and (22) to (20), there follows that m

Finally, by means of Corollary 2.1, we get m

(24) Tmm- J =

Chi.( b i=O

U) *

( d - C) . (-l)m+l

U.Storck

196

with rn 2 1 and with

(6;) E [a,b] x [c,d].

Note that the values (&) with fixed j and variable i are, in general, not identical. Therefore, the derivatives cannot be written in front of the first summation formula. In order to obtain an enclosure of the remainder term, the substitution of [a,b] x [c,d] for is necessary. Because of the subdistributivity of interval arithmetic, the derivatives have to remain under the first summation symbol. However, under certain assumptions, we can find a representation of Tmmwhich is better than the one by means of (23) and (24). For this purpose, the sum in (23) is written behind the twointegration symbols contained in the expression for rm+lof Theorem 2.1; the summation symbol originating from rm+l is written in front of the two integration symbols in the expression for r,+l. From (11) and Definition 2.2 we obtain

(6;)

(25) Tmm- J =

Numerical Integration in Two Dimensions

197

with

If we can show that the expression Kza,2t(Z) does not change its sign in [a,b] x [c,d] (see Section 2.4), then, according to the extended mean value theorem, the corresponding derivative of f can be placed in front of the integration symbols and we have

with

(6)E [a,b] x [c,d] . Substition of (19) and (14) in (26) leads to

Analogously, we can derive

with

(fm+l-;) [a,b] x [c,d]

E

and

U.Storck

198

(c)

with E [a,b] x [c,d] . If none of the 1(2s,2t in (25) change their sign, then (25) can be transformed into

with

(6)E [a,b] x [c,d] , j = O(1)m + 1 , m 1 1.

We now assume that there exists an expression KZs,ztfor which we cannot prove that its sign is constant. Then the following transformations are possible: With (14) and the extended mean value theorem we get

with

(31)

(6;) E [a,b] x [c, d] , i = O( 1)m. Substitution of

{

a c

c m

E

i=O

cmi

. hi;2m+2-2j.

h 22ij

( [ a ,61,

[c, 4) for

(6,) leads to

( b - a ) * ( d - c)

The signs of the values k i alternate, and therefore the derivative of f cannot be written in front of the sum. From

there follows

Numerical Integration in Two Dimensions

For KZj,o(Z) and

K2j,2k(5) we

199

can show analogously that

and

Thus, it is possible to apply (32), (33) and (34) if KZs,45) changes its sign in [ a ,b] x [c,4. Using (32), (33) and (34) for all K 2 s , 2 t occurring in (25), we obtain

U.Storck

200

with m 2 1 .

2.3

Convergence of the Single Romberg Extrapolation

Analogously to the onedimensional Romberg extrapolation (see [4], [7], [8]) we find the following theorem.

Theorem 2.2

If the inequalities

2

(Ij;fL) 5 a < 1 hold for all i E N,

lim Tm0= lim Tmm= T(0) = J

m+m

m-m

then

.

Proof In [4] it is proved that 1. lim m-tO

2.

m

i=O

c,,,i

= 0 for fixed i, = 1 for all m ,

and by application of the theorem of Toeplitz (see [9]), Theorem 2.2 follows.

2.4

0

The Modified Romberg Extrapolation

In this subsection, we will examine more closely the approximations Tik defined by (7), and we will derive a favorable method for calculating these approximations.

Numerical Integration in Two Dimensions

201

Since the diagonal elements T,, are the best approximating elements of the Ttable (8), we consider mainly T,, and the corresponding remainder terms &, defined by

h!,,,, := J - T , , . Here, the remainder term can be expressed by employing the results of Subsection 2.2. Employment of the interval operations for the operations in 1R leads to

In order to obtain a small total error of the result O J , i. e. a small width of O J , we should find suitable procedures for the determination of OT,, and OL,. Therefore, we have to avoid procedures which may produce interval inflation. However, we need to perform many interval operations for the recursive evaluation of T,, because of (15) and, in addition, some weights change their signs within a row of the T-table (an example is given in [S]). Thus, both factors can lead to an inflation of the result OT,,. In order to avoid this, we calculate OT,, directly by employing a scalar product which consists of a weighted sum of function values of the integrand f. For calculating this scalar product the weights are required. We are now dealing with the determination of the weights for some step size sequences fulfilling the conditions of Theorem 2.2, and this implies convergence of the approximations.

The Approximation First of all, we establish by means of (6) and ( 7 ) that every T,k is a linear combination of function values at the nodes. Furthermore, an instruction for easily calculating the weights can be given. The first step size sequence to be considered now is the Romberg sequence, defined by

Definition 2.3 The Romberg sequence Fz = { (35)

ni := 2'

, i20

.

k} is given b y

U. Storck

202 We thus obtain

Theorem 2.3 For the Romberg sequence with T ; k defined by (7), the following properties are valid: Every T i k can be represented by

(36)

{

r k

= hlo * h20 *

n, n,

C Cwikj/ j=O I=O ,

Z l i j = a + j . 2

hi0 = (a - U )

,

f(Z1ij

0 5 k 5 i , 1,j = O ( 1 ) n i

z2i/=c+l*%

h2o = ( d - C )

For all i, k , we have: I

ni

ni

The weights

(38)

wioj/

1

can be calculated f o r k = 0 by:

w;kj/

=

(2ni)-2

f o r j , 1 = 0, n i

2-'nf2

for ( j = 0 , n i and 1 = l(1)n; - 1 ) or ( I = 0,ni and j = l(1)n; - 1)

n,T2

f o r j = l(1)ni - 1 and 1 = l(1)ni - 1

and for k 2 1 by: Wi,kJj,2/

=

4k ' Wi,k--1,2j,2/ - W i - l , k - l , j , l 4k

-1

4k * W i , k - l J j J / - l 4k - 1 4k . Wi,k-1.2j-1,2/ Wi,k,2j-1,2/ = 4k - 1 4k * wi,k-1,2j-l,21-1 wi.k,2j-l,2/-1 = 4k - 1 Wi.k.2j.21-1

=

,

j , 1 = o( I)?

,

j =O(l)?

,

1 = l(l)%

,

j = l(l)?

,

1 =0(1)?

, j,l=

l(l)?

The proof of this theorem is given in (151. In order to obtain the weights of the next step size sequence, we first consider the set of all nodes belonging to a step size sequence. Let .F = { h i } be a step size

Numerical Integration in Two Dimensions

203

sequence with hi = $, then the node set with respect to Ti0 is defined by:

Now, we deal with the Bulirsch sequence:

Definition 2.4

The Bulirsch sequence .FB= { $} is given b y 3.2b-l

, i even , i odd ,i=o

For .Fa, we obtain immediately

no = 2 n1 n; = 2 . n i - z

, i23

and

, i20

In (42), the symbol ’+’ indicates that the two sets are disjoint. Furthermore, (42) implies that all nodes needed for the determination of To,o,TI,-,,. . . ,T;-*,oare also

U.Storck

204 required for the calculation of T;-l,o and Ti,o, and we have

Theorem 2.4 ;"orthe Bulirsch sequence

Fs

with n-1 := 0 the following properties are valid:

rnd for all i, k, we have ni

znd

45)

{

ni

>0 v;kjl 5 0 Wikjl

, j , 1 = O(1)n; , j , l = O(1)n;-1 .

The weights w;kjl, v;kjl are rational and can be calculated as follows: 1. f o r k = O , 0 5 i b y

(46)

i w 8.0 J.1 - [

, vioj/ = 0

2-'nT2

, j , l = O,n; , ( j = 0,ni and 1 = 1(1)(n;- 1 ) ) or ( j = 1(1)(n;- 1 ) and 1 = 0 , n ; )

nf2 ,

j , l = l(1)n;- 1 j , 1 = O( 1)ni-l

,

Numerical Integration in Two Dimensions

205

2. f o r l s i s 2 , l s k s i b y Wi,k,j,/

= f f i k * Wi,k-l,j,/

Vi,k,2j,2/

= ffik *

Vi,k,2j-1,2/

9

Vi,k-lJjJ/

+

j , 1 = O( 1)ni Pik(wi-l,k-l,2j,2l

= f f i k .Vi,k-l,2j-l,2l

(47) %,k,2j,2/-1

= ffik

* Vi,k-1,2j,N-l

+

+

+

Vi-l,k-l,j,/)

j , l = 0(1)[?] P i k ' Wi-l,k-l,Zj-1,2/

= ffik

*

7

j =1 ( 1 ) [ 9 ] P i k ' Wi-l,k-I,2j,2/-1

j =O(l)[y] Vi,k,2j-l,2/-1

,

Vi.k-1,2j-1.2/-1

+

, 1=0(1)[y] 9

, 1 = 1(1)[5+]

Pik ' Wi-l,k-1.2j-l,2/-1

I

j,l= 1 ( 1 ) [ 9 ] , 3 . f o r i 2 3 , l l k s i by

The proof is given in [15]. Moreover, another step size sequence is discussed in [15], the decimal sequence:

Definition 2.5 The decimal sequence 2;+1

(49) ni =

5.29-l 1 2

Fv

= { $} is given b y

,i 22 ,i ,i23,i , i=O ,i = l .

even odd

206

U.Storck

For the decimal sequence, we can find statements similar to the ones in Theorem 2.4. The last two theorems show that the weights can be calculated exactly by using rational arithmetic, and that they are independent of the integral. Thus, the weights are calculated only once. Subsequently, for each step size sequence, we determine the common denominator HN,, of all weights required for the determination of T,, and calculate the corresponding numerators. The numerators then are stored in a matrix. For example, we get

w,,

:=

with G,,jl for 0 5 j, 1 5 ni denoting the numerators. Analogously, we obtain V,,. Moreover, the enclosures of the function values of the integrand a t the nodes, which are needed for OT,,, are calculated and stored in matrices:

F, :=

Of(Zlrn0

Z2m0)

O f ( ~ l r n 07 Z 2 m l )

Of(Zlm1

ZZmO)

O f ( ~ l m l9 ~ 2 r n 1 )

Of(xlmnm

Finally, OT,,

for

38

ZZrnO)

Of(Zlrnn,

7

~2rn1)

***

i

Z2mnm)

O f ( ~ l m l7 ZZrnn,)

a * *

is determined for 3~by

(and for 3 ~there ) follows by application of (45)

I

Of(Z1m0

Of(Zlrnn,

1

~rnn,)

Numerical lntegration in Two Dimensions

207

Since in practice we only use the approximations T,, for m 5 5 , the weight matrices W,,, V,, are calculated for m 5 5 . For the evaluation of OT,,, it is necessary to determine the matrices F,, Fm-land to compute the scalar products. Then OD, is calculated and multiplied with the result of the scalar product.

The Remainder Term Now, we start to consider the remainder term &., The main problem for the deb], [c,4). termination of &, lies in the calculation of the derivatives f(2ji2k+2-2j)([al Since we want t o obtain a verified result, an approximating procedure for the determination of derivatives is not suitable. Furthermore, symbolic differentiation involves too many operations. Therefore, we use automatic differentiation for calculating the derivatives. The advantages of automatic differentiation are, first, that interval vectors can be chosen as arguments and, second that verified enclosures of the Taylor coefficients can be determined. Here the Taylor coefficient of a function f is defined by

with a , b E R or a , b E ZR and s,t 2 0. The Taylor coefficients are calculated recursively, as shown in the following two-dimensional differentiation algorithms for the basic operations {+, -, /}: Let u , v be functions of C"1sm2(G), G c R2;kl, kz > 0, then a,

For the constants cl, c2, we have:

and for the variables (ill t z ) , with kl, k2 > 1, we have:

208

U. Storck

Note that by applying the preceding formulas, the Taylor coefficients of arbitrary functions of C " ~ I ~ Z ( Gcan ) be determined, provided these functions are explicitly given. Further differentiation formulas are given in [7]. Since these formulas are recursive, however, it is necessary to calculate all (f)"," with 0 5 u 5 s , 0 5 v 5 t in order to calculate (f)#,*. Upon arranging the Taylor coefficients in a matrix M = (mjj) with coefficients mjj = (f)j,j, we have to consider the left upper triangular matrix consisting of the first 2(k 2) diagonals of M in order to determine all (f)zj,zk+24j with j = O(1)k 1; these coefficients are required for the remainder , [C,4))2j,Zk+Z-Zj are needed term. Finally, note that the Taylor coefficients ( f ( [ a b], for the determination of the remainder term and, therefore, the calculation of the derivatives f (2jt2k+2-2J) ( [ a ,b], (c, 4) is not necessary. After having examined the approximations for 3 ~ 3, 8 , (and 3 ~in) the previous subsection, we will now deal with the remainder terms for these step size sequences:

+

+

Theorem 2.5 For

3 ~ 3, 8 and 3~with 1 5 m 5 5, there holds:

with

Dm =

um= Em = andwithhlo=b-a,

h 2 0 = d - ~ ,t = [ y ] .

209

Numerical Integration in Two Dimensions

Proof:

The inclusion (52) is obtained by means of an application of (27) and (28), substitution of ( [ a , b ][c,dJ) , for all K0,2m+2-2j, K2m+2-2j,~ occurring in (25), and an employment of (34) for Km+l,,,,+l,Km+2,,,,. However, if we wish to employ (27) and (28), we have to show that the following terms do not change their sign in [a,bl x [c,4: for odd m: for even m:

(53)

K0,2~+2-2j(Z)for

j = O...?

K2m+2-2j,0(2) for

j =O . . . y

Ko,2m+2-2j(Z)for

j = O...?

K2m+2-2,,,-,(Z)

for j = O . . . y

Ko,m+2(Z)

<

In [2] and [12] it is proved that the expressions

, for m 5 7. do not change their sign for 3 ~In.[7], this is proved for 3 ~3~ Thus, Kzm+2,0(Z) and K o , ~ ~ + ~have ( Z )no change in sign for the three step size sequences for m 5 7. For the remaining K8,t(2)occurring in (53) we develop an enclosure procedure for m 5 5 which determines the range of values and which can be applied to all step size sequences. For more details see (151. In order to obtain an enclosure of the remainder term, we have to execute the calculations using interval operations. Provided we employ one of the three step size sequences examined before, we get the following expression:

with Cm denoting a factor depending on the chosen step size sequence, Dm,jand G, depending on the integration boundaries and the Bernoulli numbers, and with

i=O

i=l

k=O

k#Zi

k=O k#2i-1

210

U. Storck

Some of these factors need to be determined only once and can then be stored. In the execution of this computation, the Taylor coefficients are calculated, some factors are determined, and a scalar product is evaluated.

The Enclosure Algorithm We now present the algorithm EGarant which computes a guaranteed enclosure of the integral J .

Algorithm EGarant 1. Input (f,a, b, c, d, eps, F) 2. i : = o

+

3. repeat i := i 1 determine for 3 : OR,,; untild(0R.i)< eps i = 5

if d(0R.i) < eps

then determine for 3 : OJ;:=OTi;QOR,.;

I

with 0T;i according to (50) and (51), respectively

I

4. o u t p u t ( O J i , i )

In addition to the function f and the bounds a, b, c, d of the integrals, the absolute error eps and the chosen step size sequence F are given by the user. It should be mentioned that the Taylor coefficients ( f ) z j , 2 i + Z - z j with j = O ( l ) i 1, which are required for the determination of OR,;, are calculated and stored in a matrix. Thus, for the calculation of OR,,+l,;+l, the stored Taylor coefficients are used for the determination of the next ones, i. e. , only two new diagonals of the matrix are calculated. Furthermore, note that the remainder term which is consistent with the given error bound is determined first and, subsequently, the corresponding approximation is calculated.

+

2.5

Comparison for Different Step Size Sequences

Before we compare the three step size sequences FR,Fo,3 ~it , should be pointed out that there exist various other step size sequences which fulfill the convergence conditions of Section 2.3. However, we discover that many of these step size sequences are disadvantageous for our algorithms since they possibly give rise to one or more of the following problems:

Numerical Integration in Two Dimensions 0

211

Most of the functions Kd,t(?)in (25) change their sign in [a, b] x [c, 4, which leads to an inflation of the remainder term interval. The step size sequence causes the intersections of the node set belonging to Ti0 with the node sets belonging to Tjo for j < i to contain only a few elements. This implies that the calculation of Ti; requires many function evaluations.

0

The step size sequence converges too slowly or too quickly towards 0. In the first case, the remainder term factors C,, suml, sum2 converge slowly towards 0. This leads to a large number of remainder calculations if a small error eps is prescribed. In the second case, the quick convergence of the step size sequence may cause a large number of function evaluations which would not be necessary otherwise. Therefore, both cases may lead to an excessive computation time.

Now we compare the three step size sequences [15] we obtain:

I

m

I

1-3

3 1 p ,3

I

B7

3 ~in ; accordance with

475

I

Table 1: ODtimal steD size seauences Among those presented in Table 1, the optimal step size sequence is the one which, primarily, guarantees that the global error of the result, or rather the remainder term, stays within the given error bounds and, secondarily, for which the number of function evaluations is minimal.

3 3.1

The Double Romberg Extrapolation The Remainder Term

The double Romberg extrapolation is based on two extrapolations which are independent of each other. For the two arbitrary step size sequences

we have

U.Storck

212 with

(57)

hli

b-a = -, nli

h2i

d-c

= -.

n2i

These trapezoidal summation formulas represent functions of tively. With

22

and

21,

respec-

it follows that

Moreover the limits

exist; with J in (1) we get

Our intention is to obtain an approximation of J with a suitable remainder term. For this purpose, J1(52), JZ(z1) will be approximated and the corresponding remainder terms will be determined. Composition of these two extrapolations leads to an approximation of J and its remainder term. We now consider the expressions Z"ll(h1,f) and Tl2l(h2,f)as functions of hl and h2, respectively, and extrapolate them using the Neville-Aitken algorithm for the values hl = 0, h2 = 0. The recursive formulas

and

Numerical Integration in Two Dimensions

213

are valid for arbitrary rn, n. The terms T/L1and T/il approximate J1(22) and J2(11),respectively, and thus represent functions of $2 and $1, respectively. Since we know that the diagonal element is the best approximating element of the Ttable, we only consider T k i and Ti!. For an arbitrary, but fixed z2 E [c, a] in (59) or $1 E [a, b] in (60), the formulas (59), (60) correspond to the recursive formula of the onedimensional Romberg extrapolation, and, according to [7],[8], we obtain the remainder terms I

&L(f) with

& J K(z1) K ( q ) = C ck; - h:Y+' -

'

*

(SZm+2(F)

n -$+ m

k=O

*

f(2m+2'0)(Z)dzl

a

m

and ck\ = (61)

b

:= 51 - Tki(f) =

Szm+2(O))

i=O

hlk-hli

k#i

and

d

@i(f):= J 2 - T$,$(f)=

with

K(z2)

and

=

=

c!!

fi -$+.

k=O k#i

\

2

i=O

J K(z2)

-

f(012n+2)(Z)dz2

C

( & + p ( F )

- Szn+2(0))

hZk-hZi

If K ( z 1 ) and K ( z 2 ) have no change of sign in [a,b] and the extended mean value theorem it follows that

&Im(f)

b

= (.f((i,i,

~ 2 ) ) 2 m + 2 , 0*

J K(zi)dzi

[c, a],

with

respectively, then by

[a,b]

(I,]

E

(1,1

by

and an employment of (14) yields

In order to obtain an enclosure of giving

&A, d2i,we replace

[ a ,b], (2.2

by [c,a],

U.Storck

214

In [2] and [I21 it is proved for 3~that the expressions K ( q ) , K ( Q ) do not have a change of sign in [ a ,b] and [c, d], respectively. In [7], this is shown for Fs and 3~for rn = 0(1)7. Therefore, if we choose two of these three step size sequences for our extrapolations, we can employ the representation (64). Now suppose that one of the functions K(xl),K(x2)changes its sign in [a,b] or [c,d], respectively, then we will write the sums in (61) in front of the integrals. Since the expressions

do not have a change of sign, we get by use of the extended mean value theorem:

an application of (14) yields

If we replace that

(li

by [a,b] and

&i

by [c, d] and use cSt= (-l)+‘. IcStl, it then follows

Composition of the two extrapolations (59), (60) leads to an approximation of J and to the corresponding remainder term. For this it is important to know that the trapezoidal sums of (56) are linear with respect to f, i. e.

Numerical Integration in Two Dimensions

215

The recursive formulas (59), (60) imply that every TkJ is a linear combination of T E ~ , o , T E ~ + l , . with . . , ~ p~ = 1,2 and thus every Tkl is linear. Therefore, in accordance with (61), we obtain by use of &!,,,

&A:

kn:= J - TL'!,(TE) =

1( 1( j

j f ( 2 ) d z 2 ) d z 1 - TLL( Jf(2)dz2 - R!i(f(Z)))

a

=

d

c

a

C

d

f ( 2 ) d z 2 ) d z 1- TLL

(J

c

f(WZZ))

+ !CLL(R!i(f(s))

C

d

b

a

C

d

b

a

C

An application of the extended mean value theorem leads to

(69)

Rnn

= (d-c)

R!!,,(f(zl,

(1.2))

+(b-a)

*

R!,!(f(t2,1,

- @;(@,I,,(f(.')))

~ 2 ) )

7

with (1,2 E [c, d] and (2,1 E [a,b]. In order to obtain an enclosure of Rmn,it is necessary to replace (1,2 by [c, d], by [a,b] and, for Rkl,,,, d2A, to employ in (69) the results of (64) or (67), depending on the chosen step size sequences.

3.2

Convergence

Since the double Romberg extrapolation is based on two one-dimensional Romberg extrapolations, we first consider the convergence for the one-dimensional case and obtain in accordance with [4]:

e)2

(

5 a1 < 1 for all i E N, then with 51 of (58), it follows that lim T,$,(f) = lim T!k(f) = J l ( z 2 ) .

Let h1,j be a step size sequence with m-w

m-w

The same holds for the step size sequence h2,j, and we obtain

Theorem 3.1 Let h1,j be a step size sequence with sequence with

(v)5 2

a2

(e) 2

5

a1

< 1 and h2,j be a step size

< 1 for all i E No, then it follows for J in ( I ) that

216

3.3

U. Storck

The Algorithm

Since for the determination of the approximations, a simple recursive algorithm analogous to the single Riomberg extrapolation cannot be given, we consider a method for the direct calculation of the approximations Tmn,which are defined by

(70) Tmn:= T E L ( T i ) . Then, we deal briefly with the remainder term

and, in analogy to the single Romberg extrapolation, an employment of interval operations leads to

The step size sequences which we use are 3 ~3 B,, 3 ~they ; fulfill the convergence criterion of the preceding subsection.

The Approximation For the determination of the approximations Tmnwe refer to the results in [7], [S]for the one-dimensional Romberg extrapolation. Let 3 1 = be the step

{ &} = { &} the step size

size sequence for the extrapolation in the zl-direction, 3 2 sequence for the extrapolation in the z2-direction. Then, according to [7], we have for 3 ~3139 , FQ:

with

Here the weights wmmp,vmmp (and wnnp, vnnp) are calculated in dependency on the step size sequences for the extrapolation in 21-direction (and in xz-direction,

Numerical Integration in Two Dimensions

’ T ~ L ( T E ( ~=) )hie.

h2o

{ c wmmpc

+C nii

(72)

217

nit

nzJ

p=o

q=o

wf-i

wnnq

C unnq + C Vmmp C Wnnq ~ m m p p=o q-0 n1,i-i “2,

4

+ \

p=o

c

W,i-1

*

f(dimp

, d2nq)

* f ( ~ 1 m p9 ~ , n - 1 , 9 )

*

f ( z ~ , m - ~ , p9 z2nq)

q=o na.J-1

Wmmp

p=o

c

Vnnq .f(z1,m-1,p

3

z2,n-1,q)}-

q=o

It should be pointed out that all weights can be calculated exactly and independently of the integral by using rational arithmetic. Now we can calculate the products in (72) in such a way that we get one weight per function value; we can store these weigths, or, more precisely, we store the numerators and the common denominators in analogy to the single Romberg extrapolation. Another possibility is to store the weights of the one-dimensional extrapolations and to determine the required weights during the run time, which requires the computation of one product for each weight. Subsequently, in both cases we calculate the approximant T,, employing the accurate scalar product.

The Remainder Term In (3.1) we have examined the representation of the remainder term for the step size sequences FR,Fs,FD for rn 5 7. If we employ the values for hl;, h2; in (64) and, subsequently, employ (69), then by replacing € 1 , ~by [c, d] and € 2 by ~ [a,b] we get Rmn

E

- ( b - a)2m+3* ( d - C) * Cm * &+I -(b-a)

* ( d - ~ ) ~ ” + ~ * *C& n +I

- ( b - a)2m+3 * ( d - C)2n+3

(f([a,

a19

[c, 4))2m+2,2n+2

*

*

(f([a,b], [c, 4))2m+2,0

.(f([a,a],[~,d]))0,2n+2

c, cn Bn+l *

*

*

7

with constants Cm and Cn depending on the chosen step size sequences. Finally, we have to’ replace all real operations by interval operations and, thus, we obtain an enclosure of the remainder. This means that during the runtime the Taylor coefficients have to be determined by use of automatic differentiation, some factors must be calculated, and the scalar product must be executed.

An Enclosure Algorithm In order to obtain an enclosure algorithm, we have to develop a procedure for determining the indices rn and n in the remainder term which has to fulfill the

218

U.Storck

prescribed error bound. Since an a priori statement about the Taylor coefficients cannot be made, we determine all remainders OR,, with m n = c = constant. Here the Taylor coefficients ( f ( [ a ,b ] , [c,dl))2,+2,2,+2 with m n = c are calculated once for the determination of all these remainder terms and, in analogy to the single Romberg extrapolation, they are stored in a triangular matrix which implies for the subsequent step (c 1) that only the next two diagonals of this matrix have to be determined. A subsequent comparison of the calculated enclosures of the remainder terms leads to the one with minimal width. If the enclosure of the remainder term is consistent with the given error bound, we have found the indices and the corresponding enclosure of the remainder; otherwise, the constant c is replaced by (c 1) and the next remainders will be calculated. Applying this procedure, we obtain the following enclosure algorithm

+ +

+

+

Algorithm ZCarant 1- InPut(f, a , b, C, d, eps, 31,3 2

)

2. Determine OR, m, n

3. Determine OJ = OT,,$OR 4. Output

(OJ)

Just like the algorithm EGarant, this algorithm calculates a verified result. It should be pointed out that the remainder term is determined first, followed by the computation of the corresponding approximant.

3.4

Comparison for Different Step Size Sequences

We refer to the results in [7],[ 8 ] and obtain:

4 4.1

m7n

071

2

3

4-7

optimalsequence

FR, F u , 3 v

3 ~ 3, v

FR, 3 u

3u, 3 v

Comparison of the Enclosure Algorithms Theoretical Comparison

Both extrapolation methods begin with the determination of the enclosures of the remainder term, so that they are consistent with a given error bound. Then the corresponding approximants are calculated. Since the results are consistent with

Numerical Integration in Two Dimensions

219

given error bounds, the required amount of processing time can be chosen as the criterion for comparison. The amount of processing time depends mainly on two contributions. The first one is the effort required to calculate the Taylor coefficients needed for the determination of the remainder terms. However, an a priori statement about the behavior of the Taylor coefficients cannot be made. Moreover, both procedures use different Taylor coefficients in their remainder terms. The second contribution is the number of function evaluations required for calculating the approximants. In both algorithms this number depends on the remainder terms. Therefore, neither of these extrapolation methods can generally be considered as superior to the other one.

4.2

Numerical Results

Now we present some numerical results obtained by use of the two algorithms EGarant and ZCarant. First, the calculations of the remainder terms and of the approximants were carried out for the whole integration domain. Then the integration ranges [a,b] and [c, d] were partitioned into parts with the length 0.3, sometimes leaving a part with a length less than 0.3. For each subdomain the corresponding integrals were determined by use of both algorithms. Thus, the summation of the results of the algorithm EGarant leads to a new enclosure of the integral; this also holds for the algorithm ZGarant. If we denote the new integration bounds of an integral over a subregion by ii, i,E, d, then for the majority of the subdomains, it follows that ( i - 6 ) = (d-2.) = constant. This implies that some terms needed for the determination of the remainder terms have to be calculated only once and can then be stored to be used in the subsequent executation of the computation. Besides, the classical recursive algorithm was implemented and its results were compared with those of the enclosure algorithms. Considering the numerical results, we come up with the following observations: 0

0

0

If we compare the two extrapolation methods (with and without subdivision of the domain), we notice that the single Romberg extrapolation often needs less time than the double Romberg extrapolation. However, we can easily find counterexamples.

A subdivision of the integration domains may result in a larger computation time, however for critical problems, a subdivision is unavoidable. The results of the recursive algorithm often are contained in the enclosures delivered by both enclosure algorithms. But there exist counterexamples for which the results of the classical algorithm are inconsistent with the prescribed error bound. It must be emphasized that we cannot draw any conclusions about the error of a result obtained by the recursive algorithm.

220

U.Storck 0

In practice, for both extrapolation methods we observe that the processing time needed for the calculation of the remainder term is often larger than the time required for the determination of the approximants. Therefore, with increasing extrapolation order, i. e. with increasing m, or m and n, the processing time for calculating the remainder terms grows in such a way that a bisection of the integration ranges is more favorable, i. e. the adaptation described in the next section should be employed.

The computations were carried out by use of PASCAL-SC on an Atari Mega-ST4 using a 13 digit decimal floating-point arithmetic with accurate scalar-product. We use the following notations:

JL : exact enclosure of the integrals J with the bounds differing only in one unit in the last place

OJE : verified enclosure of J obtained by the single Romberg extrapolation O J z : verified enclosure of J obtained by the double Romberg extrapolation tE : required computation time in seconds for the single Romberg extrapolation t z : required computation time in seconds for the double Romberg extrapolation Q

: absolute error of the single Romberg extrapolation

cz : absolute error or the double Romberg extrapolation

p : prescribed relative error

w. s. : with subdivision of the integration region Example 1

The prescribed relative error was l e - 5 ; for all extrapolations the chosen sequence was &; we got the following results:

Numerical Integration in Two Dimensions 1.o

a JL

Oel

0J z OJz

0.5

0.169899036795

OJE OJE W. S. W. S.

221 0.2

(t) I 0.261 624071882(i) I 0.3962432071 80 (:)

1148675 69899(01m7~9)

o.26162(?za

1

4S9119832 (*) 0*396(iw~~g5$

0.16989(s92724,) 92323662

0.26162(tig2:$

0.396243(:l:E)

0.16989(:tz;) 0.1 6989sm (,), 91732966

0.261 62(:&73 0.261 62( ~~~f~~

0.39624(iEz)

0.1698990346121e

0.261 6239269437

0.3962430086000

0.3962(-1,&) 43107666

(*)

Table 3: Comparison of the results for example 1 The enclosures marked with (*) are not consistent with the prescribed error bound; this is caused by the restricted extrapolation order. This fact demonstrates the requirement of subdivisions of the integration region. The other enclosures have an error which is less than the prescribed error. The approximations of the recursive algorithm are within the enclosures, and the underlined digits of the approximation agree with the exact solution of the integral.

The required times (in seconds) are:

I

a

11.0 10.5

I

0.2

I

Table 4: Comparison of the required time We notice that the double Romberg extrapolation requires more time than the single Romberg extrapolation. Examde 2

J =

l11 1

e"1'"2dZ

Here we use FR for all extrapolations, and we get the following results which agree qualitatively with the preceding observations:

222

1 i; 1 1 1 1 le-03

P

0JE fE

(i:;)

1.31 790 1.31 7 (~~~~~~~~~ 3.0 1 Oe-04 6.578e-07 2.99

0Jz

I

7

le-05

7.57

1.31 (,063563453) 1.31 7 ( % 7 : 3 8008815848

le-08

1.31 79021 51

U.Storck

(ig)

3.990 e-10 20.8

1.31790215(

:,“:a

9.453e-04

2.809e-06

1.101e-09

4.56

8.36

36.7

I 1.317902103688 I1.317902103688 I 1.317902151426

I

Table 5: Comparison of the results for example 2

5 5.1

Further Aspects Adaptation

Adaptation means the combination of a bisection of the integration ranges I1 := [ a , b] and Iz := [c, d] with the possibility of increasing the extrapolation order. This is favorable, in particular in the case where the enclosures of the Taylor coefficients grow rapidly with increasing order. If the integration ranges are bisected once or several times, the approximants for each subdomain will be determined; for the calculation of the remainder term we will be able to choose between two different possibilities. The first one is to reject a new calculation of the Taylor coefficients for each subdomain and, rather, to employ the previously calculated enclosures of the Taylor coefficients of the original domain. Then, for each subdomain, some factors in the remainder term grow a t a smaller rate and, thus, the remainder term enclosures are tighter. The other possibility is to calculate the Taylor coefficients for each subdomain and, subsequently, to determine the corresponding enclosures of the remainder terms. This results in more processing time but also a tighter enclosure of the remainder term. We are now interested in the first case and we will illustrate this briefly for both extrapolation methods. If we bisect I1 for a total of k times and Iz for a total of 1 times, then the differences of the integration bounds (b- u ) and (d- c ) must be multiplied with the factors 2-k and 2-’, respectively, for calculating the remainder term of each subdomain. Summation of these remainder terms leads to an enclosure of the remainder term for the whole region and, since the factors in the remainder term contain large powers of ( b - u ) and (d - c ) , we get a tighter enclosure for the remainder as before.

Numerical Integration in Two Dimensions

5.2

223

Combination of the Single and the Double Romberg Extrapolat ion

The two extrapolation methods can be combined in the following way. First of all, in analogy to the single Romberg extrapolation, the Taylor coefficients of a triangular matrix are determined; subsequently, for both extrapolation methods, we calculate all remainder terms which can be determined using the previously computed Taylor coefficients. Then we have a look a t the remainder terms which are consistent with the prescribed error bound; subsequently, we choose the one which involves a minimal number of function evaluations as required in the calculation of the corresponding approximant. If we cannot find a remainder term satisfying the prescribed error, we have to use an adaptation, i.e., we have to increase the extrapolation order or to bisect the integration ranges.

5.3

Numerical Integration in higher Dimensions

In analogy to the two methods for numerical integration in two dimensions being presented in this paper, we can develop procedures for the numerical treatment of integrals of the form :

With increasing n, we note the increase of the number of possible methods. We briefly illustrate this situation for the integration in three dimensions. Here we may choose between: a single method in analogy to the single Romberg extrapolation, 0

a ‘triple’ extrapolation, which is based on the composition of three onedimensional integrations, three different procedures, each of them developed by means of the composition of a one-dimensional and a single two-dimensional extrapolation.

In order to determine a suitable remainder term for the three-dimensional single extrapolation, we need the three-dimensional Euler-Maclaurin summation formula. In the other cases the results of [7], IS], [15] and of the present paper may be used. Finally it should be pointed out that, with increasing n, the computation time increases considerably.

References [l] Abramowitz, M. and Stegun, J. A.: Handbook of Mathematical Functions, Dover Publications, New York, 1965

224

U.Storck

[2] Bauer, F. L., Rutishauer, H. and Stiefel, E.: New Aspects in Numerical Quadrature, Proc. of SIAM, 15, AMS, 1963 [3] Bohlender, G., Rall, L. B., Ullrich, Ch., Wolff v. Gudenberg, J.: PASCAL-SC, Wirkungsvoll progmmmieren, kontrolliert rechnen, B. I., Mannheim, 1986

[4] Bulirsch, R.: Bemerkungen zur Romberg-Integration, Num. Math., 6, pp. 6-16, 1964 [5] Davis, Ph. J., Rabinowitz, Ph.: Methods of Numerical Integration, Academic Press, San Diego, 1984 [6] Engels, H.: Numerical Quadrature and Cubature, Academic Press, New York, 1980

[7] Kelch, R.: Ein adaptive3 Verfahren zur Numerischen Quadratur mit automatischer Ergebnisverifikation, Dissertation, Universitgt Karlsruhe, 1989 (81 Kelch, R.: Numerical Quadrature b y Extrapolation with Automatic Result Verification, this volume [9] Knopp, K.: Theorie und Anwendung der unendlichen Reihen, Berlin, Springer, 1931 [lo] Kulisch, U. W., Miranker, W. L. (Eds): Computer Arithmetic in Theory and Practice, Academic Press, New York, 1981 [ l l ] Laurent, P. J.: Formules de quadrature approche'e sup domaines rectangulaires convergentes pour toute fonction inte'grable Riemann, C.R. Acad. Sci. Paris, 258, 798-801, 1964 [12] Locher, F.: Einfiihrung in die numerische Mathematik, Wissensch. Buchges., Darmstadt, 1978 [13] Lyness, J. N. and McHugh, B. J. J.: On the Remainder Term in the NDimensional Euler MacLaurin Expansion, Num. Math. Software, Vol. 1, No.2, June 1975 [14] Stoer, J.: Einfihrung in die Numerische Mathematik I, Springer, Berlin 1979 [15] Storck, U.: Verifizierte Kubatur durch Extrapolation, Diplomarbeit, Universitit Karlsruhe, 1990 [16] Stroud, A. H.: Approximative Calculation of Multiple Integrals, Prentice Hall, New York, 1971

Verified Solution of Integral Equations with Applications Hans-Jiirgen Dobner

Verification methods are a new class of powerful numerical algorithms computing numerical solutions together with mathematically guaranteed error bounds of high quality. In this paper such methods are derived for Fredholm and Volterra integral equations of the second kind and for certain types of equations of the first kind. These ideas are applied to elliptic and hyperbolic differential equations. Finally some ideas about realizations are given.

Introduction

1

Algorithmic procedures with the following quality properties are considered: 0

Automatic verification of the existence of a (theoretical) solution,

0

Automatic computation of guaranteed error bounds

because of the underscored letters, they are called E-Methods. [ll]. The development of E-Methods was motivated by the growing and complex employment of computers in all fields of engineering by non-mathematicans has made it necessary to postulate an increased reliability of the computed results. Since the early 1980’s (cf. Kulisch / Miranker [17] ), effective computational tools have become available for the determination of imprecisions arising from floating point arithmetic; i.e., a precise formulation of computer arithmetic and its implementations on digitial computers. Algebraic problems in finite dimensional spaces, e.g. systems of linear equations, were the first ones to be treated with EMethods (Kaucher/ Rump [12] ). These techniques have been extended to functional problems in infinite dimensional spaces. The methodology of these techniques is described in Kaucher / Miranker [ll]. In Section 2 the basic concepts of error controlling algorithms are outlined. EMethods for linear Volterra and Fredholm integral equations of the second kind are derived in the following two sections, whereas weak singular problems are discussed in Section 5. First kind equations are treated in Section 6. Partial differential equations of elliptic and hyperbolic type are considered in Section 7. Additional applications and methods for implementation are discussed in the final Sections 8,9. Scientific Computing with Automatic Result Verification

225

Copyright 0 1993 by Academic Press, Inc. All rights of reproduction in any form reserved. ISBN 0-12-044210-8

H.-J. Dobner

226

2

Some basic concepts

With

ZR we denote all real intervals of the form [u] = [g,E]= {z E

the operations

* E { +, -, /} 0 ,

(1)

[u]

* [b]

R I

g

5 5 5 Ti},

are defined as usual:

:= [min{g* b,

a * $ , ~ i * b ii*$}, ,

max{g*b, a * $ , ~ i * b E*Z}], ,

with 0 $ [b] in the case of division. Let M be the partially ordered Banach space M = C [ a ,p] with the usual maximum norm; = { ( P ~ } ;=~ N { ( ~ ; ( s ) } ~5 ~ ~s , 5a p, is a generating system of M ; R := {+, -, ., /, denotes the set of operations defined in M where, in the case of integration, either fixed or variable integration bounds are admitted.

s}

Definition 2.1

The n-dimensional,

E N , subspace of M

R.

is called screen of M. A linear mapping T, : M -+ S,, with T,(Pi

= (Pi

,

i = 1,...,n,

is called (functional) rounding. Then the set of operations R is redefined in S,, by use of * E R. f @ g := ~ n ( f* g ) , f , g E The space (S,,, 0 R) with the rounded operations 0

f

R = { El, 8 ,m, IL],

} is called functoid and is denoted by M,,.

In the power set P M of M , the operations R are defined pointwise according to (1); also the enclosure relation 5 is explained in this sense.

Definition 2.2

The finite subspace of P M ,

Verified Solution of Integral Equations

227

is called an interval screen of PM. A mapping II, : PM + IS, is called directed (functional) rounding if the following properties are valid: nnpi

i = 1,..., n,

= (pi,

and 'pi

E

i >n .

IIn(pi,

In IS, the arithmetic operations R are defined in the following way

F 8 G := II,(F

* G)

,

F,G E ZS,,*

E

R.

R) is called (interval) functoid, where

The space (ZS,,O

4

0R ={

8,0, 0,8, }; it will be simply referred to as ZM,. Furthermore the symbol 8 ,* E R will be replaced by * ,* E R, for the sake of clarity. Remark 2.1 Every interval [u] E IR represents simultaneously a subset [ a ] C_~ M of M, that is [UIM

= {f E M

I f(s) E I.[

1

I s IP>

a

f

This ambiguity of the coefficients [u;] in (2) as real intervals or as ranges of function sets makes it possible to deal with functional problems in ZR.In this sense we shall ~ [ u i ] . An interval function F E IS, contains all functions g of M identify [ u i ] = with n g(s)

E

~ ( 8=) X [ f i l c ~ i ( s ) i=l

[ji]

E IR

1 Q

I IP.

Lemma 2.1 The elements of I M , are closed, bounded, convex sets. Proof F E I M , can be represented as an interval

[ E m = (9 E M I

F(s)I g ( s ) IF ( s ) ,

Q

I sI P}

in the partially ordered Banach space M; hence F is closed, bounded and convex. 0

Extensive use will be made of the important test F G. For practical purposes a sufficient condition, the coefficient enclosure will be employed instead of C.

s,,,

Lemma 2.2 Let be F,G E I M , with F = C;=,[fi](pi , G = enclosure C_, be defined as follows: (3)

F

snG

:@

[fi]

C_

[gi]

Cy=l[gi](pi,

,

and the coefficient

i = 1,...,n.

H.-J. Dobner

228

Then the implication

FG,,G=+F(s)CG(s) , F,GEIM,, a5sIP1 is valid and (IM,,, c,,) is a partially ordered set. Proof

Follows by the rules of intervalmathematics and (3).

0

Definition 2.3 F E (IS,,, 0 0) is called an enclosure of the real valued function f E M provided the following relation is valid (4) f(.) E F ( s ) 1 a 5.9 5 P. Enclosures for operators are defined analogously.

Convention: Throughout this article an enclosure of a real valued quantity f will always be denoted by the corresponding capital letter F. Now, in view of computational purposes, we shall formulate a modification of Schauder’s fixed point theorem.

Theorem 2.1 Let t : M + M be a compact operator and T : I M , + IM,, an enclosure. If for an element X E IM,,, the condition (5) TXCX is satisfied, then the operator t has a fixed point 2 and moreover (6) 2 E x. Proof

From (4) we can derive t X C X . According to Lemma 2.1, X is closed, bounded and convex, so that the existence of a fixed point 5 E X follows by Schauder’s fixed point theorem. 0 An analogous theorem holds in the case of an a-condensing operator t. Even uniqueness statements can be validated with computer-adequate theorems; since this exceeds the scope of this paper, we refer to Kaucher / Miranker [ll].

3

Enclosure methods for linear Volterra integral equations of the second kind

The equation being considered first has the form

(7)

4 s ) = g(s)

+ J ’ k ( s , t ) z ( t ) dt, a

a

I s Ip,

Verified Solution of Integral Equations

229

with g E M, k E M x M. In operator notation (7) is written as z=g+kz,

where we use the same symbol for the kernel and the corresponding operator. Two different approaches for enclosing the solution of (7) will be discussed. The first method is based on an iterative process. Let Z be an initial guess for the solution of (7) and 2 the point interval extension of Z . If I denotes the identity, we define

U

:= ( Z - K ) * - G

and iterate by use of

Then, the enclosure K’+’ of the ( p + l ) - iterated kernel k’+’ satisfies the recurrence formula I<(S,T)

I P ( s , t )= z

K’(7,t)dT

,

4./

= 0,1,.

. .,

.

Theorem 3.1 If an iterate X ( ” )of the iteration process (8) satisfies the enclosure condition

x(y+’)c- X ( ” ) ,

(9)

then the following statement has been proved automatically: There exists a solution x of (7) and x is enclosed within I

E

2 + X(”+’) .

Proof Using property (4), we see that the compact operator t tz := -

c m

’=O

a

k’(s,t)u(t)dt

+

I’

kP+’(S,

t)z(t)dt ,

maps the nonempty, closed, bounded and convex set X(”+’) into itself so that Theorem 2.1 guarantees the existence of a fixed point $ of T in the function interval

230

H.-J.Dobner

X ( y + l ) With . r an approximation of the inverse of the Frkchet derivative of ( I - k), we rewrite (7)in the form 2

= 2 - r((Z - k)z -9)

.

An application of the mean value theorem then yields (10)

z

- S = -r((Z

- k)S - g)

+ (I - r(Z - k ) ) ( z - S)

.

Since the kernel is continuous, the Neumann series Cr=o k” converges and coincides with the inverse of ( I - k); hence lim,,+m ~ ~ = 0. If k r now~ is chosen ~ as~ l ~ ~ m

then

Z - r ( I - k) = km+’ . From this it can easily be seen that S must be added to a fixed point of (10) to 0 obtain a solution of (7); this completes the proof.

Remark 3.1

It is ensured that there exists an integer m with IIkm+lll < 1, because the spectral radius of k is zero. In practical computations (8) is performed a t first with a chosen value m, based on some rough estimates for 11 k”(s,t)dtll. If the stopping criterion (9) cannot be satisfied after a prescribed number of iterations, the computations are repeated with an enlarged value of m.

s,”

Linear systems of Volterra equations of the second kind have the form

_.N

a < s

i = l , ..., N ,

where all functions are assumed to be continuous. The E-Iteration (8) is applied simultanously and componentwise, thus leading to guaranteed error-bounds in this case. In some cases, e.g., in the case that the Neumann series is slowly convergent, the algorithm (8) will not work very well or may even fail. In this situation the solution of (7) can be enclosed by using degenerate kernels, thus leading to a system of differential equations.

Verified Solution of Integral Equations

23 1

Theorem 3.2 Let K ( s , t) be a degenerate enclosure of k ( s , t), that is m

k ( s , t ) E K ( s , t ) = xu;(.) ;= 1

,

&(t)

a

L s ,t 5 P ,

where u i ( s ) and B;(t),respectively, are continuous real-valued or set-valued functions. .. . ,Ym)be a solution of the system of initial value Let the functoid vector Y = problems

(x,

ass

OE X(a)

I@, i = l , ...,m ,

i = 1, ...,m.

,

Then the following is true: The solution z(s) of (7) exists and is enclosed in the function tube X ( s ) : m

(11)

L(S)

E X ( S )= G(s)

+ C.i(s)Y;(s)

,

(Y

5sI p.

i=l

Proof We have G

4

+K X

X, therefore (11) follows by use of Theorem 2.1.

0

Enclosure methods for linear F’redholm integral equations of the second kind

Now we will be treating problems of the type

(12)

+ 1 k(s,t)z(t)dt P

z(s) = g ( s )

a

,

a Is I P

,

with g E M, k E M x M. The following short operator notation will be used z = g + k z .

We shall derive several concepts in order to validate the solution of the foregoing equation.

H.- J . Dobner

232 Decomposition of the kernel We rewrite (12) in the form

,

s=g+k1s+kzz

where kz is a degenerate kernel m

kz(s,t) =

CU;(S) bi(t)

,

Q

5 s 5 /3 , m E N ;

i=l

kz can always be chosen so that the remainder kl = k - kz is a contractive operator. The function systems ai and bi are assumed to be linearly independent. We start with iterations in the functoid IM,,: (13) (14)

V('+') = G

w/'+l)=

+ Ki V(') + K~ w,C.)

, ,

p =O,l,.

..

i = 1,. . .,m , p = 0,1,. . .

Then we may formulate

Theorem 4.1

For the iteration schemes (13) and (14), let the enclosure conditions (15)

and

(16)

V("+') c - V(')

,

W/'B+l) c - W('d a

,

be satisfied by iterates V := V('+')

i=l,

... , m ,

, Wi := W/'i+l) , i = 1,. .. ,m.

If furthermore U = (U1, . .. ,Urn) E I R" is an enclosure for the solution of the system of linear interval equations (17)

(E-Q)U=R

with the unit matrix E E Rmxm , and (18)

(19)

Q := Q(Bj,Wi) :=

,

(1 P

Bj(t)Wi(t)dt).;-1 a - 1 , ...,m

0

R

:=

R(Bj,V )

:=

(J,

P

E IRrnX", Bj(t)V(t)dt) j = 1 , ...,m

E IR" , then there holds the following existence and enclosure statement: m

(20)

Z(S)

E

+C

~ ( 8=) V ( S )

i=l

Ui

Wi(s)

(Y

5 s L p.

233

Verified Solution of Integral Equations Proof: If 71 denotes the resolvent or the resolvent kernel of g

+

k1 I

+

k2 I

= I = g(s)

+

kl,

we have

+ 71 g(s)

/."

$(ai(s)

+ 71

ai(t)

Z ( t ) dt

I

CY5s_
w ~ ( s ):= ~

+

i ( 3 ) 71

,

a;(.)

CY

5 s 5 p , i = I , . ..,m ,

can be determined as solutions of the Fredholm integral equations v wi

= =

g+kiv a;+k1w;

, , i = l , ...,m .

The solution x of (12) then is given by m

Z(8)

=V(8)

+x u ;

Wi(5)

, CY I 8 5 p ,

i=l

where u = ( ~ 1 , .. .,urn)E

R" satisfies the real linear system (cf. (18), (19)) ( E - Q ( b j , w i ) ) ~ =R ( b j , ~ .)

Using the enclosure property (4) we may conclude by means of (13), (14) that

therefore, there holds

consequently there follows u = (ul,.. . ,urn)E ( U l , .. .,Urn) = U , i.e. (20).

H.-J. Dobner

234

Remark 4.1 The existence of a solution of (12) is proved by means of the E-Algorithm. Provided the enclosure conditions (13),(14) are replaced by the stronger E- enclosure conditions

Ue( V(’+’)) c v”) Ue( Wi(”+’)) E W!@i+l)

7

, i = 1,. .. ,m ,

which are defined pointwise, then additionally the unique solvability of (21), (22) has been validated (cf. Kaucher / Miranker [ll]). The solution of (17) by use of E-Methods delivers automatically an existence and uniqueness statement for all matrices contained in Q(Bj,W;); in this case therefore the uniqueness of z is validated, too.

If a nonnegative real constant c is explicitly known such that Ih(s,t)l I c<1

,

a

I

I P,

then the iterations (13), (14) can be avoided and the solution z(s) of (12) is guaranteed to exist in the function tube

where (24)

and U = (U1, . . .,U,) E Rmis a solution of (26)

( E - Q (Bj,Ai+[Ai] [C]))

R (Bj,G+[G] [ C ] ) .

Verified Solution of Integral Equations

235

R e m a r k 4.2 In (24) and (25), absolute values of functions have to be integrated. This situation can be avoided when using basis functions cp; (see (2)) provided they possess a fixed sign in [a,@]. R e m a r k 4.3 The accuracy of X in (23) is essentically governed by the diameter of the interval

PI

R e m a r k 4.4 If k ( s , t) is a degenerate kernel, i.e. k1 = 0, then the constant [C]is zero; hence the computational evaluation of the integrals (24), (25) is not necessary and a higher accuracy can be achieved. Enclosure of the kernel Now we outline an enclosure method which is based on the enclosure of the given kernel k ( s , t ) within a degenerate functoid kernel K ( s , t ) . This method does not make use of the decomposition of the kernel k ( s , t ) into a degenerate and a contractive one. Therefore the contractivity property is replaced by some positivity conditions. Let K ( s ,2 ) be a degenerate enclosure for k ( s ,t) with the special form m

(27)

k ( s , t ) E K ( s , t ) = C A i ( s ) b i ( t )7 a i=l

I

IP

7

continuous real valued functions b;, and continuous interval valued functoid functions A;. In the simplest case K(s,t) reduces to an interval. The property

(28)

bi(s).z(s)#O

, ( Y I s I Pi =, l , ...,m ,

yields

(29)

J,” A;(s)b;(t)z(t)dt= A i ( s ) Lobi(t)z(t)dt , a<s

ID,i = 1 , ...,m .

Therefore, the following theorem can be stated:

H.-J. Dobner

236

Theorem 4.2 Let (E - Q(bj,Ai)) E IPxm be a set of nonsingular matrices and let (30)

Y = ( E - Q(bj,A;))-'(R(bj,G)) U = (U1,...,Urn) EI F .

Provided there exists a solution z(s) of (12), then m

(31)

Z(S)

+ C UiAi(S)

E X ( S )= G ( s )

(Y

5 s 5 /3.

i=l

Proof: We assume that there exists a solution z(s) of (12). Because of (28) we have

We introduce the short notation uj := f b j ( t ) z ( t ) d t . Multiplication of (32) by bi, i = 1 , . . . ,rn , and integration of both sides from a to /3 leads to uj E U, , j = 1,. . .,rn. 0

Remark 4.5 The statement of Theorem 4.2 above can also be interpreted as an exclusion statement in the sense that there is no solution of (12)in the complement X ( s ) . Remark 4.6 A kernel enclosure of the type m

k ( s , t ) E K ( s , t )= C a i ( s ) B i ( t ) i= 1

,a I

s

IP

7

with ai E M, Bi E IM,, i = 1 , . . . ,rn, leads to a contractivity condition for the matrix (E - Q(B,,ai)) of the resulting linear system. Next we show a way to disregard the property (28).

Verified Solution of Integral Equations

237

Lemma 4.1 By means of continuous functions ail bi, i = 1 , . . . ,rn , and the maximum norm, a continuous function k ( s , t ) , a 5 s , t 5 p, can be approximated as follows with arbitrary accuracy: m

C

ai(s)bi(t)

y

a I3,t Ipi

i=l

where additionally it is assumed that

(33)

C X I S Si ~ = l,, . . . , m .

ai(s)>O

Proof This is a consequence of WeierstraD’ approximation theorem.

0

Since both the forcing term g and the kernel k are assumed to be continuous, furthermore we have:

Lemma 4.2 The solution z(s) of the integral equation (12) is bounded. From Lemma 4.2 we can deduce that there exists a real constant 1 such that l
,

a5s5p.

The new unknown z ( s ) := z(s) - 1 , a 5 s 5 p , has only positive values; z ( s ) satisfies an integral equation with the kernel k ( s ,t ) and the forcing term

Consequently Lemmata 4.1 and 4.2 allow us to disregard the restriction (28).

5

Problems with unbounded kernels

The subject of this section are Fredholm integral equations (12) with the kernel

(34)

k(s,t ) = P ( S , t ) q ( s - t )

,

I8 IP

*

It is assumed that p ( s , t ) is bounded and q ( s - t) is unbounded such that the following conditions are satisfied, making use of real constants C, 6, C > 0, 0<6<1:

(35)

J:Iq(s-t)ldt

<m

1

a I s I P ,

H.-J. Dobner

238

Then the corresponding integral operator is a compact operator on M. Now several ideas will be discussed enabling one to treat this singular problem in such a manner that the E-Algorithms of the previous section can be applied. Decomposition of the domain By means of partitioning the domain, the integral equation ( 1 2 ) with a kernel of type ( 3 4 ) will be transformed into a system of integral equations. The interval [ a , p ]will be divided into N subintervals [ a i , p i ] , i = 1 , . . .,N , N E N ,with diameter A = :

9

this leads t o the equivalent system

(37) a i l s l p i , i=l,

..’, N ,

with

. . = 1 ,..., N ,

2,J

, if a; 5 s 5 p i , , 2. = 1,.”) N , , else and

Self-validating methods for problems such as (37) are outlined in detail by Klein ~ 4 1 .

Note that the kernels k , j ( s , t ) = p ; j ( s , t ) q ; j ( s- t) are bounded for li - jl 2 2, so that they can be decomposed into the sum of contractive kernels k l , i , j ( s ,t ) and degenerate kernels k z , i , j ( s ,t). In the case that i = j - 1 , i = j and i = j +1 , i = 1,..., N , 1 < j 5 N, the kernels are unbounded, whereas the corresponding integral operators have bounded norms. The decomposition of the domain and the decomposition of the kernels can be carried out such that maxi=1,...,N{CZ1 Ilkl,i,jll} < 1 . Under these assumptions we formulate

Verified Solution of Integral Equations

239

Theorem 5.1

If X = (XI,.. . ,X N )E IM; is an enclosure of the system (37)’ then there exists a

solution z(s) of the Fredholm integral equation (12) with the singular kernel (34), which is enclosed as follows:

The proof is obvious.

0

Decomposition of the kernel In contrast to the way outlined before, the kernel now will be decomposed. With an arbitrary E > 0 and a kernel corresponding to (34), the compact integral operator k can be written in the form (39) k=ki+k2, where

k2

is an operator with a degenerate kernel and k1 an operator with

Jlk1I1

< E.

The continuity of p ( s ,t) guarantees the existence of a nonnegative real constant C such that Ip(s,t)l

I

c

,

Q

I

s,t

I P.

By use of a nonnegative real number q we define

and

Then, denoting the integral operators as usually by kl, k2, we have

we see that q can be chosen such that Ilkl ,1 < f . Thus, the remaining operator k2 can be written as the sum of a degenerate operator k 2 2 (i.e. an integraloperator with a degenerate kernel) and an operator k21 having the property IIkzlll < f ; therefore, finally we have a sum of the type (39)’ hence the E-Method as outlined in Theorem 4.1 can be used to compute tight and guaranteed bounds for the solution z(s). Iterated kernels

We make use of the fact that there exists an integer rn such that the iterated kernels k”(s,t) of k ( s , t) in (34) are bounded for all integers v larger than or equal to m. First we need two Lemmata, which will be formulated without a proof. For convenience we restrict ourselves w.1.o.g. to special singularities q(s, t ) .

H.-J. Dobner

240 Lemma 5.1 If the singularity q ( s , t) is of the form

then the m-th iterated kernel km(s,t) of k ( s ,t ) is bounded provided m(1- 7)> 1.

(41)

Lemma 5.2 Let the integer parameter m in (41) be chosen such that e z , j = 1,. . .,m - 1 , is not an eigenvalue of k (this is always possible). Then each solution z(s) of (12) is a solution of Y(S)

(42)

= g(s)

+J,p

+ czl s,p k ” ( ~ , M t ) d t Ws,t)y(t)dt

,

a

5 s IP ,

and the converse is also true. Therefore equation (42)can be treated by use of one of the E-Methods as presented in Section 4; as a consequence this can be summarized in

Theorem 5.2 Let Y E ZM,,be a validated solution of (42)under the constraints

(43)

None of the m-th roots of unity is an eigenvalue of k, and m(1 -7) > 1 .

Then the solution z(s) of (12), with the singular kernel (34), is guaranteed to exist within Y: (44) z(s) E Y ( s ) , a Is 5 p. The proof is obvious. 0

Remark 5.1 Provided the singular Fredholm integral equation is treated by means of automatic verification schemes, then additionally the constraint (43)must be satisfied in such a way that (44) holds true. This leads to the new and interesting complex of validations under constraints. Remark 5.2

The solution of Volterra problems with singularities such as (40)can be enclosed by use of the iterative EScheme (8) as applied to the problem of interest. In contrast to Fredholm equations, there is no need to take into account any supplementary condition.

Verified Solution of Integral Equations

6

241

Equations of the first kind

The EMethod (8) is also applicable in the case of linear Volterra integral equations of the first kind

k ( s , t ) z ( t ) d t= g(3)

(45)

,

a

5 IP ,

provided (45) is treated under the assumptions as mentioned e.g. by De Hoog and Weiss [5]: (46) (47) (48) (49)

g(s)

and g’(s) are continous

, a 5 s , t 5 P,

g ( a ) = 0,

k(s,t) and ak(s,t , are continous for a I s 5 P, k(s,s)# O

,

as

IP.

Then (45) is equivalent to the linear equation of the second kind

therefore equation (45) can always be solved by means of the validation schemes of Section 3. The derivatives can be determined making use of the technique of automatic differentiation (cf. Rall [23] ).

Theorem 6.1 Assume that the kernel and the forcing term of (45) satisfy (46) - (49). If Y E ZM,, is an enclosure for the solution y of (50), then the solution z ( s ) of (45) also exists and is enclosed in Y ,too: z(s)

E Y ( s ) , a Is I p.

The proof is trivial because of z = y, according to (46) - (49).

0

Remark 6.1 It should be mentioned that the solution z of (45) and the solution y of (50) coincide only in the case that condition (47) is satisfied. This condition is not explicitly employed in the process of validating (50). Therefore the situation may arise that (50) possesses a unique solution y(s); however, y(s) is not related to any solution of (45) unless condition (47) is satisfied. Therefore, (47) must be validated additionally to ensure an enclosure of (45) via (50), according to Theorem 6.1 . This is another example of a validation with constraints. The situation for Fredholm integral equations of the first kind is much more complicated since they represent ill-posed problems; consequently, a treatment by means of Emethods is not possible.

H.-J. Dobner

242

7

Automatic result verification for elliptic and hyperbolic problems

In this section we deal with hyperbolic problems of the form Ust

= r(s,t,u,uB,ut)

7

( s , t )E D G

,

where the domain D of definition and a subset D of the boundary a D are chosen according to different initial value problems (see below). Furthermore we consider boundary value problems of potential theory in the plane. For convenience in the hyperbolic case, we treat the following special problems (an extension to more general problems is obvious):

Darboux problem

Goursat problem

D : triangle with the sides s = a , a>0,

Cauchy problem

D : triangle with

t=o, s=t

the endpoints (0, O), (a, 01, (0, P),

ff,P > 0,

Values of u given on the two characteristics s = 0,t = 0 Values of u given on s = 0 and on s = t Values of u , us,ut given on the line through (%O) and (0,P).

Instead of (51) we consider the equivalent integral equation

(52)

4%t ) = g ( f ( s ,t ) ) + J JD(8.t)

7.(0,7,u , %,

%)dad7

7

where g ( f ( s , t ) )= g(s, t ) is a summand arising from the initial value problem (cf. Walter [26]) and D ( s , t ) denotes the corresponding domain of integration. We are looking for classical solutions u E C * > 2 ( D,) that is u is a continous function on D with continous derivatives u , , u ~ , u , ~We . assume furthermore that g and r are continously differentiable. Therefore, for this type of equations our solution space will be M := CIJ(D). The concepts described in Section 2 carry over to this situation with only slight modifications.

Verified Solution of Integral Equations

243

Next we eliminate the partial derivatives in the integro-differential equation (52) by introducing the new unknowns u1 := U , U ~:= u,,u3 := u,; this now yields a system of three coupled Volterra integral equations:

(53)

s

where Dl(s,t) = D ( s , t ) R2 and DZ(s,t) , Dg(s,t) problem under consideration.

s R depend on the special

By use of the abbreviations

(53) is written in short form: (54)

u = t(u).

The fixed point equation (54) is appropriate for computing guaranteed error bounds.

Theorem 7.1

such that the enclosure condition

U(”+’)c - U(”) is satisfied componentwise, then there exists a solution u(s, t ) of the hyperbolic initial value problem (51); moreover this solution u is an element of the function set

u,(”+’)

(55)

u l ( s , t ) E U,(”+’)(s,t) , ( s , t ) E D

.

Proof: Since t : M3 + M3 is a compact operator and T an enclosure of it, the statements 0 of this theorem follow immediately from Theorem 2.1 .

H.-J. Dobner

244

Remark 7.1

In addition to the enclosure (55) we have simultaneously generated an error estimate. Such an error bound is only useful in the case that the set U(”+’) has a small diameter. If this is not true we can improve the quality of these bounds by employing the EAlgorithm proposed in (8).

Example 7.1 We consider the Goursat problem = ut-stu, 0 5 s , t u ( s , t ) = -t u(s,t) = 0 , s = t . U . t

51

The computations have been performed in the simple polynomial functoid IM,, x ZM,,= { C ~ ~ o [ a i j ] sI i t[a;j] j E ZR},where the solution has been validated with an average number of 6 digits. Now we are turning over to the basic problems of potential theory, that is we are looking for a twice continuously differentiable function u(y,z ) satisfying the Laplace equation

in the interior Do of a simply connected domain D C RZ.The curve r bounding D is supposed to be sufficiently smooth and to be parametrized by

making use of the counter clock sense. We consider 0

0

The interior Dirichlet problem: The solution u of (56) is assumed to take prescribed values f(s) on r, where f is continous:

The interior Neumann problem: The harmonic function u is assumed to possess a normal derivative equated to a continous function f

(58)

i3U

a,lr=f

where n denotes the outer normal on aD.

,

Verified Solution of Integral Equations

245

The functions u solving the Dirichlet or the Neumann problem can be represented in the form of a potential or logarithmic potential, respectively, with density functions X D or XN. The unknown densities satisfy the Fredholm integral equations (59)

respectively, where the kernel has the form

The integral equation arising from the Dirichlet problem is a nonsingular one; in contrast, (60) is a singular equation, which cannot be solved directly by use of validating methods. Therefore now there follow two different enclosure statements.

Theorem 7.2 Let XD E I M , be an enclosure of the solution X D of (59). Then the solution u of the Dirichlet problem (56) , (57) exists and can be expressed as follows:

Theorem 7.3 Let F E I M , be a functoid enclosure of f with 0E (63)

1’

F(t)dt

,

0 E F ( P )- F ( a )

.

Let V E I M , be a functoid function with v E V such that, with a real number c , ~ ( s satisfies ) the integral equation

Then a solution u of the Neumann problem (56), (58) exists and is enclosed within

H.- J. Dobner

246 Here V satisfies

(66)

1 1 V ( s )- J B ( L q s , t ) - -)V(t)dt = - ( F ( s ) a 2~ 2n n

+ -)2 C

,

Proof: It is well known from potential theory (cf. Martensen [21] ) that equation (60) has one and only one solution X N with a prescribed total distribution c, c E R ,

By use of Wielandt’s ([27]) removal of an eigenvalue from the spectrum, we know that the Fredholm integral equation (64) is nonsingular, i.e., 1 is not an eigenvalue of (64). Moreover, if and only if f is periodic with period p - a and if

then the unique solution v of (64) is a solution of (60) satisfying (67); the converse is also true. Now (65) follows by use of standard arguments from potential theory 0 and interval analysis.

Remark 7.2 The function f prescribed on the boundary J? of D has to fulfill the two conditions

(Cl)

f is periodic with period p - a,

The second constraint follows necessarily from GauD’ theorem. If one of the two conditions is not satisfied, then the integral equation (64) still has one and only one solution v ; however, v does not lead to a solution of the Neumann problem. In order to validate the solution of this problem, the constraints ( C l ) and (C2) must be validated too. On the one hand it may be known in an a priori fashion that f satisfies (Cl) and (C2); on the other hand, it is possible that there is no information on f . In the first case an enclosure for the solution of (60) already leads to an enclosure for (56), (58). In the second case a function tube F will be determined containing f in such a way that (62) and (63) hold true. Now it is assured that there exists a pair (5, E (V, F), so that 5 satisfies (60) and (67) for a function f fulfilling ( C l ) and (C2).

I)

Example 7.2 We prescribe the boundary values f(s) = -2n; and we consider an elliptic domain

Verified Solution of Integral Equations

247

with the semiaxis a > b > 0 and compute the verified results in a sinus-cosinus functoid with n denoting the dimension of this functoid. The results for the densities X D , IN are displayed in the table; an E-Algorithm (23) was used.

8

Modifications and further applications

In this section an overview is given on a modification of the fundamental E Algorithms of the previous Sections 3 , 4 such that an extension to more general problems is possible. Nonlinear problems We consider nonlinear Volterra equations of the form (68)

z(5)

= g(s)

+ J’k(s,t,=(t))dt

,Q I sI p.

(1

We require that k(s,t,z)and & k ( s , t , z ) are continous. If it(.) is an initial guess for the solution of (68) and u denotes the convex union of two function sets we may formulate Theorem 8.1 Provided the enclosure condition

is satisfied for two nonempty iterates X ( ” ) X(”+’) , E ZM,, of the iteration process, (69)

+

X ( ” + ’ ) ( s )= G(s) J.’Z<(s,t,X(t))dt

-X(S)

+ J.’ E K ( s , t , ( X ( ” )+ 2)u X ) ( t ) ) X ( ” ) ( t ) d t , a < s < p ,

u=O,l,

... ,

H.-J. Dobner then by means of an algorithm, the existence of a solution x of (68) has been validated simultaneously with the error statement 4 3 )

E R(3)

+

X('+')(S)

,Q 5s5p .

Proof: Follows by use of Theorem 2.1

0

Remark 8.1

In each step of the iteration (69), the solution of a linear Volterra integral equation of the second kind has to be validated by use of one of the EMethods aa proposed in Section 3. In an analogous manner guaranteed errorbounds for nonlinear Fredholm integral equations may be computed, see [14]. An application is given by the verified computation of eigenvalues and eigenfunctions The problem considered here is the determination of a real number X and a nonvanishing function x E M := C[cy,p]such that

(71)

x = X ~ * X

,

where k is the Fredholm integral operator defined in (12). Theorem 8.2 Provided X is a nonempty functoid element with 0 such that such that

X, and [ X ] is a real interval

then in the interval [ X ] there exists an eigenvalue X of Ic and a corresponding eigenfunction x , which is guaranteed to be in X. Proof: 0 This is a consequence of Theorem 2.1

.

The eigenvalue problem is formulated as the following system of nonlinear equations: x-Xkx=O

,

LDx(t)dt

-c =0 ,

cER

.

An application of the Newton method yields condition (72).

Verified Solution of Integral Equations

249

Comments about implementation and applications

9

All validation algorithms as outlined in this paper have been implemented in PASCAL-SC (cf. [3]). We start with some remarks about realizations. How to determine enclosures ? For a given real-valued function f E M, the problem is to find an interval-valued enclosure F E I,M, with a small diameter. The tools to be applied are as follows: Automatic differentiation: F is obtained by use of a Taylor expansion of f , where the remainder is enclosed within interval bounds by means of automatic differentiation (cf. Rall [23]). Fourier expansion: An enclosure is generated by means of a truncated interval Fourier expansion automatically enclosing the truncation error. 0

Algebraic formulation: Transcendental functions are enclosed by solving the defining equation; e.g., the exponential function u ( z ) := ez can be written as d(z) - u ( z ) = 0

.

, u(0) = 1

Realization of the enclosure operations @ To give an idea how to proceed we treat the case of multiplication in the framework of the functoid IM2 = (A0 = Als I Ao,A1 E R} , 0 5 s 5 1. For U ( s ) = BO B l s , V ( s ) = CO CIS, the product U ( s ) V ( s ) is not an element of IM2. Application of the directed rounding

+

+

yields

+

+

+

U ( s ) 0 V ( s ) = nz(BOc0 (BlCO B0Cl)S B1C1s2) = BOCO (BlCO BOCl BlG[O,11)s E IM2 ;

+

this belongs to the space IM2. The other operations @, * E R gously.

+

+

, and other types of functoids

Note that it suffices to know the results elements 4, , 41 of I M , (cf. (2)).

are treated analo-

4, @ 41 , j, 1 = 1 , . ..,n , for the basis

H.-J. Dobner

250

Now we list some of the examples we have treated by use of our E-Methods. In contrast to discretization schemes, the global behavior of the approximated solutions is known for each argument s. Through X ( s ) , a mathematically guaranteed enclosure for z(s) is given; X ( s ) accounts for all kinds of errors occuring during the computation.

Example 9.1 l)e-"z(t)dt . Problem : z(s) = e-8 - 1 + Le-(a+l) - 1 J,'(s 2 2 Basis functions : si , i = O , l , . . . , l o . Number of verified correct digits : 8. Remark : A trapezoid rule with 16 grid points (cf. KreB [15]) gives a result with one (non-verified) digit.

+

Example 9.2 Problem : z(s) = s2 sin s cos s + l :J &&z(t)dt. Basis functions : s' , i = O , l , . . . ,40. Number of verified correct digits : 6.

+

Example 9.3 Problem : z(s) = 1 - 0 . 5 + ~ 1.35s' ~ + J,"e - X ( a -(&I)! t ) M z(t)dt, O<s<1. Basis functions : s' , i = 0, 1, . . . ,20. Number of verified correct digits : depends on the parameter A; for X = 2 7 - 8 digits are guaranteed. Remark : This problems arises in connection with life distributions of machine components (renewal theory). Example 9.4 Problem : s( f 5s4) = J,"(f s4 - t4)z(t)dt , 0 5 s 5 4 . 4 . Basis functions : s' , i = 0,1,. . . ,40 Number of verified correct digits : 12 - 13 . Remark : With a comparable cost, a special discretization method suggested by De Hoog and Weiss [5] yields an accuracy of 6-8 (non-verified) digits.

+

Example 9.5 Problem : z(s) = s4- 2s3 with the kernel

+

+ s + 3.5531 Jik ( s ,t)z( t)dt , 0 5 s 5 1,

as

Verified Solution of Integral Equations

251

Basis functions: cos i s , sin is , i = 0 , 1 , . .. ,20. Number of verified correct digits : 6. Remark : The kernel k ( s ,t) is Green’s function of a boundary value problem.

Example 9.6 Problem : z(s) = s2 - f J!l %dt,

-1 Is I 1 .

Basis function : 1 , subdivision of [-1,1] into 120 subintervals. Number of verified correct digits : 1. Remark : This is a problem arising in the physics of polymers (cf. Schlitt [24] ).

References [l] E. Adams und R. Lohner, Error Bounds and Sensitivity Analysis, in: R. S. Stepleman (ed.), Scientific Computing, North Holland Publishing Company, Amsterdam, 1983.

[2] G. Alefeld und J. Herzberger, Einfuhrung in die Intervallrechnung, Bibliographisches Institut, Mannheim, 1974. [3] G. Bohlender, L. B. Rall, C. Ullrich und J. Wolff v. Gudenberg, Pascal-SC, Bibliographisches Institut, Mannheim, 1986. [4] H. Cornelius und R. Lohner, Computing the Range of Values of Real Functions with Accuracy Higher Than Second Order, Computing 33, 331-347, 1984. [5] F. de Hoog und R. Weiss, High order methods for Volterra integral equations of the first kind, SIAM J. Numer. Anal., Vol. 10, No 4,647-664, 1973. [6] H.-J. Dobner, Contributions to computational analysis, Bull. Austral. Math. SOC,Vol. 41, 231-235, 1990. [7] H.-J. Dobner und E. Kaucher, Self-validating Computations of linear and nonlinear integral equations of the second kind, Contributions to Computer Arithmetic and Self-validating numerical methods, C. Ullrich (editor), J. C. Baltzer AG, Scientific Publishing Co, 273-290, 1990. [8] S. Feny und H. W. Stolle, Theorie und Praxis der linearen Integralgleichungen 4, Birkhuser, Base1 / Boston / Stuttgart, 1984.

252

H.-J. Dobner

[9] E. Kaucher, U. Kulisch und C. Ullrich (eds.), Computer Arithmetic Scientific Computation and Programming Languages, Teubner, Stuttgart, 1987. [lo] E. Kaucher und C. Schulz-Rinne, Aspects of Self-validating Numerics in Banach Spaces, Computer Arithmetic and Self-validating Numerical Methods, Accademic Press, 269-299, 1990. [ll] E. Kaucher und W. L. Miranker, Self Validating Numerics for Function Space

Problems, Academic Press, New York, 1984.

[12] E. Kaucher und S. M. Rump, EMethods for Fixed Point Equations f(z)= z, Computing 28, 31-42, 1982. [13] J. KieDling, M. Lowes, A. Paulik, Genaue Rechnerarithmetik, Teubner, Stuttgart, 1988. [14] W.Klein, Inclusion Methods for linear and Nonlinear Systems of Fredholm Integral Equations of the Second Kind, this volume. [15] R. Krefl, Linear integral equations, Springer, Berlin / Heidelberg / New York, 1989. [16] U. Kulisch, Grundlagen des numerischen Rechnens, Bibliographisches Institut, Mannheim 1976. [17] U. Kulisch und W. L. Miranker, Computer Arithmetic in Theory and Practice, Academic Press, New York, 1981. [18] U.Kulisch (ed.), Pascal-SC. A Pascal-SC Extension for Scientific Computation; Information, Manual and Floppy Disk, Version ST, Teubner, Stuttgart, 1987. [19] P. Linz, Precise Bounds for Inverses of Integral Equation Operators, Intern. J . Computer Math, Vol. 24, 73-81, 1988. [20] R. Lohner, Einschliehng der Liisung gewiihnlicher Anfangs- und Randwertaufgaben und Anwendungen, Dissertation, Karlsruhe, 1988. [21] E. Martensen, Potentialtheorie, Teubner, Stuttgart, 1968. [22] E. J. Nystriim, ober die praktische Auflijsung von linearen Integralgleichungen mit Anwendungen auf Randwertaufgaben der Potentialtheorie, SOC.Scient. Fenn. Comm. Phys. - Math. IV. 15, 1-52, 1928.

Verified Solution of Integral Equations

253

[23] L. Rall, Automatic Differentiation, Springer Lecture Notes in Computer Science 120, Berlin / Heidelberg / New York, 1981. [24] D. W. Schlitt, Numerical solution of a singular integral equations encountered in polymer physics, J. of math. phys., Vol. 9, No. 3, 436-439, 1968. [25] C. Wagner, On the solution of Fredholm integral equations of the second kind by iteration, J. Math. and Phy. 30, 23-30, 1951. [26] W. Walter, Differential and Integral Inequalities, Springer, 1970. [27] H. Wielandt, Das Iterationsverfahren bei nicht selbstadjungierten linearen Eigenwertaufgaben, Math. Z, 50, 93-143, 1943.

This page intentionally left blank

Enclosure Methods for Linear and Nonlinear Systems of F’redholm Integral Equations of the Second Kind Wolfram Klein

Based on interval analysis and modified fixed point theorems, a self-validating algorithm for solving linear and nonlinear systems of Fredholm integral equations of the second kind is presented. In contrast to other methods, this algorithm works with Taylor series and therefore determines a continuous enclosure of the solution rather than a discrete approximation.

1 1.1

Introduction Linear F’redholm Integral Equations of the second kind

Throughout this paper, the following notations will be used for the determination of a solution of the linear Fredholm integral equation of the second kind : Let M := C(D) denote the space of continuous functions on a given interval D:= [ a,p 1, D2:=D x D. Together with the maximum norm 11 f ~ U Z , , DI f ( s ) 1, f c M , M is a Banach space.

Ilo :=

We will consider the linear Fredholm integral equation of the second kind

Scientific Computing with Automatic Result Verification

255

Copyright 0 1993 by Academic Press, Inc. All rights of reproduction in any form reserved. ISBN 0-12-044210-8

Wolfram h'lein

256

with the unknown function y(s). The continuous functions k(s,t) c C ( D 2 )and g(s) c C(D) are called kernel and right hand side of the integral equation, respectively. In operator notation, (2) leads to

Y-KY=g

(3)

with the integral operator

(lt'y)(s)

:=

1 P

k(s,t)y(t)dt, y e M , seD.

The purpose of this paper is to develop a method for the determination of an enclosure of the solution y = y(s) in (2) as an element of a function space; therefore, the following definitions and theorems are needed : Definition 1 A kernel k(s,t) e C( D*)of the form

m=O

with linearly independent functions a,(s), m=O,l, ...,N as well as b,(t), m=O, 1,...,N is called a degenerate kernel of order N. Let us now consider a specialized method to solve integral equations with degenerate kernels : Theorem 1 Because of (l),all solutions of the linear Fredholm integral equation with a degenerate kernel (4) of the form

are given by

Fredholm Integral Equations

257

y(s) in (6) is a solution of (5) iff the vector ( is a solution of the following system of linear equations :

The basic idea of the proof is to insert equation (6) into equation (5) :

m=O

This leads to

By use of the linear independence of the functions a,(s), m=O,l, ...,N as well as bm(t), m=O,l, ...,N, (8) is equivalent to the system of linear equations in (7); therefore the proof is completed. Definition 2 The n-th iterate of the integral operator K in (3) is defined in the following way :

(K0Y)(3) :=

Y(3)

(K"y)(s) := ( K ( K " - ' y ) ) ( s ) ,n 2 1.

The corresponding kernels are defined as follows : k'(s,t)

:= k(s,t)

An application of this definition allows another specialized method for solving the linear Fredholm integral equation of the second kind :

Wolfram Klein

258

Theorem 2 Let K : C(D) --f C(D) be a continuous endomorphism with the additional property that 11 K I[< 1. Then W

n=O

is the only solution of the linear Fredholm integral equation. This solution y can be determined by the following iteration scheme : yo

:= 9;

yn

:= g + Kyn-',

n 2 1.

The proof of this theorem can easily be shown by use of the fundamental theorem of the Neumann series ( [2] ). The third method to determine a solution of the linear Fredholm integral equation employs general kernels. Both specialized methods mentioned above will now be combined : For each continuous kernel k(s,t) c C(Dz) , WeierstraiJ' Approximation theorem guarantees a suitable decomposition

of k such that

0 0

ko is a degenerate kernel of the form (4) and the norm of the integral operator K s of ks is 'small', that means K s satisfies 11 K s II< 1.

An employment of this decomposition leads to the following modification of the integral equation in (3) ( the K D integral operator with kernel ko ) :

together with the operators R := I - K s ( R-' exists because of S := R-'KD, (10) leads to

11 K s I[<

1) and

Let r-* denote the kernel of the operator R-'. The kernel of the integral equation (11) is again a degenerate kernel of the order N :

Fredholm Integral Equations

-: -

2

am(s)

259

J.’bm(t)y(t)dt,

m=O

with new linearly independent functions am, m=O,l, ...,N as well as bm(t), for m=O,l, ...,N. That means, even in the case of these general kernels, the fundamental theorem of Fredholm guarantees ( theoretically ) the existence of a solution of (2) iff the homogeneous integral equation has only the trivial solution y(s) = 0. In practice, this solution can be determined by the formula (12) mentioned above, i.e. with a suitable combination of the two specialized methods above.

1.2

Interval functoid arithmetic

In the Banach space M := C(D) of continuous functions on a given interval D := [ a,p 1, inner and outer operations R := {+, -, -,/, J } on continuous functions and with elements of K = R or C , respectively, are defined. Let @ = { v i } i rdenote ~ a generating system of M. A functional arithmetic on a computer is introduced in the well known way ( [3) ) : The finite subspace of M, T

is called a (functional) screen of M. Rounding operations property @T(f)

=f

vf

@T : M +

ST with the

ST

are introduced as well as operations in ST :

( ST,OR ) is called a functoid. In order t o define an interval function screen, the usual notations of interval arithmetic are used: The real interval space IR is the space of nonempty, closed and bounded intervals [ a ] : = [ & , ii]:={z~R:-oo
260

Wolfram KJein

the standard arithmetic operations in IR are defined pointwise ( see [l]). For the determination of verified enclosures of real functions, an interval function screen IST is defined in the following way :

IST is a subset of the powerset PM of M; F I? IST is called an interval extension o f f c M if there holds f(x) 6 F ( z ) , V z e D. Operations in PM are defined in a pointwisesense : F O G := { f o g : feF,gcG,}, F,G I? PM, and 0 I? R. A suitable rounding i @: M ~ IST with the property ~ @ T ( F= )

f

6

F

VFEIST,

i@T(f)

F C i@T(F)

v f I? M , VF

I?

PM,

is defined. This admits the following definition of operations in IST :

F

0G := iQT(F o G )

V F, G e I S T , 0 6 R.

Based on these tools, algorithms for the enclosure of the solution of (2) will be described.

1.3

Enclosures of degenerate kernels

One problem of the algorithms mentioned above is the ( practical ) determination of the decomposition (9) and, therefore, especially on a computer, the determination of functions a,(s), m=O,l, ...,N as well as b,(t) , m=O,l, ...,N of the degenerate kernel

Fredholm Integral Equations

261

k in (4). A very efficient and easily realized execution is the employment of an

automatic two-dimensional Taylor expansion, resulting in the following equations ( a generalized version of this method is proposed in the paper by H.C.Fischer in this book ) : For each kernel k with the following infinite two-dimensional Taylor expansion

a decomposition into a finite part as well as the Lagrange remainder is used. The coefficients am,Pmnand pm of the functions a m ( s ) , bm(t),m=O,l, ...,T, of the finite Taylor expansion of order T as well as of the Lagrange remainder rT+1 of order T + l are defined as follows :

n=O

m=O

+ E(8- so)"pn(t T+1

tO)T+l-n

n=O

T m=O

T m=O

The coefficients can be determined automatically and very efficiently. The (just defined) enclosures Am(s), Bm(t)and RT+I(s,t) c IST of the functions a m( s ) , bm(t) and r T + ] ( s , t ) are determined easily by use of interval arithmetic and automatic differentiation. Therefore, each kernel k(s,t) 6 CT+'(D2)can be enclosed in the following way : T

k(s,t)

Am(s)Bm(t) m=O

+ RT+I(sit)*

This corresponds literally to the decomposition (9).

262

2 2.1

Wolfram Klein

Enclosure Methods for Special Kernels Direct Method

In case of a Fredholm integral equation with a degenerate kernel of order T (see (4)), the solution of the degenerate integral equation

c / T

Y(s) -

am(s)

P

bm(t)y(t)dt= g ( s ) ,

S ~ D

U

m=O

is given by ~ ( 3 = )

g ( s )+

T

C am(s)tm,

SED,

m=O

iff the vector t is a solution of the following system of linear equations

This method will now be generalized in order to get a functional enclosure of the solution y(s) in (15) : Let us suppose that A , and B , E IST are interval extensions of the functions a m ( s ) ) the property and b m ( s ) of the degenerate kernel k ( s ,t ) = T a m ( s ) b m ( twith am(s)

6

A,(s),

bm(s)

6

B m ( s ) , m = 0,1, ..., T ,

s E

D.

By means of the rules of interval arithmetic, the degenerate kernel is contained in

m=O

it can be shown ( see [4]) that T

Y ( s )= g ( s )

+ C Am(s)Xm m=O

is an enclosure of the solution y(s) 6 Y(s) in (15) if the interval vector X contains all point solutions of the system of real linear equations contained in

Therefore, this method guarantees

Fredholm Integral Equations 0

0

0

2.2

263

the existence and uniqueness of the solution y of (15) ( provided the interval vector X in (17) exists ) and, furthermore, it determines a functional enclosure Y in equation (16) of this solution y. An implementation of this algorithm was executed in PASCAL-SC on an ATARI MEGA ST4 ( see Section 7 ).

Indirect Method

As mentioned above, the solution of the Fredholm integral equation of the second kind, (3), satisfying a contraction condition 11 K II< 1 is given by y = ( I - K)-'g

n=O

The value of this infinite sum y and, therefore, the solution of the integral equation (3) can be approximated by the following iteration scheme yo := g yn+l

:= g + K y " ,

72

21.

For the determination of an enclosure of the solution of (3) itself, the interval extension T of the continuous and compact operator T, indirectly used in (18), is needed : Ty := g

+ Ky.

If this interval extension

TYcY

T of T in ( 18 ) satisfies the following condition (19)

by means of a certain interval function Y E IST ( non-empty, closed, bounded and convex), then the following statements are true because of Schauder's fixed point theorem : 0 There exists a fixed point i of T, i.e. i is a solution of (3) and, furthermore, 0 Y is an enclosure of this solution : 3 E Y. Therefore, this method admits the determination of a ( functional ) enclosure Y of the solution y in (3) in the case of 11 I( II< 1.

Wolfram KJein

264

3

Enclosure Method for General Kernels

The decomposition of the kernel k(s,t) proposed in Section 2 equ. (9),

leads to a new degenerate kernel appearing in

( see (10)-(12) ). Therefore, the two specialized methods proposed above will be combined, which leads to the following method for the determination of a functional enclosure of the solution of (2) : By use of this new degenerate kernel, the linear Fredholm integral equation has the following form :

In (20) there are newly defined degenerate functions am(s), 0 5 rn 5 N

and a new right hand side

Because of 11 K s 11 < 1, an approximation of each of these functions a , ( s ) , f ( ~ ) can be determined by means of the indirect method as proposed in Theorem 2; furthermore a functional enclosure am of amin (21) as well as an enclosure f(s) of f(s) in (22) can be computed by use of the indirect enclosure method mentioned in Section 3.2. Subsequently, inserting these enclosures into equation (20), an enclosure Y(s) of the solution y(s) of the new degenerate interval integral equation in (20) is determined by use of the direct method 3.1. Even in the case of general kernels this means that a functional enclosure of the solution can be determined.

fiedholm Integral Equations

4

265

Systems of Linear Integral Equations

In the following chapter, the methods for the onedimensional linear Fredholm integral equation will be generalized to the N-dimensional case : Let us consider the following system of linear Fredholm integral equations of the second kind

N c N , I ~ : = { 1 , 2 ..., , N}, D : = [ ( Y , / ~ ] ,

The idea is to introduce a suitable matrix-vector notation in order to get formulas analogous to the ones in the onedimensional case : Let a) k(s,t) := (( k'j(s,t) ))i,jLINdenote the matrix of kernels k'j(s,t) and b) y(s) := ( y'(s) )iLIN and g(s) := ( g i ( s ) )iCIN the vectors of the solution y and of the right hand side g, respectively. For the integral, the following definition is used :

These notations lead to the following equivalent formulation of the given system of linear integral equations (23) :

y(s) -

a

k(s, t)

. y(t) dt = g(s),

s E D.

(24)

This indeed resembles formula (2) ( note: k(s,t) y(t) is a matrix-vector product). Based on the maximum norm 11 . 11- for continuous functions in C(D),the vector norm ( x e C N ( D ))

is used in C N ( D ) .For the matrix K(s,t) of the Fredholm integral operators (( K'J ))i,jLIN,this yields the following inequality :

Wolfram Klein

266

N

N

Additionally, in the case of CE, C:, 11 K'j 11 5 1 , the following inequality holds : )I K x 11 5 11 x 11. The purpose of the present paper is a generalization of the special methods as well as the methods for general kernels which have been derived above for the one-dimensional case.

4.1

Degenerate systems of linear F'redholm Integral Equat ions

Let us assume that each kernel kij(s, t) in the system of integral equations (23) has the following form T

k'J(s,t) = C&(s)b:(t),

v 2, j d N , v (S,t)€D2,

m=O

with a fixed T c N and with linearly independent functions a i ( s ) , m=O,1 ,...,T V i d ~as, well as b$(t), m=O,l, ...,T, V i, j d N . Notice that the following properties are not essential restrictions of the generalization : (i) the independency from i and j of the fixed index T of the order of degeneration, and (ii) the independency from j of the functions &(s) : these restrictions are based on the practical employment of the automatic differentiation with Taylor series in order to get the decomposition (26), Subsection 2.3 . A suitable matrix notation for these functions a i ( s ) , b$(t) is introduced in the following way : Define matrices with elements &(s)

Fredholm Integral Equations

267

as well as matrices with elements b z ( s )

b:

bz bz

b, :=

b,”

b: b:

b:

bz

b2 E3 ..

bE2 b,N3

... bLN ... bLN ... a”,” ... . ...

,

m = 0,1, ...,T ;

bEN

then the matrix k(s,t) corresponding to (24) has the following form, employing the usual sum and product of matrices : T

k(s,t) =

C am(s)bm(t), m=O

V ( s , t ) c D2.

This means that this matrix/vector notation does indeed lead to a similar notation as in formula (4),and Theorem 1 can be applied in a generalized manner. Let us now assume that AL(s) as well as B:(s) are interval functions c ZST with the property &(s)

6

A;(s),

b%(s)c B ~ ( s ) ,Vi,j c Z N ,

VSCD.

Additionally, interval matrices Am, B,,, are analogously defined as in the case of the point matrices above :

A,,, :=

At, 0 0 A: 0 0

0 0 A:

... 0 . . . .. 0

0

0

... 0 ... 0

,

0

... A:

and

...

... ...

B, := B,”

m = 0,1, ..., T,

.

Bc3

...

1

,

B,”

m = 0 , 1 , .,..,T.

Wolfram Klein

268

This leads to the following enclosure method for the solution vector y of the linear system (24) : The solution vector y of the system of linear Fredholm integral equations (24) with degenerate kernels (26) is contained in

m=O

provided the interval vectors X, are enclosures of all point solutions of the systems of real linear equations contained in

X,

-XIB,(t)A,(t)dtX,, N

n=O

P

=

P

B,(t)g(t)dt,

rn =O,l, ...,It

a

These statements again resemble those in the one-dimensional case in (3.1); the method itself guarantees the existence of a continuous enclosure Y in (27) of the solution of the degenerate system of linear integral equations.

4.2

Indirect method for systems of linear F'redholm Integral Equations

With respect to the inequality (25), it is assumed that the integral operators K'J of the system (24) satisfy the condition N

Let KG denote the interval extensions of the operators K'j with

v Y 6 zsT, v ye Y,

6

K ~ Y ,v i, j t I N ,

and let K := ((KG))ijclNdenote the corresponding interval-valued matrix integral K y of the corresponding operator T operator. The interval extension T := g ( employed indirectly in (24) ) is introduced, and the subset is used componentwise. With these definitions, the following algorithm for the determination of an enclosure of the solution vector of the system of linear equations can be formulated under the assumption (28) : If one of the iterates Yi+' 6 ZSp in the iteration

+

Yo

:= g

Yi+' := TY', i20,

's'

Fredholm Integral Equations

269

satisfies the condition Yi+l C Y’ , the solution vector y of (24) is contained in Yi+l ( see [4] ). This means that this iteration suggests an approach to the determination of a functional enclosure of the solution of (23).

Systems with general kernels

4.3

Analogously to the onedimensional case, a decomposition ( based on WeierstraW Approximation theorem)

k’3 = k g

+ki,

V i, j

e

IN

is used for each kernel kij of the linear system of integral equation (24); in contrast to the one-dimensional case, a

a

each kernel k; is assumed to be a degenerate kernel of the form (4) as well as of the same order T of degeneration V i j €IN ( see (26)) and has to satisfy the norm of the integral operators K 2 of

cf, cj”=, II K 2 II 5

Icy

1.

It can be shown [4], that the same (generalized) method as outlined in Section 4 may be used : Interval functions A’, as well as B$ c IS= with the property uL(s) c

A L ( s ) , b i ( s ) c B:(s),

V i, j c IN, V s c D

and the corresponding interval matrices A,, B, of Section 5.1 are introduced for the degenerate part of the algorithm; the interval matrix Ks := ( ( K ! ) ) i j + containing the integral operators Kf as well as the enclosure G of the right hand side vector G are used for the indirect method : first, this indirect enclosure method of Section 5.2 is applied to the column vectors AL of the matrices A, as well as to the right

Wolfram KJein

270

hand side vector G in the following two iterations schemes :

CL0 := AA;

cE+l := A A + K cg, ~ ..

i 2 0,

and

Fo F"'

:= G ; := G + K s

F',

i 2 0.

The following conditions are assumed to be satisfied : (i) for each i = i,,,,,, 1 Ij 5 N , 0 5 m 5 T :

..

@A+'

C_

.. @A with a certain index i = im,j ;

and (ii) for a certain index i : F'+' C_ F' . Then, an interval system of equations may be constructed which consists of the resulting enclosures C i := Cg+I and F := F'+' :

(I - M)X = R, with

and

Rm :=

J, Bm(t) F(t) d t , P

0 5 m 5 T.

In a third step, the enclosure vector X of all point solutions of this interval system is used t o determine an enclosure of the final solution vector Y of the integral system (24) in the following way :

Fredholm lntegral Equations

5 5.1

271

The Nonlinear Case The nonlinear F'redholm Integral Equation

In this chapter, the following nonlinear Fredholm integral equation of the second kind will be considered :

f has to be continuous and differentiable with respect t o the (unknown, continuous) function y. This solution y(s) of (29) can be approximated by means of the following iteration scheme :

furthermore, the existence and uniqueness of this solution can be shown by applying the midpoint theorem t o the iteration scheme (30) and to the iterates y"+'(s) and yn(s) defined above, respectively. Using the notation

with a certain unknown function ( " ( t ) depending on the iterates y"(s) and y"-'(s). Based on the additional assumption

Wolfram Klein

272 equation (31) yields

and, form

< n, n,m

e

N

,

respectively. This guarantees the existence and uniqueness of the solution of (29) as approximated by the iteration scheme (30) mentioned above : the sequence { Y " } ~ ~ N converges to the solution y(s) of (29) ( for < 1) (see [4]). Another way to arrive at these results is an employment of an integral operator

In fact, it can be shown ( [4] ) that T is a continuous and compact operator. Therefore, in the case of T : U + U ( U c C ( D ) ,nonempty, convex, closed and bounded), Schauder's fixed point theorem guarantees the existence of a fixed point jr e U of T, which is a solution of (29) . Some additional notations will be introduced to obtain an iteration scheme analogous to the one in (30), however with an employment of the residuum (defect) of the approximation :

furthermore, with follows:

h(G)

:=

-5

p

an approximate solution, a defect function h is defined as

+ F(G) + g .

An application of the midpoint theorem and an application of a generalized Newton iteration scheme, respectively, yields

Fredholm Integral Equations

273

and, therefore, the following iteration scheme ( yo := y' ) :

AY"(3) -

f v ( s ,t , Y"(t)) y"+'(s)

*

A Y Y W = h(y"(s)),

:= y"(s)

+ Ay"(s)

n

(33)

2 0.

In contrast to the iteration scheme (30), this iteration employs the residuum Ay(s) of the approximation and not the approximation itself. Additionally, (33) is again a linear integral equation with respect to Ay"(s); therefore, it can be solved by means of the methods presented above.

Enclosure method

5.2

The notations of Subsection 6.1 admit the following theorem : Theorem 3 Let us suppose that Y e IS* is an interval function satisfying g

+ F ( Z ) + FY(ZY Y ) * (Y - 2) c Y

with a certain continuous function 2 and the interval1 hull U_ of two functions. Then there exists a fixed point 9 of the operator Ty := g i.e. 9 is a solution of (29) and, furthermore, 0jeY. 0

+ Fy,

The proof can easily be obtained by applying the midpoint theorem

and by use of the remarks concerning Formula (32) . The algorithm for the determination of an enclosure for the nonlinear Fredholm integral equation is a combination of the Theorem 3 mentioned above as well as of the iteration scheme (33) : Let 2 be a certain continuous approximation of the solution y of (29), and let m( Y ), Y €IS=,denote its midpoint function.

Wolfram Klein

274

Algorithm AYo := 0; n := -1; REPEAT n := n + l , hn

:=

-[?I + F([?]) + [g],

AYn+l := h , + F v ( ( Z+AY") Y Z ) 2 := Z+m(AY"+'),

*

AY",

UNTIL

AY"+' I N AYn Solution :

i 6 [Z] + AY"". REMARKS : An implementation of this algorithm was carried out by use of PASCAL-SC ( [6] ) and with the techniques mentioned in Sections 2.2 and 2.3, i.e, with interval function screens and automatic differentiation. 0 It can be shown ( [4] ) that the argument Z U_Yof the integrand f, in Theorem 3 and its enclosure, respectively, can be determined by use of automatic Taylor differentiation. 0 These methods may also be applied to suitably generalized nonlinear integral equations of the form 0

as well as to systems of nonlinear Fredholm integral equations of the second kind with a suitable matrix/vector notation ( [4] ).

6

Numerical Results

All of the proposed algorithms have been implemented in the programming language PASCAL-SC, Version I1 on an Atari Mega ST4 ( [6] ). Several modules for a functoid arithmetic, interval functoid arithmetic, and an automatic two dimensional Taylor expansion admit a simple handling of the programs: especially the input data of the programs for the integral equations can be formulated in the usual mathematical notation.

275

Fredholm Integral Equations

6.1

Functoid Arithmetic, Interval Functoid Arithmetic

The following example gives an idea how to work with functions on a computer. The standard Newton algorithm for the determination of a root of a function f is applied to a functiog fa depending on a parameter a,in order to get the root, again in dependence on tbis parameter a: Let us consider the following function fa :

depending op a parameter a;we want to determine its root 2 = 2(a)in dependence on this parameter a : Starting with the following arbitrary polynomial in a, .'(a)

:= 5

+ 0 . 2 +~ 0 . 2 5 ~ ~+' 0.8a3,

the Newton algorithm is applied to functions in Slo(M) ( see (13), Section 2.2), with the monomial base @ l o := { 1, a,a', a3,...,a''}. After 10 iterations this algorithm leads t o the following polynomial in a : Iteration index 10 : .'"a)

= 1.000000000001

+ + -

+ -

+ -

+ -

+

9.999999999857 * lo-' 1.474999999800. lo-'' 8.259999999700 * lo-'' 3.067999999700 lo-' 7.835199999500 . lo-' 1.397119999870. lo-' 1.718079999880. lo-' 1.397119999900 * lo-' 6.796799999600. lo-' 1.510399999900. lo-'

-

0'

a3 a4 a5 a6 a'

' a a10

Wolfram Klein

276

This is an approximation for the root ;(a)of the function f a in dependence on a ( the exact solution : ;(a)= 1 a) . This example establishes the determination of an approximation of the solution in a functional sense. The disadvantages of this example are clear : it is only possible to determine an approzimation of the solution because of (i) the error in the Newton algorithm, (ii) the finite degree of the polynomials, and because of (iii) the rounding errors.

+

The subsequent example shows how to avoid these problems : We wish to determine the value of the following infinite series

.;

in dependence on the parameter a and with a certain starting element :=

35

. (1225-

a')

A rearrangement of (34) yields the following iteration scheme : := .:+I

:=

.: 1 35

-

*

+

(1225-a') a - * g, a 35

€

[O, 11, n 2 0.

This may be used for the determination of an approximation z;+l of S a . The application of an interval function screen ZSlo(M) ( see (14), Section 2.2 ) and the interval extension of this iteration, respectively, as well as a generalization of Theorem 2 leads to the following interval polynomial :

Fredholm Integral Equations

277

Iteration index 16 :

;(a) c [34.99999999998 , 35.00000000002 ] t [9.999999999994 , 1.000000000001 ] a [ -1.6 lo-'* , 1.8 lo-'' ] ' a [ -1.4*10-'2 , 1.4 * lo-'' ] a3 -9.6.10-13 , 9 . 6 . 1 0 - l ~] a4 [ -4.7.10-13, 4.7.10-13 1 [ -1.8.10-13, 1.8.10-13 1 a6 [ -4.7.10-14 , 4.4.10-14 1 [ -9.4.10-l5 , 9.4.10-l6 ] a8 [ -1.4.10-15 , 1.4.10-15 1 [ -1.2. 10-l6 , 1.2 * 10-l6 ] a10 [

+ + + + + + + + +

-

-

(35)

.

In contrast to the previous example, the solution of (34) itself, i.e. the value of the infinite series ( in this example the function ?(a)= 35 a)is contained in the interval polynomial (35). That means, especially in contrast to the disadvantages of the previous example as mentioned above, that the value of the infinite series is contained in (35) with respect to (i) the fact that (34) is an infinite series, (ii) the approximation error of the finite interval function screen, and (iii) with respect to all rounding errors.

+

6.2

Fredholm Integral Equations

6.2.1

The linear one-dimensional Fredholm Integral Equation

The three algorithms proposed in Sections 3 and 4 were tested for different types of Fredholm integral equations : the degenerate case leads to highly satisfactory

Wolfram Klein

278

results ( however, a transformation between different bases seem to be necessary: Spline, trigonometric,... + monomial base ). Analogously satisfactory results of this kind were obtained in the case of integral equations satisfying a contraction condition. A large number of tests have been carried out for integral equations with general kernels ( and therefore for the combination of both specialized methods in Section 3 ) as given on different domains. The following table shows some numerical examples and results. For these results, the corresponding subsequent graphs illustrate the dependence of the number of correct decimal digits of the solution and the computing time, respectively, on the degree T of the Taylor polynomials.

cow. Dec.dig

f (s

s

*

+ t2)3

sin(s)

-

t3

+ gos* +250s + 255)

-(1 293

I

2 + cos(s) - 1

P,11

2

4 Dl21

7

12

279

Fredholm Integral Equations

The linear one dimensional F'redholm Integral Equation of the Second Kind

y(s)

9

6

-

-

1

1 3

'

s

+

7 ) y ( t ) dt = 20

s .s i n ( s

S C [ O ,

11

digits

Number of correct decimal digits in dependency on the Taylor degree

-

3 -

a

4

30 min

-

Smin

-

3 min

-

16

I2

20

,Taylordegree

/--/----I 4

8

E

16

2o

Computing time in dependency on the Taylor degree

z Taylor-

degree

Wolfram Klein

280

Systems of Linear F'redholm Integral Equations

6.2.2

Two different types of systems of linear integral equations of the second kind were tested : 0

0

the 'normal' linear system of order N with N Z different kernels and N different right hand sides and, furthermore, linear systems arising from a linear one-dimensional integral equation by subdividing the given domain into N Zsubdomains.

Because of the application of the automatic Taylor differentiation, the ( additional ) subdivision of the given domain leads to better results as compared with an increase of the degree of the Taylor polynomials, and this within a shorter computation time. Different tests were made with systems up to a dimension of (25 x 25) (i.e. with 625 kernels); two examples of dimension (2 x 2) will be presented with the following notation for the (unknown) solution vector : Y ( s ) = ( y ( ' ) ( s ) , ~ ( ~ ' ( s ) ) ~ . The examples are given by

= 1

and

Jt'

+ cos(s) - e* - sin(s .

T')

s . eatay(')(t)dt

1-el 2

+~

( -3 e0.5)- e*+'.

Fredholm Integral Equations

6.2.3

281

Nonlinear Fkedholm Integral Equations

Analogously to all iteration schemes of the Newton type, the quality of the (enclosure of the) solution as well as the computing time are strongly dependent on the starting value. Furthermore, it seems to be clear that the iterative application of the linearized methods enlarges the cost enormously. The following equations exhibit several of the tested nonlinear Fredholm integral equations as well as some nonlinear equations of a generalized type. The first example is given by

Then, some nonlinear equations of a generalized type have been considered : y(s) -

l+ 2

1 y(s)

+ y(t) at

= s, O l s S l

Wolfram Klein

282

References [l] Alefeld, G.,Herzberger, J.: An Introduction to Interval Computations. Academic Press, New York, 1983 (ISBN 0-12-049820-0).

[2]Heuser, H : Funktionalanalysis, B.G. Teubner, Stuttgart, 1986. [3]Kaucher, E., Miranker, W.L.: Self-validating Numerics for Function Space Problems; Academic Press, 1984 (ISBN 0-12-402020-8). [4] Klein, Wolfram : Zur Einschliessung der Liisung von linearen und nichtlinearen Fredholmschen Integralgleichungssystemen zweiter Art, Dissertation 1990,University Karlsruhe, FRG [5] Kulisch, U.W., Miranker, W.L. (eds.): A New Approach to Scientific Compu-

tation. Academic Press, New York, 1983 (ISBN 0-12-428660-7).

[6]Kulisch, U.W. (ed.) : PASCAL-SC: A PASCAL extension for scientific computation; information manual and floppy disks; version ATARI-ST. B.G.Teubner Verlag, Stuttgart,l987 (ISBN 3-519-02108-0).

Acknowledgment I am especially grateful to Prof. Kaucher for many helpful ideas and valuable discussions. Furthermore, I want to thank Prof. Kulisch for his support and the possibility to carry out my work a t the Institute directed by him.

A Step Size Control for Lohner’s Enclosure Algorithm for Ordinary Differential Equations with Initial Conditions W. Rufeger and E. Adams

Lohner’s Enclosure Algorithm for Ordinary Differential Equations with Initial Conditions is supplemented by an automatic control of the step size. In the present paper, the control has been developed mainly in view of the computability of the upper and the lower bounds of the enclosure in a close neighborhood of a pole in the Restricted Three Body Problem. Applications to other problems are being investigated.*

1

Introduction

For any system of equations, the practical determination of the value(s) of a true solution rests on the execution of a suitable algorithm. Generally, rounding errors then are unavoidable and there are additional procedural errors if the algorithm is chosen as a truncation of an infinite sequence of arithmetic operations. If there are numerical errors of these kinds, an algorithm delivers only an approximation of the value(s) of the unknown true solution. With the exception of sufficiently simple problems, (i) the computability of an approximation does not imply the existence of the true solution (e.g. because of spurious difference solutions [7], [15], [24]) and (ii) an a priori error estimate does bounds.

not yield

the desired quantitative tight error

These difficulties are particularly pronounced for systems of equations with differential or integral operators, which therefore refer to function spaces. In most cases, there is then the unknown discretization error concerning the practical finite dimensional approximation in a Euclidean space. Consequently, it is desirable to verify a set of values (the “enclosure”) which is guaranteed to contain the unknown values of the true solution(s). This set is bounded by one-sided approximations, namely an upper bound and a lower bound. ‘A large part of the numerical work presented here was executed by use of a P C kws EB68/20 with a PASCAL-SC compiler made available by Prof Dr.-Ing. Straub, University of the Armed Forces, Munich. Scientific Computing with Automatic Result Verification

283

Copyright 0 1993 by Academic Press, Inc. All rights of reproduction in any form reserved. ISBN 0-12-044210-8

W. Rufeger and E. Adams

284

These bounds are to be computed such that all local procedural and/or rounding errors are fully taken into account, i.e., in the sense of increasing the upper bound and decreasing the lower bound. The enclosure property to be guaranteed rests on the tacit assumption that there are no errors of the employed software (i.e. the enclosure algorithms) and hardware. The enclosure algorithms developed in the Institute for Applied Mathematics of the University of Karlsruhe rest on the Kulisch Computer Arithmetic [19]. They can be executed by means of the computer languages PASCAL-SC [9] or ACRITH-XSC [29] or the subroutine library ACRITH [28] supporting FORTRAN or PASCAL-XSC [18]. The present paper is concerned with nonlinear ordinary differential equations (ODEs) whose side conditions generate either an initial value problem (IVP) or a boundary value problem (BVP). Discretizations then are traditionally the most important practical method for the determination of difference solutions, serving as approximations of the unknown values of the true solution(s). Since difference methods ignore discretization errors, the present paper addresses the totally errorcontrolled construction of enclosures of true solutions of ODEs. The chapters [7] in this volume address 0

the theoretical and practical unreliability of difference methods and

0

the corresponding problem of “computational chaos”; this stresses

0

the practical importance of enclosure methods for investiations concerning the reliability of a chosen discretization in the case of a given system of nonlinear ODEs; in this context see [23].

Remarks:

(1.) By definition, “chaos” implies a high sensitivity with respect to perturbations

of all kinds. When using numerical methods there is then a corresponding sensitivity concerning numerical errors. There arises the question whether or not (computational) chaos in a set of difference solutions implies chaos in the corresponding set of true solutions of a system of ODEs, [6], [7].

(2.) Obviously, and as will be discussed, the computational determination of an enclosure is costly and sensitive with respect to the choices of the artificial parameters in the enclosure algorithm.

_ .

Concerning a true solution of a system of evolution-type ODEs, an enclosure algorithm can be executed only for bounded intervals of the independent variable t, “time”. This property holds naturally in the case of periodic true solutions solving the ODEs and a boundary condition of periodicity. For their verified computational assessment, an interval shooting method may be used, which generates auxiliary IVPs. With periodic solutions as a predominant final goal as in [l] and in [3], the

Step Size Control

285

present paper is concerned only with enclosure algorithms for IVPs on bounded intervals of t in conjunction with ODEs which (locally) are highly sensitive with respect to perturbations and numerical errors. This sensitivity can be practically compensated for by means of suitable automatic choices of the artificial parameters of the enclosure algorithm, a task to be addressed subsequently on the basis of the first author's diploma thesis [22]. Whereas all details are presented in [22], the present paper concentrates on the control strategy.

2

On Lohner's Enclosure Algorithm for IVPs

Systems of autonomous, explicit, nonlinear ODEs with initial conditions are considered, y' = f ( y ) for t I to; f : D + R",f E C p ( D ) ,D c R" ; Y(t0) = Yo E D ; Y = ( Y l . . . 1yn)T; f = ( f l . . . , f")T ;

f is a composition of rational and standard functions. By extension, non-autonomous systems y' = g(tl y) can be represented by means of (1). The true solutions of (1) are denoted by y' = y*(t,y). With [.,I an interval and x the Cartesian product of sets, an enclosure [y] of y* consists of a pair of bounds, y and B, such that

"Lohner's enclosure algorithm" ([20], [2]), to be addressed subsequently concerning IVPs and (2), can be characterized as follows: (i) it rests on an explicit one step method based on a grid { t o , tl, - - - ,ti, - - .} and -

on componentwise Taylor-polynomials of order p - 1 which are supplemented by the Taylor remainder terms of order p, i.e., the expressions for the local discretization errors;

(ii) it is executed by means of rounded interval arithmetic ([8], [21]); (iii) provided the final enclosure has already been determined and verified for t E [to,tj], the continuation of the construction for t E [tj,tj+l] is concerned

with an IVP consisting of y' = f ( y ) and the admission of all y(tj) E [y(tj)]; this vectorial IVP is represented equivalently by a system of n scalar Volterra integral equations;

W. Rufeger and E. Adams

286

(iv) for each one of these integral equations and t E [tj,tj+l], an “a priori set” [Y:+~] 3 [y(tj)] and hj+l := tj+l - tj are chosen such that the conditions of

the Banach Fixed Point Theorem are satisfied employing interval extensions of the integral equations, (211; this verifies that all solutions starting in [y(tj)] exist and are contained in [yjO+J for t E [tj, tj+l];

id

(v) subsequent t o the etermination of this “crude first” enclosure, this set is tightened by mea s of interval extensions of the Taylor polynomials and their remainder terms, making use of enclosures of the derivatives f’, f”,.- ,f(p) of f, [21]; if there is an automatic (local) control of hj+l, the corresponding tests are carried out in the presently discussed step (v) of the algorithm;

-

(vi) the interval enclosure for t

E [tj, tj+l] may be supplemented by the determination of a suitable parallelepiped in R” if this set is a “better enclosure” of the set of solutions for t E [tj,tj+l]; this refers to the “wrapping effect” ([2], [21]) of enclosing a set in R“ by an “outer wrapping”.

Remarks: 1.) The value of hj+l in (iv) is either an input data of the algorithm or (automatically) controlled in the execution of step (v), as will be discussed.

2.) The employed interval-extensions of the Taylor-polynomials and the re-

mainder terms yield continuous and Spline-like bounds of the enclosure for

t E [to,tj+l]* 3.) The local compontentwise employment of Taylor-polynomials and remainder terms corresponds to the one of (truncated) power series expansions of the compontents of the solutions, see Chang and Corliss [ l l ] for estimates of the radii of convergence. The execution of this enclosure algorithm possesses the following properties: (cy) For the continuation from tj to tj+l, there is the set of all initial vectors contained in the enclosure computed a t tj.

(p ) For t E (tj,tj+]],

this enclosure contains the true set of solutions with however, the enclosure exceeds this the initial vectors referred to in (a); set because of the numerical errors accounted for. Additionally, there are overestimates because of the employed interval methods ([8], [21]) and the wrapping effect, [21].

(7) - The mapping condition of the Banach fixed point theorem is the bottleneck (251 of the construction of the enclosure; see step (v) of the are input data, their values must algorithm. In fact, if h l , . . . ,hj+l,

be chosen sufficiently small such that there is a “reasonable chance” for a fulfillment of the mapping condition in the sequence of time steps. A major purpose of an (automatic) step size control is a near-optimal choice of hj+l in view of this condition in the step from tj to tj+l.

Step Size Control

287

As a consequence of (p), an additional excess is acquired in each time step. Therefore, the execution of the enclosure algorithm possesses a built-in “pollution” or even “blow-up” tendency, which has to be sufficiently delayed by use of suitable (local) choices of the artificial parameters. Numerical experience indicates that a suitable local control of hj+l is a more effective means of enforcing the continuation of the construction than a corresponding local control of the order p of the remainder term of the Taylor-polynomials.

Remark: Because of numerical experience for true solutions of evolution problems, computed enclosures posses generally an exponential growth of the widths as time t increases. The widths can be kept sufficiently small for suitable intervals of time t provided the numerical method and its artificial parameters are chosen accordingly. Correspondingly, the step size control to be presented generally cannot prevent an exponential growth of the enclosure, rather, the rate of growth can be kept sufficiently small for suitable intervals of time. Additional numerical experience indicates the importance of sufficiently small values of the following quantities: the width of the enclosure d;j(S)

:= pi(tj) - y.(tj) -I for

i = l(1)n and j E JV fixed

(3)

and its (componentwise) rate of growth

The reasons for the desired smallness of dij and Dij are as follows: (i) the “pollution” of the enclosure by its excess is then correspondingly small; (ii) if present, a locally expanding character of a set of solutions in phase space then possesses a correspondingly small influence;

(iii) this is also true for the “self excitation” [3] of the enclosure as a consequence -

of ”extraneous ODES” which, different from y’ = f ( y ) in (1) and perhaps unstable, possess interval extensions within the computed enclosures.

Concerning (3) and (4), the (local) choice of the step size exerts contradictory influences on the various contributions to the enclosure since h should be chosen as large as possible in order to keep the following quantities sufficiently small: (A) the (numerical) cost of the execution of the enclosure algorithm and -

W. Rufeger and E. Adams

288

(B) the growth of the enclosure as a consequence of rounded interval arithmetic and the wrapping effect.

On the other hand, in view of the machine precision,

h should be chosen as small as possible in order to keep the local discretization errors sufficiently small. If h is not suitably selected (particularly if h is too large), numerical experience indicates that there may be a strong growth of j j ( t j ) - y(tj) as j increases. Almost always this leads to a termination of the execution Gf the algorithm, generally because of non-admissible arguments of real-valued functions such as the logarithm or the square root.

Remark:

In applications of Lohner’s enclosure algorithm preceding the work [22] by the first author, a non-automatic local control of h was occasionally used, either (i) as part of the input data, e.g. [20], or (ii) by means of a continuous personal supervision of the execution of the enclosure algorithm as, e.g., in [l].

3

Goals of the (Automatic Local) Step Size Control for Lohner’s Enclosure Algorithm

As an introduction to this topic, step size controls of difference methods (for ODES) are characterized as follows in view of the absence of verified quantitative error controls: (i) they rest on qualitatively adopted estimates of the -

discretization error, which is not (quantitatively) accessible on the level of these methods;

(ii) some controls employ ad hoc strategies whose usefulness has been shown in certain test examples, see [26];

(iii) these controls are heuristic; generally, an inappropriate choice of h cannot be recognized by means of properties of the computed difference solution.

The step size control [22] for Lohner’s enclosure algorithm [20] can be characterized as follows:

(a)it rests quantitatively on the enclosure of the global discretization error and its rate of growth;

Step Size Control

289

(p ) this is supplemented by an estimate of the excess of the computed enclosure [ ~ ( t j + ~as ) ]compared with the set of true solutions starting in [y(tj)]; for this purpose, a few approximations of solutions y*(t,yj) are used which start on the boundary of [y(tj));

(r)for every detail of the control strategy, there is a theoretical justification; (4)an inappropriate local choice of h can be recognized by means of the (almost immediate) subsequent growth of the computed enclosure.

Applications of automatic or non-automatic step size controls for Lohner's enclosure algorithm are predominantly of interest for the following purposes:

(I) either to keep the numerical cost sufficiently small, (11) or, for all t -

E [ t O , t m ] , to enforce the computability of an enclosure in the case of ODEs with high sensitivities concerning perturbations and numerical errors.

In view of (I) and for ODEs without such sensitivities, a control of h for Lohner's enclosure algorithm is being developed by H. J. Stetter and coworkers ([25], [17]). The control of h to be presented here addresses goal (11) for a sufficiently difficult sample problem: the determination of orbits of the Restricted Three Body Problem close to a pole of ODE'S. For other sample problems with high sensitivities, adaptations of the control of h are being developed by the authors.

Remarks: (1.) The built-in blow-up tendency of the enclosure addressed in Section 2 is par-

titularly severe for problems with high sensitivities. An enforcement of the computability of the enclosure is then the most important purpose for the construction of enclosures in "perturbation-sensitive neighborhoods" in phase spaces.

(2.) Particularly in the case of "chaotic dynamics" and their strange attractors [7], the set of true solutions might be globally and strongly expanding in a large subdomain of the phase space. The automatic control then yields . correspondingly small step sizes h l , . .. ,hj+l,

~

-

Because of the preceding discussions, it is clear that an achievement of goal (11) stipulates the enforcement of suitably small values of the width of the enclosure,

W. Rufeger and E. Adams

290 and of its growth rate,

._ .where, according to numerical experience, c;j+l

:= 10E(d8j) for

c;,j+l in (4) with E ( 0 ) the exponent in a decimal floating point number representation.

(7)

The quantity D,+l depends on (i) the local rate of expansion of the set of true solutions (in phase space) starting in [Y(tj)l and,

(ii) a t -

t = tj+l, the overestimate of this set by means of [ ~ ( t j + ~ ) ] .

In fact, since y(t0) E D is a point, dj+l is a result of all the overestimates (excesses) in the time steps from t o to tj+l . For t E [tj, tj+l], the local excess of the enclosure is a gauge for this overestimate. This excess is the domain in R" possessing the outer boundary a[y(t)] and the inner boundary a S ( t ) where, for t E [tj, tj+l],

is the set of solutions starting in [y(tj)]. Obviously, a S ( t ) is unknown with the exception of a S ( t j ) = a[y(tj)].

Strategies for the Local Control of h

4 4.1

The Local Excess

It is assumed 0

that an enclosure [y(t)] has already been determined for t E [0, tj] and

0

that a first choice, hj+l := hj+l,o of a step size has already been made.

For the control of hj+l, now an approximation of a S ( t ) and, therefore, of the excess will be determined. Because of properties required in ( l ) , the domain invariance theorem ([13], [IS]) in topology asserts that a[y(tj)] is mapped onto a S ( t )fort # t j by the continuous and injective operator representing the solutions of ( l ) , [13]. If y' = f ( y ) is linear, a S ( t ) is a parallelepiped that is completely characterized by its 2" corners which are the images of the 2" corners of a[y(tj)]. For the nonlinear

291

Step Size Control

ODES y’ = f ( y ) under consideration here, generally there is no such simple characterization of aS(t). In order to acquire some information on a S ( t ) ,a heuristic test of the excess is now introduced. For this purpose, a relatively small number, M, of starting points G(”)(tj) E a[g(tj)] with u = l(1)M is suitably chosen. For t = t j + l := t j hj+l and these starting points, a pointwise approximation G(”)(t;+l) of a S ( t j + l ) is obtained by means of the Taylor-polynomials which are employed in step (i) of the enclosure algorithm, see Section 2.

+

On the basis of numerical experience, the adopted test of the excess rests on a partitioning of [y(tj+l)] = X,Y1[%,j+l,pi,j+l] c R“ into an inner subinterval X,Yl[%,j+l,5j;,j+l] C R” and the remaining outer shell, where

with a fixed choice of a sufficiently small The test

6

E

R+.

either

(A) accepts the value of hj+l if, for i = l ( l ) n , there are values i$”)(tj+l) E

[!/j,j+l,%,j+ll

set { 1,

*

a ,

and G I P ) ( t j + l ) E [7i,j+1,g;,j+1] for a t least one u and one p in the

M} or the test

(B) requires the choice of a new step size hj+l := hj+l,l, such that hj+l,l < hj+l,o, provided condition (A) is not satisfied. For a fixed j + 1 E N only one decrease hj+l,o --+ hj+l,l is allowed in order to avoid a sequence of step sizes hj+l approaching the machine precision. Provided the condition in (A) is satisfied, there is a second test for hj+l implying the first choice hj+z,ofor the next time step.

4.2

On the Choices of the Step Sizes

hj+l,l

and hj+z,o

Concerning the goal (11) as stated in Section 3, numerical experience indicates that the choices of hj+l,l and subsequently hj+Z,O should additionally be subject to the inequalities

6 5 Dj+l 5 y where 6 , E~R+ are suitably chosen.

(10)

On the basis of the first author’s diploma thesis [22], the step sizes hj+l,l and hj+z,o are chosen by means of the following rules:

If condition ( A ) in Section 4.1 is not satisfied, then

W. Rufeger and E. Adams

292

If condition ( A ) is satisfied but Dj+l 5 7 is not true (i.e., if Dj+l > 7) then, according to the user’s option, either h,+l,l := where

:=

0.1 * hj+l,0

+ 0.08. 10’

and

if hj+l,o = 0.1

if hj+l,o = O . U . . . with hj+l,o E(hj+l,o) withE(o) defined in (7)

a

# 1,

(12)

If condition (A) in Section 4.1 and additionally Dj+l 5 y in Section 4.2 are satisfied, then (hj+l)j := hj+l,o with f for final. For the next time step,

A choice of hj+Z,O bigger than hj+l,o is not meaningful in case the solution to be enclosed is in a perturbation-sensitive neighborhood of the phase space, where Dj+l is relatively large. Suitable choices of E in Section 4.1 and 7,6 in Section 4.2 depend on the true solution being enclosed. These choices should be made individually for given systems of ODES, by means of preliminary numerical experiments and experience.

Remark:

Without a loss of generality, (11) and (12) hold for the case of hj+l,o < 1. As a side condition concerning hj+z,O, the mapping condition in step (iv) of the enclosure algorithm must be satisfied, see Section 2. This has always been true in all applications of the control presented here. The conservative upper bound for hj+z,o as a function of hj+l,o in (13) was chosen because of numerical experience. In fact, unless the execution of the enclosure algorithm possesses an ability to “look ahead”, the execution of the algorithm may “unexpectedly” enter a perturbationsensitive neighborhood in phase space, with relatively large values of Dj+1. A less conservative upper bound of hj+z,o as a function of hj+l,o then may lead to a blow-up of the execution of the algorithm. The sensitive neighborhoods just addressed are related to singularities in the complex extension of the phase space which is obtained through a replacement of time by a complex independent variable. For extensions of this kind, Chang and Corliss (e.g. [ll])have developed and applied estimates of the local radius of convergence of the infinite Taylor-series which, by truncation, generate the Taylor-polynomials employed as discretization methods. These estimates provide the ability to “look ahead”.

Step Size Control

293

Remark: Subsequent to the completion of the first author's paper [22], one of the estimates by Corliss and Chang was incorporated in a correspondingly extended version of the control algorithm presented here. For the bstricted Three Body Problem in Section 5, preliminary numerical experience with this extensions asserts the ability of this control to detect the approach of a (perhaps still far away) perturbationsensitive neighborhood.

5

Problem Area Chosen for the Development of the Control

In view of Section 3, ODEs with a locally high perturbation-sensitivityare the major goal for the development of the step size control presented here. Sensitivities of this kind occur, e.g., in neighborhoods of poles of ODEs. An example is the Restricted Three Body Problem, with orbits (yl, y2, yi, y;)T in a four-dimensional phase space. For this problem, see Section 9.2 in [7] and [12]. In the usual representation in a rotating basis ([lo], [14], [27]) the (autonomous and explicit) ODES possess poles at the fixed positions of the "earth" and the "moon". In many papers (e.g. [lo]), the unique orbit has been considered which contains the point

ip(0) = (1.2, 0, 0, -1.04935750983)T.

(14)

Starting at ip[O), high precision difference methods yield an approximation, ip, which at t = T = 6.192169331396 almost returns to gP, [20], [22]. Therefore, iP is believed to be an approximation of a hypothetical true periodic solution, y;, with a period T a f'. Figure 1 depicts the projection of ipinto the y1 - y2 plane, with the pole "earth" located at (-1/82.45, 0). By use of his enclosure algorithm with double precision, p = 22, and 13 prescribed changes of h, R. Lohner [20] has verified that the true orbit, f*, containing ip(0) is almost closed. In fact, the four computed components [yi(T)] of the enclosure [y(f')] possess widths of less than and they contain the components of &,(O), ".

By means of Lohner's enclosure algorithm with simple precision, p = 22, and the (automatic) control presented here, t' was enclosed in [22], making use of the PC referred to in the footnote in the first page of this chapter. For this purpose the following choices of the artificial parameters were made: 7 = 1.4, 6 = 0.06,

6

= 0.1, and h1,o = 0.03.

(15)

For t E [O,f'], the control generated 1182 time steps, t k , with 14 changes of hj+l at different t k . The enclosures determined in [20] or [22] and f p coincide within the graphical accuracy of Figure 1.

W. Rufeger and E. Adams

294

Figure 1: Projection into the y1 - y2 Plane of a High Precision Approximation ijp of a (Hypothetica1)PeriodicSolution of the Restricted Three Body Problem

For another choice of the starting point,

’I:=

-3.709913156265 2.911961898819 -6.434622630570 -2.729679082780

*

lo-*

*

Lohner’s enclosure algorithm, with the step size control presented here, yielded an enclosure of the true orbit y; The projection of yf) into the y1- y2 plane is depicted in Figure 2. For the determination of the enclosure of y;, the values (15) of the artificial parameters were used, with the exception of hl,o = 7 . lo-’. The reason for the choice of ‘Iwill now be explained on the basis of the first author’s [22] numerical experience with Runge-Kutta methods as applied to the problems under discussion. Starting a t &(O) and by means of a classical Runge-Kutta method with h = 5.10-3, an approximation ij, of 6’ was determined in [22]. Figure 3 displays the projection of ij, into the y1 - yz plane. The point 71 belongs to ijq. Since - y;lloo attains relatively large values, jj, is said to divert a t ‘Ifrom the true solution yf). According to the more detailed discussion of this phenomenon in [4], [5], [6] and [7], ij, diverts every time this difference solution comes close to the pole a t the “earth”. By use of numerical experience for this example ([4], [6], [7]), diversions of difference solutions can be avoided when suitable step size controls of Runge-Kutta methods are used. In the case of the Lorenz ODES (([5], [6], [7]), this is not so, even when

Step Size Control

-0.5

i t

295

I

-1

I

1

-

YI

Figure 2: Projection into the y1 - y2 Plane of a True Solution yi, Starting a t the Point 71 Defined in (16)

a Runge-Kutta-Fehlberg method of order eight has been used with its automatic control of h. Here even difference methods with an “optimal control” of step size h and order p have been observed which divert a t the stable manifold of the hyperbolic stationary point a t the origin, [6], [7].

6

On Generalizations of the Presented Control

In [22], decreases or increases of hj+l are governed by the fixed rules (10) - (13), respectively. An additional dependency of these rules on the solution being enclosed is desirable, particularly by means of information for t @ [tj,tj+t]. For the continuation of the enclosure for t > t j + l , it is obviously important to be able to “look ahead”. For this purpose, there are the following practical possibilities: (i) either the employment of a sufficiently close approximation of the true solution to be enclosed for t E [ t O , t o o ] , as has been suggested by H. J. Stetter and

coworkers [25], [17];

(ii) or, according to Chang and Corliss [ll], an estimate of the local radii -

of convergence of the (infinite) Taylor-series, corresponding to the Taylorpolynomials with remainder terms for t E [ t j , tj+l].

A local control of the order p of the Taylor-polynomials is desirable in addition to the presented control of hj+l. In the case of Runge-Kutta-Fehlberg methods, a local

W. Rufeger and E. Adams

296

0.5

0

-0.5

-1

0

1

Figure 3: Projection into the y1 - y2 Plane of an Approximation 5, Computed by Means of the Classical Runge-Kutta Method Starting at &(O)

control of the step size and the order of consistency is executed by means of a local comparison of results for three different choices of this order [26]. A control of p is also possible here, provided a suitable supplement of Lohner's enclosure algorithm can be carried out efficiently. At present, p is a fixed input data of this algorithm.

Remarks:

(1.) For a successful incorporation of these generalizations of the control strategy, -

sufficiently many and diversified numerical experiments are required. For this purpose, the periodic solution of the highly stiff Oregonator Problem [I] is being used in addition to the problem adressed in Section 5 . Preliminary experience indicated a need for minor modifications of the rules (10)- (13) of the employed control.

(2.) These generalizations should be studied in conjunction with a suitable replacement of the verification step (iv) of the enclosure algorithm in Section 2; at present this step rests on interval extensions for Volterra integral equations.

(3.) The desirable development of an "optimal control" requires a preliminary analysis of the concept of "optimality" in the present context.

Step Size Control

7

297

Concluding Remarks

(1.) The presented control of the step size hj+l -

rests on the following properties, which here are quantitatively accessible, as opposed to the case of discretizations such as Runge-Kutta(-Fehlberg) methods: (a) for t E [0, t j + l ] , the continuousenclosure of the global discretization error and its rate of change and

(b) a practically useful estimate of the (“polluting”) excess of this enclosure as compared with the set of true solutions.

(2.) The parameters -

in (a) and (b) are computability of the enclosure.

the most

important quantities for the

(3.) Applications of Lohner’s enclosure algorithm (for IVPs) are particularly im-

portant in the case of perturbation-sensitive neighborhoods in phase spaces. Controls of the kind presented here are then instrumental for the practical computability of enclosures.

(4.) Concerning Lohner’s enclosure algorithm with and without a control, the nu-

merical cost is obviously much larger than the one in the case of discretizations such as Runge-Kutta-Fehlberg methods. This may be irrelevant

(a)in view of the reliability of (error-controlled) enclosures and

( p ) if a user’s PC is employed rather than other users.

a mainframe to be shared with

( 5 . ) An important generalization is the ability to “look ahead”, [ll]. In fact, a -

perturbation-sensitive neighborhood in phase space is not necessarily known in advance. An example is the stable manifold of a saddle point, which may be the locus of diversions of difference solutions (even of the highest order of consistency that is compatible with the employed number format [6], [7]).

References (11 E. Adams, A. Holzmiiller, D. Straub, The Periodic Solutions ofthe Oregonator and Verification of Results, Comp. Suppl. 5,p. 111 - 112, 1988.

[2] E. Adams, Enclosure Methods and Scientific Computation, p. 3 - 31 in: Numerical and Applied Mathematics, ed.: W. F. Ames, J. C. Baltzer, Basel, 1989. [3] E. Adams, Periodic Solutions: Enclosure, Verification, and Applications, p. 199 - 245 in: Computer Arithmetic and Self-validating Numerical Methods ed.: Ch. Ullrich, Academic Press, Boston, 1990.

298

W. Rufeger and E. Adams

[4]E. Adams, W. Rufeger, Diverting Difference Solutions, Particularly in Celestial Mechanics, Proc. 13th IMACS World Congress, Dublin, 1991,Vol. 1, p. 355 - 356. [5]E. Adams, W. Kiihn, On Computational Chaos for the Lorenz ODES,Proc. 13th IMACS World Congress, Dublin, 1991, Vol 1, p. 353 - 354. [6]E. Adams, W. F. Ames, W. Kiihn, W. Rufeger, H. Spreuer, Computational Chaos May be Due to a Single Local Error, will appear in J. Comp. Physics. [7] E. Adams, The Reiliability Question for Discretizations of Evolution Problems, I: Theoretical Considerations on Failures, 11: Practical Failures, this volume. [8]G. Alefeld, J. Herzberger, Introduction to Interval Computations, Academic Press, New York, 1983. [9]G. Bohlender, L. B. Rall, Ch. Ullrich, J. Wolff von Gudenberg, PASCALSC: Wirkungsvoll programmieren, kontrolliert rechnen, Bibliogr. Institut, Mannheim, 1986. [lo] R. Bulirsch, J. Stoer, Numerical Treatment of Ordinary Differential Equations b y Eztrapolation Methods, Num. Math. 8, p. l - 13, 1966.

[ll] Y. F. Chang, G. F. Corliss, Solving Ordinary Differential Equations Using Taylor Series, TOMS, 8, p. 114 - 144, 1982. [12]J. W. Daniel, R. E. Moore, Computation and Theory in Ordinary Differential Equations, W. H. Freeman, San Francisco, 1970. (131 K. Deimling, Nichtlineare Gleichungen und A bbildungsgmd, SpringerVerlag, Berlin, 1974. [14]S. Filippi, Das Verfahren von Runge-Kutta-Fehlberg zur numerischen Losung von Mehrkorperproblemen, p. 307 - 324 in: Mathematische Methoden der Himmelsmechanik und Astronautik, ed.: E. Stiefel, Bibliogr. Institut, Mannheim, 1966. [15]A. Iserles, A. T. Peplow, A. M. Stuart, A Unified Approach to Spurious Solutions Introduced by Time Discretisation Part I: Basic Theory, SIAM J . Num. Anal. 28,p. 1723-1751, 1991. [16]E. Kasriel, Undergraduate Topology, W. B. Saunders, Philadelphia, 1971. [17]M. Kerbl, Step Size Control Strategies for Inclusion Algorithms for ODEs, in: Computer Arithmetic, Validated Computation and Mathematical Modelling, Editors: E. Kaucher, s. M. Markov, C. Mayer, p. 437-452, J. C. Baltzer (IMACS), Basel, 1991.

Step Size Control

299

Or: M. Kerbl, Efiziente globale Steuerung von EinschliePungsalgorithmen zur Losung gewohnlicher Differentialgleichungen, Doctoral Dissertation, Wien, 1991. [18] R. Klatte, U. Kulisch, M. Neaga, D. Ratz, Ch. Ullrich, PASCAL-XSC, Springer-Verlag, Heidelberg, 1991.

[19] U. Kulisch, W. L. Miranker, The Arithmetic of the Digital Computer: A New Approach, SIAM Review 28, p. 1 - 40, 1986. [20] R. Lohner, Einschliejlung der Losung gewohnlicher Anfangs- und Randwertaufgaben und Anwendungen, Doctoral Dissertation, Karlsruhe, 1988. [21] R. E. Moore, Methods and Applications of Interval Analysis, SIAM, Philadelphia, 1979. [22] W. Rufeger, Numerische Ergebnisse der Himmelsmechanik und Entwicklung einer Schrittweitensteuerung des Lohnerschen Einschliepungs-AIgorithmus, Diploma Thesis, Karlsruhe, 1990. [23] U. Schulte, Einschlieflungsverfahren zur Bewertung von Getriebeschwingungsmodelen, Doctoral Dissertation, Karlsruhe, 1991. [24] H. Spreuer, E. Adams, On Extraneous Solutions with Uniformly Bounded Difference Quotients for a Discrete Analogy of a Nonlinear Ordinary Boundary Value Problem, J . Eng. Math. 19, p. 45 - 55, 1985.

[25] H. J. Stetter, Validated Solution of Initial Value Problems for ODES,p . 171 - 187 in: Computer Arithmetic and Self-validating Numerical Methods, ed.: Ch. Ullrich, Academic Press, Boston, 1990. [26] J. Stoer, R. Bulirsch, Numerische Mathematik, vol. 2, 3rd edition, Springer, Berlin, 1990. [27] A. H. Stroud, Numerical Quadrature and Solution of Ordinary Differential Equations, Springer, New York, 1974.

[28] IBM High Accuracy Arithmetic Subroutine Library (ACRITH), General Information Manual, 3rd edition GC 33-6169-02, IBM Corporation, 1986. [29] IBM High Accuracy Arithmetic - Extended Scientific Computation (ACRITH-XSC), GC 33-6461-01, IBM Corporation, 1990.

This page intentionally left blank

Interval Arithmetic in Staggered Correction Format Rudolf J. Lohner

Real intervals are usually realized on a computer in such a way that their bounds are chosen as machine numbers, i.e. as floating point numbers with a fixed precision. Then interval arithmetic is implemented. by use of floating point operations with directed rounding and with results of the same precision. If the machine has floating point hardware with directed rounding then this kind of interval arithmetic can run at full hardware speed. Alternatively, sometimes multiple precision real data types are used, again with fixed or with variable precision, especially if ill conditioned problems have to be solved or if there are very high requirements for accuracy of the results. In this case all arithmetic operations must be rewritten since multiple precision arithmetic has to be simulated by integer arithmetic on the computer. Using fast floating point hardware is no longer possible on traditional hardware. In this paper we discuss an approach located between these two extremes, which can take full advantage of floating point hardware provided only one additional operation - the ezact scalar product with a long accumulator - is added to the hardware. We represent an interval as a sum of n floating point numbers ai, i = 1,. . ,n (which are stored separately) plus one interval A with floating point bounds (a so called staggered correction format). Then all four basic operations -, +, / as well as 4 can be efficiently implemented using sums and scalar products of floating point numbers which are computed exactly in the long accumulator.

.

+,

We sketch some aspects of the implementation of such a stagged interval arithmetic and give some applications.

1

Introduction

Scientificcomputations on computers are primarily executed by use of floating point arithmetic since powerful floating point hardware has been developed performing these computations at extremely high speed. However, since pure floating point computations may be totally unreliable (e.g. [29], [lo], [26])it is often necessary to use interval arithmetic to get verified and safe results. It is quite natural (and also the most common implementation), to represent real intervals on a computer as machine intervals; they are intervals whose bounds are machine numbers, i.e. , they are floating point numbers with a fized precision. Then Scientific Computing with Automatic Result Verification

301

Copyright 0 1993 by Academic Press, Inc. All rights of reproduction in any form reserved. ISBN 0-12-044210-8

R. Lohner

302

interval arithmetic can be implemented by use of floating point operations with directed rounding and with results of the same precision. Thus, interval arithmetic can take full advantage of existing floating point hardware if this hardware is able to perform directed roundings. In cases where the accuracy of the results is insufficient or no results can be obtained at all due to poorly conditioned problems, it is desirable to have an interval arithmetic of higher precision. Such an arithmetic can easily be obtained if the underlying floating point system is replaced by a multiple precision real arithmetic as e.g. in [17]. But then the basic operations are generally no longer available in hardware since multiple precision arithmetic has to be simulated by integer arithmetic on the computer. Therefore, execution times will increase drastically if a program uses such a high precision interval arithmetic. In this paper we discuss a solution of this problem which can result in execution times closer to those of pure floating point hardware but still offering a highly increased precision for critical computations. For this purpose we need a reliable floating point hardware (e.g. according to the IEEE standard, [2]) and, in addition, only one additional hardware operation; this is the exact scalar product, as implemented by means of a long accumulator, see (191, (201, [l8], [13]. This long accumulator can be built into hardware very efficiently as has been shown in [13] and later work. But even if we simulate the long accumulator in software only, our approach t o higher precision still results in similarly fast codes as in the case of usual high precision arithmetic, which is also performed in software only. Our high precision data type represents intervals as the sum of n floating point numbers ai, i = 1 , . . . ,n (which are stored separately) plus one interval A with floating point bounds (a so called staggered correction format). Then all four basic -, *, / and also J can be efficiently implemented using sums and operations scalar products of floating point numbers which can be computed exactly in the long accumulator. Also all commonly used elementary functions can be implemented without difficulties for this data format; for simplicity, however, here we only discuss arithmetic operations and J

+,

In Section 2 we show how these arithmetic operations can be defined for the staggered correction format. Then, Section 3 outlines some aspects of the implementation of such a staggerd interval arithmetic in PASCAL-XSC to be followed by some applications in Section 4. Another approach to compute results in a staggered correction format avoids a separate implementation of each basic arithmetic operation; rather, the algorithm is rearranged such that it produces its result already in a staggered correction format. Algorithms for linear problems can often be easily transformed in such a way (see e.g. [28]). For nonlinear problems a combination may be more convenient: e.g. in Newton-like methods the function evaluation may be computed with a staggered arithmetic on a single operation-wise basis whereas the computation of the iteration step can be written without using staggered operations but still producing a result in staggered format.

Staggered Arithmetic

303

The staggered correction format was introduced by Rump, [27], [28], and Bohm, [6], [7],such that elementary operations are not used explicitly. Stetter and Auzinger, [31], [3], proposed the use of this format in a more general, operation-wise way and also coined its name. Klotz, [16], discussed several matrix factorizations using staggered correction arithmetic; since then, staggered correction methods have also been applied t o other problems; e.g. the evaluation of multivariate polynomials, [24], the computation of all eigenvalues of symmetric matrices, [25], and the computation of the matrix exponential function, [4], [5].

2

Interval Staggered Arithmetic

Let S be the set of all machine numbers on a computer. By ZR we denote the set of all intervals over R and by IS we denote the set of all machine intervals, i.e., all intervals X = [g,z] with g,zE S. In this section, we define an interval in

staggered correction format (or shorter staggered format); additionally, we develop algorithms for the four basic operations as well as the square root for this staggered correction data type.

Definition 1 (Staggered Correction Format) Let xi E S, i = 1,. . . ,n, n 2 0, be machine numbers and X E I S a machine interval. Then n

x = c x i

+x

i=l

is called an interval in staggered correction format of length n. The set of all intervals in staggered correction format of length n is denoted b y ZS,. Remarks 1. Sometimes we will refer to x i as the i-th component and to X as the interval component of a given staggered interval. Of course, in this definition the representation of an interval in staggered correction format is not unique in general. In most cases we even may have staggered representations of different lengths for the same interval x E IR as, e.g. , any sum of zeros plus an interval X for all X E I S . However, this will not present any difficulties, in fact, all calculations will be carried out by use of the long accumulator which is nothing but a long fixed point register and therefore provides a means for the unique representation of intermediate results.

2. We explicitly allow n to be zero in Definition 1. Then the sum in (1) is an empty sum with value zero. In this case we have IS0 = IS and the staggered correction representation of an interval in IS0 is unique. 3. It is desirable that lxll > ... > Ixnl > and that the exponents of two successive summands x i , x i + l , differ a t least by 1, where 1 is the mantissa

1x1

R. Lohner

304

length. We then say that the mantissas in the staggered correction form do not overlap. In this case the interval z is represented with an optimal precision of about (n 1)l mantissa digits. However, this is not a necessary assumption concerning this format: we could even allow all components z;as well as X to be of the same order of magnitude such that no precision is gained by storing z as a staggered interval. We still could perform all operations and computations but, of course, we would then face a considerable loss of efficiency. Therefore, the non-overlapping property is very desirable, and the operations on IS,, should be written such that they produce results with this property.

+

4. One advantage of the staggered format is the fact that we can store a number z with high precision using only a few components if z is very close to a machine number; e.g. , z = 1 lo-" can be stored with a precision of at least m decimal digits provided there is no underflow of lo-". For this purpose, only a staggered length of n = 1 is needed; i.e. , one real and one interval are

+

sufficient to enclose I.

The exact results of the arithmetic operations on staggered intervals are defined as usual in interval analysis: if 2 and y are two staggered intervals, then we treat them as elements of IR and define the operations as they are defined there:

Definition 2 (Exact operations on staggered intervals) Let z and y be intervals in a staggered correction format. Then z o y := {(I( = o q , < E z,q E y}, i n t h e c a s e o f o E {+,-,*,/}, i f O $ y f o r o = / a n d f i : = { ( I ( = J I , ( ~ z } if inf(z) 2 0.

<

Of course, we cannot expect the arithmetic operations to produce ezact results on a computer since we still are limited by the machine precision, i.e. , the length of the long accumulator in this case. However, for any operation z = z o y or z = 6, we must require that the staggered interval z is a superset of the true mathematical result of z o y, as is also the case for ordinary machine interval arithmetic: z o y E z and fi z . Here we do not require the result z to be optimal in the sense that it is the smallest staggered interval of some prescribed length n containing the true mathematical value of z o y or fi;rather, we are satisfied with a compromise between the tightness of the enclosure z and the ease and efficiency of the implementation. We desire the operations to accept staggered operands of any (perhaps differing) length and to produce their result in a staggered format whose length can be prescribed independently of that of the operands. This can be achieved conveniently if all operations follow the same pattern as far as possible: first we compute upper and lower bounds for the true result and store them in two long accumulators. Then by use of proper rounding, these accumulators are converted to a staggered interval of the desired length. Division and square root will be the only operations differing from this pattern.

305

Staggered Arithmetic

+

First we comment on the two unary operations and -. If x is a staggered interval of length n and if n is also the prescribed result length of +x and - 5 , then is the identity where no operation is necessary, and - is the operation reversing the signs of all components of x . However, if the prescribed result length is m # n then there are two cases. The first case, where m > n , is similar to the case of m = n: all components of x must be copied (with or without a reversal the signs) and only the additional m - n real components of the result have to be set to zero. The second case, however, where m < n requires the execution of some nontrivial operations; we define two accumulators lo and up to store the bounds of x: lo := C%lxi X and up := Cy=lxi Then these accumulators are converted to a staggered interval of length m by the same procedure which is used to obtain the results for the other operations.

+

+

+ x.

This conversion of two accumulators lo and up to an interval staggered x of length n is now outlined. Let lo and up be two long accumulators and n 2 0 a prescribed staggered length. Then lo and up are converted to x by use of the following algorithm Convert. Here, lo and up are long accumulators, X I , xu, xi are reals, X is an interval, i, n are integers, and stop is a boolean variable.

Algorithm Convert : Input: accumulators lo and up and result length n Output: staggered interval x stop := false

i .- 0

repeat

i ..-i+1 21

:= 010

xu := o u p i f X I = xu then

xi

:= X I

lo := lo - x l up := up - xu else xi := 0 stop := true until stop or i = n f o r i := i 1 to n do xi := 0 X := [ v l o ,Aup]

+

Here, 0,V, A represent the roundings of the accumulator to the nearest, next lower and next upper machine number, respectively. This algorithm reads successively real numbers x l and xu from the accumulators and, as long as they are equal and the staggered length has not yet been reached, they are assigned to xi and subtracted from the accumulators. When x1 and xu are different or when the staggered length n for the result x is reached, then the remaining values in the accumulators are converted to the interval component X of x using proper rounding. In the case

R. Lohner

306

that some 2,’s are not yet defined, they are set to zero. This algorithm has the property that the real components of z do not overlap in the sense described above. Since we will use this algorithm in the following operators, this property will also hold true for their results. Next, we consider the algorithms for addition and subtraction of staggered intervals. They simply add (subtract) the lower and upper bounds of two staggered operands z and y into two accumulators and convert them to the result z using the above algorithm Convert. Here, lo and u p are long accumulators, z i , y i are reals, X , Y are intervals, and z is the resulting staggered interval. The case of addition:

Algorithm Add :

Input: staggered intervals E and y of length n, and ny,respectively Output: staggered interval z of length n, containing z y

+

lo := u p

+x +1 +x +

u p := up F z := Convert( lo, up, n, )

and analogously in the case of subtraction:

Algorithm Sub :

Input: staggered intervals z and y of length n, and ny,respectively Output: staggered interval z of length n, containing z - y nz

u p : = cxi

nII

c y i ;&l

+x - Y

i= 1

lo := u p

-

+x-

u p := u p y z := Convert( lo, up, n, )

Addition and subtraction are straightforward. Multiplication however can be implemented in various ways yielding different results because of the subdistributive X and y = yj Y be two law of interval arithmetic. Let z = ~~~1 z i staggered intervals then, due to subdistributivity, we have:

+

i=l

cy!!l +

j=1

This seems to suggest that better results can be obtained from using the second line in (2) than from using the third line. However, in order to compute the products

307

Staggered Arithmetic

xyzl xyzl

X y j and Y 0 CjL,y j and 0

xi

xi.

we first have to round the sums to machine intervals,

As a consequence of these additional roundings, however,

now the second line in (2)' yields coarserenclosures than the third line.

Therefore, we choose line three in (2) as the basis of our multiplication algorithm. Here, again, lo and up are long accumulators, x i , y i are Teals, X , Y are intervals, and z is the resulting staggered interval.

Algorithm Mult :

Input: staggered intervals 5 and y of length n, and ny,respectively Output: staggered interval z of length n, containing x * y

up := lo [lo,up]:= [lo,up]

+ Cxyj + nY

j=1

z := Convert( lo, up, n,

)

Z Y X i

i=1

+ X*Y

Up t o now, we were able to compute bounds for the results of our operations in two accumulators and then round these to an interval staggered format. This can no longer be carried out conveniently in the case of the division of two staggered correction intervals x and y . Rather, we will apply an iterative algorithm computing successively the n, real components zi of the quotient s l y .

xyLl

In order to compute this approximation z;, we start with z1 = O r n ( x ) / O r n ( y ) ; here m ( a ) represents a point selected in a, e.g. the midpoint and 0 is the rounding to 5'. Now, we proceed inductively: if we have an approximation E e l z i , then we can compute a next summand %k+l by use of

i= 1

i=l j = 1

where the numerator is computed exactly using a long accumulator and is rounded only once to S. The division is performed in ordinary floating point arithmetic.

As in the previous operations, this iteration guarantees that the z i do not overlap since the defect (i.e. the numerator in (3)) of each approximation z i is computed with only one rounding.

x;=,

Now, the interval component 2 of the result z may be computed as follows: nu

n*

2=

n,

nz

o(C - C C + x - C z j Y ) / O y , xi

i= 1

yizj

i=l j = 1

(4)

j=1

where 0 is the rounding to an interval in I S .

x:zl

+

It is not difficult to see that z = z; 2 as computed from (3) and (4) is a superset of the exact range {(/q1[ E x , q E y } ; in fact, for all (Y E X , @ E Y we

R. Lohner

308 have the identity:

An interval evaluation of this expression for c1 E X and p 6 Y shows immediately that the exact range of x/y is contained in C;zl z j 2,which is computed using (3) and (4).

+

Thus we may formulate the following algorithm for the division of two staggered intervals. Again lo and up are long accumulators, xi, yi, zi and y, are Teals, X,Y and 2 are intervals, 0 is a rounding to S, and 0 is an interval rounding to IS.

Algorithm Div : Input: staggered intervals x and y of length n, and nu,respectively Output: staggered interval z of length n, containing x/y

lo := lo - C yjz,, up := lo

j=1

2 := O([lo,up]

+x

-

c ZjY)/OY n=

j=1

In this algorithm the double sum from (3), (4) is accumulated in the long accumulator lo as long as the Zk’S are computed; the final value in lo is then used in the computation of the interval part 2. Thus, here the amount of work is reduced to a minimum. An algorithm for the square root can be obtained analogously as in the case of the division. We compute iteratively as follows the zi, i = 1,. . . ,n, of the approximation part:

This guarantees again that the zi do not overlap since in the numerator of (5) the defect of the approximation zi is computed with one rounding only. Now, the interval part 2 is computed by use of

c;=,

Staggered Arithmetic

309

Z=

O(5 i=l

di

-

@

c n.

zizj

i,j=l

+X )

+05zi i=l

xyzl

As in the case of the division, it is easy to see that ( 5 ) and (6) is a superset of the exact range {filt E have the identity :

zj

+

2 as mputed from in fact, for all 7 E X we

zi

5);

+

Now, we are able to write down an algorithm for the computation of the square root: Algorithm Sqrt : Input: staggered interval x of length n, Output: staggered interval z of length n, containing

6

Remarks 1. In the cases of addition, subtraction, and multiplication we have non-overlapping real components zi since the algorithm Convert is used here. In the cases of division and the square root this property holds true since the defects of the approximations are computed with one rounding only. Nevertheless, it may still be possible that we obtain overlapping in the sense that some zi will be zero; this may happen e.g. if we choose n, too large such that underflow occurs when a long accumulator is converted to a staggered interval. It may also happen that we obtain a non-overlapping approximation part z;, which, however, overlaps with the interval part Z in the sense that the absolute value of 2 exceeds those of some of the zi’s. This is due to

zyzl

R. Lohner

310

the fact that we allow an arbitrary result length n, to be specified prior to the computation of the operation such that n, may be larger than the best possible accuracy of the result (which is limited, of course, by the accuracy of the operands).

2. The non-overlapping property is always satisfied in the operations defined in [16]; however, the algorithms in [16] are much more complicated and timeconsuming as compared to our algorithms.

3. In order to make the application of our algorithms easy and efficient, some additional operations from interval arithmetic should be supplied such as the computation of an infimum, supremum, diameter or midpoint. Furthermore, conversion routines from and to other data formats as well as operations with mixed operand types are useful for the practical work in a programming language. Also elementary functions can be implemented for staggered intervals without difficulties. The algorithms for such implementations may already make use of the staggered operations presented here. As an example, exp(z) can be implemented by use of the Taylor series which is computed in staggered interval arithmetic, while the remainder term is just an ordinary interval to be added to the interval part of the staggered result. For the inverse functions, it is often very easy to apply an interval Newton’s method using staggered interval arithmetic. As an example, z = l n z can be obtained as the zero of the function f(z) = z - e z p ( z ) . Here, only e z p ( z ) and staggered interval arithmetic are needed. A Newton method of this kind can even be implemented quite efficiently making use of the quadratic convergence property: we start the iteration with a small staggered length n, doubling this length after each iteration step. The overall computational cost will then roughly be only twice as much as the cost for one iteration employing the maximum staggered length that is used.

3

Implementation in PASCAL-XSC

Here we present some details of an implementation of a staggered correction interval arithmetic in PASCAL-XSC (see [14], [15] and [ l l ] for the language definition). PASCAL-XSC is very well suited for this purpose since it offers a standard data type DOTPRECISION. A variable of this type represents a long accumulator. PASCAL-XSC also supplies an easy way t o carry out operations with DOTPRECISION variables making use of the so called dot product ezpressions; they are expressions written in parentheses and prefixed by a #-symbol and an optional rounding symbol. These expressions are evaluated exactly and the result is stored in a long accumulator (DOTPRECISION)if no rounding is specified. If a rounding is specified, however, the result is rounded ezactly once and stored as a real (if the rounding symbol is < (down), > (up), or * (nearest) ) or it is stored a an interval (if the rounding symbol is # ). In the presently available versions of PASCAL-XSC,

Staggered Arithmetic

31 1

these long accumulators are simulated in software; when they will be available in hardware (hopefully), the programs will also run in hardware without any change of the source code. The global type ISTAGGERED representing a staggered correction interval is chosen to be a dynamic array of reals which contains the real values z;,i = 1,. . . ,SLEN in and the lower and upper bound of the interval X in the the components 1,. . . components -1 and 0 resp. The global variable SLEN, which is the prescribed result length n, for all operations, can be altered as desired in the execution of the program. The functions and operators will then produce results of the new staggered length SLEN while accepting any length as input. SLEN must always be positive (zero is not allowed in this module in order to keep the code somewhat simpler). SLEN is initialized with the value one in the module’s main body. The trivial function IVAL rounds an ISTAGGERED variable to an ordinary interval; it is listed here only since it is used in the division operator. The local function CONVERT is the implementation of Algorithm Convert. The local auxiliary function ADD-INT-TIMES-STAGG is used to add the product of an interval times a point-ISTAGGERED to two dotprecision variables. The addition, multiplication, and division operators, that are completely listed subsequently, serve as examples for the implementation of such staggered correction interval operations and functions. It should now be straightforward to write operators for mixed operands or to use these operators in algorithms for standard functions for type ISTAGGERED. Conversion functions and overloaded assignment operators are omitted here; their implementation is very easy. The same holds true for input/output procedures in the case that the basic floating point arithmetic is a decimal one. For binary or hexadecimal floating point arithmetic, however, writing input/output procedures is a nontrivial task.

MODULE i s t a g g e r ; USE i-ari; ( Import ordinary i n t e r v a l arithmetic 3

R. Lohner

312

(---------------------------------------------------3 ( IVAL encloses an ISTAGGERED X i n an interval 3 (---------------------------------------------------3

GLOBAL FUNCTION IVAL( VAR X : ISTAGGERED > : INTERVAL; VAR I : INTEGER; D : DOTPRECISION; BEGIN D:= X( FOR I:= 0 TO UB(X) SUM( XCI] > 1; IVAL:= INTVAL( X<(D - XCO] + XC-11). X>( D > 1; END ;

.

(--------------------------------3

€ Space f o r more type conversions 3 (--------------------------------1

€---------------------------------------------------{ CONVERT implements Algorithm Convert. ( The accumulators LO and UP are converted t o an ( ISTAGGERED of length SLEN.

1 1 3 3

(---------------------------------------------------3 FUNCTION CONVERT ( VAR L0.W : DOTPRECISION : ISTAGGERED[-l..SLENI; : INTEGER; VAR I XL,XU : REAL; STOP : BOOLEAN;

>

BEGIN STOP:= FALSE; I:= 0; REPEAT I := I + 1; XL:= X*( LO 1; xu:= I)*( UP > ; IF XL=XU THEN BEGIN CONVERTCI] := XL; LO:= X ( LO XL > ; UP:= I( UP xu 1; END ELSE BEGIN CONVERTCI] := 0.0; STOP:= TRUE; END ; UNTIL STOP OR (I=SLEN); FOR I:= 1+1 TO SLEN DO CONVERT[I]:= 0.0; CONVERT[-l]:= X<( LO 1; CONVERT[ 01:- X>( UP > ; END {--- CONVERT ---I;

-

3 . €-----------------------------------------( Space for unary operators f o r ISTAGGERED 3

(-----------------------------------------3

c----------------------------------------------------

<

Addition of two ISTAGGERED variables :

1 3 3

€---------------------------------------------------GLOBAL OPERATOR + ( VAR X,Y : ISTAGGERED ) ADD : ISTAGGEREDC-l..SLEN]; : INTEGER; VAR I L0,UP : DOTPRECISION; BEGIN UP:= X( FOR I:= 0 TO UB(X> SUM ( XCII + FOR I:= 0 TO UB(Y) SUM ( YCII > >;

313

Staggered Arithmetic

PROCEDURE ADD-INT-TIMES-STACC ( VAR A : INTERVAL; VAR X : ISTACCERED; VAR LO.^ : DOTPRECISION 1; VAR I : INTEGER: BEGIN FOR I:= 1 TO UB(X1 DO IF X[I]>O.O THEN BEGIN LO:= # ( LO + A.INF*XCI] 1; UP:= # ( UP + A.SUP*XCI] 1 ; END ELSE BEGIN LO:= # ( LO + A.SUP*XCIl > ; UP:= # ( UP + A.INF*XCI] > ; END ; END (---ADD-INT-TIMES-STACC ---I; (---------------------------------------------------1 ( Uultiplication of two ISTACCERED variables : 1 (---------------------------------------------------1 GLOBAL OPERATOR ( VAR X,Y : ISTAGCERED ) HULT : ISTACCEREDC-1.. SLFNl ;

*

VAR 1,J : INTEGER; L0,UP : DOTPRECISION; C : INTERVAL; BEGIN LO:= # ( FOR I:= 1 TO UB(X) SUU ( FOR J:= 1 TO UB(Y> SUU (XCIl*Y[J]> 1 > ; UP:= LO; ADD-INT-TIUES,STACC( INTVAL(X[-11 ,X[O]), Y, L0,UP 1; ADD-INT-TIHES,STACC( INTVAL(Y c-11 ,YcO]>, X, L0.W 1 ; C := INTVAL(X[-11 ,X[O]> * INTVAL(Y[-lI ,Y[O]>; LO:= #( LO + C.INF > ; UP:= # ( UP + c.sUP 1 ; HULT:= CONVERT( L0.W > ; END (---Uultiplication ---I;

(---------------------------------------------------1 ( Division of two ISTACCERED variables : 1 (---------------------------------------------------1 GLOBAL OPERATOR / ( VAR X,Y : ISTACCERED 1 DIVI : ISTACCEREDC-l..SLEN];

VAR 1.J L0,UP Z YU.ZI C

: INTEGER; : DOTPRECISION; : ISTACCERED[-1. SLEN] ;

REAL; : INTERVAL; :

.

R. Lohner

314

4

Applications

As has already been mentioned in the introduction, there exist a number of applications of staggered correction interval arithmetic. Here, we will demonstrate in a more detailed way the simple but powerful application to discrete dynamic systems; additionally, we will give some further references to other applications which have already been published. The computation of orbits of dynamic s y s t e m is known to be highly unstable if the system exhibits chaotic behavior. In this case, even for the very simplest systems, ordinary floating point computations will eventually deliver results which are completely wrong quantitatively. Also ordinary interval arithmetic (i.e. intervals of floating point numbers) will yield poor enclosures after few iterations and, finally, in most cases the computation will break down because of overflow. By use of the interval staggered correction format, however, we can compute enclosures of orbits for a considerably longer time with high accuracy. Of course this can also be achieved by means of multi-precision arithmetic simulated in software; once the long accumulator is available in hardware, the staggered arithmetic will fully benefit from the speed of floating point hardware. Consider the simple dynamic system as given by the logistic equation:

315

Staggered Arithmetic

for some a E [0,4] and xo E (0,l). On the computer, we can compute this iteration with (i) ordinary floating point arithmetic, (ii) ordinary interval arithmetic or with (iii) staggered interval arithmetic. However, for the cases (ii) and (iii) we should first rewrite the right hand side of (7) such that it is better suited for the application of interval arithmetic: For narrow intervals it is well known in interval analysis that a tighter interval enclosure can be obtained by using a mean value form instead of an interval evaluation of the originally given expression. The ordinary interval evaluation of a function f(z) over an interval X, denoted as f ( X ) , is obtained via replacing all occurrences of x in f by the interval X and via replacing all operations by the corresponding interval operations. The mean value form is defined by fm(X) := f ( y ) f’(X)(X - y) with some fixed value y E X, e.g. , the midpoint. Thus, in the cases (ii) and (iii) we replace the right hand side of (7) by its mean value form, i.e. , by

+

Xn+l = a . ( y m ( l - x n ) + ( l - 2 X n ) . ( X n - Yn)) with yn M rnid(X,) = midpoint of X,, .

(8)

where X,, is an interval in case (ii) and a staggered interval in case (iii). Rewriting (7) as (8) does not affect the quality of pure floating point computation, which is still executed using (7). The following PASCAL-XSC program uses the module istagger from the previous section to compute orbits for this equation. The print-out lists the approximations I,, obtained by pure floating point evaluations of (7) and the enclosures X,, as obtained by ordinary and by staggered interval arithmetic using (8). It is assumed here that the module contains also a function MID with ISTACCERED parameter and ISTACGERED result as well as some additional operators and assignments for mixed data types. PROGRAM ICHAOS ( INPUT,OUTPUT ; USE i,ari,istagger; { Import interval and staggered interval arithmetic 1 PROCEDURE MAIN; VAR X ,XM : ISTAGGERED[-1. .SLENI ; Y,YM : INTERVAL; A,XO : REAL; N,H : INTEGER; BEGIN WRITELN(’Computation of the logistic equation’); WRITELN(’ x(n+l) = a * x(n> * ( l-x(n> 1’); WRITELN ; WRITE(’enter parameter a = ’1; READ(A); WRITE(’enter initial value x 0 = ’1; READ(X0); WRITE(’number of iterations m = ’1; READ(M); Y:= XO; { initial value for interval arithmetic 1 { initial value for staggered interval arithmetic 1 X:= XO; FOR N:= 1 TO H DO BEGIN XO:= A*XO*(l-XO); { Compute real approximation 1

R. Lohner

316 €--------------------------------------------------------{ Compute with interval and staggered interval arithmetic { of length SLEN. Use mean value form for x(n+l) :

€--------------------------------------------------------YM:= mid(Y); Y := A*( YM*(l-YM) + (l-O*Y)*(Y-YM) 1; XM:= mid(X); X := A*( XH*(l-XH) + (1-2*X)*(X-XM) 1; WRITELN(N. : I, XO, Y, IVAL(X) 1;

1

I

1

1

END ; END ; BEGIN WRITE('enter staggered length : '1; READ(SLEN); IF SLEN>O THEN MAIN; END.

For the input data a = 3.75, 10 = 0.5 and n = 500 iterations we get the following results (here, an obvious short notation for intervals is used): floating point ordinary interval staggered interval 1 0.9375000000000000 0.9375000000000000 0.9375000000000000 10 0.6453672908309288 0.6453672908309g: 0.645367290830930: 0.8259709787107752 20 0.8259709787108499 0.82597097871:&% 30 0.7180965684239893 0.7180965684481460 0.718096568418754: 3876004 40 0.4163493160568014 0.4163493223190691 0.416349316957635; 118998043 401742922880 50 0.3604395431283658 0*3604391241304220 0.360439633921698: 1391103575510 0.7990863343083092 60 0.7990957294909673 0'799W938934906 31654999449893 0.452195299859730: 70 0.4520542523856749 0-45i2n24976590i7 80 0.8492167613301459 0.9469520868573489 0.8561779966293362 0.7677390706389728 90 0.8481094253763446 0.73991374860737g [O, 11 0.88829399228403;: 100 0.8358933994881687 [O, 11 150 0.8504150611290033 0.7028204 13487813; [O, 11 0.823557321 13045;; 200 0.9131976055921124 [O, 11 0.219734861549818: 250 0.9310785715888734 [O, 11 0.496432049973525: 300 0.3198944553532391 [O, 11 0.859325800193209; 350 0.9287366136360928 [O, 11 400 0.2202363784969592 0.69393944643472:; [O, 11 450 0.9345460787353592 0.61598678450720?? l0,ll 500 0.9245450657841405 io; 1j 0.2767538gg3:g6' logistic equation for a = 3.75, z0= 0.5;last column: staggered length SLm = 5 n

Table 1 In order to demonstrate the effect of the choice of (7)or (8),in the following Table 2 we list the maximum number nmozof iterations which can be performed with staggered interval arithmetic by use of (7)or (8)and with different choices of the staggered length SLm. For larger values of n overflow occurs and the program is aborted (all computations were carried out on the basis of the IEEE double floating point format).

Staggered Arithmetic

317

SLm n,,, 0 1 2 4 6 8 10 14 18

with (7) 35 69 99 152 208 262 315 424 532

nmoz with

(8)

90 169 277 464 636 846 979 1303 1679

max. iteration count with staggered arithmetic

Table 2 In all cases here, the accuracy obtained when rounding the results to S is very high (i.e. the maximum accuracy). The accuracy decreases only immediately before reaching nma+. Pure floating point computations always yield totally incorrect results after more than about 100 iterations. Another application is the evaluation of polynomials in one or more variables. The original algorithm for the onedimensional case from Biihm, [6], [7], does not use staggered interval arithmetic on an operation-wise basis; rather, the algorithm is formulated in such a way that the intermediate and the final results are automatically obtained as staggered correction intervals. In the multi dimensional case, [24], e.g. in two variables, a polynomial p ( z , y) = Cboa i j z i y j is rewritten as p ( z , y ) = Cjm,o(cy=o a;jzi)yj = EGO bjyj with bj(z) = CEO a&. Now, the value of p ( z , y ) can be computed with sufficient accuracy if we compute all coefficients bj accurately enough and then evaluate the remaining polynomial in y. These computations can very easily be carried out in staggered correction interval arithmetic - we only have to supply an algorithm for the evaluation of a onedimensional polynomial with staggered interval coefficients. We then compute the coefficients bj as staggered intervals and evaluate Cj”=o bjyj in staggered interval arithmetic. Obviously, this method can be extended immediately to an arbitrary number of independent variables.

cy=o

Presumably the first problem where interval staggered methods were applied, is the solution of systems of linear equations by Rump, [27], [28]. Here, again the staggered respresentation is built into the algorithm: if we have an approximate solution z1 of the linear system A z = b, then the exact solution is z = z1 4 where 4 is a defect correction solving the system A t = b - Azl. Now, we compute an interval enclosure X of 4 and therefore, we have enclosed the solution z in a staggered interval z1+ X of length one. Rump also treats the case of approximating [ by e.g. 22 and enclosing the second defect in an interval Y ;this finally results in an enclosure z E z1 z2 Y which is a staggered interval of length two. This method can trivially be extended to enclosures with higher staggered length by

+

+ +

318

R. Lohner

continuing the defect correction. As other possible extensions we can also allow the matrix A and the right hand side b to contain staggered intervals. The matriz ezponential function is treated by Bochev and Markov, [4], [5]. Here, a diagonal Pad4 approximation with remainder term is used: D(A)-'N(A) = exp(A)-D(A)-'R(A). The remainder term R(A) is enclosed in an ordinary interval matrix S, whereas the numerator polynomial N(A) and the denominator polynomial D ( A ) are computed in staggered interval arithmetic. Then, an enclosure for exp(A) can be obtained as the solution of the linear matrix equation D(A)exp(A) = N ( A )

+S

which contains staggered intervals in the coefficient matrix D(A) and in the right hand side N(A) S.

+

Another application of staggered methods is the computation of enclosures for the eigenvalues and eigenvectors of symmetric matrices. In [25] a generalization of Jacobi's method is described which computes tight bounds for all eigenvalues and eigenvectors of symmetric matrices; this is still true in very ill-conditioned cases, i.e. , also in the presence of clusters of eigenvalues or in the simultaneous presence of very large and very small eigenvalues. The success of the method is due to the following modification of Jacobi's method: when several iterations of the classical Jacobi method have been carried out for the symmetric matrix A, we obtain a floating point approximation of the eigensystem. In ill-conditioned cases a similarity transformation of A by use of XI yields a matrix deviating strongly from a symmetric matrix. Therefore, 21 is re-orthogonalized and this re-orthogonalization is executed in staggered arithmetic of length two (but without interval part in this case); this results in a new approximation X 1+ X 2 . With this modified approximate eigensystem, a similarity transformation A1 := (XI +X2)-'A(XI +X2) is computed resulting in a matrix A1 which now has excellent symmetry properties again (usually up to machine precision). This matrix A1 is approximated in staggered format (with interval part zero) for the diagonal entries and in ordinary floating point format for the off-diagonal entries. Jacobi's method is now continued by use of Al. Since now the Givens rotations are close to the identity matrix, however, they are no longer computed directly but, rather, their difference W from the identity matrix I is determined. If I T is the transformation matrix obtained from this second Jacobi 'sweep', then the total transformation matrix from the two Jacobi sweeps is ( X I X z ) ( I T ) = X1 X2 X1T X2T. This matrix can again be stored in a staggered format (with interval part zero). Finally, by use of this transformation matrix, another similarity transformation is computed - now in staggered interval arithmetic - and an application of Gerschgorin's circle theorem yields very accurate enclosures of the eigenvalues as staggered intervals of length one. This whole algorithm can also be iterated more than once yielding results as staggered intervals of higher length. This method has also been generalized to the complex case, i.e. for Hermitian matrices.

+

+

+

+

+

+

As a last application we mention ordinary initial value problems. The algorithm from [23] which computes guaranteed continuous bounds for the solutions of non-

319

Staggered Arithmetic

+

linear initial value problems, gives its results in the form z ( t ) E Z ( t ) R ( t ) where Z ( t ) is an approximation of the solution and R ( t ) is an interval enclosing the global error. Here, it is possible to apply staggered arithmetic in the computation of ? ( t ) , however to retain ordinary interval arithmetic in the computation of R ( t ) , which is the most expensive part of the algorithm. Thus, the accuracy of the results can be increased, mainly by increasing the accuracy of the approximation. The increase of the cost for the computation of the global error R ( t ) is almost negligible.

References [l] Alefeld, G., Herzberger J.: Introduction to Interval Computations. Academic Press, New York, 1983. [2] American National Standards Institute / Institute of Electrical and Electronic Engineers: A Standardfor Binary Floating-point Arithmetic. ANSI/IEEE Std. 754-1985, New York, 1985. [3] Auzinger, W., Stetter, H.J.: Accurate Arithmetic Results for Decimal Data and Non-Decimal Computers Computing 35, 1985. [4] Bochev, P., Markov, S.: A Self-validating Method for the Matrix Exponential. Computing 43, 59 - 72, 1989. [5] Bochev, P., Markov, S.: Simultaneous Self- Verified Computation of esp(A) and jd ezp(As)ds. Computing 45, 183 - 191, 1990. [6] Bohm, H.: Berechnung von Polynomnullstellen und Auswertung arithmetischer Ausdnicke mit garantierter, maximaler Genauigkeit. Dissertation, Universitit Karlsruhe, 1983. [7] Bohm, H.: Evaluation of Arithmetic Expressions with Maximum Accuracy. In: [20], 1983. [8] IBM High-Accuracy Arithmetic Subroutine Library (ACRITH). General Information Manual, GC 33-6163-02, 3rd Edition, 1986. [9] IBM High-Accuracy Arithmetic Subroutine Library (ACRITH). Program Description and User’s Guide, SC 33-6164-02, 3rd Edition, 1986. [lo] Hammer, R.: How Reliable is the Arithmetic of Vector Computers? In: [30], 1990.

[ l l ] Hammer, R., Neaga, M., Ratz, D.: PASCAL-XSC, New Conceptsfor Scientific Computation and Numerical Data Processing. This volume. [12] Kaucher, E., Kulisch, U., and Ullrich, Ch. (Eds.): Computer Arithmetic - Scientific Computation and Programming Languages. Teubner, StUteart, 1987.

320

R. Lohner

[13] Kirchner, R. and Kulisch, U.: Accurate Arithmetic for Vector Processors. Journal of Parallel and Distributed Computing 5,250 - 270, 1988. [14] Klatte, R., Kulisch, U., Neaga, M., Ratz, D. und Ullrich, Ch.: PASCAL-XSC Sprachbeschreibung mit Beispielen. Springer, Heidelberg, 1991. [15] Klatte, R., Kulisch, U., Neaga, M., Ratz, D., and Ullrich, Ch.: PASCAL-XSC Language Reference with Examples. To be published by Springer, Heidelberg, 1992. [16] Klotz, G.: Faktorisierung von Matrizen mit maximaler Cenauigkeit. Dissertation, Universitit Karlsruhe, 1987. (171 Krimer, W.: Multiple-Precision Computations with Result Verification. This volume. [18] Kulisch, U. (Hrsg.): Wissenschaftliches Rechnen mit Ergebnisverifikation Eine Einflhrung. Akademie Verlag, Ost-Berlin, Vieweg, Wiesbaden, 1989. [19] Kulisch, U. and Miranker, W. L.: Computer Arithmetic in Theory and Practice. Academic Press, New York, 1981. [20] Kulisch, U. and Miranker, W. L. (Eds.): A New Approach to Scientific Computation. Academic Press, New York, 1983. [21] Kulisch, U. and Miranker, W. L.: The Arithmetic of the Digital Computer: A New Approach. SIAM Review, Vol. 28, No. 1, 1986. [22] Kulisch, U. and Stetter, H. J. (Eds.): Scientific Computation with Automatic Result Verification. Computing Suppl. 6,Springer, Wien, 1988. [23] Lohner, R., Einschliepung der Losung gewohnlicher Anfangs- und Randwertaufgaben und Anwendungen. Dissertation, Universitat Karlsruhe, 1988. [24] Lohner, R., Precise Evaluation of Polynomials in Several Variables. Computing Supp1.39, 139 - 148, 1988. [25] Lohner, R., Enclosing all Eigenvalues of Symmetric Matrices. In: Ullrich, Ch. and Wolff von Gudenberg, J. (Eds.): Accurate Numerical Algorithms, Research Reports ESPRIT, Springer, Berlin, Heidelberg, New York, 1989. [26] Ratz, D.: The Efects ofthe Arithmetic of Vector Computers on Basic Numerical Methods. In: [30], 1990. [27] Rump, S. M.: Kleine Fehlerschranken bei Matrixproblemen. Dissertation, Universitit Karlsruhe, 1980. [28] Rump, S. M.: Solving Algebraic Problems with High Accuracy. In: [20], 1983.

Staggered Arithmetic

321

[29] Rump, S. M.: Wie zuverlassig sind die Ergebnisse unserer Rechenanlagen 4 In: Jahrbuch oberblicke Mathematik 1983, 163 - 168, B.I., Mannheim, 1983. [30] Ullrich, Ch. (Ed.): Contributions to Computer Arithmetic and Self- Validating Numerical Methods. J. C. Baltzer AG, Scientific Publishing Co., IMACS, 1990. [31] Stetter, H.J.: Sequential Defect Correction f o r High Accuracy Floating-Point Algorithms. Lecture Notes in Mathematics, Vol. 1066, 186 - 202, 1984.

This page intentionally left blank

111. Applications in the Engineering Sciences

This page intentionally left blank

Multiple-Precision Computations with Result Verification Walter K r h e r Multiple-precision real and interval modules for PASCAL-XSC have been developed. These modules are used to illustrate a variety of algorithms for the following purposes: multipleprecision evaluation of the function square root with maximum accuracy, the arithmetic-geometric mean iteration, different methods for the computation of a large number of digits of ?r, the computation of elliptic integrals, the computation of guaranteed bounds for the natural logarithm, and the computation of e" using a representation of this value by an infinite product. In general, enclosures for the desired values are computed. Due to the concept of overloading of functions and the operator concept of PASCAL-XSC the programs become clear and readable.

1

Multiple-Precision Arithmetic

In the runtime library of PASCAL-XSC [8],a multiple-precision arithmetic is available [9]. The base used for the representation of mantissa digits is B = 232,i.e., each mantissa digit occupies 32 bits and is implemented by an unsigned 32-bit integer value. Additionaly there are an exponent field as well as some other flags which, for example, indicate that a multiple-precision value is (a) zero, (b) negative, (c) temporary, or (d) exact. The number of mantissa digits that may be used to represent one multipleprecision value is only bounded by the capacity of the memory of the employed machine. The required number of mantissa digits (with respect to base 232) of a resulting value may be set separately for each operation. All operations consider their arguments to full-length independently from the required length of the result mantissa. There are numerous elementary mathematical functions such as log, exp, sin, cos, arcsin,... for arguments of the multipleprecision data type. The values of these functions are guaranteed to be accurate to at least two units in the last mantissa digit of the results. Again the number of required mantissa digits may be chosen arbitrarily. The multipleprecision data type is called mpreal. The standard procedure to set the precision is setprec(n). The integer value n specifies the number of required mantissa digits of the results of all subsequent operations. The integer standard function getprec returns the current precision of the multipleprecision arithmetic. Function getprec has no arguments. Other frequently used standard functions are Scientific Computing with Automatic Result Verification

325

Copyright Ca 1993 by Academic Press, Inc. All rights of reproduction in any form reserved. ISBN 0-12-044210-8

W. Krarner

326

expo(mp) which gives the exponent of mp with respect to base 2 and mant(mp) which gives the mantissa of an mpreal number normalized to 0.5 5 mant(mp) < 1.0. The PASCAL-XSC module is named mpari. There is also an interval module for multiple-precision intervals (data type mpinterval). The precision setting is achieved using the same procedure setprec(n). The result of an elementary mathematical function for this data type is a multipleprecision interval. The resulting interval is a superset of the range of the function over the input interval. The bounds are accurate to a t least one unit in the last place with respect to the actual precision setting. Again the precision setting may be changed a t any point of the program.

+,

The multiple-precision operations -, *, / as well as the elementary mathematical functions are called in the usual way. Here, the operator concept of PASCAL-XSC and the concept of overloading of function names is extensively used. This leads to clear and concise programs which are easy to read and debug. As a first example a program for the determination of an enclosure of sqrt(t) with arbitrary accuracy is considered.

2

Square Root Function with Error Bound

For the computation of the square root function x = 4,the Newton iteration Xi+l

:= xi -

53

-t

22;

= 0.5 *

(xi

+ -)Xti

is used. This iteration for the root of the function f ( x ) = x 2 - t converges quadratically to 4. In each iteration step the accuracy of the approximation is nearly doubled.

4 is zero, the result is also 0 and no iteration is performed. In other cases an initial approximation for 4 is used which, for simplicity, is computed by use of the normal floating-point arithmetic. The iteration is started with a precision setting of the multiple-precision arithmetic corresponding t o the normal floating-point format. In each iteration step the precision is doubled to find the next approximate. The last iteration step is carried out with two additional mantissa digits (with respect to base 232). If the argument t of

Only for the last iterate, a bound for the error is evaluated. In order to do this the following representation of the relative error is used: Let 2 := (1 c)& denote the last approximate. Then there holds the a-posteriori error estimation

+

This formula is used to find an upper bound for the relative error of the approximate. Notice that the numerator must be computed by means of a precision which

327

Multiple-Precision Cornput ations

guarantees the exactness of the product i 2 .Thus, the cancellation of leading digits in the difference does not cause a loss of accuracy.

A PASCAL-XSC test version looks as follows: use mp-ari;

C

multiple-precision arithmetic 1

function sqrt(x: mpreal): mpreal;

x and 3 multiple-precision result. The accuracy of the result is 1 guaranteed to be less than one unit in the last place with 1 respect to the accuracy setting given by the calling program. 1

{ Computation of sqrt(x) with multiple precision argument

C C C

var y, yy, np5, sqrtx: mpreal; errbound: mpreal; newprec, resprec: integer; a r g : real; stop: boolean; begin mpvlcp(x); local copy of the value parameter x and 1 mpinit(y) ; mpinit(np5) ; { initialization of mpreal numbers 1 mpinit (yy) ; mpinit (errbound); np5:= 0 . 5 ; { create an mpreal value 0.5 1 C overloaded assignement 1 resprec:= getprec; { save precision setting of calling program 1 newprec:= 2; C initialize precision setting for iterative refinement 1

<

{ Compute normal floating-point approximation for sqrt(x)

arg:= x; y:= sqrtcarg);

C

1

mpreal to real conversion 1

1 1 I number of mantissa digits (to base 2**32) is nearly doubled 3 { in each iteration step. The largest number of mantissa digits 1 { is equal to the correct mantissa digits required by the 1 { calling program part plus two additional guard digits. 3 { Quadratic convergence of the Newton iteration leads to a { doubling of the number of correct bits in each step. So the

while newprec <> resprec + 2 do begin newprec := min( resprec + 2, 2*newprec - 1 1; setprec( newprec ) ; { alter precision setting according to { the quadratic rate of convergence p:= np5*(y + x/y); end ;

C

C

1 1

Newton step 3

1 3 C The error bound of the computed result is checked. The result 1 { is ok if more than 32*resprec bits are correct. 3 The following loop is processed exactly once. Only in cases

{ of very bad initial approximations an iteration is possible.

W. Krarner

328

repeat setprec(2*getprec); yy:= yay; { exact multiplication! 1 setprec(2); { low precision is enough 1 if not(yy - x = 0 . 0 ) then begin errbound:= succ(abe(yy I)) /> (min(yy, X) +< XI; { \< and +> denote directed rounded operations 1 stop:= (expo(errbound) < -32*resprec) or (errbound = 0 . 0 ) ; end else stop:= true; if not stop then begin writeln(’***** SQT routine: Error bound still to large! *****’I; y:= np5*(y + x/y> ; end ; until stop;

-

{ Give back result up to the number of mantissa digits required

by the calling program. setprec(resprec); { initialize the result of the function 1 sqrt:= true; sqrt:= y; { set the temporary flag for the result 1 sqrt:= false;

{

1 1

mpfree(y) ; mpfree(yy1; mpfree(np5) ; mpfree(errbound1; mpfree(sqrtx) ; { deallocate local mpreal variables 1 end ;

For the resulting value of the given routine there holds

v

The symbols and A stand for the directed roundings with respect to the precision setting when entering the routine sqrt ( >. Using the program given above in combination with formula (l), an interval enclosure of fi with one ulp (unit in the last place of the mantissa) accuracy can be determined easily.

Multiple- Precision Computations

3

329

Arithmetic-Geometric Mean Iteration

In Section 5 , the evaluation of elliptic integrals to high accuracy is described. This may be carried out using the so called arithmetic-geometric mean ( A G M ) iteration. Let go and a0 be two positive numbers with 0 < go < ao. The geometric mean of these numbers is g1 := @while their arithmetic mean is a1 := From the properties of means it follows that go < g1 < a1 < ao. For the sequences { g j } , { a j } with

F.

gj+l

and

:=

aj+l := g j + a j , j = o , 1 , 2 ,..., 2

there holds the following monotonicity property

The sequences

{gj}

and { a j } converge to their common limit

lim g j = lim a, = AGM(go,ao).

j-ma

J+m

Moreover, the sequence of distances { d j } with

shows the quadratic convergence rate of the AGM iteration. In general, relation (3) is not valid on a computer using rounded arithmetic. However, using directed rounded operations or interval arithmetic, a lower bound Gj for gj and an upper bound A, for a, can be computed in each step of the iteration. Therefore,

The AGM iteration may be executed to arbitrary precision using only the arithmetical operations and * and the square root function. Interval operations as well as evaluations of the interval square root function give verified bounds for the A G M . The method is not a self-correcting method like the Newton iteration. All iteration steps have to be performed with a full length arithmetic. The following program is a straightforward implementation in order to get verified bounds for the arithmetic-geometric mean of two positive real numbers. Multiple-precision interval arithmetic is used.

+

330

W. Kramer

program agm; use mp-ari, mpi-ari; {multiple-precision real and interval module 1 function agm(x, y: mpinterval): mpinterval; Var a, b, bnew: mpinterval; begin mpvlcp(x); mpvlcp(y); { value copy for mpinterval value parameter 1 mpinit(a); mpinit(b); mpinit(bnew); { initialization of local mpinterval numbers 1 a:= x; { starting values for the AGM iteration 1 b:= y; while a >< b do begin { iterate as long as a and b a r e disjoint 1 bnew:= sqrt(a*b); { geometric mean 1 a:= (a+b)/2; { arithmetic mean 1 b:= bnew; { in each iteration step the convex hull of a and b is an 1 { enclosure of AGM(x,y) = AGM(y,x) 1 end ; agm:= true; { initialize function result 1 agm:= a +* b; { convex hull of a and b 1 ap:= false; { result of function is only temporarily used 1 mpfree(a); mpfree(b); mpfree(bnew1; { free memory for local var 1 end ; Var { multiple-precision intervals 1 a, b, res: mpinterval; nn, relerr: integer; begin mpinit(a) ; mpinit(b) ; mpinit(res) ; { initialize mpinterval variables 1 writeln; writeln(’*** Arithmetic-Geometric Mean Evaluation ***’); writeln (’**+ to Arbitrary Verified Accuracy ***’I; repeat writeln ; write(’Number of mantissa digits (base=2**32)? ’1; read(nn); writeln; setprec(nn); { precision setting 1 write(’a, b = ? ’1; read(a.b); writeln; res:= agm(a,b); if nn < 7 then writeln(res); setprec(2) ; relerr:+ 1 + expo( (res.sup -> res.inf) /> res.inf 1; writeln(’Re1. error of enclosure <= 2**(’, relerr, ’1’ 1; until false; {infinite loop; stop with 1 end.

A sample output of this program is shown below. An enclosure for the arithmeticgeometric mean AGM(1,2) is computed. The input 1 1 denotes the degenerate interval [1,1] (point interval). A lower and an upper bound for the arithmeticgeometric mean as well as an upper bound for the relative error of the enclosure are given.

Multiple-Precision Computations

*** ***

Arithmetic-Geometric Mean Evaluation t o Arbitrary Verified Accuracy

Number of mantissa d i g i t s (base=2**32) ? a , b = ? 1 1 2 2 1.45679103104690686905E+OOO 1.45679103104690686932E+OOO

331

*** *** 3

Rel. error of enclosure <= 2**(-62) 6 Number of mantissa d i g i t s (base=2**32) ? a , b = ? 1 1 2 2 1.456791031046906869186432383265081974973863943219E+000 1.456791031046906869186432383265081974973863943223E+OOO

Rel. error of enclosure <= 2**(-157) Number of mantissa d i g i t s (base=2**32) ? a , b = ? 1 1 2 2 Rel. error of enclosure <= 2**(-3165)

100

For example, using 100 mantissa digits for the multiple-precision interval arithmetic gives an enclosure of AGM(1,2) with 3165 correct bits.

4

Various Methods for the Computation of Pi

The computation of guaranteed bounds for elliptic integrals (see the next section) is based on the availability of guaranteed bounds for x with sufficient accuracy. In this section several methods are described to compute bounds for the value of x with high accuracy. The individual methods possess rates of convergence of order 1 through 4.

4.1

An Illustrative Example (Method of Archimedes)

By considering inscribed and circumscribed polygons of 96 sides, Archimedes (287212 B.C.) gave the interval enclosure 10 371

1 7

< x <3-.

Of course, Archimedes’ method is not restricted to polygons of 96 sides. In principle, the method can be used to provide any number of digits of x . Let xn denote half of the length of one edge of a circumscribed n-sided regular polygon of the unit circle. Then xZn, half of the side-length of the 2n-gon can be expressed by

W. Kriimer

332

This formula can be derived using the addition theorem a

t a n ( a ) = tan(2

+ -)a2

=

2tan(a/2) 1- t a n z ( a / 2 )

and Figure 1. The expression t a n ( a / 2 ) corresponds to zZnwhereas tan(a) correA

Figure 1: Circumscribed polygons sponds to z,.

Solving the quadratic equation for tan(a/2) yields formula (7).

-

The sequence {n 5 , ) is monotone decreasing with Iim,,+- n z, = T . To obtain more and more accurate approximations of T , formula (7) may be applied repeatedly. Theoretically, i.e., using exact arithmetic operations, the generated sequence is monotone decreasing. So, the processing of the loop in the following implementation of the algorithm is terminated if the monotonicity is violated. procedure Pi-approx; var hsl, old, n: real; begin { circumscribed triangle 1 n:= 3; he1 denotes half the length of one edge hsl:= sqrt(3.0); repeat old:= n*hsl; { old approximation value for pi 1 n:= n + n; hsl:= (sqrt(hsl*hsl + 1) - 1) / hsl; { ==> new approx 1 writeln(n:i7:l,’-gon ’ ,hsl*n ) ; until hsl*n >= old; check monotonicity property 1 end ;

<

<

1

Multiple-Precision Computations

333

The output of procedure Pi-approx is as follows: 6-gon 12-gon 24-gon 48-gon 96-gon 192-gon 384-gon 768-gon 1536-gon 3072-gon 6144-gon 12288-gon 24576-gon 49152-gon 98304-gon

3.464101615138E+00 3.215390309168E+OO 3.159659942095E+00 3.1460862150783+00 3.1427145992593+00 3.1418730~17043+00 3.1416627477623+00 3.1416101181633+00 3.1416968489973+00 3.1415932209723+00 3.1415893391283+00 3.14158120~1623+00 3.1415172440903+00 3.1414850764883+00 3.1422862813593+00

Comparing the final approximation (3.142...) with the known correct value of ?r = 3.141592653589... shows that only 3 digits of the approximation are correct. Inspection of the intermediate approximations shows that the value produced by the 3072-gon is correct up to 6 decimal places. Apparently, the stopping criterion in combination with normal (round to nearest) floating-point arithmetic does not appear to be very reliable. The same computation is now carried out using interval arithmetic. procedure Pi-upper-bound; var hsl. old, n: interval; begin n:= 3; { starting with a circumscribed triangle { he1 >= half the length of an edge hsl:= sqrt(intval(3.0)); repeat { in each iteration step un upper bound n*hsl of Pi 1 { is computed using a circumscribed n-gon 1 hsl:= hsl.sup; { use only an upper bound of the interval old:= n*hsl; { old approximation of Pi n:= n + n; { number of sides is doubled hsl:= (sqrt(hsl*hsl + 1) 1) / hsl; { ==> new approx. of Pi writeln(n.inf :17:1,’-gon Pi: ’ ,hsl*n 1; until sup(hsl*n) >= old.sup; { stop if monotonicity property { does not hold any longer end ;

-

W. Kramer

334 This produces the following output:

c I: c c c c c c c c c

6-gon 12-gon 24-gon 48-gon 96-gon 192-gon 384-gon 768-gon 1536-gon 3072-gon 6144-gon

3.4641016151373+00. 3.4641016151423+00] 3.215390309163+00. 3.215390309193+00] 3.159659942073+00, 3.159659942173+00] 3.14608621503+00, 3.14608621541+00] 3.1427145983+00. 3.1427146013+00] 3.141873043+00, 3.141873063+00] 3.141662743+00, 3.141662773+00] 3.1416100E+00. 3.14161033+00] 3.14159753+00] 3.14159663+00, 3.1415923+00, 3.1415963+00] 3.141583+00. 3.14160E+001

Now the iteration stops at an optimal point. The best approximation is given using the regular 3072-gon. The first 6 digits are correct. Note that the given bounds are bounds for the upper bounds of the iterated formula. They are not bounds for the exact value of K. The interval computation also shows that the method is numerically not very well suited to get good approximations with respect to the number of mantissa digits (here 13 digits to base 10). A much better result can be obtained by transforming formula (7) in the following way:

,m-1

X,

(9)

This notation avoids the cancellation that occurs in the numerical evaluation of the original formula. Prior to presenting numerical results, a corresponding formula for inscribed n-gons is given. Referring to Figure 2 one obtains = 1

: -3 + x 2

(1

2

+ 1 ) . (1 - z)

= sin

(Pythagoras’ theorem), (Altitude theorem)

.

The quantity s, denotes the length of one side of an inscribed regular n-gon. Combining these formulas yields

There holds n lim - 32, = K n-oO

2

.

Using relation (10) repeatedly yields a monotone increasing sequence converging to K. Combining formula (9) and (10) leads to the following algorithm for the computation of interval enclosures for T:

Multiple-Precision Computations

335

A

Figure 2: Inscribed polygons use li-ari;

{ PASCAL-XSC module for 21 digit decimal interval arithmetic 1 procedure enclosure-of-pi; var sl-in, old-in, hsl, old: linterval; pi-old, pi: linterval; n: longreal; begin pi:= lintval(2, 5 ) ; C interval ranging from 2 to 5 1 n:= 3; { starting with triangles 1 el-in := sqrt( lintval(3.0) ) ; { sl-in = length of one edge of the inscribed n-gon 1

hsl := sqrt( lintval(3.0) ) ; { hsl = half the length of an edge of the circumscribed n-gon 1 repeat old-in:= (sl_in*n)/2; { old appr. using an inscribed n-gon 1 old := n*hsl; { old appr. using a circumsribed n-gon 1 n:= n + n; { consider next 2n-gon 1 sl-in := sl-in / sqrt( 2 + sqrt(4 - sl,in*sl-in) 1; C ==> new approx. of pi (inscribed) 1 hsl := ha1 / ( sqrt(hsl*hsl + 1) + 1 ) ; { ==> new approx. using circumscribed polygon 1 pi-old:= pi; pi:= n*(sl_in/2 +* hsl); { convex hull gives new enclosure of pi 1 writeln(short(n):14:l,’-gon ’, ;)!p until pi >= pi-old; C check monotonicity 1 end ;

W. Kriimer

336 This program yields the following output: 6-gon 12-gon 24-gon 48-gon 96-gon 192-gon 384-gon 768-gon 1536-gon 3072-gon 6144-gon 12288-gon 24576-gon 49152-gon 98304-gon 196608-gon 393216-gon 786432-gon 1572864-gon 3145728-gon 6291456-gon 12582912-gon 25165824-gon 50331648-gon 100663296-gon 201326592-gon 402653184-gon 805306368-gon 1610612736-gon

[ 2.9E+00 3.5E+00] I [ 3.1E+00 3.3E+00] 8 [ 3.13E+00 3.16E+00] , [ 3.1393+00 3.147E+00] * [ 3.141E+00 3.143E+001 [ 3.14143+00 3.1419E+00] [ 3.1415E+00 3.1417E+00] I [ 3.14158E+00 3.14162E+001 8 3.141598E+00] [ 3.141590E+00 * [ 3.141592E+00 3.141594E+00] [ 3.1415925E+OO 3.14159303+00] * [ 3.1415926E+OO 3.1415928E+001 , [ 3.14159264E+OO 3.141592683+00] 8 [ 3.1415926513+00 3.1415926583+00] , [ 3.1415926533+00 3.1415926553+00] 8 [ 3.14159265343+00 3.14159265393+00] 3.14159265373+00] [ 3.1415926535E+OO [ 3.141592653583+00 3.14159265361E+OO~ 3.141592653593E+OO] [ 3.1415926535873+00 , [ 3.1415926535893+00 3.14159265359OE+OO~ , [ 3.14159265358963+00 3.14159265359OOE+OO~ [ 3.141592653589763+00 3.141592653589853+00] [ 3.141592653589783+00 3.14159265358980E+OOI 8 [ 3.14159265358979lE+OO 3.1415926535897973+00] [ 3.1415926535897923+00 3.1415926535897943+00] 8 [ 3.14159265358979313+00 , 3.1415926535897934E+001 3.1415926535897933E+OO] [ 3.1415926535897932E+OO [ 3.14159265358979322E+OO 3.14159265358979325E+OO] [ 3.14159265358979323E+OO 3.14159265358979324E+OO]

.. .

.. . .

.. .

Now, the final enclosure is very satisfactory with respect to the 21 digit decimal interval arithmetic used. The numerical results show that the rate of convergence is rather poor. The following subsections also describe methods with a higher rate of convergence.

Pi with Order One

4.2

A linearly convergent method which is due to Ramanujan (see [3]) is considered. Here 1 / is ~ given as an infinite sum 1 T

-

fi 9801

2 n=O

+

(4n)!{1103 26390n) (n!)43964n

with O! being defined as 1 as usual. Each additional summand increases the number of correct decimal digits of the approximation roughly by 8 (i.e. 26 bits). The following program shows an implementation of the preceding formula.

Multiple-Precision Cornputations

337

program ramapi; use mp-ari; module for multiple-precision real arithmetic 1 Var s. an. app, pi: mpreal; nn. n. nmax: integer; begin write(’Number of digits (base=2**32)? ’1; read(nn); writeln; setprec(nn + 1); nmax:= nn-2; { in each step roughly eight decimal places a r e gained 1 { i.e. roughly one mantissa digit to base 2**32 1 mpinit (app) ; mpinit (8) ; mpinit (an) ; mpinit (pi) ; correct reference value for pi 1 pi:= 4*arctan(mpreal(l)); 8:’ 0; for n:= 0 to nmax do begin setprec(nn-n); n-th summand only to lover precision 1 En:= fac(4*n)*(1103 + mprea1(26390)*n) / ( fac(n)**4 * (mprea1(396)**(4*n)) 1; setprec(nn); full-precision for summation 1 s:= s + sn; app:= 9801 / ( sqrt(mpreal(8))*s 1; if n < 10 then vriteln(’approx: ’, app); Error <= 2**(’, 1 + expo(pi - app), ’1’ 1; uriteln(’n: ’, n. ’ end ; mpfree(app) ; mpfree(s) ; mpfree(sn) ; mpfree(pi) ; end.

<

<

<

<

The function f ac(n) computes the factorial n!. The output of the previous program using 22 mantissa digits with respect to base 232is as follows: 22 Number of digits (base=2**32) ? approx: 3.141592 7... Error <= 2**(-23) n: 0 approx: 3.141592653589793 8... n: 1 Error <= 2**(-50) approx: 3.14159265358979323846264 S... n: 2 Error <= 2**(-77) approx: 3.1415926535897932384626433832795 5... n: 3 Error <= 2**(-103) approx: 3.1415926535897932384626433832795028841976... Error <= 2**(-130) n: 4 approx: 3.14159265358979323846264338327950288419716939937S... n: 5 Error <= 2**(-157) approx: 3.14159265358979323846264338327950288419716939937510582 l . . . n: 6 Error <= 2**(-183) Error <= 2**(-210) n: 7 Error <= 2**(-237) n: 8 n: 9 Error <= 2**(-263) Error <= 2**(-290) n: 10 n: 11 Error <= 2**(-316) n: 12 Error <= 2**(-343) n: 13 Error <= 2**(-369)

W. Kramer

338 n: n: n: n: n:

<= 2**(-396)

Error Error Error Error Error Error Error

14 15 16 17 18 n: 19 n: 20

<= 2**(-423) <= 2**(-449) <= 2**(-476) <= 2**(-502) <= 2**(-529) <= 2**(-555)

After 20 iteration steps an approximation with only 555 correct bits has been evaluated. As in method 1 (see Subsection 4.1) the rate of convergence is very poor.

4.3

Pi with Order Two

The method of order two discussed in this section is taken from [4].Let 21

y1

p+-) 1 ,

:= 1

i/z

,

:=

Then the sequence

xn+l

:=

A,,

G+& 2

decreases monotonically to

The error of A,

A,

-A <

A.

1

For all n there holds

is bounded by

,

n22,

The following program is an implementation of this algorithm in PASCAL-XSC using multiple-precision interval arithmetic. Notice that the algorithm is not a selfcorrecting algorithm. All steps have to be carried out using the same full-length arithmetic.

Multiple-Precision Cornputations

339

program borpi ; use mp-ari, mpi-ari; C multiple-precision modules 3 var sqrt2, xnew, ynew, x, y, pi, r, 8 , d, zp5. range: mpinterval; n, max: integer; begin setprec(l5); C 15 mantissa digits with respect to base 2**32 3 nmax:= 6; mpinit (sqrt2) ; mpinit(x) ; mpinit (y) ; mpinit (xnew) ; mpinit(ynew1; mpinit(pi) ; mpinit(r) ; mpinit(s) ; mpinit(d) ; mpinit(zp5) ; mpinit (range) ; range.inf:= 1.5; range.sup:= 1.75; d:= 2; C 0.5 3 zp5:= 0.5; C sqrt(2) 3 sqrt2:= sqrt(d) ; C pi0 3 pi:= 2 + sqrt2; C 2**0.25 3 s:= sqrt(sqrt2); Cxl 1 xneu:= zp5*(s + 1/81; Cyl 3 ynew:= s; d:= ynew - xnew; writeln('pi(i): '1; writeln(pi range*d) ; for n:= 1 to nmax do begin x:= xnew; y:= ynew; 8:- sqrtcx); r:= 1/s; { i=l ==> xneu = x2 3 xnew:' Zp5*(8+1); C i=l ==> yneu = y2 3 ynew:= (y*s + r)/(l+y); C i=1 ==> pi = pi1 3 pi := pi* (x+l) /(y+i) ; d:= ynew - xnew; writeln( 'pi(i) : ; writeln(pi - range*d) ; readln; end ; end.

-

The output of this program is as follows. Lower and upper bounds for A in each iteration step are given. A blank character has been inserted by hand to separate the correct digits computed so far. 3.1 094417000927143405761110416733446E5E4E3512740332942E37466256165E925E30434E21E9E3E009B73615311520773846E2502E2 3.1 52900537561340156036907E5346425231337996E502796673396522347633932326043507345E4EE0E0445290735077195775E69E045 3.141 476952E1553~5l3~967l4l3472E4l9El9733E435E5427E9B24OE96E35534293EE499337959lOEl6243E5E27327355O6OE667El443 3.141 63E35297640509773E7149474269744014E2059679E3493667~327440039013333469796565E327030319153709017E922E430E36 3.14359265 ~74E276SBZ2l9E1l9E~EBl6E2OO44O95O666O399O5lE7E567232EE2EE6369976OOO3O5O9l996lO56Ol6O9464472ll43El52O 3.14159265 392224338911666916177291EE16733760602751262EE2462425E65733547B979313646649535648031E5E65408460133949 3.1415926535E979323E4 3.1415926535E979323E4

4175100076264229045692144E92E4514125492a21E17629930422693347692692960E749735E1E032060551B

70~971423332B50766E233ll~E2O545l473E9l267O39l4O95245596E4E46O6l365ll9E3l4E4EOE67O25E3~974

3.l415926~358D19323E4626433E32795o2EE4l97~69399375lo5E2o9749445923o7El64o62E62oE99E62Eo34E15342ll7o679E1l4EoE6s

132E2306647093E44609550 434677505883 3 1415926535E979323E4626433E3279so2EE4l97~6939937~lo5E2o9749445923o7E~64o62E62oE99E62Eo34E25342l17o679E214EoE65 132E2306647093E44609SSO 727717384599

W. Krimer

340

Pi with Order Two (Second Method)

4.4

The next method is due to Brent [6]. Let go :=

1

t , := - - C

2

$ and a0 := 1. Define

"

2 k - 1 4

k=O

where

and g, and a, are computed by the AGM iteration AGM(g0, ao). Then

.f+l < tn

?r

ax ,
n=0,1,...

.

program brentpi; use mp-ari, mpi-ari; { modules for multiple-precision arithmetic 1 Var a, b. x, y, t, np5: mpinterval; pi-lb, pi-ub: mpreal; nn, err, errold: integer; begin mpinit(a); mpinit(b); mpinit(x); mpinit(y1; mpinit(t1; mpinit(np5) ; mpinit(pi-lb) ; mpinit(pi,ub) ; write(*Mantissa digits (base=2**32)? '1; read(nn1; vriteln; setprec(m); { M mantissa digits to the base 2**32 are used 1 np5:= 0.5; a:= 1.0; b:= sqrt(np5); t:= 0.25; x:= 1.0; err:= 0 ; repeat errold:' err; y:= a; a:= (a + b)/2; { arithmetic mean 1 b:= sqrt(y*b) ; { geometric mean 1 y:= a - y; t:= t x*y*y; { ti 1 x:= x + x; pi,lb:= inf ((a+b)*(a+b)/(4*t)); pi,ub:= sup(a*a/t); { It is pi-lb < pi < pi-ub in each iteration step (see Brent) 1 err:= 1 + expo(pi,ub -> pi-lb); { upwardly directed rounded subtraction to get upper bound of 1 { exponent of the difference of upper and lower bound for pi 1 writeln('Error < 2**('. err, '1'); { no longer quadratic convergence 1 until err > 2*(errold-l); end.

-

Multiple- Precision Computations

341

Output of this program: Mantissa d i g i t s (base=2**32) ? Error < 2**(-4) Error < 2**(-13) Error < 2**(-31) Error < 2**(-67) Error < 2**(-140) Error < 2**(-285) Error < 2**(-676) Error < 2**(-1156) Error < 2**(-2316) Error < 2**(-2363)

76

The 9-th iterate is WB approximation of

4.5

7r

with 2315 correct bits.

Pi with Order Three

The method described here is a procedure with a cubic rate of convergence which is based on the cosine function. The root x = of cosine can be computed by use of the following recursion:

:

xn+1 := xn

+

n = 0,1,2,* * * .

COS(Z~),

(12)

37r The iteration is started with a value of zo E (-). 2’2 --A

For starting values xo < 5, the sequence (2,) increases monotonically to 7r/2,and for a starting value 10 > 5, it decreases monotonically to the same limit. In both cases the rate of convergence is of order 3. The implementation of the recurrence relation (12) uses the cosine for multipleprecision numbers. The implementation of such a routine is described briefly. For 1.1 < 6 5 1 the cosine is computed using the n-th partial sum of the Taylor series of cos(x). The integer n has to be chosen appropriately. n

cos(x) =

C(-l)k-+R, (2k)! X2k

k=O

For 1x1 > c the recursion cos(2x) = 2cos2(x) - 1

W. Kramer

342

may be used repeatedly. The numerical values of n and e should be chosen in such a way that the computational cost is minimized (see [15]). If lRnl < 2-" is required, a good choice for n is given by

Using the representation (n is assumed to be even) cos(x) = 22

-

22 22 + 1 ) . . .)- 1)+ 1, - 2)(2n - 3) 4.3 2.1

X2

+(("'((2n(2n - 1) "(272 it is easy t o see that n multipleprecision multiplications and additions as well as n divisions by small integers have to be performed. Of course, a corresponding representation is possible for odd values of n. The cos () routine for multipleprecision arguments which has been implemented in the PASCAL-XSC module mpreal works in such a way. Now this routine is used to implement the iterative process (12) for the computation of .; According to the rate of convergence, in each iteration step the working precision is increased by a factor of 3. program lorpi; use mp-ari, mpi-ari; { multiple-precision modules 1 function max( n, m: integer ) : integer; begin if n>m then m a x : = n else max:= m; end ; Var xn, xnpl, pi2-ref: mpreal; nn, err: integer; begin mpinit(xn) ; mpinit(xnp1) ; mpinit(pi2-ref ; write('Number of mantissa digits (base=2**32) ? '1; read(nn); writeln; setprec(nn+i) ; pi2-ref:r Z*atan(mpreal(l)); { accurate reference value for pi/2 1 setprec(3); { default setting of multiple-precision arithmetic 1 xn:= 1.5707963; { pi/2 to normal floating-point accuracy 1 { initial approximation 1 setprec(1) ; repeat setprec( 3*getprec-i ) ; { actual precision setting 1 if getprec > nn then setprec(nn+2); writeln('Actua1 precision setting: ', getprec); xnpl:= xn + cos(xn); xn:= xnpl; err:= 1 + expo( abs(pi2-ref - xn) ; writeln(' Error: 2**('. err, '1'); until getprec >= nn; end.

Multiple-Precision Computations

343

Running this program with 120 mantissa digits gives: Number of mantissa digits (base=2**32) ? Actual precision setting: Error: 2**(-33) Actual precision setting: Error: 2**(-104) Actual precision setting: Error: 2**(-315) Actual precision setting: Error: 2**(-949) Actual precision setting: Error: 2**(-952)

120

2 5

14 41 122

In each iteration step, the number of correct bits is almost tripled. The actual precision of the employed arithmetic is altered in each iteration step.

4.6

Pi with Order Four

The rate of convergence of the method discussed in this section is four. Again, it is not the value of ?r which is approximated. Instead a sequence of values a, is The method used is based on a modular computed which tends to the limit equation of order four (see[3]).

a.

Based on the starting values

yo := a0

Jz-1 ,

:= 6 - 4 & ,

the following iteration is performed:

The implementation in PASCAL-XSC uses some auxiliary variables for intermediate results. The fourth square root is computed using the normal square root function twice. Of course, it will be faster to compute the fourth square root by the Newton method directly analogously to the processing of the normal square root. The method is not a self-correcting one. All operations and function evaluations have to be performed in full-length arithmetic. So the precision setting in the PASCAL-XSC program is executed only once at the beginning of the program part. In order to find an approximation for ?r, the reciprocal of the final value of the last iterate must be calculated.

W. Kramer

344

program quadpi; module for multiple-precision real arithmetic 1 use mp-ari; var a, y. h, t, 8 : mpreal; one, two. four, six, ref: mpreal; i. imax. n, nn, err: integer; begin urite(’Number of mantissa digits (base=2**32)? ’1; read(nn); writeln; mpinit(a); mpinit(y1; mpinit(h); mpinit(t); mpinit(s1; mpinit (one) ; mpinit (tso) ; mpinit (four) ; mpinit (six) ; mpinit (ref1; setprec(nn+i) ; one:= 1; two:= 2; four:= 4; six:= 6; C accurate reference value for i/pi ref := mpreal(0.25)/atan(one); setprec(nn) ; h:= sqrt(two) ; y:= h one; a:= six four*h; t:= y*y; n:= 8; { 2**3 1 imax:= 7; for i:= 1 to imax do begin h:= one t*t; c 1 y**4 1 t:= sqrt(h); C sqrt(Y - y**4) 1 h:= sqrt(t); { sqrt4( 1 - y**4 1 1 y:= (one h>/(one + h); h:= one + y; 8:’ h*h; t:= y*y; a:= s*s*a - mpreal(n)*y*(h+t); err:= i + expo( ref - a 1; writeln(’Number of correct bits a r e at least: ’, abs(err)); n:= 4*n; C 2**(2k+3) end ; end.

<

-

1

-

-

-

-

>

Number of mantissa digits with respect to the employed base ’i?32: 3200 Number of correct bits for the first 7 iteration steps:

>= >= >= >= >=

30 137 570 2308 9268 >= 37113 >= 102365

Only 7 iteration steps are necessary to compute an approximation with more than 100000 correct bits. For testing purposes the first 10000 significant bits of ?r as well as the bits 90001 up to 100000 are given in Appendix A. They have been computed using the fourth order method described here.

Multiple-Precision Computations

345

There are many other methods to compute numerical values for 7r. See for example [3]and [4].

5

Elliptic integrals

The computation of elliptic integrals is strongly related to the arithmetic-geometric mean iteration described in Section 3 as well as to the availability of sufficiently accurate values of T (see the previous section). To get verified bounds for the elliptic integral of the first kind

F ( k ) :=

1'

d r 1 - k2sin2r

the relationship

for moduli k E ( 0 , l ) is made use of. If { G j } and { A j } are the sequences of machine numbers associated with the process of forming arithmetic and geometric means in ( 2 ) with starting values go := I/= and a0 := 1 , then there holds

v

The symbols and A denote rounding of constants as well as directed rounded results of floating-point operations towards -m and +m, respectively. The AGM iteration is monotone with respect to both arguments. Thus, if go := 4is not a machine number, the { G j } sequence has to be computed using a floating-point number less than to go and the sequence { A j } using a floatingpoint number exeeding go. The iteration stops for some j o if the term AJo-G'o is 'JO sufficiently small and this then is true for the closely related bound of the relative error. With the same notation as above the complete elliptic integral of the second kind

E ( k ) :=

1' 4

1 - k2sin2r d7

may be represented by the expression (see [IS])

The term

W. Kramer

346

can easily be enclosed in an interval as described above (formula (14)). In practice, only a limited number N of terms of the infinite series are used. The associated remainder term m

RN :=

C

2'(aj

- - ~ j ) ~

j=N+1

is positive and bounded by

Thus, an enclosure of E(E) is given by

All the operations in (16) are set operations and [. .. , . ..] stand for intervals of the given bounds. Again g, and aj can be replaced by their floating-point bounds G, and Aj as given by (6). The following PASCAL-XSC function makes use of the relationship in (16), in order to compute guaranteed bounds for complete elliptic integrals of the second kind. function ce12( modulus: r e a l 1: i n t e r v a l ; I Complete e l l i p t i c i n t e g r a l of t h e second kind Par

>

g, p e w , a, k, s, h , pi08: i n t e r v a l ; J . ] m a r , twoj: i n t e g e r ; err: real; atop: boolean;

begin k:= modulus; i f k = 1 then ce12:= 1 e l s e begin pi08:= arctan(intval(l))*O.S; { i n t e r v a l enclosure of p i / 8 a:= I ; i n t e r v a l evaluation of go k ) * ( l + k) ); g:= s q r t ( ( I j:= 0 ; jmax:= 8; e r r : = 10-14; a:= ( a g)*(a g); twoj:= I ; repeat writeln(twoj) ; j:= j + l ; twoj:= l*twoj; gnev:= sqrt(g*a); I geometric mean a:= O.S*(g + a ) ; I arithmetic mean g:= gnaw; h:= twoj*(a g)*(a - g ) ; a:= s + h;

>

-

-

-

> >

-

>

Multiple-Precision Computations

347

{ The range of the complete e l l i p t i c integral of the second kind { for arguments k i n [O, 11 i s [l, pi/2]. Thus, the following error { criterion leads t o a r e l a t i v e error bound. stop:= (a.sup g.inf < err) or ( j > j m a x ) ;

-

> >

1

-

u n t i l stop; s:= 4 - 2+k+k s intval(0.0. sup(h)); h : = srpio8 / i n t v a l ( g . i n f , a.sup); i f h.inf < 1 then h . i n f : = I; ce12:= h; end ; end ;

-

The following Table shows the results of each iteration step for the function c e l l ( k ) with modulus k:= predecessor( 1) = 0.999999999999999999999. An interval arithmetic with 23 decimal mantissa digits has been used.

[ i.OE+OO

c

C C C C C C

l.OE+OO i.OE+OO l.OE+OO I.OOOE+OO 1.00000E+00 1.00000000000E+00 1.000000000000000000E+OO

, , ,

1.00000000006E+00] 1.000000000000000001E+OO]

The subsequent output demonstrates the fast quadratic convergence rate for E(0.5). The computation has been executed using a multiple-precision interval arithmetic. Only the correct digits are displayed which have been determined by means of a comparison of the lower and the upper bounds of the enclosure that is produced in each iteration step.

Methods for the computation of interval bounds for incomplete elliptic integrals of the first and second kinds are discussed in (131.

6

Evaluation of the Natural Logarithm

The first method described here can be formulated easily; it is linearly convergent.

W. Kramer

348

6.1

Natural Logarithm (First Method)

The method described in [7] will be used. Two sequences zn and yn with starting values 2 0 > 0 and yo > 0 are defined:

Using the common limit of these sequences log( way:

2)can be expressed in the following

In order to compute log(.), set z0 := 2 and yo := 1. In each iteration step the common limit lies in the interval lim

n-w

2,

= lim yn E [min{in,yn}, maz{zn,yn}]. n-w

The following program uses this observation to get bounds for log(z). program logagm; use i-ari; { i n t e r v a l module 1 Var

a r g , res: i n t e r v a l ;

f u n c t i o n m-ln( a r g : i n t e r v a l 1: i n t e r v a l ; { computation of an enclosure of I n ( a r g ) 1 Var

x , y, xnew, ynew, m: i n t e r v a l ; begin x:= a r g ; y:= 1; repeat xnew:= s q r t ( x*(x+y)/2 1; ynew:= s q r t ( y*(x+y)/2 1; x:= xnew; y:= ynew; { convex h u l l of x and y m:= x +* y; 1 m: = (arg- 1)* ( a r g + l ) / (2*m*m) ; writeln(m) ; { enclosure of I n ( a r g ) 1 u n t i l not(x >< y ) ; { t h e i n t e r s e c t i o n of x and y i s no longer empty 1 m-ln:= m; end ;

Muitiple-Precision Computations

349

begin repeat write(’Argument ? ’1; read(arg); writeln; res:- m,ln(arg); writeln(’res : 3 , res); writeln(’intrinsic: , ln(arg) 1 ; until false; end.

Enclosures computed for log(1.25) are given below. Argument ? r

1.25

c c c

1.93-001, 2.13-001 , 2.13-001. 2.23-001.

2.6E-0011 2.4E-0011 2.33-0011 2.33-0011

c c c c

2.2314353-003, 2.2314353-001, 2.2314353-001. 2.23143643-001.

2.2314363-0011 2.2314363-0013 2.2314363-0011 2.23143563-0011

L

c c c c c c c c c c c

2.2314355133-001, 2.2314355143-0011 2.23143551303-001, 2.23143551333-0011 2.23143551313-001, 2.23143551323-0011 2.23143551313-001, 2.23143651323-0011 2.231435513133-001, 2.231435513153-0011 2.23143551313E-001, 2.231435513153-0011 2.231435513141-001. 2.231435513153-0011 2.2314365131413-001, 2.2314355131443-0011 2.2314355131413-001, 2.2314356131433-0011 2.2314355131413-001. 2.2314355131433-0011 2.2314355131413-001. 2.2314355131433-0011 2.2314355131423-001, 2.2314355131433-0011 2.23143551314203-001. 2.23143551314223-0011 res : [ 2.23143551314203-001, 2.23143551314223-001] intrinsic: [ 2.2314355131420973-001. 2.231435513142098E-0011

c c

The enclosure indicated by intrinsic is the resulting interval for In(arg) using the intrinsic natural logarithm of the PASCAL-XSC interval module iari. Obviously the rate of convergence is very poor.

6.2

Natural Logarithm (Second Method)

The arithmetic-geometric mean iteration may also be used to compute guaranteed bounds for the natural logarithm. The following method is applicable for I E (0,l);

W. Kramer

350 it uses a precomputed enclosure of x . There holds (see [4]) x

‘log

x

‘

n

- 2AGM(1,10-”) -t 2 A G M ( l , ~ l 0 - ~ ) ’102(”-’)‘

AGM(. ..) denotes the arithmetic-geometric mean iteration (2). A PASCAL-XSC program for the computation of logarithms may look as follows: program logquad; use mp-ari, mpi-ari; { multiple-precision modules 1 function log(n: integer; x: mpinterval): mpinterval; { Enclosure for log(x) using 10**(-n) for the ACR 1 V U

u, v, one, t, err: mpinterval; pi2: mpinterval; i: integer; begin rnpvlcp(x) ; mpinit(u); mpinit(v1; mpinit(one) mpinit(t); mpinit(err) mpinit (pi21 ; err.inf:= -n; { err:= C-n, nl 1 err.sup:= n; for i:= 1 to n-1 do err:= err/100 one:= 1; pi2:= 2*atan(one) ; C pi12 3 t:= one; for i:= 1 to n do t:= t/lO; { t:= 10**(-n) 1 u:= agm( t. one); v:= agm(x*t, one); log:= true; { enclosure of ln(x) 3 log:= pi2*(l/u - l/v) + err; log:= false; mpfree(u) ; mpfree(v1; rnpfree(one1; mpfree(t) ; mpfree(err1; end ; Val-

x, res: mpinterval; k, n, kmax, abserr: integer; begin setprec(3) ; mpinit(x) ; mpinit(res1; kmax:= 4 ; n:= 8 ; repeat writeln; write(’x = ? I ) ; read(x); writeln; for k:= 1 to kmax do begin writeln(’Actua1 precision setting: ’, getprec); res:= log(n, x); { enclosure of ln(x) 1 if getprec <= 12 then uriteln(res); abserr:= 1 + expo(res.sup - res.inf); writeln(’Abso1ute error of enclosure <= 2**(’, abserr, I ) ’ ) ; setprec(2*getprec); { doubling of the precision 3 { lo**(-n) is used in the ACM for log 1 n:= 2*n;

Multiple-Precision Computations

351

end ; vriteln(log(x) 1 ; until false; end.

Using this program for the computation of log(O.75) gives:

x = ?

0.75 Actual precision setting: 3 -2.87682072451 86072543433114042E-001 -2.87682072451 70072755322886661E-001 Absolute error of enclosure <= 2**(-42) Actual precision setting: 6 -2.8768207245178092743921900 60097871183078392255057837146078E-001 -2.8768207245178092743921900 597778711830783922555939752539llE-001 Absolute error of enclosure <= 2**(-94)

Actual precision setting: 12 -2.8768207245178092743921900599382743150350971089776105650666 600-001 -2.8768207245178092743921900599382743150350971089776105650666 536-001 Absolute error of enclosure <= 2**(-199) Actual precision setting: 24 Absolute error of enclosure <= 2**(-411)

For arguments near 1 the precision of the arithmetic should be increased for the computation of u, v and - (cancellation!) in routine logquad( >.

7 Evaluation of e" The value of e" can be expressed by means of a rapidly convergent infinite product. The factors used in this representation are given by an arithmetic-geometric mean iteration. With go := a0

6

:= 1

,

,

W. Krrimer

352 the equation

21-k

eT

=3 2 E

(y)

holds, or more explicitly

This expression is used in the following program: program epowpi ; use mp-ari; { module for multiple-precision real arithmetic 3 var aj, ajpl, g: mpreal; exppi, f, p: mpreal; nn, i. j. jmax: integer; errexp, maxexp: integer; begin mpinit(aj); mpinit(ajp1) ; mpinit(g1; mpinit (p) ; mpinit(f ; mpinit (exppi) ; write(’Wumber of mantissa digits (base=2**32)? ’1; read(nn); writeln; setprec(nn1; maxexp:= -32*(nn-2); { each mantissa digit has 32 bits 3 jmax:= 10; 3 exppi:= exp(4*piov4); { use intrinsic interval functions to < get a correct reference value of e**pi 3 writeln(’exp(pi) : ’ , exppi) ; j:= 0 ; aj:= 1; g:= sqrt(mpreal(o.5) 1; C i/sqrt (2) > { arithmetic mean of a0 and go 3 ajpi:= O.S*(aj+g); f := ajpl/aj; p:= 32*f*f; repeat j:- j+l; g:= sqrt(g*aj); c geometric mean 3 aj := ajpl; arithmetic mean 3 ajpi:= o.S*(aj + g ) ; f := ajpl/aj; for i:= 2 to j do begin f:= sqrt(f); end ; c f is next factor of the product 3 p:- p*f; writeln(p1; errexp:= 1 + expo(exppi - p); I check actual accuracy 3 vriteln(’Error <= 2**(’, errexp, ’1 ’1 ; until (errexp < maxexp) or (j > jmax); end.

<

Multiple-Precision Computations

353

The first few approximations of e" as computed by the previous PASCAL-XSC program are: Number of mantissa digits (base=2**32) ?

35

Error <= 2**(-12) 2.314069263 Error <- 2**(-31) 2.31406926327792690057

Error <= 2**(-68) 2.3140692632779269005729086367948547380266106 Error <= 2**(-142)

Error <= 2**(-288) Error <= 2**(-579) Error <= 2**(-1082)

The number of correct bits is doubled in each iteration step. All iteration steps have to be computed using the full-length multiple precision arithmetic.

Appendix In the following the bits 1 to 10000 as well as the bits 90001 to 100000 of A are given in hexadecimal notation. They have been calculated using the fourth order method of Section 5.5. Notice that the sequence of bits is identical also for $, 2 and, in general, for 2k * A , k any integer number. The multiplication by a power of two is compensated by an appropriate change of the exponent. The first eight significant bits of T are i i 0 0 i 0 0 i binary C 9 hexadecimal and the bits 99993 to 100000 are given by 0 0 0 i 0 i 0 i binary 1 3 hexadecimal. The given value is also the round-to-nearest approximation of a with an accuracy of 100000 bits. The next four bits, i.e., the bits 100001 to 100004 are 0010.

W. Kramer

354

B i t s 1 to 10000: CQOFDAA2 2168C234 C4C6628B 80DClCDl 29024E08 8A67CC74 020BBEA6 3B139B22 514A0879 8E3404DD EF95lQB3 CD3A431B 302BOA6D F25F1437 4FEl356D 6DSlC24.5 E485B676 62SE7EC6 F44C42EQ A637ED6B OBFFSCBB F406B7ED EE386BFB SA899FAS AEQF24ll 7C4BlFE6 49286661 ECE4SB3D C2007CB8 Al63BFOS 98DA4836 lCSSD39A 69163FA8 FD24CFSF 83666D23 DCA3AD96 lC62F366 208562BB 9ED52907 7096966D 67OC364E 4ABC9804 F1746C08 CAl8217C 32905E46 2E36CE3B E39E772C 180E8603 9B2783A2 EC07A28F BSCSSDFO 8F4CS2CQ DE2BCBF6 95581718 3995497C EAQSBAES 16D2261B 9BFAOSlO 1572836A 8AAAC42D AD33170D 04607A33 A85521AB DFlCBA64 ECFB8604 68DBEFOA 8AEA7157 SD060C7D B3970F86 A6ElE4C7 ABFSAEIC DB0933D7 lE8C94EO 4A26619D CEE3D226 lAD2EE6B F12FFA06 D98A0864 D8760273 3EC86A64 521F2B18 177B2OOC BBE11757 7A61SD6C 770988CO BAD946E2 08E24FAO 74ESAB31 43DB6BFC EOFDlOIE 4B82D120 A9210801 1A723C12 A787E6D7 88719A10 BDBASB26 99632718 6AF4E23C 1A946834 B6lSOBDA 2683EQCA 2AD44CE8 DBBBC2DB 04DE8EF9 2E8EFC14 lFBECAA6 287C6947 4E6BCOSD 99B2964F AO9OC3A2 23381186 615BE7ED lF612970 CEE2D7AF B8lBDD76 2170481C DO069127 DSBOSAAQ 93B4EA98 8D8FDDCl 86FFB7DC 90A6C08F 4DF43SC9 34028492 36C3FAB4 D27C7026 ClD4DCB2 602646DE C9761E76 3DBA37BD FEW9406 ADQE53OE ECDB382F 413001AE B06A53ED 9027D831 179727B0 865A8918 DABEDBEB CFQB14ED 44CE6CBA CED4BBlB DB7F1447 E6CC254B 33205161 2BD7AF42 6FB8F401 378CD2BF 5983CAOl C64B92EC F032EAlS D1721D03 F482D7CE 6E74FEF6 DSSE702F 46980682 B5A84031 QOOBlCQE 69E7C97F BEC7E8F3 23A97A7E 36CC88BE OFlD4SB7 FFSESACS 4BD407B2 2B4154AA CC8F6D7E BF48ElD8 14CCSED2 OF8037EO A7971566 F29BE328 06AlDS8B B7CSDA76 FSSOAABD 8AlFBFFO EBlQCCBl A313DSSC DA66CQEC 2EF29632 387FE8D7 6E3C0468 043E8F66 3F4860EE 12BF2DSB OB7474D6 E694F91E 6DBEllS9 74A3926F 12FEEbE4 38777CB6 A932DF8C D8BEC4DO 73B931BA 3BC832B6 8DQDD300 74lFA7BF 8AFC47ED 2576F693 6BA42466 3AAB639C SAE4FS68 3423B474 2BFlC978 238Fl6CB E39D652D EBFDBEBE FC848ADQ 22222304 A4037C07 13EBS7A8 1A23FOC7 3473FC64 6CEA306B 4BCBC886 2F8386DD FAQD4B7F A2C087E8 79683303 EDSBDDBA 062B3CF6 B3A278A6 6D2A13F8 3F44F82D DF310EEO 74AB6A36 45973899 AO2SSDCl 64F3lCCb 0846851D FQAB4819 6DED7EA1 BlDSlOBD 7EE74D73 FAF36BC3 lECFA268 359046F4 EB879F92 4009438B 481C6CD7 889A002E DSEE382B CQlQODA6 FC026E47 9558E447 6677E9AA 9E3050E2 766694DF C8lFS6E8 BOB96671 60C980DD 98A573EA 44720656 139CD290 6CDlCB72 9ECS2A52 86D44014 A694CA45 7583DSCF EF26FlB9 OAD8291D AO799DOO 022EQBED 5SC6FA47 FCACBBlA CAE37645 6D98D948 79EE7E6D BFCDOl4B B1615599 14ECOB67 6A67E3E8 422EQlE6 5BAl41DA 92DEQC3A 6D6CCA51 36DD424B 81064988 EBSBAQAC 1269F7DF 673B982E 23FB6C99 BB2AA31C 6A6686FF D699149B 3OAC67B8 464D80A9 5D42530A 681644DO 39060E8F 8FD52626 96DOA759 SAE3F935 A67DCFFS A874A701 FBFAOCBD 534B4E39 BC096770 53374821 AllC3ACQ 98EOBA71 8087B317 82SAlACF CFAEBBF2 4F25C605 lADAQC28 6AlFCD61 14A838A1 ADE714Cl 6A9401CD CF81El07 lFF7AB97 239F

...

B i t s

1 2 3 4 5 6

7 8

9 10

11

12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40

90001 to 100000:

...

3022C 36902F34 6AD187A9 C2EC9993 ECD86C41 BFE230bF lFB2EF7C F2CESFD6 8D2F6301 8BlE42SF 9C771026 431F8AlD ODCOF8El 6AD3812A CC9134B8 lFEE327C EOAOESCF C86BEAFl 72A27E30 CEA896FE BA786E22 DDFA302E 222C2CBB A35FA648 82C86362 F9321511 2COQA93C BCA6A178 62069169 SCC41737 AF3D24E6 QAE48E8B FEAQB673 613AE2ED lE6361F7 C2980252 CFZDBEOF 00D9859E 687C8SFA 258FlSFC 1156EQDC D23BSQBC D92531EA 82C83ABF 617AE464 47224260 3BE8D303 6F6847A0 96D82576 CF864F6D F38F97C6 754A878D lDEC3320 92276COE OEODDB32 OF3BC82C 2F27El3A SC32C512 03818880 091889E9 29578818 DEBDA7AC OFD7AQDl CE7EAlD9 646SADFC E07B3BBF QSOEECQE C1040274 33CDCDCC 68AODAC9 3BB057A5 2D26248B 6EADF968 AE293F2A 51D4249A A2030Dl9 E209FClE 366B6B42 A3474215 1BCClSlC OB4F7856 7FFE4A8B 6847F66C lACO4A93 25802585 656FSOBS 3287FE19 EC6C6DCl A86D94Bl 181B0701 AOD85033 lAlCE4FA 55C7D662 DFECC627 40402100 C6SB71B8 C2774FE6 EBB89413 93E80CEB D95BB767 070A2636 873DCl6E 3AFQFBSS A8CD7E77 A4838115 8SDB98CB 56455C51 ODEEOODB EF79lF81 236DA709 66D60D47 SD78EB46 3E85710D 6E23E3E9 E79EB34B BD3C6B6C 8B1110C2 AQBDQQED DElF7954 SFD69C81 6AACCA65 10562526 SBFF6B75 CO19AEDB 3771635A SCCD676F 32268C73 DlOC7FS2 67A21D76 D25FSFE4 AQE17028 67849666 6F360A09 OlAA93CF FQC328A8 88224DD3 lC06B679 648CA864 BC4BEQBD 6190B56C 7ASElD47 6E60F519 9DlBFOl8 AE49ED33 75290058 F4C63BE9 CBS3A38B 01CE70AC 4D419BAl 4A2340D7 B1552ED6 7A4C8243 85652C47 666ABA4D OESDSD04 6D8DS7AF QCD4B435 71540DB4 OCF53386 B3D8B915 6D96DC3B 74D6CCB4 21E73B19 529FDD6A 479568E1 F6EQDF33 9AEEAlBl 696F5011 4CC4B30D 32lCF3D2 B4AC86F6 BEDSEBZE FFFEOAAl 95214CA6 079725FD

361 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372

Multiple-Precision Cornputat ions 719A82FE 3A297AE8 9098F39D lFF94BFC CO2lDlEl 9666DF61 161C800A SA8DB122 B72CBOO7 9AEC0646 BIEBE96B 16766449 ECD4D09A 9DB26883 C44CF36C 24CBAADF FC8616BD BB9B923E 7307627C 9B297DDA EB4D4046 07E8E7AA CE3lD9BE 036CElD7 ACBSFBF4 CC41843B 26B6609A F38A6A84 48F38333 BA31746F BESABISD C6DB2BBD 6C33ABBC 60CD4BS8 C6F60836 llC7EBEl A727E9A6 EOD64CF3 D4A6C232 7899DA3C 9SBE8A16 B4419347 61248F3A 92230297 73BBC3ED A6B333FB 079702132 E9FI80FS IDE49DF4 DFDBDOZD F8038171 2CISDD2E CC49761E 3CBCDEC9 EF43C7CD 8102FA66 4DCSF9DO

FDFADB79 4BA29826 7EE72D43 07616AD8 446C372C BA813A67 CO99C32E 9CQC9C37 368991F7 2BlD66F3 22El4BDS 660DD446 lSFB2A22 36866981 F84BCBB2 9436A17F 89421907 C03BECF6 0420A192 203EICDE 1696C30E IlDSFDD3 OEBA6A71 CFOC663A 4DOSCFF6 A71748AE FC36691F 686lDDCE 84863130 4SD7642D OAZOEOBB CBAO826A 62646ClE 90ABB038 181126810 24227144 BBOCF086 2C8A6801 14740881) D6ODF369 80F7667C 699A2A6A 92636606 62237109 20B7BCF3 B7FAE3D3 27167250 7E9OFE06 SCSOOECO D40DFE36 SOIFE9C9 B6BAOOC8 EFISAIFB B9D9B644 D410E2AC 3466E76A EDAZDBCI 71646ElE 8360B2E6 8BF43B42 2D8C6961 AB9B9F66 A6808723 4DDDB8SB 61908EAF 601C7B02 ElF74032 8A6BBOC6 166992EA 9DC49ADC F36F98FA 3BA7347S 3C6E9A74 68C32C63 73BE7DBS BOSDC664 23633128 497C3F06 086D0102 D166OE96 C34376C8 CB81810A 92OD40F4 4479BlD6 638B4DEF B66A1998 3Bll9C9A 312E18Cl 2F2D81A6 6DBOBF96 B9031AB2 CF312013

355 373 374 376 376 377 378 379 380 38I 382 383 384 386 386 387 388 389 390 391

References [l] Alefeld, G.; Herzberger, J.: An Introduction to Interval Computations. Academic Press, New York, 1983.

[2) Borwein, J.M., Borwein, P.B.: The Arithmetic-Geometric Mean and Fast Computation of Elementary Functions, SIAM Review, Vol. 26, No. 3, July 1984. [3] Borwein, J.M., Borwein, P.B.: Srivinasa Ramanujan und die Zahl Pi, Spektrum der Wissenschaft, April 1988. [4] Borwein, J.M., Borwein, P.B.: Pi and the AGM, John Wiley & Sons, 1987. [5] Braune, K., KrSmer, W.: High-Accuracy Standard Functions for Intervals, in ”Computer Systems: Performance and Simulation”, M. Ruschitzka(editor), Elsevier Science Publishers, 1985. [6] Brent, P.: Fast Multiple-Precision Evaluation of Elementary Functions , Journal of the Association for Computing Machinery, Vol. 23, pp. 242-251, April 1976. [7] Carlson, B. C.: Algorithms Involving Arithmetic and Geometric Means, Amer. Math. Monthly, 78, pp. 496-505, 1971. [8] Cordes, D.: Runtime System for a PASCAL-XSC Compiler, in E. Kaucher, S.M. Markov, G.Mayer (editors): Computer Arithmetic, Scientific Computation and Mathematical Modelling, J.C.Baltzer, Scientific Publishing Co. IMACS, 1991. [9] Cordes, D., Kramer, W.: PASCA L-XSC Module for Multiple-Precision Operations and Functions, Universitiit Karlsruhe, 1991 (Draft).

356

W.Krirner

[lo] Hammer, R.,Neaga, M., Ratz, D.: PASCAL-XSC, New Concepts for Scientific Computation and Numerical Data Processing, This volume. [ 111 IBM: High-Accuracy Arithmetic Subroutine Library(ACRITH), General Information Manual, GC 33-6163-02,3rd Edition, April 1986. [12]Klatte, R.; Kulisch, U.; Neaga, M.; Ratz, D.; Ullrich, Ch.: PASCAL-XSC Language Reference with Ezamples, Springer-Verlag, Berlin/Heidelberg/New York, 1992. [13]KrHmer, W.:Mehrfachgenaue reelle und intervallmufiige Staggered-Correction Arithmetik n i t zugehorigen Standardfunktionen. Bericht des Inst. f. Angew. Mathematik, Univ. Karlsruhe, pp. 1-80,1988. [14]KrHmer, W.: Computation of Verified Boundsfor Elliptic Integrals, Oldenburg, SCAN 91. [15]Kulisch, U.;Miranker, W. L.: Computer Arithmetic in Theory and Practice, Academic Press, New York, 1981. [16]Lohner, R.: Interval Arithmetic in Staggered Correction Format, This volume. [17]Lortz, B.: Eine Langzahlarithmetik mit optimaler einseitiger Rundung, Dissertation, UniversitHt Karlsruhe, 1971. [18]Salamin, E.: Computation of. Using Arithmetic-Geometric Mean, Mathematics of Computation, Vol. 30, Nr. 135, July 1976. [19]Shanks, D., Wrench, W.: Calculation of. to 100,000 Decimals, Math. Computing, 16, 1962. [20]Spanier J. and Oldham K.B.: An Atlas of Functions, Hemisphere publishing corporation, 1987.

Verification of Asymptotic Stability for Interval Matrices and Applications in Control Theory Beate Gross This paper deals with the classical control theory on the basis of systems of linear ordinary differential equations y' = Ay, where A represents a constant matrix. In a realistic mathematical simulation, intervals must be admitted for the values of at least some of the constant elements of A. Concerning applications, it is desirable to obtain verified results on the asymptotic stability of y' = Ay and the corresponding degree of stability. For this purpose, four constructive methods are developed such that the interval matrix admitted for A is directly treated. Consequently, there is no need for an employment of the characteristic polynomial and its roots, i.e., of quantities which would have to be computed prior for a stability analysis. As a major example, the automatic electro-mechanical track control of a city bus is presented.

1

Introduction

Generally, mathematical simulations of real world problems are affected by uncertainties concerning input data. Examples are changes of operational parameters or inprecise measurements. The influence of these uncertainties may be simulated by means of the employment of interval-valued input data. This paper is concerned with mathematical models involving systems of linear ordinary differential equations (ODEs) with constant (interval-valued) coefficients and the requirement of the asymptotic stability of the homogeneous ODEs belonging to the collection of systems

As an example, the following problem in the theory of automatic control is presented and analyzed in this paper: a city bus with an electro-mechanical track-guidance. Because of the variety of operational states, the input parameters may take values in unusually wide intervals. The mathematical design of the automatic vehicle control yields a collection of systems (1.1). The desired stability properties of the control problem are equivalent with the asymptotic stability of the system of homogeneous ODEs. The goal of this paper is to develop a mathematical method allowing the verification of the property of asymptotic stability of the collection of systems (1.1). This Scieiitific Computing with Automatic Result Verification

Copyright Q 1993 by Academic Press, Inc. 357 All rights of reproduction i n any form reserved. ISBN 0-12-044210-8

358

B. Gross

property is present in the case of negative real parts of all eigenvalues of all matrices A E (A]. Additionally, a superset is to be determined in the complex plane that is guaranteed to contain all eigenvalues. By means of the example of the track-guided city bus, the practical applicability of the method to be developed will be assessed. The mathematical problem just outlined concerns the property of the Hurwitzstability of an interval matrix [A]. By means of a bijective matrix transformation, this problem can be expressed equivalently through the dual problem of the S c h u r stability of a related interval matrix [ B ] :For all B E [ B ] ,it has to be verified that the spectral radius is smaller than one. For this purpose, the Cordes-Algorithm will be used which has been developed in the doctoral dissertation [6] of D. Cordes. A successful completion of this algorithm is confined to interval matrices [ B ]possessing sufficiently small widths. Therefore, subsequently a partitioning method for [A] is presented allowing applications of the Cordes-Algorithm with respect to interval matrices generated by subdivision of [ A ] . As time increases to infinity, the decay properties of the solutions of (1.1) are governed by the locations of the eigenvalues of all A E [A] in the left half-plane. In this context, the degree of stability of (1.1) is defined as the minimal distance of the eigenvalues from the imaginary axis. In view of the tasks outlined above, four constructive methods are presented yielding the superset addressed before as well as lower bounds of the degree of stability. By means of Enclosure Methods, the four constructive approaches have been programmed making use of the computer language PASCAL-SC2, a predecessor of the language PASCAL-XSC. Under the admission of interval matrices [ A ] ,this allows a fully automated execution of the employed Enclosure Method such that guaranteed results are obtained. In all investigated cases, non-conservative results were determined for the superset and the degree of stability. This is in particular true in the case of multiple eigenvalues (or eigenvalues with relatively small distances). In several examples with large widths of [ A ] ,the property of asymptotic stability is shown by use of the partitioning method. In contrast to the presented constructive method, the execution of the classical methods is strongly affected by the condition numbers of the individual eigenvalues. This is in particular true for the methods of Leonhard-Mikhailov, Routh, Hurwitz or Nyquist; see the monograph by W. Hahn [9]. In many applications, therefore these traditional methods are not very useful for the goals presently pursued, particularly when all A E [A] are admitted. In the more recent literature, no constructive methods are known with a performance superior to the one of these methods just mentioned. As a conclusion of this Introduction, the following important or desirable properties are listed for the constructive methods to be presented: (a) their execution avoids the unnecessary task of a computational determination of the individual eigenvalues of A;

Verification of Asymptotic Stability

359

(b) the practical execution would be significantly improved provided it could be carried out on a mainframe computer; (c) the execution is not as sensitive with respect t o details of the given problem such as, e.g., the one by use of Computer Algebra Systems, like MATHEMATICA; (d) the characteristic interval polynomial of [A] is not needed, whose computational determination may be costly and ill-conditioned.

Remark: This paper is an abbreviated version of the author’s Diploma Thesis [8].

2

A Sample Problem: A Bus with an Automatic Tracking System

As an introductory motivation for the subsequently treated general problem area, an example from control theory will now be presented, allowing an efficient application of methods for the verification of asymptotic stability.

2.1

Performance Description (Outline of the Simulation)

For a considerable time, investigations have been conducted into the automatic, electronically guided tracking-system of a motor vehicle. Contributing problems have been steering and the proper tracking of the vehicle. This concept is of a particular interest t o suburban and inner city traffic and was, therefore, developed for a standard line bus system. In 1985, in Fiirth/Germany, a 700 m ”test track’’ was built and incorporated into normal line use, with future plans for an expanded system. The advantages of an automatic guidance are obvious: the driver can devote his attention to the acceleration and deceleration of the vehicle and to the passengers. Additionally, precise lateral guidance of the vehicle facilitates the use of narrow traffic lanes and aids in a more accurate approach to bus stops. This is of particular importance in crowded and busy inner cities. The predetermined path of the bus is guided by an A.C.-cable, which is embedded a few centimeters beneath the road surface. Course deviation is measured by a pair of induction coils which are located a t the front end underneath the bus. This course deviation is measured with a frequency of 10msec and the analogue signal is then digitalized. The required course correction is determined by means of a control algorithm which is stored in the program memory of the on-board microprocessor. The required signal is then transmitted to the front wheels in a hydraulic steering system, without the assistance of the driver.

B. Gross

360

---

Omnibus 0 305 . . . . . . . . . . . . . . . ... ... ... ... ... ... ... ... ... ... ... ................ ... ... ... ... .. .\ ..\ .......................

. ... .......................... ... ... ... ... ... ... ... ... ... ... ... ... . . . . . . . . . . . . . . . . . . *. . . . . . ---.-. . . .. .. . . ..-.. .. .. .. .. . .. .. .. .. .. .. .. .. .. .. . .

- ------- -- --- --

\

Figure 2.1: Principle of the electronical tracking system

Figure 2.1 illustrates the simulation: With velocity v , the vehicle of mass rn follows the cable beneath the road surface. The cable is placed in accordance with standard civil engineering practices, i.e., straight lines and curves with fixed radii R are connected by clothoid-shaped sections. Road-surface conditions are measured by a predetermined traction coefficient p , addressing the contact of the tires to the road surface. The following parameters determine the automatic control system: a

the deviation d, of the cable with reference to the inductive device,

a

the corresponding deviation change rate d,,

a

the turning angle p of the front wheels,

a

the angle Q between the vehicle’s longitudinal axis and the local direction of motion of the vehicle’s center of gravity, and

a

the angular velocity 2 of the vehicle about its center of gravity.

The employed mathematical model is represented by the following fifth-order system of linear ODES with constant coefficients:

which is given as follows in detail

Q

-0 0 = 0 0 0

1 0 0 a23 0 a33 0 a43 0 0

0

0 --dv

a24

a25

a34

a35

a44

a45

0

0

Q

- .p

0

Verification of Asymptotic Stability

ol[

0 0 0

9

361

On a first glance, the leading ODE in (2.2) appears as meaningless: this equation will be employed, however, for the determination of the distance d,, which is one of the components of the vector y,. The input data for (2.2) are listed in Subsection 4.4.

The state vector x , represents the set of state parameters of the control system. This system can be influenced externally by means of the input function u, of the control plant, which here is represented by 6. The local curvature of the cable acts as a perturbation pammeter z,. The output is the distance d, referred to before. The matrix A, is said to be the systems matriz or the dynamics matn'z. The vectors b,, e, and c, are called the input vector, the perturbation vector, and the output vector, respectively. The model has been derived under the assumption of small values of IaI. The components c and (Y of the solution represent the vehicle's response with respect to the steering angle p. This simulation is governed by strong changes of the physical parameters as the bus moves down the line. The predominant influence is exerted by the continuous change of the velocity v between 1 and 20m/sec. At each bus stop, the mass m varies in accordance with the number of passengers entering or leaving the bus. The ma& varies between 9 950 kg and 16000 kg. The coefficient p referred to before takes values between 0.5 and 1.0. Both, the system matrix A, and the perturbation vector e, depend on these nonfixed parmeters, i.e., A. = A,(v,m,p) and e, = e,(v). The outlined strong parameter variations require a robust control guaranteeing an acceptable operation of the bus for any point ( v , m, p ) in the parameter space. Because of engineering reasons, several restrictions for the parameter variations are prescribed: in particular because of mechanical reasons, the steering angle p is bounded by the condition I p I 5 45"; because of the computer-governed control mechanism, the corresponding velocity is bounded by the condition I I 5 23"/sec;

b

0

b

because of safety reasons, the maximum distance d, is bounded by the condition Id, 1 5 0.15m.

k,

In response to an initial deviation d,(O) or a local change of the curvature the distance d, = d , ( t ) is to be reduced to zero by means of the automatic control.

B. Gross

362

2.2

T h e Design of the Control

The rather complex design of the control requires an extensive knowledge of the methods of automatic control. Consequently this design can only be outlined briefly, here. A detailed presentation is given in the monograph by G. Roppenecker [14]. The control to be designed has the following general layout:

11,

=

I

p

ys = du

Process Omnibus 0 305

Dynamic

8,

Control

k

Differential Filter

.

Figure 2.2: Chosen design of the dynamical feedback controller of the automatical tracking system

The deviation d,, is the only measurable quantity. By use of the parameter space method [l], it can be shown that a decrease of the velocity d, is sufficient for a stabilization of the feedback control system. Since d,, is not directly measurable, this variable is determined by means of a differential filter connected in series, making use of the filtered values 8, of the measured distance d,. The thus generated estimate for d,, is denoted by 8,. Both, 8, and d,, are input signals for a dynamic second order control. This generates the input variable for the feedback control. Consequently, a preset automatic control is to be designed such that there is a satisfactory operation of the bus for every point ( v , m, p ) of the parameter space. A control possessing this property is said to be robust. It is hardly possible to assess quantitatively the operational comfort concerning the moving bus. As an accessible gauge, the acceleration Z may be employed, which should be suitably bounded such that there are no discontinuous changes of this quantity. For the design of a control with these properties, the following three representative cases have been chosen by Roppenecker [14]:

a

Verification of Asymptotic Stability

363

Operational Case 1: v = 3m/sec, m = 9950kg, p = 1; Operational Case 2: v = 10m/sec, m = 9950kg, p = 1; Operational Case 3: v = 20m/sec, rn = 16000kg, p = 1. The stability for these three cases is an obvious prerequisite for the generally suitable operation of the bus in all other cases. For the three cases, the property of stability ensures that the bus, with sufficient operational comfort, follows the A.C.cable. Subsequent to numerous steps concerning the design and the optimization, a system of homogeneous linear ODES of the total order seven was obtained which possesses a unique stationary point. So, a verification of the property of asymptotic stability of the homogeneous ODE system is sufficient. The system of the total order seven can be reduced to the following system of the total order six:

0 0 0 0

-0.882 72.68

1 0 0 0 2.588 -53.55

0

0

0

a23

a24

a25

a33

a34

a35

a43

a44

a45

0 0

0 0

0 0

0 0 0 0 1 -29.52

.

(2.3)

Additionally to the system (2.2) of ODES for the uncontrolled circuit, there is the ODE for a control function 5 ~ Furthermore, . two vanishing elements in the matrix in (2.2) are here replaced by non-zero values. For the practical realization there is an additional differential filter, supplying the variable dv. This increases the total order to eight. This additional filter is not taken into account in the subsequent analysis. For an automatic control, it is not sufficient just to design this system. Rather, numerous tests and simulations are required prior to a practical employment of this control. In the case of a failure, this would have catastrophic consequences, depending on the particular situation. Generally, it is not possible to test the automatic control physically and in advance for all possible operational cases. The execution of an advance verifying stability analysis is then particularly important. Time and cost involved in the development of the control can be partly saved provided extensive engineering simulations and tests may be omitted or reduced concerning the scope of the implied investigations.

The quantative analysis of this sample problem is presented in Subsections 4.4 and 5.9.

3

Verification of Asymptotic Stability

This paper is devoted to an analysis of the asymptotic stability of systems of linear ordinary differential equations with constant coefficients. Additionally, the classical

B. Gross

364

concepts in the theory of ordinary differential equations (ODEs) here are generalized to the case of interval-valued constant coefficients.

3.1

The Concept of Stability

A t first, initial value problems (IVPs) of the following kind will be considered: Y' = AY

with A E IR"'",

+ Q,

?/(to)= Y O

(3.1)

y : R + R",yo E IR", b E IR".

For the well known theory in the present subsection, see the monographs by W. Walter [15]and L. Cesari [5].

Theorem 3.1 The IVP (3.1) possesses the unique solution

+ 1@(t- s) b ds, t

y ( t ) = @ ( t yo )

to

with @ ( t = ) eA(*-to)the fundamental matrix of the system of homogeneneous

ODEs.

All solutions of the system of homogeneous ODEs y' = Ay with initial condition at t = t o then can be represented by y ( t ) = @ ( t )yo = eA@-*o) yo, i.e., they depend only on the fundamental matrix @.

Theorem 3.2 The following three properties are equivalent: 1. The homogeneous

ODEs y'

= Ay are asymptotically stable.

2. All eigenvalues of A are confined to the left half-plane { z E

: Rez

< 0).

3. All eigenvalues of @(to+ 1) = eA are confined to the interior of the unit circle {w E

c

: IWI

< 1).

In the simulation of real world problems, frequently there are uncertainties such as variations of nominally constant parameters or inprecise measurements. In order to deal with problems of this kind, it is advantageous to employ interval matrices rather than point matrices, see the monograph [2]by G. Alefeld and J. Herzberger. Under the admission of an interval matrix [A] IR"'" as a replacement of a point matrix A E IR"'", instead of (3.1) there is a collection of IVPs Y' = AY

+ b, y(to) = YO

with A E [A]S IR"'", y : IR + IR", yo E IR", b E [a] C IR". The corresponding system of homogeneous ODEs is given by y' = Ay, y : R + IR", A E [A]

IR"'",

b E [b] C R"'".

(3.3)

Provided the property of asymptotic stability has been verified for each matrix A E [A],Theorem 3.2 can be extended as follows:

Verification of Asymptotic Stability

365

Corollary 3.3 The following three properties are equivalent: 1 . The homogeneous

ODEs y'

= A y are asymptotically stable for all A E [ A ] .

2. For all A E [A], all eigenualues are confined to the left half-plane. 3. For all A E [ A ] ,all eigenwalues of @ ( t o+ 1) = eA are confined to the interior of the unit circle.

A collection of systems of ODEs is said to be unstable provided there is at least one matrix A E [A] that is unstable. Identical stability concepts are usually employed for matrices A E IR"'" and for ODEs y' = A y . For instance, a matrix A is called asymptotically stable provided

the corresponding system y' = Ay possesses this property.

3.2

Mobius-transformations

The verification of asymptotic stability to be presented rests on a bijective mapping of the left half-plane { z E C : Rez < 0) onto the interior of the unit circle { w E G : IwI < 1). Allowing a simple computational execution, a mapping of this kind can be constructed by use of the theory of the Mijbius-transformation, whose properties are presented in this section. For an outline of the theory of these transformations, e.g., K. Knopp's monograph [ll]may be referred to.

Definition: Numbers a , b, c, d E C are chosen such that ad - bc # 0. The function

+ +

az b cz d then is said to be the Mobius-transformation of z. w := S(2) :=

~

Provided c # 0, the denominator vanishes for z = -d/c such that the function is not defined for this point. In order to avoid this c 3 e the point 00 is added to the set 6,thus generating the set := C u {m}. In C, S(-d/c) := 00 is defined and S(00) =

00:

c=o,

a / c : otherwise.

Consequently, the Mijbius-transformation then is well-defined for all I E

E.

Some of the properties of the Mtibius-transformations are now asserted in the following theorem, see [Ill:

B. Gross

366

Theorem 3.4

1. Every Mobius-transformation can be represented as a composition of its three elementary versions, a translation, a rotation together with a dilatation and an inversion.

2. Every Mobius-transformation S is a bijective mapping from 8 onto @. The corresponding inverse mapping is given b y z = S-l(w) =

dw - b -m

+ a'

3. A Mobius-transformation maps generalized circles (i.e., circles and straight lines in

E ) into generalized circles.

4. A

domain D is considered that is bounded b y a generalized circle I'. A Mobiustransformation S maps D bijectively onto one of the two domains which are bounded b y the generalized circle S(I') := { S ( z ) : z E I'}.

5. Every Mobius-transformation S different from the identity possesses precisely one or two fixed points.

6. Every Mobius-transformation S is uniquely determined b y the images of three different points.

7. A Mobius-transformation preserves the angle defined b y the intersection of two generalized circles.

Remark: Beginning here, the short expression "rotation plus dilatation" will be used instead of "rotation together with a dilatation".

3.3

Applications Concerning Matrices

Just as in the case of complex numbers z E 3, Mobius-transformations can also be defined for quadratic matrices. Unless explicitly stated, this paper is only concerned with matrices of this kind. The identity matrix will always be denoted by I E Cnxn. az b Definition: If A E CnXnand S ( z ) = -represents a Mobius-transformation cz d with a, b, c, d E Q:, then

+ +

S ( A ) := ( a A

+ bZ)(cA+

is said to be the Mobius-transformation of A provided cA provided that -d/c is not an eigenvalue of A .

+ d l is invertible, i.e.,

The action of a Mobius-transformation on the spectrum of A E CnXnwill now be discussed.

367

Verification of Asymptotic Stability

Theorem 3.5 Zf X is an eigenvalue of A , then S(X) = -is an eigenvalue of cX d S ( A ) , assuming the ezistence of S ( A ) .

+

+

Proof: Since every Mobius-transformation can be represented as a composition of its three elementary types (translation, rotation plus dilatation and inversion), it is sufficient to prove the assertion only for each one of these three elementary types. Provided A z = Xz, that is 2 E CC" is an eigenvector belonging to the eigenvalue A, the elementary Mcbius-transformations are defined as follows:

1. Translation:

S I ( A ) Z:= ( A

+ ~ Z ) S= (A + b ) ~ .

2. Rotation plus Dilatation:

S ~ ( A ) := Z U A Z= U X Z . 3. Inversion: If A is invertible, i.e. if all eigenvalues are non-zero, then 1

S ~ ( A ) := Z A-'x = -z. X

It is now to be shown that S ( A ) can be represented as a composition of these three elementary types of transformations. The following two cases have to be distinguished: either c = 0 or c # 0. If c = 0, S ( A ) can be represented by means of S ( A ) = ( A bZ)(dZ)-' = : A 5Z. This is a composition of a rotation plus dilatation and a subsequent translation. If c # 0, S ( A ) can be represented as follows:

+

+

+ bZ)(cA+ dZ)-' ad ad ( a A + -Z - -Z + bZ)(cA+ dZ)-' U ad -I + ( b - - ) ( c A + dZ)-'.

S(A) = (aA =

C

C

C

C

Consequently, A is a t first subjected to a rotation plus a dilatation and then to a translation. Afterwards, the result is inverted; S ( A ) then is generated by a subsequent rotation plus dilatation and a translation. Any eigenvalue X is corre0 spondingly transformed, and this completes the proof.

3.4

Verification of Asymptotic Stability by Means of a Mobius-transformat ion

For a given matrix A E Rnx" it is to be verified that all eigenvalues are confined to the left half-plane. For this purpose, a suitable Mobius-transformation S ( A ) is to be constructed, such that the left half of the z-plane is mapped onto the interior

B. Gross

368

of the unit circle in the w-plane. Subsequently, by Theorem 3.4.4, the imaginary axis (including m) of the z-plane has to be transformed into the unit circle in the w-plane.

Then A is asymptotically stable, provided the spectral radius e ( S ( A ) )of S ( A ) is smaller than one. This condition can be tested and verified by means of the Cordes-Algorithm presented in Subsection 3.5.

::I

. .. . .. . .. . .. . .. ... . . .. .. .. . . . . . 1.

_I-:

. '. .. '...' .. .' .. : . : . : . b . . . .

w = S(Z)

I

.

.

-1:::

. . . . . . . . .. . . .

Figure 3.1: Desired mapping properties of the Mobius-transformation

Theorem 3.6 The Mobius-transformation w = S ( z ) = i-possesses the fol2-1 lowing properties:

1 . The real axis is mapped onto the real axis.

2. The imaginary axis is mapped onto the unit circle. w+l 3. The inverse transformation is given b y z = S-'(w) = w-1'

Proof: 1. Since there are only real coefficientsin the expression for S, real numbers are mapped into real numbers.

2. The points 0 and 00 on the imaginary axis are considered. They are mapped into the points -1 and 1, respectively, on the real axis. Since these axes are orthogonal, the image of the imaginary axis is a circle or a straight line, with an orthogonal intersection of the real axis in either case, see Section 3.2. The image must then be the unit circle since the points -1 and +1 are on this circle. 3. This can be verified directly. 0

Remark: There are infinitely many Miibius-transformations with the desired mapping properties, and S ( z ) = % + is just one in this set. This special transformation 2-1 possesses the advantage that there is no need for an investigation of the properties of the inverse S-' because of the identical structures of S and S-'.

'

Verification of Asymptotic St ability

369

As has been shown in Section 3.3, this Mlibius-transformation is well defined for matrices A E (Cnxn: S(A) := (A Z)(A - Z)-',

+

provided X = 1 is not an eigenvalue of A. This condition is not a loss of generality since A is to be investigated with respect to its asymptotic stability. In fact, if X = 1 is an eigenvalue, A is not stable according to Theorem 3.2. Because of Theorem 3.5, there is the

Corollary 3.7 Zf X is an eigenvalue of A, then S(X) = + is an eigenvalue of the

transformed matriz

S(A) = (A + Z)(A - Z)-'.

X-1

For the purpose of this paper it is of particular interest that using this special Mlibius-transformation eigenvalues with negative real parts are mapped into the interior of the unit circle. In contrast, eigenvalues of A on the imaginary axis or in the right half-plane are mapped onto the unit circle or into its exterior, respectively, compare Theorem 3.4.4 and Figure 3.1. Therefore, the homogeneous ODE

is asymptotically stable if and only if all eigenvalues of B := S(A) = (A+Z)(A-Z)-' are confined to the interior of the unit circle. Consequently, if this can be proved the asymptotic stability of A is verified.

Remark: For the subsequent analysis by means of a computer, it is advantageous to represent S(A) = (A+Z)(A-Z)-I equivalently by means of S(A) = Z+2(A-Z)-*. For a determination of the inverse of a matrix A - I , there are codes in the computer languages PASCAL-SC [4] or PASCAL-XSC [lo]. These codes generate tight enclosures provided the eigenvalues of A are different from one and not too close to one. The ODE (3.4) is asymptotically stable if and only if all eigenvalues of A are confined to the left half-plane. Therefore the exclusion of the case of X = 1 or X x 1 is not a restriction of generality in the present paper since the ODE then is not stable. The codes referred to will generate enclosures [S(A)] := Z+2[(A- Z)-'] of S(A) possessing small overestimates as a consequence of the enclosure of A - I . Concerning [B]:= [S(A)], now the Cordes-Algorithm may be employed. Provided this application verifies that the spectral radius of all matrices B E [B]is less than one, the property of asymptotic stability has been shown for the ODE (3.4). This is a guaranteed verification since all numerical errors are fully taken into account in the determination of the enclosure.

B. Gross

370

3.5

The Cordes-Algorithm

For systems of linear ODEs with periodic coefficients, D. Cordes in his doctoral dissertation [6] has developed a method for the verification of the property of asymptotic stability. This method, which is t o be presented now, rests on the stability properties of the fundamental matrix @(T)with T the period of coefficients. The system of ODEs is asymptotically stable if and only if the spectral radius of @ ( T ) is smaller than one. The system of ODEs with constant coefficients, y' = Ay, under discussion here possesses any period T > 0. Consequently, it is sufficient to verify for an arbitrary but fixed t > to that p ( @ ( t ) ) < 1 is true for the spectral radius of @(t).Therefore, the Cordes-Algorithm is applicable in this special case of systems of linear ODSs with constant coefficients. Generally, the fundamental matrix @ ( t )cannot be determined exactly for any fixed time t > to. By means of the Lohner-Algorithm [12],an enclosure [C] IR"'" can be determined for C := @ ( t ) .The property of asymptotic stability is then t o be verified for all C E [C]. The Cordes-Algorithm rests on the essential criteria as asserted in the following theorem, whose first statement is proved in [2]:

Theorem 3.8 (a) Zf p(l [C]I) < 1 is valid for the spectral radius of the absolute value matriz 1 [C] I := ( I [Cij]I ) := (max{ I Gj I, I cij I 1) of [C] C IR"'", then there holds p(C) < 1 for all C E [C].

(b) Zf 11 I [C] 111 < 1 is valid b y use of any least upper bound n o m then p(C) < 1 for all C E [C].

11.11

of

I [C]1,

Generally, it is computationally difficult to verify that p(l [C]I) < 1. Because of overestimates, it is frequently not possible to verify the property of asymptotic stability by means of a least upper bound norm, e.g., the row sum norm 11. Ilm or the column sum norm I(.111.

For the spectral radius p(C) of a matrix C E IR"'", there holds (p(C))" = p(C") for all m E IN. Provided p(C) < 1, then p(C") approaches zero as rn -+ 00. It is now assumed that [C"] is an enclosure of the set {C" : C E [C]} and that )II [C"] I I( < 1 is valid for a number rn E IN. Then, asymptotic stability of [C] is implied by the inequality

This allows the development of an algorithm which can be implemented for computer applications. For this purpose, the 2'-th power of [C]is determined by means of the multiplication of [C"-'] by itself. In the case of interval matrices, these multiplications generally cause an overestimate of the true set {C" : C E [C]}. This is a consequence of the following contributions:

Verification of Asymptotic Stability

371

(a) the multiplication [C] x [C)is executed by means of [A] x [B] with an independent treatment of A E [A] := [C] and B E [B]:= [C]; (b) additionally, there is a growth of the sequence {[C"]}because of the employment of rounded interval arithmetic; (c) complex eigenvalues of C E [C)generate rotations and therefore overestimates because of the wrapping effect, e.g. [13]. The influence of (c) can be diminished by means of a suitable local rotation of the employed basis, see [13] and the doctoral dissertations of Cordes [6] and Lohner [121* The Cordes-Algorithm aborts in the following three cases: 0

0

0

the verification of the property of asymptotic stability has been achieved for a matrix I [C"'] I with a certain m E IN; for a certain m E IN, there holds I trace([Cm])I 2 n; then there exists point-matrix D E [Cm]with at least one eigenvalue X such that 1x1 > 1;

a

the number m of iterations exceeds a prescibed bound; then a decision concerning the asymptotic stability of [C]is not possible.

In the execution of the multiplications of interval matrices, there are unavoidable overestimates. This may cause an unstable point matrix to be enclosed in [ P I whereas this is not so for [C"'-l]. In this case, the property of asymptotic stability cannot be verified for [C] irrespective of its truth.

Partition of the Input Interval

4 4.1

Introduction of the Problem

In the context of the problem of asymptotic stability, systems of linear ODES are considered in this paper, y' = Ay,

y :R

4

R",A E R"'".

The following constructive sufficient criterion for this property is derived in Section 3: 0

an interval matrix [(A - Z)-l] can be determined, i.e., then the matrix A - Z is invertible, and,

0

by means of the Cordes-Algorithm, p(B)

[B] := Z

+ 2 [ ( A- Z)-'].

< 1 can be verified for all B E

B. Gross

372

This algorithm is still applicable in the case that an interval matrix [A]is admitted, provided the width of [A] is sufficiently small. Frequently in applications, this condition is not satisfied because of relatively large variations of input data or uncertainties of measurements. Generally, 0

0

then an enclosure [ ( A- Z ) - l ] cannot be determined under the simultaneous admission of all A E [A]or the Cordes-Algorithm fails to verify that p(B) < 1 for all B E [ B ] .

As a remedy, a suitable sequence of partitions of the input interval [A] may be carried out. The presented constructivemethods are then to be applied with respect to each generated subinterval of [ A ] .Provided the property of asymptotic stability has been verified for all subintervals of [ A ] ,this property is also true for the union of these intervals, i.e., for all A E [ A ] .

4.2

The Philosophy of Partitioning [A]

Given an interval matrix [A]C R"'", there are k E (1,. . .,n2}elements which are represented by genuine intervals; the remaining n2 - k elements are real numbers. The Cartesian product of the k interval-valued elements defines an interval in Rk, which is to be partitioned (subdivided). Generally, the k individual input intervals in R possess largely non-uniform widths. As an additional problem a sequence of subdivisions yields a rapidly increasing number of subintervals. Consequently, in each step of this sequence it is advantageous to partition only one interval on R possessing maximal width. Each partition in R generates subintervals of equal widths. Therefore, in each step, two new interval matrices are generated in R"'". They are identical except for the one element in R which has been subdivided in this step, i.e., with the exception of their location in Rk.The total ordering of the set R allows the notation left (or Tight) partial interval matriz for the interval matrix in R"'" that contains the left or right subinterval in R. See Figure 4.1 for a sequence of sample partitions of an interval in Rkwith k = 2. Occasionally, in the sequence depicted in this figure, two neighbouring intervals in R are simultaneously subdivided in an individual step of this sequence.

Verification of Asymptotic Stability

373

Figure 4.l:Example of a partitioning strategy for a matrix containing two interval-valued elements

The corresponding algorithm was arranged recursively such that the left partial interval matrix is at first investigated in each step. If asymptotic stability cannot be verified for this interval matrix, it will be subdivided again. Otherwise, the right partial interval matrix is treated next. Provided the property of asymptotic stability has been verified for both partial interval matrices belonging to one step of this recursion, this property is also valid for their union. In that case, the algorithm returns to the preceding step of the recursion. If this is its beginning, the property of asymptotic stability has been verified for all A E [A]. Otherwise, the right partial interval matrix of the preceding step is treated. Consequently, this generates the structure of a tree, see Figure 4.2 for an example.

B. Gross

374

IA1

Figure 4.3: Example for a recursive tree for the partitioning algorithm

This recursion tree is treated by going from left to right. Concerning each individual path in this tree, the total number of steps is said to be the recursion depth of this path. Generally, this depth is non-uniform for different paths of this tree since it is determined as the path is treated recursively. Even though it cannot be fixed in an a priori fashion, the depths of the various paths must be bounded; i.e., the recursive treatment of a particular path will perhaps be aborted somewhere. If this occurs without a verification of asymptotic stability at the (artificial) end point of a path, then there are the following possible causes: (i) The two subintervals as generated a t the end of this path are still "too large".

(ii) Because of reasons as stated in Subsection 3.5, the Cordes-Algorithm does not yield a verification of the desired property. (iii) At least one subinterval of [A] is not asymptotically stable. Consequently, then [A] does not have this property. The subinterval addressed in (iii) can be utilized as an indicator in order to yield a perhaps important information concerning the input data of the real world problem under discussion.

4.3

Examples Employing Subdivisions of the Input Interval Matrix

The examples presented in this subsection have been chosen and treated (by means of other methods) in the International Journal of Control. In papers in this journal, generally only interval matrices in RZxZ have been adopted. For these examples,

Veriikation of Asymptotic Stability

375

the property of asymptotic stability can then be tested immediately. The examples illustrate the employed methods very well. The first example to be treated here was chosen by S. Bialas [3]; it has also been investigated in numerous subsequent papers in the International Journal of Control. The example rests on the choice of

For all A E [A], it was possible to verify the property of asymptotic stability by means of the partitioning method as presented in Section 4.2. For this purpose, it was necessary to verify this property individually for 102 partial interval matrices. It was analogously possible to verify the property of asymptotic stability for the following example by Xu Daoyi (71:

[ -3, -21 [ -6, -51

([

3,

41 [ - 3 , - 2 l

>.

In the execution of the partitioning method, here only 22 partial interval matrices had to be investigated. The interval matrix adopted in the third example,

(I

[-7,-31 [ 0, 21 3, 51 [ - 8 , - 4 l

)'

was originally chosen in the paper by R. K. Yedavalli (161. This interval matrix contains the one treated in the first example. For a verification of asymptotic stability by means of the partitioning method, here a total of 195 partial interval matrices were treated. The total computation time was insignificant in each one of these three examples. In fact, the cost of the Cordes-Algorithm is almost negligible in the case of (interval) matrices in R2".

4.4

Application Concerning The Bus with an Automatic Tracking System

At first, in this subsection, the data chosen for the bus with an automatic tracking system is presented. This is followed by tables exhibiting the computed results for the eigenvalues for the Operational Cases chosen in Subsection 2.2. The data and the tables have been taken from the book by G. Roppenecker [14]. The following system of linear ODES of the order five serves as a simulation of the operational characteristics of the bus with an automatic tracking system j.a(t)

= AaZa(t) $. b a u a ( t )

+ eaz,(t),

y a ( t ) = cTZa(t)

B. Gross

376 where

0 1

5,

=

0

0

[ i], !], (.Ii], 0 0

a23

a24

a25

L [0

d"

0

G=[

:I.

The vanishing eigenvalue of the matrix A, has multiplicity three. Since A, is not stable, a control is required. According to G. Roppenecker [14], the following intervals have to be taken into account for the velocity v of the bus, its mass m, and the coefficient p, respectively.

With reference to the metric units, nondimensional quantities are introduced. The elements of the system matrix A, and the components of the perturbation vector e, then can be determined as follows as function of the input data: a23

a33

+

= 6.12~33

v(a43

= - 2 ( 3 . 6 7 ' 6v

+ 1)1

+1.93'6~)~

+

a24

= 6.12~34

a34

2P = --(3.676v

0 2P a43 = --" ( 3 . 6 7 6 ~ - 1 . 9 3 6 ~-) 1, a44= --(6v Qv

mu2 a26 = 6.12 a35 ~ 2P a35 = --3.676~, 0

+

a45

= -2p 6v,

mu

~

4

5

va44,

- 1.936~),

+ 6H),

~

e2

= -u2,

mu where 0 = 00 11.174(m - mo) : moment of inertia of the bus (in kgm'), 6v = 6 ~ 0 8.430(m - mo) : coefficient of lateral force of the front wheels (in N/rad), 6~ = 6 ~ 0 17.074(m - mo) : coefficient of lateral force of the twin rear wheels (in N/rad), and, in the case of the empty bus (m = 9950kg), 00 = 105700 : moment of inertia (in kgm'), mo = 9950: mass (in kg), 6v0 = 195000 : coefficient of lateral force of the front wheels (in N/rad), 6 ~ = 0 378000 : coefficient of lateral force of the rear wheels (in N/rad).

+ + +

Verificationof Asymptotic Stability

377

The matrix A. is given as follows for the three operational cases as chosen in Subsection 2.2: Operational Case 1 ( v = 3 m/sec, rn = 9 950kg, p = 1)

I

0 0 0 0

A, =

1 0 0 0 0 -154.79826 -113.56743 -122.06784 0 -25.44590 0.26282 -13.54115 0 -0.68978 -38.39196 -13.06533

0 0

0 0 0 0 0

'0 0 A, = 0 0 0

0

0

0

1 0 0 0 0 -46.43948 -113.56743 -122.06784 0.26282 -13.54115 0 -7.63377 -3.91960 0 -0.97208 -11.51759 0 0 0 0

1

-

-

1 0 0 0 0 -17.86862 -89.06990 -94.51381 0 -2.94635 0.30108 -10.41892 0 -0.99185 -4.54563 -1.53750 0 0 0 0 -

The design of the automatic control has been presented in Subsection 2.2. In the absense of the differential filter, this yields the following system of linear ODES of order six '

0 0 0 0

1

0

0

0

0

a23

a24

a25

0

(1.33

a34

a35

0

a43

a44

a45

0 0

0 0

0 0

-0.882 2.588 72.68 -53.55

0 0 0 0

1 -29.52

.

(4.1)

-

Now, Gerschgorin's Disk Theorem is employed to determine six disks, whose union contains the set of eigenvalues of the matrix in (4.1). The disk belonging to the second column in (4.1)is given by I X I < 57.138,where all possible cases are taken into account. This disk contains the other five disks. An inequality 1x1 c P does not allow a conclusion whether or not all eigenvalues are confined to the left halfplane. Consequently, Gerschgorin's Disk Theorem is not useful for a verification of asymptotic stability for the present problem.

B. Gross

378

By means of the partitioning method, it will now be tested whether or not the design of the automatic control for the sample problem in Subsection 2.2 guarantees a stable operational performance for all choices of the parameters v, m and p of the problem. Under the admission of the total input interval for these parameters, the corresponding intervals of the elements of the input matrix [A] possess relatively large widths. This is in particular true for the interval [-959.47684, -8.934311 as the set of the admissible values of the element ~ 2 3 .In the interval matrix [A] & R6x6, k = 9 denotes the number of elements taking values in genuine intervals, because of their dependencies on the mass rn and the coefficient p of the interaction of tires and road. The interval-valued elements a 2 4 , u34, a25 and a35 do not depend on the velocity v. The other interval-valued elements are functions of v, with an increasing rate of changes as v decreases. The parameter variations of the extent just outlined prevented the execution of the partitioning method under the simultaneous admission of the total input interval for (m, v, p)T E IR3, a t least when making use of the employed SAM 68000-computer made by kws. With a total computation time of two weeks, only the following edge of the input interval in IR3 could be treated by means of the partitioning method: V p = 1, m = 9950kg and -E [1,20]. [m/secI This hardware problem is expected to be less serious in the case of an employment of more powerful computer systems supporting the execution of enclosure algorithms. For this purpose, the language PASCAL-XSC may be used in conjunction with any IBM-compatible PCs.

5

Estimates for the Degree of Stability

Concerning the location in G of the eigenvalues of a matrix A E IR"'", four constructive methods will be presented in this section. For this purpose, the verification of asymptotic stability as outlined in Subsection 3.4 will be employed. Consequently, an estimate by any of these four constructive methods is guaranteed to be true.

5.1

The Degree of Stability

For certain classes of applications, it is not sufficient to verify the property of asymptotic stability for a system of ODES, y' = Ay,

y : R -+

R",A E R"'"

In fact, frequently it is important to assess the time response of a dynamic process. Generally, it is desirable that all solutions of y' = Ay approach zero sufficiently rapidly. All solutions of nonhomogeneous systems y' = Ay b will then correspondingly approach a steady-state provided there is one. If this approach is too

+

Verification of Asymptotic Stability

3 79

slow, the simulating process is almost unstable. Consequently, problems are to be expected for the corresponding real world process. The time response of an asymptotically stable system is governed by the location of its eigenvalues in the left half-plane. This will now be investigated. Provided X E R is a negative eigenvalue of A, the approach of zero of ext slows down as 1x1 decreases. In the case that A possesses a pair of conjugate complex eigenvalues A, with

x

X=6+iw, x = b - i w

and b < O ,

then the corresponding solutions of y' = Ay are represented by r1e6*sin(wt)

+ r2e6' cos(wt) = re6tsin(wt + 'p)

with rl, r2, r E R".The time response is bounded by *redt, i.e., the decay slows down as 161 decreases. It is assumed that A either has a negative eigenvalue close to zero or an eigenvalue with negative real part close to zero. The corresponding process then possesses only a small rate of decay due to damping and a slow approach of any steady-state. Provided the real parts of all eigenvalues are negative, then generally the time response is governed by the eigenvalues that are closest to the imaginary axis. This induces the following definition:

Definition: (i) An asymptotically stable matrix A E R"'" possesses the degree of stability u, with u 2 0, if -0 represents the maximal real part in the set of eigenvalues of A. (ii) An asymptotically stable interval matrix [ A ] C R"'" possesses the degree of stability u, with u 1 0, if every matrix A E [ A ] has a t least the degree of stability u. . .

. . .x . . . . .

X' .

. . . . .

. . . . . . . . . . . . . . . . . . . .

. . . .' X . , ' .... .. .. .. . .. .. . . . .x. . '

'

Figure 5.1: Definition of the degree of stability, (I

This stability measure arises naturally in the representation of the sets of solutions of the system y' = Ay considered, here. In fact, the fundamental matrix of a

B. Gross

380

system of this kind is given by @ ( t ) = eAt. In the more general case of systems y' = A ( t ) y with A(t + T) = A(t) for all t E IR and a fixed period T E , 'RI @ (t)= F(t)eK' because of the Floquet theory. Since F ( t 2') = F ( t ) for t E IR is true for the matrix function F , the stability of y' = A(t)y is governed by the eigenvalues of K E IR"'".

+

The following subsections are devoted to a presentation of additional properties of the Mijbius-transformation w = - This is the basis for the design of the z - 1' four constructive methods, each of which yielding a safe lower bound of the degree of stability u. Here, too, the Mijbius-transformation is chosen such that there are corresponding mappings S( A ) of A E RnX" and S ( z )for the points z E Q:, compare Corollary 3.7. +

5.2

Additional Mapping Properties of the Mobius-Transformation w =

5

On the basis of Theorem 3.6, an approach for the verification of the property of asymptotic stability is outlined in Subsection 3.4. This method does not allow an esimate of the degree of stability. For this purpose, an additional consideration -k of the Mijbius-transformation w = -is required; particularly, the images of 2-1

straight lines parallel to the imaginary axis are of interest.

'+

1. The Mobius-transformation w = - maps straight lines Theorem 5.1 2-1 parallel to the imaginary axis lying in the left half-plane onto circles with center on the real axis containing the point +l.

2. The half-plane to the left of any such straight line is mapped into the interior of the image circle, see Figure 5.2.

Figure 5.2: Mapping properties of the Mobius-transformation S(z) in the left half-plane

Verification of Asymptotic Stability

381

Proof:

00 which is mapped into s(00) = 1. Consequently, the images of all straight lines contain the point +l.The point t E R on the real axis will denote the intersection of an arbitrary straight line

1. Every straight line contains the point

with the real axis. The image, w = S(t) = + of this point is real-valued. t-1 Since generalized circles are mapped onto generalized circles, the image of a straight line is either a straight line or a circle. Since angles are preserved by conformal mappings, the images of straight lines parallel to the imaginary axis lying in the left half-plane are circles intersecting the real axis orthogonally.

2. The Mtibius-transformation w = -maps the real axis onto the real axis. 2-1 Since 00 is mapped into +1, the segment of the real axis to the left of the point t is mapped onto the segment of the real axis between the points w = S(t) = + and +l. This completes the proof. t-1 0 +

The constructive methods to be derived make additional use of the following theorem:

Theorem 5.2 The inverse Mobius-transformation z = -maps circles with w-1 the center at the origin and the radius r < 1 onto circles containing the points 1 1 1 y=and - with center M = -(7 -). Concerning these circles in the r-1 7 2 7 w-plane, their interior is mapped onto the interior of the circles in the z-plane.

+

+

c

Figure 6.3: Mapping properties of the inverse Mobius-transformation S-'(w) in the unit circle

Proof: Every circle contained in the unit circle is mapped into the left half-plane.

For the circles in the w-plane with their center at the origin, their intersections f r with the real axis are considered. These points possess the following images: -r+1 -r-1

r-1 1 --ER. r+l 7

r+l --S ( r ) = -=: 7 E R and S(-r) = --

r-1

B. Gross

382

Because of the orthogonal intersections of these circles in the w-plane with the real axis, this kind of intersection must also be true for their images in the zplane. Consequently, these images are circles, with the points 7 and l/7 on their circumference. The origin of the w-plane is in the interior of any circle under consideration in this plane. This origin is mapped into the point z = -1. Consequently, the interior of these circles in the w-plane is mapped onto the interior of their image-circles in 0 the z-plane. This completes the proof.

5.3

First Constructive Method for a Lower Bound of the Degree of Stability

For an asymptotically stable matrix A E R"'", by definition, all eigenvalues are located on or to the left of a straight line in the left half-plane which is parallel to the imaginary axis. A constructive method is desired which yields a close lower bound of u,making use of the Cordes-Algorithm as presented in Subsection 3.5. A parallel to the imaginary axis is considered which intersects the real axis a t -a with a E

R+. By means of w

= z+ and because of Theorem 5.1, this line 2-1

a-1 is mapped onto the circle containing the points 1 and - with its center a t a+1 a M=E R.The half-plane to the left of this line is mapped into the interior a+l of the circle.

Figure 5.4: Transformation of the half-plane { z E C : z

< -a} into the unit circle

This circle is now shifted parallel to the real axis such that it takes up a position with the origin as its center. This shift can be expressed by means of the Miibiustransformation v=w--

a a+1

and

z+l a S ( z ) = -- z-1 a+l'

Verification of Asymptotic Stability

383 1 a + 1'

In the new position, the circle intersects the real axis a t the points k -

Figure 5.5: Centering of the disk by means of a shift parallel to the real axis

The corresponding Mobius-transformation for a matrix B is given by

C:=B--

ff

ff+1

I

with I the identity matrix. When B is replaced by I S(A) := C = -I a+l

+ 2(A -

there follows

+ 2(A - I ) - ' .

The following Theorem shows the relationship between the eigenvalues of A and C.

1 Theorem 5.3 Provided the spectral radius of the matriz S(A) = C = -I +

as1 1 2(A - I)-' is less than - then all eigenvalues of A E Rnxn possess real parts a + 1' smaller than -a.

Proof: An arbitrary but fixed eigenvalue of S(A) = C = -I a+l denoted by p.

+ 2(A - I)-'

is

1 p is confined to the interior of a circle in the va+l' 1 plane with its center a t the origin and its radius -. Because of Theorem 3.5, a+l X = S - l ( p ) is to the left of a straight line in the r-plane parallel to the imaginary axis, which intersects the real axis a t -a. 0 Since this is true for all eigenvalues p of A, this completes the proof. 1 The condition p(C) < - may be replaced by the equivalent condition a+l ~ ( ( a1)C) < 1 which can be directly tested by means of the Cordes-Algorithm. An a close to optimal can be determined by use of a bisection method.

Provided p(C)

+

< - then

384

B. Gross

5.4

Second Constructive Method for a Lower Bound of the Degree of Stability

A matrix A E RnXn is asymptotically stable if and only if all eigenvalues of A are confined to the left half-plane. A Miibius-transformation of A into S ( A ) = B = I 2(A - I ) - l is considered. Because of Corollary 3.7, the eigenvalues XI,.. . ,A, A.+l . of A then are transformed into the eigenvalues p; = ,2=1,..., nofB. A; - 1

+

The set of these eigenvalues p; is confined to the interior of the unit circle. It is assumed that the spectral radius of B is not only smaller than one but also smaller than l / P with a P E R+ such that p > 1. It is then possible to derive a set in the left half-plane which contains all eigenvalues of A. In conjunction with a trial and error approach, a bisection method will be used to determine a suitable value of P. For this purpose, the circle with the center at the origin of the v-plane and the radius l / P is transformed into the t-plane. Because of Theorem 5.2, the image 1

is a circle intersecting the real axis at 7 := -k and - with its center at M := 1-P 7 1 1 -(7 --) E R in the left half-plane. 2

+

t

Figure 5.6: Transformation of the disk { w E CC : IwI < I/a} into the left half-plane

This circle in the z-plane contains all eigenvalues of A , including the one with the maximal real part. Consequently, l / 7 as the larger one of these points of intersection is a lower bound of the degree of stability. For this estimate to be fairly close, the eigenvalue of A with maximal real part must be real and closest to the circle with the center at M.

Verification of Asymptotic Stability

385

t . . . . . . . . . . . . . . . . t

. . . . . . . . . .

. . . . . . . . . .. ..

1 Figure 5.7: Dependency of the disk size from the position of the eigenvalues

This bound of the degree of stability may be rather coarse in the case that 7 is determined by eigenvalues other than the one with the maximal real part. As an example, an eigenvalue X = -10 is considered. Then 7 < -10 and l / y > - l / l O , even though X = -10 is irrelevant concerning the stability of A. As another example, a pair of conjugate complex eigenvalues is considered, All2 = -1 f 1Oi. This pair is transformed into p1/2

25f5i =26

with

lpll = lpll = -x 0.9806.

Provided it can be verified that the spectral radius of B is smaller than 1 / p = 0.99, the inverse transformation yields a circle in the z-plane with its center at M = -99.5025, which intersects the real axis at 7 = -199 and l / y = -1/199 w -0.005025. The pair of eigenvalues All2 is irrelevant concerning the stability of A; in fact, generates a rapidly decaying oscillation that is governed by e-* sin lot. In some cases, it is undesirable that the point -1 is always in the interior of the circle intersecting the real axis at 7 and l/7. In fact, the lower bound l / 7 for the degree of stability is unfavorable in the case that the maximal real part of the eigenvalues is smaller than - 1. Consequently, there are practical restrictions concerning the applicability of the second constructive method. Nevertheless, this method yields a safe information with respect to the location of all eigenvalues in the left half-plane. This may be of interest for problems other than the one of the asymptotic stability of A.

5.5

Third Constructive Method for a Lower Bound of the Degree of Stability

Any suitable matrix norm provides a trivial upper bound of the spectral radius of a matrix A E Elnxn since @ ( A )5 11 A (1 < 00. Both the row sum norm 11 A 11- and

B. Gross

386

11 A 111 of A depend only on the elements of A. If a := min{l( A 111,II A l l m } , then all eigenvalues of A are confined to a circle with

the column sum norm

the center at the origin and the radius a. Correspondingly, p(A) 5 a.

If a > 1, A for the purpose of the subsequent analysis is multiplied by l/a. If min{ 11 A 111,II A llm} 5 1, a is equated to one. As compared with the eigenvalues of the matrix A, the ones of := l/a.A have factors l / a 5 1, and they are confined 5 1. to the unit circle. Consequently,

A

@(A)

t

t

Figure 5.8: Multiplication with the factor l / a

-

With i := 1/a z, the Mobius-transformation 27, :=

i+l

7 is 2-1

now applied with

respect to the unit circle. The image is the union of the left half-plane and the imaginary axis. For matrices, this corresponds to the Mobius-transformation B := I 2 ( A - I)-'.

+

The multiplication of a matrix A with l / a E (0,1] does not affect the location of the eigenvalues either in the left half-plane or on the imaginary axis or in the right half-plane. If a > 1, the eigenvalues of := 1/a.A are closer to the imaginary axis than the ones of A. Consequently, a matrix A E R"'" is asymptotically stable if and only if the matrix := l / a A possesses this property. For this to be true, it is necessary and sufficient that the spectral radius of the correspondingly transformed matrix b := Z 2 ( A - Z)-' is less than one. In the case that p(B) 2 1, there are eigenvalues of A either on the imaginary axis or in the right half-plane.

A

A

+-

If p(b)< 1 then, analogously to the second constructive method, a p 2 1 as large as possible is determined such that p(B) < 1/p. Then all eigenvalues of B are confined to the intersection of the following sets: 0

the union of the left half-plane and the imaginary axis and

0

the circle in the &plane with the center a t the origin and the radius

1/p.

With reference to the left part of Figure 5.9, this is the union of the left half-circle and the segment of the imaginary axis that is contained in this circle.

387

Verification of Asymptotic Stability

The circle in the &plane with the center at the origin and the radius 1 / p is now transformed into the .%plane. Just as in the case of the second constructive method, 1 1 the image is a circle with the center at A? := -(7 -) E R, which intersects the 2 7 1 real axis a t 7 := + and -. 1 - 0 Y 6 + 1 The Mijbius-transformation i: := -maps the union of the left half-plane and w-1 the imaginary axis onto the unit circle and its boundary. The intersection of these 6 + 1 two images in the %plane is lens-shaped. Under the transformation 5 := 6-1' the points &i/P are mapped into the points of intersection of the bounding circles,

+

f -(i + PI2 1+P2'

Figure 5.9: Transformation of the half-circle onto a lens-shaped domain

The inverse transformation from the %plane into the z-plane corresponds to a dilatation by a factor a. Consequently, the eigenvalues of A E WX"are confined to the intersection of the following sets: rn

a circle with the center at the origin and the radius a and

circle with the center at M := aM which intersects the real axis at a7 and 47.

rn a

Therefore, a/7 is a lower bound of the degree of stability. Here too, the condition p ( P @ < 1 can be verified by means of the CordesAlgorithm and a close to optimal p can be determined by means of a bisection met hod.

388

B. Gross

5.6

Fourth Constructive Method for a Lower Bound of the Degree of Stability

A matrix A E R"'" is asymptotically stable if and only if all eigenvalues of A are

confined to the left half-plane. Then there exists a minimal distance of the eigenvalues from the imaginary axis. By definition in Subsection 5.1, this is the degree of stability u of A. Since u is unknown, now a lower bound will be determined. Consequently, a 6 E R+ is to be determined which is as large as possible but still possesses the property of a lower bound of u. As applied to the eigenvalues of A, a Miibius-transformation i := z 6 yields a shift to the right of the spectrum of A. For A, the corresponding shift is given by

+

A := A + 61.

+

A

If 6 < u, the eigenvalues i := X 6 of are confined to the left half-plane. Otherwise, is not asymptotically stable and the degree of stability of A is not bounded by 6. The asymptotic stability of can be verified by means of the constructive method as represented in Subsection 3.4. For this purpose, the left half-plane is mapped into the interior of the unit circle, i+l making use of the Miibius-transformation w = ?. Subsequently, it is tested z-1 i+l whether or not there holds 1ji1 < 1 for the transformed eigenvalues fi := A-1' The following theorem establishes a relationship between the eigenvalues of A and

A

a

B:

Theorem 5.4 For a matriz A E R"'", it is assumed that p(b)< 1 f o r the spectral radius of the transformed matriz B := Z 2 ( A (6 - l)Z)-l. Then, 1. 6 > 0 is a lower bound for the distance of the eigenvalues of A from the imaginary azis and

+

+

2. the degree of stability of A ezceeds 6 . Proof: 1. If := A + 6 Z , then B := Z + 2 ( a - Z)-I. Provided p(b) is ~. < 1, then asymptotically stable because of Subsection 3.4. Consequently, all eigenvalues of A possess negative real parts. Additionally, 6 > 0 is a lower bound of the distance of the eigenvalues of A from the imaginary axis.

A

A

,

2. The degree of stability has been defined by means of -u := max Rex;. i=l(l)n

Therefore, 6 is a lower bound of u since all eigenvalues of A are located to the left of a straight line parallel to the imaginary axis intersecting the real axis at -6.

0

Here too, p ( B ) < 1 can be verified by means of the Cordes-Algorithm. In many cases, a bisection method will yield an acceptable lower bound for the degree of stability. Just as in the case of the first constructive method, there is no further information on the location of the eigenvalues in the left half-plane.

Verification of Asymptotic Stability

5.7

389

Additional Applications of the Four Constructive Methods

The four constructive methods as presented in Subsections 5.3-5.6 have been developed in view of the degree of stability. As a supplement, the second and the third method yield sets in the left half-plane containing all eigenvalues. Additionally and with only slight modifications, in particular the fourth method admits further applications which may be of interest. This method will now be used for the determination of an upper bound of the maximal real part of the eigenvalues of an arbitrary matrix A E IR"'". For this purpose, the shift 6 is replaced by -6. Correspondingly, the spectrum of the matrix A E IR"'" is moved such that all eigenvalues of A - 61 are confined to the left halfplane. Then all eigenvalues of A possess real parts less than 6, as is shown in Figure 5.10.

. . . . . . . .

. . . . .. .. . ......8 .

. .. . .. . .. . .. . .. . .. ... . . . .. .. . . . . . . . . X . ' .. . . . . .. . .. . .. . .. . .. . .. . .. . . . . . . . .. .. .. .. . . . . . . . . . .. .. .. . . . . . . . . .

. . . . . . .

. . . . . . . . . I

. . . . . . . . . . . . . . .. .. . . . . . .F .

.. .. .. .. . . . . .

. . . . . . . ............... . . . . . . . . .. .. .. .. .. .. .. .

__c

6

. .. . .. . .. . .. . .. . .. . .. . . . .. . . .x.:. . . . . . . . . . . . . .. . .. .

. .. . .. . .. . .. . .. . . . . .

. . .

Figure 5.10: Estimates of the maximal real part of the eigenvalues

Additionally and with only slight modifications, the fourth method admits the determination of a lower bound of the real parts of all eigenvalues of A. For this purpose, the matrix A E IR"'" is replaced by the matrix -A. Correspondingly, 0

0

the eigenvalues Xi of A are replaced by the eigenvalues -Xi of -A and the eigenvalue of A with the minimal real part now becomes the one with a maximal real part of -A, which is bounded as has been outlined before.

In this way, a strip can be determined containing all eigenvalues of A and being bounded by straight lines parallel to the imaginary axis. In many cases, this strip represents an acceptable confinement for the eigenvalues of A. Figure 5.11 displays parameters 6' and 6 to be determined by means of bisection methods.

B. Gross

390 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . :x: . . . . . . . . . . . . . . . . . . . . . . . . x . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .. .. .. .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . .1. . . . . . . . . .".. 6' . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .x . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

A

8

; .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. .. .. .. .. .. .. .. . . . . . . . . . . . . . .. .. .. .. . . . . . . . . . . . . . .X'.'....

For the construction of a strip of this kind, the other three methods also may be used. This only requires a shift of the spectra of A or -A, respectively, such that all eigenvalues are confined to the left half-plane. The corresponding Mobiustransformations are A - ( I or -A - (Z, respectively. Consequently, a strip with the properties outlined before may be determined by means of each one of the four methods. Since each one yields a set containing all eigenvalues of A, the intersection of these four sets represents a final enclosure with a correspondingly smaller overestimate. These four sets are (i) strips parallel to the imaginary axis, as following from the first and the fourth method, (ii) a circle due to the second method, and (iii) a lens-shaped domain as due to the third method. The sets addressed in (i)-(iii) yield upper and lower bounds for the real parts of all eigenvalues; the ones addressed in (ii) and (iii) yield corresponding bounds for the imaginary parts.

5.8

Examples for Bounds of the Eigenvalues

In this subsection, examples for applications of the four constructive methods are presented for the case of point matrices. Since these examples serve as test problems, the eigenvalues of the investigated matrices were initially chosen.

The examples were constructed as follows: 0

a Real Canonical form A E WXnof a matrix was adopted by means of its chosen real or conjugate complex eigenvalues;

391

Verification of Asymptotic Stability 0

0

these eigenvalues are simple or multiple ones; they are confined to the left half-plane; furthermore, the numbers chosen for the real or imaginary parts of these eigenvalues are integers or they possess only one additional non-zero digit following the decimal point; by use of an invertible elementary transformation with a matrix T, a nonCanonical matrix A was generated which is similar to A, A := T - l A T ;

the elementary transformation by use of T induces a suitable linear combination of two rows of A; the transformation by means of T-' then induces a corresponding linear combination of two columns of TA. Consequently, the elements of A are represented by means of numbers with only a few non-zero digits in a normalized floating point representation. A sequence of elementary transformations, T I ,T2,. . ., was carried out. This sequence was truncated when the majority of all elements of the finally generated matrix, A , were non-zero. The sequence of these transformations can then be represented by means of one matrix, T . 0

In the first example, all chosen eigenvalues are real: X1/2 = -1 possess the multiplicity two; A3 = -2 and X4 = -10 are simple. A sequence of elementary transformations generated the following matrix:

-5.5 4.5 -4.5 -4.5 4 -6 4 -9 9 -10 -9 8.5 -9.5 8.5 7.5 The following table lists the upper and the lower bounds which were determined for the real parts Re Xi and the imaginary parts Im Xi of the set of eigenvalues, making use of the four constructive methods.

Method

1 9 I 9

3

4

Bounds for the eigenvalues

-10.00003100932 -10.00103384201 -4.95052208968 1-29 -28.931 12558652 - 10.00003020952

I

< Re X < -0.99999729704 < ReX < -0.09998966264 < ImX < 4.95052208968 < ReX < -0.99993316106 < Im X < 28.931 12558652 < Re X < -0.9999969701 9

The first, third, and fourth method yield almost sharp lower bounds for the degree of stability of A. The circular enclosure of all eigenvalues as due to the second method is determined by the eigenvalue X4 = -10. For this eigenvalue, an almost sharp lower bound was calculated by use of the first, the second, and the fourth met hod. The construction of the matrix A in the second example started from the following choices of its eigenvalues: = -1, X314 = -2, = -10, = -0.1 IOi,

+

B. Gross

392

and Xg/lo = -0.1 - lOi, all with multiplicity two. Any numerical approximation of the spectrum of a matrix with these eigenvalues is ill-conditioned. = -0.1 1Oi and = -0.1 - 1Oi is The Real Canonical matrix concerning represented by -0.1 -10 0 10 -0.1 0 0 -0.1 -10 .

+

[:

1

10

A sequence of elementary transformations of -1.9 30 0.9 -1 -0.9 -1 1.71 2 0 1.9 0 10 0 1 0 -0.1 0.5 0.45 0 10

58.5 1.5 0.5 -7.65 2.85 15 1.5 -0.15 -5.25 15

9 1 -1 -8.1 0 0 0 0 0.5 0

30 0 0 1 -0.1 10 1 -0.1 0 10

:I

-0.1

2 yielded the following matrix:

4.7 -1 1 -0.91 -10 -0.1 0 1 -0.5 1.9

0 0 0 0 0 0 0 0 0 0 0 0 -1.1 -10 0.9 10.1 0 0 0 0

39 1 7 -5.1 1.9 10 1 -0.1 -13.5 10

-0.2 -1.8 1.8 -3.42 0 0 0

0 -0.9 -2

The following Table lists the upper and the lower bounds which were determined for the real and the imaginary parts by the four constructive methods: Method 1

2 3 4

Bounds for the eigenvalues -10.0000011231 -1010.1001809740 -505.0495955362 -107.9 -107.8998178053 -10.0000013608

< < < < < <

ReX ReX ImX ReX ImX ReX

< < < < < <

-0.0999997922 -0.0000990001 505.0495955362 -0.0991464083 107.8998178053 -0.0999995609

Just as in the case of the first example, here too, the first, the third, and the fourth methods yield almost sharp bounds for the maximal and the minimal real parts of the eigenvalues. The computed bounds for the imaginary parts are rather poor.

5.9

Application Concerning the Bus with an Automatic Tracking System

In this subsection, lower bounds of the degree of stability are presented which have been determined by means of the four constructive methods introduced in Section 5. Additionally, domains in @ are presented containing all eigenvalues.

Verification of Asymptotic Stability

393

The following Table lists the approximations for the eigenvalues of (4.1) in the three operational cases which have been determined in the monograph by G. Roppenecker [14]: Operational Case 1 Real Part -0.368 -1.009 -1.009 -28.05 -28.05 -34.87

Operational Case 2

Imag. Part Real Part Imag. Part 0 2.168 -2.168 14.52 -14.52 0

Operational Case 3 Real Part Imag. Part

In conjunction with these results, Table 5.2 lists the upper and the lower bounds of the eigenvalues which have been determined by means of the four constructive methods: Method 1 2 3 4

Operational Case 2

Operational Case 1 -34.886 -35.584 -17.778 -180.934 -180.933 -34.886

< c c < < <

ReX ReX ImX ReX ImX ReX

< c < < < <

-0.368 -17.118 -0.028 -23.434 17.778 -11.696 -0.368 -139.529 180.933 -139.506 -0.368 -17.119

Method

Operational Case 3

1

-20.009 < ReX c -2.623 -20.118 < Rex < -0.049 -10.034 < ImX < 10.034 -106.472 C R e X C -2.617 -106.344 < ImX < 106.344 -20.009 < ReX < -2.617

n

L

9

0

4

< < < c < <

ReX Rex ImX ReX ImX ReX

< -1.267

c

-0.042

c

-1.267

< 11.696 < 139.506 < -1.267

Table 5.2

In particular, the first and the fourth method have yielded bounds which are close to the corresponding approximations in Table 5.1. This exhibits the tight character of the computed bounds for the real parts of the eigenvalues of (4.1) and verifies the high quality of the approximations as given in Table 5.1.

6

Concluding Remarks

Constructive methods for the following purposes have beeen developed:

394

B. Gross

(a) for all matrices A E [A] C_ IR"'", the verification of the property of asymptotic stability, (b) for matrices A E Rnxn,a tight upper bound for the maximal real part in the set of eigenvalues of A and, correspondingly, (c) a tight lower bound for the minimal real part. The bound addressed in (b) is a verified lower bound for the degree of stability of A. This verification of the degree of stability is less costly than a corresponding computational determinations of all eigenvalues of a matrix A E IR"'", followed by an identification of the one with maximal real part. For large matrices, the determinations of all eigenvalues can be carried out only approximatively and unreliably. Additionally, an explicit determination of all eigenvalues is not necessary in order to verify the property of asymptotic stability of a matrix. In the case of the admission of all A E [ A ] E IR"'", the implications of these considerations are compounded by the fact that computational determinations of sets of eigenvalues are correspondingly more difficult, costly and unreliable. The verification of this quality rests on the employment of totally error-controlled numerical methods, i.e., the combination of the Kulisch Computer Arithmetic with the constructive methods developed in the present paper. Employing a complex interval arithmetic, it is correspondingly possible to determine tight upper and lower bounds for the imaginary parts of the eigenvalues of A E R n x n . An algorithm and a code for this purpose have recently been developed and applied in the context of the diploma thesis of Jutta Morlock.

References [l] J. Ackermann: Abtastregelung, Band II: Entwurf robuster Systeme, SpringerVerlag, Berlin, 2nd edition, 1983.

[2] G. Alefeld, J. Herzberger: Introduction to Interval Computations, Academic Press, New York, 1983. (31 S. Bialas: A Necessary and Sufficient Condition for the Stability of Interval Matrices, International Journal of Control, 38, 1983, p. 717-722. [4] G. Bohlender, L. B. Rall, Ch. Ullrich, J. Wolff von Gudenberg: PASCAL-SC: Wirkungsvoll programmieren , kont rolliert rechnen, Bi bliographisches Ins t it u t , Mannheim, 1986.

[5] L . Cesari: Asymptotic Behaviour and Stability Problems in Ordinary Differential Equations, Springer-Verlag, Berlin, 1963.

Verification of Asymptotic Stabnity

395

[6] D. Cordes: Verifizierter Stabilitritsnachweis fur Losungen periodischer Differentialgleichungen auf dem Rechner mit Anwendungen, Dissertation, Karlsruhe, 1987. [7] Xu Daoyi: Simple Criteria for Stability of Interval Matrices, International Journal of Control, 38, 1985, p. 289-295. [8] B. GroB: Verifizierter Stabilitatsnachweis fur Intervallmatrizen mit Anwendungen a u s der Regelungstechnik, Diplomarbeit, Karlsruhe, 1991. [9] W. Hahn: Stability of Motion, Springer-Verlag, Berlin, 1967.

[lo] R. Klatte, U. Kulisch, M. Neaga, D. Ratz, Ch. Ullrich: PASCAL-XSC, Springer-Verlag, Berlin, 1991. [ll] K. Knopp: Elemente der Funktionentheorie, de Gruyter Verlag, 9. edition, 1978. [12] R. Lohner: Einschliefhng der Losung gewohnlicher Anfangs- und Randwertaufgaben und Anwendungen, Dissertation, Karlsruhe, 1988. [13] R. Moore: Interval Analysis, Prentice-Hall, Englewood Cliffs, N. J., 1966. [14] G. Roppenecker: Zeitbereichsentwurf hearer Regelungen, Oldenbourg Verlag, Munchen, 1990. [15] W. Walter: Gewohnliche Differentialgleichungen, Springer-Verlag, Berlin, 3. edition, 1986. [16] R. K. Yedavalli: Stability Analysis of Interval Matrices: Another Sufficient Condition, International Journal of Control, 43, 1986, p. 767-772.

This page intentionally left blank

Numerical Reliability of MHD Flow Calculations Wera U. Klein

In this paper a new numerical investigation is presented for the twodimensional magnetohydrodynamic flow ( MHD flow ) in a rectangular duct and an error analysis of the traditional calculation of the solution is derived. Arbitrary values of the flow parameters are admitted; they are the Hartmann number and the wall conduction ratio. The singular perturbation problem is solved and analyzed by means of interval arithmetic and verified enclosure methods ( E-Methods 1, supported by the programming languages PASCAL-SC and FORTRAN-SC. Furthermore, the error analysis of the traditional calculation as applied to this M H D flow shows that there is a lack of reliability of published numerical results of this physical problem, and this even for Hartmann numbers M<1000. These results indicate that the reliability of numerical results should at least be verified for any calculation, via employing a control of rounding errors by use of an accurate floating-point arithmetic and enclosure methods.

1 Introduction The behavior of magnetohydrodynamic flows ( MHD flows 1 at high Hartmann numbers is relevant for the design of the self-cooled , liquidmetal blankets for fusion reactors. There exist many numerical algorithms examining such flows; however, the problem of all these studies is a lack of convergence for Hartmann numbers M > 1000. The important range of the Hartmann numbers is 100 < M C 100000. The reason for this typical behavior will be shown and a new algorithm to accelerate convergence of the solution will be outlined. Scientific Computing with Automatic Result Verification

Copyright Q 1993 by Academic Press, Inc. 397 All rights of reproduction in any form reserved. ISBN 0-12-044210-8

W.U.Klein

398

A finite difference study of a two-dimensional, liquid-metal MHD flow in square ducts is presented for the case of thin, electrically conducting walls and a uniform, transverse, applied magnetic field. The system of elliptic partial differential equations for the simulation of this physical problem will be approximated by use of finite differences. The implementation and computational results of this method will be compared by means of the employment of different machines and different programming languages. Particularly, machines with a traditional computer arithmetic were used such as a n IBM 3090 with VS-FORTRAN and an IBM 4381 with the new programming language FORTRAN-SC ( Scientific Computation 1 or a n ATARI ST4 with PASCAL-SC. FORTRAN-SC and PASCAL-SC are programming languages which are based on an accurate, mathematically defined computer arithmetic /Ku81/; this means that the floating point arithmetic is based on a small number of strong axioms. These languages offer additional arithmetics like multiple precision or interval arithmetic /Ka84/ and a library of problem solving routines for standard problems of numerical analysis, generating tight verified bounds for the unknown exact solution. The methods of interval mathematics as implemented by use of these new programming languages show that it is impossible to compute solutions of the discrete equations of our problem by use of a traditional computer arithmetic, a t least for Hartmann numbers equal or larger than 1000. Using the accurate arithmetic, a relationship between the necessary accuracy for the coefficients of the discrete equations and the number of correct digits of the approximated solution could be calculated for different Hartmann numbers. The range of the necessary accuracy can not be reached by means of a traditional computer arithmetic. The finite difference method was implemented by use of FORTRAN-SC or PASCAL-SC and methods of interval arithmetic. This generated a verified enclosure of the exact solution of the discrete problem; it predicts an Mshaped axial velocity profile in agreement with the theoretical studies of Hunt mu641 and Walker /Wa81/. The positions of the extrema of the velocity field are verified and regions of negative velocity for Hartmann numbers larger than 100 are computed in dependency on the conductance of the walls.

2

Physical model

The fully developed flow of an incompressible, electrically conducting liquid in a n infinitely long, rectangular duct is investigated. The walls of the channel are assumed to be very thin and electrically conducting ( t, 4 b, twis the wall thickness and 'b' the half length of the side walls 1.

399

M H D Flow Calculations

The surrounding medium 14 'issurned to be electrically insulating. .L\ homogeneous magnetic induction B is applied parallel to the side walls of the duct ( see Figure 2.1 ).

2b

2a

Figure 2.1: Infinitely long, rectangular duct and applied magnetic induction B By means of the usual assumption of magnetohydrodynamics, Hunt /Hu691 has shown that a time independent flow can be described in the following typical and unique form :

v = ( o,o, V( x,y ) ),

9=

X,YL p = ( 0,0,P; - P"Z ), @(

-vp = ( O , O ,

Po),

where V represents the velocity field, 9 the electrical potential. p the pressure and Vp the pressure gradient. The coordinates x and y span the duct profile [ -a,a ] x [ -b,b], and z is the axial coordinate of the duct. The unknown velocity V and the electrical potential ct, a re defined by the following equations :

A

+ I XJ

)

=

- a, V(

XJ

AV( X J 1 - M L V (X J ) = M2( ax@

1, (

( XJ

with the Hartmann number M := b B,

-Po1,

m.

2 .la 1

W.U.Klein

400

Here, B,, is the absolute value of the magnetic induction. q the viscosity of the fluid, and a the electrical conductivity ofthe fluid. The boundary conditions of the velocity given by

V and the electrical potential are

p is denoted a s the wall conduction ratio and defined by p := (aw/a)tw, n represents the normal direction a t the points of the wall and s the

tangential direction, uw is the electrical conductivity of the wall, and twis the wall thickness iWa8li. Important and typical properties of the equations ( 2 .la symmetries in the variables x and y : V( x,y ) = V( x,-y ) = V( -x,y @(

x,y =

x,-y =

-

-x,y

V ( x,y ) € [ -a,a 1 X [ -b,b and

)

a r e the (

2.2 ?

V ( x,y 1 € 1 -a,a 1 X [ -b,b

Therefore, (

i)

the velocity V is symmetrical with respect to the x- and the y-axis and

(

ii )

the electrical potential @ is symmetrical with respect to the x-axis and antisymmetrical with respect to the y-axis.

Now the calculation of the solution of the problem ( 2.la, 2.la' ) can be reduced to the first quadrant of the duct profile [ 0,a ] X [ O,b I. The boundary conditions of the reduced system are: V( a,y 1 = V( x,b ) = 0, a,V 0,y ) = a,V x,O ) = 0, a, a,y 1 = pS W a,y ), @( 0,y 1 = 0, a, $4 x,b ) = pt axL@( x,b 1, a, cpc X,O) = 0, with x € [ 0,a I, y € [ 0,b I, and the wall conduction ratio of the top, pt, and the one of the side walls, ps. The system ( 2.la, 2.la' ) as well as the system ( 2.la, 2 .lb 1 has a singular behavior for large Hartmann numbers M. Another problem arises because of the rather unusual boundary conditions ( neither Dirichlet nor Neumann

MHD Flow Calculations

40 1

conditions in ( 2.lb) 1. Therefore, in contrast to the usual formulation of a Dirichlet problem or a system with Neumann conditions, the question of existence and uniqueness of the solution of problem ( 2.la, 2.lb is still unresolved a t present.

For an arbitrary wall conduction ratio p, the fully developed MHD flow in a rectangular duct has been analyzed by Walker /Wa81/ by means of the asymptotic method for problems of singular perturbations. One important result of this study is that the interior of the duct can be subdivided into the following subregions: The core of the duct with a nearly constant axial velocity, the Hartmann layers acijacent to the top and bottom of the duct profile, the inner and outer side layers adjacent to the side walls, and the corner regions. The thickness of the boundary layers can be estimated depending on the Hartmann number and the wall conduction ratio ( see Figure 2.2 1.

I

1 I I

0( MI

I I

I I

I

0 .o ( C )

X I I

I

I

I

b

0( cp

(c) core, (h) Hartmann layers, (i) inner side layers, ( 0 ) outer side layers, (cr) regions of corners. Figure 2.2: Subregions of the duct flow for large Hartmann numbers.

It is well known that the velocity reaches its global maximum in the regions of the inner side layers and becomes negative in the outer side layers. Therefore the velocity profile has a typical M-shape if plotted against the x-direction of the duct.

402

3

W.U.Klein Numerical iMethod

Numerical results of the two-dimensional MHD flow in a rectangular duct are only known for Hartmann numbers M < 1000 /St89/. These results do not agree with the asymptotic results of Walker /Wa81/. For instance, the conjectured region of negative velocity for Hartmann numbers M < 1000 could not be calculated. The position of the global minimum of the computed velocity field agrees very well with the asymptotic investigation, however, the values of the approximations do not exhibit a negative velocity for Hartmann numbers M < 1000. These results were determined by use of both the method of finite differences and the method of finite elements. The typical boundary layer structure of the flow field leads to variable and/or adaptive grid generations. Analogously to other numerical methods, the traditional computation of the finite difference method with variable grids yields consistent results only in the case of Hartmann numbers M < 1000. For M > 1000, the numerical results for different grids are inconsistent; this method diverges for Hartmann numbers M > 1000. The unusual boundary conditions of the electrical potential can be discretized by use of finite differences or finite elements. Using the method of finite elements, the boundary condition must be iterated by an additional step ISt89l. Therefore, we prefer in this study the finite difference method. In the case of arbitrary Hartmann numbers, the new idea to solve this problem by use of the finite difference method is the one of an employment of an accurate floating-point and interval arithmetic, which can be used by means of the programming languages PASCAL-SC or FORTRAN-SC ,'Ku811. The domain of the computation is a square duct and, because of the symmetries of the governing equation ( 2.la ), we reduce the domain of computation to the first quadrant of the duct profile. Thus the discretization is based on an orthogonal grid in the domain [0,11 X [0,11. The distances of the grid points are variable with respect to the small critical regions close to the channel walls. They are defined to be ( 1,Iv = o.,..n , n C N concerning the x-direction, and ( kv )v = ,,,, , m € N con-cerning the y-direction . Therefore the grid G is given as follows:

x I,, I

G

:=

{ (xi,y,):=

(

v=o

1

2

k v ) : O s i s n , O l j l m , n,mC m

n

(

N,

v=o

1v ) v = o ,..,n 2 0,( kv ) y = o ,__, m 2 0, lo = ko = 0,

2

v=o

Iv

= E k, v=o

= 1 ).

MHD Flow Calculations

403

In the subsequent outline, we use a typical notation for a difference scheme, which is shown in Figure 3.1: The distances of the grid point ( i j 1 to its neighbors are denoted by the symbols hl and hz in the x-direction and ha and h4 in the y-direction, respectively. Additionally, the notation Hi := lhi, i = 1,..,4, is used for the inverse distances.

Figure 3.1: Grid structure of the first quadrant of the square duct profile The major difference operators for the approximation of the equations ( 2.la, 2.lb ) are given by the following discretized formulas: in the case of the Laplacian operator

- H A

AG :=

-H,*

H,2 + H,H,

+ H,Hu + Ha2

-H,2

W.U.Klein

404

in the case of the forward derivative with respect to the x coordinate

ac

:=

-H,

H,

o I,

in the case of the backward derivative with respect to the x coordinate d ' ~:= [

0

-H2

H,

1.

Thus the equations ( 2 .la 1 can be approximated by the following system of linear equations

@*

* V*

Here the following notations are used: V* and @* are the vectors of the discrete function values at the grid points of V and @, M is the Hartmann number, and Po is the constant value of the z-component of the pressure gradient ( see Section 2 I. Thus the matrix of the system of equations ( 3.1 ) has a band structure \ Figure 3.2).

Figure 3.2: Band structure of the matrix of the discretized system ( 3.1 ) Prior to the determination of the solution of this discrete problem, the coefficients in the matrix and in the forcing vector of the system of linear

MHD Flow Calculations

405

equations (3.1 ) have to be computed. Moreover, it is known that the accuracy of this coefficient calculation depends on the value of the Hartmann number and the wall conduction ratio N U S O / . Therefore, it was intended to compute the solution of the discrete problem with an verified accuracy of the last unit of a 'DOUBLE REAL' floating-point format. At this moment, this range of accuracy is hypothetical; the reason for this situation will be seen in the next section. In order to realize this accuracy, a calculation with dynamic arrays of intervals would be required. The necessary accuracy of the coefficients up to the last unit of the DOUBLE REAL format could not be reached by use of intervals of the length of the DOUBLE REAL format. The computation itself needs a larger representation of the intervals and this can be dynamically adapted corresponding to the desired accuracy of the coefficients. The programming languages PASCAL-SC and FORTRAN-SC support this interval arithmetic. Normally, the necessary range of the interval format was 16 - 18 decimal digits for the bounds of the intervals /wiiSO/. Number of e x a c t decimal d i g i t s of calculation

.........

accurate computation traditional computation

14

10

6

4

, M 10

10

10

Hartmann number

Figure 3.3: Accuracy of accurate interval analysis and traditional computation dependent on the Hartrnann number M

W.U.Klein

406

Using these intervals and interval methods, this high but necessary accuracy for the coefficients can be verified independently of the value of the Hartmann number or the wall conduction ratio. In Figure 3.3, the difference between the usual computation of the coefficients and the verified interval calculation is illustrated. It is obvious for the case of high Hartmann numbers that the reliability of the computed coefficients is rather poor. With these interval coefficients, the accurate discrete problem is expressed by use of a system of interval equations, which now has to be solved in a verified fashion by means of the E-Methods /Ka84/. '3' is 'the capital letter of the German words for "Ehclosure", 'Xxistence" and "Uniqueness". This implies that the results of the methods are represented by interval vectors enclosing the exact solution. Furthermore, it is automatically proved that there exists a unique solution contained in the interval vectors. Thus the result of the accurate interval computation is a verified enclosure of the exact discrete system; consequently, the reliability of the numerical results has been proved. These methods are supported by the programming languages FORTRAN-SC and PASCAL-SC; they can be called, e.g., by the procedures DILIN or ILIN of the ACRJTH-library.

As a result, there are two important steps to guarantee the enclosure of the exact solution. The first is the computation of the Coefficients with high accuracy and the second the determination of a verified solution of the interval equation system.

4

Error analysis of traditional computation

Finally we wish to answer the questions why there is a lack of convergence of numerical calculations for Hartmann numbers M > 1000 in traditional computation and under which circumstances it is necessary and meaningful to apply an accurate computer arithmetic involving interval methods. Therefore, we now analyze the two steps of the finite difference method (I)

Transformation of the system of partial differential equations ( 2.la, 2.lb) to a system of linear algebraic equations (3.1).

(11)

Determination of the solution of the system of linear algebraic equations.

ad (I): It is now assumed that we compute the solution of the discretized system by use of an accurate floating-point arithmetic employing the programming languages PASCAL-SC or FORTRAN-SC /Ku81/ and interval methods

407

MHD Flour Calculations

Ka84/. The coefficients of the discrete system ( 3.1 then can be calculated with high accuracy and enclosed by bounds which a re independent of the value of the Hartmann number or the special grid generations. In Section 3 we have introduced the problem of the accuracy of the coefficient calculation in the length of the DOUBLE REAL format of floating-point numbers. Ifwe compare these calculations to those by use of the traditional computations, we obtain a n important relationship between the Hartmann number and the achievable accuracy of the traditional calculation of the coefficients; this is shown in Figure 3.3. However, only for Hartmann numbers M < 1000, the traditional computational system is computationally solvable. This solvability can be verified by use of a n EMethod for solving systems of linear equations; this method is due to Rump /Ru83/ ( procedure DLIN of the ACRITH-library ). The main issue of the analysis is the following definition:

Definition: For a n arbitrary vector b, with b € VR, the computational sensitivity E of a regular matrix A € MR is defined as (1

+ [ -6,6 I 1 * A * x = b

solvable is computational

6

1

E

if unsolvable

6,E

6 >

€ Rf,

E

where [ -6,6 I describes the real interval from -6 to 6 . The definition is concerned with perturbations of the matrix A for example a s caused by rounding errors, input errors, etc. ), which a r e smaller than the sensitivity c. and do not lead to computationally unsolvable systems. This sensitivity has been determined for the case of the traditional computational system ( 3.1 ) and also for the case of the perturbed systems of the interval analysis. This investigation shows th a t the sensitivity of the system ( 3.1 ) does depend on the Hartmann number M. A dependency on the wall conduction ratio or on the choice of the grid generation could not be recognized. The relationship of the Hartmann number M and the sensitivity of system ( 3.1 ) is shown in Figure 4.1.

It is obvious t ha t the sensitivity is almost equal for Hartmann numbers M

< 1000. This means that the traditional computational system is solvable only for M < 1000 and, in particular, numerical results are known only for this domain. On the other hand, it is obvious th a t the accuracy of the length of the DOUBLE REAL format is necessary for computations for arbitrary Hartmann numbers M < lo6 This was the reason for the so-called hypothetical choice of accuracy in Section 3.

408

number o f decimal digits

-

W.U.Klein accurate

----.sensibility ......... traditional

Figure 4.1: The accuracy in decimal digits of the accurate interval and the traditional computation, and the sensitivity of the discrete system ( 3.1 as a function of the Hartmann number M ad (11) : Figure 4.1 also indicates the accuracy of the coefficients that is necessary for the solvability of the computational discrete system. Subsequent to its determination, the quality of this computed solution must still be analyzed: The exact discretization of the partial differential equations ( 2.la, 2.lb ) can be given as

A * x = b, where the matrix A and the vectors x and b correspond to the dimension n, which is also the number of grid points.

This exact discretized system is perturbed as follows by rounding errors on a real computer:

A(I+F)*(x+h) = b with A*x =b

I F h

- the exact discrete system,

( 4.1 1

- the identity matrix,

- the matrix of the rounding errors, - the relative error of the approximate solution.

The following useful assertion has been shown, e.g. by Wilkinson /Wi69/.

409

.MHD Flow Calculations Corollary:

Let A,B C MR and A be a regular matrix. The sum of the matrices A B is regular if 1 A-lB 1 < 1.

+

Therefore in system (4.1) we get A ( I + F ) = A + A F isregularif IA-lAFI <1.

( 4.2 )

As a result of the error analysis represented in Figure 4.1, the computational error matrix F can be estimated by

IF1 < 10-210gMfor 100 < M < lo6,

( 4.3 )

where M is the Hartmann number Mii901. With ( 4.2 )and ( 4.3 ), the system (4.1) is solvable if

I All I A I <

10

where I..I represents the Euclidean norm. Moreover, the computational error h of the system (4.1) is estimated by Wilkinson Mi691 when using a fixed format of floating-point numbers with t decimal digits

The employed Frobenius norm I..IF is related by the Euclidean norm I..I by means of 1A

IF < n1I2 * I A I, A C MR,

where n is the dimension of the system.

Now we get the final result of the estimation of the relative error of the solution: Ihl s Ix'

LO-' n I A - ' I I A I l-lO-'n

IA-'I

IAl

,

IEN.

Using this inequality, the quality of the approximated solution of the discreti-zation can be estimated in the following way: provided

W.U.Klein

410

is satisfied by use of a certain number k C N, then the first k decimal digits of the approximate solution x + h agree with the first k decimal digits of th e exact solution x.

An employment of ( 4.3 ) in this inequality yields the relationship between the quality t of the discretized computational system, the number of the grid points n, the Hartmann number M, and the quality Iz of the approximated solution:

A rearrangement of this inequality leads to:

k with t n

M k

+ Zog(n) + 2 * Zog(M) 5 t,

t,kc N

( 4.4

1

- the number of exact decimal digits of the calculation, - t h e dimension of the discrete system, - t h e Hartmann number, - the number of exact decimal digits of the approximated solution of the discrete system.

Using this inequality, it is now possible to find a n optimal relationship between the number of necessary grid points n, the possible accuracy t of the calculation, and the quality k of the approximate solution, depending on the Hartmann number M of interest. Two examples:

For the Hartmann number M = 1000 we wish to determine the approximate solution with a guaranteed accuracy of the leading two digits ( k = 2 1. I n the case of 1000 grid points ( n = 1000 ), for this reliability by use of the inequality ( 4.4 1, we need a n accuracy of the discretized calculation of t 5 11 digits. Now we wish to calculate for the case of the Hartmann number M = lo4, using the same assumption of the desired accuracy ( k = 2 ). Then by use of the inequality (4.41, we obtain the relationship between the needed accuracy of the coefficient computations an d the number of grid points; for instance with n = 10000, we need a n accuracy o f t I14 decimal digits a t least. For further calculations concerning the dependency between th e computational accuracy, the Hartmann number, the number of grid points. and the numerical reliability, see Figures 4.2 and 4.3.

MHD Flow Calculations

41 1

n = i ooo k = 2

number o f decimal digits

8

P

lo3

lo4

M

lo5

Figure 4.2: Relationship of the Hartmann number M and the quality t ( number of digits ) as determinedby use of inequality ( 4.4 ) number o f decimal digits

M=10 k = 2

10

4

i

Figure 4.3: Relationship of the number of grid points n and the quality t ( number of digits 1 as calculated by use of inequality ( 4.4 )

W.U.Klein

412

Finally, the first step of the error analysis ( for the determination of the sensitivity of the system ( 3.1 ) ) has shown that it appears to be possible to get some numerical results with an accuracy of 6 digits in the case of the coefficient computation. The systems are solvable ( see Figure 4.1 1. However, now we have shown that in order to get a two-digit reliability of the approximate solution for the same discrete problem, we need an accuracy of 11 digits of the coefficient computation. Therefore, the numerical results of the MHD flow are doubtful even for Hartmann numbers M C 1000 and by use of traditional computations. Only an interval computation and an error analysis by use of methods of interval mathematics lead to reliable numerical results.

5

Numerical Results

In contrast to the traditional computation, and for discretizations with different variable grids as outlined in Section 3, the application of EMethods of interval analysis leads to consistent numerical results for the two-dimensional MHD flow in the cases of arbitrary Hartmann numbers and an arbitrary wall conduction ratio. The following investigation is confined to the analysis of the velocity field. The numerical study of the electrical potential is carried out in /WiiSO/.

As a first example of a successful application of interval methods, Figure 5.1 shows the velocity distribution in a square duct for the Hartmann number M = 10000.

The perfect qualitative agreement with the asymptotic results of Walker /WaW and Hunt /Hu65/ is also shown, particularly, in view of the following properties: the M-shaped velocity profile in the x-direction and the subregions at the walls, the Hartmann layer or the inner and outer side layers ( compare Figure 2.2 and the subregions in Figure 5.1 ). Using the new methods of Section 3, we can also analyze the relationship between the velocity and the Hartmann number M. There are two aspects of this dependency:

1. An increase of the Hartmann number M results in a decrease of the thickness of the side layers ( Figure 5.2 ). The inner side layer is characterized by the position of the global maximum. The outer side layers are characterized by the position of the global minimum and the zeros of the velocity field. Thus the relationship between the increase of M and the decrease of the thickness is obvious in Figures 5.3 and 5.4.

MHD Flow Calculations

413

2. The absolute values of the global extrema, the maximum in the inner

side layers, and the minimum in the outer side layers increase with the Hartmann number M. "his relationship can be seen in Figures 5.3 and 5.5.

1

V( x,y 1 3000

Parameter : M = 104, 9s = = 0.1

1800

600

Y V( x,y 1

.o

1.0'

3000

4

x

X

Figure 5.1: Velocity profile in a square duct The relationship between the velocity and the wall conduction ratio p are especially analyzed for the following three combinations of cases: (I)

The wall conduction ratio is constant and equal at the four walls.

(11) The wall conduction ratio at the bottom and at the top of the duct is

constant and fixed, and at the side wall it is variable.

(111) The wall conduction ratio at the side walls is constant and fixed, and

at the bottom and the top it is variable.

W.U.Klein

414

V(

X.Y

300

1

-

M =

104. pt =

200

100

X

3000

I

M = 10'. 'pt= 1 0 - ~

2000

1000

v ( x.y

1

30000

20000

10000

Figure 5.2: Velocity field in the first quadrant of the duct profile close to the right side wall ( x = 1 )

MHD Flow Calculations

415

ad (I) : With a smaller wall conduction ratio, we also obtain smaller outer side layers ( see Figures 5.6 and 5.7 1. Furthermore, there is no influence of the wall conduction ratio on the other layers, the Hartmann or the inner side layers. The relationship between the wall conduction ratio and the values of the global extrema can be described in the following way: a decrease of the wall conduction ratio yields an increase of the absolute values of the global extrema. This relationship is presented in Figure 5.8. ad (11): A decreasing side wall conduction ratio leads to a n increase of the thickness of the inner side layers. This is obvious at the positions of the minimum and the zero points of the velocity a t the x-axis( see Figure 5.9 1. On the other hand the absolute values of the global extrema increase. ad (111):

As a simple but essential result, it is observed that there is no influence of

the wall conduction ratio at the top and a t the bottom on the structure or the values of the velocity field.

v ( x.0 1

cp = 0.1 M = 1 3000 .,$,

.. .,. I

.

I

.

;,p

I

M = 1 1 000 ..:y ';:i\ ,I

.. %.

Figure 5.3: Velocity profile in the middle of the duct ( y = l ) close to the right side wall in the case of a constant wall conductance and for various values of the Hartmann numbers

416

W.U.Klein

Hartmann number M

Figure 5.4: The positions of the global maximum ( Max 1, the global minimum ( Min ), and of the zero points ( Nd ) of the velocity in the middle of the duct as dependent on the Hartmann number M

Figure 5.5: The relationship of the Hartmann number M and the value of the global maximum ( Vmax) for the case of a constant wall conductance p

417

MHD Flow Calculations

v ( x.0)

M=

lo4

Figure 5.6: Velocity profile in the middle of the duct ( y = 1 close to the right side wall for the case of a constant Hartmann number and various values of the wall conductance

:I 1

1.0 10-1

10-2

10-3

,

10-4

1 .Nd

,

*I I I I

A

I

I I I

A

I I I

I I

I I

I 1

+

T i I

+I i

t i

i

i i

i

i

Min 2 .Nd

-

-

0.90

I

I

I

"

*

0.95

X

1.0

Figure 5.7: The positions of the global maximum ( Max ), the global minimum ( Min ), and of the zero points ( Nd ) of the velocity in the middle of the duct as depending on the wall conduction ratio

W . U .Klein

418

Figure 5.8: The relationship between the wall conduction ratio and the value of the global maximum ( V, ) for the case of a constant Hartmann number M V( x.0 1

Parameter:

M = 104,

-

2400

-

1800

(PL

= 0.1.

- 1200 - 600

Figure 5.9: Velocity profile in the middle of the duct ( y=l 1 close to the right side 'wall for the case of a constant wall conductance at the top and at the bottom and for different values of the wall conductance at the side walls and for the case of a constant Hartmann number M

MHD Flow Calculations

419

Finally, with this first numerical analysis of MHD flows in a square duct for Hartmann numbers M larger than 1000 and variable electrically conducting walls, the influence of the Hartmann number on the structure and the values of the velocity field could be verified. The influences of the different wall conduction ratios on the velocity field could be computed and analyzed. The wall conduction ratios a t the top and at the bottom of the duct profile are unimportant for the velocity field ( III ), but there is a considerable influence of the wall conduction ratio a t the side walls on the structure of the velocity field ( II ). For a further physical discussion of these numerical results, see /WU90/. The need for a sufficient numerical reliability, as observed here, may also arise in other highly sensitive physical models. Furthermore, if the numerical results are verified, a possible subsequent improvement of the physical model may ensure that we correct the physical model itself and not a system polluted by rounding errors.

6 Conclusion For large Hartmann numbers, in literature a lack of consistency of numerical results has been observed. In the present paper uncontrolled rounding errors have been shown to be the cause. These errors cannot be eliminated in the execution of a traditional computer arithmetic since it does not support the methods of interval mathematics; especially, there is an absence of directed rounding in the set of floating point numbers. A control of rounding errors is based on computational enclosure methods which, because of their reliability, may be used for a validation of the employed physical and mathematical models.

Acknowledgement This work has been supported by the Deutsche Forschungsgemeinschaft (DFG ) for a period of three years. Prof. Roesner ( TH Darmstadt, Germany ) has supervised my work in this period of time and I am especially grateful for his support and for many discussions with him. Furthermore, I would like to thank Prof. Kaucher for his interest in my work and his providing me with many valuable ideas and Prof. Kulisch for enabling me to make use of the computer facilities of the Institute directed by him.

W.U.Klein

420

List of symbols Physical abbreviations: Hartmann number thickness of the channel wall half length of the side walls magnetic induction absolute value of the magnetic field velocity field z - component of the velocity field electrical potential pressure field of the channel constant z-component of the pressure gradient viscosity of the fluid electrical conductivity of the fluid electrical conductivity of the wall wall conduction ratio wall conduction ratio a t the top and a t the bottom of the channel wall conduction ratio of the channel side walls Derivation operators: ax

t

a,

axf, ;a an ass

V A =

a;y

first derivative with respect to the positional coordinates x OrY second derivative with respect to the positional coordinates xory first derivative with respect to the normal direction second derivative of the tangential direction gradient of a vector field two-dimensional Laplacian operator

Algebraic abbreviations: MR A,B,F

I

I..$ I..I

VR

x,bh

the set of real n X n matrices matrices in MR identity matrix in MR Frobenius norm in MR Euclidian norm in MR or VR the set of real, n dimensional vectors vectors in VR

MHD Flow Calculations

42 1

References: Mu641

Hunt J.C.R., 1964 'Magnetohydrodynamic flow in rectangular ducts ', J. Fluid Mech. 21(4), pp. 577-590

Mu691

Hunt J.C.R., 1969 'A uniqueness theorem for magnetohydrodynamic duct flow ', Proc. Carnb. Phil. Soc. 65, pp. 319-327

/Ka84/

Kaucher E.W., Miranker W.L., 1984 'Self-validatingnumerics for function space problems ', Academic Press, New York

/Ku81/

Kulisch U., Miranker W.L., 1981 'Computer arithmetic in theory and practice ', Academic Press, New York

/Ru83/

Rump S.M., 1983 'Solving algebraic problems with high accuracy ', in :Kulisch U., Miranker W.L. (eds), 1983 ' New approach to scientific computation ' Academic Press

/sta9/

Sterl A., 1989 'Numerische Simulation magnetohydrodynamischer FlussigMetall-Stromungen i m rechteckigen Rohr bei groper Hartmann-Zahl ', Doctoral Dissertation, University of Karlsruhe

/Wa81/

Walker J.S., 1981 'Magnetohydrodynamic flow in rectangular ducts with thin walls. Part I : Constant area and variable area ducts with strong uniform magnetic fields ',Journal de Mechanique 20, pp. 79-112

mi691

Wilkinson J.H., 1969 'Rundungsfehler ', Heidelberger Taschenbucher, Band 44 ,Springer Verlag, Berlin

/Wii90/

Wurfel W.U., 1990 'Numerische Berechnung der zweidimensionalen MHDStromung fur beliebig grope Hartmannurhlen M < 1000 mit E-Methoden ', Doctoral Dissertation, Technical University of Darmstadt

This page intentionally left blank

The Reliability Question for Discretizations of Evolution Problems Ernst Adams

I. THEORETICAL CONSIDERATIONS ON FAILURES Evolution problems with ordinary or partial differential equations (DEs) are considered in conjunction with consistent and stable difference methods. In applications, true (exact) difference solutions or their computed difference approximations are not necessarily meaningful approximations of true solutions of the DEs under consideration. In fact, there are not only quantitative but even qualitative distinctions between all true solutions of the DEs and "spurious difference solutions" J orI~( computed " diverting difference approximations". Generally, the failures (i) and (ii) of discretizations as approximation methods cannot be recognized on the level of a discretization. This is possible, however, by means of Enclosure Methods, provided a method of this kind is available for the DEs under consideration. In connection with (i) and (ii), Enclosure Methods are used for a verification of certain periodic solutions of the Lorenz equations. In the continuation [4] of the present paper, there are examples for the existence of (i) and the occurence of (ii) in problems of major current interest.

1. INTRODUCI'ION In the context of " Computability with (Guaranteed) Reliability", the present volume is devoted to the development and applications of efficient Enclosure Methods. The acceptance of these methods rests on the recognition of their practical and their theoretical importance, as compared with the cost of their applications. In view of this desired recognition, the present chapter and the next one [4] are devoted to the practical treatment of differential equations (DEs) (with side conditions), making use of difference met hods or, synonymously, discretization methods; see Stetter's monograph [47] for these methods in the case of ordinary differential equations (ODES). In addition to the predominantly practical treatment of the reliability question of discretizations, certain analytical supplements are given in the Appendices 5.7, 7.8, and 8.6. At the end of the second Scientific Computing with Automatic Result Verification

423

Copyright 0 1993 by Academic Press, Inc. All rights of reproduction in any form reserved. ISBN 0-12-044210-8

424

Discretizations: Theory of Failures

part [4], there is a partial list of symbols. The presentation is subdivided as follows: 0 in this chapter, there are theoretical considerations concerning failures of discretizations as approximation methods for true solutions of DEs and 0 in the next chapter [4], there are corresponding applications to popular problems arising in the mathematical simulation of real world problems; in this context, periodic solutions are a particularly important area of applications because of otheir practical significance and otheir inherent sensitivity with respect to perturbations (e.g. numerical errors) of all kinds. These discussions of practical failures of difference methods serve as responses to the "Reliability Question for Discretizations". Remark: This question is induced by the perturbing influences of numerical errors of all kinds, i.e., the local discretization errors, the local rounding errors, and (if present) other local procedural errors. Each one of these perturbations depends on the approximation being computed. The quantitative influences of these errors depend on the "perturbation-sensitivity" of the system of equations to be solved numerically. Outside of mathematics, there is (still) a basic trust in the reliability of results due to discretizations of the original continuous DE-problem. Because of the following reasons, this trust is unfounded: a (J the fact that the (very few) available true (exact) solutions y* of nonlinear DEs are relatively simple and, therefore, not very useful as sufficiently tough test cases for the desirable robustness of discretization methods and @J the fact that the true solutions y* of DEs are elements of function spaces; however, difference solutions y (of discretizations) are vectors in Euclidean spaces. Consequently, these solutions y* and y are not (directly) comparable. Therefore, discretizations of DEs generate an entirely different problem and not just a perturbation in the space of the original one. Consequently, on the level of the discretization, there is no quantitative access to the local discretization error, which establishes the link between the original function space of the true solutions and the approximating Euclidean space. Remark: A solution is said to be "true" if it satisfies the equations of the given problem without a residual (defect). A true solution is said to be "exact" if it

E. Adams

425

possesses an explicit representation. In the case of ordinary differential equations (ODEs) an indirect link of function spaces and Euclidean spaces is provided by Loher's enclosure algorithms for solutions of initial value problems (IVPs) [31], [32], (see also [2], [38], [39]) and boundary value problems (BVPs) ([31], [32], [29]). In the execution of these algorithms, enclosures of the values of the true solution y* are determined simultaneously with enclosures of the values of the local discretization error of the employed one-step Taylor-method. This simultaneous enclosure is made possible by the fact that higher order derivatives of the true solution can be @ expressed by means of recursive differentiations of the explicit ODEs under consideration, y' = f(y), and evaluated efficiently by use of Moore's algorithm [36]. Concerning the partial derivatives of the highest order, there is generally more than one derivative of this kind in a partial differential equation (PDE). The local discretization error is then not accessible corresponding to (a) and, therefore, design principles for Enclosure Methods are not available. This disregards PDEs which, together with their side conditions, are fi inverse-monotone or f i i J compatible with the theorem of Miiller, see Walter's treatise [54] on differential and integral inequalities. In fact, the PDEs investigated in [13] in this volume belong to the classes (i) or (ii) and they are of the form ust = f(s,t,u,us,ut) as treated by Walter [54]. Generally, evolution problems possess the property (i) only in the case that all true solutions are not oscillatory. For deterministic DEs, the weak causality principle is valid [7], i.e., identical causes (the input) always yield identical effects (the output). For approximations of true solutions y* of DEs by difference solutions of their discretizations, the strong causality principle [q very often is not valid. This principle asserts that similar causes yield similar effects. In fact, this invalidity is the basis of the subsequent demonstrations of the unreliability of discretization methods for ODEs or PDEs. For this discussion, it will be distinguished between the the following classes of 'I solutions" : (A) true ODE- or PDE-solutions, y*, with the side conditions of the DEs taken into account;

a

426

Discretizations: Theory of Failures

true difference solutions, 7, of discretizations concerning (A), without a consideration of the local discretization errors; all other errors are assumed to be absent in the determination of 7; (c) the character of 7 as an approximation of y*; (D) in the presence of rounding errors and (if there are any) other procedural errors, the computed numerical approximation of 7. Obviously then, "similar causes" are generated by the replacement of a numerical method or a compiler by different ones. Often then, the "effects" are quite different. As an illustration of this situation, computed difference approximations are considered which belong to true solutions y* in the strange attractor of the Lorenz equations (see Section 5.4). Concerning this attractor, C.Sparrow observes in the Introduction of his monograph on this problem in the theory of Dynamical Chaos [43]: "The general form...does not depend at all on our choice of initial conditions...or on our choice of integration routine ...The details...depend crucially on both the factors.... As a consequence of this, it is not possible to predict the details of how the trajectories will develop over anything other than a very short time interval." In Chaos Theory, authors generally do not distinguish between true solutions y* of DEs and their computed difference approximations of type (D). This distinction 0 is the foundation of the present chapter and the next one and 0 it respects C.A. Truesdell's [53] remark: "Approximations make no sense except in terms of a prior sense of exactness". In this chapter and the next one [4], the discussions of practical failures of difference methods are predominantly concerned with discretizations of ODEs. These discussions are immediately applicable with respect to discretizations of evolution problems with partial differential equations (PDEs). In fact, for practical purposes these problems are usually approximated by auxiliary systems of ODEs, making use of either (see Subsection 9.4) 0 a semidiscretization by means of a longitudinal method of lines (e. g. [54]) or 0 a method of finite elements with respect to the spatial independent variables

(B)

y

y

y

(e. g. [TI,[581) or a spectral method (e. g. [8]). The major purpose of this chapter and the next one [4] is a comparison of the computer-based performances of Enclosure Methods and 0

A

427

E. Adams

&

corresponding traditional numerical methods not possessing a total error control. Difference methods are examples for case (b) since the local discretization errors cannot be controlled on the level of the discretization. If the consideration is restricted to this level, the practical execution of a difference method involves local rounding errors and, perhaps additionally, local procedural errors. A

2. ON PATH0UX;ICAL (EXRATIC) DIFFERFiNCE SOLUTIONS In the applied literature there is a rapidly increasing number of papers casting doubt on the reliability of computed difference approximations and, therefore, on the employed difference methods. This distrust is caused by any one of the following "operational reasons" (a) or (b) and/or structural reasons" (c) or I'

(d) :

0 a significant deviation of

y,

a computed difference approximation, from a true DE-solution, y*, has been safely determined since y* is either explicity known or has been enclosed in computed bounds, y and 7;

y

b fJ a significant distance of a computed difference approximation from its starting point, y(t& has been determined by means of at first going from y(to) to a computed point y(tl), followed by the "time-reversal" of replacing t by -t in the discretization for t > tl; @ "spurious" (or "extraneous" or "ghost") difference solutions ysp have been observed, which are true difference solutions that do not approximate (neither quantitatively nor qualitatively) any true DE-solution y*, see Sections 3 and 8; @ J "diverting" difference approximations have been observed which, for t~[to,tl], * and for t~(tl,tz], a different true approximate one true DE-solution, yl, DE-solution, y;, such that the distance of yT and y; becomes sufficiently large for at least a subinterval of (tl, t2], see Sections 3 and 9. Concerning (a): The enclosures addressed in (a) possess the following properties: 0 they consist componentwise of pairs of computed upper bounds, j , and lower bounds, y; the bounds are represented by suitable Spline-like functions: 0

the algorithmic determination of

and y is totally error-controlled since all

numerical errors are "unidirectionally" accounted for, i. e., such that increases and y decreases;

7

Discretizations: The0y of Failures

428 0

the distance

11 i - y 0

!Irn

x

i

-

y is negligible for all practical purposes; typically,

11 i Ilrn with q between 4 and

15 provided

(1 7 ( I r n

exceeds a

certain (very small) bound; because of the employment of the Brouwer and the Banach fixed point theorems, the computability of a pair of bounds, i and y, implies

automatically the existence of the (unknown) true DE-solution y* inside the enclosure. Concerning (b): Whereas the comparison addressed in (a) is strictly valid, the one in (b) is heuristic. Time-reversal cannot recognize a spurious difference solution and it fail to recognize a diverting difference approximation. Concerning (c): The spurious nature of a difference solution "yp can be recognized in some cases, e.g., 0 if "yp is independent of t without being a root of the function f in the ODES y' = f(y) or 0 if ysp is kh-periodic with any fixed k E N since the period of "yp then depends on the step size h of the difference method; see Section 8.1 for additional heuristic tests. Generally, the spurious nature cannot be recognized in the case of an aperiodic difference solution. A spurious difference solution, isp, may exert a catastrophic influence on the computational approximation of true DE-solutions y*, as has been discussed in [25], [50], [56], and [57,see Section 8. Remark: In the mathematical literature, there is no formal definition of the property spurious". In Computational Fluid Dynamics, " spurious pressure modes" have been observed [8]. Concerning (db The large distance of y: and :y for t E (tl,tz) may be caused by the interrelated influences of stable and unstable manifolds belonging to a true DE solution which is either a stationary point or periodic, see Section 3. In literature, difference approximations because of any one of the reasons (a)-(d), are frequently said to represent "Computational Chaos" or "Computational Garbage", e.g., or [22] or [34]. If such cases of failure were exceptional, it might be justified to denote them as "pathological". In fact, this implicit rejection was used by the "applied community" as the first cases of spurious difference solutions became known, e.g., in the papers [15], [ M I , and [45], see Section 8.3. The present paper and its continuation [4] are mainly concerned with sets of

y,

429

E. Adams

4

true DE-solutions y* or sets of difference approximations possessing high sensitivities with respect to perturbations of all kinds such that either 0 a meaningful computational assessment is not possible by means of traditional numerical methods and computer systems or 0 an assessment of this kind requires the employment of a sufXciently sophisticated analysis. In a restricted sense, problems of this kind may be called "(computationally) chaotic" . The present chapter and its continuation [4] are not concerned with the " chaotic" aspects of Computational Chaos" or "Dynamical Chaos". Concerning "Chaos", the following topics are of particular interest: (A) the well-established mathematical theory of (parameter-dependent) sequences of bifurcations of nonlinear finite-dimensional recursions, leading to a point of accumulation; a point of this kind then is said to be the "onset of chaos", see e.g. [a]and [35] concerning the Logistic (difference) Equation which is addressed in Sections 4 and 8.2; @) the fractal structure of the boundaries of domains of attraction of certain solutions of the problems referred to in (A), see e.g. [56] ; (c) statistical methods as applied to sets of solutions of problems addressed in (A) or (B), see e.g. [55] ; (D) the existence of a horseshoe map (e.g. [16]) in a set of true ODE-solutions, which is an indication of the presence of structures of "chaos" in this set, see e.g. [17], [46]; in the mathematical literature, there is no formal definition of "chaos" for sets of true ODE-solutions; (E) for nonlinear evolution-type DEs, a computational determination of chaotic properties of a set of true solutions is not possible since this task involves a semi-infinite interval of time. True (or approximated) difference solutions of discretizations of nonlinear DEs belong to the class of finite-dimensional problems addressed in (A) - (C). Consequently, sets of difference solutions (or approximations) of these kinds may possess genuine and well-established structures of chaos". Obviously, this does not implv mathematically that the sets of true solutions y* of the underlying DEs also possess structures of llchao~". Concerning this property for solutions of ODES, see [17], [46]. Remark: Because of its dependency on the artificial parameters, a set of difference

430

Discretizations: The0 y o f Failures

solutions has a richer topographical structure than the corresponding set of true solutions y*, e.g. [50]. 3. ON SIMPLE (DETERMINISTIC) ODEs WITH TOPOGRAPHICALLY COMPLICATED SETS OF SOLUTIONS The present section is concerned with initial value problems (IVPs) employing systems of nonlinear autonomous, explicit ODEs (3.1) y' = f(y) for tdR6, y:IRIn; fiD4"; DcR"; f is a sufficiently

smooth composition of rational and standard functions; y(0)~D. Nonautonomous ODEs y' = g(t,y), with y: I R I " - ' and y ' = dy/dt, can be represented equivalently by means of (3.1), making use of yn := t. Since only real-valued solutions of (3.1) are of interest, they can be represented in a Euclidean space IR", serving as a phase space. The true solutions of (3.1) are denoted by y* = y*(t,y(O)). The ODEs (3.1) are said to be "simple" if, e.g., 0 the components fi, ...,fn of f are represented by sums of terms which are linear or polynomials of low degree and 0 the values of the coefficients are not much different. An example of a system of "simple" ODEs is the system (5.8) of the Lorenz equations, which is the classical paradigm in "Chaos Theory". Now, a practically suitable, consistent, and stable discretization of (3.1) is chosen, together with its artificial parameters. This difference method is to be used * for the determination of an approximation, 8, of a true solution, y , with y(0) fixed and t confined to an interval [O,tm]. The reliability of the approximation depends

y

on the complexity of the topographical structures of the set of solutions in the phase space. Since they are considered in view of the computational determination of they are not identical with the (invariant) topological structures of the set of solutions. Inverse-monotone IVPs (3.1) (e.g., [54]) possess simple topographical structures. In fact, by use of the natural partial ordering of the phase space

y,

(e.g., [371), (34

~ ( " ( 0 )< y(2)(0) + y*(t,y(l)(O)) < y*(t,y(2)(0)) for all tdR such that both solutions exist. As functions of t, the individual components of the true solutions y* then are "layered" in dependency on the "layering" of the initial vectors. If n = 1 , the IVP (3.1) is generally inverse-monotone. A "layering" of the set of solutions is also a

431

E. Adams

property of inverse-monotone BVPs or inverse-monotone problems with PDEs. Because of the simplicity of the topographical structures of the set of solutions, there generally is no topographically induced reliability problem concerning applications of difference methods. As compared with the set of true DE-solution y* of inverse-monotone problems, the sets to be discussed now are qualitatively different, particularly concerning their topographical complexities. A true ODE-solution y* and a difference approximation are considered which, at t = 0, start at the same point y(0) in phase space. At least for a certain interval of time, $ is not only an approximation of y* but also of true DE-solutions y* starting close tony*. At some time te>O, the tangent directions of y* and of some of the solutions y* may begin to be strongly different. This then may be followed by a considerable growth of the distance of y* and some of the solutions y* just addressed. In a situation such as this one for an interval of time beginning at te, may remain close to one of these solutions y* rather than close to y*. In [5], this performance of a difference approximation has been denoted by the expression "diversion" of 8, provided the distance of y* and becomes large for some time t>t,. The cau~eof this phenomenon is at any time t€(O,te); however, the effect of moving away from y* takes place for t >te. Remark: For f = f(y) and locally at y, the Lyapunov exponents may be used as a measure of the rate of expansion of the local tangent directions of neighboring true ODE-solutions, e.g., [16, p.2831. A pole of the ODEs (3.1) is considered together with the set of all solutions of (3.1) which do not enter this pole. In a small neighborhood of the pole, this set possesses a complicated topographical structure since, generally, 0 there are large values of the local curvature of orbits coming close to the pole and 0 the local tangent directions of neighboring solutions are strongly different. This is one of the previously outlined situations favoring a diversion of a difference approximation see Subsection 9.2 for an example. The presence and the location of a pole of the ODEs (3.1) can be determind in an a priori fashion by means of an inspection of (3.1) or the calculation of a root. This kind of determination is also possible concerning the real stationary points of (3.1), which satisfy f(y) = 0. Since f in (3.1) is sufficiently smooth, locally attractive stationary points are the only ones which, as t + m, may belong to more than one true solution of (3.1). A

A

8,

A

y

8;

y

y

432

Discretizations: Theory of Failures

In addition to the influence of a pole or a stationary point, the topographical structures of a phase space are also affected by the existence and (if present) the location of 0 attractors of any kind in the set of solutions or 0 stable invariant manifolds, Ms, and unstable invariant manifolds, MU, both belonging to a stationary point or a periodic solution (allowing a local linearization of the ODES), see e. g. [16] and [7, p.5941. Subsequently, the shorter expressions stable (or unstable) manifold will be used. For n = 2 in (3.1), there are only two kinds of attractors, 0 stationary points, with dimension zero, and 0 limit cycles, with dimension one. For n'2, there may be additionally 0 stationary points or periodic solutions, which are locally attractive tangent to MS and locally repellent tangent to MU,and 0 strange attractors (e.g., [16, p.2561) which ocover a bounded subset in the phase space of solutions, .consist of a non-denumerable set of true ODE-solutions which remain in this set for all times after entering it, such that .the dimension of this attractor is not an integer number. Remarks: An attractor is said to be strange if it contains a transversal homoclinic orbit [16, p.2561. This is a property of the horseshoe map [16, p.3191 referred to at the end of Section 2; see [17] and [46] for orbits of this kind. An orbit going from one stationary point (or one periodic solution) to another is said to be heteroclinic. If the orbit returns to the same stationary point (or periodic solution), it is said to be homoclinic. In the phase plane of the pendulum equation p" + &sin9 = 0, there are heteroclinic orbits starting on the manifold Mu of one saddle point and going over to the manifold Ms of another saddle point [20]; see Example 5.22 for this ODE. 3J A strange attractor is topographically characterized by the predominance of locally divergent tangent directions of the orbits. Inside a strange attractor, manifolds MSand MU(if there are any) possess comp&ated topographical shapes. Manifolds Ms or MU,if they exist, either * 0 start and terminate at stationary points, yi, or at periodic solutions, Yper, or 0 MUdoes not terminate as t increases to infinity and MS has this property as t decreases to minus infinity.

a

E. Adams

433

Since Ms and M U consist of true ODE-solutions, for finite t they cannot merge with another true solution. If Msor M Uare (n-1)-dimensional hypersurfaces in the phase space IR", they separate adjacent sets of true ODE-solutions. For any true ODE-solution y* starting outside of MU,now a difference approximation is considered in the case that M U is an (n-l)-dimensional penetrate MU.Generally, hypersurface. Because of numerical errors, then this is practically irrelevant since all true ODE-solutions sufficiently close to M U will subsequently remain in a small neighborhood of MU,at least for a certain interval of time. Now, any true ODE-solution starting outside of MSis considered in the case that Msis an (n-1)-dimensional hypersurface. A difference approximation then penetrate MS.Subsequently, will approach the attractor belonging to Ms, however, on the "wrong side" of MS, as compared with y*; then, close to the attractor, will be deflected to follow MU,however, (iii) in an incorrect direction on M Uwith respect to Ms and y*. Here, the penetration as the "cause" may take place considerably earlier than the "effect",the deflection of to follow M Uin a "wrong direction". Concerning sets of true ODE-solutions, there is generally an increase of the complexities as the dimension n of the system (3.1) of ODES increases, perhaps in the context of a mathematical simulation. Then, there is a correspondingly increased unreliability of computed difference approximations. Therefore, when a large number of degrees of freedom of a real world problem is to be simulated by means of a nonlinear mathematical model (3,1), this model is frequently of no practical significance. l Generally the complexities of the sets of solutions increase strongly as Remarks: J nonlinear ODES are replaced by nonlinear PDEs. 2J Subsection 9.4 is devoted to an introductory discussion of diverting difference approximations in the case of evolution-type PDEs. Up to this point in the present section, the discussion was concentrated on topographical structures of sets of true ODE-solutions and their difference approximations. For the remainder of this section, individual true ODE-solutions will be considered in view of their complexities. In the subsequent discussion of this problem, only a few cases are chosen: @) when the period T of a true T-periodic ODE-solution y*per is large, it is correspondingly difficult by means of traditional numerical methods to distinguish

y

y

y

434

Discretizations: Theoy of Failures

between this true solution and a neighboring true aperiodic solution; 0 now y’ = f(y) in (3.1) is replaced by y’ = g(t,y) where the dependency of g(t,y) on t is governed by input functions hl and h2 such that hl is TI-periodic and h2 is T2-periodic; it is assumed that T1/T2 = p/q with numbers p,q E IN; then y’ = g(t,y) may possess a T-periodic solution if T: = qT1 = pT2; very small changes of TI or T2 may render p and q large such that T becomes correspondingly large; & J provided (3.1) is nonlinear and possesses a T-periodic forcing term, the true solutions of (3.1) may consist of nT-periodic responses [26,p.179]which, for n -&N, are said to be subharmonics; @ J a linear ODE may be on the boundary of stability and, therefore, highly sensitive with respect to perturbations; an example is the dependency of the solutions of y“ + ky’ + y = 0 on the choice of k = 0 or k = + c or k = -c with ccIR+ arbitrarily small. 4. QUALITATIVE THEORY OF DIFFERENCE METHODS AND

SHADOWING THEOREMS Without loss of generality, the discussions in the present section are mainly confined to SCalarIVPS y’ = f(t,y) for tE[t ,t 1; y:IR + IR; fRxD + R; D cIR, y(t0) = yo~D, (4.1) o m f is sufficiently smooth, and applications of (II) explicit one-step methods ([47l, [48]), y j + l = F(yj): = yj + h @(tj,yj,h,f); yj: = y(tj), j = O(1)n-1, (4.2) h: = (tm-t )/n, nEN; yo = yo.

a

0

A family of true difference solutions of (4.2) will be denoted by y(h): = {yj I j = O(l)n} for h = h(n) = (tm- to)/n with n B . (4.3)

The symbol g(h) will be used in the case that there are numerical errors in the evaluation of F(gj) in (4.2). Provided the true solution y* of (4.1) exists, the local discretization error of (4.2) is defined by (4.4) r(t,y*,h): = (y*(t+h) - y*(t))/h - $(t,y*(t),h,f) for h>O. The asymptotic theory of difference methods rests on the following conditions:

435

E. Adams

f and (I are sufficiently smooth; in particular, there exists a Lipschitz-constant Mo E IR' of (I with respect to its dependency on yj; (4.2) is consistent with order p E IN; then there is an No E IR' such that I r(t,y*(tj),h) I 5 NohP for any fixed tj€[tO,ta)]and all

yo E D; the solution y* of (4.1) to be approximated exists for tc[tO,tm]

such that y*(t) E D. The global discretization error is defined as follows: (4.6) e(t,h): = y(t) - y*(t) for any fixed t ~ [ t o , t ~ ] . Provided (i)-(iii) are satisfied and h is sufficiently small, then there holds [48] (4.7) le(t,h)l 5 hP(No/Mo)(eMo(t-to) - 1) for t~(t~,t,J. This implies the (pointwise) convergence to zero of the h-dependent sequence (4.6) as h + 0. Generally, the estimate (4.7) is not practically useful for any fixed and finite h. Property (4.5iii) makes use of y*. Therefore, (4.7) is meaningless in the case that there is 0 a family of spurious difference solutions, ysp(h), (which does not approach a true solution y* for h + 0, see Sections 2 and 8) or 0 for a fixed h>O, a diverting difference approximation, y(h), (which is close to different true solutions for different intervals of time, see Section 3). Remarks: J . l In the case that y' = f(y) represents a system of ODES, discretizations such as (4.2) are applied individually to the components y'i = fi(y) of the system. The estimate (4.7) then is essentially still valid. J . 2 In the case that (4.2) is replaced by a multi-step discretization, an estimate such as (4.7) requires additionally the property of stability for the discretization. This is automatically satisfied for one-step discretizations. J . 3 The property of stability refers to a linearized stability analysis of the discretization with evaluation "at the point" of the true solution y* of the ODE. This linear analysis supplies a partial information concerning the true nonlinear stability behavior of the discretization. Only a global stability analysis provides a complete knowledge of the bifurcation points of the discretization, which depend on h. This bifurcation analysis is essential in the contemporary analysis of spurious

436

Discretizations: The0 y of Failures

difference solutions 7sp(h), see Section 8. Frequently Shadowing Theorems are believed to assert the practical reliability of discretizations of ODES. A theorem of this kind states for all j=O(l)N that a computed difference approximation y(h) is within an c-neighborhood of an unknown true difference solution "yh), provided that the numerical errors of Y(h) are smaller than a number 6. The initial values of and y are not necessarily identical. There holds c = c(6,N) where N represents the (finite) number of time steps, each of length h. If there is an c = c(6,j) for j = 1(1)N, c(6,j) can be evaluated quantitatively provided that 6 is confined to the level of the discretization, i.e., 6 then accounts for local rounding errors or errors in the evaluation of the function q5 in (4.2). The local discretization errors, however, are not quantitatively accessible on this level. The Shadowing Theorem to be presented will subsequently be applied to the (scalar) Logistic ODE y' = f(y): = y(1 - y) for ED: = [0,1] with yo = y(to)ED for to = 0. (4.8) The true solutions y* of (4.8) can be represented explicitly [23]: Yo (4.9) Y*O) = YO + (1 The stationary points y = 1 (or y = 0) of (4.8) are stable (or unstable). In the applied literature, discretizations of (4.8) are usually the starting point for discussions of Dynamical Chaos. In fact, this is Computational Chaos since all difference solutions y(h) are spurious unless they resembly y* at least qualitatively. For a scalar ODE (4.1) with fGC2(D), an explicit one-step method (4.2) yields 0 a true orbit (difference solution) v(h) satisfying (4.10) % j + l = F(yj): = y j + h)I (tj,ij,h,f) for j = O(1)N - 1, F: [0, lI-+[O,l],FEC~[O, 11 and, additionally, 0 a " pseudo-orbit" (difference approximation) Y(h) satisfying l$j+1 - F(yj)l < bF for j = O(1)N - 1 with a 6,dR,' (4.11)

F

where SF accounts for local numerical errors in the computational determination of F(h). For (4.8), an application of the explicit one-step Euler discretization yields the following special case of (4.10): (4.12) "y+l = F(yj): = yj + hyj(1 - "y) + F'(yj): = 1 + h - 2hyj. Concerning (4.10), constants a,r, and M are defined as follows [lo], making

437

E. Adams

use of DF(y): = F'(y):

and (4.15)

M:

I

1

= sup{ ID2(F(y) y~[O,l]}with D2F(y) = d2F(y)/dy2.

The following version of the Shadowing Theorem is stated and proved by Chow & Palmer [lo]. Theorem 4.16: For F as specified in (4.10), it is assumed that N and bF are sufficiently small such that (4.17) 2 M U T 5 1. Then there exists an orbit y(h) such that (4.18) 1 + ; ( I + ~ ~ I ) - ' 5T sup

j=O( l ) N

l y j -yjl

5 2 ( 1 + f l ~ ~ i ~ ~=) :by. -'T

Concerning (4.12), y j for j = O(1)N is now equated to the (locally attractive) stationary point y = 1 of the ODE in (4.8). Then for h # 1 o 5 (N+l)/(l - h)"', T 5 (N+l)bF/(l - h)"+', and M = 1- 2hl + (4.17a)

~ M U 5T4(N + 1)2hSF/(1 - h)2(N+1) 1. This condition is satisfied if bF = bF(h,N) is sufficiently small. An estimate of this kind is not meaningful in a neighborhood of the (locally repellent) stationary point y = Oof (4.8). Remark. In an example in [lo], I yj-yj I hsp: = 2. For h > h,,, the existence of spurious difference solution Ysp(h) has been shown by May [35], see Section 8.2. In particular, then there are oscillating and even 2kh-periodic difference solutions yBP(h)for all kd. The true solutions y* of (4.8) are non-oscillating. In analogy to (4.17a), the condition (4.17) can be satisfied for h > h,, provided bF = bF(h,N) is sufficiently small. The Shadowing Theorem 4.16 then asserts the existence of a true difference solution "yh) that is shadowed by a computed difference approximation y(h). Since y(h) and y(h) are spurious, they are quantitatively and qualitatively different from all true solutions y* with

438

Discretizations: Theory of Failures

representation in (4.9). This relates to the observation made before that bF does not account for the local discretization errors. Generally, therefore, Theorem 4.16 is irrelevant with respect to the true solutions y* of ODEs underlying a discretization (4.10). For systems y' = f(y) with y:R4RR2,fD4RR2, DcR2, fcC2(D), (4.19) Chow & Palmer [ l l ] have proved a Shadowing Theorem concerning explicit one-step discretizations. Additionally in [ll], they have presented a quantitative application with respect to the HCnon (difference) equations whose difference solutions $(h) possess chaotic properties, e. g. [7], [16]. The results by Chow & Palmer [ll] confirm a conjecture in the classical paper on shadowing by Hammel et al. ([19], see also [MI): "Conjecture: For a typical dissipative map F:R2+ R2 with a positive Lyapunov for a exponent and a small noise amplitude S,>O, we expect to find a by 5

4

(pseudo-) orbit of length N zl/&." Therefore, the shadowing property may be true for a surprisingly large number of steps in the recursion yielding the sequence {yj}. Remark: In literature (e.g.[18], [19]), shadowing has been considered for interval mappings Yktl = F(Yk) with Yktl, YkcRq intervals with q = 1 or 2. In this context, the validity of shadowing has been related to the property: (4.20)

N

Yo: = n F-k(Yk) # 4, k=O

with 4 denoting the empty set [19,p.467]. For the case of q = 1 and a rigorous application of interval methods, see [41]. Now, the actual computational determination of a pseudo-orbit inside a strange attractor is considered. Numerical results can be described by the reference to Sparrow's monograph [43] which is quoted in Section 1. For a further characterization of this set, a fixed choice of yo inside a strange attractor is considered. According to [52], the employments of a Cray 1 and a Cyber 205 have yielded grossly different results for j < N with N as small as 50. Remarks: J . l Concerning the influence of the local discretization errors in the multi-dimensional case, the preceding discussions stress the fact that the validity of the shadowing property for a pair $(h) and y(h) is irrelevant with respect to the * true solutions y of the underlying ODEs.

E. Adams

439

3 The shadowing property is valid for a finite number N of steps tj+l-tj=h. Therefore, it is not possible to investigate a sequence :(h) of difference solutions as h =(too-t o)/N+O. 5. ON THE COMPUTATIONAL ASSESSMENT OF SOLUTIONS OF NONLWAR ODES WITH BOUNDARY CONDITIONS (OF PERIODICITY) 5.1 SURVEY OF THE PROBLEM In view of the title of this paper, it is observed that (A) the search for periodic solutions Yier of systems (3.1) leads to boundary value problems (BVPs) with (two-point) boundary conditions of periodicity: (5.1) y' = f(y) for tdR; fD+R", DcR" is open and convex, f is a sufficiently smooth composition of rational and standard functions;

Y(T) - Y(0) = 0; the period T is either prescribed or to be determined; (B) the subsequent discussions exhibit the serious potential unreliability of approximations of true solutions yier, which are determined by means of discretizations; (c) this therefore stresses the theoretical and practical importance of applications of Enclosure Methods. Some of the examples to be presented concerning (B) and (C) employ two-point boundary conditions other than conditions of periodicity: (5.2) y' = f(y) for t~[a,b]; fD+IRn, DcR" is open and convex; g(y(a),y(b)) = 0, a,bER, b#a; f and g are sufficiently smooth compositions of rational and standard functions. A true solution of the BVP (5.2) is denoted by y*. Without a loss of generality, the independent variable t can be chosen such that a = 0. Generally in the case of BVPs, the existence of a (classical) solution y* must be verified. Unless a (totally error-controlled) Enclosure Method is employed, a single local numerical error may cause the computed approximation either a (J deceptively to satisfy all equations of the discretization even though a true solution y* does not exist, or &J to fail to satisfy all these equations since the automatic "guidance and control" is absent which is built into Enclosure Methods. Examples for case (a) or case (b) are presented in Sections 5.5, 7, and 9. Spurious difference solutions ysp provide a different class of examples concerning case (a),

y

y

440

Discretizations: Theory of Failures

see Subsection 8.3. Because of (a) and (b), a reliable computational approximation of periodic solutions Yier is difficult; this is enhanced by the fact that their location in phase space is unknown; frequently, there is more than one solution Yier; depending on their topographical (or dynamical) properties, perhaps only one of these solutions is of interest; a global search of the phase space may be necessary; in the present paper, this problem is reduced to a search for only one individual true solution yier; provided T is relatively large, it is generally difficult to distinguish between a T-periodic true solution Yier and an aperiodic true solution y*; see Section 3. For some classes of BVPs, existence theorems for solutions y* are proved in literature, e.g., [9]. Generally, then 0 additional qualitative and/or quantitative mathematical work is to be executed in order to verify that all conditions of a theorem of this kind are satisfied; 0 only in exceptional cases, it can be verified by inspection whether this is so. Numerous theorems in literature assert qualitatively the existence of a solution y* = y*(c) of a perturbed BVP provided 1c1 is sufficiently small and the unperturbed BVP is known to possess a solution y* = y*(O). Theorems of this kind are, e.g., 0 concerned with nonlinear autonomous or T-periodic ODEs, [9,p.137,p.1391 or 0 the KAM-theory ([7], [16],[20]) concerning periodic solutions of nonlinear perturbed autonomous ODEs. As compared with applications of existence theorems, an employment of Enclosure Methods is advantageous because of their merged properties 0 of a verification of the existence and 0 a totally reliable quantitative assessment. The theoretical and practical importance of totally error-controlled methods is exhibited in Section 7 through discussions of systems of linear ODEs y' = A(t)y with a T-periodic matrix function A and boundary conditions of Tperiodicity, y(T) = y(0). Then there arises the matrix I-$*(T) where I is the identity matrix, $* represents the fundamental matrix satisfying $' = A(t)$ and $(O) = I, and $*(T) is the monodromy matrix. The matrix I-$*(T) must be [9]

441

E. Adams

non-invertible if the homogeneous ODEs are to have T-periodic solutions or invertible provided the ODEs possess an additional forcing term and a unique T-periodic solution is to be determined. Unless @* = @*(t)for t€[O,T] is calculated with total error-control, a reliable distinction between the presence or the absence of the property of invertibility for I - @*(T)is not possible. Additional to existence, the following qualitative questions are relevant for a periodic solution Yier: whether Yier is isolated, whether Yier is asymptotically stable (with respect to perturbations of the @) initial data) and, then, locally attractive, in the case that Yier is T-periodic and unstable, whether Yier possesses a stable manifold Ms and an unstable manifold MU, see Section 3 and [16] or 0

0

4

6

[301.

A

A

Properties (b) and (c) 0 may be present in the case of dissipative ODEs, 0 theyA are absent in the case of conservative ODEs. Property (c) is present in the case of the periodic solutions of the Lorenz equations which are discussed in Section 5.4. Concerning the success of applications of numerical methods, 0 properties (a) and (b) are desirable and A 0 property (c) is undesirable, as is shown for the Lorenz equations in Subsections 5.4 and 9.3. Properties (b) and (c) rest on the linear variational system of the BVP wiih respe? to the solution Yier under discussion. Consequently, a verification of (b) and (c) recquires the reliable knowledge of a sufficiently close approximation of Yiep By means of Enclosure Methods, property (b) can be verified similtaneously with the enclosure of Yier. Remark: A more general "structural stability" of a solution y* of a BVP may be defined with respect to selected perturbations of the ODEs; they may represent, A

h

A

A

A

e. g., 0 0

0

a simulation of procedural or rounding errors or, the influence of celestial bodies not taken into account in the computational determination of an approximation of a (hypothetical) periodic orbit of a planet, or the rotational attachment of a gear drive discussed in Section 7, etc.

442

Discretizations: The0y of Failures

5.2 ON

TRADITIONAL COMPUTATIONAL METHODS FOR THE! OF PERIODIC SOLUTIONS DE"ATI0N In physics and engineering, there are numerous methods for the determination of approximations of periodic solutions of nonlinear ODEs; particularly, 0 averaging methods (see [9] and [26]) and 0 perturbation methods (see [9] and [26]). Generally, these methods yield continuously differentiable functions such that their (aJ residual is quantitatively accessible, which however is not a useful gauge with respect to the oscillatory solutions under discussion; there are no meaningful quantitative error estimates comparable, e.g., to the b (J ones for the local discretization error. Concerning periodic solutions Yier, discretizations of ODEs are frequently employed, supplemented by a suitable approximation y(0) of the "initial'' vector yier(0) of the desired solution. A determination of yier(0) may be carried out either @ by use of a shooting method (e.g.[48]) in conjunction with, e.g., a Newton iteration of y(0) in order to satisfy the boundary condition of periodicity or, starting with a suitably estimated vector y(O), a marching method is used, = y(t,y(O)) will where it is hoped that the computed approximation gradually approach a periodic state, see Section 7.6. For the execution of either (a) or (P), y(0) must be sufficiently close to the true vector Yier(0) of the desired isolated periodic solution Yier. Consequently, both (a) and (P) might have to be preceded by a search in phase space for almost closed orbits. This search is particularly difficult and costly in the case that either the period or the dimension of this space are large. Concerning (P), y(0) must be inside the domain (basin) of attraction of Yier, whose existence and size are not known. A domain of attraction does not ex'ist in the case of conservative ODEs.

y

Provided the ODEs are dissipative, a desired marching approximation of Yier is generally slow in the case of meaningful mathematical simulations of engineering problems; in fact, the dissipation of energy then is relatively small. Consequently, in the course of the correspondingly slow approach of Yier, there may be a considerable accumulation of the influences of numerical errors.

443

E. Adams

Therefore, the traditional numerical methods outlined in the present subsection suffer from significant reliability problems. This suggests applications of Enclosure Methods, at least on a spot check basis. 5.3 ON "HE VWIFIED COMPUTATIONAL ASSESSMENT OF SOLUTIONS OF B W s OR OF PERIODIC SOLUTIONS For the purpose of a verifying enclosure of a solution y* of the BVPs (5.1) or (5.2), a simple shooting method will be used. Consequently, an auxiliary IVP is now assigned to either one of these BVPs: (5-3) y' = f(y) for tdR; tM", DcR" is open and convex; f is sufficiently smooth; y(0) = XED. The solutions of (5.3) are represented by means of an operator @{: y*(t,s) = : @fs provided y*(~,s)eDfor r~[O,t]. (5.4) The topic of this subsection will now be discussed for the case of the BVP (5.1). A vector (s,T)ER"+~ is then to be determined such that I

*

y;er(O,s) 5 Yper(T,s). Consequently, s is located on a T-periodic orbit yier. Suitable discretizations of (5.1) and (5.3) are considered which generate approximations 9 of the true solutions Y;Ser = Yrfer(t) and y* = y*(t,s) of (5.1) and (5.3), respectively; see Subsection 5.4 for examples. Because of (a) and (b) in Subsection 5.1, a totally error-controlled treatment of (5.1) and (5.3) is desirable. This is not possible on the level of a discretization. Concerning BVPs, the following Enclosure Methods are available: @ the one by Lohner ([31] or [32]) for general two-point BVPs (5.2) or the special BVP (5.1) of this kind and the one by Kiihn [29] for BVPs (5.1) with the boundary condition of T-periodicit y. Remark: By means of an ad hoc Enclosure Method for the phase plane, the periodic solution of the Oregonator problem has been verified in [l]. Since (p) has not been published in the generally accessible literature, an outline of this method is presented in Subsection 5.7. Concerning the presently available hardware and software supporting the construction of enclosures, applications of (a) or (p> to nonlinear BVPs are confined to a sufficiently small total order n of the system of ODES. For examples which so far have been completed successfully, there holds n 5 4. (5.5)

a

S: =

444

Discretizations: The0y of Failures

Remark: For linear BVPs, examples with n 5 22 have been completed successfully, see Subsection 7.5. The practical confinement of n to values 5 4 is mainly due to the fact that (a) and (8employ the determination of enclosures for the following extended systems of ODEs: the given nonlinear system y' = f(y) with order n; the enclosure of the @ solution y* = y*(t,y(O)) of this system serves as the (interval-valued) input for n auxiliary systems of linear ODEs, each of order n. ( ' i i J Generally, the correspondingly high computational cost prohibits the employment should ) be used as of (a) or (p) as interval iteration methods; rather (a) or (/I verification methods for candidate enclosures (with very small widths) 0 that are nearly centered with respect to a highly accurate approximation of the desired true solutions, a whose computational determination requires the execution of a perhaps costly preceding search for in the phase space.

y

y

For a system of ODEs, y' = f(y), it is assumed that a periodic solution yier and its period T have been enclosed and verified by means of Kiihn's enclosure method (8. In view of corresponding applications of discretizations, the stability properties of yier are of interest. For an investigation of this problem, it is at first assumed that a T-periodic solution, y& = y&t) and its period T are (explicity) known. It is then possible to derive the linear variational system [9,p.10] of y' = * * f(y) with respect to y = yper: 7' = A(t)q for t€[O,T]with A(t): = f (y&(t)) (5.6) and f the Jacobi matrix of f. That fundamental matrix $:IR+L(IR") of (5.6) is considered which satisfies $(O) = I with I the identity matrix. Then, there holds: a the monodromy matrix $*(T) possesses an eigenvalue X = 1 [9,p.98], 0 therefore, (5.6) possesses a T-periodic solution vier = &r(t), [9,p.98]. According to [9,p.98-99], there holds: Theorem 5.7 The system (5.6) is considered. If n-1 of the eigenvalues of $*(T) satisfy I X j I < 1, then Yier possesses the property of asymptotic orbital stability. By means of the Floquet theory [9,p.58], it follows immediately that yier is unstable in the case that there is an eigenvalue x k of $(T) with the property 1 Xkl >1. Obviously, Yier and its period T are (explicitly) known only in trivial

E. Adams

445

cases. By means of Kiihn's Enclosure Method (p), (tight) enclosures can be determined, both for T and yier(t) for t€[O,T]. Then 0 there is an interval matrix [A(t)] in (5.6) for t~[o,T],making use of the enclosures of Yier(t) and TE[S T ] = :fr]and an interval matrix [$(t)] has to be determined for t€[O,T].

S ENcu)suRE AND VERIFICATION OF PERIODIC SOLUTIONS OF THE LORENZ EQUATIONS Kiihn [29] has enclosed and verified several periodic solutions of the Lorenz equations 5.4 m

-4Y2-

(5.8)

y' = f(y): =

-yly3

Yl

Yl)

+

ryl-ya

fort 2 0 ;

y: =

y2

,

.Y3, , Y l Y 2 - by3 where b,r, IJWare arbitrary but fixed. This system is a "simple" special case of (3.1). Starting from the PDEs of Fluid Mechanics, Lorenz [33] has derived (5.8) by means of a Fourier method, [8]. Concerning (5.8), the monograph [43] by Sparrow is devoted to discussions of true solutions y* and approximations y. The ODES (5.8) are dissipative [43]. Generally they are considered to be the paradigm in the theory of Dynamical Chaos. There are three real-valued stationary points [43] (9 for all b,r,aER+,the origin (O,O,O)T and for all b, r-l,oEIR', the points C1 and CZwith (5.9) the positions (&<,r-l)T~IR3 where = f & 5 . For r > 1, the stationary point (O,O,O)T is locally a saddlepoint. Attached to this point, there are a two-dimensional stable manifold Ms and a one-dimensional unstable manifold MU,see [30], [43] and Section 3. For special choices of b,r, UER', 0 all solutions y* = y*(t,b,r,a,yo) can be represented in an essentially explicit fashion or @ ,( the set of solutions y* is known to possess simple ("non-chaotic") topographical properties. For arbitrary b = 2a E IR' and r E R+, W. F. Ames [5] has found the following equivalent representation of (5.8):

e:

446

Discretizations: The0y of Failures

For arbitrary b = 2a, r E R' and in the limit as t+m, all true solutions y* of (5.8) are confined to a fixed hypersurface in the phase space. Since y i and yl can be represented explicitly as functions of y;, the set of solutions y* of (5.8) then cannot possess "chaotic properties". In fact, this set then is topographically simple since (5.8) is autonomous and satisfies a Lipschitz-condition. In particular, this excludes the existence of a strange attractor, see Section 3. The system (5.8) is invariant under the transformation (y1,yz,ydT (-y1,-y2,ydT defined by +

(5.11)

-1 0 0 f(Sy) = Sf(y) with S: = 0 -1 0 0 0 1

+ Sk = I with k

= 2.

Provided there is a T-periodic solution yier of (5.8) with the point yo on its orbit, A*

then there is also a T-periodic solution Yper with the point Syo on its orbit. This solution is defined by (5.12) yper(t): = Syier(t) for tE[O,T]. A* Generally, the orbits of Yier and Yper are different. It is assumed that there are a point yo and a time T > 0 such that A*

(5.13)

$:Yo

= SYO.

For systems y' = f(y) possessing the properties (5.11) and (5.13), with a k-lEN, Kiihn's Theorem 5.55 (see Subsection 5.7) asserts the existence of a T-periodic solution Yier with T = kr, which is invariant under S. Provided a transformation S possessing these properties is known for a system y' = f(y), the boundary condition y(T) = y(0) may then be replaced by y ( ~ )= Sy(0); this is advantageous since any numerical method then is to be used only for t€[O,T/k]. For arbitrary b,r,a E R', Kiihn [29] has shown that all solutions y* = y*(t,y(O)) of (5.8) are contained in the following ellipsoid in the limit as t*: (5.14) V(yl,y~,y~):= ry? + uy3 + a(y3 - 2r)2 5 (Y with cr = a(b,r,a); e.g.,cy < 20071 for b = 8/3, u = 6, and r = 28; see also [30, p.261. Consequently, the strange attractor, the stationary points, and all periodic solutions of (5.8) (if there are any) are confined to V. Kiihn [29] has organized and applied

E. Adams

447

an algorithm for the execution of the Enclosure Method (as outlined in Subsection 5.7) for the purpose of the enclosure and verification of periodic solutions Yier and their periods; @) a C-compiler enabling him to run this algorithm on a HP-VECTRA computer, making use of codes available in the language PASCAL-SC; @ for the verified periodic solutions yie,, an approximation of the eigenvalues addressed in Theorem 5.7 and subsequently. Remark: The language PASCAL-XCS [27] was not yet available in the winter of 1989/1990 when Kiihn carried out the numerical work addressed in the list (a) -

a fJ

(c).

For b = 8/3, r = 28, and o = 6, Kiihn [29] has proved the following theorems, which can be stated making use of the notation a t for an enclosure of a number aER, with b and c as the last mantissa digits of a floating point representation of bounds of a. *(1) Theorem 5.15: For (5.8), there exists an isolated T ( 1 )-periodic solution Yper whose orbit intersects the interval (6.83111896933, 3.2213122112, 27)TcR3. The period T(1) is in the interval 0.689918686827cR. The fundamental matrix @ * (T (l)) of the linear variational system possesses the eigenvalues X I ( ' ) = 1, X 2 ( 1 ) x "*(1) *(1) 1.05092, and X3(1) x 0.00120. There is a T(l)-periodic orbit Yper # Yper a~ defined by (5.12). Remark: Concerning X = 1, see the reference to [9,p.98] in conjunction with Theorem 5.7. *(2) Theorem 5.16: For (5.8), there exists an isolated T ( 2 )-periodic solution Yper whose orbit intersects the interval (0.51219159, 2.24983738, 27)T. The period T(2) is in the interval 1.75168488%.The fundamental matrix @*(T(z))of the linear variational system possesses the eigenvalues X l ( 2 ) = X 2 ( 2 ) ::4.69500, and X 3 ( 2 ) x *(2) 9.4324.10-9. The orbit Yper is invariant under the transformation S defined in (5.11). *(3) Theorem 5.17: For (5.8), there exists an isolated T (3 )-periodic solution Yper whose orbit intersects the interval (10.952283216& 21.71601368$, 20)T. The period T(3) is in the interval 2.59427762798. The fundamental matrix @*(T(3))of the linear variational system possesses the eigenvalues X l ( 3 ) = 1 , X 2 ' 3 ) :: 9.14122, and X 3 ( 3 ) x 1.4052.10-'3. There is a T(3I-periodic "*(3) *(3) orbit Yper # Yper as defined by (5.12). Figure 5.18 exhibits the projections into the yl-y2-plane of the orbits of

448

Discretizations: Theory of Failures *(1)

71 := Yper and *(3)

*(2)

72 := Yper

. Figure 5.19 shows this projection for the orbit of 73 :=

Yper

Remark: 1.) Since T(3) is relatively large, Kiihn [29] determined the verifying *(3) enclosures of Yper and T(3) by means of a multiple shooting method [48], which is *(2) not represented in Section 5.7. In the case of yper and T(2), Kiihn employed a simple shooting method for tc[O, T(2)/2]. J . 2 Interval methods have been used in [12] and [42] for the purpose of a verification of the existence of periodic solutions yier of the Lorenz equations (5.8). Since the treatment of this problem in the papers just referred to is incomplete, Kiihn’s verification [29] of the existence of solutions yier for (5.8) is the first complete proof.

Figure 5.18: [29] Projections of T(l)-periodic solution 71 := *(2) periodic solution 72 := Yper Of (5.8)

*(1)

Yper

and T ( 2 ) -

449

E. Adams

nI

y2! 1

/

g . . . . . . . . . . . . . . .

-10

; . . . . . . . . . . . . . . . . + y ,

0

10

*(81

Firmre 5.19: [29] Projection of T(3)-periodic solution 73 := Yper of (5.8)

An arbitrary starting point y ( i, (0) with i = l of 2 or 3 is chosen inside any

one of the intervals IiclR3 as given in Theorems 5.15, or 5.16, or 5.17. Since the orbits

* ( i) Yper

with i = l or 2 or 3 are unstable, the orbits y

generally move away from *(3)

.A

i) Yper ; *(

*(

i)

:= y*(t,y

(i)

(0)) will

see Example 5.20 for an approximation i) Yper

N

of

is to be corresponding growth of the widths of an enclosure of expected as t increases and the total interval IicW is admitted at time t =O. For an investigation of this issue, H. Spreuer used the enclosure algorithm with Rufeger’s step size control ([38] and [39]) and, as an additional feature, the estimate by Chang and Corliss of the local radius of convergence of the employed Taylor polynomials, see [39]. Spreuer obtained the following results: * ( 1) J . l After 66 revolutions past Yper , the widths of the three computed scalar enclosing intervals are smaller than Yper

*(

450

Discretizations: Theory of Failures *(3)

3 At

t=20.8, the width of the three scalar intervals enclosing Yper can be represented by the following vector: (0.102, 0.106, 0.297)T. At this time, the 9th *(3) revolution past Yper is almost complete. By means of this application of [39] and its supplement, H. Spreuer additionally found out that there is a misprint in Kiihn's original thesis [29] and in the representation of [29] in Theorem 5.17: T ( 3 ) E 2.62612378.

5.5 On Failures of Traditional Numend Methods as Applied to B W s The present subsection is devoted to the presentation of examples for the unreliability of discretizations of nonlinear BVPs in the case that a total error-control is absent. The following two kinds of failures will be observed: the computational assessment of an approximation which either (A) deceptively "determines" a non-existing true solution y* or (B) fails to stay close to a true solution y*, even though the difference initially is zero or arbitrarily small. Example 5.20 For the Lorenz equation (5.8), W.Espe [14] attempted the *(3) determination of an approximation of the T(3)-periodic solution Yper as asserted in Theorem 5.17. This solution is highly unstable because of the eigenvalue X 2 ' 3 ) . Espe 0 employed a Runge-Kutta-Fehlberg method of order eight ([47],[48]) with 0 its automatic control of the step size h and, for the starting vector y(0) of a marching method, he chose the midpoint of the interval vector as stated in Theorem 5.17. The distance between this choice of y(0) and y;:;)(O) is at most lo-''. Figures 5.21a - 5.21d present Espe's results: 0 in the execution of the first revolution for tE[O, T(3)], moves gradually *(3) away from Yper such that their distance can already be detected within graphical accuracy at t M T ( 3 ): Figure 5.21a; 0 in the execution of the second revolution for tE[T(3), 2T(3)] and as *(3) and Yper simultaneously come close to the stationary point (O,O,O)T, they start to move apart rapidly such that their distance reaches relatively large Figures values; in Sections 3 and 9, this is interpreted as a diversion of 5.21b and 5.21d;

y

y

y

y:

45 1

E. Adams 0

in the execution of the subsequent four revolutions, the aperiodic orbit shown in Figures 5 . 2 1 ~and 5.21e was generated; this orbit resembles those which, in the non-mathematical literature, are believed to indicate the presence of " chaos", e. g. [16, p.841 or [43, p.31, p.661.

-1 1

-3

Yl

13

Figure 5.21a: [14] Projection of the first revolution of the Runge-Kutta-Fehlberg starting in the midpoint of the interval as stated in Theorem approximation *(3) 5.17; is an approximation of the T (3) -periodic solution yper of (5.8) exhibited in Figure 5.19

5

y,

-7

1

9

17

Figure 5.21b: [14] Projection of the first and the second revolution of the approximation continuation of Figure 5.21a

y;

Dascretizataons: Theory of Failures

452

-9

9

0

Figure 5.21~: [14] Projection of the first to the fourth revolution of the approximation continuation of Figure 5.21b

y;

0

2

1

y~

3

4

5

Figure 5.21d [14] Component of the approximation as a function of te[O, 4.51, with T := T (3) M 2.59 demarcated; this relates to Figure 5.21b

E. A dams

453

0

4

8

12

16

Figure 5.21e: [14] Component f~of the approximation f as a function of t€[0,15], with T: = T(3) ::2.59 demarcated; this related to Figure 5.21~ This completes Example 5.20 and the demonstration of a failure of type (B). The subsequent Example 5.22 for a failure of type (A) 0 is primarily concerned with a problem in elastostatics, 0 however, it can be recast into an example in dynamics with periodic solutions [20]. Examde 5.22 1241: The BVP of the nonlinear buckling of a beam is considered such that one end (x=O) is clamped and the other (x=l) is free [51,p.70]: y" + Xsin y = 0 for x~[O,l];y(0) = 0, y'(1) = 0; A: = P/EI€IR+, (5.23) with P the axial (non-follower) load and EI the flexural rigidity of the beam. Concerning every (classical) solution of (529, a continuation for XER yields a 4-periodic solution. Under consideration of the boundary conditions, the integration of y" + Xsin y = 0 for x~[O,1]yields

m.

y'(0) = :s 5 (5.24) The true solutions y* = y*(x,X,s*) of the ODE in (5.23) can be represented by means of elliptic integrals [51,p.71]. Therefore and by use of the linearized ODE y" + Xy = 0 with y(0) = y'(1) = 0, it is known that 0 there is a sequence of bifurcation points :A, = (2n - 1)%?/4, for all ndl; 0 for X E ( X ~ , X ~ + ~ there ), are 2n+l real-valued solutions y* = y*(x,X,fs;) with si = 0 and s; > 0 for i = l(1)n allowing the boundary conditions to be satisfied.

454

Discretizations: The0y of Failures

The following auxiliary IVP is assigned to the BVP (5.23), making use of yl: = y

The shooting parameter s = s(X) is to be determined such that the boundary condition 1 (5.26) F(s): = yz(l,X,~)t 0 is satisfied by the solutions y* = y*(x,X,s) of (5.25). If it exists, a root of (5.26) will be denoted by s* = s*(X). An interval Newton method [6,p.75] will by used for an enclosure and a verification of a root s*. The correspondingly required derivative F' of F can be determined by means of enclosures of auxiliary functions qi = qi(X,X,S): = 6'yy(x,X,s)/& for i = 1 or 2. These functions satisfy the following linear auxiliary IVP, whose ODES can be derived by means of a differentiation * * with respect to s of the equations in (5.25) with yt, y2 replaced by yl, y2: for xc[O, 11 with 771(0,X,s) = 0 77; = 772, (5.27) 772(0,W = 1. 77; = -(Xcos y;(x,X,s))q1 A d + fixed

I

The interval Newton method then is given by [Sk+l1 : = m([sk]) - [ya(l,X,m([sk]))]/[~a(l,X,[ski)] (5.28)

for k = l € H ; [SO]clR is chosen and XdR+ is fixed, [Sk] =

[ s k,ik],

m([Sk]:=(sk+ik)/2.

The external brackets bi(...)I and [7;( ...)I refer to an employment of Lohner's Enclosure Method ([31],[32]) with respect to the semi-coupled IVP (5.25),(5.27). For any fixed A, the set of roots {s;(X)li = l(l)n} c (0,of (5.26) can be characterized as follows: with si = 0 and for i = l(l)n, the distance sy - S;-I of the roots of a (J F(s) = 0 decreases strongly as s z ( O , v ] increases; this distance is very small, particularly for large A; close to as X - Xn changes its sign from negative to positive for any fixed nclN, a new ( i J root, Siew, starts from s = O; for all xc[O,l], the solution y* = y*(x,X,srtleW)possesses a magnitude that d (J approaches zero as X - Xn -+ O+; because of (b) and (d), a computational determination of a root s* is e (J or zero, respectively. ill-conditioned close to either For X = 417 > X7 :: 416.99, an approximation 8,, of Siew will now be

v,

E. Rdams

455

discussed for the case of a replacement of Lohner's Enclosure Method by a classical Runge-Kutta method with step size h. For h = 1/50 or 1/100 or 1/200, the shooting method yielded approximations y2( 1, 417,s) which were negative for s = lo-'' and positive for s = 5 . lo-''. If correct, these values imply the existence of a root Siew in the interval [10-l2, 5.10-"]. When this problem was reworked by means of Lohner's Enclosure Method, it turned out that there is no root Siew in this interval 5.10-"]. As X - X7 goes from negative to positive values, the 0 new root srtlewis in the interval [0.27, 0.281. The subsequent Example 5.31 demonstrates another failure of type (A). The following nonlinear Mathieu equation with period T, = 1 is considered: (5.29) y" + 0.1 y' + (3 + cos(27lt))y + 0.1 y3 = -10. If there are T-periodic solutions Yier of (5.29), then T must satisfy T

=

kT, with

any kdN. For a verification of this property, (5.29) is equivalently represented by 1 ~ . validity of y*(t +T) A y*(t) means of the system y' = f(t,y) with y: = ( y ~ , y z ) The for its solutions yier and all tdR implies that 1 (5.30) Yier(t) = f(t,yier(t)) yier'(t +T) = f(t+T,yier(t+T)) = f(t +T,yier(t)). Consequently, f must be T-periodic or, more generally, T/k-periodic with any kEN. Examde 5.31 f211: Concerning (5.29), y(0) and y'(O) were systematically varied in order to search for candidates for periodic solutions, which then were to be verified by use of Lohner's Enclosure Methods ([31],[32]). For this purpose, a classical Runge-Kutta method was used with the choice of a step size h = 2.10-4. A "candidate" f was found for T = 0.3782 because of A

(5.32)

1-8.03

I

2.3

I

-8.036518

I

2.300605

1

The corresponding orbit is "closed with more than graphical accuracy. Because of discussions with respect to (5.30), T = 0.3782 is far away from the candidates 0 T = k d for periods of T-periodic solutions of (5.29). Additional exmples for failures of discretizations are as follows: 0 computed "approximations" f of non-existing solutions Yier of BVPs in Subsection 7.3 and 7.4 of the present paper; 0 for failures of computed difference approximations f as caused by diversions, A

456

Discretizations: Theory of Failures see Subsection 9.2 in [4] and [3], [49], and the paper [39] by Rufeger & Adams in this volume.

5.6 CONCLUDI" REMARKS

Generally, nonlinear ODEs with boundary conditions constitute a situation exhibiting unusual difficulties with respect to 0 the verification of the existence of true solutions y* and the determination of reliable computational approximations of solutions y*. Concerning both (a) and (b), these difficulties hold irrespective of the employment of traditional numerical methods or of Enclosure Methods. In view of the unreliability of discretizations, the presented Examples 5.22 and 5.31 are convincing since the investigated systems of ODEs 0 are not "suitably chosen" for demonstration purposes, but rather have been taken from physics literature; 0 their complexities are rather low since their total orders are small and their nonlinearities are simple. This completes Part I, which is mainly devoted to qualitative discussions with respect to the reliability of discretizations of DEs. Concerning this problem, 0 there are two quantitative examples in Subsection 5.5 of this Part I; 0 a more general discussion of practical failures of discretizations is presented in Part I1 [4]. APPENDIX: K-S ENCLQSURE METHOD FOR PERIODIC SOLUTIONS OF NONLINEAR ODES The nonlinear BVP (5.1) is considered for the case that the unknown endpoint T is to be determined simultaneously with a T-periodic true solution y* of this BVP. Since the ODEs y' = f(y) are autonomous, any solution y* of (5.1) can immediately be continued to all intervals [a, (I+l)T], for all ldN. For this purpose, The goal of Kiihn's construction [29] is y*(t) for t€[O,T] is replaced by y*(t +a). 0 the computational determination of an enclosure of a solution y* of (5.1), 0 merged with the automatic verification of the existence of y*. For these purposes, a hyperplane V (with a representation in R") is chosen such that

5.7

E. Adams

457

fa

Dnv#$, (5.33) S* = {per(O)€D n V provided Y*per exists and, (iii) in a neighborhood of s*, f(y) for ED n V is not orthogonal to grad V(y); here $ denotes the empty set and s represents the shooting vector introduced in (5.3). Obviously, s* = Y*per(O) can be satisfied only as the result of a convergent iteration for s as a root of (5.5). For y E V, a new (n-1)-dimensional Cartesian basis is introduced by means of a bijective map $: (5.34) Q': V + Rn-' and w = $-l(y). Without a loss of generality, this basis is now chosen such that (5.35) V: = {YER" I Yn = c = const.}. The following PoincarC map P is introduced: (5.36) P:D n V + D n V and Py: = $f y for y€D n V T(Y1 with t = ~ ( y the ) first return time to D n V of an orbit which, at t = 0, starts at y€D n V. Remark: This map involves unknown true solutions y* of y' = f(y). Consequently, a PoincarC map is not finite-dimensional since, implicitly, the function space of the solutions y* is involved. Provided there exists a periodic solution y;er with y;,,(O) =: S*EDn V and y;,,(~(y))€D n V, then there is a smallest kEDI such that

(5.37)

k-1

Pks*: = ( P - P....P)s* = S*EDn V with Po: = I and T := .E ~(Pjs*), J

=o

where T represents the total time between the departure from D n V at s* and the return to this point. If (5.37) is true for s*, then there holds Pmks*= Pks* = s* for all mdN, i. e., y;er is T-periodic. Since S*E D n V, there are n unknown components of the vector (w,T) to be determined. Since y = D n V, (5.37) is equivalent with I (5.38) G(w): = $f $~(w)- flu) 5 0. T(4

flu) for the points on

Theorem 5.39: If and only if F(w,T): = $;flu) - f l u ) = 0, with T: = !"(w) and an arbitrary but (5.40) fixed h N , then G as defined in (5.38) possesses a root w*. Proof: fa If G ( w ) = 0 and T = T(w),then F(w,T) = 0.

458

0If

Discretizations: Theory of Failures F(w,T) = 0, then T must be an integer multiple of T(w). Consequently,

G(w)= 0. For both (i) and (ii), the existence of a root w* follows from the existence of the 0 solution y* with representation by means of the operator ${. Since T is unknown, F as defined in (5.40) corresponds to the boundary condition of a free BVP concerning y' = f(y). This BVP will now be represented equivalently by means of a BVP with fixed endpoints 0 and 1, making use of a suitable transformation of the time scale. For this purpose, a function cp(t,yo): = $fyo is introduced, and the operator ${ is replaced by (5.41) $!yo: This implies that

= cp(Tt,yo).

=: h($kYo)

+Y!O

(5.42)

=

d

Tq=j-q

cp(Tt,yo) = T

d

m $&Yo

=

Tf(cp(Tt,yo)). By means of y' = f(y), therefore, (5.43) My) = Tf(y) for YEDand $!?YO = ${YO and (5.44)

= f(cp(T,Yo)).

*?YO

This allows the following representation of the function F, which is defined in (5.40): (5.45) F(w,T) = $!?flu)- flu). A*

Consequently, a root s of A

h

F(s) = 0 with s: = (u,T)*EIR" is to be determined, where T has been incorporated in the choice of the function h. The problem (5.46) is now eqtivalentlyArepre:e?ted thro:gh the classical mtan value theorem, 0 = F(s ) = F(s0) + F'(si,).(s-so) with SOEDn V fixed and Sim the intermediate argument. A rearrangement of terms yields the equivalent h representation s =, H(s) with h H(s): = SO - rF(s0) +(I - I'F'(Sim))(S - SO), where s o ~ Dn V is fixed and I'EL(IRn) is arbitrary but fixed and invertible. According to Rump [40],this gives rise to the following interval extension, where I' now is notAassuyed to be invertible: H([sJ): = SO - I'F(s0) + (I - rF'([sJ))([s] - SO), where both (5.47) A h S ~ E [ S ] and I'EL(!Rn) are arbitrary but fixed. (5.46)

A*

A

A

h

h

h

h

A

E. Adams

459

This induces an interval iteration (5.48) [Sk+l]: = H([Sk]) with [so]clR" suitably chosen. The following theorem has essentially been proved by Rump [40]: Theorem 5.49: Provided the enclosure condition (5.50) [Sk t l]c [sk] is satisfied, there hold: ( i J r is invertible and there exists one and only one fixed point s E[Sktl] of H. Remark: For (5.50) to be satisfied, s must be isolated. For applications of Theorem 5.49, the Jacobian of F is needed: dF wT = Y ( 1 ) q with Y(t): = (Yik(t)): = (5.51) A

h

A

A

A

A*

A*

9

and, because of (5.40) - (5.43, (5.52) ~ I+ ~lFwT =

= +h($?$(w)) = f($?$(w)).

Because of (5.35) and making use of the Kronecker symbol &j, 8F, wT (5.53) = 6ij = Yik(1) &j - &j = Yij(1) - 6ij.

@$'$

wj

Concerning Kiihn's investigation of the symmetry properties (5.11)-(5.13), the details will now be discussed. Provided y* is a solution of y' = f(y) and (5.11) is valid, then trivially, Sy* is also a solution (5.54) (Sy*(t)f = Sy*'(t) = Sf(y*(t)) = f(Sy*(t)) = > S@€yo= @yo since this implication is true at t = 0 and ${ is the operator representing the solutions y*. Whereas generally an orbit is not invariant with respect to S with a matrix S such that Sf(y) = f(Sy), the conditions of the following theorem may then be satisfied, which has been proved by W.Kiihn [29]: Theorem 5.55: Provided there is an invertible matrix S€L(lR") such that Sf(y) = f(Sy), (bJ a k - ~ E I Nsuch that Sk = I, and quantities yo, T with the properties as stated in (5.13), then there hold: the orbit as determined by yo is invariant with respect to S and @ 0. ${yo is periodic with period T = kT. Proof @ Because of (5.54), there holds (5.56) sy: = {syIy€y} = {S$€yoJt€R}= {@fSyoIt€lR}= y since Syo~y. @ Because of (5.13), (5.56), and Syoq, there holds f (5.57) 4:.Jo = $fK-J( yo = $(n-1)7syo =...= Sky0 = Iyo = yo. 0

a

460

Discretizations: Theory of Failures

LIST OF REFERENCES E. Adams, A. Holzmuller, D. Straub, The Periodic Solutions of the Oregonator and Verification of Results, p. 111-121 in: Scientific Computation with Automatic Result Verification, editors: U. Kulisch, H. J. Stetter, Springer-Verlag, Wien, 1988. E. Adams, Enclosure Methods and Scientific Computation, p. 3-31 in: Numerical and Applied Mathematics, ed.: W. F. Ames, J. C. Baltzer, Basel, 1989. E. Adams, Periodic Solutions: Enclosure, Verification, and Applications, p. 199-245 in: Computer Arithmetic and Self-validating Numerical Methods, ed.: Ch. Ullrich, Academic Press, Boston, 1990. E. Adams, The Reliability Question for Discretizations of Evolution Problems, 11: Practical Failures, in: Scientific Computing with Automatic Result Verification, editors: E. Adams, U. Kulisch, Academic Press, Boston, will appear (this volume). E. Adams, W. F. Ames, W. Kiihn, W. Rufeger, H. Spreuer, Computational Chaos May be Due to a Single Local Error, will appear in J. Comp. Physics. G. Alefeld, J. Herzberger, Introduction to Interval Computations, Academic Press, New York, 1983. J. Argyris, H.-P. Mlejnek, Die Methode der finiten Elemente, 111: Einfiihrung in die Dynamik, F. Vieweg & Sohn, Braunschweig, 1988. C. Canuto, M. Y. Hussaini, A. Quarteroni, T. A. Zang, Spectral Methods in Fluid Dynamics, Springer-Verlag, New York, 1988. L. Cesari, Asymptotic Behavior and Stability Problems in Ordinary Differential Equations, 2nd edition, Springer-Verlag, Berlin, 1963. S.-N. Chow, K. J. Palmer, O n the Numerical Computation of Orbits of Dynamical Systems: the One-Dimensional Case, will appear in J. of Complexity . S.-N. Chow, K. J. Palmer, On the Numerical Computation of Orbits of Dynamical Systems: the Higher Dimensional Case, will appear. S. De Gregorio, The Study of Periodic Orbits of Dynamical Systems, The Use of a Computer, J. of Statistical Physics -83 p.947-972, 1985. H.-J. Dobner, Verified Solution of Integral Equations with Applications, in: Scientific Computing with Automatic Result Verification, editors: E. Adams, U. Kulisch, Academic Press, Boston, will appear (this volume).

46 1

E. Adams

W. Espe, ijberarbeitung von Programmen zur numerischen Integration gewohnlicher Differentialgleichungen,Diploma Thesis, Karlsruhe, 1991. R. Gaines, Difference Equations Associated With Boundary Value Problems for Second Order Nonlinear Ordinary Differential Equations, SIAM J. N u . Anal. 11,p. 411-433, 1974. J. Guckenheimer, P. Holmes, Nonlinear Oscillations, Dynamical Systems, and Bifurcations of Vector Fields, 2nd printing, Springer-Verlag, New York, 1983. J. K. Hale, N. Sternberg, Onset of Chaos in Differential Delay Equations, J. of Comp. Physics 22, p. 221-239, 1988. S. M. Hammel, J. A. Yorke, C. Grebogi, Do Numerical Orbits of Chaotic Dynamical Processes Represent True Orbits?, J. of Complexity 5 p, 136- 145, 1987. S. M. Hammel, J. A. Yorke, C. Grebogi, Numerical Orbits of Chaotic Processes Represent True Orbits, Bull. (New Series) of the American Math. SOC. p. 465-469, 1988. Hao Bai-Lin, Chaos, World Scientific Publ. Co.,Singapore, 1984. R. Heck, Lineare und Nichtlineare GewoMiche Periodische Differentialgleichungen, Diploma Thesis, Karlsruhe, 1990. B. M. Herbst, M. J. Ablowitz, Numerically Induced Chaos in the Nonlinear p. 2065-2068, 1989. Schrodinger Equation, Phys. Review Letters H. Heuser, Gewohnliche Differentialgleichungen, B. G. Teubner, Stuttgart, 1989. A. Holzmiiller, EinschlieBung der Lasung linearer oder nichtlinearer gewohnlicher Randwertaufgeben, Diploma Thesis, Karlsruhe, 1984. A. Iserles, A. T. Peplow, A. M. Stuart, A Unified Approach to Spurious Solutions Introduced by Time Discretisation, Part I: Basic Theory, SIAM J. Numer. Anal. 28, p. 1723-1751, 1991. D. W. Jordan, P. Smith, Nonlinear Ordinary Differential Equations, Clarendon Press, Oxford, 1977. R. Klatte, U. Kulisch, M. Neaga, D. Ratz, Ch. Ullrich, PASCAL-XSC Language Reference with Examples, Springer-Verlag, Berlin, 1992. P. E. Kloeden, A. I. Mees, Chaotic Phenomena, Bull. of Math. Biology 47, p. 697-738, 1985. W. Kiihn, EinschlieBung von periodischen Lasungen gewohnlicher Differentialgleichungen und Anwendung auf das Loremsystem, Diploma

a

462

Discretizations: Theory of Failures

Thesis, Karlsruhe, 1990. G. A. Leonov, V. Reitmann, Attraktoreingrenzung ftir nichtlineare Systeme, B. G. Teubner Verlagsgesellschaft, Leipzig, 1987. R. Lohner, Einschliehng der Liisung gewohnlicher Anfangs- und Randwertaufgaben und Anwendungen, Doctoral Dissertation, Karlsruhe, 1988. R. J. Lohner, Enclosing the Solutions of Ordinary Initial and Boundary [321 Value Problems, p. 255-286 in: Computerarithmetic, eds.: E. Kaucher, U. Kulisch, Ch. Ullrich, B. G. Teubner, Stuttgart, 1987. E. N. Lorenz, Deterministic Nonperiodic Flow, J. of the Atmosph. Sc. 20, p. 130-141, 1963. E. N. Lorenz, Computational Chaos - A Prelude to Computational Instability, Physica D g, p. 299-317, 1989. R. May, Simple Mathematical Models With Very Complicated Dynamics, Nature 261. p. 459-467, 1976. R. E. Moore, Methods and Applications of Interval Analysis, SIAM, Philadelphia, 1979. J. M. Ortega, W. C. Rheinboldt, Iterative Solution of Nonlinear Equations in Several Variables, Academic Press, New York, 1970. W. Rufeger, Numerische Ergebnisse der Himmelsmechanik und Entwicklung einer Schrittweitensteuerung des Lohnerschen EinschlieBungs-Algorithmus, Diploma Thesis, Karlsruhe, 1990. W. Rufeger, E. Adams, A Step Size Control for Lohner’s Enclosure Algorithm for Ordinary Differential Equations with Initial Conditions, in: Scientific Computing with Automatic Result Verification, editors: E. Adams, U. Kulisch, Academic Press, Boston, will appear (this volume). 1401 S. M. Rump, Solving Algebraic Problems with High Accuracy, p.51-120 in: A New Approach to Scientific Calculation, eds.: U. Kulisch, W. L. Miranker, Academic Press, New York, 1983. P. Schramm, Einsatz von PASCAL-SC und Anwendungen genauer Arithmetik bei zeitdiskretisierten dynamischen Systemen, Diploma Thesis, Karlruhe, 1990. Ya. G. Sinai, E. B. Vul, Discovery of Closed Orbits of Dynamical Systems with the Use of Computers, J. Statistical Physics p.27-47, 1980. C. Sparrow, The Lorenz Equations: Bifurcations, Chaos, and Strange Attract ors, Springer-Verlag, New York, 1982.

a,

E. Adams

463

H. Spreuer, E. Adams, Pat hologische Beispiele von Differenzenverfahren bei nichtlinearen gewohnlichen Randwertaufgaben, ZAMM 52, p. T304-T305, 1977. H. Spreuer, E. Adams, On Extraneous Solutions With Uniformly Bounded Difference Quotients for a Discrete Analogy of a Nonlinear Ordinary Boundary Value Problem, J. Eng. Math. l9,p. 45-55, 1985. H. Steinlein, H.-0. Walther, Hyperbolic Sets, Transversal Homoclinic Trajetories, and Symbolic Dynamics for C1-Maps in Banach Spaces, J. of Dynamics and Differential Equations 2, p. 325-365, 1990. H. J. Stetter, Analysis of Discretization Methods for Ordinary Differential Equations, Springer-Verlag, Berlin, 1973. J. Stoer, R. Bulirsch, Einfiihrung in die Numerische Mathematik, 11, Springer-Verlag, Berlin 1973. D. Straub, Eine Geschichte des Glasperlenspieles - Irreversibiliat in der Physik Irritationen und Folgen, Birkhauser-Verlag, Basel, 1990. A. Stuart, Nonlinear Instability in Dissipative Finite Difference Schemes, SIAM Review 31, p. 191-220, 1989. S. Timoshenko, Theory of Elastic Stability, McGraw-HillBook Co., Inc., New York, 1936. G. Trageser, Beschattung beweist: Chaoten sind keine Computer-Chimaren, Spektrum der Wissenschaft, p. 22-23, 1989. C. Truesdell, An Idiot’s Fugitive Essays on Science - Methods, Criticism, Training, Circumstances, Springer-Verlag, New York, 1984. W. Walter, Differential and Integral Inequalities, Springer-Verlag, Berlin, 1970. W. Wedig, Vom Chaos zur Ordnung, GAMM Mitteilungen 1989, Heft 2, p. 3-31, 1989. H. C. Yee, P. K. Sweby, On Reliability of the Time-Dependent Approach to Obtaining Steady-State Numerical Solutions, Proc. of the 9th GAMM Conf. on Num. Methods in Fluid Mechanics, Lausanne, Sept. 1991. H. C. Yee, P. K. Sweby, D. F. Griffiths, D y ~ m i c a lApproach Study of Spurious Steady-State Numerical Solutions for Nonlinear Differential Equations, J. of Comp. Physics $V,p. 249-310, 1991. A. ieniiek, Nonlinear Elliptic and Evolution Problems and their Finite Element Approximations, Academic Press, London, 1990.

This page intentionally left blank

11. PRACTICAL FAILURES Whereas failures of discretizat ions of differential equations are mainly discussed qualitatively in Part (I), [5], this continuation presents examples concerning spurious difference solutions (or diverting difference approximations) and a quantitative treatment of selected evolution problems of major contemporary interest. Examples are mathematical models for gear drive vibrations, the Lorenz equations, the Restricted Three Body Problem, and the Burgers equation. In each one of these examples, Enclosure Methods were instrumental in the discovery of large deviations of computed difference solutions or difference approximations from true solutions of the differential equations.

6. INTRODUCTION The numbering of the sections in the present chapter is in continuation of the one in Part (I), [5]. A literature reference [I.n] points to reference [n]in Part (1).

Discretizations of nonlinear DEs (ODEs or PDEs) represent the most important practical approach for the determination of approximations of unknown true solutions y*. An approximation is said to fail practically 0 if a suitably defined distance of and y* takes sufficiently large values or 0 if there is no true solution y* which is approximated by a difference solution y. As is shown in [5], classes of examples for such failures are provided by "spurious difference solutions" f, " diverting difference approximations" or more generally, any kind of computational chaos. There is a rapidly rising number of publications on the corresponding unreliability of discretizations of DEs. A qualitative discussion of this subject area is offered in [5]. For the purpose of a quantitative substantiation, only a few cases of practical failures of discretizations are treated in Subsection 5.5 in [5]. The present continuation of [5] is concerned with the following areas: 0 for a class of linear ODEs with periodic coefficients, in Section 7 a case history of systematic practical failures of discretizations;

F

5

y,

465

466

Discretizations: Practical Failures

for nonlinear ODEs or PDEs, in Section 8 a discussion of spurious difference solutions 7 on the basis of literature and examples; 0 for nonlinear ODEs or PDEs, in Section 9 the observation of diverting difference approximations y as caused by properties of the underlying DEs * or the phase space of their true solutions y . The areas covered in Sections 7-9 address problems of major current interest in the mathematical simulation of the real world. Consequently, the failures of discretizations as observed in these sections possess grave practical implications. Even though the problems considered in these sections are mainly concerned with ODEs, all their implications extend to PDEs. In fact, 0 there are qualitatively identical influences of numerical errors in the cases of discretizations of ODEs and PDEs; 0 frequently in the applied literature, evolution-type PDEs are approximated by systems of ODEs, as is discussed in Section 1 and in Subsection and in 0

9.4.

Generally in the areas covered in Sections 7-9, the practical failures of discretizations of DEs cannot be recognized on the level of these difference methods. In the absence of explicitly known true solutions y*, Enclosure Methods are the major practical approach for a reliable quantitative assessment of unknown true solutions y*. In fact, these methods were instrumental in the recognition of the majority of the practical failures as discussed in Sections 7-9. 7. APPLICATIONS CONCERNI" PERIODIC VIBRATIONS OF GEAR DRIVES 7.1 INTRODUCTION OF THE PROBLEM The case history presented in this section demonstrates 0 the diagnostic power of Enclosure Methods, which were instrumental in the discovery that 0 local numerical errors smaller than 10-8 were large enough to allow the practical computability of difference approximations y of non-existing true * periodic solutions yper. This case study relates to the simulation of vibrations of one-stage, spur-type industrial gear drives. Consequently, two mated gears are considered, each of which is mounted on a shaft that 0 is carried by bearings and 0 one end of which is attached to an external clutch.

E. Adams

467

The physical problem can be characterized by the following geometric or dynamic data of gears: 0 a diameter of up to 5 m, 0 a performance of up to 70 OOO kw, 0 a peripheral speed of up to 200 m/s, 0 an unusually high geometric precision of the teeth, whose deviations must not exceed a few micrometers (pm), [16], [25], [34]. Vibrations of gear drives are mainly induced by (A) the periodically varying number of mated pairs of teeth and (B) geometric deviations of the manufactured teeth from their intended ideal configurations, e. g. pitch deviations. Because of (A), the tooth stiffness function cz of the mated pairs of teeth is a T-periodic function of time t, where T = l/w; the tooth engagement frequency w takes values of up to 5000 l/s. In the subsequent analysis, only (A) is taken into account. The T-periodicity of cz presupposes that small T-periodic vibrations are superimposed onto the constant (nominal) angular velocities q and w2 of the (two) shafts. Obviously, 2a.w = q z l = w2z2 with z1 and z2 the numbers of teeth of the mated gears. Concerning a stochastic dynamical treatment of pitch deviations of teeth, see the paper [3] by D. Cordes, H. Keppler, and the author. The quantitative results in [3] demonstrate that pitch deviations of more than the few micrometers (pm) as admitted in the industrial quality standards ([16], [25], [34]) may cause unacceptably intense vibrations. Additionally to these physical sensitivities, there are mathematical and numerical difficulties as will be shown. The case study to be presented concentrates on the closely interrelated aspects of 0 a Physical Simulation of the real world, 0 Mathematical Theory, 0 Numerical Analysis, and 0 Computer and Compiler performance. The present section rests on 0 the physical simulation in the doctoral dissertation [14] (1984) of H. Gerber, 0 the mathematical and numerical analysis in the doctoral dissertations of D. Cordes [12] (1987) and U. Schulte [29] (1991), and 0 H. Keppler’s additional analysis [18] which was carried out after the completion of Schulte’s dissertation.

468

Discretizations: Practical Failures

Beginning in 1987, Schulte's and Keppler's work has been supported by DFG in order to enable @ a theoretical analysis of real gear drive vibrations, which have been and are being investigated by use of a test rig in the Institute for Machine Elements of the Technical University of Munich. Concerning (b), the doctoral dissertations of H. Gerber [14] and R. Muller [23] are referred to. These authors have provided the experimental results to be used subsequently for a validation of the mathematical and physical simulation to be discussed. In fact, a simulation of a real world problem is meaningful only in the case that a satisfactory agreement with observed physical results can be reached. This then may be the basis for a confident employment of the model for purposes of predictions. 7.2 ON THE I"IGATED CLASSES OF MODELS On the basis of Gerber's dissertation [14], a multibody approach has been used for the purpose of the following physical simulation:

the gears are replaced by two solid bodies carrying elastic teeth to be treated by means of the tooth stiffness function c,; (B) each shaft is replaced by one or several solid bodies; (c) the thus generated N solid bodies are coupled as follows: (c1) in cross sections of their artificial separation, by springs and parallel dashpots; i.e., the T-periodic tooth spring with stiffness coeficient c, and the tooth dashpot with a constant damping coefficient k,; (c2') springs and parallel dashpots correspondingly simulate the bearings, whose foundations are at rest; (c3) it is optional whether or not the shafts are correspondingly coupled with the (heavy) external clutches, which are assumed to rotate uniformly, i. e., without vibrations. If present, a coupling of this kind will be called a "rotational attachment". Figure 7.6 displays a simulation by means of N = 4 solid bodies. The dashpots and the optional rotational attachments are not shown in this figure. As a minimal interaction of the shafts and the clutches, there must be a transfer of the constant external torques, see Figure 7.6: 0 MI to the shaft with the nominal angular velocity q and 0 M2 to the other shaft with In the absence of the rotational attachment (C3), MI and Mz act as external loads @)

*.

E. Adams

469

onto one end-face of each shaft, respectively. In the presence of (C3), MI and M2 are still assumed to be constant since a feedback of the vibrations onto the heavy clutches has been excluded by assumption. Gerber’s multibody approach [14] employs a simulation by means of (A), (B), (Cl), and ((2). The omission of (C3)is an essential property of 0 the ”original class of models” to be defined subsequently, 0 which in 1987 was the starting point of Schulte’s [29] work. The absence of (C3) caused mathematical and numerical difficulties to be discussed in Subsections 7.3 and 7.4. These problems did not arise when (C3) was additionally taken into account in the case of a special model with N = 2: 0 in the doctoral dissertations of Kiiciikay [20] (1981) and Naab [24] (1981) and 0 in the one of D. Cordes [12] (1987). Additionally to (A), (B), and (C), a physical simulation consists in the choice of a total of n (2 N) degrees of freedom to be assigned to the N solid bodies. Each one of these degrees corresponds to one of the following kinematic coordinates (a) or (b), compare Figure 7.6: (& translational coordinates xi of the motions relative to the bearings and/or rotational coordinates which are either @) (bl) the absolute coordinates vi describing the superposition of the vibrations and the nominal solid body rotation wjt of the shafts and the gears, with either the sign + or the sign - (depending on i and j) or (b2) the relative coordinates vi Wjt or (b3) the relative coordinates @i which are defined as differences of coordinates vi belonging to neighboring solid bodies in the simulation. Obviously, there is no contribution of * wjt to the coordinates vi * wjt or 9i. These contributions preclude the existence of a periodic state for the absolute coordinates (bl). Possibly the existence of a periodic state can be proved for the relative coordinates (b2) or (b3). Even though these kinematic facts are obvious, they were obscured by numerous quantitative results that were determined without a total error control, see Subsection 7.3. For a fixed tooth engagement frequency w, the results of major engineering interest are 0 the tooth deformation s, which is a linear combination of all coordinates belonging to the two solid bodies representing the mated gears, and the tooth force F. 0

470

Discretizataons: Practical Failures

Provided 1 s(t) I and I s'(t) I are sufficiently small for all t E R, F can be expressed as follows: F(t) : = cz(t)*s(t) + kZ*s'(t). (7.1) Mathematically, both s and F consist of 0 static contributions, due to the external torques MI and Mz and, addit ionally, 0 vibrational contributions that may still be present in the case of MI = M2 = 0, compare (7.30). Provided the vibrations are bounded for all t E IR, it is physically obvious that I s(t) I and I s ' (t) 1 also have this property. Correspondingly, in the dependency of s on rotational coordinates 0 there is no contribution of the uniform rotation iwjt, as is seen by inspection of (7.7b)and (7.7d), and no influence of a replacement of the coordinates pi by either pi Ujt or by qi, as is seen by inspection of (7.8a). The translational and/or rotational coordinates to be employed are Concerning their now interpreted as the components of a vector (function) y : Ran. physical simulation, the dynamical interaction of the n corresponding degrees of freedom can beAexpressed by a system of n linear ODEs, each of the second order: + Dy ' + (A -t B(t))y = g for t E R with y: R -+ R" M : = diag(m1, ...,m,) E L(R") and mi E IR' for i = l(1)n; (7.2) B(t) := c,(t)C; A, C, D E L(IR"); cZ(t+T) = cZ(t)for all t E R; c,. is (piecewise) continuous; T = l / w ; g E IR". Subsequently, M will be equated to the identity matrix I because, without a loss of generality, the system of ODEs can be premultiplied by M-1. Since s is a linear function of certain components of y, the tooth force F is immediately known when * * a solution y = y (t,w) of (7.2) has been determined. For engineering purposes, the most important final result is the function

yy"

A

pertaining to a T-periodic representation sperof the tooth deformation s. For engineering purposes, the resonance excitations of Fdyn are to be determined, i. e., the frequencies w belonging to the local maxima of Fdyn. A countably infinite set of candidates w for these excitations is furnished by the Floquet theory, e. g., p.9, p. 751. The goal of the engineering analysis under discussion is an

E. Adams

47 1

identification of those candidates actually yielding significant resonance excitations. For engineering purposes, additionally, the property of asymptotic stability of (7.2) is of interest for the case when g is equated to zero. If it exists, a periodic solution of the nonhomogeneous ODEs (7.2) then is globally attractive. A rotational attachment (C3)is physically realized by means of springs and parallel dashpots, with corresponding coefficients of stifEness and damping. It is convenient to express these coefficients as scalar multiples of corresponding coefficients appearing in the simulation. Denoting the scalar factors by a E IRA, the matrices in (7.2) can be expressed as follows, according to [29]: (7.4) A = A o+cu A1 and D =Do + CUD1 since B = B(t) simulates the interaction of the gears. According to Diekhans [13] or Schulte [29], the influence of a on Fdyn is practically negligible for cr 5 10-3. In the case of an employment of translational coordinates xi and/or relative rotational coordinates @i, Schulte [29] has shown the validity of the following decomposition of the system of ODEs: if a = 0, there are n-1 ODEs just like the ones in (7.2) and (compare (7.8a)), f I J (II) by use of a suitable linear combination u = u(y1, ...,yn), the additional ODE u" = 0. (7.5a) The solutions of (7.5a), * u (t) = a + /3t with a, /3 E IR, (7.5b) (iJ can be determined without a discretization error, making use of any consistent difference method; (iiJ they relate to the uniform rotation * q t ; i. e., the solutions of the subsystem (I) represent the vibrational motions. Because of the bijectivity of the transformation relating the coordinates Vi and 'J'i, the composition of the subsystems (I) and (11) is equivalent to the original system (7.2). Remark: For the case of n = 2 rotational degrees of freedom, Gerber [14] has derived a Mathieu-ODE, since then n-1 = 1. This is the "simplest" one in the set of models which have been derived and employed by Gerber [14] and Schulte [29]. The following types of mathematical simulations will be distinguished: (iJ the "original class of models" with an employment of absolute rotational coordinates Pi and the choice of a = 0; (iiJ the "class of modified models" which coincides with (i), with the exception that a=O is replaced by a>O; the corresponding rotational attachment to

Discretizations: Practical Failures

472

uniformly rotating external masses causes an automatic replacement of the coordinates qi by Vi * wi; @) the "class of relative models" which, for a=O,employs the relative rotational coordinates $i. For the purpose of a vibrational analysis, the three classes of models are "physically equivalent" provided Q 5 10-3.The three classes are mathematically equivalent provided Q = 0 and the ODE u" = 0 is added to the subsystem (I). As an example for (7.2), now the special case of N = 4 bodies and n = 6 degrees of freedom is considered in the context of the original class of models. Figure 7.6 displays 0 the translational coordinates XI and x2, 0 the absolute rotational coordinates (01, M, (p3, and 94, 0 the stiffness coefficients c1, c2, c3, and c4 of coupling springs, with parallel dashpots not indicated in the figure, 0 the tooth spring with stiffness coefficient c,(t) 5 h(t) and the parallel tooth dashpot not indicated in the figure. Figure 7.6 does not display the (optional) rotational attachments of the shafts.

C

9 Firmre 7.6: (from [29])Physical simulation of gear drive

473

E. Adams

Under an additional consideration of the rotational attachments of the shafts with a>O, the mathematical model is represented by the following system of ODES:

(70%) where

s := x1

+ x2 + r l n + r2cpz,

rlwi = r 2 a

because of the kinematic consistency, r2M1 = rlM2 because of the external equilibrium; (7.7c) ml and m2 are masses, and J1, J2, J3, J4 are moments of inertia. For cr = 0 [29, p. 551, this is a model in the original class of models with then 9, cpz, (p3, (p4 absolute rotational coordinates. In the vectorial representation (7.2) of (7.7), the matrices A, C, and D are not invertible for all fixed choices oft E R. For cr > 0, this is a model in the modified class of models. The symbols 9, cpz, (p3, Ujt: (p4 in (7.7a) and (7.7b) then represent the following relative coordinates vj rpk

-a t

+at

for j = 1or 3 and for k = 2 or 4.

In both cases cr = 0 or cr > 0, (7.7a) is satisfied by the uniform rotation *Ujt: = - a t for k = 2 or 4 + s = 0. (7.7d) x1 = x2 = 0, qj = wit for j = 1 or 3, For cr=O, this is obvious by inspection of (7.7a) and (7.7b). For a>O, this follows from the meaning of @,...,(p4 just referred to. For all n > l , the models in the original and in the modified classes can be characterized as follows:

474

Discretizations: Practical Failures

the coordinates 9i and their derivatives Pi' occur in pairs, as sums or as differences with certain weights; the tooth deformation s is expressed by means of (7.7b) with XI,x2, rl(p1, and r 2 the ~ coordinates of the gears; (7.8) (iii) for Q = 0 and all t E IR, Schulte [29] has shown that the matrices A, C, and D in (7.2) are not invertible, see Appendix 7.8; (iv) for Q > 0 and all t E IR, the matrices A, C, and D generally are invertible. Schulte [29, p. 68 - 691 has derived the model in the class of relative models which, for d, belongs to (7.7). This model consists of the subsystem (I) with five ODEs, each of the second order, for the coordinates (7.8a) XI,x2, @I := (p1 - 93, @2 := cp1 + (r2/r1)~,and *3 := M - 94, and the ODE u" = r2M1 - r1M2 = 0 with u := r2J1cp1 + r2J393 - rlJ2cpz rlJ494, where u' = const. represents the conservation of angular momentum. In the engineering community, the original class of models (for Q = 0) has been the standard mathematical approach to an analysis of vibrations of industrial gear drives. Because of the mathematical and numerical problems to be shown for all models in this class, the author and his coworkers have adopted the other two classes of models. Concerning the original class, a literature survey has shown that traditionally marching methods have been used for a computational approximation of the true T-periodic tooth deformation, s&.~.Prior to a discussion of this approximation in Subsection 7.6, a determination of sier by means of the corresponding BVP of periodicity will be carried out. For this purpose, the system of ODEs (7.2) is represented equivalently by means of the following system, making use of a redifinition of the symbol y: y = A(t)y + g with h

A

A(t) : = (7.9)

A

["

-M-'(A + B(t)) -M-'D A : IR -, L(R2") is (piecewise) continuous with A(t A

A

A

,

A

+ T)

A

= A(t) for all

475

E. Adarns A

For g = 0, the homogeneous system possesses the following representations of its general solution: (7.10) Y(t) = 4*(9 Y(0). The fundamental matrix 4: R + L(R2") solves the matrix-IVP (7.11) 4 ' = A(t)6 for t 2 0 with 4(0) = I. All solutions of (7.9) can be represented by means of (7.12) The system of ODEs (7.9) is now supplemented by the boundary condition of T-periodicit y, (7.13) Y(T) - Y(0) = 0. Remark: There is not necessarily a solution of the BVP (7.9), (7.13). If this solution exists, it can always be continued for all t E R to represent the desired T-periodic solution y;er Since y;er does not exist when absolute rotational coordinates are employed, the BVP (7.9), (7.13) then cannot possess a solution. Substitution of (7.12) into (7.13) yields the following system of linear algebraic equations for the initial vector y(0) of the desired T-periodic solution of (7.91,

YGer:

(7.14)

(1 - ~*(T))Y(o)= Yiart(T)*

*

Provided X = 1 is not an eigenvalue of the monodromy matrix #-(T), (7.14) posesses one and only one solution: (7.15) l Y*(o) = (1 - 4 * ( ~ ) -Yiart(T)* The T-periodic solution y;er of (7.9) then can be represented as follows [12]: (7.16)

x

= 1is not an eigenvalue of

#*(T).

This solution is unique and globally attractive if the homogenous ODEs are asymptotically stable. Because of a theorem in D.9,p.58], this property holds true if and only if the spectral radius of #*(T) satisfies the inequality (7.17) P(4*(T)) < 1. This then yields

476 0

Discretizations: Practical Failures

*

the unique T-periodic tooth deformation, sper, making use of (7.7b) and

*

those components of yper which represent XI, 0

*

x2,

rl(p1, and r 2 ~ ;

the unique T-periodic tooth force, Fper, whose maximum value for t E [O,T]

is the desired dynamic tooth force, F&, as defined in (7.3). Remarks: J l For the computational determination of yper for t E [O,T], it is

*

advantageous to use a marching method, starting with the true initial vector y (0) * of yper, compare Subsection 7.6. J . 2 The homogeneous ODEs in (7.9) possess T-periodic solutions if and only if X =1

is an eigenvalue of d * ( ~[1.9,p.591. ) 3 J In the case that X = l is an eigenvalue of $*(T), the system (7.14) possesses solutions if and only if an orthogonality condition is satisfied, see Appendix 7.8. Then, there are infinitely many T-periodic solutions of (7.9). These solutions are not attractive since the system of homogeneous ODEs in (7.9) then is not asymptotically stable.

7.3 ON CONFUSING RESULTS FOR THE ORIGINAL CLASS OF MODELS In his doctoral dissertation [12], D.Cordes applied Enclosure Methods for an investigation of a simulation with N = 2 bodies and n = 4 degrees of freedom. This simulation can be illustrated by means of Figure 7.6 provided rotational attachments replace the portions of the shafts which, in Figure 7.6, possess the coordinates ’p3 and (p4. For a large set of values w, Cordes 0 enclosed the true dynamic force Fiyn = Fiyn(w) and 0 he verified the condition (7.17) for asymptotic stability for almost all values of w in the chosen set. Cordes’ quantitative results are practically identical with the ones in the doctoral dissertations of Kiiciikay [20] and Naab [24] where traditional numerical methods had been used earlier for this model with N = 2 and n = 4 as investigated by Cordes. For the author’s research project with support by DFG (since 1987), it was planned 0 to investigate models with n E {2,4,6,8,10,12}in the original class of models; 0 to enclose or approximate F:yn(w) by use of the representation (7.16) of

*

Yper, 0

to compare results of Enclosure Methods and traditional numerical methods

477

E. Adams

as applied to these models; to use the codes which had been prepared and applied by D.Cordes [12]. In this context, M. Kolle [19] (1988) employed 0 a classical Runge-Kutta method p.481 for the determination of an 0 approximation of the fundamental matrix $*, for t E [O,T] and 0 a Gauss-Jordan method for an approximation of (I - $(T))-I. Kolle computed results for for approximately lo4 different choices of the vector v of the input data, with w one of its components. His results can be summafized as follows: A for roughly 50% of the chosen vectors v, p($(T,v)) < 1; (A) A (B) for almost all choices of v, the matrix (1 - $(T,v))-' was computable, resulting in a meaningful dependency on v; 0

3

3

N N

A

the computed approximations Fdyn(v) agreed fairly well with the available physical observations for Fdyn(v), [14]. Simultaneously with Kolle's work and for all chosen input vectors v, U. Schulte's [29] corresponding attempts concerning the original class of models failed

(c)

A

&) & A

to verify that p($*(T,v)) < 1,

to enclose (I - $*(T,~))-I. Therefore, in 1989, U. Schulte [29] investigated the properties of the true solutions of the systems of ODES (7.2) under consideration. She arrived at and proved Theorem 7.28 (in Appendix 7.8). Consequently, for all choices of v and all models in the original class, X = 1 is an eigenvalue of $*(T). Therefore,

@

P(@*(T,V))= 1 and (I - $*IT,v))-' does not exist. A Concerning (A) versus (a): Since p($*(T,v)) = 1, an appeal to statistics suggests that the computed values of p($(T,v)) are expected to exceed one in approximately 50% of all cases and to be smaller than one in the remaining cases. A Concerning (B) versus (b): it is concluded 0 for the approximately lo4 choices of the input vector v, that the classical Runge Kutta method as used by M.Kolle almost always failed to recognize A property (b) but, rather, yielded "approximations" (I - &T,v))-' of the non-existing true matrix (I - $*(T,v))-~; 0 this is in contrast to the failure-message which was given in every attempt to enclose (I - $*(T,v))-~. The systematically incorrect nature of the computed results for

&)

h

478

Discretizations: Practical Failures A

A

(I - V(T,v))-' was not suspected initially because of the property (C). In fact, (C) may be sufficient for an engineer to accept computed results for (I - (T,v))-' and

9

N

thus Fdyn(v), without asking any further questions. It cannot be ruled out, though 0 that property (C) happens to be present in the case of the available physical observations but 0 that this property may be absent for different real world problems leading to models in the original class. In view of the reliability question of discretizations, it is mandatory to determine the reason for the almost consistent failure of the classical A Runge-Kutta method to recognize property (b). This issue has recently been investigated by H. Keppler [18]. In an example for N = n = 2 belonging to the original class of models, he employed Enclosure Methods for the determination of columnwise enclosures [$(t)] of the true fundamental matrix $*(t) for t E [O,T]. This model is illustrated by Figure 7.6 provided the attention is confined to the bodies representing the gears and, additionally, only to their absolute rotational coordinates PI and M. Concerning the subsequent discussion, it is assumed that rl+q is the v - th component of y : IR + R 4 and r 2 is~the p - th component of the vector y. Keppler recognized that the v - th and the p - th column vectors of the computed interval matrices I - [$(T)]cL(W) differ by vectors comparable with the widths of the enclosures [$(T)]. Now, (7.14) is replaced by the following system of linear algebraic equations for the determination of an approximation yper(0) of Y*per(O): (7.18) (1 - (T,v))Y(o) = Ypart(T). Since I - $(T) is almost singular, generally there are large errors in the computed approximation y(0) =: Yper(0) of a solution of (7.18). This is in particular true concernig the components r1$(0) and rZs(0) of y(0). In Keppler's example, the individual errors cancelled each other almost totally in the sum r1$1(0) + r2&(0) which is part of the computed approximation t(0) for the tooth deformation s*(O) with s given in (7.7b). Starting with the computed vectors yper(O), Keppler employed a marching method for the determination of approximations yper(t) of Y*per(t) for t E [O,T]. The individual components of yper(t) then are as erroneous as the components of y(0). Keppler recognized that the individual errors of the components of rl&(t) and &(t) cancel each other almost totally in the expression for their sum, which is part of the expression for t(t). A

9

E. Adams

479

In additional examples for n > 2, Keppler reached identical conclusions. Consequently, Keppler's numyicalAexperim%nts may serve as a heuristic explanation for the "success" (A), (B), and (C) of Kolle's numerical work [19] referred to before. This hereby completed subsection in the case history under discussion suggests the following conclusions: 0 it highlights the unreliability of discretizations that are not totally error- controlled; 0 the observed almost total cancellation of the influences of the individual errors in the terms contributing to the sum for s should not be interpreted as an indication that this "saving grace" is always to be expected in a " meaningful" mathematical simulation of real-world problems; 0 the unreliability of the employed discretization would have been revealed in Kolle's work [19] if the computed individual components of $(t) had been inspected rather than only g(t). Keppler's numerical experiments [18] referred to before have been carried out by use of the classical Runge-Kutta method with a fixed step size and by use of a given computer and compiler. Concerning the sign of 1 - p($(T)) and the computability of (I - $(T))-', the influence of these "tools" has been investigated by means of numerical experiments to be discussed in the next subsection.

7.4 FURTHER NUMERICAL EXPERIMENTS ON PRAflICAL FAILURES OF DISCRETEATIONS Concerning the numerical experiments referred to at the end of Subsection 7.3, a first series has been conducted by U.Schulte [29]. In the original class of models, she chose an (unstable) simulation with N = 2 bodies and n = 2 degrees of freedom. The input data of the differential operators have been chosen such that the corresponding model in the relative class (a Mathieu ODE) belongs to the domain of asymptotic stability, with even a considerable distance from the boundary of stability. For the numerical experiments, Schulte [29] used a classical Runge-Kutta method with various choices of the step size h and executions by means of the following compilers, languages, and computers:

Discretizations: Practical Failures

480

Computer HP Vectra with a 80386187 processor II

IBM 4381 (VM)

Compiler/Language PASCAL-SCwithout a utilization of interval arithmetic or the optimal scalar product WATFOR 87, Version 3 FORTRAN 77 FORTVS, FORTRAN 77

Short Notation PASCAL

WATFOR FORTVS

Table 7.19 By use of these three computer-based executions of the classical Runge-Kutta method, approximations $ of the true fundamental matrix #*:lR+L(!R4) were determined for t E [O,T] with T scaled to one. The employed values of the step size h are listed in Table 7.20. The value of h = 0.00390625 was chosen since this is a machine number in the hexadecimal system; i.e., then there is no error in the representation of the period T = 1. Enclosure methods were used as follows: @ the Cordes-Algorithm for the determination of an enclosure b($(T))]; @.) ACRITH [17] or PASCAL-SC [ll] routines for the enclosure [(I - ($(T))-'] of (I - $(T))-' if this was computationally possible; 0 for the purpose of comparisons, the Lohner-Algorithm for IVPs (p.311, p.321) in conjunction with ACRITH [17] and an execution by means of the IBM 4381 computer addressed in Table 7.19. For i, j = 1(1)4, the widths of the computed enclosures [#ij(T)] = [@ij(T),$ij(T)] are smaller than approximately with ($ij(T)( exceeding one only insignificantly. In dependency on h and the executions addressed in Table 7.19, Table 7.20 lists the following computed results, which are represented by the symbols + : if $ij (T) E [#ij(T)] is true for the element $ij of $(T), with i,j = 1(1)4; correspondingly, : if $ij (T) @ [#ij(T)]is not true for $ij; S : if b($(T))] < 1 is true; I: if [I - $(T))-l] was computable.

-

E. Adams

48 1

PASCAL. 0.00390625

++++++++++++++++ ++--++--++++++++ ++--++--++++++++ ++--++----------

WATFOR 0.001 0.00390625 0.005 0.01

I

-

++--++---+--+---

++--++----++--++

---------- + + - - + + S FORTVS

0.001 0.00390625 0.005 0.01 Table 7.20 For the true fundamental matrix $*(T), there hold (7.21) p($*(T)) = 1and (I - $*(T))-' does not exist. Consequently, a "computational determination" of the properties I or S is caused provided by the errors of &T). These errors are smaller than approximately there is a row of 16 plus-signs in Table 7.20. An inspection of this table indicates the following major results: a in the case of "PASCAL", the accuracy of the approximations &j(T) increases as h decreases; this is not necessarily so in the case of "FORTVS ; for all choices of h, the incorrect property I has been "shown" by use of "FORTVS; the incorrect property S then was "shown" only in 50% of the four cases (compare a corresponding result in Kijlle's work [19]); it is of particular interest that the incorrect property I was "shown" when PASCAL" was used in conjunction with a relatively small step size. Summarizing Schulte's numerical experiments [29], it is observed that I&j(T) - $*ij(T)I < lo-' for i,j = 1(1)4 was not small enough to avoid the It

N

computability of a an enclosure p - $(T))-l] implying the existence of (I - $(T))-l with the true matrix I - $*(T) not being invertible.

482

Discretizations: Practical Failures

Conclusion: The chance of a computational determination of the (invalid) property I increases as the accuracy of the computed approximation $ij decreases. Remark Inaccurate computational tools may increase the chances of always producing " results" ! Subsequent to the completion of Schulte's dissertation [29], H.Keppler [18] y r i e d out numerical experiments for @ an (unstable) model with N = 4 and n = 6 in the original class of models and A @ the corresponding (asymptotically stable) model with N = 4 and n = 5 in the class of relative models. In both cases, w = 2400 was chosen. Keppler employed the following numerical methods: (A) a classical Runge-Kutta method with step size h = T/100 and evaluations with double precision (REAL 8) for the determination of approximations ($ij(T)) = $(T)and (Fpsrt,i(T)) = ypart(T) for t E [O,T]; as compared to the corresponding enclosures, the errors of &j(T) and Xypart, i(T) are smaller than an approximation of (I - $(T))-I by use of a Gauss-Jordan method and an evaluation in REAL 8; this yields y(0) = (I - $(T))-'ypart(T); 0 an enclosure @(O)] of the solution F(0) of the system (I - $(T))y(O) = ypart(T), making use of the subroutine liberary ACRITH [17]; the distance of the bounds for the components of @(O)] is less than starting from y(0) or @(O)], an approximation yper(t) of the true T-periodic solution Y*per(t) was determined for t E [0,2T], making use of a marching method with an execution just as in the case of (A); 0 concerning y(0) or @(O)], the results of (C) were compared at times tj E [O,T] and tk E [T, 2T] such that tk - tj = T. For the components yper,i(t) of yper(t) this yields differences Gi(t): = yper,i(L + T ) - yper,$t), with t E [O,T] and i = 1(1)6 (or 1 = 1(1)5) in the case of model (a) (or model (P)). These components Gi define an error vector G = G(t) characterizing the deviation of Yper from a true T-periodic state. An upper bound of the supremum norm IlG(t)ll, of G for t E[O,T] is presented in Table 7.22.

f'u

Concerning the last column, see Subsection 7.8.

483

E. Adams

For the determination of vper employment of .

method (B 1)

(B2)

For t€[O,T]: IlG(t)ll,<

Is the true result for pper unique?

model

(a) A

(@

(a) n

(P)

10-4 10-13 10-16 10-16

no YeS

no Ves

Table 7.22 Conclusions from Table 7.22 Case (Bl): Even though I - q5*(T) is not invertible in the case of model (a), a "meaningful approximation" yper can be determined by use of (I - &T))-'. In fact, IIGllo, then is much smaller than the expected accuracy of an engineering A

approximation of a T-periodic state. Case (B2): Even though the errors of &j(T) and ypart,i(T) y e smaller than enclosures of @(O)] were computable in the case of model (a). The corresponding results for IlG(t)ll, are very small. Tables 7.20 and 7.22 serve as additional explanations for Kolle's almost consistent success in computing approximations" (I - $(T))-' of non-existing matrices (I - ~*(T))-I. 7.5 ON SATISFACTORY RESULTS FOR THE CLASSES OF THE MODIFIED AND THE RELATIW MODELS A satisfactory agreement of computed enclosures and approximations (by means of traditional numerical methods) has always been observed for asymptotically stable models either (A) in the class of relative models (with a = 0) or in the class of modified models (with any a > 0). (B) In a comparison of results for corresponding models in the cases (A) and (B), a satisfactory coincidence of the results for the tooth deformation has always been and, for an employment of REAL 4, observed provided a E (g,0) where 0 M

484

g ::

Discretizations: Practical Failures The influence of the terms with coefficient Q in the ODEs (e. g. (7.7a)) is absent if

Q

< < g because of the employed floating point number

representation or 0 noticable within graphical accuracy for (Y > CW. Concerning applications of traditional numerical methods, the discussions in Subsections 7.4 and 7.6 demonstrate the significant improvement of results when a model in the original class is replaced by a model in the modified or the relative class. A satisfactory and reliable agreement of computed and (physical) experimental results is major goal of a simulation. In the present context, particularly results for the frequencies w of the resonance excitations of the force ratio Fdyn(W)/Fstat are to be compared, where 0 Fdyn is defined in (7.3) and 0 Fstat represents a static force as following from the system of ODEs when the derivatives are replaced by zero and cz is replaced by its time-averaged value. The simulation under discussion is not concerned with the design of a gear drive to be built. Rather, available (physical) experimental results for Fdyn(W)/Fstat are to be approximated "optimally" within the available set of models in the relative class with n E {5,7,9,11} degrees of freedom. The solid line in Figure 7.23 exhibits the experimental results which were provided by the Institue for Machine Elements of the Technical University of Munich ([14] and [23]). Concerning the test rig employed for the physical results, these authors are the sources for the input data as used in the set of models. The dashed line in Figure 7.23 represents the pointwise computed "optimal" result for the choices of N = 4 and n = 7. For almost all w in the prescribed interval [SOO, 40001, this model is asymptotically stable. The dashed line was determined by use of 0 a classical Runge-Kutta method with step size h = T/100 [18] or an Enclosure Method [29], with practically coincident results and both of them as applied to 0 the representation (7.16) of the desired T-periodic solution Yier = yier(t,w) and the corresponding expression for the force ratio Fdyn(W)/Fstat.

485

E. Adams

I

1500

I

2500

I

3500

I

w

4500

Figure 7.23: Force Ratio Fdyn/Fstat as a function of w physical measurements as provided by the Institute for Machine Elements of the Techical University of Munich ([14], [23]) ____ computed results for the choice of n = 7 degrees of freedom The computed maximum at w M 1200 corresponds to the separately measured maxima at w M lo00 and w M 1400. With the exception of w M 3740, there is a satisfactory agreement of the computed and the experimental frequencies of the local maxima of the curves in Figure 7.23, i. e., the resonance excitations. The excitation at w M 3470 is a consequence of pitch deviations of the teeth whose influences are not accounted for here which, however, were additionally analyzed in [3]. Remark: The discussions at the beginning of the present subsection have demonstrated the reliability of the traditional numerical method that has been used for the determination of the dashed curve in Figure 7.23. 7.6 ON APPLICATIONS OF MARCHING METHODS Traditionally in the engineering analysis of vibrations of gear drives, M

approximations Fdyn(w)/Fstat have been determined by means of marching methods as applied to models in the original class. Generally, values of this ratio of forces are needed for a large set of values of w. Because of continuity, small increments of w

486

Discretizations: Practical Failures

cause correspondingly small changes of the desired starting vector yper(O,w) and, therefore, A they provide a close approximation for this vector for a new choice, w,of w such that gper = Xper(t,w) is almost T-periodic already for small t, analogously to the cases covered by Table 7.22. Consequently, the total cost of the computational assessment of the desired set of 0

N

A

N

values Fdyn(w)/Fstat may be bearably small when marching methods are employed in the context under discussion. Concerning an emplovment of the orieinal class of models, serious doubts with respect to its reliability are motivated by the discussion in Subsection 7.4. In fact, the Conclusions at the end of Appendix 7.8 demonstrate the unreliability of marching methods in conjunction with the original class of models. This unreliabilit y problem will now be investigated by means of the following example which is due to H.Keppler [18]: (A) an unstable model in the original class with N = 4 and n = 6 was chosen and the corresponding asymptotically stable model with N = 4 and n = 5 in the class of relative models; (c) in both cases, the frequency w = 2400 was selected. The marching method was executed by means of 0 a classical Runge-Kutta method with step size h = T/100 and 0 evaluations of the basis of either a simple precision (REAL 4) or a double precision (REAL 8); 0 in contrast to the situation in Table 7.22, the starting vector y(0) now is not close to the true T-periodic solution. Approximations f of the tooth deformation s were determined by use of approximations which had been computed for the cases of n = 6 and n = 5, respectively. The results for 1were compared at times tj E [998T, 999T] and tk E [999T, lOoOT] such tk - tj = T. This yielded differences D(t): = g(999T + t ) h g(998T + t ) with t E [O,T]. An upper bound of the supremum norm IID(t)ll, of (B)

y

A

h

h

A

D(t) for t

E

[O,T] is presented in Table 7.24.

A

A

487

E. A d a m

,

Emdovment of mecision Employment of model

REAL 4 IP(t )Il,<

with n=6 in the original with n=5 in the relative class of models

REAL 8 llD(;)Il,<

10-4

I

10-13

I

10-7

10-13

1

Table 7.24 Even though the model in the original class is unstable, the correspondingly ill-conditioned nature of the employed marching method does show up in the results for t in this exmule. This is not necessarily so in other exmDles. For the model with n = 5, Keppler [18] additionally compared approximations 7 of the T-periodic solution Yier which were computed as follows by means of a classical Runge-Kutta method; either @ employing the representation (7.16) of yier with numerical evaluations for t E [O,T] in conjunction with a Gauss-Jordan method for the approximation of (I-$(T)-~or &) in the context of a marching method with 811 execution for t E [0, lOoOT]. Keppler [18] obtained the following results, making use of E(t) := &t +T)-&t) for t E [O,T] or D(;) as defined before: llE(T)ll, < lo4 in case (a) using h = and

(7.25) llD(T)ll, <

in case (b) using h = 2

6

In the two cases being compared, the computed approximations y(2T) and %

y(loOoT), respectively, agreed only within a precision of with the corresponding enclosure [y&kT)] = bier@), yier(kT)], where yier(fl) -

-y&kT)

p1

lO-'O and k = 2 or 1000 in the two cases being compared.

Remarkt The relatively poor performance in case (a) is a consequence of a loss of accuracy in the computational approximation of (I - $(T))-l. The marching method in case (b) here is nit ill-conditioned because of the employment of a model (with n = 5) in the relative class of models. The conclusions drawn in [4] are partly superseded by the ones presented here on the basis of Keppler's recent numerical work [18].

a

488

Discretizations: Practical Failures

7.7 CONCLUDI" REMARKS In the execution of the research reported here, initially the diagnostic power of Enclosure Methods was of decisive importance in order to discover the true nature of the deceivingly convincing but totally misleading set of Kolle's almost consistently successful computational results; see the discussions at the end of Subsection 7.3. These difficulties pertain to the original class of models. (A) They were understood by means of ~Schulte's[29] Instability Theorem 7.28 and athe qualitative analysis of the fundamental system of the ODEs, see (7.30) and the subsequent discussions in the Appendix 7.8. (B) They were removed by means of the replacement of the original class of models by either the class of the modified or of the relative models. The employment of a model in the original class causes the following difficulties: (iJ "approximations" of non-existing functions have to be determined when the boundary condition (7.13) of T-periodicity is used (Subsection 7.2) or ( i i Japproximations of a solution of an unstable IVP have to be determined in the context of a marching method (Subsections 7.6 and 7.8). The presented examples exhibit surprisingly accurate approximations, both by means of (i) or (ii). Nevertheless, models in the original class are obviously unreliable. The research reported here is a case study illustrating that there are numerous and interrelated problems in the following contributing domains: @) the physical simulation of the real world problem, @J the qualitative mathematical properties of the resulting simulation, (cJ the (potential) unreliability of traditional numerical methods, d (J the unknown reliability problems of the employed hardware and software. The significance of the areas (a) and (b) is characterized by the demonstrated comparability of the influences of 0 numerical errors as small as 10-9and 0 macroscopically large" uncertainties in the physical simulation. The problem areas (c) and (d) can be removed by means of an employment of Enclosure Methods in conjunction with a supporting computer language. The rotational attachment (a > 0) of the models in the modified class is not only of interest for the purposes of stability of the homogenous ODEs. Rather, this attachment simulates the always present mechanical coupling of the gear drive to I'

489

E. Adarns

its environment, i. e., the adjacent clutches. Correspondingly, it is claimed that the instability and the ill-conditioned nature of models in the original class are consequences of the physically inadequate simulation of their couplings with the environment, which do not allow exchanges of bending moments or torques. Just as here in the case of vibrations of gear drives, there is frequently more than one candidate for a meaningful physical simulation. The comparison of available candidates may enable a user to find a special one that is well suited, both mathematically and numerically. In particular, a user should be sufficiently familiar with the area (a) in order to be able to recognize a situation where * 0 it is sufficient to confine one’s attention to a special function such as Spen rat her than 0 to search for a more general function which is not really needed, such as

*

Yper-

7.8 APPENDIX: MATHEMATICAL SUPPLEMENTS

For all models in the orieinal class of models, Schulte [29] has shown that the matrices A, C and, D in (7.2) possess a vanishing eigenvalue with algebraic (7.26) multiplicity one and an eigenvector that is independent oft, compare (7.8), and A for all t E R, A(t) in (7.9) possesses a vanishing eigenvalue with (7.27) algebraic multiplicity two, geometric multiplicity one, and an eigenvector that is independent of t. Schulte [29] has proved the following Instability Theorem: Theorem 7.28: Under the conditions on A = A(t) in (7.27), * X = 1 is an eigenvalue of 4 (T) with an algebraic multiplicity of two and a ( i J geometric multiplicity of one; ( i i J the system of ODEs in (7.9) is unstable in the sense of Lyapunov. Schulte [29] has carried out the proof (i) by means of a similarity transformation * of the ODEs (7.9) and a subsequent representation of the fundamental matrix 4 such that the properties as stated in (i) can be directly seen. The proof of (ii) then follows from a theorem in D.9, p.581 asserting that A

h

490

Discretizations: Practical Failures

systems of linear homogeneous ODEs with T-periodic coefficients possess unbounded solutions as t+m if there holds: eitherJ '@ there is (7.29) an eigenvalue X of #*(T) with 1 X I >1 or @ J #*(T) has an eigenvalue X = 1whose algebraic multiplicity exceeds its geometric multiplicity. According to [I.9,p.7], boundedness as t+m and stability in the sense of Lyapunov are equivalent properties for systems of linear homogeneous ODEs. Because of (7.29), the nonhomogeneous ODEs (7.9) cannot be stable in the sense of Lyapunov, i.e., they are unstable in this sense. In view of marching methods as applied to (7.2) or (7.9), the kind of growth as t+w is of interest. The eigenvalue X = 1 of #*(T)induces the following true solutions of the homogeneous ODEs (7.9) [35, p.961: y ( I ) * ( t ) : =q(1)(f)withqc1)(t+T)= q(1)(t)foralltdR, (7.30) Y ( 2 ) *(t) := t . qt 1) (t) + v c2 ) (t) with q(2)(t+T) =q(2)(t)forallt~!R. In view of vibrations of real gear drives, it is assumed for models in the original class that X = 1 is the eigenvalue of #*(T) yielding p(#*(T)) = 1, (7.31) I X I < 1 is true for all eigenvalues of #*(T) other than X = 1. The fundamental system of solutions for (7.9) then consists of (7.30) and 2n-2 linearly independent solutions approaching zero as t+w. In view of the simulation under discussion, the fundamental system and the fundamental matrix #* have the following properties ( i J the choice of q( 1) EIR" allows a representation of a uniform rotation, f ujt; q ( 2 ) then may represent the non-constant and non-trivial T-periodic solution of the homogeneous ODEs, which exists if and only if X = l is an eigenvalue of #*(T) [I.9,p.59]; for all c E !R, then cq(2 ) is also a solution; (iii) as t-m, #*(t) therefore possesses a linear growth; in agreement with this property and because of (7.31), it is observed that (1v) X = l is on the boundary of stability separating the domain of instability (with an exponential growth of some solutions as t-m from the domain of asymptotic stability (where all solutions approach zero asymptotically as t+m). Because of Theorem 7.28, the nonhomogeneous ODEs (7.9) have the following properties for all models in the original class: 0 the representation (7.16) of y;er does not exist;

49 1

E. Adams 0

mathematical theory (e. g., [35,p.102-109] asserts the existence of (infinitely many) solutions Y;~~(O) of (7.14), provided certain orthogonality conditions are satisfied; a verification of these conditions requires the (unavailable) exact knowledge of @*; in fact, because of the uniform rotation ujt, the employment of the absolute rotational coordinates precludes the existence of periodic solutions with any period. For engineering purposes, it is not y;er but, rather, a T-periodic

tooth deformation s ; ~that ~ is to be determined. As has been observed concerning (7.7), there is no contribution of wjt to the expression for s. Consequently, s ; ~can ~ be approximated by means of a marching method, even in the case of an employment of a model in the original class of models, see Subsection 7.6. Now, models in the classes of the modified or the relative models will be considered. Applications of Enclosure Methods have shown that, generally, X =1 is not an eigenvalue of @*(T)in the case of any model in these two classes. Again, in view of vibrations of real gear drives, it is now assumed that (7.32) P(@*(T))< 1. This condition can be verified by means of Cordes’ (Enclosure) Algorithm [12]. Then there exists one and only one T-periodic solution (7.33) y;er with its unique representation in (7.16), which yields

a

s ; by ~ ~use of the expression for s in ( 7 3 ) . Prior to Figure 7.6, there is a discussion concerning the mathematical equivalence of the original class and the class of relative models. Consequently, (7.33ii) is still true in the case a model in the original class is employed, and Sier then may be determined by means of a marching method: 0 this computational approximation of s;er is ill-conditioned because of the

0

instability asserted by Theorem 7.28; however, the growth of the influences of the computationally preceding numerical errors is (at most) linear in t; consequent1y, these influences may be small provided the employed numerical precision is sufficiently large in view of the interval of time employed in the execution of the marching method.

492

Discretizations: Practical Failures

Conclusions concerning all models in the original class: 0 there is no T-periodic solution y;er, as has been asserted in p.31 and 0

[41; there exists one and only one T-periodic tooth deformation s;~~, whose

ocomputational determination by means of marching methods is ill-conditioned; atherefore, suitable computational precautions are called for. 8. ON SPURIOUS DIFFERENCE SOLUTIONS CONCERNI" ODES OR PDES 8.1 INTRODUCTION OF THE SUBJECT In the present section, the following classes of problems will be considered: DEs with side conditions and (if they exist) true solutions y* and ( I J (II) discretizations with respect to (I) depending on a step size h such that there are true difference solutions y = y(h); in the case of PDEs, there are several step sizes. Because of their definition, 0 difference solutions are error-free with respect to a system of equations in a Euclidean space, however, 0 the local discretization error is ignored which relates (11) to (I). Then, there are the following major questions: (A) assuming the existence of y*, whether a sequence of h-dependent solutions y(h) converges to y* in a pointwise sense; 0 whether there is a quantitative estimate for the distance of y* and y(h); compare Section 4 for the case of ODES; whether there is a qualitative agreement of the solutions y* and y(h) such 0 that, e.g., both are periodic with periods that do not differ greatly. According to their definition in Section 2, spurious difference solutions, ysp =

i;SP(h), 0 pertain to a pair of related problems (I) and (11); 0 because of various reasons, they do not possess the properties (A) - (C), as will be discussed. A comparison of the solutions of problems (I) and (11) can be characterized by Stuart's observation [ISO,p.201]: "The dynamics of discretisations (which are coupled iterated maps) are generally far more complicated than the dynamics of their continuous counterparts (which are differential equations)." This is caused by

E. Adams

493

the presence of the artificial parameters in the discretization. In view of this situation, the following expression has been coined at an Ih4A conference in 1990 "The dynamics of numerics and the numerics of dynamics." An h-dependent sequence of spurious difference solutions, ysp(h), is considered which is 0 real-valued for all hE(k,h), where either 0

h=Oor

0

h is a

fixed and positive quantity which, generally, is not known in an a

priori fashion. Provided a sufficiently close approximation = &h) of a difference solution = $(h) has been determined computationally for a fixed h, then the possibility of its spurious character can be tested heuristically in suitable cases such as the following

y

y

if is constant, it can be immediately seen whether or not satisfies all equations of problem (I); a kh-periodic difference approximation y(h) with any k-1 E IN is considered possessing this property for all h in a certain interval; since the period of then depends continuously on the artificial parameter h, $(h) cannot be an approximation of a true periodic solution y* of (I); difference solutions y varying on a scale comparable to the grid are frequently spurious, Stuart [1.50,p.205]: sequences of difference solutions y(h) or approximations y(h) are considered such that any consistent difference quotient of the first order is unbounded as h+O; generally, a sequence of approximations y(h) then cannot be related to any true classical solution y*; it is assumed that there is a first integral of the ODES which does not depend on any derivative of a true solution y*; it is then immediately possible to test whether or not a difference solution y (or approximation satisfies this integral (approximately). A pair of problems (I) and (11) is considered such that an asymptotically stable constant or periodic true solution y* of (I) possesses a finite basin (domain) of attraction, D (i.e. the set of initial vectors of Y *' the true solutions which asymptotically approach y*) and

y

y)

494

Discretizations: Practical Failures

a discretization (11) of (I) yields an approximation y of y*such that there is a finite domain of attraction, DN,of y. Y Stable spurious difference solutions, ysp, attract a certain subset of initial vectors. Therefore, the existence of spurious difference solutions, ysp, of (11) may cause DN Y to be considerably smaller than D The potentially catastrophic computational Y*' consequences are enhanced by the fact that the existence of ysp is generally not known in an a priori fashion. This situation may arise in the context of an employment of a marching method for the purpose of a (transient) approximation of a steady-state solution y* of ODEs or PDEs. This is a popular numerical method in computational Fluid Dynamics (CFD), compare the papers [IS61 and F.571, and[32] by Yee et al. The quantitative examples in these papers are confined to stationary points y* of simple ODEs such that the domain of attraction, D is Y*' known. For various discretizations with chosen values of h, Yee at al. have shown that the corresponding domain of attraction, DN,is significantly smaller than D Y Y ** Remark: Favorable conditions for the existence of kh-periodic spurious difference solutions with any keR are 0 nonlinear source terms in the DEs or 0 discretizations with a high order consistency and correspondingly complicated expressions. Since spurious difference solutions, ysp, are true solutions of systems of equations in Euclidean spaces, it is possible to develop theories of solutions of this kind. This has been demonstrated by Iserles, Stuart et al, e.g.,[I.25],[I.50], and [31]. In particular, theorems have been proved in several of these papers asserting the non-existence or the non-exclusion of spurious lh-periodic or 2h-periodic difference solutions ysp for certain classes of discretizations such as implicit or multistep or Runge-Kutta methods. The present section is 0 less concerned with a theoretical structure but, rather, with 0 the presentation of examples for sequences of h-dependent spurious difference solutions and 0 discussions of their computational and practical importance in view of the title of this paper and its predecessor [5]. 0

495

E. Adams

8.2 ON SPURIOUS DIFFERENCE SOLUTIONS FOR DISCRETIZATIONS OF ODES WITH INITIAL, CONDITIONS The following class of IVPs is considered: (8.1) y' = f(y) for t 2 0; f:D4Rn, DclR"; f is sufficiently smooth; y(0)~D. Consequently, there are continuously differentiable (i.e., classical) solutions y* = y*(t,y(O))ED for intervals [O,i(y(O))). Concerning (8.1), any explicit discretization is considered such that there are difference solutions ij = ij(h,y(O))ED for jdN. It is assumed that all conditions of a theorem are satisfied which, for a fixed y(O), asserts the pointwise convergence to y* = y*(t,y(O)) of a sequence = i(h,y(O)) as h+O, compare Section 4. For any y(0)~D and as h+O, a sequence of difference solutions = y(h,y(O)) then must be necessarily true for all finite h > 0. non-spurious in this limit. This then is In fact, there may be a bifurcation point hl > 0 such that a real-valued spurious difference solution "ysp = isp(h) bifurcates from a non-spurious solution = "yh,y(O)) and, if h = 0, the ( i i J limit as h+O of this sequence "yp(h) perhaps is not in D. Property (i) will now be demonstrated by means of an example, followed by a more general discussion of bifurcating h-dependent sequences of difference solutions. Concerning bifurcating sequences of spurious difference solutions, a classical example rests on the Logistic ODE (4.8) with its set of monotonic true solutions y* = y*(t,y(O)), in (4.9). An application of the explicit Euler one-step discretization generates the famous Logistic (difference) Equation (4.12) in population dynamics [1.28], which is one of the traditional starting points in the theory of Dynamical Chaos. The standard form of this equation is p.351: (8.2) ?j+1 = A?j(l-?j) for j+&IN, where ?j: = (h/(l+h))yj and A(h): = l+h. The h-dependent difference solutions ?j = q(h,v(O)) possess a monotonically increasing sequence of bifurcation points Ak with Ak: = A(hk) and A1 = 3 such that p.351, omitting the subscript k of h, 0 at Ak, a stable real-valued (spurious) 2kh-periodic difference solution bifurcates from a 2k-1h-periodic real-valued difference solution which is unstable for A 2 Ak; 0 there is a point of accccumulation, A x 3.57OO..., of {Ak}; 0 there is a point A M 3.8284...such that there are (spurious) &periodic difference solutions for A > and all &IN; additionally then there is an uncountable number of initial points giving totally aperiodic trajectories p.351.

i

N

x

496

Discretizations: Practical Failures

x

This point is said to be the onset of chaos [21]. In its parameter range of stability, a 2kh-periodic difference solution, Remarks: !per, is locally attractive. Concerning difference solutions y asymptotically approaching ?per, their set of initial values y(0) shrinks as k increases. Consequently, it becomes more and more difficult to determine a 2kh-periodic difference solution by means of a marching method (compare Subsection 7.6). 2.) Concerning the Logistic ODE (4.8) or some other scalar ODEs, applications of numerous discretizations have been investigated by Yee at al (e.g.,p.56]). In all these cases, the authors have shown the existence (or occurrence) of spurious difference solutions "yp, either explicitly or by means of the computed difference approximations 9. Since it is generic to the existence of real-valued spurious difference solutions, bifurcation theory will now be briefly reviewed, following Iserles et al. D.251. For this purpose, the special case of a stationary point Yitat = ystat(0) of an IVP with ODEs and an unspecified initial vector y(0) is considered. Due to consistency, Yitat is a root of the function F in the discretization (8.3) yj+l = F(yj,h). The difference solution Yitat is assumed to be locally stable for hE(O,h,). Bifurcation from yitat occurs (subject to various non-degeneracy conditions) when an eigenvalue of the Jacobian of F(yitat,h) passes through the unit circle in the complex plane C as h 5 h, is replaced by h > h, p.251. At b, then a spurious difference solution bifurcates from y*. According to D.251 and D. 16,p.145-147], there are the following three possibilities:

a

Ystat,

*

..

\

,

Figure 8.4a Transcritical bifurcation horizontal straightline: lh-periodic difference solution Yitat curve: spurious difference solution yap

497

E. Adams

*

Ystat

OO

+ h

Firmre 8.4b Pitchfork bifurcation; see Figure 8.4a for the horizontal straight line and the curve if an eigenvalue passes through + 1 d , then a spurious fixed point (period h) of (8.3) bifurcates from Yitat; this typically occurs as a transcritical bifurcation shown in Figure 8.4a; if an eigenvalue passess through -let, then a spurious solution of (8.3) with (II) period 2h bifurcates from yitat; this is a period-doubling pitchfork bifurcation, see Figure 8.4b; (III) if a pair of complex conjugate eigenvalues passes through the unit circle, then a spurious closed invariant curve Yper for (8.3) bifurcates from Yitat by means of a Hopf bifurcation. Remark: The sequence {Ak} with Ak = A(h) concerning (8.2) is an example of type (11) bifurcations. In the case of a bifurcation of the types (I) - (111), a branch B bifurcating at a point h,>O is of particular interest provided B is real-valued for h < h,; the stationary point of the discretization, Yitat = y:tat(0),then is stable by assumption. The property of being real-valued may still be true for B as h+O; B then must be spurious. In fact, and by assumption, the "genuine" (non-spurious) difference solution y(h,ystat(O)) = Yitat coincides with the stationary point Yitat of the ODE. Generally for a non-constant branch B concerning discretizations of explicit ODEs, the property of being spurious manifests itself in the property that the limit as h+O of Ysp(h)is not in D; see 0 the subsequent Example 8.13 and 0 Stuart's PSO,p.205] observations for IBVPs with parabolic PDEs (which also apply to ODEs): "Typically, as the mesh is refined,... spurious solutions will either move off to infinity in the bifurcation diagram or...". Up to this point, the discussion has been confined to explicit discretizations. For implicit difference methods, there are additional ways to generate spurious difference solutions. In fact, at any grid point tj, the employed difference equations may possess more than one real-valued solution; consequently, each such point is a

498

Discretizations: Practical Failures

(timewise) bifurcation point, see Examples 8.5 and 8.10. Another relevant observation is concerned with the distinction between true solutions 7 of the discretization and their computed approximations A (suitably defined) distance of 7 and then may be large, particularly in the case that this discretization is unstable; see [15] for an example exhibiting the governing influence of the numerical precision. This situation gives rise to still another type of (pseudo-)spurious difference approximations.

y.

8.3 ON SPURIOUS DIFFERENCE SOLUTIONS OF ODEs WITH BOUNDARY CONDITIONS (OF PERIODICITY) The consideration of a BVP may be motivated as follows: either by the search for a T-periodic solution of an ODE, making use of a boundary condition of periodicity (see (5.5) or (7.13)) or a time-independent (steady-state) solution of an IBVP; see Section 8.1 for practical consequences for the approximation of a non-spurious steady-state solution of an IBVP. The present subsection is concerned with BVPs consisting of nonlinear ODEs of the second order with either separated two-point Dirichlet boundary conditions (as in (5.2)) or boundary conditions of periodicity (as in (5.1)). The discretizations to be investigated employ equidistant grids, for the derivatives of the second order, the usual discretization possessing second order of consistency, and for derivatives of the first order, the usual forward, backward, or central difference quotients of first or second order of consistency. Generally for BVPs, the verifications of the existence and the uniqueness of true solutions y* are major tasks. For the BVPs to be considered here, at least the existence of a (classical) true solution y* will be known. The problems to be investigated here are concerned with sequences of difference solutions = "yh) which either @ as h+O, serve as a pointwise approximation of a true solution y*, or @ J for all hE(O,Ei) represent a sequence of spurious difference solutions, isp(h). Difference solutions of type (b) were first reported in literature in 1974 in the paper [I.15] by Gaines. He

499

E. Adams

employed an unstable discretization of a nonlinear BVP of the type under consideration here, and he 0 discussed a sequence of spurious difference solutions which, as h+O, becomes more and more pathological" . In 1977, Gaines' work p.151 was followed by the one of Spreuer & Adams p.441 which presents the subsequently treated three Examples 8.5, 8.10, and 8.13. Example 8.5 (D.451 see also n.441): The BVP (8.6) - ( Y " ) ~ + 12y' = 0, y(0) = 0, y(1) = 7 possesses only the following classical solutions: (8.7) y c (x): = (x + 113 - 1 0

I'

r,

and (8.8) y;z,(x): = (x - 2)3 + 8. The following consistent discretization was chosen:

In D.451 it has been shown that (8.9) possesses a total of 2"-' difference solutions where each one is determined by one of the 2"-' sign patterns in the solutions of the n - 1 quadratic equations for yj+l as following from (8.9); two of these solutions are non-spurious, and they approach the true solutions (8.7) and (8.8), respectively; 2"-' - 2 of these are spurious difference solutions; the alternating sign pattern +,-,+,-,...yields a sequence approaching the limiting function 7x as h+O; this function does not satisfy the ODE in (8.6); for a si n pattern that is fixed independently of ndN, limiting functions y (L)

i,

N

and JR) of a sequence of spurious difference solutions have been constructed where ~ y ( is~ a) polynomial which is valid for xc[O, 1/21 and y(R) is a polynomial valid for x~[1/2,13; othese polynomials and their one-sided derivatives of the first order possess coincident values at x = 1/2, respectively; .the one-sided derivatives of the second order possess a finite

Discretizations: Practical Failures

500

discontinuity at x = 1/2; consequently, and y(R) define a I' nonclassical" solution of (8.6); 0 in p.451, there is also another sign pattern, again yielding functions y(L) and y(R) with these properties. 0 Remark: There are no spurious difference solutions provided discretizations identical with the ones in (8.9) are employed in the following equivalent explicit representation of the implicit ODE in (8.6): y" f ,/F = 0. A sufficiently smooth system of ODEs y' = f(y) is considered in conjunction with a two-point boundary condition such that there is a solution y*. Additionally, a sufficiently smooth discretization is considered that is consistent and stable in a neighborhood of y*. For this situation, a theorem in [28] asserts the pointwise convergence as h+O of an h-dependent sequence of difference solutions. Therefore, the occurence of 2"-' spurious difference solutions of (8.9) may be due either to (A) the possibility that the discretization (8.9) is not stable or (B) in the case of the existence of a limiting function, that this is not a true solution of the ODE. Examde 8.10 D.441: The BVP (8.11) (Y'"')~- y"' = 0, ~"'(0) = ~ " ( 1 = ) 1, y(0) = ~ ' ( 0 )= 0 does not possess a (classical) true solution. A consistent discretization is chosen employing the usual (central) difference quotients of the lowest order. This discretization possesses a non- denumerable infinity of difference solutions 7 = y(h,c) which are defined by (8.12)

g(2jh), ;zj+l: = g((2j + 1)h) + h4/8 with g(x): = (x3 + cx2 - h2x)/6 for all c E R.

y2j: =

As h-10, the pointwise convergence of

y

y(h,c) to the function y(x): = (x3 + cx2)/6 is obvious. Consequently, (8.12) represents infinitely many sequences of spurious difference solutions. Here too, the spurious character may be 0 due to either one of the reasons (A) or (B). So far, the following two kinds of spurious difference solutions have been discusssed here with respect to (consistent) discretizations of ODEs: either as h-10, a sequence 7 = "yh) or a corresponding sequence concerning an employed difference quotient of do not converge to a continuous limiting function such as

&

=

501

E. Adams

the ones discussed in the case of IVPs or the ones reported by Gaines p.151 in the case of BVPs, or A all the sequences referred to in (a) converge to continuous limiting functions and their respective continuous derivatives; however, these functions are not solutions of the ODES. Example 8.13 D.441: The BVP (8.14) -5"+ ( Y ' ) ~+ y = g(x); y(0) = ~ ( 1=) 1; g(x): = 6 - 5x + 11x2 - 8x3 possesses the solution (8.15) y*(x): = 1 + x(l - x). Since this BVP is inverse-monotone ([IS41 and [l]), y* is the only (classical) solution of (8.14). The following consistent discretization is chosen: 0

0

6

(8.16) -2[Yi'l

-zi

+

+

[TI +3 = g(jh) for

yi-l]

yj

j = l(1)n - 1; yo = 1, y n = 1; h = l/n; ncl is even. This system is not inverse-monotone for any ndN. The following candidate for a difference solution y is chosen as a function of free parameters Pj E IR: (8.17) yj: = 1 + ((-l)j - 1)h1I2 + ,djh for j = O(1)n; ncl is even; PO = Pn = 0; PjdR for j = l(1)n - 1 is suitably determined. The boundedness of the parameters Pj(h) as h+O is shown in Appendix 8.6. Therefore, as h+O, the sequence of difference solutions defined in (8.17) converges to the function y(x): = 1 which does not satisfy the ODE. The (employed) 0 difference quotients are not bounded as h+O. Remarks: 1.1 In the limit as h+O, (8.17) for all jdN defines the stationary point yj = 1 which satisfies (8.16) subsequent to its multiplication by h2. Provided, this version of (8.16) is employed, y j = 1 almost satisfies this system of equations for any sufEicient ly small h > 0. Depending on riel with n 2 20, H.Spreuer [30] has determined numerous additional spurious difference solutions of (8.16), each for a fixed value of h = l/n. For this purpose, Spreuer employed a shooting method which, starting at Yn = 1, satisfies the condition yo = yo(yn-l) = 1by means of an iterative determination of Yn-1. ExamDle 8.18: The function (8.19) y*(x): = 1 + ( - l ) k ( ~- k)(l - x + k) for x e k k + 11for all k d is a non-classical2-periodic solution of the ODE

Discretizations: Practical Failures

502

(8.20)

-5"+ ( Y ' ) ~+ y

A

XEIR with g(x): = -2y*"(x) + (Y*'(X))3 + Y*(X). A

= g(x) for

A corresponding generalization of (8.17) generates a sequence of spurious difference

solutions. As h+O, this sequence approaches the "steady-state difference solution" y(x): = 1 which does not satisfy the ODE. In both Example 8.13 and Example 8.18, the spurious nature of the difference solutions = (h) is suggested by means of the heuristic test (7) referred to in Subsection 8.1. Concerning (consistent) discretizations of nonlinear BVPs belonging to certain classes, the existence of spurious difference solutions can be excluded provided the sufficient conditions in theorems are satisfied which have been proved in 1981by Beyn & Doedel [9]. In 1981, Peitgen at al.[26] have investigated BVPs ) 0, PER, (8.21) y" + pf(y) = 0, y(0) = y ( ~ = making use of the symmetric discretization of y" of the lowest order. Provided there are nElN equidistant grid points in ( 0 , ~ )and f possesses k zeros in this interval, then one of the results in [26] asserts the existence of k" difference solutions, almost all of which are "numerically irrelevant", i.e., spurious. In [26], they are "characterized in terms of singularities of certain embeddings of finite difference approximations with varying meshsize, where the meshsize is understood as a homotopy parameter." 8.4 ON SPURIOUS DIFFERENCE SOLUTIONS OF DISCRETIZATIONS OF IBVPS The discussions of spurious difference solutions concerning ODES carry over to the case of IBVPs, provided traditional numerical methods are used. In fact, @ I W s then arise by means of any one of the traditional approximation methods listed at the end of Section 1; BVPs then arise either by means of a horizontal method of lines or as a vehicle for the determination of a steady-state solution, see Subsection 8.1. Concerning (ah see Subsection 9.4. Example 8.22 (for the case (0)): The BVP (8.14) is now generalized to become a nonlinear hyperbolic IBVP with non-specified initial functions at t = 0:

E. A d a m

(8.23)

503 Ytt + cyt - 2yxx + (yx)3 + y = g(x) for (x,tND: = [0,11 [QTI with CEW; y(0,t) = y(1,t) = 1; y(x,O) and yt(x,O) are free with the exception of y(0,O) = y(1,O) = 1and yt(0,O) = yt(1,O) = 0; g(x) as defined in (8.14).

An explicit discretization of (8.23) is chosen with step sizes h

= l/n and kER+

which, concerning yx and yxx, is identical with the one in (8.16) and, concerning yt and ytt, employs difference quotients which are formally identical with the ones for yx and yxx, respectively. Obviously, (8.17) then represents an h-dependent steady-state sequence of spurious difference solutions, ysp, of this discretization. This sequence, {isp}, is locally attractive provided c is sufficiently large. Correspondingly, there are initial sets {y(xj,O) I j = l(1)n - 1) and {yt(xj,O) I j = l(1)n - 1) such that the difference solutions with these initial data approach the spurious difference solution ispas defined in (8.17). Because of the increasingly "pathological" character of (8.17) as h+O, the same must be true for the initial sets just referred to. This problem is of practical relevance in the case of fixed choices of h = l/n and k. In fact, a time-dependent difference approximation then may approach the o time-independent spurious difference solution ysp as defined in (8.17). Remark: Concerning the "pathological" character, compare Remark 1.) subsequent to (8.2). Stuart ([IS01 and [31]) has investigated discretizations of the following parabolic IBVP: 0 0

(8.24) yt - yxx - cf(y,yx) = 0 for c and D as defined in (8.23), f(0,O) = 0, y(0,t) = y(1,t) = 0; y(x,O) = y ~ ( x )is given.

The equidistant grid and the discretization are chosen just as the ones with respect to (8.23). For the case of y(x,O) = 0, Stuart considers the trivial steady-state solution y*(x,t) = 0 on D. In [31,p.473], he asserts the validity of the following implication:

(8.25) provided the discretization is linearly unstable at = 0 (for a choice of step sizes), then there exist spurious periodic difference solutions

Discretizations: Practical Failures

504

possessing the following properties: 0 they are real-valued even for arbitrarily small time steps k [31,p.483] and 0 their norm tends to infinity as k+O [31,p.483]. In the conclusions of [31], Stuart raises the question :" What classes of initial data will be affected by the spurious periodic solutions...For general initial data, the question is open and indeed it is not well defined until a precise meaning is attached to the word 'affected"'. According to p.50,p.192], the performance of the discretizations under discussion is governed by "the nonlinear interaction of a high wave number mode, which is a product of the discretization, and a low wave number mode present in the governing differential equation", i.e., the PDE. Highfrequency oscillations of a computed approximation arise in numerous contexts, and they have to be removed by means of suitable smoothing operations. Examples are: spurious pressure modes in applications of spectral methods in Computational Fluid Dynamics D.81, or W Gibbs oscillations due to the truncation of an Ansatz, or (iii) multigrid methods for systems of linear algebraic equations, or (iv) discretizations of nonlinear hyberbolic or parabolic IBVPs, etc. Concerning (i), the following nonlinear parabolic IBVP is discussed in [2]: (8.26) yt = (f(y)yx)x + g(y) + 7(x,t) on D as defined in (8.23); EIR4+, g:R4, 7 : D I ; f, g, yo are sufficiently smooth; boundary and initial conditions corresponding to the ones in (8.24).

In dependency on f and g, the functions yo(x) = y(x,O) and -y = dx,t) are chosen such that (8.27) y*(x,t): = e-ct(sinrx) with any CEIR', (x,t)ED, is a solution of (8.26). An equidistant grid is chosen, making use of 0 h: = (n + l)-l and k: = T/N for any n, NED(,TER', and 0 grid points xi = ih and tj = jk for i = q l ) n + 1, j = 0(1)N, respectively. A consistent implicit discretization with two time levels tj and tj-1 is chosen for the determination of an approximation yjj: = (ylj,...,ynj)TERn of YJ: (y*(xl,tj),...,y*(Xn,tj))T: A

(8.28) F(yj,yj-1) = f j

+ zj for j E DI.

505

E. Adams A

The vector f j follows from the data. Local errors of all kinds are represented qualitatively by the vectors zj. The solution y j of (8.28) depends on r: = k/h*, f, g, and z: =(zl,...,~j)~dRj", In [2], the function y j = yj(z) is approximated by means of a linear Taylor-polynomial. The partial derivative zj(p9v):= 6j/aZ, is the solution of a system of linear algebraic equations: (8.29) PjZj(p9v' = Zj-l(p'v) + Sjve(p) with Pj E LW), Sij

the Kronecker symbol;

:= (Siv) E R"; i,p E (1 ,...,n}; v E (1 ,...,j}, and Pj =

Pj(r, f, g). For "favorable choices" of f and g, Pj is an M-matrix p.371. This property is valid for all (meaningful) choices of h E (0, 1/21 and k > 0 provided that f E R+ and g E R. This is well known since (8.26) then is linear. Remark Since (8.26) employs a PDE, it is not possible to enclose Z j simultaneously with the solution of this IBVP. This is the basis of Loher's enclosure algorithm for ODES (D.311, D.321). By use of matrix-theory and numerical experiments concerning (8.28) and (8.29), the author of the present paper observed in [2] that @ lly; - "yllmis a slowly increasing linear function of j if Pj is an M-matrix for

j = l(1)j;

0.

lly; -

"yI/,

is a strongly growing linear or nonlinear function of j subsequent

to the (always irreversible) transition at i from the presence to the absence of this property; (iii) prior to this transition, "spurious oscillations" of the computed sequences of the vectors y j and Zj were observed, with respect to both X i and tj; the amplitudes of these oscillations were small or even decreasing when Pj remained an M-matrix as j increased; this can always be enforced by a sufficiently small local choice of the time step k. The spurious character of oscillating computed vectors yj and Zj follows from the facts that 0 the true solution y* to be approximated by the vectors yj is monotone and 0 that this then is true for the corresponding vectors zj solving (8.29). In [2], the empirically recognized importance of the property of an M-matrix for Pj has been theoretically substantiated by means of the theory of M-matrices. In [2], identical empirical conclusions concerning Pj were also drawn for a three-level implicit discretization of the hyperbolic IBVP simulating nonlinear vibrations of a string fixed at both ends [7,p.201].

506

Discretizations: Practical Failures

8.5 CONCLUDING REMARKS The existence of spurious difference solutions ysp has the following practical implications: @ they are true solutions of a chosen discretization; consequently and on the level of the discretization, they cannot be distinguished from non-spurious difference solutions 7, not even when Enclosure Methods are employed & on this level; in this context, see Subsection 7.4; for discretizations of ODEs with initial conditions and as h-0, a sequence of spurious difference solutions ysp = ysp(h) approaches a limit outside of the domain D of f in the ODE y’ = f(y); for a practically chosen h, the computed approximation ysp may still be in D ; a spurious character perhaps then can be detected by means of the tests (a)-(c)referred to in Subsection 8.1.; A & for discretizations of ODEs with boundary conditions, there are the classes (a) an? (b) of spurious difference solutions referred to in Subsection 8.3; class (b) cannot be detected on the level of the discretization, not even by means of the tests (a)-(c)in Subsection 8.1; concerning spurious difference solutions in the case of IBVPs, the relatively small body of knowledge seems to indicate a situation comparable to the one in the case of ODEs; as a practical (heuristic) test in the case of implicit discretizations, the property of an M-matrix may be used, which is related to Enclosure Methods on the level of the discretization; the practically most important influence of spurious difference solutions is (cJ the one with respect to the size of the domain of attraction of a (difference) solution to be approximated; h the situation addressed by (a)-(c)calls for a (spot-check) application of Enclosure Methods with respect to solutions of DEs to be approximated; concerning PDEs, the status of these methods is characterized in Section 1. A

a h

h

A

6 A

A

h

8.6 APPENDM: A SUPPLEMENT TO EXAMPLE 8.13 The practical relevance of the sequence of spurious difference solutions in (8.17) suggests a presentation of the complete verification of the employed analysis. The proof [30] of the subsequent Lemma 8.30 has not been published before. Lemma 8.30 The sequence of parameters pj in (8.17) is uniformly bounded for j = l(1)n -1 and all nGN.

507

E. Adams

Proof: (I) The Boundedness of the Auxiliarv Sequence {AB~} Making use of apj: =

pj

- pj-l

pj

and

=

CAP,, (8.16) can be represented as

k-1

follows:

+ 2Apj + 6(-1)jt1 h1/2(Apj+1)2+ h(Apj+~)~ + h2k-1h Apk

IOApjtl (8.31)

= hg(jh) - h

and Po = 0 and On =

+

=

A

((-l)jt1+l)h3/2 =: hg(jh) for j = O(1)n - 1

E Apk = 0.

k=l

Consequently, the roots Apjtl of the following equation have to be determined: Gj(Apj.1): = A p j t l - Ajtl/Bjt1 = 0 for j = l(1)n-1, where A

(8.32)

Ajt1: = hg(hj) - 2Apj - h2

C

Apk k=l

Bjtl: = 10 + 6(-l)jt1h1'2A@jt1

and

+ h(A@jtl)2.

Making use of the existence of an MEW such that

A

,J

g(jh) 5 M, it is assumed

that there is a KEIR' independent of h such that I A D I~ 2 K; e.g., M = 5.2. Without a loss of generality, h will be confined to the interval (O,ho] where ho: = Min{K/(M+K), (10K)-2}. The following assumption of an induction is chosen: (8.33) IApj I 5 K for j > 2. There follow (8.34) Gj(K/3) > 0 and Gj(-K/3) < 0 for j 2 1. In the general step of the induction, the existence of a root Apjtl of (8.31) is shown such that I Apj+l I 5 K/3 for j > 1. (11) The Boundedness of the Sequence { O i l The estimate (8.33) will now be sharpened. For this purpose, (8.31) and the estimates leading to (8.33) are employed to yield (8.35) I Apjtl I 5 h(M + K)/7 + (2/9)"K. There holds (8.36)

pn =

k!kpk

The estimate (8.37)

5 lhpkl

k=2

2 A01 - k5= 2 I Apk 1.

iK

508

Discretizations: Practical Failures

is satisfied for K = M/5, which allows the choice of K = 1. Consequently, there fdlaw /In> 0 (or < 0) for = 1 (or -1). Therefore, I ,8j is bounded because of

I

(8.38)

I PI ~= I>:P~/

5

.h

k = l IAPkl

for j = l(l)n,

and this sum is bounded since (8.37) has been satisfied by the choices API = 1 or - 1.

[III) The Continuous Dependencv of W on DL The sequences {APj,,} and {APjtl} are considered in their dependencies on @I = A P i and p1 = Ap1, respectively. Making use of dj: = A P j - APj, (8.31) implies that

f:

djtl = Cjtl/Djtl where Cjtl: = -h2k = ldk - 2dj and

(8.39)

Djtl: = 10 + 6(-l)jt1h1’2(Afij+1 + Apjtl) Ejtl: = (Aojt1)2 + APj+l

*

APjtl

+ h.Ejt1 where

+ (APjt1)2*

Since I A @ j + l l , I APj+lI 5 K/3, there follows Djtl > 9 for j 2 1. If I dk I < 6 for k = l(l)j, then (8.39) implies that (8.40) Idj+l( < (h6 + 26))/9 < 6. Therefore, 1 dj I < 6 for j = 2(l)n if I dl I = 6 and, because of (8.39), (8.41) I djtl I < (h6 + 21 dj 1)/9. Consequently, in analogy to estimates leading to (8.35), (8.42) 1 djtl I = h6/7 + (2/9)’6 and (8.43) I Pn - Pn I 5 (17/7)6. Therefore

(8.44) I Pn - Pn 1 5 (17/7) I PI - PI I. [IV) Existence of a 131 such that A = 0 There exists a ,81 = AD1 such that

@n =

n

c Pk = 0 because of k=l

(8.38) and the implication as stated subsequent to (8.37) and 0 the continuous dependency of Pn on /31 as asserted by (8.44). This PI, such that Pn = 0, can be determined by means of 0 a shooting method, starting with /31 and, for j = l(1)n - 1, 0 the determination of a root A P j + l of (8.32). This completes the proof of the existence and the uniform boundedness of the 0 sequence of parameters pI,...,Pn-l for all ndi, which are introduced in (8.17). 0

509

E. Adams

9. ON DIVERTING DIFFERENCE APPROXIMATIONS CONCERNING ODEs OR PDES 9.1 DIVERSIONS IN THE CASE OF DISCRETIZATIONS OF ODES

y

Diversions of difference approximations have been defined as item (d) in the beginning of Section 2. If the condition starting with "such that ..." were absent in (d), the phenomenon would be irrelevant. In fact, as t j increases or decreases, at tj continuously coincides with locally different true solutions y* = y*(tj). Diversions of difference approximations 9 are caused by the interaction of @ topographically suitable situations and &) local discretization- or local rounding-errors acting as perturbations triggering a deviation that becomes a diversion, either immediately or at some later time. As has been discussed in Section 3, the situations addressed in (a) are in particular (al) neighborhoods of poles of the ODEs where the cause and the effect are almost coincident in this neighborhood and (a2) the presence of an (n - 1)-dimensional stable manifold Ms (in the n-dimensional phase space of the ODEs) which is penetrated due to one or several consecutive local numerical errors, followed by the subsequent manifestation of the diversion in a neighborhood of the unstable manifold

y

MU.

The triggering phenomenon (b) therefore may be due to a single or a few consecutive local errors, which are unavoidable in the execution of the computational determination of This phenomenon can be characterized as follows: 0 the strong causality principle addressed in Section 1is violated; 0 diversions are possible for arbitrarily small h > 0 and an arbitrarily high numerical precision, provided a total error control is absent; 0 the triggering process is uncontrollable and therefore random since it depends on the interplay between the choices of .the numerical method and its artificial parameters with .the employed computer and compiler; 0 the randomness of the phenomenon indicates the futility of a search for a mathematical theory for the (non-)occurrence of diversions. Remarks: J l Concerning this randomness, see Subsections 7.3 and 7.4. J . 2 Spurious difference solutions ispof the following kinds have found particular

y.

510

Discretizations: Practical Failures

attention in literature (p.501 or p.571): they are either stationary points or periodic solutions. It is then possible that there exist (n - 1)-dimensional stable sets asand both belonging to the difference solution ysp under consideration. unstable sets Rounding errors then may trigger the diversion of a computed difference approximation 7.

a", N

9.2 EXAMPLES FOR THE OccURRENcE OF DIVERTING DIFFERENCE APPROXIMATIONS IN THE CASE OF THE RJ3!TRICTED THREE BODY PROBLEM The following idealization of Celestial Mechanics is considered [33,p.51: ( i J the orbits of the earth E and the moon M are confined to a plane in IR3; ( i i Jin this plane, there is a suitably rotating Cartesian yl-y2-basis whose origin is attached to the center of gravity, C, of E and M; the points E, C, and M are on the yl-axis; in figure 9.4, C and E are almost coincident since (m) the position of C relative to E and M is determined by the ratio p = 1/82.45 of the masses of M and E; consequently, -p is the location of E and A: = 1 - p is the location of M on the yl-axis; in the yl-y2-plane, trajectories of a small satellite S are to be determined; for these trajectories, the phase space possesses the Cartesian coordinates yl, (vJ y2, ~ 3 =: Y'I and y4 = ~ ' 2 . For the Restricted Three Body problem defined by (i), (ii), and (iv), (v), the equations of motion are as follows [33, p.51 in the employed rotating basis: y'1

(9.1)

= Y3,

y'3 = y1

y72

+ 2y4 -

lilq'+ p) - p(yl -

r; y)4 = y2 - 2y3 - xyz r8 rl where rl: = ((yl + p)2 + ~ 2

=

*

*

A)

r;

~ ) "and ~

r2: = ((yl -

+ Y22)"2.

*

For any true solution y* = (yl, y2, y3, yt)' of (9.1), the Jacobi-integral, J, takes a fixed value:

In agreement with numerous papers in literature (e.g. [lo]), the following

511

E. Adams

point is chosen: (9.3) yp(0): = (1.2, 0, 0, -1.04935750983)T. Starting at yp(0) and for the ODES (9.1), a classical Runge-Kutta method with step size h = yielded an orbit whose projection into the yl-y2-plane is displayed in Figure 9.4 (see Figure 1 of the paper p.391 in this volume). It is noted that the presently used superscript double tilde, NN, corresponds to the superscript single tilde, N,in p.391. With much more than graphical accuracy, applications of " high-precision" difference methods in literature have yielded (almost) closed orbits with representation in Figure 9.4. These orbits (almost) return to yp(0) at

y,

yp

yp

N

the time t = T: = 6.192169331396, [lo]. Therefore, is believed to be an approximation of a hypothetical T-periodic solution, yier, of (9.1) with period T x N

T.

0.5

0

-0.5

-1.0

1.0

0

Figure 9.4: Projection into yl-y2-plane of almost closed orbit with starting point y(0) in (9.3). Determination by means of classical Runge-Kutta method with step size h = 10-3. In a numerical experiment referred to in p.391, the classical Runge-Kutta method with h = 5.10-3 was used to approximate the true solution, y , of (9.1), that is displayed in (9.3) by means of a computed difference approximation Figure 9 . 5 4 see Figure 3 in [1.39]. The sharp tip of one of the loops in Figure 9.5A is a consequence of the projection of the orbit from R4 into R2. For an investigation of the gross deviation of from with y,(O) = yq(0), the values J of the Jacobi-integral, J, were computed at the grid points employed in the A*

yq

yq

yp

yq

Discretizations: Practical Failures

512

determination of yq Figure 9.5B displays the results J for the leading four loops depicted in Figure 9.5k It is seen y2

0.5

0

Yl

-0.5 -1

1

0

Figure 9.5A Projection into yl-y2-plane of orbit pq with starting point s,(O) in (9.3). Determination by means of classical Runge-Kutta method with step size h = 5.10-3. J

y2t

.

I

-2.llQ03.. J I-2.114025..

I

\

.

0.51

t

-0.5

J

-2W720..

J = -1.788917... f

.

.. ,Y1

I

I

-1

I I

I

0

I #

I I

1

Figure 9.5B: Leading four loops of *yq from Figure 9.5A with the values of the Jacobi integral J listed. that the change A J of J is 0 negligibly small past any individual loop, however, 0 there is a large decrease A J in the transfer from the (i - 1)th loop to the i-th loop; this transfer takes place in a small neighborhood of the pole E of the ODES (9.1).

E. Adams

513

yq

This suggests that as displayed in Figures 9.5A and 9.5B has been generated by a sequence of diversions, each one taking place in a small neighborhood of E. Close to E, four initial vectors r ] ( ,...,r ] , 4 ) were chosen as follows: 0 0

yq

they coincide with values computed for at certain grid points such that at 7]( i ) is beginning to traverse one of the leading loops shown in Figure

yq

9.5 B.

i 0.5

0

-0.5

Figure 9.6: Projection into yl-y2-plane of enclosed true orbit (3): = r ] ( 3) of orbit displayed in Figure 9.5A. The true orbit y*( i ) starting at a fixed position

r]( i)

y:3)

starting at point

was enclosed, making use of

the step size control developed in D.381, see D.391. Figure 9.6 depicts the enclosure of Y : ~ ) . At r]( 3 ) , yq diverts from y:3) to the continuation of as shown in Figure N

yq

9.5A. Consequently, the occurrence of a diversion of the difference approximation

ys has been verified and its properties have been demonstrated. N

Remarks: J . l Figure 2 in p.391 presents the enclosure of the true orbit at

r]( 2 ) =

y:2,

starting

2 which is listed in (16) in D.391.

a Concerning the (almost) closed orbit in Figure 9.4, the loops in Figure 9.5A are spurious.

Swing-by maneuvres of space vehicles are executed in near neighborhoods of planets. Because of the proximity of a pole of the ODES of Celestial Mechanics, only the employment of Enclosure Methods can reliably avoid diversions of computed orbits. For real space vehicles, in-flight trajectory corrections can be carried out by use of engines and a sufficient

5 14

Discretizations: Practical Failures

supply of fuel. Consequently and at the expense of the payload, in-flight corrections of orbits are possible which have been computed incorrectly in advance. Remarks: J . l Concerning diversions of difference solutions for (9.1), there is a detailed discussion of the early stages of the investigations of this problems in D.31; additionally in p.31, there are numerous graphs presenting sequences of loops as caused by diversions. 2 ) Figure 1 in [27,p.2] displays a pattern of loops that is closely related to the one in Figure 9.5A. As an explanation of the loops, the legend of this Figure 1 refers to “the chaotic motion of a small planet around two suns...”. This is an example for the interpretation of Computational Chaos as Dynamical Chaos in the set of true solutions of the ODES of this problem. 9.3 EXAMPLES FOR THE? OccURRENcE OF DIVERTING DIFFERENCE! APPROXIMATIONS IN THE CASE OF THE LoRENz EQUATIONS

In Subsection 5.4, there is a review of Kiihn’s work p.291 on enclosed (and verified) periodic solutions of the Lorenz equations (5.8), see also p.31. Additionally, in p.291 Kiihn has studied certain aperiodic solutions of the Lorenz equations, making use of the choices of b = 8/3, r = 28, u = 6, and employing @ Enclosure Methods for a determination of true, (presumably aperiodic) solutions y* and, simultaneously, &) Runge-Kutta methods for a determination of a corresponding difference approximations Both (a) and (b) were executed by use of a C-compiler running on an HP-Vectra. Figure 9.7 depicts the projection into the yi-yn-plane of an enclosed aperiodic true solution y* and its approximation both starting at a point yo, close to in Figure 5.19. The width of the computed enclosure is smaller than In Figure 9.7, (A) the enclosed true solution y* is demarcated symbolically by boxes and (B) the solid line represents a difference approximation which was determined by use of a classical Runge-Kutta method with a step size h = ho: = 1/128. As y* and have (almost) reached the stationary point (O,O,O), they begin to separate for the remainder of the interval [O,ta)] for which they have been

y.

y,

yiQ?)

determined. This can be explained by a diversion of at the stable manifold Ms of (O,O,O)T, taking place before coming close to this point (O,O,O)T. This diversion

515

E. Adams

occurs presumably as remarkable

y

penetrates the stable manifold in one time step. It is

................

........._..

Yl

Figure 9.7: Projection into yl-yz-plane of enclosed true solution y*: approximation by means of classical Runge-Kutta method with step size h = ho = 1/128 : y* and start at point yo close to y;i!) as displayed in Figure 5.19.

o o

y

that y* and

y coincide with much more than graphical accuracy for all t E [O,tm]in

the case of the choices h = 1 / 3 6 < ho and h = 1/64 > ho. This non- monotonic dependency of Ily*(t) - y(t)ll, on h is unpredictable. For a starting vector yo close to the one in Figure 5.19, Kiihn found cases where y* and are & practically coincident for tE[O,td] c [O,t,], however,

y

their Euclidean distance d = d(t) oscillates for t > td, reaching values comparable with the Euclidean distance of the stationary points C1 and CZ introduced in (5.9). Property (p) can presumably be explained by a diversion of at t x td, followed by ( i J a certain winding pattern of y* about C1 and W a different corresponding pattern of about C2. The time t d depends unpredictably on the employed numerical method: in the four investigated cases, td increased from 13.5 to 20 as

y

5 16

Discretizations: Practical Failures

the step size h of a classical Runge-Kutta method was reduced or as the precision t of a Runge-Kutta method with control of h was chosen closer to zero. On the basis of a Taylor-polynomial of (variable) order p, H.Spreuer ([30], PSI) has developed an explicit one-step method with a near-optimal control of h and p by means of the following conditions: the moduli of each one of the three last terms of the polynomial are required to be smaller than an €0 which initially was chosen as and the computational cost is to be as s m a l l as possible. The (uncontrolled) local rounding errors are characterized by the fixed double numerical precision, correspondig to 15 decimal mantissa digits, of the employed HP-Workstation. In applications concerning the Lorenz equations (5.8), @ Spreuer chose yo on a PoincarC-map defined by y2 - yl = 0, then (II) he determined the first intersection, fint, of f with y2 - y1 = 0, and then ( I I I J he returned from yint through replacing t by -t. for - yell, reached a minimum of less than Spreuer observed cases where

a h

A

y

the returning difference approximation and cases where this distance was large. The latter cases can presumably be explained by diversions of f. Additionally in conjunction with a suitably Spreuer ([30], PSI) employed values to << extended number format. For the chosen yo, then diversions were still observed or they were absent for €0 sufficiently small. Therefore, here too, the presence or absence of diversions is unpredictable. W. Espe [I.14] has used a Runge-Kutta-Fehlberg method of order eight with control of h to approximate the periodic orbit y; depicted in Figure 5.19. Espe’s results are presented in Example 5.20, see Figures 5.21a - 5.21e. The details of the curve representing $ in these figures are as unuredictable as the strange attractor of the Lorenz equations (5.8), see the quotation from P.431 in Section 1. The details of the aperiodic orbits computed by H.Spreuer or by W. Espe depend in an unpredictable, seemingly random (“pseudo-random”) fashion on the employed computer, compiler, number format, numerical method, and its artificial parameters. Remark: Concerning diversions of difference approximations for the Lorenz equations (5.8), in P.31 there is a discussion of the early stages of the investigations of this problem.

517

E. Adams

9.4 SPURIOUS DIFFERENCE! SOLUTIONS AND DIVWTING DIFFERENCE APPROXIMATIONS FOR DISCRETIZATIONS OF PDES In the present subsection, evolution problems will be considered consisting of 0 linear or nonlinear PDEs for functions Y*: D x Rb + Rn with D c R q the domain of the spatially independent variables, t E Rij time and, additionally, 0 appropriate boundary conditions and unspecified initial data. This defines an initial boundary value problem (IBVP). For q = 2 (or 3), the spatial variables will be denoted by <,7 (and C). It is assumed that 0 the spatial domain is compact and that 0 boundary conditions are prescribed everywhere on the sufficiently smooth boundary aD. Frequently in engineering or the sciences, an IBVP is replaced by an IVP because of computational reasons. For this approximation in the context of a mat hematical simulation, the following approaches are customary: @), the dependency of Y* on the spatially independent variables is suitably approximated while @) the dependency of Y* on time t is retained from the PDEs. Concerning (a), there are in particular the three methods listed at the end of Section 1. Additionally there is a Fourier-expansion of the dependent variables with respect to the spatially independent variables, in conjunction with an integration over D and the employment of convolution s u m (e.g.p.81) in the case that the nonlinearities in the PDEs are confined to products of (powers of) the dependent variables. Concerning a spectral method or a Fourier-expansion, the following Ansatz for an approximation is usually chosen: N (9.8) YH(t,v,C,t): = k C= l ak(t) b (<,q,C) for (<,v,I,t) E D x Rb, with N E fixed, N

such that

Remarks:

satisfy individually the boundary conditions on 8D.

1.1Here ,...,

are "new coordinates" with coefficients al,...,al.

2 J See p.131 in the present volume for an employment of interval polynomials in

the context of an enclosure and a verification of the existence of a solution Y* of a nonlinear IBVP. M

The convergence of a sequence {Y,} to Y* (in a suitable sense)

518

Discretizations: Practical Failures can be investigated by means of classical methods provided the PDEs are linear; however and generally, this is a very difficult problem in the case that the PDEs are nonlinear, [1.8,~.247,~.250]. x

The Lorenz equations (5.8) serve as an example for this replacement of Y* by Y, as defined in (9.8). They can be derived by means of a Fourier-expansion as applied to the well known system of nonlinear (evolution-type) PDEs of fluid mechanics and heat transfer. For the case of N = 3, they have been derived in the classical paper by E.N. Lorenz p.331. For N = 39, a correspondingly generalized replacement has been derived by Monin [22]. According to p.7,p.660], the results for N > 3 are qualitatively identical with the ones for N = 3. x

Under the assumption of the replacement of Y* by Y,, each one of the methods listed at the end of Section 1 or a Fourier expansion generate a system of (non)linear ODEs, whose initial conditions follow directly from (9.8) and the initial conditions for the original PDEs. The symbol y* will be used for the true solution of an IVP, which has been generated by means of the preceding discussions. For practical reasons and traditionally, a difference method is applied to any such IVP. The treatment in literature of the classical Lorenz equations (5.8) is an example, provided additional initial conditions are chosen. Consequently, the preceding discussions on diverting difference approximations apply literally with respect to the approximating IVPs. For nonlinear IBVPs, full discretizations are frequently generated as follows in literature: at first a semi-discretization of the spatially independent variables is @ executed by means of any one of the methods listed at the end of Section 1 or a Fourier-expansion; this is followed by fjQ a discretization of time in the resulting system of ODEs. In the context of the "reliability of discretizations", there arises the question whether or not (i) in conjunction with (ii) may lead to (A) the existence of spurious difference solutions or (B) the occurrence of diverting difference approximations, both for the full discretization. This question is of considerable importance in Computational Fluid D Y M ~ ~ C S (CFD), as has been discussed by H.C.Yee et al., see p.561, p.571, and [32]. The

519

E. Adams

question concerning (A) and (B) will be answered affirmatively by means of the subsequent Example 9.13 which rests on the Burgers equation aY + YTdy = tw a2Y with ~EIR+,for (x,t)ED: = [0,1] x !R& (9.9) X This equation p.8,p.183] "is one of the few non-linear PDEs for which exact and complete solutions (cJ i are known in terms of the initial values"; @) these " solutions... exhibit a delicate balance between (non-linear) advection and diffusion";therefore, these solutions have been studied "extensively as a mathematical model of turbulence" . Because of (a), (9.9) is the perhaps most frequently employed test problem for the accuracy of approximations delivered by difference methods or spectral met hods, see p.81 and [A, [8]. In this context, boundary conditions of periodicity have been adopted frequently in literature; (9.10) Y(0,t) = Y(1,t) and = for tEIR'.

9)

Additionally, an initial condition for Y(x,O) for x~[O,l]has to be chosen. In the subsequent treatment of (9.9) and (9.10), this function will not be specified. Because of (9.10), an integration of (9.9) with respect to x yields (9.11) T aj h Y(x,t)dx = 0 + 1; Y(x,t)dx = const. The IBVP concerning (9.9) and (9.10) is inverse-monotone ([l] and p.541). For Y(x,O) with x~[O,l],now all initial functions are admitted which 0 satisfy the boundary conditions (9.10), 0 possess continuous derivatives of the second order, and 0 are bounded by constants Y, P EIR' such that Y < P. Because of the inverse-monotonicity, the corresponding set of true solutions Y* of the IBVP is enclosed by y and Y. (9.12) Y * ( x , t ) ~ p , Pfor ] (x,t)ED. Examde 9.13: For the IBVP concerning (9.9) and (9.10), a semi-discretization by means of a longitudinal method of lines has been introduced in p.561 and [32], making use of the following spatial discretization:

520

Discretizations: Practical Failures

Yj: = Y(xj,t) for xj: = jh with h: = 1/N and a fixed NEN; (1/2)(aY2(x,t)/&) is replaced by the usual central difference (9.14) quotient of the first order; @Y(x,t)/h2 is replaced by the usual central difference quotient of the second order. In p.561 and [32], N = 3 was chosen. Because of the boundary conditions (9.10), an application of (9.14) with respect to (9.9) yields Y; + (3/4)(Y3 - Y3) = D(Y2 - 2y1 + Y3), p := 96, (9.15) Y; + (3/4)(Yj - V) = p(Y3 - 2Y2 + YI), = p(Yi - 2Y3 + Y2). Yi + (3/4)(v As a simulation of the conservation law in (9.11), the authors of [IS61 and [32], have adopted the following condition: (9.16) Y; + Y; + Yj = 0.

a)

Consequently, (9.15) can be reduced to a system of ODEs in a phase plane. This system in IR2 possesses four stationary points. Making use of vectors 6i = 6i(p)€IR2, these four stationary points are listed in Table 9.17 [32]:

Point 0 / 3 9 1/31 (-1,l) + 61 (191) + 62 (4-1) + 63

Type of stationary point stable spiral saddle point saddle point saddle point

Table 9.17 Concerning the IBVP of the Burgers equation (9.9), these stationary points are spurious. Each one of the saddle points possesses spurious manifolds Ma and MU. The authors of p.561 and [32] have discretized the system of ODEs in the phase plane by means of several standard difference methods. By use of a (parallel) Connection Machine, these authors found approximations of the following spurious difference solutions provided the step sizes were not sufficiently small: 0 spurious limit cycles and other spurious periodic difference solutions; 0 the basin of attraction of the stationary point (1/3, 1/3) and 0 its fractal boundary; 0 approximations diverging from the interval under consideration in IR2. In analogy to the discussions in Subsection 9.3, it is observed that the manifolds Ma of the saddle point may cause diversions of computed difference approximations.

E. Adams

521

Conclusion: Example 9.13 demonstrates that full discretizations of nonlinear IBVPs may possess spurious difference solutions or diverting difference approximations. 1O.CDNCLUDING REMARKS The major purpose of this volume can be summarized as follows: " Computability with Reliability" i n e a r through applications of Enclosure Methods. For discretizations of w differential equations, the discussions in [5] and in the present chapter have shown that practical applications of traditional difference methods are an "Industrial Art rather than a Mathematical Science In fact, on the level of a discretization of ODEs or PDEs, it is generally impossible to determine whether a computed difference approximation is @ an approximation of a true solution y* in any well-defined sense or @ meaningless such as a spurious difference solution or a diverting difference which is spurious in the more general meaning of this approximation expression. Concerning (b), the existence or the actual occurrence has been shown by means of popular mathematical simulations taken from physics or engineering. As has been demonstrated in Subsections 9.2 and 9.3, poles of the ODEs or saddle points offer a favorable topographical environment for the occurrence of a diversion. In fact, in both cases solutions y* with initially small distance subsequently acquire a significant distance. Concerning saddle points, the cause of a diversion may be characterized by the title of PSI: "Computational Chaos May be Due to a Single Local Error". As has been shown in Subsections 7.4 and 9.3, a computed difference approximation is generally subject to the following influences which are just as unpredictable and pseudo-random as the corresponding special cause of a diversion: 0 the ones of the employed computer and compiler and 0 the ones of the employed numerical method and its artificial parameters. Applications of discretizations of ODEs or PDEs are of considerable practical importance in the design stage of advanced and costly engineering systems as, e.g., in the case of a major space technology project or high speed flight. These applications involve obvious risks provided new engineering systems with little or no advance technological experience are involved. Consequently, 'I.

y,

y

522

Discretizations: Practical Failures

applications of difference methods induce the following requirements for a user’s experience and awareness: & the real world problem underlying the simulation must be well understood, (B) the qualitative structures of first a physical and then a mathematical simulation must be thoroughly investigated, (c) the essential distinction of the set of true solutions y* of the DEs and the set of difference approximations must be realized. The discussions in Section 7 imply the importance of (A), and (B), concerning the choice of a system of DEs from a set of possible candidates representing a simulation. Concerning (C), the discussions in Sections 8 and 9 are referred to. In view of the desired reliability of the computational analysis by means of discretizations of ODEs, either sufficientlv many and diversified numerical experiments should be carried out with respect to the choices of difference methods and their artificial parameters and/or (II) perhaps onlv a few but sufficiently well chosen (verifying) enclosures of true solutions of ODEs should be determined provided appropriate Enclosure Methods are available. Obviously, neither (I) nor (11) can ensure total reliability since is always incomplete and, therefore, yields only hints and (II) depends (IIa) on the generality of the (few) chosen test cases and (IIb) on the fact that a failure in the execution of an Enclosure Algorithm may imply either the nonexistence of the true solution y* or (in exceptional cases) an accumulation of overestimates in such a way that the algorithm breaks down. Concerning a corresponding relative reliability of a mathematical model, the cost of (11) is expected to be of the same order of magnitude as the cost of (I). Remarks: J L The second case in (IIb) may be an indication of practical risks involved in a model with this property. J . 2 The discretizations of PDEs as discussed in Subsection 9.4 are affected by all the problems that have been observed for discretizations of ODEs. Consequently, there is a practical need for the development of Enclosure Methods for PDEs (that are not inverse-monotone). The evaluation of the reliability of traditional (i.e. not totally error-controlled) difference methods raises the question:

a

E. Adams

523

“What are the consequences of a failure of a real world system because of computational errors in its design?’’ The practical implications of this question justify the extent of the discussions concerning the reliability of the discretizations in the present chapter and its predecessor [5]. LIST OF REFERENCES Literature reference [XIin the first part, [5]. E. Adam, H. Spreuer, Uniqueness and Stability for Boundary Value Problems with Weakly Coupled Systems of Nonlinear Integro-Differential Equations and Application to Chemical Reactions, J. Math. Anal. and Appl. 49, p.393-410, 1975. E. Adams, Sensitivity Analysis and Step Size Control for Discrete Analogies of Nonlinear Parabolic or Hyperbolic Differential Equations, p.3-14 in: Mathematical Methods in Fluid Mechanics, vo1.24, eds: E. Meister, K. Nickel, J. PolASek; Verlag P. Lang, Frankfurt, 1982. E. Adams, D. Cordes, H. Keppler, Enclosure Methods as Applied to Linear Periodic ODES and Matrices, ZAMM -07 p.565-578, 1990. E. Adams, Gear Drive Vibrations and Verified Periodic Solutions, p.395-416 on: Computer Arithmetic, eds.: E. Kaucher, S.M. Markow, G. Mayer; J.C. Baltzer, Basel, 1991. E. Adams, The Reliability Question for Discretizations of Evolution Problems, I: Theoretical Considerations on Failures, in: E. Adams, U. Kulisch, Academic Press, Boston, will appear (this volume). W. F. Ames, Nonlinear Partial Differential Equations in Engineering, Academic Press, 1965. W. F. Ames, Nonlinear Partial Differential Equations in Engineering, vol. 11, Academic Press, 1972. W. F. Ames, Numerical Methods for Partial Differential Equations, 2nd edit., Academic Press, New York, 1977. W. J. Beyn, E. Doedel, Stability and Multiplicity of Solutions to Discretizations of Nonlinear Ordinary Differential Equations, SIAM J. Sci. Stat. Comp. 2, p.107-120, 1981. R. Bulirsch, J. Stoer, Numerical Treatment of Ordinary Differential Equations by Extrapolation Methods, Num. Math & p.1-13, 1966. G. Bohlender, L. B. Rall, Ch. Ullrich, J. Wolff von Gudenberg,

524

Discretizations: Practical Failures

PASCAL-SC Wirkungsvoll programmieren, kontrolliert rechnen, Bibl. Inst. Mannheim, 1986. D. Cordes, Verifizierter Stabilitatsnachweis fiir Liisungen von Systemen periodischer Differentialgleichungen auf dem Rechner mit Anwendungen, Doctoral Dissertation, Karlsruhe, 1987. G. Diekhans, Numerische Simulation von parametererregten Getriebeschwingungen, Doctoral Dissertation, Aachen, 1981. H. Gerber, Innere dynamische Zusatzkriifte bei Stirnradgetrieben, Doctoral Dissertation, Miinchen, 1984. J. K. Hale, True Orbits and Numerical Orbits, Report Presented to Council on Engineering Research, January 17, 1991. M. Hirt, Anwendung von DIN 3990 in der Industrie, p.9-48 in: Sichere Auslegung von Zahnradgetrieben, VDI-Verlag, Diisseldorf, 1987. IBM High Accuracy Arithmetic Subroutine Library (ACRITH), Program Description and User’s Guide, SC 33-6164-02, 3rd ed., April 1986. H. Keppler, unpublished private communication. M. Kolle, Modellierung und numerische Behandlung von Getriebeschwingungen, Diploma Thesis, Karlsruhe, 1988. F. Kiiciikay, ijber das dynamische Verhalten von einstufigen Zahnradgetrieben, VDI-Fortschrit t berichte, Series 11, Number O,VDI-Verlag, Diisseldorf, 1981. T. Y. Li, J. A. Yorke, Period Three Implies Chaos, American Math. Monthly 82, p.985-992, 1975. A. S. Monin, Hydrodynamical Instability. Usp. Fiz. Nauk 150.p.61, 1986. R. Miiller, Schwingungs- und Gerauschanregung bei Stirnradgetrieben, Doctoral Dissertation, Miinchen, 1991. [241 K. Naab, Stabilitatsuntersuchungen an linearen Systemen mit periodisch zeitveranderlichen Parametern, VDI-Fortschrittberichte, Series 11, Number 41, VDI-Verlag, Diisseldorf, 1981. G. Niemann, H. Winter, Maschinenelemente, vol. 11, 2nd edition, SpringerVerlag, Berlin, 1983. H.-0. Peitgen, D. Saupe, K. Schmitt, Nonlinear Elliptic Boundary Value Problems Versus Their Finite Difference Aproximations: Numerically Irrelevant Solutions, J. fiir die reine und angew. Math. 322.p. 74-117, 1981. H.-0. Peitgen, P. H. Richter, The Beauty of Fractals, Springer-Verlag, Berlin, 1986.

525

E. Adams

WI

V. S. Rjabenki, A. F. Filippow, ijber die Stabilitat von Differenzengleichungen, VEB Deutscher Verlag der Wissenschaften, Berlin, 1960. U. Schulte, Einschliehngsverfahren zur Bewertung von Getriebeschwingungsmodellen, Doctoral Dissertation, Karlsruhe, 1991; VDI-Fortschrittberichte, Series 11: Schwingungstechnik, Number 146, VDI-Verlag, Diisseldorf, 1991. H. Spreuer, unpublished private communication. A. Stuart, Linear Instability Implies Spurious Periodic Solutions, IMA J. of Num. Anal. 3 p.465-486, 1989. P. K. Sweby, H. C. Yee, On Spurious Asymptotic Numerical Solutions of 2x2 Systems of ODES, Report: University of Reading, 1991. W. Walter, Gewohnliche Differentialgleichungen, 4th edit., Springer-Verlag, Berlin, 1990. H. Winter, Grundgedanken, Aufbau und Handhabung von DIN 3990 und IS0 6336, p. 1-9 in: Sichere Auslegung von Zahnradgetrieben, VDI-Verlag, Diisseldorf, 1987. V. A. Yakubovich, V. M. Starzhinskii, Linear Differential Equations with Periodic Coefficients, Vol. 1, J. Wiley and Sons, New York, 1975.

Partial List of Symbols and Abbreviations a: = b a is defined by b 4: set of complex numbers D( set of natural numbers set of integer numbers R set of real numbers IR’ set of positive numbers set of positive numbers and zero IRb IR”, C” corresponding vector spaces LW) set of real nxn-matrices [a]: = [g,%]: = {a I a- 5 a 5 c A with A: = IR or IR” or L(IR”), closed interval

n

a}

I diag(...)

4 MS

MU

identity matrix in LQR”) diagonal matrix in L@”) fundamental matrix of y’ = Ay stable manifold unstable manifold

526

P-nl

aT

Aij * Y*

Yper

*

Ypart N

Y

Y N

N

YSP

ODE PDE DE BVP IVP IBVP

Discretizations: Practical Failures spectral radius of 4(0) eigenparameter time period literature reference number n in Part I transpose of vector a vector demarcated by index i i-th component of vector y element of matrix A true solution of DE periodic true solution of DE particular true solution of y’ = Ay + b true difference solution of discretization difference approximation: approximation of spurious true difference solution ordinary differential equation partial differential equation either ODE or PDE boundary value problem with ODES initial value problem with ODES initial boundary value problem with PDEs

KKR Bandstructure Calculations, A Challenge to Numerical Accuracy R. Schutz, H. Winter, and G . Ehret We discuss the formalism and the merits of the KKR bandstructure method, one of the traditional one-particle theories devised for describing the electrons in periodic solids. We argue that it is worthwhile to apply this mathematically well founded scheme, in spite of its considerable computational expenditure, to the complicated substances of current interest, once the numerical problems have been sorted out and removed with the help of ACRITH.

1

Background

It is not possible to give a n exact quantum mechanical description of the electrons in a solid . The expression for any physically observable quantity may be written a s the sum of a n infinite number of contributions due to elementary processes being visualized by so-called Feynman diagrams. In general, there is no small parameter indicating which one of those may safely be neglected. Physical intuition therefore is summoned to invent approximations promising to keep track of part of the reality and whose predictions can be compared to experimental evidence. Such a conceptionally simple, widely used approximation is the mean field approach: A particular electron moves in the thermodynamic ensemble-averaged field of the nuclei and all the other electrons. One is back to the one-particle problem and searches for bound solutions of a second order differential equation containing a n effective one-particle potential to be determined in a selfconsistent manner specified later on. This is the famous Schrodinger wave equation. If the electrons are very fast, th at is if relativistic effects become important, this equation has to be replaced by a 4-component first order differential equation, the so-called Dirac equation. Since this generalization Scientific Computing with Automatic Result Verification

Copyright Q 1993 by Academic Press, Inc. All rights of reproduction in any form reserved. ISBN 0-12-044210-8

527

528

R . Schutz, H . Winter, G . Ehret

does not affect the following reasoning, or the nature of the arising numerical problems, we restrict ourselves to the nonrelativistic case. The solutions of this wave equation are made unique by imposing boundary conditions appropriate to the physical situation: Besides being continuously differentiable over the whole volume of the solid, they are required to reflect periodicities of the system. If we concentrate on perfect crystals, they result from the faet t ha t the whole solid may be generated by spatial repetition of one building block, the unit cell. The centers, Rj,of those unit cells form a lattice, the so-called Bravais lattice, determining the crystal class to which the given system belongs. This situation is expressed by the periodicity of the potential with respect to the translation vectors, Rj,of the Bravais lattice. According to Floquet's theorem, known in physics as Bloch's theorem, the desired solutions of Schrodinger's equation can be chosen to be simultaneously eigenfunctions of the discrete Bravais lattice translation operators; the eigenfunctions of these operators can be shown to be of the form: exp {ikr}, with k some arbitrary vector parameter. The wave function is then given by the formula

where u( r ) is a lattice periodic function. Depending on the parameter k,this function obviously satisfies the following boundary condition:

The values of the pseudo-wave vector, k , leading to different wave functions are confined to a finite volume in a hypothetical three-dimensional k space, the reciprocal space: Starting with a particular k value, say ko, in the reciprocal space we can construct a set of identical wave functions, belonging to the wave vectors b , n = k,+Gn with eiGnRj= 1for all Rj.As a consequence, our k-vector parameters, characterizing the whole set of wave functions, yk, are restricted to the first Brillouin zone of the reciprocal space spanned by the lattice vectors, Gn. Unfortunately, the realistic form of the one-particle

KKR Bandstructure Calculations

529

potentials does not allow for solving Schrodinger's equation analytically. The analytically solveable Kronig-Penney model, e.g. consisting of a periodic array of &distribution type of potentials, gives only some qualitative insight into the gross physical nature of the Yk's. Laborious numerical work, is therefore necessary to achieve a quantitative physical understanding of periodic solids on a single-particle level and to account for the enormous variance of properties observed in different systems. This line of research, which started in the thirties, is called bandstructure work. Its development and success are tightly linked to progress made in the field of computers. It therefore does not come a s a surprise that a host of different methods have been developed in previous decades simplifying the bandstructure problem through the introduction of further approximations even more. A key point in this context is the validity of the Rydberg-Ritz variational principle: The task to solve Schrodinger's equation may be transformed into a variational problem: There exists a functional of y whose stationary points are solutions of the wave equation. First order errors in the wave functions thus cause only second-order errors in the eigenvalues of the bound solutions. This opens the possibility - instead of solving Schrodinger's equation subject to the proper boundary conditions - to work with test functions composed of a fixed basis set and to determine the expansion coefficients by a search for the stationary points of the functional in this restricted space. We mention in this context, for example, the APW (augmented plane waves) and the LCAO (linear combination of atomic orbitals), named after the nature of the basis used. A great deal of sensible results a r e due to these and similar methods. A further simplification comes from the possibility to approximately linearize with respect to energy certain intermediary quantities depending mainly on energetically low lying states. The whole then boils down to a linear eigenvalue problem. The most prominent bandstructure methods in this context are O.K. Andersen's LMTO (linear combination of muffin-tin orbitals) and the L (linearized) APW. In view of the considerable success of these procedures and their wide-spread use the question arose as to whether a rigorous solution of the bandstructure problem, proliferating eigensolutions, was still necessary.

R . Schiitz, H . Winter, G . Ehret

530

So the venerable KKR method, named after their inventors Koringa, Kohn and Rostoker and based on scattering theory, was abandoned by many researchers, since it requires a n extra high amount of numerical expenditure. The attractive feature of this method however is, that it yields true solutions of the wave equation satisfying the exact boundary conditions, instead of being merely variational. With the advent of supercomputers working in the vector and multiprocessor mode, the computation of quantities has become possible th at depend, among others, on highly excited one-particle states. It has been demonstrated th a t the state wave functions as delivered by the aforementioned approximate schemes are insufficient for these purposes since they violate basic sum rules, e.g. the closure relation and the so-called f-sum rule. A revival of the KKR th a t meets these extended requirements, is thus highly desirable. Apart from updating the partly antiquated codes this requires a n improvment of th e numerical stability of the nonlinear KKR-procedure, especially in the case of solids with many atoms in the unit cell and we discuss how ACRITH has been exploited and can be further used to achieve this goal. At first, however, it is worthwhile to give a sketch of the KKR formalism.

2

The KKR Formalism

The Schrodinger equation reads:

Here, Wr) is the periodic crystal potential, y(r)is the one-electron wave function, and E is th e energy parameter. As we search for bound solutions, equation (2.1) has to be considered as a n eigenvalue problem. For a given pseudo-wave vector, k,we expect a set of eigenvalues, Ek,A together with the related wave functions. They are distinguished by the integer valued 'bandindex' A. Considering a particular k-vector, the differential equation (2.1)can be easily shown to be equivalent to the following integral equation:

KKR Banhistructure Calculations

53 1 (2.2,

Here, due to the periodicity of'the system, the space integral is restricted to the volume, flunit, of one unit cell ; Green's function for the 'empty lattice', Gk, obeys Dyson's equation:

A special solution of equation (2.3)is found by the following ansatz:

12.4)

For the following it proves advantageous to switch from the global coordinates, r, r' to local ones. Th at is to say, we subdivide the unit cell into cells, regions surrounding the individual atoms, K , the Wigner-Seitz (WS) and write for a space point r, lying in the WS cell of atom K in unit cell j:

Rj is the global coordinate of the center of cell j, rK is the position of atom K relative to this center, and p is the space coordinate with respect to the atom K. This situation is visualized in Figure 1. As can be seen straightforwardly by inserting formula (2.4) into equation (2.21, due to this ansatz, y k is guaranteed to fulfill the proper boundary condition (equation (1.2)). However, evaluating Gk is a delicate and tedious

R . Schiitz, H . Winter, G . Ehret

532

Fig. 1

Coordinates in the unit cell

task: It implies performing a n infinite sum over the lattice vectors, Gn, of the reciprocal space. Moreover, this sum over a three-dimensional manifold shows a poor convergence. It is only assured after all by the oscillatory behavior of the exponential. In addition, some of the denominators in formula (2.4) may approach the vicinities of poles for some eigenvalues E=ck,.\. These difficulties have been overcome by [ 11, adapting Ewald's techniques [ 21 to the evaluation of relation (2.4):Exploiting a mathematical identity, and concentrating on cell j = O with Rj=O formula (2.4)can be casted into the following form:

(2.61

.exp

and

1 [- ; ((G,fk?-c

(2.7)

KKR Banaktnrcncre Calculations

533

(2.8)

Equations (2.6)to (2.8)split G into two contributions: The evaluation of Gel) requires to perform a summation over the reciprocal lattice vectors, Gn, whereas G(2)contains a sum over the real space lattice vectors Rj. In contrast to the representation (2.41, however, these sums converge within finite cluRters of lattice points, whose size depends on the choice of the Ewald parameter q. Typical values of q are between 0.5 and 2.0; cluster sizes of about 100 lattice points guarantee convergence of the procedure, particularly the independence of the result for G from the choice of the precise number for q. The next step is to choose a n appropriate basis for the one-particle states. In view of our subdivision of space and the fact th a t the crystal potential is not far from spherically symmetric within major parts of the WS-cells, this choice is almost compelling: For r in the WS-cell of site K , the KKR defines the functions forming this basis to be solutions of a wave equation containing a spherically symmetric potential, VK(lpl),th a t is as close as possible to the true crystal potential, V(r).VK(lpl)is constructed as follows: Within a sphere centered at a site K of radius rrntlK',the so-called muffin-tin sphere, it is the spherically averaged potential V W . In the region between this sphere and the boundary of the WS-cell, the interstitial region of site K , it is the spatial average, Vo,of V(r)as averaged with respect to the interstitial regions of all atoms in the unit cell. Without a loss of generality, Vo is usually equated to zero. The solutions of this simplified wave equation a r e separable into the angular and the radial parts of p and thus our basis states a re of the following form:

(2.9)

534

R . Schutz, H . Winter, G . Ehret

Here, em are angular momentum quantum numbers, Yem are the real spherical harmonics, and Re‘“’ is that solution of the radial part of the muffin-tin wave equation which is regular a t the origin. This equation reads:

(2.101

In the interstitial region (IR), where V, is zero, Re’“’can be written as a linear combination of the free-wave radial solutions je(dE p) and n e ( d c p), the spherical Bessel and Neumann functions, respectively. In the interstitial region Re‘K’is usually written in terms of the ‘phase shifts’, GKp(c):

with, e.g. 0 s 6 e K s n . Requiring the continuity of Re‘“)a t the muffin-tin radius then leads us to express the 8elKs through the numerically determined logarithmic We obtain derivatives of R p ( p ) at rrntK).

In the limit V-0, the radial function RqIK’isequal to je(d/ep); one is thus in the position to represent Green’s function (equations (2.6) to (2.8))in the basis

535

This yields

Here, the C's a r e the Gaunt numbers which a re defined by the following integral:

536

R . Schiitz, H . Winter, G . Ehret

A considerable simplification for the following is achieved by neglecting the differences between the muffin-tin potentials VK(p) and the true potential V(r),t ha t is if one adopts the muffin-tin approximation. This is a n excellent approximation for most of the bulk materials of interest. With some additional expenditure, however, one can also take account of the non muffin-tin corrections to the potential, either in a perturbative way or rigorously, by extending the muffin-tin KKR to the full potential KKR. Also in this respect the theory has reached a high degree of development. In the following, however, we restrict ourselves to the mufin-tin approximation. The significant simplification comes through the fact th a t the integral on the r.h.s. of equation (2.2) is now restricted to the muffin-tin spheres since Vmt is zero in the interstitial regions. Within the mufin-tin spheres, it is possible to expand the wave function YyL(pK) in terms of the single-site basis functions as defined by equation (2.9) and to work with the ansatz: (2.17)

A rapid convergence of the sum over em can be expected and indeed, experience tells t hat emax = 3 to 5, depending on the substance in question, is quite sufficient. The ce,,iK) are the state-vector coefficients. Inserting equations (2.13) to (2.17) into the integral equation (2.2) a n d making use of the wave equation, it is easy to cast the whole formalism into the form of the following nonlinear eigenvalue problem:

(2.18)

A can be written in terms of quantities defined above:

537

(2.19)

(2.20)

Equation (2.18) has only nontrivial solutions if special values, ~ki, of satisfy the relation: DetfAkk,?)= 0.

E

(2.21)

The index A, the bandindex, stands for the multiplicity of the possible solutions. Physically ekh is the energy of a n electron occupying the in one-particle mean field approximation. stationary state ykA

To bring the formalism into a self-contained shape, we close this chapter with some remarks, concerning the construction of the potential. V(r) is the expectation value of the Coulomb potential, Vcoul, as felt by a particular electron due to the presence of the nuclei and the other electrons, plus a term, Vxc, accounting approximately for many body effects in the oneparticle scheme. It reads

(2.22)

538

R . Schiitz, H . Winter, G . Ehret

Therefore, in order to calculate V(r), we need to know the expectation values of the charge densities, n(r)nucand n(r) of the nuclei and the electrons, respectively. While nnucis approximated by point charges a t the lattice points, we need to know the E ~ Iand ykAto get n(r); this quantity is determined through the following relation:

Here, f(E) is the Fermi-distribution function and QBZ is the volume of the Brillouin zone (BZ). Equation (2.23) derives from the fact that, according to Pauli’s principle, at zero temperature the possible one-particle states are , in equation filled with two electrons each, up to the Fermi energy E ~ which (2.22) is taken care of by the factor f ( e u ) . Anyhow, equations (2.22) and (2.23) define a problem of self-consistency: I t is not sufficient to solve equations (2.18) and (2.21) once and forever for a given potential, V, but one has to start with a clever guess for V, the input potential V,n(l’, compute eK,:ll and yK.{I) based on this potential, and evaluate by means of relations (2.22) and (2.23), a new potential, the output potential Voull’.For the following cycle a linear combination of the former input and output potentials yields the new input potential

(2.24)

The choice of y depends on the material under consideration. The selfconsistency cycles can be stopped, once a n output potential, say Vouli’,is reasonably close to the related input potential Vidi’.Withthese remarks we hope to have convinced the reader that the KKR:bandstructure method demands for a highly efficient and numerically accurate code.

KKR Bandstructure Calculations

3

539

Numerical Procedures and Implications

To find solutions of equations (2.17) and (2.20),we interpret them a s a linear eigenvalue problem for given values of the parameter E by writing:

(3.1)

(3.2)

The eigenvalues a = 0 then define the solutions of the original problem and we have to search for those special values, ekh, of E for which equations (3.1) and (3.2)have zeros in their eigenvalue spectrum {ai}. This task is tackled with the help of a bisection algorithm. It is simplified by the monotonic behavior of the ai considered as functions of E: The special form of equation (2.18)guarantees the validity of the following relation: daL(E)

sign(

y)= 1 fobreuchbmnchi

(3.3)

The individual steps of the procedure are as follows. We start by solving the eigenvalue problem (equations (3.1) and (3.2)) at the lower and the upper boundary ee, respectively, eU of the energy interval I: ce S E S eU of interest. Due to equation (3.3) the number, n o , of zero eigenvalues a within I is then given by the relation:

(3.4)

530

R . Schutz, H . Winter, G . Ehret

Here m y g and munegare the numbers of negative eigenvalues, a, at c pand cU, respectively. Next we solve the eigenvalue problem a t the intermediary point ct,=(cf+ c&2. Finite numbers for the quantities

indicate that there are solutions of the KKR-problem within the corresponding interval. For the moment we investigate the leftmost interval I'e)containing a finite number of zero eigenvalues further, whereas the other, if there is one, is put on stack for later inspection. The procedure outlined above is now repeated for I(e),and the whole bisection algorithm is continued until we have reached the leftmost interval, I('], of I with width A t = { , the tolerance, containing a t least one zero eigenvalue a,and E at the center of I(A),is identified with a KKR-eigenvalue, say c k i . Subsequently the leftmost interval stored on the stack is being considered and treated in exactly the same way as described above. The job is finished when the positions of all the I(\)have been spotted, th at is to say when the stack has been emptied. Whilst the bisection algorithm requires to solve the eigenvalue problem a t the boundaries of each investigated interval, the eigenvectors have to be computed a t the end, when all the intervals I(*) are known. The tolerance parameter, { , has to be defined a t the beginning. On the one side, { , should be chosen a s small as possible since i t determines the uncertainty in the Ek.\, on the other it should be large enough to guarantee a proper performance of this algorithm in the presence of the 'numerical noise' of the computer. Moreover, the degree of accuracy in the steps setting up the KKR matrix and of the eigenvalue solver subroutines put a lower limit on {. A second tolerance parameter comes into play whenever there arises the question a s to whether a particular eigenvalue, a, is sufficiently close to zero to be considered a s indicating a valid solution of the KKR problem. It turns out that it is reasonable to use the same { for this purpose a s well. The numerical requirements become still more stringent if one keeps in mind th a t aa(e)/de may be very large, necessitating one to reduce the magnitude of {, and lor a subset of eigenvalues, EL\, become from symmetry exactly or nearly

KKR Bandstructure Calculations

54 1

degenerate. In this context it is also important to remark th a t the KKR matrix is rather ill-conditioned i n many cases: Its matrix elements may differ by orders of magnitudes. The same is true for the components of the state vectors, c, as well, whereby for certain important applications reliable numbers for all of them are needed. This situation may either occur if for substances with one or a few atoms in the unit cell we need high angular momentum components of c, related to small values of the corresponding phase shifts, 6e, or if we are concerned with systems containing many atoms in the unit cell, which are partly strong and partly weak scatterers. In these cases the dimension of A may become considerable (up to 200). Using the traditional version of the KKR-program in practical applications containing the EISPACK-library eigenvalue solver subroutines we perceived unpleasant features of part of the solutions: As a function of k, jumps in EM revealed the occurrence of false eigenvalues. Especially worrying was the massive violation of orthogonality between some of the KKR-one-particle wave functions belonging to different energies. This important property of the KKR-formalism can be theoretically proved; it reads:

provided the c ~ ~are' properly ~ ~ ~normalized. ; ~ Such a failure spoils the fulfillment of the closure relation, reading:

(3.6)

and affects in a n uncontrollable manner the data for the following matrix elements, which are crucial in many problems of solid-state physics:

542

R . Schutz, H . Winter, G . Ehret

Because of relation (3.61,these matrix elements should satisfy the following sum rule:

(3.8)

for any k,q and A'. Since the M's a r e positive definite, a violation of (3.8)can easily be proven, even if the sum over A is truncated. As a final consequence we point out the ensuing violation of the f-sum rule:

with ne, the total number of electrons in the unit cell. This drawback renders, e.g., impossible a reasonable first principle's assessment of the system's response to external perturbations coupling to the electronic density, properties th at determine many equilibrium and transport quantities. It is hard to eliminate these false eigenvalues and states from a host of data produced on a dense k mesh within the BZ. Furthermore, such failures indicate some inherent, maybe intolerable, degree of inaccuracy in all those results as well th a t may not be recognized a s grossly wrong on inspection. In those cases one gets a measure for the attained accuracy if one attempts to construct the wave functions, 'Yk.A(pK), in the interstitial regions. For these purposes we do not employ formula (2.161,but we evaluate the following expression due to [ll:

543

KKR Bandstructure Calculations

(-

+-

2f+

q/n\/ee

2, 1 R,-p-rK

1

+zK

1 (3.10)

YkAinterat.,which depends on the state vector coefficients, c, can easily be seen to fulfill the boundary conditions a t the surface of the unit cell. If emax is large enough and the numerical solution of the KKR-equation is sufficiently should match Yk,,,a s determined by use of formula accurate, Yk,,,interst. (2.161,differentiably on the surface of the muffin-tin sphere. In many cases we observed intolerably large discontinuities when emaxwas put equal to 3. The increase of emax to 6 rather made things worse instead of causing a n improvement: A clear indication of a n insufficient numerical performance of the codes. The KKR-program is too complex for allowing to spot the source of inaccuracy on inspection. Instead, we have got to replace part of the original code with a reliable routine, either guaranteeing correct results, or indicating numerical instabilities. This unique tool is ACRITH. So, a s a first attempt we replaced the library eigenvalue solver by the corresponding ACRITH subroutine package for Hermitian matrices, due to [31. These routines indeed revealed th at the eigenvalue solver is the crucial part of the KKR-program. They avoided false eigenvalues, improved the results and lead to a satisfactory fulfillment of the relations (3.5)and (3.8)for the tested k-points. The ACRITH-results turned out to be enclosed between tight bounds. The cause for the significant improvement of the results seemed to

544

R . Schutz, H . Winter, G . Ehret

be the use of a better algorithm and the virtually exact execution of scalar products. The disadvantage of this kind of procedure, of course, is the dramatic increase in CPU time, limiting ACRITH in this case to test purposes. Anyhow, ACRITH solved the problem to find the crucial parts of the program and provided us with accurate solutions a t special k-points. This enabled us to systematically judge on the performance of other library subroutines. The IMSL-eigenvalue subroutine package turned out to do far better, and a substantial improvement has been achieved by implementing the KKR-program with the latter, without slowing it down too much. In our final version a n ACRITH-herived subroutine package due to [41 takes care of the eigenvalue solver task. This subroutine does not preserve the verification capability of ACRITH but maintains the high accuracy in evaluating scalar products. Its performance is excellent and since the CPU time for this part of the program has increased by only a factor of 1.5 a s compared to the original version, the present updated code is well suited for large scale and reliable production runs.

4

Some Results

To give some illustration on the considerations of the previous chapters, we now show KKR results, obtained for two metals using the updated code, described above, a s representatives for rather different classes of materials. Vanadium is a typical d-transition metal with a large, rapidly varying value for sin6,=, a t E = E ~ , the Fermi energy separating the occupied from the empty part of the one-particle states. In Figures 2a, 2b, and 2c we show the twelve energetically lowest eigenvalues Ek.4 a s a function of Ikl for k in (l,O,O)-, (l,l,O) - and ( l , l , l ) -direction, respectively. The eigenvalues for fourty k-points per direction can be interpolated to. build continuous curves, the energy bands. The frequent occurrence of clustering of eigenvalues, mentioned in the preceding section and requiring special numerical care, is clearly visible. The ACRITH-controlled improvement of the code was mainly necessary in this case to improve the orthogonality between the states 1 to 3 on the one side and the excited states 7 to 12 on the other.

545

KKR Bandrtructure Calculations

m

.in

.a

JO

.a 50

k (*".)

.Lo

.m m

90

im

.in

(b)

(01

IO

a

k (d.".)

.a

.s

m

.m

.lo

10

3

.a .s m

k (*J

(C)

Fig. 2 The twelve lowest valence bands of Vanadium along three directions in the BZ

I

T

I

X

I

M

I

Y

I

T

Special points in the BZ

Fig. 3 The fifty lowest energy-bands of tetragonal YBazCu307 above the semi-core states for k in !1,0,0) direction

I

M

.m

do

R . Schutz, H . Winter, G . Ehret

546

A quite different example is the stochiometric ceramic compound YBa~Cu307,a representative of the family of the high superconducting transition temperature materials with a T, of 92 K. By including a muffintin sphere in the vicinity of a n empty site to make the muffin-tin spheres more space-filling, this substance has 14 atoms in the unit cell. If we choose emax= 2 for the oxygens, emax=3 for the transition metal sites and emax= 1 for the empty sphere, our KKR matrix has the dimension d , = 163. The resulting eigenvalues a r e correspondingly dense and numerous, a s figure 3 shows, where the first 50 bands above the transition metal - and oxygen tightlybound semi-core states are drawn for k in (1,0,0) direction. Again the updated code leads to a substantial improvement over the old one.

5

Summary

We gave a detailed description of the KKR bandstructure method to highlight its significance for treating contemporary solid-state physics problems of considerable interest. We pointed out which critical numerical problems may be encountered and showed how they can be overcome by improving the code under the guidance of ACRITH, proliferating verified results, guaranteed to be within small 'error bars'. Only after updating the codes it turned out that it made sense to use the KKR eigenvalues and oneparticle states a s input for the calculation of still more complex quantities needing true solutions of the Schrodinger wave equation instead of merely variational approximations as obtained from simpler bandstructure methods. The further development of ACRITH-XSC to the final stage of a standard programming language in conjunction with the adequate hardware, assuring acceptable execution times, and being easily accessible to the application minded user is highly desirable.

KKR Banaktructure Calculations

REFERENCES [ll

Ham, F.S.;Segall, B.: Phys. Rev. 1786 124,1961.

121

Ewald, P.:Ann. Phys. 253 64,1921.

[31

Laifer, R.: Diplomarbeit. Institut fur Angewandte Mathematik, Universitat Karlsruhe, 1991.

[41

Lohner, R.: private communication, 1991.

547

This page intentionally left blank

A Hardware Kernel for Scientific/Engineering Computations Andreas Knisfel

Present floating point units or processors are characterized by architectures developed some decades in the past, when hardware was an extremely precious resource and the arithmetic and numerical requirements were not yet investigatedexhaustively. Unreliable and sometimes improper calculations are the consequence.Investigations of these effects lead to a mathematical definition of the required arithmetic. Additionallyto the enclosure of results by means of interval arithmetic even the verification of the existence and uniqueness of computed results is possible. Considering the mathematical demands for scientifk/engineeringcomputationsat first and then the implementation constraints, the architecture of a new floating point unit was developed. Fortunately, the hardware increase is only in the order of percentages.

1 Introduction The state-of-the-art of the commonly used floating point arithmetic is characterized by historical hardware limitations and unsophisticatedimplementations.Therefore, many of the features of the mathematical real numbers and mathematical operators are lost by improper mappings to the floating point arithmetic. The following examples outline some of these unavoidable and undetectableeffects. m p l e 1: scalar arithmetic

x z 0 but x*1.0 = 0. Example 2: Sum of 13 numbers computed on a CRAY-II [5] Scalar-mode 1.oo0ooO0ooo.. .o

Scientific Computing with Automatic Result Verification

Vector-mode -1.0664062500.. .O.

Copyright Q 1993 by Academic Press, Inc. 549 All rights of reproduction in any form reserved. ISBN 0-12-044210-8

550

[;:I*[

A . Kniifel

Example 1:inner product evaluation (using a conventionalloop and scalar arithmetic) 1024

-1026

- computer 0

1018

1022 2111 1019

-

-1021

e x8779. act

Example 4: standard function evaluation (Intel 80387 1151result)' c&( x ) = 6.123. ExamDle 5 : Extraction of the computation of eigenvalues of the Hilbert-18 matrix (MATLAB [21] results)' XI = . *lo*** x4 =

..............

+..............

*lo***

* * * *lo1 x7 = + 2.77 X18 = + 2.72231855534765

The majority of the computer architectures and programming languages offers only a scalar arithmetic. But scientific-engineeringapplications deal mainly with data structures in higher numerical spaces such as vectors or matrices of real or complex numbers. The implementation of the required arithmetic by use of the improper scalar operations addressed before amplifies the demonstratedeffects, causing ill-conditioned problems. Therefore, the computer arithmetic must be defined on the basis of mathematics and in view of requirements of hardware design. The main principle is the application of semimorphic operators [16,17,18]. The correct result of an operation in a numerical space is mapped by a monotone and antisymmetric rounding to the image on a computerrepresentable screen. It provides the arithmetic with maximum accuracy for all operations, including several rounding modes like the rounding to the nearest number Operations with the directed roundings to +oo or -= can t x defined similarly.

0.

the dots ''." indicate incorrect digits

A

V

A Hardware Kernel for ScientijiclEngineering Computations

55 1

In 1985, the issue of the ANSUIEEE standard for binary floating point arithmetic [7] caused a considerable progress in scientific/engineering computations. The IEEE arithmetic became part of many computer architectures from PC’s up to mainframes. This standard defines data formats applying the semimorphism to the corresponding scalar operations. Especially the DOUBLE format is now a state-of-the-art feature. Being limited to scalar operations, this standard is not applicable for higher numerical spaces. For the control or even the elimination of arithmetic errors the semimorphism must be applied for the operations on intervals, vectors or matrices with real and complex components. In many cases then the enclosure of results in tight bounds and even the verification of the existence, uniqueness, and enclosure of a solution are possible in an automated fashion by means of a computer [ 10,193. The arithmetic for scalars, vectors and matrices for real and complex numbers and intervals requires more than 600 operations. Additionally, higher precision operations are required for standard functions [9,14] and other purposes. The implementation of the entire set in hardware is impossible and not needed. Considering both the functional requirements as well as technical limitations and computer architectures, an efficient arithmetic kernel is presented subsequently. By adapting the existing floating-point units, this kernel uses the hardware resources very efficiently, offering increased functionality and performance.

2

Requirements and Constraints

2.1 Numerical Requirements a) Basic Arithmetic The data structures in scientific/engineeringcomputations are based on a given floating point number system 6(b,l,&in,Emax). The system 6 is characterized by the base b, the number of mantissa digits 1, and the expont range between Emin and Emax. Only a finite set of real numbers is representable on the screen. Since in the majority of arithmetic operations the result is not exactly representable as a floating number, a rounding must map these results to the screen. The computer data structures are composed from this data format 6 ,using these numbers for the representation of the corresponding interval bounds, real and imaginary part of complex numbers, or for vector and matrix components.

552

A. Kn6fel

More than 600 operators are required for scalars, vectors, and matrices with real or complex dot and interval components.The hardware implementation of the entire set is not needed and futile. Fortunately, the required vector spaces do have intermediate coherencies. The 600 operations can therefore be reduced to a set of about 30 instructions without chopping the canonical structures and loosing performance.

b) Standard Functions The implementation of highly accurate standard functions with an error of less than one ULP (Unit in the Last Place) cannot easily be achieved with the arithmetic in CC [9,14]. Cancellation and rounding errors in the argument reduction step may cause severe errors. Even special compound operations [9] for the evaluation of the approximating polynomial cannot necessarily reduce the errors considerably. With an a priori error analysis of the algorithm the required additional precision can be determined. Table 1 shows the number of additional binary digits [25] for some standard function in the case of IEEE DOUBLE input values. Some or all of these precision requirements could be covered by a special format, called EXTENDED.

Table 2.1: Additional digits for the evaluation of standard functions in the case of the IEEE double format, with 'Laand U:interval spaces

c) accurate expressions and procedures The EXTENDED format is intended as a special purpose format for standard functions. It is not sufficient to cover all the requirements for the accurate evaluation of general expressions. Examples are the enclosure of eigenvalues and eigenvectors, the functoid

A Hardware Kernelfor ScientificlEngineeringComputations

553

arithmetic for integral- and differential equations, and the evaluation of polynomials. Since the required precision in these cases cannot be determined in an a priori fashion, a dynamic multi-precision arithmetic must be used. On the other hand as the precision is increased the frequency of the occurrence of such problems demases. A well balanced compromise between precision and hardware effort must therefore be consided

2.2 Scalar and Vector Operations Not withstanding the reduction from over 600 to 30 operations, some of these are rather complicatedand difficult to implement in hardware. A statisticalanalysis was applied to detect frequently used operations. To guarantee a real-life environment for the analysis, a large collection of scientWengineering routines was investigated both with Critical and non-critical input data sets. According to their mainly used operations, the investigated routines can be subdivided into three major classes. For each class a significantexample is presented below with some considerations for the hardware implementation. The following figures show the relative distribution of the real inner product computations 0,the intervd inner product calculations @, the sum of all other vector instructions and the sum of all scalar instructions rounded either to the nearest number or with d i r d roundings

m, 50%

m.

7

40%

70%

60% 50%

30% 20%

40%

30%

20% 10%

0%

Firmre 2.1; relative distribution of instructionsin class 1

10%

0%

2.2; relative distribution of instructionsin class 2

554

A. Knoifel

The first class of algorithms involves large numbers of vector instructions, as shown in Figure 2.1. This class 1 contains mainly enclosure algorithms for algebraic problems such as linear systems and the evaluation of polynomials. The relative distribution was found to be nearly independent of the chosen problem and its dimension. An interesting aspect is the ratio of real to interval vector instructions of 41. The next class 2 of algorithms is dominated by the employment of scalar instructions (Figure 2.2). This class contains the enclosure of eigenvaluesand of zeros of non-linear systems. The last class 3 is dominated by the employment of scalar interval instructions, which are composed of scalar real operations with directed roundings (Figure 2.3). This class 3 is dominated by algorithms for the enclosure of the solution of integral- and differential equations. The distribution of the vector instructions has no effect on 40% the hardware considerations for this problem class. The main difficulty for this class of prob20% lems is the insufficient support of an interval arithmetic on many computers. In the majority of 0% the architecturesthe rounding mode is determined relative distribution of in a special register. In the case of the interval instructions in class 3 arithmetic the rounding direction and therefore the rounding mode is toggled for each interval operation. In modem high-performancesystems the update of the rounding mode takes up to six times of the one of a floating point addition due to an inefficient user interface. Because of this reason a decreased performance is the consequence. Therefore, different hardware instructions must be accessible for the scalar operations with different roundings. 100%

Summarizing the requirements for the three classes, the following operations are llXpired: scalar operations +,-,*, and + with the roundings 0, nand intermediate vector operations for real and interval inner products.

v.

A Hardware Kernel for ScientijZYEngineering Computations

555

2.3 Higher Precision Operations For the accurate evaluation of entire expressions or even algorithms, a higher precision arithmetic must be applied. The required precision and the exponent range are problemdependent and must be adaptable. Such problems occur less frequently as the precision is increased. It the case of the interval arithmetic, a theorem [l] establishes the ralationship between the precision of the arithmetic and the evaluation error. A problem must therefore be highly unstable or the accuracy demands must be very high if the computation cannot be performed with a double or mple precision arithmetic. By means of staggered corrections [26] or accurately evaluated inner product expressions, a double or triple precision arithmetic is ten to fifty times slower than the usual scalar arithmetic. This leads to a slow evaluation of standard functions and a penalty for the major part of the higher precision requirements with only a slightly increased precision. Therefore, some of the possible higher precision formats should be supported directly by the arithmetic unit and all other demands should be satisfied by software.

2.4 Constraints Even now when floating point performance and functionality are important marketing aspects in the computer market and the technological progress allows the hardware implementation of more and more complex circuits with decreasing cost, the following constraints must be considered: the power dissipation, the transistor count, and the chip area of a circuit. Therefore, a sophisticated floating point unit (SFPU) satisfying all the above mentioned requirements should be comparable with sophisticated on-the-shelf floating point units (FPU’s). 2 . 4 . 1 Computer Architectures and Interfaces Another important constraint is the architecture of the entire processor or computer. The SFPU is not a stand-alone unit but highly dependent on the main processor, the memory structure, and the VO features. The ability to adapt and modify the given concept depends on the type of processor or computer, respectively. For the following considerationsfive computer types are introduced:

556

A . Knofel

I

PC: Narrow bus systems, one CISC processor2and an attached floating point coprocessor with a complex communication protocol 3 Low VO bandwidth and floating point performance,extensions to the instruction set are difficult.

11

Workstation: Wide bus systems, one or a few RISC processors3 with floating point capability * High VO bandwidth and floating point performance, extensions to the instruction set are difficult.

I11 Mainframe: Wide bus systems and special VO processors, one or a few CISC processors with access to the micro-code instruction set 3 High VO bandwidth and floating point performance, extensions to the instruction set are easy . IV

Vector computer: multiple wide bus systems, special purpose pipeline architecture for number crunching- Very high VO bandwidth and floating point performance, extensions to the instruction set are easy, the implementation is difficult due to the required sophisticatedchip technology.

V

Parallel computer: up to several thousand coupled nodes of type I to IV.

The VO bandwidth and the interface to the memory are important parameters for the design of an arithmetic unit. The following listing shows the required instructions for performing the operation REAL-A = REAL-B + REAL-C in an ASSEMBLER-like syntax: LOAD ADDRESS Of REAL-A LOAD REAL-A into FP-REG-A LOAD ADDRESS O f REAL-B LOAD REAL-B into FP-REG-B FP-REG-C = FP-REG-A + FP-REG-B LOAD ADDRESS O f R E A L C STORE FP-REG-C

Complex Instructions Set Computer: These computers have up to several hundred micro-coded instructions, multiple address modes. Even simple instructions are executed in several cycles. Reduced Instructions Set Computer: These computers have less then 100 hardwired simple instructions and only a few address modes. An increasing number of RISC processors have a superscalar architecture with several concurrent operating executing units for integer, floating-point,and logical operations.

A Hardware Kernel for ScientifuYEngineering Computations

557

The number of cycles required for these seven instructions on several computers is listed in Table 2.2 for 64 bit floating point numbers:

Table 2.2: Execution cycles on differentcomputer architectures

A detailed analysis [6] shows that in some cases up to 80% of the execution time are required for the data and the instruction transfer. Fast execution units are therefore only recommended for systems where the necessary transfers are also performed rapidly.

3 Hardware Arithmetic Core In this section the hardware architecture of the arithmetic unit is presented and compared with a conventional scalar arithmetic unit. Even though the functionality of this unit is independent of the computer architecture, several design alternatives are presented to offer a variety for adaption purposes on a special machine.

3.1 The Scalar Arithmetic Circuits The majority of the general purpose computers do have a scalar arithmetic unit in hardware. The common floating point formats like IEEE double [7] have almost the same mantissa length and therefore almost the same amount of the required hardware. Main differences occur only on PCs, where the multiplication and division are executed by means of a small unit in several cycles due to the low I/O capability. Figure 3.1 shows a block diagram of the scalar unit and Table 4.1 gives the number of transistors and other parameters of the main blocks.

558

A . Knofel

registers I

2xponknt path

I

Jantissa p%h

Figure 3.1: Block diagram of the scalar arithmetic unit

Table 3.1; Transistor count, area and power consumption of the main units in a FPU

3.2 Integration of the Vector Arithmetic For a semimorphic inner product both the computation of the products and the summation must be performed exactly and with a result that is rounded only once at the end of the process. Since for the scalar multiplication the exact computation of the product is a necessity, the existing multiplier can be re-used. Unfortunately, in existing FPUs the less significant half of the product mantissa is used solely for rounding purposes and not accessible. Therefore, the already computed lower part must be routed outside the multiplier. As shown in Table 4.1, the multiplier has a significant contribution to the amount of hardware and requires no significant changes.

A Hardware Kernel for Scientif ClEngineering Computations

559

For the exact summation process of the exactly computed products, an additional effort is necessary. In software-like algorithms [2,8] the arithmetic operations are performed in a staggered fashion with a main part and a remainder. The main part and the remainder are subsequently added to obtain the desired result. Since several passes through the summation loop are required and the number of remainders and therefore the required storage space is problem- and dimension-dependent,this procedure is not applicable for fast and efficient hardware implementations. The summation process using a so called Long Accumulator (LA) [ 161 avoids these disadvantages. The LA is a fixed point register with a sufficient number of digits such that both the square of the smallest number as well as the square of the largest number can be added or subtracted exactly, as is shown in Figure 3.2. Therefore, the minimal length L of the LA for the floating point system CC(b,l,kn,Emax) is L = k + 2*1+ 2*Ehn + 2*Emax, where k is a performance-dependent number of additional digits to prevent an intermediate overflow. a(b,l,Einax ,Bin )*

+

Figure 3.2: Long Accumulator Since for customary floating point systems the length of an exact product is usually only 3-5% of the length of the LA, in many cases the accumulation has only a local impact. Therefore, the accumulator is subdivided into several accumulator words. The product is now accumulated locally to the intersecting words with consecutive carry propagation and carry resolution, as shown in Figure 3.3.

A . Knife1

560

I

~000000~ 1110111111111111111111111111xxxx Ixxxx ~xxxx~xxooo

A

resolution

A

proiagation

A

A

local accumulation

Figure 3.3: Accumulation to a subdivided LA In the past several circuit principles using a subdivided LA were proposed. An overview and a comparison are given in [3,12]. As compared with other proposals the following two main principles do have significant advantages concerning the performance and circuit amount.

3.2.1 The Summation Matrix This principle was developed by R. Kirchner and U. Kulisch [20] for the high performance requirements of vector supercomputers. To each LA word an adder/subtractor and a transfer register are attached. The LA is then warped in a matrixlike structure shown in Figure 3.4. For the accumulation the product is aligned at fist in a fine shift to the word boundaries of the LA. Each part of the product is extended by an identification tag determining the intersecting LA word. In the subsequent middle shift the extended words are rotated until each word is positioned in the column of the corresponding LA word. The joint operation of the fine and middle shift result in an extremely large shiftwidth and therefore require an extremely large shifter. In the final phase the product words are propagated in each cycle through the rows of the summation matrix, in this way performing the coarse shift. With the identification tag the transfer unit of each LA word determines whether the product word is accumulated or transfemed to the next row. Parallel to the accumulation of product words, the positive or negative cames are also added. Since all the adders/subtractors are active in each cycle, the power consumption is very high.

56 1

A Hardware Kernelfor ScientijiclEngineering Computations

I

...

...

v

...

v

L

...

v

...

d

Figure 3.4 The Summation matrix

3.2.2 Fast Carry Resolution Figure 3.3 shows the three accumulationphases: local accumulation carry propagation over word boundaries and carry resolution.

In the circuit displayed in Figure 3.4 the major part is required for the concurrent carry processing. The carry might propagate over long distances. However, it should not delay the pipeline in a vector supercomputer. But the cany propagation can be accellerated using a technique developed by M. Muller, Ch. Rub and W. Rulling which avoids void additions [23]. Here the carry propagation over word boundaries is not performed explicitely. Flags indicate whether all digits in a LA word are "(b-1)...(b-1)" or "0...0".Only in the case of a carry input and "(b-1)...(b-1)" the carry propagates to the next word such that this word contains "0...O" afterwards. Therefore, in these cases

A . Knofel

562

only the flags of the words are toggled. In the fist word, where at least one digit is not "(b-1)", the carry is resolved and here an explicit addition is performed. Figure 3.4 shows the entire process using flags. carry resolution

carry propagation

local accumulation

-~oooox$xxxx~oooo~

4 4

Figure 3.5; Accumulation using flags

4

a) flag setting before the accumulation b) flag setting after the accumulation

The same scheme can be applied for negative values and the borrow propagation. Since the logical equations for the carry propagation are similar to those of a full addition, well known acceleration techniques for hardware adders can be applied. Even a carry propagation over the entire LA can therefore be performed within a very short time limit. Additionally, the flags can be used to determine the employed domain of the LA for the final rounding. Beginning at the most significant word, the LA is scanned in descending order. The first word, where the flag is not equal to the sign, contains the most significant digits of the result. Depending on the desired mantissa length of the result, this and the next one or the next few words must be read out into the rounding and normalisation unit. Since the number of LA words as required for the rounding of the result is constant, a hard wired stride could be implemented to determine trailing words for the "sticky" bit (Figure 3.6). sign

leading result word Figure 3.6: Rounding support using the flags

global sticky bit

563

A Hardware Kernel for ScientificlEngineering Computations

The explicit accumulations in Figure 3.5 can be performed sequentially for each LA word or in parallel [ 1I], depending on the performance requirements. In both cases the product exponent determines which LA words intersect with the product and therefore determines where the cany propagation will start. In the sequential scheme the aligned product is accumulated sequentially to each intersecting LA word starting with the least significant part. In parallel to the local accumulation, the LA word for the carry resolution is determined and a copy of the flag set is toggled on-the-fly assuming a carry. The carry output of the local accumulation can be accumulated to the carry resolution word in any way since a carry value zero causes no change. If the carry output is one, the modified flag set becomes now .the actual one; in the case of a carry value zero the original flag set remains. With or without a carry propagation and resolution, the scheme is performed in the same number of cycles and can therefore be performed in an execution pipeline. The entire inner product can now be performed in a pipeline, since pipeline interlocks do not occur between the stages. Figure 3.7 shows the two nested pipelines for the accumulation and the entire inner product calculation. Since read and write accesses to the LA occur simultanously in the scheme in Figure 3.7, a 2-Port RAM for the LA is required. This could be avoided by stretching the scheme and loosing performance. read

multiply

accumulate

1' load accu 1' ddsub load accu read X i + l Z i := X i * yi :tore back addsub load accu 1 L store back addsub store back 1' 1' Z i := shift(zi) read yi+l

L

-re

1

,

determine carry

load carry inc/dec store back set flags

3.7; Pipeline scheme for the sequential accumulation

The sequential scheme requires at least two additions for the local accumulation and one for the carry resolution and therefore at least 6 cycles. In high performance systems this might not be fast enough to keep pace with the data fetch and the multiply stage. The parallel accumulation reduces the number of the required cycles to two without the penalty of a significantly increased hardware. The LA words for the explicit accumulation are determined in advance by the product exponent and the flag set. These

A . Knife1

564

words can now be loaded parallel to a sufficiently wide adder. The local accumulation and the carry resolution are now performed in one step. The involved LA words can now be written back into the same locations. As in the sequential scheme the correct set of flags is selected. The cost for the increased performance are a wide adder and a multi-port RAM since all the involved LA words are transferred in parallel. Fortunately, one address decoder is sufficient here, initiating the transfer of the LA words for the local accumulation and the start of the carry resolution. The carry logic is responsible for the determination and transfer of the carry resolution word. Figure 3.8 shows a block diagram for the parallel accumulation with the sophisticated address decoder. Since the actual flag setting is required for subsequent accumulations, the accumulationcannot be pipelined. However, the accumulation can be embedded analogously to the sequential scheme into an overall inner product pipeline. Figure 3.9 shows the pipelined timing scheme in the case of a superscalar RISC processor, executing the address calculations, the control instructions and the floating point calculations in parallel. aligned product

carry resolution Carry propagation accumulation Start

m

I

e 3.8; Block diagram for the parallel accumulation

vo unit

scalar unit

scalar unit

Load X i INC adresse INC loop

w

I multiply

floating point unit determine LA words & toggle flags load words accumulate

e 3.9; Inner product pipeline for a superscalar RISC processor

A Hardware Kernel for ScienrijiclEngineering Computations

565

3.3 Interval Arithmetic In particular for algorithms in class 1 as defined in section 2.2. interval inner products are essential. The main problem here is to determine which one of the four combinations of the products of the lower and the upper bounds of the operands in each product contributes to the lower or the upper bound in the accumulation step. A software-based preanalysis would reduce the performance substantiallyand a hardware support should be given. For hardware implementations the comparison of possible solutions [12] showed that all possible products should be computed and sorted in the FPU. The determined minimal and maximal product can then be accumulated in two Long Accumulators, one for the lower bound of the result and the other for the upper bound. If the implementationlimits allow only one LA on-chip, the algorithm must be executed twice, once for the lower bound of the result and again for the upper bound. Figure 3.10 shows the entire FPU with the components for the scalar and the inner product operations as well as the sorters. One adder is sufficient, since the accumulation is performed using the same time as one multiplication which is performed four times. As mentioned above, the sequential or parallel accumulation scheme as well as the summation matrix could be used, depending on the VO capability of the system and the performance requirements. Since two LA’S are accessible, the computation of complex inner products can be performed directly and the less frequently occuring complex interval products can be computed using the interval operations. In comparison with Figure 3.1 the major additions to the scalar FPU are the two Long Accumulators. Table 3.2 is an extension of Table 3.1, exhibiting the amount of hardware as required for two accumulationunits in the case of the summation matrix and the fast carry resolution.

Table 3.2; Transistor count, area and power consumptionof the accumulationcircuits

566

A. Knife1

P

&re

3.10; Block diagram of the entire arithmetic unit

3.4

Modifications

If the extended requirements as listed in Table 4.2 are too large, only a window of the LA might be implemented in hardware, covering a limited exponent range. This is justified since computations with very large or very small exponents occur only rarely. In the case of an underflow or overflow from this window, software routines might be applied or the window might be shifted to another exponent range, to ensure maximal accuracy in all cases not jeopardizing the semimorphism principle for inner product calculations. Other compromises are possible; whatever they are there always exist counter-examples causing complicated additional computation steps. It is up to the manufacturer to weigh the importance of these counter-examplesagainst the hardware amount.

A Hardware Kernel for ScientijiclEngineering Computations

567

The progress in circuit technology allows now the integration of entire systems in one chip, including the scalar and floating point processor as well as the cache, the memory management unit (MMU), and the system bus interface [24]. Therefore, the LA memory can be located in the data memory not requiring any dedicated additional circuits (Figure 3.1 1). The LA resides in one cache block and the corresponding flags in special FPU registers. Additional data busses from the cache to the FPU allow simultanous data access to the input operand and the LA. Since the LA and the input vectors reside on different memory addresses, no address conflict logic is necessary. The LA is now well integrated into the memory concept. Multiple LAs could be present at the same time and in the case of a task switch the LA is swapped like the normal data memory locations, requiring no special treatment or even adaptations of the operationg system.

I

main memory

tt

I

Figure 3.1 1; Long Accumulator in the data cache memory

3.5 Higher Precision The usual floating point format is well adapted to the memory and bus structure and the register size of the processor. Therefore, the increase of only a few digits requires the same number of memory locations, bus transfers, and registers as the increase to a full memory word. Since the word length of the processors increases more and more to the size of a floating point number, this would result in a higher precision format with at least twice the number of mantissa digits. Such a higher precision format would cover most of the higher precision requirements.

568

A . Knofel

Fortunately, the accumulation unit is already adapted to the wider data format since the exactly calculated products possess double length. Addition and subtraction can therefore be performed with minor adaptations using the existing shifter and adder. Table 4.1 shows that especially the multiplier and divider are elements in the FPU taking up many transistors and much area and power. Since the extension of the multiplier is not justified, the multiplication should be computed by parts. The two input mantissas are chopped into halfs and partial products are computed with the existing multiplier. These partial products can then be accumulated in the usual way since the computation of a product via partial products is a short inner product. The same constraints must be applied to the division 2

=$. The existing divider is used

1

to compute an approximation K := -.

The quotient is computed iteratively afterwards MY [22], using the LA for the exact remainder [12].

It is not easy to obtain inner products of higher precision numbers. Since the mantissas are now at least twice as large, the LA must be extended at the least significant end to ensure maximal accuracy. Then the products can be computed in the same way as has been outlined for the case of a multiplication. In the case of interval inner products, the sorters can be re-used. Temporary registers are now necessary to store the partial products until the minimum and maximum have been determined. For some applications even more precision is required. The arithmetic should be implemented in software, using well known techniques for long real operations [ 131 and a staggered representation [26] of numbers as a vector of floating point numbers. Considering the algorithms for such an arithmetic, the main problem in the implementation is the accumulation of fractions of these numbers or their partial products. The equations for the addition, multiplication and division contain numerous inner product expressions that can now be computed efficiently. A shift of these algorithms from the software to the micro-code of the processor would not increase the performance but only increase the control logic and the instruction set. Summarizingthe required changes for these higher precision calculations, the presented FPU with scalar and inner product operations could be re-used with slight modifications for higher precision calculations. Details considering minor aspects are described in [12]. Some higher precision formats extend not only the mantissa length but also the

A Hardware Kernel for ScientificlEngineering Computations

569

exponent range. If the LA’S must be kept in a dedicated on-chip memory, an extension is not recommandable. As described above, only a LA window might be present in hardware to cover the major applications.

4 Conclusion The presented FPU was developed in view of mathematical requirements. Surprisingly, the addition of the capability of accurate inner products covers the urgently required and frequently used operations offering a wide range of additional operations to be performed with high efficiency. Such a FPU is required to update the hardware arithmetic to the level of the numerical progress of the past decades. Some alternatives were given to decrease the hardware amount but even with a full implementation, the circuit size would only increase to the order of percentages.

References G. Alefeld, J. Herzberger: Einfuhrung in die Intervallrechnung. Series Informatik 12, BibliographischesInstitut Mannheim, 1974 G. Bohlender: Genaue Berechnung mehrfacher Summen, Produkte und Wurzeln von Gleitkommazahlen und allgemeine Arithmetik in hoheren Programmiersprachen. Doctoral Dissertation, University of Karlsruhe, 1978 G. Bohlender: What do we need beyond IEEE Arithmetic?. In: Ch. Ullrich: Computer Arithmetic and Self-Validiting Numerical Methods, Academic Press, 1990 K. D. Braune: Hochgenaue Standardfunktionen fur reelle und komplexe Punkte und Intervalle in beliebigen Gleitpunktrastern. Doctoral Dissertation ,University of Karlsruhe, 1987 R . Hammer: Arithmetic on Vector Computers. In: Ch. Ullrich: Contributions to Computer Arithmetic and Self-Validiting Numerical Methods, J.C Baltzer AG, Scientific Publishing, 1990 P.M. Hansen: Coprocessor Architectures For VLSI. UCB/CSD Report 88/466, University of California, Berkeley, 1988 ANSI, IEEE: IEEE Standard for Binary Floating-Point Arithmetic. ANSI-IEEE Standard 754-1985,1985 IBM: Verfahren und Schaltungsanwendung zur Addition von Gleitkommazahlen. European patent, 1986 P.W. Markstein: Computation of Elementary Functions on the IBM Risc System /6000 processor In: IBM Journal of Research and Development Vol. 34, No.1, 1990

570

A . Knife1

[lo] E. W. Kaucher; W. A. Miranker: Self-validating Numerics for Function Space Problems; Academic Press, 1894 [l 11 A. Knofel: Advanced Circuits for the Computation of Accurate Scalar [ 121

[13] [14] [15] [ 161

[17] [18] [19] [20]

Products. In: E. Kaucher, S. Markov: Computer Arithmetic, Validated Computation and Mathematical Modelling, J.C.Baltzer AG, 1991 A. Knofel: Hardwareentwurf eines Rechenwerkes fur semimorphe S k a l a r und Vektoroperationen unter Berucksichtigung d e r Anforderungen verifizierender Algorithmen, Doctoral Dissertation, University of Karlsruhe, 1991 D. E. Knuth: The Art of Computer Programming 11,Addison-Wesley, 1981 W. Kramer: Inverse Standardfunktionen fur reelle und komplexe Intervallargumente mit a priori Fehlerabschatzungen fur beliebige Datenformate. Doctoral Dissertation , University of Karlsruhe, 1987 W. Kramer: Die Berechnung von S t a n d a r d f u n k t i o n e n i n Rechenanlagen. Jahrbuch Uberblicke Mathematik 1992, Vieweg, 1992 U. Kulisch: Grundlagen des numerischen Rechnens: Mathematische B e g r u n d u n g d e r R e c h n e r a r i t h m e t i k . Series Informatik 19, Bibliographisches Institut, 1976 U. Kulisch, W.L. Miranker: Computer Arithmetic in Theory a n d Practice. Academic Press, 1981 U. Kulisch, W. L. Miranker: The Arithmetic of the Digital Computer: A New Approach. SIAM Review, Vol28, No.1, 1986 U. Kulisch, H.J. Stetter (eds): Scientific Computation with Automatic Result Verification, Springer, 1988 R. Kirchner, U. Kulisch: Arithmetic for Vector Processors. Proc. of the 8th Symp. on Computer Arithmetic, IEEE Computer Society, 1987

[21] MATLABTMNumerical Computation System, the MATH WORKS Inc., 1990 [22] D.W. Matula: A Highgly Parallel Arithmetic Unit for Floating Point Multiply, Divide with Remainder and Square Root with Remainder, SCAN-89, Basel, October 1989 [23] M. Muller, Ch. Rub, W. Rulling: Exact Addition of Floating Point Numbers. Sonderforschungsbereich 124, FB 14, Informatik, University of the Saarland, Saarbriicken, 1990 [24] Fuad Abu Nofal et.al: A Three Million Transistor Processor, 1992 IEEE International Solid State Circuit Conference [25] H. Schoss: Interval1 Standardfunktionen f u r das binare IEEEZahlenformat. Diploma Thesis, Institut fur Angewandte Mathematik, University of Karlsruhe, 1990 [26] H.J. Stetter: Sequential Defect Correction for High-Accuracy Algorithms, Numerical Analysis, Lecture Notes in Mathematics, Springer, 1984

Bibliography on Enclosure Methods and Related Topics

Gerd Bohlender This bibliography lists key publications about computer arithmetic and scientific computation by members of the Institute of Applied Mathematics at the University of Karlsruhe as well as related work. However, this is certainly not a complete literature list on these subjects nor a complete list of publications of the Institute. The bibliography is listed in alphabetical order according to authors. Publications by manufacturers and institutions are listed at the end. Titles of books and proceedings are printed in bold face, titles of articles and other publications are printed emphasized. Information about special topics can be found by means of the keywords listed below. a

0

I n t r o d u c t i o n : For a n introduction to verification methods see [Ku183, Ku186, Ku1891; in these publications, the arithmetic operations, programming tools, and interval methods are introduced which are required for the implementation of numerical methods with automatic result verification. P r o c e e d i n g s of SCAN meetings and similar conferences: Berlin 1979 [Ale80], Karlsruhe 1982-1988 [Ku182, Kau87, Ku1881, Yorktown Heights 1982 [Ku183], Bad Neuenahr 1985 [Mir86], Ohio 1987 [Moo88], SCAN 89, Basel [U1190a, U1190b1, SCAN 90, Albena [Kaugl, And911, Leipzig 1991 [Jahgl], SCAN 91, Oldenburg [Her92].

a

Computer arithmetic for verification methods: foundations [Ku176, Ku177, Ku177a, Ku181, Ku183a, Ku1861, software implementation [Wip68, Boh78, Gru79, Boh82, Boh82a, Boh83, Boe85, Rum85, Sue86, Prigl], library for FORTRAN 8x [Boh84, Die85, U11851, library for Ada [Fis85, Kla85, Kla86, Kla86a, U1187, Er1881, portable implementation in C [CorSl], extensions for scientific computation [Pic72, IBM84, Hah88, Hus88, Kie881, IMACS-GAMM resolution and vector extensions [IMACS89, IMACS91, Bohgla], decimal arithmetic [Coh83, Auz85, A1186, Bohglb].

a

Hardware a r i t h m e t i c for verification methods: surveys [BohSO, BohSlc], bitslice processor BAP-SC [Teu84, Teu86, Boh86a, Boh871, studies [Ku183b, Win85, Mar85, Ca186, Kir87, Kir88, Kir88a, Lic88, Kno88, Yi189, Haf90, Mue90, Mue91, Kno91, Knogla, Knoglb], patents [Ku183c, IBM86a, Dep88, Ku189bI.

Scientific Computing with Automatic Result Verification

571

Copyright 0 1993 by Academic Press, Inc. All rights of reproduction in any form reserved.

ISBN 0-12-044210-8

G. Bohlender

572 0

0

0

0

0

0

0

Traditional computer arithmetic: reliability [Rum83, Ham90, RatSO], experience [RRZN84, Gue871, error analysis [Wi164, Ste74, Knu811, IEEE standards [IEEE85, IEEE87, C0080, Cod881, hardware implementation [Hwa79, Spa76, Spa81, ARITH], increasing precision [Kah65, Moe65, Dek71, Lin74, LinSl], multiprecision arithmetic [Lor’i’l, Bre78, Bre80, Bre81, Kru86, Kra88b, Shi89, Kra92bl. Standard functions with maximum accuracy: [Wo180, Bra85, wol85, Bra86, Bra87, Bra87a, Bra88, Kra87, Kra88, Kra90, Kra921, complex polynomials [KraSlc] (see also: verification methods). Interval methods: introduction and proceedings of meetings (Moo66, Ale74, Ale83, Nic75, Nic80, Nic85, May89, Zju89, NeuSO], extended interval arithmetic [Kau73, Dim921. Verification methods: introduction [Ku186a, Rum88, Ku189, Ada89a1, linear systems [Rum80a, Rum82a, Stu86, Rum89a, Rum90, Rum91, Gru87, May88, JanSO, Sch89e, Sch89b, Rex91, Her91, Jan921, interval linear systems [Rum92], inverse matrix [HerSOa, Her91, BetSl], eigenvalue problems [Ohs88, Beh88, Beh90, GoeSOa, Klu90, KraSld, Beh91, Roh921, sparse matrices [Sch86, Cor87a, KraSld, Cor891, parameter-dependent systems [Neu88], nonlinear problems [Boe87, Ale88, Lei90, Rum90, Rum91, Sch89d1, I’inear programming/optimization [Jan85, Jan88, Lei90, Jan91, Jan91 b, JanSlc, CseSl], differential equations / function space problems [Kau83, Mir83, Kau84, Kau85, Kau85a, Kau87a, Kau89, Kru85, Ra185, Dob86, Dob87, Cor85, Cor87, Cor88, Loh84, Loh85, Loh87, Loh88, Loh89, Loh89a, Spr87, Ame86, Wei87, Wei88, Ada87, Ada87a, Ada90, AdaSla, AdaSlb, Ada92, Goe90, Plu90, Plu91, Ker91, Ko191, KolSla, Ste90, Dav76, Bau80, Mango, Ohs881, evaluation of expressions [Boe82, Boe83, Boe83a, Rum83c, Fis87, Fis88, Wo188a, Fis89, Kre88, AleSOb], evaluation of polynomials and program parts [Boe83, Gru87, Loh88a, Fis89, Geo90, KraSOa, Dim90, KraSlb, KraSlc, KraSle, Pet91, SchSOa], quadrature/integration [co182, Cor85, Cor87c, Kau89, Ke189, Ke190, Kle90, Kle91, KraSla], automatic differentiation [Ra180, Ra181, Cor88, CorSla, Fis90, Gri89, Gri91, Shi92b1, other (Kra69, Mir86a, Boc89, Boc90, RumSla, MarSlb, Cuy921. Parallel programming: PASCAL-XSC on transputers [Boh92a, Boh92b1, basic arithmetic operations [Obe83, Obe84, Obe86, Van91, DavSl], matrix operations [WecSO, BohSld, WolSla, Boh92, Dav921, linear systems (approximate solutions) [Frogo],verification methods [ReiSl, Kam91, Dav92, Boh92al. Super computers: reliability [HamSO, RatSO], arithmetic [Kir87, Kir88, Sch89f, Sch92b1, verification methods [SchSOc, SchSOe], PASCAL-XSC: language definition [Boh86, Boh87, Nea89], reference manual and compiler handbook [Ham89, IAM89, Kla91, Kla92, NUMSl], demonstration of numeric library [Rum82, Rum83b1, portable compiler based on C

Bibliography

573

[A1191, A1191a, Ham911, extensions for parallel programming [Dav92, Boh92a, Boh92bl. Former versions of PASCAL-XSC: (Boh78, wo182, Wo182a, Wo183, Ku187, Ku187a, KWS851. 0

0

0

0

Programming languages for scientific computation: PASCAL-XSC (see above), survey [Kla89, Ull9Oc], studies for a FORTRAN extension [Boh8la, Boh83b, Boh84, Ull82, U1183, U1184, Ull84a, U11851, FORTRAN-SC (precursor of ACRITH-XSC) [Ble87, Ble88, Met88, Met88a, Met89a, Met90, Kra88b, Kra89, Wa188, Wa189a, Wa189b, Wa190, IAM881, Abacus/Calculus [Hus88, Rum891, TPX precompiler for Turbo Pascal [Hus89], ModulaSC [Fa190, Fa1911, APL/PCXA [Hah88, HahSO], C++ [Juegl], C-XSC [LawSl, Law92, Law92a1, Modyna [Margla], libraries for Fortran 8x/90, Ada, and C see above. Problem solving libraries: ACRITH [IBM84, IBM86, Rum871, ACRITHXSC (language extension and compiler handbook) [IBMSO, Met88, Met88a, Wa188, Wa189a, Metgo], ARITHMOS [SIE86], HIFICOMP [Margoal. Applications: interval methods in industry [Cor87b, Cor88, Ada90, CorSO], dynamical systems [LudSO], CAD [Lut88], recursive filters [Sau86], engineering problems [Cor87, Ada87a, Ams88, Ada89b, Ada91, Sch89, SchSOb, Sch911, geometry [Ott87, Ott881, computational chaos [Ada92]. Literature lists: interval methods [Gar85, Gar87, Nic88, Yak92a, Yak92b1, automatic differentiation [Cor91a].

References [Ada831

Adams, E.; Lohner, R.: Error Bounds and Sensitivity Analysis. In: Stepleman, R. S. (Ed.): Scientific Computing. North Holland, Amsterdam, 1983.

[Ada851

Adams, E.; Spreuer, H.; Holzmiiller, A.: Probability Distributions over Output Intervals for (Non)Linear ODES with Given Distributions over Input Intervals. In [Wah85, vol. 1, pp. 107-1091, 1985.

[Ada871 Adams, E.; Ansorge, R.; GroBmann, Ch.; Roos, H.-G.: Discretization in Differential Equations and Enclosures. Mathematische Forschung, Band 36, Akademie-Verlag, Berlin, 1987. [Ada87a] Adams, E.; Cordes, D.; Lohner, R.: Enclosure of Solutions of Ordinary Initial Value Problems and Applications. In [Ada87, pp. 9-28], 1987. [Ada881

Adams, E.; Holzmiiller, A.; Straub, D.: The Periodic Solutions of the Oregonator and Verification of Results. In [Ku188, pp. 111-1211, 1988.

574

G. Bohlender

[Ada89a] Adams, E.: Enclosure Methods and Scientific Computation. In [Ame89, pp. 3-31], 1989. [Ada89b] Adams, B.; Adams, E.; Spreuer, H.: Practical Stability and Stochastic Point Processes. In [Ame89, pp. 81-89], 1989. [Ada901

Adams, E.; Cordes, D.; Keppler, H.: Enclosure Methods as Applied to Linear Periodic ODEs and Matrices. ZAMM 70, pp. 565-578, 1990.

[AdaSOa] Adams, E.: Periodic Solutions: Enclosure, Verification, and Applications. In [U1190a, pp. 199-2451, 1990. [Ada911

Adams, E.: Gear Drive Vibrations and Verified Periodic Solutions. In [Kaugl, pp. 395-4161, 1991.

[Adagla] Adams, E.; Kiihn, W.: On Computational Chaos for the Lorenz ODES. Proc. 13th IMACS World Congress on Computational and Applied Mathematics, Vol. 1, pp. 351-352, Dublin, Ireland, July 1991. [Adaglb] Adams, E.; Rufeger, W.: Diverting Difference Solutions, Particularly in Celestial Mechanics. Proc. 13th IMACS World Congress on Computational and Applied Mathematics, Vol. 1, pp. 353-354, Dublin, Ireland, July 1991. [Ada921

Adams, E.; Ames, W. F.; Kuhn, W.; Rufeger, W.; Spreuer, H.: Computational Chaos May Be Due to a Single Local Error. To appear in J. of Computational Physics.

[Alb77]

Albrecht, R.; Kulisch, U. (Eds.): Grundlagen der Computerarithmetik. Computing Supplementum 1. Springer-Verlag, Wien / New York, 1977.

[Alb77a] Albrecht, R.: Grundlagen einer Theorie gerundeter algebmischer Verknipfungen in topologischen Vereinen. In [Alb77, pp. 1-14], 1977. [Alb80]

Albrecht, R.: Roundings and Approzimations in Ordered Sets. [Ale80], 1980.

[Ale681

Alefeld, G.: Zntervallrechnung uber den komplexen Zahlen und einige Anwendungen. Dissertation, Universitiit Karlsruhe, 1968.

[Ale741

Alefeld, G.; Herzberger, J.: Einfiihrung in die Intervallrechnung. Bibliographisches Institut (Reihe Informatik, Nr. 12), Mannheim / Wien / Zurich, 1974 (ISBN 3-411-01466-0).

(Ale771

Alefeld, G.: Uber die Durchfiihrbarkeit des Gaupschen Algorithmus bei Gleichungen mit Zntervallen als Koejizienten. In [Alb77, pp. 15-19], 1977.

[Ale801

Alefeld, G.; Grigorieff, R. D. (Eds.): Fundamentals of Numerical Computation (Computer-Oriented Numerical Analysis). Computing Supplementum 2, Springer-Verlag, Wien / New York, 1980.

[Ale831

Alefeld, G.; Herzberger, J.: An Introduction to Interval Computations. Academic Press, New York, 1983 (ISBN 0-12-049820-0).

In

Bibliography

575

[Ale871

Alefeld, G.: Rigorous Error Bounds for Singular Values of a Matrix Using the Precise Scalar Product. In [Kau87, pp. 9-30], 1987.

[Ale881

Alefeld, G.: Errorbounds for Quadratic Systems of Nonlinear Equations Using the Precise Scalar Product. In [Ku188, pp. 59-67], 1988.

[Ale88al Alefeld, G.: Existence of Solutions and Iterations for Nonlinear Equations. In [Moo88, pp. 207-2271, 1988. [Ale901 Alefeld, G.: Enclosure Methods. In [U1190a, pp. 55-72], 1990. [AleSOa] Alefeld, G.; Illg, B.; Potra, F.: On a Class of Enclosure Methods for Systems of Equations with Higher Order of Convergence. In [U1190b, pp. 151-1591, 1990.

[ Alegob] Alefeld, G.: On the Approzimation of the Range of Values b y Interval Ezpressions. Computing 44, pp. 273-278, 1990. [A11861 Allendijrfer, U.: Dezimale Gleitpunktsysteme zur efizienten Implementierung der optimalen Rechnemrithmetik. Dissertation, Universitiit Kaiserslautern, 1986.

[A11911

Allendorfer, U; Shiriaev, D.: PASCAL-XSC to C - A Portable PASCAL-XSC Compiler. In [KauSl, pp. 91-1041, 1991

[A1191a] Allendtirfer, U.; Shiriaev, D.: PASCAL-XSC. A Portable Development System. Proceedings of 13th World Congress on Computation and Applied Mathematics, IMACS '91, Dublin, 1991. [Alt90]

Alt, R.; Vignes, J.: Stochastic Round-off Error Analysis on Sequential and Parallel Computers. In [U1190b, pp. 3-17], 1990.

[Alt 9 11

Alt, R.: A Parallel Transputer Machine for Controlled Scientific Computation. In [KauSl, pp. 105-1151, 1991.

[Arne861 Ames, W. F.; Nicklas, R. C.: Accurate Elliptic Differential Equation Solver. In [Mir86, pp. 70-851, 1986. [Arne891 Ames, W. F. (Ed.): Numerical and Applied Mathematics. J. C. Baltzer Scientific Publishing, Basel, 1989.

[Ams88] Ams, A.; Klein,W.: VIB - Verified Inclusions of Critical Bending Vibrations. In [Ku188, pp. 91-98], 1988. [And901

Andreev, A.; Kjurkchiev, N.: Two-sided Methods for Solving Equations. In [U1190b, pp. 161-172],1990.

[And911 Andreev, A. S.; Dimov, I. T.; Markov, S. M.; Ullrich, Ch. (Eds.): Mathematical Modelling and Scientific Computations. Bulgarian Academy of Sciences, Sofia, 1991. [Arc911

Archer, M.; Linz, P.: On the Verification of Numerical Software. In [KauSl, pp. 117-1311, 1991.

(Ash711

Ashenhurst, R. L.: Number Representation and Significance Monitoring. In [Ric71], 1971.

576

G. Bohlender

[Atagl]

Atanassova, L.; Andreev, A.: On Some Higher Order Interval Methods. In [KauSl, pp. 265-2791, 1991.

[Am851

Auzinger, W.; Stetter, H. J.: Accurnte Arithmetic Results for Decimal Data on Non-Decimal Computers. Computing 35, pp. 141-151, 1985.

[Bau80]

Bauch, H.: Zur iterativen Losungseinschliejlung bei Anfangswertprublemen mittels Intervallmethoden. ZAMM 60, pp. 137-145, 1980.

[Bau87]

Bauch, H.; Jahn, K.-U.; Oelschliigel, D.; Siisse, H.; Wiebigke, V.: Intervallmathematik, Theorie und Anwendungen. BSB B. G. Teubner Verlagsgesellschaft, Leipzig, 1987 (ISBN 3-322-00384-1).

[Beh88]

Behnke, H.: Inclusion of Eigenvalues of General Eigenvalue Problems of Matrices. In [Ku188, pp. 69-78], 1988.

[BehSO]

Behnke, H.: The Determination of Guaranteed Bounds to Eigenvalues with the Use of Variational Methods 11. (Part I see [GoeSOa]).In [U1190a, pp. 155-1701, 1990.

[Behgl]

Behnke, H.: The Calculation of Guaranteed Bounds for Eigenvalues Using Complementary Variational Principles. Computing 47, pp. 11-27, 1992.

[Bet911

Bethke, D.; Herzberger, J.: Uber Eigenschaften von zwei Methoden zur Einschlieflung der Inversen einer Intervallmatrix. In [Jah91, pp. 4094131, 1991.

[Ble871

Bleher, J. H.; Kulisch, U.; Metzger, M.; Rump, S. M.; Ullrich, Ch.; Walter, W.: FORTRAN-SC: A Study of a FORTRAN Extension for Engineering/Scientific Computation with Access to ACRITH. Computing 39, pp. 93-110, NOV. 1987.

(Ble881

Bleher, J. H.; Rump, S. M.; Kulisch, U.; Metzger, M.; Ullrich, Ch.; Walter, W.: FORTRAN-SC: A Study of a FORTRAN Extension for Engineering/Scientific Computation with Access to ACRITH. In [Ku188, pp. 227-2441, 1988.

[Boc89]

Bochev, P.; Markov, Sv.: A Self- Validating Numerical Method for the Matrix Ezponential. Computing 43, pp. 59-72, 1989.

[BocSO]

Bochev, P. B.: Enclosure Methods for Set Valued Phase Flow. In [U1190b, pp. 173-1811, 1990.

[Boh77] Bohlender, G.: Genaue Suinmation von Gleitkommazahlen. In [Alb77, pp. 21-32], 1977. [Boh77a] Bohlender, G.: Produkte und Wurzeln von Gleitkommazahlen. In [Alb77, pp. 33-46], 1977. [Boh77b] Bohlender, G.: Floating-Point Computation of Functions with Maximum Accuracy. IEEE Transactions on Computers, Vol. C-26, no. 7, July 1977.

Bibliography

577

[Boh78]

Bohlender, G.: Genaue Berechnung mehrfacher Summen, Produkte und Wurzeln von Gleitkommazahlen und allgemeine Arithmetik in hoheren Progmmmierspmchen. Dissertation, Universitlt Karlsruhe, 1978.

[Boh81]

Bohlender, G.; Griiner, K.; Kaucher, E.; Klatte, R.; Krlmer, W.; Kulisch, U.; Miranker, W. L.; Rump, S. M.; Ullrich, Ch.; Wolff v. Gudenberg, J.: PASCAL-SC: A PASCAL for Contemporary Scientific Computation. IBM Research Report RC 9009 (#39456) 8/25/81, 79 pages, 1981.

[Boh8 1a] Bohlender, G.; Kaucher, E.; Klatte, R.; Kulisch, U.; Miranker, W. L.; Ullrich, Ch.; Wolff v. Gudenberg, J.: FORTRANfor Contemporary Numerical Computation. IBM Research Report RC 8348. Computing 26, pp. 277-314, 1981.

[Boh82]

Bohlender, G.; Griiner, K.: Gesichtspunkte zur Implementierung einer optimalen Arithmetik. In [Ku182, pp. 95-1 151, 1982.

[Boh82a] Bohlender, G.; Griiner, K.; Wolff v. Gudenberg, J.: Realisierung einer optimalen Arithmetik. Elektronische Rechenanlagen 24, H. 2, pp. 68-72, 1982. [Boh83]

Bohlender, G.; Griiner, K.: Realization of an Optimal Arithmetic. In [Ku183, pp. 247-2681, 1983.

[Boh83a] Bohlender, G.; Bshm, H.; Griiner, K.; Kaucher, E.; Klatte, R.; Krimer, W.; Kulisch, U.; Miranker, W. L.; Rump, S. M.; Ullrich, Ch.; Wolff v. Gudenberg, J.: MATRIX-PASCAL. IBM Research Report RC 9577 (#42297) 9/13/82, 89 pages. Published in (Ku183, pp. 311-3841, 1983. [Boh83b] Bohlender, G.; BGhm, H.; Braune, K.; Griiner, K.; Kaucher, E.; Klatte, R.; Kriimer, W.; Kulisch, U.; Miranker, W. L.; Ullrich, Ch.; Wolff v. Gudenberg, J.: Application Module: Scientific Computation for FORTRAN 8z. pp. 1-34, March 1983. [Boh84]

Bohlender, G.; Bijhm, H.; Griiner, K.; Kaucher, E.; Klatte, R.; Krlmer, W.; Kulisch, U.; Miranker, W. L.; Rump, S. M.; Ullrich, Ch.; Wolff v. Gudenberg, J.: Proposal for Arithmetic Specification in FORTRAN 82. pp. 1-34, Sept. 1982. Revised version appeared in [For84, pp. 213-2431, 1984.

[Boh86]

Bohlender, G.; Rall, L. B.; Ullrich, Ch.; WoH v. Gudenberg, J.: PASCAL-SC: Wirkungsvoll programmieren, kontrolliert rechnen. Bibliographisches Institut, Mannheim / Wien / Zurich, 1986 (ISBN 3-411-03113-1).

[Boh86a] Bohlender, G.; Teufel, T.: Demonstration of the Bit-Slice Processor Unit BAP-SC in a 68000 Environment. In [Wah85, vol. 1, pp. 155-1581; [ R u s ~pp. ~ , 331-3361, 1986. [Boh87]

Bohlender, G.; Teufel, T.: BA P-SC: A Decimal Floating-point Processor for Ootimal Arithmetic. In IKau87. DD. 31-581. 1987.

578

G. Bohlender

[Boh87a] Bohlender, G.; Rall, L. B.; Ullrich, Ch.; Wolff v. Gudenberg, J.: PASCAL-SC: A Computer Language for Scientific Computation. Perspectives in Computing, Vol. 17, Academic Press, Orlando, 1987 (ISBN 0-12-111155-5). [Boh88] Bohlender, G.: Is Floating-Point Arithmetic Still Adequate? In [Syd88], 1988. [Boh89] Bohlender, G.; Ullrich, Ch.; Wolff v. Gudenberg, J.: New Developments in PASCAL-SC. SIGPLAN Notices, Vol. 23, No. 8, pp. 83-92, 1989. [Boh89a] Bohlender, G.; Ullrich, Ch.; Wolff v. Gudenberg, J.: The module concepts in PASCAL-SC, Ada, Modula-2 and FORTRAN 82. Journal for Pascal, Ada and Modula-2, 1989. [Boh89b] Bohlender, G.; Kornerup, P.; Matula, D. W.; Walter, W.: A Proposal for Ezact Floating-point Operations in FORTRAN 82. Proposal submitted to ANSI FORTRAN Committee X3J3, November 1989. [BohSO] Bohlender, G.: What Do We Need Beyond IEEE Arithmetic? In [U1190a, pp. 1-32], 1990. [BohSl] Bohlender, G.; Miranker, W. L.; Wolff v. Gudenberg, J.: Floating-point Systems for Theorem Proving. IBM Report RC 15101 (#67356) 11/2/89, 14 pages. In: Meyer, R. K.; Schmidt, D. S.: Computer Aided Proofs in Analysis. Springer-Verlag (The IMA Volumes in Mathematics and Its Applications, Vol. 28), New York, 1991. [Boh9 1a] Bohlender, G.: A Vector Eztension of the IEEE Standard for FloatingPoint Arithmetic. In [Kaugl, pp. 3-12], 1991. [Boh91b] Bohlender, G.: Decimal Floating-Point Arithmetic in Binary Representation. In [Kaugl, pp. 12-27], 1991. [Boh9 1c] Bohlender, G.; Knofel, A.: A Survey of Pipelined Hardware Support for Accurate Scalar Products. In [Kaugl, pp. 29-43], 1991. [BohSld] Bohlender, G.; Wolff v. Gudenberg, J.: Accurate Matriz Multiplication on the Array Computer A M T DAP. In [Kaugl, pp. 133-1501, 1991. [Boh9 1e] Bohlender, G.; Kornerup, P.; Matula, D. W.; Walter, W. V.: Semantics for Ezact Floating-Point Operations. In [ARITH, Vol. 10, pp. 22-26], 1991. [Boh92] Bohlender, G.; Davidenkoff, A.: Accurate Vector and Matriz Arithmetic for Parallel Computers. Proceedings o f the Third Workshop on Parallel and Distributed Processing, Sofia, Bulgaria, April 16-19, 1991, edited by K. Boyanov, Elsevier Science Publishers, to appear 1992. [Boh92a] Bohlender, G.; Davidenkoff, A.: Accurate and Reliable Solution of Numerical Problems on Transputer Systems. ESPRIT PCA Project 4035, Final Report, January 1992. [Boh92b] Bohlender, G.; Davidenkoff, A,: Support for Parallel Programs in PASCAL-XSC on a Transputer System. In [Her92], to appear 1992.

Bibliography

579

[Boe82]

Bohm, H.: Auswertung arithmetischer Ausdricke mit maximaler Genauigkeit. In [Ku182, pp. 175-1841, 1982.

[Boe83]

Bohm, H.: Berechnung von Polynomnullstellen und Auswertung arithmetischer Ausdricke mit garantierter maximaler Genauigkeit. Dissertation, Universitiit Karlsruhe, 1983.

[Boe83a] Bohm, H.: Evaluation of Arithmetic Expressions with Maximum ACCUracy. In [Ku183, pp. 121-1371, 1983. [Boe85]

Bohm, H.: High Accuracy Arithmetic Facility for IBM System/370. In [Wah85, vol. 1, pp. 159-1611, 1985.

[Boe87]

Bohm, H.; Rump, S. M.; Schumacher, G.: E-Methods for Nonlinear Problems. In [Kau87, pp. 59-80], 1987.

[Bra851

Braune, K.; Krimer, W.: Standard Functions for Intervals with Maximum Accuracy. In [Wah85, vol. 1, pp. 167-1701, 1985.

[Bra861

Braune, K.; Kriimer, W.: High-Accuracy Standard Functions for Intervals. In [Rus86], 1986.

[Bra871

Braune, K.; Krimer, W.: High-Accuracy Standard Functions for Real and Complex Intervals. In [Kau87, pp. 81-1141, 1987.

[Bra87a] Braune, K: Hochgenaue Standardfunktionen fur reelle und kompleze Punkte und Intervalle in beliebigen Gleitpunktrastern. Dissertation, Universitit Karlsruhe, 1987. [Bra881

Braune, K.: Standard Functions for Real and Comples Point and Interval Arguments with Dynamic Accuracy. In [Ku188, pp. 159-1841, 1988.

[Bre78]

Brent, R. P.: A FORTRAN Multiple Precision Arithmetic Package. ACM Trans. Math. Software 4, pp. 57-70, 1978.

[BreBO]

Brent, R. P.; Hooper, J. A.; Yohe, J. M.: A n Augment Interface for Brent’s Multiple-Precision Arithmetic Package. ACM Trans. Math. Software 6,pp. 146-149, 1980.

[Bresl]

Brent, R. P.: MP User’s Guide. 4th ed.; Technical Report TR-CS-8108, Department of Computer Science, Australian National University, Canberra, 1981.

[Bre92]

Brezinski, C.; Kulisch, U. (Eds): Computational and Applied Mathematics I Algorithms and Theory. Proceedings of the 13th IMACS World Congress, Dublin, Ireland. Elsevier Science Publishers B.V. To be published in 1992

[Bro77]

Brown, W. S.: A Realistic Model of Floating-point Computation. In [Ric77], 1977.

[Bro81]

Brown, W. S.: A Simple But Realistic Model of Floating-point Computation. ACM Trans. Math. Software 7, 4, pp. 445-480, 1981.

-

580

G. Bohlender

[Buc86] Buchberger, B.; Kutzler, B.; Feilmeier, M.; Kratz, M.; Kulisch, U.; Rump, S. M.: RechnerorientierteVerfahren. B. G . Teubner Verlag, Stuttgart, 1986 (ISBN 3-519-02617-1).

[Cal861

Calaoagan, R Guellil, T.; Mertens, M.: Entwurf eines BCDGleitkomma-Rechenwerkes mit mazimaler Genauigkeit. Diplomarbeit, Abteilung Informatik, Universitit Dortmund, 1986.

[Cap881

Cappello, P. R.; Miranker, W. L.: Systolic Super Summation. IEEE Transactions on Computers 37 (6), pp. 657-677, June 1988.

[Cap88al Cappello, P. R.; Miranker, W. L.: Systolic Super Summation with Reduced Hardware. IBM Research Report RC 14259 (#63831), IBM Research Division, Yorktown Heights, New York, Nov. 30, 1988.

[Cap751

Caprani, 0.: Round08 Errors in Floating-point Summation. BIT 15, pp. 5-9, 1975.

[CheSO]

Chesneaux, J. M.: Study of the Computing Accuracy b y Using Probabilistic Approach. In [U1190b, pp. 19-30], 1990.

[Cla791

Claudio, D. M.: Beitruge zur Strulctur der Rechnerarithmetilc. Dissertation, Universitiit Karlsruhe, 1979.

[Cla80]

Claudio, D. M.: Rounding Invariant Structures b y Application of a Mapping of a Ringoid in an Ordered Set. In [Nic80], 1980.

[Cod881 Cody, W. J.: Floating-point Standards - Theory and Practice. [Moo88, pp. 99-1071, 1988.

In

[Coh83] Cohen, M. S.; Hull, T. E.; Hamacher, V. C.: CADAC: A Controlled Precision Decimal Arithmetic Unit. IEEE Trans. Comp.; Vol. C-32, No. 4, April 1983. [Co182]

Collatz, L.: Inclusion of Solutions of Certain Types of Integral Equations. In: Baker, T . H.; Miller, G. F.: Treatment of Integral Equations b y Numerical Methods. Academic Press, London, 1982.

[Co190]

Collatz, L.: Guaranteed Inclusions of Solutions of Some Types of Bounda r y Value Problems. In [U1190a, pp. 189-1981, 1990.

[co1771

Collins, G. E.: Infallible Calculation of Polynomial Zeros to Specified Precision. In [Ric77], 1977.

[COO~O] Coonen, J. T.: A n Implementation Guide to a Proposed Standard for Floating-Point Arithmetic. Computer 13,No. 1, January 1980. (Errata sheet: Computer, March 1981)

[Cor85al Cordellier, F.: On the Use of the Residue to Check the Quality of the Solution of a Linear System. In [Wah85, vol. 1, pp. 187-1911, 1985. [Cor85]

Cordes, D.; Adams, E.: Test for Uniform Boundedness for Problems Permitting Parameter Resonance. In [Wah85, vol. 1, pp. 103-1051, 1985.

Bibliography [Cor87]

[Cor87a] [Cor87b]

[Cor88] [Cor89]

[Cor89al [C9r91I

[Cor851 [Cor87c]

[Cor88] [CorSO]

[Cor91a] [Cor91b]

[Cor9Ic] [Cor841 [CseSI] (CUY921

581

Cordes, D.: Verifizierter Stabilitatsnachweis fur Losungen von Systemen periodischer Diferentialgleichungen auf dem Rechner mit Anwendungen. Dissertation, Universitit Karlsruhe, 1987. Cordes, D.; Kaucher, E.: Self- Validating Computation for Sparse Matriz Problems. In [Kau87, pp. 133-1491, 1987. Cordes, D.; Adams, E.: Test for Uniform Boundedness for Dynamic Problems Admitting Parameter and Combination Resonance. In [Kau87, pp. 115-1321, 1987. Cordes, D.: Stability Test for Periodic Diferential Equations on Digital Computers with Applications. In [Ku188, pp. 99-1lo], 1988. Cordes, D.: Sparlich besetzte Matrizen. In [Ku189, pp. 129-1361, 1989. Cordes, D.; Krgmer, W.: Vom Problem zum Einschliepungsalgorithmus. In [Ku189, pp. 167-1811, 1989. Cordes, D.: Runtime System for a PASCAL-XSC Compiler. In [Kaugl, pp. 151-1601, 1991. Corliss, G. F.; Rall, L. B.: Adaptive, Self-validating Numerical Quadrature. MRC Technical Summary Report # 2815, University of Wisconsin, Madison, 1985. Corliss, G. F.: Computing Narrow Inclusions for Definite Integrals. In [Kau87, pp. 150-1691, 1987. Corliss, G. F.: Applications of Diferentiation Arithmetic. In [Moo88, pp. 127-1481, 1988. Corliss, G. F.: Industrial Applications of Interval Techniques. In [U1190a, pp. 91-1131, 1990. Corliss, G. F.: Automatic Diferentiation Bibliography. In [Grigl, pp. 331-3531, 1991. Corliss, G. F.; Rall, L. B.: Computing the Range of Derivatives. In [Kaugl, pp. 195-2121, 1991. Corliss, G. F.: Validated Anti-Derivatives. In: Meyer, R. K.; Schmidt, D. S.: Computer Aided Proofs in Analysis. Springer-Verlag (The IMA Volumes in Mathematics and Its Applications, Vol. 28), New York, 1991. Cornelius, H.; Lohner, R.: Computing the Range of Values of Real Functions with Accuracy Higher Than Second Order. Computing 33,pp. 331347, 1984. Csendes, T.: Test Results of Interval Methods for Global Optimization. In [Kaugl, pp. 417-4241, 1991. Cuyt, A.; Verdonk, B.: Multivariate Rational Data Fitting: General Data Structure, Mazimal Accuracy and Object Orientation. Proceedings of International Congress on Extrapolation and Rational Approximation, Tenerife, January 1992; to appear in Numerical Algorithms, 1992.

582

G. Bohlender

[Dav76]

Davey, D. P.; Steward, N. F.: Guaranteed Error Bounds for the Initial Value Problem using Polytope Arithmetic. BIT 16,pp. 257-268, 1976.

[DavSl]

Davidenkoff, A.: High Accuracy Arithmetic on Transputers. In [Kaugl, pp. 45-61], 1991.

[Dav921

Davidenkoff, A.: Arithmetische Ausstattung von Parallelnxhnenz fur zuverlassiges numerisches Rechnen. Dissertation, Universitit Karlsruhe, to appear 1992.

[Dav92a] Davidenkoff, A.: A CRITH-XSC - A Programming Language for Scientific / Engineering Computation. ( A Survey). Talk at GAMM 91 Conference, Cracow, Poland, April 1-5, 1991. ZAMM 72 6, pp. T465-T467, 1992. Dekker, T. J.: A Floating-point Technique for Extending the Available Precision. Numer. Math. 18,pp. 224-242, 1971. Deppermann, M; Hafner, K.: Chips for High-Precision Arithmetic - an Architectural Study. ESPRIT project DIAMOND, Deliverable D4-5, and Patent (???), Siemens, Munchen, 1988. [Die851

Dietrich, Th.: Realisierung der optimalen Arithmetik in FORTRAN 8x. Diplomarbeit, Institut fur Angewandte Mathematik, Universitit Karlsruhe, 1985.

[Dim901 Dimitrova, N.; Markov, S. M.: Interval-Arithmetic Algorithms for Simultaneous Computation of All Polynomial Zeroes. In [U1190b, pp. 183-1951, 1990. [Dim911 Dimitrova, N.; Markov, S. M.: On the Interval-Arithmetic Presentation of the Range of a Class of Monotone Functions of Many Variables. In (Kau91, pp. 213-2281, 1991.

[Dim921 Dimitrova, N.; Markov, S. M.; Popova, E. D.: Extended Interval Arithmetics: New Results and Applications. In [Her92], to appear 1992.

[Dob86] Dobner, H. J.: Einschliejhngsalgorithmen fur hyperbolische Differentialgleichungen. Dissertation, Universitit Karlsruhe, 1986. [Dob87] Dobner, H.-J.; Kaucher, E.: Solving Characteristic Initial Value Problems with Guaranteed Errorbounds. In [Kau87, pp. 170-1851, 1987. [DobSO] Dobner, H.-J.; Kaucher, E.: Self- Validating Computations of Linear and Nonlinear Equations of the Second Kind. In [U1190b, pp. 273-2901, 1990. [DuCSO] Du Croz J. J.; Pont, M. W.: Ada FPV - A Package to Test FloatingPoint Arithmetic. In [U1190b, pp. 359-3691, 1990. [Ein92]

Einarsson, B.; Fosdick, L. D.; Gaffney, P.; Houstis, E. (Eds.): Programming Environments for High Level Scientific Problem Solving. Proceedings of IFIP WG 2.5 conference, local organizers: U. Kulisch, A. Schreiner, Karlsruhe, September 23-27, 1991; Elsevier Science Publishers, to appear 1992.

Bibliography

583

[Ed881

Erl, M.; Hodgson, G.; Kok, J.; Winter, D.; Zoellner, A.: Design and Implementation of Accurate Operators in Ada. ESPRIT project DIAMOND, Deliverable 1-2/4, 1988.

[Fa1901

Falcd Korn, C.; Gutzwiller, S. E.; Kiinig, S.: Modula-SC, a Precompiler to Modula-2. In [U1190b, pp. 371-3831, 1990.

[Fa19 11

Falcd Korn, C.; Gutzwiller, S.; Kiinig, S.; Ullrich, Ch.: Modula-SC. Motivation, Language Definition and Implementation. In [KauSl, pp. 1611791, 1991.

[Fis85]

Fischer, H.-C.: Realisierung einer optimalen Arihtmetik in ADA. Diplomarbeit, Institut fiir Angewandte Mathematik, Universitit Karlsruhe, 1985.

[Fis87]

Fischer, H X . ; Haggenmiiller, R.; Schumacher, G.: Auswertung arithmetischer Ausdriicke mit garantierter hoher Genauigkeit. Siemens Forschungs- und Entwicklungs-Berichte 16,Nr. 5, pp. 171-177, 1987.

[Fis88]

Fischer, H.-C.; Schumacher, G.; Haggenmiiller, R.: Evaluation of Arithmetic Expressions with Guaranteed High Accuracy. In [Ku188, pp. 1491571, 1988.

[Fis89]

Fischer, H.-C.: Genaue Auswertung von Polynomen und Ausdriicken. In [Kul89, pp. 155-1651, 1989.

[FisS01

Fischer, H.-C.: Schnelle automatische Diflerentiation, Einschliefiungsmethoden und Anwendungen. Dissertation, Universitit Karlsruhe, 1990.

[FisSOa]

Fischer, H.-C.: Range Computation and Applications. pp. 197-2111, 1990.

[Fis92]

Fischer, H.-C.: Efiziente Berechnung von Ableitungswerten, Gradienten und Taylorkoefizienten. In Chatterji, S.D.; Fuchssteiner, B.; Kulisch, U.; Liedl, R.; Purkert, W. (Eds): Jahrbuch Uberblicke Mathematik 1992, Vieweg, Braunschweig, 1992.

[Flag71

Flavigny, B.: PERCAL: Un outil logiciel pour la de'termination de la pre'cision des calculs . Micro-Bulletin 24, pp. 9-18, Centre National de la Recherche Scientifique, Paris, 1987.

[Flag01

Flavigny, B.: Analytic and Probabilistic Study of Round-off Errors in Numerical Computations. In [U1190b, pp. 31-45], 1990.

[For841

Ford, B.; Rault, J. C.; Thomassed, F. (Eds.): Tools, Methods and Languages for Scientific and Engineering Computation. Elsevier (North Holland), 1984.

[Fro861

Frommer, A.: Monotonie und Einschliefiung beim Brown- Verfahren. Dissertation, Universitit Karlsruhe, 1986.

[Fro901

Frommer, A.: Liisung linearer Gleichungssysteme auf Parallelrechnern. Vieweg Verlag, Braunschweig, 1990.

In [U1190b,

584

G. Bohlender

[FroSOa] Frommer, A.; Mayer, G.: Efficient Modifications of the Interval Newton Method. In [U1190b, pp. 213-2271, 1990. [Gar851

Garloff, J.: Interval mathematics. A bibliography. Freib. 1nt.-Ber. 6, pp. 1-222, 1985

[Gar871

Garloff, J.: Bibliography on interval mathematics. Continuation. Freib. 1nt.-Ber. 2, pp. 1-50, 1987.

[Gar911

Garloff, J.: Stability Test of a Polynomial with Coefficients Depending Polynomially on Parameters. In [Jahgl, pp. 415-4191, 1991. GeZirg, S.: Two Methods for the Verified Znclusion of Zeros of Complez Polynomials. In [U1190b, pp. 229-2441, 1990. Georgiev, K.; Margenov, S.: On Domain Decomposition Methods for Problems with Discontinuous Coeficients. In [Kaugl, pp. 281-2921,1991.

[GeoSO] [GeoSl] [Goego]

Goehlen, M.; Plum, M.; SchGder, J.: A Programmed Algorithm for Existence Proofs for Two-Point Boundary Value Problems. Computing 44, pp. 91-132, 1990.

[GoeSOa] Goerisch, F.; He, Z.: The Determination of Guaranteed Bounds to Eigenvalues with the Use of Variational Methods I . (Part I1 see [BehSO]). In [Go1711 [Go1901

[U1190a, pp. 137-1531, 1990. Goldstein, M.; Hofierg, S.: The Estimation of Significance. In [Ric71], 1971. Goldberg, D.: Computer Arithmetic. In Patterson, D.; Hennessy, J. L. (Eds.): Computer Architecture: A Quantitative Approach. Morgan Kaufmann, Los Altos, 1990.

[Go1911

Goldberg, D.: What Every Computer Scientist Should Know About Floating-Point Arithmetic. ACM Computing Surveys 23,pp. 5-48,1991.

[Gri89]

Griewank, A.: On Automatic Diflerentiation. In: Iri, M.; Tanabe, K. (Ed.): Mathematical Programming: Recent Developments and Applications, pp. 83-108, Kluwer Academic Publishers, 1989.

[Grigl]

Griewank, A.; Corliss, G. (Eds.): Automatic Differentiation of Algorithms: Theory, Implementation, and Applications. Proceedings of Workshop on Automatic Differentiation at Brecken Ridge, SIAM, Philadelphia, 1991.

[Grog01

Grozev, G. R.; Markov, S. M.: Numerical Methods with Result Verification for Boundary Value Problems. In [U1190b, pp. 291-2991, 1990.

[Gru771

Griiner, K.: Fehlerschmnken fiir lineare Gleichungssysteme. In [Alb77, pp. 47-55], 1977.

[Gru791

Griiner, K.: Allgemeine Rechnerarithmetik und deren Implementierung mit optimaler Genauigkeit. Dissertation, Universitiit Karlsruhe, 1979.

[Gru87]

Griiner, K.: Solving Complez Problems for Polynomials and Linear Systems with Verified High Accuracy. In (Kau87, pp. 199-2201, 1987.

Bibliography [Gue87] [HafSO] [Hah88]

585

Giinther-Jiirgens, G.; Endebrock, P.; Klatte, R.: RESI: Practical Experience in Computer-Arithmetic. In [Kau87, pp. 186-1981, 1987. Hafner, K.: Chips for High Precision Arthmetic. In [U1190a, pp. 33-54], 1990. Hahn, W.; Mohr, K.: APL/PCXA: Erweiterung der IEEE Arithmetik fiir technisch wissenschaftliches Rechnen. Hanser Verlag, Miinchen 1988 (ISBN 3-446-15264-4).

[HahSO] Hahn, W.; Mohr, K.: APL/PCXA. In [U1190b, pp. 385-3961, 1990. [Ham871 Hammel, S. M.; Yorke, J. A.; Grebogi, C.: Do Numerical Orbits of Chaotic Dynamical Processes Represent True Orbits? Journal of Complexity 3,pp. 136-145, Academic Press, 1987. [Ham881 Hammel, S. M.; Yorke, J. A.; Grebogi, C.: Numerical Orbits of Chaotic Processes Represent True Orbits. Bulletin (New Series) of the AMS 19, No. 2, pp. 465-469, American Math. SOC.,1988. [Ham891 Hammer, R.; Neaga, M.; Ratz, D.: PASCAL-SC, New Concepts for Scientific Computation and Numerical Data Processing. Institute of Applied Mathematics, University of Karlsruhe, P.O.Box 6980, D-7500 Karlsruhe, Germany, 1989. [Ham901 Hammer, R.: How Reliable is the Arithmetic of Vector Computers? In [U1190b, pp. 467-4821, 1990. [Ham911 Hammer, R.; Neaga, M.; Ratz, D.; Shiriaev, D.: PASCAL-XSC. A New Language for Scientific Computing. (In Russian), Interval Computations 2, St. Petersburg / MOSCOW, 1991. [Han69] Hansen, E.: Topics in Interval Analysis. Clarendon Press, Oxford, 1969. [HeiSO] Heindl, G.: A n Eficient Method for Improving Orthononnality of Nearly Orthononnal Sets of Vectors. In [U1190b, pp. 83-91], 1990. (Hei911

Heindl, G.: Inclusions of the Range of Functions and their Derivatives. In [Kaugl, pp. 229-2381, 1991.

[Hen801

Henrici, P.: A Model for the Propagation of Rounding Error in FloatingPoint Arithmetic. In [Nic80], 1980.

[Her851

Herod, J. V.; Adams, E.; Spreuer, H.: Bounds for Spatially NonhomogeneowMode1 Boltzmann Energy Equations. In [Wah85, vol. 1, pp. 1231261, 1985. Herzberger, J.: Metrische Eigenschaften von Mengensystemen und einige Anwendungen. Dissertation, Universitit Karlsruhe, 1969. Herzberger, J.: Zur Approximation des Wertebereiches reeller Funktionen durch Intervallausdriicke. In [Alb77, pp. 57-64], 1977. Herzberger, J.: On Schulz’s Method in Circular Complex Arithmetic. In [U1190b, pp. 93-1021, 1990.

(Her691

[Her771 [Her901

586

G. Bohlender

[HerSOa] Herzberger, J.; PetkoviC, Lj.: Eficient Iterative Algorithms for Bounding the Inverse of a Matriz. Computing 44, pp. 237-244, 1990. [Her911

Herzberger, J.: Iterative Inclusion of the Inverse Matriz. In [Andgl, pp. 14-26], 1991.

(Her921

Herzberger, J.; Atanassova, L. (Eds.): Proceedings of SCAN 91. Elsevier Science Publishers B. V., Amsterdam, to appear 1992.

[Hu184]

Hulshof, B. J. A.; van Hulzen, J. A.: Automatic Error Cumulation Control. In: EUROSAM 84 (Proc. of the Int. Symposium), Lecture Notes in Computer Science 174, 1984.

[Hu181]

Hull, T. E.: Precision Control, Ezception Handling and a Choice of Numerical Algorithms. Lecture Notes Math. 912, pp. 169-178, 1981.

[Hu185]

Hull, T. E.; Abrham, A.; Cohen, M. S.; Curley, A. F. X.; Hall, C. B.; Penny, D. A.; Sawchuk, J. T. M.: Numerical Turing. ACM SIGNUM Newsletter 20, No. 3, pp. 26-34, July 1985.

[Hu187]

Hull, T. E., Cohen, M. S.: Toward an Ideal Computer Arithmetic. In [ARITH, Vol. 8, pp. 131-1381, 1987.

[Hus88]

Husung, D.: ABACUS: Programmienuerkzeug mit hochgenauer Arithmetik fur Algorithmen mit verifizierten Ergebnissen. Diplomarbeit, Institut fur Angewandte Mathematik, Universitat Karlsruhe, 1988.

[Hus89]

Husung, D.: Precompiler for Scientific Computation (TPX). Reports o f the Institute for Computer Science 111, TU Hamburg-Harburg, 1989.

[Hwa79] Hwang, K.: Computer Arithmetic: Principles, Architecture, and Design. J. Wiley, New York, 1979. [Jah91]

Jahn, K.-U. (Ed.): Computernumerik mit Ergebnisverifikation. Problemseminar, Technische Hochschule Leipzig, 13.-15. Mirz 1991. Proceedings in Wissenschaftliche Zeitschrift der Technischen Hochschule Leipzig, Jahrgang 15, Heft 6, 1991.

[Jan851

Jansson, C.: Zur Linearen Optimierung mit unscharfen Daten. Dissertation, Kaiserslautern, 1985.

[Jan881

Jansson, C.: A Self- Validating Method for Solving Linear Programming Problems with Interval Input Data. In [Ku188, pp .33-451, 1988.

[Jan901

Jansson, C.: Guaranteed Error Bounds for the Solution of Linear Systems. In [U1190b, pp. 103-1101, 1990.

[Jan911

Jansson, C.; Rump, S. M.: Rigorous Solution of Linear Programming Problems with Uncertain Data. ZOR-Methods and Models of Operations Research 35, pp. 87-111, 1991.

[Jangla] Jansson, C.; Rump, S. M.: Rigorous Sensitivity Analysis for Real Symmetric Matrices with Uncertain Data. In [Kaugl, pp. 293-3161, 1991.

Bibliography

587

[Jan9 1b] Jansson, C.: Interval Linear Systems with Symmetric Matrices, Skew-

Symmetric Matrices and Dependencies in the Right Hand Side. Computing 46, pp. 265-274, 1991.

[Janglc] Jansson, C.: A Geometric Approach for Computing A Posteriori Error Bounds for the Solution of a Linear System. Computing 47, pp. 1-9, 1991. [Jan921

Jansson, C.: A Posteriori Error Bounds for the Solution of Linear Systems with Simplex Techniques. To appear 1992.

[Jen85]

Jensen,K.; Wirth, N.: Pascal User M a n u a l and Report. I S 0 Pascal Standard, 3rd ed., Springer, NewYork, 1985.

[Jue871

Jiillig, H.-P.: Untersuchung zur Implementierung von Skalarproduktausdricken. Diplomarbeit, Institut fur Angewandte Mathematik, Universitit Karlsruhe, 1987.

[Jue91]

Jiillig, H.-P.: Algorithmen mit Ergebnisverifikation mit C++/2.0. [Jah91, pp. 433-4391, 1991.

[Kah65]

Kahan, W.: Further Remarks on Reducing Truncation Errors. Comm. ACM 8 , p. 40, 1965.

In

[Kah 851 Kahan, W.; Le Blanc, E.: Anomalies in the IBM ACRITH Package. In [ARITH, Vol. 7, pp. 322-3311, 1985.

[Kah88]

Kahan, W.: Floating-point Lectures. Unpublished manuscript. Sun Microsystems, Mountain View, Calif., 1988.

[KamSl] Kammann, K.: Untersuchung der Parallelisierbarkeit des Rump- Verfahrens tur Einschlieflung der Losung von linearen Gleichungssystemen. Diplomarbeit, Institut fur Angewandte Mathematik, Universitit Karlsruhe, 1991. [KanSl]

Kan, P. T.; Miranker, W. L.: Entropy in Multilevel Arithmetic. In [Kaugl, pp. 317-3271, 1991.

[Kau73]

Kaucher, E.: Uber metrische und algebmische Eigenschaften einiger beim numerischen Rechnen auftretender Ruume. Dissertation, Universitit Karlsruhe, 1973.

[Kau77]

Kaucher, E.: Algebraische Erweiterungen der Intervallrechnung unter Erhaltung der Ordnungs- und Verbandsstrukturen. In [Alb77, pp. 65791, 1977.

[Kau77a] Kaucher, E.: Uber Eigenschaften und Anwendungsmoglichkeiten der erweiterten Intervallrechnung und des hyperbolischen Fastkorpers uber R. In [Alb77, pp. 81-94], 1977. [KauSO] Kaucher, E.; Rump, S. M.: Generalized iteration methods for bounds of the solution of fised point operator equations. In [Ale80], 1980.

588

G. Bohlender

[Kau81] Kaucher, E.; Klatte, R.; Ullrich, Ch.; Wolff v. Gudenberg, J.: Programmiersprachen im Griff - Band 2: PASCAL. Bibliographisches Institut, Mannheim, 1981. [Kau82] Kaucher, E.; Rump, S. M.: E-Methods for Fized Point Equations f(z) = z. Computing 28, pp. 31-42, 1982. [Kau82a] Kaucher, E.: Losung von Funktionalgleichungen mit garantierten und genauen Fehlerschranken. In [Ku182, pp. 185-2051, 1982. [Kau83]

Kaucher, E.: Solving Function Space Problems with Guaranteed Close Bounds. In [Ku183, pp. 139-1641, 1983.

Kaucher, E.; Miranker, W. L.: Self-validating Numerics for Function Space Problems. Academic Press, New York, 1984 (ISBN 0-12402020-8). [Kau85] Kaucher, E.; Schumacher, G.: Self- Validating Methods for Special Diflerential-Equation Problems. In [Wah85, vol. 1, pp. 115-1181, 1985.

[Kau84]

[Kau85a] Kaucher, E.: Inclusion with High Accuracy Arithmetic for Function Space Problems. In [Wah85, vol. 1, pp. 183-1861, 1985. [Kau87] Kaucher, E.; Kulisch, U.; Ullrich, Ch. (Eds.): Computerarithmetic: Scientific Computation and Programming Languages. B. G. Teubner Verlag, Stuttgart, 1987 (ISBN 3-519-02448-9). [Kau87a] Kaucher, E.: Self- Validating Computation of Ordinary and Partial Differential Equations. In [Kau87, pp. 221-2541, 1987. [Kau88] Kaucher, E.; Miranker, W. L.: Validating Computation in a Function Space. In [Moo88, pp. 403-4251, 1988. [Kau89] Kaucher, E.: Methoden zur Losung von Integral- und Diflerentialgleichungen. In [Ku189, pp. 225-2381, 1989. [KauSO] Kaucher, E.; Schulz-Rinne, C.: Aspects of Self- Validating Numerics in Banach Spaces. In [U1190a, pp. 269-299],1990. [KauSl] Kaucher, E.; Mayer, G.; Markov, S. M. (Eds.): Computer Arithmetic, Scientific Computation and Mathematical Modelling. Proceedings of SCAN-90. IMACS Annals on Computing and Applied Mathematics, Vol. 12 (1992), published Oct. 1991. J. C. Baltzer AG, Basel, 1991. [Kaugla] Kaucher, E.: Area-Preserving Mappings, their Application for Parallel and Partially Validated Computations of Navier-Stokes-Problems. In [Kaugl, pp. 425-4361, 1991. [Ke189]

Kelch, R.: Ein adaptives Verfahren zur numerischen Quadratur mit automatischer Ergebnisverifikation. Dissertation, Universitit Karlsruhe, 1989.

[Ke190]

Kelch, R.: A n Adaptive Procedure for Numerical Quadrature with Automatic Result Verification. In [U1190b, pp. 301-3171, 1990.

Bibliography

589

[Kergl]

Kerbl, M.: Stepsize Control Strategies for Inclusion Algorithms for ODES. In [Kaugl, pp. 437-4521, 1991.

[Kie881

Kieilling, I.; Lowes, M.; Paulik, A.: Genaue Rechnerarithmetik - Intervallrechnung und Programmieren mit PASCALSC. B. G. Teubner Verlag, Stuttgart, 1988 (ISBN 3-519-00114-4).

[Kir78]

Kirchner, R.: Asynchrone Schaltwerke rnit Flankensteuerung. Dissertation, Universitit Karlsruhe, 1978.

[Kir79]

Kirchner, R.; Neaga, M.; Wippermann, H.-W.: Ein Drei-Pass-PASCALCompiler fur 280. In [Wip79], 1979.

[Kir82]

Kirchner, R.: herblick uber die vorliegende Zmplementierung der Pascal-Spmcherweiterung. In [Ku182, pp. 117-1461, 1982.

[Kir84]

Kirchner, R.: KL/P-Architektur-Handbuch. Interner Bericht Nr. 97/84, Fachbereich Informatik, UniversitU Kaiserslautern, 1984.

[Kir87]

Kirchner, R.; Kulisch, U.: Arithmetic for Vector Processors. In [ARITH, Vol. 8, pp. 256-2691, 1987.

[Kir88]

Kirchner, R.; Kulisch, U.: Accurate Arithmetic for Vector Processing. Journal of Parallel and Distributed Computing 5, special issue on "High Speed Computer Arithmetic", pp. 250-270, 1988.

[Kir88a] Kirchner, R.; Kulisch, U.: Arithmetic for Vector Processors. In [Moo88, pp. 3-41], 1988. [Kla75] [Kla85]

Klatte, R.: Zyklisches Enden bei Zterationsverfahren. Dissertation, Universitiit Karlsruhe, 1975. Klatte, R.; Ullrich, C.; Wolff v. Gudenberg, J.: Arithmetic Specification

for Scientific Computation in A D A . IEEE Transactions on Computers,

Vol. C-34, NO. 11, 1985.

[Kla861

Klatte, R.; Ullrich, C.; Wolff v. Gudenberg, J.: Optimal Arithmetic and A D A . In [Wah85, vol. 1, pp. 179-1811; [Rus86, vol. 21, 1986.

[Kla86a] Klatte, R.; Ullrich, C.; Wolff v. Gudenberg, J.: Implementation of Arithmetic for Scientific Computation in A D A . Working paper of the ESPRIT project DIAMOND, No. 03/1-2/2/Kl.f, Karlsruhe, 1986.

[Kla891

Klatte, R.: Ubersicht uber neue Programmierspmchen. In [Ku189, pp. 29431, 1989.

[Kla91)

Klatte, R.; Kulisch, U.; Neaga, M.; Ratz, D.; Ullrich, Ch.: PASCALXSC - Sprachbeschreibung mit Beispielen. Springer-Verlag, Berlin/Heidelberg/New York, 1991 (ISBN 3-540-53714-7, 0-387-537147). Klatte, R.; Kulisch, U.; Neaga, M.; Ratz, D.; Ullrich, Ch.: PASCALXSC - Language Reference with Examples. Springer-Verlag, Berlin/Heidelberg/New York, 1992.

[Kla92]

G. Bohlender

590 [Kle90]

Klein, W.: Zur Einschliepung der Losung von linearen und nichtlinearen Fredholmschen Integralgleichungssystemen zweiter Art. Dissertation, Universitat Karlsruhe, 1990.

[Kle91]

Klein, W.: Inclusion Methods for Linear and Nonlinear Systems of Fredholm Integral Equations of the Second Kind. In [Kaugl, pp. 453-4581, 1991.

[Klo87]

Klotz, G.: Faktorisierung von Matrizen mit mazimaler Genauigkeit. Dissertation, Universitat Karlsruhe, 1987.

[Klu90]

Klug, U.: Verified Inclusions for Eigenvalues and Eigenvectors of Real Symmetric Matrices. In [U1190b, pp. 111-1251, 1990.

[Kno88]

Knofel, A.: Entwurf eines Arithmetikprozessors fur das optimale Skalarprodukt mit Hilfe eines Silicon Compilers. Diplomarbeit, Institut fur Angewandte Mathematik / Institut fur Rechnerentwurf und Fehlertoleranz, UniversitHt Karlsruhe, 1988.

[KnoSl]

Knofel, A.: Hardwareentwurf eines Rechenwerks fur semimorphe Skalarund Vektoropemtionen unter Beriicksichtigung der Anforderungen verifizierender Algorithmen. Dissertation, Universitat Karlsruhe, 1991.

[Kno9 la] Knofel, A.: Advanced Circuits for the Computation of Accurate Scalar Products. In [Kaugl, pp. 63-67], 1991.

[KnoS1b] Knofel, A.: Fast Hardware Units for the Computation of Accurate Dot Products. In [ARITH, Vol. 10, pp. 70-741, 1991. [Knu73] Knuth, D. E.: The Art of Computer Programming. Vol. 1: Fundamental Algorithms. 2nd ed., Addison-Wesley, Reading, Massachusetts, 1973. [Knu8

I

[Ko191

Knuth, D. E.: The Art of Computer Programming. Vol. 2: Seminumerical Algorithms. 2nd ed., Addison Wesley, Reading, Massachusetts, 1981. Kolev, L. V.; Nenov, I.: Approximate Interval Solution of an Initial Value Problem for Linear ODES. In [Kaugl, pp. 459-4681, 1991.

[Ko191a] Kolev, L. V.; Mladenov, V. M.: A n Interval Method for Determining All T-Periodical Solutions of Nonlinear ODE Systems. In [Kaugl, pp. 4694731, 1991. [KoeSO]

Konig, S.: On the Inflation Parameter Used in Self-validating Methods. In [U1190b, pp. 127-1321, 1990.

[Kra87]

Krimer, W.: Inverse Standardfunktionen fur reelle und komplexe Intervallargumente mit a priori Fehlerabschatzungen fur beliebige Datenformate. Dissertation, Universitat Karlsruhe, 1987.

[Kra88]

Krimer, W.: Inverse Standard Functions for Real and Complex Point and Interval Arguments with Dynamic Accuracy. In [Ku188, pp. 1852111, 1988.

Bibliography

591

[Kra88b] Kramer, W.: Mehrfachgenaue reelle und intervallmujlige StaggeredCorrection Arithmetik mit zugehorigen Standardfunktionen. Bericht des Inst. f. Angew. Mathematik, Univ. Karlsruhe, pp. 1-80, 1988. [Kra89]

Krimer, W.; Walter, W.: FORTRAN-SC: A FORTRAN Extension for Engineering/Scientific Computation with Access to A CRITH, General Information Notes and Sample Programs. IBM Deutschland GmbH, Stuttgart, 1989.

[Kra89a] Kriimer, W.: Fehlerschranken fur haujig auftretende Approximationsausdricke. ZAMM 69, pp. 44-47, 1989.

[KraSO]

Kramer, W.: Berechnung der Gammafunktion fur reelle Punkt- und Intervallargumente. ZAMM 70, pp. 581-584, 1990.

[K ra9 Oa] Krimer, W.: Highly Accurate Evaluation of Program Parts with Applications. In [U1190b, pp. 397-4091, 1990.

[Kragla] Kramer, W.: Computation of Verified Bounds for Elliptic Integrals. In [Her92], to appear 1992. [KraS1b] Krimer, W.: Einschlujl eines Paares konjugiert komplexer Nullstellen eines reellen Polynoms. ZAMM 71,pp. T820-T824, 1991. Genaue Berechnung von komplexen Polynomen in [Kraglc] Kramer, W.: mehreren Variablen. In [JahSl, pp. 401-4071, 1991. [KraSId] Krlmer, W.: Verijied Solution of Eigenvalue Problem with Sparse Matrices. Proceedings of 13th World Congress on Computation and Applied Mathematics, IMACS '91, Dublin, pp. 32-33, 1991.

[Kra9 1el Krimer, W.: Evaluation of Polynomials in Several Variables with High Accuracy. In [Kaugl, pp. 239-2491, 1991.

[Kra92]

Krimer, W.: Die Berechnung von Standardfunktionen in Rechenanlagen. In Chatterji, S . D.; Fuchssteiner, B.; Kulisch, U.; Liedl, R.; Purkert, W. (Eds): Jahrbuch Uberblicke Mathematik 1992, Vieweg, Braunschweig, 1992.

[Kra92b] Kramer, W.: Eine portable Langzahl- und Langzahlintervallarithmetik mit Anwendungen. ZAMM 73,1992. [Kra69]

Krawczyk, R.: Newton-Algorithmen zur Bestimmung von Nullstellen mit Fehlerschranken. Computing 4, pp. 187-220, 1969.

[Kre88]

Kreutzenberger, R.: Formelauswerter mit E-Verijikation fur allgemeine arithmetische Ausdni'cke und Einbau in einen FORTRA N-SC Compiler. Diplomarbeit, Institut fur Angewandte Mathematik, Universitat Karlsruhe, 1988.

[Kru85]

Krickeberg, F.; Leisen, R.: Solving Initial Value Problems of Ordinary Diflerential Equations to Arbitrary Accuracy with Variable Precision Arithmetic. In [Wah85, vol. 1, pp. 111-1141, 1985.

592

G. Bohlender

[Kru861

Kriickeberg, F.: Arbitrary Accuracy with Variable Precision A n'thmetic. In [Nic85, pp. 95-1011, 1985.

[Ku171]

Kulisch, U.: A n aziomatic approach to rounded computations. TS Report No. 1020, Mathematics Research Center, University of Wisconsin, Madison, Wisconsin, 1969, and Numerische Mathematik 19, pp. 1-17, 1971.

[Ku1761

Kulisch, U.: Grundlagen des Numerischen Rechnens - Mathematische Begriindung der Rechnerarithmetik. Reihe Informatik, Band 19, Bibliographisches Institut, Mannheim/Wien/Ziirich, 1976 (ISBN 3-411-01517-9).

[Ku176a] Kulisch, U.; Bohlender, G.: Formalization and Implementation of Floating-Point Matriz Operations. Computing 16,pp. 239-261, 1976. [Ku177]

Kulisch, U.: Ein Konzept fur eine allgemeine Theorie der Rechnemrithmetik. In [Alb77, pp. 95-1051, 1977.

[Ku177a] Kulisch, U.: Uber die beim numerischen Rechnen mit Rechenanlagen auftretenden Raume. In [Alb77, pp. 107-1191, 1977. [Ku181]

Kulisch, U.; Miranker, W. L.: Computer Arithmetic in Theory and Practice. Academic Press, New York, 1981 (ISBN 0-12-428650-x).

[Ku182]

Kulisch, U.; Ullrich, Ch. (Eds.): Wissenschaftliches Rechnen und Programmiersprachen. Berichte des German Chapter of the ACM, Band 10, B. G. Teubner Verlag, Stuttgart, 1982 (ISBN 3-519-02429-2).

[Ku182a] Kulisch, U.: Eine neue Arithmetik fur wissenschaftliches Rechnen. In [Ku182, pp. 9-28], 1982. [Ku183]

Kulisch, U.; Miranker, W. L. (Eds.): A New Approach to Scientific Computation. Proceedings of Symposium held a t IBM Research Center, Yorktown Heights, N. Y., 1982. Academic Press, New York, 1983 (ISBN 0-12-428660-7).

[Ku183a] Kulisch, U.: A New Arithmetic for Scientific Computation. In [Ku183, pp. 1-26], 1983. [Ku183b] Kulisch, U.; Bohlender, G.: Features of a Hardware Implementation of an Optimal Arithmetic. In [Ku183, pp. 269-2901, 1983. [Ku183c] Kulisch, U.: Schaltungsanonlnung und Verfahren zur Bildung von Skalarprodukten und Summen von Gleitkommazahlen mit maximaler Genauigkeit. Patentschrift DE 3144015 A l l 1983.

[Ku1861

Kulisch, U.; Miranker, W. L.: The Arithmetic of the Digital Computer: A New Approach. IBM Research Center RC 10580, pp. 1-62,1984. SIAM Review, Vol. 28, No. 1, pp. 1-40, March 1986.

[Ku186a] Kulisch, U.; Rump, S. M.: Rechnemrithmetik und die Behandlung algebraischer Probleme. In [Buc86, pp. 213-2811, 1986.

Bibliography

593

[Ku186bl Kulisch, U.: A New Arithmetic for Scientific Computing. In [Mir86, pp. 18-30], 1986.

[Ku187]

Kulisch, U. (Ed.): PASCAL-SC: A PASCAL extension for scientific computation, Information Manual and Floppy Disks, Version IBM PC/AT; Operating System DOS. B. G. Teubner Verlag (Wiley-Teubner series in computer science), Stuttgart, 1987 (ISBN 3-519-02106-4 / 0471-91514-9).

[Ku187a] Kulisch, U. (Ed.): PASCAL-SC: A PASCAL extension for scientific computation, Information Manual and Floppy Disks., Version ATARI ST. B. G. Teubner Verlag, Stuttgart, 1987 (ISBN 3-519-02108-0).

[Ku1881

Kulisch, U.; Stetter, H. J. (Eds.): Scientific Computation with Automatic Result Verification. Computing Supplementum 6. SpringerVerlag, Wien / New York, 1988.

[Ku188a] Kulisch, U.; Stetter, H. J.: Automatic Result Verification. In [Ku188, pp. 1-61, 1988. [Ku188b] Kulisch, U.: Fast and Accurate Arithmetic for Vector Processors. In [Syd88], 1988.

[Ku1891

Kulisch, U. (Ed.): Wissenschaftliches Rechnen mit Ergebnisverifikation - Eine Einfuhrung. Ausgearbeitet von S. Geijrg, R. Hammer und D. Ratz. Akademie Verlag, Berlin, und Vieweg Verlagsgesellschaft, Wiesbaden, 1989.

[Ku189a] Kulisch, U.: Zeitgemupe Rechnemrithmetik. In [Ku189, pp. 1-27], 1989. [Ku189b] Kulisch, U.; Kirchner, R.: Schaltungsanontnung zur Bildung von Produktsummen in Gleitkommadarstellung, insbes. von Skalarprodukten. Patentschrift DE 3703440 C2, 1989. [Lan85]

Lang, R.; Adams, E.: Mathematical Analysis of Dependency of Dynamic Perturbations on Constitutive Equations. In [Wah85, vol. 1, pp. 63-66], 1985.

[Law911 Lawo, Ch.: C-XSC - A Programming Environment for Verified Scientific Computing and Numerical Data Processing. Institute of Applied Mathematics, Prof. Dr. U. Kulisch, University of Karlsruhe, Postfach 6980, D-7500 Karlsruhe, Germany, 1991. [Law921 Lawo, Ch.: C-XSC, eine objektorientierte Programmierumgebung fur verifiziertes wissenschaftliches Rechnen. Dissertation, Universitiit Karlsruhe, 1992. [LawBa] Lawo, Ch.: C-XSC, eine Progmmmierumgebung fur verifiziertes Rechnen in C++. In [Her92], to appear 1992.

[Lei901

Leisen, R.: Coupling Symbolic, Numerical and Grafical Computation to Get Verified Results for Nonlinear Problems. In [U1190b, pp. 245-2571, 1990.

594

G. Bohlender

[Leu821

Leuprecht, H.; Oberaigner, W.: Parallel Algorithms for the Rounding Exact Summation of Floating Point Numbers. Computing 28, pp. 89104, 1982.

[Lic88]

Lichter, P.: Realisierung eines VLSI-Chips fur das Gleitkomma-Skalarprodukt der Kulisch-Arithmetik. Diplomarbeit, Fachbereich 10, Angewandte Mathematik und Informatik, Universitit des Saarlandes, 1988.

[Lin74]

Linnainmaa, S.: Analysis of Some Known Methods of Improving the Accuracy of Floating-point Sums. BIT 14, pp. 167-202, 1974.

[Lid11

Linnainmaa, S.: Software for Doubled-Precision Floating-point Computations. ACM Trans. Math. Soft. 7 , pp. 272-283, 1981.

[LinyO]

Linz, P.: Accurate Floating-point Summation. Comm. ACM, Vol. 13, No. 6, June 1970.

[Lingo]

Linz, P.: A Posteriori Bounds for Validating Numerical Results. In [U1190b, pp. 47-54], 1990.

[Loh84]

Lohner, R.; Adams, E.: Einschliefiung der Losung gewohnlicher Anfangsund Randwertaufgaben. ZAMM 64, pp. T295-T297, 1984.

[Loh85]

Lohner, R.: Enclosing the Solutions of Ordinary Initial- and BoundaryValue Problems. In [Wah85, vol. 1, pp.99-1021, 1985.

[Loh86]

Lohner, R.; Wolff v. Gudenberg, J.: Complex Interval Division with Maximum Accuracy. In [ARITH, Vol. 7, pp. 332-3361, 1985.

[Loh87]

Lohner, R.: Enclosing the Solutions of Ordinary Initial and Boundary Value Problems. In [Kau87, pp. 255-2861, 1987.

[Loh88]

Lohner, R.: Einschliefiung der Losung gewohnlicher Anfangs- und Randwertaufgaben und Anwendungen. Dissertation, Universitzt Karlsruhe, 1988.

[Loh88a] Lohner, R.: Precise Evaluation of Polynomials in Several Variables. In [Ku188, pp. 139-1481, 1988. [Loh89]

Lohner, R.: Einschliefiungen bei Anfangs- und Randwertaufgaben gewohnlicher Diflerentialgleichungen. In [Ku189, pp. 183-2071, 1989.

[Loh89a] Lohner, R.: Praktikum ”Einschliepung bei Diflerentialgleichungen”. In [Ku189, pp. 209-2231, 1989. [Lor711

Lortz, B.: Eine Langzahlan’thmetik mit optimaler einseitiger Rundung. Dissertation, Universitit Karlsruhe, 1971.

[LudSO]

Ludyk, G.: CAE von dynamischen Systemen: Analyse, Simulation, Entwurf von Regelungssystemen. Springer-Verlag, Berlin, 1990 (ISBN 0-387-51676-X, 3-540-51676-X).

[Lut88]

Lutz, M.: Untersuchungen zur Genauigkeit geometrischer Methoden in CAD-Systemen. Dissertation, TU Berlin, 1987. Carl Hanser Verlag, Miinchen Wien, 1988.

Bibliography [Ma1711

595

Malcolm, M.A.: On Accurate Floating-Point Summation. Comm. ACM, Vol. 14, No. 11, Nov. 1971.

[Mango] Mannshardt, R.: Enclosing a Solution of an Ordinary Differential Equation b y Sub- and Superfunctions. In [U1190b, pp. 319-3281, 1990. [Mar891 Marggraff, H.: Objektorientierte Erweiterung von PASCA L-SC und Verfahren zur Implementierung. Diplomarbeit, Institut fiir Angewandte Mathematik, Universitat Karlsruhe, 1989. Markov, S. M.: Some Applications of Eztended Interval Arithmetic to Interval Iterations. In [Ale80], 1980. Markov, S. M.: Interval Differential Equations. In [Nic80], 1980. Markov, S. M.: Least-square Approzimations under Interval Input Data. In [U1190b, pp. 133-1471, 1990. Markov, S. M.; Velitchkov, T.: HIFICOMP - a Subroutine Library for Highly Eficient Computations. In [U1190b, pp. 41 1-4251, 1990. Markov, S. M.: Polynomial Interpolation of Vertical Segments in the Plane. In [Kaugl, pp. 251-2621, 1991. Markov, S. M.; Krastanov, M.: MODYNA - A Program System for Mathematical Modelling with Result Verification. In [Andgl, pp. 130-1351, 1991. Markov, S. M.; Popova, E. D.: New Aspects of Mathematical Modelling: Curve Fitting. In [Andgl, pp. 49-63], 1991. Marks, K. M.: Strukturen fur Gleitkommaprozessoren zur digitalen Signalverarbeitung. Diplomarbeit, Institut fur Informationstechnik, RuhrUniversitat Bochum, 1985. Mayer, 0.: Uber die in der Intervallrechnung auftretenden Raume und einige Anwendungen. Dissertation, Universitat Karlsruhe, 1968. Mayer, G.: Linearisierte Theorie von Schiffswellen und Wellenwiderstand bei ungleichformiger Anstromung. Dissertation, Universitat Karlsruhe, 1982. Mayer, G.: Regulare Zerlegungen und der Satz von Stein und Rosenberg fur Intervallmatrizen. Habilitationsschrift, Universitat Karlsruhe, 1986. Mayer, G.: Enclosing the Solutions of Linear Equations b y Interval Iterative Processes. In [Ku188, pp. 47-58], 1988. Mayer, G.: Grundbegriffe der Intervallrechnung. In [Ku189, pp. 101-1171, 1989. Mayer, G.; Frommer, A.: A Multisplitting Method for Verification and Enclosure on a Parallel Computer. In [U1190b, pp. 483-4971, 1990. Mayer, G.: Old and New Aspects for the Interval Gaussian Algorithm. In [Kaugl, pp. 329-3491, 1991.

596

G. Bohlender

(May921 Mayer, G.: Some Remarks on Two Interval-Arithmetic Modifications of the Newton Method. Computing 48, pp. 125-128, 1992. [Met881 Metzger, M.: Weiterentwicklung von Progmmmierspmchen fur die Numerik unter besonderer Benkksichtigung der Rechengenauigkeit. Dissertation, Universitiit Karlsruhe, 1988. [Met88a] Metzger, M.: FORTRAN-SC, A FORTRAN Eztension for Engineering / Scientific Computation with Access to ACRITH: Demonstration of the Compiler and Sample Programs. In [Moo88, pp. 63-79], 1988. [Met891

Metzger, M.: Eine numerische Anwendung mit Ergebnisverifikation in FORTRAN-SC. ZAMM 69, T49-T52, 1989.

[Met89a] Metzger, M.; Walter, W.: FORTRAN-SC: Eine FORTRAN-Erweiterung f i r wissenschaftlichea Rechnen. In [Ku189, pp. 45-67], 1989. [Met901

Metzger, M.; Walter, W.: FORTRAN-SC: A Programming Language for Engineering / Scientific Computation. In [U1190b, pp. 427-4411, 1990.

[Mir83]

Miranker, W. L.: Ultra-Arithmetic: The Digital Computer Set in Function Space. In [Ku183, pp. 165-1981, 1983.

[Mir86]

Miranker, W. L.; Toupin, R. A.: Accurate Scientific Computations. Symposium Bad Neuenahr, Germany, 1985. Lecture Notes in Computer Science, No. 235, Springer-Verlag, Berlin, 1986.

[Mir86a] Miranker, W. L.; Mascagni, M.; Rump, S. M.: Case Studies for Augmented Floating-Point Arithmetic in Accumte Scientific Computations. In [Mir86], 1986. [Moe65] Mdler, 0.: Quasi Double Precision in Floating-point Addition. BIT 5, pp. 37-50, 1965. [Moo661 Moore, R. E.: Interval Analysis. Prentice Hall Inc., Englewood Cliffs, N. J.; 1966. [Moo771 Moore, R. E.: A Test for Existence of Solutions for Nonlinear Systems. SIAM J . Numer. Anal. 4, 1977. [Moo791 Moore, R. E.: Methods and Applications of Interval Analysis. SIAM, Philadelphia, Pennsylvania, 1979. [Moo881 Moore, R. E. (Ed.): Reliability in Computing: The Role of Interval Methods in Scientific Computing. Proceedings of Conference at Columbus, Ohio, September 8-1 1, 1987; Perspectives in Computing 19,Academic Press, San Diego, 1988 (ISBN 0-12-505630-3). [Moo911 Moore, R. E.: Interval Tools for Computer Aided Proofs in Analysis. In: Meyer, R. K.; Schmidt, D. S.: Computer Aided Proofs in Analysis. Springer-Verlag (The IMA Volumes in Mathematics and Its Applications, Vol. 28), New York, 1991.

Bibliography

597

[Muego] Miiller, M.; Riib, Ch.; Riilling, W.: Exact Addition of Floating-Point Numbers. Report TR 05/ 1990, Angewandte Mathematik und Informatik, Universitit des Saarlandes, 1990. [MueSl] Miiller, M.; Riib, Ch.; Riilling, W.: Ezact Accumulation of FloatingPoint Numbers. In [ARITH, Vol. 10, pp. 64-69], 1991. [NakSO] Nakao, M. T.: A Numerical Method for the Ezistence of Solutions for Nonlinear Boundary Value Problems. In [U1190b, pp. 329-3391, 1990. Erweiterungen von Progmmmierspmchen fur wis[Ned41 Neaga, M.: senschaflliches Rechnen und Erorterung einer Implementierung. Dissertation, Universitit Kaiserslautern, 1984. [Ned91 Neaga, M.: PASCAL-SC - Eine PASCAL-Enueiterung fur das wissenschaflliche Rechnen. In [Ku189, pp. 69-84], 1989. [Neu871 Neumaier, A.: Overestimation in Linear Interval Equations. SIAM J . Numer. Anal. 24(1), pp. 207-214, 1987. [Neu881 Neumaier, A.: Rigorous Sensitivity Analysis for Parameter-Dependent Systems of Equations. To appear in J. Math. Anal. Appl.; 1988. [Neu88a] Neumaier, A.: The Enclosure of Solutions of Parameter-Dependent Systems of Equations. In [Moo88, pp. 269-2861, 1988. [NeuSO] Neumaier, A.: Interval M e t h o d s for Systems of Equations. Cambridge University Press, Cambridge, 1990. [Nic751 Nickel, K. (Ed.): Interval Mathematics. Proceedings of the International Symposium, Karlsruhe 1975, Springer-Verlag,Vienna, 1975. [Nic80] Nickel, K. (Ed.): Interval Mathematics 1980. Proceedings of the International Symposium, Freiburg 1980, Academic Press, New York, 1980 (ISBN 0-12-518850-1). [Nic85] Nickel, K. (Ed.): Interval Mathematics 1985. Proceedings of the International Symposium, Freiburg 1985, Springer-Verlag,Vienna, 1986. Nickel, K. (Ed.): Freiburger Intervallberichte from 78/1 to 87/10. Institut fiir Angewandte Mathematik, Universitit Freiburg, HermannHerder-Strah 10, D-7800 Freiburg, Germany, 1988. [Obe83] Oberaigner, W.: Rundungsezakte Arithmetik. Bericht 83-02. Institut f i r Informatik, Univ. Innsbruck, 1983. [Obe84] Oberaigner, W.: SIMD-Algorithms for the Rounding Ezact Evaluation of Sums of Products. Proceedings WG PARS, 1984. [Obe86] Oberaigner, W.: Parallele Verfahren fur Rundungsexakte Berechnungen. Institut fiir Informatik, Universitit Innsbruck, 1986. “id381

[Ohs88] [Oli91]

Ohsmann, M.: Verified Inclusion for Eigenvalues of Certain Difference and Differential Equations. In (Ku188, pp. 79-87], 1988. de Oliveira, J. B. S.; Claudio, D. M.: A User Directed Approach to Finding Roots of Polynomials. In [KauSl, pp. 351-3661, 1991.

598 [Ott87]

G. Bohlender Ottmann, Th.; Thiemt, G.; Ullrich, Ch.: Numerical Stability of Simple Geometric Algorithms in the Plane. In B~rger,E. (Ed.): Computation Theory and Logic. Springer-Verlag, Berlin, 1987.

[Ott87a] Ottmann, Th.; Thiemt, G.; Ullrich, Ch.: Numerical Stability of Geometric Algorithms. Extended abstract, Proceedings of the 3rd Annual ACM Symposium on Computational Geometry, pp. 119-125, 1987. [Ot t 881

Ottmann, Th.; Thiemt, G.; Ullrich, Ch.: On Arithmetical Problems of Geometric Algorithms in the Plane. In [Ku188, pp. 123-1361, 1988.

[Pet911

Petkovid, M.; Herzberger, J.: Inclusion of Multiple Polynomial Roots in Complex Rectangular Arithmetic. In [Kaugl, pp. 367-3751, 1991.

[Pic721

Pichat, M.: Correction d'une somme en arithme'tique d virgule jlottante. Numerische Mathematik 19,pp. 400-406, 1972.

[Pic851

Pichat, M.: Reducing Abbreviation Errors in Iterative Resolution of Linear Systems. In [Wah85, vol. 1, pp. 193-1951, 1985.

[PiuSl]

Piuri, V; Stefanelli, R.: Use of the 3N Code for Concurrent Error Detection in Parallel Multipliers. In [Kaugl, pp. 69-88], 1991.

[Plu90]

Plum, M.: Verified Existence and Inclusion Results for Two-Point Boundary Value Problems. In [U1190b, pp. 341-3551, 1990.

[Plug11

Plum, M.: Computer-Assisted Existence Proofs for Two-Point Boundary Value Problems. Computing 46, pp. 19-34, 1991.

[POP901

Popov, W.: On the Axiomatizations of Floating-point Arithmetic. In [U1190b, pp. 55-66], 1990.

(Prig11

Priest, D. M.: Algorithms for Arbitrary Precision Floating Point Arithmetic. In [ARITH, Vol. 10, pp. 132-1431, 1991.

[Ra165]

Rall, L. B.: Error in Digital Computation. J. Wiley, New York, 1965.

[Ra1801

Rall, L. B.: Applications of Software for Automatic Differentiation in Numerical Computation. In [Ale80, pp. 141-1561, 1980.

[Ra181]

Rall, L. B.: Automatic Differentiation: Techniques and Applications. Lecture Notes in Computer Science, No. 120, Springer-Verlag, Berlin, 1981.

[Ra183]

Rall, L. B.: Differentiation and Generation of Taylor Coeficients in PASCAL-SC. In [Ku183, pp. 291-3091, 1983.

[Ra184]

Rall, L. B.: Differentiation in PASCAL-SC: Type GRADIENT. ACM TOMS 10, pp. 161-184, 1984.

[Ra185]

Rall, L. B.: Computable Bounds for Solutions of Integral Equations. In [Wah85, vol. 1, pp. 119-1211, 1985.

[Ra187]

Rall, L. B.: Optimal Implementation of Differentiation Arithmetic. In [Kau87, pp. 287-2951, 1987.

Bibliography

599

[Ra190]

Rall, L. B.: Diflerentiation Arithmetics. In [U1190a, pp. 73-90], 1990.

[Ra191]

Rall, L. B.: Tools for Mathematical Computation. In: Meyer, R. K.; Schmidt, D. S.: Computer Aided Proofs in Analysis. Springer-Verlag (The IMA Volumes in Mathematics and Its Applications, Vol. 28), New York, 1991.

[Rat771

Ratschek, H.: Fehlererfassung mit partiellen Mengen. In [Alb77, pp. 1211281, 1977.

[Rat841

Ratschek, H. and Rokne, J.: Computer Methods for the Range of Functions. Ellis Horwood Limited, Chichester, 1984.

[Rat901

Ratz, D.: The Eflects of the Arithmetic of Vector Computers on Basic Numerical Methods. In [U1190b, pp. 499-5141, 1990.

[Rei75]

Reisser, J. F.; Knuth, D. E.: Evading the Drijl in Floating-point Addition. Inf. Process. Lett. 3,3, pp. 84-87, 1975.

[ReiSl]

Reith, R.: Verijied Solution of Linear Systems on Transputer Networks. In [KauSl, pp. 377-3921, 1991.

[Rex9 11

Rex, H.-G.: Zur Losungseinschliejhng linearer Gleichungssysteme. In [Jah91, pp. 441-4471, 1991.

[Ric71]

Rice, J. R. (Ed.): Mathematical Software. Proceedings of Symposium held at Purdue University, April 1970, Academic Press, New York, 1971.

[Ric77]

Rice, J. R. (Ed.): Mathematical Software 111. Proceedings, Academic Press, New York, 1977.

[Roh92]

Rohn, J.; Deif, A.: On the Range of Eigenvalues of an Interval Matrix. Computing 47, pp. 373-377, 1992.

[Rum801 Rump, S. M.: Kleine Fehlerschranken bei Matrixproblemen. Dissertation, Universitiit Karlsruhe, 1980. [RumSOa] Rump, S. M.; Kaucher, E.: Small Bounds for the Solution of Linear Systems. In [Aleso], 1980. [Rum821 Rump, S. M.: Rechnervorfihrung, Pakete fur Standardprobleme der Numerik. In [Ku182, pp. 29-50], 1982. [Rum82a] Rump, S. M.: Losung linearer und nichtlinearer Gleichungssysteme mit mazimaler Genauigkeit. In [Ku182, pp. 147-1741, 1982. [Rum831 Rump, S. M.: How Reliable are Results of Computers? / Wie zuverlassig sind die Ergebnisse unserer Rechenanlagen? In: Jahrbuch Uberblicke Mathematik, Bibliographisches Institut, Mannheim, 1983. [Rum83a] Rump, S. M.: Solving Algebraic Problems with High Accuracy. Habilitationsschrift. In [Ku183, pp. 51-1201, 1983. [Rum83b] Rump, S. M.: Computer Demonstration Packages for Standard Problems of Numerical Mathematics. In [Ku183, pp. 27-52] 1983.

600

G. Bohlender

[Rum83c] Rump, S. M.; Btihm, H.: Least Significant Bit Evaluation of Arithmetic Expressions in Single-Precision. Computing 30,pp. 189-199, 1983. [Rum851 Rump, S. M.: Properties of a Higher Order Computer Arithmetic. In [Wah85, vol. 1, pp. 163-1651, 1985. [Rum861 Rump, S. M.: New Results on Verified Inclusions. In [Mir86, pp. 31-69], 1986. [Rum871 Rump, S. M.: Introduction to ACRITH - Accurate Scientific Algorithms. In [Kau87, p p . 296-3691, 1987. [Rum881 Rump, S. M.: Algorithms for Verified Inclusion - Theory and Practice. In [Moo88, pp. 109-1261, 1988. [Rum891 Rump, S. M.: CALCULUS. In [Ku189, pp. 85-99], 1989. [Rum89a] Rump, S. M.: Lineare Probleme. In [Ku189, pp. 119-1271, 1989. (Rum901 Rump, S. M.: Rigorous Sensitivity Analysis for Systems of Linear and Nonlinear Equations. Math. of Comp. 54(190), pp. 724-736, 1990. [Rum911 Rump, S. M.: Estimation of the Sensitivity of Linear and Nonlinear Algebraic Problems. Linear Algebra and its Applications 153,pp. 1-34, 1991. [Rumgla] Rump, S. M.: Convergence Properties of Iterations Using Sets. [Jah91, pp. 427-4311, 1991.

In

[Rum921 Rump, S. M.: On the Solution of Interval Linear Systems. Computing 47, pp. 337-353, 1992. [Rus86] Ruschitzka, M. (Ed.): Computer Systems: Performance and Simulation. In collaboration with Robert Vichnevetsky. Proceedings of 1l t h IMACS World Congress on System Simulation and Scientific Computation, Aug. 5-9, 1985, Oslo. Preprints see [Wah85]. Elsevier Science Publishers B. V. (North-Holland), 1986. Saier, R.: Entwurf eines ubersetzenden Editors fur PASCA L-SC. Diplomarbeit, Universitit Karlsruhe, 1986. [Sat1861 Sauvagerd, U.: Simulation und Implementierung der von Kulisch vorgeschlagenen ”Neuen Arithmetik”. Diplomarbeit, Institut fiir Informationstechnik, Ruhr-Universitit Bochum, 1986.

[Sai86]

Schauer, U.; Toupin, R. A.: Solving Large Sparse Linear Systems with Guaranteed Accumcy. In [Mir86, pp. 142-1671, 1986. [Sch89fl Schmidt, L.: Verified Computations on Supercomputers - Part I: Arithmetic and Programming Languages. (Part I1 see [SchgOe]). Talk at SCAN 89 in Basel, 1989. [Sch92b] Schmidt, L.: Semimorphe Arithmetik zur automatischen Ergebnisverifikation auf Vektorrechnern. Dissertation, Universitit Karlsruhe, 1992. [Sch88] Schroeder, J.: A Method for Producing Verified Results for Two-Point Boundary Value Problems. In [Ku188, pp. 9-22], 1988.

[Sch861

Bibliography [SchSO] [SchSOa] [Sch89] [SchSOb] [SchSl]

[Sch87] [Sch89a] [Sch89b] [Sch89c]

601

Schroeder, J.: Numerical Algorithmsfor Ezistence Proofs and Error Estimatesfor Two-Point Boundary Value Problems. In [U1190a, pp. 247-2681, 1990. Schroeder, J.: Verification of Polynomial Roots b y Closed Coupling of Computer Algebra and Self- Validating Numerical Methods. In [U1190b, pp. 259-2691, 1990. Schulte, U.; Kdle, M.; Adams, E.: Dynamische Zahnkrufle. ZAMM 69, pp. T350-T352, 1989. Schulte, U.: Modellierung von Getriebeschwingungen mit Hilfe von Einschliejlungsverfahren. ZAMM 70, pp. T168-T170, 1990. Schulte, U.: Einschliejlungsverfahren zur Bewertung von Getriebeschwingungsmodellen. Dissertation, Universitit Karlsruhe, 1991; Fortschrittsberichte VDI, Reihe 11, Schwingungstechnik, Nummer 146, 1991. Schumacher, G.: Computer Arithmetic and Ill-Conditioned Algebraic Problems. In [ARITH, Vol. 8, pp. 270-2761, 1987. Schumacher, G.: Losung nichtlinearer Gleichungen mat Verifikation des Ergebnisses. In [Ku189, p p . 137-1541, 1989. Schumacher, G.: Einschliepung der Losung von linearen Gleichungssystemen auf i'cktorrechnern. In [Ku189, pp. 239-2491, 1989. Schumacher, G.; WoiE von Gudenberg, J.: Highly Accurate Numerical Algorithms. In [U1189, p p . 1-58!, 1989.

[Sch89d] Schumacher, G.: Solving Nonlinear Equations with Verification of Results. In [U1189, pp. 203-2341, 1989. [Sch89e] Schumacher, G.: Genauigkeitsfmgen bei algebmisch-numerischen Algorithmen auf Skalar- und Vektorrechnern. Dissertation, Uuiversitit Karlsruhe, 1989. [SchSOc] Schumacher, G.: Automated Error Control on Vector Computers. In: Rieder, U. (Ed.): Methods of Operations Research 6.2. Proceedings XIVth Symposium on Operations Research 1989, Anton Hain Verlag, pp. 475-479, 1990. [SchSOd] Schumacher, G.; Wolff von Gudenberg, J.: E-Methods for Improving Accuracy. In: Wallis, P. J. L. (Ed.): Improving Floating-point Programming. John Wiley & Sons, pp. 169-176, 1990. [SchSOe] Schumacher, G.: Verified Computations on Supercomputers - Part 11: Error Control in Matriz Problems. (Part I see [Sch89q). In [U1190b, pp. 515-5211, 1990. [Schgla] Schumacher, G.: Auswertung von Formeln und Nullstellenbestimmung auf Rechenanlagen. In Chatterji, S. D.; Kulisch, U.; Laugwitz, D.; Liedl, R.; Purkert, W. (Eds.): Jahrbuch Uberblicke Mathematik 1991, Vieweg, Braunschweig, pp. 47-59, 1991.

G. Bohlender

602 [Sch92]

Schumacher, G.; Siebert, S.: Highly Accurate Numerical Methods to Solve Simulation Problems in Electrical Engineering. Proceedings of the 6th Annual Conference of the European Consortium for Mathematics in Industry, Kluwer Academic Publishers, to appear 1992.

[Shi89]

Shiriaev, D.V. : Implementation of Arbitrary Precision Interval Arithmetic in C. (In Russian) Proc. 1-st Sov.-Bulg. Seminar on Numerical Processing, Oct. 19-24, 1987, Program Systems Institute of the USSR Academy of Sciences, Pereslavl-Zalessky, 140-146 (1989), deposited in VINITI 21.04.89, 2634-B89, 1989. Shiriaev, D.V. : On the Memory Eflcient Diflerentiation, ZAMM 72, pp. 632-634, 1992. Shiriaev, D.V. : Reduction of spatial complezity in reverse automatic diflerentiation b y means of inverted code, In [Her92], to appear 1992. Skelboe, S.: Computation of Rational Interval Functions. pp. 87-95, 1974.

BIT 14,

Spaniol, 0.: Arithmetik in Rechenanlagen. B. G. Teubner Verlag, Stuttgart, 1976. Spaniol, 0.: Computer Arithmetic: Logic and Design. J. Wiley, Chichester, 1981. Spreuer, H.; Adams, E.; Holzmiiller, A.: Enclosure Methods of Output Distributions for Ordinary Differential Equations with Stochastic Input. In [Kau87, pp370-3901, 1987. [Ste74]

Sterbenz, P.: Floating-Point Computation. wood Cliffs, New Jersey, 1974.

[St e84]

Stetter, H. J.: Sequential Defect Correction for High-Accuracy FloatingPoint Algorithms. Lecture Notes in Mathematics, Vol. 1006, pp. 186-202, Springer-Verlag, 1984.

[Ste88]

Stetter, H. J.: Inclusion Algorithms with Functions as Data. In [Ku188, pp. 213-2241, 1988.

[Ste89]

Stetter, H. J.: Staggered Correction Representation, a Feasible Approach to Dynamic Precision. In: Proceedings of the Symposium on Scientific Software, edited by Cai, Fosdick, Huang, China University of Science and Technology Press, Beijing, China, 1989.

[Stego]

Stetter, H. J.: Validated Solution of Initial Value Problems for ODE. In [U1190a, pp. 171-1871, 1990.

[Stu86]

Stummel, F.: Strict Optimal Error and Residual Estimates for the Solution of Linear Algebraic Systems b y Elimination Methods in HighAccuracy Arithmetic. In [Mir86, pp.119-1411, 1986.

[Sue861

Suess, W.: Das genaue Skalarprodukt im IEEE-Zahlformat. Diplomarbeit, Institut fur Angewandte Mathematik, Universitat Karlsruhe, 1986.

Prentice-Hall, Engle-

Bibliography [SYd881

603

Sydow, A. (Ed.): Third International Symposium on Systems Analysis and Simulation. Berlin, Humboldt University, September 12-16, 1988.

Teufel, T.: Ein optimaler Gleitkommaprozessor. Dissertation, Universitgt Karlsruhe, 1984. [Teu86] Teufel, T.; Bohlender, G.: A Bit-Slice Processor Unit for Optimal Arithmetic. In [Wah85, vol. 1, pp. 151-1541; [Rus86, pp. 325-3291, 1986. [Thi85] Thieler, P.: Technical Calculations b y Means of Interval Mathematics. In [Nic85, pp. 197-2081, 1985. [To1851 Tolla, P.: Optimal Termination Criterion and Accuracy Tests in Mathematical Programming. In [Wah85, vol. 1, pp. 197-1991, 1985. [U1172] Ullrich, Ch.: Rundungsinvariante Strukturen mit iuj3eren Verknipfungen. Dissertation, Universitgt Karlsruhe, 1972. [U1177] Ullrich, Ch.: Zum Begrifl des Rasters und der minimalen Menge. In [Alb77, pp. 129-1341, 1977. [U1177a] Ullrich, Ch.: Zur Konstruktion komplexer Kreisarithmetiken. In [Alb77, pp. 135-1501, 1977. [U1182] Ullrich, Ch.: FORTRAN - Erweiterung fur wissenschaftliches Rechnen. In [Ku182, pp. 51-70], 1982. Ullrich, Ch.: A FORTRAN Extension for Scientijc Computation. In [U1183] [Ku183, pp. 199-2231?1983. [U1184] Ullrich, Ch.: Rechnemrithmetik und die Weiterentwicklung von FORTRA N. Elektronische Rechenanlagen 26, pp. 71-78, 1984. [Teu84]

[U1184a] Ullrich, Ch.: Trends in FORTRAN. fiberblicke Informationsverarbeitung, Bibliographisches Institut, Mannheim, pp. 321-349, 1984. [U1185] Ullrich, Ch.: A FORTRAN 82 Application Module for Optimal Arithmetic. In [Wah85, vol. 1, pp. 175-1781, 1985. Ullrich, Ch.: Computer Arithmetic in Higher Programming Lanytitrgea. In [Kau87, pp. 391-4241, 1987. [Ull89] Ullrich, Ch.; Wolff v. Gudenberg, J. (Eds.): Accurate Nuinerical Algorithms, A Collection of DIAMOND Research Papers. Springer-Verlag, Berlin, 1989. [UllgOa] Ullrich, Ch. (Ed.): Computer Arithmetic and Self-validating Numerical Methods. (Proceedings of SCAN-89, invited papers). Academic Press, San Diego, 1990. [U1190b] Ullrich, Ch. (Ed.): Contributions to Computer Arithmetic and Self-validating Numerical Methods. (Proceedings of SCAN-89, submitted papers). IMACS Annals on Computing and Applied Mathematics, Vol. 7, J. C. Baltzer AG, Basel, 1990. [U119Oc] Ullrich, Ch.: Programming Languages for Enclosure Methods. In [U1190a, pp. 115-1361, 1990. [U1187]

604

G. Bohlender

[Van911 van de Locht, A.: Untersuchung verschiedener Parallelisierungskonzepte beim exakten Skalarprodukt. Diplomarbeit, Institut fur Angewandte Mathematik, Universitit Karlsruhe, 1991. [VelSO]

Velitchkov, T.; Cohen, R.; Stoyanov, P.: HIFICOMP: Basic Computer Arithmetic Operations. In [U1190b, pp. 443-4571, 1990.

[Vig87]

Vignes, J.: La me'thode CESTAC: Contrde et Estimation Stochastique des Arrondis de Calcul. Micro-Bulletin 23, pp. 8-22, Centre National de la Recherche Scientifique, Paris, 1987.

[Wah85] WahlstrZim, B.; Henriksen, R.; Sundby, N. P. (Eds.): Proceedings of 11th IMACS World Congress on System Simulation and Scientific Computation. 5 volumes, Aug. 5-9, 1985, Oslo. Proceedings published in [Rus86]. [Walt391 Wallis, P. J. L. (Ed.): Improving Floating-Point Programming. J. Wiley, Chichester, 1990 (ISBN 0 471 92437 7). [Wal88] Walter, W.: FORTRAN-SC, A FORTRAN Extension for Engineering / Scientific Computation with Access to ACRITH: Language Description with Examples. In [Moo88, pp. 43-62], 1988. [Wa189a] Walter, W.: Einfuhrung in die wissenschaftlich-technischeProgrammiersprache FORTRAN-SC. ZAMM 69, 4, T52-T54, 1989. [Wa189b] Walter, W.: FORTRAN-SC: A FORTRAN Extension for Engineering / Scientific Computation with Access to ACRITH, Language Reference and User's Guide. 2nd ed., pp. 1-396, IBM Deutschland GmbH, Stuttgart, Jan. 1989. [Wa190]

Walter, W.: FORTRAN 66, 77, 88, - SC . , . : Ein Vergleich der numerischen Eigenschaften von FORTRAN 88 und FORTRA N-SC. ZAMM 70, 6, T584-T587,1990.

[WalSOa] Walter, W. V.: Flexible Precision Control and Dynamic Data Structures for Programming Mathematical and Numerical Algorithms. Dissertation, Universitit Karlsruhe, 1990. [Wa191] Walter, W. V.: Fortran 90 - Was bn'ngt der neue Fortran-Standard f i r das numerische Progmmmieren? In Chatterji, S. D.; Kulisch, U.; Laugwitz, D.; Liedl, R.; Purkert, W. (Eds): Jahrbuch Uberblicke Mathematik 1991, pp. 151-175, Vieweg, Braunschweig, 1991. [Wei84]

Weissinger, J.: Numerische Mathematik auf Personal Computenz. Bibliographisches Institut, Mannheim, 1984.

[Wei87]

Weissinger, J.: An Enclosure Method for Dierential Equations. In [Kau87, pp. 425-4531, 1987.

[Wei88]

Weissinger, J.: A Kind of Diflerence Method for Enclosing Solutions of Ordinary Linear Boundary Value Problems. In [Ku188, pp. 23-31], 1988.

Bibliography [WeiSO]

605

Weissinger, J.: Sparlich besetzte Gleichungssysteme. Eine Einfuhrung Bibliographisches Institut, mit Basic- und Pascal-Progmmmen. Mannheim, 1990.

[WecSO] Weckherlin, A.: Untersuchung und Implementierung pamlleler, genauer Grundroutinen der Linearen Algebra. Diplomarbeit, Institut fiir Angewandte Mathematik, Universitit Karlsruhe, 1990. [Wi164] Wilkinson, J.: Rounding Errors in Algebraic Processes. PrenticeHall, Englewood Cliffs, New Jersey, 1964. Williamson, R. C.: Interval Arithmetic and Probabilistic Arithmetic. In [U1190b, pp. 67-80], 1990. [Win851 Winter, Th.: Ein VLSI-Chip fur Gleitkomma-Skalarprdukt mit mazimaler Genauigkeit. Diplomarbeit, Fachbereich 10, Angewandte Mathematik und Informatik, Universitit des Saarlandes, 1985. Auswertung von Progmmmteilen mit mazimaler [Win881 Winkler, S.: Genauigkeit. Diplomarbeit, Institut fiir Angewandte Mathematik, Universitit Karlsruhe, 1988. [Wip67] Wippermann, H.-W.: Realisierung einer Intervallarithmetik in einem A LGOL-60 System. Elektronische Rechenanlagen 9, pp. 224-233, 1967. [WilgO]

[Wip68] Wippermann, H.-W.: Implementierung eines A LGOL-60 Systems mit Schrankenzahlen. Elektronische Datenverarbeitung 10, pp. 189-194, 1968. [Wip79] Wippermann, H.-W. (Ed.): PASCAL, 2. Tagung in Kaiserslautern. Berichte des German Chapter of the ACM, Nr. 1. B. G. Teubner Verlag, Stuttgart, 1979. [Wo1641 Wolfe, J.M.: Reducing Truncation Errors b y Progmmming. Comm. ACM, Vo.. 7, No. 6, June 1964. Wolff v. Gudenberg, J.: Einbettung allgemeiner Rechnemrithmetik in Pascal mittels eines Operatorkonzeptes und Implementierung der Standardfunktionen mit optimaler Genauigkeit. Dissertation, Universitit Karlsruhe, 1980. [Wo182] Wolff v. Gudenberg, J.: PASCAL - Erweiterung fur wissenschaftliches Rechnen. In [Ku182, pp. 71-94], 1982. [Wo182al Wolff v. Gudenberg, J.: Syntaz und Semantik der worliegenden Implementierung der PASCAL-Spracherweiterung. In [Ku182, pp. 207-2311, 1982. [Wo183] Wolff v. Gudenberg, J.: An Introduction to MATRIX PASCAL: A PASCAL Eztension for Scientific Computation. In [Ku183, pp. 225-2461, 1983. [Wo185] Wolff v. Gudenberg, J.: Standard Functions with Mazimum Accuracy Computed in Single-Precision Arithmetic. In [Wah85, vol. 1, pp. 1711731, 1985.

[W0180]

G. Bohlender

606 [Wo188]

Wolff von Gudenberg, J.: Arithmetische und programmiersprachliche Werkzeuge fur die Numerik. Habilitationsschrift, Universitat Karlsruhe, 1988.

[Wo188a] Wolff v. Gudenberg, J.: Reliable Ezpression Evaluation in PASCA L-SC. In [Moo88, pp. 81-97], 1988. [Wo189]

Wolff v. Gudenberg, J.: Esprit-Projekt DIAMOND. In [Ku189, pp. 2512591, 1989.

[W0190]

Wolff v. Gudenberg, J.: A Symbolic Generic Expression Concept. In [U1190b, pp. 459-4641, 1990.

[Wo191] Wolff v. Gudenberg, J.: Object-Oriented Concepts for Scientific Computation. In [Kaugl, pp. 181-1921, 1991. [WolSla] Wolff v. Gudenberg, J.: Gmndroutinen fur sichere numerische Algorithmen auf Parallelrechnern. In [Jah91, pp. 421-4261, 1991. [Yak92a] Yakovlev, A. G.; Kearfott, R. B. (Ed.): Bibliography of Soviet Works on Interval Computations. E-mail: [email protected] from misc. Version of May 20, 1992. [Yak92b] Yakovlev, A. G. (Ed.): Interval Computations on Electronic Computers and Related Subjects. Literature list, partly in Russian. Russia, 109004, Moscow, Nizhnyaya Radischevskaya 10, Moscow Institute of New Technologies in Education. June 22, 1992. [Yi189]

Yilmaz, T.; Theeuwen, J.F.M.; Tnagelder, R.J.W.T.; Jess, J.A.G.: The Design of a Chip for Scientific Computation. Eindhoven University of Technology, 1989 and Euro Asic, Grenoble, Jan.25-27, 1989.

[Zju89]

Zjuzin, V. S.; Ermakov, 0. B.; Zakharov, A. V. (Eds.): Conference on Interval Mathematics. In Russian. Saratov, May 23-25, 1989.

Publications by Institutes and Manufacturers [ARITH] Institute of Electrical and Electronics Engineers: Proceedings of x-th Symposium on Computer Arithmetic ARITH. IEEE Computer Society Press. IEEE Service Center, 445 Hoes Lane, P.O.Box 1331, Piscataway, NJ 08855-1331, New York. Editors of proceedings; place of conference; date of conference. 1. Shively, R.R.; Minneapolis; June 16, 1969. 2. Garner, H.L.; Atkins, D.E.; Univ Maryland, College Park; May 15-16, 1972. 3. Rao, T.R.N.; Matula, D.W.; SMU, Dallas; Nov. 19-20, 1975.

4. Avizienis, A.; Ercegovac, M.D.; UCLA, Los Angeles; Oct. 25-27, 1978.

Bibliography

607

5. Trivedi, K.S.; Atkins, D.E.; Univ Michigan, Ann Arbor; May 18-19, 1981. 6. Rao, T.R.N.; Kornerup, P.; Univ Aarhus, Denmark; June 20-22, 1983. 7. Hwang, K.; Univ Illinois, Urbana; June 4-6, 1985. 8. Irwin, M.J.; Stefanelli, R.; Como, Italy; May 19-21, 1987. 9. Ercegovac, M.; Swartzlander, E.; Santa Monica; Sept. 6-8, 1989. 10. Kornerup, P.; Matula, D.; Grenoble, France; June 26-28, 1991. [BSI82]

British Standards Institute: Specification for Computer Progmmming Language PASCAL. See also [IEEE83, Jen851. BS 6192:1982, UDC 681.3.06, PASCAL:519.682, London, 1982.

[IAM80] IAM: PASCAL-XR: PASCAL f o r extended Real arithmetic. Joint research project with Nixdorf Computer AG. Institute of Applied Mathematics, University of Karlsruhe, P.O.Box 6980, D-7500 Karlsruhe, Germany, 1980. [IAM88] IAM: FORTRA N-SC: A FORTRAN Extension for Engineering / Scientific Computation with Access to ACRITH. Institute of Applied Mathematics, University of Karlsruhe, P.O.Box 6980, D-7500 Karlsruhe, Germany, Jan. 1989. 1. Language Reference and User’s Guide, 2nd edition. 2. General Information Notes and Sample Programs. [IAM89] IAM: PASCA L-SC: PASCAL f o r Scientific Computation. Institute of Applied Mathematics, University of Karlsruhe, P.O.Box 6980, D-7500 Karlsruhe, Germany, 1989. (IBM841 IBM: IBM System/370 RPQ. High Accuracy Arithmetic. SA 22-7093-0, IBM Deutschland GmbH (Department 3282, Schonaicher Strasse 220, 7030 Boblingen), 1984. [IBM86] IBM: IBM High-Accuracy Arithmetic Subroutine Library (ACRITH). IBM Deutschland GmbH (Department 3282, Schonaicher Strasse 220, 7030 Boblingen), 3rd edition, 1986. 1. General Information Manual. GC 33-6163-02. 2. Program Description and User’s Guide. SC 33-616402. 3. Reference Summary. GX 33-9009-02. [IBM86a] IBM Verfahren und Schaltungsanordnung zur Addition von Gleitkommazahlen. Europiiische Patentanmeldung, EP 0 265 555 Al, 1986. [IBMSO] IBM: ACRITH-XSC: IBM High Accuracy Arithmetic - Extended Scientific Computation. Version 1, Release 1. IBM Deutschland GmbH (Department 3282, Schonaicher Strasse 220, 7030 Boblingen), 1990. 1. General Information, GC33-6461-01.

G. Bohlender

608 2. Reference, SC33-6462-00. 3. Sample Programs, SC33-6463-00. 4. How To Use, 3233-6464-00. 5. Syntax Diagrams, SC33-6466-00.

[IEEE83] American National Standards Institute / Institute of Electrical and Electronics Engineers: IEEE Standard Pascal Computer Programming Language. ANSI/IEEE Std. 770 X3.97-1983, New York, 1983; J. Wiley 8z Sons Inc., 1983. [IEEE85] American National Standards Institute / Institute of Electrical and Electronics Engineers: A Standard for Binary Floating-point Arithmetic. ANSI/IEEE Std. 754-1985, New York, 1985 (reprinted in SIGPLAN 22, 2, pp.9-25, 1987).

[IEEE87] American National Standards Institute / Institute of Electrical and Electronics Engineers: A Standard for Radiz-Independent Floating-point Arithmetic. ANSI/IEEE Std. 854-1987, New York, 1987. [IMACS89] IMACS; GAMM: IMA CS-GAMM Resolution on Computer Arithmetic. In Mathematics and Computers in Simulation 31,pp. 297-298, 1989. In [U1190a, pp. 301-3021, 1990. In [U1190b, pp. 523-5241, 1990. In [KauSl, pp. 477-4781, 1991. [IMACSSl] IMACS-GAMM Working Group on Enhanced Computer Arithmetic: Proposal for Accurate Floating-point Vector Arithmetic. Institute of Applied Mathematics, University of Karlsruhe, P.O.Box 6980, D-7500 Karlsruhe, Germany, 1991; in this volume. [KWS85] kws: PASCAL-SC: A PASCAL extension for scientific computation; information manual and floppy disks; version kws SAM 68000. kws Computersysteme GmbH, Rheinstrasse 104, D-7505 Ettlingen, 1985. [NUMSl] Numerik Software GmbH: PASCAL-XSC: A PASCAL Extension for Scientific Computation. User’s Guide. Numerik Software GmbH, P.O.Box 2232, D-W-7570 Baden-Baden, Germany, 1991. [RRZN84] RRZN: Beitruge zur Rechnemrithmetik. Bericht 38, Regionales Rechenzentrum f i r Niedersachsen / Universitiit Hannover, Wunstorfer Strasse 14,3000 Hannover 91, Mirz 1984. [SIE86]

SIEMENS: ARITHMOS (BS 2000) Unterprogrammbibliothek fiir Hochprzzisionsarithmetik. Kurzbeschreibung, Tabellenheft, Benutzerhandbuch. SIEMENS AG, Bereich Datentechnik, Postfach 83 09 51, D-8000 Miinchen 83. Bestellnummer U29OO-J-Z87-1, Sept. 1986.

Index Centered form 127 Classes 71 Computational accuracy 410 chaos 284,429 Computer arithmetic 46, 60, 571 Condition number 133 Control theory 357 Convergence 336, 341, 343,532

Accumulation 87 Accumulator object 95, 98 Accurate dot product 90, 92 expressions 30 matrix operations 96 vector operations 92, 96 ACRITH 527 -Library 406 -xsc 5 Applications 8 Arithmetic Geometric Mean 329 Arithmetic standard 2 Asymtotic stability 357 Automatic 2D-Taylor expansion 261 Automatic differentiation 105, 107 in one dimension 150 in two dimensions 153, 207

D eci m al sequence 163, 205 Degenerate kernel 256 systems 266 Degree of stability 378 Difference approximation 426 solution 426 Differential equations 60 Directed rounding 227 Diverting difference approximation 288, 294, 431, 514, 521 Dotprecision 77, 310 Dot product 90, 92 expression 56, 310 Double Romberg extrapolation 21 1 Dynamic arrays 27, 51, 73 multiple-precision arithmetic 78 Dynamical chaos 284,429

Bandstructure 527 Bifurcation 495 Binary digits of pi 353 Bisection algorithm 539 BLAS package 89 Bloch’s theorem 528 Boundary conditions 400 value problem 443 Bounds for spectral radius 370 of eigenvalues 390 Bulirsch sequence 161, 203 Burgers equation 519

E-Method 404 Eigenvalue problem 536 Electronic tracking system 360 Elliptic integrals 345 problem 242, 244, 245 Enclosure 1, 228 conditions 228, 229, 232, 243, 247

c++

5, 71 c-xsc 5 Cancellation 88 Carry propagation 559 resolution 559 609

Index

610 Euler-Maclaurin summation formula in one dimension 146 in two dimensions 191 Exact scalar product 302

F a s t automatic differentiation 113 Feedback controller 362 Finite difference method 402 First principles calculations 527 Floating-point arithmetic 49, 59, 60, 87 unit 549 Floquet theory 444, 470 FORTRAN 90 5 , 4 5 FORTRAN-SC 398 Forward mode 107 Fredholm integral equation 231 Function space 6 Functional arithmetic 259 screen 259

G e a r drive vibrations 466 Global discretization error 288, 435 Green’s function 531 Grid generation 402

H a r d w a r e 301 arithmetic 557, 571 kernel 549 Hartmann number 399,412 Hyperbolic problem 242, 243, 244 I E E E standards 11, 89,97, 754,854 Inclusion 1 Initial boundary value problem 502, 517 Initial value problem 285, 425 Inner product 553 Instability theorem 489

Interval arithmetic 47 extension 111 function screen 260, 276 functoid arithmetic 259 matrix 357 methods 572 partitioning 371 polynomial 61 screen 226 slope 120, 127 Inverse monotonicity 425, 430, 501, 519

KKRmethod 530 Lagrange remainder 261 Layers - Hartmann layer 401, 412 - inner side layer 401, 412 - outer side layer 401, 412 Linear Fredholm integral equation 256 Literature lists 573 Local discretization error 288, 434 Logistic (difference) equation 314,436 Lohner’s enclosure algorithm 285,425 Long accumulator 302, 559 Lorenz equations 445

Matrix elements 541 operations 96 Mean value form 315 Metal 544 Method of Archimedes 331 MHD flow 399 Module concept 25 Mobius transformation 365 Muffin-tin approximation 536 Multiple-precision 302 arithmetic 325

Index N a t u r a l logarithm 347 Neumann series 258 Newton interation 326 Nonlinear Fredholm integral equation 271, 281 Numerical compatibility 81 integration in higher dimensions 223 in one dimension 143 in two dimensions 187 library 80 Object-oriented 71 Operator concept 21 notation 4, 9 Operators, predefined 18 Optimal scalar product 3 Ordinary differential equation 363, 430, 285 Overestimate 286, 371 Overloading 23 P a r a llel programming 572 Partial differential equation 502, 517 PASCAL-SC 4, 398 PASCAL-XSC 5, 15, 302, 310, 325, 572 Periodic solution 284, 439, 447, 456, 511 Perturbation-sensitive neighborhood 289,424 Pi 331 Pipeline 563 Problem solving libraries 573 routines 34 Programming language 45, 573 R a n g e of values 127 Reliability 424

611 Restricted three body problem 293, 510 Reverse mode 111, 1 13 Romberg sequence 156, 20 1 extrapolation 146, 189 integration 146, 189 Rounding control 47 error 408 analysis 129 Sample Program 35 Schrodinger equation 530 Self-consistency problem 538 Semimorphism 11 Sensitivity 407 Shadowing theorem 436 Shooting method 443 Single Romberg-Extrapolation 191 Spurious (ghost) difference solution 427, 492, 520 Square root 326 Stable manifold 432 Staggered arithmetic 7 correction format 302 Standard functions 572 Standard modules 33 State vector coefficients 536 Step size control 283 Strange attractor 432 Streams 79 Subarrays 75 Supercomputers 530, 572 System of interval equations 404 of integral equations 265, 280 with general kernels 269 T a y l o r coefficient 150, 207 Taylor-Series Ansatz 175 Traditional computer arithmetic 572

612 Trapezoidal rule 189 sum 146 Unbounded kernel 237,240 Unstable manifold 432 User-defined operator 54 Validating computions 9 Validation with constraints 240, 241, 246 Vector operations 92, 96 Verification methods 572 Verified result 546 Volterra integral equation 228, 241, 247 w a l l conduction ratio 400, 412 WeierstraB’ approximation theorem 258 Wrapping effect 286, 371

Index

Mathematics in Science and Engineering Edited by William F. Ames, Georgia Institute of Technology Recent titles

Anthony V. Fiacco, Introduction to Sensitivity and Stability Analysis in Nonlinear Programming Hans Blomberg and Raimo Ylinen, Algebraic Theory for Multivariate Linear Systems T. A. Burton, Volterra Integral and Differential Equations C. J. Harris and J. M. E. Valenca, The Stability of Input-Output Dynamical Systems George Adomian, Stochastic Systems John O’Reilly, Observers for Linear Systems Ram P.Kanwal, Generalized Functions: Theory and Technique Marc Mangel, Decision and Control in Uncertain Resource Systems K. L. Teo and Z. S. Wu, Computational Methods for Optimizing Distributed Systems Yoshimasa Matsuno, Bilinear Transportation Method John L. C a d , Nonlinear System Theory Yoshikazu Sawaragi, Hirotaka Nakayama, and Tetsuzo Tanino, Theory of Multiobjective Optimization Edward J. Haug, Kyung K. Choi, and Vadim Komkov, Design Sensitivity Analysis of Structural Systems T . A. Burton, Stability and Periodic Solutions of Ordinary and Functional Differential Equations Yaakov Bar-Shalom and Thomas E. Fortmann, Tracking and Data Association V. B. Kolmanovskii and V. R. Nosov, Stability of Functional Differential Equations V. Lakshmikantham and D. Trigiante, Theory of Difference Equations: Applications to Numerical Analysis B. D. Vujanovic and S. E. Jones, Variational Methods in Nonconservative Phenomena C. Rogers and W. F. Ames, Nonlinear Boundary Value Problems in Science and Engineering Dragoslav D. Siljak, Decentralized Control of Complex Systems W. F Ames and C. Rogers, Nonlinear Equations in the Applied Sciences Christer Bennewitz, Differential Equations and Mathematical Physics Josip E. Peearid, Frank Proschan, and Y. L. Tong, Convex Functions, Partial Orderings, and Statistical Applications E. N. Chukwu, Stability and Time-Optimal Control of Hereditary Systems E. Adams and U. Kulisch, Scientific Computing with Automatic Result Verification Viorel Barbu, Analysis and Control of Nonlinear Infinite Dimensional Systems

I S B N 0-32-OV4230-8 9001 8

9 7861.20 442102'

Scientific Computing With Automatic Result Verification (Mathematics in Science and Engineering,189)