Elementary Numerical Methods and C++

ELEMENTARY NUMERICAL METHODS AND C++ Melvin R. Corley, Ph.D., P.E. Louisiana Tech University Ruston, LA 71272 mcorley@l...

Author: Melvin R. Corley

539 downloads 2545 Views 7MB Size Report

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!

Report copyright / DMCA form

DOWNLOAD PDF

ELEMENTARY NUMERICAL METHODS AND C++ Melvin R. Corley, Ph.D., P.E. Louisiana Tech University Ruston, LA 71272 [email protected] November 2007 Edition © 2000-2007 by Melvin R. Corley All Rights Reserved.

Table of Contents Computers for Engineers and Scientists . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Computers in Science and Engineering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The Purpose of Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Quick Start: Your First C++ Program . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chapter 1 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1 1 1 2 3 5 5

C++ Fundamentals: Constants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Representing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Integers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Floating Point Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 Characters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 Boolean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 Chapter 2 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 C++ Fundamentals: Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 Variable Names . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 Arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 One-Dimensional Arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 Multi-Dimensional Arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 Strings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 Chapter 3 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 Fundamentals of C++: Basic Input and Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 Standard Streams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 The cout stream object . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 The cin stream object . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 The cerr stream object . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 Standard stream redirection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 User-Defined Stream Objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 Chapter 4 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 Fundamentals of C++: Programming Actions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 Programming Actions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 Sequence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 Expressions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 When Additional Work Is To Be Done . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 Two Mutually Exclusive, Collectively Exhaustive Alternatives . . . . . . . . . . . . 28 Three Or More Mutually Exclusive Alternatives . . . . . . . . . . . . . . . . . . . . . . . 28 Three Or More Mutually Exclusive, Collectively Exhaustive Alternatives . . . 28 Special Case: Exact Matches Of Integral Options . . . . . . . . . . . . . . . . . . . . . . . 29 Repetition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 When The Number Of Iterations Is Known In Advance . . . . . . . . . . . . . . . . . . 30 When The Number Of Iterations Is Not Known In Advance . . . . . . . . . . . . . . 31

iv # Contents Pre-Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Post-Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chapter 5 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

32 33 34 34

Fundamentals of C++: Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 Using Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 Eliminating Repetitive Code Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 Algorithmic Encapsulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 Passing Arguments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 By Value . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 By Reference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 Chapter 6 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 Some Numerical Programming Fundamentals . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 Errors in Numerical Computations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 Truncation Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 Roundoff Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 Calculating Machine Epsilon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 Chapter 7 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 Zeros of Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 Incremental Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 Bisection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 Regula Falsi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 Improved Regula Falsi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 Newton-Raphson Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 Chapter 8 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 Using C++ Classes in Numerical Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 “Thinking” Objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 Member Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 Member Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 Constructors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 Destructors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 Actions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 “Set”/“Get” Member Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 Overloaded Operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 The ENM Classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 Vector and Matrix classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 Miscellaneous ENM classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 Chapter 9 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

Contents # v Simultaneous Algebraic Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 Linear Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 Diagonal Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 Triangular Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 Gauss Elimination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 Partial Pivoting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 Full Pivoting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 Tridiagonal Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 LU Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 Crout’s Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 Doolittle’s Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 Gauss-Seidel Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 Nonlinear Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 Chapter 10 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 Curve Fitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 Polynomials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 Polynomial Evaluation Using Horner’s Method . . . . . . . . . . . . . . . . . . . . . . . 120 Lagrange Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 Splines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 Least-Squares approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 Chapter 11 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 Numerical Quadrature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 Newton-Cotes Formulas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 Trapezoidal Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 Simpson’s One-Third Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 Simpson’s Three-Eighth’s Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 Romberg Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 Chapter 12 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 Ordinary Differential Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 First Order ODEs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 Working With Higher Order ODEs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 Initial Condition Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158 Taylor Series Expansion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158 Euler’s Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 Runge-Kutta Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160 Second Order Runge-Kutta Methods . . . . . . . . . . . . . . . . . . . . . . . . . 161 Fourth Order Runge-Kutta Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 164 Sets of Simultaneous ODEs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164 RKSUITE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166 Boundary Value Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170

vi # Contents Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175 Chapter 13 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178 Table of C++ Operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179

CHAPTER

1

Computers for Engineers and Scientists CHAPTER OUTLINE 1.1 Introduction 1.2 Computers in Science and Engineering 1.3 The Purpose of Computing

1.1

INTRODUCTION

Digital computers have reached near-commodity status in many societies around the world. While not yet viewed in exactly the same way as other common appliances such as the television and the telephone, computers have, in some ways, affected our lives in even more pervasive ways. For example, computers now run our automobiles and transportation systems, our microwave ovens and household appliances, our audio/video entertainment systems, virtually all of our medical instrumentation, as well as most factories. So, while computers may not be as visible as other appliances, they are there, making the systems around us work. These computers that surround us take on many shapes and sizes and functions. And though they may be radically different in appearance and function, they all share some similar properties. For example, they all do their required functions by executing programs stored in some type of memory device. For the most part, these programs are conceived and developed by human programmers. The effectiveness of a computer being used in a particular application is strongly dependent on the quality of the program that it is executing. And the more we, as users of these miraculous devices, are aware of their internal operation and limitations, the more effectively we will be able to harness their largely untapped power to serve us in applications yet unseen.

1.2

COMPUTERS IN SCIENCE AND ENGINEERING

Practitioners in the fields of science and engineering were among the early adopters of computer

2 # Chapter 1 Computers for Engineers and Scientists technology. For example, the pioneering work of Raimondi and Boyd[1] led to the solution of the differential equations for fluid flow in the hydrodynamic bearing. This remarkable work was conducted in the early pioneer days of computers using very primitive equipment and programming tools, yet has remained the standard solution method used in bearing design to this day. Today’s scientists and engineers have a vast array of computer tools to select from. Hardware choices range from handheld devices to desktop personal computers that are much more powerful than the largest mainframes of a generation ago, to scientific workstations that are more powerful still, to supercomputers whose computation speeds are mind-numbing. Similarly, a wide array of software tools are available to select from. Many problems today are solved using application programs developed specifically for virtually any field of practice. Other more general purpose electronic toolkits are available to handle more general purpose technical problem-solving needs. These programs allow the user to select from a large array of mathematical functions and to combine these in an unlimited number of ways to obtain numerical or, in some cases, symbolic solutions to complicated problems. Finally, there are the general purpose programming languages such as FORTRAN, C++, and JAVA. It is with these general purpose programming languages that the previous two categories of software tools are constructed. There is value in developing skill in the use of tools in each of the three classes identified above. But there are several reasons that argue for those in technical fields to have some significant experience in the use of general purpose programming languages for solving technical problems: • Modern general purpose programming languages cause programmers to give careful consideration to the various data types they employ in their programs and the inherent properties of each type. • Integrated development environments provide convenient debugging environments offering the user the opportunity to develop the skills required to detect, isolate, and correct logic flaws in programs. These skills seem to be transferable to other areas of endeavor and will serve the learner well in future experiences with debugging complicated systems. • The fundamental computer operations of sequence, selection, and repetition are very visible in general purpose programming languages. Learning these fundamental constructions early reduces the effort required to learn new computing languages and application commands in the future. • Using a general purpose programming language reduces the “fog” that often accompanies using higher level application environments. It removes one level of obfuscation in that when using a general purpose programming language, we may ask “Why did the program do that?” and the answer will invariably be “Because you told it to do that!” instead of having to wonder how the underlying software in our application environment interpreted and implemented our command.

1.3

THE PURPOSE OF COMPUTING

In one of the early standard textbooks, R.W. Hamming offered this time honored reminder to engineers and scientists who use the computer in their work, “The purpose of computing is insight, not numbers.”1 This quote can be the springboard for launching a phalanx of warnings and

1

R.W. Hamming. Numerical Methods for Scientists and Engineers, McGraw-Hill, 1962.

Elementary Numerical Methods and C++ # 3 admonitions (usually learned by personal experience) regarding the dangers of getting so involved in the numbers we are computing that we may lose sight of the problem we are actually trying to solve. Anybody can generate a lot of numbers using a computer. It takes a skilled practitioner to realize solutions and insight using a computer. Never let the computer replace your brain. Always challenge the numbers your computer programs generate. Be skeptical. Only after careful analysis and judgement should you begin to really believe the numbers your programs generate represent provide insight into the problems you are solving. Therefore, we prepare now to begin a study that seeks to provide the technical student with an introduction to the bare essentials of numerical problem solving in the context of the bare essentials of the C++ general purpose programming environment.

1.4

QUICK START: YOUR FIRST C++ PROGRAM

Traditionally, the first program that one writes in C++ is a simple program that prints the worlds Hello world on the computer console. The rationale is that to create, compile, link and execute even a simple program requires a basic knowledge of these steps which are essential to the successful creation of more substantial programs in the future. Gaining a “quick victory” in accomplishing this task will give the learner encouragement to proceed further at a quick pace. The “Hello world” program is shown in the following example. The steps required to create, compile, link and execute the program vary extensively depending on which C++ compiler, integrated development environment, and operating system is being used. O EXAMPLE 1.1 “Hello World” C++ Program “Hello World” is a very brief C++ program that simply displays the words Hello world on the computer console. The line numbers shown at the beginning of each line are not part of the program, but are used for reference in the discussion below the program listing.

1) 2) 3) 4) 5) 6)

Column Numbers 1 2 3 123456789012345678901234567890 #include using namespace std; int main(void) { cout << "Hello world\n"; }

The six lines above constitute the entirety of the program. Column numbers are shown just as a typing guide. C++ itself is relatively insensitive to the actual spacing and formatting within the program. The formatting style shown above is typical of the predominant style that is in use today. Line 1) of the program is technically not a C++ statement. Rather, because it begins with the # symbol, it is a compiler directive that instructs the compiler to go to some pre-assigned folder on the computer system and read in the contents of the file named iostream and treat them just as if the user had entered the contents of the file from the keyboard. C++ is quite a compact language which makes extensive use of so-called header files to define capabilities that are not automatically assumed to be part of the language definition itself. For example, the iostream header file defines the cout object that is used in line 5) of the program. Rather amazingly, C++ does not natively support any capability to process input or output data! Instead, it relies on header files to define these capabilities

4 # Chapter 1 Computers for Engineers and Scientists to suit the user’s wishes. However, all standard C++ compilers are required to provide an extensive set of header files that perform scores of useful functions in a standard way. This insure portability of programs written in C++ from one compiler and computer environment to another. This portability has long been one of the hallmarks of the C and C++ programming languages heritage. Line 2) is a relatively new feature of C++ that we make use of in later chapters. Some compilers may still compile this code correctly if this line is omitted, but those that adhere strictly to the standard will not. We will always include this line in our programs. It’s purpose is to allow programmers to create and use identifiers (variable and function names, which will be covered later) that may conflict with predefined system identifiers. By carefully crafting such code, programmers can extend or replace the functionality of most system defined functions. But this is pretty serious programming stuff that is best left to the experts. For our purposes, just consider this line as boilerplate that must be present for the program to operate correctly. Line 3) introduces the main function. Every C++ program must have a main function. This line provides information relating to how the program will interact with the operating system. The word int at the beginning of the line is required. This keyword provides a mechanism for sending information back to the operating system at the conclusion of the program. A single integer number can be returned, usually to indicate whether the program ran successfully or not. This information can be used in command interpreter batch files or scripts to determine the next operation to be performed. UNIX-type systems make extensive use of the return value from programs. The word void in parentheses indicates that no information from the command line is to be passed to the program. More sophisticated programs will allow optional arguments that are typed on the command line following the program name to be accessed by the program. Line 4) is paired with line 6) using the curly brace characters, {}, to form what is known as a block. All of the C++ code within a block is treated as a single unit. Generally speaking, if a block of code consists of a single statement, the braces are optional. However, most beginning programmers will be well served if they adopt the convention of always enclosing blocks of code in braces and indenting the code within the block three spaces. This causes the physical structure of the program in a printed listing to reflect the logical structure of the meaning of the code. The single line of code in the program that actually does anything is line 5). This line of code causes the string of characters enclosed in quotation marks to be sent to the cout object, which is preassigned to the computer console by the C++ startup code that executes prior to calling the main function. The << symbols form the C++ stream insertion operator that tells the cout object to accept the string of characters on the right hand side of the operator and perform its normal operation on the string, which, for the cout object, is to display them on the console. The special symbol \n at the end of the character string is the new line symbol which is a nonprinting character that causes the cursor to go to the left margin and down one line after printing the preceding text. More such escape characters will be shown in the next chapter. The “Hello World” program is a very good way to get familiar with the steps necessary to produce C++ programs. The reader should be sure to run this program successfully before proceeding to more complicated programs. Also try to extend the program by inserting more lines similar to line 5) to gain an understanding of the \n escape character. O

Elementary Numerical Methods and C++ # 5

Glossary console the primary device for displaying computer program output, usually the computer display screen in current generation computers. programs sequences of computer instructions that operate on user-supplied data to yield useful results. mainframe a large enterprise computer most often used to process large databases or basic business functions such as accounting and payroll.

memory device an electronic or electromagnetic device that stores computer programs to be run and/or data for the program. programmers humans who have the specialized knowledge required to develop computer programs.

Chapter 1 Problems 1.1

Look around the area where you are reading this. Count how many computers you can see. Did you include your automobile (most modern automobiles contain several special purpose computers)? How about your calculator or PDA? How about your telephone, or television, or even your wristwatch? (Now turn off the TV and study!)

1.2

To help prepare you for the tedium associated with computer programming, you might try one of the popular assignments given to aspiring technical writers, such as write a paragraph explaining how to tie your shoelace or your necktie. Search the Internet for a professionally written example of a similar task and compare it to your paragraph.

CHAPTER

2

C++ Fundamentals: Constants CHAPTER OUTLINE 2.1 Representing Data 2.2 Integers 2.3 Floating Point Numbers 2.4 Characters 2.5 Boolean

2.4 Characters 2.5 Boolean

Computer programs are generally understood to be composed of two elements; data and algorithms that operate on the data. Data without processing algorithms are meaningless collections of numbers. Algorithms without data are only potential solutions to real problems.

2.1

REPRESENTING DATA

Computers represent all types of data; numbers, characters, Boolean (true/false) values, or anything else you can think of as binary numbers. This is because all modern computers are made up of bistable storage elements which have two and only two possible states; high or low, on or off, charged or not charged, etc. Thus any program you run, whether it is a highly interactive flight simulator, a word processor, an image editor, or an Internet chat program is processing data that is at the deepest level a series of binary digits—1's and 0's. The details of how these binary digits are processed vary among different brands of computers and among different programming languages. But it is essential to know the basics of how these binary digits are processed as integers, floating point numbers, characters, and Booleans in order to effectively use them in the proper way in programs. A scalar is a single number or entity as opposed to non-scalar data which aggregates more than one entity in some orderly fashion. Most computer programs use a combination of scalar and nonscalar data to accomplish the intended results. Let us briefly discuss the two most commonly used C++ scalar data types.

2.2

INTEGERS

Elementary Numerical Methods and C++ # 7 Although it is not the simplest data type, we begin with a discussion of the integer data type. A C++ integer is a whole number that exactly mimics the behavior of the mathematical concept of an integer, except for the range of numbers that it can represent. Written in the decimal (base 10) number system, an integer is written as ±dndn-1dn-2dn-3…d2d1d0 where the di’s are digits from the set {0,1,2,3,4,5,6,7,8,9}. We understand standard positional notation to mean that digit d0 is to be multiplied by 100, d1 is to be multiplied by 101, d2 by 102, etc. Therefore, when we write the sequence of decimal digits 1234, we understand that the number represented by this digit sequence is 1×103 + 2×102 + 3×101 + 4×100 or one thousand two hundred thirty four. C++ integers are written exactly the same way. The integer may be preceded by an optional + sign if the number is positive or a mandatory - sign if it is negative and immediately followed by the decimal digits of the number with no embedded commas or spaces. There is also never a decimal point in an integer, and, of course, there can be no fractional part, since integers are whole numbers. But there is one major difference between C++ integers and the set of whole numbers. To understand this difference, let’s briefly examine some of the details concerning the way integers are stored in a digital computer’s memory. As we stated earlier, all data processed by a digital computer consists of groups of binary digits, or bits. A bit is a storage element that can have only two states, usually represented as the numbers 0 or 1. Computers store integers as binary numbers using a sequence of bits according to the following scheme: ±bnbn-1bn-2bn-3…b2b1b0 where the bi’s are either 0 or 1, each occupying one bit of computer storage space. Likewise, the sign of the number can be represented in one bit of storage since it also has only two possible values. Universally, a 0 sign bit represents a positive number and a 1 sign bit means the number is negative. When interpreting binary numbers using positional notation, we must remember that the base of the number system is 2, not 10 as we are accustomed to. To remind us of this, we normally place a subscript 2 to the right of the number. So the number 0100110100102 represents the decimal number 1×210 + 0×29 + 0×28 + 1×27 + 1×26 + 0×25 + 1×24 + 0×23 + 0×22 + 1×21 + 0×20 = 1024 + 0 + 0 + 128 + 64 + 16 + 0 + 0 + 2 + 0 = 1234 The leading zero tells us that the number is positive. The binary representation of negative numbers is slightly more complicated, but we are assured that the leftmost bit of a negative number will be a 1. Computers are designed to access information in “chunks” at a time called bytes. A byte is eight consecutive bits. Current generation computers work most efficiently if they access two, four, or eight bytes at a time. Thus, a so-called 16-bit processor is optimized to access information two bytes at a time while a 32-bit processor is optimized to access information four bytes at a time. Now a computer only has a finite number of bytes to work with. (Granted, that number is increasing rapidly with every new generation of computers, but it is still finite!) Computer designers have continually sought to provide programmers with hardware that will minimize the effort required to solve an increasingly large range of problems. A common design that is still frequently used is to allocate 16 bits (2 bytes) of storage for integers. Using this representation integers must lie in the range of -32768

8 # Chapter 2 C++ Fundamentals: Constants to +32767 (-215 to +215-1). The most positive number always has a magnitude of one less than the most negative number because the number 0 (zero) is considered to be a positive number. Use of 16bit integers is on the wane for most programs because we frequently encounter situations which exceed these rather restrictive range limits causing hard-to-find program errors. Most programmers prefer to use 32-bit (4 byte) integers. This gives a greatly increased range of numbers to work with, thus reducing the chance of undesired overflow. Using 32-bit integers, the most negative number is -2147483648 (-231) and the most positive number is +2147483647 (+231-1). The C++ programming language places no requirements regarding the number of bits that integers occupy. This choice is left up to the compiler developer. The compiler is simply a program that converts your C++ programs into instructions that are compatible with your particular computer hardware/operating system combination. Integer constants are simply whole numbers that lie within the acceptable range as described above. They are written as an unbroken sequence of decimal2 digits containing no spaces, commas, or decimal points. O EXAMPLE 2.1 Valid Integer Constants The following numbers are all valid integer constants: 123 -456

0 -999999

64321234 -4738921

O O EXAMPLE 2.2 Invalid Integer Constants The following numbers are all invalid integer constants. 2,001 (No embedded commas are permitted) 1 000 000 (No embedded spaces are permitted) 3425. (No decimal points allowed in integers) 999999999999 (Out of range) O

2.3

FLOATING POINT NUMBERS

Matters get somewhat more complicated when it comes to working with floating point numbers on a computer. These are numbers that contain fractional non-integral parts. First, let us note that like integers, there are an infinite number of real numbers. The distinction between real numbers and floating point numbers is subtle, but very significant for numerical methods applications. Between any two real numbers, there exists an infinite number of real numbers. The same cannot be said about floating point numbers since they comprise a finite set. If the set of real numbers is visualized as a line extending from the number 0 at the origin infinitely in both the positive and negative directions, then the set of floating point numbers will plot as discrete points along this line. The points will be very close together near the origin and very far apart at great distances from the origin. At some point a great distance away from the origin, floating point numbers cease to exist while the real numbers

2 It is also possible to use constants consisting of binary, octal, or hexadecimal digits. However, most scientific programmers will use only decimal constants. Since our objective is to cover only the rudimentary basics of C++, we will avoid mentioning many of the more esoteric aspects of the C++ language. Should the reader encounter these features when reading code written by others, he/she should know that there are many excellent books which provide coverage of the C++ programming language in greater depth than we purport to do here.

Elementary Numerical Methods and C++ # 9 go on forever. Thus, floating point numbers are limited in both range and precision. This inability to represent numbers precisely is the root cause of roundoff error which is a cause for concern in many numerical methods. Intuition and Murphy’s Law both suggest to us that if we cannot represent the underlying data in a numerical calculation accurately, then when we start adding, subtracting, multiplying, and dividing these inaccurate numbers, they will combine in the worst possible way. This view is only a slight exaggeration of the truth. We will discuss this matter further later and determine ways to control roundoff error. An extensive discussion of floating point number representation inside the computer is beyond the scope of this book, but the general principles can be presented briefly. Floating point numbers are represented internally in a manner similar to scientific notation. The number consists of four parts: the sign of the number, the significant digits of the number with all leading and trailing zeros removed, the exponent, and the sign of the exponent. Each of the sign components require one bit to signify positive or negative. The remaining bits of the number are apportioned between the exponent and the significant digits. We may represent such a scheme using binary digits as follows: ±±eemem-1…e0bnbn-1…b2b1b0 where ±e is the sign of the exponent, the e’s are the exponent bits, and the b’s are the significant bits of the number. Just as with integers, computer designers have sought to provide a reasonable balance of the number of bits allocated to the e’s and b’s while providing rapid access to data. The C++ language standard supports at least two choices for representing floating point numbers; single precision and double precision. The details as to how each manufacturer represents floating point numbers is normally not too important. As a general rule, you can assume that single precision floating point numbers can have exponents that range from about 10-38 to 1038 and carry about seven significant decimal digits. Double precision numbers are much less standardized. Those computers using the IEEE standard (including personal computers using Intel processors with floating point coprocessors) can handle exponents from about 10-308 to 10308 and maintain about sixteen significant decimal digits. There are two ways to write floating point constants. The conventional way is to write the number as a sequence of decimal digits, preceded by a minus sign if the number is negative, and including a decimal point somewhere in the number. This last point is very important. Including a decimal point in a floating point constant is a key to ensuring accurate calculations. Omitting a decimal point in a floating point constant can lead to disaster. You will see why in Section 5.2. Even if the constant you are writing has no fractional part, if it is being used in an expression that can potentially yield a value that has a fractional part, you should be sure to write all constants in the expression as floating point constants including decimal points. You should never begin or end a floating point constant with a decimal point. Add a zero before or after the decimal point, as appropriate to clearly set off the decimal point. Also, just like integer constants, floating point constants must not include imbedded spaces or commas. A second way floating point constants can be written is in exponential format. This format is intended for numbers that would otherwise be difficult to read because of a large number of leading or trailing zeros to provide proper placement of the decimal point. To use the exponential format, write the significant digits of the number without the leading/trailing zeros, then suffix the number with the letter ‘E’ (or ‘e’) and a positive or negative integer which indicates how many places right (for a positive) or left (for a negative) the decimal point is to be shifted from the position shown to obtain the correct value of the number. Do not put any spaces anywhere in the number and do not put a decimal point in the exponent. O EXAMPLE 2.3 Valid Floating Point Constants Here are some valid floating point constants. The numbers on the second row are equivalent to

10 # Chapter 2 C++ Fundamentals: Constants the corresponding number in the first row. 0.0 0.0e0

-123.456 -0.123456e+3

0.00023483 +23.483E-5

-99234.0023 -9.92340023e4

O O EXAMPLE 2.4 Invalid Floating Point Constants Here are some invalid floating point constants: 1,234.56 (No embedded commas are permitted) 1 000 000.0 (No embedded spaces are permitted) 2.0e0.5 (No decimal points allowed in the exponent) 6.02e 23 (No space between the ‘e’ and the exponent) O

2.4

CHARACTERS

All “calculations” in the traditional sense in C++ programs are done using integers and floating point numbers. Characters are most often used as data to be manipulated. For example, a program that justifies text flush against the left and right margins must do some computations, but the words that are being processed consist of characters. Therefore, there is a need to represent characters using the internal binary storage capabilities of the digital computer. Just as with integers and floating point numbers, computer designers have chosen to represent characters as a group of bits lumped together and treated as a single entity. The choice of how many bits to allocate to each character has evolved over the years. In the 1960's, characters were generally allocated six bits each. This allowed for 26 = 64 different characters. This scheme permitted only the upper case English alphabet, the 10 decimal numerals, and other miscellaneous symbols typically found on a typewriter keyboard. As the need for character processing increased (e.g., computer printed form letters), seven bits became necessary. Since the early 1980's, the standard has been eight bits (one byte) of storage per character. Today, internationalization forces are generating a need for multi-byte characters. Characters are stored internally according to some standard collating sequence that is compatible with the computer hardware and operating system. Two such collating sequences are in common use: ASCII and EBCDIC. As you might suspect, these two sequences are largely incompatible. The C++ programmer should generally avoid learning many of the details of either collating sequence, as tying a program to a collating sequence makes it difficult to convert to a computer that uses a different collating sequence and thus reduces its portability. The C++ standard requires only that the character collating sequence provide some simple assurances, such as ‘A’ appears before ‘B’, ‘a’ appears before ‘b’, ‘0’ appears before ‘1’, etc. Note that the numerals can be interpreted either as numbers or characters, depending on the context. We shall see in Section 3.1 that a similar problem exists for the letters. In order to eliminate confusion, C++ requires that when a number or letter is to be interpreted as a character, it must be surrounded by single quotes (e.g., 'A', 'B', '1', '2'). Note that each such character must be surrounded by its own set of single quotes.

Elementary Numerical Methods and C++ # 11 There are several special escape sequences in C++ that have their own special notation. Most of these are nonprinting characters that are commonly used to format printed output. Several of the most common escape sequences are listed in Table 2.1 . Table 2.1 Standard C++ Escape Sequences Character

Escape Sequence

Newline

\n

Horizontal tab

\t

Carriage return

\r

Formfeed

\f

Alert

\a

Backslash

\\

Single quotation mark

\'

Double quotation mark

\"

Null character

\0

2.5

BOOLEAN

The C++ standard has at long last created a standard representation for the data type that is most compatible with the most basic underlying hardware data type—the Boolean. A Boolean (we capitalize the word because it is named after the founder of the study of binary algebra, George Boole) is a value that can be simply true or false. In fact, these are the only values a Boolean can have. We wonder why it took so long to formalize a concept that is intimately associated with programming logic as discussed in Section 5.1. The C programming language, upon which C++ is based, defined the integer value of zero to be false and anything other than zero to be true. As a result of this convention, programmers have long developed schemes to simulate the action of simple true/false logic as defined in the Boolean data type. This can make reading or modifying old programs more complicated. Although the C convention is still respected in C++, we strongly encourage the consistent use of the Boolean data type in new programming projects.

Glossary algorithm a set of instructions to be applied to input data to achieve the desired result; a recipe for solving a given problem. ASCII American standard code for information interchange; the collating sequence used by most western language computer systems. bit the amount of information necessary to differentiate between two equally likely alternatives; a single

EBCDIC Extended Binary Coded Decimal Interchange Code; the collating sequence often used by IBM mainframe computers. escape sequence a special combination of symbols used to represent nondisplayable characters. floating point number one of a finite subset of the real numbers that can be accurately represented by a computer.

12 # Chapter 2 C++ Fundamentals: Constants binary digit. Boolean a data element that can be only true or false. byte a group of binary digits taken together to represent a single displayable or nondisplayable character; typically 8 bits. character a displayable or nondisplayable symbol associated with a particular combination of bits. collating sequence the order in which characters are ranked for sorting purposes. compiler a computer program that converts programs that are written according to the rules of a programming language into the native binary machine language of a particular computer.

integer a whole number having no fractional part. portability the property of a computer program that allows it to be run on computers manufactured by different vendors will little or no modification. roundoff error that results from the inability of a computer to accurately represent all of the set of real numbers. string a sequence of characters that are understood to be grouped together as a single entity.

Chapter 2 Problems

2.1

TRUE/FALSE a) ____ The escape sequence \n when output with cout causes the cursor to position to the beginning of the next line on the screen. b) ____ The constants 3 and 3.000 are exactly the same.

2.2

Indicate which of the following C++ constants are integers (I), floating point (F), character (C), Boolean (B), or illegal (X). If the constant is illegal, state why. a) ____ 3.141592653 b) ____ 6.02x10^23 c) ____ 'z' d) ____ True e) ____ !25 f) ____ 11:24:53 g) ____ 87.3e-4 h) ____ 16,384 i) _____ 2.3653e2.5 j) _____ 342618

CHAPTER

3

C++ Fundamentals: Variables CHAPTER OUTLINE 3.1 Variable Names 3.2 Arrays 3.2.1 One-Dimensional Arrays 3.2.2 Multi-Dimensional Arrays 3.2.3 Strings

Digital computers would be little more than unsophisticated calculators if all they could do was plod through long sequences of numerical constants. To make things interesting, we must be able to reserve storage locations in the computer memory that will be filled with actual data only as the program begins execution. Using such variables allows the programmer to build a set of instructions that will perform the desired manipulations on the variable without knowing in advance the actual numerical value of the variable. It will be provided by the program user who actually runs the program. One of the important tasks of programmer is to make sure the program works properly regardless of the actual value of the variable that is specified by the user.

3.1

VARIABLE NAMES

C++ provides access to variables by allowing the programmer to name the storage locations that will eventually contain the variable data. The variable naming rules for C++ are quite general and different variable naming conventions have developed as C and C++ have evolved over the years. The current style for naming variables follows these rules: • Variable names should consist only of upper and lower case characters and the ten decimal digits. (NOTE: Even though the C++ standard allows several special symbols to be used in variable names, we discourage their use.) • The first character of the variable name must not be a digit. • Variable names should be descriptive of the variable’s purpose. Do not use nonsense variable names. • If the variable name is a single word, it should be written entirely in lower case letters. • If the variable name requires more than one word, the first word should be all lower case

14 # Chapter 3 Fundamentals of C++: Variables and the first letter of each subsequent word should be capitalized. Variable names in C++ are case sensitive. Be very careful when declaring variables that you follow the conventions listed above. Every time you reference a variable in a program, you must spell it exactly the same way. In C++ all variables must be declared before they are used. To declare a variable, simply include a line in the program that states the variable type (int for integers, float for single precision floating point numbers, double for double precision numbers, char for characters, and bool for Booleans) then list on the same line the variable names of that type that are to be created. You can include as many declaration lines of each type as you wish in any order. O EXAMPLE 3.1 Valid Variable Declarations Here are some valid variable declarations of each type previously discussed. The variable names illustrate the naming rules and conventions described above. int i, jmax, k37; float x, conversionRate, heightToWidthRatio; double radius, desiredDepth; char yn, ans, userResponse; bool doMore, finished;

O

Depending on where in the program the variables are declared, they may or may not have a known initial value. It is up to the programmer to make sure a variable is initialized before it is used. A variable is initialized by following the variable name with an = and the value to be used to initialize the variable. If you want to protect the initialized variable from being altered as the program executes, prefix the declaration with the const keyword. O EXAMPLE 3.2 Valid Variable Declarations with Initializers Here are some valid variable declarations with initializers. int kmin = -3, incr = 2; float minPressure = 14.7; const double gpm2lpm = 3.785; char yes = 'y', no = 'n'; bool option1 = true, option2 = false;

O

3.2

ARRAYS

3.2.1

One-Dimensional Arrays

In scientific problem solving, we frequently find the need to aggregate a collection of similar, but not identical items. For example, we may have acquired experimental data from an instrument that recorded the temperature in a component heat sink after power was applied to the component. We know that the data acquisition system recorded the temperature at regular time intervals. Now it would be burdensome to have to assign a unique variable name to each one of these temperature values. Since we know that all of the numbers are temperature readings, it would be very natural to aggregate them all under a variable named temperature if there were some way to easily access each individual temperature reading. Declaring temperature as a one-dimensional array satisfies this requirement. A one-dimensional array can best be visualized as a list of items (e.g., temperature readings) written on lined paper where each line is numbered.. The line number is called the array

Elementary Numerical Methods and C++ # 15 index or subscript. In C++ array subscripts must be integers and they start at zero. (This fact presents some degree of difficulty when converting old FORTRAN program into C++ since array subscripts in FORTRAN begin with the number one.) To carry forward our temperature data acquisition example, suppose we know that our instrument has acquired exactly 100 temperature measurements. Then to declare the C++ variable temperature to be a one-dimensional array capable of holding all the measured data, the following declaration could be used: double temperature[100];

So simply adding an integer constant in square brackets following a variable name in a declaration statement tells the C++ compiler to reserve the specified number of storage spaces for the variable. In the case above, we have decided to reserved 100 storage spaces, each one of type double, to contain the temperature readings. The temperature data will be stored in subscript locations 0 through 99 of variable temperature. Note that we have chosen to declare the array to be of type double. This is because the instrument measuring the temperatures has a resolution of much better than one degree, so we must provide for the fractional parts of a degree for each temperature measurement. Note that in this particular case, we could probably use the float type since it is doubtful that the measurements are made with more than seven significant digits! Nonetheless, we will generally always use double precision numbers to represent real numbers unless the storage requirements become prohibitive. We also use the square bracket notation to reference individual elements of an array. For example, variable temperature[0] contains the first temperature reading taken when the data acquisition system was started. The variable temperature[99] contains the last temperature value read. Notice that although the data items themselves are of type double, the array subscript is always an integer. After all, it would seem pretty silly to try to obtain temperature number 2.5 from the array, wouldn’t it? Later we will learn how to use integer variables as array subscripts to rapidly perform repetitive operations on each element of an array, regardless of how many elements are in the array. An array can be initialized with a set of values as it is being declared. The syntax is considerably more complicated than for a scalar variable. To initialize an array at declaration, follow the declaration with an = and the list of values to be placed in the array enclosed in curly braces and separated by commas. If you supply fewer values in the initialization list than the size of the array, the remaining array values will be set to zero. If you supply too many initializers, an compiler warning will be generated. If you include an initializer list, you may choose to omit the size of the array written between the square brackets. If you omit this value, the compiler will automatically create exactly the same number of elements in the array as are listed in the initialization list. We can form arrays of any data type; either those built-in to C++ or those added by the user as classes. In fact, later we will use classes to create objects that allow us to do common operations on lists of numbers as if they were vectors. O EXAMPLE 3.3 Valid One-Dimensional Array Declarations Here are some valid one-dimensional array declarations: int picks[10], id[4] = {3, 18, 22, 47}; float xval[3] = {-23.4, -2.78, 6.734}; double sum[6]; char validAnswers[] = {'Y', 'y', 'N', 'n', 'Q', 'q'}; bool options[] = {false, true, true, false, true};

O 3.2.2

Multi-Dimensional Arrays

C++ allows you to declare variables to have an unlimited number of subscripts by simply adding

16 # Chapter 3 Fundamentals of C++: Variables additional subscripts in square brackets in the variable declaration. Thus, a variable declared as double y[6][4];

would be capable of containing four six-element one-dimensional arrays, for a total of 6×4 = 24 double precision numbers. This particular type of variable declaration having two subscripts suggests a row-column structure akin to a spreadsheet or a matrix. Using the matrix analogy, the first subscript refers to the row number and the second subscript refers to the column number. The individual elements of the two-dimensional array can be accessed by writing y[0][0], y[0][1], y[0][2],…,y[5][2], y[5][3]. This thinking can proceed to variables having greater than two subscripts, although the geometric structure analogies break down when more than three subscripts are used. In practice, scientists and engineers use one and two-dimensional arrays much more frequently than those having a greater number of subscripts. In fact, later we will develop special C++ classes for handling one and twodimensional arrays as vectors and matrices, respectively. Therefore, we will not pursue any further the use of native arrays as defined in C++. 3.2.3 Strings C++ makes a special provision for the common case of one-dimensional arrays of characters used to represent textual information, such as results of calculations to be displayed on the screen or printer. If an array of characters is terminated with a special sentinel character, specifically the null character '\0', then it becomes a character string which is understood by a rich set of string processing functions in the standard C++ library. The C++ language itself also supports special syntax for character string constants. A string constant consists of the characters written back to back just as in normal writing and surrounded with double quotes (e.g., "This is a string"). The compiler automatically adds the terminating null character to the string. To declare an array of characters as a string, simply add a = and the string constant initializer at the end of the declaration as shown below: char heading[] = "Last Name

First Name\n";

Note that in this example we did not attempt to count the characters in the string constant initializer and place this number in the square brackets. If the contents of the string is not going to change during the execution of the program, we recommend that you follow this syntax and let the compiler determine the number of characters in the string. (It would also be wise to include the const keyword at the beginning of the declaration.3) If you do place an integer number inside the square brackets as the size of the string, don’t forget to include the trailing null character in the count. Also in the example above we see the use of an escape sequence to represent the new line character that is to be included in the string. It is important to note here that C++ strings of the type described here are not dynamic. That is, the length of the string can not be extended during program execution. Furthermore, C++ does not by default check to see that references to individual characters in the string are actually in the bounds of the string.. Storing characters beyond the end of the string is sure to corrupt the program and cause unpredictable results. Fortunately, C++ provides a new string class that overcomes these limitations, but the standard C-style strings described here are still widely used by C++ programmers.

Glossary

3

Programmers trained in the C language are more apt to use the syntax

char *str = "initializer";

which uses an explicit pointer to accomplish this purpose. We have chosen to avoid any explicit use of pointers in this brief introduction to C++.

Elementary Numerical Methods and C++ # 17 array a collection of scalar entities of the same type that are referenced by a single variable name. case sensitive upper and lower case letters are not considered to be equivalent.

subscript an integer constant or variable used to select an individual element from an array. variable a symbolic identifier that names some fixed location in the computer’s storage space, the contents of which can be expected to change during execution of the program.

Chapter 3 Problems 3.1

TRUE/FALSE a) ____ b) ____ c) ____ d) ____

All variables must be declared before they are used. All variables must be given a type when they are declared. C++ considers the variable number and NuMbEr to be identical. Declarations can appear almost anywhere in the body of a C++ program.

3.2

Fill in the blanks. a) A variable declared outside any function is called a _________ variable. b) Lists and tables of values are stored in __________. c) The elements of an array are related by the fact that they have the same ___________ and ___________. d) The number used to refer to a particular element of an array is called its __________. e) An array subscript should normally be of type _________. f) The first element in a normal C++ array has the subscript value of __________. g) The fourth element of array values would be accessed using the expression __________.

3.3

Label each of the following C++ variables as being either correct (C) or incorrect (I). If the variable is incorrect, state why. a) ____ _under_bar_ b) ____ m928134 c) ____ t5 d) ____ j7 e) ____ herSales f) ____ hisAccountTotal g) ____ 67h2 h) ____ top-Dog i) ____ Great! j) ____ float

CHAPTER

4

Fundamentals of C++: Basic Input and Output CHAPTER OUTLINE 4.1 Standard Streams 4.1.1 The cout stream object 4.1.2 The cin stream object 4.1.3 The cerr stream object 4.1.4 Standard stream redirection 4.2 User-Defined Stream Objects

If you have had any previous programming experience with a different programming language, is may seem strange that the C++ language itself makes no provision for inputting or outputting data. It would be an unusual program, indeed, that did not require some sort of input or generate some output! The omission of input and output facilities as a part of the language definition of C++ was not an oversight, rather it was a deliberate decision to keep the language itself simple and let programmers add input and output facilities through functions, just like everything else is done in C++. Functions will be covered in Chapter 6. Even though you can write your own C++ input and output functions, you don’t have to. In fact, standard C++ supplies you with two completely different methods for handling input and output. The first method is the old C-style printf family of functions which provide an easy way to migrate existing C programs to C++. The second method is called the stream input/output (or streamio, or iostream) family of functions which are entirely new to C++. In fact, the stream input/output capabilities are implemented in the new powerful C++ object-oriented class feature. So while a detailed study of classes is beyond the scope of this text, it is ironic that we will be using classes extensively for all of our input/output operations.

4.1

STANDARD STREAMS

C++ programs come pre-equipped to process data in three different streams. (Other streams can be created and processed during program execution.) A stream is a sequence of data characters that is either read or written by the program. Streams consist of both printable and non-printable characters.

Elementary Numerical Methods and C++ # 19 The non-printable characters, such as the escape characters presented in an earlier chapter, may visually format the stream when rendered on paper, but to the computer, all of the characters are processed in a long string, one after another. The three pre-connected streams are the standard input stream (stdin), the standard output stream (stdout), and the standard error stream (stderr). In a simple text-oriented program running in console mode, stdin consists of the sequence of keystrokes entered at the keyboard. It makes no difference to the program whether the stdin stream is being created by the user “on the fly” as the program runs or if it is actually a data file created in advance by the user and “fed” to the program in lieu of typing the characters on the keyboard as the program runs. The stdout stream for a typical console program is represented by the information that the program prints to the computer screen. You will come to see that every character that appears on the screen is explicitly directed there by the programs that you write. Since the characters appear on the screen in a prescribed order, they also are best considered as a stream which can be directed to the computer screen, to the printer, or to a file on your computer’s hard disk. When using stdout, your program is not particularly aware of the actual final destination of the characters it generates and sends to the stream. You have great control over the exact appearance of the characters that your programs generate, consequently, you will find yourself spending much time working with the appearance of your program output. C++ offers a bewildering array of formatting options for streams that you write to. The stderr stream is very similar to the stdout stream in that it produces characters that are shipped to some output destination. While the stdout stream is typically mapped to the console screen, the stderr stream may be mapped to the screen or some other system device used to log errors. We normally used the stderr stream to report fatal errors that cause programs to abort. Normally, formatting of these error messages is not a prime concern. C++ provides access to these standard streams (as well as other streams that you may create through your own program code) using objects which are automatically created during the program startup code before your main() function begins executing. This is done if you include the header file iostream using the following two lines of code at the top of any program unit that references any stream object: #include using namespace std;

The first of these lines is a directive to the C++ preprocessor to include at this point all the definitions necessary to manipulate input/output streams using iostream classes, including the objects that control the standard preconnected streams as described in the following sections. The second line accesses the relatively new namespace feature of C++. Including this line in your program effectively tells the compiler that you are not using any functions or objects of your own that conflict with those in the standard library. If you do not include this line, then you will have to prefix any function or object name from the standard library with the std:: scope resolution operator. 4.1.1 The cout stream object The cout object is probably the most widely used of those provide by the iostream classes. It is associated with the stdout stream and the object’s name is a contraction of the phrase console output. For console applications (those that output to a text screen consisting of 25 lines by 80 columns), cout is the way to display most types of program output. The use of the cout object to display program output has been illustrated already in some of the example programs. Begin the line of code with the word cout followed by one or more sets of stream insertion operators (<<) and corresponding variables or constants that you wish to display on the screen. For example, consider the following code fragment.

20 # Chapter 4 Fundamentals of C++: Basic Input and Output cout << "Welcome to the world of C++ programming."; cout << 1 << 2 << 3;

When this code fragment runs, it will cause the following text to be displayed on the console screen: Welcome to the world of C++ programming.123

First of all, notice that the cout object can print both string (text) and numerical values. In this case, they are all constants, but any one of these could have been a variable name whose current value was as indicated in the console display at the time the statements were executed. The cout object “knows” how to print the values of all standard C++ data types. Second, notice that all of the output appears on a single line, even though the code to generate the output appeared on two lines. In C++, unlike many programming languages, there is no relationship between the number of lines in the displayed output and the number of lines of code used to generate the output. Remember that C++ regards the output device as a simple stream of characters. It has no regard for line or page lengths or other formatting considerations. If you want to end the current line of output and start a new line, then it is your responsibility to insert the necessary formatting codes to do so, e.g., the new line escape sequence '\n'. If you want spaces to appear between items to be displayed, then you have to include them in the output list yourself. To illustrate these points, the previous code fragment could be rewritten to have the three integer constants appear on a separate line separated by commas using a single C++ statement as follows. cout << "Welcome to the world of C++ programming.\n" << 1 << " " << 2 << " " << 3 << '\n';

Unless otherwise modified using formatting modifiers and iomanipulators, output created by the cout object will have a default format that generally results in pleasing and unambiguous output,

although it may not be well suited for tabular output aligned in rows and columns. The accompanying software includes a useful C++ class that simplifies text formatting for special appearance purposes. (See Section 9.5.2 on page 84.) 4.1.2 The cin stream object The cin object is the standard input stream object. It is the logical converse of the cout object. The standard input stream for most personal computers is the keyboard, although this can be altered as the program is run. Information is transferred from the standard input stream to program variables using the stream extraction operator (>>). Using the cin object to “read” numbers into a program is fairly straightforward. C++ “knows” how to interpret characters from the input stream and convert them to the proper internal representation of all standard data types. For the commonly used numerical data types, the cin object begins scanning the input stream and ignores any whitespace (space, tab, or new line character) until it finds the first printable character. If the character is a ‘+’, ‘!’, or a digit, then successive characters are scanned and the number is formed in binary form in the computer memory. When the first invalid character is encountered (including whitespace), the scanning stops and the value is stored in the designated variable. The stream pointer is left positioned so that the next character to be read will be the first character following the number. Thus, numbers that are to be read by a C++ program are most often typed one per line with the carriage return keystroke being interpreted by the program as the separator between the input data items. The following code fragment reads the values of three double precision variables from the cin object. double a, b, c; cin >> a >> b >> c;

When the program is executed, any of the following sets of keystrokes would successfully store the numerical values of 1.0, 2.0, and 3.0 in variables a, b, and c, respectively. 1<Enter>2<Enter>3<Enter> 1.0<space>2.0<space>3.0e0<Enter>

Elementary Numerical Methods and C++ # 21 12<space>3<Enter> etc. So almost anything that “looks” right to your eye will be accepted. Notice that the last entry must be the <Enter> key, as this keystroke actually passes the information to the program that is waiting for input. When reading numbers of type int, float, and double from the cin object, the <Enter> key marks the end of the last number on the line and resumes execution of the program. The <Enter> key remains in the input queue and will be the first character read for the next input operation. If the next item to be read is also a number, then the <Enter> key will be considered whitespace preceding the data item and will be ignored. A common mistake is to enter numerical data separated by commas. The first number will be read properly and the trailing comma will terminate the entry of the first number. However, since a comma is not whitespace, the cin object will become “confused” when it tries to retrieve the second number and the program will probably “hang” at this point or generate a fatal error message. The cin object is capable of accepting properly formatted input of all standard data types. In addition, all properly designed classes should be compatible with the cin object as illustrated later in Section 9.4.5. A detailed study of input processing for all data types is beyond the scope of this book. 4.1.3 The cerr stream object The cerr standard stream object behaves just like the cout objects, but it is intended to be used to log error or other critical informational messages. On legacy computer systems, the output from the cerr object was often directed to a separate printer, but on most personal computer systems, the cerr output appears on the same output device as the cout device, usually the display window on the screen. 4.1.4 Standard stream redirection On most operating systems, all programs will be pre-connected to the cin, cout, and cerr standard stream objects at startup. You don’t have to do anything extra in your C++ code to make them available. But it is often the case that it would be more convenient to redirect these standard stream objects to data files of your choice. For example, if the input data for a program is voluminous, it would be wise to enter the data into a data file using a text editor (not a word processor), and then have your program fetch the input data from that file rather than the keyboard. Likewise, if the output of the program is longer that a single screenful, it might be more convenient to redirect the program output to a data file of your own choosing and then review the contents of the data file after the program terminates. And finally, you may wish to divert all output from the cerr object to a logging data file rather than have it appear intermixed with the cout output. On most operating systems, this can be accomplished using input/output stream redirection. The arcane syntax required to accomplish this can be illustrated best by an example. Suppose you have a program named myprog for which you have prepared the input data and stored it in a file name indata.txt and you wish the normal output from your program to be stored in a file name outdata.txt and the error messages that you write to the cerr object to be stored in a file named errdata.txt. To accomplish this, you would enter the following command to run the program: myprog outdata.txt 2>errdata.txt

Thus, the content of the file named following the < symbol is used for the program input, the output generated by all invocations of the cout object will be stored in the file named following the > sign, and the output generated by any cerr statements will be stored in the file named after the 2> symbol combination. If the files outdata.txt and errdata.txt do not exist when the program is run, they will be created. If they already exist, they will be erased and replaced by the new output generated by the program. If it is desired to append the output from this program run to files that

22 # Chapter 4 Fundamentals of C++: Basic Input and Output already exist, then replace the > symbol with >> in each case.

4.2

USER-DEFINED STREAM OBJECTS

The standard predefined stream objects cin, cout, and cerr are sufficient for most elementary scientific programming applications. When these standard stream objects don’t meet the special needs of a particular application, C++ offers a wide range of tools for accessing data files using userdefined stream objects. User-defined stream objects may open external data files for input or output (or both) in either sequential or random access modes using either character-coded or binary data. Complete coverage of all the options for file access is well beyond the scope of this text. For a full treatment of this topic, consult a C++ programming textbook, such as reference [2]. The common cases of sequential character-coded file access for either reading or writing can be handled with the default options of the fstream object. The following code fragment illustrates how to open input and output data files as fstream objects. #include using namespace std; ! ifstream fin("indata.dat"); ofstream fout("outdata.dat"); ! fin >> item1 >> item2ÿ fout << item1 << item2ÿ In this code fragment, object fin is created and associated with a local file named indata.dat. Thereafter, data items may be read from the local file into program variables using object fin just as if they were being read from the console using the predefined cin object. Likewise, once the object fout has been created and associated with the local file, all data written using fout will

appear in the local file at the conclusion of the program. From a programmer’s viewpoint, object

fout behaves just like the predefined object cout, except the output information is placed in the

local data file instead of to the program console.

Glossary iomanipulators special symbols sent to input and output stream objects to control the formatting of data. record a sequence of characters either of fixed length or delimited by a unique sentinel character. redirection changing the location of the standard stream objects cin, cout, and/or cerr from their default locations to user-defined files.

stream a sequence of characters provided by an external source, such as a data file, to a program. whitespace nonprintable characters such as space, tab, and new line that are used to separate data entry items for input.

Chapter 4 Problems 4.1

Write a C++ program that will place numbers into one of two “bins” based on their size. Use a text editor (e.g., Notepad) to create an input file named indata.txt that contains numbers in the range 0 to 1000. Your program should begin by prompting the user for a

Elementary Numerical Methods and C++ # 23 threshold for “large” numbers. The user should type this number on the keyboard. The program should then open the indata.txt file for input and two output files called smallnum.txt and bignum.txt. The program should then loop through all of the numbers in indata.txt, comparing each one to the threshold number the user entered. If the number from indata.txt is less than or equal to the threshold number, it should be written out to file smallnum.txt. Otherwise, it should be written to bignum.txt.

CHAPTER

5

Fundamentals of C++: Programming Actions CHAPTER OUTLINE 5.1 Programming Actions 5.2 Sequence 5.3 Selection 5.4 Repetition

Computer programs consist of two parts: data and algorithms. In the previous chapters we have briefly introduced some of the more frequently used types of data available in the C++ language. We now turn our attention to the programming actions that C++ programmers may employ to process the data to yield the desired results. When these actions are carefully assembled, they embody the underlying algorithm, or recipe, which is used to process the data.

5.1

PROGRAMMING ACTIONS

Just as each data type has a proper application, so do the various programming actions. It can be shown that any algorithm can be written using three program structure types: sequence, selection, and repetition. The C++ language implements these three structures very concisely and gracefully. In the following sections we present these implementations in a structure that dictates under what conditions each programming keyword should be used. Later we will see how C++ combines data and programming actions to form objects which can be used by programmers as “black box” components to facilitate the solution of scientific and engineering problems.

5.2

SEQUENCE

The sequence programming structure is the one most commonly used. It is applied either independently by default or in conjunction with the other two structures. For as we will soon see, the

Elementary Numerical Methods and C++ # 25 purpose of the other structures is to determine which sequence structure is applied or how many times a sequence structure is to be executed. The sequence structure causes statements to be executed in an orderly, methodical, serial fashion. Thus, the ordering of statements within a sequence structure is very important. If a statement requires that the results of a previous statement be available, then the statements must be arranged in that same linear order in the program listing. Clearly written programs always use the physical structure of the program to reinforce the logical structure of the program. In the sequence structure, this means that you should place only one statement per line. It is also important to place comments in your program to help the reader understand what you are doing. This is especially important at the start of long sequences of code. Because well written C++ code is almost self-documenting, you need not litter your code with unnecessary comments, but place appropriate comments at strategic points to give the reader the “big picture” of what the following code sequence is all about. There are two forms of comments in C++. The older form inherited from the C language requires that you surround the comment as shown below. /* All text in this section is a comment and is ignored by the compiler. Comments can span multiple lines using this style of comment. */

C++ also supports a simpler syntax for single line comments. Any text following two successive forward slashes (//) is also treated as a comment, as in the declaration statement below: double dia;

// diameter of the sphere

The assignment statement is the most common C++ programming action. The assignment statement appears similar to an algebraic equation; on the left side of an = is a variable. This variable must have been previously declared. (Note that even in the variable declaration/initialization statements described above, the variable type has been declared physically prior to the initialization portion of the statement.) On the right side of the = is an expression of a type that is compatible with the type of the variable on the left side. 5.2.1 Expressions Expressions consist of operands and operators. Operands may be constants, variables, or other expressions. Operators are special symbols or keywords that perform actions upon the operands. C++ supports a very rich set of operators that perform their magic on operands in hierarchical ways that are sometimes very cryptic and hard to follow. A complete hierarchical list of the C++ operators is given in Appendix A. For the purposes of this brief introduction, we will concentrate only on the four basic arithmetic operators of addition (+), subtraction (-), multiplication (*), and division (/). We will also show how to use parentheses to group subexpressions to avoid ambiguity and to improve readability. Other operators will be introduced as they are needed. Expressions involving these operators are evaluated according to the following rules: • Evaluation proceeds from left to right. • Multiplication and division operations are performed before addition and subtraction. • Expressions inside parentheses are evaluated completely with their results being carried to the expression outside the parentheses as a single value. O EXAMPLE 5.1 Valid C++ Integer Expressions Here are some valid C++ expressions involving simple integer arithmetic operators and parentheses. In these expressions assume the variable a is an integer currently having the value 1, b has the value 2, c has the value 3, and d has the value 4.

26 # Chapter 5 Fundamentals of C++: Programming Actions

Expression

Value

a + b + c + d - 10

0

d * c / b

6

d * (c / b)

4*

(a + b) * (c + d)

21

d * (a + (b + d) / c)

12

(a + b) / c + d

5

(a + c) / (b + d)

0*

*

Give special attention to these answers that involve integer division. Since the result of dividing one integer by another integer can contain no fractional part, the non-integer portion of the quotient is truncated. It is never rounded up. O O EXAMPLE 5.2 Valid C++ Floating Point Expressions Here are some valid C++ expressions involving simple floating arithmetic operators and parentheses. In these expressions assume the variable a is a double currently having the value 1.0, b has the value 2.0, c has the value 3.0, and d has the value 4.0.

Expression

Value

a + b + c + d - 10

0

d * c / b

6

d * (c / b)

6

(a + b) * (c + d)

21

d * (a + (b + d) / c)

12

(a + b) / c + d

5

(a + c) / (b + d)

0.6666…

O O EXAMPLE 5.3 Valid C++ Logical Expressions Here are some valid C++ logical expressions. In these expressions assume the variable a is an integer currently having the value 1, b has the value 2, c has the value 3, and d has the value 4. The parentheses are not required in every case shown below, but it is wiser to write expressions with parentheses than to trust your memory of the operator hierarchy given in Appendix A.

Expression

Value

a == b

false

(a + b) == c

true

Elementary Numerical Methods and C++ # 27 (a + b) < d

true

(a < b) && (c < d)

true

(a < b) || (c > d)

true

(a > b) || (c > d)

false

(d > 2*b) && ((a <= c) || (b >= d))

false*

*

The portions of the expression following the && will not even be evaluated! C++ uses “short circuit” expression evaluation which detects that since the first part of the comparison is false (d is exactly equal to 2*b, but it is not greater that 2*b), there is no way the entire expression can be true since the connecting operator is &&.

5.3

SELECTION

The selection program action is used to make decisions. A program must be written so as to respond properly to any possible outcome of the decisions it makes. Thankfully, these decisions are not the open-ended kind that humans have to make routinely. Most computer program selection actions are fairly well defined and provide the means for the programmer to generate meaningful results regardless of the actual results of the decisions. The most frequently used C++ selection program action is the if statement, of which there are four varieties. All variations of the if statement require the use of one or more logical expressions. A logical expression consists of bool variables and constants, comparisons between other types of variables and constants, and various operators. The value of a logical expression must evaluate to a bool true or false. The logical operators available are included in the list given in Appendix A. But the vast majority of most logical expressions are formed using the comparison operators— == (equal to), != (not equal to), > (greater than), >= (greater than or equal to), < (less than), and <= (less than or equal to)—and the logical combining operators && (and) and || (or). One of the most common mistakes in C++ programs is to confuse the single equals sign (=) with the double equals sign (==). The former is used in the assignment statement to transfer the value of the right-hand side of the statement to the variable on the left-hand side while the latter is used to compare the values of the expressions on either side and return a Boolean true or false value. 5.3.1 When Additional Work Is To Be Done The simplest selection program structure is the simple if statement which has the following structure: if (logical_expression) { statements... }

In this structure, the logical_expression is any valid expression of the type described in the previous section that evaluates to a bool true or false. If the value of the logical expression turns out to be true, then the block of statements... between the curly braces is executed. If the logical_expression evaluates as false, then no action takes place, just as if the statement were not present at all. Notice that there is no semicolon after the closing curly brace, but statements inside the curly braces will have terminating semicolons as normal. If the block of statements... is, in fact,

28 # Chapter 5 Fundamentals of C++: Programming Actions a single statement, then the curly braces can be omitted. However, to encourage a consistent programming style, we will always use the curly braces. This type of if statement is most often used to catch data input errors or other unusual or abnormal conditions. For example, if (eps < 0.0) { cout << "In the future, please always set eps > 0.0\n"; eps = fabs(eps); // fabs() returns the absolute value }

5.3.2

Two Mutually Exclusive, Collectively Exhaustive Alternatives

This is a very common situation in which two, and only two, possible options exist. Exactly one of the options must be chosen and the other one must be ignored entirely. if (reynoldsNumber < 2300.0) { fd = laminarFd(reynoldsNumber); } else { fd = turbulentFd(reynoldsNumber); }

5.3.3

Three Or More Mutually Exclusive Alternatives

Sometimes the current conditions in a program require that possible actions be filtered for the precise action set that should be undertaken depending on the desired effect. Often, this structure is used to determine if exceptional (unusual or error) conditions exist and to take appropriate action. If no exceptional condition exists, the statement does nothing. if (color == blue && size > 3.0) { errorCode = 1; } else if (color = red && size > 6.0) { errorCode = 2; } else if (color = green && size > 10.0) { errorCode = 3; } When using a multiple test if statement, the order of the clauses can affect the execution speed

of the program. The test that is statistically most likely to occur should be placed first, followed by the second most likely occurrence, etc. 5.3.4 Three Or More Mutually Exclusive, Collectively Exhaustive Alternatives

The multiple test if statement as shown above can be made collectively exhaustive (ensuring that at least one of the listed options is always taken) by adding a final else clause. The else clause must be placed last since it will “catch all” cases that have not been found to be true in the preceding if-else clauses. A good analog of this program structure is a set of sieves used to determine the particle size in soils. When a bucket of soil is poured on the top of the sieves, the coarsest rocks are trapped on the top sieve which has a very coarse screen. The rest of the soil drops down through the remaining screens with each screen trapping the particles that are too coarse to pass through to the next level. The final sieve consists of a very fine screen trapping all but the finest powder particles. These drop through the fine screen into a pan which retains everything that it receives, of course. The various

Elementary Numerical Methods and C++ # 29 screens are analogous to the if-else if clauses and the pan at the bottom is analogous to the final else clause. if (hoursEarned < 32) { rank = freshman; } else if (hoursEarned < 64) { rank = sophomore; } else if (hoursEarned < 96) { rank = junior; } else { rank = senior; }

Note that the order of the clauses in the example above is very important. The form of the logical expressions (so called one-sided tests) requires this. For example, if we had placed the test for junior standing first, it would be the option selected for any student having less than 96 hours earned! (A first-term freshman might benefit greatly from such a programming error by informing his parents that he is doing so well that he is already ranked as a junior!) We could eliminate such an order dependence by making the logical expressions two-sided at the expense of slightly slower program execution speed. 5.3.5 Special Case: Exact Matches Of Integral Options C++, like many programming languages, provides a special form of the selection programming action to handle the very common case of selecting from among a finite number of discrete alternatives. One common example of this action is to invoke a program option based on a user input. int option; ! switch (option) { case 1: generateTreeSystem(); break; case 2: generateGridSystem(); break; case 3: generateCompoundSystem(); break; case 4: calculateSystem(); case 5: printReport(); break; default: printErrorMsg(); } The switch selection structure is one of the more complex C++ programming actions. Notice that the expression at the start of the switch control action is an integer expression, not a boolean

expression. Thus its value, which must have been set by some previous program action, can have any discrete value over the valid range of integers. The body of the switch control action enumerates the specific constant values of the integer expression that are of interest. An equality test is performed

30 # Chapter 5 Fundamentals of C++: Programming Actions comparing the given integer expression with each of the case clauses in the order given until one is determined to be an exact match. When a match is found, the statements following the colon of the case option are executed. The break statements as shown in options 1, 2, 3, and 5 in the example above are necessary to exit the switch selection block and transfer program execution to the next statement following the closing right brace at the bottom of the control structure. If a break statement is not included, as for option 4 in the example above, the program falls through to the next executable statement paying no attention to the next case keyword. This convention is different from some other programming languages which automatically exit to the bottom of the control structure before the next test value is encountered. In the example above, omitting the break statement after the call to function calculateSystem() is reasonable since the next logical step would be to printReport() anyway. The default case at the bottom of the switch program structure is optional. It serves as the “pan” at the bottom of the “sieve” similar to the final else clause of a multiple option if-else statement as described in the previous section. If the default option is included, it must be listed last. It will “catch” any values of the integer control expression that has not matched any of the previous case options. Frequently the default case is used to handle illegal entries by setting a bool variable to indicate that the operator should be informed that an illegal value was entered and the value should be re-entered.

5.4

REPETITION

It has been said that “computers can’t solve any problem, but they can beat a problem to death.” The most important program action that we have to help us “beat those problems to death” is the repetition action. In later chapters we will see again and again numerical methods apply the power of repetition to hammer away at a problem until we get “close enough” to a true solution. Each of the repetition actions forms a complete block inside a function which we will always enclose in curly braces, even though they may be omitted if the body of the block is a single statement. When the repetition action terminates, program control is transferred to the statement following the closing curly brace of the repetition block. Every repetition block has three phases: initialization, execution, and termination. The initialization phase is often performed prior to the opening of the repetition block. In the initialization phase, variables are set to values or states that will cause the repetition block to begin execution. For those loops which are counter-controlled, this means that the loop counter variable is set to its initial value. Following the initialization phase, the execution phase begins. It is important to realize that during the execution phase, the entire block will be repeatedly executed until some condition arises that causes the loop to terminate. Thus, it is implicit that some time during the execution phase of the repetition block, some calculation must be made that potentially alters the state of a boolean expression that determines whether or not another pass through the execution phase will be made. C++ offers a well-defined set of repetition actions that should be employed properly. While it is true that all repetition actions can be expressed in terms of selection actions and unconditional transfer of control statements—the infamous goto statement, of which this will be the first and last mention in this book, programs are much more readable if the following repetition actions are applied in the situations for which they were specifically designed. 5.4.1 When The Number Of Iterations Is Known In Advance In many situations, the number of times a set of operations must be performed is known in advance. Typically, this occurs when the program is operating on a list or array of items. The same operations

Elementary Numerical Methods and C++ # 31 must be performed on each element in the array. Since we have to know how many items are in the array in order to create it, we will know exactly how many times we must repeat an operation in order to apply the operation to every element in the array. The C++ for keyword is used to implement such a repetition structure. const int n = 100; ! double a[n]; ! for (int i=0; i
others with a semicolon. The first clause is the initialization action. The statement(s) in this clause is(are) performed only once prior to the first pass through the execution block of statements which are enclosed in curly braces below the for statement. In C++ it is considered good form to declare the loop counter variable, if one is used, as illustrated in the example shown above. In the C programming language, the counter variable must be declared in the declaration portion at the start of the function or as an external variable. This will still work in C++, but the new format is preferred. Using the newer format makes the counter variable available within the block following the for statement, but unavailable following the closing curly brace. Thus, the same counter variable can be redeclared and used in a subsequent for statement. The second clause in the for statement is a bool expression that determines whether to continue the loop repetition or not. If the bool expression evaluates to true prior to the start of execution of the statements in the block in the curly braces, then all of the statements in the block will be executed. If the expression is false, control is transferred to the statement that immediately follows the block of statements enclosed in the curly braces. The order implied here is important. The for statement employs a pretest control expression. It is tested prior to executing the statements in curly braces even the first time. If the control expression is initially false, the for loop will not be executed at all! The third clause in the for statement specifies the action to be taken at the end of each execution of the block in curly braces, but before the bool expression is evaluated to determine whether another pass through the execution block is to be performed. Taken as a whole, the for statement in the example given above could be interpreted as follows: “Beginning with the value of i set to zero, repeatedly execute the block of statements enclosed in curly braces, incrementing the variable i by one at the start of each iteration, until i reaches the value of n. When i reaches the value n, do not execute the block, but transfer control to the next statement following the closing curly brace.” Since the loop counter variable is visible within the block following the for statement, it can be used just like any other variable, as illustrated in the example above. (You can also alter the loop counter variable, but this is considered to be a very bad practice.) Expert programmers use the for statement in some very exotic repetition situations because with its three clauses, it is very versatile. We will avoid illustrating these examples since they tend to obfuscate the primary purpose of the for statement which is well illustrated in its classical form in the example above. 5.4.2 When The Number Of Iterations Is Not Known In Advance Often, especially in the implementation of many of the numerical methods that we will present later,

32 # Chapter 5 Fundamentals of C++: Programming Actions we do not know a priori exactly how many repetitions will be needed to accomplish a desired result. Just as the for repetition action is associated with stepping through the elements in arrays, the two forms of the while repetition action discussed below are associated with iterative loops that must repeat until some sort of convergence is achieved. In contrast to the for loops discussed above where the various clauses of the for statement itself control loop repetition, the calculations necessary to control whether the loop is repeated or not are contained within the scope of the loop. The repetition of the loop is decided on the basis of the value of a bool expression. If no calculation is done within the loop that can potentially change the value of the controlling bool expression, then the loop will repeat infinitely. The two variants of the while loop presented below differ only in the point during loop execution at which the controlling bool expression is evaluated. 5.4.2.1 Pre-Test The pre-test form of the while statement is similar to the for statement in that the test to determine whether to perform another pass through the loop body is performed prior to the start of each loop pass. Thus, if the controlling bool expression is false, the entire loop is skipped. This is true even prior to the initial execution of the loop. For this reason, this type of loop is also sometimes called a zero-trip loop. If you do this, then the loop is skipped because the controlling bool expression is false. You must keep this possibility in mind and be sure no calculations are made in the body of the loop that are necessary for proper execution of the remainder of the program. The post-test form of the while statement should probably be used in this situation. In Example 5.4, we use the elementary incremental search method to find where a function changes sign. A brief explanation of this example is in order. Line 1 is a compiler directive which informs the compiler that this program will require support for some math operations that are not built-in to the compiler (namely, the natural log function log(x)). Line 2 declares the name of the function. Every C++ program is required to have a function named main(). This is where the program will begin executing code. Lines 3 and 14 form an open brace pair which delimit the block of code that constitutes function main(). At line 4, some of the needed variables are declared and initialized. Line 5 declares the variable fx and calculates its value based on the current value of x which is 0.1. In line 6, variable x is incremented to 0.1625. Variable fxpdx (“f of x plus deltax”) is declared and calculated based on the newly revised value of x. Line 8 begins the while repetition structure. Note that the expression in parentheses is of type bool which can have only a true or false value. If the two variables fx and fxpdx are currently both positive or both negative, then their product will also be positive resulting in a true value of the bool expression. This will cause the loop enclosed in curly braces to be executed. As the loop is executed, the value of x is advanced by deltax, the current value of fxpdx is copied to variable fx, and the new value of fxpdx is calculated. (Later we will see how the use of functions would have eliminated the unnecessary retyping of the long expression which provides the values of fx and fxpdx.) At the end of the block in curly braces, program control returns to the while statement where the bool expression is again calculated to determine whether another repetition is required or not. This sequence continues until the values of fx and fxpdx have opposite signs, thus finding a region in which the function is guaranteed to have a zero (cross the x-axis). Clearly, we do not know in advance how many iterations will be required, thus suggesting the use of the while loop structure rather than the for loop structure. O EXAMPLE 5.4 A while loop 1) 2) 3) 4)

#include main() { double x = 0.1, deltax = 0.0625;

Elementary Numerical Methods and C++ # 33 5) 6) 7) 8) 9) 10) 11) 12) 13) 14)

}

double fx = 1.0/x + log(x+1) - 5.0; // log()is the natural log x += deltax; // shorthand way of writing x = x + deltax; double fxpdx = 1.0/x + log(x+1) - 5.0; while (fx * fxpdx > 0.0) { fx = fxpdx; x += deltax; fxpdx = 1.0/x + log(x+1) - 5.0; }

O 5.4.2.2 Post-Test The post-test form of the while statement must be used when it is necessary (or more convenient) to calculate information within the loop so that it can be used to determine whether or not to repeat the loop again. The body of the loop is always executed at least once, hence this form of the while loop is sometimes called a one-trip loop. The previous example can be recast to use the do-while repetition action as shown in Example 5.5. From this example, it is easy to see why this type of loop is also called a do-while loop. Notice that this form of the incremental search algorithm is actually a little shorter than the previous version. This is because we do not have to calculate two values of the function at two successive values of x before the loop starts. Only one value is needed prior to starting the loop because the second value is calculated inside the loop body during the first pass. However, this loop structure does cause us to take the somewhat awkward step of calculating the value of the function at the value of x = 0.1, and assigning it to the variable fxpdx instead of fx, as the variable name suggests. We often have to do this kind of initialization to properly handle the first pass through a loop. Notice that the first thing that happens inside the loop body is to copy the value of fxpdx to variable fx. Had we assigned the result of the expression on the right hand side in line 4 to variable fx instead of variable fxpdx, it would have immediately been overwritten with garbage by the uninitialized contents of variable fxpdx. The end result of this post-test loop is exactly the same as for the pre-test loop. In both cases upon exiting the loop, the value of x will be 0.225, fx will be 1.30442, and fxpdx will be -0.352615. In most cases, it is a matter of a programmer’s style and personal preferences as to the choice of which form of the while loop to choose. O EXAMPLE 5.5 A do-while Loop 1) 2) 3) 4) 5) 6) 7) 8) 9) 10) 11) 12) 13)

#include main() { double x = 0.1, deltax = 0.0625; double fx; double fxpdx = 1.0/x + log(x+1) - 5.0; do { fx = fxpdx; x += deltax; fxpdx = 1.0/x + log(x+1) - 5.0; } while (fx * fxpdx > 0.0); }

O

Glossary

34 # Chapter 5 Fundamentals of C++: Programming Actions comment documentation provided in a computer program intended for human readers only; it is ignored by the C++ compiler. counter an integer variable used to determine when to terminate a definite repetition programming action. definite repetition a repetition programming action that will repeat a known number of times. expression a valid combination of operands and operators that produces a value that can be stored in a C++ variable of the appropriate type. indefinite repetition a repetition programming action that will repeat as many times as necessary in order to satisfy some Boolean expression. operand a constant, variable, or intermediate calculated value that is acted upon by an operator. operator a symbol that modifies or combines one or more operands as a part of an expression.

post-test loop a repetition programming action in which the decision whether to repeat the loop statements or not is made at the end of the loop. pre-test loop a repetition programming action in which the decision whether to repeat the loop statements or not is made at the start of the loop. repetition a programming action in which operations are performed repetitively until some condition changes. selection a programming action in which operations are performed conditionally depending on the results of comparisons made among variables. sequence a programming action in which operations are performed sequentially in the order in which they appear in the program code.

Chapter 5 Problems 5.1

TRUE/FALSE a) ____ The arithmetic operators *, /, %, +, and ! all have the same level of precedence. b) ____ Exponentiation in C++ is denoted by using two asterisks back-to-back, i.e., **.

5.2

Given the equation y = ax3 + 7, label each of the following C++ statements as being either a correct (C) or incorrect (I) representation of this equation. y = a * x * x * x + 7; a) ____ b) ____ y = a * x * x * (x + 7); c) ____ y = (a * x) * x * (x + 7); d) ____ y = (a * x) * x * x + 7; e) ____ y = a * (x * x * x) + 7; f) ____ y = a * x * (x * x + 7);

5.3

What, if anything, prints when each of the following C++ statements is performed? If nothing prints, then answer “nothing.” Assume x = 2 and y = 3. cout << x; a) ____ b) ____ cout << x + x; c) ____ cout << “x=”; d) ____ cout << “x = “ << x; e) ____ cout << x + y << “ = “ << y + x; f) ____ z = x + y; g) ____ cin >> x >> y; h) ____ // cout << “x + y = “ << x + y; i) ____ cout << “\n”;

5.4

List the three types of control structures from which all programs can be constructed.

5.5

List the four forms of the if statement.

5.6

What is the difference between a pre-test and a post-test repetition structure?

Elementary Numerical Methods and C++ # 35 5.7

Determine the values of each variable after the calculation is performed. Assume that when each statement begins executing all variables have the integer value 5. product *= x++; a) _____ b) _____ quotient /= ++x;

5.8

Write a C++ program that illustrates common mathematical operations involving int, float, and double data types. Create int variables ia = 1, ib = 2, and ic = 3; float variables fa = 1.0, fb = 2.0, and fc = 3.0; and double variables da = 1.0, db = 2.0, and dc = 3.0. Your program should print out the results of several different

expressions using the addition, subtraction, multiplication, and division operators with these variables. Be sure to include some expressions that involve integer division.

5.9

We want you to apply engineering principles while decorating your Christmas tree this year. Since ornaments placed on a branch having a negative slope may slide off, fall to the floor and break, we must avoid this situation. Therefore, before hanging each ornament, we need to measure its weight as well as the branch length, diameter and the angle it makes with the tree trunk. Research in your engineering materials textbook the properties of spruce (or douglas fir, etc.) Assume the diameter of the branch is constant. Write a C++ program that will accept this information from the user and determine whether or not it is safe to place the ornament on the branch. SAMPLE DATA: Branch diameter: 10 mm Branch length: 300 mm Angle of branch with respect to tree trunk: 70 degrees Mass of ornament: 20 grams

5.10 An air conditioner both cools and dehumidifies a factory. The system is designed to turn on (1) between 7 a.m. and 6 p.m. if the temperature exceeds 75 degrees F and the humidity exceeds 40%, or if the temperature exceeds 70 degrees F and the humidity exceeds 80%; or (2) between 6 p.m. and 7 a.m. if the temperature exceeds 80 degrees F and the humidity exceeds 80%, or if the temperature exceeds 85 degrees F, regardless of the humidity. Write a C++ program that inputs temperature, humidity, and time of day and displays a message specifying whether the air conditioner is on or off. Test the program with the following data: (a) Time = 7:40 p.m., temperature = 81 degrees F, humidity = 68% (b) Time = 1:30 p.m., temperature = 72 degrees F, humidity = 79% (c) Time = 8:30 a.m., temperature = 77 degrees F, humidity = 50% (d) Time = 2:45 a.m., temperature = 88 degrees F, humidity = 28%

CHAPTER

6

Fundamentals of C++: Functions CHAPTER OUTLINE 6.1 Using Functions 6.1.1 Eliminating Repetitive Code Sequences 6.1.2 Algorithmic Encapsulation 6.1.3 Passing Arguments

One key to effective computer programming is to organize the code in a logical fashion. There are several programming paradigms that can be used to encourage the programmer to organize code in a manner that is understandable by other programmers, thus facilitating future program maintenance. Examples of these programming paradigms include structured programming and modular programming. However, one programming tool that virtually all computer languages support is the use of functions.

6.1

USING FUNCTIONS

A function is a block of code that operates more or less independently, but in a subordinate capacity, and is usually written to accomplish a single purpose within the program. Functions are compiled and executed relatively independent of each other. This means that variables declared inside one function are local to that function, i.e., they are not visible from anywhere outside the function in which they are declared. This characteristic makes functions ideal tools for implementing the principles of structured and/or modular programming. Variables that are declared outside of any function are called global variables. They are visible to all functions that are defined following their declaration. When a function declares a local variable to have the same name as a global variable, all references to the variable name are interpreted to mean the local variable rather than the global variable. This behavior can be overridden by prefixing the variable name with the scope resolution operator (::). Thus, if a variable named sum is declared inside a function and there also exists a global variable having the name sum, any reference to the global variable named sum would have to be written as ::sum. In general, functions should be used for two purposes—to eliminate repetitive sequences of program code and to implement algorithmic encapsulation.

Elementary Numerical Methods and C++ # 37 6.1.1

Eliminating Repetitive Code Sequences

In Example 5.4 we used code to compute numbers from an expression that involved the natural log function. In that brief example the identical expression was repeated three times. In a more realistic example we might find code fragments extending over several lines repeated many times in a program. Chances are good that in these cases that a typing error will occur in one or more of these code repetitions. Such errors are difficult and time consuming to track down. It is even more likely that if revisions to the program are required, some of the changes will not be made in every one of the repeated code sequences. Using lines of repeated code in this manner gives programs an undesirable characteristic called poor locality. Locality refers to changes being made at a single point and having the desired effect to be propagated throughout the program automatically. So one obvious way to improve a program’s locality is to eliminate lines of repeated code, or even repeated expressions as in Example 5.4. We can do this quite nicely using functions. In Example 6.1, we give the same code as in Example 5.4, but this time we use a function to compute the value of the repeated expression. In this implementation, we introduce the use of a function to calculate the expression for which we are seeking a sign change. On the surface the program appears to be slightly more complicated because it is longer and involves more program blocks. But a closer examination will reveal that this implementation has much better locality than the previous version. Notice that the expression to be evaluated appears only once—on line 4. On lines 9, 11, and 16, where the expression previously appears, we now find on the right hand side of the assignment statements a call to function f and with the current value of the variable x being passed to the function to be used in evaluating the expression. The improvement in locality is obvious. If we wanted to change any part of the expression being searched, we would have to change it in only one place—on line 4. The calls to the function on lines 9, 11, and 16 will then automatically use the updated expression. In the previous example, we would have to retype the expression three times to properly effect any changes, thus tripling the chance of typing error. Passing arguments to functions will be covered shortly. Suffice it to say at this point that in function main, the variable fx is entirely different from a call to the function f(x) despite their similar appearance. Likewise, the variable x in function f() should not be confused with variable x in function main(). Since these two variables are declared in two different functions, they are absolutely unrelated. O EXAMPLE 6.1 Using a Function to Eliminate a Repeated Expression 1) 2) 3) 4) 5) 6) 7) 8) 9) 10) 11) 12) 13) 14) 15) 16) 17) 18)

#include double f(double x) { return 1.0/x + log(x+1) - 5.0; } main() { double x = 0.1, deltax = 0.0625; double fx = f(x); // call function f(x) to evaluate expression x += deltax; // shorthand way of writing x = x + deltax; double fxpdx = f(x); while (fx * fxpdx > 0.0) { fx = fxpdx; x += deltax; fxpdx = f(x); } }

38 # Chapter 6 Fundamentals of C++: Functions O 6.1.2

Algorithmic Encapsulation

Another reason to use functions is to achieve algorithmic encapsulation. This term simply means that we should attempt to group the lines of code in programs in such a way that a given sequence of program lines achieves a single identifiable result. Having grouped code in this fashion, it is considered good programming practice to pull this code out and implement it as a function, even if it appears only one time in a program. Developing this habit early will encourage programmers to write reusable code that can eventually reside in program libraries and be used later in a “black box” fashion. Writing reusable code is a key to improving software development productivity. Writing code that makes extensive intelligent use of functions also makes maintaining and modifying the code much easier. 6.1.3 Passing Arguments Functions are algorithms, or recipes, that generally manipulate input data and produce output data, or results. Input data are passed to the function through arguments. If a function calculates only a single number—a very common case—then the result should be returned as the value of the function. If the function calculates no result or more than a single value, then output arguments can be used. 6.1.3.1 By Value Function input arguments are usually passed by value. Arguments passed by value cannot be changed within the function. This is the default method for passing arguments and affords the greatest degree of protection for the program data. Some other programming languages pass arguments by reference as the default, which can lead to data corruption. When an argument is passed by value, a copy is made of the argument and the copy is passed to the function. When a function has multiple arguments, some may be passed by value and some passed by reference, as is appropriate in each case. In Example 6.2, a function named sind() is defined to calculate the sine of an angle which is passed by value as the single input argument. The function converts the argument to radians and calls the standard library sin() function to do the actual calculation. Inside the function, the argument angle appears on the left hand side of an assignment statement. But because the formal argument in the function definition is declared to be passed by value, the value of actual argument passed by the main function is not altered. O EXAMPLE 6.2 Passing Function Arguments By Value #include #include double sind(double angle) { angle *= PI / 180.0; // PI is predefined by the compiler return sin(angle); } main() { double x; for (int i=0; i<5; i++) { cout << "Enter an angle in degrees: "; cin >> x; cout << "The sine of the angle is " << sind(x) << endl; } }

Elementary Numerical Methods and C++ # 39 O 6.1.3.2 By Reference Function arguments and return values can be passed by reference if they are prefixed with the & character. (The & character may be appended to the variable type specifier, prepended to the variable name, or set apart as a separate character. We will adopt the style of appending the & to the variable type specifier.) It is desirable to pass arguments by reference when their values should reflect changes made inside the body of the function or when the arguments are large structures or arrays that would require an excessive amount of time or storage space to copy if passed by value. Note that the & symbol is used only in the function definition. No special notation is used when the function is called. The example program shown below illustrates a function that uses two arguments passed by value to supply the input values and two reference arguments to return the calculated values to the calling function. O EXAMPLE 6.3 Passing Arguments By Reference 1) #include 2) void rtop(double x, double y, double& r, double& a) 3) { 4) r = sqrt(x*x + y*y); 5) a = atan2(y, x); 6) } 7) main() 8) { 9) double x=3.0, y=4.0, radius, angle; 10) rtop(x, y, radius, angle); 11) cout << "radius = " << radius << " angle = " << angle<<"radians\n"; 12) }

O

Glossary argument a value that is sent to a function as an operand. function a sequence of programming actions that are physically grouped together in a single program unit and receive one or more arguments from a calling program and return one or more results to the calling program. modular programming a programming technique in which procedures of a common functionality are grouped together into separate modules. A program therefore no longer consists of only one single part. It is now divided into several smaller parts which interact through procedure calls and which form the whole program

Chapter 6 Problems

pass by reference the address of function arguments are passed to the function, thus enabling both read and write access to the values of the arguments. pass by value values of function arguments are copied to temporary storage locations before being sent to the function for action. side effect changes to arguments (usually unintended) that are passed by reference to functions. structured programming a technique for organizing and coding computer programs in which a hierarchy of modules is used, each having a single entry and a single exit point, and in which control is passed downward through the structure without unconditional branches to higher levels of the structure.

40 # Chapter 6 Fundamentals of C++: Functions 6.1

Fill in the blanks. a) The _________ statement in a called function is used to pass the value of an expression back to the calling function. b) The keyword ________ is used in a function header to indicate that a function does not return a value or to indicate that a function contains no parameters. c) A variable declared outside any function is called a _________ variable. d) C++ normally passes function arguments by __________. e) What symbol is used to indicate that a function argument is to be passed by reference rather than by value? _____

6.2

Norton4 gives the following procedure for analytically determining the two possible locations of the coupler bar (points B and BN) in the figure shown below.

The coordinates of point A are found from

Ax = a cos(θ 2 ) Ay = a sin(θ 2 )

The coordinates of point B are found using the equations of circles about A and O4.

(

b 2 = ( Bx − Ax ) + B y − Ay 2

c 2 = ( Bx − d ) + B y 2

)

2

2

which provide a pair of simultaneous equations in Bx and By. Subtracting the second equation from the first, we obtain the following expression for Bx in terms of By.

Ay B y a 2 − b2 + c2 − d 2 Bx = − Ax − d 2( Ax − d )

4

Robert L. Norton, Design of Machinery, 2nd Ed., McGraw-Hill, 1999, pp.152-153.

Elementary Numerical Methods and C++ # 41 Substituting this for Bx in the second equation gives the following parabolic expression for By. 2

Ay B y ⎞ ⎛ a 2 − b2 + c2 − d 2 By + ⎜ − − d ⎟ − c2 = 0 2( Ax − d ) Ax − d ⎠ ⎝ Applying the quadratic formula and making numerous substitutions yields the following solutions for Bx and By.

Bx = S − By =

Ay B y

Ax − d

− Q ± Q 2 − 4 PR 2P

where:

P=

Ay2

( Ax − d ) 2

+1

R = (d − S) − c2 2

Q= S=

2 Ay ( d − S ) Ax − d a 2 − b2 + c2 − d 2 2( Ax − d )

Note that the solutions for this problem can be real or imaginary. If the latter, it indicates that the links cannot connect at the given input angle or at all. Once the two sets of values for (Bx,By) are found, they can be used to find the link angles θ3 and θ4 for the two possible configurations using the following formulas.

⎛ B y − Ay ⎞ θ 3 = tan −1 ⎜ ⎟ ⎝ Bx − Ax ⎠ ⎛ By ⎞ θ 4 = tan −1 ⎜ ⎟ ⎝ Bx − d ⎠ A two-argument arctangent function must be used in the equations above since the angles can be in any quadrant. Write a C++ program that will read the following input parameters from a data file named 4bar.inp and will compute the coordinates of point B and the angles θ3 and θ4 for both the open and crossed configurations in each case. Print the input data and results in formatted columns.

42 # Chapter 6 Fundamentals of C++: Functions

6.3

Row

Link 1

Link 2

Link 3

Link 4

θ2

a

6

2

7

9

30E

b

7

9

3

8

85E

c

3

10

6

8

45E

d

8

5

7

6

25E

e

8

5

8

6

75E

f

5

8

8

9

15E

g

6

8

8

9

25E

h

20

10

10

10

50E

i

4

5

2

5

80E

j

20

10

5

10

33E

k

4

6

10

7

88E

l

9

7

10

7

60E

m

9

7

11

8

50E

n

9

7

11

6

120E

In many localities, sales tax rates are different for different types of merchandise. For example, suppose in a particular locality the sales tax rate for medicine is 0%, the rate for groceries is 5%, and the rate for everything else is 8.5%. As merchandise is scanned, the price of each item is coded according to the item's category and passed on to the cash register computer for processing. Write a C++ program to simulate this process. The program should read the item information from a data file named sales.dat. Each record (line) in the data file has the following format. The first character will be a letter identifying the product category. An upper or lower case M stands for medicine; an upper or lower case G stands for grocery; any other letter stands for general merchandise. The price of the item in standard two place decimal notation starts in column 2 of the record. The number of records in the file will be variable, depending on the number of items purchased. Your program should accumulate and print the total amount of purchases in each category as well as the final total. Round the final total to the nearest penny. SAMPLE DATA: x12.50 M14.27 g0.79 g1.56 x12.45 y18.20 z13.34 m23.00

Elementary Numerical Methods and C++ # 43 6.4

The conventional four-bar mechanism shown below can be classified according to the rotatability of one or more of its links. If one or more of the links can make a full 360-degree revolution with respect to any other link, the mechanism is said to be Grashof. If no link can make a full 360-degree revolution, then the mechanism is classified as non-Grashof. A mechanism is Grashof if the following equation is satisfied:

S + L <= P + Q

where, S is the length of the shortest link in the mechanism, L is the length of the longest link in the mechanism, P & Q are the lengths of the other two links in the mechanism. The lengths of the four links in the mechanism can be determined from the coordinates of points A, B, C, and D in the figure below.

Write a C++ Boolean function that will accept the coordinates of points A through D as eight scalar double values and will return the value true if the resulting mechanism is Grashof and false otherwise. Write a C++ driver program to demonstrate the proper operation of the function using the following sample data. CASE #

6.5

NODE COORDINATES Ax

Ay

Bx

By

Cx

Cy

Dx

Dy

1

0.0

0.0

5.0

2.0

5.0

12.0

6.0

0.0

2

3.0

2.0

!1.0

8.0

7.0

8.0

13.0

0.0

3

!4.0

!3.0

6.0

4.0

7.5

3.0

12.0

!2.0

4

0.0

0.0

0.0

4.0

4.0

4.0

4.0

0.0

A multiple answer question used by some online testing systems is similar to a multiple choice question, but more than one answer may be checked. One major testing system grades such questions on an all or nothing basis, i.e., the student must mark exactly the same correct answers as the answer key in order to get any credit at all for the question. If the student marks an answer as correct and the solution key does not show it as a correct response, or if the student fails to mark any answer as correct that the solution key shows as correct, then no credit at all is given for the problem.

44 # Chapter 6 Fundamentals of C++: Functions Professor “Easy-A” Corley feels that this is unfair and desires a more reasonable grading function for multiple answer questions that gives the student partial credit. If the student's response for each question answer matches the solution key, then credit should be given for that answer. Only if the solution key for an answer differs from the student response should points be taken off. For example, suppose a multiple answer question has four possible answers and the solution key indicates the first three should be marked. If a student response for the question marks the only the first response as correct, the the grading function should return a score of 2, because the first and fourth responses matched the answer key. Your function header should look like this: int gradeMultipleAnswer(int noa, unsigned int key, unsigned int studentAnswer);

The first argument is the total number of possible answers, from 1 to 32. The function return value will be an integer ranging from zero to noa, depending on how many of the student answers matched the key. Arguments key and studentAnswer are 32 bit integers, where each bit represents whether or not the respective answer was marked. The table below contains some sample data you can use to test your grading function. Some of the correct answers are given to help you debug your program. noa key studentAnswer function return value 4 10 (0b1010) 3 (0b0011) 2 4 10 (0b1010) 8 (0b1000) 3 4 10 (0b1010) 10 (0b1010) 4 4 10 (0b1010) 0 (0b0000) 2 8 170 85 ? 5 13 3 ?

CHAPTER

7

Some Numerical Programming Fundamentals CHAPTER OUTLINE 7.1 Introduction 7.2 Errors in Numerical Computations 7.2.1 Truncation Error 7.2.2 Roundoff Error 7.2.3 Calculating Machine Epsilon

7.1

INTRODUCTION

Although we make much of the Object-Oriented Programming (OOP) characteristics of C++, one of the distinctions of the C language upon which C++ is built its heavy reliance on the use of functions. Even the main program is a function. This consistent focus on relying on well-written, more or less self-contained units of computer code which do one thing and do it very well is a natural place to begin applying the C++ language to the solution of problems using numerical methods.

7.2

ERRORS IN NUMERICAL COMPUTATIONS

A key to success in using the digital computer to solve numerical problems in engineering and science is understanding the causes of errors in computed results and learning how to control them. An error should be thought of simply as the difference between a computed solution and the correct answer. An error is not the same thing as a mistake, although mistakes can, and usually do, cause errors. Even if there are no programming or logical mistakes in a computer program, there will almost certainly be error in the solutions computed by the program. These errors are inherent in the use of a digital computer for most practical problems. Put in simple terms, there will be errors in all of our computed solutions because we are using arithmetic to approximate higher mathematical operations such as differentiation and integration, and even the arithmetic we use is imprecise. Despite these severe restrictions that will be explored in only sketchy detail in the following sections, usable

46 # Chapter 7 Some Numerical Programming Fundamentals solutions to real problems can be generated with a high degree of confidence if due caution is given to dealing with these inherent sources of error. 7.2.1 Truncation Error The first source of error that affects most numerical calculations is truncation error and it actually has nothing to do with digital computers. By truncation error, we mean the error that is incurred when a Taylor series is truncated after a finite number of terms. This is somewhat confusing to the new programmer, because the term “truncation” is also used to represent the loss of digits to the right of the decimal point in integer division as illustrated in Example 5.1. Such confusion is usually inevitable. In the present context, truncation error is controlled to a large extent by a priori decisions made by the programmer by his/her choices made in two key areas. First, when solving a problem that requires using an algorithm based on a Taylor series, such as numerical integration or differentiation, the programmer will have a choice of what order method to use. The order of the method is directly related to how many terms of a Taylor series are used to approximate the mathematical operation being performed. Usually we expect that higher order methods which use more terms of the Taylor series will give better results than lower order methods. We will not pursue the derivation of the magnitude of truncation error for the numerical methods that we use in this text, but the reader should have the general understanding that a particular method being designated as “second-order accurate” implies that one more term of the Taylor series was used in its development than was used in the development of a “first-order” method. Similarly, methods that are “third-order” or “fourth-order” have correspondingly lower truncation error because more terms of the Taylor series were used in their formulation. It is also generally true that the expression of “higher order” methods tend to be more complex than “lower order” methods. However, it is not true that higher order methods take more time to execute than lower order methods. Quite the contrary is true. One major reason for using a higher order method for a particular problem is to reduce the program execution time. Of course, these are sweeping generalizations, and notable exceptions abound. Once a particular order method is chosen, a second factor that the programmer can select can also dramatically affect the accuracy of the final result. That factor is the step size of the independent variable in the Taylor series approximation. Consider the example that is often used when introducing the concept of the integral in calculus. Typically, the instructor will choose a simple curve, say y = x2, and divide the region between two given values of x into rectangular strips having equal widths. The area under the curve can be approximated by summing the area of the strips. As the width of the strips is decreased, the accuracy of the approximation improves until the true value of the integral is obtained as the strip width approaches zero. In numerical calculations, we cannot let the strip width go to zero, for that would result in an infinite number of strips to be processed. So we choose a priori some strip width which produces an estimate of the true area that we trust (or hope) is “close enough.” Now, the process described above is exactly the approach that will be taken when computing integrals in Chapter 12. Choosing the order of the method to use for integration will be analogous to choosing the shape of the curve that we assume forms the top edge of each strip—horizontal line, straight line, parabola, etc. And choosing the width of the strip will complete our specification of the amount of truncation error we will accept in our computed value of the integral. Generally, it is not possible to accurately give quantitative value to the amount of truncation error that will result (if we could, then we could combine that value with our integral estimate and compute the exact answer), but it is possible to make some generalizations: • For the same step size, higher order methods have better truncation error. • For a given order method, truncation error decreases at a predictable rate and approaches zero as the step size goes down. • For a specified accuracy, the step size required for a higher order method is larger than the

Elementary Numerical Methods and C++ # 47 step size required for a lower order method. Thus, to cover a fixed distance, fewer steps are required, usually resulting in faster program execution. To summarize, truncation error in numerical calculations results from choosing to use fewer than all the terms in a Taylor series approximation. Truncation error is largely controlled by a priori decisions made by the programmer and is independent of the computer used to execute the program. 7.2.2 Roundoff Error Roundoff error results from the inability of the computer to store an infinite number of digits. Computers use floating point numbers to represent numbers that are not integers. The basic properties of floating point numbers were presented in Section 2.3. It doesn’t take an extremely complicated program to illustrate limitations of floating point arithmetic, as illustrated in Example 7.1. O EXAMPLE 7.1 Simple Illustration of Roundoff Error The following simple program shows how even simple calculations can be difficult to perform correctly because of roundoff error. #include #include using namespace std; main() { float dx = 0.1, x = 0.0; for (int i=1; i<=1000; i++) { x += dx; cout << i << " " << x << endl; } }

This program computes a running total in the variable x by successively adding the variable dx to the previous total. Obviously, the correct answer at any point should be the total number of iterations performed divided by 10. But because the binary representation of the decimal number 0.1 is an infinite repeating pattern of 0's and 1's, it cannot be represented exactly on any computer. Though the actual results produced by this program depend on several factors, here are sections of the 1000 lines of results produced when run using the gcc compiler on a personal computer: 1 2 3 4 5 6 7 8 9 10 400 401 402

!

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 40 40.1 40.2

403 404 405 406 407 408 409 410 535 536 537 538

!

40.3 40.4 40.5 40.6 40.7 40.7999 40.8999 40.9999 53.4998 53.5998 53.6998 53.7998

539 540 541 542

! 994 995 996 997 998 999 1000

53.8997 53.9997 54.0997 54.1997 99.3991 99.4991 99.5991 99.6991 99.799 99.899 99.999

For the first several iterations, things proceed as expected. Then, after 407 iterations the numerical value of the sum begins to consistently understate the correct answer by 0.0001. After about another 100 iterations, the error increases to 0.0003 and by the end of the sequence, the error has increased to 0.001. O

48 # Chapter 7 Some Numerical Programming Fundamentals The program above illustrates the general behavior of roundoff error. Because one (or more) numbers in a calculation can not be represented precisely, when it is used it corrupts the result. When this corrupted result is subsequently combined with another corrupted result from a different calculation, the result is slightly worse. After many hundreds or thousands of calculations involving these numbers, the results can be significantly in error because of roundoff. Several steps can be taken to reduce the effects of roundoff error, though in some cases, accumulated roundoff error will limit the applicability of a particular numerical method. Several books discuss one or more of these techniques in more detail. They are presented here with minimum discussion. • Use double precision arithmetic to increase the number of significant digits in all calculations. Roundoff error always occurs in the least significant digits (those farthest to the right), so adding more digits effectively “protects” the leftmost digits which always have larger magnitude than those to the right. (The general idea being used here is that if roundoff is going to happen, it should happen to digits that have the least possible effect on the accuracy of the calculation.) • Avoid subtracting numbers that are nearly equal in magnitude. Subtracting numbers having approximately the same magnitude eliminates the “good” digits on the left hand end of the numbers and keeps the “bad” digits that have been corrupted by roundoff. • Add numbers in inverse order of magnitude. It’s possible that the accumulated sum of the smaller numbers will be significant if they are combined first, whereas they will be ignored when addition is attempted after the sum is already very large. This effect is caused by the requirement that decimal (and binary) points must be aligned before numbers can be added. If the smaller number has to be shifted so that there are several zeros between the point and the first digit, the entire number may be ignored and considered to be zero. • Keep the magnitude of numbers as small as possible. This is to reduce the “spread” between adjacent floating point numbers as their exponents increase. Practically, this means numbers should be scaled whenever possible and divisors should be the largest possible number. An example of this technique is illustrated in Section 10.2.3.1 on page 92. 7.2.3

Calculating Machine Epsilon

The notion of machine epsilon is of both theoretical and practical importance. Machine epsilon is defined as the smallest floating point number that can be added to the number 1.0 (unity) which will produce a sum that is distinguishable from the number 1.0. To the beginner this sounds preposterous! How can it ever be true that the sum of two numbers is the same as the first number. In the world of real numbers, this can be true only if the second number is precisely 0.0. But in the world of floating point numbers, the requirement that decimal points be aligned before two numbers are added can result in the sum of the two numbers being the same as the first number. For example, if we are using a computer that can process floating point numbers having seven significant digits, and wish to add the numbers 1.000000 and 1.0e-10, the floating sum would be exactly 1.000000 because in order to represent the true sum which is 1.0000000001, more than seven significant digits would be required. All digits to the right of the seventh significant digit would be truncated. Of course, in digital computers this process is done using the binary number system, so the value of machine epsilon is a power of 2 and not a power of 10 as used above. But the exponent portion of the value of machine epsilon should give us some indication of how many significant digits the computer is capable of processing when working with floating point numbers. We should certainly expect to find the values of machine epsilon to be different for single precision and double precision

Elementary Numerical Methods and C++ # 49 floating point numbers. From a practical perspective, it is important that we know what our computer considers to be a “small” number. This is not the same as the smallest number that can be represented in floating point on the computer. Typically machine epsilon is much larger than the smallest number that can be represented on the computer. We often find it convenient to use some multiple of machine epsilon to determine when an iterative scheme has converged or when a divisor is “close” to 0.0 and likely to produce large roundoff error. Since the value of machine epsilon is machine dependent, it is important to be able to calculate its value correctly. In Example 7.2 we present the calculation of machine epsilon in a modular function. The algorithm presented in this example has given good results on a variety of different computers using different compilers and optimization levels. O EXAMPLE 7.2 Calculating Machine Epsilon #include #include float machineEpsilon(void) { float eps = 1.0, epsp1 = 1.0 + eps; while (epsp1 > 1.0) { eps *= 0.5; epsp1 = 1.0 + eps; } return 2.0 * eps; } main() { cout << "The value of machine epsilon for type float is " << machineEpsilon() << endl; }

Program Output:

The value of machine epsilon for type float is 1.19209e-07

Comments The reader may question the need for the variable epsp1 in the function machineEpsilon(). The reason for using this variable is that if it is not used and the while statement is written as while (1.0 + eps > 1.0)

many compilers will optimize this code and leave the variable eps on the floating point math processor stack where all calculations are performed in extended double precision. This will result in a calculation of machine epsilon for extended double precision rather than single precision as is desired. Also note that the value returned by the function is 2.0 * eps. The while loop continues until the computer can’t see a difference between epsp1 and 1.0. Therefore, the smallest value of eps that yielded a value of epsp1 that was distinguishable from 1.0 was the value it had on the previous iteration, which is precisely 2.0 times its present value. O

Glossary double precision a floating point storage scheme that uses twice as much storage as single precision in exchange for greatly increased range and precision. machine epsilon the smallest floating point number such

order a measure of relative truncation error. Higher order implies lower truncation error. roundoff error the error that results from the computer’s inability to store an infinite number of

50 # Chapter 7 Some Numerical Programming Fundamentals that when it is added to unity, the sum can be differentiated from unity.

digits. truncation error the error that results from omitting the higher order terms of a Taylor series.

Chapter 7 Problems 7.1

You are given the task of designing an industrial drying oven. Material that is 99% water is placed in the oven in 100 kg batches. How many kilograms of water must be removed for the moisture content to be reduced to 98%? Find the solution to this problem by setting up a repetition structure that successively subtracts a fixed amount of water until the desired moisture content is reached. Print a table of the moisture percentage as a function of the amount of water removed.

7.2

The Maclaurin series expansion for sine(x) is:

sine( x ) = x −

x3 x5 x7 + − +… 3! 5! 7!

Write a C++ program that uses functions to compute this series two different ways. The first method simply computes each term as it is written and adds it to the total. (Note: you will have to write a factorial function.) The second method should exploit the fact that each succeeding term can be calculated by multiplying the preceding term by

−

x2 . Test your program by computing the sine of 30, 60, 90, 450, 810, and n*(n + 1 )

1170 degrees using 5, 10, 20, 40, and 60 terms of the series, respectively. 7.3

A Taylor series expansion of the exponential function leads to the following approximation:

e x ≈ S(x) = 1 +

x x2 xn + + + . 1! 2! n!

Write a program that calls a function named myexp(x) which evaluates the approximate value of ex as given in the equation above, where x is the argument passed to the function. The value of n in the equation is not a constant, but rather is the value necessary to ensure that the last term included in the sum is less than machine epsilon. Careful attention must be given to the order in which the terms of the series are summed in order to limit the effects of roundoff error. For each of the sample values of x given below, your program should print the value of myexp(x) as well as the value returned by the standard library function exp(x). Sample values of x: 0.0, 1.0, -1.0, 8.0, -8.0, 9.0, 9.0, 10.0, -10.0. 7.4

For a positive number a,

a = lim{ x n } n→ ∞

Elementary Numerical Methods and C++ # 51 where

x0 =

find the smallest fraction

1⎛ a a⎞ , x n +1 = ⎜ x n + ⎟ , 2 2⎝ xn ⎠

p , p < 10 6 that approximates q

2 with a truncation error

less than 10!10. Experiment with different numbers under the radical and different values of the truncation error and draw appropriate conclusions. NOTE: If you don’t write this program properly, it can take a very long time to execute! Properly written, this program should return the results almost instantaneously.

CHAPTER

8

Zeros of Functions CHAPTER OUTLINE 8.1 Introduction 8.2 Incremental Search 8.3 Bisection 8.4 Regula Falsi 8.5 Improved Regula Falsi 8.6 Newton-Raphson Iteration

8.1

INTRODUCTION

If a relationship exists between one or more independent variables and a single dependent variable, it is relatively easy to write a C++ function that will accept values of the independent variables as arguments and calculate the return value of the function. Many common mathematical functions, such as trigonometric and exponential functions, are included in the C++ standard library. To access these mathematical functions, insert the compiler directive #include at the top of the program source code. Application-specific functions must be written by the application programmer. Examples of application-specific functions include the following: determining the distance a projectile will travel given the initial launch angle and velocity, calculating the critical buckling load for a column given the column material properties and slenderness ratio, or calculating the enthalpy of superheated steam given its temperature and pressure. Evaluating a function given the values of its arguments is sometimes called forward-solving. A more difficult problem is that of back-solving, where a desired function value is known and the required value of one independent variable that will yield the desired value of the function must be determined. For functions having more than one independent variable, it is generally true that all but one independent variable must be specified ahead as there is typically a many-to-one relationship between the independent variables and the dependent variable. That is, there are many combinations of independent variables that will produce a given value of the dependent variable. Even when there is a single independent variable, there may be multiple solutions. For example, there are an infinite number of angles for which tan(θ) = 0.5, since tangent is a periodic function. It may be possible to invert the function so that the solution (or a family of solutions) can be found directly. For the function given in the previous paragraph, the solution is obviously tan-1(0.5).

Elementary Numerical Methods and C++ # 53 Another example is the familiar quadratic formula which yields the two solutions for the function ax 2 + bx + c = 0. However, it is more likely that the function is non-invertible and an explicit expression for the required value of the independent variable cannot be found. In this case one of the methods described in this chapter must be used to intelligently search for the required value of the independent variable. The class of methods presented in chapter are called “zero finding” because they find the value of the independent variable which makes the dependent variable equal to zero. (When applied to polynomials and some other types of functions, the process we are describing here is usually called root-finding.) Some adjustment of the function may be necessary to cast the problem into this form. For example, if a student is trying to determine what grade he/she must make on the final exam in order to make a “B” in a course having a given grade weighting scale, the function may look like this:

f (hw1 , hw2 ,… , hwn , ex1 , ex2 , finalExam) = 80 This relationship can be put in the proper form for processing using the methods described below simply by moving the constant to the left-hand side, thus producing a function that we can properly find the “zero” of, as shown below.

f (hw1 , hw2 ,… , hwn , ex1 , ex 2 , finalExam) − 80 = 0 Most zero-finding methods work best if you have a good initial starting guess or region in which the zero is known to exist. Many times the physical or mathematical constraints on the function will suggest a starting range. As examples, we know that the final exam grade must lie between 0 and 100 in the previous example, if it is mathematically possible for the student to make a “B”. In the case of the column buckling load, we know that the load must be greater than 0 (in compression) and less than the compressive yield strength of the column material.

8.2

INCREMENTAL SEARCH

The incremental search method is little more than a formalized trial-and-error method. It is quite inefficient and does not guarantee convergence. It should be used only to narrow down the search interval by finding two values of the function that have opposite signs. Once these two points are found, either the Bisection or Regula-Falsi method (described in following sections) should be employed to find the root.

FIGURE 8.1 Incremental Search Method

54 # Chapter 8 Zeros of Functions To begin the incremental search method, choose some starting value of the independent variable, x0, and a step size, Δx, which may be either positive or negative. Evaluate the function values f(x0) and f(x0+Δx) and the product of these two numbers. If the product is positive (meaning the two values have the same sign) advance to x1 and repeat the process. Continue until the product of two successive function values is negative. At this point, abandoning the incremental search method and using the two values of x which yield function values having opposite signs as starting points for either of the methods presented in the next two sections will lead to results faster than continuing with the incremental search. However, for consistency with the objective of this chapter, the example shown employs incremental search in an iterative manner to find the zero of the function to the accuracy desired by the user. Note that there is no upper limit on the range of x-values that will be searched. If no sign change is found, then the program will eventually exit with a floating point exception as the values of the independent variable grow without bound. O EXAMPLE 8.1 Incremental Search Method 1) 2) 3) 4) 5) 6) 7) 8) 9) 10) 11) 12) 13) 14) 15) 16) 17) 18) 19) 20) 21) 22) 23) 24) 25) 26) 27) 28) 29) 30)

#include #include double f(double x) { return 1.0/x + log(x+1) - 5.0; } double incrementalSearch(double (*f)(double), double x0, double deltax, double eps) { double x = x0, fx = f(x); do { double fxpdx = f(x+deltax); if (fx * fxpdx >= 0.0) { x += deltax; fx = fxpdx; } else { deltax /= 10; } } while (fabs(deltax) >= eps); return x; } main() { double x = 0.1, deltax = 0.0625; cout << incrementalSearch(f,x,deltax,1.0e-6) << endl; }

Program Output: 0.20785

Comments: The main function simply defines the parameters to pass to function incrementalSearch which does all the work. Notice how the name of the function for which the zero is sought is passed as the first argument to the function. This requires the arcane syntax in the definition of the incrementalSearch function, but this is a good programming practice, for it allows the incrementalSearch function to find the zero of more than one function in the same program. This practice should be employed where ever possible. The choice of 10 as the divisor in line 21 is rather arbitrary. Any positive integer number greater than 1 will work, though the rate of convergence will

Elementary Numerical Methods and C++ # 55

FIGURE 8.2 Bisection Method - first iteration

FIGURE 8.3 Bisection Method - second iteration

vary. O

8.3

BISECTION

The bisection method (also known as the half-interval method, the Bolzano method, the mid-point method, and several other names) is the simplest member of the family of bracketing methods. These methods require that two values of the independent variable, xl and xr, be supplied for which the respective values of the dependent variables, fl and fr, have opposite signs (fl × fr < 0) . The difference

xr − xl

is called the interval of uncertainty (IU). Thus, for continuous functions we are assured that

there is at least one zero of the function between xl and xr. All bracketing methods work by successively reducing the size of the interval of uncertainty until it reaches some user-defined small value. The incremental search method of the previous section is often used to locate an initial interval of uncertainty. The bisection method reduces IU by successively asking the question “Does a zero lie to the left or right of the midpoint of the current IU?” Choosing the midpoint of the current IU,

xc =

xl + xr 2

, is optimal because it maximizes the value of the answer to the question in an

information theory sense. Lacking any information about the shape of the function, it is equally likely that a zero will lie to the left or right side of the midpoint of the current IU. To determine on which side the zero lies, we multiply the value of the function at the midpoint, fc = f(xc), times the value of the function at one of the two endpoints. If the sign of the product is negative, then we know that a zero must exist between the center and this endpoint. Note that if the sign of the product is positive, we can not necessarily conclude that there is no zero between the midpoint and the endpoint. There may be any even number of zeros between the two. But we will always choose to ignore this possibility and concentrate on finding the one in the interval which produces the negative product. Once we have identified on which side of the midpoint the zero lies, we rename xc and fc as the opposite endpoint values, thus reducing IU by one-half. The two figures above illustrate two iterations of the bisection method. The original interval of

56 # Chapter 8 Zeros of Functions uncertainty, IU0, stretches from xl to xr and is bisected at point xc. Clearly, the zero lies to the right of the mid-point because fc and fr have opposite signs. Because we are certain the zero lies to the right of xc, the region between xl and xc is abandoned by renaming xc as xl and fc as fl. This effectively reduces the interval of uncertainty by 50%. Figure 8.3 shows that for the next iteration IU will be collapsed from the right side since fc and fl now have opposite signs. One desirable characteristic of the bisection method is that its convergence rate is predictable. Since we know that IU decreases by 50% each iteration, we can predict how many iterations will be required to reach a specified accuracy. This convergence rate is independent of the actual function being computed. For example, suppose that we are given that the initial values of xl = 1.0 and xr = 2.0 and we wish to find the zero of a function to within ±1.0e-6. Notice that before beginning the bisection method we already know the answer is xc = 1.5 ± 0.5. After the first iteration we will know the answer to within ±0.25 (IU0/22) and after the second iteration we will know the answer to within ±0.125 (IU0/23). Continuing on in this fashion we find that after 19 iterations, the solution will be known to within ±9.525e-7, which is better than the required accuracy. Generalizing this procedure, the number of iterations of the bisection method, n, that must be applied in order to find the zero of a function to a accuracy of ±ε given an initial interval of uncertainty IU0 is given by

⎛ 2ε ⎞ n = − log 2 ⎜ ⎟. ⎝ IU 0 ⎠

O EXAMPLE 8.2 Bisection Method 1) 2) 3) 4) 5) 6) 7) 8) 9) 10) 11) 12) 13) 14) 15) 16) 17) 18) 19) 20) 21) 22) 23) 24) 25) 26) 27) 28) 29) 30) 31)

#include #include using namespace std; double f(double x) { return 1.0/x + log(x+1) - 5.0; } double bisection(double (*f)(double), double xl, double xr, double eps) { double fl = f(xl), fr = f(xr); if (fl * fr > 0.0) { cerr << "fl * fr > 0.0 in bisection\n"; exit(-1); } while (fabs(xr - xl)/2.0 > eps) { double xc = (xr + xl)/2.0, fc = f(xc); if (fl * fc <= 0.0) { xr = xc; fr = fc; } else { xl = xc; fl = fc; } } return (xr + xl)/2.0;

(8.1)

Elementary Numerical Methods and C++ # 57 32) 33) 34) 35) 36) 37)

} main() { double xl = 0.1, xr = 1.0; cout << bisection(f,xl,xr,1.0e-6) << endl; }

Program Output: 0.20785

Comments: Notice that at the beginning of function bisection, a check is conducted to be sure that the two initial values of xl and xr yield values of fl and fr, respectively, that do, in fact, have opposite signs. If this is not the case, an error message is printed on the system error logging device (normally the operator’s screen) and the program exits with an error flag. It is up to operating system or the user’s shell program to process this error return any further. O

8.4

REGULA FALSI

The Regula Falsi method (also known as the method of false position or the linear interpolation method) differs from the bisection method in only one respect—the method used to locate the point xc. Instead of always choosing the mid-point of the interval (xl, xr), regula falsi opts to use some limited information about the shape of the function itself to choose the point xc. Specifically, regula falsi assumes the points (xl, fl) and (xr, fr) are joined by a straight line. Since the two end-points are known to have opposite signs, the straight line connecting these two points must cross the x-axis, as illustrated in Figures 8.4 and 8.5.

FIGURE 8.4 Regula Falsi - first iteration

58 # Chapter 8 Zeros of Functions

FIGURE 8.5 Regula Falsi - second iteration

The equation for calculating the value of the point xc can be derived quite easily by equating the slope of the line segments between (xl, fl)-(xr, fr) and (xl, fl)-(xc, 0).

fr − fl 0 − fl = xr − xl xc − xl

After some rearranging, we find

xc =

f r xl − f l xr fr − fl

O EXAMPLE 8.3 Regula Falsi Method 1) 2) 3) 4) 5) 6) 7) 8) 9) 10) 11) 12) 13) 14) 15) 16) 17) 18) 19) 20) 21) 22) 23) 24) 25)

#include #include using namespace std; double f(double x) { return 1.0/x + log(x+1) - 5.0; } double regulaFalsi(double (*f)(double), double xl, double xr, double eps) { double fl = f(xl), fr = f(xr); if (fl * fr > 0.0) { cerr << "fl * fr > 0.0 in regulaFalsi\n"; exit(-1); } while (fabs(xr - xl)/2.0 > eps) { double xc = (fr*xl - fl*xr)/(fr - fl), fc = f(xc); if (fl * fc <= 0.0) { xr = xc; fr = fc; } else

(8.2)

Elementary Numerical Methods and C++ # 59 26) 27) 28) 29) 30) 31) 32) 33) 34) 35) 36) 37)

{

xl = xc; fl = fc;

} } return (xr + xl)/2.0;

} main() { double xl = 0.1, xr = 1.0; cout << regulaFalsi(f,xl,xr,1.0e-6) << endl; }

Program Output: NONE! Comments: Curiously, this program fails to find the solution even though there is only one line of executable code change from the bisection program shown in Example 8.2! In the next section we will examine the behavior of this program and develop an improved version of regula falsi that doesn’t suffer from this behavior. O

8.5

IMPROVED REGULA FALSI

By inserting the following line of code between lines 18 and 19 of the listing in Example 8.3, we can follow the progress of the regula falsi method. cout << xl << ", " << xr << ", " << xc << fl << ", " << fr << ", " << fc << endl;

Here are the first several lines of output from the program: 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1,

1, 0.645786, 5.09531, -3.30685, -2.95328 0.645786, 0.44552, 5.09531, -2.95328, -2.38696 0.44552, 0.335294, 5.09531, -2.38696, -1.72839 0.335294, 0.275696, 5.09531, -1.72839, -1.12932 0.275696, 0.24382, 5.09531, -1.12932, -0.680419 0.24382, 0.226877, 5.09531, -0.680419, -0.387847 0.226877, 0.217902, 5.09531, -0.387847, -0.213654 0.217902, 0.213157, 5.09531, -0.213654, -0.115402 0.213157, 0.210651, 5.09531, -0.115402, -0.0616571 0.210651, 0.209328, 5.09531, -0.0616571, -0.0327482 0.209328, 0.20863, 5.09531, -0.0327482, -0.0173389 0.20863, 0.208262, 5.09531, -0.0173389, -0.00916486 0.208262, 0.208067, 5.09531, -0.00916486, -0.00483998 0.208067, 0.207965, 5.09531, -0.00483998, -0.00255481 0.207965, 0.207911, 5.09531, -0.00255481, -0.00134823 0.207911, 0.207882, 5.09531, -0.00134823, -0.000711399

It appears that the value of xl is not changing at all. Every time a new value of fc is computed, it has the same sign as fr, causing xr to be reset closer and closer to xc, but the difference between xl and xr, which is what we are using to test convergence on the zero, remains relatively unchanged. In fact, after about 60 iterations, the results quit changing entirely, resulting in an infinite loop. Because this behavior is often seen in regula falsi, some authors discourage its use. However, Example 8.4 shows how simple it is to correct this shortcoming and render regula falsi entirely satisfactory for general zero-finding use. This clever solution, advanced by Gerald and Wheatley[3], works by effectively altering the slope of the straight line connecting fl and fr. To accomplish this,

60 # Chapter 8 Zeros of Functions the variable fsave is used to save the value of fc from the previous iteration. (An additional bool variable, fsaveSet, is also needed to implement the modified code only after the first iteration so that fsave can be set properly.) The improved regula falsi method decreases the slope of the line connecting fl and fr when it detects that fc has the same sign that it had in the previous iteration, thus indicating convergence from one side only. The method reduces the slope by dividing the value of the function on the opposite side of the zero from which convergence is being attained by some fixed positive number greater than 1.0, say, 10.0. This may happen more than one time if convergence persists from one side only. Eventually, the function value on the opposite side of the zero will get close enough to the x-axis to allow the computed value of xc to move to the other side of the zero, thus allowing convergence from both sides. The code in Example 8.4 implements this improved regula falsi method. O EXAMPLE 8.4 Improved Regula Falsi 1) 2) 3) 4) 5) 6) 7) 8) 9) 10) 11) 12) 13) 14) 15) 16) 17) 18) 19) 20) 21) 22) 23) 24) 25) 26) 27) 28) 29) 30) 31) 32) 33) 34) 35) 36) 37) 38) 39) 40) 41) 42) 43)

#include #include using namespace std; double f(double x) { return 1.0/x + log(x+1) - 5.0; } double improvedRegulaFalsi(double (*f)(double), double xl, double xr, double eps) { double fl = f(xl), fr = f(xr), fsave; bool fsaveSet = false; if (fl * fr > 0.0) { cerr << "fl * fr > 0.0 in improvedRegulaFalsi\n"; exit(-1); } while (fabs(xr - xl)/2.0 > eps) { double xc = (fr*xl - fl*xr)/(fr - fl), fc = f(xc); if (fl * fc <= 0.0) { xr = xc; fr = fc; if (fsaveSet && fc * fsave > 0.0) { fl /= 10.0; } } else { xl = xc; fl = fc; if (fsaveSet && fc * fsave > 0.0) { fr /= 10.0; } } fsave = fc; fsaveSet = true; } return (xr + xl)/2.0; }

Elementary Numerical Methods and C++ # 61 44) main() 45) { 46) double xl = 0.1, xr = 1.0; 47) cout << improvedRegulaFalsi(f,xl,xr,1.0e-6) << endl; 48) }

Program Output: 0.20785

Comments: Adding the variables fsave and fsaveSet and the code on lines 11, 12, 25-28, 34-37, and 39-40 allows the regula falsi method to converge on the answer from both sides. O To see how well it works, we can insert the same debugging code used in the previous example between lines 20 and 21 of Example 8.4 to follow the convergence of the algorithm. 1) 0.1, 1, 0.645786, 5.09531, -3.30685, -2.95328 2) 0.1, 0.645786, 0.44552, 5.09531, -2.95328, -2.38696 3) 0.1, 0.44552, 0.160781, 0.509531, -2.38696, 1.36872 4) 0.160781, 0.44552, 0.264551, 1.36872, -2.38696, -0.985297 5) 0.160781, 0.264551, 0.221117, 1.36872, -0.985297, -0.277748 6) 0.160781, 0.221117, 0.180699, 0.136872, -0.277748, 0.700167 7) 0.180699, 0.221117, 0.209638, 0.700167, -0.277748, -0.0395455 8) 0.180699, 0.209638, 0.208091, 0.700167, -0.0395455, -0.00536115 9) 0.180699, 0.208091, 0.206142, 0.0700167, -0.00536115, 0.0384411 10) 0.206142, 0.208091, 0.207852, 0.0384411, -0.00536115, -4.55844e-05 11) 0.206142, 0.207852, 0.20785, 0.0384411, -4.55844e-05, -3.87123e-07 12) 0.206142, 0.20785, 0.20785, 0.00384411, -3.87123e-07, 3.45089e-06

First of all, we notice that the improved regula falsi method converges almost 40% faster than the bisection method. Remembering the order of the variables as they are printed on each line, we see that on line 3) the value of fl has been divided by 10 from its previous value because the algorithm detected convergence from the right side only in the previous two iterations. The desired effect was achieved, because sign of fc changed from negative to positive on that iteration. This same logic was repeated on lines 6), 9), and 12), thus forcing convergence from both sides.

8.6

NEWTON-RAPHSON ITERATION

Newton-Raphson iteration, often called simply Newton’s method (we will use the two names interchangeably), is the most popular of the so-called “single-point” iteration schemes. Unlike the bracketing methods presented in the previous three sections, Newton’s method does not require that the location of a zero be initially bracketed by providing two points where the function has opposite signs. Rather, Newton’s method requires only a single starting point. But the relaxation of the starting requirements comes at a cost, for Newton’s method requires that the derivative of the function also be available. This requirement sometimes restricts the applicability of the method, as in the case of a function that relies on “black-box” code (perhaps provided by a vendor under an “execute-only” license) for which the underlying algorithms and data are unknown and the derivative is simply not

62 # Chapter 8 Zeros of Functions available5. Newton’s method can be derived easily from the Taylor series. Given a function f(x) with all the usual “nice” properties, the Taylor series can be written as

df f ( x0 + Δ x) = f ( x0 ) + Δ x dx

x0

Δx2 d 2 f + 2 dx 2

x0

Δx 3 d 3 f + 6 dx 3

+… x0

When using the Taylor series in the zero-finding context, the starting values of x0 and f(x0) are known as well as the value of f(x0 + Δx). This may seem strange, but think about it. We know the value of the function at the current location (x0) is not zero, so we want to find how far we have to go (Δx) to find the place where the function is zero. Thus, this location will be x0 + Δx and the function value at that point (f(x0 + Δx)) will be zero. Truncating the Taylor series after the first derivative term (this is the first example we will see of numerical truncation error) and rearranging, we have

Δx ≈

0 − f ( x0 ) df dx x0

If we designate the x-location of the zero of the function as x*, then Δx = x* - x0, and

x * ≈ x0 −

f ( x0 ) df dx x0

(8.3)

In practice, truncating the Taylor series after the first derivative causes a single application of Newton’s method to yield a value of x* that is not very close to the zero. In many circumstances, the zero can be found by making Newton’s method an iterative scheme, i.e.,

x k +1 = x k −

f ( xk ) df dx xk

(8.4)

In many cases where the function is well-behaved and the starting guess is not too far from the location of the zero, Newton-Raphson iteration will converge in less than 10 iterations. A graphical representation of the desired behavior of Newton-Raphson iteration is shown in Figure 8.6. The iteration begins by picking a starting point x0 and evaluating the value of the function at that point (f0). Next, a line is drawn tangent to the curve passing through f0 and extending it until it intersects the horizontal axis. This intersection point is the next guess of the location of the zero of the function and the starting point for the next iteration.

5 In these cases, a finite-difference approximation of the derivative may be used with Newton’s method, if it is applied carefully. Many numerical methods textbooks present formulas for finite-difference approximations of derivatives.

Elementary Numerical Methods and C++ # 63

FIGURE 8.6 Proper Operation of NewtonRaphson Iteration

Unfortunately, things may not always go as well as planned. Figure 8.7 shows some possible outcomes when using Newton-Raphson iteration. With the starting point x0,1 things proceed as planned with the zero at x1* being found in only a few iterations. Moving the starting point to the left to point x0,2 causes the zero at x1* to be skipped in favor of the one at x2*. Newton’s method does not always converge to the zero that is closest to the starting point. Shifting the starting point only slightly to the left to point x0,3 causes a different anomalous behavior. On the second iteration, the derivative of the function will be zero causing the next guess for the location of the zero to be at infinity and probably causing a fatal division-by-zero exception in the program. At any state of Newton-Raphson iteration, the function may have a zero derivative. Shifting the starting guess only very slightly to the left to x0,4 produces a situation that is difficult to detect. The inflection point at the right side of Figure 8.7 has no real zero, but Newton’s method will continue to bounce from one side to the other of the inflection point forever, thinking that it will eventually converge. Under certain circumstances, Newton’s method will exhibit similar behavior even in the presence of a real zero. At any stage of Newton-Raphson iteration, the algorithm may

FIGURE 8.7 Possible Outcomes of Newton-Raphson Iteration

64 # Chapter 8 Zeros of Functions get “stuck” on an inflection point. Sophisticated pattern-recognition methods can be used to detect these conditions, but perhaps the best thing to do in such situations is to simply quit with an error condition if a zero is not found in about twenty iterations. O EXAMPLE 8.5 Newton-Raphson Iteration 1) 2) 3) 4) 5) 6) 7) 8) 9) 10) 11) 12) 13) 14) 15) 16) 17) 18) 19) 20) 21) 22) 23) 24) 25) 26) 27) 28) 29) 30) 31) 32) 33) 34)

#include #include using namespace std; double f(double x) { return 1.0/x + log(x+1) - 5.0; } double df(double x) { return -1.0/(x*x) + 1.0/(x+1.0); } double newton(double (*f)(double), double (*df)(double), double x, double eps) { for (int count=0; count<20; count++) { double derivative = df(x); if (fabs(derivative) < eps) { cerr << "Derivative near zero in newton\n"; exit(-1); } double deltax = f(x)/df(x); x = x - deltax; if (fabs(deltax) < eps) return x; } cerr << "Failure to converge in newton\n"; exit(-1); } main() { double x0 = 0.1; cout << newton(f,df,x0,1.0e-6) << endl; }

Program Output: 0.20785

Comments: The program converges in six iterations using the starting value of 0.1. Because of the natural log function, values of x in newton must be greater than zero. Using different starting values reveals that the program will fail for x0 > 0.4. An examination of the plots of f(x) and df(x)/dx reveals that both are well behaved for x > 0, the derivative function quickly approaches zero, causing the next value of x to be negative, thus producing a floating point exception. Such a narrow range of available starting values is atypical of Newton-Raphson iteration. Many mathematical models of physical systems permit a very wide range of starting values, suffering only a slight penalty of one or two extra iterations for starting values far away from the true solution. O

Glossary

Elementary Numerical Methods and C++ # 65 back solve given the value of a function, find the value of the independent variable that yields the known function value. convergence the numerical solution to a problem approaches the analytical solution. forward solve determine the value of a function given a value of the independent variable.

interval of uncertainty (IU) the domain of the independent variable which is known to contain the zero of a function at a particular stage of the search process. zero of a function the value of the independent variable that makes the value of a function zero.

Chapter 8 Problems 8.1

A large thunderstorm is approaching and you wish to measure the amount of rainfall. The only container available is a galvanized pail having a base diameter of 8.5 inches, a top diameter of 11 inches, and a height of 10 inches. Since this pail is tapered, the depth of water accumulated will not accurately reflect the actual rainfall amount. Write a C++ program that will allow you to calibrate a dipstick to use with the given pail to accurately determine the true rainfall amount. The program should print out the distance from the bottom end of the dipstick where marks should be placed to indicate rainfall amounts in 0.5-inch increments.

8.2

Find the solution to the equation x − tan( x ) = 0 using Newton’s method with a starting guess of x 0 = 4.5. Then try starting guesses of x 0 = 4.0 and x 0 = 5.0. Explain the behavior in each case.

8.3

According to Reynolds6, the saturation pressure for Refrigerant 23 as a function of temperature is given by the equation

ln P = F1 +

F2 + F3 ln T + F4 T + F5 T 2 + F6 T 3 T

The units for this equation are pascals for pressure and kelvin for temperature. The equation is valid for temperatures from 150 K to 299.07 K. The constants in this equation are:

F1 = 6.81234858 × 10 2 F2 = − 101732931 . × 10 4 F3 = − 144514230 . × 10 2 F4 = 100348278 . F5 = − 158761756 . × 10 − 3 F6 = 126698956 . × 10 − 6 Write a C++ program that will calculate and print the saturation temperature (in K and EF) for the following pressures (in MPa):

6 Reynolds, W.C., Thermodynamic Properties in SI, Department of Mechanical Engineering, Stanford University, ISBN 0917606-05-1, 1979.

66 # Chapter 8 Zeros of Functions 0.0044, 0.01, 0.03, 0.05, 0.08, 0.101325, 0.2, 0.3, 0.8, 1.0, 1.5, 2.0, 3.0, 4.8 8.4

The Regula Falsi method for zero-finding can converge very slowly under certain circumstances. Consider the following modification. At the start of each iteration, compute the value of xc using both the formula for the bisection method and the formula for the Regula Falsi method, then average them to determine a final value of xc. Would this procedure help or hurt the overall performance of Regula Falsi? You might want to use some sketches of functions to illustrate your arguments.

1 + ln( x + 1) − 5 in x . ≤ x ≤ 10 . . Show in a table at least three iterations, giving the values of xl, xr, the range 01 fl, fr, xc, and fc for each iteration. Apply this hybrid method to finding the zero of the function f ( x ) =

8.5

The figure below illustrates the two-dimensional problem of trying to get a long object, such as a ladder, around a corner in a building. The length of the object is = 1 + 2 . Imposing the geometric constraint that the object must touch the outside wall of the two hallways as well as the inside corner leads to the following equation.

=

w2 w1 + sin( π − A − C ) sin( C )

w cos(π − A − C ) w1 cos( C ) d = 2 2 − =0 dC sin (π − A − C ) sin 2 ( C ) The longest object that can negotiate the turn is the minimum of this function with respect to the angle C. To find the minimum, differentiate the equation above and set it equal to zero as shown in the equation below. Using the information given above, determine the longest ladder that can negotiate a 90E intersection of two corridors that are 8-ft. and 10-ft. wide, respectively.

8.6

A plastic bucket is placed under the drip line of the roof of a house to serve as a rain

Elementary Numerical Methods and C++ # 67 gauge. Because rainfall striking the roof is collected in the bucket, the amount of rainfall is “amplified,” with the amount of amplification depending on the placement of the bucket relative to the drip line of the roof. The amplification factor will be zero if the bucket is located completely under the eave of the roof and unity if it is located beyond the drip line. Assume a particular roof has a total horizontal span of 28 feet with the crest of the roof being at the center. Further assume that rain striking the roof flows directly to the drip line then vertically down into the bucket. Write a C++ program that will determine the location of the edge of an 11-inch diameter bucket relative to the roof drip line that will yield an amplification factor of 5.0. Note that there will be two solutions, one for which only a small portion of the bucket is placed beyond the drip line and a second solution for which most of the bucket is placed beyond the drip line.

CHAPTER

9

Using C++ Classes in Numerical Methods CHAPTER OUTLINE 9.1 Introduction 9.2 “Thinking” Objects 9.3 Member Data 9.4 Member Functions 9.4.1 Constructors 9.4.2 Destructors 9.4.3 Actions 9.4.4 “Set”/“Get” Member Functions 9.4.5 Overloaded Operators 9.5 The ENM Classes 9.5.1 Vector and Matrix classes 9.5.2 Miscellaneous ENM classes

9.1

9.4.3 Actions 9.4.4 “Set”/“Get” Member Functions 9.4.5 Overloaded Operators 9.5 The ENM Classes 9.5.1 Vector and Matrix classes 9.5.2 Miscellaneous ENM classes

INTRODUCTION

To this point we have used what is commonly called the procedure-oriented programming paradigm to study computer programming and numerical methods. This approach causes the program developer to consider carefully the data that the program uses and the algorithms that will operate on the data. The use of C++ functions allows some degree of procedure abstraction in that we can regard properly written functions as “black boxes” which do their jobs properly if the proper data are passed as arguments. Various nuances of procedure-oriented programming dominated the field of computer science for the first thirty years of its existence, and the concept has served us well. Recently, the concept of Object-Oriented Programming (OOP) has experienced a groundswell of support. OOP is most commonly implemented through extensions to the C language now

Elementary Numerical Methods and C++ # 69 commonly known as C++7. Today, virtually all application development uses programming languages that support some form of object-orientation. C++ is the language of choice for most major application program development. OOP in C++ has many touted features. Most notable of these is the notion of code reuse. Once a well-designed set of C++ classes are developed, they can be reused by many programmers in different applications. And since a fundamental principle of OOP is that objects encapsulate both data and algorithms, enhanced versions of class libraries can be developed and implemented without requiring any changes to the older programs that were written for the previous version. The case for using OOP is strong, but not universal. There is some overhead associated both with program development and program execution when using objects. Simple problems like the zerofinding methods presented in Chapter 8 would benefit little from the advantages of OOP techniques. But as the underlying data becomes more complicated, as with vectors and matrices, the advantages of implementing objects becomes quite apparent. The study of OOP and its full implementation in C++ is very complex. The cursory treatment given in this chapter should be regarded only as the barest minimum essentials required to get started. There is much more to OOP than is presented in this chapter.

9.2

“THINKING” OBJECTS

Programmers who were schooled in procedure-oriented programming techniques often have difficulty grasping the concept of objects. They were taught primarily to concentrate on algorithms, then find some way to get the data into and out of the algorithm through the function argument list. In C++ and other object-oriented programming languages, the programmer must be taught to think about the data and the algorithms jointly as objects. Probably the best way to begin this process is to write a detailed narrative description of what the program is to accomplish. From this carefully written functional description, identify all the nouns. The nouns are the first candidates for the names of the objects to be implemented in the program. Next, identify the attributes assigned to the nouns. An attribute is a property that does not change throughout the life of the object. Then identify all the states in which the object can exist. A state is a property that can change in value throughout the life of the object. Together, the attributes and states form the member data of the object. Often, the member data are referred to as the things that the object “knows.” Finally, from the written description, extract the list of verbs associated with the nouns. The verbs represent the things that the object can “do” and will be implemented as member functions. Writing an object-oriented program in C++ consists of developing the classes required to implement the various objects needed in the program, then tying these classes together by sending messages to objects instantiated from these classes by calling their member functions. (It sounds easier if you say it fast.)

9.3

MEMBER DATA

The member data of an object consists of its attributes and its states. It is possible, though not advisable, to create a class consisting only of its member data. The reason this is not advisable is that

7 The designation C++ is taken from the C autoincrement operator, which suggests C++ is an “increment above” the C language. Since C++ is fully compatible with its non-object oriented predecessor, it is often called a hybrid language. The use of the OOP extensions in C++ is purely optional.

70 # Chapter 9 Using C++ Classes in Numerical Methods objects consisting of member data only offer no protection for the data. One of the tenets of OOP is that objects are responsible for always ensuring the integrity of their member data. This is done by using member functions, which is covered in the next section. The example below shows how to implement classes in C++. We will insist on a fairly regimented style for using classes that is consistent with the guidelines of most C++ programming experts. Every class we create will require two files. The first will be a header file that contains the class definition. The header file is usually given a name similar to the class name followed by the file extension “.h”. The header file must not contain any executable code. This goes in the second file associated with each class. This file should have the same name as the class followed by the file extension “.cpp”. If you are using a compiler that includes an integrated development environment (IDE) that allows you to define projects, all the .cpp files should be included in the project while the header files should never be included in the project. They will be explicitly included using preprocessor directives in the .cpp files. To clarify these style guidelines, the contents of each file will be listed separately in the following examples. The examples in the remainder of the book will also use the newer form of the standard library header files which do not have the “.h” extension. Using these new header files also requires that our program be aware of namespaces. This is a new feature of standard C++ that was added to avoid class and data name “collisions” when incorporating code written by multiple authors. Those objects and data that are a part of the standard C++ library are defined to be in namespace std. Thus, when we use the standard input/output object cin and cout that are defined in header file iostream, we must indicate that these are the objects we are referring to and not some other cin and cout objects that may be defined in some other header file. To accomplish this name clarification, we may either prefix every occurrence of cin and cout with the scope resolution std:: or, if that is too burdensome, we may simply include the line using namespace std;

at the top of the file and all objects named cin and cout will assumed to be prefixed with std::. If there happened to be other objects named cin and/or cout used in the program, they would have to be referenced using the syntax namespace::cin and namespace::cout, where namespace is the namespace assigned in the code where these objects were defined. This can all be very confusing until one has had experience writing programs for large projects involving code written by several different suppliers. For the beginner, just understand that all programs that #include or other standard C++ library header files that do not have the .h file extension must also include the line using namespace std; just below the #include preprocessor lines at the top of the file. O EXAMPLE 9.1 Using a Data-Only Class This project illustrates an elementary class that consists only of data members. The class is named point and is intended to represent a point in a rectangular Cartesian coordinate system. File point.h contains: 1) 2) 3) 4) 5) 6) 7) 8) 9)

#ifndef POINT_H #define POINT_H class point { public: double x; double y; }; #endif File point.cpp is actually not needed in this example since there is no executable code associated with class point. However, for the sake of consistency we will include it anyway: 1) #include "point.h"

Elementary Numerical Methods and C++ # 71 Notice the use of the double quote symbols to enclose the file name instead of the more common use of angle brackets. This is because the file point.h is located in the same folder as the program file and not in the system folder where the standard C++ header files are located. Some C++ compilers do not strictly enforce this distinction. The main function is defined in file ex41.cpp: 1) 2) 3) 4) 5) 6) 7) 8) 9) 10) 11) 12) 13)

#include #include "point.h" using namespace std; int main() { point a, b; a.x = 3.0; a.y = 4.0; b.x = 1.0; b.y = -1.0; cout << "Point a is (" << a.x << ", " << a.y << ")\n"; cout << "Point b is (" << b.x << ", " << b.y << ")\n"; }

Program Output:

14) Point a is (3, 4) 15) Point b is (1, -1)

Comments: Clearly, the use of the class point is irrelevant to this program, since the main function has direct access to the data members of objects a and b using the member selection operator (the period between the object name and the object data member name). O The header file for the class is always enclosed in preprocessor directives as illustrated in the example above. These directives combine to allow the contents of the file to be processed by the compiler only one time. This is not a language requirement, but it is a widely used programming convention. The structure of the directives always follows the following template: #ifndef CLASSNAME_H #define CLASSNAME_H ! #endif where CLASSNAME_H is the name of the class being defined in the header file (in all upper case letters) followed by the underscore character and a capital H (_H). When the preprocessor reads the first line, it checks to see if the identifier CLASSNAME_H has been defined. (Do not confuse preprocessor identifiers — those used in lines beginning with the # symbol — with normal C++

identifiers. They are entirely different. Preprocessor directives are not part of the C++ program. They only control various options within the compilation process.) The first time the header file is processed, say when it is #included in the class .cpp file, the value of CLASSNAME_H will be undefined, so the preprocessor will continue because the first directive means “if the named identifier is not defined, continue processing this file.” The second line immediately defines the identifier (we don’t care exactly what value is assigned to the identifier; just that it is now defined in the eyes of the preprocessor). The remainder of the header file is passed on to the compiler and the #endif on the last line marks the end of the conditional expression begun on the first line. When the class header file is invoked for the second and all subsequent times (as in file ex41.cpp in the example above), the preprocessor directive on the first line will detect that the identifier CLASSNAME.H is now defined, thus returning a false value for the conditional causing it to skip all lines until the terminating #endif directive, effectively skipping the entire contents of the file. Inside the class header file the class is introduced with the class keyword. Curly braces

72 # Chapter 9 Using C++ Classes in Numerical Methods surround the block of statements declaring the class and its components. Notice the requirement that a semicolon follow the closing curly brace. Omitting this semicolon is a common syntax error which is difficult to identify because many compilers will report the error to be in the file that #included the header file rather than in the header file itself. Inside the curly braces class data members and member functions are listed. Each data member and member function has an associated access permission that determines what types of access to the member external functions will be allowed to have. The three access specifiers are given in Table 9.1. The default access specifier for the class keyword is private. Normally, class data members are made private so that they can be changed only by member functions within the class and friends of the class. A friend of a class is another class or function that has been given explicit permission to read and write to private data members of the original class. Casual use of friend classes and functions is strongly discouraged, as it circumvents the data protection mechanisms which lie at the heart of OOP. In the previous example, the data members were made public since it was desired to permit global access to the data members for the purpose of illustration. Table 9.1 Access Specifiers Access Specifier Keyword

Access Permitted

public:

The member can be accessed anywhere an object created from this class is being used.

private:

The member can be accessed only by member functions defined within the class and classes or functions declared to be friends of this class.

protected:

The member can be accessed by member functions of this class, friends of this class, and other classes derived from this class.

9.4

MEMBER FUNCTIONS

Member functions are functions defined within the scope of a class and act on the data members of objects constructed from the class. They empower the object to “do” the things it was intended to do while always preserving the integrity of the member data of the object. It is sometimes helpful to consider the class member functions as filters which control access to the object data members. All access to the object must be done through the object member functions. There are four types of member functions — constructors, destructors, action functions (sometimes called methods), and overloaded operators. Class designers can add member functions upon demand to meet the needs of other programmers who use the class. Removing a member function from a class is not advisable since there may be an existing program that relies upon that particular member function to accomplish its objective. 9.4.1 Constructors Every class must have at least one constructor. If the programmer does not define one, the compiler will provide one. In many cases, the compiler-provided constructor may be adequate. But in most

Elementary Numerical Methods and C++ # 73 cases it is better to provide a constructor to be sure the data members of the constructed object have a known state when the object is created. A class can have many different constructors; one for each possible way the object can be created. Each different constructor must have a different argument list. User programs only rarely call the constructor directly. Normally, it gets called as a part of a variable declaration. A constructor is identified in the class definition as a member function that has no return type and the same name as the class name. Since no constructors for class point were defined in Example 9.1, when objects a and b were created (or instantiated, since the objects can be regarded as instances of the class), their data members were undefined because the compiler-provided constructor does not initialize the data members. To remedy this, Example 9.2 shows a revised class definition header file which includes two constructors. The constructor having no arguments (called the default constructor) insures that the data members are assigned to known values. The second constructor having two double arguments assigns the two arguments to the x and y data members, respectively. A revised main function which exercises the two constructors is also shown in the example. O EXAMPLE 9.2 Using Constructors to Initialize Member Data File point.h contains: 1) 2) 3) 4) 5) 6) 7) 8) 9) 10) 11)

#ifndef POINT_H #define POINT_H class point { public: double x; double y; point(void) : x(0.0), y(0.0) {}; point(double xx, double yy) : x(xx), y(yy) {}; }; #endif File ex42.cpp contains: 1) #include 2) #include "point.h" 3) using namespace std; 4) int main() 5) { 6) point a, b(0.0, 0.0); 7) cout << "Point a is (" << a.x << ", " << a.y << 8) cout << "Point b is (" << b.x << ", " << b.y << 9) a.x = 3.0; 10) a.y = 4.0; 11) b.x = 1.0; 12) b.y = -1.0; 13) cout << "Point a is (" << a.x << ", " << a.y << 14) cout << "Point b is (" << b.x << ", " << b.y << 15) }

")\n"; ")\n";

")\n"; ")\n";

Program Output: 16) 17) 18) 19)

Point Point Point Point

a b a b

is is is is

(0, (0, (3, (1,

0) 0) 4) -1)

Comments: The default constructor shown on line 8) of point.h “constructs” data members x and y with initial values of 0.0. The body of the constructor between the curly braces is empty, since there is nothing else to do. The second constructor requires two double arguments which are passed as the initial values of the two object data members. Again, the body of the constructor is empty.

74 # Chapter 9 Using C++ Classes in Numerical Methods The main function instantiates object a using the default constructor and object b using the second constructor in line 6. The number 0.0 is passed as the initial value for both data members of object b. The output from lines 7 and 8 verify the correct initial values of the data members of a and b. The remaining lines are duplicated from Example 9.1, illustrating the point that the data members can be altered externally because they were declared as public. O For most simple classes that do not require dynamic memory allocation, constructors of the types given in Example 9.2 are sufficient. In fact, by using a feature of C++ called default arguments, the two constructors given previously can be combined into one. Default arguments can be used in the argument list of any C++ function definition, not just for class constructors. To define default arguments for a function, simply put the assignment in the argument list just as if you were assigning a value to a variable in a declaration statement. If the function call does not include a value for that argument, then the default value will be used. However, there is one major restriction: once you start assigning default arguments, all arguments to the right must also have default arguments. Another type of constructor that is often needed is the copy constructor. A copy constructor requires one argument, an object (or reference or pointer to and object) of the same class as being constructed. The newly constructed object is a clone of the one passed as the argument. Actually, the compiler is capable of generating copy constructors for simple objects having scalar non-pointer data members such as the point class. So if the class definition does not include a copy constructor, the compiler will create one that simply copies the data members into the newly created object. Example 9.3 illustrates the point class with a default constructor that uses default arguments and a simple copy constructor. O EXAMPLE 9.3 Default Arguments and Copy Constructor File point.h contains: 1) 2) 3) 4) 5) 6) 7) 8) 9) 10) 11)

#ifndef POINT_H #define POINT_H class point { public: double x; double y; point(double xx = 0.0, double yy = 0.0) : x(xx), y(yy) {}; point(point &p); }; #endif File point.cpp contains: 1) #include "point.h" 2) point::point(point &p) 3) { 4) x = p.x; 5) y = p.y; 6) } File ex43.cpp contains: 1) #include 2) #include "point.h" 3) using namespace std; 4) int main() 5) { 6) point a, b(0.0, 0.0); 7) cout << "Point a is (" << a.x << ", " << a.y << ")\n"; 8) cout << "Point b is (" << b.x << ", " << b.y << ")\n"; 9) a.x = 3.0; 10) a.y = 4.0;

Elementary Numerical Methods and C++ # 75 11) 12) 13) 14) 15) 16) 17) }

b.x = 1.0; b.y = -1.0; cout << "Point a is (" << a.x << ", " << a.y << ")\n"; cout << "Point b is (" << b.x << ", " << b.y << ")\n"; point c(a); cout << "Point c is (" << c.x << ", " << c.y << ")\n";

Program Output: 18) 19) 20) 21) 22)

Point Point Point Point Point

a b a b c

is is is is is

(0, (0, (3, (1, (3,

0) 0) 4) -1) 4)

Comments: Since the two-argument constructor in line 8) of point.h has default values for both arguments, it will be used to initialize both a and b in line 6) of ex43.cpp. The copy constructor is declared in line 9) of point.h, but it is defined in file point.cpp. Note the use of the scope resolution operator (::) in line 2) of point.cpp to inform the compiler that this is the definition of a member function of class point and not a global function. O 9.4.2 Destructors Destructors are special member functions that are responsible for properly disposing of object data members when they are no longer needed. About the only time destructors must be implemented by the class programmer is when the class makes use of dynamic memory allocation or other pointers to memory. Otherwise, the compiler-provided destructor will handle the job nicely. The user of a class need not be concerned with the destructor function since destructors are never called directly by the user. The compiler calls the destructor implicitly whenever an object is no longer needed. Chapter 10 will introduce the use of classes that require destructors. Consult the source code of the software used there for examples of object destructors. Destructor functions are easily identified because they have the same name as the class preceded by the tilde (~) character. 9.4.3 Actions Action member functions are the most general purpose member functions. Their names, arguments, and functions are completely open and may be designed and implemented at will by the class designer. The public member functions of a class constitute the interface to the class, meaning that all access to objects instantiated from the class must be made by calling the public member functions listed in the class header file. Action member functions must be called explicitly in order to do their work. Since member functions are associated with specific objects of the class, the particular object that the function is to act upon must be identified. The C++ syntax for calling an object’s member functions is similar to the syntax used for accessing the object’s data members. The object’s name is written followed by a period (.) and the name of the member function being called including the required function arguments in parentheses. For the point class being developed in the examples in this chapter, several meaningful action member functions could be developed. Two will be developed in Example 9.4. The first is named radius(), which returns the distance from the origin to the point, and the second is named angle(), which returns the angle in radians a line from the origin to the point makes with the positive x-axis. O EXAMPLE 9.4 Using Action Member Functions File point.h contains:

76 # Chapter 9 Using C++ Classes in Numerical Methods 1) 2) 3) 4) 5) 6) 7) 8) 9) 10) 11) 12) 13)

#ifndef POINT_H #define POINT_H class point { public: double x; double y; point(double xx = 0.0, double yy = 0.0) : x(xx), y(yy) {}; point(point &p); double radius(void); double angle(void); }; #endif File point.cpp contains: 1) #include "point.h" 2) #include 3) point::point(point &p) 4) { 5) x = p.x; 6) y = p.y; 7) } 8) double point::radius(void) 9) { 10) return sqrt(x*x + y*y); 11) } 12) double point::angle(void) 13) { 14) return atan2(y, x); 15) } File ex44.cpp contains: 1) #include 2) #include "point.h" 3) using namespace std; 4) int main() 5) { 6) point a(3.0, 4.0), b(1.0, -1.0); 7) cout << "Point a is (" << a.x << ", " << a.y << ")\n" 8) << "Radius of a is " << a.radius() 9) << "\nAngle of a is " << a.angle() << " radians\n"; 10) cout << "Point b is (" << b.x << ", " << b.y << ")\n" 11) << "Radius of b is " << b.radius() 12) << "\nAngle of b is " << b.angle() << " radians\n"; 13) }

Program Output: 1) 2) 3) 4) 5) 6)

Point a is (3, 4) Radius of a is 5 Angle of a is 0.927295 radians Point b is (1, -1) Radius of b is 1.41421 Angle of b is -0.785398 radians

O 9.4.4

“Set”/“Get” Member Functions

As stated previously, it is bad programming practice to make the data members of an object public since there is no way to protect the data from unauthorized or unintended modification by code external to the class. The encapsulation requirements of OOP direct that only code within the class (member functions) should be allowed to modify the data members. It is presumed that the author of the class will be able to write code that will filter the requested data changes to prevent illegal or

Elementary Numerical Methods and C++ # 77 inconsistent data from being stored in the object. For this reason, it is almost universally true that class data members are declared to be private: in the class so that they cannot be accessed by functions external to the class (except for friend functions). If the creator of the class wishes to allow any data members of objects created from the class to be modified after the object has been constructed, then he/she must provide so-called set member functions (which are declared to be public:) to validate and apply changes requested by the user. “Set” member functions are typically defined for data members which represent the state of the object, but are not defined for data members which represent the properties of the object. But once the data members are declared to be private:, they become totally invisible to functions external to the class for both reading and writing. Accessing private data members using the member selection operator (.) is not permitted on either the left or right side of the = sign. Thus, there is a need for a complementary set of functions to obtain the values of the object data members. These are appropriately called “get” member functions. The class author can use “get” functions to control just how much of the object’s data is visible to functions using the object and in what format the data is made available. For the simple point class being developed in the examples of this chapter, “set” and “get” functions are not really required, since there are no restrictions on the values of data members x and y. But it is easy to see how such functions could become important if, for example, the point class were to be restricted to, say, points in the first quadrant only. In this case, it would be very important to supply “set” functions to ensure that the data members were not allowed to become negative. Nonetheless, Example 9.5 shows how “set” and “get” functions could be implemented. Note that the constructor member function has been modified to call the “set” member functions. Presumably, the “set” member functions will contain the necessary filters to ensure that the data members contain valid data. These filters must also be invoked when the object is created, lest the initial value of the object have invalid data. Rather than duplicating the code to check the arguments passed to the constructor, it is common practice to simply create the object data having some known valid values then pass the arguments to the appropriate “set” member functions to set the data members to their desired values, if the desired values are valid. O EXAMPLE 9.5 Using “Set” and “Get” Member Functions File point.h contains: 1) 2) 4) 5) 6) 7) 8) 9) 10) 11) 12) 13) 14) 15) 16) 17) 18) 19)

#ifndef POINT_H #define POINT_H class point { private: double x; double y; public: point(double xx = 0.0, double yy = 0.0); point(point &p); void setX(double xx); void setY(double yy); double getX(void) { return x;} double getY(void) { return y;} double radius(void); double angle(void); }; #endif File point.cpp contains: 1) #include "point.h" 2) #include 3) point::point(double xx, double yy) : x(0.0), y(0.0)

78 # Chapter 9 Using C++ Classes in Numerical Methods 4) 5) 6) 7) 8) 9) 10) 11) 12) 13) 14) 15) 16) 17) 18) 19) 20) 21) 22) 23) 24) 25) 26) 27) 28)

{

setX(xx); setY(yy);

} point::point(point &p) { x = p.x; y = p.y; } void point::setX(double xx) { x = xx; } void point::setY(double yy) { y = yy; } double point::radius(void) { return sqrt(x*x + y*y); } double point::angle(void) { return atan2(y, x); } File ex45.cpp contains: 1) #include 2) #include "point.h" 3) using namespace std; 4) int main() 5) { 6) point a(3.0, 4.0), b(1.0, 7) cout << "Point a is (" << 8) cout << "Point b is (" << 9) a.setX(-3.0); 10) b.setY(1.0); 11) cout << "Point a is (" << 12) cout << "Point b is (" << 13) }

-1.0); a.getX() << ", " << a.getY() << ")\n"; b.getX() << ", " << b.getY() << ")\n"; a.getX() << ", " << a.getY() << ")\n"; b.getX() << ", " << b.getY() << ")\n";

Program Output: 1) 2) 3) 4)

Point Point Point Point

a b a b

is is is is

(3, 4) (1, -1) (-3, 4) (1, 1)

Comments: If you are using a compiler that allows interactive program debugging by tracing program execution line-by-line, it is very instructive to use such a tool with this program to watch the “get” and “set” functions do their thing and thus shield the data from errant code. O 9.4.5 Overloaded Operators C++ allows most of the standard operators, such as assignment (=), addition (+), subtraction (-), multiplication (*), division (/), equality (==), inequality (!=), and many others, to be applied to userdefined objects through the mechanism of operator overloading. The creator of the class must decide which, if any, operators are meaningful for his/her class. For those that make sense, member or friend functions can be created to implement the respective operators, thus allowing standard C++ expression syntax to be applied to objects created from the class.

Elementary Numerical Methods and C++ # 79 For the point class, some operators such as =, +, -, ==, !=, >, <, >=, and <= are easy to understand. Others may make sense only in special conditions. For example, it is difficult to understand the concept of the product of two points, but the notion of a point multiplied by a scalar is easily understood. So generally, only a subset of the available operators are overloaded for a class. If the compiler finds an object used in an expression with an operator which has not been overloaded, the program cannot be compiled. Operator overloading is one of the more difficult and abstruse aspects of C++. It can be difficult to ascertain in advance exactly what operators must be overloaded and in what context. For example, the subtraction operator (-) can be both unary and binary. In the unary case, it means to “negate” the single operand. In the binary case, it means subtract the second operand from the first. And it can quickly get even more complicated, as in the case of the multiplication operator for the point class. As suggested earlier, we should overload the * in the case of multiplication of a point by a scalar (in either order), but we should not permit the * operator to be used to multiply two points together. To further complicate matters, many operators can be overloaded using either member functions or non-member (usually friend) functions, while others, such as the assignment operator, must be declared as a class member function. All this suggests that there is no unique way to define overloaded operators for a class and an exhaustive treatment of the topic of operator overloading is well beyond the scope of this book. Two operators that are often overloaded for user-defined classes are the stream-insertion and stream-extraction operators. These two operators are slightly different from what has been discussed above because they serve to extend the capability of the iostream objects such as cin and cout. It is usually wise to make user-defined classes conformable to iostream objects so that they can be read and written just like the C++ built-in data types. Example 9.6 shows how some overloaded operators, including stream-insertion and stream-extraction, for the point class could be implemented. O EXAMPLE 9.6 Using Overloaded Operators File point.h contains: 1) 2) 3) 4) 5) 6) 7) 8) 9) 10) 11) 12) 13) 14) 15) 16) 17) 18) 19) 20) 21) 22) 23) 24) 25) 26)

#ifndef POINT_H #define POINT_H #include using namespace std; class point { private: double x; double y; public: point(double xx = 0.0, double yy = 0.0); point(const point &p); void setX(double xx); void setY(double yy); double getX(void) { return x;} double getY(void) { return y;} double radius(void); double angle(void); const point& operator=(const point& p); point& operator+(void); point& operator-(void); const point& operator+(const point& p); const point& operator-(const point& p); bool operator==(const point& p); bool operator!=(const point& p) { return ! (*this == p); } friend const point& operator*(const point& p, double m);

80 # Chapter 9 Using C++ Classes in Numerical Methods 27) friend const point& operator*(double m, const point& p); 28) friend const point& operator/(const point& p, double d); 29) friend ostream& operator<<(ostream&, const point &); 30) friend istream& operator>>(istream&, point &); 31) }; 32) #endif File point.cpp contains: 33) #include "point.h" 34) #include 35) point::point(double xx, double yy) : x(0.0), y(0.0) 36) { 37) setX(xx); 38) setY(yy); 39) } 40) point::point(const point &p) 41) { 42) x = p.x; 43) y = p.y; 44) } 45) void point::setX(double xx) 46) { 47) x = xx; 48) } 49) void point::setY(double yy) 50) { 51) y = yy; 52) } 53) double point::radius(void) 54) { 55) return sqrt(x*x + y*y); 56) } 57) double point::angle(void) 58) { 59) return atan2(y, x); 60) } 61) ostream& operator<<(ostream& os, const point& p) 62) { 63) os << "(" << p.x << ", " << p.y << ")"; 64) return os; 65) } 66) istream& operator>>(istream& is, point& p) 67) { 68) double x, y; 69) is.ignore(); // skip ( 70) is >> x; 71) is.ignore(); // skip , 72) is >> y; 73) is.ignore(); // skip ) 74) p.setX(x); 75) p.setY(y); 76) return is; 77) } 78) const point& point::operator=(const point& p) 79) { 80) if (this != &p) 81) { 82) x = p.x; 83) y = p.y; 84) } 85) return *this;

Elementary Numerical Methods and C++ # 81 86) } 87) bool point::operator==(const point& p) 88) { 89) return x == p.x && y == p.y; 90) } 91) point& point::operator+(void) 92) { 93) return *this; 94) } 95) point& point::operator-(void) 96) { 97) x = -x; 98) y = -y; 99) return *this; 100) } 101) const point& point::operator+(const point& p) 102) { 103) point *ans = new point(); 104) ans->x = x + p.x; 105) ans->y = y + p.y; 106) return *ans; 107) } 108) const point& point::operator-(const point& p) 109) { 110) point *ans = new point(); 111) ans->x = x - p.x; 112) ans->y = y - p.y; 113) return *ans; 114) } 115) const point& operator*(const point& p, double m) 116) { 117) point *ans = new point(); 118) ans->x = p.x * m; 119) ans->y = p.y * m; 120) return *ans; 121) } 122) const point& operator*(double m, const point& p) 123) { 124) point *ans = new point(p); 125) ans->x *= m; 126) ans->y *= m; 127) return *ans; 128) } 129) const point& operator/(const point& p, double d) 130) { 131) point *ans = new point(p); 132) ans->x /= d; 133) ans->y /= d; 134) return *ans; 135) } File ex46.cpp contains: 136) #include 137) #include "point.h" 138) int main() 139) { 140) point a(3.0, 4.0), b(1.0, -1.0); 141) cout << "Point a is " << a << endl; 142) cout << "Point b is " << b << endl; 143) b = a; 144) cout << "Point b is now " << b << endl;

82 # Chapter 9 Using C++ Classes in Numerical Methods 145) 146) 147) 148) 149) 150) 151) 152) 153) 154) 155) 156) 157) 158) 159) 160) 161) 162) }

cout << "Are a and b equal? "; if (a==b) cout << "Yes" << endl; else cout << "No" << endl; point c = -b; cout << "Point c is -b = " << c << endl; c = a + b; cout << "Point c is now (a + b) = " << c c = a - b; cout << "Point c is now (a - b) = " << c c = b - a; cout << "Point c is now (b - a) = " << c c = a * 2.0; cout << "Point c is now (a * 2.0) = " << c = 2.0 * a; cout << "Point c is now (2.0 * a) = " << c = a / 2.0; cout << "Point c is now (a / 2.0) = " <<

<< endl; << endl; << endl; c << endl; c << endl; c << endl;

Program Output: Point Point Point Are a Point Point Point Point Point Point Point

a is (3, 4) b is (1, -1) b is now (3, 4) and b equal? Yes c is -b = (-3, -4) c is now (a + b) = c is now (a - b) = c is now (b - a) = c is now (a * 2.0) c is now (2.0 * a) c is now (a / 2.0)

9.5

THE ENM CLASSES

(0, 0) (6, 8) (-6, -8) = (6, 8) = (6, 8) = (1.5, 2)

Comments: Several overloaded operators have been added to the class. Their operations are illustrated in the program output. The implementations of the overloaded operators have been deliberately varied to illustrate the wide variety of conventions and styles that can be used to create overloaded operator functions. For example, binary addition and subtraction are declared as member function on lines 22 and 23 of point.h and implemented on lines 101 through 114 of point.cpp, while pre- and postmultiplication of a point by a double are declared as friend functions on lines 26 and 27 of point.h and implemented on lines 115 through 128 of point.cpp. Note that two separate functions are required, as C++ does not assume the * operator is commutative. The division operator (/) is defined only for the case of division of a point by a double. An expression in a user program requesting division of a double by a point would generate a syntax error. When developing overloaded operators for a class, a reasonable strategy is to adopt the philosophy of “building the sidewalks to follow the footpath traffic patterns.” That is, define those overloaded operator functions that are known to be needed a priori, then begin writing code that uses objects created from the class and let the compiler suggest the need for other overloaded operators. O

Several classes, collectively called the ENM classes, have been developed to facilitate the numerical methods covered in the following chapters. The ENM classes are based heavily on the powerful Template Numerical Toolkit[4] and include classes and algorithms for vectors, matrices, polynomials, and tridiagonal systems. Later chapters will include example programs which use the

Elementary Numerical Methods and C++ # 83 ENM classes. Here we present some of the basic essentials of the vector and matrix classes, since they are foundational. The ENM classes are defined in the context of a C++ namespace that is also named ENM. Therefore, in order to use any of the ENM classes, not only must the appropriate #include directives be present at the top of the user program, but the statement using namespace ENM;

must be present, otherwise the ENM classes will not be visible to the compiler. 9.5.1 Vector and Matrix classes The ENM Vector and Matrix classes are defined in header files enmvec.h and enmmat.h, respectively. The classes are defined a templates, which means that virtually any data type can be stored in Vector and Matrix objects. The subject of template classes is too advanced to cover in this book, but from a user standpoint, it means that any of the following declarations of Vector and Matrix objects are legal. Vector list(10); // list is a 10-element vector of integers // Unless otherwise specified, all values will be set to zero Vector grades; // grades is an empty vector of floats Vector<double> times(5,"8.2 12.4 2.5 9.7 5.3"); // times is a 5-element // array of doubles which are initialized to the 5 numbers in the // string constant Matrix grid; // grid is an empty matrix of ints Matrix field(6,3); // field is a matrix of floats having // 6 rows and 3 columns initialized to all zeros Matrix<double> results(3,4,"6.2 1.1 -2.4 9.0 " "2.0 8.3 6.5 4.1 " "4.0 -2.6 5.1 8.2"); // results is a matrix // of doubles having 3 rows and 4 columns. The values in the // string constant are the initial values. Note the spaces at the // at the end of the first and second rows of initializers. These // are required.

As a convenience, the following synonyms are defined in the header files: IVector FVector Dvector IMatrix FMatrix DMatrix

/ / / / / /

Vector Vector Vector<double> Matrix Matrix Matrix<double>

The storage space for all ENM classes is dynamically allocated. Thus, the number of rows and/or columns specified in the object declaration need not be constant(s). Furthermore, the number of rows and/or columns will expand or contract as necessary during program execution. This feature alone makes the ENM classes worth using, because standard C++ arrays must be declared using constants for the number of rows and/or columns. Stream insertion and extraction operators for the iostream classes are provided so that ENM vectors and matrices can used with the standard cin and cout objects using the standard >> and << operators. These operations are invertable, so that a matrix m written with cout << m; can be read back in using cin >> m;. The ENM Vector and Matrix classes have the unusual capability of dual personalities. Once an ENM Vector or Matrix object has been instantiated, its elements may be accessed by using either the [] or () subscript notation. If the [] notation is used (e.g., a[3][4], b[j]) then the normal C++ subscripting rules apply in which the first element is stored at index 0 (zero). If the () notation is used (e.g., a(3,4),b(j)) then the FORTRAN subscripting rule applies in which the first element is stored at index 1 (one). Notice that the ENM Matrix class even supports the FORTRAN convention of separating the row and column subscripts of a two-dimensional array with a comma and enclosing both subscripts in a single set of parentheses. Having this dual personality makes

84 # Chapter 9 Using C++ Classes in Numerical Methods conversion of legacy FORTRAN code to C++ much simpler. 9.5.2 Miscellaneous ENM classes A very convenient format class8 is provided in header file enmmisc.h to simplify the arcane process of formatting numerical data output by objects of the iostream classes, such as cout. Objects of the format class are constructed using three arguments. The first argument is a concatenation of C++ iomanipulators to control the appearance of the number. The second is an integer specifying the precision of the number, if floating point. The third is the width of the field to print the number in. A complete discussion of the use of iomanipulators is far beyond the scope of this book. Consult any good C++ programming book for a discussion of this topic. Once a format has been defined, it can be sent to an output stream to control the appearance of subsequent output. Refer to Example 11.3 on page 131 and Example 12.1 on page 145 for illustrations of the format class.

Glossary abstraction the process of picking out common features of programming objects and consolidating them in such a way as to reduce program complexity. action or method a member function that operates on member data of an object. attribute an object data member that permanently characterizes the object, i.e., the data member usually doesn’t change as the program runs. class a C++ programming construct that integrates data and appropriate actions to perform of the data. constructor a special member function that has the same name as the class and that is used to initialize the member data of the object as it is instantiated. destructor a special member function that is called automatically to release any dynamic memory when an object is no longer needed. encapsulation the process of combining elements to form a new entity. information hiding the programming feature that limits the exposure of variable data to modification. inheritance the programming feature that allows classes to be defined as specialized forms of more general classes that have been previously defined. instantiate to create a specific occurrence of an object based on a class definition. member data data items that are associated with a specific instantiation of an object of a particular class.

member function a function that is associated with a specific instantiation of an object of a particular class. messages communications among objects achieved by calling member functions. namespace a C++ keyword used to partition identifiers so as to avoid duplication. object an instance of a C++ class, i.e., a specific variable in a C++ program that was constructed from a class definition. object-oriented programming (OOP) a programming paradigm that emphasizes the use of objects. overloaded function or operator a function or operator that has been redefined to apply uniquely to objects of a specific class. procedure abstraction a program design methodology that relies heavily on the use of many carefully crafted functions working together to accomplish the program’s purpose. procedure-oriented programming a programming paradigm that emphasizes the use of small wellwritten functions to perform operations on external data. project the collection of all the program files, settings, options, etc., necessary to create an executable computer program. state an object data member that is expected to change in value as the program runs.

Chapter 9 Problems 9.1

Matrices are used extensively in computer graphics applications. For a brief introduction, see

8

Flowers, B.H., Introduction to Numerical Methods and C++, p. 131

Elementary Numerical Methods and C++ # 85 http://www.cc.gatech.edu/gvu/multimedia/nsfmmedia/graphics/elabor/ 2dtransf/2dtransf.html. Write a C++ program that uses the enmvec and enmmat

classes to construct a transformation matrix that will transform a 1 by 1 rectangle originally centered at position (6,0) so that its new center is at (-3,7). As the rectangle is moved, it grows in size by a factor of 2 and also rotates 15 degrees counterclockwise about its own center. Print out a sequence of transformation matrices that will accomplish this task and show the coordinates of the four corners of the rectangle after each step of the transformation. Construct a composite transformation matrix that does the entire sequence in one step. Print this matrix and apply it to the original data and show that the transformed rectangle is the same as the one obtained using the step by step approach. 9.2

Eigenvalues and eigenvectors of matrices are important to engineers in the study of vibrations and other areas. One of the simplest numerical methods for finding the largest eigenvalue of a square matrix A is the von Mises Power Method. To begin the Power Method, a vector, x, having the same number of elements as the rank of the matrix is formed. The initial elements of x can be almost anything except all zero. We will use random values supplied by the C++ rand() function. Normalize the vector by dividing each element by the first element, x[0]. Multiply the matrix A times the vector x and replace vector x with this product. Normalize the vector x with respect to x[0] and repeat the cycle. When the normalization factor approaches a constant value, the process has converged and the value x[0] is the dominant eigenvalue of the matrix. The vector x after convergence has been obtained is the eigenvector corresponding to the eigenvalue obtained. Use the Power Method to find the dominant eigenvalue of the matrix

⎡1 1 1⎤ ⎢1 2 3⎥ ⎥ ⎢ ⎢⎣1 3 6⎥⎦

⎡1⎤ ⎢1⎥ ⎢ ⎥ Use the vector ⎢1⎥ as the starting guess. ⎣ ⎦

CHAPTER

10

Simultaneous Algebraic Equations CHAPTER OUTLINE 10.1 Introduction 10.2 Linear Systems 10.2.1 Diagonal Systems 10.2.2 Triangular Systems 10.2.3 Gauss Elimination 10.2.4 Tridiagonal Systems 10.2.5 LU Decomposition 10.2.6 Gauss-Seidel Iteration 10.3 Nonlinear Systems

10.1

10.2.4 10.2.5 10.2.6 10.3

Tridiagonal Systems LU Decomposition Gauss-Seidel Iteration Nonlinear Systems

INTRODUCTION

The need to solve simultaneous algebraic equations arises frequently in engineering problems, both of its own accord and as a part of the solution of a more complex problem, such as in finite-element analysis. The task is to find numerical values for a set of independent variables such that several prescribed functional equality relationships exist among the variables. Generally speaking, the number of unique functional relationships specified must equal the number of independent variables for there to be a solution. If the form of each specified relationship is a linear combination of the independent variables, then a unique solution is possible. If the relationship is nonlinear, no general guarantee as to the existence of any solution or the uniqueness of the solution is available. In engineering, linear systems can be used to describe problems such as electrical resistor networks and truss problems in statics. Nonlinear problems can result from similar type physical situations, but where the constituent equations are nonlinear, say because of variable resistance as a function of current.

10.2

LINEAR SYSTEMS

Linear algebraic systems have the functional form

Elementary Numerical Methods and C++ # 87

a11 x1 + a12 x2 + a13 x3 +… + a1n x n = b1 a 21 x1 + a 22 x2 + a 23 x 3 +… + a 2 n x n = b2 a 31 x1 + a 32 x 3 + a 33 x 3 +… + a 3n xn = b3 a n1 x1 + a n 2 x 2 + a n 3 x3 +… + a nn x n = bn which can rewritten in expanded matrix form as

⎡ a11 ⎢a ⎢ 21 ⎢ a 31 ⎢ ⎢ ⎢⎣ a n1 or, more compactly, as

a12

a13

a 22 a 32

a 23 a 33

an2

an3

a n1 ⎤ ⎧ x1 ⎫ ⎧ b1 ⎫ a n 2 ⎥ ⎪ x 2 ⎪ ⎪b2 ⎪ ⎥ ⎪⎪ ⎪⎪ ⎪⎪ ⎪⎪ a n 3 ⎥ ⎨ x3 ⎬ = ⎨b3 ⎬ ⎥⎪ ⎪ ⎪ ⎪ ⎥⎪ ⎪ ⎪ ⎪ a nn ⎥⎦ ⎪⎩ x n ⎪⎭ ⎪⎩bn ⎪⎭

[ A]{ x} = {b}.

Numerical values will be known for each of the aij’s and the bi’s. The task is to find numerical values for the xi’s such that each of the n equations is satisfied. Solution methods for this problem fall into two general categories; direct methods and indirect, or iterative, methods. Direct methods, as the name suggests, find the solution values in a direct, i.e., non-iterative, manner. There is no guesswork or trial-and-error involved. The algorithm proceeds methodically to a solution. Assuming the equations have a solution (i.e., are nonsingular), all of the direct methods will arrive at the solution. But, we shall see that direct methods are limited by roundoff error, which tends to accumulate during the elimination process. So for some systems of equations, the answers obtained by direct methods may be unusable because of accumulated roundoff error. By contrast, iterative methods are not subject to accumulated roundoff error. Each iteration proceeds by correcting the given trial solution vector using information derived from the governing equations. But this process is time-consuming and not always convergent. 10.2.1 Diagonal Systems Diagonal systems, those for which [A] has nonzero elements only on the principal diagonal, are almost trivial. If the principal diagonal elements are all unity, then it is intuitively obvious to the casual observer that the elements of {b} are the solutions for {x}, as in the following set of equations.

⎡ 1 0 0⎤ ⎧ x1 ⎫ ⎢ 0 1 0⎥ ⎪ x ⎪ = ⎢ ⎥⎨ 2 ⎬ ⎢⎣ 0 0 1⎥⎦ ⎪⎩ x 3 ⎪⎭

⎧ 5⎫ ⎪ ⎪ ⎨ 1⎬ ⎪ − 2⎪ ⎩ ⎭

Multiplying through on the left-hand side yields the set of equations

x1 = 5

x2 = 1 x3 = − 2

88 # Chapter 10 Simultaneous Algebraic Equations which, of course, is the solution. If the diagonal elements are not unity, as shown in the linear system given below, then only a single division per equation is necessary to make the diagonal elements unity and yield the solution as above.

⎡ 5 0 0⎤ ⎧ x1 ⎫ ⎢ 0 3 0⎥ ⎪ x ⎪ = ⎢ ⎥⎨ 2 ⎬ ⎢⎣ 0 0 4⎥⎦ ⎪⎩ x3 ⎪⎭

⎧ 25 ⎫ ⎪ ⎪ ⎨ 3⎬ ⎪ − 8⎪ ⎩ ⎭

10.2.2 Triangular Systems Triangular systems, those for which [A] has nonzero elements only on the principal diagonal and other diagonals either above and to the right of the principal diagonal (called upper triangular) or below and to the left of the principal diagonal (called lower triangular). Upper triangular systems can be solved using a process called back substitution, and lower triangular systems can be solved using an analogous procedure called forward substitution. Here we will illustrate only back substitution. Given an upper triangular system

⎡ 1 4 1 ⎤ ⎧ x1 ⎫ ⎢ 0 1 − 1⎥ ⎪ x ⎪ = ⎢ ⎥⎨ 2 ⎬ ⎢⎣ 0 0 − 9⎥⎦ ⎪⎩ x 3 ⎪⎭

⎧7⎫ ⎪ ⎪ ⎨ 3⎬ ⎪18⎪ ⎩ ⎭

we observe that the last equation has only one nonzero coefficient, the value of x3 can be determined directly as

x3 =

b3 18 .= = −2 a 33 − 9

Once the value of x3 has been obtained, row 2 can be rewritten in equation form as

1 × x 2 − 1 × ( − 2) = 3

from which we obtain x2 = (3!(!1×(!2))/1 = 1. In a more general form,

x2 = (b2 − a 23 x 3 ) / a 22 .

Now that the values of x2 and x3 are known, the first equation becomes solvable and yields x1 = (7 ! 4*1 !1*(!2))/1 = 5. The general formula for back substitution can be written as

⎛ xi = ⎜ bi − ⎝

j≤n

∑a

j = i +1

ij

⎞ x j ⎟ aii . ⎠

The procedure for forward substitution, which is used for lower triangular systems, follows this same logic, only it starts by solving for x1 and proceeds down the system solving for each value xi using the known values of the x’s above it. The general form of the solution for each element of {x} using forward substitution is given by j
In the numerical example given, all the calculations resulted in integer values, Of course, this will

Elementary Numerical Methods and C++ # 89 not be true in general, and each calculation will generally introduce some roundoff error. It is instructive to qualitatively assess the impact of this process, as it is a determining factor in the application of direct methods for solving simultaneous equations. Using back substitution as the model process and assuming that all the numbers in [A] and {b} are known to be free of any errors, the computed value of xn will incur only a slight amount of roundoff error resulting from a single division operation. But this slightly tainted number is then used in the calculation of xn-1, which involves one multiplication, one subtraction, and one division. Because this calculation involves more operations and it uses a tainted value of xn, the computed value of xn-1 will contain significantly more roundoff error than the value of xn. In a similar manner, the “quality” of xn-2 will be lower than either xn-1 or xn because it involves yet more mathematical operations and uses numbers that are known to be suspect because of roundoff error. Because of the probabilistic nature of roundoff error, the amount of error present in any given back or forward substitution cannot be calculated precisely. But it will always be true that the accuracy of the computed solutions for the values of {x} degrades from bottom to top when using back substitution and from top to bottom when using forward substitution. 10.2.3 Gauss Elimination Most linear systems are not initially presented in diagonal or triangular form. Gauss elimination (and related methods such as Gauss-Jordan) transform general sets of linear equations into either triangular or diagonal form. The resulting triangular or diagonal system then can be solved using the methods of the previous section. Standard Gauss elimination reduces the system to triangular form by “eliminating” (setting to zero) coefficients of [A] in each column that are below the principal diagonal. It does this by applying two of the three elementary row operations permitted on linear systems; multiplying a row by a constant and addition/subtraction of rows. These operations do not change the relationships among the independent variables implied by the original equations while changing the form the equations into upper triangular form. Thus, the solution to the resulting upper triangular set is equivalent to the solution to the original equations. We will illustrate Gauss elimination using the following set of three simultaneous linear equations

2 x1 + 8 x2 + 2 x3 = 14 3x1 + 18 x2 − 3x3 = 39 2 x1 − x 2 + 2 x3 = 5

which, when written in matrix form, becomes

2 ⎤ ⎧ x1 ⎫ ⎡2 8 ⎢ 3 18 − 3⎥ ⎪ x ⎪ = ⎢ ⎥⎨ 2 ⎬ ⎢⎣ 2 − 1 2 ⎥⎦ ⎪⎩ x 3 ⎪⎭

⎧14 ⎫ ⎪ ⎪ ⎨ 39⎬. ⎪5⎪ ⎩ ⎭

Gauss elimination begins by designating the first row as the pivot row. The diagonal element on the pivot row, a11 in this case, is called the pivot element. The pivot row will be progressively moved from row 1 to row 2 and on down to row n-1. Each time the pivot row changes, all the coefficients below the pivot element in [A] will be eliminated (set to zero) using elementary row operations. The first step is to normalize the pivot row with respect to the pivot element by dividing each element in the pivot row by the pivot element. To preserve the equality, the first row of the right-hand side (the single element b1 in this case) must also be divided by pivot element. The resulting system is

90 # Chapter 10 Simultaneous Algebraic Equations

1 ⎤ ⎧ x1 ⎫ ⎡1 4 ⎢ 3 18 − 3⎥ ⎪ x ⎪ = ⎢ ⎥⎨ 2 ⎬ ⎢⎣ 2 − 1 2 ⎥⎦ ⎪⎩ x 3 ⎪⎭

⎧7⎫ ⎪ ⎪ ⎨ 39⎬. ⎪5⎪ ⎩ ⎭

The value 3 in matrix element a21 can be set to zero using elementary row operations by multiplying the first row by -3 and adding rows 1 and 2 and placing the result in row 2. The individual elements of the second row of [A] are computed as follows:

a 21 = − 3a11 + a 21 = − 3(1) + 3 = 0 a 22 = − 3a12 + a 22 = − 3(4) + 18 = 6

a 23 = − 3a13 + a 23 = − 3(1) − 3 = − 6 b2 = − 3b1 + b2 = − 3(7) + 39 = 18 The system now appears as

1 ⎤ ⎧ x1 ⎫ ⎡1 4 ⎢ 0 6 − 6⎥ ⎪ x ⎪ = ⎢ ⎥⎨ 2 ⎬ ⎢⎣ 2 − 1 2 ⎥⎦ ⎪⎩ x 3 ⎪⎭

⎧7⎫ ⎪ ⎪ ⎨18⎬. ⎪5⎪ ⎩ ⎭

We repeat the process for row 3 to eliminate the 2 in position a31. To do this, row 1 is multiplied by the negative value of element a31 (-2) and added to row 3. The new elements of row 3 are calculated as follows:

a 31 = − 2a11 + a 31 = − 2(1) + 2 = 0 a 32 = − 2a12 + a 32 = − 2(4) − 1 = − 9

a 33 = − 2a13 + a 33 = − 2(1) + 2 = 0 b3 = − 2b1 + b3 = − 2(7) + 5 = − 9 The resulting system with the appropriate elements of the first column having been eliminated now appears as

1 ⎤ ⎧ x1 ⎫ ⎡1 4 ⎢ 0 6 − 6⎥ ⎪ x ⎪ = ⎥⎨ 2 ⎬ ⎢ ⎢⎣ 0 − 9 0 ⎥⎦ ⎪⎩ x3 ⎪⎭

⎧ 7⎫ ⎪ ⎪ ⎨ 18 ⎬. ⎪ − 9⎪ ⎩ ⎭

The careful reader may note the equations are now solvable by computing the value of x2 from row 3, using it to compute x3 in row 2, and finally using the values of x2 and x3 to compute x1 in row 1. This is an artifact of this particular set of equations which we will ignore in order to proceed with Gauss elimination. Now that all the elements below the pivot element have been eliminated, the pivot row is advanced to row 2 and the elimination process is repeated to eliminate all coefficients in column 2 below row 2. Once again, we normalize the pivot row with respect to the pivot element.

Elementary Numerical Methods and C++ # 91

1 ⎤ ⎧ x1 ⎫ ⎡1 4 ⎢ 0 1 − 1⎥ ⎪ x ⎪ = ⎥⎨ 2 ⎬ ⎢ ⎢⎣ 0 − 9 0 ⎥⎦ ⎪⎩ x 3 ⎪⎭

⎧ 7⎫ ⎪ ⎪ ⎨ 3 ⎬. ⎪ − 9⎪ ⎩ ⎭

The -9 in position a32 can be eliminated by multiplying row 2 by -(-9) and adding rows 2 and 3, storing the results in row 3. Note that since all elements in the affected rows to the left of the pivot element in column 2 are zero, only elements in the column of the pivot element and to the right need be calculated. (Actually, the number in the column of the pivot element need not be calculated, since the choice of the multiplier of the pivot row was specifically chosen to force this element to be zero.) The necessary calculations are detailed below.

a 32 = 9(a 22 ) + a 32 = 9(1) − 9 = 0 a 33 = 9(a 32 ) + a 33 = 9( − 1) + 0 = − 9

b3 = 9(b2 ) + b3 = 9(3) − 9 = 18 The resulting system now appears as

⎡ 1 4 1 ⎤ ⎧ x1 ⎫ ⎢ 0 1 − 1⎥ ⎪ x ⎪ = ⎥⎨ 2 ⎬ ⎢ ⎢⎣ 0 0 − 9⎥⎦ ⎪⎩ x 3 ⎪⎭

⎧7⎫ ⎪ ⎪ ⎨ 3 ⎬. ⎪18⎪ ⎩ ⎭

Since the original system used here contained only three equations, the Gauss elimination procedure is now complete and the equations are in upper triangular form. For a larger system, all coefficients in column 2 below row 2 would have continued followed by setting row 3 as the pivot row and the process repeated until the pivot row n-1 has been processed. The resulting upper triangular system then can be solved using back substitution, as described in the previous section. Notice that once the solution for xn has been obtained as xn = bn/ann, the solution for the remaining xi values does not require dividing by aii as indicated in the previous section, since the rows have all been normalized with respect to aii in the elimination step. The numerical results for this problem are once again, x3 = !2, x2 = 1, and x1 = 5. A qualitative assessment of the roundoff error associated with Gauss elimination reveals a predictable, though not deterministic, behavior, just as it did for the back substitution process. If all the original numbers are assumed to be pure, the numbers in the first row of the triangularized system are of high “quality” since they experienced only the single division required to normalize the row. The numbers in row 2 are of slightly lower quality, each one having been subjected to one multiplication and one addition, followed by an additional division to normalize the row. The numbers in row 3 have been subjected to an additional multiplication and addition. For a larger system, this type of analysis would reveal that, in general, the “quality” of the numbers in the triangularized system degrades as one proceeds down the rows, since each row in the final matrix has been subjected to more arithmetic operations than the row immediately above it. The result is that the least accurate numbers in the final triangularized system are the two nonzero numbers in the bottom row. And, as Murphy’s law would have it, these are the very numbers that are used to begin the back substitution process which further degrades the accuracy of the numbers, as shown in the previous section. Thus, we observe that roundoff error is encountered both “coming” and “going” in the solution of a set of linear simultaneous algebraic equations using Gauss elimination followed by back substitution. Unfortunately, this is a curse that is inflicted upon all direct methods for solving systems of

92 # Chapter 10 Simultaneous Algebraic Equations simultaneous algebraic equations and there is no “silver bullet” that will effectively neutralize this effect in all cases. The next section will cover row pivoting (also called partial pivoting) which helps to mitigate the effects of roundoff error when using any of the various direct methods. Because all direct methods suffer from roundoff errors, we advise that all code for direct methods such as Gauss elimination be written in double precision to reduce the impact of roundoff error. 10.2.3.1

Partial Pivoting

Section 7.2.2 presented some of the basic concepts of roundoff error and its control in numerical calculations. We have already suggested that double precision arithmetic be used when solving simultaneous equations using direct methods. One other technique that can be applied universally to this class of problem is to choose to divide by large numbers in order to keep the magnitude of the numbers as small as possible. Recall that when selecting a new pivot row in Gauss elimination, the first step is to normalize the pivot row with respect to the pivot element by dividing the entire pivot row by the pivot element. What happens if the pivot element is zero? Or if not exactly zero, what if the pivot element is very small so that after dividing by the pivot element the other numbers in the pivot row are very large? It may be true that the set of equations being solved is (nearly) singular and no real solution exists. But more than likely it simply means that a different equation should serve as the pivot row. Swapping rows in a set of linear equations is a third elementary row operation that can be performed as needed without changing the solution values. This operation is called row pivoting, or partial pivoting. Row pivoting is relatively easy to perform and is always advisable. Once a new pivot row has been selected, the absolute value of the pivot element should be compared with the absolute values of all other elements in the same column below the pivot row. The row containing the largest absolute value is then swapped with the pivot row so that the pivot element is now the largest number of all those below it in the same column. After pivoting has been completed, the magnitude of the pivot element should be examined for a zero or near-zero value. If the pivot element after pivoting is zero or close to zero, the equations are probably singular or very ill-conditioned and the solution process should be abandoned with an error condition. 10.2.3.2

Full Pivoting

Another form of pivoting, called column, or full pivoting, also can be employed. As the name implies, this form of pivoting searches for the pivot element in all elements to the right of and below the diagonal element of the current row. This search strategy ensures that the element having the largest absolute value in the unprocessed portion of the coefficient matrix will become the pivot element for the current row. Generally speaking, after the largest element has been found, both rows and columns will have to be swapped in order to place the pivot element properly. Swapping rows is an elementary row operation which does not alter the value of the solution elements. However, when columns are swapped, the associated variables are also repositioned in the solution vector. So column pivoting requires that a record be kept of all the column swaps so that the solution vector can be reordered properly. For full coefficient matrices having few zero elements, full pivoting may or may not significantly improve the accuracy of the results. For banded and sparse matrices which contain a large number of zero elements, using full pivoting reduces the need for the user to arrange the equations in proper order so as to avoid encountering a zero element on the diagonal and giving a false impression that the coefficient matrix is singular.

O EXAMPLE 10.1 Gauss Elimination With Full Pivoting

Elementary Numerical Methods and C++ # 93 This example implements Gauss Elimination with full pivoting. The file enmgauss.h contains two versions of function gaussSolve(). The first version accepts a coefficient matrix [A] and a single right-hand side vector {b}. The second version is identical, except the right-hand side is a matrix, the columns of which are the multiple load vectors. In both cases, upon return from the function, the coefficient matrix is replaced by the upper triangular equivalent system and the {b} vector(s) contain the solution vector(s). File enmgauss.h contains: 1) 2) 3) 4) 5) 6) 7) 8) 9) 10) 11) 12) 13) 14) 15) 16) 17) 18) 19) 20) 21) 22) 23) 24) 25) 26) 27) 28) 29) 30) 31) 32) 33) 34) 35) 36) 37) 38) 39) 40) 41) 42) 43) 44) 45) 46) 47) 48) 49) 50) 51) 52) 53)

#ifndef ENM_GAUSS_H #define ENM_GAUSS_H #include <enmvec.h>

namespace ENM { template void gaussSolve(MaTRiX &A, VecToR &b) { Subscript n = b.dim(); Subscript i,j,k,l,maxRow,maxCol; typename MaTRiX::element_type sum = 0.0; typename MaTRiX::element_type maxValue, temp; assert(b.dim()==A.numRows()); assert(b.dim()==A.numCols()); Vector swap(n); for (i=1; i<=n; i++) { swap(i)=i; } for (i=1; iabs(maxValue)) { maxValue = A(k,l); maxRow = k; maxCol = l; } } } if (maxRow != i) { for (j=i; j<=n; j++) { temp = A(i,j); A(i,j) = A(maxRow,j); A(maxRow,j) = temp; } temp = b(i); b(i) = b(maxRow); b(maxRow) = temp; } if (maxCol != i) { for (j=1; j<=n; j++) {

94 # Chapter 10 Simultaneous Algebraic Equations 54) temp = A(j,i); 55) A(j,i) = A(j,maxCol); 56) A(j,maxCol) = temp; 57) } 58) l = swap(i); 59) swap(i) = swap(maxCol); 60) swap(maxCol) = l; 61) } 62) temp = A(i,i); 63) if (abs(temp) < 1.0e-6) 64) { 65) std::cerr << "Singular matrix in gaussSolve\n"; 66) exit(-1); 67) } 68) for (j=i; j<=n; j++) 69) { 70) A(i,j) /= temp; 71) } 72) b(i) /= temp; 73) for (k=i+1; k<=n; k++) 74) { 75) temp = -A(k,i); 76) for (j=i; j<=n; j++) 77) { 78) A(k,j) += A(i,j)*temp; 79) } 80) b(k) += b(i)*temp; 81) } 82) } 83) // back substitution 84) b(n) /= A(n,n); 85) for (i=n-1; i>=1; i--) 86) { 87) sum = b(i); 88) for (j=i+1; j<=n; j++) 89) { 90) sum -= A(i,j)*b(j); 91) } 92) b(i) = sum; 93) } 94) // rearrange solutions for column pivoting 95) VecToR z(b); 96) for (i=1; i<=n; i++) 97) { 98) if (i != swap(i)) 99) { 100) b(swap(i)) = z(i); 101) } 102) } 103) } 104) 105) template 106) void gaussSolve(MaTRiX &A, MaTRiX &b) 107) { 108) Subscript n = b.numRows(); 109) Subscript numRHS = b.numCols(); 110) Subscript i,j,k,l,maxRow,maxCol; 111) typename MaTRiX::element_type sum = 0.0; 112) typename MaTRiX::element_type maxValue, temp; 113) assert(b.numRows()==A.numRows());

Elementary Numerical Methods and C++ # 95 114) 115) 116) 117) 118) 119) 120) 121) 122) 123) 124) 125) 126) 127) 128) 129) 130) 131) 132) 133) 134) 135) 136) 137) 138) 139) 140) 141) 142) 143) 144) 145) 146) 147) 148) 149) 150) 151) 152) 153) 154) 155) 156) 157) 158) 159) 160) 161) 162) 163) 164) 165) 166) 167) 168) 169) 170) 171) 172) 173)

assert(b.numRows()==A.numCols()); Vector swap(n); for (i=1; i<=n; i++) { swap(i)=i; } for (i=1; iabs(maxValue)) { maxValue = A(k,l); maxRow = k; maxCol = l; } } } if (maxRow != i) { for (j=i; j<=n; j++) { temp = A(i,j); A(i,j) = A(maxRow,j); A(maxRow,j) = temp; } for (j=1; j<=numRHS; j++) { temp = b(i,j); b(i,j) = b(maxRow,j); b(maxRow,j) = temp; } } if (maxCol != i) { for (j=1; j<=n; j++) { temp = A(j,i); A(j,i) = A(j,maxCol); A(j,maxCol) = temp; } l = swap(i); swap(i) = swap(maxCol); swap(maxCol) = l; } temp = A(i,i); if (abs(temp) < 1.0e-6) { std::cerr << "Singular matrix in gaussSolve\n"; exit(-1); } for (j=i; j<=n; j++) { A(i,j) /= temp; }

96 # Chapter 10 Simultaneous Algebraic Equations 174) 175) 176) 177) 178) 179) 180) 181) 182) 183) 184) 185) 186) 187) 188) 189) 190) 191) 192) 193) 194) 195) 196) 197) 198) 199) 200) 201) 202) 203) 204) 205) 206) 207) 208) 209) 210) 211) 212) 213) 214) 215) 216) 217) 218) 219)

for (j=1; j<=numRHS; j++) { b(i,j) /= temp; } for (k=i+1; k<=n; k++) { temp = -A(k,i); for (j=i; j<=n; j++) { A(k,j) += A(i,j)*temp; } for (j=1; j<=numRHS; j++) { b(k,j) += b(i,j)*temp; } }

} // back substitution for (k=1; k<=numRHS; k++) { b(n,k) /= A(n,n); for (i=n-1; i>=1; i--) { sum = b(i,k); for (j=i+1; j<=n; j++) { sum -= A(i,j)*b(j,k); } b(i,k) = sum; } } // rearrange solutions for column pivoting MaTRiX z(b); for (i=1; i<=n; i++) { if (i != swap(i)) { for (j=1; j<=numRHS; j++) { b(swap(i),j) = z(i,j); } } } } } #endif The following main() function demonstrates the operation of gaussSolve() for a single right-

hand side.

1) 2) 3) 4) 5) 6) 7) 8) 9) 10) 11)

#include #include <enmcpp.h> #include <enmmat.h> #include <enmgauss.h> using namespace ENM; using namespace std; int main() { Matrix<double> A(4,4,"0.3 0.2 6.6 -1.1 " "4.5 -1.8 -0.3 6.5 " "-7.3 9.7 10.9 -4.1 "

Elementary Numerical Methods and C++ # 97 12) 13) 14) 15) 16) 17) 18) 19) 20) 21) 22) 23) 24) }

"8.1 -2.7 8.7 8.9"); Vector<double> b(4,"1 0.1 0.01 0.001"); cout.flags(ios::scientific | ios::showpos | ios::right); cout.precision(4); cout << "Original Matrix A: " << A << endl; cout << "Original Load Vector b: " << b << endl; Matrix<double> T(A); Vector<double> x(b); gaussSolve(T,x); cout << "Triangularized matrix: " << T << endl; cout << "Solution vector: " << x << endl; cout << "Residual [A*x - b]: " << A*x - b << endl;

Program Output: 1) 2) 3) 4) 5) 6) 7) 8) 9) 10) 11) 12) 13) 14) 15) 16) 17) 18) 19) 20) 21) 22) 23) 24) 25) 26) 27) 28) 29) 30)

Original Matrix A: +4 +4 +3.0000e-01 +2.0000e-01 +6.6000e+00 +4.5000e+00 -1.8000e+00 -3.0000e-01 -7.3000e+00 +9.7000e+00 +1.0900e+01 +8.1000e+00 -2.7000e+00 +8.7000e+00

-1.1000e+00 +6.5000e+00 -4.1000e+00 +8.9000e+00

Original Load Vector b: +4 +1.0000e+00 +1.0000e-01 +1.0000e-02 +1.0000e-03 Triangularized matrix: +4 +4 +1.0000e+00 -6.6972e-01 -3.7615e-01 +0.0000e+00 +1.0000e+00 +8.7404e-01 +0.0000e+00 +0.0000e+00 +1.0000e+00 +0.0000e+00 +0.0000e+00 +0.0000e+00

+8.8991e-01 -7.4980e-01 +7.7803e-01 -3.5544e-01

Solution vector: +4 -3.9372e+00 -2.9753e+00 +7.4591e-01 +1.9516e+00 Residual [A*x - b]: +4 -3.5527e-15 +1.6376e-15 -5.0706e-15 -4.9674e-15

Comments: Lines 14 and 15 of main() set the formatting for all floating point numbers to follow. The options specified are ios::scientific (print the numbers in scientific notation with trailing exponents), ios::showpos (show + signs for positive numbers), ios::right (align the numbers to the right leaving any spaces to the left of the numbers). The cout.setprecision(4) function call at line 15 causes all floating point numbers printed to object cout to have four significant digits to the right of the decimal point. This combination of options aligns the numbers easier reading. The numbers at the ends of lines 1, 7, 13, 19 and 26 of the program output are artifacts of the way the ENM classes print matrices and vectors. They are the sizes of the respective vectors and matrices. The output of the program clearly shows the upper triangular form of the [A] matrix after returning from croutSolve(). The residual vector printed on lines 27 through 31 of the program output are the differences between the product of the original coefficient matrix times the solution vector and the original right-hand side. The magnitude of the numbers in this vector (~10-16) suggests

98 # Chapter 10 Simultaneous Algebraic Equations only a modest amount of roundoff error was incurred in the solution process, since this is the order of machine epsilon for double precision arithmetic. O 10.2.4 Tridiagonal Systems Special forms of Gauss elimination and other direct equation solvers exist for coefficient matrices that have special forms (e.g., symmetric, banded, sparse, etc.). Using specialized algorithms for these types of equations can yield tremendous improvements in solution speed and accuracy over standard Gauss elimination. One special matrix form that is commonly encountered in the finite-difference solution of ordinary and partial differential equations is the tridiagonal system. The coefficient matrix of a tridiagonal system has nonzero elements only on the three central diagonals. Thus, there is only one number to eliminate below each pivot element and no pivoting need be done. In the back substitution phase, a similar situation occurs; there is only one nonzero term to the right of the diagonal element, and it will be the column immediately to the right of the diagonal element. To illustrate the simplicity of Gauss elimination when applied to a tridiagonal system, consider a system containing five equations.

⎡ a11 ⎢a ⎢ 21 ⎢ 0 ⎢ ⎢ 0 ⎢⎣ 0

a12 a 22

0 a 23

0 0

a 32

a 33

a 34

0 0

a 43 0

a 44 a54

0 ⎤ ⎧ x1 ⎫ ⎧ b1 ⎫ 0 ⎥ ⎪ x 2 ⎪ ⎪b2 ⎪ ⎥ ⎪⎪ ⎪⎪ ⎪⎪ ⎪⎪ 0 ⎥ ⎨ x 3 ⎬ = ⎨ b3 ⎬ ⎥ a 45 ⎥ ⎪ x 4 ⎪ ⎪b4 ⎪ ⎪ ⎪ ⎪ ⎪ a55 ⎥⎦ ⎪⎩ x5 ⎪⎭ ⎪⎩ b5 ⎪⎭

During the elimination phase (called the forward pass for tridiagonal systems), beginning with the second row, the element to the left of the element on the principal diagonal is set to 0, the element on the diagonal (aii) and the right-hand side element (bi) are modified and the element to the right of the diagonal remains unchanged. More precisely, for each row proceeding from row 2 through row n,

aii = aii − bi = bi −

ai ,i −1 ai −1,i −1 ai ,i −1

ai −1,i −1

ai −1,i

bi −1

ai ,i −1 = 0 The back substitution phase (called the reverse or return pass) just as in normal Gauss elimination by setting

xn = bn a nn ,

followed by

xi = (bi − ai ,i +1 xi +1 ) aii for all rows i starting at row n-1 and proceeding up to row 1. As the number of equations increases, the percentage of coefficient matrix elements that are nonzero drops rapidly until it becomes very inefficient to store the full matrix. A C++ class can be

Elementary Numerical Methods and C++ # 99 used to fully encapsulate the behavior of tridiagonal systems while using a minimum of storage space. Example 10.2 illustrates the use of a tridiag class that is derived from a general purpose matrix class. The code is written to follow the C++ convention of array indices starting with 0 instead of 1 as was used in the preceding development. O EXAMPLE 10.2 A Class for Solving Tridiagonal Linear Systems The tridiag class is derived from the ENM Matrix class and is implemented as a template class. Thus, the entire class definition is included entirely in file enmtri.h, which is shown below. 1) 2) 3) 4) 5) 6) 7) 8) 9) 10) 11) 12) 13) 14) 15) 16) 17) 18) 19) 20) 21) 22) 23) 24) 25) 26) 27) 28) 29) 30) 31) 32) 33) 34) 35) 36) 37) 38) 39) 40) 41) 42) 43) 44) 45) 46) 47) 48) 49) 50)

#ifndef ENM_TRIDIAG_H #define ENM_TRIDIAG_H #include #include <string> #include <enmcpp.h> #include <enmmat.h> namespace ENM { enum {L=0, D, U, B}; template class Tridiag : public Matrix { public: Tridiag(Subscript N) : Matrix(N,4) {} Tridiag(Subscript N, std::string s) : Matrix(N,4) { std::istringstream ins(s); Subscript i, j; for (j=1; j<4; j++) { ins >> row_[0][j]; } for (i=1; i> row_[i][j]; } } ins >> row_[N-1][L]; ins >> row_[N-1][D]; ins >> row_[N-1][B]; } ~Tridiag() {} bool solve(void); }; template bool Tridiag::solve(void) { for (Subscript i=1; i<m_; i++) { T diag = row_[i-1][D]; if (fabs(diag) < 1.0e-6) return false; T mult = row_[i][L]/diag; row_[i][L] = 0.0; row_[i][D] -= row_[i-1][U] * mult; row_[i][B] -= row_[i-1][B] * mult; } for (i=m_-1; i>=0; i--) {

100 # Chapter 10 Simultaneous Algebraic Equations 51) 52) 53) 54) 55) 56) 57) 58) 59) 60) 61) 62) 63) 64) 65) 66) 67) 68) 69) 70) 71) 72) 73) 74) 75) 76) 77) 78) 79) 80) 81) 82) 83) 84) 85) 86) 87) 88) 89) 90) 91) 92) 93) 94)

T diag = row_[i][D]; if (fabs(diag) < 1.0e-6) return false; if (i != m_-1) row_[i][B] -= row_[i][U] * row_[i+1][B]; row_[i][B] /= diag;

} return true;

} template std::ostream& operator<<(std::ostream &s, const Tridiag &A) { Subscript M=A.numRows(); s << M << "\n"; s << A[0][D] << " " << A[0][U] << " " << A[0][B] << "\n"; for (Subscript i=1; i<M-1; i++) { for (Subscript j=0; j<4; j++) { s << A[i][j] << " "; } s << "\n"; } s << A[M-1][L] << " " << A[M-1][D] << " " << A[M-1][B] << "\n"; return s; } template std::istream& operator>>(std::istream &s, Tridiag &A) { Subscript M; s >> M; Tridiag tmp(M); s >> tmp[0][D] >> tmp[0][U] >> tmp[0][B]; for (Subscript i=1; i<M-1; i++) { for (Subscript j=0; j<4; j++) { s >> tmp[i][j]; } } s >> tmp[M-1][L] >> tmp[M-1][D] >> tmp[M-1][B]; A = tmp; return s; } } /* namespace ENM */ #endif

To demonstrate the use of the tridiag class, we will solve the following set of linear equations which represent the steady-state temperature distribution in an insulated bar that is placed in contact with a heater that is at 100E on one end and a heater that is at 50E on the other end. We have chosen to find the temperatures at seven uniformly-spaced internal points along the bar. The system of equations can be written as

Elementary Numerical Methods and C++ # 101

0 0 0 0 0 ⎤ ⎧ T0 ⎫ ⎡− 2 1 ⎢ 1 −2 1 0 0 0 0 ⎥ ⎪ T1 ⎪ ⎥⎪ ⎪ ⎢ ⎢ 0 1 −2 1 0 0 0 ⎥ ⎪ T2 ⎪ ⎥⎪ ⎪ ⎢ 0 1 −2 1 0 0 ⎥ ⎨ T3 ⎬ = ⎢ 0 ⎢ 0 0 0 1 −2 1 0 ⎥ ⎪ T4 ⎪ ⎥⎪ ⎪ ⎢ 0 0 0 1 − 2 1 ⎥ ⎪ T5 ⎪ ⎢ 0 ⎢ 0 0 0 0 0 1 − 2⎥⎦ ⎪⎩ T6 ⎪⎭ ⎣

⎧ − 100⎫ ⎪ 0 ⎪ ⎪ ⎪ ⎪ 0 ⎪ ⎪ ⎪ ⎨ 0 ⎬ ⎪ 0 ⎪ ⎪ ⎪ ⎪ 0 ⎪ ⎪ − 50 ⎪ ⎩ ⎭

When using the tridiag class to represent a system like this, only the nonzero elements of the coefficient matrix need be stored. Each element in a row is given a symbolic name that should be used as its reference. The L element is the element on the Lower diagonal of each row. Note that there is no L element for row zero. The D element in each row is the Diagonal element (-2 for each row in this example). The U element is the Upper diagonal element in each row. The last row in the coefficient matrix has no U element. Finally, the B element in each row is the number in right-hand side vector (the so-called B-vector) corresponding to each row in the coefficient matrix. The main program below shows how this system of equations can be constructed quite conveniently using a string constant initializer in lines 9-15. The first row string initializers contains only three numbers; values for T[0][D], T[0][U], and T[0][B]. In a similar fashion, the three numbers in the last string initializer represent T[6][L], T[6][D], and T[6][B], respectively. 1) 2) 3) 4) 5) 6) 7) 8) 9) 10) 11) 12) 13) 14) 15) 16) 17) 18) 19) 20) 21) 22) 23)

#include #include #include #include "enmtri.h" using namespace ENM; using namespace std; int main() { Tridiag<double> T(7,"-2 1 -100 " "1 -2 1 0 " "1 -2 1 0 " "1 -2 1 0 " "1 -2 1 0 " "1 -2 1 0 " "1 -2 -50"); assert(T.solve()); cout << "After calling T.solve():\n" << T << "Solution is in column B:\n"; for (Subscript i=0; i
singular system. In line 17, the value of the entire tridiagonal system is printed, followed on lines 1822 by just the values of the B column of the solved system, which is where the solution values for the seven internal temperatures along the bar are stored. Here is the actual output from the program: 1) 2) 3) 4)

After calling T.solve(): 7 -2 1 93.75 0 -1.5 1 87.5

102 # Chapter 10 Simultaneous Algebraic Equations 5) 6) 7) 8) 9) 10) 11) 12) 13) 14) 15) 16) 17)

0 -1.33333 1 81.25 0 -1.25 1 75 0 -1.2 1 68.75 0 -1.16667 1 62.5 0 -1.14286 56.25 Solution is in column B: 93.75 87.5 81.25 75 68.75 62.5 56.25

O 10.2.5 LU Decomposition The right-hand side of the simultaneous linear equation problem, {b}, is often called the B-vector or the load vector. The term “load vector” is used because in the classical formulation of structures problems, the coefficient matrix [A] represents the coupled stiffness of the structural elements, the column vector {x} represents the unknown deflections of the members which is being solved for, and the right-hand side {b} vector contains the known or assumed loads that are being applied to the structure. Typically, a structure must be analyzed under many different load conditions, each one of which constitutes a different {b}. As a result, the different {b} vectors are placed as columns in a [B] matrix. We usually designate the variable n to represent the number of rows and columns of [A] and the variable m to represent the number of columns of [B]. Extending any of the various elimination methods to handle multiple right-hand sides is relatively simple. We simply place the operations involving elements of [B] in a repetition structure that causes the same operations to be repeated for each of the m columns of [B]. Unfortunately, unless some sophisticated programming is added to Gauss elimination to record a history of the elimination process, it is mandatory that all of the load vectors be specified in advance so they can all be solved for simultaneously. This is because the elimination process also affects the contents of the load vectors. In practice, this is rarely possible. It is often the case that the results of one analysis will suggest a different load configuration which will require another problem solution and corresponding Gauss elimination process. While this is a trivial exercise if the system involves only a few equations, it can be time consuming and expensive for systems involving several hundred or even several thousand equations. LU decomposition methods (also frequently called LU factorization methods) are a family of methods that offer direct solution of systems of linear simultaneous equations at speeds that are competitive with Gauss elimination, but which do not require that the load vector be specified in advance. That is, the processing of the [A] matrix is independent of the right-hand side. Consequently, it makes no difference whether the right-hand sides are specified at the same time as columns of a [B] matrix or individually as column vectors. LU decomposition works by “decomposing” the original coefficient matrix [A] into two square matrices [L] and [U] which are lower triangular and upper triangular, respectively. This decomposition is done in such a way as to insure that [ L][U ] = [ A] (10.1) Once these two matrices have been created, the solution of the equation

[ A]{x} = {b}

(10.2)

proceeds in two steps as follows. First, solve the linear system

[ L]{ y} = {b}

(10.3)

Elementary Numerical Methods and C++ # 103 for the intermediate solution vector {y} using forward substitution since [L] is lower triangular. Having solved for the vector {y}, use this vector to solve [U ]{x} = {y} (10.4) for the vector {x} using back substitution since [U] is upper triangular. That the vector {x} obtained through the two-step process described above constitutes a solution to the original linear system problem given in Equation 10.2 can be shown by first premultiplying both sides of Equation 10.4 by [L], giving

[ L][U ]{x} = [ L]{y}

But substituting for [L][U] from Equation 10.1 and substituting for [L]{y} from Equation 10.3 returns us to Equation 10.2, the original problem to be solved. For LU decomposition to work, we obviously have to find [L] and [U] such that Equation 10.1 is satisfied. There are several popular algorithms for decomposing [A], the most popular being Crout’s method and Doolittle’s method. Be advised that on rare occasions you may encounter a matrix that is solvable by Gauss elimination but which fails to yield a satisfactory [L][U] pair using any particular decomposition algorithm. 10.2.5.1

Crout’s Method

Crout’s method for LU decomposition can be derived directly from the rules of matrix multiplication. To facilitate this, a particular convention must be established regarding “ownership” of the principal diagonal elements since both [L] and [U] have elements on the principal diagonal. Crout’s method assumes [L] “owns” the principal diagonal elements and the principal diagonal elements of [U] are rather arbitrarily assigned the value of unity. We will develop Crout’s method by applying it to the general 4×4 system

⎡ l11 ⎢l ⎢ 21 ⎢ l31 ⎢ ⎣ l41

0 l22

0 0

l32 l42

l33 l43

0 ⎤ ⎡ 1 u12 0 ⎥ ⎢0 1 ⎥⎢ 0 ⎥ ⎢0 0 ⎥⎢ l44 ⎦ ⎣ 0 0

u13 u23 1 0

u14 ⎤ ⎡ a11 u24 ⎥ ⎢ a 21 ⎥=⎢ u34 ⎥ ⎢ a 31 ⎥ ⎢ 1 ⎦ ⎣ a 41

a12 a 22

a13 a 23

a 32 a 42

a 33 a 43

a14 ⎤ a 24 ⎥ ⎥ a 34 ⎥ ⎥ a 44 ⎦

Simultaneously, we will apply the general formulas we develop to the specific numerical case of

0.2 6.6 − 11 .⎤ ⎡ 0.3 ⎢ 4.5 − 18 . − 0.3 6.5 ⎥ ⎢ ⎥ [ A] = ⎢ − 7.3 9.7 10.9 − 4.1⎥ ⎢ ⎥ 8.9 ⎦ . − 2.7 8.7 ⎣ 81 Using the rules of matrix multiplication, the value of a11 can be obtained by multiplying row 1 of [L] times column 1 of [U] to obtain

l11 × 1 + 0 × 0 + 0 × 0 + 0 × 0 = a11

Now in the case of LU decomposition, the numerical values of the elements of [A] are known and we are trying to find the numerical values of [L] and [U]. Because of the structure of [L] and [U] and our placement of 1's on the principal diagonal of [U], we find that l11 = a11 = 0.3. Now we choose to multiply the second row of [L] times the first column of [U] and we obtain

l21 × 1 + l22 × 0 + 0 × 0 + 0 × 0 = a 21

from which we see that l21 = a21 = 4.5. In a similar manner, proceeding down the first column of [L],

104 # Chapter 10 Simultaneous Algebraic Equations we find

l31 × 1 + l32 × 0 + l33 × 0 + 0 × 0 = a 31 ⇒ l31 = a 31 = − 7.3 l41 × 1 + l42 × 0 + l43 × 0 + l44 × 0 = a 41 ⇒ l41 = a 41 = 81 .

By induction, it is easy to see that the first column of [L] is numerically equal to the first column of [A] for any number of rows because the first column of [U] contains 1 in row 1 and all zeros below row 1. At this point, the [L] and [U] matrices appear as follows.

0 ⎡ 0.3 ⎢ 4.5 l 22 [ L][U ] = ⎢ ⎢ − 7.3 l32 ⎢ . l42 ⎣ 81

0 0 l33 l43

0 ⎤ ⎡ 1 u12 0 ⎥ ⎢0 1 ⎥⎢ 0 ⎥ ⎢0 0 ⎥⎢ l44 ⎦ ⎣ 0 0

u13 u23 1 0

u14 ⎤ u24 ⎥ ⎥ u34 ⎥ ⎥ 1⎦

Now comes a critical step in Crout’s algorithm. Rather than proceeding to find the remaining elements of [L], we move to the first row of [U] and begin finding the numerical values of elements u12 … u14. To find the value of u12, we multiply the first row of [L] times the second column of [U] to obtain

l11 × u12 + 0 × 1 + 0 × 0 + 0 × 0 = a12 ⇒ u12 =

a12 0.2 = = 0.6667 l11 0.3

Similarly,

l11 × u13 + 0 × u23 + 0 × 1 + 0 × 0 = a13 ⇒ u13 =

a13 6.6 = = 22 l11 0.3

l11 × u14 + 0 × u24 + 0 × u34 + 0 × 1 = a14 ⇒ u14 =

a14 − 11. = = − 36667 . l11 0.3

More generally stated, the first row of [U] is simply the first row of [A] normalized with respect to a11. Note that all of the terms on the right-hand side of each equation are either values of [A], which are given, or values of [L] or [U] that have been computed already. At this point,

⎡ 0.3 0 ⎢ 4.5 l 22 [ L][U ] = ⎢ ⎢ − 7.3 l32 ⎢ . l42 ⎣ 81

0 0 l33 l43

0 ⎤ ⎡ 1 0.6667 22 0 ⎥ ⎢0 1 u23 ⎥⎢ 0 ⎥ ⎢0 0 1 ⎥⎢ 0 0 l44 ⎦ ⎣ 0

− 3.6667⎤ u24 ⎥ ⎥ u34 ⎥ ⎥ 1 ⎦

After completing the first row of [U], we go back to find the second column of [L] as follows:

l21 × u12 + l22 × 1 + 0 × 0 + 0 × 0 = a 22 ⇒ l22 = a 22 − l21 × u12 = − 18 . − 4.5 * 0.6667 = − 4.8 l31 × u12 + l32 × 1 + l33 × 0 + 0 × 0 = a 32 ⇒ l32 = a 32 − l31 × u12 = 9.7 + 7.3 * 0.6667 = 14.5667 . l41 × u12 + l42 × 1 + l43 × 0 + l44 × 0 = a 42 ⇒ l42 = a 42 − l41 × u12 = − 2.7 − 81 . * 0.6667 = −81 Again, note that everything appearing on the right-hand side of each equation is known. The values of [L][U] are now

Elementary Numerical Methods and C++ # 105

0 0 ⎡ 0.3 ⎢ 4.5 0 − 4.8 [ L][U ] = ⎢ ⎢ − 7.3 14.5667 l33 ⎢ . − 81 . l43 ⎣ 81

0 ⎤ ⎡ 1 0.6667 22 0 ⎥ ⎢0 1 u23 ⎥⎢ 0 ⎥ ⎢0 0 1 ⎥⎢ 0 0 l44 ⎦ ⎣ 0

− 3.6667⎤ u24 ⎥ ⎥ u34 ⎥ ⎥ 1 ⎦

Next, we find the second row of [U] by again applying the rules of matrix multiplication to find l21 × u13 + l22 × u23 + 0 × 1 + 0 × 0 = a 23 ⇒ u23 = u23 =

a 23 − l21 × u13 l22

− 0.3 − 4.5 * 22 = 20.6875 − 4.8

l21 × u14 + l22 × u24 + 0 × u34 + 0 × 1 = a 24 ⇒ u24 = u24 =

a 24 − l21 × u14 l22

6.5 + 4.5 * 3.6667 = − 4.7917 − 4.8

Although the equations are growing more complicated, we see again that everything that is needed on the right-hand side is either given or has been computed previously. The interim numerical values are now

0 0 ⎡ 0.3 ⎢ 4.5 0 − 4.8 [ L][U ] = ⎢ ⎢ − 7.3 14.5667 l33 ⎢ . − 81 . l43 ⎣ 81

0 ⎤ ⎡ 1 0.6667 22 − 3.6667 ⎤ ⎥ ⎢ 0 0 1 20.6875 − 4.7917⎥ ⎥⎢ ⎥ 0 1 0 ⎥ ⎢0 u34 ⎥ ⎥⎢ ⎥ 0 0 1 ⎦ l44 ⎦ ⎣ 0

Moving to the third column of [L], we obtain

l31 × u13 + l32 × u23 + l33 × 1 + 0 × 0 = a 33 ⇒ l33 = a 33 − l31 × u13 − l32 × u23 l33 = 10.9 + 7.3 * 22 − 14.5667 * 20.6875 = − 129.8486 l41 × u13 + l42 × u23 + l43 × 1 + l44 × 0 = a 43 ⇒ l43 = a 43 − l41 × u13 − l42 × u23 l43 = 8.7 − 81 . * 22 + 81 . * 20.6875 = − 19312 .

Placing these numerical values in their proper positions gives

0 0 0 ⎤ ⎡ 1 0.6667 22 − 3.6667 ⎤ ⎡ 0.3 ⎢ 4.5 1 20.6875 − 4.7917⎥ 0 0 ⎥ ⎢0 − 4.8 ⎢ ⎥ ⎢ ⎥ [ L][U ] = ⎢ − 7.3 14.5667 − 129.8486 0 ⎥ ⎢ 0 0 1 u34 ⎥ ⎢ ⎥⎢ ⎥ 0 0 1 ⎦ . − 81 . − 19312 . l44 ⎦ ⎣ 0 ⎣ 81 The third row of [U] has only one unknown element which can now be found. l31 × u14 + l32 × u24 + l33 × u34 + 0 × 1 = a 34 a − l × u − l × u24 − 4.1 − 7.3 * 3.6667 + 14.567 * 4.7917 u34 = 34 31 14 32 = = − 0.2998 l33 − 129.8486 Inserting this value in its proper position gives

106 # Chapter 10 Simultaneous Algebraic Equations

0 0 0 ⎤ ⎡ 1 0.6667 ⎡ 0.3 − 3.6667 ⎤ 22 ⎥ ⎢ 4.5 ⎢ 0 0 0 − 4.8 1 20.6875 − 4.7917⎥ ⎥⎢ ⎥ [ L][U ] = ⎢ ⎢ − 7.3 14.5667 − 129.8486 0 ⎥ ⎢ 0 − 0.2998⎥ 0 1 ⎥ ⎢ ⎥ . − 81 . − 19312 . l ⎥ ⎢⎣ 0 0 0 1 ⎦ ⎢⎣ 81 44 ⎦ Finally, element l44 can be found by multiplying row four of [L] by column four of [U].

l41 × u14 + l42 × u24 + l43 × u34 + l44 × 1 = a 44 l44 = a 44 − l41 × u14 − l42 × u24 − l43 × u34 = 8.9 + 81 . * 3.6667 − 81 . * 4.7917 − 19312 . * 0.2998 = − 0.7915 Inserting this value in [L] completes the Crout decomposition.

22 0 0 0 ⎤ ⎡ 1 0.6667 − 3.6667 ⎤ ⎡ 0.3 ⎢ 4.5 ⎥ ⎢ 0 1 20.6875 − 4.7917⎥ 0 0 − 4.8 ⎢ ⎥ ⎢ ⎥ [ L][U ] = ⎢ − 7.3 14.5667 − 129.8486 0 1 0 ⎥ ⎢0 − 0.2998⎥ ⎢ ⎥⎢ ⎥ 0 0 1 ⎦ . − 81 . − 19312 . − 0.7915⎦ ⎣ 0 ⎣ 81 Generalizing the Crout algorithm for row/column 2 and higher (recall that column 1 of [A] becomes column 1 of [L] with no further processing and row 1 of [A] becomes row 1 of [U] after normalizing), apply the following formula for each element on and below the diagonal element in column j: i −1

lij = aij − ∑ lik × ukj k =1

Then apply the following formula to each element right of the principal diagonal in the corresponding row of [U]: i −1

uij =

aij − ∑ lik × ukj k =1

lii

In practice, code for implementing Crout’s method never actually creates separate [L] and [U] matrices since the decomposition can be done in-place. This means that as new values of [L] and [U] are computed, they overwrite the corresponding values of [A]. This is possible because once a value aij has been used in a calculation of either lij or uij, it is never needed again. This means that code implementing Crout’s method (or any other LU decomposition scheme) will normally overwrite the original coefficient matrix with its LU decomposition. When using the LU decomposition to solve a set of simultaneous linear equations using forward and back substitution as described in the previous section, one has to remember the convention that was used to assign the diagonal elements of the coefficient matrix. For Crout’s method, remember that the diagonal elements appearing in the decomposed matrix “belong” to [L] and whenever a value of uii, is needed, the number 1.0 should be used. Crout decomposition also lends itself to pivoting just like Gauss elimination. However, since the right-hand side is not used in the LU decomposition, a record of the row swaps must be maintained so that when the decomposed matrix is used to solve a particular right-hand side, the elements of the load vector can be swapped appropriately before and after the forward and back substitution steps so that the results will be returned in proper order.

Elementary Numerical Methods and C++ # 107 O EXAMPLE 10.3 Crout’s Method This example implements Crout’s method for LU decomposition exactly as illustrated above. Two functions are implemented. Function croutFactor(A) decomposes the matrix passed in the argument list into its LU components. The decomposition is done in place and the original matrix is overwritten. Function croutSolve(A,b) accepts a matrix A which must have already been processed by croutFactor() and a column vector b which is the right-hand side load vector. The load vector is overwritten by the solution vector. File enmcrout.h contains: 1) 2) 3) 4) 5) 6) 7) 8) 9) 10) 11) 12) 13) 14) 15) 16) 17) 18) 19) 20) 21) 22) 23) 24) 25) 26) 27) 28) 29) 30) 31) 32) 33) 34) 35) 36) 37) 38) 39) 40) 41) 42) 43) 44) 45) 46) 47) 48) 49) 50)

#ifndef ENM_CROUT_H #define ENM_CROUT_H namespace ENM { template bool croutFactor(MaTRiX &A) { Subscript M = A.numRows(); Subscript N = A.numCols(); assert(M == N); if (M==0 || N==0) return false; Subscript i=0, j=0, k=0, sweep=0; typename MaTRiX::element_type t; t=A(1,1); for (j=2; j<=N; j++) { A(1,j)/=t; } for (sweep=2; sweep<=N; sweep++) { j=sweep; for (i=j; i<=N; i++) { t=0.0; for (k=1; k<j; k++) { t += A(i,k)*A(k,j); } A(i,j) -= t; } i=sweep; if (A(i,i) == 0) return false; for (j=i+1; j<=N; j++) { t=0.0; for (k=1; k void croutSolve(MaTRiX& A, VecToR& b) { Subscript n = b.dim(); Subscript i,j; typename MaTRiX::element_type sum = 0.0;

108 # Chapter 10 Simultaneous Algebraic Equations 51) 52) 53) 54) 55) 56) 57) 58) 59) 60) 61) 62) 63) 64) 65) 66) 67) 68) 69) 70) 71) 72) 73) 74)

// Solve Ly=b using forward substitution b(1) /= A(1,1); for (i=2; i<=n; i++) { sum = b(i); for (j=1; j=1; i--) { sum = b(i); for (j=i+1; j<=n; j++) { sum -= A(i,j)*b(j); } b(i) = sum; } } } #endif The following main() function demonstrates the operation of croutFactor() and croutSolve(). 1) #include 2) #include <enmcpp.h> 3) #include <enmmat.h> 4) #include <enmcrout.h> 5) using namespace ENM; 6) using namespace std; 7) int main() 8) { 9) Matrix<double> A(4,4,"0.3 0.2 6.6 -1.1 " 10) "4.5 -1.8 -0.3 6.5 " 11) "-7.3 9.7 10.9 -4.1 " 12) "8.1 -2.7 8.7 8.9"); 13) Vector<double> b(4,"1 0.1 0.01 0.001"); 14) cout.flags(ios::scientific | ios::showpos | ios::right); 15) cout.precision(4); 16) cout << "Original Matrix A: " << A << endl; 17) Matrix<double> T(A); 18) if (!croutFactor(T)) 19) { 20) cerr << "Crout decomposition failed.\n"; 21) exit(-1); 22) } 23) cout << "Matrix A after Crout decomposition: " << T << endl; 24) Vector<double> x(b); 25) croutSolve(T,x); 26) cout << "Solution for Ax=b, where b=" << b 27) << "x: " << x; 28) cout << "Residual [A*x - b]: " << A*x - b << endl; 29) }

Program Output:

1) Original Matrix A: +4 +4 2) +3.0000e-001 +2.0000e-001 +6.6000e+000 -1.1000e+000 3) +4.5000e+000 -1.8000e+000 -3.0000e-001 +6.5000e+000

Elementary Numerical Methods and C++ # 109 4) 5) 6) 7) 8) 9) 10) 11) 12) 13) 14) 15) 16) 17) 18) 19) 20) 21) 22) 23) 24) 25) 26) 27) 28) 29)

-7.3000e+000 +9.7000e+000 +1.0900e+001 -4.1000e+000 +8.1000e+000 -2.7000e+000 +8.7000e+000 +8.9000e+000 Matrix A after Crout decomposition: +4 +3.0000e-001 +6.6667e-001 +2.2000e+001 +4.5000e+000 -4.8000e+000 +2.0687e+001 -7.3000e+000 +1.4567e+001 -1.2985e+002 +8.1000e+000 -8.1000e+000 -1.9312e+000

+4 -3.6667e+000 -4.7917e+000 -2.9983e-001 -7.9154e-001

Solution for Ax=b, where b=+4 +1.0000e+000 +1.0000e-001 +1.0000e-002 +1.0000e-003 x: +4 -3.9372e+000 -2.9753e+000 +7.4591e-001 +1.9516e+000 Residual [A*x - b]: +4 +8.8818e-016 +1.0297e-014 -1.6201e-014 +4.7748e-015

Comments: This is the same set of equations that were solved in Example 10.1 using Gauss Elimination and the results are virtually identical. Refer back to that example for an explanation of the formatting options shown on lines 14 and 15 of main(). O 10.2.5.2

Doolittle’s Method

Doolittle’s method differs from Crout’s method only in the choice of where the 1's on the diagonal go. Doolittle’s method places 1's on the principal diagonal of [L] instead of [U]. 10.2.6 Gauss-Seidel Iteration All direct methods for solving linear systems suffer from roundoff error limitations. Though techniques such as matrix partitioning and iterative refinement can largely mitigate the effects of roundoff error, sometimes it is simpler to use an iterative method such as Gauss-Seidel to solve large sets of linear equations. Iterative methods work by first assuming solution values, then “plugging” these values into the equations in some orderly fashion to obtain corrections for the assumed values. The process is then repeated using the updated values of the solution vector until the corrections approach zero. Since each iteration is independent (the “corrected” values could have been assumed to start with), the effect of roundoff error is minimal and is not cumulative as it is for direct methods. But this improved accuracy comes at the cost of lengthier calculation times. Furthermore, the number of iterations and resulting execution time that will be required for the solution to converge is not known a priori. And to further complicate the situation, iterative methods may not converge for many sets of equations even though a solution exists. The Gauss-Seidel method given below is guaranteed to converge when the coefficient matrix is strongly diagonally dominant. This means that the absolute value of the diagonal element of each row in the coefficient matrix is larger than the sum of the absolute values of the other coefficients in the row. In fact, Gauss-Seidel will almost always converge if the diagonal element is simply the largest number in the row, and it will usually converge even when then diagonal element in a few rows is not the largest element. So a word of caution is

110 # Chapter 10 Simultaneous Algebraic Equations in order: before attempting to use Gauss-Seidel, be sure the equations are arranged in the most favorable order with respect to diagonal dominance. The operation of Gauss-Seidel for a set of n simultaneous linear equations using zero-based arrays is easy to illustrate. Given as set of equations in the form

a 00 x 0 + a 01 x1 + a 02 x 2 +… + a 0,n −1 x n −1 = b0 a10 x 0 + a11 x1 + a12 x 2 +… + a1,n −1 x n −1 = b1 a 20 x 0 + a 21 x1 + a 22 x 2 +… + a 2 ,n −1 x n −1 = b2 a n −1,0 x 0 + a n −1,1 x1 + a n −1,2 x 2 +… + a n −1,n −1 x n −1 = bn −1

{

Assume a solution vector x 0 , x1 , x 2 ,… , x n −1

x0 =

}T

and solve the first equation for x0.

b0 − a 01 x1 − a 02 x 2 −… − a 0,n −1 x n −1 a 00

(Notice that the originally assumed value of x0 is never used in the calculations because it is immediately replaced by the value computed in this calculation.) Place the computed value of x0 in the solution vector and proceed to solve for x1 using

x1 =

b1 − a10 x 0 − a12 x 2 −… − a1,n −1 x n −1 a11

As each new value of xi is calculated, it is placed in the solution vector and used in subsequent calculations. Repeat this process in a circular fashion until the computed xi’s quit changing. O EXAMPLE 10.4 Gauss-Seidel Method In this example we will solve the same problem as given in Example 10.2. (Attempting to solve the equations of Example 10.3 fails. Why?) The Gauss-Seidel method is implemented as a function template in file enmeid.h: 1) 2) 3) 4) 5) 6) 7) 8) 9) x) 10) 11) 12) 13) 14) 15) 16) 17) 18) 19) 20) 21) 22)

#ifndef ENM_GAUSS_SEIDEL_H #define ENM_GAUSS_SEIDEL_H #include #include <enmcpp.h> #include <enmmat.h> namespace ENM { template void gaussSeidel(const Matrix& a, const Vector& b, Vector& {

T err, sum, newval; assert(a.numRows() == a.numCols()); Subscript N=a.numRows(), i, j; do { err = 0; for (i=0; i
Elementary Numerical Methods and C++ # 111 23) { 24) sum += a[i][j]*x[j]; 25) } 26) } 27) newval = (b[i] - sum)/a[i][i]; 28) if (fabs(fabs(newval) - fabs(x[i])) > err) 29) { 30) err = fabs(fabs(newval) - fabs(x[i])); 31) } 32) x[i] = newval; 33) } 34) } while (err > 1.0e-6); 35) } 36) } /* namespace ENM */ 37) #endif

The driver program ex53.cpp illustrating how to call gaussSeidel() contains: 1) 2) 3) 4) 5) 6) 7) 8) 9) 10) 11) 12) 13) 14) 15) 16) 17) 18)

#include #include <enmseid.h> using namespace ENM; using namespace std; int main() { Matrix<double> a(7,7,"-2 1 0 0 0 0 0 " "1 -2 1 0 0 0 0 " "0 1 -2 1 0 0 0 " "0 0 1 -2 1 0 0 " "0 0 0 1 -2 1 0 " "0 0 0 0 1 -2 1 " "0 0 0 0 0 1 -2"); Vector<double> b(7,"-100 0 0 0 0 0 -50"); Vector<double> x(7, 0.0); gaussSeidel(a, b, x); cout << "Solution x: " << x; }

Program Output: Solution x: 7 93.75 87.5 81.25 75 68.75 62.5 56.25

Comments: The error tolerance given on line 34 of enmseid.h can be changed to provide the accuracy desired by the user. The type of error condition being tested on lines 28 - 31 of enmseid.h is called a maximum absolute error condition. Other types of tests for convergence may be appropriate under certain conditions. Some commonly used ones are maximum relative error and sum of squares error. O

10.3

NONLINEAR SYSTEMS

As may be expected, the solution of problems involving simultaneous nonlinear equations is quite

112 # Chapter 10 Simultaneous Algebraic Equations a bit more difficult than for linear equations. This is true both analytically and numerically. There are few assurances offered, even regarding the existence of a solution, let alone the convergence of any particular method for a given problem. The situation is not completely hopeless, though. For many real-world problems the governing equations are well understood and accurately model physical phenomena over a wide range of conditions. We have every reason to believe that a solution exists for such problems and we perhaps will have a good ballpark estimate of the correct solution values. We will present here only one method for solving simultaneous nonlinear algebraic equations, a generalized form of Newton’s method which is an expansion of the Newton-Raphson iteration for finding the zero of a single nonlinear equation (Section 8.6, page 62). We begin by assuming a set of n nonlinear functions of n independent variables that are (hopefully) simultaneously equal to zero at one or more combinations of the independent variables. We will let the asterisk superscript represent the value of the respective independent variables that satisfies the solution condition. That is,

( ) f ( x , x , x ,… , x ) = 0 f ( x , x , x ,… , x ) = 0 f 1 x1* , x 2* , x 3* ,… , x n* = 0 2

* 1

* 2

* 3

* n

3

* 1

* 2

* 3

* n

(

(10.5)

)

f n x1* , x 2* , x 3* ,… , x n* = 0 The function on the left-hand side of each equation can be expanded about the solution point using a Taylor series. For a function of several independent variables, this is a cumbersome process, so we will choose immediately to truncate each Taylor series after the first derivative term. Note the use of partial derivatives since the functions involve multiple independent variables.

(

)

(

)

f 1 x1* ,… , x n* ≈ f 1 x1 ,… , x n + Δ x1

(

)

(

)

f 2 x1* ,… , x n* ≈ f 2 x1 ,… , x n + Δ x1

(

)

(

)

f 3 x1* ,… , x n* ≈ f 3 x1 ,… , x n + Δ x1

(

)

(

)

f n x1* ,… , x n* ≈ f n x1 ,… , x n + Δ x1

∂f1 ∂ x1 ∂f 2 ∂ x1 ∂f 3 ∂ x1 ∂f n ∂ x1

+ Δ x2 x1 ,… xn

+ Δ x2 x1 ,… xn

+ Δ x2 x1 ,… xn

+ Δ x2 x1 ,… xn

∂f1 ∂ x2 ∂f 2 ∂ x2 ∂f 3 ∂ x2 ∂f n ∂ x2

+… + Δ x n x1 ,… xn

+… + Δ x n x1 ,… xn

+… + Δ x n x1 ,… xn

+… + Δ x n x1 ,… xn

∂f1 ∂ xn

x1 ,… xn

∂f 2 ∂ xn

x1 ,… xn

∂f 3 ∂ xn

x1 ,… xn

∂f n ∂ xn

x1 ,… xn

(10.6)

The values of the xi’s on the right-hand side of the equations above represent a trial point in the ndimensional solution space which is hopefully in the neighborhood of the solution

{

* * * * point x1 , x 2 , x 3 ,… , x n find,

}

T

. Substituting Equation 10.5 for the left-hand sides of Equation 10.6, we

Elementary Numerical Methods and C++ # 113

⎡ ∂f1 ⎢∂x ⎢ 1 ⎢∂f 2 ⎢ ∂ x1 ⎢ ∂f ⎢ 3 ⎢ ∂ x1 ⎢ ⎢∂f n ⎢ ⎣ ∂ x1

∂f1 ∂ x2 ∂f 2 ∂ x2 ∂f 3 ∂ x2

∂f1 ∂ x3 ∂f 2 ∂ x3 ∂f 3 ∂ x3

∂f n ∂ x2

∂f n ∂ x3

∂f1 ⎤ ∂ xn ⎥ ⎥ ∂ f 2 ⎥ ⎧ Δ x1 ⎫ ⎪ ⎪ … ∂ xn ⎥ ⎪ Δ x2 ⎪ ⎪ ⎪ ∂ f 3 ⎥ ⎨ Δ x3 ⎬ = … ⎥ ⎪ ∂ xn ⎥ ⎪ ⎪ ⎥ ⎪ Δ x ⎪⎪ ∂f n ⎥⎩ n ⎭ … ⎥ ∂ xn ⎦ …

⎧− ⎪ ⎪⎪ − ⎨− ⎪ ⎪ ⎪⎩ −

f 1 ( x1 , x 2 , x 3 ,… , x n ) ⎫ ⎪ f 2 ( x1 , x 2 , x 3 ,… , x n ) ⎪ ⎪ f 3 ( x1 , x 2 , x 3 ,… , x n ) ⎬ ⎪ ⎪ f n ( x1 , x 2 , x 3 ,… , x n ) ⎪⎭

(10.7)

Equation 10.7 is multi-dimensional generalization of Equation 8.3. When applied in an iteration sequence, starting guesses for the respective x values are first chosen. Next, the values of the nonlinear functions are computed and the negatives of these values are placed in the right-hand side column vector. Then the elements of the n×n matrix of partial derivatives, called the Jacobian matrix, are computed using the assumed values of the xi’s. Although it may not appear obvious from the form of Equation 10.7, this is a set of simultaneous linear equations. While the general forms of the functions on the right-hand side are nonlinear (and the analytical expressions of the partial derivatives on the left-hand side are potentially nonlinear), once these expressions are evaluated at a particular set of xi’s, they become simple numerical constants and the problem reduces to one of solving a set of simultaneous linear equations using any of the methods covered earlier in this chapter. Thus, the values of the Δxi’s become the unknowns whose values are obtained when solving the linear equations. Each of the computed Δxi’s is added to its respective assumed value of xi to obtained an improved estimate of the correct solution vector. Using these updated values of the xi’s, new values of the right-hand side and Jacobian matrix of Equation 10.7 are computed and the whole process repeats. Under favorable conditions the process will converge to a solution vector that yields zeros for the given function values to within some appropriate error tolerance. The generalized Newton method is subject to the same problems as the one-dimensional case as discussed in Section 8.6. Nonetheless, Newton’s method has been shown to be a powerful and fast method for solving systems of nonlinear equations involving several hundred independent variables, often requiring less than ten iterations. O EXAMPLE 10.5 Simultaneous Nonlinear Equations Consider the simple example of finding the intersection point(s) of a circle centered at the origin having a radius of 6 ( x 2 + y 2 = 6 ) and the straight line given by the equation y = 2 − x . Rewriting the two equations in the proper form gives,

f 1 ( x, y) = x 2 + y 2 − 6 = 0 f 2 ( x , y ) = x + y − 2 = 0.

Computing the appropriate partial derivatives for the Jacobian matrix,

∂f1 ∂f1 = 2 x, = 2 y, ∂x ∂y ∂f 2 ∂f 2 = 1, = 1. ∂x ∂y

114 # Chapter 10 Simultaneous Algebraic Equations Assume starting values x = 0.5, y = -0.3. The main program listed below calls the croutFactor() and croutSolve() functions developed in Example 10.3 to solve the linear system of equations in each iteration. 1) 2) 3) 4) 5) 6) 7) 8) 9) 10) 11) 12) 13) 14) 15) 16) 17) 18) 19) 20) 21) 22) 23) 24) 25) 26) 27) 28) 29) 30) 31) 32) 33) 34) 35) 36)

#include #include #include <enmcpp.h> #include <enmmat.h> #include <enmcrout.h> using namespace std; using namespace ENM; int main() { Matrix<double> A(2,2); Vector<double> b(2); double x = 0.5, y = -0.3, sumsq; int numiter = 0; do { b(1) = -(x*x + y*y - 6.0); b(2) = -(x + y - 2.0); A(1,1) = 2.0*x; A(1,2) = 2.0*y; A(2,1) = 1.0; A(2,2) = 1.0; if (!croutFactor(A)) { cerr << "Singular matrix in croutFactor!\n"; exit(-1); } croutSolve(A, b); sumsq = sqrt(b(1)*b(1) + b(2)*b(2)); x += b(1); y += b(2); numiter++; } while (sumsq > 1.0e-6); cout << "x = " << x << " y = " << y << " in " << numiter << " iterations\n"; }

Program Output: x = 2.41421

y = -0.414214 in 7 iterations

Comments: The intersection of a line and a circle may have zero, one, or two solutions. A quick sketch of this problem reveals that there are clearly two solutions. To find the second solution, just change the starting guesses for x and y on line 12 to (-0.5, 0.3) and the program output becomes x = -0.41421 y = 2.414214 in 7 iterations

The convergence of the iteration is tested by computing the Euclidean norm of the solution vector returned by croutSolve() as shown on line 28. Since this is the vector that contains the Δx and Δy components, when its length approaches zero, it is safe to say that the Newton iteration has converged. O

Glossary

Elementary Numerical Methods and C++ # 115 B-vector (load vector) the column vector of constants that form the “right-hand side” of a set of simultaneous equations. back substitution a solution algorithm for upper triangular systems that begins by solving for the last unknown and proceeds to solve for the other unknowns in reverse order. diagonal system a set of simultaneous equations having a coefficient matrix with nonzero elements only on the principal diagonal. elementary row operations operations that when applied to one or more entire rows of a matrix do not alter the essential characteristics of the matrix. forward pass the forward substitution step for solving tridiagonal systems. forward substitution a solution algorithm for lower triangular systems that begins by solving for the first unknown and proceeds to solve for the other unknowns in ascending order. full (column) pivoting swapping both rows and columns of a set of equations so as to cause the pivot element to be the largest possible value in order to reduce roundoff error. in-place decomposition an LU decomposition method that requires no intermediate storage locations. Jacobian matrix a matrix in which the elements of the ith row are the numerical values of the partial derivatives of the ith dependent variable with respect to each independent variable.

lower triangular a matrix having nonzero elements only on or below and to the left of the principal diagonal. nonsingular a coefficient matrix that has a nonzero determinant, thus ensuring a unique solution for a set of simultaneous equations. pivot element the diagonal element of the pivot row. pivot row the row of a matrix during the coefficient elimination process below which the coefficients are to be zeroed. reverse (return) pass the back substitution step for solving tridiagonal systems. row (partial) pivoting swapping rows of a set of equations so as to cause the pivot element to be the largest possible value in order to reduce roundoff error. strongly diagonally dominant a coefficient matrix in which the absolute value of the diagonal element is larger than the sum of the absolute values of all of the other coefficients in the same row. tridiagonal system a set of simultaneous equations having nonzero coefficients no more than one row or column away from the principal diagonal. upper triangular a matrix having nonzero elements only on or above and to the right of the principal diagonal.

Chapter 10 Problems 10.1 Solve the following set of simultaneous algebraic equations using Gauss Elimination.

2w − x = 8 4 y − 3z = − 2 3w + 6 y = 15 4 x − z = − 10 10.2 The C++ program given below will build a simultaneous set of matSize linear algebraic equations, [A]{x}={b}, where matSize is a pseudorandom number in the range 5 < matSize < 30. The true solution vector {x} is formed first by choosing random integers in the range -20 < x[i] < 20. The coefficients of the [A] matrix are chosen similarly. The elements of the right-hand side vector {b} are computed so as to form a valid set of linear equations. The program concludes by printing out the [A] matrix and the {x} and {b} vectors. Modify and extend the program to solve the resulting set of equations using an appropriate method and compare the calculated solution with the known solution vector {x}. Investigate as many aspects of this problem as you can. For example, how does the accuracy of the solution vary as the value of matSize increases? What happens to the

116 # Chapter 10 Simultaneous Algebraic Equations solution accuracy when double precision vectors and matrices are replaced by float values? #include #include #include #include #include "enmvec.h" #include "enmmat.h" #include "enmcrout.h" using namespace std; using namespace ENM; int myRand(int minVal, int maxVal) { double range = maxVal-minVal+1; return int(range/MAXINT*random())+minVal; } int main(void) { // Uncomment the next line to give truly random values // srandom(time(0)); int matSize = myRand(5,30); DMatrix A(matSize, matSize); DVector x(matSize), b(matSize); for (int i=0; i<matSize; i++) { x[i] = myRand(-20, 20); } for (int i=0; i<matSize; i++) { double total = 0; for (int j=0; j<matSize; j++) { A[i][j] = myRand(-20, 20); total += A[i][j] * x[j]; } b[i] = total; } cout << "A matrix is:" << A << endl; cout << "x vector is:" << x << endl; cout << "b vector is:" << b << endl; }

10.3 The Gauss-Jordan method for solving a set of simultaneous linear equations differs from the naive Gauss Elimination method only in that coefficients both above and below the principal diagonal are eliminated for each pivot element. Thus, the resulting coefficient matrix is diagonal instead of triangular. Use the Gauss-Jordan method to solve the following set of equations. Show all work!

1 − 2⎤ ⎧ x0 ⎫ ⎧ 28 ⎫ ⎡4 2 ⎢ 2 8 − 2 − 1⎥ ⎪ x ⎪ ⎪ − 38⎪ ⎪ ⎢ ⎥ ⎪⎨ 1 ⎪⎬ = ⎪⎨ ⎬ ⎢ 1 −2 4 2 ⎥ ⎪ x2 ⎪ ⎪ 32 ⎪ ⎢ ⎥ 2 8 ⎦ ⎪⎩ x3 ⎪⎭ ⎪⎩ − 40⎪⎭ ⎣− 1 4 10.4 The forces in the members of two-dimensional trusses of the type shown below can be solved using simultaneous algebraic equations. Assume point A is on rollers so that there is no reaction in the vertical direction at A. There are reactions in both the vertical and

Elementary Numerical Methods and C++ # 117 horizontal directions at G. Furthermore, note that there is a vertical structural member between points A and G. Assume all structural members are in tension and write equations for the vertical and horizontal components of the forces at every lettered node. Since the system is in static equilibrium, each of these equations must equal zero. Solve the resulting set of linear equations to find the forces in the structural members and the supports.

10.5

10.6

A company wishes to create a cash incentive program for its sales force using the following scheme. A fixed amount of money is set aside as the total pool of bonus money available. The management defines several levels of sales performance (brackets) into which the sales people will be categorized. Those sales people that do not reach the minimum performance level will not receive any money from the bonus pool. Those reaching only the minimum level of sales will receive a fixed amount that is predetermined by the management. The incremental amounts to be awarded to higher performers are to be uniformly spaced. Note that the dollar amount in each bracket can’t be determined until the end of the contest. For example, if the company set aside $10,000 and had five sales people all reach the top bracket while no other sales people even reached the bottom bracket, each of the five would receive $2,000. (Note that they would also receive $2,000 each if they had only reached the bottom bracket.) How should the company divide a total pool of $10,000 if the minimum award is $100 (bracket 1), there are 10 sales performance brackets, and at the end of the contest the numbers of sales people in the various brackets are as follows: Bracket

1

2

3

4

5

6

7

8

9

10

# in bracket

0

1

0

1

3

0

0

2

1

2

In the four-bar mechanism shown below, we are typically given the lengths of each of the links as follows:

118 # Chapter 10 Simultaneous Algebraic Equations

a = AB b = BC c = CD d = AD . We will assume that the origin for the coordinate system is at point A and that points A and D lie on the positive x-axis. If we are given the angular position of the link AB with respect to the positive x-axis, then we can solve for the configuration of the mechanism by finding the xy coordinates of point C. There are several different formulations of the equations that will solve for the geometry of the four-bar mechanism. For this problem, we choose to use the circle equation formulation. Given that link AB is fixed at its known angle, then if we detach the pin at point C, link BC is free to swing around the pin at point B, thus constraining point C to lie on a circle having radius b centered at point B. But link CD is also free to swing about the pin at point D, thus constraining point C to lie on a circle having radius c centered at point D. Solving these two equations simultaneously will give the x-y coordinates of point C that assembles the mechanism properly. These two equations can be written as

( x c − xb ) 2 + ( y c − yb ) 2 = b 2 ( xc − xd ) 2 + ( yc − yd ) 2 = c 2 . These equations may have zero, one, or two solutions, depending on the lengths of the links and the given orientation of link AB. Write a C++ program that solves for the coordinates of point C for the following cases. a

b

c

d

angle of AB (degrees)

2

7

9

6

30

9

3

8

7

85

10

6

8

3

45

5

7

6

8

25

CHAPTER

11

Curve Fitting CHAPTER OUTLINE 11.1 Introduction 11.2 Polynomials 11.2.1 Polynomial Evaluation Using Horner’s Method 11.2.2 Lagrange Interpolation 11.3 Splines 11.4 Least-Squares approximation

11.1

11.2.2 Lagrange Interpolation 11.3 Splines 11.4 Least-Squares approximation

INTRODUCTION

The term “curve fitting” is very imprecise. To most technical minds, it means that some analytical function is to be found such that when the function is plotted over a range of values of an independent variable, it will more or less track the behavior of some discrete data points so that it “looks good” to the trained eye. Sometimes it is required that the function pass through each discrete data point exactly. This is called interpolation. Other times the function is required only to follow the trend of the discrete data and not necessarily pass through any individual data point. This is called approximation. When interpolating, there are many choices that must be made that may radically affect the appearance of the resulting curve. These include the choice of the interpolating function, and, if polynomials are used, the degree of the polynomial. Even more options are available when approximating data with a function. In addition to the choice of the function to be used, there is the question of what mathematical criterion should be used to determine what is a “good” approximation. This chapter presents three of the most commonly used methods for “curve fitting.” Polynomial interpolation using Lagrange’s method will serve to introduce a C++ polynomial class. Cubic spline interpolation and a corresponding C++ spline class will be given only brief treatment. Finally, polynomial least-squares approximation will be presented using the classical formulation.

11.2

POLYNOMIALS

Polynomials are widely used for interpolation and approximation for several good reasons. First,

120 # Chapter 11 Curve Fitting polynomials are easy to work with, both analytically and computationally. We will develop a Polynomial class in C++ which implements most common polynomial operations. Second,

mathematical operations on polynomials produce other polynomials. Finally, and perhaps most important, the Weirstrass Approximation Theorem states that over any closed interval, any piecewise continuous function can be approximated to any desired degree of accuracy using some polynomial. Granted, the polynomial may be a very long one, but it is guaranteed to exist. 11.2.1 Polynomial Evaluation Using Horner’s Method The standard mathematical notation for writing a polynomial of degree (or order) n is

pn ( x ) =

n

∑a x

i

i

(11.1)

i=0

Expanding the summation, this becomes

pn ( x ) = a 0 + a1` x + a 2 x 2 +… + a n −1 x n −1 + a n x n

(11.2) Thus, a zero-degree polynomial consists only of a0 and is simply a constant while a first-degree equation comprises the first two terms of Equation 11.2, which is the equation of a straight line, etc. How can polynomials of the form given in Equation 11.1 or 11.2 effectively be represented in C++? The answer, of course, is to create a Polynomial class which encapsulates the representation and behavior of polynomials of the form given above. While the details of constructing a class (especially a template class) are beyond the scope of these notes, some fundamental understanding is useful. First, notice that in order to represent a polynomial, only the coefficients need be known. The value of the independent variable is not needed for storing the polynomial, only for evaluating it. For example, the polynomial 3.0 + 2.5x !6x2 can be represented by the three-element vector {3.0, 2.5, !6.0}T. When using this representation, placeholders containing zeros must be used for all terms not present, as in the case of the polynomial !4.0x + x3, which would be represented by the vector {0.0, !4.0, 0.0, 1.0}T. The Polynomial class used in file enmpoly.h in Example 11.1 is based on this style of polynomial representation. Having established the vector-based representation of polynomials in C++, it is relatively easy to see how a function could be written to evaluate a polynomial at a given value of the independent variable by directly applying Equation 11.2. This is the so-called term-by-term method for polynomial evaluation, but because of the exponentiation required for each term, it is not very efficient. Horner’s method is universally used to evaluate polynomials of the form discussed here. Horner’s method appears unusual upon first inspection because it uses an inside-out sort of logic. Rewriting Equation 11.2 in a format compatible with Horner’s method gives pn ( x ) = ((… (a n x + a n −1 ) x + a n − 2 ) x +… + a1 ) x + a 0 (11.3) 2 Applying Equation 11.3 to the two examples above, we see that 3.0 + 2.5x !6x can be written as ((!6x) + 2.5)x + 3.0 and -4.0x + x3 can be expressed equivalently as (((1.0x) + 0)x ! 4.0)x + 0.0. Notice that despite its ungainly appearance, the Horner representation of polynomials leads to their evaluation at any particular value of x using only multiplication and addition. It does not involve any exponentiation, that being accomplished by implicit successive multiplication. Horner’s method is implemented in the val() function of the Polynomial class. It is also overloaded as the operator() function which means that if a program has created a given polynomial named, say p1, then the value of polynomial p1 at x = 3.5 could be found by calling either p1.val(3.5) or more simply p1(3.5). The Polynomial class defined in file enmpoly.h also implements functions to permit most common fundamental operations such as adding or subtracting two polynomials, multiplying two polynomials, taking the derivative of a polynomial, and formatted output of polynomials (which is

Elementary Numerical Methods and C++ # 121 the longest and most complicated function in the class). Careful study of the code in file enmpoly.h will reveal several C++ programming insights. O EXAMPLE 11.1 Polynomial Evaluation Using Horner’s Method A fairly complete Polynomial class is defined in file enmpoly.h which is shown below. The class is implemented as a template, which means it can be instantiated as either single or double precision. The class is derived from the Vector class and actually includes no additional data members. The order, or degree, of the polynomial is assumed to be one less than the number of elements in the coefficient vector. 1) #ifndef ENM_POLYNOMIAL_H 2) #define ENM_POLYNOMIAL_H 3) #include 4) #include 5) #include <enmvec.h> 6) #include <enmmath.h> 7) #include <enmmisc.h> 8) #include <enmfit.h> 9) namespace ENM 10) { 11) template 12) class Polynomial : public Vector 13) { 14) protected: 15) public: 16) Polynomial(Subscript N=0) : Vector(N+1) {}; 17) Polynomial(Subscript N, char* s) : Vector(N+1,s) {}; 18) Polynomial(const Vector& xv, const Vector& yv, int m) : 19) Vector(Lsqfit(xv, yv, m)) {}; 20) T val(T x) const 21) { 22) T retval = T(0); 23) for (Subscript i=this->n_-1; i>=0; i--) 24) { 25) retval = retval * x + this->v_[i]; 26) } 27) return retval; 28) } 29) T operator()(T x) const 30) { 31) return val(x); 32) } 33) Subscript order(void) const 34) { 35) return this->n_-1; 36) } 37) void divideByQuad(Polynomial&q, Polynomial&b,Polynomial& r) 38) { 39) assert(q.size()==3); 40) assert(q[2]!=T(0)); 41) assert(this->n_>3); 42) b.newsize(this->n_-2); 43) r.newsize(2); 44) Subscript n=this->n_-1; 45) T fac = q[2]; 46) q[2] = T(1); 47) q[1] /= fac; 48) q[0] /= fac;

122 # Chapter 11 Curve Fitting 49) b.v_[n-2]=this->v_[n]; 50) b.v_[n-3]=this->v_[n-1]-this->v_[n]*q[1]; 51) for (Subscript i=n-4; i>=0; i--) 52) { 53) b[i] = this->v_[i+2]-q[1]*b[i+1]-q[0]*b[i+2]; 54) } 55) r[1] = this->v_[1]-b[0]*q[1]-b[1]*q[0]; 56) r[0] = this->v_[0]-r[1]*q[1]-b[0]*q[0]; 57) for (Subscript j=n-2; j>=0; j--) 58) { 59) b[j] *= fac; 60) } 61) } 62) Polynomial Derivative(void) 63) { 64) assert(this->n_>0); 65) Subscript dorder = n_-2; 66) Polynomial tmp(dorder); 67) for (Subscript i=0; i<=dorder; i++) 68) { 69) tmp[i] = this->v_[i+1]*(i+1); 70) } 71) return tmp; 72) } 73) }; 74) template 75) Polynomial operator+(const Polynomial& A, const Polynomial& B) 76) { 77) Subscript i=0; 78) Subscript M=A.order(); 79) Subscript N=B.order(); 80) Polynomial tmp(max(M, N)); 81) Subscript L=min(M, N); 82) for (;i<=L;i++) tmp[i] = A[i] + B[i]; 83) for (;i<=M;i++) tmp[i] = A[i]; 84) for (;i<=N;i++) tmp[i] = B[i]; 85) return tmp; 86) } 87) template 88) Polynomial operator-(const Polynomial& A, const Polynomial& B) 89) { 90) Subscript i=0; 91) Subscript M=A.order(); 92) Subscript N=B.order(); 93) Polynomial tmp(max(M, N)); 94) Subscript L=min(M, N); 95) for (;i<=L;i++) tmp[i] = A[i] - B[i]; 96) for (;i<=M;i++) tmp[i] = A[i]; 97) for (;i<=N;i++) tmp[i] = -B[i]; 98) return tmp; 99) } 100) template 101) Polynomial operator*(const Polynomial& A, const Polynomial& B) 102) { 103) Subscript norder = A.order() + B.order(); 104) Polynomial tmp(norder); 105) Subscript M = A.order();

Elementary Numerical Methods and C++ # 123 106) 107) 108) 109) 110) 111) 112) 113) 114) 115) 116) 117) 118) 119) 120) 121) 122) 123) 124) 125) 126) 127) 128) 129) 130) 131) 132) 133) 134) 135) 136) 137) 138) 139) 140) 141) 142) 143) 144) 145) 146) 147) 148) 149) 150) 151) 152) 153) 154) 155) 156) 157) 158) 159) 160) 161) 162) 163) 164) 165)

Subscript N = B.order(); for (Subscript i=norder; i>=0; i--) { tmp[i] = T(0); for (Subscript j=min(i,M); j>=max(0,i-N); j--) { tmp[i] += A[j] * B[i-j]; } } return tmp;

} template Polynomial operator*(const Polynomial& A, T value) { Subscript N=A.order(); Polynomial tmp(N); for (Subscript i=0; i Polynomial operator*(T value, const Polynomial& A) { return A * value; } template Polynomial operator/(const Polynomial& A, T value) { Subscript N=A.order(); Polynomial tmp(N); for (Subscript i=0; i std::ostream& operator<<(std::ostream& os, const Polynomial A) { bool printSign=false; format rst(os.flags(),os.precision(),os.width()); format yesSign(std::ios::showpos | std::ios::left, 0, 0) format noSign(std::ios::left, 0, 0); Subscript N=A.size(); if (A[0]!=T(0)) { os << A[0]; printSign=true; } for (Subscript i=1; i
124 # Chapter 11 Curve Fitting 166) 167) 168) 169) 170) 171) 172) 173) 174) 175) 176) 177) 178) 179) 180) 181) 182) 183) 184) 185) 186) 187) 188) 189) 190) 191) 192) 193) 194) 195) 196) 197) 198) 199)

else { }

os << "x";

} else if (A[i]==T(-1.0)) { os << "-x"; } else { if (printSign) { os << yesSign << A[i] << rst << "*x"; } else { os << noSign << A[i] << rst << "*x"; } } if (i>>1) { os << "^" << i; } printSign=true;

} } if (!printSign) os << "0"; return os;

} typedef Polynomial FPolynomial; typedef Polynomial<double> DPolynomial; } /* namespace ENM */ #endif

Here is a sample main() function that illustrates most of the capabilities of the Polynomial class. A more practical example will be given later in this chapter. 1) 2) 3) 4) 5) 6) 7) 8) 9) 10) 11) 12) 13) 14) 15) 16) 17) 18) 19) 20) 21) 22)

#include #include <enmcpp.h> #include <enmpoly.h> using namespace ENM; using namespace std; int main() { Polynomial<double> A(2,"0 0 1"); cout << "A.val(3.0) = " << A.val(3.0) << endl; cout << "A(3.0) = " << A(3.0) << endl; cout << "A = " << A << endl; DPolynomial B(5,"-6 5 -4 3 -2 1"); cout << "B = " << B << endl; cout << "Derivative of B = " << B.Derivative() << endl; DPolynomial C = A*B; cout << "C = " << C << endl; DPolynomial D, E; C.divideByQuad(A,D,E); cout << "After C.divideByQuad(A,D,E): A = " << A << endl << "D = " << D << endl << "E = " << E << endl; cout << "A + B = " << A + B << endl;

Elementary Numerical Methods and C++ # 125 23) 24) 25) }

cout << "A - B = " << A - B << endl; cout << "2.0*B = " << 2.0*B << endl;

Program Output: 1) 2) 3) 4) 5) 6) 7) 8) 9) 10) 11) 12)

A.val(3.0) = 9 A(3.0) = 9 A = x^2 B = -6+5*x-4*x^2+3*x^3-2*x^4+x^5 Derivative of B = 5-8*x+9*x^2-8*x^3+5*x^4 C = 0-6*x^2+5*x^3-4*x^4+3*x^5-2*x^6+x^7 After C.divideByQuad(A,D,E): A = x^2 D = -6+5*x-4*x^2+3*x^3-2*x^4+x^5 E = 0 A + B = -6+5*x-3*x^2+3*x^3-2*x^4+x^5 A - B = 6-5*x+5*x^2-3*x^3+2*x^4-x^5 2.0*B = -12+10*x-8*x^2+6*x^3-4*x^4+2*x^5

Comments: The divideByQuad() function divides the polynomial calling the function (which must be of degree three or higher) by the quadratic polynomial passed as the first argument and returns the quotient polynomial as the second argument and the remainder polynomial as the third argument. Of course, the quotient polynomial will be of degree two lower than the original polynomial, which remains unaltered. The remainder polynomial will be either zero, a constant, or a linear polynomial. If the remainder is precisely zero, as in this example, then the first and second arguments constitute perfect factors of the original polynomials. This process is used by a number of deflation methods to obtain the roots of polynomials. However, these methods are not presented in this book. O 11.2.2 Lagrange Interpolation The term “interpolation” has a rather specific meaning in the context of curve fitting. An interpolating function must pass through every supplied data point (usually called base points). There may be many interpolating functions that can be fit to a set of n data points, but each function is required to pass through every data point. If the function chosen for interpolation is a polynomial of the form given in Equation 11.1, then the polynomial will be of degree n!1 and it will be unique. That is, any method used to find the polynomial will return the same set of n coefficients. (Of course, if the given base points exactly fit a polynomial of degree less than n!1, say, perhaps degree n!3, then the coefficients an!2 and an!1 will turn out to be zero, neglecting any roundoff error introduced in the calculation procedure.) There are several different algorithms that can be used to determine the coefficients of the interpolating polynomial for a given set of base points. (Actually, it is rarely necessary to compute the actual coefficients of the polynomial. Usually it is the value of the interpolating polynomial at a particular value of the independent variable that is needed.) The Lagrangian algorithm is one of the easiest methods to understand and implement and it is also quite computationally efficient. It begins by assuming an entirely different format from the classic polynomial given in Equation 11.1. The Lagrange nth degree polynomial consists of the sum of n+1 terms, each of which is an nth degree polynomial.

126 # Chapter 11 Curve Fitting

pn ( x ) = a 0 ( x − x1 )( x − x2 )… ( x − x n −1 )( x − x n ) + a1 ( x − x 0 )( x − x 2 )… ( x − x n −1 )( x − x n ) + a 2 ( x − x 0 )( x − x1 )… ( x − x n −1 )( x − x n )

(11.4)

+ a n −1 ( x − x 0 )( x − x1 )( x − x 2 )… ( x − x n ) + a n ( x − x0 )( x − x1 )( x − x2 )… ( x − x n −1 ) Of course, this unusual structure is chosen for a reason. To understand why, notice what happens when Equation 11.4 is evaluated at the known base point (x0,y0).

y 0 = a 0 ( x 0 − x1 )( x 0 − x 2 )… ( x 0 − x n −1 )( x 0 − x n ) 0

+ a1 ( x 0 − x0 )( x 0 − x 2 )… ( x 0 − x n −1 )( x 0 − xn ) 0

+ a 2 ( x 0 − x0 )( x 0 − x1 )… ( x 0 − x n −1 )( x 0 − xn )

(11.5)

0

+ a n −1 ( x0 − x 0 )( x 0 − x1 )( x 0 − x 2 )… ( x 0 − xn ) 0

+ a n ( x 0 − x0 )( x 0 − x1 )( x 0 − x 2 )… ( x 0 − xn −1 ) Since every term on the right-hand side except the first one contains the factor ( x 0 − x 0 ) , the unknown value of a0 now can be directly computed as

a0 =

y0 ( x 0 − x1 )( x 0 − x2 )( x 0 − x 3 )… ( x 0 − x n −1 )( x 0 − x n )

(11.6)

Similarly, by evaluating Equation 11.4 at the other n base points, the values of all n+1 values of the ai’s can be found to be

ai =

yi

n

∏ (x

− xj)

i

(11.7)

j=0 j ≠i

Substituting this expression into Equation 11.4, the general solution for the Lagrange interpolating polynomial can be expressed as

pn ( x ) =

n

∑ L ( x) y i

i

i=0

Li ( x ) =

n

(x − x j )

∏ (x j=0 j≠i

i

(11.8)

− xj)

A demonstration of Lagrangian interpolation is given in Example 11.2 and C++ code

Elementary Numerical Methods and C++ # 127 implementing the method is included in file enmfit.h which is included in Example 11.3 starting on page 131. Lagrange interpolation is straightforward and easy to apply. However, there are some caveats that must be considered to ensure its successful application. • More is not always better. It is very tempting to assume that more base points will give a more accurate interpolation than fewer points. This is usually not true. Remember that the degree of the polynomial used for interpolation is equal to the number of base points used minus one. Furthermore, it is characteristic of higher degree (>6) interpolating polynomials to oscillate rather wildly between base points. So even though they still pass through all base points exactly, their usefulness for interpolating realistic results between base points is compromised. As a general rule, do not use more than seven base points for polynomial interpolation. • Stay within the boundaries. Though it cannot be seen using the direct derivation of Lagrangian interpolation that was presented above, it can be shown mathematically (and is demonstrated in Example 11.3) that the accuracy of polynomial interpolation suffers rather dramatically when the x-variable of the point to be found lies outside the range of x-values of the supplied base points. This is commonly known as extrapolation. When interpolating in a table containing a large number of potential base points, be sure to search through the table first to find a relatively few base points that surround the x-value of the point to be found and use this subset of base points for interpolation. O EXAMPLE 11.2 Demonstration of Lagrange Interpolation Given the data x

1

2

3

5

6

f(x)

4.75

4

5.25

19.75

36

Calculate f(4) using Lagrange interpolation of various orders. Since there are five base points, it is possible to compute f(4) using interpolating polynomials of order 1 through 4. For the first-order interpolation, we should use base points (x0,y0) = (3,5.25) and (x1,y1) = (5,19.75) since they “surround” the unknown point. For n = 1, Equation 11.8 becomes

L0 =

x − x1 4−5 1 = , = x 0 − x1 3 − 5 2

f (4) = L0 y 0 + L1 y1 =

L1 =

x − x0 4− 3 1 = = x1 − x 0 5 − 3 2

1 1 (5.25) + (19.75) = 12.5 2 2

128 # Chapter 11 Curve Fitting For the second-order interpolating polynomial, there are two reasonable choices of base point sets; (x0,y0) = (2,4), (x1,y1) = (3,5.25), (x2,y2) = (5,19.75) or (x0,y0) = (3,5.25), (x1,y1) = (5,19.75), (x2,y2) = (6,36). For the data points given in this problem, there is no reason to choose either one of these two sets of base points over the other since they both surround the desired x-value equally. We will choose the first set and leave the calculations for the second set as an exercise for the reader.

L0 =

x − x1 x − x 2 1 ⎛ 4 − 3⎞ ⎛ 4 − 5⎞ =⎜ ⎟⎜ ⎟ =− 3 x 0 − x1 x 0 − x 2 ⎝ 2 − 3⎠ ⎝ 2 − 5⎠

L1 =

x − x 0 x − x 2 ⎛ 4 − 2 ⎞ ⎛ 4 − 5⎞ =⎜ ⎟⎜ ⎟ =1 x1 − x 0 x1 − x 2 ⎝ 3 − 2 ⎠ ⎝ 3 − 5 ⎠

L2 =

x − x 0 x − x1 ⎛ 4 − 2 ⎞ ⎛ 4 − 3⎞ 1 =⎜ ⎟⎜ ⎟ = x 2 − x 0 x 2 − x1 ⎝ 5 − 2 ⎠ ⎝ 5 − 3 ⎠ 3

f (4) = L0 y 0 + L1 y1 + L2 y 2 1 1 = − (4) + 1(5.25) + (19.75) = 10.5 3 3 The third-order interpolating polynomial requires four base points. Again, there are two choices, but one is clearly better than the other. Using the last four points in the given table puts the point to found directly in the center of the interval. This should give a better answer than using the first four points which place the point to be found near the end of the interval. So we choose (x0,y0) = (2,4), (x1,y1) = (3,5.25), (x2,y2) = (5,19.75) and (x3,y3) = (6,36) as base points and proceed as before.

L0 =

x − x1 x − x2 x − x 3 ⎛ 4 − 3⎞ ⎛ 4 − 5⎞ ⎛ 4 − 6 ⎞ 1 =⎜ ⎟⎜ ⎟⎜ ⎟ =− 6 x0 − x1 x 0 − x 2 x 0 − x3 ⎝ 2 − 3⎠ ⎝ 2 − 5⎠ ⎝ 2 − 6 ⎠

L1 =

x − x 0 x − x 2 x − x 3 ⎛ 4 − 2 ⎞ ⎛ 4 − 5⎞ ⎛ 4 − 6 ⎞ 2 =⎜ ⎟⎜ ⎟⎜ ⎟ = x1 − x0 x1 − x 2 x1 − x 3 ⎝ 3 − 2 ⎠ ⎝ 3 − 5 ⎠ ⎝ 3 − 6 ⎠ 3

L2 =

x − x0 x − x1 x − x 3 ⎛ 4 − 2 ⎞ ⎛ 4 − 3⎞ ⎛ 4 − 6 ⎞ 2 =⎜ ⎟⎜ ⎟ = ⎟⎜ x2 − x 0 x 2 − x1 x 2 − x 3 ⎝ 5 − 2 ⎠ ⎝ 5 − 3 ⎠ ⎝ 5 − 6 ⎠ 3

L3 =

x − x 0 x − x1 x − x 2 1 ⎛ 4 − 2 ⎞ ⎛ 4 − 3⎞ ⎛ 4 − 5⎞ =⎜ ⎟⎜ ⎟⎜ ⎟ =− 6 x3 − x 0 x 3 − x1 x 3 − x2 ⎝ 6 − 2 ⎠ ⎝ 6 − 3⎠ ⎝ 6 − 5⎠

f (4) = L0 y 0 + L1 y1 + L2 y 2 + L3 y 3 1 2 2 1 = − (4) + (5.25) + (19.75) − (36) = 10 6 3 3 6 O

11.3

SPLINES

A spline is a thin piece of wood that draftsmen of yesteryear would use to draw a single smooth curve through several points. Heavy weighted guides would be placed on the drawing at the known points

Elementary Numerical Methods and C++ # 129 and the spline would be “threaded” through the guides to form the smooth curve. The mathematical analog of this process is described by the cubic spline curve, which consists of a series of cubic polynomials whose coefficients are chosen to provide continuity of the function as well as its first and second derivatives at the base points. The resulting curve accurately models the behavior of a true wooden spline and mimics the “eyeball” curve that most people would draw through the base points. The development of the equations for cubic spline interpolation proceeds as follows. Given a set of n+1 base points (xi,yi), 0#i#n, we assume the interpolating function between any two base points to be a cubic polynomial of the form

Si ( x ) = ai ( x − xi ) 3 + bi ( x − xi ) 2 + ci ( x − xi ) + d i , 0 ≤ i < n.

(11.9) The four coefficients are subscripted because no single cubic polynomial will fit all the data points. A different cubic polynomial exists between each pair of base points. Thus, there are n values of ai to be found as well as n values of bi, n values of ci, and n values of di, for a total of 4n values. Since the values of the interpolating function must match the given base points, 0

0

0

Si ( xi ) = yi = ai ( xi − xi ) + bi ( xi − xi ) + ci ( xi − xi ) + d i , 0 ≤ i < n ∴ d i = yi , 0 ≤ i < n 3

2

This leaves the 3n values of the ai’s, bi’s, and ci’s to be determined. We do this by imposing the conditions that the function and its first and second derivatives are continuous at the base points. The value of the cubic polynomial between base points xi!1 and xi must equal the value of the next cubic polynomial which connects base points xi and xi+1 when evaluated at point xi. But this value is known to be yi (which is the same as di from above). To simplify the notation, let us define hi −1 = xi − xi −1 so that the function continuity requirement at each internal base point can be written as 3

2

d i = ai −1hi −1 + bi −1hi −1 + ci −1hi −1 + d i −1 , 1 ≤ i < n. Applying a similar requirement to the first and second derivatives of Equation 11.9 yields the equations 2

ci = 3ai −1hi −1 + 2bi −1hi −1 + ci −1 bi = 3ai −1hi −1 + bi −1 , which apply to all interior base points, 1#i
bi −1hi −1 + 2bi (hi + hi −1 ) + bi +1hi =

3( yi +1 − yi ) 3( yi − yi −1 ) − , 1 ≤ i < n. hi hi −1

Since all of the yi’s and hi’s are known, this constitutes a set of n!1 equations involving n+1 unknowns. This leaves two additional equations that may be imposed at the discretion of the user. Almost universally these two conditions are placed on the values of b0 and bn, the second derivatives of the spline as it enters and exits the interpolation region, respectively. If both of the values are set to zero, the resulting curve is called a natural spline because it faithfully represents the behavior of the old wooden spline. However, other conditions may be applied under special circumstances. Irrespective of the numerical values assigned to b0 and bn, the resulting set of equations are tridiagonal, which means they can be solved very quickly for the bi’s, from which the other coefficients can be found. This step need be performed only once for any given set of base points. The resulting solutions can be stored and used any time the value of the interpolating function is needed. Example 11.3 shows the implementation of the Spline class patterned after the code by Press,

130 # Chapter 11 Curve Fitting et al[6]. The equations for the spline are solved in the constructor. Subsequent calls for spline evaluation use the resulting coefficients efficiently to minimize execution time.

11.4

LEAST-SQUARES APPROXIMATION

When a single polynomial is needed to approximate (not interpolate) a function, the least-squares criterion is typically used. If the function to be approximated is sampled at n+1 base points, then the least-squares polynomial is the one for which n

∑e

E=

i=0

2 i

=

n

∑(p

m

( xi ) − yi )

2

i=0

is a minimum. In this expression, pm(x) is an m-th degree polynomial such that m # n. Expanding the polynomial, this expression becomes

E=

n

∑ (a

0

+ a1 xi + a 2 xi2 +… + a m xim − yi

i=0

)

2

The minimum value of E can be found by setting the partial derivative of E with respect to each of the m+1 polynomial coefficients equal to zero and solving for the values of the ai’s. This will result in a set of simultaneous linear equations which can be solved using Gauss elimination or LU decomposition. To see how this set of equations is formed, consider the formulation of the first two equations in the set.

∂E = ∂ a0

n

∑ 2( a

)

+ a1 xi + a 2 xi2 +… + a m xim − yi = 0

0

i=0

n

n

n

i=0

i=0

i=0

( n + 1)a0 + a1 ∑ xi + a2 ∑ xi2 +… + am ∑ xim = ∂E = ∂ a1

n

∑ 2( a

0

n

∑y

i

i=0

)

+ a1 xi + a 2 xi2 +… + a m xim − yi xi = 0

i=0

n

n

n

n

i=0

i=0

i=0

i=0

a 0 ∑ xi + a1 ∑ xi2 + a 2 ∑ xi3 +… + a m ∑ xim+1 =

n

∑xy i

i

i=0

The resulting set of simultaneous linear equations have the general form

⎡n + 1 Σ x ⎢ Σx 2 ⎢ Σx ⎢ Σx 2 Σx 3 ⎢ ⎢ ⎢ Σ x m Σ x m+ 1 ⎣

Σx 2 Σx 3 Σx 4 Σ x m+ 2

Σ x m ⎤ ⎧ a0 ⎫ ⎧ Σ y ⎫ ⎥⎪ ⎪ ⎪ ⎪ Σ x m+1 ⎥ ⎪ a1 ⎪ ⎪ Σ xy ⎪ ⎪ ⎪ ⎪ 2 ⎪ Σ x m+ 2 ⎥ ⎨ a 2 ⎬ = ⎨ Σ x y ⎬ ⎥⎪ ⎪ ⎪ ⎪ ⎥⎪ ⎪ ⎪ ⎪ m Σ x 2 m ⎥⎦ ⎪⎩ a m ⎪⎭ ⎪⎩ Σ x y ⎪⎭

(11.10)

where each of the summations goes from 0 to n. Because of the exponents within the summations, this set of equations can be difficult to solve for high degree polynomials or for data sets having large x-values. There are alternate formulations of these equations that address these difficulties, but for elementary problems, the present formulation works well for low degree polynomials involving data points not too distant from the origin. Solving Equation 11.10 yields the coefficients of the required

Elementary Numerical Methods and C++ # 131 least-squares polynomial. Unlike Lagrange interpolation, for least-squares fitting we normally do return the actual polynomial coefficients rather than just the y-value corresponding to a supplied xvalue. This is an efficiency consideration, because it would take far too long to compute the polynomial coefficients every time the base points were to be approximated at a single point. O EXAMPLE 11.3 Interpolation and Curve Fitting This example illustrates Lagrange and cubic spline interpolation, as well as least-squares approximation. Six base points are available: (0.5,2.0), (1.0,1.0), (2.0,1.0), (3.0,0.3333…), (4.0,0.25), and (5.0,0.2). These points are taken from the function f(x)=1.0/x. Since there are six base points, the Lagrange interpolating polynomial will be of degree five. We will choose arbitrarily to use a parabola (second degree polynomial) for the least squares curve fit. All of the interpolation/curve fitting code in contained in file enmfit.h. As usual, the code has been implemented using templates so that it can be used for any appropriate base data type. The cubic spline functionality has been implemented as a class while the Lagrange interpolation and least squares approximation have been implemented as function templates. 1) 2) 3) 4) 5) 6) 7) 8) 9) 10) 11) 12) 13) 14) 15) 16) 17) 18) 19) 20) 21) 22) 23) 24) 25) 26) 27) 28) 29) 30) 31) 32) 33) 34) 35) 37) 38) 39) 40) 41) 42)

#ifndef ENM_FIT_H #define ENM_FIT_H #include #include <enmvec.h> #include <enmmat.h> #include <enmpoly.h> #include <enmcrout.h> namespace ENM { template class Spline { protected: T *xpts_; T *ypts_; T *y2_; int n_; void SetUp(int npts) { n_ = npts; xpts_ = new T[npts]; ypts_ = new T[npts]; y2_ = new T[npts]; } void CreateSpline(const T *x, const T *y, int n, T yp1, T ypn) { assert (n > 2); Subscript i; T sig, p, qn, un; T *u; SetUp(n); for (i=0; i 0.99e30) { y2_[0] = u[0] = 0.0; }

132 # Chapter 11 Curve Fitting 43) 44) 45) 46) 47) 48) 49) 50) 51) 52) 53) 54) 55) 56) 57) 58) 59) 60) 61) 62) 63) 64) 65) 66) 67) 68) 69) 70) 71) 72) 73) 74) 75) 76) 77) 78) 79) 80) 81) 82) 83) 84) 85) 86) 87) 88) 89) 90) 91) 92) 93) 94) 95) 96) 97) 98) 99) 100) 101) 102)

else {

y2_[0] = -0.5; u[0] = (3.0/(xpts_[1]-xpts_[0]))*((ypts_[1]-ypts_[0]) /(xpts_[1]-xpts_[0])-yp1);

} for (i=1; i 0.99e30) { qn = un = 0.0; } else { qn = 0.5; un = (3.0/(xpts_[n-1]-xpts_[n-2]))* (ypn-(ypts_[n-1]-ypts_[n-2]) /(xpts_[n-1]-xpts_[n-2])); } y2_[n-1] = (un-qn*u[n-2])/(qn*y2_[n-2]+1.0); for (Subscript k=n-2; k>=0; k--) { y2_[k]=y2_[k]*y2_[k+1]+u[k]; } delete[] u;

} public: Spline() : xpts_(0), ypts_(0), y2_(0), n_(0) {}; Spline(const T *x,const T *y,int n,T yp1=T(1.0e30), T ypn=T(1.0e30)) { CreateSpline(x, y, n, yp1, ypn); } Spline(const Vector& x, const Vector& y, T yp1=(T)1.0e30, T ypn=(T)1.0e30) { Subscript N=x.size(); assert (N == y.size()); T* xnew = new T[N]; T* ynew = new T[N]; for (Subscript i=0; i
Elementary Numerical Methods and C++ # 133 103) 104) 105) 106) 107) 108) 109) 110) 111) 112) 113) 114) 115) 116) 117) 118) 119) 120) 121) 122) 123) 124) 125) 126) 127) 128) 129) 130) 131) 132) 133) 134) 135) 136) 137) 138) 139) 140) 141) 142) 143) 144) 145) 146) 147) 148) 149) 150) 151) 152) 153) 154) 155) 156) 157) 158) 159) 160) 161) 162)

memcpy(ypts_, source.ypts_, n_*sizeof(T)); memcpy(y2_, source.y2_, n_*sizeof(T));

} ~Spline() { delete[] y2_; delete[] ypts_; delete[] xpts_; } T splint(T x) const { assert (n_ > 0); Subscript klo = 0, khi = n_-1, k; while (khi-klo > 1) { k = (khi+klo) >> 1; if (xpts_[k] > x) khi = k; else klo = k; } T h = xpts_[khi] - xpts_[klo]; assert (h != T(0)); T a = (xpts_[khi]-x)/h; T b = (x-xpts_[klo])/h; return a*ypts_[klo]+b*ypts_[khi]+((a*a*a-a)*y2_[klo] +(b*b*b-b)*y2_[khi])*(h*h)/6.0; } T splypp(T x) const { assert (n_ > 0); Subscript klo = 0, khi = n_-1, k; while (khi-klo > 1) { k = (khi+klo) >> 1; if (xpts_[k] > x) khi = k; else klo = k; } T h = xpts_[khi] - xpts_[klo]; assert (h != T(0)); T a = (xpts_[khi]-x)/h; T b = (x-xpts_[klo])/h; return a*y2_[klo]+b*y2_[khi]; } T splyp(T x) const { assert (n_ > 0); Subscript klo = 0, khi = n_-1, k; while (khi-klo > 1) { k = (khi+klo) >> 1; if (xpts_[k] > x) khi = k; else klo = k; } T h = xpts_[khi] - xpts_[klo]; assert (h != T(0)); T a = (xpts_[khi]-x)/h; T b = (x-xpts_[klo])/h; return (ypts_[khi]-ypts_[klo])/h-(3.0*a*a-1.0)/6.0*h*y2_[klo] + (3.0*b*b-1.0)/6.0*h*y2_[khi]; }

134 # Chapter 11 Curve Fitting 163) 164) 165) 166) 167) 168) 169) 170) 171) 172) 173) 174) 175) 176) 177) 178) 179) 180) 181) 182) 183) 184) 185) 186) 187) 188) 189) 190) 191) 192) 193) 194) 195) 196) 197) 198) 199) 200) 201) 202) 203) 204) 205) 206) 207) 208) 209) 210) 211) 212) 213) 214) 215) 216) 217) 218) 219) 220) 221) 222)

T fp(T x) const {return splyp(x);}; T fprime(T x) const {return splyp(x);}; T f(T x) const {return splint(x);}; T val(T x) const {return splint(x);}; T evaluate(T x) const {return splint(x);}; T operator()(T x) const {return splint(x);}; Spline& operator=(const Spline& source) { if (&source != this) { delete[] xpts_; delete[] ypts_; delete[] y2_; SetUp(source.n_); memcpy(xpts_, source.xpts_, n_*sizeof(T)); memcpy(ypts_, source.ypts_, n_*sizeof(T)); memcpy(y2_, source.y2_, n_*sizeof(T)); } return *this; }

}; typedef Spline FSpline; typedef Spline<double> DSpline; template T Lagrange(T x, const Vector& xv, const Vector& yv) { T y = T(0); Subscript n = xv.size(); assert (n==yv.size()); for (Subscript i=0; i Vector Lsqfit(const Vector& xv, const Vector& yv, Subscript m) { Subscript s = m+1, n = xv.size(), i, j, k; assert (n==yv.size()); Vector V(s); Matrix X(s,s); for (j=0; j<s; j++) { for (k=j; k<s; k++) { T Xjk = T(0); for (i=0; i
Elementary Numerical Methods and C++ # 135 223) { 224) Xjk += pow(xv[i],j+k); 225) } 226) X[j][k] = Xjk; 227) if (k>j) X[k][j] = Xjk; 228) } 229) } 230) for (k=0; k<s; k++) 231) { 232) T bk = T(0); 233) for (i=0; i 2) #include <enmmisc.h> 3) #include <enmfit.h> 4) #include 5) using namespace ENM; 6) using namespace std; 7) int main() 8) { 9) format rst(cout.flags(),cout.precision(),cout.width()); 10) format xfmt(ios::right | ios::fixed, 4,8); 11) format yfmt(ios::right | ios::fixed, 8,12); 12) DVector xval(6,"0.5 1.0 2.0 3.0 4.0 5.0"); 13) DVector yval(6,"2.0 1.0 0.5 0.33333 0.25 0.2"); 14) DSpline spl(xval, yval); 15) DPolynomial fit(xval, yval, 2); 16) cout << "Second order least-squares polynomial: " 17) << fit << endl; 18) cout << xfmt << "x" << yfmt << "1.0/x" 19) << yfmt << "spline" 20) << yfmt << "Lagrange" 21) << yfmt << "lst sq\n"; 22) for (double x=0.125; x<6.0; x+=0.125) 23) { 24) cout << xfmt << x << yfmt << 1.0/x 25) << yfmt << spl(x) 26) << yfmt << Lagrange(x,xval,yval) 27) << yfmt << fit(x) << rst << endl; 28) } 29) }

Program Output:

1) Second order least-squares polynomial: 2.27557-1.12663*x+0.146475*x^2 2) x 1.0/x spline Lagrange 3) 0.1250 8.00000000 2.79251689 3.54484183 4) 0.2500 4.00000000 2.54859074 2.92846875 5) 0.3750 2.66666667 2.28036921 2.41834839

lst sq 2.13702553 2.00306245 1.87367670

136 # Chapter 11 Curve Fitting 6) 7) 8) 9) 10) 11) 12) 13) 14) 15) 16) 17) 18) 19) 20) 21) 22) 23) 24) 25) 26) 27) 28) 29) 30) 31) 32) 33) 34)

0.5000 0.6250 0.7500 0.8750 1.0000 1.1250 ! 1.7500 1.8750 2.0000 2.1250 ! 2.8750 3.0000 3.1250 ! 3.8750 4.0000 4.1250 ! 4.8750 5.0000 5.1250 5.2500 5.3750 5.5000 5.6250 5.7500 5.8750

2.00000000 1.60000000 1.33333333 1.14285714 1.00000000 0.88888889

2.00000000 1.71963079 1.45140926 1.20748311 1.00000000 0.83796347

2.00000000 1.66027343 1.38728805 1.17037178 1.00000000 0.86773453

1.74886831 1.62863725 1.51298354 1.40190717 1.29540814 1.19348646

0.57142857 0.53333333 0.50000000 0.47058824

0.52153838 0.50998417 0.50000000 0.48616306

0.55102596 0.52334679 0.50000000 0.47904164

0.75253819 0.67808057 0.60820028 0.54289735

0.34782609 0.33333333 0.32000000

0.35381495 0.33333000 0.31584963

0.35457301 0.33333000 0.31313286

0.24720392 0.21394238 0.18525819

0.25806452 0.25000000 0.24242424

0.25622086 0.25000000 0.24381731

0.24843512 0.25000000 0.25439287

0.10927723 0.11263444 0.12056899

0.20512821 0.20000000 0.19512195 0.19047619 0.18604651 0.18181818 0.17777778 0.17391304 0.17021277

0.20629039 0.20000000 0.19370961 0.18742307 0.18114422 0.17487691 0.16862499 0.16239230 0.15618269

0.23930282 0.20000000 0.13718135 0.04401986 -0.08730027 -0.26564469 -0.50098988 -0.80448427 -1.18850920

0.26430050 0.30427645 0.34882974 0.39796038 0.45166836 0.50995369 0.57281636 0.64025637 0.71227372

Comments: Considering the amount of computing that is required to accomplish these results, the main program is remarkably brief, thus illustrating the power of C++ classes and templates. Notice that the cubic spline and least-squares polynomials are created outside the for loop in lines 14 and 15 of main() and only evaluated repeatedly inside the loop. This is important, because had the function calls creating these curves been placed inside the for would have resulted in tremendously increased execution time unnecessarily. Notice also the use of the formatting variables xfmt and yfmt to format and align both the column headings and the data within the columns. It is good practice to reset the formatting to default values at the end of data output as is done using the variable rst in the main program. Otherwise, the next data item output may have an unexpected format. The format class is discussed in Section 9.5.2 on page 84. The numerical data printed above and the graph of these data shown below combine to illustrate a number of interesting points regarding interpolation and approximation. • At the six base points both the cubic spline and Lagrange interpolating polynomial match the supplied data exactly, as expected. The least-squares polynomial does not pass through any of the base points. • Over the interior region defined by the base points (0.5#x#5.0), both the cubic spline and the Lagrange interpolating polynomial trace the base point function rather well, the most significant variances being between 0.5 and 1.5 where the base function has the greatest curvature. • The least-squares curve, because it was constrained to be a parabola having constant curvature, balances the areas where it is above and below the base curve. It clearly follows the general trend of the base curve, but most people would agree that it is not a very good approximation of the base curve, overall. • All three of the derived curves depart significantly from the base curve once outside the

Elementary Numerical Methods and C++ # 137 base points. On the lower end, all three at least head in the right direction, though not sharply enough. However, on the upper end, the Lagrange polynomial turns sharply downward once x > 5. This behavior clearly illustrates the danger of extrapolating curve fits beyond the original base points.

Glossary approximation a function that is reasonably “close” to a set of base points according to some criterion. base point a known x-y coordinate. extrapolation applying an approximating function to points that lie outside the range of the supplied base points.

interpolation an approximation that passes precisely through all of the supplied bas points. natural spline a spline curve that has second derivative values of zero at the first and last base points.

Chapter 11 Problems 11.1 The drag force acting on a body traveling through a viscous fluid is related to several parameters, one of which is the drag coefficient. The drag coefficient varies with the shape of the body and the Reynolds number. Using the graph of drag coefficient versus Reynolds number given by http://www.ma.iup.edu/projects/CalcDEMma/drag/drag7.html or some other authoritative source, find reasonable polynomial and cubic spline curve fits for either a sphere or a cylinder over the range 0.1 < Re < 5,000,000. 11.2 The figure below shows the measured performance of a “water squeezer” energy absorber that is used in an aircraft arresting gear system. Write a single C++ program that will use various techniques to approximate the measured performance of the energy absorber, i.e., given a value of the piston displacement, y3, return an appropriate value of the drag coefficient. You should manually digitize points from the graph and store them in the program. Your program should use these base points to produce sets of calculated values of

138 # Chapter 11 Curve Fitting drag coefficient for values of piston displacement ranging from 0-350 ft in 10 ft increments. Your calculated values should be output in a format suitable for plotting using Excel. You should compare the results you obtain for polynomial interpolation of various orders, least squares curve fits of various orders, and cubic splines using various numbers of base points. Your objective is to find the “best” approximation for the energy absorber. Best is defined as the method that gives the closest approximation to the graph shown below while using the fewest base points. Your report for this project should include several graphs prepared in Microsoft Excel as part of your Analysis of Results.

CHAPTER

12

Numerical Quadrature CHAPTER OUTLINE 12.1 Introduction 12.2 Newton-Cotes Formulas 12.2.1 Trapezoidal Rule 12.2.2 Simpson’s One-Third Rule 12.2.3 Simpson’s Three-Eighth’s Rule 12.3 Romberg Integration

12.1

12.2.2 Simpson’s One-Third Rule 12.2.3 Simpson’s Three-Eighth’s Rule 12.3 Romberg Integration

INTRODUCTION

We use the term quadrature to refer to the process of finding the area under a curve which is known either explicitly or implicitly through a given set of x-y values known as base points. Most readers would call this process integration, but this term is sometimes used to apply to the solution of ordinary differential equations, which is presented in the next chapter. The underlying algorithms for solving these two types of problems can be similar, but are usually quite different. So to avoid confusion, we will seek to avoid the term integration whenever possible unless we are referring to the mathematical definition of the term. Numerical quadrature is often used when processing data obtained from experiments. Such data share two common characteristics: they often are obtained using equipment that measures data at fixed regular time intervals and, as with all experimental data, they are contaminated with random (and possibly nonrandom) errors. When choosing what variables to measure using such equipment, it generally is better to measure variables that can be integrated to obtain desired results than to measure variables that must be differentiated to obtain the results. For example, if the dynamic performance of a accelerating vehicle were to be studied, it would be better to mount accelerometers in the vehicle and integrate the experimental data to obtain velocity and displacement than to measure the displacement versus time and differentiate to obtain velocity and acceleration. The qualitative reason for this decision is that differentiation tends to amplify high frequency noise (from radio frequency electromagnetic interference, etc.) and integration tends to amplify low frequency noise (instrument parameter drift, weather changes, etc.). Experimental data taken over relatively short time periods tend to have more high frequency noise. Integration (and numerical quadrature) tends to “smooth” out this high frequency noise.

140 # Chapter 12 Numerical Quadrature

12.2

NEWTON-COTES FORMULAS

The Newton-Cotes family of formulas are derived from interpolating polynomials passed through base points having uniform spacing h. Thus, they are well-suited to the situation described in the introduction. They also are easy to derive and implement. Technically, there are an infinite number of Newton-Cotes integration formulas. However, since these formulas are based on interpolating polynomials, the same caveats presented in Section 11.2.2 are applicable here. In particular, it is important to keep the degree of the interpolating polynomial low. The presence of even small amounts of random error in the raw data tends to cause interpolating polynomials of higher degree to depart wildly from the “eyeball” curve one would expect to approximate the data. When such polynomials are integrated, the resulting areas may not be remotely close to any reasonable answer. For this reason, it is always better to use the composite form of one of the Newton-Cotes formulas derived in the following sections or a least-squares function approximation when integrating over more than three or four base points. 12.2.1 Trapezoidal Rule The trapezoidal rule is the simplest practical Newton-Cotes integration formula. Its name comes from the resulting shape of the area that it computes—a trapezoid. The formula for the trapezoidal rule can be derived easily by examining Figure 12.1. The curved line represents the shape of the actual function between points (x0,y0) and (x1,y1). The straight line between the same two points is the firstdegree interpolating polynomial that will be integrated to estimate the area under the curve. By simple geometry it is easy to see that the area under the straight line is a trapezoid consisting of a rectangle having width h and height y0 and a right triangle having base h and height (y1 ! y0). Thus, the area of the trapezoid consists of the sum of the areas of the rectangle and the triangle:

At = hy 0 +

h ( y1 − y0 ) 2

h = ( y 0 + y1 ) 2

(12.1)

Clearly, if the base curve is not a straight line, then the estimate of the area under the curve as calculated by the trapezoidal rule will be incorrect. If the curve is convex upward (as in Figure 12.1), then the calculated area will be too small because some of the area will be missed. Similarly, if the curve is convex downward, the calculated area will be too large. This is an example of truncation error in action! It is obvious that if the approximating curve were a higher degree polynomial (meaning more terms of the underlying Taylor series were included), the approximation would be much better. While no exact value of the truncation error can be provided for all cases, a more detailed analysis of this problem can be performed, although this development is omitted here. Suffice it to say, all the general rules regarding the control of truncation error given in Section 7.2.1 apply to numerical quadrature using the trapezoidal rule. For example, Figure 12.1 suggests that the base point spacing, h, being used for this curve is too large. If a smaller h were used, then the curve would more closely approach a straight line, reducing the amount of area unaccounted for using the straight line approximation. However, if h is made smaller, then more applications of the trapezoidal rule must be used to account for the total area. In some extreme cases, this can cause the roundoff error associated with repeated calculations to affect the answer. This phenomenon is demonstrated in Example 12.1 on page 145.

Elementary Numerical Methods and C++ # 141

FIGURE 12.1 Trapezoidal Rule

Since numerical quadrature is normally applied over a series of base points, it is worthwhile developing a so-called composite trapezoidal rule that can be applied directly to an array of base points having uniform spacing as shown in Figure 12.2. There are n+1 base points (y0 to yn) connected by straight lines to form n trapezoids (A0 to An!1). The area under the entire curve defined by the base points can be approximated by summing the areas of the n trapezoids as follows:

A=

n −1

∑A

i

i=0

A=

h ( yo + y1 ) + 2 h y1 + y 2 ) + ( 2

h y n − 2 + y n −1 ) + ( 2 h yn −1 + yn ) ( 2 h A = ( y 0 + 2 y1 + 2 y 2 +… + 2 y n − 2 + 2 y n −1 + y n ) 2

A=

n −1 ⎞ h⎛ 2 y + y + yi ⎟ ⎜ 0 ∑ n ⎠ 2⎝ i =1

(12.2)

142 # Chapter 12 Numerical Quadrature

FIGURE 12.2 Composite Trapezoidal Rule

12.2.2 Simpson’s One-Third Rule As stated in the previous section, the accuracy of the integration can be improved if the underlying functional approximation of the base curve is improved. Simpson’s one-third rule is the next logical step in the family of Newton-Cotes integration formulas. Simpson’s one-third rule (which is usually called simply Simpson’s rule) passes a quadratic parabola through three adjacent base points and integrates this parabola to approximate the area under the base curve as shown in Figure 12.3. To find the area under the parabola, assume the axes are shifted so that the first base point has coordinates (0,yo). The second and third base points will have coordinates (h, y1) and (2h,y2), respectively. If a single parabola is to pass through all three points, then the following boundary conditions apply:

y0 = a ⋅ 0 2 + b ⋅ 0 + c y1 = a ⋅ h 2 + b ⋅ h + c y 2 = a ⋅ ( 2 h ) 2 + b ⋅ ( 2h ) + c The first boundary condition establishes that c = y0. Substituting for c into the other two boundary conditions and solving for the other two unknowns using Cramer’s rule yields

a=

y1 − y 0

h

y2 − y0

2h

h

2

4h

2

h 2h

=

1 ( y0 − 2 y1 + y2 ) 2h 2

Elementary Numerical Methods and C++ # 143

FIGURE 12.3 Simpson’s Rule

y1 − y 0

h2 b=

4h

y2 − y0

2

h

2

4h

h

2

=

1 ( − 3 y0 + 4 y1 − y2 ) 2h

2h

The area under the parabola can be found by integrating the function from zero to 2h.

A= =

∫

2h

0

(ax 2 + bx + c)dx

a ( 2h ) 3 b ( 2h) 2 + + c( 2 h ) 3 2

Substituting for the coefficients a, b, and c, the final form of Simpson’s one-third rule is found to be

A=

h ( y0 + 4 y1 + y2 ) 3

(12.3)

Equation 12.3 gives the approximate area under a curve passing through the three given base points. This area comprises two strips, each of width h. We will call the single application of a NewtonCotes integration formula over the appropriate number of strips an integration panel. For Simpson’s one-third rule, a panel consists of two strips, whereas for the trapezoidal rule, a panel consists of a single strip. Simpson’s one-third rule can be applied in a composite form when more than three base points are given. Proceeding as was done for the trapezoidal rule, the area under the curve shown in Figure 12.4 can be found by summing the area under each panel. Since each panel consists of two strips, there are only n/2 panels and the number of strips must be even. This implies that Simpson’s onethird rule can be applied only when an odd number of base points are given. Simpson’s

144 # Chapter 12 Numerical Quadrature composite one-third rule can be developed by summing the areas of the n/2 individual panels as follows:

A=

n/2

∑A

i

i=0

A=

h ( y0 + 4 y1 + y2 ) + 3 h y2 + 4 y3 + y4 ) + ( 3 h y 4 + 4 y5 + y 6 ) + ( 3 h ( 3

y n − 2 + 4 yn −1 + yn )

⎛ ⎞ n− 2 n−1 h⎜ ⎟ A = ⎜ y 0 + y n + 2 ∑ yi + 4 ∑ y j ⎟ 3⎜ ⎟ i=2 j =1 ⎝ ⎠ i = even j = odd

(12.4)

Equation 12.4 is considerably more complicated than Equation 12.2, the summation being divided over even and odd points. This also makes for slightly longer execution times, but Example 12.1 will demonstrate the clear superiority of Simpson’s one-third rule over the trapezoidal rule which makes it the favorite for most engineering quadrature applications.

Elementary Numerical Methods and C++ # 145

FIGURE 12.4 Composite Simpson’s One-Third Rule

12.2.3 Simpson’s Three-Eighth’s Rule The next higher precision Newton-Cotes quadrature formula is Simpson’s three-eighths rule. Following procedures analogous to those of the previous section (but involving considerably more algebra), Simpson’s three-eighths rule can be derived from a cubic polynomial passing through four adjacent equally-spaced base point and has the form

A=

3h ( y0 + 3 y1 + 3y2 + y3 ) 8

(12.5)

A corresponding composite form also can be derived following the methods presented earlier. See the implementation of Simpson’s three-eighths rule in Example 12.1. Somewhat surprisingly, the three-eighths form of Simpson’s rule performs no better than the onethird rule. A more detailed derivation of each method reveals that they both have the same order of truncation error, O(h4). For this reason, the simpler one-third rule is always used in preference to the three-eighths rule. The only practical reason to use Simpson’s three-eighths rule is when an even number of base points are to be integrated. In this case, Simpson’s one-third rule should be used for base points 0 # i # n!3, followed by a single application of Simpson’s three-eighths rule for points n!3 through n. O EXAMPLE 12.1 Newton-Cotes Quadrature Formulas This example explores some of the characteristics of the composite forms of the trapezoidal rule, Simpson’s 1/3 rule, and Simpson’s 3/8 rule when applied to the task of computing the integral of a function expressed as an array of equally-spaced base points. For the purposes of this example, several different computations of

∫

2p

0

e − x cos( x ) ≈ 0.499066278634 will be made using the three

named methods with different strip widths. The implementation of the composite quadrature formulas is given in file enmquad.h which is shown below.

146 # Chapter 12 Numerical Quadrature 1) 2) 3) 4) 5) 6) 7) 8) 9) 10) 11) 12) 13) 14) 15) 16) 17) 18) 19) 20) 21) 22) 23) 24) 25) 26) 27) 28) 29) 30) 31) 32) 33) 34) 35) 36) 37) 38) 39) 40) 41) 42) 43) 44) 45) 46) 47) 48) 59) 60) 61) 62) 63) 64) 65) 66) 67) 68) 69) 70)

#ifndef ENM_QUAD_H #define ENM_QUAD_H #include <enmvec.h> namespace ENM { template T trap(const Vector& y, T h) { Subscript n = y.size(); assert (n>0); T sum = T(0); for (Subscript i=1; i T simp(const Vector& y, T h) { Subscript n = y.size(); assert (n%2 == 1); T sume = T(0), sumo = T(0); for (Subscript i=1; i T simp38(const Vector& y, T h) { Subscript n = y.size(); assert ((n-1)%3 == 0); T sum12 = T(0), sum3 = T(0); for (Subscript i=1; i T Romberg(T f(T), T a, T b, T eps=1.0e-6) { const int MAXSTEP = 20; // maximum number of iterations const int K = 5; // order of extrapolation assert( eps > 0 ); T c[MAXSTEP]; c[0] = 0.5*(b - a)*(f(a) + f(b)); int m;

Elementary Numerical Methods and C++ # 147 71) 72) 73) 74) 75) 76) 77) 78) 79) 80) 81) 82) 83) 84) 85) 86) 87) 88) 89) 90) 91) 92) 93) 94)

T p = 1; for (m = 1; m < MAXSTEP; ++m) { T sum = 0.0; register int k; T dx = (b - a)/p; // trapezoid rule refinement T x = a + 0.5*dx; for (k = 0; k < p; ++k, x += dx) { sum += f(x); } c[m] = 0.5*(c[m-1] + dx*sum); // polynomial extrapolation (degree 5) T d = 1; for (k = m - 1; k >= 0 && k >= m + 1 - K; --k) { d *= 4; c[k] = (d*c[k+1] - c[k])/(d - 1); } if (m >= K - 1) { T updated = c[m+1-K]; if (abs(updated - c[m+2-K]) <= eps*abs(updated)) return updated; } p *= 2; } // couldn't reach desired precision return c[MAXSTEP-K];

95) 96) 97) 98) 99) 100) } 101) } /* namespace enm */ 102) #endif The main() function contains the following lines: 1) #include 2) #include 3) #include <enmcpp.h> 4) #include <enmvec.h> 5) #include <enmmath.h> 6) #include <enmquad.h> 7) #include <enmmisc.h> 8) using namespace std; 9) using namespace ENM; 10) const double trueAns = 0.499066278634; 11) double fun(double x) 12) { 13) return exp(-x)*cos(x); 14) } 15) int main() 16) { 17) Vector N(14,"6 12 18 24 60 120 240 480 720 " 18) "1440 4320 8640 25920 51840"); 19) format nfmt(ios::right, 6, 6), 20) efmt(ios::right | ios::scientific, 3, 12), 21) qfmt(ios::right | ios::showpos | ios::fixed, 12, 16), 22) rst(cout.flags(),cout.precision(),cout.width()); 23) cout << nfmt << "n" << qfmt << "Trapezoidal" 24) << efmt << "Error" 25) << qfmt << "Simpson 1/3" 26) << efmt << "Error"

148 # Chapter 12 Numerical Quadrature 27) 28) 29) 30) 31) 32) 33) 34) 35) 36) 37) 38) 39) 40) 41) 42) 43) 44) 45) 46) 47) 48) 49) 50) 51) }

<< qfmt << "Simpson 3/8" << efmt << "Error" << rst << endl; for (Subscript j=0; j y(n+1); double xl = 0.0, xu = 2.0 * PI, dx = (xu - xl) / n; for (int i=0; i<=n; i++) { y[i] = fun(xl + i*dx); } double qtrap = trap(y, dx); double qtrapErr = qtrap-trueAns; double qsimp = simp(y, dx); double qsimpErr = qsimp-trueAns; double qsimp38 = simp38(y, dx); double qsimp38Err = qsimp38-trueAns; cout << nfmt << n << qfmt << qtrap << efmt << qtrapErr << qfmt << qsimp << efmt << qsimpErr << qfmt << qsimp38 << efmt << qsimp38Err << endl; }

Program Output: n 6 12 18 24 60 120 240 480 720 1440 4320 8640 25920 51840

Trapezoidal +0.593432007152 +0.522075555221 +0.509242154259 +0.504780171034 +0.499978757579 +0.499294335894 +0.499123289042 +0.499080530992 +0.499072612995 +0.499067862221 +0.499066454588 +0.499066322623 +0.499066283522 +0.499066279856

Error 9.437e-002 2.301e-002 1.018e-002 5.714e-003 9.125e-004 2.281e-004 5.701e-005 1.425e-005 6.334e-006 1.584e-006 1.760e-007 4.399e-008 4.888e-009 1.222e-009

Simpson 1/3 +0.489804029540 +0.498290071244 +0.498906519492 +0.499015042972 +0.499064948418 +0.499066195332 +0.499066273425 +0.499066278309 +0.499066278570 +0.499066278630 +0.499066278634 +0.499066278634 +0.499066278634 +0.499066278634

Error -9.262e-003 -7.762e-004 -1.598e-004 -5.124e-005 -1.330e-006 -8.330e-008 -5.209e-009 -3.255e-010 -6.417e-011 -3.875e-012 9.781e-014 1.414e-013 1.489e-013 1.537e-013

Simpson 3/8 +0.487864861693 +0.497461926447 +0.498718422647 +0.498952999619 +0.499063293543 +0.499066091327 +0.499066266916 +0.499066277902 +0.499066278489 +0.499066278625 +0.499066278634 +0.499066278634 +0.499066278634 +0.499066278634

Error -1.120e-002 -1.604e-003 -3.479e-004 -1.133e-004 -2.985e-006 -1.873e-007 -1.172e-008 -7.324e-010 -1.446e-010 -8.898e-012 3.481e-014 1.395e-013 1.462e-013 1.485e-013

Comments: All three quadrature functions approach the correct answer as the number of strips increases. It is clear from the numerical data that the two Simpson’s rules attain high accuracy using fewer strips, and thus, less compute time. It can be shown analytically that for the same value of n, Simpson’s 3/8 rule should have an error that is 2.25 times as large as that of Simpson’s 1/3 rule. A few quick calculations from the numbers above will show that this is generally true in the middle half of the tabulated numbers. (This may be troubling at first glance. Shouldn’t the formula that uses a higher degree polynomial have lower truncation error? Not really. For the same strip width, the Simpson’s 3/8 rule integrates over 50% more distance with each application than Simpson’s 1/3 rule. So it computes the integral over n strips using only 2/3 as many applications. If the same number of applications of each rule are used to integrate a function, then Simpson’s 3/8 rule will have lower truncation error.) To gain more insights, consult the graph of the absolute value of the error versus number of strips shown on the next page. Note that both axes are plotted on a logarithmic scale to highlight the relationships. All three curves appear to be linear until n approaches 10,000. Then the two Simpson’s rule curves flatten out, while the trapezoidal rule curve continues to decline at a steady pace. Furthermore, the two Simpson’s rule curves appear to decline at the same rate which is much steeper than the trapezoidal rule curve. The linear portions of the curves exhibit classical truncation error behavior. A linear curve on a log-log plot suggests a power relationship. Indeed, it can be shown

Elementary Numerical Methods and C++ # 149 analytically that the truncation error of the trapezoidal rule decreases inversely proportional to the square of the number of strips, while both Simpson’s rules have truncation error that decreases inversely as the fourth power of the number of strips. But what happens to the Simpson’s rules as the number of strips nears 10,000? At this point, then truncation error has approached the double precision resolution limit (machine epsilon) and roundoff error begins to dominate. There is no evidence of significant roundoff error effects in the trapezoidal rule results because the truncation error is still larger than the roundoff error. If still more strips were incorporated, eventually the trapezoidal error curve would begin to deviate from its constant slope as roundoff error began to dominate.

O

12.3

ROMBERG INTEGRATION

As seen in the previous example, the accuracy of Newton-Cotes quadrature calculations are affected by both truncation error and roundoff error. For most engineering calculations the truncation error tends to dominate. The previous example shows that truncation error can be reduced by simply reducing the spacing between the base points, but this approach is not particularly effective for the trapezoidal rule. Romberg quadrature makes use of the fact that when the same function is integrated over the same limits using the same Newton-Cotes quadrature formula, varying only the number of base points used, the truncation error associated with each successive integration is likely to behave in a very orderly way as dictated by the terms of the Taylor series that were truncated when deriving the quadrature formula. Thus, while the exact value of the truncation error for any strip width is not known in general, by making some reasonable assumptions about the behavior of the higher-order terms of the Taylor series, we can predict with reasonable accuracy how the truncation error will

150 # Chapter 12 Numerical Quadrature behave when we compute the area under a curve using a Newton-Cotes formula with a given strip width, h1, as compared to the truncation error we will have when we use the same formula with a different strip width, h2. The graph shown in the previous example clearly demonstrates the reasonableness of this line of thinking. Richardson’s extrapolation is a clever application of this logic which combines two applications of the same Newton-Cotes quadrature formula using different base point spacing to produce an improved estimate of the value of the integral which is considerably more accurate than either of the two original calculations. Furthermore, this process can be continued over subsequent iterations to quickly obtain an estimate of the integral that is essentially free of truncation error to almost any desired degree. When Richardson’s extrapolation technique is applied to the trapezoidal rule, the method is called Romberg integration. Romberg’s method works by successively applying the composite trapezoidal rule to sets of base points that are twice as dense as the previous set (i.e., h2 = 0.5h1), then combining those results to form estimates of the integral that are progressively more accurate. The particular choice of halving the base point spacing each iteration offers additional efficiency. Because half of the base points needed at the finer base point spacing have already been calculated in the previous step, the additional cost in time to compute the new trapezoidal rule estimate is only half of what you would expect it to be. The application of Romberg integration requires creating a triangular tableau in conjunction with application of the composite trapezoidal rule as follows. Begin by using the trapezoidal rule considering the entire range of integration as a single strip. Call this value I0,1. Recompute the area using two strips this time. Call this value I0,2. We could continue doubling the number of strips using the composite form of trapezoidal rule to get answers that have progressively improved truncation error, but this process is relatively time consuming and relatively slow to converge. It can be shown that if we have two values of the integral as calculated by the trapezoidal rule such that one value uses twice as many strips as the other value, they can be combined to form a more accurate value using the equation

I i ,1 =

4 I i ,0 − I i −1,0 3

.

(12.6)

This calculation does not require any additional calculations of the integrand, so it takes virtually no additional time to compute a column of significantly improved estimates of the integral. Once this column has been calculated, a second column can be calculated using the formula

I i ,2 =

16 I i ,1 − I i −1,1 15

(12.7)

The values in this column can be shown to be much more accurate than those in the previous column. Given enough values that are computed by the composite trapezoidal rule in the first column, subsequent columns can be created with very little additional calculation and the accuracy of the answers improves dramatically as the calculations move from column to column to the right and down from row to row. The general expression for calculating values in a new column, k, is given by

I i ,k =

4 k I i ,k −1 − I i −1,k −1 4 −1 k

(12.8)

The particular implementation of Romberg’s method shown in the enmquad.h listing in the previous example is an adaptation of the method given in the Numerical Recipes references shown in the Bibliography. It features an automatic in-process calculation of the accuracy of the method which halts the progress of the method when the specified error tolerance is reached. The

Elementary Numerical Methods and C++ # 151 effectiveness of Romberg quadrature is illustrated in the following example. O EXAMPLE 12.2 Romberg Quadrature This example demonstrates how Romberg’s method can usually obtain estimates of the integral of a function faster than the standard Newton-Cotes methods do. The C++ code below solves the same problem given in the previous example for different values of the user-specified accuracy requirement. The function F(x) defines the integrand. The implementation shown here includes a counter that increments each time the function is called. This count roughly corresponds to the cost of computing the integral, as the calculations done inside function Romberg() are relatively trivial. The main() driver function simply sets up the arguments for function Romberg(), computes the actual error of the approximation and clears the cnt variable to prepare for the next computation of the integral. To see the body of the Romberg() function, see the listing of enmquad.h in the previous example. 1) 2) 3) 4) 5) 6) 7) 8) 9) 10) 11) 12) 13) 14) 15) 16) 17) 18) 19) 20) 21) 22) 23) 24) 25) 26) 27) 28) 29) 30) 31) 32) 33) 34) 35) 36) 37) 38) 39) 40) 41) 42) 43)

#include #include <enmquad.h> #include <enmmisc.h> using namespace std; using namespace ENM; int cnt = 0; double F (double x) { ++cnt; return exp(-x)*cos(x); } int main(void) { const double trueAns = 0.499066278634; format nfmt(ios::right, 3, 6), efmt(ios::right | ios::scientific, 3, 12), qfmt(ios::right | ios::showpos | ios::fixed, 12, 16), rst(cout.flags(),cout.precision(),cout.width()); cout << nfmt << "eps" << nfmt << "cnt" << qfmt << "Romberg Value" << efmt << "Error" << rst << endl; double eps = 1.0e-3; double RombVal = Romberg(F,0.0,2.0*PI,eps); double errVal = RombVal - trueAns; cout << nfmt << eps << nfmt << cnt << qfmt << RombVal << efmt << errVal << endl; cnt = 0; eps = 1.0e-6; RombVal = Romberg(F,0.0,2.0*PI,eps); errVal = RombVal - trueAns; cout << nfmt << eps << nfmt << cnt << qfmt << RombVal << efmt << errVal << endl; cnt = 0; eps = 1.0e-9; RombVal = Romberg(F,0.0,2.0*PI,eps); errVal = RombVal - trueAns; cout << nfmt << eps << nfmt << cnt << qfmt << RombVal << efmt << errVal << endl; cnt = 0; eps = 1.0e-13; RombVal = Romberg(F,0.0,2.0*PI,eps); errVal = RombVal - trueAns; cout << nfmt << eps << nfmt << cnt << qfmt << RombVal << efmt << errVal << endl;

152 # Chapter 12 Numerical Quadrature 44) }

Program Output: eps 0.001 1e-06 1e-09 1e-13

cnt 17 33 65 257

Romberg Value +0.499076273249 +0.499066315529 +0.499066278667 +0.499066278634

Error 9.995e-06 3.690e-08 3.335e-11 1.458e-13

Comments: The Romberg algorithm is often presented in a triangular tableau, the initial rows and columns of which are shown below. You should be able to verify the numbers in the tableau using the composite trapezoidal rule and the formulas given above. 3.147459397956727 1.437969170828067

0.868139095118513

0.718984585414033

0.479323056942689

0.453401987730967

0.551397542353895

0.495535194667183

0.496616003848816

0.497301940612591

Comparing the results of this program to those of the previous example reveals the dramatic efficiency of the Romberg method. For the first case, the requested accuracy was 1.0e-3, but the calculation using Romberg’s method achieved an actual accuracy of better than 1.0e-5. And this accuracy was achieved with only 17 function evaluations. To get comparable accuracy from the standard trapezoidal rule in the previous example required between 480 and 720 function evaluations. Similarly, for a requested accuracy of 1.0e-6, the actual accuracy was 3.69e-8 and required only 33 function evaluations. Using the regular trapezoidal rule to obtain similar accuracy would have required nearly 26,000 function evaluations. Note that Romberg integration also well outperforms Simpson’s one-third rule. To get an accuracy of about the same that Romberg obtained with 33 function evaluations, Simpson’s one-third rule would have required over 120 function evaluations. This trend continues as the requested accuracy increases. Eventually, we see that standard trapezoidal rule could not achieve with 51,840 function evaluations the accuracy that Romberg’s method achieved with only 65 function evaluations. Such outstanding performance as this makes Romberg’s method typically the first choice for numerical quadrature. Professional implementations of the algorithm even include the ability to handle discontinuities in the integrand. O

Glossary base points points having known coordinates which control the shape of the approximating function. composite quadrature formula the formula that results from repeated application of a fundamental quadrature formula over successive series of base points.

integration a general term applied to either quadrature or ordinary differential equation solving. quadrature the process of finding the approximate area under a curve described by a given set of base points.

Chapter 12 Problems 12.1 The Error Function, erf(x), is an example of an integral function, i.e., it is defined in terms of an integral. It often appears in the analytical solution of heat transfer problems. The

Elementary Numerical Methods and C++ # 153 definition of the function is given by

erf( x ) =

2 π

∫

x

0

2

e − z dz

Several good approximations exist for this function in order to avoid having to evaluate the integral. One simple approximation is given by

erf( x ) ≈ 10 . − 15577 . e − 0.7182 ( x + 0.7856)

2

Investigate the accuracy of these two formulas when using numerical quadrature to compute the integral in the first method. You may wish to try various quadrature methods using different base point spacings. Compare your answers to those tabulated in an authoritative reference. Do not compare your calculated values to those produced by the erf(x)function available in the C++ library. (Why?) 12.2 A rapid robot arm with a laser is used to measure the diameter of six holes on a rectangular plate 12" x 8". The coordinates of the centers of the six holes are (2,7.2), (4.5,7.1), (5.25,6), (7.81,5), (9.2,3.5), (10.6, 5). The path of the robot going from one point to another needs to be smooth to avoid sharp jerks in the arm that can cause premature wear and tear in the robot arm bearings. One suggestion is to fit a fifth degree polynomial through the six points. Another suggestion is to fit a cubic spline through the points. Which suggestion is better with respect to the length of the path the end of the robot arm will have to move? (HINT:

L=

∫

b

a

2

⎛ dy ⎞ 1 + ⎜ ⎟ dx. ⎝ dx ⎠

CHAPTER

13

Ordinary Differential Equations CHAPTER OUTLINE 13.1 Introduction 13.2 First Order ODEs 13.3 Working With Higher Order ODEs 13.4 Initial Condition Problems 13.4.1 Taylor Series Expansion 13.4.2 Euler’s Method 13.4.3 Runge-Kutta Methods 13.5 Boundary Value Problems

13.1

13.4.1 13.4.2 13.4.3 13.5

Taylor Series Expansion Euler’s Method Runge-Kutta Methods Boundary Value Problems

INTRODUCTION

Many phenomena in nature are commonly modeled using ordinary differential equations (ODEs). First order ODEs accurately model the depth of liquid in a tank discharging through an orifice, the temperature of a hot copper sphere after it is dropped into a large tank of oil or water, and myriad other common situations. Second order ODEs are used to describe spring-mass-damper systems such as an automobile suspension, R-L-C circuits, and projectile problems, among many others. In all of these cases, the problem is expressed in terms of the derivative of one or more dependent variables (e.g., the vertical acceleration of the wheel and the vertical acceleration of the chassis in the case of an automobile suspension system) with respect to a single independent variable, such as time. The solution to the problem consists of the behavior of the dependent variables themselves (i.e., zeroth derivatives) versus time. For a relatively small percentage of real-world problems, analytical methods may yield a closedform or series solution to the problem. However, in the vast majority of cases where the parameters vary nonlinearly and cross-coupling exists between dependent variables, analytical methods fail. In these cases, a numerical solution using the methods described in this chapter or other methods can be used to obtain a numerical solution. When available, the analytical solution is to be desired, for it reveals the relationships among the parameters to the trained eye. For example, the solution to the

⎞ ⎛ k d2x k t + φ ⎟ , where A and φ are arbitrary + x = 0 is x (t ) = A sin⎜ second order ODE 2 m dt ⎠ ⎝ m

Elementary Numerical Methods and C++ # 155 constants that are determined by the initial conditions. By inspecting the solution, it can be determined that increasing the spring stiffness k by a factor of four will double the frequency of oscillation if all other parameters are held constant. A numerical solution obtained by any method provides no opportunity for such insight. In order to infer this behavior, several runs of the computer program would have to be made using different values of k. Even after doing this, one would have to be very careful about generalizing the behavior of the solution if the governing equation(s) were nonlinear. To further compound the problem, we shall see that numerical solution methods tend to be computationally intense, resulting in long solution times for complicated problems. In spite of these difficulties, numerical solutions of ordinary differential equations are widely used in fields ranging from orbital mechanics to pharmacokinetics on a daily basis.

13.2

FIRST ORDER ODES

The classic statement of the first order ODE is given as

dy = f ( x , y ), y ( x 0 ) = y 0 dx

(13.1)

The solution to this problem will be an expression of y(x), either analytical or numerical. At first glance it may appear that the methods of the previous chapter could be employed to obtain a numerical solution. After all, the solution is just the integral of the right-hand side. And indeed, if the right-hand side is a function of the independent variable x alone, then Simpson’s one-third rule or some other quadrature formula can be used to obtain a solution. But if, as Equation 13.1 indicates, the right-hand side is a function of both the independent and the dependent variables, then quadrature formulas can not be used because the value of the right-hand side can’t be calculated until the value of the dependent variable is known and the value of the dependent variable is the integral of the righthand side. The methods of this chapter address this case for a single or set of several first order ODEs.

13.3

WORKING WITH HIGHER ORDER ODES

Only relatively few interesting engineering problems can be modeled using a single first order ODE as given in Equation 13.1. The more common situation is to be given a set of simultaneous higher order ODEs. The first step in solving such systems is to convert the given equations into an equivalent set of simultaneous first order ODEs, for which the numerical solution methods given in the following sections are directly applicable. This process is undertaken apart from any programming language consideration, and must be completed before writing any code to solve the problem. This process of variable mapping is quite straightforward, yet often confusing to the beginner. This process is outlined in the steps below and it is illustrated in Example 13.1. • Given a set of m simultaneous higher order ODEs, determine the highest order derivative of each dependent variable. The number of simultaneous first order equations needed, n, is the sum of the highest orders of the respective dependent variables. • Create m auxiliary variables—call them, say qi, representing the zeroth derivative of each of the dependent variables. • Create new auxiliary variables by differentiating each of the first m auxiliary variables an appropriate number of times. And just what is meant by the phase “an appropriate number

156 # Chapter 13 Ordinary Differential Equations of times?” Generally, it means one less than the highest order derivative of the dependent variable that appears in the original equations. For example, if the highest derivative of dependent variable z appearing in the original equations is d3z/dt3, then you must create two additional auxiliary variables, one representing dz/dt, and one representing d2z/dt2. Thus, if the dependent variable had been mapped as auxiliary variable four (q4) in the previous step, then two additional auxiliary variables, say numbers 8 and 9 would be defined as q8 = dq4/dt and q9 = dq8/dt. Note that these two auxiliary variable definitions are themselves first order ODEs which will become part of the set of simultaneous first order ODEs that will be solved. • Once the preceding step has been completed for all of the original dependent variables, the total complement of n auxiliary variables will have been defined. The first m of these will be the simple name substitutions assigned in the second step. The remaining n!m auxiliary variables will be defined in terms of first derivatives of other auxiliary variables as described in the previous step. The final step is to substitute the auxiliary variables into the original set of higher order ODEs. If the procedure is followed properly, there should be one term in each equation for which no auxiliary variable substitution is available. This term will be the highest derivative of each dependent variable. In each case, substitute the first derivative of an appropriate auxiliary variable, thus reducing the m original equations to first order. This will complete the set of n simultaneous first order ODEs involving the n auxiliary variables. Again, the process sounds complicated when described in words, but it actually becomes rather simple to apply with a little practice, as shown in Example 13.1. O EXAMPLE 13.1 Converting Higher Order ODEs to Simultaneous First Order ODES Given the following set of four simultaneous higher order ODEs, develop an equivalent set of simultaneous first order ODEs.

d 3 w dx + z=0 dt 3 dt d 2 x d 2 w dy + 2 =0 dt 2 dt dt d 2 y dw + xz = 0 dt dt 2 dw dx dy dz + + + =0 dt dt dt dt Since there are four dependent variables, m = 4, and the first four auxiliary variables are assigned as name aliases for the four dependent variables:

q0 = w q1 = x q2 = y q3 = z

Elementary Numerical Methods and C++ # 157

The highest order derivatives of the dependent variables are seen to be and

d 3w d 2 x d 2 y , , , dt 3 dt 2 dt 2

dz ; thus, the number of auxiliary variables and corresponding simultaneous first order dt

equations, n, is 3+2+2+1 = 8. Since the highest derivative of dependent variable w is of order three, two additional auxiliary variables must be defined as follows:

dq 0 = q4 dt dq 4 = q5. dt In a similar manner, since the highest derivatives of both x and y are both of order two, one additional auxiliary variable must be defined for each of these dependent variables as follows:

dq1 = q6 dt dq 2 = q 7. dt This completes the definition of all eight auxiliary variables and has defined four simultaneous first order ODEs in the process. Since the highest derivative of dependent variable is of order one, no additional auxiliary variables need to be defined. The remaining four first order ODEs are obtained by substituting the auxiliary variables into the original ODEs. When doing this step, it is useful to construct an “alias” table that provides an easy reference for translating dependent variables and their derivatives into auxiliary variables. An alias table for this set of equations might be written as follows: When the term in the Substitute this original equation is auxiliary variable w

q0

x

q1

y

q2

z

q3

dw/dt 2

d w/dt

q4 2

q5

dx/dt

q6

dy/dt

q7

Applying this alias table, the final four first order ODEs are found to be

158 # Chapter 13 Ordinary Differential Equations

dq5 dt dq 6 dt dq 7 dt dq 3 dt

+ q6 q3 = 0 + q5 q 7 = 0 + q 4 q1 q 3 = 0 + q 4 + q 6 + q 7 = 0.

O

13.4

INITIAL CONDITION PROBLEMS

A set of ODEs can be solved only if the boundary conditions are known. There must be as many boundary conditions as the combined order of the given equations (or, in light of the previous section, as many as the number of equivalent simultaneous first order ODEs). If all of the boundary conditions are given at the same value of the independent variable (usually, but not necessarily zero), then the boundary conditions are called initial conditions and the combination of the ODEs and the initial conditions form what is called an initial condition problem. If even one of the boundary conditions is specified at any value of the independent variable different from the others, the problem becomes a boundary value problem and cannot be solved directly using the methods developed below. Solution strategies for boundary value problems are discussed in Section 13.5. 13.4.1 Taylor Series Expansion One possible approach to solving the first order ordinary differential equation given in Equation 13.1 is to apply the ubiquitous Taylor series, which is commonly expressed as

y( x 0 + Δ x ) = y( x 0 ) + Δ x

dy( x ) dx

x = x0

2 Δ x 2 d y( x ) + 2! dx 2

x = x0

3 Δ x 3 d y( x ) + 3! dx 3

+…

(13.2)

x = x0

Substituting from Equation 13.1 for dy(x)/dx gives

y( x 0 + Δ x ) = y( x 0 ) + Δ xf ( x0 , y( x0 ) ) +

Δ x 2 df 2! dx

+ x = x0

Δx 3 d 2 f 3! dx 2

+…

(13.3)

x = x0

While Equation 13.3 is not really a numerical method, it can give accurate numerical answers.

( )

()

For example, if f x , y = pn x , a simple polynomial, then Equation 13.3 works quite well, since after n + 1 terms, all remaining derivatives are zero. In this case, all of the terms on the right-hand side can be evaluated and a numerical value for the left-hand side can be obtained by choosing any desired value for Δx. But most ODE problems are not that simple. Consider the case of

dy = x − y + 1, (the dx

Elementary Numerical Methods and C++ # 159 analytical solution is y( x ) = x + e − x ) for which subsequent derivatives become an infinitely repeating sequence of x ! y and !x + y. At some point the sequence must be halted, thus introducing truncation error. But even more representative is a case where subsequent derivatives of f(x,y) become intractable, as in the case of

dy 2 = −2 xy (the analytical solution is y ( x ) = e − x ). In such dx

cases, the accuracy of a solution obtained by a direct Taylor series expansion is limited by one’s ability to differentiate a function. It would be more desirable to have at our disposal methods that produce a known truncation accuracy independent of the particular equation being solved and do not require higher derivatives of the given ODE. 13.4.2 Euler’s Method The obvious solution to this dilemma is simply to truncate Equation 13.3 after the second term on the right-hand side. In doing so, an approximate solution at x + Δx is calculated using the formula

yi +1 = yi + Δ xf ( xi , yi )

(13.4)

which is known as Euler’s (pronounced “oiler’s”) method. In Equation 13.4, the terms yi and yi+1 refer to the values computed using Euler’s method at points xi and xi+Δx, respectively, as opposed to the true solution values which are designated as y(xi) and y(xi+Δx), respectively. The numerical difference between the true solution values and the computed solution values is the truncation error. This difference is illustrated in Figure 13.1, which illustrates the behavior of Euler’s method.

FIGURE 13.1 Operation of Euler’s Method Point (x0,y(x0)) represents the given initial condition for the ODE. At x+Δx, the true solution is y(x0+Δx). Because Euler’s method uses a truncated Taylor series, all knowledge of derivatives higher than first order are lost. So the calculated value y1 lies on a straight line extrapolation of the slope of the curve at y(x0). The difference between the true and calculated values at x0+Δx is the local truncation error. Now the normal procedure for solving an initial condition problem is use the supplied initial condition to calculate the initial slope and proceed out in either the positive or negative direction of the independent variable using some appropriate step size, Δx. Once a new value of the dependent variable has been found according to the algorithm being used, it is treated as if it were the supplied

160 # Chapter 13 Ordinary Differential Equations initial condition and another step is made in the desired direction. One step of this process is illustrated in Figure 13.1. At the initial condition supplied, the true values of x0 and y(x0) are known and the correct slope of the solution function is calculated according to Equation 13.1. These values are plugged into Euler’s method (Equation 13.4) and yield the calculated value y1, which is the straight line extrapolation of the slope of the curve over a distance Δx. For the curve shown in Figure 13.1, the error introduced by neglecting the changing slope of the curve is significant. This error is exacerbated in the next step (which is not illustrated in the figure), for the slope of the curve for the next step will be calculated as f(x1,y1) instead of f(x1,y(x1)). Thus, it is very likely that the error at the end of the second step will be much greater than it is at the end of the first step because the value of the derivative which is extrapolated over the second step is incorrect to begin with! This is characteristic behavior of numerical solutions of ODE problems. If the method fails, it doesn’t take very long to find out because numbers will tend to grow wildly and floating point exceptions are generated after only a few steps. Two actions may be taken to improve the truncation error condition: take smaller step sizes or find a method that has better truncation error characteristics. The first approach is the easiest one to implement. Unfortunately, taking smaller step sizes doesn’t improve the truncation error of Euler’s method as well as we would like. It can be shown both analytically and experimentally that the truncation error of Euler’s method is roughly proportional to Δx. This means that halving the step size (which doubles the computer time required to solve the problem) halves the error. Generally, we are not satisfied with this rate of payback for our computational effort. But for over one hundred years, Euler’s method was the only available numerical method for solving ODEs. 13.4.3 Runge-Kutta Methods Near the turn of the 20th century, two German mathematicians, Carle David Tolmé Runge and Martin Wilhelm Kutta, published methods for obtaining numerical solutions of ODEs that have better truncation error characteristics than Euler’s method, yet do not require any derivatives of the function f(x,y) given in Equation 13.1. The family of methods incorporating their ideas are collectively called Runge-Kutta methods. There are an infinite number of Runge-Kutta methods possible, but only a dozen or so are commonly used. They all share the same underlying general structure of yi +1 = yi + Δ xφ i (13.5) where φi is called the increment function. The genius of the Runge-Kutta approach to solving ODEs is the structure of the increment function, which has the general form φ i = a1 k 1 + a 2 k 2 + a 3 k 3 + … (13.6) The aj’s in Equation 13.6 are weighting constants and the kj’s are values of the first derivative of the solution as computed by the supplied first order ODE as defined in Equation 13.1. Each of the kj values will be computed using a different value of independent variable, x, and independent variable, y following the general form

(

k j = f xi + α j Δ x , yi + β jk Δ xk k

)

(13.7)

Despite the complicated appearance of Equation 13.7, each kj is simply a sample of the first derivative function at a particular value of x and y. So Runge-Kutta methods “solve” ODE problems accurately not by considering higher order derivatives of the f(x,y) as a Taylor series does, but rather by taking multiple samples of the first derivative. The number of samples required (the number of k values required in Equation 13.6) and the various constants (the ai’s in Equation 13.6 and the α’s and β’s in Equation 13.7) are not arbitrary, of course. These must be determined very carefully in the derivation of each specific Runge-Kutta method so as to assure the method has the same order of truncation error as a Taylor series of the desired order. The derivation of these coefficients is quite involved even for the simplest cases and

Elementary Numerical Methods and C++ # 161 will not be attempted in these notes. It is very helpful to consider a physical analogy at this point before the details of the particular Runge-Kutta methods are presented. Imagine that you are in a cave that has a very uneven floor and your lantern goes out. Everything is now pitch black. How will you choose your steps as you return to the surface? (Verrrrry carefully, of course!) You will keep one foot planted where you are currently located and use the other foot to probe the ground in front of you. Every time you place your “probe foot” on the ground in front of you, your ankle joint acts as a transducer which measures the slope, or derivative, of the ground surface at that point. If you take enough samples of the slope by placing your “probe foot” on the ground in front of you, you get a mental “picture” of the round profile and are able to take a single step and then pull your “plant foot” up to your new position which is one step closer to the cave exit. You then repeat the process again and again until you safely reach the exit. This analogy is a good representation of the way all Runge-Kutta methods work. The k values of Equation 13.7 represent the samples of the slope of the cave floor as determined by the “probe foot” and the increment function of Equation 13.6 represents the mental image of the cave floor that is formed after several trial plants of the “probe foot.” The value computed by the increment function is the weighted average of all the trial slope values. Following this analogy, it is easy to see that the more samples of the cave floor that are taken before each step, the more likely an accurate placement of the “plant foot” will be for the next step. However, this also means that each step takes longer because of the intermediate “probes” of the cave floor. From the description above, Euler’s method can be considered to be a degenerate form of RungeKutta method where the increment function consists of only a single value of k, the value at the beginning of the step. Higher order Runge-Kutta methods require more values of k in the increment function. The mathematics of the individual Runge-Kutta derivations always reveal that there are an infinite number of methods at each order and Runge-Kutta methods have been derived for orders of at least ten. But not all methods at a given order will give the same numerical answers. Some can be shown to have better truncation error characteristics than others. But all second order Runge-Kutta methods exhibit the characteristic that the truncation error decreases with the second power of the ratio of the step size increment. This means, for example, that if a solution for an ODE problem has been obtained using a second order method with step size Δx, then is the problem is repeated using step size Δx/2, the truncation error should be expected to decrease by a factor of four. Similarly, third order methods exhibit truncation error that varies according to the third power of the step size and fourth order methods by the fourth power of step size, and so on. 13.4.3.1

Second Order Runge-Kutta Methods

The increment function for the family of second order is

φ i = a1 k 1 + a 2 k 2

The k’s are defined as

k 1 = f ( x i , yi ) k 2 = f ( xi + α Δ x , yi + β Δ xk1 )

(13.8)

(13.9)

yielding four constants, a1, a2, α, and β. The derivation of this method places the following requirements on the constants in order to provide accuracy equivalent to a second order Taylor series:

162 # Chapter 13 Ordinary Differential Equations

a1 + a 2 = 1 a 2α =

1 2

a2 β =

1 2

(13.10)

With only three equations relating the four coefficients, there is no unique solution. Thus, the choice of a fourth equation to produce a unique solution is arbitrary and an infinite number of second order Runge-Kutta methods exist. Two common choices for the fourth equation have nice geometric interpretations. The first of these is let a 2 = 2 . Solving for the other coefficients and substituting from Equations 13.5, 13.8, and 13.9, this form of the second order Runge-Kutta method can be written in traditional form as 1

Δx ( k1 + k 2 ) 2 k 1 = f ( xi , yi )

yi + 1 = yi +

(13.11)

k 2 = f ( xi + Δ x , yi + Δ xk1 ). Combining the definitions of the k’s into the first equation leads to the following single line version of the same method.

yi + 1 = y i +

Δx f ( xi , yi ) + f ( xi + Δ x , yi + Δ xf ( xi , yi ) ) 2

(

)

(13.12)

Despite its complicated appearance, Equation 13.12 is quite easy to understand when taken stepby-step with the help of Figure 13.2. First, the Euler’s method is used to step across the interval Δx. But, instead of accepting that value as the calculated value for yi+1, the slope of the curve is computed at the right-hand side of the interval using the trial values generated by Euler’s method. That slope is combined equally with the slope at the start of the interval to give an average value of the slope that is used to step across the interval. Thus, two calculations of f(x,y) are required for each step using this method and it is intuitively rational that two samples of the slope, one computed at each end of the interval, should reasonably approximate a Taylor series solution that incorporates the second derivative. Equation 13.12 is often called the improved Euler method.

Elementary Numerical Methods and C++ # 163

FIGURE 13.2 Second order Runge-Kutta method using a2=1/2 Figure 13.3 illustrates the second common choice for a second order Runge-Kutta method. This time we arbitrarily set a2 = 1. Again, after calculating the other three coefficients and substituting as before, this form of the second order Runge-Kutta method can be written in the traditional form as

yi +1 = yi + Δ xk 2 k 1 = f ( xi , yi )

Δx Δx ⎞ ⎛ k 2 = f ⎜ xi + k1 ⎟ . , yi + ⎝ 2 2 ⎠

(13.13)

Substituting for the k’s as before, the single line version of this second-order Runge-Kutta method becomes

Δx Δx ⎛ ⎞ yi +1 = yi + Δ xf ⎜ xi + , yi + f ( xi , y i )⎟ . ⎝ ⎠ 2 2

(13.14)

The interpretation of Equation 13.14 is similar to that of Equation 13.12, except that Euler’s method is used to obtain an approximate solution at the midpoint of the interval. The slope of the curve is computed at the midpoint and this value of slope is used as the average slope across the whole interval. For this reason, Equation 13.14 is sometimes called the midpoint method. At first glance, it may appear that this form of the second order Runge-Kutta method involves fewer calculations than does Equation 13.12, but this is not the case. Two evaluations of the slope are required in both cases. Equation 13.14 is also intuitively reasonable in that it is logical to presume that computing the slope at the midpoint should give about the same value as the average of the slopes computed at both ends of the interval.

164 # Chapter 13 Ordinary Differential Equations

FIGURE 13.3 Second order Runge-Kutta method using a2=1

13.4.3.2

Fourth Order Runge-Kutta Methods

Fourth order methods are the most commonly used of all the Runge-Kutta methods because they consistently yield numerical solutions having low truncation error using reasonable step sizes. There are an infinite number of fourth order Runge-Kutta methods available, but the most widely used one is

Δx ( k1 + 2 k 2 + 2 k 3 + k 4 ) 6 k1 = f ( xi , yi )

yi + 1 = yi +

Δx Δx ⎞ ⎛ , yi + k 2 = f ⎜ xi + k1 ⎟ ⎝ 2 2 ⎠

(13.15)

Δx Δx ⎞ ⎛ , yi + k 3 = f ⎜ xi + k2 ⎟ ⎝ 2 2 ⎠ k 4 = f ( xi + Δ x , yi + Δ xk 3 ). Notice that each step of the fourth order Runge-Kutta method requires four evaluations of f(x,y)—twice as many as the second order methods. This means that solving a problem using the fourth order method will take twice as long as using a second order method if the same step size is used. (The other additions, multiplications, and divisions required are usually negligible when compared to the time required to evaluate f(x,y) for most problems.) To compensate for this, the fourth order method must be able to take longer step sizes, Δx, without introducing too much truncation error. This is invariably true. Fourth order methods can tolerate much larger step sizes without “blowing up” than second order methods can. 13.4.3.3

Sets of Simultaneous ODEs

All of numerical methods for solving ODEs are designed to solve only first order equations. Example

Elementary Numerical Methods and C++ # 165 13.1 demonstrated how a set of higher order ODEs can be converted to an equivalent set of simultaneous first-order ODEs. Extending any Runge-Kutta method to handle a simultaneous set of first order ODEs is straightforward, though somewhat tedious from a bookkeeping standpoint. Separate sets of k’s must be maintained for each dependent variable and care must be taken to include all of the proper values of the dependent variables when calculating each k-value. Example 13.2 shows a Mathcad worksheet that properly implements the classic fourth order Runge-Kutta method for solving two simultaneous first order ODEs over a single step. O EXAMPLE 13.2 Solving Simultaneous ODEs Using the 4th Order Runge-Kutta Method

166 # Chapter 13 Ordinary Differential Equations Given the second order ODE: y''=-2y'-4y with initial conditions y(0)=2 and y'(0)=0. Convert the second order equation to a set of two simultaneous first order ODE's by creating two auxiliary variables, q1 and q2 . Let q 1 =y and let q 2 =y'. Substituting these two definitions yields the two first order ODE's: dq1 /dt=f1(t,q 1 ,q2 )=q2 dq2 /dt=f2(t,q 1 ,q2 )=-2q2-4q1 Set the initial conditions: t := 0 q 1 := 2

q 2 := 0

Let Δt := 0.125and define the derivative functions for the two equations: f1 ( t , q 1 , q 2 ) := q 2

f2 ( t , q 1 , q 2 ) := −2⋅ q 2 − 4⋅ q 1

k11 := f1 ( t , q 1 , q 2 )

k11 = 0

k12 := f2 ( t , q 1 , q 2 )

k12 = −8

k21 := f1 ⎛t +

Δt

Δt

k22 := f2 ⎛t +

Δt

k31 := f1 ⎛t +

Δt

k32 := f2 ⎛t +

Δt

⎝ ⎝ ⎝ ⎝

2

2

2

2

, q1 + , q1 + , q1 + , q1 +

2 Δt 2 Δt 2 Δt 2

⋅ k11 , q 2 + ⋅ k11 , q 2 + ⋅ k21 , q 2 + ⋅ k21 , q 2 +

Δt 2 Δt 2 Δt 2 Δt 2

⋅ k12⎞

k21 = −0.5

⋅ k12⎞

k22 = −7

⋅ k22⎞

k31 = −0.438

⋅ k22⎞

k32 = −7

⎠ ⎠ ⎠ ⎠

k41 := f1 ( t + Δt , q 1 + Δt ⋅ k31 , q 2 + Δt ⋅ k32 )

k41 = −0.875

k42 := f2 ( t + Δt , q 1 + Δt ⋅ k31 , q 2 + Δt ⋅ k32 )

k42 = −6.031

q 1 := q 1 + q 2 := q 2 +

Δt 6 Δt 6

⋅ ( k11 + 2⋅ k21 + 2⋅ k31 + k41 )

q 1 = 1.943

⋅ ( k12 + 2⋅ k22 + 2⋅ k32 + k42 )

q 2 = −0.876

Thus, at t=0.125, the value of y=1.943 and y'=-0.876. Subsequent values can be found by repeating the calculations above using these two values as initial conditions.

13.4.3.4

RKSUITE

Computer implementations of various order Runge-Kutta methods abound. Over the years various standard implementations have been shown to be excellent performers over a wide range of ODE problems. Perhaps the consummate ODE problem solver is a collection of routines that are collectively known as RKSUITE[7]. RKSUITE supercedes some very widely used codes written by

Elementary Numerical Methods and C++ # 167 the authors, incorporating the best features of those earlier codes. Some of the outstanding features of RKSUITE include: • Automatic step size control to achieve user-specified accuracy requirements. RKSUITE subdivides the requested solution interval as necessary to achieve the required accuracy. It accomplishes this feat by actually using two different Runge-Kutta methods that have different orders of accuracy. The step size is adjusted until both methods give results that differ by less than the required accuracy. This is a dynamic process that is constantly applied to keep the step size as large as possible while maintaining the desired accuracy. • Three different Runge-Kutta integrators designed to offer an optimum match to the speed and accuracy requirements of the given problem. The third order method is suitable for solving problems when the relative accuracy requirement is in the 10-2 ! 10-4 range. A fifth order method is available for use when then accuracy requirement is in the 10-3 ! 10-6 range. Finally, an eighth order method is available when the accuracy requirement is higher than 10-5. • Excellent run-time diagnostics. The routines are able to diagnose almost any conceivable problem that could be encountered during program execution. Helpful suggestions are provided to guide the user in correcting the problem. • Two different modes of operation. The first is for “usual tasks” which consists of stepping a solution out from the given initial conditions and returning solutions to the calling program on demand until a final value of the independent variable is reached. This routine, named ut, is most often used to obtain solutions at regular intervals of the independent variable for plotting. The second routine, named ct, is for “complicated tasks” which do not require results to be obtained at regular intervals. • Excellent documentation that thoroughly discusses each function parameter’s meaning and purpose. Several templates are provided illustrating the use of RKSUITE over a wide range of problem types. A complete discussion of the proper use of RKSUITE is far beyond the scope of these notes. The supplied RKSUITE documentation should be studied carefully to fully utilize the many features of the various routines included in the suite. For the traditional problem of solving the initial condition problem involving a set of well-behaved ODEs at uniform increments of the independent variable, three main steps are involved. First, the governing ODEs must be cast as a set of simultaneous first order ODEs using the method of Section 13.3. Then these equations must be implemented in a function that has the following specified signature: deriv(double t, const double* q, double* dq)

The first argument contains the current value of the independent variable. The second argument is an array of the n values of the set of auxiliary (dependent) variables. The third argument is an array into which the function must store the computed values of the n values of the first derivatives of the dependent variables. The user program will never directly call function deriv. It will be called by various RKSUITE functions whenever the derivatives of the dependent variables are needed for calculation of Runge-Kutta method k-values. The second step for solving the standard initial condition problem is to write a main driver program which initializes the various parameters that RKSUITE uses to solve the problem. The driver function must call function setup one time to set these parameters. Function setup has thirteen arguments which are described in detail in the RKSUITE documentation. The program in Example 13.3 also includes brief annotations describing these arguments. Finally, the main program must include a repetition sequence (usually a for statement—Why?) to initialize and advance the independent variable over its prescribed range, calling function ut to

168 # Chapter 13 Ordinary Differential Equations advance the dependent variables with each step. Function ut has eight arguments which are also described in the RKSUITE documentation. Additionally, we offer the following clarification regarding the twant, and tgot arguments. The value of twant must be set prior to calling ut. The argument specifies the value of the independent variable for which we desire the computed values of the dependent variables. In effect, ut is told to “go underwater” and solve the governing ODEs using as small a step size as necessary, but “come up for air” when the value of the independent variable reaches twant. The value of argument tgot can be consulted after ut returns and “comes up for air.” If no errors were encountered, then you can be assured that tgot will have the same numerical value as twant. If tgot is not the same as twant, then you can be assured that appropriate error messages will have been generated if argument mesage in function setup was set to true. O EXAMPLE 13.3 Using RKSUITE to Solve an Initial Condition Problem A projectile traveling through the atmosphere is subjected to a drag force which acts in the opposite direction of the velocity and a body force due to gravity which always acts in the negative ydirection. The governing ordinary differential equations for such a projectile in free flight can be expressed as 2

2

d2x C ⎛ dx ⎞ ⎛ dy ⎞ dx =− ⎜ ⎟ +⎜ ⎟ 2 ⎝ dt ⎠ dt M ⎝ dt ⎠ dt 2

2

d2y C ⎛ dx ⎞ ⎛ dy ⎞ dy = −g − ⎜ ⎟ +⎜ ⎟ 2 ⎝ dt ⎠ dt M ⎝ dt ⎠ dt

where C is a drag coefficient, M is the mass of the projectile, and g is the acceleration due to gravity. A given projectile having M = 20 kg and C = 0.03 kg/m is launched on level ground with an initial velocity of 100 m/s in the x-direction and 49 m/s in the y-direction. Use RKSUITE to obtain a trajectory of the projectile’s flight for 8.5 seconds. Following the procedure of Section 13.3, define q0 = x, q1 = y. Since the highest derivatives of each dependent variable is order two, two additional auxiliary variables are needed to represent the first derivatives of the first two auxiliary variables: q2 = dq0/dt, and q3 = dq1/dt. For computational efficiency, define speed = sqrt(q2*q2 + q3*q3) and substitute the auxiliary variables into the governing equations to obtain the two remaining first order ODEs: dq2/dt = !C/M*speed*q2 and dq3/dt = !g !C/M*speed*q3. The follow C++ program obtains the numerical solution to this problem by calling RKSUITE routines setup and ut: 1) #include "rksuite.h" 2) #include 3) #include 4) #include 5) using namespace std; 6) const double mass = 20.0; // declare system parameters such 7) const double C = 0.03; // as these to be global so that they 8) const double g = 9.806; // are accessible in both main and derivs 9) void derivs(double t, double q[], double dq[]) 10) { 11) // q[0]=x, q[1]=y, q[2]=x', q[3]=y' 12) double speed = sqrt(q[2]*q[2] + q[3]*q[3]); 13) dq[0] = q[2];

Elementary Numerical Methods and C++ # 169 14) dq[1] = q[3]; 15) dq[2] = -C/mass*speed*q[2]; 15) dq[3] = -g - C/mass*speed*q[3]; 16) } 17) int main(void) 18) { 19) const int neq = 4; // 4 first order ODEs 20) const int lenwrk = neq*32; // size of workspace for RKSUITE 21) const int method = 2; // use 5th order RK method 22) const double eps = 1.0e-6; // accuracy requirement 23) const double hstart = 0.0; // let RKSUITE determine init step size 24) const double tstart = 0.0; // inital value of independent variable 25) const double tend = 8.5; // total time of flight 26) const char task[] = "U"; // use the "usual task" solver 27) const bool errass = false; // do not estimate the actual error 28) const bool mesage = true; // do display any error messages 29) int ncount = 20; // number of times to "come up for air" 30) int uflag; // variable to store ut error indicator 31) double q[neq] = {0.0, 0.0, 100.0, 49.0}; // dependent variable IC's 32) double thres[neq] = {1.0e-6, 1.0e-6, // "ignore variable" 33) 1.0e-6, 1.0e-6}; // thresholds 34) double twant, tgot; // desired/act. vals of indep. variable 35) double qgot[neq]; // dep. variable values at end of step 36) double qpgot[neq]; // derivs of dep. vars at end of step 37) double qmax[neq]; // max vals of dep. vars (for plotting) 38) double work[lenwrk]; // storage space used internally by ut 39) double dt = tend / ncount; // step size 40) setup(neq,tstart,q,tend,eps,thres,method,task,errass, 41) hstart,work,lenwrk,mesage); // initialize RKSUITE 42) cout << tstart << ", " << q[0] << ", " << q[1] << ", " 43) << q[2] << ", " << q[3] << endl; // print values of IC's 44) for (int i = 1; i <= ncount; i++) // compute the soln step-by-step 45) { 46) twant = i * dt; // tell ut when to "come up for air" 47) ut(derivs,twant,tgot,qgot,qpgot,qmax,work,uflag); // go solve ODEs 48) cout << tgot << ", " << qgot[0] << ", " << qgot[1] << ", " 49) << qgot[2] << ", " << qgot[3] << endl; // print current solns 50) } 51) }

Program Output: 1) 2) 3) 4)

0, 0, 0, 100, 49 0.425, 41.0667, 19.2568, 93.4223, 41.7461 0.85, 79.5364, 35.5797, 87.7457, 35.1682 1.275, 115.753, 49.2267, 82.7963, 29.1342

170 # Chapter 13 Ordinary Differential Equations 5) 6) 7) 8) 9) 10) 11) 12) 13) 14) 15) 16) 17) 18) 19) 20) 21)

1.7, 149.997, 60.4064, 78.4408, 23.5433 2.125, 182.497, 69.2902, 74.5747, 18.3179 2.55, 213.443, 76.0197, 71.1145, 13.3969 2.975, 242.993, 80.7139, 67.993, 8.73262 3.4, 271.277, 83.4734, 65.155, 4.28746 3.825, 298.408, 84.385, 62.5548, 0.0319348 4.25, 324.477, 83.5242, 60.1542, -4.05684 4.675, 349.563, 80.9578, 57.9209, -7.99632 5.1, 373.73, 76.7466, 55.8279, -11.7995 5.525, 397.033, 70.9463, 53.8523, -15.4755 5.95, 419.518, 63.6095, 51.9748, -19.0306 6.375, 441.223, 54.7869, 50.1796, -22.4685 6.8, 462.181, 44.5276, 48.4533, -25.791 7.225, 482.417, 32.8808, 46.7854, -28.9986 7.65, 501.955, 19.8952, 45.1674, -32.0907 8.075, 520.815, 5.6202, 43.5927, -35.0663 8.5, 539.014, -9.89443, 42.0565, -37.924

Comments: Each line of output contains the results obtained by ut after it “comes up for air” at the value of twant specified inside the for loop. The five numbers on each line are the current time and the four auxiliary variables, respectively. A plot of the trajectory of the projectile can be obtained by plotting the second auxiliary variable, q1 / y, versus the first auxiliary variable, q0 / x.

O

13.5

BOUNDARY VALUE PROBLEMS

A boundary value problem differs from an initial condition problem only in the specification of the problem boundary conditions. For example, the governing differential equations for a cantilever beam and a simply supported beam are exactly the same. But the cantilever beam is an initial condition problem because both the deflection and the slope are known at the point where the beam is attached to the wall and the simply supported beam constitutes a boundary value problem because only the deflection is known at the two ends of the beam. Nothing is known regarding the slope of

Elementary Numerical Methods and C++ # 171 the beam at any point. Since there are still two boundary conditions, the problem can be solved, but the methods of the previous section cannot be directly applied because one auxiliary variable will represent the slope of the beam, and its value is not known at the origin. Suffice it to say that this is a much more difficult problem to solve. Several techniques are available for solving the general ODE boundary value problem, but the most intuitive method for the two-point BVP is the so-called shooting method. In the case of the simply supported beam problem, the shooting method begins by assuming the slope of the beam at the origin and solving the problem using a Runge-Kutta method as if it were a cantilever beam problem. When the integration reaches the end of the beam, the calculated deflection is compared to the known boundary condition at the end of the beam and the assumed slope is adjusted accordingly. For example, if the calculated position at the end of the beam turns out to be higher than the true value, the initial assumed slope should be decreased and the ODEs solved again with the new value. If the second guess produces a calculated position at the end of the beam that is lower than the known boundary condition, then we can presume that the correct value of the initial slope is a number between the first two guesses. This process of guessing initial slopes and solving the ODEs to check the final position against the boundary condition is nothing more than zero-finding as discussed in Chapter 8 and is analogous to the process marksmen use for “sighting in” their guns—hence, the name. Thus, to solve the two-point boundary value problem using the shooting method requires no knew algorithms, but rather “piggy-backs” a zero-finding problem on top of an initial condition ODE problem. When solving the two-point BVP using the shooting method with RKSUITE, the trial solutions should be obtained using routine ct, since this is a “complicated task.” Routine ct differs from ut only in that it “comes up for air” whenever it wants to rather than at regular intervals. However, it is guaranteed to stop at the ending value of the independent variable (e.g., the end of the beam). Routine ct should still be called from inside a repetition structure, but a do-while repetition is appropriate since it is not known in advance how many time ct will have to be called. O EXAMPLE 13.4 Two-Point Boundary Value Problem Using the Shooting Method For this example, we reconsider Example 13.3 with the following twist: What value of the drag coefficient, C, would be necessary to allow the projectile fired with the same initial conditions as before to land exactly 600 meters down range at the same elevation as the firing point? Examining the results of Example 13.3 reveals that with C = 0.03 kg/m, the projectile impacts the ground at about 530 meters down range. Energy considerations tell us that the maximum range for a given launch condition would occur if C = 0.0 so that there are no drag friction losses. Thus, we know that the correct value of C lies in the interval 0.0 # C < 0.03. Knowing these two values bracket the solution suggest the use of either the bisection or Regula Falsi zero-finding method. However, another problem complicates the situation. The exact time of flight for the projectile is not known for any value of C (except C = 0.0, which can be solved analytically). This problem is handled by giving routine setup a conservative value for tend, the final time for ending the ODE solution, and monitoring the results from ct inside the repetition structure. When the value of y (q1) becomes negative, we know that the projectile has penetrated the surface of the ground. Now, to find the exact time of impact, RKSUITE routine interp is called to interpolate between the current solution and the solution at the previous step to find the exact time at which the projectile struck the ground. The down range distance at the point of impact is the value of x (q0) at the time of impact. This interpolation is performed with the same accuracy as the Runge-Kutta method being used, but it is considerably faster. The code for solving this problem is shown below. Function derivs is exactly the same as it was in Example 13.3. Note that global variable C is not longer declared const since it gets changed in the zero-finding process. The new global variable impactTime contains the time of flight for the

172 # Chapter 13 Ordinary Differential Equations latest value of C that was tried. Function bisection is copied directly from Example 8.2. The code is considerably more complex than for the IC problem. Function solvect does the bulk of the work and returns the distance by which the projectile misses the 600 meter target for the value of C currently being used. Function solveut takes the computed value of C and solves for the correct trajectory just like Example 13.3. 1) 2) 3) 4) 5) 6) 7) 8) 9) 10) 11) 12) 13) 14) 15) 16) 17) 18) 19)

#include "rksuite.h" #include #include #include using namespace std; const double mass = 20.0; double C; double impactTime; const double g = 9.806; void derivs(double t, double q[], double dq[]) { // q[0]=x, q[1]=y, q[2]=x', q[3]=y' double speed = sqrt(q[2]*q[2] + q[3]*q[3]); dq[0] = q[2]; dq[1] = q[3]; dq[2] = -C/mass*speed*q[2]; dq[3] = -g - C/mass*speed*q[3]; } void solveUT(void)

20) { 21) const int neq = 4; 22) const int lenwrk = neq*32; 23) const int method = 2; 24) const double eps = 1.0e-6; 25) const double hstart = 0.0; 26) const double tstart = 0.0; 27) const double tend = impactTime; 28) const char task[] = "U"; 29) const bool errass = false; 30) const bool mesage = true; 31) int ncount = 20; 32) int uflag; 33) double q[neq] = {0.0, 0.0, 100.0, 49.0}; 34) double thres[neq] = {1.0e-6, 1.0e-6, 1.0e-6, 1.0e-6}; 35) double twant, tgot; 36) double qgot[neq]; 37) double qpgot[neq]; 39) double qmax[neq]; 40) double work[lenwrk]; 41) double dt = tend / ncount; 42) setup(neq,tstart,q,tend,eps,thres,method,task,errass, 43) hstart,work,lenwrk,mesage); 44) cout << tstart << ", " << q[0] << ", " << q[1] << ", " 45) << q[2] << ", " << q[3] << endl; 46) for (int i = 1; i <= ncount; i++) 47) { 48) twant = i * dt; 49) ut(derivs,twant,tgot,qgot,qpgot,qmax,work,uflag); 50) cout << tgot << ", " << qgot[0] << ", " << qgot[1] << ", " 51) << qgot[2] << ", " << qgot[3] << endl; 52) } 53) } 54) double solveCT(double trialC)

Elementary Numerical Methods and C++ # 173 55) 56) 57) 58) 59) 60) 61) 62) 63) 64) 65) 66) 67) 68) 69) 70) 71) 72) 73) 74) 75) 76) 77) 78) 79) 80) 81) 82) 83) 84) 85) 86) 87) 88) 89) 90) 91) 92) 93) 94) 95) 96) eps) 97) 98) 99) 100) 101) 102) 103) 104) 105) 106) 107) 108) 109) 110) 111) 112) 113)

{

C = trialC; const int neq = 4; const int lenwrk = neq*32; const int method = 2; const double eps = 1.0e-6; const double hstart = 0.0; const double tstart = 0.0; const double tend = 10.0; // assume a conservative value const char task[] = "C"; const bool errass = false, mesage = true; int cflag; double q[neq] = {0.0, 0.0, 100.0, 49.0}; double thres[neq] = {eps, eps, eps, eps}; double tnow=tstart, tprev; double qnow[neq], qpnow[neq], work[lenwrk]; setup(neq,tstart,q,tend,eps,thres,method,task,errass, hstart,work,lenwrk,mesage); do { tprev = tnow; ct(derivs,tnow,qnow,qpnow,work,cflag); } while (qnow[1] >= 0.0); // repeat until y<0.0 double twant, qwant[neq], qpwant[neq]; const int lenint = 32; double wrkint[lenint]; // Use a simplified interval-halving search to find the // exact time of impact. The qwant vector contains the // interpolated values of the dependent variables at // time twant. do { twant = (tprev + tnow)/2.0; intrp(twant,"S",neq,qwant,qpwant,derivs,work, wrkint,lenint); if (qwant[1] >= 0.0) tprev = twant; else tnow = twant; } while (fabs(qwant[1]) > eps); impactTime = (tprev + tnow)/2.0; // update global variable return (qwant[0] - 600.0); } double bisection(double (*f)(double), double xl, double xr, double {

double fl = f(xl), fr = f(xr); if (fl * fr > 0.0) { cerr << "fl * fr > 0.0 in bisection\n"; exit(-1); } while (fabs(xr - xl)/2.0 > eps) { double xc = (xr + xl)/2.0, fc = f(xc); if (fl * fc < 0.0) { xr = xc; fr = fc; } else {

174 # Chapter 13 Ordinary Differential Equations 114) 115) 116) 117) 118) 119) 120) 121) 122) 123) 124) 125) 126) 127) 128) 129) 130) 131) 132) 133)

xl = xc; fl = fc;

} } return (xr + xl)/2.0;

} void findTrueC(double lowC, double highC) { double trueC = bisection(solveCT, lowC, highC, 1.0e-6); C = trueC; // set global C to final value solveCT(trueC); // make sure impactTime is computed correctly } int main() { double lowC = 0.0, highC = 0.03; findTrueC(lowC, highC); cout << "True value of C is " << C << endl << "Time of flight is " << impactTime << " seconds\n"; solveUT(); }

Program Output: 1) 2) 3) 4) 5) 6) 7) 8) 9) 10) 11) 12) 13) 14) 15) 16) 17) 18) 19) 20) 21) 22) 23)

True value of C is 0.0216367 Time of flight is 8.59705 seconds 0, 0, 0, 100, 49 0.429852, 41.9149, 19.6473, 95.1151, 42.494 0.859705, 81.846, 36.5951, 90.7544, 36.427 1.28956, 120, 51.0167, 86.8365, 30.73 1.71941, 156.554, 63.0588, 83.295, 25.3473 2.14926, 191.655, 72.8464, 80.0748, 20.2336 2.57911, 225.434, 80.4869, 77.13, 15.3517 3.00897, 257.998, 86.0732, 74.4214, 10.6713 3.43882, 289.443, 89.6864, 71.9159, 6.16782 3.86867, 319.85, 91.3981, 69.5847, 1.82104 4.29852, 349.287, 91.2721, 67.4033, -2.38511 4.72838, 377.815, 89.366, 65.35, -6.46334 5.15823, 405.484, 85.7325, 63.4062, -10.4234 5.58808, 432.339, 80.4207, 61.5556, -14.2727 6.01793, 458.415, 73.4773, 59.7843, -18.0163 6.44779, 483.745, 64.9467, 58.0801, -21.6576 6.87764, 508.355, 54.8725, 56.4327, -25.1984 7.30749, 532.268, 43.2978, 54.8337, -28.6395 7.73734, 555.502, 30.2653, 53.2757, -31.9807 8.1672, 578.074, 15.8182, 51.7531, -35.2212 8.59705, 599.999, 3.21769e-05, 50.2612, -38.3599

Elementary Numerical Methods and C++ # 175

O

Glossary boundary condition a value of a dependent variable or one of its derivatives at a particular value of the independent variable that is known to satisfy the governing ordinary differential equation. boundary value problem an ordinary differential equation which has boundary conditions specified at more that one value of the independent variable. improved Euler method a second-order Runge-Kutta method that uses the average of the values of the first derivative at the start and the end of the current step as the increment function. increment function a linear combination of first derivatives which is used to “increment” the dependent variable from its current value to the value at the end of the current step.

initial condition a value of a dependent variable or one of its derivatives at the initial value of the independent variable. initial condition problem an ordinary differential equation which has boundary conditions specified at a single value of the independent variable (usually zero). midpoint method a second-order Runge-Kutta method that uses the value of the first derivative at the midpoint of the current interval as the increment function. ordinary differential equation an equation that relates one or more dependent variables and their derivatives with respect to a single independent variable.

Chapter 13 Problems

13.1 The diagram below shows a mass attached to a fixed support by means of a viscous damper and two springs--one linear and one “hardening” nonlinear spring. The governing secondorder ordinary differential equation of motion for this system is

176 # Chapter 13 Ordinary Differential Equations

mx + cx + kx + k * x 3 = 0 Use RKSUITE to solve this problem using the following parameters: k=2.0 N/cm k*=2.0 N/cm3 c=0.15 N sec/cm m=1.0 kg = 0.01 N sec2/cm x0=10.0 cm vel0=0.0 cm/sec tmax = 1.0 sec Compare your results to the case of replacing the two springs with a single linear spring having k=4.0 N/cm. 13.2 The response of a motor controlled by a governor can be modeled by the following system of differential equations: d 2s ds dθ + 0.042 + 0.961s = + 0.063θ , 2 dt dt dt d 2u du ds + 0.087 = + 0.025s, 2 dt dt dt dv = 0.873( u − v ) , dt dw = 0.433( v − w) , dt dx = 0.508( w − x ) , dt dθ = − 0.396( x − 47.6). dt The motor should approach a constant speed as t → ∞. Assume that ds du s(0) = = = θ (0) = 0.0 , u(0) = 50, and v (0) = w(0) = x (0) = 75. Evaluate v(t) dt t = 0 dt t = 0 for t = 0, 25, 50, 75, 100, 150, 200, 250, 300, 400, and 500. What does lim t → ∞ v (t ) appear to be? You can check this by calculating the exact steady state solution which occurs when all derivatives are zero. 13.3 A two-stage amateur rocket has the following properties: Tare mass of stage 1 = 5.03 lbm Tare mass of stage 2 = 5.19 lbm Stage 1 propellant mass = 2.72 lbm Stage 2 propellant mass = 1.16 lbm Stage 1 burn time = 3.5 sec Stage 2 burn time = 2.3 sec Stage 1 maximum thrust = 250 lbf

Stage 2 maximum thrust = 175 lbf The engines of both stages generate thrust that decreases linearly from the maximum value at ignition to 0.0 and the end of the burn time. Assume that the fuel for each stage is also consumed linearly with burn time. A timer ignites the second stage engine 0.1 sec after the first stage burns out. Assuming the rocket is launched vertically on a windless day, the governing differential equation of motion can be written as d2y = (T − FD ) / Mass − g dt 2 where y is the altitude of the rocket T is the instantaneous thrust generated by the rocket motor FD is the drag force due to air resistance Mass is the instantaneous mass of the rocket g is the acceleration due to gravity, 32.2 ft/sec2 The drag force, FD, for this rocket can be modeled as

⎛ dy ⎞ FD = 3.0 × 10 − 5 ⎜ ⎟ ⎝ dt ⎠

2

where dy/dt is the vertical velocity of the rocket. Write a C++ program that uses RKSUITE to find the maximum altitude of the rocket and the time required to reach this altitude. You may also wish to generate plots of the altitude and vertical velocity versus time.

178 # Bibliography

Bibliography 1.

Raimondi, A. A., and J. Boyd, “A Solution for the Finite Journal Bearing and Its Application to Analysis and Design,” Parts I, II, and III. Trans. ASLE, Vol. 1, No. 1, pp 159-209, in Lubrication Science and Technology. Pergamon Press, New York, 1958.

2.

Deitel, H. M., and P. J. Deitel, C++ How to Progam, 3rd ed., Prentice Hall, 2001, ISBN 0-13-089571-7.

3.

Curtis F. Gerald and Patrick O. Wheatley, Applied Numerical Methods, 3rd edition, Addison-Wesley, 1970, ISBN 0-201-11577-8.

4.

Roldan Pozo, Template Numerical Toolkit: A numeric library for scientific computing in C++, National Institute of Standards and Technology, http://math.nist.gov/tnt/.

5.

B.H. Flowers, An Introduction to Numerical Methods in C++, Oxford University Press, 1995, ISBN 0-19-853863-4.

6.

William H. Press, Brian P. Flannery, Saul A. Tenkolsky, and William T. Vetterling, Numerical Recipes in C, Cambridge University Press, 1988, ISBN: 0521431085.

7.

R.W. Brankin, I. Gladwell, and L.F. Shampine, RKSUITE Release 1.0, ftp://netlib.att.com/netlib/ode/rkesuite.

Appendix A Table of C++ Operators Operators at the top of the table have the highest precedence, while those at the bottom have the lowest precedence.

Operator

Explanation

::

binary scope resolution

::

unary scope resolution

( )

parentheses

[ ]

array subscript

.

member selection via object

!>

member selection via pointer

++

unary postincrement

!!

unary postdecrement

typeid

run-time type identification

dynamic_cast< type >

run-time type-checked cast

static_cast< type >

compile-time type-checked cast

reinterpret_cast< type >

cast for non-standard conversions

const_cast< type >

cast away const-ness

++

unary preincrement

!!

unary predecrement

+

unary plus

!

unary minus

!

unary logical complement

~

unary bitwise complement

( type )

C-style unary cast

sizeof

determine size in bytes

&

address of

Grouping left to right

left to right

right to left

180 # Appendix B Table of C++ Operators Operator

Explanation

Grouping

*

dereference

new

dynamic memory allocation

new[]

dynamic array allocation

delete

dynamic memory deallocation

delete[]

dynamic array deallocation

.*

pointer to member via object

!>*

pointer to member via pointer

*

multiplication

/

division

%

modulus

+

addition

!

subtraction

<<

bitwise left shift

>>

bitwise right shift

<

relational less than

<=

relational less than or equal to

>

relational greater than

>=

relational greater than or equal to

==

relational is equal to

!=

relational is not equal to

&

bitwise AND

left to right

^

bitwise exclusive OR

left to right

|

bitwise inclusive OR

left to right

&&

logical AND

left to right

||

logical OR

left to right

?:

ternary conditional

right to left

=

assignment

right to left

+=

addition assignment

left to right

left to right

left to right

left to right

left to right

left to right

Elementary Numerical Methods and C++ # 181 Operator

Explanation

!=

subtraction assignment

*=

multiplication assignment

/=

division assignment

%=

modulus assignment

&=

bitwise AND assignment

^=

bitwise exclusive OR assignment

|=

bitwise inclusive OR assignment

<<=

bitwise left shift assignment

>>=

bitwise right shift assignment

,

comma

Grouping

left to right

Numerical Methods, Algorithms and Tools in C'

Read more

Numerical Methods, Algorithms and Tools in C#

Read more

Computing for Numerical Methods Using Visual C++

Read more

Computing for Numerical Methods Using Visual C++

Read more

Computing for Numerical Methods Using Visual C++

Read more

G - Elementary Numerical Analysis

Read more

Numerical Methods and Applications

Read more

Numerical methods

Read more

Numerical Methods

Read more

Numerical Methods

Read more

Numerical Methods

Read more

Numerical Methods

Read more

Numerical Methods

Read more

C-algebras and numerical analysis

Read more

C* - Algebras and Numerical Analysis

Read more

Numerical Methods for Laplace Transform Inversion (Numerical Methods and Algorithms)

Read more

Numerical Methods for Laplace Transform Inversion (Numerical Methods and Algorithms)

Read more

Elementary numerical analysis: algorithmic approach

Read more

Elementary numerical analysis: algorithmic approach

Read more

Numerical recipes in C

Read more

Numerical recipes in C

Read more

Operator theory and numerical methods

Read more

Numerical Methods, Software, and Analysis

Read more

Operator Theory and Numerical Methods

Read more

Numerical recipes in C

Read more

Analysis of Numerical Methods

Read more

Essential numerical computer methods

Read more

Numerical Methods for Finance

Read more

Numerical methods in finance

Read more

Numerical Methods for Finance

Read more

Recommend Documents

Numerical Methods, Algorithms and Tools in C'

Numerical Methods, Algorithms and Tools in C# © 2010 by Taylor and Francis Group, LLC Numerical Methods, Algorithms...

Numerical Methods, Algorithms and Tools in C#

Numerical Methods, Algorithms and Tools in C# © 2010 by Taylor and Francis Group, LLC Numerical Methods, Algorithms...

Computing for Numerical Methods Using Visual C++

www.dbebooks.com - Free Books & magazines COMPUTING FOR NUMERICAL METHODS USING VISUAL C++ Shaharuddin Salleh Universi...

Computing for Numerical Methods Using Visual C++

COMPUTING FOR NUMERICAL METHODS USING VISUAL C++ Shaharuddin Salleh Universiti Teknologi Malaysia Skudai, Johor, Malays...

Computing for Numerical Methods Using Visual C++

Download at wowebook.info www.dbebooks.com - Free Books & magazines Download at wowebook.info COMPUTING FOR NUMERICA...

G - Elementary Numerical Analysis

Home Next International Series in Pure and Applied Mathematics G. Springer Consulting Editor Ahlfors: Complex Analys...

Numerical Methods and Applications

Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris...

Numerical methods

NUMERICAL METHODS This page intentionally left blank NUMERICAL METHODS Rao V. Dukkipati Ph.D., P.E. Fellow of ASME ...

Numerical Methods

NUMERICAL METHODS This page intentionally left blank NUMERICAL METHODS Rao V. Dukkipati Ph.D., P.E. Fellow of ASME ...

Numerical Methods