This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
r , the natural method of choosing the rules ( 27) takes account of the form of k and one anticipates that if r = Nh then Q, may vanish for I > N. rifp> 4and a=l; 0>rifp> 4anda>1.
4. Stability of linear recurrence relations. 4.1. We may now turn to the various recurrence relations associated with the above formulae when f (t, u, v) _ ^u + rlv + d(t) in (15) or G(t, u, v) = to + tlv + d(t), g(s, w) = to in (25). The analytic test equations thus assume the form y'(t) _ ^y(t) + r/y(t - r) + d(t), t > 0; y(t) = 0(t), fort E [-r, 0],
and Y'(0 =
WO + p
J e k(t - s)y(s ) ds + d(t), t > 0;
y(O) = yo.
0 The analytic equations are asymptotically stable when bounded changes in the initial data ( that is, in ?k(•) or in y(0)) produce bounded decaying changes in the respective solution y(.). In the first instance , the equation is asymptotically stable for all r > 0 when I77 I < -^. In the second , the equation is asymptotically stable when In1IIk1I1 < - (we assume here that IIkili = .fo Ik (s)Ids < oo). These are sufficient conditions. We seek results for the recurrence relations which are to some extent analogues of the latter observations. In each case our recurrence relations hold for n >1 no with some choice no which does not concern us, and the inhomogeneous terms here have no affect on the stability of either the functional equations or their discretized counterparts ( so that one can consider the homogeneous counterparts of our test equations, with d(t) = 0). It is a consequence of asymptotic stability that (i) for arbitrary starting values , yn -> Oas n -+ no, (ii ) if d(t) undergoes a uniformly bounded perturbation, so does the sequence { g }. The property (i), involving qualitative behaviour of the , unperturbed solution , is sometimes taken as the definition of asymptotic stability; the definition we gave previously has, in our view, pedagogical advantages.
4.2. The case f(t,u,v) = fu+r7v+d(t)
(29)
in (17) gives rise to the recurrences k
E {atyn -t - h/t( Syn-t +
^yn - N-t)}
=
(30)
an
t=o (n >, k) when r = Nh and ( 19) applies , where bn = h ^^=o Qtd ( tn_t). In the case that r = (N - v)h and ( 21) applies, k
k
E{atyn-t - hQ1(6yn -1 + 11 1=0 i =o
1 .,
=
.
J -
(31)
35 equations (30) and (31) are finite- term recurrences , but the number of terms depends upon h since N = [r/h]. We can concentrate on (31), to the exclusion of (30), since in the special case n = Nh (v = 0) we obtain (30) if we replace E^=o yv,jyn-N-L-j by y.-N-t in (31).
Now we turn to the corresponding equations based upon equations (22) and (24). We have, in the case (29), the relations
E E &16.-1 + n#n+v-N-l} + S. k hk t=o t=o and yn +v
k / k t + Y, at(V)yn-t = h Pt(v)
{Gn -t +7/^n+v-N- t} + &(I')
t =o
t=o
( dn(v) = h Ei=o /3t(v)d(tn_t)), Sn = h E O Qtd(tn_t)). Recasting the first equation, k
k
k tt
qq
qq
E aLyn-t - Sh E Nlyn-[ _ qh Y, Ntyn+v-N-L + sni t=o t=o t=o
(32)
rearranging the second equation, k yn+v - I1h E NC(V )yn+v-N-t = t=o t
k
k
n
Sh#t(V) y -l t =o
at(v) yn-t
+ sn(v).
(33)
t=o
We assume that the off-step values yn+v may be eliminated to obtain a finite-term recurrence relation between values {yn I n > 0} with suitable starting values. The elimination proposed can be performed most readily by employing the advancement operator E, since then Eq. (32) reads s( f h , E )y n -k = I7h-(E)9 n- k-N+v + an
and (33) reads [EN+k _ 1)hvv(E)] yn-k-N+v = {-.s (f hi E)}Pn-k + 8.(V)From these forms, we deduce the recurrence relation
{[EN+k - iihQ,(E)]s(eh, E) +,jhsv(th; E)o,(E)}yn-k = an(v), (34) on writing S;,(v) for a certain linear combination of values d(tt).
36 4.3. In contrast to the recurrences obtained above, Eq. (25), with
G(t, u, v) _ (u + iv + d(t),
g(s, w) = w,
k(t) = 1, (35)
yields a relation between y'n and, in general, all previous values yl, viz. k'
k {CYl - Sh fll }yn -l -
l=o
rlh
l=o
R
n-l
[ ^
I 1: ^(h)l_.i y) + &. (36) i=o
L
Note, however, that when G(t, u, v) = ('u + gv + d(t), g(s, w) = w, and k(t) = 0, for t > r, then the analytical equation under consideration becomes
y'(t) =
Wt) + rl rt k(t - s)y(s)ds + d(t), t > 0, tT
and we require y(t) = fi(t) to be prescribed for t E 1-7-,0]. If T = Nh, a natural numerical method for this equation produces a recurrence which now assumes the same form (36), but where S2l = 0 if ! > N (so not all previous values y"l are actually involved). This case is subsumed in our discussion. 4.4. The recurrences (30), (31), (34) and (36) play roles in the stability analysis of delay- and integro- differential equations similar to those familiar in the study of ordinary differential equations employing (7). We can relate the stability of the recurrences (30), (31), and (34) (assumed solvable for the given h) to stability polynomials: Lemma (a) The recurrences (30) or, respectively, ( 31), are asymptotically stable if and only if the stability polynomial
Eh(f, rl; M) pNP(µ) - (EµN + 11)ha(p),
(37)
Eh (C, rl i li) := µN+kP(µ) - ((11N+k + rl^^(R))ha(fi),
(38)
respectively is a Schur polynomial in the variable p. (b) The recurrence (34) is asymptotically stable if and only if
Et (f, q ;
,u) {[µN +k _
iiha^ (µ)]s(Eh, µ) +
rl hsv (^h;
µ)a(µ)}
(39)
is Schur. [Proof:] The result is the analogue of the classical result for (7). ❑ The polynomial (C, g; µ) reduces to µ kEh (C, rl; Is) on setting -y„(µ) = µk Also, Eh (C, rl; µ) reduces to (C , r); µ) on a formal substitution of -y,, ( µ) in place of p„(µ) and on replacing a„(µ) by 0 . Insight can thus be obtained by concentrating on
Eh
Eh
Et (C, 71; ,U). Recall our notation for the reflected polynomials : s(A; C) _ Cks (A; (-1), p(() and i.(() = _ ('p((-1), a(() = Cka(C-1 ), sv(A; () = Cks„(A; (- 1), 6,.(() = (ky ((-1) Rephrasing part ( b) of the last Lemma, the recurrence (34) is stable if,
t t,I 1,. 4,tI I,.. f,i
37 and only if, the polynomial Eh(C, rt;,u) does not vanish when l l >, 1. This condition can, in turn, be expressed in terms of a function of C = µ-1: The recurrence (34) is stable if and only if
[C-N -
7ha.(C)]9(f h; () +
?hg. (C h;
()a(()
54 0
I< 1. Such arguments show that the preceding lemma can be put in a new when Cl form; moreover, it is this form that we hope to generalize: Proposition 4.1 The recurrences (30), (31) are stable if and only if (respectively):
Cl with SZ(C,77; C) := s(fh,C) -1hCNa(() (a) Sh(6,77;C) # 0 when I1,
(b) Sh (, 77; C) 34 0 when ICI 1, with Sh (^, n; C) S(h, C) - qhCN7'(()a(C) 0 when ICI < 1. Then ( 34) is stable if and only if Sh(C, r/; (c) Let [1 - jhCNa'(C)] is non -vanishing when ICI < 1, where Sh(C, 77; C) s(Ch; () + 71hXh(C, r); C)a(() and wherein Xh(f, r1; C) = {[1 - rlhCNay(()]-1CNS,(Ch; C)}. The condition [1 - rlhCn'a„(C)] # 0, above, is not entirely welcome and is the price paid for uniformity of presentation; a different approach could avoid it. The preceding proposition merely rephrases stability criteria traditionally expressed in terms of Schur polynomials, and we now turn to the integro-differential equation. Because its discretization does not ( in general ) yield a finite-term recurrence relation, stability of the recurrence is determined by the zeros of a function which will be defined as a power series rather than as a polynomial. In the case of Eq. (36), In(h) I = S2 < oo. suppose the LMF 1p, o-1 is strongly stable and we require that Et o Write S2(h)(C) = Eio -4h1Ct (the series is absolutely convergent if ICI < 1 ), and denote by w(C) the function
(40)
w(C) _ ^(C-1)
P(C -1)
which, by the assumptions on {p,Q}, is analytic when ICI < 1 and C # 1 (where it has a simple pole). Then: Lemma Suppose that the recurrence ( 36) is solvable , that o I1(h)I < oo, and the function Eh(C, 77; C) := (1 - C){1 - w(()(Ch + i h52(h)(C-1))} is defined , when C = 1, by its limiting value as C --+ 1. Then (36) is strictly stable if (and I< 1. only if) Eh(6,77;C ) does not vanish when Cl
[Proof:] This result is a minor generalization of a result established by Lubich9 [Theorem 7.1], employing a discrete Paley-Wiener theorem. ❑ We observe that
w(C) pYC) ,
YlC)
= CkP(( -1), a(C) =
C"o (C
-1),
(41)
38 and we write p (() for (k-lp*((-i) By the assumption of strong stability, p (() does not vanish for I(I < 1. In consequence, on considering Eh (C rl; (), we deduce from the preceding result: Proposition 4.2
With the preceding assumptions , the recurrence relation (36) is stable if and only if Sh((, r/; () # 0 when I(I < 1, where (42)
Sh(f +0; () := I(((h; () - 7lh6(()^(h11() r with s(A; C() ) = (ks(A; C-1), a(() _ (ka((-1), and C1 (S)
5. Stability results for the adapted schemes 5.1. We now prepare to indicate the unifying role of (12). The functions employed in Propositions 4.1 and 4.2 each assume the form of
Sh((,,7; ^) = s((h; () - rIh 'wh((, 11; where wh(f 0 7;() is analytic for I(I
Suppose that 9(6h;() has no zeros on or inside the disk I(I < 1 and , together with wh((, 77; (), is analytic in (on this closed disk; if
lihl
ICI p
{wh((, i;
()I}
< ICl f
{IS(f h; ()I}
then Sh((, r/; () := s((h; () +,7hwh((, r); () has no zeros on or inside Cl I< 1. By way of application, suppose that (h E So, the strict stability region of the LMF, so that s((h; () has no zeros on or inside I(I < 1 and infi i=l{Is((h; ()l} = L1 ((h). Applying the previous result with s((h; () for (h E SO, and replacing Wh((,'i; () with the forms appropriate in view of Propositions 4.1 and 4.2, we deduce:
Proposition 5.2 Let (h E SO, the strict stability region for (7). Let ti(A) be defined by Eq. (12). If, respectively,
16(()l < it (f h), l I7.(() °(()l < tl((h), 54 0 when I(l < 1, and I,hI supltl lXh(E,7;()d(()l =l I,7hlsupltl=l lcl(h)(()a(()I < t1 ((h).
Inhl
(a) supIcI =1 (b) l')hl supltl=
(c) [1 or (d) then the recurrence (30), (31), (34), or (36), respectively, is stable.
< i1((h),
Remarks : Proposition 5.1 (and hence its corollaries ) can be strengthened. Thus, if the functions involved are analytic and supICI=1{Ir7hwh(6, 77; ()/s((h; ()I} < 1 then Sh((, rl; () :_ s((h; () - 7hwh((', r/; () has no zeros inside I(I < 1 when s((h; () has none.
39 5.2. The results given by Proposition 5.2 are not always sharp (though they can be surprisingly good ), so an alternative approach can be worthwhile: Lemma Suppose that (h E SO. If {(h + 77hwh((,77; ()I 1(1 < 1} C SO where wh((, r/; () _ -h((, r); ()Q((), then s((h; () +phwh( r/, h; () has no zeros on or inside ICI = 1. [Proof:] Suppose, by way of a contradiction, that s((h; (*) +rjhwh(rl, h; (*)v((*) = 0, where I< Cl 1. Then there exists a value C such that I(I < 1 and
9(((h - 77hwh(77, h; (*); () = 0 Cl 1, which (such a value is C (*). Thus, (h - rlhwh(rl, h; (*) E So where I< ❑ contradicts the assumption. In consequence of the above, if I' denotes the circle centered on (h with radius R = 177h1 suplµl=,{lwh(77, h; ()I) and I' C Sa then stability is preserved in moving from the LMF to the adapted LMF. Since sup1C14, I wh(17, h; ()l = suplCl=, (wh(77, h; ()l when wh(rl, h ; () is analytic on the unit disk, we can state: Proposition 5.3 Let (h E Sa, the strict stability region for the LMF . Let t2(A) be defined by Eq. (13). Then the recurrence (30), (31), (34), or ( 36), respectively, is stable if < t2((h), (a)
177h1
(b) 1,7h1 suplCl=, I'7„(C)I < 1,2(Eh), 0 and 177hl suplCl=, IXh((, rl; ()I < t2(f h), (c) [1 -'7h(Nd.(()] or (d) 1r7hlsuplCl=, ISl(h)(() I < t2(6h) respectively. In general , comparison of Propositions 5.2 and 5.3 is not possible, as the lefthand sides and right-hand sides differ in each inequality. However, in the case of BDF formulae, a(() = 1 and then one has only to compare the size of tl(.) with that of t2(•) at the value of interest. 6. Summary We formalized the description of adapted LMFs and provided a set of simple inequalities ensuring stability of these adapted formulae; the stability conditions are sufficient but not necessary. Watanbe & Roth" cover certain LMFs and Runge-Kutta methods for delay equations; in 't Hout & Spijkers recall related results and give necessary and sufficient conditions for stability for certain discretized delay equations; these treatments do not include (25). If stability regions are required, in principle these can be computed2,10 by the boundary-locus method, computing the loci ( = ((O), q = r7(0 ) such that 32{Sh(6, r); exp{iO} )} = 0, {Sh(f, r); exp{iO})} = 0, where Sh((, rl; () has the appropriate form s(f h; () -rlh&(()wh((, rl; (). This calculation may be time-consuming and cannot always be effected exactly (in particular in the general case (42)). It is possible to obtain extensions of our results. For example, we may employ such functions as
t*((,77) := inflµl=1{lEh((, 71,,u)I }, (43)
40 to deduce stability results for (31). This circumvents difficulties inherent in the use of cl(•), L2(.) associated with the observation that the introduction of a delay term can stabilize an unstable differential equation and its discretized version. 7. Acknowledgement. I thank Mark D. Baker (Oxford), for comments on an early draft, and Dr. Ruth Thomas (UMIST) and Dr. Neville Ford (Chester College), who read the final paper.
8. References
1. C.T.H. Baker, Stability functions for multistep formulae for Volterra functional equations . Numer. Anal. Tech. Rep. 220 , Univ . Manchester, 1992. 2. C.T.H. Baker and N.J. Ford, Some applications of the boundary-locus method and the method of D-partitions . IMA. J. Numer. Anal. 11 (1991) pp. 143 - 158. 3. V.K. Barwell, Numerical solution of Differential- Difference Equations. Ph.D. thesis, Rep. CS-76-04, Univ. Waterloo (Canada), 1976. 4. C.W. Cryer, Numerical methods for functional differential equations. In: Delay and Functional Differential Equations and their Applications (edited: K. Schmitt), Academic Press (New York) 1972, pp. 17-101. 5. G. Hall and J.M. Watt ( editors ), Modern Numerical Methods for Ordinary Differential Equations, Oxford University Press (Oxford) 1976. 6. K.J. in 't Hout and M.N. Spijker, Stability analysis of numerical methods for delay differential equations . Numer. Math . 59 (1991 ) pp.807-814. 7. J.D. Lambert, Computational Methods in Ordinary Differential Equations, John Wiley (London), 1973. 8. V. Lakshmikantham and D . Trigiante, Theory of Difference Equations: Numerical Methods and Applications , Academic Press (New York ), 1988.
9. C. Lubich, On the stability of linear multistep methods for Volterra convolution Equations . IMA. J. Numer. Anal. (1983 ) 3 pp. 439 - 465. 10. C.A.H. Paul and C.T.H. Baker, Stability boundaries revisited - RK methods for delay equations , Numer. Anal. Tech. Rep. 205,Univ. Manchester, 1991. 11. D.S. Watanbe and M.G. Roth, The stability of difference formulas for delay differential equations SIAM J. Numer. Anal. 22 (1985) pp. 132-145.
If f " , .
,_,a...^4 i j - I ;'
I,,^._ I' 4.
WSSIAA 2 ( 1993) pp. 41-53 ©World Scientific Publishing Company
RKSUITE: A Suite of Explicit Runge-Kutta Codes R.W. Brankin The Numerical Algorithm Group Ltd. Wilkinson House, Jordan Hill Road Oxford, OX2 8DR, U.K. I. Gladwell L.F. Shampine Department of Mathematics Department of Mathematics Southern Methodist University Southern Methodist University
Dallas, TX 75275, USA Dallas, TX 75275, USA Abstract New software based on explicit Runge- Kutta formulas have been developed to replace well-established , widely-used codes written by the authors (RKF45 and its successors in the SLATEC Library and the NAG Fortran 77 Library Runge-Kutta codes). The new software has greater functionality than its predecessors . Also, it is more efficient , more robust and better documented.
1 Introduction Two of the authors are responsible , in part , for some of the most widely used software based on explicit Runge-Kutta (RK) formulas for solving the initial value problem in ordinary differential equations (ODEs), b = f(x,y), y(a) = a, a < x < b
(1)
where y, f,a E R. Specifically, Shampine is an author of RKF45 [20] (succeeded by DERKF in the SLATEC Library) and Gladwell wrote D02PAF (and its driver routines D02BxF where x has the values A, B, D, G and H ) available in the NAG Fortran 77 Library [5]. These very successful codes have been used in a variety of applications , embedded in packages and translated into several other programming languages . Since the early 1970's when the codes were designed and written , there have been a variety of improvements in technology, in user expectations and in formulas. RKSUITE [3] has been written in response
41
42 to these developments . The design philosophy for RKSUITE is discussed in [17]. In Section 2, we discuss the design of the suite with special emphasis on the added functionality. The choice of RK formulas is justified in Section 3. Implementation questions are addressed in Section 4. Finally, the documentation and templates are described in Section 5.
2 Design 2.1 Overview The aim is to solve the initial value problem ( 1.1) by computing the solution y(x) at user-specified values of x: a=to
on which the approximate solutions y; szts y (x1) are computed . It uses a RK pair which provides a local error estimate that is used to select step lengths, hence the meshpoints xq, so that a local accuracy requirement is satisfied. The algorithm is similar to that of Watts [22]. Generally the user-defined points t; will not coincide with the points xq chosen by the code. So, either the code must include the t; among the zq, or an interpolation facility must be provided . We permit both approaches to producing the solution at the points t; through separate codes for the two different "tasks". In many applications , the number of output points ti is small; in some, output is required only at x = b. In such cases stepping to the output points (and thus avoiding the overhead associated with interpolation ) can be the more efficient approach . In RKSUITE the overhead associated with interpolation is isolated in a user-callable interpolation routine . Hence, the overhead is incurred only when the user, directly or indirectly, requests interpolation. Most explicit RK codes in wide use today are based on formulas with local truncation error of order between four and six . It was shown in [9] that within this range of orders explicit RK codes are the most efficient for low to moderate accuracy calculations. More recently, it has become clear that for moderate to quite high accuracy requirements , higher order explicit RK formulas can be competitive with the extrapolation methods and the variable order, variable step Adams codes recommended in [9) for this range of accuracies. Time-dependent partial differential equations , when discretized in space by the method of lines, lead to large systems of ordinary differential equations.
tt,I
43 Usually, fairly low accuracy but good stability is required in the solution of these ODEs. In many applications, the ODEs are moderately stiff. It has become common practice to employ a low order RK formula that has a large region of absolute stability. There is a trade-off in efficiency between taking fairly short steps with an explicit RK method and using longer steps with an implicit method of similar order but with additional linear algebra overhead. The comparison favors explicit RK methods for some very large problems, particularly those arising from a method of lines approximation to problems in three space dimensions. In RKSUITE, we supply three explicit RK formula pairs: (i) A (4,5) pair with efficiency superior to the popular Fehlberg (4,5) pair for solving problems at typical scientific accuracy requirements. (ii) A (7,8) pair that is significantly more efficient than the (4,5) pair for higher accuracy requirements. (iii) A (2,3) pair intended for problems where stability rather than accuracy is the determining factor for efficiency.
All these pairs are used in error per step, local extrapolation mode. The user of RKSUITE chooses the appropriate pair by means of a parameter in the setup routine . The documentation suggests overlapping ranges of accuracy requirements for which each formula pair might be most appropriate. These accuracy ranges have been determined by experiment (see Section 3). A user concerned about efficiency should experiment to determine the most appropriate pair, starting from the documented recommendation. The recommendations assume that the frequency of output (the number of points t;) does not impact efficiency. In this first release of RKSUITE, the (7,8) pair has no associated interpolant , so frequent output may impact its efficiency. If this occurs during an integration , the code issues a warning. In addition to the formula pair, all other parameters that define the problem are specified in the setup routine . So, for example , the integration task, the points a and b , the initial solution vector, yo, and the accuracy requirements are all specified. Then, the integration routine corresponding to the specified task (step- or interval-oriented integration) is called . It has minimal problem definition parameters , basically those defining the ODES and how the integration is to proceed , e.g. where output is next required . Integration past an output point is permitted, but not past the endpoint , b. Output parameters are also minimal: position (xq or t; depending on the chosen task ), the solution and its derivative there . These parameters are output only. That is , the user cannot impact the integration by changing them before calling the integrator again to proceed further towards b.
Integration proceeds by repeatedly calling the integration routine for the chosen task. A repeat call to the setup routine forces a "cold " restart and so should be used only when necessary. There are a number of warnings that may be issued by the integrator as the integration proceeds . After each of these a "warm" restart can be made by calling the integration routine again without
44 resetting parameters . After any of the possible failures, for example that due to an internal choice of step size too small to make progress on the computer in use, a "cold" restart by a call to the setup routine is the only way to proceed. A repeat call without a restart leads to a catastrophic error . Any catastrophic error , such as checkably incorrect input arguments in either the setup or an integration routine, leads to a message and termination of the program. There are a variety of internal checks on the sequencing and validity of subroutine calls. For example, a call to an integrator not preceded by an initial setup call or a call for a global error assessment without first specifying global error assessment in the setup are both catastrophic. Integration to, but not past, a specific point t, or b is achieved by setting the integration endpoint. When the endpoint is reached , it must be redefined to a point further away from the starting point for the integration to proceed . To avoid the overhead of a "cold" restart , which is not usually needed in this situation, a separate routine is provided just for the purpose of redefining the endpoint. Next , we summarize the user-callable routines in RKSUITE. Some of the underlying routines , that are not user-callable are discussed with other implementation questions in Section 4.
2.2 Setup and integrator routines Values which normally remain unchanged in the integration are defined in the setup routine . So, a= TSTART, b= TEND , n= NEQ , yo= a= y(a)= YSTART( 1:NEQ), are set here. A scalar relative local error control, TOL, and vector absolute local error threshold , THRES(1:NEQ ), must also be supplied. The error control is designed to be used in a relative fashion with absolute error thresholds . On each step , the local error estimate is computed relative to an average magnitude of the computed solution on that step. TOL is bounded by values
0.01 > TOL > 10.0e where a is the machine precision . This is a range of tolerances encompassing all reasonable accuracy requests . That is, the user is required to request some accuracy but is not permitted to ask for so much accuracy that it cannot be delivered in the precision available . The (positive) threshold values, THRES(I), are bounded below by f where r is the smallest number on the machine. This restriction is designed to prevent pure relative error control . It is intended that the user set THRES(I) so that if lyil
I
I
,',f
1
45 In addition, the user specifies which method is to be used , which task is to be attempted and whether a global error assessment is to be computed. A numerical flag is used by the codes to inform the user of warnings and errors. In the setup the user specifies whether these messages should be printed, too. In case of catastrophic error, a message is always printed. A further argument is an estimate for the starting step size . When a value of zero is input, a starting step size is computed automatically . An initial stepsize should be supplied only if an on-scale value is known from a previous integration of a similar problem . When the code computes an on-scale step size internally, it uses a modified version of the procedure described in [7]. A user-supplied step size is "checked " by this procedure too and modified if necessary ; that is, the user cannot control the step size by specifying it through the setup routine. Finally, a work array WORK(1:LENWRK) must be supplied . The setup routine checks that this array is long enough for the specified method and task, and for global error assessment if it is requested . This array is used to pass all "variable dimensioned" work arrays, i.e., all internally used arrays of length NEQ. The user must pass it to other routines in RKSUITE on demand but is never expected to set or access it. Internally it is packed and unpacked as necessary though usually it is accessed through internally stored "pointers" or broken up in a subroutine argument list. All saved scalar information is kept in named COMMON blocks (see Section 4). The -setup routine has no error indicator - all input errors are catastrophic! The environmental constants c, ,r and the output channel for messages, are isolated in the routine ENVIRN. A sample ENVIRN is distributed with RKSUITE. It contains appropriate values for most computer systems in common use. To insure that appropriate values are set, the distribution version of ENVIRN will not run without modification so as to specify these values. The interval-oriented integrator requires the definition of the ODE (1.1) and the next output location, t ;. The output arguments are the location reached, the solution and its derivative at that location, and a flag. Internally , calls are made to the step-oriented integrator and, where appropriate , to the interpolation or endpoint redefinition routine . The requested output location, t;, must be closer to the endpoint , b, than the location, t;_1, returned from the previous call. Unless t; is identical to b it must be clearly distinguishable in the precision available from b and from t,_1.
The step-oriented routine requires the user to supply only the definition of the differential equation. It takes a step from the previous (stored ) point xq to the new point xq+1 and outputs xq+l, yq+1, yq+1 and a flag. As for the interval-oriented integrator , all output values are stored internally (in named COMMON) so that the integration may be continued with minimal user intervention and minimal danger of unintentional or misinformed user interference with the integration. If the user wishes to intervene , this may be achieved by calling the setup routine and hence inducing a restart. There are five failure flag values for each integrator (the indicators are identical in the two integrators ). All other failures here and elsewhere in RKSUITE
46 are catastrophic . The three "soft" failures (or warnings) are: (i) The efficiency of the code is being impacted by output. (ii) The code may be working too hard to solve the problem , as measured by the number of evaluations of (1.1). (iii) It appears the problem is stiff so there may be much more efficient integrators available.
In each of these cases the user may call the STAT routine (see Section 2.3) to find out more . These "soft" failures are not fatal. The user can simply call the integrator again to proceed . On a recall of the integrator the internal counters which keep track of the information used to generate these warnings are re-initialized. So, the warnings are only repeated if the problem persists. The stiffness diagnosis is based on a complex algorithm of Shampine [17]. If a number of inexpensive tests show that stiffness is possible , a nonlinear power method is used to compute a pair of dominant eigenvalues of the local Jacobian. A criterion involving these eigenvalues and the radius of the stability region of the integration formula is used to determine if stiffness is likely to be the reason for the seeming inefficiency. In RKSUITE the stability region information is treated as a part of the method definition (see Section 4). The "hard" failures which may be returned from the integrator are: (i) The step size has been forced too small in an attempt to achieve the user-specified local error requirement. (ii) If global error assessment was requested , then from this point onwards in the integration the results may no longer be reliable. These "hard" failures are fatal . The integration can be restarted only by calling the setup routine. We discuss global error assessment in Section 2.4.
2.3 Interpolation An interpolation routine is provided for the (2,3) and (4,5) pairs. See [1] and [2] for a discussion of the choices of formulas. Hermite interpolation at interval endpoints [x;, x;+i] is more than adequate for the (2,3) pair. This approach requires no more evaluations of (1.1) for interpolation . The high-quality interpolant associated with the (4,5) pair is relatively expensive per step , in terms of additional evaluations of (1.1 ). These costs are incurred only on those steps where interpolation is required and only once per step , no matter how many. The user must supply: (i) The subroutine to evaluate f (x, y) in (1.1). (ii) The location , t;, where interpolation is required. Extrapolation is permitted, but the user is advised to interpolate only.
(iii) An argument to specify whether to return the values of the interpolated solution or its derivative or both.
47 (iv) A positive integer to indicate how many components of the solution are desired. Only the first NWANT< NEQ components are approximated. This is for use with very large problems where only a few components are of interest . Often NWANT=NEQ. (v) The workspace associated with the integrator and additional workspace for the interpolation procedure. If an argument is incorrectly specified, insufficient additional workspace is supplied, or the interpolation routine is called when not applicable , for example with the (7,8) pair, a catastrophic error occurs with an appropriate message. The interpolated result is returned in (one or both of) the output arrays for solution and derivative . Great care has been taken to stabilize the interpolation procedure and minimize the storage required using techniques described in [6, 13, 14]. We have used a standard internal representation , a scaled power series expanded round the current interval endpoint, x;+1• In part choosing a standard representation is to permit changes in interpolant without changes in the interface for the interpolation subroutine . But, the main reason is that it is the chosen form for input to the authors' software (18] for event location associated with the solution of equation (1.1).
2.4
Reporting Routines
Subroutine GLBERR may be called at any time after a call to an integrator, assuming global error assessment was specified in the setup . It reports statistics collected during the integration . The first statistic is an array of length NEQ of root-mean-square global errors for each component taken from the initial point to the current point . The other statistic is the maximum (in norm) global error seen up to this point and the location where it was observed. The global error assessment is computed by step halving. That is, after each integration step from xq to xq+1i two half steps are taken over the same interval in a subsidiary integration . The subsidiary integration is started from V0 at x0. The assessment is the difference between the primary integration and the (more accurate) subsidiary integration . This approach is related to that taken in subroutine D02BDF [5] and in GERK [19], but it assumes far less of the problem and the internal safeguards are much more stringent . Global error assessment is terminated if the assessment is clearly implausible , for example because it is reporting no accuracy or because the estimated local error in the subsidiary integration is not sufficiently smaller than that of the main integration. The subsidiary integration is terminated in these cases because there is no reason to believe that the assessment has any validity from this point onwards. The user should then consider whether the results of the primary integration can be trusted.
Subroutine STAT may be called after any exit from an integrator. It reports statistics about both the integration and the method to help the user determine what further action to take. The statistics include the number of evaluations
48 of (1.1), the cost of a step for the chosen method , the number of accepted steps (that is, the number of points xq so far ) and the next step to be attempted from the current integration point. Also returned is an argument WASTE which is the fraction of the steps attempted that have been "wasted ", the ratio of the number of failed steps to the total number of steps attempted . If this fraction is large, say greater than a tenth , then the integrator is dearly having difficulty integrating (1.1). This may happen, for example, because of stiffness, or the occurrence of many discontinuities in f or in some low order derivative of f.
2.5
Error Trapping Routines
Two routines are supplied to permit the "package writer" to trap catastrophic errors that ordinarily lead to termination of the program with a printed message. A routine is supplied which the user may call to set internal flags to indicate that catastrophic error should be trapped . A second routine is provided for the user to call after each user call to a routine in RKSUITE to find out whether a catastrophic error has occurred . Because these facilities are intended only for the package writer , they are only documented internally.
3 Choice of Runge-Kutta Formulas The aim was to find or develop a set of RK formula pairs which would outperform significantly the currently used formula pairs , would broaden the lass of problems and accuracy requirements for which RK pairs would be used, and would have an associated interpolant. The comparison routines are the widely used production quality library RK codes RKF45 [20], D02PAF [5] and DVERK [10] (in the IMSL Library as DIVPRK) and the (4,5) formulas of Dormand and Prince [4] used in the code DOPRI5 in the text [8]. The latter were derived as an improved pair to replace those developed by Fehlberg and used in RKF45. Dormand and Prince [4] demonstrated an average 40% improvement in efficiency for their pair in comparison with the Fehlberg pair. Bogacki and Shampine [2] developed a non-minimal stage (4,5) pair which theoretically outperforms the DormandPrince (4,5) pair by a similar factor. These theoretical predictions are matched closely in their numerical tests. This pair is so efficient that it is difficult to find an interpolant with matching properties without using several more evaluations of (1.1). As implemented in RKSUITE, the (4,5) pair requires 7 evaluations per step , it is FSAL and the associated interpolant requires 3 additional evaluations per step.
The criterion for choosing the higher order (7,8) pair is that it be more efficient than the (4,5) pair at moderate to high accuracies. A pair derived by Prince and Dormand [12] was used. The criterion for the (2,3) pair is that it have small local truncation coefficients , good error estimation properties and a large absolute stability region. The pair derived by Bogacki and Shampine [1]
49 has these properties. In [11] the three library codes were compared for efficiency and tolerance proportionality with the three codes implemented in RKSUITE . The library codes performed as expected for high accuracy requirements , i.e. higher order resulted in greater efficiency. At low to medium accuracy requirements the situation was more confused with D02PAF performing somewhat better than expected and DVERK somewhat worse. In the comparison between the formulas implemented in RKSUITE , the suite performed essentially as predicted theoretically. It is noteworthy that , from these tests, the (2,3) pair would hardly ever be recommended on grounds of efficiency alone, and that the (7,8) pair is competitive with the (4,5) pair at somewhat lower accuracy requirements than might be predicted from the folklore of ODE solving. In the cross- comparison between RKSUITE and the three library codes, the (4 ,5) pair outperforms RKF45 by about the theoretically predicted factor. It is clear that an accuracy-dependent choice of method from RKSUITE is always significantly more efficient than any of the library codes and that just using the (4,5) or just using the (7,8) pair would be a superior choice in most circumstances . Generally RKSUITE and the library codes show good tolerance proportionality, indicating a successful local error control mechanism has been implemented . See [15] for a discussion of tolerance proportionality and its implications.
4 Implementation RKSUITE is a highly structured Fortran 77 suite of subroutines . It uses generic types throughout. All constants are defined in parameter statements and kept in symbolic form . The user-callable subroutines are heavily documented and all subroutines have sufficient documentation for the user to understand the algorithmic choices. There are a number of unusual features in RKSUITE. Two are the structured use of named COMMON blocks and some novel routines used internally. Subroutine MCONST computes a variety of machine-dependent quantities which are kept in a named COMMON block used only for this purpose. These machine- dependent quantities are computed from the values returned by a call to ENVIRN . For example , r is returned by ENVIRN and fr is computed and kept in MCONST for use in constraining the threshold values THRES(I) in a test in the setup routine . This use of ENVIRN (and MCONST ) is intended to permit easy transportability of the software.
The three RK pairs in RKSUITE are defined in subroutine CONST. The formula values and some basic properties (for example the number of stages and whether interpolation is possible ) are set and kept in a named COMMON block. A number of associated quantities (for example, the radius of the stability region and a lower limit on stepsizes for the method) are set or, where feasible, computed and kept in another named COMMON block . The formula
50 coefficients are kept as the ratios of two integers stored as double precision constants or as a computed 30 decimal place double precision constants. In either format they provide sufficient accuracy for double precision calculations on today's computers. The design of CONST is intended to permit the painless substitution of new formula pairs . The first author has demonstrated feasibility by introducing a new ( 7,8) pair due to Verner [21] into CONST. All error messages in RKSUITE are written to a standard record which is kept in a named COMMON block used solely for this purpose . All messages are printed using subroutine RKMSG which accesses this COMMON block. Keeping the record in a COMMON block permits us to build the error message through a (reverse) sequence of subroutine calls. Subroutine RKSIT is used to set and monitor the status of the integration. There are seven states associated with seven routines . The states can be set or checked . The array of states is local to RKSIT. It is set in a data statement and "saved". Using this mechanism we are able to monitor the complex interrelationships of the routines in RKSUITE , so that out-of-order calls by the user can be prohibited. The remaining named COMMON blocks are used for (i) Problem definition information kept from the setup call. (ii) Problem status information taken from the integration for providing information to the user through a call to routine STAT. (iii) Pointers to workspace to identify the starting positions of "dynamic arrays" associated with the integrator. (iv) Global error assessment information, including quantities accessed by GLBERR and pointers to starting positions of "dynamic arrays" used in the assessment. (v) Integrator options kept from the setup call.
5 Documentation and Templates The documentation for It.KSUITE has four parts: two separate files of information, internal subroutine comments and a number of templates. The major documentation file contains a careful but conventional description of the suite, and of the user-callable subroutines and their argument lists. The subroutines are presented in the order they would usually be called, with the interval-oriented routine described after the setup routine and with the step-oriented integrator, and the interpolation and endpoint redefinition routines described after all the other user-callable routines . (The order is chosen because a user would normally call only one integrator in a program and the interval-oriented one is likely to be the more popular . As user callables, interpolation and endpoint redefinition are associated with the step-oriented integrator.)
The second documentation file contains implementation details including:
51 (i) an ordered list of the subroutines in the RKSUITE file;
(ii) a description of each routine , what it calls and what calls it; (iii) a description of the labelled COMMON blocks in RKSUITE; (iv) a table of usage of the COMMON blocks including which routines initialize, read or alter them.
The internal documentation of the user-callable routine is a briefer version of the information in the main documentation file plus comments about the algorithmic activity of the routines. The latter information is also present in the routines that are internal to the suite. Three example templates are included. Two illustrate computations both with and without global error assessment . The results of running both forms of the template is contained in an output file. Results with different choices of formulas and parameters are also provided. The first template shows how to use the interval -oriented routine to obtain the solution of an ODE on an equispaced mesh of points t;. It demonstrates the care that must be taken if the last output point is to be precisely the endpoint b. The second template integrates the ODE for a two body problem using the step-oriented code. It is shown how to interpret physically the information obtained for such a problem which is mildly unstable near the points of closest approach. The third template illustrates the use of the step-oriented integrator and the interpolation routine to compute intermediate output. A forced Duffing equation is integrated over a long time interval. Two choices of parameters are made, one for which the solution is periodic and another for which it is chaotic. A Poincare map is computed for the study of the qualitative behavior of the solutions.
Acknowledgments The authors Shampine and Gladwell wish to acknowledge the support of NATO Scientific Affairs Division Grant 898-83. The authors Brankin and Gladwell wish to acknowledge the support of NATO grant CRG 920037.
References [1] P. Bogacki and L .F. Shampine, "A 3(2) Pair of Runge-Kutta Formulas", Appl. Math. Lett., 2 (1989) 321-325. [2] P. Bogacki and L.F. Shampine , "An Efficient Runge-Kutta (4,5) Pair", Rept . 89-20, Math. Dept ., Southern Methodist University, Dallas, Texas 75275, USA, 1989. [3] R.W. Brankin, I. Gladwell, and L.F. Shampine, RKSUITE: a Suite of Runge-Kutta Codes for the Initial Value Problem for ODEs, Softreport
52 92-1, Math. Dept., Southern Methodist University, Dallas, Texas 75275, U.S.A, 1992.
[4] J.R. Dormand and P.J. Prince, "A Family of Embedded Runge-Kutta Formulae," J. Comp. 5 Appl. Math., 6 (1980) 19-26. [5] I. Gladwell, "Initial Value Routines in the NAG Library" , ACM Trans. on Math. Soft., 5 (1979) 386-400. [6] I. Gladwell, L.F. Shampine, L.S. Baca and R.W. Brankin, "Practical Aspects of Interpolation with Runge -Kutta Codes", SIAM J. Sci., Stat. Comp., 8 (1987) 322-341. [7] I. Gladwell, L.F. Shampine and R.W. Brankin,"Automatic Selection of the Initial Step Size for an ODE Solver", J. Comp. Appl. Math., 18 (1987) 175-192. [8] E. Hairer , S.P. Norsett and G . Wanner, Solving Ordinary Differential Equations I Nonstiff Problems, Springer, Berlin 1987.
[9] T.E. Hull, W.H. Enright, B.M. Fellen and A.E. Sedgwick, "Comparing numerical methods for ordinary differential equations", SIAM J. Numer. Anal., 4 (1972) 603-637. (10] T.E. Hull, W.H. Enright and K.R. Jackson, "User's Guide for DVERK - a Subroutine for Solving Non-stiff ODEs," Report 100, Dept. of Comp. Sci., University of Toronto, Toronto, Canada M5S 1A4, 1976. [11] G. Kraut, "A Comparison of RKSUITE with Runge-Kutta Codes from the IMSL, NAG and SLATEC Libraries", Rept. 91-6, Math. Dept., Southern Methodist University, Dallas, Texas 75275, U.S.A., 1991. [12] P.J. Prince and J.R. Dormand, " High Order Embedded Runge-Kutta Formulae", J. Comput. Appl. Math., 7 (1981) 67-85.
[13] L.F. Shampine, "Storage Reduction for Runge-Kutta Codes", ACM Trans. on Math. Soft., 5 (1979) 245-250, [14] L.F. Shampine, "Interpolation for Runge-Kutta Methods", SIAM J. Numer. Anal., 22 (1985) 1014-1027. [15] L.F. Shampine, "Tolerance Proportionality in ODE Codes", pp. 118-136 in Numerical Methods in ODEs, A. Bellen et al., eds., Lecture Notes in Math. No. 1386, Springer, Berlin, 1989.
[16] L.F. Shampine, "Diagnosing Stiffness for Runge-Kutta Methods", SIAM J. Sci., Stat. Comput., 12 (1991) 260-272. [17] L.F. Shampine and I. Gladwell, "The Next' Generation of Runge-Kutta Codes", pp. 145-164 in Computational Ordinary Differential Equations, J.R.Cash and I.Gladwell, eds., Oxford University Press, Oxford, 1992. [18] L.F. Shampine, I. Gladwell, and R.W. Brankin, "Reliable Solution of Special Root Finding Problems for ODE's", ACM Trans. on Math. Soft., 17 (1991) 11-25.
53 [19] L.F. Shampine and H .A. Watts, "Global Error Estimation for Ordinary Differential Equations", ACM Trans. on Math. Soft., 2 (1976) 172-186. [20] L.F. Shampine and H .A. Watts, "Practical Solution of Ordinary Differential Equations by Runge- Kutta Methods," SAND 76-0585, Sandia National Laboratories , Albuquerque, New Mexico 87185, U.S.A., 1976. [21] J.H. Verner, Private Communication, 1992. [22] H.A. Watts, "Step Size Control in Ordinary Differential Equation Solvers", Trans. Soc. for Computer Simulation, 1 (1984) 15-25.
WSSIAA 2 ( 1993) pp. 55-70 ©World Scientific Publishing Company
Biorthogonality and conjugate gradient-type algorithms C. Brezinski Laboratoire d'Analyse Numerique et d'Optimisation UFR IEEA - M3 Universite des Sciences et Technologies de Lille 59655 Villeneuve d'Ascq - Cedex France
Abstract The aim of this paper is to use the theory of biorthogonal polynomials to derive many algorithms for solving a system of linear equations in a unified framework. These algorithms include extrapolation and projection algorithms , the most important ones being conjugate gradient-type algorithms . New algorithms are also obtained. 1
1 Introduction Projection and conjugate gradient-type algorithms form a large class of algorithms for solving systems of linear equations. They were discovered (and also rediscovered) during the past decades by various authors and it is often difficult to clearly see the links between them, although some advances in this direction have been made [12,17). Thus the aim of this paper is to give a unified framework for the presentation and the derivation of some of them (I have not tried to be exhaustive and some other algorithms could also 'Subject classification : AMS(MOS), 65F10, 65F05, 65F25 Keywords : linear systems, Lanczos method, conjugate gradient , projection, biorthogonality
55
56 be possibly included into the framework). This framework is the theory of biorthogonal polynomials as given in [2] which is the most general presentation of these polynomials. It may seem strange to escape from linear algebra for treating a problem which obviously belongs to this topics. However, the theory of biorthogonality is a quite simple one (simpler than linear algebra) allowing a unification of various algorithms and clearly showing were they come from and what are their differences and their relations . This synthesis will have to be compared to that obtained from the theory of orthogonal polynomials [9] which is the basis on which Lanczos-type methods and the corresponding algorithms for their implementation are built and is also the main tool for avoiding breakdown and near-breakdown in such algorithms [4,5,6,7,8,14], an important open problem for a long time. Thus the aim of this paper is to show that the theory of biorthogonal polynomials plays the same role for other projection and conjugate gradienttype algorithms given in the literature. But, moreover, this approach also allows to derive new methods (some of them are presented below) and it could help solving the problems of breakdown and near-breakdown in the corresponding algorithms.
The approach given in this paper is complementary to the more traditional one explained in [16] where purely linear algebra techniques are used.
2 Preliminary results Let us first give some preliminary results that will be useful later. We consider the system of n linear equations in n unknowns
Ax = b. We set A = I - B and we consider the vectors (yk) generated by yk+i = Byk + b k = 0,1, .. . where yo is an arbitrary vector. Let z be an arbitrary vector. We set r(z) = b - Az.
57
We have Uk r(yk+1)
= AYk=yk+1 -Yk=(B -I)yk+b=b-Ayk =r(yk)
= b - A(Byk + b) = (I - A)b - AByk = Bb - AByk = Br(yk)
since AB = BA. Thus r(yk) = Bkro by setting ro = r(yo). We also have wk = I Uk = Ar (yk) = Bk+'ro - Bkro = -ABB = Bkwo.
Let us now give some preliminary results on biorthogonal polynomials [2]. Let Lo, L,.... be linear functionals on the space of complex polynomials, let Pk be the polynomial of degree at most k such that Li(Pk) = 0 for i = 0, ... , k - 1 and let Gk = span (Lo,... , Lk-1). Then obviously
VL E Gk,
L(Pk) = 0.
We shall see now another way of expressing this property and some consequences . For that, let us set csi =
Li(^')
and define the linear functional c on (1 k] ® Q[v] by [13] C(^1 ) = Cki.
Let us write L E Gk as L = aoLo + ... + ak-1Lk-1 and let p E Pk_1 be the polynomial p(^) = ao + a1 S
+... + ak-1Sk-1.
Then L(Pk) c(pPk) = 0 where p designates p(4) = p(a
We have, by linear combination using do ,..., ak_1 instead of a0 ,... , ak_1 c(p(^)Pk(f)) = 0.
58
By linear combinations and using the preceding property, we also have c(P1(^)Pk(^)) = c(P1(f)Pk( ^)) = 0 `di < k. Remark : In the Hermitian case, that is if cki = cik we have c(P;Pk)= 0 Vi#k. Indeed, setting Pk(^) _ E. =o aj^j , we have for i < k
c(^'Pk) = Edjc(^i^') = Eajcji = Eaicij = Ea.icii = c(4'Pk) = 0.
By linear combinations we have Vp E Pk-1 c(pPk) = 0 which yields the result. We shall also make use of the linear functional c(1) defined by C(1)(Sie) = Ck,i+1
and of the family {Pk(l) of biorthogonal polynomials with respect to c(1). Biorthogonal polynomials are completely defined by the biorthogonality conditions apart a multiplying factor. We have the Lemma 2 .1 Let { li} be linear functionals on the space of polynomials and let Qk be the polynomial (assumed to exist) satisfying
li(Qk) = 0 for i = 0, ... , k - 1 and Qk(l) = 1.
Let {Li } be the linear functionals defined by Li(C3) = E aiplp ((1 - ^)') P=O
with aii 0 0 for all i.
59
Then, the polynomial Pk defi ed by n
Pk(1=)
rr
= Qk(1 - S)
satisfies
Li(Pk)= 0 for i=0,..., k-1 and Pk(O) = 1. Proof : replacing ^ by 1 - ^, we see that
p=0
Since aii # 0 for all i, the triangular k x k matrix (aip) is regular `dk and there exists (bip) such that (bip) = (aip)-l and
li(f') = E b. Lp((1 - ^)')• P=O
Thus i
li(Qk(e)) _ F bipLp (Qk(1 P=O
Setting Pk (C) = Qk(1
e)) = 0
for i = 0, ... , k - 1.
-c), we obtain the result since the triangular matrix
(bip) is regular. ■ Remark : Obviously we also have Qk(') = Pk(1-C). The normalization condition Qk(1) = 1 can be replaced by Qk(0) = 1 thus leading to the normalization condition Pk(1) = 1 for Pk.
3 Extrapolation and projection methods For solving Ax = b, let us consider the following sequence of vectors given by a vector extrapolation method [17] when applied to (yk) 1 ... 1 doo ... dok
Yo ... Yk doo ... dok
I
Xk = dk-1,0
• • • dk-1,k
dk-1,0 ... dk-l,k
k = 0,1,...
60 where (yk) is generated by yk+l = Byk + b with yo an arbitrary vector. We shall set x0 = yo.
Setting rk = b - Axk, it is easy to see that 1 1 doo ... dok
r(yo) ... r(yk) doo ... dok rk = dk-1,0
dk_1,0
. . . dk-l,k
... dk_l,k
Thus if we set
Qk(S) _
tk
1 doo
1 ... 1 doo ... dok
dok /
dk-1,0
. . . dk-l,k
dk_i,0 ...
dk-1,k
then, using the preliminary results of section 2, we have rk = Qk(B)ro.
Moreover let us define the linear functionals l; on the space of polynomials
by l.(S') = di3.
Then Qk satisfies li(Qk)=0
fori=0,...,k-1
and Qk(1) = 1. If we define the linear functionals Li by i cii
= Li(f') _ Ea ipll ((1 P=O
with `di, aii # 0 and if we set Pk(S) = Qk(l - t)
then, by the lemma
t-^t
61 Li(Pk)=O Pk(0) = 1
^k
1 C00
for i = 0,...,k - 1
...
r-01 ...
COk
COk
Pk(^) =
I Ck-1,0
...
Ck_1,1
Ck-1,k
...
Ck_l,k
and rk = Pk(A)ro. These results generalize those given by Sidi [17] (see also Gander, Golub and Gruntz [12]) on the connection between vector extrapolation processes and projection methods since the last relation shows that rk is obtained by a projection method based on biorthogonal polynomials, see [2]. Such a connection was first considered by Brezinski [2] who showed the identity between the topological e-algorithm and the biconjugate gradient algorithm. Among these methods are the MPE di, = (ui,ui) RRE di i = (wi, ui) zi arbitrary vectors MMPE di; = (zi, uj) TEA di; = (z,ui+i) z arbitrary vector. Indeed we have • For the MPE
li(^') _ (B'ro,B-ro),
Li(t3) = (A'ro, A'ro)•
Indeed li((1 - c)j) = (B'ro, (I - B).'ro) = ((I - A)'ro, A'ro) i i = E(-1)PCp(APro,Afro) = E(-1)PC'LP(^')• 0 o
62
• For the RRE li(^') = (B'wo , Biro ),
L.(t1) _ (A' wo,Afro)•
Indeed = (B'wo, (I - B)1ro) _ ((I - A)'wo, Afro)
= >(-1)'Cp(Apwo, A'ro) _ E(-1)"CpLp(^f)• P=O
P=O
• For the MMPE lc(^') = (z1, Biro), L:(^1) _ (z., Airo). Indeed
l.((1- V) =
:, (I - B)fro ) = ( z., A'ro ) = Lf(f')•
• For the TEA 1=(^') _ (z, B'+iro ), L,(^') _ (z, A'+1ro)• Indeed
(z, B'(I - B)1ro) _ (z, (I - A)'A'ro) E(-1)'Cp(z, Ap+iro) = >(-1)'C'Lp(')• P=O
P=0
Other methods can be included into this framework as we shall see now.
4 Conjugate gradient-type algorithms Let us now give some general results on such methods and begin by results on biorthogonal polynomials.
We consider the monic polynomials P(l) defined by
63 Cl
• • • CO,k+1
C01 ... COk
Ck-1 1 . . . Ck-1 k+1
1
'
...
I I Ck
fk
- 1 ,1
...
Ck_l,k I
These polynomials are the same as those of section 2. Thus Pk and P(1) both exist and are uniquely determined if and only if C01 ... Cok
I Ck_1,1
...
Ck-l,k I
We shall assume that this condition holds Vk. It is easy to see that satisfies
P(1)
L.(fPkl^)=0 for i=0,...,k-1. Thus {Pk} and
{Pkl>} are adjacent families of biorthogonal polynomials
and it holds [2] Pk +1(f) =
Pk(f)
- Akf Pkl)(f
k = 0,1,...
with Po(C) = P,(') Indeed we have
Lk(Pk+1) = Lk(Pk) - akLi(fPk')) This quantity is zero for i = 0, ...,k- 1 since Lk(Pk) = Li (f order to have Lk(Pk+l) = 0 we must take Ak = Lk (Pk)/Lk
kl )) = 0. In
P
(fPkl)).
We also see that Vk, Pk(O) = 1.
If no relation holds between the Li's then no relation exists for computing recursively the polynomials
Pk1).
However we can always write k
Pk+1 (f)
= akPk+l(f) + E flkiPi(l) (1)
i=o
64 Since Pkl) is monic, the coefficient of tk+1 in Pk+1 is equal to -)'k. Thus we shall take ak = -1/)k in order that Pk+l be monic. Now we must have fori=0,...,k k Li(ePk+1) = akLi (lPk+l) + EQkJLi( ^P^^1) = 0.
i=0
That is akLo(tPk+l) + QkoLo(^Po )) = 0 akLl(f Pk+l) + /lkoLl(^Pol)) + 13k1L1(ePili) = 0
...... . ...................................... akLk(SPk+1) + #koLk(ePpl)) + ... + NkkLk(4Pkl)) = 0
which is a triangular system giving NO, ... , Qkk Let us set tt
tt
Vk+1(b) = Pk+1(b)/ak = -'\ kPk1( rr S)•
Thus we have
Li(^Vk) = 0 Pk +1(S)
=
for i = 0, . .. , k -1
Pk(f) -
cc .x' Vk(S )
k tt tt cc Vk+l(S) = Pk+1(S)+>#kiV(S) i=0
with Ak = Lk(Pk)lLk(CVk) and
PkoLo(^Vo) = -Lo(^Pk+l) fkoLl(CVo) +Ik1L1(CVl) = -Ll(ePk+l) ......................... ................ tt IkOLk(eVO) + ... + QkkLk(SVk) = -Lk(SPk+l).
Setting rk = Pk( A)ro and zk = Vk(A)ro we thus obtain rk+1 = rk - A' Azk k Zk+1
= rk+1 + E Pki'zi i=0
65
which is a generalization of Lanczos/Orthomin. We can also look for a formula of the form r r k /^ ( Pk + 1 (Sc ) = SPkll (S) + E/jkiPill) 0 i=O
We have, by the biorthogonality conditions l3koLo(^Po1) = -Lo(^2Pk1) QkoLl(ePP )) + /3k1L1(^Pi1)) = -L1(f2Pk1)) F3kOLk(M11)) + ... + /3 Lk(SPk1) = -Lk(S2Pk1))•
This is a generalization, up to a normalization factor, of Lanczos/Orthomin [19]. In (1), we can put fPkl)(^) instead of akPk+1(S) and thus we obtain a generalization of Lanczos/Orthodir and of the orthogonal error methods of Faber and Manteuffel [11] which includes the orthogonal residual method of Elman [10] as a particular case. Let 4+1 = span(Lo, ... , Lk). Then, as seen in the preliminary results of section 2, VL E 4+1 we shall have L(Pk+l) = 0 and L(eVk+1) = 0. Moreover if p E Pk is the polynomial having the same coefficients as L, these orthogonality conditions can be written c(3Pk+1) = 0 and
c{1)(PVk+l) = 0. Thus we have c(Pk+lVk) = 0 = c(PkVk ) - A' c(1)(VkVk) which gives X. Similarly, for i = 0, ... , k c(1)(Vk+1Vi) = c(fVk+1Vi) = 0.
Thus clll(Vk+1Vi) = 0 = c(ePk+1Vi) + Nkjclll (V Vi) j=O
66
and we obtain a triangular system of equations for the ak; since c(') (V Vi) = 0 ifi<j. Remark : In the Hermitian case (that is cki = Cik) Vi # j, cl'l(V Vi) = 0 and thus for i = 0,...,k- 1 Nki = -C ( Pk+1Vi)/C(')(VVi)
and the preceding method simplifies since the coefficients Nki are now given by a diagonal system of equations. Thus the orthogonal error algorithm [11] has been simplified . Similarly we have Qki = -c(') (pii)) M') ((ii)) for i = 0, ... , k.
5 Various methods Let us now study some choices for the linear functionals Li. For the choice k+1 i i cki=c (^ 4 )=(A ro,A ro)
we have Ak = (rk,Azk)/(Azk,Azk) = C(PkVk)1C(1)(VkVk)
and the Ski are such that c(') (Vk+1 Vi ) = 0 for i = 0, ... , k that is (Azk+1i Azi) = 0, i = 0, ... , k and the GCR as explained by Saad and Schultz [15] is recovered . A similar procedure can be used for computing the /9ki's.
The preceding method breaks down only if 3i such that zi = 0. If zi # 0 for i = 0, ... , k - 1 and zk = 0 then PM is an annihilating polynomial for the vector ro. Changing the definition of cki leads to other methods. For example the choice k .' cki = (A y, A ro) = co,i+k = Ci+k,o
gives Orthomin again. The choices cki = (Aiy, Akro) or cki = (Aky, Airo) or cki = (Akro, Ai+'ro) can also be interesting. Hermitian choices are more interesting since they lead to a diagonal system for the Nk,.
I
67 Let us now consider a choice such that Cki = Ck+1,i+1
which corresponds to vector orthogonal polynomials of dimension -1 [3]. Thanks to this relation, all the cki's are defined from
L o( et)
= Coi
and Li(1) = cio for i = 0, 1, .. . For example we can take Vi, j i coi = ci,i+.i = (ro, A y) and i
In that case we also have rr tt t ` )1(S) k = 0, 1, .. . Pk+1(S) = ( S + Bk+l )Pk11(S) - Ck+1SPk1
with P(i)() = 0, Pol)(^) = 1 and Ck+1 = Lo(e2Pk1))//Lo(^2Pk1)1)) Bk+1
Cl
= Ck+lLk-1(SPkll)/Lk(Vk1)) = 0.
Thus, setting xk = Pkl)(A)ro, we have Zk+1 = (A + Bk+1)xk - Ck+lAxk_l. This method seems to be new.
Vector orthogonal polynomials of dimension d > 0 can also be used. In that case +1) L i( ' = Li+d(M and we have
68
Lo( b ) Li(f2Pk1)) = L i+d(ePk1)) = Dk
... Lo(1k+1)
Lk-1 (f) Li +d (S )
Lk-1(Sk+1) Li+d(Sk+1)
When i + d < k - 1, two rows are identical in this determinant and thus Li(f2Pk1)) = 0 for i = 0,...,k - d - 1 when k>d+1. It follows from the preceding system that Qki = 0 for
i = 0,...,k - d - 1
and thus we have a recurrence relation of the form k tt Pk+1(S)
t
=
c r + L,
fPk1 )(f )
QkiPP)(b)
i=k-d
which is a relation with d + 2 terms as proved by van Iseghem [18]. This is a (d + 2)-term orthogonal error iteration in the terminology of Faber and Manteuffel [11]. Vector orthogonal polynomials of dimension -d, with d > 0, could also be used, (see [3] for those of dimension -1).
References [1] C. Brezinski, Pade-Type Approximation and General Orthogonal Polynomials, ISNM Vol. 50, Birkhauser, Basel, 1980. [2] C. Brezinski, Biorthogonality and its Applications of Numerical Analysis, Marcel Dekker, New York, 1992. [3] C. Brezinski, A unified approach to various orthogonalities, Ann. Fac. Sci. Toulouse, to appear.
[4] C. Brezinski, CGM : a whole class of Lanczos-type solvers for linear systems, submitted.
69
[5] C. Brezinski, M. Redivo Zaglia, Treatment of near-breakdown in the CGS algorithm, submitted. [6] C. Brezinski, M. Redivo Zaglia, H. Sadok, Avoiding breakdown and near-breakdown in Lanczos type algorithms, Numerical Algorithms, 1 (1991) 261-284; Addendum, Numerical Algorithms, 2 (1992) 133-136. [7] C. Brezinski, M. Redivo Zaglia, H. Sadok, A breakdown-free Lanczos type algorithm for solving linear systems, Numer. Math., to appear. [8] C. Brezinski, H. Sadok, Avoiding breakdown in the CGS algorithm, Numerical Algorithms, 1 (1991) 199-206. [9] C. Brezinski, H. Sadok, Lanczos type algorithms for solving systems of linear equations , submitted. [10] H.C. Elman, Iterative Methods for Large Sparse Nonsymmetric Systems of Linear Equations, Ph. D. Thesis, Yale University, Computer Science Dept., 1982.
[11] V. Faber, T.A. Manteuffel, Orthogonal error methods, SIAM J. Numer. Anal., 24 (1987) 170-187. [12] W. Gander, G.H. Golub, D. Gruntz, Solving linear equations by extrapolation, Report NA-89-11, Computer Science Dept., Stanford University, 1989.
[13] Ja. L. Geronimus, Orthogonal polynomials , Amer. Math. Soc. Transl., (2) 108 (1977) 37-130. [14] M.H. Gutknecht, A completed theory of the unsymmetric Lanczos process and related algorithms , part I , SIAM J. Matrix Anal. Appl., 13 (1992 ) 594-639. [15] Y. Saad, M.H. Schultz, Conjugate gradient -like algorithms for solving nonsymmetric linear systems , Math. Comput., 44 (1985) 417-424.
[16] H. Sadok, A unified approach to conjugate gradient algorithms for solving nonsymmetric linear systems.
70
[17] A. Sidi, Extrapolation vs. projection methods for linear systems of equations, J. Comput . Appl. Math., 22 (1988) 71-88 [ 18] J. van Iseghem, Vector orthogonal relations. Vector qd-algorithm, J. Comput . Appl. Math., 19 (1987 ) 141-150. [19] D.M . Young, K.C. Jea, Generalized conjugate-gradient acceleration of nonsymmetrizable iterative methods , Linear Alg. Appl . 34 (1980) 159194.
WSSIAA 2( 1993) pp. 71-83 @World Scientific Publishing Company
TRIDIAGONAL MATRICES AND NUMERICAL SOLUTION OF ORDINARY DIFFERENTIAL EQUATIONS
LUIGI BRUGNANO Dipartimento di Energetica, Universitd di Firenze Via Lombroso 6/17 50134 Firenze, Italy and DONATO TRIGIANTE Dipartimento di Energetica, Universitd di Firenze Via Lombroso 6/17 50134 Firenze, Italy
ABSTRACT Tridiagonal matrices have been used in the field of numerical methods for ordinary differential equations . So far they were mainly related to boundary value problems and only to prove the stability of the methods used. Recent results concerning the conditioning of tridiagonal matrices allow to use their properties in order to design efficient methods both for BVP and IVP.
1. Introduction Numerical linear algebra is heavily used in the field of the numerical solution of differential problems. This use is essentially related to the final stage, when large linear systems have to be solved. This is evident in the PDE case, but also when solving stiff ODE problems, where implicit methods need to be used. Less known is the fact that numerical linear algebra can be succesfully used in the initial stage, when the mentioned methods are defined. In this paper we shall give two examples of such a use. The two problems considered, a boundary value one and an initial value one, are considered difficult in their respective classes. The first problem is the second order singular perturbation problem; the second one is the linear stiff problem. In both cases we shall use numerical methods which lead to a discrete problem of the following type:
Tiyi+ 1 + yi + O'i_1yi-1 = fi, yo = Ya,
YN+1 = Yb,
71
i = 1, ... , N,
(1)
72 or in matrix form,
Ty = f, where Ti
1
(2) TN-1 QN_1 1
It will be essential in the following to control ITT-1 11 as function of N. We shall describe some results on such dependence in section 2. In section 3 and 4 applications in deriving methods for the above mentioned problems will be presented. 2. Invertibility and Conditioning of Tridiagonal Matrices In this section we shall review some results concerning the invertibility and the conditioning of tridiagonal matrices . For simplicity, we shall consider the normalized tridiagonal matrix T in Eq. 2; moreover we shall assume that o,iri # 0, i = 1,... , N - 1 (conversely, the problem can be reduced), while, for completness , we set oo = To = aN = TN = 0. The following result2 gives sufficient conditions for the invertibility of the matrix T: Theorem 1 If oiri < 1, i = 1, ... , N - 1, then the matrix T is invertible. A geometric interpretation of the conditions in theorem 1 is that the points (o-,, Ti) must lie inside the region (see Figure 1) of the (o,, T )-plane which is bounded by the hyperbola 1 - 4QT = 0, and contains the origin. This condition, however, doesn't imply the well-conditioning of the matrix, according to the following Definition 1 A nonsingular matrix is said well-conditioned if its condition number is bounded by a quantity independent on the size of the matrix. If this quantity depends as a polinomial of small degree (1 or 2) on the size of the matrix, then the matrix is said weakly well- conditioned. We shall see sufficient conditions for the matrix T to be well-conditioned, when the hypotheses of theorem 1 are satisfied. For sake of simplicity, we shall consider the simpler case in which the products o i ri, i = 1, ... , N-1 have constant
73
0
Figure 1.
sign. More complicated cases may be treated similarly, by using the ShermannMorrison formula, provided that the number of changes of sign of the products airi is independent on the size N of the matrix T. Therefore, two main cases may be considered: 1. 0
In the first case, the following result may be proved': Theorem 2 Suppose that for all i = 1, ... , N one has 0 < aiTi 1 and, moreover, the following set of conditions is verified:
IaiI + lri_iI < 1,
l
ITi + I0,i-1l < 1,
then the matrix T is nonsingular and well -conditioned . If the inequalities are not strict , then the matrix T is at least weakly well-conditioned. The geometric interpretation of the above result is that the off diagonal entries on each row and column of the matrix T, that is the points (ai-1j Ti) and must he inside the square region of the (a, T)-plane in Figure 2. In the case airi < 0, the following result holds true2:
74
o +T=-1
o
-T
=1
Figure 2.
Theorem 3 Suppose that for all i = 1, ... , N one has oiri < 0 and, moreover, one of the following two sets of conditions is verified: 1. 1 Ioil - ITi-1I < 1,
11 1riI - Iai-1l < 1, 2.
r ITi-1I-10'iI<1, t Io_i1 -Iril<1, then the matrix T is nonsingular and well-conditioned. If the inequalities are not strict, then the matrix T is at least weakly well-conditioned. The geometric interpretation of the above result is that the off-diagonal entries on each row and column of the matrix T, that is the points (ai_1i Ti) and (ai,Ti_1), must respectively lie inside the regions (those containing the origin) of the (a, T)-plane in Figures 3 and 4, or vice versa. A particular case for the matrix T is obtained when the entries on each diagonal have constant sign. In this case, from theorems 2 and 3 the following result derives at once2:
f - f-1 4,
75
T
O - T=
0 \
Figure 3.
T
0
0
a-T=1 /I\o+T=-1
Figure 4.
76
T
0
Figure 5.
Corollary 1 Suppose that for all i = 1, ... , N one has oiri < 4, of and ri have constant sign along each diagonal, and moreover, the following set of conditions is verified:
1
Iai +'ri -1I
< 1,
ITi + 0'i-1I <
then the matrix T is nonsingular and well-conditioned. If the inequalities are not strict, then the matrix T is at least weakly well-conditioned. The geometric interpretation of this last result is that the off-diagonal entries on each row and column of the matrix T, that is the points (vi-1i Ti) and (o' , Ti-1), must lie inside the "strip" of the (a, T)-plane shown in Figure 5. 3. Second Order Singular Perturbation Problems The conditioning of tridiagonal matrices has been considered by many authors in order to prove the stability of numerical schemes for boundary value problems 6,10,13. Here we shall use the results about conditioning seen in section 2 to derive a numerical scheme for solving boundary value problems, so that the matrix which defines the corresponding discrete problem is well-conditioned by construction. This will lead to a strategy for choosing the steps of integration, as we are going
0
77 to see by analysing the numerical solution of the following second order singular perturbation problem:
Ey" + s(t )y' + c(t)y = d(t), t E [a, b], y(a) = ya,
(3)
y( b) = Yb,
where a is positive and very small. We shall assume that c(t), d(t) are continuous, while s(t) is differentiable, for simplicity. We assume also c(t) < 0, because this is sufficient to ensure the existence and the unicity of the solution of the problem in Eq. 3. Let be a - to < t1 < ... < tN+l - b , hi-1 = ti - ti-1. Then, if we discretize the second derivative at ti as" / / ,(ti) N
y
2(hi-
l y (ti +1) - (hi + hi - 1) y (t
hi-lhi(hi -,
i) +
hiy(ti-1))
+ hi)
and the first derivative at ti as y (ti) ^ hZ
- 1y(ti+1) - (h2-1 - h?)y(ti) + h?y(ti-1) hi-lhi ( hi-1 + hi)
then, by neglecting the error due to the approximation , and denoting , as usual, with yi the approximation to y(ti ), si = s(ti ), ci = c(ti ), and di = d(ti), we obtain the discrete problem in Eq. 1, where
of
_
_
-hi_1(2e + sihi_1) (hi-1 + hi)(2e + si(hi-1 - hi) - cihihi-1)'
T' fi
-`hi+1(2c - si+lhi+l) / (hi+l + hi)(2e + si+, ( hi - hi+,) - ci+lhihi+l)'
=
-dihi-1hi (2e + si(hi_1 - hi) - cihihi_1).
( 4)
The steps hi are chosen in order to have all the denominators positive. Moreover, the matrix T will be composed of submatrices in which vi , Ti, and the products viTi have constant sign. It will be convenient to examine these submatrices separately, as we said in section 2. For simplicity, we shall analyze only the case in which s(t) > 0 for a < t < b: in this case11 we have exactly one boundary layer at t = a. It follows that the matrix T is composed of two submatrices , the first one having the products viii all positive, and all negative in the second one. The general case is treated similarlyll In correspondence of the layer it is convenient to use a small constant stepsize h . It follows that, if h is sufficiently small (see Eq . 4), then we have viii > 0.
78 In this case, the submatrix is invertible and well-conditioned provided that, for t E [a, b] and k > 0 the following conditions are satisfied": 1. c(t) < -k, 2. s'(t) > -k, 3. h < min(a,2e/s(a)), where a - f (2e/(maxa
+1((1 - q:)-,:+l - ci+lgih1)+ h;+1(2e(gc - 1) - c;+igih?) + (2e + h;s;+1)h:q+ ? 0, where q; = 7-i_ 1 - p,, with p, > -1 and such that q;r_1 > 0 . In fact , in this case the points (o;_1i r, ) are contained in the strip of Figure 5, and the points (a,, ri_1) are contained in one of the two regions in Figure 3 and 4; then we can apply the results in theorem 3. If h;+1 > yh;, where y > 1 is a fixed parameter, or some of the denominators in Eq. 4 are negative , then we choose h;+1 as the maximum step which satisfies both these two conditions . In this case, the products air; are still negative, and the points (o;_1i r;) are in the strip in Figure 5, but it may happen that some of the points (a,,r;_1) don't fulfill the hypotheses of theorem 3. As this exceptional points are very few, the resulting matrix T is still well -conditioned. The idea just described has been applied succesfully to the most difficult test problems in literature". 4. Boundary Value Methods for Initial Value Problems In this section we consider a particular class of numerical methods for ODE, called Boundary Value Methods (BVMs9). For simplicity, we shall consider the solution of the following scalar problem:
79 y'(t) =
Ay(t) + b(t),
.A < 0, t E [a, b], (5)
y(a) = Ya,
where b(t) is a bounded function , but the results can be extended to linear systems of ODE3'4. A BVM method is obtained by fixing a partition' ,',' of the interval of integration , a - to < tl < ... < tN = b, such that t; = t;_1 + h;, i = 1, ... , N. The problem in Eq. 5 is then discretized by using a two-step method ( main method), while the last step is discretized by using an implicit one-step method ( last-point method ). In particular , we shall consider the BVM method which utilizes the Simpson method as main method , and the trapezoidal rule as last-point method. We shall consider the stability of this method , when the steps h; are chosen as follows: h;+1=rh; ,
i=1,...,N-1, (6)
where the initial step h, is given, and r > 1 is a fixed parameter. In this case, the discrete problem consists in the solution of the linear system
Ay = f, where the vector y = (Y1, ... , yN )T contains the approximated solution , while the structure of the right -hand side is, by posing b; = b(t;), i = 0, . . . , N:
fi = -3 ((q - 3)ya + hlbo + 2h, ( 1 + r-1)bl + hlr-1b2) , i=2,...,N-1,
f; =-h ri- 1(b;_1+2( 1+r-1)bi+r-1b;+1), h rN-1 fN = --(bN-1 + bN),
with q = -h,\ > 0. The matrix A is tridiagonal, and can be expressed as (7)
A=T,+qT2, where:
Ii = (1 -r -2 ) r-2
-1
1
80 / 2(r + 1) 1 r
T2= ID . 3r r 2(r + 1) 1 2r zr and D = diag ( 1, r, r2,
... , rN-1)
If we write the matrix A in Eq. 7 as A = bT, where b is a diagonal matrix, and T is the tridiagonal matrix in Eq. 2, then the BVM method is stable if the matrix T is well-conditioned. From the structure of T1 and T2, it follows at once that the matrix T is composed by no more then two submatrices , the first one having the products ai ri all negative, and the second one having the products Qir; all positive. Moreover , the off-diagonal entries of the two submatrices have constant sign along the diagonals ; therefore, we can utilize the result in corollary 1. If we neglect the entries on the last row, which depend on the last-point method, then the matrices T1 and D-1T2 are Toeplitz matrices. Let us denote by 1 -r(1) Ti =a Or(1) .
and 1
T(2)
D-1T2 = Q o.(2)
where: a = (1 - r 2),
= 3(1 + r-1), T(1) = r 2« 1,
0,(2) = 3N-1, r(2) = 3(rfl)-1•
Then, one easily obtains that the off-diagonal entries of the matrix T are given by
81 av(1) + g fr'Q(2) Ui = a+qar' aT (1) +
gflr'
-1.r(2)
r, = a + q,9r'-1 It is easy to check that the condition I0i-1+Til <1, i=1,...,N is always satisfied . Moreover , it can be shown3 that if the condition ITi_i+ail<1, i=1,...,N (8) is satisfied, then also the condition QiTi < 4, i = 1, ... , N is satisfied. Therefore, if the condition in Eq. 8 holds true, then the matrix T results to be well-conditioned, from corollary 1. This condition is equivalent to require that the vector Qi
_
a
v (1) +
qfr'-2
v(2)
Ti-1 a + gpri- 2 ' a+ q/3ri-2 s 7
where a + ri-2Q(1)
1
a+q/3r'
and art + r' c(2)
v;2) =
a + q/r' r(2)
must be inside the strip shown in Figure 5. We observe that this re gion is convex; moreover, (U,,Ti_I )T is a convex combination of the vectors v^1) and v;2) Therefore, it will be sufficient to show that the vectors v(1) and v(2 ) are inside the region shown in Figure 5. The vector (a('), T(1 ) )T is on the left part of the boundary of this region; it follows that v!1) will be inside the strip if a + q/3ri-2 -1 < -1 +2 a + q/jri 1 - r-2
1 - r_2 '
that is, q/3r' < 2(a + q/jr').
0 This condition is satisfied Vr > 1. Similarly, the vector ( .(2),T(2))T is inside the region shown in Figure 5; after simple calculations , one obtains that the vector v 2) is inside the strip if it is verified the condition are + q/3r' r 2r + 1 a+q/3ri 2(r+1) < 2(r+ 1)' that is, a(r3 - 2r - 1) < q/3r`(r + 1). this condition is obviously satisfied if we require r3-2r-1 < 0, which is true for r < r" 1.618. We observe that the condition r < is not restrictive in practice, because the used values of the parameter r are usually much smaller'. A similar analysis can be done for other BVM methods3, by considering different main and last-point methods. In general, by imposing that the matrix of the discrete problem is well-conditioned, when scaled by its diagonal, we obtain upper bounds for the parameter r in Eq. 6. The BVM method, shortly described in this section, has been succesfully used for large linear systems of ODE'. It turns out that this method is superior, with respect to known solvers like LSODE, for stiff problems. Moreover, it can be efficiently implemented on parallel computers. References 1. A. O. H. Axelsson and J. G. Verwer, Boundary Value Techniques for Initial Value Problems in Ordinary Differential Equations, Preprint No. 147, Mathemathisch Centrum, Amsterdam, 1983. 2. L. Brugnano and D. Trigiante, Tridiagonal Matrices: Invertibility and Conditioning, Lin. Alg. Appl. 166 (1992), 131-150. 3. L. Brugnano and D. Trigiante, Stability Properties of Some BVM Methods, Preprint, 1992. 4. L. Brugnano and D. Trigiante, A Parallel Preconditioning Technique for BVM Methods, Preprint, 1992. 5. J. It. Cash, Stable Recursions, Academic Press, New York, 1976. 6. C. F. Fisher and R. A. Usmani, Properties of Some Tridiagonal Matrices and their Applications to Boundary Value Problems, SIAM J. Numer. Anal. 6 (1969), 127-142. 7. S. Godounov and V. Riabenki, Schemas aux Differences, MIR, Moscow, 1977.
83 8. J. W. Lewis, Inversion of Tridiagonal Matrices, Numer. Math. 38 (1982), 333345. 9. L. Lopez and D. Trigiante, Boundary Value Methods and BV-Stability in the Solution of Initial Value Problems, Appl. Numer. Math. (to appear). 10. R. M. M. Mattheij and M. D. Smooke, Estimates for the Inverse of Tridiagonal Matrices Arising in Boundary Value Problems, Lin. Alg. Appl. 73 (1986), 3357. 11. F. Mazzia and D. Trigiante, Numerical Methods for Second Order Singular Perturbation Problems, Computers Math. Appl. 23 (1992), 81-89. 12. F. Mazzia and D. Trigiante, FDSPP: a Solver for Singular Perturbation Problems, Submitted. 13. R. A. Usmani, On the Explicit Inverse and Conditioning of a Tridiagonal Matrix, Int. J. Computers Math. 41 (1992), 201-213.
WSSIAA 2(1993) pp. 85-98 ©World Scientific Publishing Company
Runge-Kutta Methods for Neutral Differential Equations M.D. Buhmann', A. Iserles2 and S.P. Norsett3
Abstract In this paper we examine the numerical solution of the delay differential equation y (t) = ay(t ) +Qy(Zt)+7y (2 t), t > 0, y(0) = 1,
where a, /3 and ry are complex numbers, using a Runge-Kutta approach. Sufficient conditions for the asymptotic stability of the numerical solution, i.e. limn- , y„ = 0, are found and particular attention to the simple case of collocation with one collocation point only is given. Subject Classification: 34K20, 65L20 (primary); 34K40 (secondary).
1 Introduction A multitude of papers have been devoted in the last two decades to the numerical solution of delay differential equations with constant lag, of the generic form Y '(t) = f (t, y(t), y(t - r )),
t > 0, (1.1)
' Magdalene College , Cambridge CB3 OAG, England. ' Department of Applied Mathematics and Theoretical Physics, University of Cambridge , Silver Street, Cambridge CB3 9EW, England. 3 lnstitutt for Matematiske Fag, Norges Tekniske Hpgskole, Trondheim , Norway.
85
86 with an initial condition y(t) = '(t), -r < t < 0. However, considerably less effort has been expanded in analysing computational algorithms for delay differential equations with variable lag. In the last few years the authors and others have acted to repair this oversight and gain better understanding of numerical algorithms for the equation y'(t) = f (t, y(t), y(0(t))), t > 0, where 0 is a smooth function, 0(0) = 0 and 0 < 0(t) < t for t > 0 [2, 3, 4, 5]. The initial value need be provided just at the origin, y(O) = yo, say. It is instructive to compare equations (1.1) and (1.2), since they pose very different numerical challenges. Firstly, the solution of (1.2) is smooth-indeed, it retains the degree of smoothness of f-unlike the solution of (1.1), which possesses discontinuous derivatives even when f is analytic [1, 7]. Hence, there is no need to keep track in the numerical solution of (1.2) of discontinuities. In this sense, (1.2) is simpler to compute. However, this is offset by the considerably more complicated form of discretized equations, a form that renders them difficult to analyse. For example, compare the solution of the linear equations Y V) = ay(t ) + Qy(t - r) and Y 'M = ay(t ) +Qy(12 t), say, by the trapezoidal rule. We choose a step size h > 0 which is an integral fraction of r, h = r/N, say. It is an easy exercise to derive the underlying recurrences , namely 1+1ha 1h/3 ( yn+1 = 1 - 2 ha yn + 1 - 1 ha yn-N + Y.-N+1)
and 1--ha 4hp Y2n
y2n+1
y2n-1 + 1 - 1 ha(yn-1 +3yn), ha 1 - 12 2 1 + z ha 1h)3 1 - zha y2n + 1 - 2 ha (3yn + yn+1)
Here and in the sequel Yn denotes the underlying numerical approximant to y(nh), n E 7G+.
87 Note that the recurrence ( 1.3) is stationary-its nature does not change with n. Thus, it can be analysed by techniques from the theory of difference equations with constant coefficients. Although this is not always easy and might call for considerable sophistication [11], an investigation of (1.4) is clearly of an altogether different level of difficulty. This state of affairs is further aggravated by paucity of analytic information on equations of type (1.2). There exists a large body of research on qualitative behaviour of (1.1) [1, 7] but only modest inroads have been made into the case of variable delay [5, 9, 8, 10]. Needless to say, there is little use in investigating asymptotic stability of numerical schemes, unless it is known for which choice of parameters the differential equation itself displays asymptotic stability! In this paper we focus on the equation
y'(t) = ay(t) + Qy(zt ) + yy'(Zt),
t > 0, y(0) = 1.
(1.5)
a, Q and ry are given complex constants, except that we require that y # 2-' for all t E 7G+. According to [9], this implies existence and uniqueness of the solution. Equations of this type feature in a multitude of applications [9], but their main significance in the context of this paper is as a paradigm for (1.2) and a convenient test equation. Note that, unlike (1.2), the equation (1.5) is of a neutral type.
It is known that the solution of (1.5) is asymptotically stable if and only if Rea < 0 and 101 < Jal [9]. (Interestingly enough , y has no bearing on asymptotic stability!) In [2] the first two authors considered the question of how well is this qualitative feature maintained when ( 1.5) is discretized by the trapezoidal rule. This has been generalized in [3] to encompass general multistep methods. Stability analysis of Runge-Kutta schemes, the subject matter of this paper , is more difficult , because of the multistage nature of the underlying recurrences. We conclude this introduction with the remark that the authors are keenly interested in extending this approach to more general settings , for instance by considering general proportional delays or indeed general monotone delays as in [4 ]. This will be the subject matter of a future paper.
88
2 Sufficient conditions for stability of a general RungeKutta scheme In this section we shall derive sufficient conditions for the asymptotic stability of a RungeKutta Ansatz for a differential equation which we initially restrict to the delay differential equation Y (t) = ay(t) + Qy(§t), y( 0) = 1,
(2.1)
where a, O E C, in order to make the exposition easier. By asymptotic stability we mean here that the numerical solution yn tends to zero as n - on. We solve (2.1) by a general Runge-Kutta scheme with dense output [6]; this is a new approach to solving (2.1), cf., for instance [2]:
k2t^ =
k2em +1
Yn+1
=
a I y2m + h ), ae,ik2 i_1
I + Q I ym + h i=1
d1,iki,il
i k +i ) (Y2 m+ i+hat a +0 (Ym + h
1
et .i k) ) ;
(2.2)
(2 . 3)
(2.4)
yn+h>btk$ti e=1
First, we consider the z-transforms for the solution vector {y„},°,°_o and the coefficient vectors {k$el }°°=o, f = 1, 2, ... , s: Y(z) ynzn
and
n=0
Kt(z) keel zn, P = 1,2,...,s. n=0
We seek conditions on h, a, p, A {ata}ii=1 and b {be}e=1 such that Y(z) is bounded at z = 1, because then it follows that yn -, 0 as n -r no. This requires us first to find an expression for Y in terms of the parameters of the Runge-Kutta method and a and P. To this end we multiply (2.4) by zn+l and sum up for n E 7G+ (formally, as we do not know yet whether the series converge). This yields the identity
Y(z) =
1 I 1 + hz 2 btKt(z)} . x
t=1
Next we need expansions for the Ke which we shall then require to be convergent at 1 and be such that the sum in (2.5) is -h-1. This ascertains boundedness of Y at 1 and
89 therefore asymptotic stability. Multiplying (2.2) by z2m, (2.3) by z2m+1 and summing up formally for m E 7G+ yields
Kt(z) =
1 a (y2. 'f' h atjk2 ) + N (Ym
M=O Il
>
+hdtik.1 I Z2m
i=1
j=1 /
( l + (Y2m+l + h ar,ik2+1 + ym + h er ikl,"1 m=0
j=1
s
/
z2m+1
j=1
e
aY(z) + ha ar,jKj(z) + /3(1 + z)Y(z2) + h/3 E( dr,j + et,jz)Kj(z2). j=1
j=1
We substitute the value of Y from (2.5), and this produces, after trivial manipulation, the recurrence relation U(z)K(z) = (a + /3)1 + h/3V (z)K(z2 ),
(2.6)
where
and Ut,j(z) = ( 1 - z)bt, i - ha(at,j + z(bj - at,j)), Vt,i(z) =
de,i+z( eta -dt,j)+z2(bj -et,j),
I,1 = 1 ,2,...,s.
Iterating (2.6) yields 00 K(z) = (a + 0) E(h)3)n Qn(Z)l,
(2.7)
n=0
where 1lo(z) = U- 1(z),
ln (z) = U- 1(z)V(z) 1n_ 1(z2), n = 1,2,....
(2.8)
These operations are once again formal, because we do not know yet whether U(z) is invertible, nor have we ascertained the summability of the series in (2.7). The former will be dealt with below, as we will in fact find explicit forms of the inverse of U(z) and its determinant. And the summability of (2.7), especially at one, is of course our main concern in this section as it will, together with bTK(1) = -h-1, imply that Y is bounded at 1 and thus the asymptotic stability of the numerical solution itself. We now employ
90 (2.5) to deduce from (2.8) that r l Y(z) =
1 z S 1 + h(a + Q)z > (hQ)"bT SZn(z)1
ll
n=0
1.
The above construction can easily be extended to the neutral equation y'(t)=ay(t)+ay(zt)+7y(2t), y(0)=1.
(2.10)
Here, the general scheme
k2t) _ a (iim + h i at,jk2+n f + Q (Ym + h
dtk(m j=1
+ 7 (Ym+hf ik); j=1
k2t)+1 =
a ^y 2m+1 + h at.jk2m +1 I +0 ym + h etjk(m) \ \ j=1 / j=1 +7 (Ym+h9t ik); j=1
yn+hbtknt);
Yn+1
t=1
leads us to the relation
U(z)K(z) = (a + 3 + 7)1 + h [i3V(z) t 7V(z)] K(z2),
(2.11)
where Vt,j(z) = ft,j+z(9t,j-ft, j)+z2(bj-9t,j),
Q,7= 1,2,...,s.
Upon iterating (2.11), we obtain K(z) _ (a +)3 +7) E(hf)nQn(z)l n=0
(2.12)
with 1lo(z) = U
_ 1(z) 1n
(z) = U-1(z) [V(z)+ JV(z)] 52n_ 1 (z2), n = 1,2,.... (2.13)
We seek a condition for the convergence of (2.12 )-and therefore of (2.7 )-at the origin. To this end, we note the following identity which can be verified easily: T
(I-deT)-1=I+1-eTd,
(2.14)
91 whenever eTd # 0. When we apply (2.14) to the expression U(z) = (1 - z) (I - haA) - hazlbT = (I - haz 1 [(I - haA)-1 b] T\I (I - haA), 1-z / we obtain U-1(z) = 1 1 z (I - haA)-1 J(z), (2.15) where haz J(z) I+ 1 - zR(ha) (1bT) (I - haA)-1. Expression ( 2.15) shows in particular the invertibility of U(z) for z # 1, but we shall also find, at the end of this section, an explicit expression for its determinant. As a consequence of (2.15), the entries in the sum (2.12) for z = 1 can be bounded by m h' 101'
li
m II U-1(z)[V(z) + 2V(z)]II" = h"I/3I" II(I - haA)-1IIn in
where
II J(z)[H(z)JH(z)]II", (2.16)
z H(z) = D + zE + z 1b T, 1-z
and fI(z)=F+zG+1
zlbT
II ' II being an arbitrary matrix norm. We shall require that (2.16) is less than 1 for convergence of the series at z = 1. In order to estimate the size of IIJ(z)[H(z) + jk(z)]II near z = 1 we note that H(z) = 1 z2 z lbT + 0(1), z -+ 1, (2.17) and that the same estimate holds for H. We also note that , for small enough h, haz 1bT (I - haA)-11 1 1- zR(ha) _ 1-z 1 1- zR(ha) ,
J(z)1 =
(I +
where we have exploited the explicit representation
R(ha) = 1 + habT(I - haA )-11, (2.18)
92 which follows at once from the definition of the Runge-Kutta method. Therefore we obtain, together with (2.17), for any vector v and for z near 1 (1 + p)bTV1 J(z)[H(z) + JH(z)]v =
1 - R(ha)
+ 10(11 - zI). (2.19)
Expression (2.19), in combination with (2.16), means that we need to require
(hl al + hl7i) IIbIIgiiviig
(2.20)
11- R(ha)I(l - hIaIIIAII) <
where s + 9 = 1 and where IIvii = 1 in the norm associated with the matrix norm we have chosen above. For example , when the entries of b are positive , so that
then q' =
Ilblll
= 1, and we choose q = 1,
oo , II II = II III and (2.20) reads
h IQI + h 17I I1- R(ha )I (1- hIaIIIAII) < 1. We now have the following result: Theorem 1. The solution {yn}' o of the neutral differential equation (2.10) satisfies the stability property yn -* 0, n - oo, if (2.20) is satisfied. Proof. We only need to remark that bTK(1) = -h-1, as required by (2.5) in order to make Y(z) bounded at z = 1. Specifically, (2.19) and (2.18) provide the following identities:
bTK(1 ) = ( a +0 +7)bT >(h,0)" [(I -haA)-'G( 1)(H(1)+ JH ( 1))]nU-1(1)1 n=0
= (a + 3+7)bTU-1(1)1 + (a +,0 +7)(h/3)bT (I - haA )- 1G(1)(H(1) + "H(1 )) U-1(1)1 + (a + R +7)(h /3)2bT(I - haA)-1G(1)(H ( 1) + 3H(1 ))(I - haA)-1 x G(1)(H ( 1)+ 3H(1 )) U-1(1)1
_ -(a+Q+7)h «+(a+Q+7)h -(a+Q+7) h«r+. _ l +)0+ h^ 1 + _ -h-1,
93 as required. Here we have used in particular the identity U-1(1)1 = -11 and bT(I - haA)-1G(1)(H(1)3II(1))1 = -(1 + J)• The theorem is proved.
❑
We also remark that (2.12) converges in fact everywhere on the unit circle if h is small enough (but we have no specific estimates on h yet which give sufficient conditions on the size of h for convergence ). The reason for this is the simple observation that ( 2.12) can be rewritten , after simple manipulation, as
;i'
1 K(z) = (a + /3 +7) :t (h/)- [ t=o 1 - z2`R(ha) n=o
j(Z) =
J
J (z2`) 1,
(I - haA)-1 Iq
J
[(1 + 1)z2lbT + {( 1 - zR(ha )) I + hazlbT (I - haA)-1}
x [D + zE + JF + 'fzG]] (I - haA)-1. Because J(z) 1 - zR(ha) can be uniformly bounded for all z with Jzj = 1, it follows that (2.12) is convergent for small enough h. It is also of interest to evaluate the determinant of the matrices U(z) explicitly: let bT bT u(z) := det U(z) = det
(1 - z)(I - haA) - haz bT
Let us consider first dmu(1)/ dz76 for m E {O, 1,...,s - 2}. Clearly, each mth derivative of u is a sum over determinants of the form (di,/dz")ui (die / dz3 2 )u2 (di,/ dz'' )u; ,
94 where u1,... , u, are the rows of U and E;_1 j, = m. Since m < s - 2, necessarily j; = 0 for at least two distinct values of i E {1 , 2, ... , s}. Thus , as z = 1 , at least two rows in each such determinant become bT and the determinant vanishes . We conclude that dmu(l)/ dz' = 0 for m < s - 2, therefore , u being a s-degree polynomial, it is of the form u(z) = (1 - z )'- 1(0 - pz). It is trivial to identify a:
u(O) = det (I - haA) = Q(ha), where P(z)/Q(z) is the linear stability function of the underlying Runge-Kutta method. To find p we consider u for jzj > 1. We have u(z) = (-z)' det U '( ha)(1 + o(1)), where bT U'(w) := I - wA+ w bT hus, w(a l,, - b 3) -w(a2,, - b,)
1 - w(a1,1 - bi) -w(a2,1 - b2) -w(a2,1 - b1) 1 - w( a2,2 - b2)
det U'(w) =
0
det 1-w(a,,,-b,) b,
-w(a,,1 - b1) -w(a,,2 - b2) b1 b2 =
0
I - wA -wl
det
bT
0 1 (2 . 21)
1
Since Q(w) = det (I - wA), it is true that Q(w) det (IOT `^)-l 0
1
J
1.
We multiply (2.21) by the latter expression. Hence, a product of determinants being the determinant of a product, det U* (w) =
I - wA -w1 Q(w) det bT 1
1
(I - wA)-1 0 det I OT 1,
det L I -WI 1 = Q(w ) L bT (I - wA)-1 1
95 Let I -wl F, := del [ fT 1 , where f, = [ f1, ... , f, ]. Expanding F, in the penultimate column , we derive at once the recurrence F, = Fa-1 + f,w,
therefore , by induction , F, = 1 + wf,T1. Letting f, = bT(I - wA)-1, we obtain det U`(w) = Q(w) (1 + wbT(I - wA)-11) = P(w), since = 1 + wbT(I - wA)-11. Q(w^ It follows that a = P(ha), consequently u(z) = (1 - z )'-'(Q(ha) - P(ha)z ).
(2.22)
As a final observation we point out that this work can be generalized to the neutral differential equation y'(t) = ay(t) +Qy(L-1t) + yy'(L-1t), t > 0, y(O) = 1, where L > 2 is an integer , in the same vein as in [2]. We shall, however, dispense with this generalization here.
3 An example with one collocation point In this section, we solve (2.1) by the one-stage Runge-Kutta scheme with collocation at c E [0,1]: yn+1 = yn + hkn,
where k2n = a ( y2n + chk2n ) + Q(yn + 2 chkn), k2n+1
a(y2n+1 + chk2n+1) +Q(yn + 2(l + c)hkn)•
96 In this instance , it is easier to find conditions on h, a, Q and c for the stability of the numerical solution . Our approach gives (2.5) where we have ((1 - cha) - (1 + (1 - c)ha)z) K(z) = a + Q + Zh,3(1 + z) (c + (1 - c)z) K(z2)
(3.2)
or K(x) =
a + Q 1 (1 +hQZ) (c + (1 - c)z) 2 ). K (z 1 - cha - (1 + (1 - c)ha)z + 2 1 - cha - (1 + (1 - c)ha)z
( 3 .3 )
We therefore have formally a + a n-1
(1 + x2') (C + (1 - C)z21)
1 K(x) = n-1 1 - cha - (1 + (1 - c)ha)x2" 1 2ha1 - cha - (1 + (1 - c)ha)z2`
(3.4)
Due to the simplicity of this approach we may study the convergence of this series on the whole unit circle, in analogy to the analysis in [2), which will give not only stability of the numerical solution but in fact square-summability of the {yn}0 o. This is because if K(z) is convergent almost everywhere on the unit circle and square-integrable, then so is Y(z), and therefore the coefficients of Y(z), namely {yn}°°_0i have to be square-summable and thus in particular yn = 0(1). In order to conclude square-integrability of Y from square-integrability of K, it is important to note that K(1)=-a+Q a+Q -h-1. han=O a) ha(1 + L) Therefore , for the convergence of K(z) at z = 1 we require that
10 1 < lal, whereas, according to the Mean Ergodic Theorem , applied in the same fashion as in the aforementioned paper , it suffices for the product n^-71
( 1 + z2' ) (c + (1 - c)z21)
(3.6)
1 1 Zh11 - cha - ( 1 + (1 - c)ha )z2` to decay exponentially almost everywhere on IzI = 1 that it is true that 1-c ZhO1 - cha
<
1
(3.7)
ifc<1and, ifc>2, 1 I s hf3c 1-cha
<
1.
(3.8)
97 Adding the neutral term ryy'(2t) to ( 2.1) requires the following modifications to our RungeKutta scheme with one collocation point: It is now (3.1) where
k2n k2n+1
=
01 ( y2n
+ chk2n ) +
13(yn
+ 2chkn) +
7'kn,
a(y2n+1 + chk2n+1 ) + N(yn + 2(1 + c)hkn) +7kn•
This gives ((1 - cha ) - ( 1 + (1 - c)ha )z) K(z) = a + 0 + [2i @( 1 + z) (c + (1 - c)z) + 7(1 - z2), K(z2) or
K(z) = a+Q +1 h1(1+z)(c+(1-c)z) K(z2)+7(1-z2)K(z2). 1-cha-(1+(1-c)ha)z 2 1 - cha - (1 + (1 - c)ha)z We therefore have formally K(z) = °° a+Q n7-^12h/(1+z2`)(c+(1-c)z2`)+7(1-z21+1) 1-cha-(1+(1-c)ha)z2n t=1 1-cha-(1+(1-c)ha)z2` Now proceeding in the same way as above, we come to the conclusion that the series converges almost everywhere on the unit circle if (3.5) is true and, according to whether c < 2 or not, the following two inequalities instead of (3.7) and (3.8) are true: 1 1-c+ P 1-cha <1 (3.9)
2hd and
2ha(c+V) 1-cha <1.
(3.10)
We formulate our result in the following theorem: Theorem 2 . Let (3.5) holds and let furthermore (3.9) or (3.10) hold, depending on whether c < 2 or c > 2. Then the solution vector {yn}°n Q of (3.1) is square-summable, so in particular limn-^ yn = 0.
❑
98 References [1] Bellman , R. and K.L.Cooke (1963), Differential-Difference Equations , Academic Press (New York). [2] Buhmann, M.D. and A. Iserles (1992), "On the dynamics of a discretized neutral equation", IMA J. Num. Anal. 12, 339-363. [3] Buhmann , M.D. and A . Iserles ( 1992 ), "Numerical analysis of functional equations with a variable delay ", in Numerical Analysis, Dundee 1991 ( D.F. Griffiths and G.A. Watson, eds), Longman (London ), 17-33. [4] Buhmann , M.D. and A . Iserles ( 1992 ), "Stability of the discretized pantograph differential equation ", Mathematics of Computation, to appear. [5] Feldstein, A.M., A. Iserles and D . Levin ( 1991 ), "Embedding of delay equations into an infinite-dimensional ODE system", Cambridge University Tech . Rep. DAMTP 1991/NA21. [6] Hairer , E., S.P. NOrsett and G . Wanner ( 1987 ), Solving Ordinary Differential Equations I.• Nonstiff Problems , Springer-Verlag (Berlin). [7] Hale, J. (1977), Theory of Functional Differential Equations, Springer-Verlag (New York). [8] Iserles, A. (1992 ), "On nonlinear delay differential equations ", Cambridge University Tech. Rep. DAMTP 1992/NA13. [9] Iserles, A. (1993), "On the generalized pantograph functional-differential equation", Europ. J. Appl. Maths, to appear. [10] Iserles , A. and J. Terjeki (1992), "Stability and asymptotic stability of functionaldifferential equations", Cambridge University Tech. Rep. DAMTP 1992/NAl. [11] Liu, M.Z. and M . N. Spijker ( 1990 ), "The stability of the 8-methods in the numerical solution of delay differential equations ", IMA J. Num. Anal. 10, 31-48.
11
WSSIAA 2( 1993) pp. 99-111 ©World Scientific Publishing Company
General linear methods for the parallel solution of ordinary differential equations by J. C. Butcher Department of Mathematics and Statistics University of Auckland Auckland, New Zealand In memory of L. Collatz Abstract The special type of general linear method known as a DIMSIM can be specialized to parallel computation by restricting attention to type 3 (for non-stiff problems) and type 4 (for stiff problems). In this paper we consider the selection of practical methods from these large families of methods, with special reference to type 3 methods. 1. Introduction . In [1] the special class of general linear methods known as diagonallyimplicit multi-stage integration methods (DIMSIMs) was introduced. As the name suggests, these methods have a similar structure to diagonally-implicit Runge-Kutta methods as regards the relationship between stage values and stage derivatives expressed by the matrix A. Thus they are cheap to implement . However, they do not suffer the disadvantage of low stage order because, unlike Runge-Kutta methods with diagonally-implicit structure, they are multivalue methods and are capable of achieving a high stage order. In type 3 and 4 DIMSIM methods, A is either the zero matrix or a diagonal matrix. Thus, the stages can be evaluated entirely in parallel and this is the feature which gives them a special interest . In section 2 of this paper we will discuss type 3 and 4 DIMSIMs in a general way and indicate some of the features which will guide the search for acceptable methods in the rest of the paper. In section 3, we will discuss a special function whose Pade approximation is related to the selection of type 3 DIMSIMs. In section 4 we will explain how specific methods may be derived using the results of section 3. The introduction of DIMSIM methods, and of type 3 methods in particular, is very recent. For this reason , many questions of implementation remain to be investigated and there are no numerical comparisons yet available. The influence of the late Professor Collatz goes well beyond his famous book [3] and his numerous other publications. Those who have been guided by him, as I have, know that numerical analysis without algorithms and numerical experience is incomplete. In dedicating my paper to the memory of this great mathematician, I am accepting an obligation to provide, in a later publication, a more practical account of these promising numerical methods. Type 4 methods, which are intended for the solution of stiff problems on parallel computers, will also be the subject of a further paper. These methods are particularly interesting in that there seems to be no limit to the order that can be achieved without any sacrifice of A-stability.
2. Type 3 and 4 DIMSIMs . Consider the general linear method characterized by the partitioned matrix C = Cii Cie 1 (2.1) IC21 C2z '
99
100 where, for convenience , we will write the submatrices as
C 11 =A, C12 =U, C21 = B , C22=V. Throughout this paper we will assume that the number of incoming approximations r is equal to the number of stages s. We will also asume that U = I and that the stage values, as well as the final results in a step , are computed to order p = r = s. In this case , it was shown in [1] that B=Bo-AB1 -VB2+VA,
(2.3)
where Bo, Bl and B2 are certain matrices whose elements depend on the vector c of offsets of stage approximations. For type 3 methods A = 0 so that ( 2.3) simplifies to
B = Bo - VB2 (2.4) and one of the aims of this paper will be to consider the choice of V which, along with a suitable v, yields methods with suitable stability regions and other characteristics. From the outset we will assume that V is a matrix of rank 1 (as is the case for Runge -Kutta and Adams methods). This means that the stability of the method is determined by the polynomial q5(w, z) = det (wI - (V + zB)),
(2.5)
in the sense that the interior of the stability region is the set of points in the complex plane given by {z EC : ¢(w,z) = 0 = IwI < 1}. (2.6) If V = evT, where VTe = 1, as implied by the rank 1 assumption together with consistency of the method , then q(w, z) necessarily takes the form 4lw, z) = wr - (so - r iz)wr -1 - z(al - Q2z) wr-2 - ... - zr-1 (ar-1 - NrZ), (2.7)
and the coefficients ao, al , ..., 0r-l, Ql, N2, ..., Pr depend on the choice of the vector v. Note that , because of the relation, 0 (exp(z), z) = O(zp+ l ),
(2.8)
which follows from the order conditions for the method, there are, in the case that p = r, exactly as many free coefficients as there are available parameters vi, v2 , ..., or to be chosen. This means that once a stability function has been chosen there will be a choice of method which will yield this stability. Amongst the options available, we will discuss the choice of the coefficients in 0 which will allow a factorization (2.9) O(w, z) = wr-r4,( w, z), for as low a value of i as possible . A choice based on this principle will make the methods as close to what can be achieved for type 1 methods as possible. In addition to this option, we
101 will consider the approach in which all of al, a2, ..., ar _ 1 are available as parameters and the choice is made which gives , in some sense, the best stability region. For type 4 methods, A takes the form
A=AI so that B=Bo-AB1-VB2 +AV
(2.10)
and the stability is determined from the polynomial 3(w, z) = det (( 1 - Az)wI - (V + z(B - AV))) = det ((1 - Az)wI - (V + z(Bo -.B1 - VB2))).
(2.11)
Making the same rank 1 assumption V = evT, as for type 3 methods, we find that ^ now takes the form ^(w, z) =(1 - Az)'w' - (ao - A/ilz)(1 - Az)r-lwr-1 -
Az(a 1
- ./32z
)(1
-
Az )r - 2wr-2
(2.12)
- ... - Xr-lZr - 1 (ar_1 A NZ),
It will be convenient in the analysis to make the substitution z --i )-lz and to write 4,(w, Az) = 4,(w, z), where 4(w, z) =(1 - z)rwrqq- (ao -, 6 ,z)(1 - z)r-lwr-1 # - z(al - N2z)(1 - z)r-2wr- 2 - ... - z'-1(ar -1 -
(2.13) ,Z)-
Because of the resealing of z, we must replace (2.8), in the case of d defined by (2.13), by d (exp(tz), z) = O( zp+1), (2.14)
where t = 1/). The main stability questions with type 4 methods, which are intended for stiff problems, are that the stability region includes all points in the left half complex plane , with perhaps some stronger property at infinity, and these properties are invariant under the substitution z - A-lz. Hence, we will use ¢, rather than 4, in stability discussions.
Just as for type 3 methods , we could attempt to choose free parameters so that 4(w, z) = w'- '4,(w, z), (2.15) for r" as low as possible. An alternative approach would be to choose al, a2i ..., ar _ 1 so that all of N1, Q2, ..., /3r are zero and to consider if A-stability, combined with perfect stability at infinity, is possible.
3. Stability polynomials for type 3 methods. Write (2.7) in the form sn - ao
+ al(z/w) + a2(z/w)2 +... + ar-1(z/w)r-1 1 +
#,(Z/W ) + N2(Z /w)2 +... + Nr ( z/w)r
= F(z/w),
(3.1)
102 say, so that (2.8) implies that the rational function F is an approximation, correct to within 0 ((z/w)P+1) to the function F defined in a neighbourhood of 0 by the functional equation exp(z) = F (z exp (-z)).
(3.2)
Our aim now will be to study the function F, with a view to finding rational approximations with numerator and denominator degrees as low as possible for a given order of approximation. In particular , we will study the Pade approximations to F.
The following result tells us at least how to find the first row and column of the Pade table for this function. Theorem 3. 1. Let D denote the open disc in the complex plane with centre 0 and radius exp(-1), then the function F defined in D by the series F( Z ) -
E
I (n + )n-1 Zn>
n=0
satisfies the functional equation exp(z) = F(zexp(- z)),
(3.4)
for z such that z exp(-z) E D. Furthermore, the function G defined in D by the series -1
G(Z) = 1 -Z- (" n )n
(3.5)
Z n,
n=2
satisfies the equation F(Z)G(Z) = 1,
(3.6)
for Z E D. Proof. Multiply both sides of (3.2) by Z = z exp(-z) so that z = ZF(Z), where the variables z and Z are related by z = Z exp(z). The relation between z and Z can now be written using the Lagrange series (see for example [4, 5]). In general, if z = ZO(z), where +/i is analytic for z in a neighbourhood of 0, then for IZI sufficiently small, M X(z) = E 7.Z", n=0
103 where, yo = X(0) and, for n = 1, 2, ..., n-1 N = -
zn_1
(X1(Z)^'(z)n) s=0
In the case r/^ = X = exp, it is found that yo = 1 and that 1 do-1 yn = W, dz 1 exp ((n + 1)z)
:=o (n + 1)n-1 n! '
in agreement with (3.3). Using standard analysis, it is found that the radius of convergence is 1/e so that, by analytic continuation, the set where (3.3) holds includes D. To prove (3.5), choose +/i = exp and X(z) = exp(-z). It is found that yo = 1 and that 1
1
yn = n1 dzn_
1 (- exp((n - 1)z)) s=0
(n - 1)n-1
\ n!
Again, standard analysis shows that the radius of convergence for the series representing the function G has radius of convergence 1/e. ■ Having found the first row and column of the Pade table for F, it is a straightforward but tedious matter to compute the rest of the table up to some maximum degrees. Denoting Pmn by the Pade approximation with degrees m (denominator) and n (numerator), these approximations for 0 < m < 5 and 0 < n < 5 are as follows
P0o(Z) = 1, P01(Z)=1+Z, P02(Z) = 1 + Z + 3Z2/2, P03(Z) = 1 + Z + 3Z2/2 + 8Z3/3, P04(Z) = 1 + Z + 3Z2/2 + 8Z3/3 + 125Z4 /24, Po5(Z) = 1 + Z + 3Z2/2 + 8Z3/3 + 125Z4/24 + 54Z5/5, P10(Z) = 1/(1 - Z), P11(Z) = (1 - Z/2)/(1 - 3Z/2), P12(Z) = (1 - 7Z/9 - 5Z2/18)/(1 - 16Z/9), P13(Z) = (1 - 61Z/64 - 29Z2/64 - 101Z3/384)/(1 - 125Z/64), P14(Z) = (1 - 671Z/625 - 717Z2/1250 - 832Z3/1875 - 4819Z4 / 15000 )/( 1 - 1296Z/625), P15(Z ) = (1 - 9031Z /7776 - 5143Z2 /7776 - 2983Z3/5184 - 3239Z4/5832 - 426679Z5 /933120)/( 1 - 16807Z /7776),
104 P2o(Z ) = 1/(1 - Z - Z2/2), P21(Z) = (1 - 4Z /3)(1 - 7Z/3 + 5Z2/6), P22(Z ) = ( 1 - 19Z/ 10 + 17Z2 /60)(1 - 29Z/ 10 + 101Z2/60), P23(Z ) = ( 1 - 1159Z/505 + 1193Z2 /2020 + 133Z3 / 1212)/( 1 - 1664Z/505 + 4819Z2 /2020), P24(Z ) = ( 1 - 473Z/ 183 + 13Z2 / 15 + 147Z3 /610 + 1673Z4 /21960 )/( 1 - 656Z/183 + 5401Z2 / 1830), P25(Z) = (1 - 8381043Z /2986753 + 899291Z2 /814569 + 2196995Z3 /5973506 + 12492787Z4 /71682072 + 120832Z5 / 1701315 )/( 1 - 11366896Z /2986753 + 61105001Z2 / 17920518), P30(Z ) = 1/(1 - Z - Z2 /2 - 2Z3/3), P31(Z ) = ( 1 - 27Z/ 16)/(1 - 43Z/ 16 + 19Z2 / 16 + 17Z3/96), P32(Z ) = ( 1 - 228Z/85 + 451Z2 /340)/(1 - 313Z/85 + 1193Z2/340 - 133Z3/204), P33(Z ) = ( 1 - 623Z/ 190 + 123Z2 /50 - 1927Z3 / 11400)/(1 - 813Z / 190 + 4977Z2/950 - 18881Z3/11400), P34(Z ) = ( 1 - 3458303Z/925169 + 3250675Z2 /925169 - 12216739Z3 /27755070 - 798983Z4 / 15860040 )/( 1 - 4383472Z/925169 + 12492787Z2 / 1850338 - 483328Z3/1756653), P35(Z ) = ( 1 - 31713823Z/7733248 + 19037684121Z2 /4276486144 - 9682611023Z3 /12829458432 - 3536864687Z4 /25658916864 - 14189787721Z5 /513178337280 )/( 1 - 39447071Z/ 7733248 + 2152324073Z2 /267280384 - 98084038991Z3 /25658916864), P40(Z ) = 1/(1 - Z - Z2 /2 - 2Z3/3 - 9Z4/8), P41(Z) = ( 1 - 256Z/ 135)/(1 - 391Z/ 135 + 377Z2 /270 + 38Z3/135 + 451Z4 /3240), P42(Z ) = ( 1 - 104Z/33 + 233Z2 / 110)/(1 - 137Z/33 +787Z2 /165 - 133Z3/110 - 329Z4/3960), P43(Z ) = ( 1 - 381096Z/94423 + 848073Z2 / 188846 - 40532Z3 /34545)/( 1 - 475519Z/94423 + 757921Z2 /94423 - 12216739Z3 /2832690 + 798983Z4 / 1618680), P44(Z ) = ( 1 - 7430297Z /1597966 + 1018440443Z2/156600668 - 1260595681Z3 /46980200 + 974868241Z4 /9396040080)/( 1 - 9028263Z/1597966 + 1668309215Z2/156600668 - 3536864687Z3 /469802004 + 14189787721Z4/9396040080),
105 P45(Z) = (1 - 657604959679Z/ 127708089489 + 4275832163599Z2 /510832357956 - 15869750530913 Z3/3575826505692 + 1983576598463Z4 /6129988295472 + 5398089761801Z5 /214549590341520)/(1 - 785313049168Z/ 127708089489 + 6650835823337Z2 /510832357956 - 9744497644432Z3 /893956626423 + 123766944830293Z4 /42909918068304), P5o(Z ) = 1/(1 - Z - Z2/2 - 2Z3 /3 - 9Z4 /8 - 32Z5/15), P51(Z) = (1 - 3125Z/1536 )/( 1 - 4661Z/ 1536 + 2357Z2 / 1536 + 359Z3/1024 + 533Z4 /2304 + 9553Z5 /61440), P52(Z ) = ( 1 - 698560Z/200613 + 3304241Z21203678 )/( 1 - 899173Z/200613 + 3446881Z2 /601839 - 2011013Z3 / 1203678 - 282691Z4 / 1604904 - 10133Z5220185), P53(Z ) = ( 1 - 2958135Z/648512 + 143001161Z2 /23265368 - 2553014437Z3 / 1116737664 )/( 1 - 3606647Z /648512 + 1899932561Z2 /186122944 - 3807303853Z3 /558368832 + 1260595681Z4/1116737664 + 974868241Z5 /22334753280), P54(Z ) ( 1 - 47306490920Z/8773814169 + 37036845053Z2 /3899472964 - 41047808321Z3 /6824077687 + 2872158214405Z4 /2948001560784) /( 1 - 56080305089Z/8773814169 + 505009940819Z2 /35095256676 - 3312529329503Z3 /245666796732 + 1983576598463Z4 /421143080112 - 539808976180125 / 14740007803920), P55(Z) = ( 1 - 64958411931481Z/ 10796179523602 + 602627559979987Z2 /48582807856209 - 3917670006940357Z3 /388662462849672 + 4808893764386087Z4 / 1813758159965136 - 10585540152954929Z5/163238234396862240) /( 1 - 75754591455083Z / 10796179523602 + 870649009743547Z2 /48582807856209 - 869838597842227Z3 /43184718094408 + 5692390072282141Z4 /604586053321712 - 213859991604212971Z5 / 163238234396862240), For the approximation Pm., the corresponding stability polynomial has order p = m + n and degree in the w variable given by r" = min(m, n + 1). Hence, to obtain order p, it is necesary that r" = [(p + 2)/2]. For p odd, there is only a single best choice given by P(p+l)/2,(p-1)/2 . However, for p even, there is a family of possible best choices. These can be written in the form P(Z) = Npl2,p/2(Z) + 9ZNp/ 2,(p-2)/2 (Z) (3.12) D,/2,p12(Z) + 9ZDp/2,(p-2)/2(Z)
106 where for each m and n, N-n and D-n are the numerator and denominator respectively of P,nn and 0 is a real parameter. An investigation of the stability regions of the methods defined in this way has given disappointing results. This has lead to the suggestion that methods corresponding to members of the Fade or to those of the form (3.12) may not be as acceptable as hoped for. Accordingly, another choice has also been considered and this gives more promising results.
The new approach is to regard al, a2, ..., ar_l as free parameters and to choose Nl, Q2, ..., /3r in terms of them so as to satisfy the conditions for order p = r. The criterion for choosing al, a2, ..., ar_1 can be based on obtaining a large stability interval [-X, 0]. One way of doing this is to force the stability polynomial evaluated at -X to have all its zeros on the unit circle and we will report on some investigations along these lines. We first note that if N(Z) = 1 + a1Z + a2 Z2 +... + ar_1Zr_1
and
D(Z) = 1+/j1Z +/3222+...+/3rZ", then for order r it is necessary that D(Z) = N(Z)Pro(Z) + O(Zr+l).
(3.13)
this determines D in terms of the coefficients in N. Having selected D in this way, we note that the stability polynomial for the method, evaluated at z = -X is given by wr(wD(-X/w) N(-X/w)). A necessary condition for this to have all its zeros on the unit disc is that this polynomial is either symmetric (the coefficient of wk is equal to the coefficient of wr-k for k = 0, 1, 2, ... , [(r - 1)/2]) or anti-symmetric (the coefficient of wk is equal to minus the coefficient of wr-k for k = 0, 1, 2, ... , [(r - 1)/2] with the coefficient of wr/2 equal to 0 if r is even).
In the case of r = 3, it is found that D(Z) = 1 + (al - 1)Z + (-al + a2 - 1)Z2 + (- 2 - a2 - 2)23, so that the stability polynomial for the method is w3-(1+(1-al)z)w2-(alz+(2+a1-a2)z2)w-(a2z2+(3+ 2 +a2)z3). (3.14)
For this polynomial to be symmetric in the coefficients of w when z = -X, it is found that al and a2 must be chosen as _ 12-3X-7X2 al 12-18X+9X2' _ -24 + 12X - 6X2 + 22X3 - 5X4 a2 24X2 36X3 + 18X4 and it turns out that for the three zeros of (3.14) to be on the unit circle when z = -X it is necessary and sufficient that X E [0.9375,1.52719],
107 where the ends of the interval are equal respectively to 15/16 and to the only real zero of the polynomial 16X3-51X2+72X-48. A numerical investigation of the stability regions corresponding to stability polynomials defined in this way indicates that as X increases above a value of approximately 1.35, the width in the imaginary direction of the stability region for Re(z) -_ -1 decreases to the extent that whatever benefits would seem to follow from a long stability interval are lost. A suggested value of X is 4/3 leading to al = -10/9 and 0`2 = 179/144. In this case the values
of the ,Bk are given by 19 89 p 65 91 =-9, /32=48, N3=-48' The stability region defined by this choice is given in figure 3.1. A similar calculation but with antisymmetry rather than symmetry, does not lead to any non-empty stability intervals. For order 4, the criterion that at z = -X, the stability polynomial is antisymmetric in its coefficients of powers of w if al, a2 and a3 are given by
48 - 36X + 40X2 - 61X3 al 48X - 96X2 + 64Y3 -- ' -48 + 84X - 52X2 + 53X3 - 29X4 a2 = 48X2 - 96X3 + 64X4 ' -288 + 288X - 240X2 + 204X3 + 228X4 - 101X5 a3 = 288X3 - 576X4 + 384X5 For the stability polynomial actually to have three zeros on the unit circle when z = -X, it turns out to be necessary that X E [0.711018,1.21516], where the ends of the interval are respectively ( 36 + 2 699 )/125 and the only real zero of 125X3 - 328X2 + 372X - 192. Just as for r = 3, the stability region becomes unacceptable when X is chosen near the upper end of this interval and the value of X = 1 seems to be about the best choice available. In fact the corresponding values of al, 02 and a3 turn out to be
9
_ 91 a3 96The stability region corresponding to this case is shown in figure 3.2. For orders 5 and 6, a similar approach has been investigated. That is, for order 5 at z = -X, the stability polynomial was assumed to be symmetric and for order 6, it was assumed to be antisymmetric. In each case, there was a free parameter to be selected and this was chosen to ensure that for z = -X, stability was actually achieved. Results for these orders will be presented in a later paper. Further investigations by Peter Johnston have taken these results considerably further and these will also be announced in a later paper.
108
Figure 3.1 The stability region for a method with p = 3.
Figure 3.2 The stability region for a method with p = 4.
^w^ w.ldk
109 4. The construction of special type 3 methods Consider the representation of a type 3 DIMSIM with p = q = r = s in terms of the transformed matrices, A = 0, U = I, B and V, given in [2], where 1 V2 V3 ... V. 0 0 0 ... 0 0 0 0 ... 0
V=
L0 0 0 ... 0 J It is a simple matter to write the stability polynomials in terms of v2, v3, ..., Vr. In fact, we have
¢(w, z) = det (wI - (V + zB)) , = det (wI - (V + z(Bo - VB2))) . (4.1) To relate ¢(w, z) as given by (4.1) with the form given by (2.7), it is enough to show how ak, for k = 1, 2,.. ., r - 1 can be written in terms of vk, k = 2, 3, ... , r. Since ao = 1 and Qk, k = 1, 2, ... , r are given by (3.125), this determines 6(w, z) as given by (2.7) completely. On the other hand, if ak, k = 1, 2, ... , r - 1 are to be chosen first by stability considerations, it will be possible to determine vk, k = 2, 3, ... , r to achieve this choice. In fact we will establish the result
Theorem 4.1. There exists a lower triangular matrix L of the form r 0! 0 0 ... 0 121 1! 0 ... 0 0 L= 131 132 2! ... 0 I. Ir1
(r-1)!
Ir2 Ira
such that 1 V2
ao
=L
al
Vr
(4.2)
ar-1
Proof. This is proved by noting that det(wI - V - zBo + zVB2) = det(wI - zBo) + det(D(wI - zBo) - V(I - zB2)), where D = diag(0,1,1, ... ,1) and that all terms in the expansion of this polynomial for which the sum of the degrees in w and z is less than r, are given by the terms in the expansion of
det(D(wI - zBo) - V) =
-V1
-V2
-z
w + zm22
0 0
-V3
...
...
-Vr
-z/2
zm23 w + zm33
...
zm2r zm3r
0
0
•••
w+zmrr
110 where v1 = 1 and the coefficients mid, for 2 < i < j < r, are identical with the corresponding elements of - Bo. The coefficient of vk , k = 1, 2, ... , r is found to be zk-1
^wr
-k + zg(w, z))
where the polynomial g(w, z) is homogenous of degree r-k-1. Thus, ak-1 equals - vk/(k-1)! plus a linear combination of vj , j < k. This is equivalent to (4.2). ■ The actual construction of the matrices B and V in a type 3 method is now straightforward. The first step is the selection of the c vector, indicating the values of the abscissae in a step . The matrices Bo and B2 are then evaluated as functions of c using the results of [2]. Having selected the values of ao , a1, ..., ar_l , according to stability requirements, the vector v is found using theorem 4.1. This gives V and hence B = Bo - VB2 . Finally, B and V are found by reversing the transformation discussed in [2]. In carrying out this derivation for an r = p = 4 type 3 method , we find two natural choices for c. The first of these is given by cT = [0,1 / 3, 2/3,1 ], corresponding to a method generalizing explicit Runge- Kutta methods . A second choice is given by cT = [-2, -1, 0,1], giving a method resembling Adams methods but in which approximations to the solution at grid points are subject to recalculation in parallel with preliminary approximations at later grid points . It is found that details of the two methods found in this way are
0 1/3 _ B 2/ 3 ' 1
c=
V -
-45347/3456 21541 / 1152 10229 /115 2 - 56659 / 3456 -45779/3456 22133 / 1152 3095/384 -54019/3456 -48419 /3456 8503 /384 4597 / 1152 -46291/3456 -56147/3456 34933 / 1152 - 7483 / 1152 -29443/3456 - 7133 /96 -7133/96 - 7133 /96 - 7133 /96
23669/96 23669/96 23669 /96 23669/96
and c
-2 -1 0 1
_ ' B
V -
-1313/ 1152 -1361/1152 -1313/ 1152 -1745/ 1152 - 187/96 -187/96 - 187/96 - 187/96
-26497/96 -26497/96 -26497/96 -26497/96
10057/96 10057/96 10057/96 10057/96
253 / 128 -569 /384 -2641/1152 967 /384 -361 /384 -2689/1152 887 /384 -19/ 128 -2257/1152 493 / 128 -1001 /384 383/1152
707/96 707/96 707/96 707/96
-943/96 -943 /96 -943/96 -943/96
173/32 173/32 173/32 173/32
The second of these seems to have potential advantages over the first because of the abscence of elements in B and V as large as those occurring in the first method. Even for such low orders as 4, we see that there is a bewildering range of possibilities available . How c should best be chosen is clearly an open question . The choice of a good stability region from the choices that are available , balanced against error constants for the individual stages as well as for the method overall , is also a question requiring more consideration than is possible in this brief introductory paper.
111 References [1] J. C. Butcher, Diagonally implicit multistage intgration methods, to appear in App. Numer. Math.
[2] J. C. Butcher, A transformation for the analysis of DIMSIMs, submitted to BIT [3] L. Collatz , Numerische Behandlung von Differentialgleichungen , (1955), English translation: The Numerical Treatment of Differential Equations, (1960), Springer, Berlin. [4] G. Polya and G. Szego, Aufagben and Lehrsfitze aus der Analysis I, (1925), Springer, Berlin. [5] E. T. Whittaker and G. N. Watson, A Course of Modern Analysis, (1915), Cambridge University Press.
WSSIAA 2(1993) pp. 113-125 ©World Scientific Publishing Company
ITERATED DEFERRED CORRECTION ALGORITHMS FOR TWO-POINT BVPs
J.R. CASH Department of Mathematics, Imperial College 180 Queen's Gate, London SW7 2BZ, England
ABSTRACT When implementing a discretization algorithm for the efficient numerical solution of two-point boundary value problems it is necessary to solve several important sub-problems. The performance of the resulting code is often critically dependent on how well these problems are solved. Indeed the failure of a code is often the result of one or more of these tasks being performed in an unsatisfactory manner rather than an inadequacy of the underlying discretization scheme. In this paper we consider these problems in detail. In particular we highlight those areas where existing algorithms are satisfactory and point out areas where a fundamentally new approach is needed.
1. Introduction The development of a piece of high class numerical software is a complex procedure which often evolves in a certain well defined way. Initially a problem area is identified as being important and a few ad-hoc research codes are developed. The necessary mathematical theory which underpins the algorithm is then worked out resulting in the production of much better quality software. Finally the codes are refined, and significant improvements are obtained, in the light of computational experience. In the case of initial value problems for ODE's, existing codes are reasonably satisfactory. Much of the basic theory was developed by Dahlquist12,13 and this is now reasonably complete although gaps do remain in our understanding of the stiff case. Despite extensive research, codes based on Runge-Kutta, Adams and backward differentiation formulae still remain the most widely used. Boundary value problems, however, have lagged well behind initial value problems in their development. This is not at all surprising since boundary value problems are intrinsically more difficult than their initial value counterpart and the underlying theory has developed more slowly as a consequence . However in the last twenty years or so there has been a considerable advance in the theoretical understanding of BVPs largely through the important early work of Keller1$'19. As a result good quality codes for BVPs are now starting to appear and gain general acceptance. It therefore seems to be an appropriate time to look closely at what codes we have for BVPs, to determine their weaknesses and to identify what else is needed.
113
114 An excellent survey of the theory and numerical solution of BVPs is given Broadly speaking we can divide boundary by Ascher , Mattheij and Russell2 . value problems into two classes , normally termed stiff and non-stiff. It is perhaps unfortunate that the word "stiff" has become associated with BVPs since it already has a well understood meaning for IVPs which does not allow growing modes . Although the idea of stiffness is reasonably well understood conceptually it has proved very difficult to define it mathematically and the boundary between stiff and non-stiff problems is certainly not clear! The two main classes of numerical methods for solving BVPs are the initial value, or shooting , methods and the so-called global methods. Simple shooting methods are in a relatively satisfactory state for non-stiff problems (this is of course to be expected since they are based on initial value solvers ) and often perform very effectively. In view of this we will confine our attention to global methods for solving stiff problems . In particular we will be interested in singular perturbation problems which are typified as having boundary layers , corner layers, shocks and general turning points. There are, two widely available codes which have been shown to deal very effectively>, with such problems namely the collocation codes COLSYS/COLNEWI,a . and the deferred correction code HAGRON5 In what follows we will concentrate We will briefly describe deferred correction and how it on the code HAGRON . can be modified, identify the important components of the code and highlight areas of weakness . Much of what we say about HAGRON is also applicable to COLSYS since many of the subproblems the two codes have to solve are the same. HAGRON uses The main difference is in the choice of discretization algorithm . mono-implicit Runge-Kutta8 formulae whereas COLSYS , which is fully described elsewhere1 '2, uses collocation.
2. Deferred Correction The idea of iterated deferred correction Is an old one. Originally proposed by Foxy it was further developed by Pereyra24, Stetter28 and Lindberg22. A widely used deferred correction code PASVA was developed by Lentini and Pereyra21 and has appeared in several subroutine libraries. To explain the elements of deferred correction we consider the general first order system of two-point boundary value problems y' = f(x,y), a s x s b, g(y(a),y(b)) = 0 .
(1)
Suppose we have available a "cheap" low order discretization method 0 and a C
higher order but more expensive one 0E. Ideally the problem we would like to solve is OE(n) = 0
.u t..... ..4...
(2)
115 but we do not wish to pay the cost of solving the equations defining rl. Instead we rewrite (2) as
(3)
oE=0c+0E-0c This splitting suggests the use of the iteration scheme C
(ii)= 0
0C(n)
_
0,01)
-
OE(*1)
(4)
This is a particular version of a rather more general procedure
0(n) = 0 ON)
=
11(11)
.
(5)
We see from (5) that the role of di(n) is to estimate the local truncation error in 0. Deferred correction formulae of the general type (5) have been analysed by Skeel26 who gives the following order of convergence theorem. Consider the numerical solution of the two-point boundary value problem (1) on the grid
rz: a=x
0
<x
1
<x
2
<... x
N
=b .
Let the true solution of (1) be z(x) and denote the restriction of z to the grid n by Az. Consider also the deferred correction scheme (5). Then if
(i) 11 = Az + O(hp) (ii) O (Az) = O(Az ) + O(h`+P) (iii) *(Aw) = 0(hr) for arbitrary functions w having at least r continuous derivatives then n = Az + O(h`+P) Suppose now that 0,01 are discretization formulae of orders p,r+p respectively and let 0 _ -0+m. In this case conditions (i) and (ii) of the above theorem are trivially satisfied. It therefore follows that if we adopt this form for 0 it only remains to satisfy condition (iii) to ensure an order of convergence of r+p.
When deriving discretization schemes for the numerical solution of twopoint boundary value problems an immediate question to ask is whether (1) is always the best form in which to present a problem. For IVPs it is normally
116 However for assumed that the problem is presented as a first order system . BVPs the situation is quite different. For example , the special second order equation
y" = f(x,Y) , g1(y( a)) = g2(y(b )) = 0
(6)
is normally solved much more efficiently by a method which does not require y'. Such problems often arise in the semi -discretization of elliptic partial Since COLSYS is able to deal directly with higher differential equations . order equations it would be of interest to have more information concerning the relative merits of dealing with an equation in its original form or Methods for dealing with the special converting it to a first order system . equation (6) have been given by Daniel and Martin14 and by Chawla10,11. The potential gains in efficiency using deferred correction are present in the However it is not straightforward to put these second order case as well . schemes in a deferred correction framework on a non-uniform grid because of , the phenomenon of supraconvergence20 23 Numerov' s method, for example, although having order 3 on a non-uniform grid behaves like a fourth order method providing the grid satisfies certain rather weak conditions. Because of this behaviour , a theory for deferred correction in the presence of In addition the theory of error supraconvergence needs to be derived . estimation and grid selection is much more complicated for the second order case than for first order and a coherent approach to this problem is needed.
In the case of first order systems it is natural to take 0.01 as being Indeed the theory developed by Skeel is implicit Runge -Kutta methods . applicable to arbitrary Runge-Kutta methods. The code HAGRON is based on a special family of methods known as mono - implicit Runge-Kutta formulae8. These are specially designed so that the deferred correction term is relatively straightforward to compute. However since this correction term is explicit there is in general a real need to investigate other classes of Runge-Kutta formulae which may have advantages over mono -implicit formulae. Another important research area is the investigation of alternative ways in which the deferred correction can be applied . For example we could replace (5) by
0(n) = 0 00) i+1)= 00 )
i = 0 , 1,2,...;
710 = n
(7)
This would be more expensive than ( 5) as regards function evaluations, would have about the same linear algebra cost but would have the advantage that the converged solution n would satisfy OE(n) = 0. There are several other ways in For example Stetter's defect correction version A which (5) can be modified . could be applied or, perhaps more interestingly, (5) could be applied in a multi-grid framework . For comments on these approaches the reader is referred
117 to the original work of Stetter28 and Skeel25. deferred correction code based on the iteration
04(71) = 0,
0 4(7)) = -m6(7))'
In conclusion , HAGRON is a
ma(rl) = -06(71) - 08(r^) (8)
where the underlying discretization methods are mono-implicit Runge-Kutta formulae . Scheme ( 8) is just one of many possible deferred correction schemes. There is a need to examine carefully how the deferred correction method is applied and to examine other classes of Runge-Kutta formulae to see which are potentially the most efficient.
3. Solution of the Algebraic Equations 3.1
Linear Equations
Recall from the previous section that the basic problem we need to solve at each step of the deferred correction is m(y) = 0 .
We solve this using some form of damped Newton iteration y(t+1) - yw = -a (y(t)) I 1 („) m(y ay fm
J
, i = 0,1,2 ,...
( 9)
where A is a damping factor to be chosen dynamically. The linear equations (9) are of the form Ax = b where the coefficient matrix A is highly structured . Several algorithms have been proposed to deal with systems of this form. For a survey of these algorithms the reader is referred to the book of Ascher, Mattheij and Russell2 . Basically these methods are Gaussian , Elimination15 16 adapted to the special block structure of the matrix. An alternative, and very promising idea , is stable compactification . This takes account of the special structure of A associated with collocation . For more comments on this the reader is referred to Ascher , Mattheij and Russell.
As would be expected , the codes available for solving the special linear equations are relatively satisfactory . However there are two remarks we wish to make specifically aimed at improving the solvers for boundary value problems . It should be remembered that for many problems much of the computational effort goes into solving these linear equations so any potential speed ups are likely to be important . Consider the problem (I-hJ)x
Often for "very bad"
=
b
.
(10)
singular perturbation problems, a Gaussian elimination
118 What codes algorithm will think the matrix is singular if h is "not small ". tend to do when a singularity is flagged is to double the number of mesh This is often undesirable since storage space is an important points . consideration when dealing with two-point BVPs . An important question is how to decrease the grid spacing in a non-uniform manner so that the linear equation solver no longer claims that I-hJ is singular while still ensuring That is, can we add points to just a few subintervals that dim( x) is small. A related problem concerns the fact so that the "singularity" is removed. that boundary value codes , and in particular HAGRON, often refine the mesh by For example , from one grid to the next, the adding just a few mesh points . coefficient matrix A may have the form A
A A = Al A= 2 A
1
z where appropriate zeros are added to A1, Az to make the equations consistent. It is clear that in the LU factorisation of A we should be able to use part of the factorisation of A especially if Al is large compared to Az. An implementation of this could considerably reduce the linear algebra cost for some singular perturbation problems. 3.2 Nonlinear Equations The numerical solution of the nonlinear algebraic equations ¢( y) = 0 is of course a much harder problem than solving the linear equations . Indeed the solution of nonlinear equations is the most important subproblem and the one The first question to be asked is that is in most need of investigation . whether the standard Newton method applied to m(y) = 0 is likely to converge when 0 comes from the discretization of a singular perturbation problem. The answer to this question is generally no, essentially because low order A technique for solving these derivatives of 0 are often very large . nonlinear equations was given in Cash and Wright6. Although this has been found to perform well in practice there is a need for much further investigation . The equations arising from singular perturbation problems are special in several ways and some account should be taken of this: (1) The components of the solution are often very badly scaled and we therefore need an affine invariant objective function. The most obvious objective function is g(s ) = J111O(s)11 but this is not affine invariant. ( 2) We are solving a sequence of problems moving from one grid to another. How do we make use of this information i.e. how do we interpolate past solutions and grids.
119 (3) Perhaps the most tantalising of all . problem
The solution of a general nonlinear
F(x)
=
0
(11)
is likely to be totally arbitrary and have no structure . However the solution of a discretization equation is a vector approximation to a continuous solution and so has structure. How can we use the fact that the solution has continuity structure when deriving an algorithm for (11). The strategy used in HAGRON for the solution of the nonlinear equations is to control the residual
II - O(y) 112
(12)
T i i.e. a scaled norm rather than II0(y ) II2 . However there is no convergence proof associated with this scaled objective function and indeed a counterexample is given by Ascher and Osborne3 .
The approach used by HAGRON is the so-called
watchdog9 scheme where we seek to control ( 12) but "keep an eye" on II0(y)II. This involves choosing two fixed integers IWATCH, KWATCH and using the following strategy: (i) We demand at each iteration step a sufficient decrease in the scaled norm.
( ii) After KWATCH steps we check whether II0(y)II2 has increased "too much". If it has then stop. ( iii) If II0 ( y)II2 has not been decreased for IWATCH steps then stop.
The watchdog scheme implemented in HAGRON takes KWATCH = 5, IWATCH = 8. However this choice is rather arbitrary and there is no doubt that Newton's method is very sensitive to this choice of parameters . This is an area where far more understanding of the algorithm is needed. We now describe two components of HAGRON which can have a critical effect on its performance but which are not backed by any theory at all. As an example we consider the use of HAGRON to solve y" = µsinhµy,
x e [0,1 ), y(O) = 0, y(1) = 1 . (13)
This is a famous problem due to Troesch . For increasing µ the problem has an increasingly sharp boundary layer near x = 1. For values of µ greater than about 20 the problem becomes rather difficult to solve numerically . In what follows we give the results obtained using HAGRON with µ = 30, using seven
120 equally spaced grid points initially and with an initial guess y = 0. At each iteration i we list the damping factor a used in (9), the values of the scaled norms 1 _1 ra# (yu -1) ) I O(y(1 -1)), [L, (y(1-1)1 O(yn))
8y
J
-91y
_1
and the unscaled norm O(y(1-1)). The results obtained are given below
Iteration
A
Scaled Norm Unscaled Norm
1
0.01
30.18 29.88
0.9801
2
0.02
30.06 30.04
0.9479
3
0.04
35.98 35.17
1.992
4
0.01825
73.46
3.182
72.44 5
0.035
109.3
2.154
107.6 6
0.07
235.1 222.6
16050.0
At this point the watchdog stops the iteration scheme because the unscaled norm has increased by too much . It should be noted that it is typical for the norms to increase in this way before convergence because all that is happening is that the code is realising that there is a boundary layer present. If we print out the odd components of the residuals of O(y(6 )) at each grid interval we have -2; [0,1/61 < 10- '; [ 1/6,1/31 < 10- '; [ 1/3,1/2] < 10 [ 1/2,2/31 0 . 0156 ; [ 2/3,5/61 4 . 44; [5/6,11 0.835. The question we ask is : since the residual is large only in the fifth segment should we reduce the grid spacing only in this segment? This is an important component of HAGRON, which can dramatically improve its performance on
tt"^
121 difficult nonlinear problems, but which is not supported by any theory essentially because we are dealing with a non-converged solution. The second problem we are concerned with is how to move efficiently from one grid to another when Newton converges . Suppose for example that we have solved for a solution y on a grid n and our mesh refinement algorithm generated a new grid n. Intuitively it would seem to be a sound approach to interpolate the solution from s to n • However numerical experience has shown that this is not always a good idea . The point cannot be made too strongly that Newton' s method is extremely sensitive to the way in which it is implemented . We have examples where HAGRON takes . 1 seconds to solve a problem if no interpolation is used but 20 seconds if interpolation is used; and vice versa . In some cases , therefore, interpolation gives tremendous speed up but in other cases is a disaster. We badly need a theory which gives some guidance as whether to interpolate or not.
4. Some Further Problems
4.1 Continuation Singular perturbation problems normally have a small parameter associated with them with the problem getting increasing difficult as the parameter gets smaller. It is natural to seek to solve such problems using continuation. A nice description of this process is given in [2, p.344] and in what follows we use their notation. Suppose we wish to solve the problem 0(y) = 0. We imbed within this problem a family of related problems m(Y;T)=0, 0 S T S 1
where •(y;0) is "easy" to solve and #(y;l) $ 0(y). On the assumption that there is a homotopy path ^(T) defined by the identity m(^(T);T) = 0, 0 S T S 1
we wish to move along this path from T = 0 to T = 1 using as large steps in T as possible . A discussion of various possible homotopy paths is given in 2. In what follows we wish to consider the important case where the problem involves a small parameter c. In this case we can rewrite our problem in the form 4(Y;E) = 0
and carry out continuation
in c. We now
wish to find
c ,e.... ..,E(N) = c so that the first problem in the easy to solve and so that the problem
sequence
a sequence 0(y;E`0') is
122 #(y;c(1)) = o
can be solved easily using information from the previous problem 0(y;e(1-1) Two important questions are
(i) How do we choose the sequence c(1)7 (ii) How do we use information from one step to the next?
This is a classic example of a problem which is much harder than it looks. It seems very likely that continuation will be a very powerful way of solving difficult singular perturbation problems and there is a real need to develop efficient, automatic continuation procedures. 4.2. Singularities of the First Kind Consider the equation xy' - f(x,y) = 0 . (14) Assuming that lim y(x) exists, it is necessary to assume that f(O,y(0)) = 0. X40 /
However because of the singular behaviour of this problem it is no longer the case that the smoothness of the function f(x,y) implies smoothness of y(x) near x = 0. It therefore does not seem possible in general to guarantee the good performance of a code ab initto on singular problems of this kind and modifications to codes such as HAGRON are required. In some cases it may be possible to get an appropriate boundary condition at x = 0 by use of L'Hopital's rule. Clearly, however, we should avoid trying to compute the term y' at x = 0 and this requires some user intervention for HAGRON, i.e. the determination of a series expansion, or the use of a different class of formulae. In contrast the code COLSYS, which is based on Gauss points, (HAGRON is based on Lobatto points) can be applied directly to such problems since it does not call for the evaluation of derivatives at the end point of the integration range. However its performance on singular problems is at best variable. If the solution y(x) does have the desired smoothness then COLSYS is normally perfectly satisfactory. However a lack of smoothness due to the singularity can have an adverse effect on COLSYS. Sometimes it suffers from an order reduction i.e. instead of behaving like an order 8 Runge-Kutta method it behaves like an order 5 one. No account of this is taken by COLSYS which re-distributes the mesh and computes errors on incorrect asymptotic assumptions. Even worse is the fact that COLSYS may accept completely wrong solutions. There is a need therefore to derive an a posteriori algorithm to test the validity of solutions when boundary value codes are used to solve singular problems. To identify that we are experiencing a reduced rate of convergence should be a relatively straightforward problem but to recognise that a wrong solution has been accepted is a much more challenging one7.
123 4.3 A Continuous Solution
A major advantage of COLSYS is that, being a collocation method, it provides a continuous solution approximation in the form of a polynomial. HAGRON on the other hand, being a standard finite difference formula, provides solution approximations on only a discrete set of points i.e. on a grid. The ability to provide a continuous approximation to the solution is an important facility for a code to have. If, for example, output is required at many points there is likely to be a very high linear algebra cost if all these points are included in the grid . Both COLSYS and HAGRON allow fixed points to appear in all grids and if the number of fixed points is small then this is normally the most satisfactory way of dealing with this problem. However if the number of output points is large or if , for example , we require an event location such as g(x,y(x)) = 0 The idea of providing a continuous then an interpolant is indispensible. solution approximation does not seem too difficult. After having used a fourth order mono-implicit Runge-Kutta formula for example, we have at our This allows disposal the data ( yn'yn,yn+ly' ) in each subinterval ( xn'xn+i )' us to derive a cubic hermit interpolating polynomial . However there is a need to be able to efficiently derive still higher order polynomials and to be able If a continuous to measure the quality of the approximations they give . approximation P(x) is available we will be able to measure the defect P'(x) - f(x,P(x)) and this would be a possible way of measuring the quality of an accepted solution . This is an important area where much remains to be done. 4.4 Parallel Computation
Boundary value problems , in contrast to the initial value case which is essentially a sequential process, lend themselves to speed up by parallelism. An obvious area for this speedup is in the solution of the linear algebraic Again this is not quite as straightforward as it seems because equations . some of the more inviting algorithms turn out to be unstable. However a stable algorithm with good speed up has recently been given by Steve Wright29 and both COLSYS and HAGRON should benefit from this in a parallel environment. Deferred correction codes can benefit even more from parallelism since all of the deferred corrections are independent of each other and so can be computed in parallel . This will be of considerable benefit to HAGRON since the eighth order deferred correction involves 9 function evaluations per grid point. This will be a very significant cost if function evaluations are expensive compared with linear algebra costs but will probably be insignificant if done Perhaps more importantly, a parallel environment opens up the in parallel . possibility of using more general deferred correction schemes of the form (5) since the cost of the corrections is no longer so important. For example we
124 could use the scheme
0 4(71) = 0 0 4(71) _ -•6(71)
where 04,06 are Gauss schemes of order 4,6 respectively. Since the deferred correction term ¢6(71) is defined implicitly it is likely to have a much higher cost than if mono-implicit formulae are used. However if all deferred corrections are done in parallel then this cost will be insignificant. It seems likely therefore that the possibility of using parallel machines will throw up methods which are impractical on serial machines but which are very effective in a parallel environment.
5. References 1. U. Ascher , J. Christiansen and R.D. Russell , A collocation solver for mixed order systems of boundary value problems , Math . Comp. 33 (1979), 659-679. 2. U. Ascher, R.M. Mattheij and R.D. Russell , Numerical Solution of Boundary Value Problems for Ordinary Differential Equations , Prentice Hall, Englewood Cliffs, 1988. 3. U. Ascher and M.R. Osborne , A note on solving nonlinear equations and the "natural" criterion function , J. Opt. Th. Appl., 15, (1987 ), 147-152. 4. G. Bader and U. Ascher, A new basis implementation for a mixed order boundary value ODE -solver, SIAM J. Sci. Statist. Comput., 8, (1987), 483500.
5. J.R. Cash and M.H. Wright , A deferred correction algorithm for nonlinear two-point boundary value problems: Implementation and numerical evaluation , SIAM J. Sci. Statist. Comput., 12, (1991 ), 971-989. 6. J.R. Cash and M.H. Wright , Implementation issues in solving nonlinear equations for two-point boundary value problems , Comput., 45, (1990), 1737. 7. J.R. Cash and H.H.M. Silva, On the numerical solution of a class of singular two -point boundary value problems , J. Comp. and Appl . Math. (to appear).
8. J.R. Cash and A. Singhal, High -order methods for the numerical solution of two-point boundary value problems , BIT, 22 , ( 1982), 184-199. 9. R.M. Chamberlain , C. Lemar6chal, H.C. Pedersen and M.J.D. Powell, The watchdog technique for forcing convergence in algorithms for constrained optimization , Mathematical Programming Study, 16, (1982 ), 1-17. 10. M.M. Chawla, A sixth order tri-diagonal finite difference method for nonlinear two-point boundary value problems , BIT, 17, ( 1977 ), 128-133. 11. M.M. Chawla, An eighth order tri -diagonal finite difference method for nonlinear two-point boundary value problems , BIT, 17, (1977 ), 281-285. 12. G. Dahlquist, Convergence and stability In the numerical integration of ordinary differential equations , Math. Scand., 4, (1956), 33-35.
125 13. G. Dahlquist, A special stability problem for linear multistep methods, BIT, 8, (1963), 27-43.
14. J.W. Daniel and A.J. Martin, Numerov's method with deferred corrections for two-point boundary value problems, SIAM J. Numer. Anal., 14, (1977), 1033-1050. 15. C. de Boor and R. Weiss, SOLVEBLOK: A package for solving almost block diagonal linear systems, ACM TOMS, 6, (1980), 80-87. 16. J.C. Diaz, G. Fairweather and P. Keast, COLROW and ARECO: Fortran packages for solving certain almost block diagonal linear systems by modified row and column elimination , ACM TOMS, 9, (1983), 376-380. 17. L. Fox, Ordinary differential equations : Boundary value problems and methods, In Numerical Solution of Ordinary and Partial Differential Equations, (L. Fox, ed.) Pergamon Press, Oxford, pp.58-72. 18. H.B. Keller, Numerical Methods for Two-Point Boundary Value Problems, Waltham: Blaisdell, (1968). 19. H.B. Keller, Numerical Solution of Two-Point Boundary Value Problems, CBMS Regional Conference Series in Applied Math., 24, SIAM, Philadelphia, (1976).
20. H.-0. Kreiss, T.A. Manteuffel, B. Swartz, B. Wendroff and A.B. White, Supra-Convergent schemes on irregular grids, Math. Comp., 47, (1986), 537554. 21. B. Lindberg, Error estimation and iterative improvement for discretization algorithms, BIT, 20, (1980), 486-500.
22. M. Lentini and V. Pereyra, A variable order finite difference method for nonlinear multipoint boundary value problems, M. Comp., 28, (1974), 9811004. 23. T.A. Manteuffel and A.B. White, On the efficient numerical solution of systems of second order boundary value problems, SIAM J. Numer. Anal., 23, (1986), 996-1007.
24. V. Pereyra, Iterated deferred corrections for nonlinear boundary value problems, Numer. Math., 11, (1968), 111-125. 25. R.D. Skeel, Thirteen ways to estimate global error, (1986), 1-20.
Numer.
Math., 48,
26. R.D. Skeel, A theoretical framework for proving accuracy results for deferred corrections, SIAM J. Numer. Anal ., 19, (1982), 171-196. 27. H.J. Stetter, Global error estimation in ODE-solvers, In Numerical Analysis Proc., Dundee, (1977) (G.A. Watson ed), Lec. Notes Math. Volume 630, Berlin-Heidelberg-New York, Springer (1978), pp.179-189.
28. H.J. Stetter, The defect correction principle and discretization methods, Numer. Math., 29, (1978), 425-443. 29. S.J. Wright, Stable parallel algorithms for two-point boundary value problems, SIAM J. Sci. Statist. Comput., 13, (1992), 742-764.
WSSIAA 2 ( 1993) pp. 127-140 @World Scientific Publishing Company
ON A QUADRATURE METHOD FOR A LOGARITHMIC INTEGRAL EQUATION OF THE FIRST KIND
ROMAN CHAPKO Department of Applied Mathematics, Lviv University, University St. 1, 290000 Lviv, Ukraine
and
RAINER KRESS Institut fur Numerische and Angewandte Mathematik, Universitdt Gottingen, Lotzestr. 16-18, W-9400 Gottingen, Germany
Abstract We describe a weighted quadrature method for the numerical solution of the logarithmic integral equation of the first kind arising from a single-layer approach to the Dirichlet problem for the two-dimensional Helmholtz equation and develop an error analysis based on the theory of collectively compact operators.
1 Introduction Recently, the numerical solution of the logarithmic integral equation of the first kind resulting from the single- layer approach for solving the plane Dirichlet boundary value problem for the Laplace equation has received much attention . A considerable part of the research on this integral equation is concerned with the application of collocation and Galerkin methods (using splines and trigonometric polynomials as trial functions ) and their error analysis . In addition also some fully discrete versions which take into account the effect of numerical integration on the matrix entries of the linear systems have been investigated . For an extensive bibliography we refer to [7]. In addition, some of the more practical quadrature methods have been considered
127
128 by Sloan and Burn [8], and by Saranen [7]. We consider quadrature methods to be more practical since here the computation of the matrix elements is less costly than in the corresponding collocation and Galerkin methods. The logarithmic single-layer integral equation for the Laplace equation has the basic property that it can be split into two parts. The first part corresponds to the integral equation for the boundary curve being a circle and the second part, representing the perturbation from a circle for an arbitrary boundary curve is a highly smoothing operator, provided the boundary is smooth. However, for the Helmholtz equation, because of the more complicated structure of the fundamental solution, after splitting off the integral operator for a circle the remaining operator still contains a logarithmic singularity. Hence, both the setting up of an approximation method and its error analysis have to take into account this fact. In this paper we present a quadrature method for integral equatios of the first kind with a certain logarithmic singularity which appropriately covers the case of the Helmholtz equation. Our approach is based on weighted trigonometric quadrature rules on an equidistant mesh. We describe the method in some detail and give an error analysis based on the theory of collectively compact operators which gives error estimates with respect to uniform convergence. For analytic boundaries and boundary data this convergence is at least exponential. We also briefly indicate how our method can be satisfactorily extended to integral equations on open arcs. The paper concludes with two numerical examples.
2 Single-layer Approach Let D C 1R2 be a simply connected bounded domain with boundary r = 8D. We consider the following exterior Dirichlet problem for the Helmholtz equation with purely imaginary wave number : Find a solution u E C2(1R2 \ D) n C (1R2 \ D) of A U - X2u
0 in 1R2 \ D,
,c > O,
(2.1)
satisfying the boundar y condition u=f
on r
(2.2)
and the condition at infinity u(x) = 0(1),
IxI - 00,
(2.3)
uniformly for all directions where we assume the given function f to belong to the Holder space Cl.°(I'). This boundary value problem is of interest since it arises in the treatment of initial boundary value problems for the time-dependent wave equation by the Tchebycheff-Laguerre transformation (see [2]).
129
Uniqueness of a solution to (2.1)-(2.3) follows immediately from the maximumminimum principle. Existence of a solution can be constructively shown by seeking the solution in the form of a single-layer potential x E R2 \ D, (2.4)
u(x) = J 1P(x, y),(y) ds(y), where the fundamental solution to (2.1) is given by
(2.5)
,D(x, y) := - Ko(xlx - yl ), x 54 y,
in terms of the modified Hankel function Ko of order zero which is also known as Basset function or as Macdonald function. The single-layer potential (2.4) solves the Dirichlet problem (2.1)-(2.3) provided the density 0 E C°'"(r) is a solution of the integral equation Jr 1b(x, y) O(y) ds (y) = f (x), x E r. (2.6)
This integral equation can be shown to be uniquely solvable in the Holder space C0 (r) (c.f. Theorem 7.29 in [4] for the related case of Laplace's equation). We assume that the boundary curve r is analytic with a 27r-periodic regular parametric representation of the form x(t) = (xl (t), x2(t)), 0 < t < 2ir, (2.7) satisfying [x1I (t)]2+[x2I (t)]2 > 0 for all t. Then we transform (2.6) into the parametric form 1 2a 2v r K(t, )W(r) dr = g(t), 0 < t < 2r, (2.8) where we have set W ( t) := O(x (t)){[x11(t)]2 + [x2I (t)]2}1 / 2 and g (t) := f (x(t)) and the kernel is given by K(t, r) := -2K°( xr(t, r )), t # r, with the distance function r(t, r) := {[xi(t) - xl(r)] 2 + [x2(t ) - x2(r)]2}1"2.
From the series Z
Ko(z)
00
Z2n
oo
n
(in 2 + C) E (nl)2 22n +
Z2n
m
(n !)2 22n (2.9)
n=0 n=1 m=1
with Euler's constant C = 0.57721... we see that the kernel K has a logarithmic singularity and can be written in the form K(t, r) =1n
(e
sine t
2
r)
1 1 + Ki(t, r) sine t 2 r 1 + K2(t,7)
(2.10)
130 where the kernels 1
K1(t,r) =
(xr( t , T))2n
t#r,
E l (n!)2 22n n=1
sine t 2
{1+K1(t,T)sin 2t Zr }, t#r,
K2(t,T)= K(t,T)-In(4sin 2t 2r/
turn out to be analytic. In particular, using the expansion (2.9) we can deduce the diagonal terms K1( t, t) = k2 {[x ( t )] 2 + [x2I (t)]2}, K2(t, t) = 2C + In 4e {[x (t)]2 + [XI(t)]21) After introducing the integral operators S, A, B : Co'" [0, 27r] -+ C1"°[0, 27r] by (Sp)(t) :=
1 (Acp)(t) :=
!
Z^r
1 2^r
f
2w
fp (r)In
C
4
_
sin2
0e
t2
1
T I dr, /
_ _ \ / cp(r)K1(t, r) sin2 t 2 r In I e sin2 t 2 r I dr, 1
(Bca)(t)
f 21 W(,)K2(t"r) dr
27r
we can rewrite the integral equation (2.8) in the short form So+A^p+B(p=g.
(2.11)
The operator S corresponds to the integral equation which occurs in the approach analogous to (2.4) for the Laplace equation for r a circle of radius a-1/2 and therefore can be seen to be bounded with a bounded inverse . This result can also be derived from the fact that for the trigonometric monomials um (t) a"nt we have
S um = where
C„mum ,
m = 0, ±1, ±2, ... , ( 2.12) 1
c"' max(1, Iml) This follows from the elementary integrals 1 J 2a In ( e sin 2 2 eimrdr = C,,,, m = 0, ±1, ±2, ... . o J
(2.13)
131
3 The Quadrature Method Our quadrature method is based on trigonometric interpolation. We choose n E IN and an equidistant mesh by setting tjn)
0, ... , 2n - 1.
The interpolation problem with respect to the 2n-dimensional space Tn of trigonometric polynomials of the form n
n-1
v(t) = an cos mt + E bm sin mt m=o m=1
and the nodal points t^nl , j = 0,. .. , 2n - 1, is uniquely solvable . We denote by Pn : C[0 , 2x] -+ Tn the corresponding interpolation operator and note the explicit expression 2n-1
Pn f
f (tjn)) L
=
jn)
(3.1)
f=o with the Lagrange factors given by n-1
L^nl (t)
sn
1+2 E cos m(t - t^ni ) + cos n (t - tIni ) . (3.2) M=1
For the trigonometric interpolation of periodic analytic functions f we have an error estimate of the form (c.f. Theorem 11.5 in [4])
II Pn f - f 11. <- ce -n° for some positive constants c and o depending on f. We will use the following interpolatory quadrature rules n-
^sa t2r a sine 2a f (,) In C dr . R^"1(t) f (t;n)), o _o
1 2, f( 1 2^r
r) sins t 2 r In e sin2
t2r
dr 2F!n)(t) f(t^n)), i=o
f( T) dT 2v f
2
f (t)
f= o
with the weights 2x
RR(n) (t) 2a J
L(n) (r) In a
Ce
r l
2 dr )
sine t
(3.4)
(3.5)
132 and F(nl(t) ram Lj(nl (r) sins t 2 r In I e sins t 2 r) dr. 2x o From (2.12) and (3.2) we derive the explicit form n-1
Rinl(t) _ { c° 2 E Cm cos m(t - t^n)) + C„ cos n(t - t^nl) M=1
1 FJnl (t) = 1
n-1
'yo + 2 7m cos m(t - t(n)) + 7n cos ri(t - t^nl) M=1
where the c,,, are given by (2.13) and rym 4 (24,, - Cm + 1 - cm-1 )•
(3.7)
We apply the quadrature rules (3.4)-(3.6) to the integral equation (2.8) according to the splitting (2.10) of the kernel K and obtain the approximating equation 2n-1
;
; ;
(tnl){R(t) + FJn)(t)K1(t,t4)) +2nK2( ttY )} = 9 ( t), 0 < t < 2a, (3.8)
which we solve for On E T. Using the fact that SP,,O = S O for 0 E Tn, we can rewrite (3.8) in operator notation as (3.9)
SO n + Ancn + BnCn = 9
with the numerical quadrature operators 2n-1 (AnW)( t)
F!n'(t)Kl(t,tnl ),(t;n)), i=o 1 2n-1
(Bn W)(t)
:
2n
E K2(t,3 j=0
In order to arrive at an approximating equation which can be reduced to solving a finite dimensional linear system we apply a collocation method with the interpolation operator Pn to (3.9). Thus our approximation scheme finally consists in solving PnS,Pn + PPA,(Pn + PnBnWn = Pn9
(3.10)
for W n E T. Clearly, this is equivalent to the linear system 2n-1
EW n(tcn l){Rlk)il +FlkilKi(tkn1t;nl)+ i=o
2n
K2(tkn, ,tjn) )}
=9(tk°l), k =0,...,2n-1, (3.11)
133
which we have to solve for the nodal values Wn(tkn)) of wn E Tn and where Ran) .
_
n-1 m a co + 2 cm cos
RIn)(0)
2n
( -1)nCn
m=1
n + tt )(0) = jn .= F(
1
n-1
m rr
+ ( -1)nyn cos 2n J yo + 2 M=1 ym n
After we have solved this linear system the solution of the boundary value problem at an arbitrary point can be obtained by numerical integration of the single-layer potential (2.4) by means of the rectangular rule (3 .6). At this point we would like to emphasize on the main advantage of the quadrature method as compared with collocation and Galerkin methods: the numerical evaluation of the matrix entries of the linear system (3.11) only requires the computation of the kernels K1 and K2 at the quadrature points. For the subsequent error analysis we note that equation (3.10) can be equivalently transformed into
con + S-1PnAncpn + S-1PnBncPn = S-1P g
(3.12)
corresponding to the equivalent form (3.13)
p + S-1A:p + S-'BV = S-19
of (2.11). The equivalence of (3.10) and (3.12) is an immediate consequence of the fact that P,,S1/i = St,i for 0 E Tn.
4 Error and Convergence Analysis We will base our error analysis on Anselone's concept of collectively compact operators (c.f. Chapter 10 in [4]). For this we will need the following lemma. Lemma 4 .1 Let f E C9}1[0,27r], q > 0, and 0 < a < 1/2. Then for the Holder norm of the q-th derivative we have
IIPnf - f 11,,. < rn lif(4+1)II .
(4.1)
where the rn satisfy rn--+0 ,
n --+oo.
Proof: Consider the monomials um(t) = e'mt and write m = 2kn + e where k is an integer and - n < I < n. Since um(t^n)) = ut(t^n)) for j = 0, ... , 2n - 1 , we have
Pnum=PnUI= 1 2 (un + U-,,),
e
134 and consequently
IIPnum - um114,a < C mv +a
for all Imi > n and some constant C. Since f E C9+'[ 0, 2rr] we can expand in uniformly convergent Fourier series 00
f(v)(t) _
f (t) _ a eimt m e
(irn)vane'
mt
m=-oo 1
M=_00
By partial integration we have 1 am ^^ f(
21r
t)e-imtdt
f(v+1)( t)e-imtdt, m 1 0,
2ir (im)a+1
whence
00
Ilf(9 +1) Il i2[0, 2a] = 2ar
m2v +2lam12.
E m=-oo
Hence, making use of Pnurn = urn for Iml < n, with the aid of Schwarz' inequality from (4.3) and (4.4) we can now derive
IIPnf - fliq, a _< c2
00
2
2
Iamlm9 + a
<
00
(v
2
ImI=n
+1) 2 L2
Ilf II
[°,2x]
rrt2
--2a
ImI=n
13
and this implies (4.1).
Theorem 4.2 Assume that the kernels K1 and K2 are twice continuously differentiable and that 0 < a< < 1. Then both operator sequences (An) and ( Bn) are collectively compact from C°""[0, 27r] into C1'0[0, 27r] and converge pointwise to the operators A and B, respectively. Proof: We only prove the statement of the theorem for the operators A. The proof for the Bn is analogous and simpler. Since C"[0, 27r] is compactly imbedded in C[0, 2w] (c.f. Theorem 7.4 in [4]) and since the norm on C2[0, 27r] is stronger than the norm on C''"[0, 27r], for the collective compactness it suffices to show that the operators An are uniformly bounded as operators from C[0, 27r] into C2[0, 21r]. For arbitrary fixed t E [0, 2rr] we define the linear functionals Fn,t on C[0, 2rr] by 2n-1
Fn,tV Fj' ( t) cp(t1n))
f=o and clearly have 2n-1
II Fn ,t ll0 =
.I
IF,(n)(t)I. i=o
135 On the other hand, by the construction of the quadrature rule (3.5) we have 2t-T1 1 f(p.W)(,) Fn,tW = 2a sin 2t-7 2 InC4 a in s dT. 2 J From this with the aid of Schwarz ' inequality we get
I
Fn,
t'PI
<- II f0II L Q[0, ] IIpn1pIIL2[o,2x] 2x
where we have set 7 7' fo(r):=sin 22 In (esin 22),
0
From (3.1) and (3.2), again by Schwarz' inequality and elementary integrations we find that 2n-1 I 2x
L;n) (t)Lkn) (t) dt < 3 1r IIwhI 0
IIPn^vIIL2[o,2x] < Il^vllw E j,k=O 0
whence
IFn,t'PI -< 37r I1 f0hI L2 [0, 2x] IIVIIo for all t E [0, 27r] and all cp E C[0, 2ir] and therefore 2n-1
E IF(n)(t)I <_
3jrIIfOIIL2[0, 2x]
i=o
for all t E [0, 27r]. Similarly, for the derivatives we have 2n-1
E j=O
l , J F \(n)) (s) (01 <
37r
Ilfo ) IIL-[o,2x], q = 1, 2.
Note, that fo" indeed belongs to L2 [0, 27r]. From (4.5) and (4.6) and the definition of the An we now can deduce an estimate
II AnW II. + II(AnW)"II00 < ChIVIIo
(4.7)
for all n E IN, all 1P E C[0, 27r] and some constant C. This establishes the desired uniform boundedness of the operators An : C[0, 21r] -^ C2[0, 2a]. The uniform boundedness (4.7), by the Banach-Steinhaus theorem, also establishes convergence
II AnW - Aw II. + II(Ansa - App) "hh
0, n -a oo, (4.8)
for all cp E C[0, 27r] since based on the estimate (4.1) for q = 2 it can be shown that (4.8) is valid for all cp from the dense subspace C3[0, 27r] of C[0, 21r]. Hence, we have pointwise convergence of the operators An and the proof is finished. 0 We note that from Theorem 4.2 it follows that the operators A and B both are compact from CO,°[0, 27r] into Cl.°'[0, 27r] since the pointwise limit of a sequence of collectively compact operators is compact.
136 Theorem 4.3 Assume that the kernels K1 and K2 are twice continuously differentiable and that 0 < a < 1/2. Then both operator sequences (PnAn) and (P,,Bn) are collectively compact from Co"a[0, 21r] into Cl.c [0, 21r] and converge pointwise to the operators A and B, respectively.
Proof: As in the previous proof we only verify the statement for the operators PnAn and show that the PnAn : C[0, 21r] -+ C',a[0, 21r] are uniformly bounded. In the triangle inequality
I PnAn'P lll,a <- Il PnAn(P - An'P lll,a + II Ancp lll,a we estimate the first term by applying Lemma 4 .1 for q = 1 to f = AnW and the second term by a stronger norm to obtain
IIPnAn(alll,a < Cl{IlAnsvll. + II( Anc° )"IIo} for all n E IN , all w E C[ 0, 21r] and some constant Cl. By (4.7) this implies
II PnAn(P lll,a < C2 11VII. for all n E IN , all W E C[ 0, 21r] and some constant C2 and collective compactness is established . Pointwise convergence follows analogously from
II Pn An(P - Aco IIl,a <- II PnAnWP - AnW Ill,a + II An(P - Avlll,, by using Lemma 4.1, the boundedness (4.7) and Theorem 4.2.
❑
Theorem 4.4 For sufficiently large n the approximating equation (3.10) has a unique solution (P n and for the unique solution W to the original equation (2.11) we have the error estimate
II(vn - WIlo,a < C ( II Pn9 - gill ," + IIPn(An + Bn)w - (A + B)Wlll, a) (4.9) for some constant C = C(a) and 0 < a < 1/2. Proof: Recall the equivalent formulations (3.12) and (3.13). By Theorem 4.3, the operators Ln := S-1Pn(An + Bn) : Co'a[0, 2a] -^ Co"a[0, 21r] are collectively compact and pointwise convergent to L := S-'(A +B). Hence, for sufficiently large n the inverse operators (I + Ln)-1 : Co,a[0, 2a] --r Co' a[0, 27r] exist and are uniformly bounded 11(1 + Ln)-1Ilo,a < C (4.10) for some constant C (c.f. Theorem 10.8 in [4]). The error estimate follows by writing Vn - W = (I + Ln)-1{S-1(Pn9 - 9) + (L - L,)cp} and using (4.10) and the boundedness of S'1 : C',* [0, 2a] --^ C',* [0, 21r].
_t t,
❑
137 The error estimate (4.9) shows that the accuracy of the approximation depends on how well P„A„gyp approximates Acp for the exact solution . In particular , if the exact solution is analytic (and this is the case if the boundary r and the boundary data f are analytic), then from (3. 3) it can be derived that the error 11'pn - WIlo,a decreases at least exponentially.
5 The Integral Equation for Open Arcs In the case where the boundary r is an open analytic arc we still can use the parametric form (2.8) of the integral equation (2.6) with the splitting (2.10). However, we loose the 27r-periodicity of the parametric representation (2.7). From the theory of (2.6) (c.f. [3]) it is known that its solution for an open arc has square root singularities at the end points . In this case an approximation based on an equidistant grading can yield only poor convergence and therefore has to be replaced by a graded mesh. As in the case of integral equations in domains with corners [5, 6] we suggest to base this grading upon the idea of substituting a new variable in such a way that the derivatives of the new integrand vanish up to a certain order at the endpoints of the integration interval. The idea to grade the mesh for the open arc by an appropriate substitution has also been employed by Atkinson and Sloan [1 ] who used the substitution t = 27r cos s . Our approach differs through the use of a substitution which leads to more smoothness at the end points. We substitute t = w(s) with the function w
defined w(s) = 2ir [v(s)]P [v(s)1P + [v(2a - s)]P' 0 ( < s < 2^r,) 5.1
where
/1 1 v(s)= Cp -2
J
v -s \ 3 1 s - 7r 1 I x f +p ir +2
Note that w( )(0) = W( q ) (2ir) = 0 for q = 1, 2,. .., p - 1. By the substitution (5.1) we replace (2.8) through
1 2a
f
4a
K( w(s), w(Q))cp(w(a))w (o) do, = g( w(s)),
0 < s < 27r, (5.2)
and introduce a new unknown function 0(s) :_ cp(w(s))w'(s ) which now vanishes at the endpoints s = 0 and s = 27r together with its derivatives up to a certain order depending on the choice of p. Hence , we may view 0 as a smooth 27r-periodic function . Then, after splitting the new kernel K(s, o) := K(w(s), w(Q)) according to,(2.10) we can carry the error analysis of Theorem 4.4 over from the case of the closed contour to an open contour . Due to space limitations , we have to leave out the details and confine ourselves to a numerical example in the next section.
138
6 Numerical Examples We conclude the paper with two numerical examples . First, we consider a domain with a non-convex boundary r illustrated on the left hand side of Fig. 1 and described by the parametric representation x(t) = (cos t + 0 .65 cos 2t - 0.65 , 1.5 sin t), 0 < t < 2a.
(6.1)
Table 1 gives some approximate values for the solution u to (2.1)-(2.3) for constant boundary values f = 1 at different locations. Table 1 . Numerical results for dosed boundary (6.1) x
n
x= -1.5,0)
x= (-1 . 5,2)
1
8 16 32 64
0.66785075 0.66766125 0.66766121 0.66766121
4
8 16 32 64
0.16438097 0.16394107 0.16394070 0.16394070
x = 1.5,0
x = (1.5,2)
0.43255478 0.43227621 0.43227610 0.43227610
0.49694155 0.49431007 0.49428553 0.49428553
0.15629933 0.15634709 0.15634709 0.15634709
0.07923738 0.08021588 0.08021324 0.08021324
0.11193959 0.10479525 0.10474644 0.10474643
0.00137965 0.00139750 0.00139750 0.00139750
In a second example we consider as boundary r the open arc x(t) = (2 sin t , sin t ), 4 < t < 74 ,
which is illustrated on the right hand side of Fig. 1. For the use of the substitution (5.1) the Table 2 contains the numerical results for the solution u to (2.1)-(2.3) for K = 1 and constant boundary values f = 1.
Fig. 1 . Boundary curves (6.1) and (6.2)
139 Table 2 . Numerical results for open boundary (6.2) q
n
x=(-1,0 )
x= (0,0)
x = (1,0)
4
8 16 32 64
0.09317349 0.09318288 0.09318340 0.09318343
0.28247123 0.28250140 0.28250308 0.28250317
0.63914749 0.63918091 0.63918294 0.63918306
6
8 16 32 64
0.09318314 0.09318343 0.09318343 0.09318343
0.28250273 0.28250317 0.28250318 0.28250318
0.63918095 0.63918305 0.63918306 0.63918306
8
8 16 32 64
0.09318186 0.09318343 0.09318343 0.09318343
0.28249939 0.28250318 0.28250318 0.28250318
0.63916424 0.63918306 0.63918306 0.63918306
Acknowledgement . This research was carried out while the first author was visiting the University of Gottingen on a DAAD-stipendium
References [1] K. E. Atkinson and I. H. Sloan, The numerical solution of first-kind logarithmickernel integral equations on smooth open arcs, Math. Comp. 56 (1991), 119-139.
[2] W. Galazjuk and R. Chapko, The Tchebycheff-Laguerre transformation and integral equations for exterior boundary value problems for the telegraph equation, Dokl. Akad. Nauk. UkrSRR 8 (1990), 11-14 (in Russian). [3] Y. Hayashi, The Dirichlet problem for the two-dimensional Helmholtz equation for an open boundary, J. Math. Anal. Appl. 44 (1973), 489-530. [4] R. Kress, Linear Integral Equations, Springer-Verlag, Berlin Heidelberg New York, 1989.
[5] R. Kress, A Nystrom method for boundary integral equations in domains with corners, Numer. Math. 58 (1990), 145-161. [6] R. Kress, Boundary integral equations in time-harmonic acoustic scattering, Mathl. Comput. Modelling 15 (1991), 229-243.
140 [7] J. Saranen , The modified quadrature method for logarithmic-kernel integral equations on closed curves , J. Integral Equations and Appl. 3 (1991), 575-600. [8] I. H. Sloan and B. J. Burn, An unconventional quadrature method for logarithmic-kernel integral equations of closed curves , J. Integral Equations and Appl. 4 ( 1992), 117-151.
WSSIAA 2( 1993) pp. 141-150 ©World Scientific Publishing Company
OPTIMAL RECOVERY FOR SOME CLASSES OF FUNCTIONS WITH L2(IR)-BOUNDED FRACTIONAL DERIVATIVES
Han-lin Chen*t, Chun Li' Institute of Mathematics, Academia Sinica, Beijing, 100080, People's Republic of China and Charles A. Micchellit$ IBM Research Division, T. J. Watson Research Center, Yorktown Heights, NY 10598, U.S.A.
ABSTRACT This paper is devoted to studying an optimal recovery problem for some classes of functions with certain bounded fractional derivatives by using the information of function values at all integers. The exact solutions, including the intrinsic error and the optimal algorithm, are obtained. A generalized version of the Hardy-Littlewood-Polya inequality plays an important role in our discussion, and Schoenberg's formula for the error in cardinal spline interpolation of odd degree is also extended to the setting of the fractional derivatives.
1. Introduction This paper concerns the exact solution of an optimal recovery problem for functions defined on the whole real line. Generally speaking, we wish to recover a function (or one of its fractional order derivatives) given its value at intergers and an a priori bound on a (higher) derivatives. Our probern takes place in a Hilbert space. however, the theory of optimal recovery applies more generally. The interested readers pray refer to the survey papers2.3. Let us state the problem under consideration. For real r > 0 (not necessary integer), we define the Sobolev class HC(IR) of functions by
H''(IR) :_ { f E L2(IR) E L2(IR)},
(1.1)
The research of these authors was partially surported by the NNSF of China. t The research of these authors was partially supported by an NSF Grant INT-87-124244. t The research of this author was also partially supported by an SERC visiting fellowship at the University of Cambridge.
141
142 where ho f f(x)c-ax{dx, ^ E IR, i = ,
(1.2)
is the Fourier transform of f E L2(IR). For f E Hr(IR), we can define the r-th derivative f (') of f via the Fourier transform [f (r)]A(^) :_ ^ E IR. (1.3) It is well known that f (r) above concides with the usual derivative when r is an nonnegative integer. By Sobolev's imbedding theorem, H'(IR) C C0(IR) whenever r > As we will see the formulation of our problem requires us to assume throughout that r > 2. We now define the function class which interest us. Specifically, we set
Br(R) := If E Hr(E) : IIf(')II2 < 1}, (1.4) where
119112
:_
(j
Ig(x)I2dx)2 is the usual L2(IR) -norm of function g E L2(IR). We
will also use later If I . := IIf II2 + Ilf (') II2
as
the norm on H'(IR).
Given an f E B"(IR) we wish to recover f(') := D' f (0 < s < r) optimally from the values of f at all integers. This entails determining the intrinsic error E(D', B' (IR)) := inf sup{IIf^y) - A(I(f ))112 : f E Br(IR)}, A
(1.5)
where I: f --* (f( j))jEZ is an information operator which maps Br(IR) into 1HZ, and A : I(Br(IR)) ^ L2(IR) is an arbitrary mapping (algorithm). Our goal is to determine the exact value of E(D', Br(IR)) and find an optimal algorithm A* which realizes E(D', Br(IR)) in the sense that
sup {Ilf(3)
- A *(I(f))112 :
f E Br(IR)} = E(D',Br(IR)).
(1.6)
2. Main Results Define e(D
',B'(E)) := Sup{Ilf(')112 : f E Br(E), f(j) = 0,Vj
E 7L}.
(2.1)
Then we generally have2 E(D', Br(IR)) > e(D', Br(IR)). (2.2) Moreover, since the problem of optimal recovery we consider is formulated in a Hilbert space, specifically. L2 (E) we even have E(D', B"(IR)) = e(D', Br(E))• (2.3) in this case2'3. We will see (2.3) directly during our subsequent analysis. We begin by obtaining the exact value of e(D',B'(IR)). We develop our formula by estimating it first from below and then above, respectively.
...._ t,.^_.
143 Lemma 1. Suppose r > 1 and 0 < s < r. Then
(2.4)
e(D',B'(,R)) > 7r'-'.
Proof. Take an even function F(x) with continuous derivatives of all orders such that IF(x)I < 1, x E IR, and
1, if IxI < 1; if IxI > 2; F(x) = 0, 1>0, if1
(2.6)
where N is a positive integer. We want to compute Tlim II f N ) I I2 / I f N) 11 2 . Now, -oo
fNW
=
f
e_F(.) sin irxdx
ff? 2i a-
x{( e'wx _ e-aax )
F(N)dx
2i (F (N(f - ^)) - F (N(f +7r))], and,
27r f I(D'fN)(e)I2de = 2,< f I612sIfN(6)l2de
YN'Il2
Ilt
E
_ N2
I6I2sIF(N(6 - ir)) - F(N(6 + ,))12 d6. 872 f
Since F(x) is an even function so too is F(e). Thus,
IIfN'I122 =
N 47 r
I
0
I
I2slF(N(6 - 7r)) - F (N(6 + ir))I2d6.
On the other hand.
IIfNII 2 =
27r IIfN I12
=
2 1r
f
IF (N)
I2 si
n2 7rxdx
2 N
= 27r I_2N IF(N)I2 sin2 irxdx < 8irN, while
II f N II2
N IF(x) l2 sin2 7rxdx = 27r >_ 27r IN N
N sin2 7rxdx >_ CN, N
I
144 for some positive constant C. Thus we conlude that IIfNII2 N, and therefore from (2.7), we have ( with s = 0)
oo I F(N(^ - 7r)) - F(N(^ +
f 0 Next for any given
c
ir))I2<
N-'.
( 2.8)
> 0, there exists S > 0 such that
11^I29 -
2, 7r2s
1 < E for all
e E (ir - S, 7r + S).
Thus 00
IIfN'll22 -,.2s II fN II2
fo
(1^11.,
- x2s)I F(N(f - 7r)) - F(N(^ + ir))I2<
fo I F(N(^ - r)) - F(N(^ +
ir))I2<
<E+C1N J (IeI2'+ r2s )IF(N(^-7r))-F(N(^+x)) I2<, (2.9) where C1 is a constant which is independent of N.
Since by choice F(x) is in
C°°(R) and supp F = [-2,2], we have
F(NS) = 2f F(x)e-ix(Nf)dx = C((N^)-"), 0, z where n is any positive integer. Taking n > s + 2 in (2.9) and letting N -+ oo, we have hill sup N-+oo
IIfN'II2_ 7r 2s II fNII2
< E.
Thus, line II f N) II2 = 7r2s and finally we conclude that N--
IIfNII2 y lull IIfN'112 N-oc
IIfN)Ilz
lilll N-.
IIfN'I12 IIfNII2 IIfNII2 11f(N)112 -
Hence, by (2.1) and the fact that I(fN) = 0 we obtain (s)
e(Ds,Br(IR)) > limsup N-+oo
which proves (2.4).
IIfN
11f( r
II2 =
7rs
-r
■
To obtain the upper bound for e(D5, Br(n)), we use the following generalized Hardy-littlewood-Polya inequality.
t t"
145 Lemma 2 . Suppose 0 < s < r < oo and f E H''(IR). Then
Il f (s'112 <_ Ilf(r'112111112
(2.10)
Proof. We may assume without loss of generality that s < r. Now, let 1 = T:1, 1p' Then p, p' E (1, oo) and I + p = 1. Thus,
Ilf(s'112 2 - 27r f^ 1^12sIf(^)12de = 2^ f^ 1^12s1f(^)IZ' ' If(^)12(' r>d^ I)Pd£ v 1^12s''If(^)I^d^) (air f If(^)l2(1-1 (22r fIR Ift 1 (27r f If12' If(^)12<). (2 2
If(e)12d^)l
1-'-
= Ilf(''II2 Ilfll2(1 ''. ■
which is the desired inequality.
Now, let us estimate e(Ds,B'(IR)) from above. Let 0 < s < r < oo and f E Hr(IR)\{0} satisfying f(j) = 0, for all j E Z.
Thus both f(s) # 0 and
f(r) # 0. From (2.10) and (2.1) we have
Ilf(11)112
(ll
l2)
< [e(id,BV (IR))]e,
(2.11)
where id = D° : L2(IR) L2(IR.) is the identity operator . Thus, we have also from (2.1) and (2.11) that
e(id, B' (JR)) = snp{11f112 : f E Hr(ff?), Ilf(r'112 = 1, f(j) = 0, Vj E X} < [e(id,Br ( IR))].
(2.12)
From Sun and Li'. see also Lil, we know that e(id,B1(IR)) = 7r-1 Actually, for us here it suffices to note that e(id, B1(IR)) < it-1.
(2.13)
146 To see this we observe that for every f E Bo [0,1] := {g : g E H' [0, 1], IJ9r1IL2(o,1) <_ 1, 9(0) = 9(1) = 01, we have Il f [IL2[o,1[ < 7r-1. The easiest way to prove this inequality is to first observe that there is an extremal function h E B01 [0,1] which maximizes 11 h1l L- [0,11 in Bo [0, 1]. Then a simple variational argument shows that h must be (uniquely up to a multiplicative constant) sin,7rx. Thus, if r > 1 (= s), then (2.12) gives
e(id,Br(1R))
(2.14)
e(D',B''(lR)) = sup{IIfl'11I2 : f E Br(1R), f(j) = 0,Vj E 7L} -r. < [e(id, Br( 1R))]1 Lr < .s
(2.15)
Using (2.10) again we obtain
Combining (2.14) and (2.15) with Lenima 1 gives the following result. Theorem 1. Suppose 1 < r < oo and 0 < s < r. Then e(D', Br(1R)) _ 7r'-r.
(2.16)
Remark 2.1. For the integers r and s satisfying 0 < s < r, this theorem was also obtained by Sun and Lib. However, even in this case the proof given here is different from that given by Sun and Li. We also remark that e(D', Br(1R)) makes sense when 2 < r < 1 and 0 < s < r. However, we don't know whether (2.16) still holds in this situation. It remains an open problem. Finally, let us explicitly construct an optimal algorithm A* in the sense of (1.6). For r > i , we define
M,.(x) . (i.e., Mr(e) _
1
21r
1' J sin 2/2I re`y {d^,
X E IR,
(2.17)
sin 2/2I'). The condition r > 1 implies Mr E L2 (F). Moreover, if
r > 1, then M,. E Co(,R)f1L2(IR). Note that Mr is the usual even order (odd degree) central B-spline when r is a positive even integer. In what follows we consider the case r>2. Note that if r > 2, then by integration by parts, Mr(x) = O(x-2) as JxJ -+ oo; if r = 2, then M2(x) is the linear central B-spline with compact support. In particular,
147 (Mr(k))kEZ E £1 whenever r > 2. Thus by (2.17) and Poisson summation formula we have for 161 < 7r, (2 /7r)r
Mr (^ + 27rk ) = E Mr(- k)e`kt < 00 kEZ kEZ
and by periodicity for all ^ E IR,
(2 /7r) ' <
M,.(^ + 27rk) _ E Mr(-k )e`k^ < 00. (2.18) kEZ kEZ
Define ix{
L'(x):
27r
< MM)e +(^ + 27r k ) flit EkE7L
1
Mr([)eix{ S
27r /R Ek1--
Mr( -k)eik£
d£,
x E IR.
( 2.19)
Then one can easily verify that Lr(j) = 6o,2 for all j E 7l, and by Wiener's lemma there is a sequence (cj)jEz E £1 such that Lr = EkE7L ckMr(• + k). Moreover, if r = 2, Lr(x) is the usual linear hat function with support on [-1, 1]. If r > 2, then integration by parts gives L,.(x) = O(1 1-T), x E P. Now, let r _> 1 and f E H7(IR) C H1(IR). Then by Lemma 3.1 of Lil, we have (f (j))jEa E t2 (note that H1 (R) was denoted by W2 (IR) there). For the convenience of the reader we prove this fact. For j E 7l, define g(x) := (-1)j+leixxl f(x)I2 so that j+1
If(j + 1)12 + If(7)I2 = f
g'(x)dx.
Since
g'(x) = (-1)j+lei' [(zir)If(x)12 +2f'(x)If(x)Isgnf(x)], we get j+1
^ Jj
g'(x)dx
< (
1
+ 7r)
f
1 j+
if ( X) 12 dx +
7
f
j+1
lf' (x)I2d x,
7
and thus
Y-
1
If(j)I2 <_ ^( 1+7r)1f1i 5 (1+7r)lflr
jEll
Set s2r(f ;x) > f(j) L2r(x j ), jEll
x E IR,
(2.20)
148 for r > 1 and f E H'(,R). Then
II52r
(f) II2
=
1 _. . 2_ -f L
__,12
..
_
._
jEZ
. I jEZ
Since E
IL2rV +2,7rj)I2
_
jEZ we obtain
Ilszr(f ) II2
IL2rl( 5
<_ EJEZ
+27,7 ) I2
If(7)I2
EjE7L( M2r(S
+2,rj))22 <
(F.jEZ M2r ( e
+2irj))
< (1 +ir) If I2. Similarly, using the fact that
_>jEZ I^+27rjl2r ( M2r(^+21rj))2 2
jEZ
( EjETL M2,((^ + 29rj))
+ 27rj) 1 2 = 22r ( M2r(^ + 21rj )) C ( e + 2irj)) jEZ
<22r EjEZ k2(EjEl M2r
(S
Clearly we have
S 2r
II
)II2
^)Iflr•
^r2r (1 + s2r)( f and s2,. are bounded linear operators from Hr(IR) into L2 (p).
which by (2.18) does not exceed ir2r, we also have Therefore, both
1
52r (f ;
j) = f (j) for all j E 7l and for this reason we call s2r the
cardinal interpolation operator. Theorem 2. Let 1 < r < x and f E Hr (IR). Then
Ilf(''112 = Ilf(r' - s2;'(f)II2 + II32r'( f)II2, (2.21) Proof. We have already seen that to prove
L
32r)
(f) E L2 (In ). To show ( 2.21), it is sufficient
h(r)(x)s(r) (f; x) dx = 0
(2.22)
for all h E kerI := {g E Hr(E) : g(j) = 0, Vi E 7l}. By the Plancheral identity this is equivalent to
fii2r e2ru;ed^ = 0.
(2.23)
In the same way, h E ker I means that fh()e'3d^ = 0,
for all j E Z.
(2.24)
b- 4$.
149 Since h E L1(1R) (because r > 1), the Weierstrass theorem and (2.24) imply that fii(e)()d= o
(2.25)
for all continuous , 27r-periodic function 0. Now,
IS I2rh(S )S2r(f+S) =I^I2rh(S)(
>
f(e)e- `'£) L2r(0
jEa =h(^) f (j)e-1j (jElL
1 22rsin 2r
EkEZ M2r( k)e- k^ J 2 1
Thus, according to (2.25 ), ( 2.23) holds at least for all f E Co (1R). Since Co (]R) is dense in H' ( 1R), (2.23 ), and therefore ( 2.22) holds for all f E Hr(1R ). This proves the theorem.
■
Remark 2.2. Theorem 2 extends a result of Schoenberg's4 on cardinal spline interpolation of odd degree. See also Sun and Li'. Theorem 2 leads us to our main theorem . We know from (2.21) that for f E
B'(]R) (i.e., II f (')112 -< 1), 11(f - 12r(f))(r)II2 = 11f(r) - s(2r)U)112 < IIf( r)II2 < 1, and f ( j) - s2r ( f ; j) = 0 for all j E Z. Combining this fact with Theorem 1 gives II f (S) - S2r)(f )II2 < e(D', Br(1R )) = 7r'-r, 0 < s < r. Consequently, ( 2.2) and ( 1.5) lead us to Theorem 3 . Let 1 < r < oo and 0 < s < r. Let E(D', B' (1R)) and e(D-, Br (R)) be defined in (1.5) and (2. 1), respectively. Then
E(D8, B ' ( 1R))
=
e(D8 , B r(lR))
=
SUP IIf (') -
fEBr(IR)
S2r)( f
)II2
=
lr' -r,
and therefore, s2r (f) is an optimal algorithm for recovering f (') from (f (j ))jE1 for given f E Br (]R).
150 References 1. Chun Li, Infinite-dimensional widths in the spaces of functions, II, J. Approx. Theory, 69 (1992), 15-34. 2. C. A. Micchelli and T. J. Rivlin, A survey of optimal recovery, in Optimal Estimation in Approximation Theory, eds. by C. A. Micchelli and T. J. Rivlin, Plenum, New York, 1977, 1-54. 3. C. A. Micchelli and T. J. Rivlin, Lectures on optimal recovery, Lect. Notes in Math., Springer-Verlag, New York/Berlin, 1129 (1985), 21-93. 4. I. J. Schoenberg, Cardinal Spline Interpolation, CBMS-NSF Series in Applied Math., #12, SIAM Publication, Philadelphia, 1973. 5. Sun Yongsheng and Chun Li, Optimal recovery for Wz (E) in L'(E), Acta Math. Sinica (N.S.), 7 (1991), 309-323. 6. Sun Yongsheng and Chun Li, Best applroximation of some classes of smooth functions on the whole real axis by cardinal splines of higher order, Mat. Zametki, 48 (1990), 100-109. [In Russian]
WSSIAA 2(1993) pp. 151-164 ©World Scientific Publishing Company
NONLINEAR GALERKIN METHOD WITH MULTILEVEL INCREMENTAL UNKNOWNS
MIN CHEN Department of Mathematics, Penn State University University Park, PA 16802 U.S.A. and ROGER TEMAM
Laboratoire d'Analyse Numerique , University de Paris-Sud 91405, Orsay, France
ABSTRACT Multilevel methods are indispensable for the approximation of nonlinear evolution equations when complex physical phenomena involving the interaction of many scales are present (such as in, but without being limited to fluid turbulence). Incremental unknowns of different types have been proposed as a means to develop such numerical schemes in the context of finite difference discretizations. In this article, we present several numerical schemes using the so-called multilevel wavelet-like incremental unknowns. The fully discretized explicit and semiexplicit schemes for reaction-diffusion equations are presented and analyzed. The stability conditions are improved when compared with the corresponding standard algorithms. Furthermore the complexity of the computation on each time step is comparable to the corresponding standard algorithm.
1. Introduction. In the past, the approximation of nonlinear evolution equations was mostly restricted to short intervals of time or to long intervals of time when the solution converges to a stationary one as t -+oo. The new technologies and the increased power of the new computers offer to the numerical analysts new challenging problems, namely the approximation of nonlinear evolution equations on large intervals of time when complex physical phenomena appear. New numerical methods adapted to such problems need to be developed (see [9]); in particular multilevel methods are needed in order to treat appropriately the different scales appearing in a complex problem and to resolve at reasonable cost the smaller scales. Incremental unknowns have been proposed as a means to address this new type of problems when finite difference discretizations are used. The idea is to treat differently the small and large
151
152 scale components of a flow and in this way to avoid stiff systems; to save on computing time; and to obtain better CFL (Courant-Fredriche-Levy) stability conditions (see [9]). After studying linear elliptic problems in [2] and [1], we consider here nonlinear evolution equations. As a first example, we apply the multilevel wavelet-like incremental unknowns to a Reaction-Diffusion equation: 8u N -VAu+ g(u)=0 in R,
(1.1)
u=0 on OR,
(1.2)
u(x, 0) = uo in Sl.
(1.3)
Here v > 0, Sl is an open bounded set in 1R" with sufficient smooth boundary and 2p-1
g(S) = F_ bis', i=o
b2p-1 > 0.
For the sake of simplicity, we shall consider only the one-dimensional case and 12 = [0,1] in the rest of the paper. The higher dimensional cases can be treated in the same way. The definition of the wavelet-like incremental unknowns in dimension two can be found in [4] and it is recalled below in dimension one. The article is organized as follows. In Section 2 we recall the definition of the wavelet-like incremental unknowns (WIU) and describe their implementation in the space discretization of problem (1.1)-(1.3). Then in Section 3 we consider space and time discretization . Four different schemes are proposed which are of the nonlinear Galerkin type. Finally in Section 4 we develop the stability analysis of these schemes. The limitation of the time mesh k = At are much better than those obtained with usual one-level spatial discretizations . Of course as usual for nonlinear problems, our stability analysis provides only sufficient stability conditions; however there are also numerical simulations performed for Burgers equation which confirm these improvements [5]. 2. Multilevel Wavelet-like Incremental Unknowns (WIU). In this section , we shall transform a spatial finite difference discretization in terms of U into a scheme involving Y and Z where U is the finite difference approximation of the solution u, while Y represents a coarse grid approximation and Z represents a fine grid correction (the incremental unknowns). Considering spatial discretization by finite difference with mesh size hd = 1/(2'N + 1), where N E 1N, we have
ate' + vAh, Uh, + g(Uh,) = 0,
(2.1)
where Uh, is the vector of approximate values of u at the grid points, Uh, E IR2'N and Ah, is a regular matrix of order 2dN. For simplicity, we write Ad = Ah„ Ud = Uh,. When a central difference scheme is used for the convection term, we have AdUd(i) = h2 (2Ud (i) - Ud(i + 1) - Ud(i - 1)),
153 where Ud(i) is the finite difference approximation of u at x = ihd. Ordering Ud(i), i = 1,2,... ,2 d N in its natural way, we see that Ad is a tri-diagonal matrix. We now introduce the (d + 1) levels Wavelet Incremental Unknowns (WIU) into equation (2.1). We first separate evenly the unknowns into two parts, one part represents a coarser grid approximation, another represents a correction to the coarser grid approximation. We obtain 2-level wavelet-like incremental unknowns. After the first split, the unknowns which represent a coarser grid approximation can be separated again into two parts... . After d-time of separations, there are N unknowns (Y part) which represent the coarsest grid approximation and each of them is an average of 2d unknowns from Ud. The other (2d -1)N unknowns (Z part) are the correction of Y to bring the total accuracy of the approximation into the accuracy of Ud. We now introduce the first separation. The incremental unknown Ud in this level as stated consists of two parts: • the coarser grid approximation which is the average of two neighboring values of finer grid y2. _ (Ud(2i - 1) + Ud(2i))/2, i=1, ...,2 d-1 N, (2.2) • the increment on the fine grid approximation
z2i_j = (Ud(2i - 1) - Ud(2i))/2,
i = 1,...,2d-1N. (2.3)
The transformation from Ud to incremental unknowns Ud is the inverse of (2.2) and (2.3): d d Ud(2i) = y2, - z2.-1,
for i = 1,...,2d-1N. (2.4)
Ud(2i - 1) = z2.-1 + y2., We reordering Ud into Ud by letting Ud = (Ud(2), ... , Ud(2dN), Ud(1),... , Ud(2dN - 1))T, and we see that
Ud = PdUd, where Pd is a permutation matrix. We then write (2.4) in the matrix form
Ud = SdUd, where Ud = (Yd, Zd) _ (y2 , ... , y2.N, ZI , ... , zz4N-1), and
/1 0 ... 0 -1 0 0 f 0 1. 0 0 -1 0
Sd =
0 0 ... 1 0 0 1 0 ... 0 1 0 0 1 ... 0 0 1
_ Id-1 -Id-1 Id-1 Id-1
\0 0 ... 1 0 0 ... 1/ Id-1 being the identity matrix of order 2d-1N . We can easily see that S-1 = ZST, and Ud = PdSdUd•
(2.5)
154 Substituting (2.5) into the finite difference equation (2.1) and multiplying the equation by (PdSd)T, we find
8(PdSd)T PdSdUd
+ v(PdSd)T AdPdSdUd + (PdSd)T a(PdSdUd) = 0.
Noticing that Pd Pd = Id, ST Sd = 21d, Pd and g commute, we obtain
2 atd
+ vSd Pd AdPdSdUd + STg(SdUd) = 0,
which is the finite difference scheme obtained when 2-level wavelet-like incremental unknowns are used. Equation (2.6) is equivalent to (2.1) except that we have replaced Ud by Ud = (Yd, Zd)T We can again introduce the next level of WIU on Yd by repeating exactly the same procedure. Let Yd = (Yd-l,Zd-1)T = (y4- 1,yg-1 ,...,y2l N, z2-l, 4"...'2I_1 ^_2)T, we replace Ud by Yd and Zd by Zd_1 and change the corresponding subscript in (2.4). Namely, we define d d-1 -d-1 y4i = y4i 4i-2+
_ for i = 1,... , 2d-2 N. (2.7)
y4i-2 = z4i 2 + y4i
Therefore, Yd = Pd-1Sd-lYd,
where Pd_1 is a permutation matrix of order 2d-1N and Id-2 -Id-2 Sd_1 Id-2 Id-2 Noticing that the size of Yd is only half that of Ud, we let Ud_1 = (Yd-1, Zd-1 *Zd)T
Pd_1
\ Pd-1 0 Id-1 , 0 0 / Sd-1 = 1 VId-1 Sd-1 0
Here Pd-1 and §d_1 are matrices of order 2dN. We see that PdPd = Id, SaST = 21d, Ud =
Pd-1Sd- lUd-1•
Generally for I = d - 1, d - 2, ... ,1, we define Y,+1 = (Yi, Z,)T and l W= l y2d-1}1i - Z2 d-1+1i - 2d-11 1 . 1 for i = 1,...,2 1N.
y2d 1 1+1i
++1
y2d-1+1i _ 2d-1 = 'Z2d - i +li_2d - 1 + y2d-1+1i,
We can easily see as previously that
Y1+1=P,S,Y,, 1=d-1,d-2,...,1,
_1 t ,.,
(2.8)
155 where P1, St have similar structures as Pd and Sd but with different sizes. We can include (2.5) into formula (2.9) with 1 = d by denoting Yd+1 = Ud and Yd = Ud. Setting Ut = (Y11 Z11 I Zt_1,... , rZd)T, we obtain 7 7F (2.10)
Ut = P1S,Ut-1 where _ (Pt 0 1 where k = (2d - 21)N. PI 0 Ikxk
-I1-1 0
S0 1 = (SO' Ikxk l
Sl ZI-1 ZI-1 0 0 V LIkXk
SISI
= 21d,
'
PIPIT = Id.
Substituting (2.10) into equation ( 2.6) successively with I = d,...,1, we obtain the evolution equation expressed in terms of Y and Z where Y = Yo, Z = (Zo, 71 Z,,..., 1 Zd)T : 2 72
2d + VSTAdSU0 +
ST9(SUo )
=
0,
(2.11)
with S = PdSd ... P1§1. 3. Nonlinear Galerkin method. In this section, we propose some new schemes based on the utilization of the incremental unknowns introduced in the last section . The new schemes will not only simplify the formulas which make them easier to implement, but also improve the stability conditions comparing to the corresponding schemes in the last section while maintaining the same complexity of the computation (see Section 4). The schemes we shall propose are obtained by neglecting some small terms involving Z. A partial justification of these schemes can also be seen through the dynamical system theory (c.f. [6], [7], [8]). In this section, we shall first propose the new treatment of the spatial discretization. We then propose several fully discretized schemes. The stability conditions for the fully discretized schemes will be presented in the next section. The convergence of these schemes can be proved by using the stability results in the next section and then proceeding as in the proof of convergence of the nonlinear galerkin method for Navier-stokes type equations in [4]. We now analyze (2.11) and start with d = 1; we write 2 5jd +VSd Pd AdPdSdUd + Sd 9(SdUd) = 0.
But SdUd _ (
Id-1 -1d-1 Yd \ Id-1 Id - 1 Zd
Sd 9('SdUd )=
C
(Yd - Zd Yd + Zd
Zd-1 Id - 1 9(yd - Zd)
-Id-1 1
Id _ 1
)
\ 9(Yd
+Zd)
= 1 9(Yd-Zd )+ 9(Yd+Zd )1 = 1 29(Yd)+O(IZdI2) 1. 1 9(yd + Zd ) - 9(yd - Zd ) O(JZdJ)
156 We therefore obtain the 2-level nonlinear galerkin method 2 \ Zd) + VSd Pa AdPdSdUd + 2 (g(j)) = 0, by neglecting a O(IZdl2) term in the evolution equation for Yd and a O(IZd1) term in the equation for the evolution of Zd. Now, when d = 2, we have
4a 1 +VSTAdSUd-1 + Sd 1Pd 1Sd 9(Sd7d) = 0.
Using the approximation for (3.1), Pd_i and g commute, we have
Sd 1Pd 1Sd 9(SdUd) 2Sd 1Pd 1 1 9(O\ d)/_ -2 (Sa 1 0
\0 1 (Pd 19(Pd-1Sd-1Yd)1 -2 (Sd 19(Sa 11'd) Ikxkl 0 J ` 0
We can again use the same approximation technique as for (3.1) and obtain
Sd 1 Pd 1 Sd 9(SdUd)
, 4 f
9(Yd-1)
1
0 /1 .
Therefore, we can easily see that the nonlinear galerkin method with the use of (d + 1)-level incremental unknowns leads to equation
2d I Z0) +VSTA,SUO+ 2d (9(00 )1 =0 (3.2) where Yo E RN and Z is a vector of dimension (2d - 1)N. f From the theory of inertial manifolds, we sometime prefer to neglect also the ? term. Therea fore, another similar scheme can be proposed
2d 00) + VSTAdSU0 +2d (9(00)) =0. Now we consider time discretization. We can easily obtain an explicit scheme for (3.2) by using the explicit Euler scheme. Scheme I. Explicit scheme
k
( Zn+1
_Zn ) +vSTA,SU0 + 2d
C9(
00)1=0.
Based on (3.3), we can also obtain a similar scheme by omitting the discretized time derivative of Z:
157 Scheme P.
T
(YO+1o YO) + STA,SUa+2d (9(oo))=0.
Alternatively, taking a\backward Euler scheme for the time discretization of the linear terms, we obtain semi-implicit schemes: Scheme II. Semi-implicit scheme
2d Y1+1 T (Z"+1 n
+ vST AdSUo}1 + 2d (
g(Yo")) = 0.
Scheme II'.
k (Ya+l0 y01 +
ISTAdSUo+1
+ 2d (g(00 )1 0.
The effective implementation of the above schemes is very similar to using incremental unknowns for solving linear problems (c.f. [2], [1]). The product of STAdST with a vector can be obtained without writing out the explicit form of S; 0(2dN) flops are required which is the order of flops required for the product of Ad with a vector. 4. Stability Analysis for the Fully Discretized schemes. Let Vh, be the function space spanned by the basis functions W h,,M, M = ihd, i = 1,2..... 2dN; whd,ihd is equal to 1 on the interval [ihd, (i + 1)hd) and vanishes outside this interval; let uh,(x) be a step function in Vh, and uhd(x) = Ud(i), for ihd < x < (i + 1)hd,i = 1,2,...,2"N. Hence 2dN uhd(x ) Ud(i)whd,ihd, X E 1. i=1
We introduce the finite difference operator
Vhd O( x) =
{m(x Td-
+ hd) -
m(x)},
and endow Vh, with the scalar product ((uhd, vhd ))hd = (Vhd uhd, Vhdvhd ),
where (•, •) is the scalar product in L2(1).
We set
II Ilhd = {((•>')) hd } 1 /2 and observe that II • I Ihd
and I • I are Hilbert norms on Vhd. Using the space Vhd , we can write the finite difference discretization scheme (2.1) in variational form as ( ihd u) + v((uhd ,u)) ha + (9(uhd )+u) = 0, Vu E Vhd. (4.1)
We can recover (2.1) by choosing u = Whd,ihd. It is not hard to see that we can recover the definition of wavelet-like incremental unknowns by a suitable decomposition of the space Vhd (c.f.
158 [4]). We define Yha (or simply Yd) as the space spanned by the basis functions ,b2hd ,M, where M = 2ihd, i = 1, 2,._ 2d-IN; here w2ha,2iha is equal to 1 on the interval [2ihd - hd, 2ihd + hd) and vanishes outside this interval. Thus 2d -'N
Yd(X) _
yd(2ihd)tb2ha,2iha,
x E 52, Vyd E Yd.
i=1
We then define Zd as the space spanned by Xha,M = W hd,M -W ha,M+ha , where M = (2i-1)hd, i = 1, 2,... ,2d-IN. We have 2d-1N Zd(x) =
Zd((2i - 1)hd)X2hd, 2ihd _ha,
> i=1
x E S2, Vzd E Z.
Therefore, Vha = Yd ® 2d• (4.2) We now decompose the approximate solution uha E Vha into: uha = yd + Zd, Yd E Yd, Zd E 2d•
By identifying yd and zd on each interval [2ihd - hd, 2ihd + hd), i = 1, ... , 2d-1N and writing yd(2ihd) = y2ti zd (ihd) = zd, we obtain exactly (2.4). With decomposition (4.2), (2.6) is identical to (d , Olt
y) +
v((yd
+ zd,
y ))ha
+ (9(yd + zd),y) = 0,
( ,z) + -((Yd + zd,z))ha + (9(yd + zd ),z) = 0,
Vy E
Yd
Vzd E Zd
Multilevel incremental unknowns can be recovered in a similar fashion. We decompose Y1i =d , . . . , 1, into
Yi = Y,-1 ® 21_1, and we recover (2.8) by defining Y1_1 and 21_1 accordingly. We therefore see that for any function uha E Vha, we can write it as Uhd = y + Z, where y = yho E Y = Y o and z E 2 = Zo ®21 6 . • • ®2d; Yo is a function space spanned by the step functions with step size ho = 2dhd and (y, z) = 0, Vy E Yo, Vz E Z. Equation ( 2.11) is therefore identical to ( ao , (V6
+ v((yho + z, y)) hd + (9(yho + z ), y) = 0 ,
, z) + v (( yho + z, z )) ha + (9(yho + z ), z) = 0,
V y E Yo, Vi E Z,
and (3.3) is identical to (ayh2 , y) + v(( yho + z, y )) ha + (9(yho ), y) = 0,
Vy E
Yo,
(4.3) 5j , z) + v((yho + z, z))ha = 0,
Vi E Z.
Before presenting the stability theory, let us introduce some easy lemmas . Their proof can be found for example in [3].
159 Lemma 2.1. There exist constants cl and c2 such that the function g above satisfies b2p-IS2P -
g(S)s ^
g(S)2 G 2bz,_ 184'-2 + C2 ,
C1, Vs.
(4.4) (4.5)
Lemma 2 .2. For every function uh E Vh,
IIuhIIh
V'21-hl
^
Sl(h)Iuhl,
where S1(h) = h/2. Lemma 2 .3. For every function yho E Yo,
Sz(ho )ly40I
< Iyhol2, with S2(ho) = ho.
I Yhol,
51(ho,hd)Ilyho Ilhd
with 3I(ho,hd)=2
hohd.
Here Iyho I- is the maximum (L°°) norm of yho. Theorem 2 .1. Stability condition for Scheme I We assume that k < KO for some KO fixed and let MO = Iuh.I2 + 1(cl +c2Ko)I1I• v
if k 1 2d < 4v( 2+2d)
Td and
k 2d(p-1) < (hd)p-1 4b2p -1 Mop-',
we have for Scheme I the following estimate: Iuhdl2 =Iyhol2+Iz°I2 <Mo foranyn> 0. Proof. Using ( 4.3), we can write Scheme I in its variational form: n+1_ n (yho
k yho,y) + v ((yho + Zn, :J))hd + (g ( yno ),'✓ ) = 0,
Vy E Yo, (4.9)
Zn+1 - Zn
k , x) + v((yno + z n, i))hd = 0,
Vi E Z. ( 4.10)
We let y = 2kyho in (4.9) and z = 2kzn in (4.10) and add these relations, since 2(a - b, b) _ Ial2 - Ib12 - la - bI2, we obtain lyh +1
12 0
12 + l2 _0 Iyn+I 0 - yn - Iyn 0
zn+112 - 1Zn12 -
I Zn+I - Zn12 + 2kvIlyho +Zn11hd + 2k(g(yeo ),yho) = 0.
160 By using (4.4) in Lemma 2.1, we find
ly,,o lI2 - I yho l2 -
Iyho 1 -
yho
l2 + Izn+l l2 -
Izn12 _ Izn+I - zn l2
+ 2kvllyho +znllhd + kb2r -1 J (yho)2pdx < 2kc1ISZI. n Now, let k(yho 1 - yho) in (4.9): n+1 n I yho - y hol
2 n + kv((yho
n
n+1
n
z ,yho - y ho))hd
n n+l + k(9(yho ),yho -
Y.
yho) = 0.
We obtain by using Cauchy-Schwarz inequality and Lemma 2.3
Iyho 1 - yho l2 < kv llyho+=nllhd IlYha 1- yhollha +kI9(yho)I Iyho 1 - yhol Ilyho + Znllhe Iyho ' - yho I + k2I9(yho )I24 + 1 Iyho1- yho I2 < kv 1 31(ho,hd) 21yho 1- yho l2 + k2I9(yho)I2 + S h vh, )2 11yho + = n llhd. 1(
Iyho 1 _ yho 12 < 2k 219(yho)12 +
2k2v2
s1(ho,hd)2
o,
Ilyha + Zn llh..
We can bound Izn+1 - zn l2 by the same method. Let a' = k(zn+l - zn) in (4.10): Izn+1 - zn12 + kv((yh0 + zn zn + 1 - zn))he = 0.
We therefore obtain using Lemma 2 .2 with h = hd,
Izn +1 -
z nl2
< kvllyho + znllhe Ilzn+l - znllhe < kvsl( hd)Ilyho + Z n lihd Izn +1 _ znl.
Therefore zn+l - znl2
V V2
<
S
Ilyho + znllhe
( hd)2
Combining these relations , we see that 2 , 2 2k2v2
Iyho 1I2 - Iyho l 2 + Izn+1I2 _ Izn 12+ (2kv - S (hd)
2 - Si(hohd) 2
)Ily ho
+ znllhd
+ kb2p_1 r (yho )2pdx < 2kc1I0I + 2k219(yho )l2. n Using (4.5) pointwise and Lemma 2.3, we obtain k2I9(yho)l2 < 2k2b2p_1 JS?(yho)'p/-2 dx + k2czl^2l
< 2k2b2p -
Iyho 1oo-2 J
-1
n
(yho )2pdx + k2c2IQI
( ) p _11yhon I2p -2 J,(yho ) 2pdx + k 2 c2lcu 1•
5,2 h
op
(4.11)
1 61 This yields
I yho
1I2 - I yho I2 + I zn+1 I2 - Iz n I2 + (2kv 4k
2,2 5.
(hd)2
2
2
al(^10, hd)2
)Ilyho
+
zn
llh,
2b22p_l
+(kb2p-1 - S2(ho)p l I!'hol2p-2) J (yho)2pd2 <2kc1ISlI+2k2c21S21.
Since S1(hd) = hd/2, 91(ho,hd) = 2^, S2(ho) = ho,
l11ho l l2
-
lyno I2 + lZn +112 _ IznI2 + 2kv(1 -
h2
h2
d- 2d
)Ilyno
+
znllh,
4k2b2p_ 1 n 2p -2 n 2p 2 ) fn(yho) dx<2kc1IS2I+2k c21121. +(kb2p-1- hp-1 Iyho1 0 We are now ready to prove Theorem 2 .1 by induction: • q = 0 is obvious since lyho I2 + Iz° 12 < Mo. • Assuming (4.8) is correct up to q = n, we then have lyno I2 + IznI2 < Mo. • For q = n + 1, using condition (4.6) and (4.7), we write Iyho
ll2 - Iyho I2 + IZ ntll2 _ IZn 12 + k
With the use of Lemma 2.2 and Iyho 1I2
lyno
+ znJ2 =
vllyho
+ znllha < 2kcl
if I + 2k2C21f1.
ly nx I2 + Iznl2, we obtain
+ lzn+l lz < (1 - 2kv)( lynho I2 + IznI2) + 2kc11121 + 2k2c21121.
Using above inequality corresponding to n, n - 1, ... ,1, 0, we obtain
Iyho 112 + Izn +112 < (1- 2kv)n
+1(Iyhol2
Iz °12)
+ (1 + (1 - 2kv) + (1 - 2kv)2 + • . • + (1 - 2kv)")(2kci ISlI + 2k2c21a1) < (1- 2kv)n}1(lyho I2 + lz° l2) + 1 (2kc11S21 + 2k2c2lSZl)• 1-(1-2kv) Therefore
Iyho
1I2
+ Izn + 1 I2 < (1 - 2kv)n +1(I yh o l2 + Iz °I2) + II (c1 + Koc2) < M°.
(4.12)
Hence (4.8). Remark 1. By using the same method, we observe that the stability condition for the classical explicit approximation scheme of (4.1) (i.e. two levels in time, one level in space) reads
and k < ha-1 4 b2p-1 Mo -1 (4.13)
162 Since for d > 1, z > Z, we observe an improved stability for any d > 1. When the nonlinear effect is strong , that is when (4.7) and (4.13) are dominant, the stability condition of the nonlinear galerkin method is better. The time step can be taken about 2d(p-1) larger than the step size if we deal with uha directly. Remark 2 . From (4.12), we see that there is an absorbing set in the L'-norm for the approximate solution. That is there exists an R0 > 0 (Ro = ^ni(c1 + Koc2)), which depends only on g (and not on k and h), such that the ball centered at origin with radius R, BR(0), for any R > R0 absorbs the solutions: namely for any initial data uo, we have Iyho1z + Iznl2 :< !,R when n is large enough. Remark 3. We have better stability results for Scheme I'. But the equation involving 'a becomes implicit which might add the complexity of computation. In order to present the stability condition for Scheme II, we need the following well-known result: Lemma. (Discrete Gronwall Lemma) Let an , bn be two nonnegative sequences satisfying an+1 _ an
k
+ Aan + < b", bn < b, Vn > 0.
Then,
n 1 0 1+kA 1 a < (1+kA)na + a (1 (l+kA)n+1)b'
Vn>0,
provided k>0and1+kA>0. Theorem 2.2. Assuming k < K0 for some K0 fixed, we set
M1 = Jyho 12 + Iz01z +
1 + 4K0v 4v (2c1 + czKo)I f
Then if k
2d(p-1)
(4.14)
(hd)P-1 - 2b2p -IMi-1' we have for Scheme II the following estimate:
lyho12+Iznl2 <M1 for all n> 0.
Proof. Again, we first write Scheme II in its variational form (yho 1
yh0 k
Vy E Yo,
(4.15)
( kz) + v((yho 1 + zn+1 z))e, = 0. Vi E Z.
(4.16)
, y) + v((yho}1 + z' 1, y))ha + (g(yho ), y ) = 0, Zn+1 - zn
163 We let 2kyho 1 in (4.15) and a = 2kzn+1 in (4.16) and use 2(a-b, a) = IaI2 - IbI2 + Ia - b12 and inequality (4.4). We obtain after adding these relations
Iyho 1I2 - Iyho I2 + Iyho 1- yha I2 +Izn +1 I2 - IznI2 + I zn+1 - zni2 + 2kvliyho 1 + zn +l llh, _ -2k(9(yho),yho1) _ -2k(9(yho),yho1 - yho) - 2k(9(yho ), 1✓ ha)
< -2k(9(yho ), yho 1 - yho) - kb2p-1 J (yho )2pdx + 2kc1 IQ 1, R
Iy' +112 - Iyho I2 + Iyho 1- yho l2 + Izn I2 - Izn I2 +Izn+1- Z. I2 +1
+
2kvllt✓ ho 1
+
zn+
1
Ilh,
+ kb2p -1 J (yho )2pdx < -2k(9(yho ), yho 1 - yho) + 2kc1ISlI. n
Now using formula (4.11) ), yn+1 _ - 2k(9(yho yh) S ho
_ 2kly(yh Iyh+1 0 yh I 0 )I 0
+ k2I9(yh iyh +1a - yh l2o o )I2.
<_ o 2 b2
(ho)p 11 Iyho I2a-2
/ (yho )2pdz + k2c2IfII.
< Iyho 1 - yho I2 + 2
We therefore obtain
Iyho 1I2 - Iyho I2 + Izn + 1 I2 - Iz nI2 + Izn+1 - zni2 + 2kv liyho + zn IIh, + kb2n-1(l -
S2(ho)p 1l
lyhol2p -2) Jn (yho)2pdx < 2kc1I12I + k2C2I12I.
Lemma 2.2 yields then
Iyho 112 -Iyho l2 + Izn}1I2 - IznI2 +4k vlyho + zni2 < 2kciIflI + k2c2Is2I. Since
Iyho +zni2 = Iyho l2 + Iznl2, we have
Iyho' 12 -Iyho 12 + Izn+1 I2 - IznI2- +4-( Iyha 1I2 + Izn+1 I2) k
2cl Ili + kc2IflI.
Now the discrete Gronwall lemma implies
Iyho 12 +IznI2 <- (1+4kv )n(lyhol2 + Iz°I2 ) + 1 4vkv(1- (1 +4kv)n)IQl(2c1 +Koc2). Therefore
Iyho 12+IznI2 (1+4kv) n(ly1°oI2+Iz°I2 )+ 1 4ykvISlI(2c1+Koc2), Iyho I2 + IznI2 < M1. Remark !. Comparing the stability condition of Scheme II with that of the standard semi-implicit approximation scheme of (4.1), we see that the size of the time step can be taken 2d(p-1) times larger. Remark 5. As in Remark 2, any ball BR(0) with R > Ro is an absorbing ball, where Ro = 1+4k. IQI(2c1 + Koc2)•
Acknowledgements . This work was supported in part by NSF, grant no DMS-9205300 and grant no DMS-.9024769; and by ONR, grant no 00014-91-J-1140.
164 References [1] M. Chen & R. Temam, "Incremental Unknowns for Convection-Diffusion Equations," to appear in Applied Numerical Mathematics. [2] M. Chen & R. Temam, " Incremental Unknowns for solving Partial Differential Equations ," Numer. Math.59 (1991), 255-271. [3] M. Chen & R. Temam, "Nonlinear Galerkin Method in Finite Difference Case: Application to evolution equations," In preparation. [4] M. Chen & R. Temam, "Nonlinear Galerkin Method in the Finite Difference Case and Waveletlike Incremental Unknowns," Numer. Math., 1992, to appear. [5] H. Choi & R. Temam, "Applications of Incremental Unknowns to the Burgers equations," Annual Reserch Briefs, Center for turbulence Reserch, Stanford University(1992). [6] C. Foias, 0. Manley & R. Temam, "Modeling of the interaction of small and large eddies in two dimensional turbulent flows," Math. Model. and Num. Anal. 22 (1988). [7] C. Foias, G. R. Sell & R. Temam, "Inertial manifolds for nonlinear evolutionary equations," J. Diff. Egn.73 (1988), 308-353. (8] R. Temam, Infinite-Dimensional Dynamical Systems in Mechanics and Physics, SpringerVerlag, New York-Heidelberg-Berlin, 1988. [9] R. Temam, "New emerging methods in numerical analysis; Applications to fluid mechanics," in Incompressible Computational Fluid Dynamics - Trend and Advances, M. Gunzberger and N. Nicolaides, ed., Cambridge University Press, 1991.
WSSIAA 2( 1993) pp. 165-176 ©World Scientific Publishing Company
SPLITTING METHODS FOR SOLVING LARGE SYSTEMS OF LINEAR ORDINARY DIFFERENTIAL EQUATIONS ON A VECTOR COMPUTER
ILIO GALLIGANI Department of Mathematics , University of Bologna, Italy I - 40126 Bologna, Italy
ABSTRACT This paper is concerned with the development of two splitting methods for solving large systems of linear ordinary differential equations. These methods are well suitable for parallel implementation on a «Cray-likes vector processor. The consistency and the stability of the two methods are analysed. A trial and error procedure for determining the dominant eigenvalue of the system is proposed-
1. Introduction At present the splitting methods have become a powerful tool for solving highly complicated problems of many areas of science and technology. These methods are essentially based on certain special relaxation processes which allow one to reduce the complicated problem into a sequence of simple problems which can be effectively solved with a computer. The most complete theory has been given for the case when the operator characterizing the original problem can be decomposed into a sum of two or more operators with a simple structure3. For example, the well-known alternating direction implicit (ADI) method is based on this decomposition-principle of the operator. This paper is concerned with the analysis and the development of two splitting methods (an ADI-type method and a method of arithmetic mean) for solving numerically on a vector computer large sparse systems of linear ordinary differential equations of the form dv(t) - Av(t) = b(t) 0
(1)
where v(O) = g is given. Here, the constant real nxn matrix A and the real nxl vector b(t) are given. The nxl vector v(t) denotes the unknown vector-valued function of the real variable r, t is interpreted as time. We assume that the matrix A has a non-random sparsity pattern. That is, A = (aid) is an irreducible matrix with positive diagonal entries and non positive off-diagonal entries which can be partitioned in the form
A =
All A12
Alq
A21 A22
A2q
Aql Aq2
Aqq
165
(2)
166 where the submatrices Am (r, s = 1, 2, ..., q) are square of order p (n = p q) and A. can be expressed as the matrix sum Arr=Xr+Dr+Yr where Dr is a diagonal matrix of order p with positive diagonal elements and Xr and Yr are pxp matrices of simple structure with positive entries on the diagonal and non positive offdiagonal entries . For example, Xr and PY1PT may be tridiagonal matrices ; P is a permutation matrix. Besides , if we define
0
0A12 ... A14I
A21 0 U =
L =
^Aql Aq2 D = diag {D1 ,..., Dq},
... 0) X = diag {X1 ,..., Xq},
0 Y = diag {Y1 ,...,Yq}
the matrix L + D + U is strictly diagonally dominant and X and Y are the direct sum of q irreducibly diagonally dominant matrices. Thus, the matrix A = L + (X+D+Y) + U is non singular and A71 > 0; that is , A is a matrix of monotone kind. Since -A is an irreducible matrix with non negative off-diagonal elements, the matrix -A is essentially positive . Thus5, the matrix -A has a real simple eigenvalue ? such that the associated eigenvector xl is positive and Re(2,.;) < a'1 for any other eigenvalue Aq of -A (i = 2,3,...,n). The scalar ?,1 is called the dominant eigenvalue of the system ( 1). Since A is a matrix of monotone kind, this eigenvalue A'1 is negative . Indeed, from -Ax1 = ,1x1, with x1 > 0 and A7 1 > 0, it follows that (- 1/2L1)xl = A-1x1; thus, Xr < 0. Then, the solution v (t) of (1 ) is asymptotically stable. The analytical solution of (1) is t v(t) = e -tA v(0) + e -tA f e vt b (t) dt (3) 0 and, for a time-step h > 0, this vector satisfies the recurrence relation h v(t+h) = e-hA v(t) + e_hq f eTA b(t+,r)dti (4)
with t = 0, h, 2h, ... 0 If the components of the initial vector g are non negative and the components of the source vector b(t) are non negative , the condition aaJ S 0 for i * j upon A is a necessary and sufficient condition to ensure that the vector (3) is non negative.
One often refers to (1) as a continuous -time linear dynamic system of monotone kinds.
167 2. Splitting Methods For any h/2 > 0 the solution of the system (1) or dv(t) + Dv(t) (L+X+Y+U) v(t) + b(t) (5) dt
satisfies the recurrence relation - hD -hD h/2
v(t+2) = e 7 v (t) - e f e,D (L+X+Y+U) v(t+tt)dti 0
(6)
_hD h/2
+ e I f eTD b(t+t)dT
0 with t = 0,
2,
h,
2
h,...
It is this recurrence relation which will yield a family of splitting methods for solving the system (5) by using different approximations to the integrals h12
h/2
f eTD (L+X) v(t +'t)d't f eTD(Y+U)v(t+ti)d-t 0 0
(7)
h/2
f
eTD
b (t+ti)dt
0
If we make the assumption that b is constant and equal to b (mh + h) = bm + 1/2 in the timesubinterval [mh, (m+l )h], the alternate use of the equation (6), in which the integrals (7) are approximated by making the assumptions that v(t+,t) =
ru(ti h
e ' v(t)
v(t+ti ) = e v(t
+2),
suggests the following method (ADI-type method): _hD
-hD
h/2 (0(,C- h)
-1 f eTD e Wm+t/2 = e T Wm - e -
(L+X) wm+t/2 di
0 hD h/2
-e 7 f eTD e" (Y+U) wm dc + sm
0
(8)
168
hD
Wm+1 = e T
-hD
W(T- h)
^
(Y+U) wm+l do
wm+1/L - e 3 f e TD a
0 hD h/2
- e 1 f e 'CD e wT (L + X) wm+1
/2 dT + s m
0 or (w0=g): (91)
Hlwm+1/2 = K1 Wm + sm
m = 0,1,2,... (92)
H2Wm+1 = K2 Wm+112 + sm
where H1 = I+M (L+X) H2 = I+M(Y+U) h
h
_h
h
Kl = e 7D - eW-1 M(Y+U) K2 = e - D- e^7 M(L+X) and 1 -h(w/+D) M=(W+D)- -e
_hD sm = D-1 (!-e _f ) bm+1/2
I is the identity matrix of order n. If the diagonal matrix wI + D is non singular, the matrices Hl and H2 are non singular. The system of equations (9) then appears as a discrete approximation of (5); starting with the given initial vector w0 = g, one can find the approximation wm to the solution v(t) of (5) at the time t = mh for any time-level m = 1,2,.... It is, for m = 0,1,... and w0 = g: wm+1 = H21K2H1 1Klwm + (H 1K2H1 1 + H2 1) sm
(10)
The matrices Hl and H2 in (9) are block-triangular matrices with diagonal blocks of simple structure; thus, wm+l/l and wm in (91) and (92) are obtained with a low computational cost on a vector computer. For h sufficiently small we have
169
M = 2 I - 2 (wI+D) + O(h3) Hl 1 = I - 2 (L+X) + 8 2 (2(L +X)2 + (wI+D) (L+X)) + O(h3) KI = I - (D +Y +U) - 82 6)I-D) (Y+U) - D2) + 0 (h3) HI 1 K1 = I - 2 A + 82 (2 (L. +X)2 + D2 + (wI +D) (L +X) -
- (wI-D) (Y+U) + 2 (L +X) (D +Y +U)) + O(h3) H21K2 = I - hA + S2 (2(Y+U)2+D2 + (wI+D) (Y+U) - (col -D) (L +X) + 2(Y +U) (L +X +D)) + 0(h') Thus, the matrix 2 T(w) =H21K2H11K1 = I - hA +h A +O(h3) 2 and the vector fm = (1I21K2HI1 + H21)sm = h(I-hA )bm+1/l + O(h) of formula ( 10) have an expansion , for h sufficiently small, which agrees through quadratic terms with the expansion of h e-hA
and
a-hA f e'`A b(t+2)dti = (I-e-M) A-1bm+1/2 0
of formula (4), respectively, Therefore, the splitting method (9) is consistent with the differential equation ( 1) and is a method of order two. In this method the two systems (91) and (92) must be solved sequentially. Starting from (5), if h/2 in formulae (6)-(8) is replaced by h, the alternate use of (6) suggests also the following method of arithmetic mean t
(1) HI Wm+1 = K1 W m + s m
m = 0,1,2,... H2 Wm2+1 = K2Wm
+
Sm
(112)
170 1 k (1) (2) wm+I = _f win + I + wm+1
w0 =g
where H1 = I+M (L+X) H2 = I+M(Y+U)
K1 e-hD-e0 1(Y+U)
K2=e-hD -ewhM(L+X)
and I-e-h(coI+D) ) M = (wI+D)-1 (
s"m = D-1 (I-e-hD) bm+I/2
Method ( 11) is ideally suited for implementation on a multiprocessor computer with two or more vector processors . The lower block-triangular system ( 111) and the upper blocktriangolar system ( 112) can be solved simultaneously on two different processors. It can easily be seen that also method ( 11) is consistent with the differential equation (1) and is accurate of order two. 3. Stability Analysis In practice two different values of Co are used in (9) and (11 ): Co = 0 and Co = X1. This choice of co makes the methods (9) and (11) stable. Indeed, we prove the following theorems. We indicate the generic diagonal element of D with di and the generic element of X and Y on the diagonal with xii and yii, respectively (i = 1,2,3,...,n). Theorem 1. With the choice w = 0 the method (9) is stable for all positive h < h, where 2 di ho = min log 1 +-
2 log 1 +i
di Xii
di yii
(12)
Proof. We denote the generic diagonal element of M with m,, i =1,2,... A thus,mi = ( 1 -e -dih/2)/di . For h > 0 we have mi > 0. Therefore, H1 and H2 are matrices with positive entries on the diagonal and with non positive off-diagonal elements. Since (i = 1,2,...,n) n E I (H1)ij I < midi + mixii < I
( ^f^ + mixii = `"I)li
j=1 j#i
E I (H2)ij I < midi + miyii < I + miyii = (H2)i1 j=1 ja
171 the matrices H1 and H2 are strictly diagonally dominant matrices . Then5, H1 and H2 are non singular M-matrices : Hl Iz0 HZ 1^:0 . For 0
we have
e -d' -mixii>0 for all i = 1,2,...,n. Thus , the diagonal entries of K2 are positive . For h > 0 the off-diagonal entries of K1 and K2 are non negative . Therefore, for 0 < h < ho, the matrices K1 and K2 are non negative. Thus, the matrix T = T (O) = HZ I K2H1 I K1 is non negative and Tm > - 0 for m = 0,1,... Since H1-K1 = H2-K2 = D-'(1-e-Dh/2)A = B,wecan wr* T = H21 (H2-B )H1I(H1-B) _ I-}1 (H1 +H2-B )Hl'B = I-HZ I (I +e - Dh/2)H11 B or S = H2 1(I + e -Dh/2 ) H11= (I--T)B-I. Since S ? 0, A- 1 > 0 and hence B- 1 > 0, we have the result 0 5 (I+T+...+Tm)S = (I+T +...+T') (I-T)B - 1 = (I-T`+1)B-15B-1 Each row of S must contain at least one positive element , and it follows that the elements of T(m) = I + I + ... + Ttn are bounded above as m -. ^. Indeed, letT<m) = (tipm)) , S = (sib) and B-1 =(Rj^ Forall i,j(1 <-i,j5n) wehave05 E t'rm) sri- <_ pig.Let r befixed and choose j = j (r) such that srj(r) > 0. Since ti"`) , sri , ii >- 0, we have t^m)sr/(r) 5 (3i^(r) or R (m) tir <_ Rij(r) / SrJ(r). Since the right side is independent of m, the matrix T(m) is uniformly bounded. Now, the sequence ( T(m)) is such that T(m+l) 2, T(m) for all m ? 0. Then , the series I + T + T2 ... converges and consequently the spectral radius of T, p(HZ IK2H11K1), is less than unity. #
Theorem 2. With the choice w = a1 the method ( 9) is stable if d1 # A, for i = 1,2,...,n and 0
2
?1 +di 2 X1 +di + log 1 + log 1 (a'I l TT+di
TT+di
xii
(13) 2
Al +di log 1 + X1 +dii yii
Proof. For a1 * -d1 (i = 1,2,...,n) and h > 0 the diagonal entries ml of M are positive: -()LI +d')h!z)f (A1 mi =(1 -e +di)>0. Therefore, H1 and H2 are matrices with positive entries on the diagonal and with non positive off-diagonal elements. Since
172 n
(H1)ii = 1 +mixii
E I (H1)ij I < midi + mixii j=1 jxi n
(H2)ii = 1 +mixii
E I (H2)ij I < midi + miyii j=1 j#i
and midi51 for O
-dih/2 ^1h/2 ) we have a -e miyu>0 for all to ( 1 +( 1+d,. 2/(X1+d•)) ^ For O < h<min (( ) /yeT) g -d'hi2-a-X1hn m• i = 1,2,...,n. If (X 1 +dt.+ yii) < 0 , the element (K 1 ) ii =e ^ ^'ii is ^ sitiveforall +di) /xii)) we have h > 0. Analogously, for 0
mixii>0 for all i = 1,2,...,n . If (A 1 +d1 +xi1) < 0 , the element (K2)u is positive for all h > 0. Thus , for 0 < h < h. the diagonal entries of K1 and K2 are positive. For h > 0 e -e
the off-diagonal entries of K1 and K2 are non negative. Therefore , for 0 < h < h., the matrices K1 and K2 are non negative. Then, the matrix T(X1) =H21 K2H1 1 K1 is non negative. Since A is an irreducible matrix, also T(Xl) is irreducible . From the Perron-Frobenius theorem5, we find that T (L1) has a simple positive eigenvalue , say a, equal to its spectral radius p(T()L1)) and that the eigenvector z corresponding to a has all positive components: T(X1) z = az with a > 0 and z > 0. We recall that x1 and a'1 obey the relation -Ax1 = a,1x1, from which we have (Y+U)xl = - (L+X)x1 - (?1I+D)x1 and (L+X)x1 = -(Y+U)x1 - (?,1I+D)x1 Thus,
K1x1 = (e
- h/2D
= e2'1
- e" hf2 M (Y +U)) x1 =
(I +M(L +X)) x1 = eX1
K2x1 = (e-h/2D _ eXl
= eX1
14) H1x1
M(L+X)) x1 =
(I+M(Y +U)) x1 = eX 1hn H2 x1
and T(a,1)x1 = e )L1h
1
173 Now, the transpose of T(1l1) has the same eigenvalues of TQ. 1) and the same spectral radius a; besides, it is irreducible and non negative. From the Perron - Frobenius theorem there exists a positive vector y for which
(T (k1))T y = ay
(15)
Multiplying equation ( 14) by YT, equation (15) by xi, and subtracting, yields 0 = yTT(A.1)x1 - (T(X1)x1) Ty = (elL h-(y)yTxl
which can be true only if e;L h = a = p(T(? 1)) , since the eigenvectors y and x1 are positive. The eigenvalue a' 1 is negative; thus p(T(.1)) < 1. Therefore , the method (9) is stable. #.
Theorem 3. With the choice co = 0 the method ( 11) is stable for all positive h < ho/2, where h0 has the expression (12). Proof. The proof runs parallel to that of theorem 1. #
Theorem 4. With the choice w = a't the method ( 11) is stable if dl * Al for i = 1 ,2,...,n and 0 < h< h. / 2, where h„ has the expression (13). Proof. The proof of this theorem runs parallel to that of theorem 2. # The choice w = a'1 in methods (9) and ( 11) is interesting because it can easily be seen from equation ( 14) that the asymptotic numerical eigensolution is proportional to the asymptotic analytic eigensolution. 4. Computation of the dominant eigenvalue The choice co = X1 in methods (9) and (11 ) requires the computation of the dominant eigenvalue X, of the system (1):
- Ax1 = ?.1x1 , x1 > 0 and Re(al) < )L1 < 0
(i=2,3,...,n)
This eigenvalue XI is obtained by a trial and error method where one solves the problem
174 (X+(AI+D)+Y)z = - 1 (L+U)z
(16)
for the eigenvector z with eigenvalue µ of maximal modulus. The problem (16) has the eigenvalue µ = 1 for A = A. Thus, the value A is adjusted until the computed µ equals unity. When this occurs A = A1. Now, we prove the correctness of this method. It is assumed that µ; = µ1(),), i = 1,2,...,n, are the eigenvalues of (16) or (X+D+Y - I?II)z = - 1 (L+U)z where (X + D + Y)-1 >- 0 and - (L + U) 2 0. Since each submatrix AR (r = 1,2,...,q) in (2) satisfies the property Ant > 0, it can be shown5 that for 1 X I <1/p (A; 1) , where p(4) is the spectral radius of 4, the result is (Arr- IAA I)-1>0 . We then note that the values of A for which det (AR- I k I I) =0 , r= 1,2,...,q , are t h e s o l u t i o n s of t h e e q u a t i o n der ( X +D + ` Y - J A J 1) =0. Thmfoe,maxp(An1) = P((X+D+Y)-1) . (1nvaquer111y,for-1/(p((X+D+Y)-1))=-A ■ I µ;(A) I , i = 2,3,...,n. Beside, to µ1(7l) there corresponds an eigenvector z1 > 0. Since det (X + D + Y - A.I) = 0, it follows from (17) that lim µ1(A,) _ +oo. Finally, it can easily be seen that dµ1(A.)/dA.<0. Therefore, we conclude that for A. < A. < 0, µ1()L) is a positive monotone decreasing function and A. = A. is a vertical asymptote.
Since µl (0) = p(-(X +D +Y)-1 (L +U)) < 1, it follows that there exists a value of A., let's say A1, (-A` < ), < 0) for which µ1(A.1) = 1, completing the proof of the correctness of the trial and error method. For a fixed value of A in (- A., 0), the eigenvalue of maximal modulus 41(A.) of (17) is determined using the power method. At each iteration k (k = 1,2,...) of the power method we must solve a linear system of the form (X + (AI+D)+Y)z(k) = -(L+ U)z(k-1) (18) (P) is an arbitrary vector with Iz(0)N = 1). Since X and Y are matrices of simple structure, the vector z(k) can be obtained applying for ( 18) an alternating-direction implicit iterative
175 method. Remembering that X and Y are the direct sum of matrices of simple structure , this iterative method can be easily implemented on a parallel computer. 5. An Application Methods (9) and (11) have been implemented on the multivector CRAY X-MP for solving a system of ordinary differential equations which was originated by the application of the method of lines in the domain S2, using centered -in space-difference equations, to the initial-boundary value problem
au
a Alan
a A2 . au + _ + B1
an
au
+B2
at =ax ( ax ay ay ax ay u(xyr) = 4(xyr)
Cu + f (19)
(x,y)eail, t>0
u(xyO) = w(xy) (x,y)e 6 where u = u(xyt ) = ( u1 u2 ... uq )T, f = f(xyt) = (f1 f2 ... fc1) T, $(xyt)
yr9)T andAk Ak(x y) = diag {a, ...,aq(k) } , Bk = Bk(x y) _ V(xy) = (' VI V2 diag {bikl,...,bQ'tl} , (k=1,2) , C = C(x y) _ ( Cr) , (r,s=1,2,...,q) with ark), b.kl , crs continuous functions of the spatial variables x and y and a;k>> 0, cr,>O and crs
Systems (91)-(92) or ( 111)-(112) are solved with the block forward -elimination algorithm and with the block back-substitution algorithm, respectively. Thus, at each iteration 1(1 = 1,2,...,q) we have to solve a tridiagonal system of order p ; for the solution of this system we use the cyclic reduction algorithm. These algorithms can be easily and efficiently implemented on a <
176 solving the initial-boundary problem (19) is very high . For example, if we consider a system (19) of two (q = 2) parabolic equations in a rectangular domain f2 = [0, 90] x [0, 90] whose ) bit) =b (1) =bi21=b?21 coefficients ai 11=ai 2) and a2 11=a22 have values in the interval [0.8, 2.5], are zero, c11 and c22 have values in [0.28 , 2.1] and I c121 and I C211 have values in [3.3 1074, 2.1 10-2], method (9) with w = 0 gives a solution at time t" = 0.2 with an accuracy of three significant digits, at least; the number of interior mesh points is p = 256 x 256 and the time-step is h = 10-3. The computer-time on CRAY-MP, utilizing one processor, for solving this problem is 13.7 seconds. References [1] L. Collatz, Functional Analysis and Numerical Mathematics, Academic Press, New York, 1966. [2] I. Galligani, V. Ruggiero, The arithmetic mean method for solving essentially positive systems on a vector computer, Intern. J. Computer Math. 32 (1990), 113-121. [3] G. I. Marchuk, Methods of Numerical Mathematics, Springer-Verlag, New York (1975). [4] W. Schonauer, Scientific Computing on Vector Computers. North-Holland, Amsterdam, 1987. [5] R. S. Varga, Matrix Iterative Analysis, Prentice Hall, Englewood Cliffs N.J. (1962).
WSSIAA 2( 1993) pp. 177-195 QWorld Scientific Publishing Company
A GENERALISED CAYLEY TRANSFORM FOR THE NUMERICAL DETECTION OF HOPF BIFURCATIONS IN LARGE SYSTEMS
T. J. GARRATT Aspentech UK, Sheraton House, Castle Park , Cambridge , CB3 OAX, UK
G. MOORE Department of Mathematics , Imperial College, Queen's Gate, London SW7 2BZ, UK and A.SPENCE School of Mathematical Sciences, University of Bath, Bath , Avon, BA2 7AY, UK
Abstract A study of the mathematical properties of the Generalised Cayley transform is given. The transform is used in an algorithm to compute approximations to a small number of eigenvalues of a large sparse nonsymmetric matrix. The eigenvalues of interest are those with smallest real part and the algorithm can be used to determine the stability of a nonsymmetric matrix and to detect Hopf bifurcations of parameter dependent equations. Some numerical experiments are performed using a model of a tubular reactor.
1 Introduction Let A be an N x N real large sparse nonsymmetric matrix with eigenvalues pi, i = Re(,ui) <, Re(µ,), and 1,... , N ordered by increasing real part, i.e. i < j corresponding eigenvectors (and possibly generalised eigenvectors) u . This paper is concerned with the following problem: Find the eigenvalue(s) of A with smallest real part. It is worth mentioning straight away that this is not a problem for which
177
178 standard software is readily available and that for large sparse matrices the standard QR algorithm [9] will be at best inefficient, or may indeed not be feasible. Our chief motivation for this problem is for the bifurcation analysis of the parameter dependent nonlinear system
dt
f(x,A)=o, (1)
where f : RN+1 -+ RN, x E RN, A E R which might arise from the spatial discretisation of a partial differential equation, where N is typically very large. The simplest solutions of (1) are the steady state solutions which satisfy f(x,A)
=
0
(2)
and may be computed using the usual continuation techniques [4, 13] to obtain a branch of solutions F = {(x, A) E RN}1 : f (x, A) = 01. For simplicity we assume that r is parameterised by A, write x = x(A) and introduce A(A) := f x(x(A), A), the Jacobian matrix which is typically large and sparse. (One could equally well assume a parameterisation of r by s, say arc-length, and write ( x, A) = (x(s ),A(s)) and A(s) := fx(x(s), A(s)).) An important question is: what is the stability of a branch of I' with respect to small disturbances in the initial conditions of (1) ? The principle of linearised stability [11] requires that we analyse the solutions of
v+A(A)v=0, vERN (3) and hence the eigenvalues y of A(A)u = uu (4) determine the stability of F. We call the right half plane the stable half plane and say that A(A) is stable if and only if all its eigenvalues lie in the stable half plane. In this case (x, A) is a stable steady state, and a loss of stability occurs along r as A varies when either (i) a real eigenvalue crosses the imaginary axis - a steady state bifurcation of (1), or (ii) a pair of complex conjugate eigenvalues cross the imaginary axis - a Hopf bifurcation of (1) where periodic solutions branch away from r. The first case may be detected easily, for example by observing a sign change in det(A(A)), but the second is much more difficult since the determinant does not change sign and no other indicator is available from an LU factorisation of A(A). In this paper we adopt the approach of calculating µ1(a), the eigenvalue of smallest real part of A(A). (In applications µ1(A) is often called the "dangerous" eigenvalue.) Then we monitor pl(A) as A varies and note that, generically, abifurcation of (1) will be detected as Re(,ul(A)) changes sign. Our approach to the problem is closely related to the technique for the numerical identification of stable matrices described in section 7.13 of Franklin [5], which we now outline. Denote the spectrum of A by o(A) = {µ1,i = 1, ... , N}. For any real positive a satisfying -a ^ o(A), consider the Cayley transform of A
F(A; a) := (A + aI)-1(A -
aI).
(5)
179 It is easily shown (see Lemma 1) that A and F(A; a) have the same eigenvectors and F(A; a) has eigenvalues 0, := (µi - a)(µi + a)-', i = 1, ... , N. It then follows that Re(,ai)>0 (<0) jBil <1 (>1) (6) and Franklin proposed finding ro(F(A; a)), the spectral radius of F(A; a), by the power method. Clearly if ro(F(A; a)) < 1, then A is stable, but if r,(F(A; a)) > 1, then A is unstable. Three key features of this approach are (i) The power method applied to F(A; a) requires the solution of systems of equations of the form (A + aI)y = (A - aI)x. This is the most costly operation involving A and will only be feasible if the sparsity in A can be exploited in a direct solver for A+aI. This is often the case in applications arising from partial differential equations. (ii) A suitable choice of a can help in two ways. First it can help separate close eigenvalues near the imaginary axis. (For a simple example, consider A with p, = -e, µ2 = 3e. If we take a = 2e, then F(A; a) has eigenvalues 01 = -3 and 02 = 1/5.) Second, if Re(µl) < 0 and hence ro (F(A; a)) > 1, an appropriate a (see [5]) can maximise the rate of convergence of the power method. (iii) A drawback is that if A has eigenvalues with Re(p) >> 0, as is likely in many applications from partial differential equations, then F(A; a) will have eigenvalues clustered near 1 inside the unit circle. Thus power-type methods to find eigenvalues of F(A; a) on or near the unit circle arising because of eigenvalues of A on or near the imaginary axis may take a long time to converge. We extend Franklin's approach in two ways. First we use the Generalised Cayley transform defined as follows: for a1, a2 E R with (a) a1 < a2
and
(b) a1 ^ a(A),
(7)
define C(A; a1, a2) by C(A; al, a2) := (A - aiI)-'(A - a2I). (8) which has eigenvalues Bi = (µi - a2)(µ, - al)-', i = N. With two parameters to be chosen there are a variety of options available. We explore this in some detail in Section 2. Second, we replace the power method by the Arnoldi or subspace iteration method, and the aim is to choose the parameters a1 and a2 to accelerate the convergence of an approximation to 01 = (hi - a2)(fil - al)-', the eigenvalue of C(A; al, a2) which corresponds to the "dangerous" eigenvalue of A. In today's language, we can think of the Generalised Cayley transform as "preconditioning" A, with C(A; al, a2) having more desirable spectral properties. The use of the Generalised Cayley transform for the numerical determination of prescribed eigenvalues and eigenvectors is not new. Peters and Wilkinson (Section 5,
180 [121) suggest its use for the generalised eigenvalue problem Au = pBu. Christodoulou and Scriven [ 2] use a generalised Cayley transformation in a scheme to compute approximations to eigenvalues of a generalised eigenvalue problem arising from a model of two dimensional slide coating flow. At first sight, their approach is very similar to ours, but there are significant differences in detail which we discuss at the end of Section 2. The plan of this paper is as follows. In Section 2 we give a theoretical study of the Generalised Cayley transform , concentrating on the relationship between the eigensolutions of A and C( A; al, a2). Assuming the eigenvalues of A are known, we define precisely how they are mapped to those of C(A; al, a2 ) and determine the role of the parameters a1 and a2. We outline a number of strategies for choosing a1 and a2 to optimise the convergence of an approximation to 01 computed by the subspace iteration or Arnoldi methods. In Section 3 we describe the implementation of these strategies when the eigenvalues of A are unknown , and in particular , discuss the "continuation " setting, when A = A(.\), a nd present an algorithm for detecting Hopf bifurcations of (1) along a branch of r using the Generalised Cayley transform. Finally, in Section 4, we present numerical results using our algorithm to detect Hopf bifurcation of equations modelling the tubular reactor [10]. The aims of this paper are to present the mapping properties of the Generalised Cayley transform ( Theorem 2 and Lemma 3 ), to describe and justify two of the many strategies possible when the Generalised Cayley transform is used with an iterative eigenvalue solver and to illustrate their use on a test problem. We do not claim that these will the optimal strategies for all problems, but these strategies have been successful for the problems we have tried here and in [3, 7].
2 Properties of the Generalised Cayley transform In this section we concentrate on the mathematical aspects of the Generalised Cayley transform of A defined by (7) and (8). First note that, as mentioned in the introduction, there is a simple relationship between the eigensolutions of A and those of C(A; al, a2), given by the following lemma. LEMMA 1 Assume ( 7). (µ, u) is an eigensolution of A if and only if ( 0, u) is an eigensolution of C(A; a1, a2), where 0 = c(µ) (µ - a2 )(µ - al) -' 1,
µ = c-1(0):= (a1O - a2)(0 - 1)-1. (9)
Here c : C\{a1} --+ C\{1} is a bijection. The fact that c is a bijection is important since it means that once eigenvalues of C(A; a1, a2) have been found the corresponding eigenvalues of A may be computed trivially. A more general result, which shows that one eigenvalue of A has the same
181 algebraic and geometric multiplicity as the corresponding eigenvalue of C(A; al, a2), is found in [6]. THEOREM 2 Suppose A has k eigenvalues pi each of algebraic multiplicity mi and geometric multiplicity 1, so that 1 mi = N. (Note that the µi need not be distinct.) If J(A) = diag (Jj(pi ), J2(µ2), ... , Jk(µk)) represents the Jordan form of A where JJ( pi) is the mi x mi Jordan block associated with the eigenvalue pi, then the Jordan form of C(A; al, a2) is J(C(A; a1i a2)) = diag (J1(01), J2(02 ), ... , J.(8, )),
( 10)
where Oi = (µi - a2 )( µi - al)-1
2.1
The mapping properties of c(µ)
We now turn our attention to the role of the parameters a1 and a2 in the mapping c(µ) = (µ - a2)(µ - al)-1, 11 E C\{al}. In this subsection we discuss the precise mapping properties of 0 = c(µ) and show how eigenvalues of A are mapped to eigenvalues of C(A; al, a2) under c(p). In fact (µ - a2)(µ - a,)_1 is a special case of the Mobius Transform and as such is circle and line preserving [21], where a line is a circle with infinite radius. The next lemma states the well-known result that vertical lines are mapped to circles under c. But first, we introduce some notation. The line parallel to the imaginary axis passing through v is denoted 1(v) :_ {z E C : Re(z) = v}, the circle with centre d E C and radius r E R+ is denoted R(d, r) {z E C : Iz - dl = r} and the disk with centre d E C and radius r E R+ is denoted B(d, r) := {z E C : Iz - dl < r}. LEMMA 3 Let v be a fixed real number with v # a1. Then under c(µ) the line 1(v) is mapped to the circle R(d(v, al, a2), r(v, al, a2)), where d(v, al, a2) = 1 + al - a2 2(v -
a2) = a1 - a2 r(v, al, al) 2(v - al)
Thus complex numbers in the p-plane with real part v are mapped to complex numbers of the 0-plane on the circle centre d(v, a1, a2) and radius r(v, a1, a2). Note that the radius and centre of the circle both depend crucially on the parameters a1 and a2. The next theorem states that there are four distinct regions of the µ-plane which are mapped to four distinct regions of the 0-plane as illustrated in Figure 2.1.
182 Im(p)
IIIp IV0
Ip
Re(p) al
2 (al + a2) a2
Im(0)
IIo
Re(0)
Figure 1: This figure shows how regions in the it-plane are mapped to regions in the 0-plane.
THEOREM 4 Assume (7) . Let 0 = c(p) where it E C. Then Re(p) E (-oo,al) f al + a2 Re(p) E [al,
Re(p) E
[al
0 E {zEC:Re(z)>1} 0E{{zEC:R.e{z}S1}-B(0,1)}
2 2 a2,12)
0 E {B(0,1) - B
(2' 2)}
0 E B (1 1) Re(p) E [a2,00 )
2
Proof The results follow after straightforward manipulation. ❑ The splittings of the p and 0 planes provide opportunities for several different strategies for the eigenvalue problem stated in the first part of the introduction. The following Corollary gives the theoretical basis for the two strategies on which we shall concentrate in the next subsection. COROLLARY 5 Let pl, p2, ... , pN be eigenvalues of A ordered by increasing real part with Re(pr) < Re(p,+l) for some r, 1 <, r < N. Assume (7) and let Bi = c(pi).
183 (a) If (al + a2)/ 2 = Re(µr+1) then
ei0B(o,1), and
10r
+1I
0 B (O,
i =1 ,...,r,
1), i=r+1,...,N
= 1.
(b) If a2 = Re(µr + 1) then
Oi^B\2'2 /
i=1,...,r,
BiEB\2'2 /
i=r+1,...,N
2.2 Iterative eigenvalue solvers and transformations In this subsection we shall introduce two strategies for the computation of 0 = c(µ1) and hence µl. Our overall idea is to choose a transformation (i.e. al and a2) and an iterative eigenvalue solver (IEVS), either subspace iteration or Arnoldi's method, and to use the IEVS to find certain eigenvalues of C(A; al, a2) which correspond to required eigenvalues of A. Clearly we need to know the convergence theory for the IEVS in order to make appropriate choices for al and 02 and so before explaining possible strategies in more detail, we need first to recall the convergence properties of subspace iteration and Arnoldi's method. For further details, consult [1] and the references therein. (a) Subspace iteration (with k, vectors and l orthogonal iterations): From a set Q. = [q1,. .. , qk,] of k, linearly independent vectors, compute I orthogonal iterations AQi-1 = QiR;
where
QHQJ = Ik,
for j = 1, 2, ... ,1.
Approximate eigensolutions of A are given by (µi, ui ), i = 1, ... , k, where ui = Qtyi and (jti, yi) are the eigensolutions of the k, x k, matrix Ak. = QHAQ1. Assuming
IiII > IP2I > ... > IIlk, I > IILk ,+II >
IEtk ,
+2l > ... > I1NI
(12)
and under reasonable assumptions on Qo ( see [1]), approximations to the k, dominant
eigenvalues of A converge and for simple Eti, IFti -
µil = O((IItk, + 1l/I1 I)`)•
(b) Arnoldi ' s method ( with ka vectors ): Using a normalised initial vector v1 E R, generate an orthonormal set of vectors 1/k. = [vi .... , vka] using the relation
hi+r,ivi +l = Avi -
hi,ivi,
hi, 3 = (vi, Av3 )•
(13)
i=1
Approximate eigensolutions of A given by (µi, ui) , i = 1, ... , ka where ui = Vkayi and ( µi, yi ) are eigensolutions of the ka x ka upper Hessenberg matrix Hka = VnAVk, {hi,;}. There are few convergence results for this method [ 1] and it is not possible to say in general which eigenvalue approximations can be expected to converge first
184 in Arnoldi's method. However, theory and experiment [17, 18] both indicate that convergence to extrernal well separated eigenvalues is to be expected. Given an IEVS, an appropriate transform C(A; al, a2) is chosen, namely one for which convergence to the required eigenvalues can be expected. In this paper, we concentrate on two particular strategies. These are Strategy 1: Given 1 < r < N map Or+l, ... , ON into B(0,1) and apply the subspace iteration method to C(A; al, a2); and Strategy 2: Given 1 < r < N map 9r+1, ... , ON into B(2, 2) and apply Arnoldi's method to C(A; al, a2). However other variations are possible [2, 7] and we discuss [2] later. To illustrate the methods we shall as far as possible quote general results and then restrict attention to the case where the dangerous eigenvalues are complex. This is the most difficult situation to detect in applications and leads to the case of Hopf bifurcation in equation (1). We emphasise that this restriction is not essential for our methods and once our approach is explained for one situation, its extension to other cases is usually straightforward. In the sequel we often assume either or both of the following: (Al) 1.11,2 = vi ± iw1 where vi, w1 E R, w1 > 0 and µ1,2 are simple eigenvalues of A. (A2) 1 < r < N is such that Re(pr) < Re(pr+i ). From Corollary 5 we see that the choice 2(ai + a2) = Re(pr+i) ensures Ier+iI = 1 and 9i E B(0,1), i = r + 1, ... , N. This leads to the following lemma. LEMMA 6 Assume (7) and A2. Let 2(a1 + a2) = Re(µr+i). Also assume that the starting vectors for subspace iteration satisfy the projection condition of [1]. Then the subspace iteration method with k, > r vectors and l orthogonal iterations applied to C(A; al, a2) produces approximations 9i, ... , Br which converge to 91i ... , 0r with a rate of convergence of at least O(I9i I-r) for 9;, i = 1, ... , r. With the further assumption Al we can be more explicit about choices for al, a2. In particular 1011 can be maximised to achieve the fastest convergence possible for 91, with an approximation to pi given by ii = c-1(01). This is equivalent to the approach by Garratt, Moore and Spence in [8]. LEMMA 7 Assume (7), Al and A2 and let 2(a1 +a2) = Re(pr+i). The maximum of 1011 is attained when a1 = Re(pr +1)
- 7r}1 and a2 =
R.e( pr+i)
+Tr+i, (14)
where Tr+l = (Re(pr +i) - vi)2 + wl)1 . With this choice [(1 + el+i)2 - ^r+1l-1 > 1 (15) where Sr+1 (Re(pr
1.. t
+i) - vi)/wi•
(16)
185 Proof Let al = yr+1 - T, a2 = yr+1 + T where T > 0 and yr+1 = Re(µr+1 ). The problem of maximising 1011 subject to 2(al + a2) = Re(µ.r}1) is equivalent to the problem of maximising D(T) where
^01^
D(T)
= (yi - Vr}1 -
T)2 +
wi wi.
(17)
(vl - Vr+1 + T)2 +
The maximum of D(T) occurs at a turning point which satisfies dD(T)/dr = 0. It is easy to show that
dD _ dT 0
(Vr + 1 - v1)[wl + (v1 - yr+1 -
r)( vl - Vr+1 + T )) = 0.
(18)
Since v1 < yr + 1 we require that 2 (vr+1 - vl)
- T2 + 2v2 = 0,
that is T = ±((V r +1 - vl)2 + wi ) +.
(19)
It is straightforward to verify that T = Tr+1 = +(( vr+1 - v1)2 + w1 2 is the value of T at which the global maximum of D ( T) is attained . Furthermore , with a1i a2 given by (14) we have 01 = W1 [7 - (Vr+l - v1)] _1i
and hence 1011 = [(1 + fr+l )' where Sr+1 is defined by (16). Hence assuming Al and A2 hold we have: Strategy 1: Choose al, a2 according to (14) and take subspace iteration with k, > r vectors as IEVS. Note that the rate of convergence depends on Sr+1 = (Re(ar+i) - v,)/w1, and the smaller this ratio, the slower the convergence of 91,2 and hence 7i1,2 = c-1(01,2). Note also that the optimal choice of T puts 01,2 on the imaginary axis of the 0-plane. Our second strategy concentrates on mapping eigenvalues into B(2, and using Arnoldi's method. Since it is difficult to give explicit rates of convergence of approximations to eigenvalues computed by Arnoldi's method for general spectra, it is not possible to write down exactly optimal al and a2 to give the fastest convergence possible for 01,2. However various reasonable strategies are possible. Consider now a specific case with r = 2 and assume A2. If a2 = Re(Y3) then 0, E B(2, 2), i = 31 .... N. The strategy we propose is now to choose al such that 101,2 - 21 is maximised and 01,2 are extremal and well separated from B(2, 2). Hence an approximation to 01,2 is likely to be found when Arnoldi's method is applied to C(A; al, a2).
186 LEMMA 8 The maximum of 101,2 - 2 such that a2 = Re(µ3) is attained when 1 al = Re (µ3) - [(Re(µ3) - v1)2 +'wl]2. Furthermore 01,2 = 2 2 [( 1 + S3)^
where ^3 is defined by (16). Proof Let al = v3 - r and a2 = v3 where v3 = Re(p3). Observe that (VI - V3 - r) + w1 1 (v1 - v3 + T) + wl 2
2
_ 1 (v1 - v3 - T )2
+
wj
(20)
2 ^ (VI - V3 + r)2 + w1
Using Lemma 7 it is clear that the maximum `` of IO1 - s I2 occurs when r = ((v3-vl)2+wi)z and 01,2- 2 =±Z[(1+S3)s -S3]-l
Note that the ratio ^3 affects the rate of convergence; if this is small then 01,2 will be near to 1 + i.e. B(2, z ), which may slow down the convergence. Hence assuming Al and A2 for r = 2 we have Strategy 2: Choose a1i a2 according to Lemma S and take Arnoldi's method as IEVS.
We illustrate these strategies in Section 4. Our work is related in spirit to, but different in detail, from that in [2], where a generalised Cayley transform is applied to the eigenvalue problem Ax = uBx with B singular. In the context of this paper their strategy would be to choose al, a2 such that all eigenvalues are mapped into B(0,1), with 01 and 02 mapped onto the unit circle R(0,1) and to apply Arnoldi's method to C(A; al, a2). If Arnoldi's method is chosen as the IEVS applied to C(A;a1ia2), then our recommended approach is Strategy 2 where 01 and 02 are mapped outside B(2, 2) rather than the method in [2]. Another extension over [2] is that we provide a strategy which could utilise subspace iteration, which might be more readily available in software libraries. Of course, once the detailed mapping picture is understood, then variations on these strategies are possible.
3 Practical details of algorithm We present the practical details involved in the implementation of Strategy 1 or Strategy 2 to compute an approximation µl to ll, the eigenvalue with smallest real part of the N x N nonsymmetric matrix A. Note in passing that the approach may easily be extended to compute approximations to several of the eigenvalues with smallest real part by utilising the deflation technique proposed by Saad in [20]. The choice of the parameters al and a2 to accelerate the convergence of 01 depends on the particular strategy adopted, and the theory was presented in the previous section. However, the theory assumes exact information about the eigenvalues of A, which of course is not available. Thus, the scheme we use follows the same philosophy as Saad [18, 20]. First assume that approximations µl, ... , Et,.+1 to µl, ... , µ,.+1 are
ft
187 known ( see Section 3.1 for details ). Use the approximate eigenvalues to choose al, a2 according to which strategy has been chosen and then begin an iterative process where al, a2 are updated by new computed eigenvalue approximations until convergence of is achieved. The usual method of assessing the accuracy of an approximate eigensolution (µ;, u;) of the eigenvalue problem Au = µu is to compute the 2-norm of the residual vector ( A - ii )u;, that is r(A,µt,iii)
II(A-Ei=)u;II2IIut!Iz1.
(21)
The approximation µl is deemed sufficiently accurate when r(A, µl, ul) < e, where e is some specified tolerance . For the purpose of bifurcation detection , the aim is not to compute an accurate approximation to yj and thus a reasonable value for e is 10 -3, say. The proposed scheme for computing it, is summarised in the following algorithm. Algorithm 3 .1 IEVS accelerated by C(A; al, a2) to compute µl. Let IEVS be an iterative eigenvalue solver which computes k > r eigenvalue approximations (where k = k, or ka) and a be a fixed tolerance. 1. Compute initial eigenvalue approximations µl, ... , µr+i . 2. For j = 1, 2, ..., do: (i) Use µ1,...,µr+, to obtain C(A;al(j),a2(j)). (ii) Apply IEVS to C(A; al (j), a2( j)) to compute approximate eigensolutions ( Bi, u;), i = 1, . . . , k. (iii) The corresponding approximate eigensolutions of A are given by (µ;,u;), where µ; = c 1(B;). until r(A, µi, u1) < E.
The practical details for the implementation of this algorithm are presented in the remainder of this section. Remark 3 . 1 The matrix- vector operation y = (A - a1I )- 1(A - a21 ) x required by IEVS is implemented by y = x + (al - a2 )z,
where (A - aiI )z =
x
which requires only one back solve and no matrix - vector multiplications.
(22)
188
3.1
Initial eigenvalue approximations
Some initial eigenvalue approximations hi,. .. , fir+1 are required for step 1 of Algorithm 3.1. We have in mind the case that A is a parameter dependent matrix A(.), where .A varies along a branch of steady state solutions r. It is not unreasonable to expect that eigenvalues computed at points along r provide good approximations to eigenvalues at neighbouring points. Thus, at all but the first point computed along r, we propose that eigenvalue approximations computed at each point on r can be used to provide initial eigenvalue approximations at the subsequent point. The remaining question concerns how initial eigenvalue approximations at the first computed point on I' may be obtained and there are several possibilities. The one favoured here is to apply the subspace iteration method to A'. The reasons for this are two-fold: (i) Approximations to the eigenvalues of A nearest the origin will be computed by the subspace iteration method applied to A-' and since in applications most of the eigenvalues of A have positive real part, the eigenvalues of A near the origin provide reasonable values to obtain initial estimates for al, a2; (ii) Continuation codes often compute a factorisation of A to perform the steady state calculations, and this initial subspace iteration on A-' will cost very little extra work. Should any of the eigenvalues with smallest real part and large imaginary parts be 'missed' by this method, then they should be computed by Algorithm 3.1.
3.2 Starting vectors We use Saad's method of utilising previously computed approximate eigenvectors [16, 18, 20] to find starting vectors for the Arnoldi and subspace iteration methods as follows where k = k, or ka: ( i) Initial point on r: Let vi E RN be a random vector. For the subspace iteration method applied to A-' let 1 '-k, Vil
(ii) For C(A; al(l), a2(1)): Let ui, ... , uk be the approximate eigenvectors computed at the previous point on r (or by (a) if initial point). If using Arnoldi's method: vi = ul; if using the subspace iteration method: Qo = [u1, ... , uk, ]. (iii) For C(A;al(j),a2(j)), j = 2...: Let 71,...,uk be the computed approximate eigenvectors of C(A;a,(j - 1), 02(1 - 1)). If using Arnoldi's method: v, = ul; if using the subspace iteration method: Qo = [ul, ... , z/.k,].
An important practical point is that even though the eigenvectors of A may be complex, real arithmetic for the Arnoldi and subspace iteration methods is maintained in practice by replacing u,, ... ,'Uk with a, set of real quasi-Schur vectors [19].
t. t I , ,1.. i . 1 .. ,1.._4. to t14
189
3.3
Choosing r,ks,ka and 1
In this small subsection we give our choices for r, k„ ka and 1 used by the strategies for the numerical results presented in Section 4. For Arnoldi's method the existing theory doesn't allow precise statements about convergence rates. For that reason we have only considered the case r = 2 and varied ka (see Section 4). For the subspace iteration method when 81 f ... , Or are the dominant eigenvalues outside B(0,1) and 0, E B(0,1), i = r + 1, ... , N, k, should be chosen such that k, > r and we put k, = 2r. To predict a rough estimate 1 for the number of orthogonal iterations in the subspace iteration method 1 that in theory ensures IB1 - 01I ti O(e) we use the approach in [151 as adopted by Saad [18].
3.4
A checking procedure
In this subsection we present a simple example to show that caution must be exercised in order to avoid accepting an approximation to an eigenvalue which is not the one with smallest real part. We then present a strategy that has proved reasonably robust for ensuring that we have indeed found the eigenvalue of.smallest real part. We first present the simple example. Example . Let A be a 6 x 6 matrix with eigenvalues p1,2 = 2 ± 10i, p3 = 3, p4,5 = 6 + 6i and 06 = 100 . Suppose the following rough approximations to P1,2 and p3 are available: i 1,2 = 3 ± 2i, p3 = 4. Using Lemma 7 with r = 2, the optimum values for the parameters a1 and a2 using 1 1,2 and p3 are a1 = 4 -v and a2 = 4+V. The eigenvalues of C(A; 4 - v/5-,4 + v/75) are 01,2 03 04,5
06
(99 f 204i)(109 - 4V^5-)- 1 0.9894 ± 0.4470i -(f + 1)(/ - 1)- 1 zz: -2.6180 (35 f 12fi)(45 + 4f)- 1 0.6488 ± 0.4974i (96 - v'-5)(96 + f)- 1 : 0.9545.
Observe the choice of a1, a2 with a1 reasonably close to p3 has made 03 the dominant eigenvalue of C(A; 4 + -\,F5,4 - v). The subspace iteration method with k, = 3 vectors applied to C(A; 4 - f, 4 + f ) yields approximations 01i 02, 83 to 01, 02, 03, which we could expect satisfy
10 1,2 - 0 1 ,21
= 0( 1041101H )
1 03 - 03I = 0 (1 0411031 -')
PZ^
0(0.7530') 0(0.3123').
It is clear that convergence of 81,2 is much slower than that of 03. If the iteration is stopped too soon , a good approximation j13 = C- 103) to 43 is obtained whereas a poorer approximation %21,2 = C-1 01,2) to 111 , 2 is obtained. It can then happen that Re(µ3 ) < Re(µ1 ) which means that we accept µ3 as an approximation to the eigenvalue with smallest real part!
190 We emphasise that this example is not pathological and have in practice observed similar problems in numerical experiments using both the Arnoldi and subspace iteration methods. The problem often occurs when IRe(p,+1) - Re(pl) l is small and this is not surprising since we found in Section 2 that the ratio (Re(µ,+1) Re(pl))/Im(yj) affects the speed of convergence of 01,2. If this ratio is small, then convergence of 01,2 may be slow and an unfortunate choice of a1 and a2 may lead to rapid convergence of 03 before that of 01,2. This situation is much less likely to be a problem at points on F subsequent to the initial one, since good eigenvalue approximations to compute a1 and a2 should be available from previous computations. To attempt to overcome the problem of accepting a `bogus' approximation to the dangerous eigenvalue, we use the following technique for testing a candidate approximation µ° to pl where we assume also that an approximation µ3 to µ3 is available. Algorithm 3 .2 To test an approximation µi to iq. 1. Set a1 = Re(ji°) - r and a2 = Re(µ°) + r where r = Re(µ° - µi) (or Re(µ2 - µ°) if µ1 is real). 2. Apply the subspace iteration method to C(A; al, a2) with suitable starting vectors to yield new eigenvalue approximations Al, • • • , µk, 3. If Re(AC) < Re(ii) then update a1, a2 and return to Algorithm 3.1. Otherwise, accept µ ° as an approximation to the eigenvalue with smallest real part. The reason why this algorithm is useful is that the above choices for a1 and a2 ensure that µi is mapped to an eigenvalue of C(A; al, a2) on the unit circle; if there are any eigenvalues of A to the left of AC, then these will be mapped to eigenvalues of C(A; al, a2) outside the unit disk and thus should be computed by the subspace iteration method. A checking procedure should also be included in a strategy where Arnoldi's method is used as IEVS. µ° is mapped to an eigenvalue of C(A; al, a2) on the boundary of B(2, 2) by setting a2 = Re(µ1). There is a free choice over a1. For example, if it is suspected that Re(pi) > 0 at the initial point on I' we propose that a1 = 0 which maximises 101 for all imaginary 0 [7] because we are interested in detecting any eigenvalues near the imaginary axis. This may not be as robust as the subspace iteration method for testing candidate approximations since eigenvalue separation also affects convergence.
In step 2 of Algorithm 3.2, there is the question of choosing suitable starting vectors for the Arnoldi or subspace iteration method. A sensible choice is to use random vectors which in theory contain components of a large number of eigenvectors of A and thus components of any eigenvectors of eigenvalues to the left of µl. A more robust approach is to adopt a deflation technique whereby the random vectors are
191 orthonormal to the computed approximate eigenvector ul [20). The reason is that the subsequent subspace computed by the Arnoldi or the subspace iteration method will be orthogonal to ul and this increases the possibility of approximations to eigenvalues to the left of A, being computed, should they exist.
3.5
Algorithm for continuation
To summarise the details of the previous subsections, we give an algorithm for computing the eigenvalues with smallest real part of A(A) := fx(x(A),A) along a branch r = {(x(Ai), A1) : f (x (Ai), Ai) = 0, i = 1, 2, ...) of steady state solutions. Algorithm 3 .3 Hopf bifurcation detection using C(A; a,, a2). Let IEVS be an iterative eigenvalue method which computes k (= k, or ka) eigenvalue approximations and e be a fixed tolerance. 1. Compute initial approximate eigensolutions ( fti, ul ), ... , ( µ,.+1, it,.+i ) of A where A:= fx(x(A1), A ,) by applying the subspace iteration method to A-1. 2. Use u1, ... , u,+1 to obtain starting vectors and perform step 2 of Algorithm 3.1, yielding approximate eigensolutions (µi, ui), i = 1,. .. , k, where r(A,µ1iiii) < e. 3. Use Algorithm 3.2 to check if it, is an approximation to the eigenvalue with smallest real part. If it is not, goto 2.
4•
For j = 2 , . . . , let A:= fx(x(.j),A3) and do: (i) Use u1i ... , u ,.}1 to obtain starting vectors and perform step 2 of Algorithm 3.1, yielding approximate eigensolutions (µi, u1), i = 1, ... , k, where r(A, µ1, ul) 5 e. (ii) Use Algorithm 3.2 to check if A, is an approximation to the eigenvalue with smallest real part. If it is not, goto 4(i). (iii) Check for Hopf or steady state bifurcation by observing the sign of Re(µ,) at (x(Aj_1),Aj_1) and (x(A3),A3).
In the next section we present the results of some numerical experiments using the above algorithm to detect Hopf bifurcation of equations modelling the tubular reactor.
4 Numerical results Numerical experiments were performed using a model of the tubular reactor [10), which we now outline. The equations in dimensionless form modelling the
192 conservation of reactant and energy of a reaction in a homogeneous tube of length L are: L 1 52y _ ay V y + Pe 832 8s
i f Dy exp ly - T-
0, (23)
2 V T + Pe h
8s - /3(T - To) } BDy exp [ry - yT 1]
= 0 (24)
with some given boundary conditions (see [10], p. 1412), where y(s) represents the concentration and T(s) the temperature, both parameterised by s. Pem, Peh, D, y, Q, B, L and v are certain parameters of the model and D, the Damkohler number, is the parameter to be varied. The simplest finite difference scheme, based on central differences, was used to discretise the steady state solutions of (23) and (24) using p discretisation points with central differences, producing the nonlinear equation f (x, D) = 0 (25) where x = [yo, To, yl, T1i ... , yn, T,,] and the Jacobian f x (x, D) is of dimension 2(p+ 1) x 2(p+ 1), nonsymmetric and five banded. The continuation package PITCON [14] was used to compute a branch F of solutions of (25). Algorithm 3.3 was implemented to compute the eigenvalues of f x (x, D) at every point computed along a particular branch 1 of steady state solutions for D E [0.23, 0.31] arising when Peh = PCm = 5, B = 0.5, y = 25 and ,Q = 3.5. The bifurcation diagram for this choice of parameters is illustrated in [10], Fig. 1, which features two Hopf bifurcations at D 0.262 and D 0.295 produced by a single complex conjugate pair of eigenvalues crossing and then re-crossing the imaginary axis with all other eigenvalues having positive real parts. We used the choices of IEVS and strategies for choosing al and a2 defined by Strategies 1 and 2 in Section 2. Different numbers of discretisation points pi, i = 1, 2.... were used, which yield Jacobians of dimension Ni x Ni where Ni = 2(pi + 1). We present results for various values of k, and ka, where lmax = 10 for the subspace iteration method and e = 10-3. However we note that direct comparison of subspace iteration with k, vectors and l orthogonal iterations against Arnoldi's method with ka vectors is difficult because of their different convergence properties and given the complicated interplay between IEVS and the mapping properties of al, a2. All computations were performed using a SUN-4/470 workstation in double precision. The first point to make is that both methods worked and produced the same results as [10], obtained using the QR algorithm applied to f x(x, D). To compare the relative efficiency of both methods, we present in Table 4 numerical results obtained during the computation of the branch F. For each Ni we give the number of points of r found by PITCON (Pi); the total CPU time (in seconds) to compute F (PTI); for each value of k„ ka = 10, 20, 30, 40, the average number per point on F of matrixvector operations involving f x(x, D) (AMV) and the total CPU time to compute an
193
Ni
P;
PTI
100
22
0.90
200
29
1.80
400
68
7.67
800
81
17.67
1600
85
38.18
k, 10 20 30 40 10 20 30 40 10 20 30 40 10 20 30 40 10 20 30 40
Strategy 1 AMV ETI 114 8.92 217 22.26 285 37.17 325 55.38 122 24.48 259 65.94 356 112.02 475 176.77 139 132.41 293 345.00 420 602.63 598 1008.87 140 319.22 293 843.75 426 1490.56 560 2313.75 145 741.57 278 1772.08 412 3211.01 563 5132.08
k, 10 20 30 40 10 20 30 40 10 20 30 40 10 20 30 40 10 20 30 40
Strategy 2 AMV ETI 2.46 25 49 6.76 74 13.89 23.71 98 25 5.88 50 15.43 74 30.36 99 50.48 25 25.90 50 66.35 75 126.86 100 208.74 24 58.40 49 153.45 33 290.12 98 470.27 24 131.46 48 333.98 73 634.91 97 1007.42
QR QTI 85.97
690.57
10924.25
93070.59
746876.73
Table 1: Comparison of methods Strategy 1 and Strategy 2 for the tubular reactor.
approximation to the eigenvalue with smallest real part (ETI); and finally the CPU time used if the QR algorithm were applied directly to f x (x, D ) (QTI).
4.1 Conclusions 1. Both Strategy 1 and Strategy 2 succeeded in computing an approximation to the eigenvalue with smallest real part of the Jacobian matrix at every computed point of r (verified by the QR algorithm). We were thus able to detect the two Hopf bifurcations known to exist and given in [10]. 2. From Table 4 we observe that that Strategy 1 always required more matrixvector multiplications than Strategy 2. 3. For this example both methods worked for k, and ka equal to 10, and there is no advantage in using values of k, or ka above 10. 4. As the number of discretisation points increases, each method did not require more matrix vector operations to compute the approximate eigenvalue, though of course the CPU time increases because larger systems must be solved. The
194 reason for this is that increasing the number of discretisation points tends only to introduce extra eigenvalues of large real part, unaffecting those with smallest real part. These extra eigenvalues are always mapped to eigenvalues of the Generalised Cayley transform close to +1 and inside B(0,1) or B(2, 2), and hence are not a problem numerically. This is a promising feature of these strategies and one which is likely to hold for other examples. Acknowledgements The authors would like to thank Y. Saad and K . A. Cliffe for helpful advice. The work of T.J. Garratt was supported by a SERC CASE award and AEA Technology, Harwell, UK.
References [1] F. CHATELIN, Spectral approximation of linear operators, Academic Press, New York, 1983. [2] K. N. CHRISTODOULOU AND L. E. SCRIVEN, Finding leading modes of a viscous free surface flow: an asymmetric generalised eigenproblem , J. Sci. Comput., 3 (1988), pp. 355-406. [3] K. A. CLIFFE, T. J. GARRATT, AND A. SPENCE, Iterative methods for the detection of hopf bifurcations in finite element discretisations of incompressible flow problems. Submitted SIAM J. Sci. Stat. Comp., 1992. [4] E. J. DOEDEL AND J. P. KERNEVEZ, AUTO: software for continuation and bifurcation problems in ordinary differential equations. Appl. Math. Tech. Rep., Cal. Tech., 1986.
[5] J. FRANKLIN, Matrix theory, Prentice-Hall, New Jersey, 1968. [6] F. R. GANTMACHER. [7] T. J. GARRATT, The numerical detection of Hopf bifurcations in large systems arising in fluid mechanics, PhD thesis, The University of Bath, 1991. [8] T. J. GARRATT, G. MOORE, AND A. SPENCE, Two methods for the numerical detection of Hopf bifurcations, in Bifurcation and chaos: analysis, algorithms, applications , R. Seydel, F. W. Schneider, and H. Troger, eds., Birkhauser, 1991, pp. 119-123.
[9]
G. H. GOLUB AND C. F. VAN LOAN, Matrix Computations, The Johns Hopkins University Press, Baltimore, Maryland, 1983.
195 [10) R. F. HEINEMANN AND A. B. POORE, Multiplicity, stability, and oscillatory dynamics of the tubular reactor , Chem. Eng. Sci., 36 (1981 ), pp. 14111419. [11] M. W. HIRSCH AND S. SMALE, Differential equations, dynamical systems and linear algebra, Academic Press, New York, 1974.
[12] G. PETERS AND J. H. WILKINSON, Inverse iteration , ill-conditioned equations and Newton's method, SIAM Rev., 21 (1979), pp. 339-360. [13] W. C. RHEINBOLDT, Numerical analysis of parameterised nonlinear equations, Wiley-Interscience Publication, 1986. [14] W. C. RHEINBOLDT AND J . BURKARDT, A locally parameterised continuation process, ACM Trans. Math. Software, 9 (1983), pp. 215-235. [15] H. RUTISHAUSER, Computational aspects of F.L. Bauer's simultaneous iteration method, Numer. Math, 13 (1969), pp. 4-13. [16] Y. SAAD, Variations on Arnoldi 's method for computing eigenelements of large unsymmetric matrices, Linear Algebra Appl., 34 ( 1980), pp. 269-295. [17] , Projection methods for solving large sparse eigenvalue problems, in Lecture Notes In Mathematics , Matrix Pencils Proceedings, A. Ruhe and B. Kagstrom, eds., Springer-Verlag, Berlin, 1982 , pp. 121-144. [18] -, Chebyshev acceleration techniques for solving nonsymmetric eigenvalue problems, Math. Comp., 42 (1984), pp. 567-588. [19] , Partial eigensolutions of large nonsymmetric matrices. Yale University Department of Computer Science Research Report YALEU/DCS/RR-397, 1985. [20] -, Numerical solution of large nonsymmetric eigenvalue problems, Comput. Phys. Comms., 53 (1989), pp. 71-90. [21] R. A. SILVERMAN, Introductory complex analysis, Prentice-Hall, Englewood Cliffs, 1967.
WSSIAA 2(1993) pp. 197-201 ©World Scientific Publishing Company
COMPLETELY CONSERVATIVE NUMERICAL b=ODOLOGY FOR N-BODY PROBLEMS WITH DISTANCE-DEPENDENT POTENUMS DONALD GREENSPAN Madmaw d s DepwvnaK The University of Taal at Arlington Arlington, Taos 76014-0408, USA
ABSTRACT We consider a general class of N-body problems for arbitrary, distance-dependent potentials. Numerical methodology is described which conserves exactly the same energy, linear momentum and angular momentum as does the given differential system.
1. Introduction There is broad interest in numerical methodology which is energy conserving (see, e.g., [11-[4]). Our aim here is to develop methodology which is not only energy conserving, but is also linear and angular momentum conserving . Such methodology has not been available previously. Attention will be restricted at present to N-body problems with distance-dependent potentials. The approach will be a molecular dynamics generalization of methodology developed for the motion of a single particle 4
2. Mathematical Considerations For clarity, we proceed in three dimensions with the prototype N-body problem, that is, with N=3. Extension to arbitrary N follows using entirely similar ideas and proofs as for N=3. Throughout, cgs units are used. For i=1, 2, 3, let P; of mass mi be at >i= (zi,yi,zi) at time t. Let the positive distance between Pi and Pi., ixj, be ry, with ry=rI.i. Let d,(ry)=^••, given in ergs, be a potential for the pair P;Pi. Then the Newtonian dynamical equations for the three-body interaction are mid=i;=- f'FJ - ftFk, i=1,2,3, (2.1) dt2 arv rv errs rk where j=2 and k=3 when i=1; j=1 and k=3 when i=2; j=1 and k=2 when i=3. System 2. 1 conserves energy, linear momentum, and angular momentum. Our problem is to devise a numerical scheme for solving system 2.1 from given initial data so that
197
198 the numerical scheme preserves the very same system invariants. 3. Numerical Methodology For &t>O, let t„=n(At), n=O, 1, 2,3,.-. At time t,,, let P; be at it,,=(xy,,yk,,,z jd with velocity ir,,=(vf,,,vi „vim„) and denote distances IP1P211 IP?P3I1 IP2P3I by r12,,n r13/n r23rn respectively. We now approximate the second order differential system 2.1 by the first order difference system F44.1'^iN= '74-1 -74.
At 2 m ____^^=_4(r, ,, )'$(r, ) ^4 ■.1+r
F.■.1'Pr
^(r ,1)'$(r 4■.1+^^ y4■.i' h■
ryk ■.1+rmk■
ruk■.1'ra ■
with j=2 and k=3 when i=1; j=1 and k=3 when i=2; j=1 and k=2 when i=3. System 3.1 - 3.2 constitutes 18 implicit recursion equations for the unknowns xx„+1, yin+v zi,+1 vixn+v vi,+ 1, v"„+1 in the 18 known. xi,,, y,,, zi,,,, v1, ,,, v,n,,, v;ZN i-1,2,3. These equations are solved readily by the generalized Newton's methods to yield the numerical solution. 4. Conservation Laws Let us show now that the numerical solution generated by Eq. 3.1- Eq. 3.2 conserves the same energy, linear momentum, and angular momentum as does Eq . 2.1. Consider first energy conservation. For this purpose, define ■-1 WN=^ m^i'4■. l'!'^,) (iTt,,.i_ )IOtl ■-0
1.1
)
Then, insertion of Eq. 3.1 into Eq. 4.1 implies +1 W■ =1mv2 21 1X
2 +l mv2 v2N 33,N
22 "`
so that
2
1, 2
1m v2
lm v2
1
2
2"'2 2A- 2
1
WN=KN-KO.
"O
3
3.0 I
199 Insertion of Eq. 3.2 into Eq. 4.1 implies, with some tedious but elementary algebraic manipulation, N-1 WN=E (-012,
1'*19..- I-#23.x-1**14.s+019.+'0Yi,)'
so that WN= -ONi'0O
Elimination of WN between Eq. 4.2 and Eq. 4.3 then yields conservation of energy, that is, KN+46N=Ko+00, N=1, 2, 3, ... . Moreover, since KO and 0p depend only on initial data, it follows that KO and 40 are the same in both the continuous and the discrete cases, so that the energy conserved by the numerical method is exactly that of the continuous system . Note also that the proof was independent of At. Thus we have proved the following theorem. Theorem 4.1 Independently of at, the numerical method of Section 3 is energy conserving, that is,
KN+cbN=Ko+46o,
N-1,2,3,....
Next, the linear momentum A P; at t„ is defined to be the vector ;; ZZof .ra)
M4^M4VR*a'v1JJdvl
(4.4)
The linear momentum 12, of the three -body system at time t„ is defined to be the vector 3
baX =E A
.
(4.5)
Now, from Eq. 3.2, ^1^^1.^. 1-vla^+m21^2a • 1-^2r/*^/73J^.1-Y3 Jul=^
Thus, in particular, for n =0, 1, 2, ... , m1(vl.s,.l -v1.x.R)+m4v2 ..• 1-v2. r s)+A13(ya. , l -v3.s.,)=0. Summing both sides of Eq. 4.7 from n=0 to n=N-1 implies
200 (4.8)
Mt1v1..OM12v2.N + M13v3 Cp N21,
in which (4.9)
M1v1AO+M12v2 . W+M3v3AA=C1.
Similarly, +M3v3 2v M i v1J. N +m 2.yN
(4 . 10)
C2
(4.11)
m1v1 N+m2v2,+m3v3, =C3, ,
in which (4.12)
MIV1,,O+M2v2JA+M3v3JA'C2
(4.13)
M11v1,A + Mt2v2AA + m3v3,=C3.
Thus 3
jai„= =E M;i„=(C1,C2,C3) =M,,
N=1, 2, 3, ... ,
1-1
which is the classical law of conservation of linear momentum . Note again that MO depends only on the initial data. Thus, we have the following theorem. Theorem 4.2 Independently of At, the numerical method of Section 3 conserves linear momentum, that is,
MN=Mo, N=1, 2, 3, ... . We turn finally to angular momentum. The angular momentum defined to be the vector
of Pi at t„ is
L tM,(^^x17 (4.14) The angular momentum of a three -body system at t,, is defined to be the vector 3
=E Lip,.
L
i-1
It then follows readily that Li^t.l-Lid ' 2M1(^M•1+^ti,)X(^l^.1-^i.^)'
Thus, 3
(4.15)
L..1-L.= 2^ Mi(J7iJ'a1+^1+^^X(j7 •1_^ir).
t t
1
1 4,
201 However, substitution of Eq. 3.2 into Eq. 4.15 yields, after some tedious calculation, r 1-L.A,
n=0, 1, 2, ...
which implies, independently of At, the conservation of angular momentum. Note again that i; depends only on initial data. Thus the following theorem has been proved. Theorem 4.3 Independently of At, the numerical method of Section 3 conserves angular momentum, that is, L,=% n=0, 1, 2, 3, .... 5. References 1. D. B. Kitchen, F. Hirata, J. D. Westbrook, R. Levy, D. Kofke, and M. Yarmush, Conserving energy during molecular dynamics simulations of water, proteins, and proteins in water, 1. Comp. Chem., 11 (1990), 1169. 2. A. Chorin, T. J. R. Hughes, J. E. Marsden, and M. McCracken, Product formulas and numerical algorithms, Comm. PureAppl. Math., 31(1978), 205. 3. Y.-S. Chin and C. Qin, Explicit energy-conserving scheme for the three-body problem, J.. Comp. Phys., 83 (1989), 485. 4. D. Greenspan, Conservative numerical methods for 2 Rx), J.. Comp. Phys., 56 (1984), 28.
5. D. Greenspan, Arithmetic Applied Mathematics, Pergamon, Oxford, 1980, 13.
WSSIAA 2( 1993) pp. 203-211 ©World Scientific Publishing Company
Inclusions for the Moore-Penrose Inverse with Applications to Computational Methods C.W. GROETSCH Department of Mathematical Sciences, University of Cincinnati Cincinnati, Ohio 45221-0025, U.S.A.
Abstract Certain inclusions for the Moore-Penrose inverse of a closed densely defined unbounded linear operator lead to a characterization of the value of the Moore-Penrose inverse of an unbounded linear operator in terms of the Moore-Penrose inverse of an associated bounded linear operator. This allows the direct application of certain computational methods, for example, gradient and regularization methods, to bounded operators while computing the value of the Moore-Penrose inverse of an unbounded operator.
1 Introduction The Moore-Penrose inverse leads a double life. On one hand it is an algebraic entity characterized by certain identities, while in its other life (secret to algebraists) it is a metrical optimizer characterized by variational principles. The algebraic personality of the Moore-Penrose inverse is dominant in the theory of generalized inverses of matrices1'2, while the variational viewpoint is a natural springboard for the theory of Moore-Penrose inverses of bounded linear operators in Hilbert space3. Our concern in this note is an aspect of the computational theory of the Moore-Penrose inverse for unbounded linear operators. Domain restrictions render the usual identities for such operators problematic, however, a number of inclusions generalizing known identities
203
204 for bounded operators have been developed'. One such inclusion serves as the jumping-off point for our discussion of a general approach to computational methods for the Moore-Penrose inverse of closed densely defined linear operators. Consider a closed linear operator A defined on a dense subspace D(A) of a Hilbert space Hl and taking values in a Hilbert space H2. The inner product, norm and identity operator in each space will be denoted by (•, •), II • I I, and I, respectively. The Moore-Penrose inverse of A is the linear operator At defined on D(At) = R(A) + R(A)1 (R(A) is the range of A) and taking values in N(A)' fl D(A) (N(A) is the nullspace of A) satisfying
N(At) = R(A)1 and AtAx = Px for all x E D(A) where P is the orthogonal projector of Hl onto N(A)' (when more specificity is called for, PS will denote the projector of a given Hilbert space onto a closed subspace S). Equivalently, At associates with each y E D(At) the vector x E D(A) satisfying x E L = {z E D(A) : JJAz - yll = min {JJAu - yll : u E D(A)}} and JJxJJ < IIzII for all z E L. It is well known that At is a closed linear operator which is bounded if and only if R(A) is closed. Previous results on the computational theory of (generalized) inverses of unbounded linear operators7,s,i° have been based on the construction of an auxiliary Hilbert space and the introduction of a generalized Friedrichs extension of the operator. The generalized solution which was approximated in these works does not necessarily belong to the domain of the unbounded operator. We take a different tack, inspired by the work of Lardy9 and based on a certain operator inclusion. In our approach the generalized solution lies in the domain of the original operator and no completion of the domain or extension of the operator is carried out. Rather, we characterize the generalized solution as a value of the Moore-Penrose inverse of an associated bounded linear operator. This provides the opportunity to apply well-known methods for computing the Moore-Penrose of a bounded linear operator to the problem of computing the Moore-Penrose inverse of an unbounded linear operator.
205
2 Background Inclusions Our development depends on a fundamental result of von Neumann, namely, that the operators
A := (I + AA*)- and A*A are everywhere defined and bounded. Moreover, A is self-adjoint9,11,12. We also note that
A := (I + A*A)-1 and AA are bounded, A is self-adjoint and AA C AA and AA* C A*A. (By A C_ B, we mean that B is an extension of A, that is, D(A) C_ D(B) and Ax = Bx for each x E D(A)). Because unbounded operators are not defined on the entire space, the many identities for the Moore-Penrose inverse that are so familiar in the matrix case are not always meaningful when A and At are unbounded. However, it is possible to develop a number of inclusions (and identities) for the Moore-Penrose inverse of unbounded operators4. One such inclusion, namely,
A*A(I - A)t C At C (I - A)tA*A
(1)
has the curious feature that it traps the Moore - Penrose inverse of a possibly unbounded operator A between two operators which are products of a bounded operator , A*A, with the Moore-Penrose inverse of a bounded operator , I - A and I - A, respectively. We are principally interested in the upper end of this chain of inclusions . As a prelude to its proof, we present the following lemma.
LEMMA . R(AA*) = R(I - A) and R(A*A) = R(I - A). PROOF . Suppose that y = AA*w. Then we find that (I - A)(y + w ) = y + w - Aw - A((I + AA*) w - w) = y. Therefore, R(AA*) C R(I - A). Conversely, if y = (I - A)z, then setting w = Az, we find y=w+AA*w-A (I+AA*)w=AA*w
206
and hence R(I - A) C R(AA*). On replacing A by A*, the second inclusion ❑ follows from the first . We now present an inclusion , related to the upper inclusion in (1), that will be the basis for our general computational approach to At. PROPOSITION 1. (I - A)At C A*A. PROOF. First we show that
At c (i - A)tA*A
(2)
Suppose that y E D(At), say, y = Az + y, where z = Aty and y E R(A)1 C_
N(A*). Then 0=AA*77 =A*An and hence (I - A)tA*Arl = 0. Also, A*AAz = A*AAz = (I+A*A)Az-Az = (I - A)z. Therefore , A*AAz E D((I - A)t) and, using the Lemma, we have (I - A)tA*AAz = (I - A)t(I - A)z
However , since y = Az + ^, where q E R(A)1 = R(AA*)1 = R(I - A)1 =
N(I - A), we have A *Aii= A*rJ=0 and hence,
(I - A)tA*Ay = (I - A)tA*AAz = Aty,
f t
207 giving (2). Suppose now that y E D(At), then from above we have ( using again the Lemma)
(I - A)Aty =
R A*Ay RA*Ay
= P (A)A*Ay = A*Ay, proving the proposition.
❑
In the next section this result on inclusions is applied to develop a general scheme for computing the Moore-Penrose inverse of an unbounded linear operator by use of associated bounded linear operators.
3 A General Approach to Approximation Let T = I - A. Then T is a self-adjoint , bounded nonnegative linear operator having spectrum in the unit interval [0 , 1]. From Eq. ( 2), we are led to consider the equation Tx = A*Ay. (3) Our main result shows that finding the generalized solution Aty of the equation Ax=y (4) involving the possibly unbounded operator A is equivalent to finding a classical solution of the bounded linear operator equation Eq. (3). PROPOSITION 2. y E D(At) if and only if A*Ay E R(T). Moreover, x is a solution of Eq. (3) if and only if Aty = PN(A)LX. PROOF . If y E D(At), then there is a x E D ( A) and a z E N(A*) with y = Ax + z. We then have
A*Ay = A*A(Ax+z) = A*AAx+AA*z = A*AAx = (I - A)x = Tx, and hence A*Ay E R(T).
208
Conversely, if A*Ay = Tx, then x = Ax + A*Ay E D(A*A) + D(A) and hence x E D(A). Also,
Ax=AAx+AA*Ay=AAx+(I-A)y that is,
(I - A)Ax = (I - A)y. Therefore, using the fact that R(I - A) = R(A*A), we have Ax = (I - A)t(I - A)y = PN(I_A)ly =
R(
=
RteY
(5)
and hence PR ,g y E R(A), i.e., y E D(At).
Moreover , if x satisfies Eq. (3), then y E D(At) and, by Eq. ( 5), x is a least squares solution of Eq. (4). Also, by Eq. (5) R` (A)y = APN(A)ly
and hence Aty = PN(A)ly.
Conversely, if y E D(At), then for any z E N(A) = N(I -A), x = Aty+z satisfies
Tx = (I - A)Aty + (I - A)z = A*Ay.
❑
The proposition allows the application of standard bounded operator techniques to Eq. (3) as a method of computing Aty. It shall be noted, however, that when R(T) is not closed, the operator equation Eq. (3) is ill-posed. By this we mean that Tt is unbounded and hence solutions of Eq. (3) do not depend continuously on y. An extensive theory of regularization has been developed for such equations for the case of unbounded' as well as bounded' operators. Proposition 1 allows us to apply the theory of regularization for bounded operators to approximate Aty. For example, since T is bounded and nonnegative, one can apply simplified Tikhonov regularization xa = (T + aI)-ly (a > 0)
209 where y = A*Ay. For these approximations the standard theory3,4 gives xa --> Aty as a -40 if and only if y E D(At). Moreover , if only approximate data y6 satisfying
IIy - ybll <- b (6) is available and approximations x^ are defined as above with y replaced by y6, then xa -4 Aty provided that the regularity condition a = o(b) is satisfied. Standard iterative methods may also be applied to Eq. (3). For example, Richardson's method, that is, xn + 1
= (I - /3T)xn + i3 9
converges to Aty as n -> oo provided that 0 < /j < 2/IITII (in the special case /i = 1, this is Lardy's methods). In the case where only approximate data y6 is available satisfying Eq. (6) and the approximateions xn are formed using yb instead of y, then the iteration number n has the role of a regularization Aty as n -+ oo provided parameter. In this case it follows easily that xn n(8)b -40 as b --+ 0. that n = n(b) satisfies The simple nature of T also allows the application of nonstationary iterative methods. For example, the steepest descent method consists of interpreting a solution of Eq. (3) as a minimizer of the quadratic functional u ^-4 (Tu, u) - 2(u, y) and following the negative gradient. This leads to the iterative method Xn+1 = xn + anrn
where rn = y - Txn, an = II rnII2 /(Trn, rn). Standard analysis3 can then be used to show that for any y E D(At),xn - ^ Aty as n - ^ 00 if xo E R(T). Finally, we make a brief proposal concerning the computation of values of the operator T = I - A. Clearly we need only treat the computation of values y = Ax of the operator A = (I + A*A)-1. That is, given x E H1, we wish to compute y E D(A*A) satisfying
y+A*Ay=x. (7) Let Si C_ S2 C . . . be a sequence of finite-dimensional subspaces of D(A*A) and let yn E Sn be the Galerkin approximation to y defined by yn+A*Ayn
-X
E
SS.
210 From Eq. (7) we then see that (yn-
y,^) +(A(yn- y),A^) =0
for all ^ E Sn and hence yn = Pny, where Pn is the orthoognal projector of y onto Sn with respect to the graph inner product
[w, v] = (w, v) + (Aw, Av). Therefore, if the coordinate subspaces are chosen so that UnSn is dense in D(A*A) with respect to the graph norm, then Yn approximates y to any desired accuracy since yn = Pny -* y as n -4 oo. Acknowledgements : This work was partially supported by grants from the Charles Phelps Taft Memorial Fund and the National Science Foundation.
REFERENCES 1. A. Ben- Israel and T.N.E. Greville, Generalized Inverses : Theory and Applications, Wiley, New York, 1974. 2. T. Boullion and P. Odell, Generalized Inverse Matrices , Wiley, New York, 1971. 3. C.W. Groetsch, Generalized Inverses of Linear Operators: Representation and Approximation, Dekker , New York, 1977. 4. C.W. Groetsch, The Theory of Tikhonov Regularization for Fredholm Equations of the First Kind, Pitman, London, 1984. 5. C.W. Groetsch, Spectral methods for linear inverse problems with unbounded operators, J. Approximation Th. 70(1992), 16-28. 6. C.W. Groetsch, Inclusions and identities for the Moore-Penrose inverse of a closed linear operator, preprint. 7. W.J. Kammerer and M.Z. Nashed, Steepest descent for singular linear operators with nonclosed range , Applicable Analysis 1 (1971), 143-159. 8. L.V. Kantorovich, Functional Analysis and Applied Mathematics, translated from the Russian by C.D. Benster , University Microfilms, Ann Arbor, 1979.
211
9. L.J. Lardy, A series representation for the generalized inverse of a closed linear operator, Atti. Accad. Naz. Lincei Rend. Cl. Sci. Fis. Mat. Natur., Ser. VIII, 58(1975), 152-157. 10. W.V. Petryshyn, Direct and iterative methods for the solution of linear operator equations in Hilbert space, Trans. Amer. Math. Soc., 105(1962), 136-175. 11. F. Riesz and B . Sz.-Nagy, Functional Analysis , Ungar, New York, 1955. 12. K. Yosida, Functional Analysis, 2nd ed., Springer-Verlag, Berlin, 1968.
WSSIAA 2(1993) pp. 213-224 ©World Scientific Publishing Company
IMPLICIT RUNGE-KUTTA METHODS FOR HIGHER INDEX DIFFERENTIAL-ALGEBRAIC SYSTEMS
E. HAIRER and L. JAY Section de Mathimatiques, Universitd de Geneve Case postale 240, CH-1211 Geneve 24, Switzerland
ABSTRACT This article considers the numerical treatment of differential -algebraic systems by implicit Runge - Kutta methods. The perturbation index of a problem is discussed and its relation to the numerical solution is explained . Optimal convergence results of implicit Runge- Kutta methods for problems of index 1, 2, and 3 in Hessenberg form are then surveyed and completed. Their importance in the study of convergence for singular perturbation problems is shown and some comments on the numerical treatment of stiff Hamiltonian systems are given.
1. Introduction The subject of this paper is the numerical treatment of nonlinear differentialalgebraic equations (DAEs) of the form 0 = F(u, u') (1) where u and F are of the same dimension. The matrix 8F/8u' may be singular but is assumed to have constant rank. An important special case is the situation where the components can be separated in differential and algebraic parts as follows y' = Ay' z), 0 = g(y, z). (2) Differential-algebraic equations arise in a variety of applications, e.g., constrained mechanical systems , robotics, simulation of electrical networks and control engineering. They are also obtained as the limit of singular perturbation problems. There are several ways for solving numerically the above problem. All of them have their own advantages:
- Index reduction with projection. Differentiate analytically the algebraic constraints and do some algebraic manipulations until an ordinary differential equation (ODE) is obtained. This ODE can then be solved by any ODE method (explicit or implicit, one-step or multistep). In order to avoid a "drift off" from the original algebraic constraints , it is recommended to combine this approach with certain projections onto the manifolds where the exact solution lies. - Direct approach. Embed the original problem into a singular perturbation problem (e.g., y' = f (y, z), ez' = g(y, z) for Eq. 2), apply formally an ODE method and
213
214 consider the limit e -. 0. This approach is restricted to implicit methods whose stability function is bounded by one at infinity. However, it provides much insight into the numerical solution of stiff and singular perturbation problems. - Special methods. They are adapted to the particular problem. Usually one applies some explicit ODE method to the differential part of the DAE and some nonlinear equation solver to the algebraic part. In this article we restrict ourselves to the direct approach combined with the use of implicit Runge-Kutta methods. For further results on the numerical treatment of DAEs we refer the reader to the monographs of Griepentrog and Marz5, Brenan, Campbell and Petzold2, and Hairer, Lubich and Roche7. Consider the DAE of Eq. 1 and assume that consistent initial values are given (uo is consistent if a function u(t) exists which satisfies Eq. 1 and u(to) = uo). For an s-stage implicit Runge-Kutta method with coefficients aid, bi, the direct approach yields12
Ui=uo
+haijU, 0=F( Ui,Ui),
ul =uo+hb1Uj'. i-i
(3)
The first line of Eq. 3 represents a nonlinear system for Ui, Ui, i = 1.... , s, and ul is the numerical approximation for the solution at to + h. Due to the wide variety of problems included in the formulation of Eq. 1, there is no hope for a unified convergence theory. There exist perfectly meaningful DAEs which cause difficulties to every numerical method. Fortunately, the problems arising in practice have some additional structures and permit a successful application of the above method. This paper is organized in the following way. We begin with a classification of DAEs (perturbation index) which measures how strong the problem is ill-conditioned. For several important problems in Hessenberg form (of index 1, 2, and 3) we then present optimal convergence results for implicit Runge-Kutta methods. These are used to give some insight into the numerical solution of singular perturbation problems. As an example, the numerical treatment of stiff Hamiltonian systems is discussed.
2. Influence of Perturbations For ordinary differential equations u' = f (u) it follows from the lemma of Gronwall that the difference between the exact solution u(t) and a perturbed solution u(t) with defect b(t) := W(t) - f (u(t)) can be estimated as
a
max I f 8(r) dr II) • (4) 11i(t) - u(t) II <_ J( t -to) (IIa (to) - u( to)II + to
II'
215 Working on a bounded interval (to < t < T) this means that small perturbations in the data of the problem cause small perturbations in the solution. For differentialalgebraic equations the situation may be completely different. Index 1 problems . We consider the DAE of Eq. 2 together with a perturbed version y' = f(y,z), 0 = g(y, z ),
= f (l, 0 = g(y,
y'
+ 6,
(t),
+ 82(t),
(5)
and assume that
gz
is invertible
(6)
in a neighbourhood of the solution (gz denotes the derivative of g with respect to z). Using the implicit function theorem we can solve the algebraic relations in Eq. 5 for z (reap. z). Inserted into the differential equation of Eq . 5 this yields an ODE for y (resp . y) and the estimate of Eq. 4 can be used to obtain
Ily(t)-y(t)II
0 = g(y) + 0(t). (7)
It represents a typical control problem where z acts as a control variable and forces the solution y to stay on the manifold defined by 0 = g(y). The essential idea is to differentiate the algebraic constraints with respect to t. This yields 0 = gy(y ) f (y, z) and a similar relation for the perturbed system . If we assume that gy fz
is invertible (8)
in a neighbourhood of the solution , the differentiated constraint can be solved for z. We thus obtain , as for the index 1 example , the estimates?
Ily(t ) - y(t) II 5 C (ll v(to) - y(to) II + f^o ( 116(8 )ll + 110'(8)11) da), IIz(t) - z(t)II <- C(IIv (to) - y(to ) II + j ax 116(9)11 + tm ax 110'(8)11) • (9) Although the estimate for the y-component can be slightly improved' , the dependence on 0'(t) cannot be suppressed in general . Therefore , small perturbations in Eq. 8 may lead to large perturbations in the solution. Index 3 problems . The equations of motion of constrained mechanical systems can be written in the form y' = AY, Z ), y' = f(y,2)+6(t),
216 z' = k(y, z, u), 2' = k(y, 2, u) + µ(t), (10)
0 = g(y), 0 = g(y) + B(t),
where typically k depends linearly on u. If we differentiate twice the algebraic constraint, the assumption gy fzku
is invertible (11)
allows to express u in terms of y, z, and estimates for y(t) - y(t),... can be obtained . These estimates will depend on 0"(t ), B'(t), and on b'(t), so that this problem is even more ill - conditioned than the previous one. However , in some important situations (e.g., f (y , z) = fo(y) + fl(y)z, k(y, z, u ) = ko(y, z ) + kl(y)u) the differences y(t) - y(t), 2(t) - z(t ) (but not fi(t) - u(t)) are independent of B"(t) and b'(t). These examples motivate the following definition of the index. Definition7 . Eq. 1 has perturbation index m along a solution u(t) on [to, T], if m is the smallest integer such that , for all functions fi(t) having a defect
F(l'(t),fi(t)) = b(t), there exists on [to,T] an estimate
I fi(t) - u(t)II <_ C (11 fi(to) - u(to)II + Amax 115(3)11 + ... + max Ilbi"`-1)(3)11) whenever the expression on the right -hand side is sufficiently small. It is also of interest to study the influence of perturbations in the RungeKutta equations ( Eq. 3) on the numerical solution. Let us explain this at the example of the index 2 problem ( Eq. 7). The internal stages of the Runge-Kutta method satisfy Yni = yn + h L : aijYnj, Yni = f(Yni, Zni), j=1
Zni = zn + h L aij Znj r 0 = g(Yni), j=1
and the numerical approximation at to +1 = to + (n + 1)h is given by a
yn +1 = yn + h
biYni, i=1
a
zn+l = zn + h
bi Zni-
i=1
If we eliminate the variables Yni, Zni we obtain the equivalent formulas Yni=yn +h>
aijf(Ynj ,Znj),
0=9(Yni ),
(12a)
j=1 a
a
biwij Znj, zn+1 = Pzn + yn+1 = yn + h bif (Yni, Zni)+ i=1 1,j=1
(12b)
217
where (aij )- 1
and
p=1-
biwij
(13)
i,j=1
(of course, we have assumed that A = (aij) is invertible ; p is the value of the stability function at infinity ). Eq. 12a represents a nonlinear system for Yni, Zni, i = 1, ... , s, and the numerical solution after one step is then explicitly given by Eq. 12b.
We next compare Eq. 12 to the perturbed Runge-Kutta method Yni = yn + h E aijf( Ynj, Znj) + hbni , j=1 a
T^Jn +1 = yn + h bif ( Yni,
_
0 = 9(Yni) + Oni, (14a)
_
s
_
Zni) + hbn,,s + 1,
i=1
biwij Znj zn+1 = p zn + i,j=1
(14b)
with the aim of estimating the differences Dyn = yn - yn and .zn =' zn - zn. Since the nonlinear systems do not depend on the initial values of the z -component, it is possible to derive the estimates for Dyn independently of those for Ozn. Substracting Eq. 12 from Eq . 14 we obtain by linearization that? Dyn+1 = PnOyn + pQn'yn + O(hlloynll + hbn + On)
(15)
where Qn = I - Pn
Pn = (fz(9yfz)- 19y)(yn, zn), are suitable projectors and
bn
= t=l,...,a+1 max Ilbnill,
=
On . Max t=11 ...,s
Ilenill
(we have implicitely used the fact that h, bn, On and Dyn are sufficiently small and that yn is sufficiently close to consistent initial values , so that the numerical solution exists ). We next assume that IpI < 1 and deduce from Eq . 15 that N-1
IIAYNII_C(IIPoAyoll+ ( IPIN+h)IlQooyoll +h (bn + h)) (16a) n=0 for Nh < Const . Similarly one gets for the z-component
IIAzNII <-C(IIPoAyoll+(IPIN+h)IIQ0Ayoll+ n=o Max Sn+n=oax- On). (16b) Eq. 16 is the numerical analogue of Eq. 9. We observe that the derivative in Eq. 9 becomes a division by the stepsize h in Eq. 16. These estimates can be exploited in several ways . If one replaces yn, 2n by values on the exact solution , one can obtain convergence results of the method. Furthermore, it is possible to interpret the perturbations as round -off errors or errors in the iterative solution of the nonlinear systems . Eq. 16 shows how such errors can affect the numerical solution . The ill-conditioning of the problem is
218 reflected by the factor 1/h in front of the perturbations 0,,. Therefore, one has to take care of the above-mentioned errors when the stepsize is very small. Estimates similar to Eq. 16 can be derived also for the index 1 and index 3 problems of the beginning of this section . However , similar results for general DAEs , having perturbation index m , are not yet known.
3. Convergence Results The above investigations make it clear that there is no unified convergence theory of Runge- Kutta methodA for general DAEs, and that one has to treat separately all different types of problems . The first results (for linear inhomogeneous DAEs with constant coefficients , but of arbitrary index) have been obtained by Petzold12 . In this section we collect convergence results for nonlinear semi -explicit problems of index 1, 2, and 3 . For nearly all Runge -Kutta methods the estimates are optimal in the sense that the exponent of h cannot be improved without restricting the class of problems. For the presentation of the convergence results we need the following abbreviations:
B(p) :
Lbick - 1 = k for
k=1,---,p;
:=1
^ C(q): )L aiic3k-
k = 1,...,q;
for
1 = k
9=1
b.
'
bick- 1ai^ = k(1-c) for
D(r)
j= 1,...,s; k=1 ,..., r;
i=1
(S): a,i=bi for
i=1,...,s.
The assumptions B(p), C(q), D(r) have been introduced by Butcher4 and play a crucial role in the construction of implicit Runge -Kutta methods4,8,9. The condition ( S) means that the method is "stiffly accurate". Throughout this section we shall assume that the Runge- Kutta matrix A = (aid) is invertible and we shall denote by p the value defined in Eq. 13. The integer q of condition C(q) is called the "stage order". We always assume that p > q. Theorem (index 1 ). Consider the index 1 problem (Eq. 5, Eq. 6) and assume that the initial values are consistent. If the Runge-Kutta method satisfies B(p), C(q), D(r), and Ipl < 1, then the global error satisfies for hn < Const Y(tn ) - yn = o(hn ),
z(tn) - zn = o(ho,
where q = min(p,2q + 2,q+r+1) and n if (S) holds, if -1 < p < 1, C = min(p , q + 1) min(p -1 , q) if p = 1.
219 The proof of this theorem is based on the following idea: due to Eq. 6, the algebraic constraint 0 = g(y, z) can formally be written as z = G(y). Inserted into the differential equation of Eq. 5 we get the ODE y' = f (y, G(y)). Convergence for the y-component now follows from the fact that the numerical solution, obtained from Eq. 3, is identical to the approximation of the Runge-Kutta method applied to y' = f (y, G(y)). Hence the results from the ODE theory can be applied7'8. The above estimates are valid for a constant stepsize application of the method. For variable stepsizes the same estimates hold with h = max. hn (with the exception of the case p = -1 where the results become those of the case p = +1). The order reduction in the z-component can be avoided, if one computes the approximation zn from g(yn, zn) = 0. Similar remarks can be made also for the subsequent theorems. Theorem (index 2). Consider the index 2 problem (Eq. 7, Eq. 8) and assume that the initial values are consistent. If the Runge-Kutta method satisfies B(p), C(q), D(r), and IpI < 1, then the global error satisfies for hn < Const y(tn) - yn = 0(h"),
z(tn) -
zn
=
0(W),
where rmin(p, 2g,q +r+1)
if (S) holds,
q = mm(p,4 + 1) if -1 < p < 1, q
if
P= 1,
q if IPI < 1, q-1 if P=-1, q-2 if p=1. The proof of this theorem needs the study of the local error (using rooted trees , elementary differentials , ...) and of the error propagation (see Eq . 15). Details are given in Hairer , Lubich and Roche7. Optimal convergence results for the index 3 problem have been obtained only very recently10 under the assumption ( S). In view of the application to Hamiltonian systems ( section 4 ) we also include new results for the case IpI = 1. Theorem (index 3). Consider the index 3 problem (Eq. 10, Eq. 11) and assume that the initial values are consistent. If the Runge-Kutta method satisfies B(p), C(q), D(r), and IpI < 1, then the global error satisfies for hn < Const y(tn) - Y. = o(h'"), z (t.) - zn = 0(ht ),
u (tn) - un = 0(h°),
where (min(p, 2q - 2, q + r) if (S) holds, if -1 < p < 1, r> 1, q> 3, r/ = min(p , q + 1) q else, (q-1 if IPI<1, (q if IPI<1, [`= q-1 if p=-1, v= q -3 if p=-1, q-2 if p=1, q-4 if P=1.
220 In the above two theorems the stage order q has to be sufficiently large such that the numerical solution remains close to the exact solution . Otherwise the nonlinear system may not have a solution. 4. Singular Perturbation Problems As mentioned in the introduction, the direct approach for solving DAEs provides much insight into the numerical solution of singular perturbation problems. Let us illustrate this at two examples. Consider first the problem y, = f(y,z), ez' = g(y,z),
0 < e « 1 (17)
where f and g are sufficiently differentiable. Under suitable assumptions on g (e.g., (gz(y, z)V,V) < -IIv112) and on the initial values, the solution of Eq. 17 possesses an asymptotic expansion of the form y(t) = yo(t) + eyl(t) + E2y2(t) + ...,
z(t) = zo(t) + Ezl(t) + e2z2(t) + ....
Inserting these expansions into Eq. 17 and collecting equal powers of a we obtain yo = f(yo,zo) (18a)
0 = g(yo , zo) A = fv (yo,zo )y1 + fz (yo,zo)zl zo = gv ( yo,zo)yl + gz (yo,zo )z1.
(18b)
We see that Eq. 18a constitutes an index 1 problem ( Eq. 5) for yo(t),zo ( t). The Eqs. 18a and 18b together are an index 2 problem (Eq. 7) with (yo, zo , y1) in the role of y and z1 in the role of z.
The main point is now that the numerical solution of a Runge - Kutta method applied to Eq. 17 also has an expansion of the form6,9 Y. = A. + Eyn + elyn + ...,
Z. = zn + ezn + c2zn + ...,
and the coefficients AI zn, ... are exactly the numerical solution of the Runge-Kutta method applied to Eq. 18. Consequently, the convergence results for the index 1 problem yield an estimate for yo (tn) - y°,^, zo(tn) - zn, those for the index 2 problem yield an estimate for yl (tn) - yn, z1 (tn) - zn, etc. Since the higher order terms in the asymptotic expansions can be neglected6,9 , this leads to sharp convergence statements for the singular perturbation problem of Eq. 17. As a second example we consider the problemll q'i=P1, Pi=-E2 q1 2( ql+q2-1), % + q2 12 P2,
q2 2
P12 C2 Vf 2 -l q2 q
%+q2-1)-1,
(19)
221 which describes a stiff spring pendulum (mass point suspended at a massless spring with Hooke' s constant 1/E2, 0 < E «1). If we introduce the new variable A by qI2 +qz EZa = -1 qi + q2 Eq. 19 becomes
4'i = P1 ,
Pi = -qla,
qz = P2 ,
P2 = -q2A - 1.
The so obtained system is a DAE of index 1 if c > 0, and it is of index 3 for the limit case e = 0 ((ql,q2) corresponds to y, (plip2) to z, and A to u in Eq. 10). Since the Runge-Kutta method (direct approach) is invariant under the above transformation , we expect that for small e the numerical method behaves similarly as for an index 3 problem. A rigorous analysis of this fact in a more general context has been presented by Lubichll
The following numerical experiment illustrates this behaviour for the s-stage methods GAUSS and RADAU IIA, whose coefficients satisfy the following conditions: GAUSS
B(2s), C(s), D(s), p = (-1)',
RADAU IIA B(2s - 1), C(s), D(s - 1), (S), p = 0.
We put c = 0.001, take the initial values7 qi(0) = 1 - 3E4 + 0(E8),
q2(0) = 0,
(20) P2(0) = 0, P1(0) = O(E$), 0, 0.5] with several different constant stepsizes. and integrate Eq. 19 on the interval [ The initial values are chosen such that the solution of Eq. 19 does not contain highly oscillatory terms (smooth motion11). This allows the use of stepsizes which are significantly larger than E. Fig. 1 shows the global errors as a function of h. Since we have used a double logarithmic scale , a function Ch' appears as a straight line with slope r. We see that for the RADAU IIA method the errors behave like 0(h2),O(h3),O(h5) for the components A, P2, q2, respectively. For the GAUSS method they behave like O(h°), 0(h2) for the components A,P2 (at least for sufficiently large h). The error of the position coordinate q2 oscillates around a line of slope 4, indicating a 0(h4)behaviour. The errors for g1,P1 behave similarly to those for g2,P2 and are not plotted. This experiment confirms the results predicted by the theorem (index 3) of the previous section. As a consequence, methods satisfying condition (S) (like RADAU IIA) are preferred . However , for a long time integration the situation may be different. Eq. 19 represents a Hamiltonian system q' = 8P
- 8q (P, q)
(21)
222
10-2 10-' 10° 10-2 10-1 10° Fig. 1 . Global error as a function of the stepsize f o r Eq. 19, h = 0.5/n, n = 1, 2, 3, .. .
with Hamiltonian function
H(p, 4)
2
(+ Pz)+2ea\
qi + qa-1) +q2
Recently, much research has been devoted to the numerical integration of such systems (see, for example , the survey article by Sanz-Serna13 ). In order to retain the qualitative properties of the flow of Eq. 21, it is important for the numerical scheme to be symplectic. For implicit Runge-Kutta methods this means that the coefficients have to satisfy13 biaii + biaii - bibs = 0
for all i, j. (22)
It is known that the GAUSS methods satisfy Eq. 22, whereas the RADAU IIA methods do not. One can prove (multiply Eq. 22 by wikwjl and sum over all indices) that Eq. 22 implies 1p1= 1, a rather undesirable property for the integration of index 3 problems. In our second experiment we integrate Eq. 19 (e = 0.001) with the initial values of Eq. 20 and with constant stepsize h = 0.05 over a long interval [0,26] (about 3 periods). The Hamiltonian function for the numerical solution, obtained by the GAUSS and RADAU IIA methods with s = 3, is plotted in Fig . 2 (for the exact solution the Hamiltonian is constant and equals 4.5e6 + 0(E8)). We observe that it remains between tolerable bounds for the GAUSS method, but drifts away from the exact value for the RADAU IIA method. This experiment demonstrates the different behaviour of symplectic and non symplectic integrators. It should be mentioned that for "stiff" Hamiltonian systems , such as Eq. 19, the use of explicit integration methods is not recommended, because they require small stepsizes (usually not larger than e). If one uses implicit symplectic integrators, such as GAUSS, the stage order q has to be sufficiently large, so that the numerical solution is precise enough (see index 3 theorem of section 3).
223
GAUSS (s=3)
6.10-9 3.10-9
0 0
20
25
5 10 15 20 Fig. 2 . Numerical Hamiltonian for Eq. 19
25
5
10
15
-5.10-8 -10-7
- RADAU IIA (s=3)
5. References 1. M. Arnold , Stability of numerical methods for differential-algebraic equations of higher index, APNUM, to appear.
2. K.E. Brenan , S.L. Campbell and L . R. Petzold , Numerical Solution of InitialValue Problems in Differential- Algebraic Equations, North-Holland, New York, 1989. 3. K.E. Brenan and L . R. Petzold , The numerical solution of higher index differential/algebraic equations by implicit Runge-Kutta methods , SIAM J. Numer. Anal. 26 (1989), 976-996. 4. J.C. Butcher , Coefficients for the study of Runge-Kutta integration processes, J. Austral. Math. Soc. 3 (1963), 185-201. 5. E. Griepentrog and R. Mirz , Differential-Algebraic Equations and Their Numerical Treatment, Teubner- Texte zur Mathematik , Band 88, Leipzig, 1986.
6. E. Hairer , Ch. Lubich and M. Roche, Error of Runge-Kutta methods for stiff problems studied via differential -algebraic equations , BIT 28 (1988), 678-700. 7. E. Hairer , Ch. Lubich and M. Roche, The Numerical Solution of Differential-Algebraic Systems by Runge-Kutta Methods , Springer Lecture Notes in Mathematics 1409, Berlin, 1989. 8. E. Hairer , S.P Norsett and G . Wanner , Solving Ordinary Differential Equations I. Nonstiff problems, Computational Mathematics 8, Springer-Verlag, Berlin, 1987. 9. E. Hairer and G. Wanner , Solving Ordinary Differential Equations II. Stiff and Differential-Algebraic problems , Computational Mathematics 14, Springer-Verlag , Berlin, 1991.
224 10. L. Jay, Convergence of Runge-Kutta methods for differential-algebraic systems of index 3, Report, Sect. de mathematiques, Univ. de Geneve, 1992. 11. Ch. Lubich, Integration of stiff mechanical systems by Runge-Kutta methods, ZAMP, to appear. 12. L.R. Petzold, Order results for implicit Runge-Kutta methods applied to differential/algebraic systems, SIAM J. Numer. Anal. 23 (1986), 837-852. 13. J.M. Sanz-Serna, Symplectic integrators for Hamiltonian problems: an overview, Acta Numerica 1 (1992), 243-286.
1 t
WSSIAA 2 ( 1993) pp. 225-238 ©World Scientific Publishing Company
Parallel Jacobi Iteration in Implicit Step -by-Step Methods
P.J. van der Houwen & B.P. Sommeijer CWI Post box 4079, 1009 AB Amsterdam , The Netherlands An iteration scheme is descibed to solve the implicit relations that result from the application of an implicit integration method to an initial value problem (IVP). In this iteration scheme the amount of implicitness is still free so as to comprise a large variety of methods, running from fully explicit (functional iteration) to fully implicit (Newton's method). In the intermediate variants (the so-called Jacobi-type methods), the influence of the Jacobian matrix of the problem is gradually increased. Special emphasis is placed on the 'stagevalue-Jacobi' iteration which uses only the diagonal of the Jacobian matrix . Therefore, the convergence of this method crucially depends on the diagonally dominance of the Jacobian. Another characteristic of this scheme is that it allows for massive parallelism: for a ddimensional IVP, d uncoupled systems of dimension s have to be solved, where s is the number of stages in the underlying implicit method (e.g., an s-stage Runge-Kutta method). Hence, on a parallel architecture with d processors (d»1), we may expect an efficient process (for high-dimensional problems). 1980 Mathematical Subject Classification: 65M 10, 65M20 Key Words and Phrases: numerical analysis, stability, parallelism.
1. Introduction We shall be concerned with parallel predictor-corrector iteration of implicit step-by-step methods for solving initial value problems (IVPs). For a wide class of functional equations, including ordinary differential equations (ODEs), Volterra integral equations (VIEs), Volterra integro-differential equations (VIDEs), delay-differential equations (DDEs), etc., these step-by-step methods (referred to as corrector equations) can be represented in the form: Y = Fn(h, U0 , U1, ... , Un) + h°(M®I)Gn (Y), Un+1 = Hn(h, U0, U1, ..., Un, Y), (1.1)
Y:= (Y1T, Y2T,... , y T)T, Un (Un1T, Un2T, ..., UnrT)T , n = 0. 1, 2.....
Here, It is the stepsize, v is the order of the IVP, M is an s-by-s matrix characterizing the corrector, Un and Y present an r-dimensional and s -dimensional block vector of numerical approximations to the exact solution of the IVP. If the IVP has dimension d, then Un and Y are vectors in rd-dimensional and sd-dimensional vector spaces, respectively , and Fn , Gn and Hn are functions depending both on the WP and the step-by-step method. Furthermore, M®I denotes the direct product of the matrices M and I. In each step, the block vectors (U0, U1, ... , Un) are the input vectors, Un+1 is the output
225
226 vector, and Y is the internal stage vector. We shall say that the corrector method has s internal stages and r output points . The representation (1.1) is similar to the partitioned general linear method (GLM) format introduced in [5]. On sequential computers , multi-stage corrector equations are seldom used in predictor-corrector iteration methods, because of the increased computational complexity if s > 1. However, parallel computers have charged the scene. A number of papers [15, 9, 11, 2, 3, 4, 14, 16] discuss the parallel aspects of functional iteration of Runge-Kutta-type correctors in the case of first-order and secondorder, nonstiff ODES and show that the sequential costs can be reduced to such an extent that they are at least competitive with, but often superior to, the best sequential codes . For stiff ODEs and VIEs, it has been shown in [10, 7] that so-called diagonally implicit iteration of Runge-Kutta-type correctors is suitable for implementation on parallel computers (see also Section 2 of the present paper). In [8], these functional and diagonally implicit iteration methods are discussed for solving the general class of correctors defined by (1.1) and preconditioning techniques for accelerating their convergence are studied. In this paper, we consider another approach to accelerating the convergence of parallel iteration methods, which leads us to Jacobi-type iteration methods. For nonstiff problems, we investigate Jacobi iteration methods that are implicit in the s stage values Yiq (i = 1, ... , s) corresponding to the qth component of the stage vectors Y1, and we show that its computational costs per step are hardly higher than those of function iteration. This type of Jacobi iteration will be called stage-value-Jacobi iteration. It turns out that diagonal dominancy of the Jacobian of the function Gn plays a crucial role in the rate of convergence of stage-value-Jacobi iteration . This is not surprising, because , as is well known, diagonal dominancy also plays an important role in classical (point-)Jacobi iteration . For example, we have the following classical theorem , the proof of which can be found in Collatz [6]: Theorem 1 .1. Let the matrix A in the linear system Ax = b be irreducibly diagonally dominant. Then the point-Jacobi iteration method xn+1 = xn - D-1 [Axn - bl, n = 0, 1,... converges for any starting vector x0. 0
For linear problems, we shall derive a safe estimate for the convergence factor of stage-valueJacobi iteration and it will be shown that for IVPs with strongly diagonally dominant Jacobian matrix, we obtain fast convergence , in spite of the modest degree of implicitness of the method . For a number of numerical examples, we compare its efficiency with that of function iteration and we test the reliability of the convergence factor estimate. 2. Parallel iteration methods We shall study iterative methods for solving the stage vector equation on parallel computers. Let us write the stage vector equation in (1.1) in the form (2.1) Rn(h , Y) := Y - F I(h, Uo, U1 .... , Un) - h°(M ® I)Gn(Y) = 0,
and consider Jacobi-type iteration methods of the form (2.2)
Iq(Y(j) - hVQGn(Y(i-l) + Iq(Y(i) - Y(j-l)))) = lq(YG- l) - h°QGn ( Y(1-i)) - PRn (h. YO-0)), q = 1,... k,
where the iteration index j runs from 1 to in. Here, P and Q are real, nonzero sd-by-sd matrices, and for a given value of q, Iq is an sd-by-sd diagonal matrix of which the diagonal entries are either I or 0 (if Q = 0, then (2.2) becomes fully explicit and reduces tofunctional iteration).
227 Each iteration in (2.2 ) requires itself the application of an iteration process for c uting YO). This iteration process will be called the inner iteration method and the iteration method (2 2) will be called the outer iteration method. It will be assumed that the inner iteration is defined by the modified Newton method. P may be considered as a preconditioning matrix and the matrices Q and determine the degree of implicitness of the iteration scheme. It will be assumed that the matrix ob^ned by summing all matrices In equals the identity matrix I, so that all components of the stage vector are iterated. Each iteration o the iteration method (2.2) requires the solution of a set of k uncoupled, implicit subsystems of dimension Trace(lq). Hence, it can be efficiently implemented on a k-processor computer. There are various obvious options for choosing the 'partitioning ' matrices Denoting the unit vector (with only unit entries) and the qth unit vector bye and eq, respectively , both having dimension sd, we recognize the following special cases: Point-Jacobi: k = sd Trac = 1 Iqe = e^ Stage-value-Jacobi: k = d Trace = s Iqe = e®e^ Stage-vector-Jacobi: k = s Trace(Iq) = d Iqe = eq®E k = 1 Trace(Iq) = sd 1q = I, Newton: where q = 1, ... , k. The most simple option is point -Jacobi iteration. It has optimal parallelism in the sense that k is as large as possible . The next simple option is stage-value-Jacobi iteration. It allows for massive parallelism for large systems (k = d). The qth processor iterates on the s stage values Yi (i = 1, ... , s) corresponding to the qth component of the stage vectors Yj, so that per step each processor has to solve to systems of equations of equal dimension s. However, in actual computations , the major part of the computational effort per step pei rocessor usually goes into the evaluation of ms components of the residual function Rn(h, YtJ )). A disadvantage of stage-value iteration may be the poor load balancing if the computational complexity of the components of the residual function vary widely. This disadvantage disappears in the case of stage-vector-Jacobi iteration , where the qth processor iterates on the d components Yqi (i =1, ... , d) of the qth stage vector Yg. Now the systems of equations have dimension d, so that for larger dimensions d the major part of -the computational effort per step per essor consists of solving these d dimensional systems. For IVPs on inating from ODES and VIES , this iteration method has been analysed in the case here P= land =DO with D an s-by-s matrix (cf. f 10, 7] where this type of iteration was called diagonally implicit iteration). It was shown that the sets of equations are of comparable computational complexity , so that we have more or less equal load balancm of the processors. Stage-vector-Jacobi iteration has the additional advantage of using the full Jacobian matrix of the IVP in the inner iteration which enables us to solve stiff systems efficiently . The disadvantage is the low number of processors that can efficiently be employed (k = s). At the end of the scale , we have Newton iteration with k = 1 and hence no intrinsic parallelism. In a more sophisticated partitioning approach , the matrices IIa^, are chosen such that sets of strongly coupled equations are taken together on one processor . Ii`owever, this requires precise information on the IVP to be solved , and can only be analysed for specific classes of problems. Finally, we remark that the iteration scheme (2.2) can be generalized by allowin the matrix M occurnn$ in the residual function Rn, to depend on the partitioning index q . This enables us to adapt the iteration method and the corrector to the particular subsystem to be iterated. However, in this paper, we confine our considerations to constant M. 2.1. The iteration error In order to analyse the behaviour of the iteration error Y(1) - Y we consider the error equation associated with (2.2) in the case where Gn is linear in Y, satisfying the relation (2.3) Gn(V) - Gn(W) _ (I®Jn) IV - WI,
with Jn the d-by-d Jacobian matrix of Gn (evaluated at tn). Omitting in Jn the step index n, the innerouter iteration method reduces to the recursion
228 (2.4) Iq [I - h"Q(I®J)Iq] [Y(i) - YO-')] = - Iq PRn(h, Y( i-I)). from which we deduce the iteration error equation (2.5) Iq [I - hvQ(I®J)Iq] [Y(i) - Y] = Iq [I - P + h`'P(M®J) - hvQ(I®J)Iq] [Y(1-i) - Y].
The combined effect of the iteration process for q = 1, 2, ... , k can be studied by considering the summed recursions given by k (2.6) [1- h""S] [Y() - Y] = [I - P - hv(S - P(M®J))] [YO-1) - Y], S := E I.Q(I®J)lq q=1
(recall that the summing the matrices Iq was assumed to yield the identity matrix). The matrix S can be expressed as ••• Q11J
k
'1
Q. ••• Qls
q=1 [Q11J
Qs1J ... QsaJ I1 Qsl ... Qss
where the Qij are d-by-d matrices . In the cases of stage-vector-Jacobi iteration (Iqe = eq®e), pointJacobi iteration (Iqe = eq), and stage-value-Jacobi iteration (Iqe = e®eq), we respec vely obtain /QiiJ ... 0 ' 1511 ... 0 \ (Sit ... S1s \
0 ... Qasl ) I. 0 ... Sss ) I. S51 ... Sss
where S := diag (QijJ). From these representations it follows that stage-vector-Jacobi iteration does not split to Jacobian matrix , while the diagonal operation in the point-Jacobi and stage-value-Jacobi iteration methods will in general not preserve the complete Jacobian. In the remainder of this paper, we restrict our considerations to point-Jacobi iteration and stagevalue-Jacobi iteration without preconditioning (P = I). 3. Jacobi iteration versus functional iteration In this section, we discuss various aspects of Jacobi iteration with (3.1) P=1' Q = M®I.
Assuming that the Jacobian of Gn(Y(J)) at to is given by I®Jn (cf. (2.3)) and solving (2.2) for YO) by just one Newton iteration , we obtain (3.2) Iq[I - h"(M®Jn)Iq] [Y(') - Y6-1)] IgRn(h, Y('-1)), q = 1 , ... , d; j = 1,... , m.
This equation shows that for point- and stage-value-Jacobi iteration methods only diagonal entries of the Jacobian matrix of the NP enter into the iteration process, so that stiff systems can only be solved if the Jacobian Jn is sufficiently diagonally dominant . Hence, in practice, one should consider the methods using point- and stage-value-Jacobi iteration as nonstif' solvers. This immediately raises the question whether Jacobi iteration has any advantage over (explicit) functional iteration obtained for P =
229 I and Q = O. Let us first compare the computational costs of the two type of methods when implemented with some stepsize and iteration error strategy. 3.1. Computational costs. Denoting the total number of steps in the integration process by N and the number of steps where we need a new LU-decomposition of the matrix I - Iqh (M®Jn)Iq by ON, we conclude that the major costs of the stage-value-Jacobi iteration method are: N evaluations of the sd components of Y(O) mN evaluations of the sd components of the residual function Rn mN estimates of the sd components of the iteration error ON evaluations of the d diagonal entries of J nn dON LU-decom ^^ppoossiitions of s-by-s matrices of the form I - Ighv(M®Jn)Iq mdN backward/forward substitutions by s-by-s matrices. Here, in should be interpreted as the averaged number of iterations over all N steps. To the iteration costs listed above, we have to add the costs of in (1.1) N evaluations of the rd components of the function Hn defining the step point formula N estimates of the rd components of the truncation error associated with Un+1 These costs have intrinsic parallelism of degree at least d, so that d processors can efficiently be employed. Suppose that the evaluation of one (block)component of Rq and Hn, and the evaluation of the diagonal entries of Jn require FR, F^;, and F Dating-point operations (fl s), respectively , and let us assume that F^ also contains the costs of Y^ and iteration error costs , and that truncation error costs are included m FH. Then , the total number of flopps processor per step required by functional iteration and stage-value-Jacobi iteration are given by := msFR+ rF^ and FSVJ :=msFR + 6Fj + 28s /3 + 2ms + rFH, respectively. Thus, FSVJ _ + OFj + 29x3/3 + 2ms2 < I + FFJ - 1 msFR + rFH
9FJ + 29s3/3 + 2ms2 msFR
In general, FJ < FR, so that we find FSVJ < ins +9 + 2s 9s+3m 1+ 2s 9s+3m FFJ ms 3m FR 3m FR
This costs-increase factor changes per step and per processor because the value of FR usually varies with t (e.g., in the case of Volterra equations ) and with the components of Rn. It is larger as FR is smaller. On the other hand, the run time per processor per step is largest for the processor to which the most expensive components of the residual function are assigned. Hence, the relevant costs-increase factor is bounded by 1 + s(1 + Os/3m)/max(FR). In most applications , this factor is only marginally larger than 1. For example, using an s-stage Gauss-Legendre corrector and iterating until the order of the corrector is reached leads to in = 2s-1 iterations per step. Hence, stage-value-Jacobi iteration is about a factor 1 + s/max(FR ) more expensive than functional iteration. In the case of point-Jacobi iteration, we have similar costs, except for the LU-decompositions and backward/ forward substitutions which are negligible because only scalarly implicit relations are involved. As a consequence, the main costs have parallelism of degree sd. We find Fpj F FF1 - 1 + msFR+ rFH < 1 + ms '
230 so that point-Jacobi increases the computational costs only marginally. Summarizing, we conclude that point-Jacobi and stage-value-Jacobi are generally not much more expensive than functional iteration. 3.2. The convergence factor. Next, we consider the convergence of the Jacobi method (3.2). The error equation corresponding to (3.2) reads (3.3) YU) - Y = Z [Y(' 1) - Y], Z := by (I - hvK®JD)-1 (M®J - K®JD), JD := diag (J),
where for functional iteration , point-Jacobi and stage-value-Jacobi we have K = 0, K = diag (M) and K = M, respectively. We shall call Z the iteration matrix and its spectral radius p(Z) the convergence factor of the iteration method. The expression (3.3) shows that we always have convergence (i.e., p(Z) < 1) if h is sufficiently small. For functional iteration the iteration matrix reduces to Z:= hV M®J,
so that we have convergence factor (3.4) P(Z) = h"P(M)P(J)• For Jacobi iteration, it is convenient to factorize Z according to (3.5) Z := Z1Z2 ,
ZI := (hvK®JD) (I - hvK ®JD)-1. Z2:= K-IM®Jo 1J - I,
where K and JI) are assumed to be nonsingular. This representation shows that, unlike functional iteration, Jacobi iteration has a bouqded iteration matrix Z for all h and J, provided that the entries of the'diagonally- scaled'-JacobianJD J are bounded. Furthermore, the matrix ZI can be partitioned into a matrix with diagonal blocks ci 'Jp, whereas the blocks in the partitioning of Z2 contains the full matrix JD J. Therefore, the mftnx Z2 will largely determine the convergence behaviour of the iteration process. The convergence wil be faster as the magnitude (in some sense) of the iteration matrix Z = ZIZ2 is smaller. We shall estimate the magnitude of this matrix by the quantity p(Zl)p(Z2). The following theorem presents an easy estimate for p(Zl)p(Z2) and specifies a few cases where p (Zl)p(Z2 provides an estimate for the convergence factor p(Z). In this theorem, it is convenient to use the minimal value of the real parts of the eigenvalues of a matrix A. Denoting the spectrum of A by a(A), this quantity is defined by (3.6) µ(A):= min (Re(a): a e a(A)). Theorem 3.2. Let
( 3 . 7 ) 6(J D) E R' , a( K) E C+ , E (h ) :_
hvP(K)P(JD) P ( K-IM®JD 1J - I) 1 + 2hvµ (K)µ(-Jtr) + h2VP(K)2P(JD)2
Then the following assertions hold: (a)
Arbitrary K, M and JD
P(Z1)P(Z2) 5 E(h).
(b)
KM = MK, JD = 61
PM :5 P(Z1) P(Z2) <_ E(h).
(c)
K=M, JD=8I
P(Z) = P(Zt)P(Z2) 5 E(h).
f )
231 (d)
K = M, Re (a(K)) = IKK), JD = 81
=>
P(Z) - P(ZI)P(Z2) _ E(h).
(e)
K = rl, JD-8I
=>
P(Z) = P(Zl)P(72) - E(h).
Proof. Let x and 8 denote the eigenvalues of K and JD, respectively. From the definition of Zl and Z2 it then follows that (3.8) p(Z) P(
I Z2) = maxx8 iI1C'1 P(Z2) = maxKB h_I1C81 P(Z2) 1 - hvKB I 1 + 2h"Re(-x8) + h2vh1C812 5 maxK8
h"IK81 PCZ 1 + 2h"µ(K)µ(-JD) + h2"IK812
Since the righthand side in this inequality is an increasing function of I41, we obtain the result (a). The convergence factor pp(3) is bou ed by p(Zl)p(Z2) if Zl and Z2 commute, or equivalently, if the matrices K®JD and KM®J ' J commute. This happens if both K and M, and JD and J commute. The condition on J)? and J implies that the Jacobian matrix J has constant diagonal entries, to obtain the result (b).have Thirdly, if also K =^nffi'n Z becomes thdirect e have ). The assertions (d} and (e) follow by observing that in these ses we sp Zle)gpu^al2), g oc In the case of stage-value-Jacob ) i iteration (K = M), the estimate E(h) reduces to ( 3 .7 9
E ( h)
h"p(M) P(JD) P (JD 1J - I) 1 + 2h"µ(M)µ(-JD) + h2vp(M)2p(JD)2
showing that, independent of the particular corrector used , fast convergence can be expected when applied to IVPs possessing strongly diagonal dominant Jacobian matrices , i.e., p(JD J - 1) << 1. Therefore, from now on , we concentrate on stage-value-Jacobi iteration . For future reference, we list in Table 3. 1 the radius p(M) and the minimal real part µ (M) of the spectrum of the matrices M of Gauss-Legendre correctors. Table 3.1. Values of p(M) and µ(M) for s-stage Gauss-Legendre correctors. s=2 s=3 s=4 s=5 s=6 p(M) 0.289 0.216 0.166 0.133 0.115 µ(M) 0.250 0.143 0.092 0.064 0.048
3.3. Transformation to constant diagonal entries in the Jacobian In general, the Jacobian J will have variable diagonal entries, so that the condition JD = 81 in Theorem 3.2 (b) - (e) is not satisfied and consequently the estimate E(h) is not necessarily an upper bound for the convergence factor p(Z). In order to gain some apnori insight into the true convergence factors for problems with nonconstant diagonal entries in the Jacobian, we may try to transform the problem into a problem with constant diagonal entries in its Jacobian. If the integration method applied to the original and transformed problems show a comparable convergence behaviour of the iteration process, then the convergence factor corresponding to the transformed problem is indicative for the convergence factor corresponding to the original problem. We illustrate this for the IVP for ODEs. Let
232 the ODE be given by y'(t) = f(y(t)), and define z(t) = Ty(t) wit :4 a constant nonsingular d-by,d matrix. In terms of z(t), we have the ODE z '(t) = y(z(t)) := Tf(T` z(t)) with Jacobian matrix TJT where J = J(y) denotes the Jacobian of the ori ginal right hand side function f. Suppose that we can find a matrix T such that at y = y(t) the matrix TJT has constant diagonal entries 8. Then, instead of integrating the equation y'(t) = ff(y(t)) from to to tn+1, we can integrate the equation z'(t) = g(z (t)) over this interval, while satisfying the condition of constant diagonal entries . The iteration matnx defining the iteration process for the transformed problem is given by (3.9) Z := Z1Z2 ,
Zi :_ (h"K®SI) (I - h"K®8I)-i, Z2 K-1M®S-ITJT-1 - I.
A comparison of the iteration matrices defined by (3.5) and (3.8) reveals that they are rather similar indicating that we can expect comparable convergence behaviour. We shall call the iteration method with iteration matrix (3.9) the transformed iteration method. Let us consider the case of triangular transformation matrices T. In order to construct such a transformation matrix T, we write T = L -t D, where L is strictly lower triangular and D is diagonal. To obtain constant diagonal entries S in TJT , we have to satisfy the relation (3.10) diag((L + D) J (L + D)- I) = 81.
Given the matrix L and S, this equation presents a system of d equations for the d diagonal entries of D. Theorem 3. 3 presents an extremely simple transformation that can be used for deriving apriori estimates for the convergence factor in cases where the Jacobian contains at least one row with nonzero off-diagonal elements. Theorem 3.3. Let J be an d-by-d matrix with entries aij and let T be the triangular matrix defined by di 0 0 0 ... 1 d2 0 0 ... T:= 1 0 d 3 0 ... ,
di:=gala.. ,
r a c e (j) J , i=2,...,d. :=T
I00dq:::
If dl,aliand8-aiidonotvanishfori= 2,...,d,then diag (TJT i)= 81. Proof. Substitution of L + D = T and (di)-1 0 0 ... 0 did2)- i (d2)-i 0 ... 0 (L + D) -l = T-1 ( (did3) 1 0 (d3)-1 ... 0
into (3.9) yields the following system for the diagonal entries di: n all - I ai) (d)Y4 = 8; aii(di)-i + aii = S. i = 2..... d. j=2
233 Choosing 8 = Trace(J)/d, this system is solved by di = ali/(S-aii), i = 2,..., d, leaving di free. Il 4. Numerical experiments In this section, we report numerical comparisons of results obtained by functional iteration and by stage-value-Jacobi iteration for IVPs for first-order ODEs y'(t) = f(t, y(t)). In our experiments, we used the fourth-order Gauss-Legendre corrector, so that the residual function occurring in (3.2) is given by Rn(h, Y) = Y - ye - h(M401) f(etn + ch , Y), M = 12 I 3 3-2I` 3+215- 3
We used the simple last step value' predictor Y(O) = eyn . In order to 'tune' the arguments of f, we set c = 0 in the coruputation of Rn (h, Y ), and c = Me otherwise. In particular, we check the relevance of the estimate E(h) defined by (3.7) as an indicator for convergence of the iteration method. For functional iteration and stage-value-Jacobi iteration the estimates E(h) are respectively given by 0.29hp(JD)P(Jn71J -1)
(4 . 1) EFI(h) = 0 29h P(J ), ESVJ(h) =
1 + 0.5hµ(-JD) + 0.084h2p(JD)2
4.1. Effect of the constant-diagonal-transformation Firstly, we compare the convergence of functional iteration and of stage-value-Jacobi iteration for the untransformed and transformed problem. Consider the linear problem
(0
1 1 1 1
(4.2) =Jy(t)+v, y(0)= 0, J:=
-2 1 1, v:= (-11, OStST. 1 1 -1/21 l2
At t = T = 5, the solution is approximately given by y(5) = (41.529764, 18.516263 , 51.537861)T. The rapid increase of the solution values is caused by a positive eigenvalue of the Jacobian m4trix J (they are approximate) given by -2.19, -2 and +0.69). Since p(J) = 2.2, P(JD) = 2 and P(JD J - I) - 1.9, the estimates E(h) for functional iteration and stage -value-Jacobi iteration are given by (43a) EFi(h) = 0.64h, ESVJ(h) _
1.1h 1 + 0.25h + 0.336h2
Next we consider the transformed version of (4.1). According to Theorem 3.3, we define the matrices
T :=
I 1 0 -3R J
03 r t _ (2/3 S 0 b -2
/
so that we can transform (4.2) to the constant-Jacobian-diagonal-form: (4A)
= TJT" tz(t) + Tv ,
z(0) = 0 ,
/ -7/6 5/6 -4/6 TJTt = 49/30 - 7/6 - 22/15 -11/12 -5/12 -7/6
,
05t<_T.
234 We now have (JD) = 7/6 and p(JD-lJ - I) = p(-(6/7)TJT-1 - I) = p(-(617)J - I) = 1.59. Denoting the estimate E(h) for transformed stage-value iteration by ETSVJ(h), we find (4.3b) ETSVJ(h) 0.54h 1 + 0.58h + 0.11h2
We integrate (4.2) and (4.4) from t = 0 until t = 5 using stepsizes h = T/N with N = 1, ... ,5. In - zN. the case (4.4), the numerical solution yN at t = 5 is obtained by the back transformation yN = T1 For the two-point Gauss-Legendre corrector, Table 4.2 lists the estimates E(h) defined by (4.2) and (4.4), and the numbers of correct significant decimal digits A at the endpoint defined by (devision is meant componentwise) A -loglo (IIY(tNAIL). YON)
These results show that direct and transformed stage-value- Jacobi iteration perform similarly , but for transformed stage-value-Jacobi iteration the estimate ETSVJ(h) is a much better predictor for the actual performance of the iteration process than the estimate ESVJ( h) corresponding to direct stage -valueJacobi iteration . Furthermore , the convergence region of stage -value-Jacobi is considerably larger than that of functional iteration. However, if the functional iteration method does converge, then its true convergence factor seems to be smaller than that of stage-value-Jacobi. Table 4.2. Correct significant decimal digits A for problem (4.1) and (4.3) at t = T = 5 obtained for the two-point Gauss-Legendre corrector (* indicates A < 0). Iteration mode T/h E(h) m=2 m=3 m=4 m=5 ... m=10 ------------------------------------------------------------------------------------------------1.1 * * 0.8 0.8 ... 0.2 Functional iteration 3 4 .80 0.5 1.2 2.7 2.7 ... 2.6 1.5 2.4 3.0 3.0 ... 2.9 .64 5 1.7 0.1 0.2 0.3 0.4 ... 1.5 1 Direct 1.4 0.4 0.6 0.8 1.2 ... 1.6 stage-value iteration 2 1.2 0.6 0.9 1.4 2.0 ... 2.1 3 4 1.0 0.8 1.3 1.8 2.6 ... 2.6 1.0 1.5 2.2 3.2 ... 3.0 .88 5 1 1.04 * 0.1 0.1 0.3 ... 1.4 Transformed .76 0.3 0.5 0.7 1.0 ... 2.3 stage-value iteration 2 0.5 0.8 1.2 1.6 ... 2.4 3 .59 4 .49 0.7 1.1 1.6 2.3 ... 2.7 .41 0.9 1.4 2.0 2.9 ... 3.0 5 -------------------------------------------------------------------------------------------------
235 4.2. Widely spaced diagonal entries Our next test problem is a system of 10 nonlinear equations: -1 Y2(t) 0 ... 0 0 0 Yi(t) -2 Y3(t) ... 0 0 0 (4.5) d = A [y(t) - e sin(t)] + e cos(t), A :_ . 0 5 t:9 T, 0 0 0 ..yg(() 1o (t) 0 0 0 ... 0 Y9(t) -1 0
with exact solution y(t) = e lin(t). The problem is constructed such that the dia gonal entries of its Jacobian are widely varying, so that the constant-diagonal condition occurring in Theorem 3.2 is far from being satisfied. Along the solution , the Jacobian of (4.5) is given by the matrix A, so that usin gg Gerschgonn's disk theorem, we have for p (J) and p(J - lJ: I) the estimates - 10 + Isin(t)I and Isin(t)f, respectively. Hence, the diagonal dominance of the Jacobian depends on t, resulting in intervals of strong, weak or no diagonal dominancy . The estimates E(h) are given by EFI(h) = 3 . 19h, ESVI(h) =
2.9h 1+0.5h+8.4h2
Table 4.3 lists the number of correct decimal digits , defined by A := - Iog1o ( I yN - y(T) IL. ), N := h
These results clearly demonstrate the superior convergence behaviour of stage -value-Jacobi iteration. Table 4.3. Correct decimal digits for problem (4.5) at t = T = 5 obtained by two-point Gauss-Legend, corrector (* indicates divergence). Iteration mode h E(h)S m=1 m =2 m=3 m=4 m=5 m=6 m=7 m=8 m=9 m=10 * * * * Functional iterationl/2 1.6 1/4 0.80 * 2.1 1.5 4.5 2.3 1 .9 1/8 0.40 2. 1 2.9 3 .4 5.9 4. 3 4.4 Stage-value-Jacobi 1 0.92 0.6 1 .0 1.6 2.0 1/2 0.79 1.1 2.5 3 . 1 4.1 1/4 0.56 2. 8 3.6 4.7 1/8 0.33 3.0 4.3 6. 1 5.8 5.9 -------------------------------------------------------------------
* 2.3 2.1 2.7 4.7 4.8 5.1 6.1 5.9 2.0 ... 4.1 ... 4.7 ... 5.9 ...
4.3. Reaction -diffusion equations In order to see the effect of stage-value-Jacobi iteration in the case of a large system, we consider the two-dimensional reaction-diffusion equation (aba) C)U( at = eeu( t,x1,x2) - f(u(t,x1,x2)),
236 defined on the unit square. Here a is a small parameter and A denotes the Laplacian in the spatial variables x1 and x2. We selected a problem from combustion theory for which f(u) is defined as (4.6b) f(u) := D(1 + a - u) exp(- a ), D:= Re aS
Details about this model can be found in [12] . The temperature u(t, x1 , x2) is subject to the initial and boundary conditions (4.6c) u(0, x1, x2) = 1, an = 0 at x1= 0 , x2 = 0, u = 1 at xt = 1, x2 = 1.
Semidiscretization of (4.6a) on a uniform grid of width Ax, using symmetric second-order differences and incorporating the boundary conditions leads to a system of ODEs (4.6)
-dyLtJ
= e(A& )- %(t) - f(y(t)),
where f(y) has to be understood componentwise . For this problem we have J = e(Ax)"2A - diag(af(y(t))/ay) and JD = - diag(4e(Ax)-2e + af(y(t))/ay)•
In our test, we selected the following parameter values : R = 5, S = 10, a = 1 (see also [1]). Furthermore , e was set to 10- and Ax = 1/40, resulting in a set of 1600 ODEs . The effect of this parameter choice is that the solution u increases from u = 1 (at t = 0) to the 'steady state' u = 2 at t = 0.5, the endpoint of the integration interval. The main difficulty in this problem is caused by th%reaction term which changes sign in the interval of integration : af(u)/du = Dexp (-6/u)[(1 + a - u)6/u - 1] is positive until u reaches the value u = 1.71 (the so-called ' ignition' point where a reaction front is formed running to the outer Dirichlet boundaries). For components having a value > 1.71, af/au is negative, ending at af/au = -74 for uvalues close to the steady state. As a consequence of this behaviour, the elements of the matrix JD are small in some parts of the integration interval , resulting in large values of the factor p (JD- J - I). Once the ignition point has been reached, af/au becomes negative, the diagonal dominance of the Jacobian is re-established and p(JD J - 1) quickly decreases ; at the end of the integration interval we have (JD J - I) = 0.02. Hence, for this problem the estimate E(h) is only relevant in part of the integration P terv al ( we remark that the assumption 6(JD) a R- of Theorem 3.2 is even violated for some [values). Nontheless, we have applied the algorithms to this problem , particularly because reactiondiffusion equations have great practical relevance . The results of this test are collected in Table 4.4. Table 4.4. Correct decimal digits for problem (4.6') at t = T = 0.5 obtained by two-point Gauss-Legendre corrector (* indicates divergence). Iteration mode h Functional iteration 1/10
m=1 m=2 m=3 m=4 m=5 ... m=10
* * * * ... * 0.5 -0.1 0. 8 1.1 1 .2 ... 1.3 1/40 1.9 3.9 3.7 5.1 ... 5.1 1/80 3.6 4.6 5.3 6.6 ... 6.4 Stage-value-Jacobi 1/10 * 0.0 0.0 * 0.2 ... 1/20 2.6 4.1 3.5 ... 3.6 1/40 4.3 5.2 ... 5.1 1/80 5.4 6.4 ... 6.4 -------------------------------------------------------------------------------------------------
1120
237 We see that stage-value-Jacobi shows a much better convergence behaviour than functional iteration: 2 or 3 iterations are sufficient (for h S 1/20), whereas functional iteration needs at least 4 iterations. Hence, in spite of the aforementioned deficiencies of the stage-value -Jacobi method for this problem, it seems to possess a rather wide applicability. 4.4. Mildly stiff problems Finally, we show that stage-value-Jacobi iteration can even be applied to mildly stiff problems. Consider a test problem proposed by Kaps [ 13]: dYt(4 d = - (2 + 0) Y1(t) + e-1(Y2(t))2. y1(0)= y2(0)=1, 05151, d = Yi (t) - y(t) (1 + y2(t)) .
with exact solution yl = exp(-2t) and Y2 = exp(-t) for all values of the parameter e. For this problem we have (2+e-1) 0 l (2+e-1) 2e- 1Y2 ll l J- ( 1 -(1+2y2)/' JD= ( 0 -(1+2Y2))
Wf integrate thy problem wing the two-ppooint Gat1csLegendre corrector. For small e we have P(J) =c- , P( JD) - e- , and P(JD- J - 1) = (2Y2/(1+2Y2)) /L, leading to (4.9) EFp(h) = 0.29 (h/e), ESvt(
h) = 0.29 (h/e)(2Y2/(1+2Y2))1n 1 + 0.5h (1+2Y2) + 0.084(h/e)2
For e = .01, Table 4.5 lists the numbers of correct decimal digits (in absolute sense) for various values of the stepsize h. As in the preceding example, the convergence region of stage -value-Jacobi is considerably larger than that of functional iteration (assuming that the numerical approximation to Y2 varies from 1 until exp(-1 ) - 0.37, the interval for E$VJ(h) is easily calculated and given in the table)) Furthermore, although the Jacobian of this problem is only weakly diagonally dominant (i.e., p(JD- J -1) is not much smaller than 1 ), the rate of convergence of the stage-value-Jacobi method appears to be substantially larger than that of functional iteration.
Table 4.5. Correct decimal digits for problem (4.8) at t = 1 obtained by two-point Gauss-Legendre corrector for e =.01 (* indicates A < 0). ------------------------------------------------------------------------------------------------h E(h) m=1 m=2 m=3 m=4 ... m=10 Functional iteration2120 2 1.02 * 1/40 0.73 * Stage-value-Jacobi 1/2 [0.65, 0.82] * 1/5 [0.64, 0.80] * 1/10[0.61, 0.77]0.0 1/20[0.53, 0.66] 1.5 1/40[0.38, 0.47] 2.3
* 1.9 * 1.9 3.2 3.9 4.7
* 4.1 * 0.8 2.4 3.8 5.0
* 7.3 1.8 3.3 4.9 6.1 7.3
... ... ... ... ... ... ...
* 7.0 1.9 3.2 4.6 5.9 7.1
238 References [1] Adjerid , S. & Flaherty , J.E. (1988): A local refinement finite element method for two dimensional parabolic systems , SIAM J. Sci. Stat. Comput . 9,792-811. [ 2] Bun-age, K. (1991): The error behaviour of a general class of predictor -corrector methods, Appl. Numer. Math. 8, 201-216. [3] Burrage, K. (1992): The search for the Holy Grail , or Predictor-Corrector methods for solving ODEIVPs, to appear in Appl. Numer. Math.. [4] Burrage, K. (1993 ): Efficient block predictor-corrector methods with a small number of iterations, to appear in J. Comp. Appl. Math. [5] Burrage, K. & Butcher, J.C. (1980): Nonlinear stability of a general class of differential equations methods , BIT 20, 185-203. [6] Collatz, L. (1950): Ober die Konvergentzkriterien bei Iterationsverfahren fur lineare Gleichungssysteme, Math. Z. 53, 149-61. [7] Crisci, M.R., Houwen, P.J. van der, Russo, E. & Vecchio, A. (1992): Stability of parallel Volterra-Runge-Kutta methods, to appear in Appl. Numer. Math.. [8] Houwen, P.J. van der (1993): Preconditioning in implicit initial value problem methods on parallel computers, to appear in Advances in Comp. Math... [9] Houwen, P.J. van der & Sommeijer, B.P. (1990 ): Parallel iteration of high-order Runge-Kutta methods with stepsize control , J. Comp. Appl. Math. 29, 111-127. [10] Houwen, P.J. van der, & Sommeijer, B.P. (1991): Iterated Runge-Kutta methods on parallel computers, SIAM J. Sci. Stat. Comput. 12, 1000-1028. [11] Jackson , K.R. & Ngrsett, S.P. (1990): The potential for parallelism in Runge-Kutta methods, Part I: RK formulas in standard form, Technical Report No . 239/90, Department of Computer Science, University of Toronto. [12] Kapila, A.K. (1983 ): Asymptotic treatment of chemically reacting systems , Pitman Advanced Publ. Company. [13] Kaps, P. (1981 ): Rosenbrock-type methods, in: Numerical Methods for Stiff Initial Value Problems, G. Dahlquist and R. Jeltsch , eds., Bericht nr. 9, Inst. frir Geometric and Praktische Mathematik der RWTH Aachen , Aachen, Germany. [14] Nguyen huu Cong (1993 ): Note on the performance of direct and indirect Runge-Kutta -Nystrom methods, to appear in J. Comp . Appl. Math.. [ 15] Norsett, S.P. & Simonsen, H.H. (1989): Aspects of parallel Runge-Kutta methods, in: A. Bellen, C.W. Gear and E Russo (Eds.): Numerical Methods for Ordinary Differential Equations, Proceedings L'Aquila 1987 , LNM 1386, Springer-Verlag, Berlin. [16] Sommeijer, B.P. (1993): Explicit, high-order Runge-Kutta- Nystrom methods for parallel computers, submitted for publication.
WSSIAA 2(1993) pp. 239-253 ©World Scientific Publishing Company
Solving non-linear equations by preconditioned time stepping using an implicit integration method.
M.E. Kramer Koninklijke/Shell-Laboratorium Amsterdam,
P.O.Box 3003, Amsterdam, 1003 AA, the Netherlands. and R.M.M. Mattheij
Department of Mathematics, Eindhoven University of Technology, P.O.Box 513 , Eindhoven , 5600 MB, the Netherlands.
ABSTRACT In this paper preconditioned time stepping to solve non -linear equations is discussed. Sufficient conditions on the preconditioner are derived to guarantee local contractivity . In addition an implicit time stepping algorithm is introduced, which gives asympotic superlinear convergence.
0. Introduction Many problems in numerical analysis involve solving a set of non -linear equations, say
f(x)
=
0
f : R"` -* B'" m > 1. Newton's method is often used, because of its rapid asymptotic convergence. However, if the Jacobian is nearly singular or is very sensitive to changes in some directions, the convergence domain may be small. This is not only a theoretical consideration ; when solving boundary value problems (BVP's) with exponentially growing modes by the multiple shooting method , we actually encountered problems with Newton ' s method6 . And apparently this is not the only type of problems, where Newton ' s method does not perform flawlessly , for in literature several alternative solution methods can be found . One class of alternative solution methods is (parameter) continuation : a series of non -linear equations is solved where a (possibly artificial) parameter is varied , using the solution of the previous problem as initial guess for the next one, see e.g. 's'12. An idea that is theoretically related, though different in implementation , is to embed the non-linear equation into an ordinary differential equation (ODE), see e. g 8.11. Indeed , Newton ' s method can be considered as the application of the explicit Euler integration method with step size 1 on the initial value problem (IVP)
239
240 dx = J -'(x(t))f(x(t)) , t>0 , dt
(0.1)
X(O) = xo where J(x) denotes the Jacobian of f(x). This differential equation is often called Davidenko ' s equation; it is sometimes referred to as the closure of Newton's method. Notice that discretizing Eq. 0.1 with the explicit Euler scheme and a step size less than 1, yields damped Newton. In this paper we look at a larger class of initial value problems, viz.
dx = M(x(t))f(x(t)) , t>0 dt x(0) = x0 in order to obtain a zero x* of f(x). The matrix function M(x)e C(R -> Rm") is called the preconditioner. It is obvious that any zero x* of f(x) induces a constant solution x =_ x* of Eq 0.2 and that vice versa any constant solution of the IVP corresponds to a zero of f(x), if M(x*) is non-singular. In section 1 sufficient conditions on M(x) are derived to guarantee that Eq. 0.2 is asymptotically stable at x = x*. Moreover, we introduce an implicit integration method for the IVP, that is computationally cheaper than implicit Euler, but that does have its asymptotic stability properties in section 2. Additionaly we consider the convergence behaviour of the integration method. In the third section we derive a preconditioner for the set of non-linear equations arising in the solution of BVP's by multiple shooting. Finally, we give some numerical examples in thelast section.
§1 Preconditioners We shall use the following notion of local contractivity, defined by 3s>o Vx ° Ix ° -x` 1 <s : I x(t) -x ` 1 decreases monotonically to zero, o with x(t) the solution of Eq. 0.2 with x(0) = x .
(1.1)
Note that this is a stability requirement. The remainder of this section we will investigate under what conditions the preconditioner satisfies such a property. Let < . , . > denote the Euclidian inner product and I . I the corresponding vector norm in V. Furthermore let B(x*;R) denote the set ( xE B"' I I x-x* 15 R ) and let I. denote the mxm identity matrix. A useful concept in this case is the logarithmic norm, see e.g.3'4'9 1.2 Definition
For any matrix AERm' the logarithmic norm p[A] with respect to I . I is defined by
241
P[A] l o
Il+hh I -1
(1.2a)
The logarithmic norm with respect to the Euclidian norm satisfies p [A] = max o , > = max { ? I ? eigenvalue of i (A +A T) } . (1.3) An important result on the relationship between local contractivity and logarithmic norm was derived in3 . Here we formulate a partial result only. 1.4 Theorem Suppose there is a ball B(x*;R) such that (t)
3a>o VxEB(x',R)
: p[M(x)J(x)l <_ -a .
(ii) The functions flx), J(x) and M(x) are bounded on B(x*,R) by constants Cf, Cj and CM, respectively. (iii) The functions J(x) and M(x) are Lipschitz continuous on B(x*,R) with Lipschitz constants LL and LM , respective ly. And define the constant C by
C
<x-x',M(x) f' [J(x*+t(x-x'))-J(x)](x-x*)dt> := max ° .
xE B(x',R)
I x -x *
(1.4a)
I3
If r < min(aC -1,R), then dx °E e(x':.)
the solution x(t) of Eq. 0.2 with x(O) =x ° remains in B(x * ;r)
(1.4b)
and Ix(t)-x* I <_ exp((-(x + C r)t) Ix°-x The proof is based on a well-known contraction argument. Next we compare the contraction domain of Eq. 0.2 as meant in theorem 1.4 to the convergence domain of Newton ' s method. According to the affine invariant Newton-Kantorovich theorem, see e.g.s, the latter domain exists of those points x° that satisfy I J-1(x°)f(x°) Iy <_ .-, with y an upper bound on U-1 (x)(J(x)-J(y))I / Ix yI, where x,y are in a convex neighbourhood of x°. Comparing this to theorem 1.4 for Davidenko 's equation Eq. 0.1 , i.e. M(x) = -J-1(x), we see that y is of the same order of magnitude as C and that I T1(x°)f(x°) I - I x°-x* I can be identified with r. Since a = 1 (= -p[-T'(x)J(x)] ), the lemma shows a contraction domain for Davidenko ' s equation of approximately the same size as the convergence domain of Newton ' s method according to the Newton-Kantorovich theorem. Hence a nonDavidenko choice for the preconditioner can be beneficial if either the value of C is reduced or the value of a is increased.
242 §2 The integration method ; mixed Euler In this section we introduce a special integration method for the IVP 5 x(t) = M(x)f(x) , t>0, x(0) = x
o
(
21 )
and investigate its properties . Recall that it is our aim to obtain a zero x'" of f(x); not to obtain an accurate estimate of the solution x(t) of Eq. 2.1. When using explicit integration methods, numerical stability considerations invariably lead to step size restrictions . The trapezoidal scheme as used in Boggs2 does not yield ultimate fast convergence for large step sizes since x j+1 -X* = x` -x So we look for a simple method e.g . Euler backward, xi" = xi +h.M (xi+')f(xi+') , . 1?0 . (2.2) Instead of using an iterative scheme to solve Eq. 2.2 we use the following mixture of implicit and explicit Euler, to be referred to as mixed Eider xi+1 = xi +h.M (x')f(x'' ) , j?0 . (2.3) Solving this equation requires essentially less work and, as we prove later on, still inhibits the stability properties of an implicit Euler method. Some authors reject the use of implicit integration methods in this case, see e.g.'. An often used argument is that implicit integration methods require the solution of a non-linear equation, which was our original problem. However, Eq. 2.3 contains the step size hi. We show in §3 that this non-linear equation can always be solved with Newton's method, if h is sufficiently small. And, moreover, once xi is close to x% Newton's method converges for all step sizes hi > 0. First we show that the mixed Euler method is consistent of order one. 2.4 Lemma The discretization error S(ti,x,hi) of the mixed Euler method defined by S(ti,x,hi) := hj-'. {x(ti+,) -x(ti) -hhM(x(ti))f(x(ti+t))} , (2.4a) is bounded, as follows : S(ti,x,hi) I < _ 2 hi
max I-0) I +hi I M(x(ti))J(v2)
max I x(t) I , (2.4b)
with v2 in the convex hull of x(t) and x(ti+t)• Since the mixed Euler method is a consistent one-step scheme, it is a convergent integration method if M(x)f(x) is locally Lipschitz continuous on an appropriate domain . Thus far we looked at the properties of the mixed Euler method as an ODE-solver. Now we investigate its behaviour as a 'root-finder'.
243 2.5 Theorem (convergence of the iterative process to x*) Let r < min(aC-',R) and let x i be in B(x*;r) and suppose that hj is sufficiently small to guarantee that xj+' of Eq. 2.3 is in B(x*;r).
I f a -e r -CALM I x i -xi" I > 0, then the constant bj defined by bj := 1 +hj(a-CjLMIxi-xJ+' 1-0 Ixi-x*I) .
(2.5a)
is larger than 1 and I xj+' -x * 1 5 bj-' Ix i -x * I and, moreover, the vector x j+' is in the sphere with centre
1- 1 x*+ l x i and radius x x 2 bj 2 b, 2 bb
(2.5b)
For a proof see Kramer6. Once x' is in B(x*;r) this theorem can be applied , since a suitable choice of the parameter hj can guarantee that
(i) xi+' is in B(x*;r) (ii)a>Cr+CjLMIxi-xi+' I As soon as I xi-x* I < ((x - Cr)/2CjLM , the constant bb is larger than 1 independent of hj, i.e. the restrictions on hj are lifted . In that case it is recommended to choose hj large since this yields superlinear convergence, viz. Ixi+t -x*I 1 lim <_ lim = 0 , if lim h j i-'- Ix' -x*I i -'°° I+ht(a-CjLMJxj-x'•'I) j-+ -
We will show that if x°eB(x*;r), it is indeed possible to choose a step size sequence {hj} such that the corresponding sequence of mixed Euler elements {xi} converges to x* and reaches B(x*; (a - Cr)/C/LM ) in a finite amount of steps. Given an iterate xj, we obtain the next iterate x'+1 in the mixed Euler process from formula Eq. 2.3, by solving the non-linear equation
with
g(y;xj,hj) = 0
(2.6a)
g(y;xj,hj) := hj-' (y-xj) -M(xj)f(y) .
(2.6b)
We show that convergence of the Newton method with starting point x i can be influenced by the choice of the step size hj. Let the Newton iterates on g(y;xJ,hj) be denoted by {y'}, i.e.
y° yi,i
= xj =yi-Idg (y`;xj,ht )l ^g(y'.;xIhj)
y
J
, iz0 . (2.7)
244 2.8 Lemma Let x .'eB (x*;R). Then under the assumptions of theorem 1.4 the following statements hold.
hi (i) If < l+ahi
(2C2LJ
I f(xJ) I)-112
if 1-2R2LJ > If(xi) I
2R (CM (R2LJ+2If(xi)I))-' , if ^2LJ <_ If(xJ)I
then the Newton process Eq. 2.7 on g(y;xi;hi) converges. (ii) If
If(xi)I
<_ min (
a2
,_R2L. ), then the Newton process Eq. 2.7 con
2CMLJ 2 verges for all step sizes hi > 0. For proof see Kramer6. This lemma disproves an often used argument to reject implicit integration methods for Davidenko's equation, viz. it requires at each step solving a non-linear equation, which is, wrongly, considered to be equal to the original problem. For in this case convergence is guaranteed for appropriate values of hi. Now we are able to show that it is indeed possible the form a sequence of step sizes {hi} such that the corresponding mixed Euler sequence does not stall, but converges to x .
2.9 Theorem Let r < min(aO-',R) and x°EB(x*,r). There are e > 0 and h > 0 such that the mixed Euler sequence {x i} with step size h exists, lies in B(x*,r) and satisfies (2.9a)
djZo Proof See Kramer6.
2.10 Remark The sequence {x '} formed in theorem 2.9 reaches after a finite amount of steps the ball around x*, where neither the Newton-Kantorovich theorem nor theorem 2.5 imposes any bound on the step size. Hence the super-linear convergence is reached eventually. If we use -J-'(x) as a preconditioner, the first Newton step on g(y;xi,h) reads i
y ' = x' - 1hh,J '(xj)f(xi) ,
(2.11)
i.e. a damped Newton step for the original problem f(x) = 0. There are two major differences between damped Newton and our algorithm. First of all we generally perform several Newton steps on Eq. 2.6a. So y' is not the next
245 iterate, but only an intermediate result. Secondly, and more importantly, we base our choice of the damping factor on controlling the discretization error and not on iteratively adapting the damping factor until the value of some object function decreases. However, once the iterates xi approach x* the first Newton iterate on g(y;x ',h) is accepted as x j+1. At the same time hj tends to infinity, so the implementation of the mixed Euler method tends asymptotically to the ordinary Newton method. This shows that in this case our method has second order convergence eventually.
§3 Shooting methods and the construction of a preconditioner Consider a BVP with separated boundary conditions
dy = h(x,y(x)) , a < x < b
and
y r= C' ([a,b] -* R') , (3.1)
gl(y(b)) = 0 and g2(y(a)) = 0 with
gl : R" -4 R"-v and
g2 : R' -* BP
,
1!5p5n
We shall consider multiple shooting as a means to solve this BVP. To recall, for this method one chooses N+1, say, shooting points x; E (a,b], a =xl <x2<..... <xNl =b.
On each subinterval one has the following initial value problem Y = h(x,Y(x)) , x, <x <x,+
l
(3.2)
Y(X) = Si . The solution of the IVP on the eh interval is denoted by y; (x,-s;). The concatenation of these solutions is the solution to the original BVP 3. 1 by requiring that they form a continuous function satisfying the boundary conditions . So the initial vectors si have to be the solution of the following set of nonlinear equations f ( s ) = 0 , f E C'( Rn(N+') _, ]&"(N+1)) ; with S := (S1T,S2 , . . . ,SN+1)T (3.3) and Yl(x2;sl) - S2 Y2(x3;s2) - S3
AS) :_ YN(@N+1;SN) - SN+I g(SI , SN+1)
246 Let y(x) denote an approximation of the solution of Eq. 3.1. The linearization of Eq. 3.1 at y(x) will be denoted by z =A(x), a<x
(3.5) with
Ba= B I a2 p
B bl and B b = 1 J n P 0
The Jacobian J(s) of the nonlinear function 3.4 is r''I(x2) -I
'P2(x3) -I
(3.6)
J(s) = "N((N+I) -I Ba
Bb
where 'F(x) is the fundamental solution of Eq. 3.5 with Ti(xi) = I. We assume that the BVP 3.1 is well-conditioned, cf.6. This implies that (if y(x) is sufficiently close to the solution of Eq. 3.1.) Eq. 3.5 has a p dimensional subspace of growing modes and a (n-p) dimensional subspace of decreasing modes( i.e. the fundamental solution is dichotomic cf.6). Let Y(x) be the fundamental solution of Eq 3.5 with BaY(a) + BbY(b) = I , then the first p columns of Y(x) span the subspace of growing modes. Generally speaking it is expensive to obtain Y(x). However, if Z(x) is a fundamental solution with Ba2Z(a) = (0 I *), then the matrix H, such that Z(x) = Y(x)H, has a zero left lower (n p)xp block; i.e. the first p columns of Z(x) and Y(x) span the same space of increasing modes. So we choose the fundamental solution 4)i(x) of Eq. 3.5 on the subintervals such that (D; (x,) = Q;, with QI is an orthogonal matrix satisfying Ba2Q1 = (O I Ba2 ^ ) and Q. the orthogonal matrix resulting from the QU-decomposition (D,-1(x;) = Q1U;_1.
Then the first p columns of 1(x2) span the increasing modes integrated up to x = x2 and so do the first columns of Q2. By an induction argument this holds for all Q; Q. The fundamental matrices 4 can be introduced by differentiating f(s) with respect to another argument. Define Q := diag(Q1,Q2, ....QN+I) and then
Q := diag (Q2,Q3, ... .QN+I+I) (3.7)
247
U2 -I
af(s)
(3.8)
aQ TS
UN I BoQI BbQN+1
So in fact at every subinterval-endpoint xi1 the fundamental solution is split into an orthogonal matrix Qi+1 that contains information on the evolution of the directions of the various modes and an upper triangular matrix Ui, that contains information on the growth behaviour of those modes. This growth behaviour now is described by the dichotomy of the problem and so we can relate the magnitude of the elements of Ui to the dichotomy constants. Let the upper triangular matrix Ui be split into four blocks as in
B1 Ci Ui = with BiE R(n 0 Ei
p)x (n o)
and
E1E RPXP
This splitting implies that both I E. I and I Bi-1 I become small as we integrate over larger intervals. The part C. would be zero if the increasing and decreasing modes were orthogonal to each other. This would be an ideal case as it would give a complete decoupling between the two modes. However, by a local coordinate system transformation this situation can be reached. For this purpose we use a Ricatti-transformation as follows. From Eq. 3.8 one can see that, due to the zero structure of BoQI and BbQN+I, non singularity of the Jacobian J(s) implies non singularity of the matrices B., Ei and the left upper block Bb(l) of BbQN+I: B(1) Bb(2) (1)
BbQN.I = , Bb 0 0
E R(n v)x(n n)
(3.10)
Then RN
.I
Bb Bb (I)) 1 (2)
(3.11) for i = N downto 1
B;Ri - Ci - Ri.1E1 = 0 , This Ricatti transformation can be written as (I R. S. = ` 0 I
S.E Rnxn Si
which induces a blockdiagonal transformation
(3.12)
248 S := diag (SpS2, ... SN,1) It can be checked that
and
SQT af(s)
S := diag (S2,S3, ... SN,t,I) .
(3.13)
has blocks Bi and E, on the diagonal and I
asQ Ts blocks on the second upper diagonal and some blocks resulting from the boundary conditions on the two bottom block rows. The E, blocks have a small norm as they represent the growth of the decaying modes. However, the Bi blocks have a norm larger than 1. So, in order to have a diagonal dominant matrix, we have to rescale the Bi and boundary condition blocks out of this matrix and perform a permutation to move the I blocks to the diagonal. Finally define the scaling matrix diag(-B1,I,-B2,I, ... , BN,I,-Bb1),-B(2)
(3.14)
and the permutation matrix
(3.15) 0
1
-
FO 1 L0 0 0
001 This leads us to the desired form
af(s) = Q -^ .B ,P
(3.16)
asQ Ts
with j the matrix defined in Eq. 3.18.
3.17 Theorem Let 0 < E < 1 and suppose that the interval length xi.,-xi is sufficiently long so that `di : I Ei I < 1 -e and I Bi-' I < 1 -E
(3.17a)
And let K be an upper bound for I Ri I, ie 11, .., N+1). Then with M =QSTPTB-'gOT,
(3.17b)
p[MJ} < -e(K2+K+1)-'
(3.17c)
249
-I 0 B1_1 I 0 I 0
B21
El 0 -I 0 I 0 B31 E2 0 -I
(3.18)
J=
I 0 BNI EN-1 0 I 0 I 0 EN 0 I
The contractivity of the IVP
ds = M(s)f(s) , t>0 Ti s(0) = so .
(3.19)
increases if the shooting intervals are enlarged , because it decreases the size of I Ei I and I B ;-1 I. However, the IVP-integrator generally still bounds the length of the subintervals. Also, since we take I M(s)f(s) I as an estimate for the error in s, it is important that I M-1 I is not too large, for the main term therein is I B I. If a BVP has both exponentially fast growing and decaying modes, but does not have separated boundary conditions, the algorithm derived in this section can not be applied directly. By adding a few variables the system can be transformed to one with separated boundary conditions. However, the additional variables introduce constant modes, so that the theory does not hold. Nevertheless application of the algorithm may still be worthwhile, because one can not expect "redundant" BC to have an impact on properly controlling modes (which is essential in our decoupling analysis). When applying the mixed Euler method to the NP 3.19, the calculation of the iterates requires less work if the norm of M(s) is small (see lemma 2.9). From Eq. 3.17b it follows that
250
IM(S)1 < ISTI in-'II (3.20)
<_ (K2+K+1)•max(1, I (Bn11) 1 I, I(Ba2)) 1 I) = O(1) if the boundary conditions are properly scaled.
§4 Numerical results In' several explicit integration methods for Davidenko's equation are studied and tested on some problems. We have applied the mixed Euler implementation described above to those test problems, in order to illustrate the performance of the method and to compare it with the explicit integration methods presented in1 and with damped Newton. We tested the mixed Euler algorithm described in §3 on these problems with M(x) = -J-1(x). As a measure for the amount of work we use the number of function calls (#f) plus m (= dimension of the system) times the number of Jacobian evaluations (#J). This is also done int. We set the tolerances for the local error to 10-1, i.e. we require the approximation of the path x(t) to have approximately one correct number. Of course this large discretization error may jeopardize the convergence of the process if the iterates stray off the correct path. On the other hand larger tolerances for the discretization error allow larger step sizes and hence require less function evaluations. For the local error tolerances set to 10-1, the mixed Euler method converges for all eight test problems. The amount of function evaluations to obtain an approximation xi with error I J-1(x')fxj) I < 10-6, is listed in Table 1. Note that #J is equal to the amount of steps.
The mixed Euler method for the test problems in' with local error 10-1 and required accuracy 10'.
problem no.
1
2
3
4
5
6
#f #J #f+m#J
13 11 35
28 16 60
9 6 21
111 31 173
23 13 62
18 10 78
7 8 18 13 57
24 14 66
Table 1
We also compare the results of the mixed Euler method with the results from other explicit integration methods as presented in'. Table 2 gives the amount of work measured by #f+m.#J . This shows that the mixed Euler method performs better on all test problems , than the two explicit integrators used here and the trapezoidal rule.
251 Comparison with other timestep methods, #f+m#J for the test problems
ME RK3 AB3
PECE
1 35 64 71
2 60 89 95
3 21 55 43
4 173 334 299
5 62 113 109
6 78 169 127
7 57 280 221
8 66 280 229
133
157
115
337
185
309
347
355
ME = Mixed Euler RK3 = third order Runge Kutta AB3 = Adams-Bashforth variable step method order 3 PECE = Trapezoidal rule as described in Boggs2 Table 2
Time stepping methods are introduced, because in some cases the convergence domain of Newton ' s method is too small for practical use . Hence for comparison we applied a version of damped Newton :
xj+1 = xi - A. J 1(xJ).f(xJ) , 1>_0 , X E (0,1] (4.1) to the test functions . The damping factor XI is first chosen to be Xj = min(2.71j_1 1), but if some object function does not decrease, aj is halved until it does or A.j < 10-3. In the latter case the process is terminated , which is denoted by FAIL. in Table 4 (N.B. the mixed Euler process on the test problems converged with a step size hj >_ 10-1). Three different object functions were used (1) 1 f(Xp =) I , (2) 1 J-' (xo,d)f(x.) I , (3) I J-'(x.)f(x.) I ,
where x0Id is the last accepted Newton iterate and x„C1,, is the update obtained using Xj. If the object function decreases the update is accepted. Table 3 shows the number of iterations necessary to reach convergence I J-'(x j)f(x ) I < 10-6, for the three different object functions. Damped Newton's method with either object function could not solve the second problem , whereas our implementation of the mixed Euler method converged in 20 steps . For the most reliable (and expensive) choice I J-'(x,,,,)f(x„,,,,) I damped Newton failed to converge in three cases.
Finally we present a problem that is solved with a preconditioner outlined in section 3. Consider the two point BVP attributed tolo i(x) _ ? sinh(? x) , 0<x
(4.2)
252 Number of iterations with damped Newton
I f(x-) I 1
8
I J-I (x°wy(x„°,,,) I
I J-I (x,, )f(x-)I
6
6
2 FAIL FAIL 3 4 4 4 ? 50 23 5 2 6
FAIL 4 FAIL FAIL
6
6
6
6
7
7
7
7
8
8
8
7 Table 3
We use the multiple shooting code MUSN7, to solve Eq. 4.2. This code uses a RK45 integration method to integrate the ODE and uses modified Newton's method to solve the shooting vectors. Additionally we developed a timestep code (TS) which uses the same integrator as MUSN, but uses the mixed Euler algorithm with the preconditioner described in6 to solve the non-linear equations. For this BVP a measure for the robustness of a solver is the smallest amount of uniform subintervals needed to obtain convergence. Indeed, the larger the subintervals the more sensitive the non-linear equations are for changes in the shooting vectors. In table 4 we list results for some values of X. Here subint denotes the number of equispaced subintervals, iter is the number of iterations and #f 1O"-calls denotes the number of function calls. This shows that the mixed Euler time stepping is more robust, but that a Newton's method requires less work if it converges.
Results for BVP (4.2) with required tolerance 10-6 MUSN
TS
?
subint
iter
2 2
1 5
3
3
3
5
3 4 4
10 10 15
6 4 11 6
fail cony fail cony
16, 955 6,710 38,625 12,978
#&-calls
steps
fail
3, 350
21
cony
3,228
cony
2, 923
18 23
cony cony
5,027 10,314
22 32
cony cony
13,684 35,482
30
cony
36, 831
result
Table 4
, x,
result #fO0-calls
253 Conclusion We have seen that the mixed Euler method has stability properties similar to those of the implicit Euler method. The price for this is a restriction on the step size if we are far from the steady state. On the other hand every time step requires only 1 computation of the preconditioner M(x), so with respect to computational effort the method is competitive with explicit integration methods. On approaching the steady state the step sizes can increase without jeopardizing stability or existence of the next iterate, yielding a superlinear convergence rate. If the preconditioner M(x) equals -J-1(x) our algorithm tends asymptotically to Newton's method.
References 1. J.P. Abbott and R.P. Brent, Fast local convergence with single and multistep methods for nonlinear equations, J.Austral.Math.Soc.19 (series B) (1975), 173-199.
2. P.T. Boggs, The solution of nonlinear systems of equations by A-stable integration techniques, SIAM J.Numer.Anal. 8 (1971), 767-785. 3. G. Dahlquist, Stability and error bounds in the numerical integration of ordinary differential equations, Thesis 1958, in: Trans. Royal Inst. of Technology, No.130, Stockholm. 4. K. Dekker and J.G. Verwer, Stability of Runge-Kutta methods for stiff nonlinear differential equations, Amsterdam : CWI Monographs, 1984. 5. M. KubiI ek and V. Hlava ek, Numerical solution of nonlinear boundary value problems with applications, Englewood Cliffs: Prentice-Hall, 1983. 6. M.E. Kramer, Aspects of solving non-linear boundary value problems numerically, PhD.Thesis, Eindhoven Univ. Techn., 1992. 7. R.M.M. Mattheij and G. Staarink, Implementing multiple shooting for nonlinear BVP. Report of Eindhoven University of Technology RANA report 87-14.
8. J.M. Ortega and W.C. Rheinboldt, Iterative solution of nonlinear equations in several variables, New York : Academic Press, 1970. 9. T. Strom, On logarithmic norms, SIAM J. Numer.Anal. 12 (1975), 741-753. 10. B. Troesch, A simple approach to a sensitive two-point boundary value problem. J. of Comput. Physics 21 (1976), 279-290. 11. H.Wacker, A summary of the developments on imbedding methods, in Continuation methods, ed. H.Wacker. New York: Academic Press, 1978, 1-35. 12. E. Wasserstrom, Numerical solution by the continuation method. SIAM Review 15 (1973), 209-224.
WSSIAA 2(1993) pp. 255-270 ©World Scientific Publishing Company
C O N V E R G E N C E OF P R O D U C T I N T E G R A T I O N R U L E S FOR WEIGHTS ON THE WHOLE REAL LINE D.S. LUBINSKY Department of Mathematics, University of the Witwatersrand, P.O. Wits 2050, South Africa. Abstract. We investigate convergence of interpolatory product integration rules associated with weights w on the real line. If {x/n}?-i *re the seros of the nth orthonormal polynomial for w, and we are given a function k € £ i ( R ) such that k(z)z> 6 Z»i(R), j — 1,2,3 we can form an interpolatory product integration rule In[k; f] with points {*y B }? = i- We investigate convergence of these rules for a class of weights including w(z) := exp(—|x|'), P > 1, under relevant conditions on k. We obtain necessary and sufficient conditions for convergence of the rules {Jn[k; ] } „ = 1 and their companion rules, and also investigate asymptotic behaviour of the weights tuy„. The results extend work of Smith, Sloan and Opie for the Hermite weight, and depend on recent results of the author and D.M. Matjila on mean convergence for Lagrange interpolation for Freud weights.
1.
Introduction and Statement of Results
The product integration approach for approximating integrals involves splitting the integrand into a "difficult" but known function k(x), and a relatively smooth, but unknown function f(x). Thus we seek to approximate OO
(1.1)
'[*?/]:= J f(x)k(x)dx. — OO
Typically the function k(x) is absorbed into the weights in the integration rule, and in this paper, we consider interpolatory rules, that is rules (1.2)
n
/»|*;/]:=5>iB/(*,-»)
satisfying OO
(1.3)
In\k;P] = I[k;P] = J P(x)k(x)dx, — oo
255
P e
Pn-u
256 where Pm denotes the set of polynomials of degree at most in. The rule is called interpolatory because it admits the representation
I.[k; f] = J Ln[f1( x)k(x)dx,
where L,,[f] E P,,_1 is the Lagrange interpolation polynomial to f at Interpolatory product integration rules have been widely applied on finite intervals 1,3,4,16,17. There has been less work for interpolatory product integration rules on infinite intervals6,11," In this paper, we shall consider interpolatory rules based on the zeros of orthonormal polynomials for Freud weights e-'Q, such as exp(-Ixl,6), /9 > 1, and obtain necessary and sufficient conditions for convergence of the rules. Moreover, we investigate the convergence of the companion rules n
(1.5)
In[k;f]
E Iwjnlf(x9n) j=1
and also the asymptotic behaviour of the weights win. This paper may be viewed as an infinite interval analogue of the papers of Sloan and Smith16 , 17 for finite intervals , and we use many of the ideas from those classic papers. We also depend very heavily on the necessary and sufficient conditions for mean convergence of Lagrange interpolation associated with Freud weights, established by the author and D.M. Matjila9 , and that paper in turn depended on bounds for orthonormal polynomials established by A.L. Levin and the author7. To state our results, we need a little more notation . Let w : R -+ [0, oo) be a weight, that is 0
f
0<
J
Itilw( t)dt < oo, .£ = 0,1, 2,... .
We denote the orthonormal polynomials for w by pn(x) = pn(w, x), n = 0, 1, 2. .. so that
(1.7)
pn(x)
:=
pn(w , x ) : _ ,rnxn
} ... 1n =
7n(w)
which is defined by the condition
(1.8)
J Pn()Pm(x)w(x)d x =limn.
> 0,
257 The zeros of pn are ordered so that (1.9) -00 < xnn < xn_l,n < xn_2 n < • • • < xln < 00.
Throughout, we assume that the integration rules In[k; •] are defined by (1.2), (1.3) and that the points xin, j, n > 1, are zeros of orthogonal polynomials with respect to a fixed weight w, which will be specified in the theorems. Following is our first result for the special weights w(x) = exp(-IxlQ), /3 > 1. Theorem 1 .1
Let / 3 > 1, 1 < p < oo, q := p/(p - 1), 0 E Ili., a > 0, and
( 1.10) & := min{1, a}. Let (1.11) W,6( x) := exp (-
2IxI"),
x E lR,
and let w (x) := W2(x) = exp(-IxLet iP - &, p < 4, (1.12) pi s(1-P), p>4. Then for
(1.13) lim
n-+oo
I. [k;
f] = I[k; f]
to hold for every continuous function f : 1R -i ]R satisfying (1.14) lim I f(x)IWe(x)(1 + IxI)a = 0, Ixloo
and for every measurable function k : Ili, -+ ]R satisfying
(1.15)
I
k(x)Wj
1(x)(1 + IxI)AIIL,(]R> <
it is necessary and sufficient that 0>r if 1
00,
258 If p > 4, and a < 1, it is necessary that 0 > T and sufficient that 0 > T. Moreover the sufficient conditions guarantee that (1.13) holds, and that (1.16)
lim n In[k;f]= I [Ikl;f], ca
for every function f : ]R -3 ]R that is bounded and R.iemann integrable function in each finite interval, and that satisfies (1.14). Remarks ( a) The sufficient conditions here contain those in 6,11 (b) When we require convergence for all continuous f satisfying (1.14) and for all k satisfying ( 1.15) and for all 1 < p < oo, the necessary and sufficient conditions above simplify to (see Corollary 1.2 in 9)
(1.17) 0>-&+ max{1,6}. To formulate our result for more general weights, we need the MhaskarRahmanov-Saff number a„ 12,13. Let W := e-Q, where Q : ]R, -> 1R is even, continuous , and xQ'( x) is positive and increasing in (0, oo ), with limits 0 and oo at 0 and oo respectively. For u > 0, the uth Mhaskar-Rahmanov-Saff number a„ is the positive root of the equation
(1.18)
u= 2
J
autQ'(a.t)dt/ 1 - P.
0
Under the conditions on Q below, which guarantee that Q(s) and Q'(s) increase strictly in (0, oo), au is uniquely defined, and increases with u. It grows roughly like Q['1] (u), where Q[-1] denotes the inverse of Q on (0, cc). Its significance lies partly in the identity (1.19)
IIPWIIL-(IR) = IIPWIIL00[-a, ,an],
which holds for polynomials P of degree < n 12,13. Moreover 10,12, the largest zero x1,, of the nth orthonormal polynomial pn( e-2Q,x) satisfies lim xln/an = 1. n-co
For W = Wp, it is known that for u > 0, ( 1.20)
au = C,3u1/R,
where
(1.21)
CR := [2$-1r(i/2)2/r(Q)] 1/1s
259 Theorem 1.1. is a special case of the following general result: Let W := e-Q, where Q : lR -+ lR is even and continuous in IR, Q" Theorem 1.2 is continuous in (0, oo) and Q' > 0 in (0, oo), while for some A, B > 1, (1.22)
A < x (xQ'(x))/Q'(x) < B, X E (0, oo).
Let w := a-2Q and let 1 < p < oo, 0 E ]R,, a > 0, and & be given by (1.10). Then for (1.13) to hold for every continuous function f : lR -+ 1R. satisfying
(1.23) I imI.f(x)IW(x)(1 + IxI)a =0, and for every measurable function k : lR -^ lR satisfying (1.24) II
k(x)W-1(x)(1 + Ixl)"'IIL9(]R) < oo,
if p < 4, it is necessary and sufficient that >-&+ 1 -, p
(1.25)
and if p > 4 and a > 1, it is necessary and sufficient that (1.26) al/P-(&+o)n(1/6)(1-4/n) = 0(1), n , oo; and if p > 4 and a = 1, it is necessary and sufficient that (1.27) an/p-(a+o)n(1/6)(1-4/D) = 0 (l01 n) ' n --+ oo. log For p > 4 and a < 1, it is necessary that (1.26) holds, and sufficient that (1.27) holds. Moreover the sufficient conditions guarantee that (1.13) and (1.16) hold for every function f : lR -> lR that is bounded and Riemann integrable function in each finite interval, and that satisfies (1.23). Next, we turn to the asymptotic behaviour of the weights. We present only one such result, which under "global" conditions on k gives asymptotics for the weights wj,,, in terms of the Christoffel numbers (or Gauss quadrature weights) for the weight w := e-2Q, which we denote by Aj.n. Recall that these arise in the Gauss quadrature formula:
(1.28)
I..[w ;1]
_
Ajn,f(xjn), j=1
260 satisfying OO
(1.29)
In[w;P] = f P(x)w(x)dx,
P G V2n-i-
— OO
We shall also need the Nevai- Ullman density of order /3, which plays the same role for the weights exp(—|x|^) as does the arcsin density T:~X J\J\ — x2 on [—1,1]. This is12 1
(1.30)
^(x) = ^ j t ^ i t
2
- x2)-l'2dt,
x 6 [-1,1].
1*1 Finally, we need the error in weighted polynomial approximation (1.31)
En\f; W] := inf {||(/ - P ) W | | t o o ( R ) : P € Vn}, n > 1.
Theorem 1.3 Assume that W = e~Q and w = e _2< 2 satisfy the hypotheses of Theorem 1.2 and a„ = an(Q), n > 1. Let k : I t —> I I satisfy
ll*Wr-,|L„(R)<~.
(1.32) and
En[kW-2;W] =o(^-A,
(1.33)
n ^ oo.
Let 0 < a < 1. Then the weights Wjn in the rule In[k; •] satisfy (1.34)
lim
max \wjn - X^kix^W^ix^MW-^x^)
— = 0.
In particular, if W = Wp, 0 > 1, then (1.35)
lim
max \wjn—
n—>oo | l j „ | < ( T o n I
Remarks
VT _ I (x > n ) = 0.
- k(xjn)/np(xjn/an)
On
I
(a) Note that (1.35) implies that if x j n -» x, n -+ oo, then lim w ; n — =
k{x)/np(Q),
261 uniformly for x in any closed interval. (b) The implicit condition (1.33) forces k to be continuous in 1R, and forces a mild smoothness on k. If for example, kWi 2 satisfies a weighted Lipschitz condition of order a E (0, 1), then it is known (see p.186 of 2) that (1.33) is true. More transparently, if k is differentiable, and
II k'W I L-( ]R ) + II kQ'W-1 I -1
(1.36)
L
-( 1R)
<
then (see p.185 of 2)
En[kW-2; W] = O
(an) , n -* oo, n
and the last term is 0(n-T), some r > 0. Since product integration rules are frequently applied to oscillatory kernels, we mention one choice of k satisfying (1.36): k (x) := exp (- aQ(x)) sin px cos or k (x) := exp (- aQ(x)) cos px, where p is fixed, and a > 1. We prove the results in Section 2. 2. Proofs We shall first need some more notation. Throughout, C, C1, C2, ... denote positive constants independent of n, x and polynomials of degree n. The same symbol does not necessarily represent the same constant in different occurrences. Throughout , we assume that W := a-Q and w := W2 are as in Theorem 1.2, that 1 < p < oo; q := p/(p - 1 ); a > 0; 0 E R. To emphasize the dependence of the weights w,jn in I. [k; •] on k , we write w,n[k] below . In this context, note that win[k] is linear in k.
If J : X -+ Y is a linear operator between normed spaces X, Y, with norms II • Ilx, II . III', respectively, we write
IIJIIx-+Y := sup{MMJ[f]IIy : 11f 11X < 1} for the usual operator norm, and X* denotes the dual of X, that is the space of bounded linear functionals from X to It, with norm II - IIx-IR. Lemma 2.1 and satisfies
(2.1)
Let t : R-4 [0,1] bean even function, which is decreasing in [0, oo)
lim i(x)=0
X-00
and (2.2)
1 > 17(x) > (1 + x2)-1, x E [0, 00).
262 Define spaces X, Y, Z as follows: X is the space of continuous g : JR --+ 1R, with
(2.3) IIgIIx
I g (x)W(x)(1
+ Ixl)"7(x)-1
I
L,o(IR)
< oo;
Y is the space of measurable k : 1R -+ JR with
IIkIlr
(2.4)
:= k(x)W-1(x)(1 +
I xI )'6'
ILq(1R) <
oo,
Z is the space of measurable h : ]R -> JR with
I
(2.5) IIhlIz := h(x)W(x)(1 + I xl)-'6'
Ly(1R)
< oo.
Then (i) n
(2.6)
IIln[k;']II x-.1R = Elwjn[ k ]IW- 1(xjn )(1 + Ixjnl)- "l (xjn), j=1
and Illn[',']Ilr_x* = sup
II k IIY<1
IIIn[k
; .]llx-.]R = IILnIIx_z.
(ii) Moreover,
(2.8)
suplln [k; f] I < 00, V f E X, V k E Y,
if (2.9)
L := sup II LnII x.Z < 00, n>1
and in this case, (2.10) Illn[k;']IIX +R LIIkIlyVn > 1, Vk E Y. Proof (i) Firstly, ( 2.6) is an immediate consequence of the definition of IIIn[k; •]II x_IR. Next, from (1.4), and by duality of LP(R.) and Lq(JR.),
sup I In [k; f] s Ii Ln[.f](x) k(x)dx It l II l 00 sup J Ln[f](x)W(x )(1 + Ixl)-"lh(x)dx IIhIIL,JR)sl V II Ln[f](x) W(x)(1 + I x
[)-"I Ln
(1R)
=
IILn[f]IIZ'
263 So sup sup
IIn [k ;f]I =
sup
IIfIIx<3 IIkjly<1 IIfIIx<1
I Ln [f]IIZ = II L
nIIX_z.
Interchanging the sups on the left -hand side gives (2.7). (ii) Note that I,, [k ; f] is linear in both f and k. Firstly, if L in (2.9) is finite, III[k; f]I
II In [ k ;']IIx-.]R I f IIx < III.[-;-] IIY_,x *IIkIIYIIfIIx 5 5
LIIkIIYIIfIIx,
by (2.7) and (2.9). Then ( 2.8) follows. Conversely, suppose that ( 2.8) is true V f E X and V k E Y. Note that X , Y, Z are Banach spaces. Then by the uniform boundedness principle, we have for each fixed k E Y,
(2.11)
supIIIn [k;.]IIX n>1
R < 00.
But as the map that sends k E Y to In[k; •] E X* is linear in k, the uniform boundedness principle and (2.11) give
L := supIIIn[.;.]IIY-+X* < 00,
and then (2.7) yields (2.9). ■ Fix a function f : JR -4 JR that is Proof of the Sufficiency Part of Theorem 1.2 bounded and Riemann integrable in each finite interval , and satisfies (1.23). Then we can choose a function n as in Lemma 2.1 such that f E X. Let Y, Z be the spaces defined in Lemma 2 .1. Now Theorem 1.3 in 9 asserts (under the relevant conditions in (1.25 ) to (1.27)) that lim JILn[9 ] n -oo
911_, = Odg E X.
Then the uniform boundedness principle shows that (2.9) holds. We shall use (2.10) and convergence theorems for Gauss quadrature to prove (1.13) and (1.16) for the given f. First let U be a polynomial, of degree t, say, and
k1:=Uw=UW2. For n > Q, we have wjn[k1 ] = wjn[UW2] = AjnU( 2jn), 1 < j < n.
264 Recall that the Ajn are the Christoffel numbers in the Gauss quadrature (1.28). It is easy to establish this identity for wjn[k] using the fact that the Gauss quadrature has precision 2n - 1, that is (1.29) holds. Then we have
(
In[kl;f]
=
r n r = EA7n
(IUIf)(xjn).
In [Uw;f]
j=1
Now choose an entire function 00 G(x) = E 92jx2j, 92j ? 0, 7 ? 0, j=o such that
Cl < G(x)W3/2(x) < C2, X E R. (Such functions were constructed in, for example, Chapter 6 of 8). Then
I Uf I (x)/G(x) < Cl 1 I Uf I (x)W3/2(x) < C11IIflixIU(x)I(1 + Ixl)- "ri( x)W1 /2 ( x) --+ 0, Ixl -+ 00, as W(x) decreases faster than any polynomial (see Lemma 5.1(a), (b) in 7). Moreover,
G(x)w( x)dx < C2 r W1/2( x)dx < oo. f -00 -00 Then a classical theorem on convergence of Gauss quadrature (see p.94 of 5) implies that 00
lim II[kl;f]
noo
= I
(IUIf)(x) w(x)dx =I [Ikii; f]•
-00
Next, consider an arbitary k E Y. We have
II[k;f] - I [Ikl;f1l IIn[k;f] -II[Uw;f] =: Tin + T2n +
+
II[Uw
;f]-I[IUlw;f]I +II[IUIw; f ] - I[Ikl;f]
T3
1
t'
1
1
i
,W
r
1,
^
III
265 Here
I
Ti. = ^: (Iwjn[k]I - wjn[Uw]I)f(xin) j=1 n
IlfliX I:I wjn[k] - wjn[Uw]I
W- 1(xjn )( 1 +
I xjnl)- ai(xjn)
j=1 n
=
llf liX EI wjn[ k - Uw]IW - 1(xi n)( 1 + lxjnl)- a77(xjn) j=1
= ill ilx IIIn[k - Uw; ]IIx ->R
< LII f iixiik - UwIl y,
by (2.6) and (2. 10). Next, we proved above that lim Ten = 0.
n-co
Finally, 00 T3 = II[IUlw;f] - I[Ikl;f]I =
LL
(1_ lkl)(x)f
(x)dx
00
Ilfllx
J
IUw-ki( x)W-1(x)(1+ lxl)-'r7(x)dx
-00 ( < IIflixilUw - kliy I (1 + Ixl)-O,-otj x)I Lp(1R), by Holder 's inequality. Since n < 1, the estimates for T1n , Ten and T3 yield lim oplI[k ; f]-I [Ikl; f]I 5
IlfIIXIIUw -kIIY(L + II(1 +IxI)- a- A I L,( L.))-
Now for p < 4, (1.25 ) implies that (a+0)p>_ (&
+o)p>
1,
and for p > 4, since 1 - 4/p > 0, (1. 26-7) show that this inequality persists. So
(1 + Ixl)-a- ^IILy (111) < 00. Now (recall the definition (2.4) of (2.12)
IIUw - kIIY
=
11(U
II • IIY)
- kW-2)(x)W(x)(1 +
Ixi)I l L9(R).
266 Since W(x)('+IxI)a decreases faster than exp(-IxIQ), some a > 1 (see Lemma 5.1 of 7), we can find a polynomial U for which the right-hand side of (2.12) becomes arbitrarily small. See, for example, 2. Then (1.16) follows. The proof of (1.13) is very similar, but easier. ■ Fix an rt(x) as in Lemma 2.1. We Proof of the Necessity Part of Theorem 1.2 assume that (1.13) holds for every continuous f and measurable k satisfying (1.23) and (1.24) respectively. Then ( 2.8) follows, and by Lemma 2.1( ii), we have (2.9). That is,`df EX and n > 1, we have
Ln[f](x)W(x)(1 + I xI )_p 11 L,(]R)
=
IILn[ f
1MMZ < L 11 f ll x = L f(x)W(x)(1 + Ix1)a7 -1(x) 1L_(R).
This is exactly ( 3.30) in 9. There ( 3.30) is all that is used in the proof of the necessary conditions of Theorem 1.3 to show that 0 > l/p - & for all 1 < p < oo. Moreover, if we define log(' + x) if a = 1 (log x)* .- 1 l 1 if a#1, then it was shown at (3.33) in 9 that for p > 4, n)*n(1/s)(1-4/p) < 00. limsup rl(an)an/p-(&+o) (log n-.oo
Since this holds for every rt satisfying the hypotheses of Lemma 2.1 (no matter how slowly 77 decays at infinity), (1.26-7) follow. ■ Proof of Theorem 1.1 This is the special case Q(x) = 2 jxj' of Theorem 1.2. The necessary and sufficient conditions in Theorem 1.2 easily reduce to those in Theorem 1.1, since an = Cpn11 , n > 1. ■ In the proof of Theorem 1.3, we shall need more notation. Given g : ]R -+ ]R such that (gw)(x)xi E L1(]R), j > 0, we can define its orthonormal expansion in terms of {pj} 0. We denote the nth partial sum of this series by n-1
00
Sn[9](x) E bipi(x), bj f pj(t)9(t)w(t)dt, j > 0. j=o _00
Then S,,[g] admits the representation 00
Sn[9](x ) =
f
-00
Kn(x,
t)( 9w)(t)dt,
267 where 5 n-l (2.13)
Kn(x, t) ;- E Pj(x)Pj (t) = ryyn l [Pn
( x)Pn-1 (t) - Pn-l ( x)Pn(t) l
L
j=o
x
-
t
1
by the Christoffel-Darboux formula. We shall use the representation (2.14) wjn[k] = AjnSn[kw-'](xjn) = AjnSn[kW-2](xjn),
which is an easy consequence of the orthogonality of kw-1 - Sn[kw-1 ] to Pn_l, and the fact that the Gauss rule (1.28) has precision 2n - 1. We also need the Christoffel function n-1
(2.15)
An(W2, x) = 1/E p ( x), n > 1. j=o
Note that (2.16)
Ajn = An(W2,xjn)•
Lemma 2.2 Let 0 < o < 1. (i)Forn>2,
IISn[9]WIIL-(IxI
(2.17)
C C(logn)II9WIIL-(]P,).
Here C is independent of n and g. (ii)Forn>2,
(2.18) II{Sn[9]-g)WIIL-(Ixl
where En-1[9; W] is defined by (1.31). Proof
(i) Choose p E (o, 1). It was shown in Theorem 12.3(b) of 7 that 'fn-1
ryn
and in Corollary 1.4 of 7 that
( 2.19)
< Clan 1"2 , n> 1. IIPnW IIL- t < (I I_ Pan)
268 From (2 . 13), we see that
(2.20)
IKn ( x,t
)IW(x)W(t) < C3 /Ix -tI, IxI, Itl<pan, n>1.
For Ix I < a an, we split
00
Sn[ 9](x )W (x)
=
f
Kn( x,t)W(x ) g(t)W2(t)dt
-00
_{
J + J -{ r}Kn( x,t)W(x )g(t)W2 (t)dt =: 11 + I2 + I3,
A A JJ3
where J1
[-pan, pan ]/[x - n-2, x + n-2]; J2 := [-pan, pan ] r1 [x - n-2, x + n-2]; J3 :_ (-oo, -pan ) U (pan, oo). From (2.20), we see that
I'll
since an is of at most polynomial growth (see Lemma 5.2(b) of 7). Next, the Cauchy-Schwarz inequality shows that I Kn(x,t )IW(x)W(t) < {.\n'( W2,x)W2 (x)}-1/2 {Anl(W2,t)W2(t)}-1/2 C5n/an , x, t E 1R., by Theorem 1.1(b) in 7. Then
II2I -< II9WIIL- (]R) f C5n /an dt < II9WIIL- ( lR)2C5(nan) -1. J2
Finally, from (2.13) and ( 2.19), we have for Ixl < aan, Itl > pan,
W(X) Kn(x,t)IW(x)
<
C6 a''
/2
Ipn-1 ( t)I +
I pn(t)I <
C7a'',/
2 Ipn-1(t ) I +
Ipn(t)I
Ix - tl - Itl
Then
II3I < II9WIIL- (>R.)C7an/2 f Ipn-1(t)I + I pn(t)I W(t)dt ItI J3 n < II9WIIL-(]R)C7an /2
t)W2(t )dt}1/2 {f t-2dt}1/z{ fp'(
j=n-1
< C8JI9WIIL_(1R),
J3
J3
where we have used the Cauchy-Schwarz inequality, and orthonormality. Adding the estimates for Ij, j = 1, 2, 3, yields (2.17). (ii) This is an easy consequence of (i) and the fact S. [P] = P, P E Pn-i. ■ Proof of Theorem 1.3
We have by (2.14),
n bn = max win - Xjnk(xjn)W- 2(xjn ) W-1(xjn) x;„j
n
Ix maaxan Ain Sn[kW- 2](xjn ) - k(xjn )W-2(xjn) W-1(rjn)a an
< C1 max I Sn[kW- 2](xjn ) - k(xjn)W -2(xjn ) W(xjn)x;,, 1
Here we have used the upper bound for Ajn = An(W2,xjn) given in Theorem 1.1 in 7. By Lemma 2.2(ii), we obtain
bn < C2(log n)En_,[kW-2;W] = 0(1), by our hypothesis (1.33). This establishes (1.34). Next, if W(x) = exp (-2 IxI0), > 1, then it is known (see Theorem 1.2, pp.184-5 of') that uniformly for Isl < o, a
lim An1(W2,ans)W2(ans) n = n_oo n
/cp(s)'
Since up(s) is positive and continuous in [-o, o], we obtain AjnW- 2(xjn ) an = 1/up(xjn/an) + o(1), uniformly for Ixjnl < oan. In view of (1.32), we obtain (1.35) from (1.34). ■ References 1. P.J. Davis and P. Rabinowitz, Methods of Numerical Integration, Academic Press, New York, 1975. 2. Z. Ditzian and V. Totik, Moduli of Smoothness, Springer Series in Computational Mathematics, Vol. 9, Springer, Berlin, 1987. 3. D. Elliott and D.F. Paget, Product Integration Rules and Their Convergence, BIT 16 (1976), 32-40. 4. D. Elliott and D.F. Paget, The Convergence of Product Integration Rules, BIT 16 (1976), 32-40. 5. G. Freud, Orthogonal Polynomials, Pergamon Press/Akademiai Kiado, Budapest/Oxford, 1971.
270 6. A. Knopfmacher and D.S. Lubinsky, Mean Convergence of Lagrange Interpolation for Freud's Weights with Application to Product Integration Rules, J. Comp. Appl. Math. 17 (1987), 79-103. 7. A.L. Levin and D.S. Lubinsky, Christoffel Functions, Orthogonal Polynomials, and Nevai 's Conjecture for Freud Weights, Constructive Approximation 8 (1992), 461-533. 8. D.S. Lubinsky, Strong Asymptotics for Extremal Errors and Polynomials Associated with Erdos-Type Weights,Pitman Research Notes in Mathematics, Vol. 202 , Longman, Harlow, 1989. 9. D.S. Lubinsky and D.M. Matjila, Necessary and Sufficient Conditions for Mean Convergence of Lagrange Interpolation for Freud Weights, submitted to SIAM J. Math. Anal. 10. D.S. Lubinsky and E.B. Saff, Strong Asymptotics for Extremal Polynomials Associated with Weights on IR, Springer Lecture Notes in Mathematics, Vol. 1305 , Springer, Berlin, 1988. 11. D.S. Lubinsky and A. Sidi, Convergence of Product Integration Rules for Functions with Interior and Endpoint Singularities over Bounded and Unbounded Intervals, Math. Comp. 46 (1986), 297-313. 12. H.N. Mhaskar and E.B. Saff, Extremal Problems for Polynomials with Exponential Weights, Trans. Amer. Math. Soc. 285 (1984), 203-234. 13. H.N. Mhaskar and E. B. Saff, Where Does the Sup-Norm of a Weighted Polynomial Live?, Constructive Approximation 1 (1985), 71-91. 14. P. Nevai, Mean Convergence of Lagrange Interpolation, II, J. Approx. Theory 30 ( 1980 ), 263-276. 15. P. Nevai, Geza Freud, Orthogonal Polynomials and Christoffel Functions: A Case Study, J. Approx. Theory 48 (1986), 3-167. 16. I.H . Sloan and W.E. Smith, Properties of Interpolatory Product Integration Rules, SIAM J. Numer. Anal. 19 (1982), 427-442. 17. W.E. Smith and I.H. Sloan , Product Integration Rules Based on the Zeros of Jacobi Polynomials, SIAM J. Numer. Anal. 17 (1980), 1-13. 18. W.E. Smith, I.H. Sloan and A.H. Opie, Product Integration Rules Over Infinite Intervals I. Rules Based on the Zeros of Hermite Polynomials, Math. Comp . 40 (1983 ), 519-535.
WSSIAA 2(1993) pp. 271-284 ©World Scientific Publishing Company
AN e-UNIFORMLY CONVERGENT FINITE BOX METHOD FOR A SINGULARLY PERTURBED ADVECTION-DIFFUSION EQUATION JOHN J.H. MILLER Department of Mathematics, Trinity College, Dublin 2, Ireland SONG WANG Tritech, 26 Temple Lane, Dublin 2, Ireland ABSTRACT An exponentially fitted finite box method on uniform rectangular meshes for a singularly perturbed stationary advection-diffusion problem in two dimensions is presented . This method is analysed in a non-conforming finite element framework. Existence, uniqueness and stability of the approximate solution are proved with respect to an e-independent discrete energy norm . An error estimate is given which shows that the approximate solution converges e-uniformly to the exact solution like 0(h1/2) in this norm. The coefficient matrix of the associated linear system is an unsymmetric M-matrix which can be transformed into a symmetric and positive -definite Mmatrix. Numerical computations are described for several test problems . These validate the theoretical results and suggest that they may hold also for problems with non-constant coefficients
1. Introduction It is well known that solutions of singularly perturbed partial differential equations exhibit boundary and interior layers when the singular perturbation parameter c is small. Application of standard numerical methods, such as the central difference or linear finite element methods, to such problems often results in solutions with spurious oscillations if the mesh is not sufficiently fine or the mesh nodes are not specially distributed. Upwind methods can overcome this difficulty, but it is well known that they result in inaccurate solutions [2]. Much attention has been devoted in recent years to exponentially fitted methods for 1-dimensional problems. These methods retain the good stability properties of upwind methods without sacrificing accurary. In addition, c-uniform error estimates have been established for such methods in cases where no parabolic boundary layers are present. In [1] Allen and Southwell introduced an exponential fitting technique for a fluid dynamics problem in 2 dimensions. Independently Scharfetter and Gummel [10] developed a simpler exponential fitting technique for the numerical solution of the 1-dimensional semiconductor device equations. Extensions of the latter to 2 and 3 dimensions have been discussed by many authors and have been very successful in practice for the solution of problems with boundary and interior layers. However, mathematical understanding of the method in higher dimensions is very limited, especially in the context of c-uniform
271
272 convergence. Recently O'Riordan and Stynes [8] proved the global e-uniform convergence of an exponentially fitted Bubnov-Galerkin finite element method for a singularly perturbed problem. However, their analysis is based on an e-dependent energy norm. In [9] Risch gave a local e-dependent error estimate for a singularly perturbed problem based on triangular elements. In this paper we analyse an exponentially fitted finite box method on a uniform rectangular mesh for a singularly perturbed advection-diffusion equation using a discrete e-independent energy norm. In the following section we state the problem and give some estimates for the exact solution and its derivatives. In Section 3 we construct the finite box method using piecewise exponential trial functions and piecewise constant test functions on a rectanglar mesh. The coefficient matrix of the resulting linear system is an M-matrix. The method is analysed in Section 4. We prove that the stability of the method is independent of e. A global 0(h1/2) error estimate in a discrete e-independent energy norm is proved. This shows that the numerical solution obtained from the method is e-uniformly convergent to the interpolant of the exact solution in the piecewise exponential trial space. Numerical results are given in Section 5 in support of the theoretical results.
2. Statement of the Problem Consider a singularly perturbed advection-diffusion problem of the form -O•f+Gu=F on f]
(2.1)
f = eVu - au
(2.2)
Ulan
(2.3)
=0
where e > 0 is a parameter , f1 = (0, 1 ) 2 E IR2, 851 denotes the boundary of f1 , n denotes the unit outward normal vector on 851 , a = (a(1), a (2)) > (0, 0) on f1 and f = (f (1), f(2)) denotes the flux. We assume that a and G are such that 2a+G>0 on 5]
When a is small the problem is said to be advection-dominated. In this case regular boundary layers appear, the width of which is O(e). In what follows L2(S), L°°(S) and Wm,P(S) denote the usual Sobolev spaces with norms II • Ilo,s, II • Iioo,s and II • Iim,p,s, respectively, for any measurable open set S C R' (n = 1,2). The inner product on L2(S) and (L2(S))2 is denoted by (•,•)s and the kth order seminorm on Wm•P(S) by I • Ik,p,s. The Sobolev space Wm•2(S) is written Hm(S) with corresponding norm and kth order seminorm II - IIm,s and I • Ik,s respectively.
273 When S = f2, we omit the subscript S in the above notation. We put Ho (f2) = {v E H1(f2) : vlao = 0}. For any integer k > 0 let Pk(S) denote the space of polynomials of degree at most k and CI (S) denote the space of continuous functions, all of whose derivatives up to and including the kth order are continuous on S. For any a E (0, 1) let C'02(S) be the subset of functions v E Ck(S) which satisfy the following Holder condition max sup IPI
ID'v(x) - D,6v(x')I < K
Ix - x'Ia/a
for some positive constant K, where DO denotes the usual partial derivative corresponding to the multi-index /f and x = (z, y). We use I • I to denote absolute value, Euclidean length, or area depending on the context. First order partial derivatives of a function u are denoted by uy and uy.
The variational problem corresponding to (2.1-3) is Problem 2.1: Find u E Ho (f)) such that for all v E Ho (f2) (eVu - au , Vv) + (Gu, v) = (F, v) (2.5)
Using standard arguments based on the Lax-Milgram lemma it is easy to show that when (2.4) is satisfied there exists a unique solution to Problem 2.1. Before discussing the numerical solution to Problem 2.1, we impose the following additional smoothness and compatability conditions on the data in (2.1-2) a(l) and a(2 ) are positive constants (2.6) F,G E C°'a(f2) n C2'a(fl) (2.7) F(0,0) = F(1,0) = F(0,1) = F(1,1) = 0 (2.8)
Under the conditions (2.6-8) the following stability theorem can be proved. Theorem 2.1. Let u be the solution to Problem 2.1 and f the corresponding flux. If (2.6-8) hold, then there exists a positive constant C, independent of u and c, such that for all xEfl
Jul
< C, lux1 < C( 1 + ems(:-1)), luyl < C(1 + Ie (y -1)) on 12
f(l)l + lfy2)1 < C on
f2
(2.9) (2.10)
Proof. For the proof for this theorem we refer to [8]. ❑ We comment that estimates for ux and uy similar to those in (2.9) can also be obtained if instead of (2.6) a(l) and a(2) are negative constants.
274 3. The Finite Box Method To discuss the method we first define some meshes on (I. Let Q denote the following family of uniform rectangular meshes Q = {Qh(l),h(2) : 0 < h(l), h(2) < ho,
ho < 1}
where Qh(i) h(2) is a mesh on ft consisting of equal rectangles with sides parallel to the coordinate axes of length h(l) in the x-direction and h(2) in the y-direction. Definition 3.1: The family of meshes Q is regular if there exist constants ryl > 0 and 12 > 0, independent of h(l) and h(2), such that for all h(l), h(2) E (0,1] h(l) 'Y1 < h(2) < 1/2
(3.1)
We assume hereafter that Q is regular. In this case Qh(1),h(2) can be denoted simply by Qh, where h = max{h(l), h(2)} is called the parameter of the mesh Qh.
For each Qh E Q, let Xh = {xi}N be the set of all vertices of Qh and Eh = {ei}M the set of all edges of Qh. Without loss of generality we assume that the nodes in Xh and the edges in Eh are numbered such that Xh = {xi}N , and Eh = {ei}M, are respectively the set of nodes in X not on an and the set of edges in Eh not on an. Definition 3.2: The Dirichlet tessellation Dh corresponding to the mesh Qh is defined by Dh = {di}N where
di = {x : Ix - xil < Ix - x9 I, x' E Xh,j # i} (3.2) for all xi E Xh (cf. [3]) .
We remark that for each xi E Xh, the boundary adi of the tile di is also a rectangle. The Dirichlet tessellation Dh is a rectanglular mesh dual to Qh. We let D = {Dh 0 < h < ho } be the family of all such meshes . The subset of Dh corresponding to Xh is denoted by Dh. A second mesh dual to Qh is defined as follows . With each edge ek E Eh we associate the open polygon bk having as its vertices the two end-points of ek and the circumcentres of the rectangles having ek as a common edge. It is clear that bk consists of either one or two triangles depending on whether the edge is on the boundary or not and that IbkI > 0. The set of all these bk forms a polygonal mesh on fZ which is denoted by Bh. This is also a dual mesh to Qh and the number of its elements is equal to the number of edges in Eh . We put B = { Bh : 0 < h < ho}. Corresponding to the two meshes Dh and Qh, we now construct a finite -dimensional test space Th C L2(l) and a finite- dimensional solution space Sh c L2(l).
t 1
275 To construct the test space Th we define a set of piecewise constant basis functions (i = 1, 2, ... , N') corresponding to the mesh Dh as follows: 1 on di 0 otherwise We then define Th = span{^,}N'. To construct the solution space Sh, for each edge ei,1 E Eh connecting the neighbouring nodes xi and x9, we define an exponential function Oi,' on ei,j by d 4^ dei (Edei ,i a+acia) = 0
(3.3)
Ma(xi) = 1, Oi, i(x1) = 0
where ei,j denotes the unit vector from xi to x1 and ai,9 = a • ei,1. From this definition it is clear that ai,1 = -a1,i and ai,1 = -aj,i . We also note that jai,jI is equal to either a( ' ) or a(2) since each edge is parallel to one of the axes . We then extend the domain of 4i,i to bi,; by defining it to be constant along perpendiculars to ei ,;. Using this exponential function we can now define a basis function for Th on fl by 4ia on biaif jEIi 0 otherwise where bi,1 is the element of Bh containing ei,j and Ii= {j:ei,EEh} (3.4)
is the index set of all neighbour nodes of xi. We define the solution space Sh to be Sh = span{¢i}N'. Obviously Sh C L2(12). For any sufficiently smooth function u it is easy to show that the Sh-interpolant uI of u satisfies d duj (c- - ai,1ul ) = 0 on eia deia dei,1
uI(xi) = u(xi), uI(x1) = u(x1) for ei,1 E Eh. Solving the above we get duj I E (^) u 2 (-,.ijej,.ij u x u B ( i) - B ) ( i)) ( 3.5 ) dei,i Tei,iI s e
= f+,, -s--ai
where B(z) denotes the Bernoulli function defined by _at
B(x) =
11
x0
x=0
(3.6)
276 Since the mapping from W1"°°(ei ,j) to Po(ei,j) defined by f • ei,j I--p fi,j preserves constants, we have I
I f . ei,j - fi,j I l.,e;,i <_
Ch lf
- ei
,j I l,oo ,e;,i
Since ei, j is parallel to one of the axes, from (2.10) we have IIf - ei,j - fi, jll .,,,j <
Ch
(3.7)
where C is a positive constant, independent of h, u and E.
We introduce the mass lumping operator P such that N
P(u)(x)
u( Xi)Si(x)
_ i=1
For any d E Dh, we use vhld to denote the restriction of vh to d and yo(vld) to denote the continuous extension of vld to ad. Using the two finite dimensional spaces Sh and Th we now define the following discrete Petrov-Galerkin problem Problem 3.1: Find Uh E Sh such that for all vh E Th a(uh, Vh ) + (P(Guh), vh) = (F, vh) (3.9) where P is an approximation to F and a(., •) denotes the bilinear form on Sh x Th defined
by a(uh, Vh ) - E ,
f ad dED6 ad
(EDuh - alih ) - n'yo(vhId) ds (3.10)
Let uh = EN,1 uilt , where {ui}N' is a set of constants. Substituting this into (3.9) and taking vh = £j, we get, for j = 1, 2, ... , N', N
(EVuh - auh) • nds + Gju1ldj l = f Fdx i-1
8di
(3.11)
di
where Gj = G(xj). Let ljk = 8dj n 8dk denote the intersection of the boundaries of di and dk. Clearly IljkI = 2e .k if k E Ij and IljkI = 0 otherwise . It is also easy to verify that 8dj = Uker; lj,k for all di E D. We thus have from (3.11) for j = 1,2,...,N' f \ duh
- , E dn - auh)Ibikds + Gjujldjl = f Pdx di
kEI i,k
i I 1 1
-
I
.4
tn , 11,
11
(3.12)
277 Using (3.5) we then obtain the system of N equations E(B(_Qi,kje^,kj)u1-B( d'.kEei.kj )uk) kEI,
'e^;kl^+G,uiI dil = f Fdx ` / j
(3.13)
The coefficient matrix of (3.13) is only structurally symmetric, i.e. a non-zero element of the upper triangular part of the matrix implies that the corresponding element in the lower triangular part is nonzero and vice versa. It is easy to show that the matrix is irreducible and diagonally dominant with respect to its columns and in addition it has positive diagonal elements and non-positive off-diagonal elements. Thus the matrix is an M-matrix (cf. [11]). This matrix can be transformed into a symmetric and positive definite M-matrix by pre-multiplication by a diagonal matrix with positive entries. For details we refer to [7]. 4. e-Uniform Convergence In the previous section we showed that the finite box method gives rise to an unsymmetric M-matrix and that this matrix can be transformed to a symmetric and positivedefinite M-matrix by pre-multiplication by a diagonal matrix with positive entries. From elementary linear algebra this implies that the solution to Problem 3.1 exists and is unique. In this section we show that the solution is stable with respect to a discrete e-independent energy norm by verifying that the bilinear form in (2.11) is coercive. We also show that the solution converges E-uniformly to the Sh-interpolant uI of u in the same norm. We start with the following lemma. Lemma 4.1. The mass lumping operator defined in (3.8) is surjective from Sh to Th. Proof. The proof is trivial and is omitted here. Let b(., •) be the bilinear form on Sh x Sh defined by b(uh,vh) = a ( uh,P(vh )) + (P(Guh),P(vh))
0
(4.1)
We consider the following Bubnov-Galerkin problem: Problem 4.1. Find uh E Sh such that for all eh E Sh b(uh,vh) = (F,P(vh))
(4.2)
We say that Problem 4.1 is equivalent to Problem 3.1 if any solution to Problem 4.1 is also a solution to Problem 3.1, and vice versa.
278 Lemma 4.2. Problem 4.1 is equivalent to Problem 3.1. Proof. The result is obvious since the operator P is surjective from Sh to Th by Lemma 4.1. ❑ We introduce the functional I I 'IIh on Sh defined, for any Uh E Sh, by Uj s
IIuhIIh =
( :,IEER
^Z
( 4 . 3)
I 6 i,9 I
We have Lemma 4.3. The functional I I ' IIh defined in (4.3) is a norm on Sh. Proof. The result is obvious since for any uh E Sh, uh vanishes on an. ❑ Letting uh = F-N'1 uici E Sh, we define the E-independent discrete energy norm I I IIQi on Sh by N'
II uh I IQ ,,= h lluhllh + F Gju?I di l 2
2
(4.4)
i=1
The following theorem shows the coercivity of the bilinear form (•, •) with respect to this norm. It also implies that the solution to Problem 4.1 is stable with respect to this norm. Theorem 4.4. There exists a positive constant C, independent of h, c and uh such that for all uh E Sh we have b(uh, uh) ? CI Iuhlleh (4.5) Proof. We let C denote a generic positive constant, independent of h, c and uh. If uh = 0, then (4.5) holds. Assume now that uh E Sh and uh # 0. Let ai,1 = E/ Iei,1l and pi,1 = ai,1/vi,1. Using the same method as in the derivation of (3.13) and following the argument in [7] we have a(uh, P ( uh )) _
°a2' B(pi,1) ( 1 + ep',i) (u1 -
ui)2Ili ,1I
+ a2'
(u2'
- u9)
Il ,1I i
e:i1EE,
(4.7) Since the mesh is uniform in both the x and y directions, we know that the coefficients of the last term in (4.7) are equal to either a('2 c2> or a( '2 (1). Since uh = 0 on an, if we sum the last term in (4.7) along the edges on each mesh line we have E
ai,1( u?
- u')
Il ,i I = 0
(4.8)
i
e:, 1EEh
f
I
I
I
*,t
y.
1
279 Therefore from (4.7) and (4. 8) we have Q2^ B(Pi, i)(1 + e°i ^ ) (ui - ui ) 2lli,il
a(uh, P(uh)) = e;, j EEA
_ a2i e ,jEEA
l
(e°i . + 1 )(ui - ui )2 Ili,i
> C min{ai'i, ai2i}h >2 ej j EEA
>
ui (
-
ui 2
I ei,.,
Chlluhllh
Ibi,i I (4.9)
In (4.9) we used the relation Ili i = 2G 'j . Finally (4.5) follows from (4.1), (4.4) and ❑ (4.9). Before considering the error estimate we state the following lemma. Lemma 4.5. For any Uh E Sh, there is a constant C > 0, independent of h, c and uh, such that
IIP(uh)Ilo < ClluhIIh
(4.10)
Proof. The lemma has been proved in [6, Lemma 3.4] for the case of triangular meshes. The mesh used here can be regarded as a triangular mesh by dividing each rectangle into two triangles . In this case the Dirichlet tessellation remains unchanged . The lemma ❑ follows therefore from the corresponding result for triangular meshes. The following theorem establishes the c -uniform convergence with respect to the c-independent discrete energy norm of the approximate solution when c is sufficiently small. Theorem 4.2. Let u and uh be the solutions to Problems 3.1 and 4.1 respectively, f be the exact flux and ur be the Sh- interpolant of u. Assume that c is sufficiently small so that the widths of the boundary layers are smaller than min{h(')/2, h(2)/2}. Then, there exists a constant C > 0, independent of h, c and u, such that
Il ul - uhIIQ,, < Ch 112(1 + h'1 IIF - Fllo))
(4.11)
Proof. Let C be a generic positive constant,independent of h, c and u. For any vh E Sh, multiplying (2.1) by P(vh) and integrating by parts, we get a(u,P(vh)) + (Gu, P(vh)) = (F, P(vh))
280 From (3.9) and the above we have a(uh - uI,P(vh))+(P(uh) - P(Guj),P(vh)) = a(u - uI,P(vh)) + (Gu - P(GuI), P(vh)) + (F - P, P(vh))
(4.12)
Since P(Gui) = P(Gu), by the definition of the bilinear form b(., •) and the CauchySchwarz inequality, we have from (4.12)
Ib( uh - ul,Vh )I
a(u - uJ, P(vh))l + I (Gu - P(Gu), P( vh ))I + IIF - P IIOIIP( vh ) Ilo = R1 + R2 + IIF - PIIoIIP(vh) IIo (4.13)
For the first term on the right side of (4.13) we have R1 = Y, f (,V(u - uI) - a(u - uj)) • nP(vh)ds dEDti ad
lv
j
-
vi
lf
If ' e i,j - fi,ilds (4.14)
ei iEE,,
where fi, j is as defined in (3.5). Noticing that each ei, j is parallel to one of the axes and li, j 1 ei, j, using Taylor's expansion we have
<_ Clli , jI(hmax{ sup If^ll(X)I, sup If.2)(X)I } +IIf'ei j- fi ,ill_ ,ej,j) I f •e i 'j -fi, jl xEl,,j fl[,j xEi,,j (4.15) Since a is sufficiently small so that the widths of the boundary layers are smaller than min{h(') / 2,h(2)/2} , it follows that the line segment li,j corresponding to any ei,i E Eh is outside the boundary layers . Therefore , we have max { sup I fyl)(X)I, sup xEi,j xEh,j
I f.(2) (X) I I < C
Substituting this upper bound and (3.7) into (4.15) and putting the result into (4.14) we obtain Ri < Ch
v•-
EI ' e,,jEEA
vil2lbi'1I lei.jl
< Chllvhllh (4.16)
In the above we used the Cauchy-Schwarz inequality. We now consider the term R2. Using the Taylor's expansion we have R2 = vi f (Cu - Giui)dxdyl i=1
d.
1<s
Ch max
Ivil
i_1
d;
(I (Gu) yI + I (Gu)vI)dxdy
<1C1
281 Using (2.9) we have I I
1 + e-1e°y (y- 1)/ f)dxdy < C I Iu = I l o, 1 < C J 0 0(
A similar bound holds for IIu1,IIo,i. Substituting the above into (4.17) and using (2.7) and (2.9) we obtain R2 < Ch max I vi l 1
Substituting (4.16) and the above into (4.13) we get
b(uh - uI, Vh) < C(hIIVhIIh + h Imax
Ivi
l + IIF - FIIOI IP(Vh)II0)
(4.18)
Since the system matrix of (3.13) is an M-matrix, it satisfies the discrete maximum principle. Thus, using (2.9) we have max lui - ui (xi)I < C 1
Letting vh = uh - uI and using (4.10), (4.4) and the above inequality we obtain from (4.18)
Iluh - uIlIQh < C [(hl/2 + h-1/2IIF - FIIO)IIuh - UIIIQa + h] This is a quadratic inequality for Iluh - ujIIq,. Solving this inequality we finally obtain ❑ (4.11), which completes the proof of the theorem. We remark that Theorem 4.2 shows that the computed solution converges c-uniformly to the Sh-interpolant of the exact solution in the c-independent discrete energy norm. Previous results are based on energy norms depending on a (cf. for example [8]). Finally we remark that Theorem 4.2 holds also for the case that instead of (2.6) a(1) and a(2) are negative constants. 5. Numerical Results In this section we present some numerical results in support of the £-uniform convergence theory established in the previous section . All the computations were carried out in double precision on an FPS M64/330 computer. Consider the set of mesh parameters {hk}i where hk = 2-k-1. Let {Qhk}i denote the corresponding set of uniform restangular meshes and {uh,k } i the set of numerical solutions on these meshes obtained with the exponentially fitted finite box method. Let {eI}i = { 1.0, 0.5, 0.1, 0.05, 0.01, 0.005, 0.002, 0.0009}. Using the double mesh method (see [4] for details) we can determine a computed rate of convergence p, for
282 each ei (1 = 1, • • • , 8). The computed rate of c-uniform convergence is then taken to be p = mini<,<8p1,. Analogously we can evaluate the computed rates or convergence in the L2- and L°°-norms. The test problems are the same as those in [5]. Test Problem 1: cV2u + guy +
uy = -100xy ( 1 - x) (1 - y) in fl = (0,1)2
U la n
=0
When c is small this problem has regular boundary layers on the edges {(x,0) : 0 < x < 1} and {(0, y) : 0 < y < 1}. The coefficients of the differential equation are constant. Clearly conditions (2.4) and (2.6-8) are fulfilled and therefore the theoretical results on c-uniform convergence proved in Section 4 are valid. The computed rates of convergence in different norms are listed in Table 5.1 for different values of c. The bottom line of the table gives the computed rates of c-uniform convergence. From the table we see that in the energy norm II • I1, this is 0.684, while the theoretical rate is 0.5. The computed rates of c-uniform convergence in the L°° and L2 norms are respectively 0.411 and 0.562. C
ll•IIQ"k
11 - 11.
11-Ilo
1.0000
2.427
1.963
1.969
0.5000
2.412
1.927
1.960
0.1000
2.050
1.581
1.622
0.0500
1.644
1.338
0.952
0.0100
0.818
0.544
0.671
0.0050
0.709
0.558
0.583
0.0020
0.684
0.411
0.562
0.0009
0.684
0.411
0.562
Uniform rate
0.684
0.411
0.576271
Table 5.1: Computed rates of convergence in various norms for Test Problem 1 Test Problem 2: eV2u + (1 + x2)ux + (1 + y2)Uy = -100xy(1 - x)(1 - y) in fl = (0,1)2
Ulan = 0 When a is small the solution of this problem also has regular layers on the edges {(x, 0) : 0 < x < 1} and {(0, y) : 0 < y < 1}. In this equation the coefficients of the first order
K^K derivatives are non-constant. When the equation is written in the same conservation form as (2.1), the coefficient of the zero order term is also non-constant and so condition (2.6) is violated. We have not considered the theoretical rate of the c-uniform convergence for such cases. However, from Table 5.2 it is seen that the computed rate of E-uniform convergence in the discrete energy norm is still 0.516, which is in agreement with the theoretical result for problems with constant coefficients. We conjecture that the theoretical results of the previous section also hold for a wide class of non-constant coefficient problems.
E
II - IIQ""
II•IIo
II•IIo
1.0000
2.170
1.947
1.955
0.5000
2.121
1.887
1.906
0.1000
2.047
1.853
1.743 1.444
0.0500
1.719
1.470
0.0100
0.705
0.480
0.580
0.0050
0.544
0.286
0.445
0.0020
0.516
0.251
0.421
0.0009
0.516
0.251
0.421
Uniform rate
0.516
0.251
0.421
Table 5.2: Computed rates of convergence in various norms for Test Problem 2 6. Conclusion In this paper we analysed an exponentially fitted finite box method based on uniform rectangular meshes for a two dimensional singularly perturbed stationary advection-diffusion problem. The method is based on a non-conforming finite element formulation. Existence, uniqueness and stability were proved with respect to an c-independent discrete energy norm. An error estimate was given which shows that the approximate solution converges E-uniformly to the exact solution like 0(h1/2) in this norm. The coefficient matrix of the associated linear system is an unsymmetric M-matrix which can be transformed into a symmetric and positive-definite M-matrix. Numerical results were presented which support the theoretical results and which suggest that they may be valid also for problems with non-constant coefficients. Acknowledgment : The authors wish to thank Dr E O'Riordan for some useful suggestions on the proof of Theorem 4.4.
284 References 1. D.N.de G. Allen and R.V. Southwell, Relaxation methods applied to determine the motion, in two dimensions, of a viscous fluid past a fixed cylinder, Quart. J. Meth. and Appl. Math., VIII (1955) 129-145 2. A. Brandt and I. Yavneh, Inaccuracy of first-order upwind difference scheme for some recirculating flows, J. Comput. Phys., 93 (1991) 128-143 3. G.L. Dirichlet, Uber die Reduction der positiven quadratischen Formen mit drei unbestimmten ganzen Zahlen, J. Reine Angew. Math., 40, No.3 (1850) 209-227 4. E.P. Doolan, J.J.H. Miller and W.H.A. Schilders Uniform Numerical Methods for Problems with Initial and Boundary Layers, Boole Press, Dublin (1980) 5. A.F. Hegarty, E. O'Riordan and M. Stynes , A comparison of uniformly convergent difference schemes for two-dimensional convection-diffusion problems, submitted 6. J.J.H. Miller and S. Wang, A triangular mixed finite element method for the stationary semiconductor device equations, M2AN, 25 (1991) 441-463 7. J.J.H. Miller and S. Wang, A new non-conforming Petrov-Galerkin finite element method with triangular elements for an advection-diffusion equation, submitted 8. E. O'Riordan and M. Stynes, A global uniformly convergent finite element method for a singularly perturbed elliptic problem in two dimensions, Math. Comp., 57 (1991) 47-62 9. U. Risch, An upwind finite element method for singularly perturbed elliptic problems and local estimates in the L°°-norm, M2AN, 24 (1990) 235-264 10. D.L. Scharfetter and H.K. Gummel, Large- signal analysis of a silicon read diode oscillator, IEEE Trans. Elec. Dev., ED-16 (1969) 64-77 11. R.S. Varga Matrix Iterative Analysis, Prentice-Hall, Englewood Cliffs, New Jersey (1962)
WSSIAA 2( 1993) pp. 285-299 ©World Scientific Publishing Company
UNIFORM CONVERGENCE ESTIMATES FOR A COLLOCATION AND A DISCRETE COLLOCATION METHOD FOR
THE GENERALIZED AIRFOIL EQUATION
GIOVANNI MONEGATO Dipartimento di Matematica , Politecnico di Torino Corso Duca degli Abruzzi 24, 10129 Torino, Italia and
SIEGFRIED PROSSDORF Institut fur Angewandte Analysis and Stochastik Hausvogteiplatz 5-7, D-0-1086 Berlin, Germany
ABSTRACT We consider the well known generalized airfoil integral equation, arising from the two-dimensional oscillating airfoil in a wind tunnel in subsonic flow. In particular we examine the degree of smoothness of its solution and derive uniform convergence estimates for a polynomial collocation and a corresponding discrete collocation method.
1. Introduction Several two-dimensional problems of Mathematical Physics, particularly in the area of aerodynamics and elasticity3,'1,17, lead to singular integral equations of the form a(y)u(y) + b( y)
1
x (xydx+ f 1 k( x, y)u(x)dx = f(y),
-1 < y < 1, (1)
where the, symbol f means that the integral is defined in the Cauchy principal value sense. The theory of such type of equations is well established under very general assumptions 14.17. Moreover, their numerical solution has been studied by several authors4, 5,7,8,9 , 10,13 ,14,19,21; stability and convergence results are available for methods such as collocation, Galerkin and quadratures. However, until very recently, the only known (weighted) uniform convergence results for collocation and quadrature methods required the kernel k(x, y) to be at least Holder continuous on [-1,1]. In 1992 two quadrature methods have been proposed' 13 to solve Eq.1 when the kernel is weakly singular. Under some assumptions on the coefficients a(y) and b(y), they allow to obtain weighted uniform and uniform convergence estimates. However, while the first method can be applied only if the kernel k(x, y) is exactly
285
286 of the form log Ix - yI, the second gives a very low rate of convergence. In this paper we consider the so-called generalized airfoil equation, arising from the two-dimensional oscillating airfoil in a wind tunnel in subsonic flow. This equation is a particular case of Eq.1, with a(y) _- 0, b(y) - 1. Its kernel k(x, y) can be represented as follows6: k(x,y) = ki(x - y)logIx - yI +k2(-- y), (2) where ki(t) and k2(t) are both entire function.
To solve this equation we can apply any of the numerical methods proposed for Eq.1, except for that in 1. However, when the input function f(y) is smooth, and the solution u(x) is assumed3 to be of the form (1 - x)(1 + x )v(x), it appears more efficient to approximate the new function v(x) by an algebraic polynomial. In this case weighted L2-convergence results have been obtained" 7 for corresponding Galerkin and collocation methods. In this paper we will consider the latter collocation method and a discrete version of it. After deriving a result concerning the smoothness of v(x), we will obtain uniform convergence estimates for both the numerical approaches.
2. Preliminary Results Let us consider the equation 1 ri U X
) dx + 1
ix - y +
J
i ki(x,y) logI- - ylu(x)dx
x Ji
k2(x, y)u(x)dx = f(y),
-1 < y < 1,
(3)
f E C"'"[-1, 1],
(4)
with k1, k2 E Cm ,"([- 1,1]2) and
in the space LW := {u
x2)2, 1u (x)12dx < oo}
equipped with the norm hull _(f 1(1 -x2 )2 lu(x)l2dx)2.
The symbol C-1 above denotes the space of functions whose m-derivatives satisfy a Holder condition with exponent µ, 0 < p < 1, in their domain of definition (see, e.g., 14). Later we will also consider the space C":= CO,,'. We recall that the finite Hilbert transform Hu(y) := a 1 i x (xydx
287 maps L2 into itself and 1
]2. IIHuII2 = IIuII2 - 1 [f 1 u(x)dx
Sufficient conditions for k1 and k2 ensuring the unique solvability of Eq.3 in L2 are given, e.g., in 22. Under the assumptions above we have Hu+K1u+K2u = f,
or, equivalently, Hu+K1u=F,,,
(5)
with Klu(y) J k1( x, y)loglx - yIu(x)dx,
f1 K2u(y) :=
J
k2(x, y)u(x)dx E C-"[-1, 1],
1
and F„ f - K2u E Cm'"[-r1,1], provided u E L2 is a solution of Eq.3. Furthermore,
[Klu(y)]' =
= k,(y,y
1 ak,(x , y) log Ix - ylu(x)dx +- kl( 1 x, y)x(xydx 1J 1 J
)Hu(y) + 11 .9k,(-, y) loglx
- ylu (x)dx + J 1 ks(x , y)u(x)dx
J 1
with k3(x, y)
-kl(x , y) - ki
y) E 1]2).
Lemma 1. Let U E L2 and (1 9(y) J kl( x,y)loglx - ylu(x)dx, with k1 E C"([-1, 1]2), 0 <,u < 1. Then 1 9ECA[-1, 1], A=min(µ,4).
Moreover,
II9IIca < constllull. Proof. For yl,y2 E [-1,1] we have /1 9(yl) - 9 (y2) = J [k1 (x,y1) - k1 (x,y2)]loglx - yllu(x)dx 1
(6)
288
y2 )[log Ix - yi I - log Ix - y2I]u(x )dx =:I,+ 12. + f kl(x, 1 1 Using the well known formula22 r
where x =
cos.,
1 tz )
(log
Ix - tI ) zdt =
C
a[ z (2 12 +
b)2
+ (log 2)21,
and Schwarz's inequality, we obtain IIII<-clyl-Y2I' f Iloglx - yllu(x)ldx
:5 C'Iy1 -y2I°IIuII.
(7)
Setting h(x):= k1 ( x,y2)u (x) (EL 2)
and G(y) 1 1 log Ix - yl h(x)dx, 1
we have G'(y) = f 1 h(x) dx E LW lx - y
and
vl I2 = G(yi) - G(y2) = f G'(x)dx. J.
Hence, by applying Schwarz's inequality again, we get
II2I2 < IIG'III f 2 (1- x2)- 12 dxl. v
(8)
Because of the symmetry of the function (1 - x2)-l, it is sufficient to consider the case when yl, y2> -z. Thus we have
If
1(1 -x2)
y]
_
2dxl
=l f( 1 +x)-2(1-x)- '2dxl
v
(1-x) -Idxl Y
yz2
Y2
= 2v'21(1 - y1)' - ( l - yz)'I < 2y 2 y1 - Y2 1 '
(9)
In the latter estimate we have used the inequality Is '2 - t'2'1 < Is - t12'. Combining the inequalities ( 7),(8) and ( 9), the assertion of the lemma follows. Now from Eqs.5, 6 and Lemma 1 it follows that Hu E CA[-1,11, with A = min(µ, 4). Moreover, by successively applying formulas (5) and (6) we obtain 1 ;
(Hu)(') = F (,) - G, -f1 8 ki( x, y) log Ix yl u(x)dx 7r 8y*
l
f
289 t-i
-£> ; (s,)(//u)«>,
i=l
m,
1=0
with Gi,aj e C[—1,1]. This latter result, together with Lemma 1, leads to L e m m a 2. Let ue Li be the solution of Eq.S under the assumptions (4). Then we have Klu=Fu-HueCm'x[-l,\], A = min(/i,-).
If we consider the subspace of Z,£: Cw:=wC[-l,l],
*"(*) = Y T ^ f .
equipped with the norm ||u||c_ := ll«/«'l|oo, the following specification of Lemma 1 is true. L e m m a 3. Let u e Cw and g be defined as in Lemma 1. Then ?eC[-l,l], ■with v = ft if \i < % and v Proof. Following the proof estimate for /2 = G(yi)-G{yi). of real valued functions
||y||c- < elHIc,
\ - i if pi>.\, where e > 0 is arbitrarily small. of Lemma 1, we only need to give the corresponding To this end let us introduce the space IT, (1 < p < oo) with finite norm
\\
G'(x)dx,
Jy2
we obtain \h\<\\G,\\Zr\[\l-^)-Ux\'71 <21^r1\\G'\\lr\yi-y2\'^. Since p > 1 can be chosen arbitrarily large, the proof is complete.
290 In what follows we assume that the hypothesis (4) are satisfied and3 u E C,,, is a solution of Eq.5. Then K1u = F„ - Hu E C-"[-l, 1], (10) where v is defined as in Lemma 3. This follows from Lemma 3 and the calculation performed to prove Lemma 2.
Hence, the solution u satisfies the equation Hu=g with g = F„ - K1u E Cm,°[-1, 1]. Since C. C L2 and the operator 1/W H in the space L2 I/W has the inverse H-' = FI defined by Ho:= -wH(¢/w),
and since H(1/w) = -1, the solution u/can be represented in the form g(x) ^(y) dx + g u(y)/w(y) = -a J 1 1 (y)• 1 W (X) X
(11)
Remark 4. If 0 E C""°[-1,1], v > 2, then H (4/w) E C'" 1] with c > 0 arbitrarily small (see Theorem A.5 in 5). Since in (10 ) v < 1, we are not sure if the operator HK1 maps the space C. into itself. By this reason we need to take into consideration some weighted Sobolev-like spaces for our numerical analysis. These spaces have been used and studied in 21,1
3. The weighted spaces
LP,
Let p be the Jacobi weight function P(y)=(1-y)'(1+y)',
yE(-1,1), a,(3>-1,
and LP = LP(-1, 1) be the Hilbert space of all square integrable functions on (-1, 1) with respect to the weight p(y), equipped with the scalar product _ (u,v)v -_ J u (y) v(y)P( y)dy. 1
(12)
Let P. p(° 0)(y), n = 0 ,1, ..., denote the Jacobi polynomials of degree n, orthonormal with respect to the scalar product (12) and with positive leading coefficients. For any real number s > 0 we define the subspace LP2" of as follows: LP
LP,, :_ {u E LP : ( IuIIv,^ < oo}, where llullv,, = (u, u)v,, and (u,v)P,,
E (I + 7)2,( u,Pi)P ( v,Pj)P. 1=0
(13)
291 The following properties are well known1,21: Property 5: (a) Lo, (s > 0 ) is a Halbert space, and L2,0 = Lo. (b) Lp, C Lp and Ilullp,t <- IIullp,. for u E Lo, and all 0 < t < s. ( e) (Pn, Pm ) p,. = (1 + n)2,6nm , S > 0, n, m = 0, 1'...
(6n,,, stands for the Kronecker symbol). (d) L2" is compactly imbedded in L2p,t for all 0 < t < s. Theorem 6 (see Theorem 2.5 in 1).
Let u E LP,,,
s > 2. Then , for arbitrary
C, 0<e<1, C[-1,1] if a,0<-2 uE C[-1+ e, 1] if a<-113>-^ C[-1 + E, 1 - E] if a,,3 > -12'
In the following we will restrict ourselves to the weights w(y) = (1-y) ' (1+y)- i and p(y) = c(y) := 1/w(y) = (1- y)-2 (1 + y)'2. Furthermore, we define C[-1, 1] := 1+y C[-1, 1], with Ilullj = II + yu(y)II=.
For the subsequent error analysis we need the following specification of Theorem 5. Theorem 7 . Let
u E Lo„
s > 2. Then u c c[- 1, 1] and IIuI1& <- cllullo,..
Proof. Setting u(y) = 1 + yu(y), Pn(y) = 1 + yPn(y), p(y) = (1 - 1y2) ' we obviously have (u,Pn)p = ( u,Pn)o,
(Pm,Pn )5 = (Pm,Pn ), = 6mn-
Moreover (see Theorem 7.32.2 in 24),
lPn(y)I= 1+yiPn(y)I<e 1+y
I(u,Pi)µPi(y)I <- cl(u,Pi)ul = cl(u,pj)ol p;)"l (I+ j)2,1(u,PS o12 E(1+j)-2.
292 Remark 8. In an analogous way we can give a more precise characterization of the behaviour of the function u E Lo,,, s > 2, in all other cases mentioned in Theorem 6. Theorem 9. Let U E C- " [-l, 1], m E No, 0 < A < 1, and m + A - s > 0. Then L
for s < m + A - z, IaI = 101 = 2, and IIuIIP,.* < cllullc ,A Proof. Since it is well known that I(u,Pn)vl 5 n- m"all u llcm•a,
and we have assumed
2(s-m -A) <-1, we obtain
IIuIIP,,
= E n-0
(1
+ n)2a I ( u, Pn)PI2
00
cllullc_,a F(1 + n )2(, -m-a) < cIllullcm.A. n=0
In the sequel we denote by Pn = Pn the Lagrange interpolation projector associated with the zeros of the Jacobi polynomial P(a,")(y ). Moreover , we assume Joel = 1/31 = a. In such a case, if u c La,,, s > z then the following interpolation error estimate is valid (see Theorem 3.4 in 1):
11- - PnullPs < ns IIu11P,,
for all
0 < t < s.
(14)
4. Stability and error estimates for the collocation method In what follows the hypotheses (4) are assumed to be satisfied. Eq.5 can be rewritten in the equivalent form (I +HK)u = Iii
(15)
with u E C.. The operator K := K1 + K2 : C. C' [- 1, 11, with r = ,u if µ < .1 and r = a - e if u > z (c > 0 arbitrarily small), is compact and bounded.
The collocation method we consider is represented by the equation (I + HPK)un = HPn f,
un = wvn, Vn E lln-1,
(16)
where lln_1 denotes the set of all polynomials of degree < n - 1. Computational details can be found in 8,15 In order to prove stability and error estimates for our collocation method we need to turn to the spaces L , (cf. Remark 4). Notice that H E £(Lo,,) is invertible and H-1 = H E £(L2,,) for all s > 0 (see Lemma 4.1 in 1). Moreover, the operators K1 and K2 are bounded linear operators mapping L2" into L;,a}1, provided k1(x, y), k2(x, y) E C'([-1,1]2), where r is the integer satisfying r - 1 < s < r (this easily follows from Sect.2 in 1). Note that the latter condition follows from assumptions (4) for m > r.
293 I in Lo,, for s > z (see (14)) and K : Lv,, -. Lo,,+1-E is compact, where Since P. e is as small as we like (see Property 5,(d)), for s > 0 we have
IIH(PP - I) K ll,,,
< cII(P, - I)KII o,,
0,
as n
Co.
This latter convergence implies the stability of the method in Lo,, provided the operator I + HK is invertible in Go , (or, equivalently, ker(H + K) = {0} in Le). That is, Theorem 10 . Assume that conditions (4) are verified and that I+HK is invertible in Lo, for s > 0. Then for all integers n sufficiently large the operator I + HP„K is also invertible in the same Lo,,, and
II(I+ HPnK)-IIIo,, < c. (17) Remark 11. We have sufficient conditions ensuring that the solution u(x) of Eq.3 belongs to a space Lo,,, with s > z. Indeed, since we have Hu = g with g E C°,A (see Lemma 2), Theorem 9 implies that g E La, for s < m + A - z (provided that m + A - i > 0). Moreover, since H is invertible in Lo , for all s > 0 with inverse H-1 = Ii, we obtain u = Hg E La, fora < s < m + A - 1 (provided m + A > 1). From inequality (17) and the identity (I+HPK)( u-un)=H [(f -Pnf)+(P„K-K)u] the following error estimate follows:
IIu - unIlo,a < c[Ilf - PnfII o ,, + II(I - P,. ) Kull ,,,].
(18)
The right hand side of (18) can be estimated by using (14), provided f,Ku E Lo ,s for t > s. Since under assumptions (4) we have f E Cm'°[-1, 1] and Ku E Cm,'\[-1, 1], with A = min(µ, 1) if u E Lo,, (see Lemma 1) and . = min(p, 1) if u E C. (see Lemma 3), inequality (14) and Theorem 9 imply that
Il l -P,,fllv,,
!^ nil,
(19)
forallt<m+µ-2, 0<s
IIKu - P„Kullo,,
(20)
forallt< m+A-Z, 0<s< t,provided m+A-Z>0. Hence, Theorem 12 . Assume that conditions (4), with m +A > 1, are fulfilled. Suppose the homogeneous equation associated with Eq.3 (i.e. with f = 0) has only the trivial
294 solution in L. Then for all sufficiently large n Eq.16 has a unique solution u„ _ wv,,, v„ E 11„_1. Furthermore, the following error estimates hold: (z) IJu-u„11o, <
for all
ntc'
0<s
where A = min(p, 4); (ii) JIn - u„jj& < ni's
for all 2 <s
where A = min(p, 16),
Assertion (ii) follows from (i), Theorem 7 and Lemma 2.
Remark 13 . If in (4) we assume m > 2, then following the proof of Theorem 3.2 in 4 from (ii) above we easily derive a corresponding uniform error estimate: (21) with e > 0 as small as we like.
5. Stability and error estimates for the discrete collocation method In Sect.4 we have assumed to be able to compute exactly, or at least to full accuracy, integrals of the form 1 f w(x)kl(x,y)loglx - ylpj(x)dx, 1 w(x)k2(x,y)pj(x)dx,
(22)
✓ -1
where pj(x) is a basis polynomial for the representation of v„ E 11„_1. Indeed the integrals above can be evaluated to full accuracy using corresponding quadrature formulas constructed in 15, with a sufficiently large number of nodes. The number of nodes may be different for each element of the collocation matrix. In this section we simply assume to discretize the integrals in (22) by using the above mentioned quadrature rules, but with a number of nodes which is the same for all integrals; let us say N = L(1 + y)nj nodes, where 0 < y < 1 is a real constant. The factorization u = wv of the solution of Eq.3 is of importance in the construction of a discrete collocation method, especially when v(x) turns out to be a smooth function. Indeed, under assumptions (4) with m > 1 we have v E Cm_1,12-(-1, 1], > 0 arbitrarily small (see Remark 4 and Lemma 3). It is convenient to consider Eq.3 in the form (H + K)u = f,
(23)
295 where Ku = ! J
w(x)[kl(x,y)logIx - yj + k2(x,y )] v(x)dx.
(24)
To construct our discrete collocation method we approximate the operator K defined in (24) by using the quadrature formula proposed in 15. This rule is of the form N Kjiu := KNV := 1 J: [ANi(y)k1(rNi, Y) + HNik2(xNi, y)]v(xNi). 'r i=1
(25)
The sets {HNi} and {XNi} denote the weights and the nodes of the N-point Gaussian rule associated with the weight function w(x), respectively. The coefficients {ANi} are given by the expression N-1
-,) ANi(y) = HNi E µi(y)Pi (xNi), i=o
where the Jacobi polynomials pi(x) _= 1 -')(x) satisfy the three-term recurrence relation Po(x) = 1 p1(x)= 1 ( 2x+1), Pi(x) = 2xpi-1 ( x) - Pi-2(x),
The quantities
j=2,3,....
1 w (x) log Ix - yI pi (x)dx
ilf (y) = f
are given by the relationships µo(y) = v(y - log 2), µi (y) = f[ Ti+1( y) - jTj ( y)], j = 1, 2, ...,
where T, (y) denotes the j-degree Chebyshev polynomial of first kind. Formula ( 25) is exact , i.e., Ku =- KNu, whenever kl(x, y) v(x) and k2(x, y)v(x) are polynomials (in the x-variable) of degree N - 1 and 2N -1, respectively. Finally, we recall that for the coefficients { ANi(y)} the bound N
E IANi(y)I < c i=1
(26)
holds uniformly with respect to y E [-1, 1] (see 20,15)
Thus we propose the following discrete collocation method (H+PPKi) un =Pn .f,
(27)
296 where P. is the Lagrange interpolation projector which defines the collocation method of Sect.4, and un (x) = w(x)vn (x), vn E nn-1•
For the quadrature error term RNUn := (K - Kj,)un we have the following representation 1
RNUn = T J 1 w(x){log 1x - yl[ki(z, y) - gM L(x, y)] + [k2 ( x, y) - gM , L(x, y)]}vn (x)dx
1 - {A Ni( y)[kl (xNi, y ) - 4M,L(xNi , Y)] + HNi[k2 ( xNi, y) - gM, L(xNi, y)]}vn (xNi), i=1
where g'M L(x, y) denotes the best (uniform) approximation polynomial associated with the function kj (x, y), of degree M = N - n - 1 in the z-variable, and of degree L = [(1 -y)nj +In, with 11, 1 < c', in the y-variable. Since it is well known12 that under assumptions (4)
II k1 - gM,LIIOo
= O(n---"), j = 1, 2,
we have II RN Un
II=
nn+'+ v
Ilvn
III •
(28)
Lemma 14 . Let {fn}- o, with fn E Co,'[-1, 1], 0 < r < 1, be a sequence of functions, and for each n assume there exists a polynomial qn E nn such that C II1n- q,lloo < n,
Then we have
Ilfpn - of IIco ,r
I> 2r.
c' < ni 2r '
Proof. The statement follows by a straightforward generalization of the proof of a corresponding result given in 11,p.105
Remark 15. Lemma 14 can be applied to our remainder RNUn. Indeed, since K,q1 E 11j}1 and K2q; E H, whenever qj E III, RNUn can be represented in the form RNUn = fn -QN+L, where QN+L E 1IN+L. Moreover, in L = [(1-y) nj +In we can choose In so that N + L describes the whole sequence of positive integers. Furthermore, under assumptions (4), by differentiating K1vn as in (6), and recalling Lemma 3, it is not difficult to verify that K1vn E C-,'[-1, 11, with v defined as in Lemma 3. Thus we have RNUn E C-[-1, 1]. Assuming m + µ - 2 - 2r > 0, by applying first Lemma 14 with r = 1 we find the bound IIRNUnIIco,, = IIRNUflIC . O = 0(n---v}2 )IIvnII_• Then, by applying Lemma 14 to (RNUn)' we obtain IIRNU'IIc^,, = 0(n--N+2+2r )IIvnIIc . Recalling Remark 11, we consider Eqs.23 and 27 in the space L2,,, s > 2, and rewrite them in the form (I + HK)u = R f,
(29)
297 (I + HPnKN) Un = HI'n .f,
( 30)
respectively. To prove the stability of Eq.30 in L;,,, s > 2, we set first I + HPnKjy = I + HPnK + HPP(Kj, - K).
The invertibility and the uniform boundedness of the operators I + HP„K (for all n sufficiently large) has been proved in Sect.4. By assuming m > 3 + 2c -,a (e > 0 arbitrarily small) in (4), from the remainder estimate (28) we can derive the new bound II (K -
KN)unll od
< O( 1)IIUn Ila , a,
(31)
for some value of s > 2. This result follows from Theorem 9 and Remark 15. Indeed for 2 < s < 2 +c, 0 < <1, we have IIRNU'IIo,a < C
IIRNUn IIC=,<
< C,
Il
Ih
yn n-+µ-2-2c
Furthermore, from the proof of Theorem 3.2 in 4 we find Ilvn Iloo <
Since Ilan II& < clluN IIa,, for s >
2
CInII
1
N - xvn il_ - C1nhIun IIC.
(see Theorem 7), we obtain
C N IIRNUn IIa,' - ,nm+N- 3-2e Ilun Ila,a,
from which (31) follows, provided m > 3 + 2c - µ. Since H is bounded in o,, and the projections Pn are uniformly bounded in L;,, for s > 2 (see (14)), inequality (31) yields en
II HPn(K - KN)un Ila, a = O(1)IIun Ila^a.
Consequently, under the above assumptions on s and m, for all sufficiently large n we have (32) C2IIUn Ilo ,a < II(I + HPnKN) Un Ilo,a ,
where c2 is a positive constant . This latter inequality implies19 stability, i.e., Theorem 16 . Under assumptions (4) with m > 3 + 2e - p , 0 < c < 1, for all integers it sufficiently large the operator I+ HPnKN is invertible in La„ 2 < s < 2 + c, and
II(I + HPnKN) - l llo,a
< c.
The convergence estimate then follows from the inequality IIU - UN lla ,a
<
c[II H(f - Pnf)I )a,a + II H(PnKN - K)ull a,al
298 < c'
[11f - Pnfll o,,
+ II (PnKN - K) ullo,,]
together with Theorems 7, 9 and Lemma 2. Indeed, recalling that v E 1], with f > 0 arbitrarily small, from15
IIKu - Kivull ,
=
O( n- m+;+c)
we obtain (see above)
IIKu - Knoll,,, < cliKu - K;, ul lc.,. <
nm
c, -3c
For the estimate of 11 f - Pnfll o,, see (19), for II (I - Pn)KIIO,, see (20). Thus we have: Theorem 17. Under the same hypotheses of Theorem 12, but now with m > 3, for all n sufficiently large Eq.27 has a unique solution uN (x) = w(x)vn(x), with V, E 11n-1. Furthermore, the following error estimate holds:
Ilu - un III s
nm- 2-3c
(33)
where e > 0 is as small as we like. As in Remark 13, from (33), assuming m > 4, we can easily derive the estimate
Ilu - un IIoo = o(n-m+,}3E).
6. References 1. D.Berthold,W.Hoppe and B.Silbermann, A fast algorithm for solving the generalized airfoil equation, in Orthogonal Polynomials and Numerical Methods, J.Comput.Appl.Math. 43 (1992), 185-219. 2. D.Berthold,W.Hoppe and B.Silbermann, The numerical solution of the generalized airfoil equation, J.Integral Equations Appl. 4 (1992), 309-336. 3. S.R.Bland, The two-dimensional oscillating airfoil in a wind tunnel in subsonic flow, SIAM J.Appl.Math. 18 (1970), 830-848. 4. M.R.Capobianco, The stability and the convergence of a collocation method for a class of Cauchy singular integral equations, Math.Nachr., submitted for publication. 5. D.Elliott, A comprehensive approach to the approximate solution of singular integral equations over the arc (-1,1), J.Integral Equations Appl. 2 (1989), 5994. 6. J.A.Fromme and M.A.Golberg, Reformulation of Possio's kernel with application to unsteady wind tunnel interference, AIAA J. 18 (1980), 951-957.
299 7. J.A.Fromme and M.A.Golberg, Convergence and stability of a collocation method for the generalized airfoil equation, Appl. Math. Comput. 8 (1981), 281-292. 8. M.A.Golberg , Solution Methods for Integral Equations , Plenum, New York, 1978. 9. M.A.Golberg and J.A.Fromme, On the L2 convergence of collocation for the generalized airfoil equation, J.Math.Anal.Appl. 71 (1979), 271-286. 10. P.Junghanns and B.Silbermann, The numerical treatment of singular integral equations by means of polynomial approximations, Preprint, P-MATH-3586, AdW der DDR, Karl-Weierstrass-Institut fur Math., Berlin, 1986. 11. A.I.Kalandyia, Mathematical Methods of two-dimensional Elasticity , MIR, Moscow, 1975. 12. G.G.Lorentz, Approximation of Functions, Holt, Rinehart and Winston, New York, 1966. 13. G.Mastroianni and S.Prossdorf, A quadrature method for Cauchy integral equations with weakly singular perturbation kernel, J.Integral Equations Appl. 4 (1992), 205- 228. 14. S.G.Mikhlin and S.Prossdorf, Singular Integral Operators, Springer-Verlag, Berlin, 1986. 15. G.Monegato and P.Lepora, On the numerical resolution of the generalized airfoil equation with Possio kernel, Numer.Math. 56 (1990 ), 775-787. 16. W.F.Moss, The two dimensional oscillating airfoil: a new implementation of Galerkin's method, SIAM J.Numer.Anal. 20 (1983 ), 391-399. 17. N.I.Muskhelishvili, Singular Integral Equations , Noordhoff, Groningen, 1953. 18. S.Prossdorf, Some Classes of Singular Equations , North-Holland, Amsterdam, 1978. 19. S.Prossdorf and B.Silbermann, Numerical Analysis for Integral and Related Operator Equation, Birkhauser -Verlag, Basel, 1991. 20. I.H . Sloan , Analysis of general quadrature methods for integral equations of the second kind, Numer. Math. 38 (1981 ), 263-278. 21. I.H.Sloan and E.P.Stephan, Collocation with Chebyshev polynomials for Symm's integral equation on an interval, J. Austral. Math. Soc. Ser. B, to appear. 22. M.Schleiff, Tiber eine singulare Integralgleichung mit logarithmischem Zusatzkern, Math.Nachr. 42 (1969 ), 79-88. 23. M.Schleiff, Untersuchung einer linearen singularen Integrodifferentialgleichung der Tragfliigeltheorie, Wiss.Z. Univ.Halle, XVII'68M,H .6 (1968), 981-1000. 24. G.Szego, Orthogonal Polynomials , Amer.Math.Soc.Colloq.Publ. 23, Providence, R.I., 1975.
WSSIAA 2(1993) pp. 301-308 ©World Scientific Publishing Company
DOUBLE EXPONENTIAL FORMULAS FOR FOURIER TYPE INTEGRALS WITH A DIVERGENT INTEGRAND MASATAKE MORI and TAKUYA OOURA Department of Applied Physics, Faculty of Engineering, University of Tokyo, 7-3-1, Hongo, Bunkyo -ku, Tokyo, 113 Japan
ABSTRACT An efficient method of numerical evaluation of Fourier tyle improper integrals such as fo f (x) sin wxdx or f _00 f (x) sin wxdx where the integrand f (x) sin wx diverges as x -* ±oo is proposed based on the double exponential transformation. It consists of a single quadrature formula and any additional operation such as an acceleration procedure is unnecessary . It is applied to some improper integrals and gives a good result.
1. Introduction Suppose that an improper integral such as I=
J0
00 log x sin xdx
(1)
is given. This should be defined properly as I = lim J ao e-ex log x sin xdx. 0
(2)
When When we evaluate (1) numerically we usually use an extrapolation method with respect to e. However, this kind of method is not general in the sense that we usually must device some special acceleration of extrapolation for each specific integral. Also, there has been no general purpose quadrature formula for such kind of integrals with a divergent integrand. In the present paper we discuss about the possibility of constructing a subroutine which enables us to evaluate such kind of improper integrals based on the double exponential transformation. The double exponential transformation is an optimal variable transformation useful for numerical integration of analytic functions. The basic idea of the double exponential transformation is as follows5'2. Let the given integral be I=
f
b f (x)dx. (3)
a
301
302 We assume that f (x) is analytic on (a, b) but that it may have an integrable singularity at x = a or x = b. A variable transformation x = 0(u), ¢(-oo) = a, q5(+oo) = b (4) leads to
00 (5) I = J f(q5(u))cb'(u)du. 00 Since this is an integral over (-oo, oo) we apply the trapezoidal rule with equal mesh size h which is known to be optimal' for numerical integration of an analytic function over (-oo, oo). Then we have a quadrature formula 00 Ih = h E f (O(kh))¢'(kh)• (6) k=-oo
This is an infinite summation and in the actual computation we need to truncate the sum IhN) = h > f (cb(kh))q5' ( kh),
(7)
k=-N_
where N = N_ + N+ + 1 is the number of function evaluations . Therefore the overall error of (7) is DI(N)=I-I(N)=I-Ih+Ih-IIN, =DIh+st, (8) where DIh is the discretization error defined by 00 f (O(u))c6'(u)du h = f(O(kh))O'(kh) 0o k=-oo
oo DIh = I - Ih = f
(9)
and et is the truncation error defined by -N_
00
st = Ih - I( N) = h E f (cf (kh)) O'(kh) + h E f (O( kh))c' ( kh).
(10)
k=-oo k=N+
In general , for a fixed h, if f (O(u))O'(u) decays rapidly as u -r ±oo, then AIh becomes large because in that case h is relatively large compared with the shape of f (O(u))¢'(u). On the other hand if f (O(u))O'(u) decays slowly as u --* ±oo, then Et becomes large. Therefore JDIhI and let cannot be made small at the same time and there should be an optimal decay rate of If (¢(u)¢'( u))J as u , ±oo.
Takahasi and Mori' found that the optimal decay of If (O(u))O'(u) I is double exponential, i.e. Jf(0(u))0'(u )I - exp (-cexp Jul ), Jul-' oo, (11) and the quadrature formula obtained based on this optimal transformation is called a double exponential formula, abbreviated as a DE -formula. The optimality of the
fl, 1 11,
11
303 double exponential formula is also established through a function analytic analysis by M. Sugihara2. Specifically for the integral over (-1, 1) r I = J_f(x)dx
(12)
x = tanh(2 sinh u)
(13)
the transformation gives a DE-formula. For the integral
the transformation
I = f f (x)dx 0
(14)
T x = exp(2 sinh u)
(15)
gives a DE-formula over (0, oo). Similarly, for the integral f(x)dx
1 = J
(16)
00
the transformation x = sinh(2 sinh u)
(17)
gives a DE-formula over (-oo, oo). However, it is known that the transformation (15) or (17) does not work well for integrals with a slowly decaying oscillatory integrand. There have been several devices for efficient numerical evaluation of such kind of integrals. They are mainly based on the Richardson's extrapolation method'. In the preceding paper3 we proposed another kind of transformation useful for such kind of integrals. In that paper we also showed that the transformation is useful even for integrals with a divergent integrand. In this paper we propose a similar transformation for integrals over (-co, 00), and show that it is useful not only for integrals with a slowly decaying oscillatory integrand but also for integrals with a divergent integrand. The purpose of the original double exponential transformation exists in that it makes the decay of the integrand double exponential at large jxj. On the other hand the idea of the double exponential transformation presented in this paper is to make the points of the formula approach double exponentially to the zeros of the integrand. 2. Transformation for Integrals over (0, oo) First we review3 the transformation for integrals over (0, oo). Let the given integral be 1 =
I 0
00 f(x)dx.
(18)
304 We assume that f (nA + 9) = 0
for large positive integer n.
(19)
In other words we assume that f (x) has infinite number of zeros with period A at large x and that the phase of the zeros may be shifted by a constant 9 with respect to the origin. If f (x) = f, (x) sin wx, then A = it/w and 9 = 0, while if f (x) = f, (x) cos wx, then A = it/w and 9 = it/(2w). We apply a variable transformation x = M¢(t), 0(-oo) = 0, 0(+00) = oo, (20) to (18) which gives
00 I = f f (M¢(t))Mc'(t)dt, (21) 00 where M is some positive constant . Next we apply the trapezoidal rule with an equal mesh size h to this integral which leads to Ih = Mh
.f (Mc( nh + M))O' ( nh + M ). (22) n=-oo
Here we choose such 0(t) that qf(t) t double exponentially as t
+oo, (23)
i.e. that Mc(nh + 9/M) approaches to Mnh + 9 double exponentially when n --> oo, and that 0 double exponentially as t -+ -oo. (24) Next we choose a mesh size h such that h = M.
(25)
Then we have f (MO(nh + M )) = f (Mnh + 9) = f (nA + 9 ) = 0, h=
M
(26)
from ( 19) for large n, so that we can truncate the summation Ih at some moderate n. For large negative n we can truncate the summation by (24). A typical and useful example of such kind of transformation is given by fi(t) = t 1 - exp (-
K Binh t)' (27)
where K is some positive constant . It is easy to write a subroutine of the double exponential formula for integrals ( 18) satisfying (19) based on the transformation
^t^
305 (27). The infinite summation in (22 ) should be truncated at n = -N_ and n = N+ where the integrand Mhf(M¢(nh+ M))¢'(nh + -L) of (22 ) is sufficiently small. As an example we evaluate the improper integral
f00
(28) logxsin xdx 0 by the subroutine. This integral should properly be defined as (2) and it is equal to -y where y is the Euler's constant. What we must do here is only to write a function subprogram which defines (29) f (x) = log x sin x I=
and to give it to the subroutine. In Fig.1 the integrand log x sin x is shown. Short vertical dashes along the x- axis indicate the location of the sampling points of the formula. We see that the points of the formula approach quickly to the zeros of log x sin x as x becomes large.
4
0
-4 0
4 7r
87r
121r
161r
Fig.l . The integrand f (x) = log x sin x. Dashes along the x-axis indicate the location of the sampling points of the formula.
First we choose K = 6 and M = 30. Then we obtain a result whose absolute error is 1.0 x 10-7 with 34 function evaluations. Next we choose K = 6 and M = 50 and obtain a result whose absolute error is 2.1 x 10-13 with 75 function evaluations. In this way, even if we do not know the proper definition of the integral (2), we can easily obtain the value of the integral (28). The reason why this formula gives a good result for such kind of integrals is discussed in the reference3. 3. Transformation for Integrals over (-oo, co) In this section we consider the integral over (-oo, co): 00 I J f(x)dx. 00
(30)
306 In this case we assume that f (nA + 9 ) = 0
for large integer ± n. (31)
Then we apply a variable transformation x = Mq(t), 0(-oo) = -00, 0(+ oo) = 00, (32) to (30) which gives I=
1
00 f (Mc( t))MO'(t)dt,
(33)
where M
= LA. (34) c c is a positive constant and µ is a positive integer. Next we apply the trapezoidal rule to this integral which leads to 00 0 Ih = Mh E .f (MO(nh + M)) c'(nh + 9 ). (35) M n=-oo Here we choose such ¢(t) that q(t) -+ t c double exponentially as t -+ ±oo, (36) where c is the constant appearing in (34). We again choose a mesh size h as (25). This makes Mq(nh+9/M) approach to (n:Fµ)A+9 double exponentially as n -+ ±oo, and hence it makes f (Mc(nh+9/M))¢'(nh+9/M) approach to zero double exponentially as n -+ ±oo by (31). An example of this type of transformation is given by x = 0(t) = t - c tanh (2 Binh t) ,
(37)
where a is a positive constant. We chose a = 1.6 and c = (2/a)/(1 + 0.8/µ), and wrote a subroutine of the double exponetial formula based on this transformation. The infinite summation in (35) should be truncated at n = -N_ and n = N+ where the integrand Mhf (M¢(nh + M))¢'(nh + M) of (35) is sufficiently small.
As the first example we evaluate the improper integral sin
J
x dx
ool+e-x
(38)
by the subroutine. Although the integrand is not divergent, it does not converge to zero as x --# oo. This integral should properly be defined as °O sin x I = lim e-0z dx C-o 0O 1 + e-=
(39)
307 and it is equal to 7r/ sinh 7r. Again what we must do here is only to write a function subprogram which defines sin
x +e-a
f (x) =
(40)
and to give it to the subroutine. In Fig.2 the integrand sin x/(1 + e-z) is shown. We see that the points of the formula approach quickly to the zeros of sin x/(1 + e-=) as jxj becomes large.
2
0
1
-2 -4 7r
0
47r
87r
12 7r
Fig.2. The integrand f (x) = sin x/(1 + e-z). Dashes along the x-axis indicate the location of the sampling points of the formula.
First we choose µ = 6. Then we obtain a result whose absolute error is 5.2 x 10-s with 30 function evaluations. Next we choose µ = 12 and obtain a result whose absolute error is 3.9 x 10-11 with 62 function evaluations. As the second example we evaluate the improper integral 00 I = f log(1 + x2) cos xdx (41) 00 by the subroutine. The integrand of this integral is divergent as x -+ oo. This integral should properly be defined as e-`Isl log(1 + x2) cos xdx (42) I = lim ,0 f 00 and it is equal to -27r/e. In Fig . 3 the integrand log(1 + x2 ) cos x is shown. We see that the points of the formula approach quickly to the zeros of log ( 1 + x2) cos x as lxJ becomes large. First we choose µ = 6. Then we obtain a result whose absolute error is 2.5 x 10_5 with 38 function evaluations . Next we choose µ = 12 and obtain a result whose absolute error is 1.4 x 10 - 11 with 80 function evaluations.
308 The reason why this formula gives a good result for such kind of integrals is similar to the case of integrals over (0, oo). A detailed analysis of this kind of formulas will be given in the succeeding paper.
Fig.3. The integrand f (x) = log(1 + x2) cos x. Dashes along the x-axis indicate the location of the sampling points of the formula.
References 1. M. Mori, The double exponential formulas for numerical integration over the half infinite interval, in Numerical Mathematics Singapore 1988, International Series of Numerical Mathematics, Vol.86, 1988, Birkhauser, 367-379. 2. M. Mori, Developments in the double exponential formulas for numerical integration, Proceedings of the International Congress of Mathematicians, Kyoto 1990, 1991, Springer-Verlag, 1585-1594. 3. T. Ooura and M. Mori , The double exponential formula for oscillatory functions over the half infinite interval , J. Comput. Applied Math. 38 (1991 ), 353-360. 4. H. Takahasi and M . Mori, Error estimation in the numerical integration of analytic functions, Rep. Comput. Centre Univ. Tokyo 3 (1970), 41-108. 5. H. Takahasi and M . Mori, Double exponential formulas for numerical integration, Publ. RIMS Kyoto Univ. 9 (1974), 721-741.
WSSIAA 2(1993) pp. 309-319 ©World Scientific Publishing Company
Computable L°° error estimates in the finite element method with applications to nonlinear elliptic problems Mitsuhiro T. Nakao Department of Mathematics , Faculty of Science Kyushu University 33, Fukuoka 812, JAPAN
Abstract In this paper, we consider a numerical technique to verify the solutions for nonlinear elliptic boundary value problems with guranteed L°° error bounds. Some computable a priori L°° error estimates to the linear finite element solution of Poisson's equation is derived for rectangular and triangular elements. These estimates play an essential role applying our verification method to nonlinear elliptic problems. Following the description of outline of the fundamental verification procedures, a numerical example is presented.
1. Introduction In the last decade, various kinds of numerics with result verification have been proposed for differential equations. However, there are not so many such works for partial differential equations(PDEs) up to now. The author has studied for years the nunerical verification of the solutions of PDEs[4][ll]etc. The basic approach of this method consists of the fixed point formulation of PDE and the construction of the function set, in computer, satisfying the validation condition of some kind of infinite dimensional fixed point theorem. However, due to the fact that our verification method is based on the L2 theory in a certain Sobolev space, we could not obtain the L°° error bounds at all. In the present paper, we propose a technique which overcome this difficulty. Since, in our method, the computable a priori error estimates for the linear finite element solution to Poisson's equation plays an essential role, our main purpose consists of the computable L°° error estimates to that simple equation with rectangular and triangular elements. Then we describe the application of the results to nonlinear elliptic problems as well as we present a numerical example. Plum[13]-[16], recently, proposed an alternative L°° approach by the use of the C' finite element method for rectangular domain with piecewise biquintic polynomials. His
309
310 method is essentially based upon the smooth approximation with high accurary and the numerical enclosure of eigenvalues using the homotopy method for the linearized elliptic operator. As we are able to use the low smooth function, i.e., C° element, the verification procedures are simple compared with the method with C' element. Furthermore, our method needs no highly accurate approximation of the solution to the original problems, because we avoid the norm estimation for the inverse of the linerarized operator, which is the second characteristic of our method.
2. Basic concepts and preliminaries We consider the following nonlinear elliptic boundary value problem: -Du = f (x, u, Vu), x E S2, (1)
u = 0, x E 852, where Q is a convex polygon in R2. We use the simplified notation f (u) f (x, u, Vu). As we will see in the section 4, it is sufficient for our purpose to establish the computable L°° error estimates to the solution of following Poisson's equation:
-Ow = g, xES2, (2)
w = 0, x E Of t. Now, for an integer m, let H"`(fl ) - H"` denote the L2-Sobolev space of order m on Q. And set Ha =_ {0 E H'Itr ( O) = 0 on 812} with the inner product < (V¢, V O), where (•, •) means the inner product on L2 (Q ) with associated norm Also let Sh be a finite element subspace of Ho dependent on the parameter h and let define, for any w E HO, the Ho- projection Phw E Sh as follows: (Ow - V(Phw), Vv) = 0, "v E Sh.
(3)
We suppose the following approximation property of Ph:
"O E H2 n Ho = IIV(O - Phc)I 1 5 Cihl qI H2,
(4)
where I0IHZ = ^Z,j=1 Ilaa;a IIL2• Here, C, is a positive constant independent of h. This estimates hold for many finite element subspace of piecewise linear polynomials with quasi-uniform partition of mesh size h. We consider the actual values of C, for several cases in the next section. Furthermore, since S2 is convex, we have the following well known regularlity estimates:
IWI H 5 C2II9II, 2
where the constant C2 can also be numerically obtained . Indeed, for the polygonal domain we can take it as C2 = 1 ([2]).
t t
311
Therefore, setting C - C1 C2, we have IIV(w -
Phw)II
!^
ChII9Ij•
(5)
Moreover, by the Aubin-Nitsche trick, we obtain the L2 error estimates
11w - PhwII <- C2h2II9I I.
(6)
Now let Th be the usual decomposition of f2 into trianglular or rectangular meshes. In order to obtain the computable L°° error estimates, we use the following constructive imbedding : H2'- L°° for 0 = w - Phw on each T E Th
I IOI IL-(r) s C3h_1I IqI I L2(T) + C4I IocI IL2(r) + CshIgI H2(r), (7) where C3 - C5 are positive constants independent of h. To determine these constants, we refer the explicit imbedding theorem obtained by Plum[15]. Lemma 1 ([15]). For each ¢ E H2(r), where r E Th,
IIOII L °°(r) s I7-I1/2 [II0II L2(r) + k1M,(T )IIoOIIL2(T) + k 2 M2 (r)I0I H2(T)], (8) where kl = 1.1548, k2 = 0 .22361,
In
= [measure of T] and
Mj(r) = sup[ 1 f Ix - xol2jdx]1/'2, j = 1, 2, xo ET I
'd
(9)
r
where Ix - xoI means the Euclid distance in R2 between x and xo. Also we use the following property to apply Lemma 1. Proposition 1. For each r E Th, set for j = 1,2, Ij(xo) = f I x - xo122dx. r
Then Ij(xo) attains the maximum at the corner point of r. Furthermore, for r a triangle, the maximum is taken at the corner having the minimum angle. Proof. For each line segment 1 = {x E Tlx = (1 -t)xl +tx2i 0 < t < 1} contained in r, Ij(xo) takes the mamimum value at the end point of 1. Indeed, otherwise, there exists some t, 0 < t < 1, such that for 7 = (1 - t)xl + tx2 I2(xl) < Ij(-Z)
and
Ij(x2) < Ij(a).
But, using the fact that Ix - xol2j is the convex function in xo for fixed x, we have Ij(x) = f I(1 - t)(xl) + t(x - x2)I2jdx
< (1 - t)I,(xl) + tlj(x2) < Ij(x)
312 which is a contradiction. The latter statement easily follows by some consideration on the geometric shape of the triangle. 3. Computation of constants for piecewise linear element 3.1 Rectangular element. For simplicity, we resrict ourselvs to the unit square, i.e., fl = I x I, where I = (0,1), with uniform mesh. The extension to the nonuniform partition and/or the general rectangle is straightforward. Let Sh(x) denote the set of continuous and piecewise linear polynomials on I with uniform mesh size h and homogeneous boundary conditions. Then, we set Sh - Sh(x) 0 Sh(y). Also define the auxiliary projection Px : HH(I) --+ Sh(x) by (u' - (Pxu)', v')1 = 0,
v E Sh(x),
(10)
where (•,•)I implies the inner product on L2(I). Py : HO(I) --+ Sh(y) is similarly defined. We present the following inverse inequlity for later use. Lemma 2 . Let J = [a, b] and v a linear polynomial on J. Then it holds that
IWJIL2(J)
< 12h- 1IIvIILz(J),
(11)
where h= b - a. Proof. Let v(x) = Ax + B(A $ 0). The direct calculation implies that
3 A2h2
A2h + ABh2 + B2h 1
B h 2 hz (A + 2) + 12
Hence, the conclusion readily follows. Now we present the L°° estimates for thre rectangular case. Theorem 1. Let w and Phw be the solutions to (2) and (3), respectively. Then the following estimates holds:
IIw - Ph wII Lo'(n) < 1.054h1I9II. Proof. First, by a simple computation using Proposition 1 or by the result of [15], M;(r) in Lemma 1 are determined as Mj(T) =
V3h
and
M2(r) =
3 V5h2
(12)
Now, from the well known error estimates(e.g., [4],[8]), we have z
IIw - PhwII < T2IWIH2.
(13)
313
IIV(w - Phw) II <- hIwIH2.
(14)
Namely, C1 = 1 in (4). Further , since (Phw)xy # 0, we need the estimates: 7r (max 1w - Phwl H2( T))112• TETn
Observe that for each r E Th
II(w - Phw)xyjlT <_ II (w - PyPxw)x yII T + II (PyPxw - Phw)x ylIT> where II •
(15)
IIT means the L2 norm on r.
The first term of the right-hand side of thr above is bounded as:
II (w - P5Pxw)II7 = II(wx - Py(Pxw)x)5IIT II (wy - Pxwy)=IIT + II ((Pxw)x - Py(Pxw)x )YIIT < IIwyxIIT + II (Pxw)xyIIT (16)
< 2I I 'wxy I I r,
where we have used the equality: ((wy - Pxwy ) x, (P,(wy - ( PPw)y))x)T = 0. Next, by the definition of Phw, we have
IIV(P5Pxw - Phw)I I2
=
(V(P5Pxw - Phw), V(PYPxw - Phw)) (V (PFPxw - w),V(P5Pxw - Phw)).
Hence, using the estimates in [4]
IIV(P,,Pxw - Phw)II 5 IIV (P,Pxw - W)II h1wjH2
(17)
Ir Combining (17) with Lemma 2, we obtain
(18)
II (PyPxw - Phw)xyll <- 7r2I w[H2. Thus ( 16), (18 ) and (15 ) imply that
max 1w - PhwI H=(T
TETh
= IIwxxll2 +
Ilwyyll2 + 211 (w - Phw)xyII2 /1-2 ) 2) 1 W 1 2
(1+2(V2+ < v
7r
H2-
(19)
Finally, some simple calculations using (12)-(14) and (19), incorporating Lemma 1, yield the desired estimates.
314
3.2 Triangular element. For the case of the triangle r illustrated in Fig. 1, a simple computation using Proposition 1 show that M;(7-) in Lemma 1 are given as follows:
C Figure 1: Model triangle T (A: minimum angle)
M1(T) <
M2 (1)
{(3a2 + b2)b + (3a2 + c2)c}1/2. (20) 6(b + c)
(15a' + 10a2b2 + 3b')b + (15a4 + 10a2c2 +3 C4)C_ 11/2. (21) :5
f
+ 45(b c)
We only consider here for the uniform mesh, i.e., a = b = h and c = 0. Then, it is known
by [12] 11W - PhwII < 0.812h2JwJH2 and JIV(w - Phw)ll < 0.81hlwIH2. (22) Computations using (20)-(22) prove the following estimates by Lemma 1. Theorem 2 . Let Th be the uniform triangulation of Q. Then, for w in (2) and Phw in (3), it holds that 11w - PhwIILo(n) < 1.818hI!gll. This estimates is rather large compared with the similar result in [12] derived for the knots or the midpoint of the hypotenuse of a triangle. For the more general triangulation, we are also able to obtain the L°° estimates by usual arguments on the interpolation error based upon the affine transformation(e.g., [1],[3],[12]). 4. Application to nonlinear problems Based upon the computable L°° error estimates of the finite element solution (3) to the Poisson equation (2) obtained in the previous section, we now consider a method to provide an a posteriori L°° error bound for the nonlinear problem (1). In order to keep the paper self-contained, we briefly review the fundamental idea of our technique to get the guaranteed error bounds in L2 sense (see [4],[8]etc. for details). When we denote the solution of (2) by w = Ag, the map A : L2 --+ Ho is compact. We also assume that f in the right-hand side of (1) is a bounded and continuous map from Ho to L2. For example, f (•, it, Vu) = 1 • Vu+g2un satisfies this assumption, where g1 = (gi,gi) and 92 are in L'°(St), and p an arbitrary nonnegative integer.
Now, using the following compact map on H. F(u) _- Af (u),
315 we have the fixed point equation which is equivalent to (1): (23)
u = F(u).
Now, according to the similar manner in [8],[9]etc., we introduce a Newton-like operator for the equation (23). Let uh E Sh be some approximate solution to (23) and let F'(uh) Ho --p Ho be the Frechet derivative of F at uh. Further we suppose that the restriction to Sh of the operator Ph[I - F'(uh)] : Ho ---+ Sh has an inverse [I - F'(uh)]h' : Sh -+ Sh , where I means the identity map on Ho. This assumption corresponds to the unique existence of the finite element solution in Sh to the linearlized equation of (23). Next, for a fixed positive parameter E such that 0 < E < 1, we define the nonlinear operator Te : Ho -+ Ho as follows: TEu = u - ([I - F'(uh)]h1Ph + EI)(u - F(u)). (24) Then TE becomes a condensing map([211) on Ho, i.e., Te can be decomposed as the sum of contraction and compact operators. It is seen that if -E-1 does not coincide with any eigenvalue of the operator Ph[I - F'(uh)] on Sh, then u = F(u) is equivalent to u = Tu. Now, as in [4],[8] etc., we define two concepts, the rounding and the rounding error, which enable us to treat the infinite dimensional problem by finite procedures. Let {c6j}j=1,...,y7 be a fixed basis of Sh, where M = dimSh. And let Shj be the set of all linear combinations of {0j} with interval coefficients. That is, M Sh,I = {uh C Sh I
Uh = E[aj,aj]0j, ({j, aj E R1} C 2S'`• j=1
Then, we define for any bounded set U C Ho as follows: Rounding : R(U) = n{uh E Sh,I I Ph (U) C Uh}, (25)
where n means the intersection for all elements in Sh,I satisfying the condition. Note that R(U) is also an element in Sh,7. Rounding error :
RE(U) _- {0 E Sh I II0II Ho <- a, II0II LZ where Sh means the orthogonal complement of Sh in Hp, a
(26)
Cha}, supllu
UEU
-
PhulI MOl
and C is defined in the section 2. Then, by using Sadovskii's fixed point theorem([21]), we have the following computable verification condition.
316 Lemma 3 . Let U C Ho ( S2) be a non -empty, bounded, convex, and closed subset such that R(TEU) ® RE(TU) C U, (27) then there exists a solution of u = F(u) in U. Here, ® denotes the direct sum in H,(9), and M1 C M2 implies Ml CM2 for any sets M1, M2. In order to construct the set U satisfying the verification condition (27) in a computer, we use the following iterative technique. Let uh E Sh be the initial approximation. For any a E R+, the set of all nonnegative real numbers, we set
E Sh
I
<
I I0I IHa
co,
II1IIL
2
< Cha }.
We now generate the following iteration sequence 1(.(n" an)} n=o,1..., where (uhn), an) E Sh I x R+. For n = 0, set uh°) = {uh} E Sh, I, ao = 0 E R+. For n > 1, first , for a given 0 < 6 << 1, define the 6-in f lation (cf .[17]) of (uhn-1) an-1) by M (n-1) (n-1)
uh
uh
+
1]60j,
(28)
j=1 an-1
=
an-1
+
Next , for the set U(n-1) =uhn-1) + [ an_1] define uhn) E Sh,I and co, E R+ by uhn) = R( ToU(n-1)), an
Ch sup IIf(u) IIL2• (29) uE
Notice that these calculations are independent of the parameter e([22]). Now we have the following verification condition , in a computer , which provides the a posteriori L°° error bound. Theorem 3 . If, for an integer N, uhN) C UhN-1)
and
aN < 6N-1,
then there exists an element u in uhN) ®a_ [Ni] satisfying u = F(u). Furthermore, sup II^IIL^( n) < CoocN-1, bE(-_N_^]
(30)
where C. is the constant in Theorem 1 and 2, i.e., C. = 1.054 and C. = 1.818, u(hN) C uhN-1) respectively. Here , implies that each coefficient interval in uhN) is strictly included by the corresponding interval in uhN-1)
317 (30) means that the L°° error bound to the set of approximate solutions represented by uhN) E Sh,l is given by the right-hand side. We omit the proof of this theorem, for it is quite similar to that of the corresponding theorem in [5] or [8]. And the second part of the theorem easily follows by Theorem 1 or 2. A numerical example. Consider the following Emden's equation: ( -Du = u2
in S2,
(31) Sl u = 0 on ac. Let S2 = (0, 1) x (0, 1) and let Sh be the finite element space which consists of the continuous piecewise bilinear polynomials with uniform rectangular mesh as in the section 3. Then, M = dimSh = (L - 1)2, where L denotes the number of partitions for the interval (0,1), i.e., h = Since the magnitude of the solution of this problem is very large, i.e., IjuhlIL-(n) .:: 30 for the initial approximation uh, we use a residual iteration method with smoothing. That is, we take a new approximation uh E H2(S2) as the pseudo-Hermite interpolation of uh ([23]). Then, setting v = u - uh, we have the following residual equation equivalent to the original problem:
-A v
= v2 +
NO
+
Aiih + 7lh
v = 0
in S2, (32)
on aSt.
Execution conditions: number of partitions : L = 80, residual error of approximation : J1Duh + wh11 14.7, inflation parameter : b = 10-3. Results: iteration numbers for verification : N = 12, Ho- error: all = 0.06661, L2 - error : Chall = 0.0002651, a(°°> L°° - error : a(°°) = 0.2206 (relative error :
0.008). IIuhIILo(n)
Fig. 2 shows the shape of uhi2) at y = 0.5 having the interval coefficients. Therefore, the exact solution at y = 0.5 lies in the range between two curves in Fig. 2 with additional L--error: 0.2206.
318 30
30
25
25
20
20
15
10
5
0. 0.0 0.2 0.4
0 0.6 0.1
1.0
X-AXIS
Figure 2: Range of the nontrivial solution of Emden's Equation at Y = 0.5
References [1] Ciarlet, P.G., The finite element method for elliptic problems, North-Holland, Amsterdam, 1978. [2] Grisvard, P., Elliptic problems in nonsmooth domain, Pitman, Boston, 1985. [3] Lehmann, R., Computable error bounds in the finite-element method, IMA J. Numerical Analysis 6 (1986), 265-271. [4] Nakao, M.T., A numerical approach to the proof of existence of solutions for elliptic problems, Japan Journal of Applied Mathematics, 5, (1988) 313 - 332. [5] -, A computational verification method of existence of solutions for nonlinear elliptic equations, Lecture Notes in Num. Appl. Anal., 10, (1989) 101 - 120. In proc. Recent Topics in Nonlinear PDE 4, Kyoto, 1988, North - Holland/Kinokuniya, 1989. [6] -, A numerical approach to the proof of existence of solutions for elliptic problems II, Japan Journal of Applied Mathematics 7 (1990), 477-488. [7] -, A numerical verification method for the existence of weak solutions for nonlinear boundary value problems, Journal of Mathematical Analysis and Applications 164 (1992), 489-507. [8] -, Solving nonlinear parabolic problems with result verification: Part I, in Jounal of Computational and Applied Mathematics 38 (1991) 323-334. [9] Nakao, M.T. & Watanabe, Y., : Part II, Several space dimensional case, Re-
319 search Report of Mathematics of Computation, Kyushu University, RMC 67-05 (1992), 10 pages. [10] Nakao, M.T. & Yamamoto, N., Numerical verifications of solutions for elliptic equations with strong nonlinearity, Numerical Functional Analysis and Optimization 12 (1991), 535-543. [11] Nakao, M.T., Numerical verifications of solutions for nonlinear hyperbolic equations, Research Report of Mathematics of Computation, Kyushu University, RMC 66-07 (1991), 13 pages. [12] Natterer, F., Berechenbare Fehlerschranken fur die Finite Elemente Methode, ISNM vol. 28, Birkhiiuser Verlag, Basel (1975), 109-121. [13] Plum, M., Computer-assisted existence proofs for two-point boundary value problems, Computing 46 (1991), 19-34. [14] -, Existence proofs in combination with error bounds for approximate solutions of weakly nonlinear second-order elliptic boundary value problems, ZAMM 71 (1991) 6, T660 - T662. [15] -, Explicit H2-estimates and pointwise bounds for solutions of second-order elliptic boundary value problems, Journal of Mathematical Analysis and Applications 165 (1992), 36-61. [16] , Numerical existence proofs and explicit bounds for solutions of nonlinear elliptic boundary value problems, Computing 49 (1992), 25-44. [17] Rump, S.M., Solving algebraic problems with high accuracy, A new approach to scientific computation (eds. Kulisch, U. & Miranker, W.L.), Academic Press, New York, 1983. [18] Tsuchiya, T. & Nakao, M.T., Numerical verification of solutions of parametrized nonlinear boundary value problems with turning points, Research Report of Mathematics of Computation, Kyushu University, RMC 67-02 (1992), 17 pages. [19] Watanabe, Y. & Nakao, M.T., Numerical verifications of solutions for nonlinear elliptic equations, to appear in Japan Journal of Industrial and Applied Mathematics. [20] Yamamoto, N. & Nakao, M.T., Numerical verifications of solutions for elliptic equations in nonconvex polygonal domains, Research Report of Mathematics of Computation, Kyushu University, RMC 67-06 (1992), 28 pages. [21] Zeidler, E., Nonlinear functional analysis and its applications I, Springer-Verlag, New York, 1986. [22] Nakao, M.T., Solving nonlinear elliptic problems with result verification using an H-1 residual iteration, Research Report of Mathematics of Computation, Kyushu University, RMC 67-07 (1992), 17 pages. [23] Akima, H., Bivariate interpolation and smooth surface fitting based on local procedures, Comm. ACM 17 (1974), 26-31.
WSSIAA 2(1993) pp. 321-332 ©World Scientific Publishing Company
A NUMERICAL METHOD FOR MULTI-SOLUTIONS OF A SYSTEM OF NONLINEAR ALGEBRAIC EQUATIONS TAKEO OJIKA Department of Electronics and Computer Engineering Gifu University
1-1 Yanagido, Gifu 501-11 JAPAN
ABSTRACT In this paper, a new method , termed here as billiard method, is proposed to find sequentially the multi-solutions of Eq.( 1.1) starting from a single initial guess , in which singularity in the neighborhood of a solution is adopted positively. Mathematical properties are discussed in detail and a practical algorithm for realization of the present method is also given . Its effectiveness is shown by solving several examples in which it is also shown that the billiard method can be extended to a system of nonlinear equations.
1. Introduction Consider a system of nonlinear algebraic equation given by f(x)=0, f:R"-'R",
(1.1)
where x = (x1ix2,...,xn )' and f = (f1, f2,..., fn)'• Suppose now that kx at the k-th iteration is given. Then one linearizes the function f at kx by expanding into a Taylor 's series and keeping only the terms of degree 0 and 1: f(111X) ,,
Z, f (kx) + J(kx)(k+lx _k x) = 0,
where the Jacobian J is given by
J( x)
= Of /OX ,
(1.3)
which is well-known as Newton method. If det J(x*) is nonsingular at a root x*, then it is well-known that starting from the neighborhood of the root, the sequence {kx} by the Newton method converges to x* with a quadratic convergence 2, 10, 11 , 13.17 , 19, 20 and if det J(x*) is singular, then the convergence of the sequence is linear 4- 8,12-1 4 , 16,18.
For a high-order nonlinear algebraic equation, i.e., a polynomial, DuhranKerner-Aberth method is well-known and is possible to obtain all the roots simultaneously 1. However, any practical method for obtaining many solutions(roots), i.e., multi-solutions, is not proposed for a system of nonlinear algebraic equations given by Eq.(1.1).
321
322 In this paper, a new method, termed here as billiard method, is proposed to find sequentially the multi-solutions of Eq.(1.1) starting form a single initial guess, in which singularity in the neighborhood of a solution is adopted positively. Mathematical properties are discussed in detail and a practical algorithm for realization of the present method is also given. Its effectiveness is shown by solving several examples in which it is also shown that the billiard method can be extended to a system of nonlinear equations.
2. Billiard Method For simplicity, we consider hereafter the 2-dimensional system of nonlinear algebraic equations Eq.(1.1):
fl(X) = 0,
(2.1.1)
= 0,
(2.1.2)
f2(x)
and we assume that all the roots are nonsingular. As for singular problems, refer to 11,12,14
Starting from a set of the initial guess °x = (°x1i °xz)', suppose that a root 1x* = (lxi,1x2)' is obtained. If we want to have another solutions, then it is necessary to give an appropriate different set of initial guesses which does not converge to the previous solution lx*, but may converge to an another new solution. In general it is difficult to select a proper initial guess. We consider here to restart the solution procedure from the neighborhood of lx*. It is not necessary to mention that if we simply restart from an initial guess in the neighborhood of lx*, then the sequence {kx} by the Newton method converges to lx*. In order to generate escaping force(repulsion force) from the neighborhood of the previous solution to a new solution area, consider the following new equations: 91(x) = f1(x)/II x -1 x*II = 0, (2.2.1) 92(x) = f2(x) = 0, (2.2.2) where II . II denotes the Euclidean norm.
Introduce here a new coordinate y = (y1, y2)' defined by y
=
x
x*,
(2.3)
then Eqs.(2.2.1) and (2.2.2) can be rewritten as 91(y)
= fl(y)/IIyII
92(y)
= f2(y) = a21y1 + a22y2 + b21yi + b22YIY2 + b23y2 + h2(y3),
[allyl + a12y2 +
b11yi + b 12yly2 + b13y2 + h1(y3)] /IIyII,
(2.4.1) (2.4.2)
where g = (g1, g2)', a11, i, j E [1, 2] and biy, i E [1, 2], j E [1, 3] are constants, and
hi, i E [1, 2] denotes the nonlinear homogeneous terms greater than or equal to degree 3.
323 The Newton method for Eq.(2.4) is now given by S(ky)(k+ly -k y) = -9(ky), where the Jacobian ( adjusting matrix) S(y) is given by S(y) = 1S(y) + 2S(y ),
(2.6)
and the submatrices 1S = [1sii] and 2S = [2sij] are given by 1511 = (ally2 - a12y1)y2
/llyll3,
1512 = -(ally2 - a12y1)yl /llyll3, 1521 = a21,
(2.7.1)
1622 = a22,
and 2511 = [(Y + y2)(2bllyl + b12y2 + Ohl/Oyl) - (bilyl + b12y1y2 + b13y2 + h1)yi]/Ilyll3, 2512 = [(Y + yz)(bl2yl + 2b13y2 + ahi/aye) - (bilyi + bl2yly2 + b13yz + hi)y2]/llyll3, 2521 = 2b21y1 + b22y2 + Ohl/ayl,
(2.8.1)
2522 = b22Y1 + 2b23y2 + '9h2/aye,
respectively. Note here that (i) the degrees with respect to y of numerators in the first row of 1S are lower than those of denominators by one, and (ii) the degrees of numerators in the first row of 2S are greater than or equal to those of denominators. Introducing a parameter a, termed here as billiard paramter, put y2
=
ayl.
(2.9)
Then lsij and 25ij are denoted by isii = ( alga - al2 )a/[(l + a2)3/2y1], 1512 = - ( ally - a12 )/[(l +a2)3/2y1], (2.7.2)
and 2511 = [(1 + a2){(2b11 + bl2a)yl + Ohl/Oyl }yl
-{(b11 + b12a + bi3a2)yi + hl}]/(1 + a2)3/2yi, 2512 = [ (1 + a2){(b12 + 2b13a)yl + ah1/aye}y1 {(b11 + b12a + b13a2)yl + hl}a]/(l + a' )3/2Y2 2521 = (2b21 + b22a)yl + ah2/ay1, 2522 = (b22 + 2b23a)yl + ah2/ay2i
(2.8.2)
324 respectively. We now have the following. Theorem 2.1 Suppose that the initial guess of the Newton method Eq.(2.5) is taken in the neighborhood of the new origin , i.e., °y = (°yl, a*Oy1)', 0 < l°y1I « 1. If the parameter a for the equations defined by Eqs.(2 .4.1) and (2.4.2) satisfies (i)
a* = a12 /all,
(2.10.1)
if all 54 0,
or
(2.10.2)
( ii) a* = -a21 /a22, if a22 0,
then det 1S(y) = 0
(2.11)
for Vy, and the determinant det S(y) of the adjusting matrix S is independent of y. [Proof ] From Eqs. ( 2.7.1) and (2.7.2), we easily have the following relation: det 1S ( y) = (alla - a12)(a22a + a21)/[(1 + a*2)3/2y1] ,
( 2.12)
and from Eqs.(2 . 7) and ( 2.8), and ( 2.12), we have
det S( y) = T(a*)/(1 + a*2)3/2, (2.13.1) where T(a*) = (1 + a*2){a22 ( 2b11 + b12a* ) - a21(b12 + 2bl3a* )} ( 2.13.2) * bll+b12a*+b13a +(-a22+a2la *)( 2)+2(alla*-al2)(b21+b22a *+ b23a*2). ❑
It is easily seen that (i) Eq.(2 . 10.1) shows the gradient of the normal line to fl(y) = 0, and (ii) Eq.(2.10 . 2) shows the gradient of the tangent line to f2(y) = 0 at the new origin y = (0, 0)', respectively. We now have the following. Corollary 2.1 If fl ( y) is linear, i.e., b11 = b12 = b13 = 0, and h1(y3) 0 for Vy, then
T(a12/ all) = 0.
(2.14)
The corollary suggests us that if fl( y) is linear , then we should select a* _ -a21/a22, a22 0 0.
Analogously to Eqs. ( 2.4.1) and (2.4.2), if we consider the following equations: 91(y) = a11y1 + a12y2 + buyi + bl2yly2 + b13yz + hi(y3), g2(y) = [a21y1 + a22y2 + b21yi + b22y1y2 + b23y2 + h2(y3 )] then the following corollary holds.
11
(2.4.3)
/ Ilyll,
(2.4.4)
325 Corollary 2.2 If 0 < °y1I << 1 and the parameter a for the equations defined by Eqs.(2.4.3) and (2.4.4) satisfies (i) a* = -all/ a12,
if a12 0 0, (2.10.3)
or (ii) a* = a22 /a21, if a21 0 0, (2.10.4) then the determinant of 1S(y) analogously defined by Eq.(2.7.1) satisfies Eq.(2.11) for Vy. It is worth mentioning here that Eqs.(2.10.1) and (2.10.3), and Eqs.(2.10.2) and (2.10.4) are orthogoal each others. Theorem 2 .2 Assume that a* for Eqs. (2.4.1) and (2.4.2) is given by Eq.(2.10.1) or (2.10.2), and °y = (°y1, a*Oy1), 0 < I°y1I << 1. If T(a*) # 0, then the approximate solution 1y and the ratio 1y2/1y1, azz j4 0, at the first iteration are given by 1y = -(a11 + a12a*)(1 + a*2)/T(a*) [ -a22 (2.15.1) which is independent to the value of the initial guess °y1, and 1y2/1y1 = -a21/a22,
(2.15.2)
which is equal to Eq.(2.10.2) and is independent to the selection of a*. [Proof] Since 0 < I°y1l << 1, from Eqs.(2.7.2) and (2.8.2), we have
(2.16)
1y = -S(°y)-19(°y)
1 a22 1 +a*2au +alza*+ a1 1a* - a12a2za*+az1
(1 + a*2)3/2 det S(° y) -a21 ( 1+a*2)(all +al2a* )+(alla*-a12 )(a22a* +a21) J Since (all a*-a12)(a22a*+a21) = 0, from Eqs.(2.13.1) and (2.13.2), we have Eq.(2.15.1) ❑ and hence it is easily seen that Eq.(2.15.2) also holds.
3. Algorithm for Billiard Method In order to obtain multi-solutions of the system of nonlinear algebraic equations by the billiard method, we now consider its numerical algorithms.
3.1 --Secant Method. Suppose that kx of Eqs.(2.2.1) and (2.2.2) at the k-th iteration is known, and consider the following perturbation:
z = kx + kee„ j=1,2,
k=0,1,2,., (3.1)
326 where e3 denotes the j-th unit vector, and ke(0 < Ik6J << 1) is called as perturbation parameter. Substituting Eq.(3.1) into Eqs.(2.2.1) and (2.2.2), we define the adjusting matrix S(kx; ke) whose ij-element s;j is given by s+i(kx; ke) = [9+(^ z) - 9r(kx)]1ke, i, j = 1,2 (3.2) Then analogously to the Newton method Eq.(1.2), we can define the following iterative* algorithm: S(kx;ke)(k+lx _ kx) = _g(kx), (3.3) which is called as e-secant method 11, 21. As for S(kx; ke), the following theorem holds. Theorem 3.1 Let ke(0 < IkE l << 1) be a perturbation parameter and S(kx; ke) be the adjusting matrix for the e-secant method whose ij-element is defined by Eq.(3.2). Then li kE_
n
S(kx; kE)
S(kx). (3.4)
As for the proof, see 10- 13,2 0. Under appropriate conditions, it is proved that the sequence {kx} by Eq.(3.3) converges to a solution with a quadratic convergence 20, and in the present algorithm the iterative formula Eq. (3.3) is adopted.
3.2 Billiard Algorithm with 6-Secant Method. In the practical computation, we apply the billiard method with c-secant method and since it is troublesome to compute the billiard parameter a , we simply use the uniform restarting parameter 6 for the initial guess of the next solution as mentioned below. (i) We solve the original Eqs.(2.1.1) and (2.1.2) with an appropriate initial guess 1x by the e-secant method and obtain 1x* = (1xi,1x2)'. (ii) Letting 2x = (1xi + 6, 1x2 + 6)', solve Eqs.(2.2.1) and (2.2.2) by the e-secant method, and obtain 2X*, where 6(0 < 161 « 1) is termed here as restarting parameter. (iii) Similarly letting 3x = (2x1 + 6, 2x2 + 6)', solve the following equations:
91(x) = f1(x)/IIx - lx*11 = 0, (2.2.3) 92(x) = f2(x)/IIz - 2x*11 = 0, (2.2.4) and obtain 3x*. (iv) Then recursively we solve 91(x) = fl( x) A z - 3x* lI = 0, (2.2.5) 92(x) = f2( x)/II x - 2x* 11 = 0. (2.2.6)
327 If we can not obtain a new solution, e.g., from Eqs.(2.2.5) and (2.2.6), then we replace here the equations by
3.3
«?,(*) = / , ( * ) / ! ! * - 2 z * | | = 0,
(2.2.7)
92(x) = f2(x)/\\x-ax'\\
(2.2.8)
= 0.
Cyclic Solutions.
Suppose that a solution \x' = (ix*, ix*,)', ix* ^ \x\ is obtained. Then there is a case such that 2x" = (ix*,,ix*)' also satisfies Eqs.(2.1.1) and (2.1.2). In the present algorithm such an automatic solution procedure, termed here as cyclic solution procedure, is also realized. 3.4
Symmetric Solutions.
Suppose that a solution ix* = (ixj, ix*,)' is obtained. Then there is a case such that 2^* = {-ix',\X2)', (1^1,-1^2)'' and/or (—ix", -ix*,)' also satisfy Eqs.(2.1.1) and (2.1.2). Its automatic solution procedure, termed here as symmetric solution procedure, is also realized in the present algorithm. 3.5
Stopping Condition.
After obtaining several solutions, it is important from practical point of view when the computation should be stopped. A stopping condition is proposed here. It is well-known that a system of n-dimensional nonlinear algebraic equations can be reduced to a polynomial(resultant) of one variable by applying recursively the Sylvester's elimination method. Hence let us now consider a polynomial of degree 2m given by f{x)
=
{x - x\)(x - x2) ■ ■ ■ {x - x2m)
= x 2m + ( x ' + x * + ---+x 2 m )x 2 m - 1 +---+x*-x* : ---x*, m . If Eq.(3.5) has a symmetric solutions, e.g., x2i = -x2,_u
(3.5)
i — 1,2, • • • m, then
x* -I- x* + • • • + x'2m = 0
(3.6)
holds. Taking this fact into account, we establish a following stopping rule. Stopping rule: Suppose that (m—1)(> 1) sets of simple solutions iX*,2X*, • • • , m _i x* are obtained. Then compute the m-th solution by mx
= -(ix'
+ 2x' + ---m-1x'),
(3.7)
and substitute it into the original Eqs.(2.1.1) and (2.1.2). If /( m x) = 0, then let be a solution and stop the computation.
mx
328 4. Numerical Examples Let us solve two examples under the following conditions:
(i) convergence criterion G = max Ifil < 10-14, (ii) perturbation parameter e = 10-8, (iii) restarting parameter 6 = 10-3. Ex.1 Consider the following system of nonlinear algebraic equations: fi(x) = x1 + x2 - 5 = 0,
(4.1.1) (4.1.2)
f2(x) = 5x1 - X2 - 3 = 0,
which have four solutions 11 as shown in Fig.4.1. Suppose now that a solution 1x* = (1, 2) is known, then we have new equations: + y22)1,2 =
(yl + 4y2 + y22)/(y2, + y2)1"2 = 0, (4.1.3) = 10y1 Y2 + 5y1 = 0. (4.1.4) g2(y) = f2(y) 91(y) = fi(y)/(y1
X2 x2
2T
f2 1.5
0.5 0 -0.5 -1 -1.5
- 1 -0.5 6 0.5 1 1.5 xI
Fig.4 . 1. Curves of fl(x) and f2( x) of Ex . 1. Fig.4.2. Curves of fl(x) and f2(x) of Ex.2. From Eqs.(2.10.1) and (4.1.3), let the billiard parameter a* = 4 and starting
from various sets of initial guesses (°y1i a*0y1)' for Eqs. (4.1.3) and (4.1.4), the values of (1y1i ly2)' and 1G at the first iteration are shown in Table 4.1, in which ite shows iteration numbers to attain the convergence criterion. From the table, it is easily seen that the new solution (-0.5856, -4.1414)' is obtained for wide ranges of °y1. From (2.15.1) and (2.15.2), we have the theoretical values of the first approximation (1y1i1y2)' = (-0.4105,-4.1051)' and the ratio 1y2/1y1 = 10, respectively.
1t
329 Table 4 . 1 Initial guesses and results at the first iteration(a' = 4)
yl
-10 -10°
-4*10' -4*10°
-10-1
-4*10-' -4*10-2 -4*10-3
-10-2 -10-3
Gr
y2 = a" yl
440.0 1.0 3.7350 4.0843 4.1192
-10-4
-4*10-4
4.1227
-10-5 10-5
-4*10-5 4*10-5
4.1231 4.1231
10-3 10-2
4*10-3 4*10-2
4.1235 4.1270 4.1619
10 -1
4*10-1
100 101
4*10° 4*10'
10-4
4* 10-4
4.1511
11.0 560.0
yI -6.3979
y2 -12.1884
y2/ yI 1.91
G 152.8753
ite 10
-3.7500 -0.3812
-4.8000 -3.3714
1.28 8.84
37.6125 0.7370
9 6
-0.3338
-3.2928
9.86
0.8044
6
-0.3290 -0.3346 0.3877 0.3881 -0.3345 -0.3279 -0.3230 -0.2742 0.2399 5.6145
-3.2856 -3.3460 3.8770 3.8815 -3.3458 -3.2838 -3.2753 -3.1965 -2.4022 5.5921
9.98 10.00 10.00 10.00 10.00 10.01 10.14 11.66 10.01 1.00
0.8105 0.7503 7.9373 7.9418 0.7504 0.8120 0.8193 0.8860 5.0889 208.1642
6 6 7 7 6 6 6 6 7 11
Starting from the initial guess (1.5,2.5)', the original Egs.(4.1.1) and (4.1.2) are solved by the algorithm in Sec. 3. The convergence tendencies to the solutions (i) lx* = (1,2)', (ii) 2x* = (0.4144, -2.1414)', ( iii) 3x* = (-0.3695, -2.3172)', and (iv) 4x* = (-1.0449, -2.3172)' are shown in Table 4.2. Table 4.2 Convergence tendencies to the solutions Sol.
(1)
(ii)
ite 0
G .575 * 10'
G .403 * 10'
G .177 * 10
1 2 3
.170 * 101 .418* 100 .905 * 10-2
.405 * 101 .873* 100 .996 * 10-1
.245 * 101 .397* 101 .393 * 101
4 5 6
.404 * 10-5 .795 * 10-12 converge
.288 * 10-2 .269 * 10-5 .239 * 10-11
.373 * 10-1 .121 * 10-3 .480 * 10-8
converge
converge
7
( iii)
(iv)
G converge
Ex.2 Consider now the following nonlinear equations given by (4.2.1) fi(x) = xi - 3x,x2 + x2 = 0, f2(x) = xi + x2 + sin x1 - 2 = 0, (4.2.2) whose curves are shown in Fig. 4.2 21. Starting from the initial guess (°yl,°y2) = (2,1), Egs.(4.2.1) and (4.2.2) are solved first by the e-secant method and the first solution is obtained, and then we apply sequentially the billiard method. The convergence tendencies to the solutions
330 (i) 1x* _ (0.4715,1.150)', (ii) 2x* = (1.0123,0.3565)', (iii) 3x* = (-1.568,0.7353)" and 4x* _ (0.4377, -1.1766)' are given in Table 4.3. Table 4.3 Convergence tendencies to the solutions Sol.
(i)
(ii )
( iii) (iv)
ite G G G G 0 .775 * 10.131 * 10.265 * 10.200 * 10 1 .593 * 101 .283 * 101 .250 * 101 .422 * 101 2 .232 * 101 .378 * 101 .633 * 102 .817 * 101 3 .485 * 100 .299 * 101 .193 * 102 .791 * 101 4 .451 * 10-1 .496 * 10° .132 * 101 .111 * 103 5 .592 * 10-3 .377 * 10-1 .824 * 10° .205 * 104 6 .107 * 10-6 .152 * 10-3 .926 * 10' .346 7 converge .966 * 10_8 .115 * 10-2 .930 * 10° 8 converge .484 * 10-6 .210 * 10° 9 converge .369 * 10-1 10 .134 * 10-3 11 .188 * 10-8 12 converge 5. Concluding Remarks In this paper, the billiard method to obtain sequentially multi-solutions of a system of nonlinear algebraic equations is proposed. Its mathematical properties in case of n = 2 have been discussed in details and a practical algorithm for numerical realization of the present method is also given. However it is not necessary to mention that the billiard method can be applied to a system of not only 2-dimensional nonlinear algebraic equations, but also general multi-dimensional ones. Two numerical examples are solved, and effectiveness of the present method are also shown. It is worth mentioning that the present method can be applied to multisolutions of (i) a polynomial and (ii) usual nonlinear equations, and to (iii) singular(multiple) roots 14.
6. Acknowledgements The author wishes to thank Mr.Masatosi Noguchi, Graduate school of Engineering, Gifu University, Gifu, JAPAN, for his valuable discussions and helps on this paper. 7. References 1. 0. Aberth, Iteration methods for finding all zeros of a polynomial simultaneously, Math. Comp., 27(1973), 339-344.
1 i
331 2. R. E. Bank and D. J. Rose, Global approximate Newton method, Numer. Math., 37(1981), 279-295. 3. D. W. Decker and C. T. Kelley, Newton 's method at singular points .I, SIAM J. Numer. Anal., 17( 1980), 66-70. 4. D. W. Decker and C. T. Kelley, Newton's method at singular points.II, SIAM J. Numer. Anal., 17(1980), 465-471. 5. D. W. Decker and C. T. Kelley, Convergence acceleration for Newton's method at singular points.I, SIAM J. Numer. Anal., 19 (1981), 219-229. 6. D. W. Decker , H. B. Keller, and C. T. Kelley, Convergence rates for Newton's method at singular points.II, SIAM J. Numer. Anal., 20 (1983), 296-314. 7. A. Griewank, Starlike domains of convergence for Newton's method at singularities, Numer. Math., 35(1980), 95-111. 8. A. Griewank and M. R. Osborne, Analysis of Newton's method at irregular singularities , SIAM J. Numer. Anal., 20(1983), 747-773. 9. L. V. Kantorovich and G. P. Akilov, Functional Analysis in Normed Spaces, Pergamon, New York, 1964. 10. T. Ojika, S. Watanabe, and T. Mitsui, Deflation algorithm for the multiple roots of a system of nonlinear algebraic equations, J. Math. Anal. Appl., 96(1983), 463-479. 11. T. Ojika, Modified deflation algorithm for the solution of singular problems I. a system of nonlinear algebraic equations, J. Math. Anal. Appl., 123(1987), 199221. 12. T. Ojika, An algorithm for the branch points of a system of nonlinear algebraic equations, Appl. Numer. Math., 4(1988), 419-430. 13. T. Ojika, M. Harayama, M. Noguchi, and T. Kato, Hybrid manipulation for the solution of a system of nonlinear algebraic equations, Res. Rept. Fac. Eng. Gifu Univ., 43(in print). 14. T. Ojika, M. Harayama and M. Noguchi, Billiard Method for Multi-Solutions of a System of Nonlinear Algebraic Equations (in preparation). 15. J. M. Ortega and W. C. Rheinboldt, Iterative Solution of Nonlinear Equations in Several Variables, Academic Press , New York, 1970. 16. L. B. Rall, Convergence of Newton process to multiple solutions, Numer. Math., 9(1966), 23-37.
332 17. L. B. Rail, Computational Solution of Nonlinear Operator Equations, John Wiley Sons Inc., New York, 1978. 18. G. W. Reddian, On Newton's method for singular problems , SIAM J. Numer. Anal., 15(1978), 993-996. 19. S. Watanabe, Hybrid manipulations for the solution of systems of nonlinear algebraic equations, Publ. RIMS, Kyoto Univ., 19(1983), 367-395. 20. S. Watanabe, T. Ojika, and T. Mitsui, On the quadratic convergence properties of the --secant method for the solution of a system of nonlinear equations and its application to a chemical reaction problem, J. Math. Anal. Appl., 95(1983), 69-84. 21. T. Watabe, M. Natori, and T. Kokuni, Numerical Computation Software(in Japanese), Maruzen,1989.
IF
WSSIAA 2(1993) pp. 333-344 ©World Scientific Publishing Company
T-stability of Numerial Scheme for Stochastic Differential Equations
Yoshihiro SAITO Shotoku Gakuen Women 's Junior College, 1-38 Nakauzura, Gifu-Shi 500, Japan and Taketomo MITSUI Graduate School of Human Informatics, Nagoya Univ., Nagoya 464-01, Japan Abstract Stochastic differential equations (SDEs) represent physical phenomena dominated by stochastic processes. Like as for deterministic differential equations (DDEs), various numerical schemes are proposed for SDEs. Recently we investigated stability notion of numerical solution for SDE. In this note we propose a new stability, T-stability, of numerical schemes for scalar multiplicative SDE with respect to its pathwise approximation. When the two- or three-point random variables are chosen as the driving Wiener process, we examine T-stability of the Euler-Maruyama scheme together with numerical results.
1. Introduction We consider stochastic initial value problem (SIVP) for scalar autonomous Ito stochastic differential equation given by J dX(t) = f (X)dt + g(X)dW(t), t E [0,T], (1) l X(0) = x, where W (t) represents the standard Wiener process and the initial value x is fixed. Some authors1, 4' 7, 8, 9, 10, 12, 13, 14, 16 propose various numerical schemes
for SDE (1), which recursively compute sample paths (trajectories) of solution X (t) at step-points. Numerical experiments for these schemes have appeared in literatures5' 1, 9, 10, 16In particular, some of them are devoted to the numerical stability2, 5, 10,11,1s However the numerical experiments which demonstrate stability are shown in a few papers 16,17 We proposed a stochastic version of the absolute stability analysis under the mean-square norm 14. Here for the purpose of discrimination we will mention it
333
334 as MS-stability which stands for the stability in Mean-Square sense. This idea is similar to the one by Petersen". However this is restricted in the weak stability analysis. Therefore we will propose another stability analysis for scalar multipicative SDE. This is the one with respect to the pathwise approximating feature of numerical solutions. Since in other words it will focus on their trajectory, we call it T-stability. Similar discussions can be seen in literatures2,16. However, because it is difficult to generally investigate the stability of numerical solution with arbitrary discritized Wiener process as driving process, as the first step, we analyse the Euler-Maruyama scheme equipped with two- or three-point random variables for the driving process. We will derive the region of absolute stability of the scheme together with the numerical results.
2. Euler-Maruyama scheme The Euler-Maruyama scheme for SDE (1) is defined as follows: X„+1 = X. + f (Xn)h + g(X.)AW, , (2) where h is the stepsize and LW„ is the random increment . The increment OW„ is known to be well approximated as AW„ = U„vlh-. (3) Here U. is the two- or three-point random variables satisfying either of the following probability distribution. i) Two-point random variables. P(U„ = ±1) = 1/2
(4)
ii) Three-point random variables. P(U,, = fv) = 1/6,
P(UU = 0) = 2/3
(5)
Note that this scheme is of first order in weak sense (Monte Calro Approximation), independent of the choice of i) or ii) as the driving process. Pardoux and Talay10 discuss the pathwise approximation of the Wiener process as well as its influence on the numerical solution. Of course the condition of order of convergence is satisfied for two- or three-point random variable. The scheme with three-point random variable is required when Lyapunov exponets of stochastic system is calculated16, 11. We will take the scheme (2) with (3) for the T-stability analysis because of its simplicity. 3. T-stability We consider the test equation with real numbers A and p J dX(t) _ AXdt + pXdW(t), t E [0,T] (6) l X(0) = 1,
335
whose exact solution is X(t) = exp{(A - 2µ2)t +µW(t)}.
(7)
From the qualitative theory of SDEs for (6) the zero solution X(t) - 0 is stochastically asymptotically stable in the large if A - Zµ2 < 0, and unstable if A - Zµ2 > 0. Thus we expect the solution brought by the scheme (2) satisfies these conditions.
In fact some authors 2, 10, 16 , 17 are treating numerical stability in a similar sense as above. Among them we proposed a linear stability analysis of numerical schemes for SDE with the multiplicative noise in the mean-square norm15, like as in numerical methods for ODEs. (Linear stability analysis of numerical methods for ODEs can be seen, e.g. in a literatures.) Because it is an analysis under the mean-square norm, hereafter we will mention it as MS-stability. In this case we need to estimate the quantity R. =ElXn12. However in a certain probability an approximate sequence of sample path Xn happens to overflow in computer simulations. This violates the evaluation of Yn. Thus we need to go further to another stability analysis. Here we raise up a stability notion with respect to approximate sequence of sample path (trajectory). That is, assume the condition A - Zµ2 < 0 and let {X.} (n = 1, 2, ...) be the sequece of approximate solutions of Eq. (6) by a certain numerical scheme. Strictly speaking, each X. should be denoted as X,,(w) with an event w, but we fix it. The numerical scheme equipped with a driving process said to be Tstable if the convergence JXnJ -+ 0 (n -+ oo) holds for the driving process which approximats the Wiener process.
Note that T-stability is weaker than MS-stability. Actually if a numerical sheme is MS-stable, it is obviously T-stable. 4. Region of T-stability of the Euler-Maruyama scheme
Next problem is how to get the criterion of T-stability. In case of the EulerMaruyama scheme (2) with (3) it can be done as follows. When we apply (2) with (3) to Eq. (6), it leads to Xn+1 =
(1+Ah+pUnvh)Xn n
ft(1+Ah+/U;vr1)Xo. -o Taking the mean with respect to (n + 1) time-steps, we obtain an averaged onestep difference equation of the form Xn+l = R(h; A, /.1)Xn• (8)
336 We will call R(h; A, µ) the averaged stability function of the scheme. The way to take the mean of course depends on the sequence {U„}. As for the calculation of the function, see the examples below. Clearly X„ --> 0 as n -* oo if IR(h; A, p)I < 1.
(9)
Hence the scheme is T-stable for those values of triplets (h; A, µ) satisfying (9). The region R given by R = {(h; A, µ); (9) holds} is analogously called the region of T-stability of the scheme. Depending on the selection of the driving process, we will study the T-stability and its region. i) Two-point random variables Considering on the probability distribution (4), it is enough to take two steps for averaging. Thus we have R2(h;A,µ) = (1+Ah+µ/)(1+Ah-µvrh_) = (1 + Ah)2 - µ2h
(10)
a)A=0,p>0. R2(h; 0, µ) = 1 - µ2h Thus IR(h;0,µ)I <1q0
(11)
The region of T-stability is shown in Figs.1 and 2. ii) Three-point random variables We take six time-steps for averaging because of (5). R6(h;A,µ) = (1+Ah+µ 3h)(1+Ah)4(1+Ah-µ 3h) = (1 +Ah)4{ ( 1 +Ah)2 -3 M2 hl a)A=0,p>0.
(12)
337 R6(h;0,µ) = 1 - 3µ2h Thus IR(h;0,p)l <14* 0
R6(h, k) = (1 + h)4{(1 + h)2 - 3kh}.
(13)
The region of T-stability is shown in Figs. 3 and 4. They show the existence of some minor isolated spots, which are however not effective in the stability analysis. 5. Numerical confirmations We will show the numerical results which confirm the analysis described in the previous section. We carried out numerical experiments for three types of test equation.
Type 1 J dX = 2XdW(t) X(0) = 1
(14)
Type 2
dX = -2Xdt + 2XdW(t) X(0) = 1
(15)
Type 3 J dX = 3Xdt + 3XdW(t) l X(0) = 1
(16)
First we will show the results in the case of selecting the two-point random variables as the driving process. Type 1 In the stepsizes h = 0.2 and 0 .4 the Euler-Maruyama scheme is T-stable , whereas in h = 0 .6 it is unstable. See Figs. 5a to 5c for numerical results. Type 2 The stepsizes h = 0.25 and 1 .8, which implies the pair (h, k) = (-0.5 , - 2) and
338 (-3.6, -2), respectively, make the Euler- Maruyama scheme T-stable, whereas the stepsizes h = 0.5 and 2 ( (h, k) _ (-1 , -2) and (- 4, -2), respectively ) do not. See Figs. 6a to 6d.
Type 3 The stepsize h = 0.2, which means ( h, k) = (0.6 , 3), lets the Euler-Maruyama scheme stable , whereas h = 0.5 ((h , k) = (1.5, 3)) does not. See Figs. 7a and 7b. Second , the three-point random variables is chosen for the driving process. Type 1 The stepsize h = 0.1 makes the scheme T-stable, while h = 0.2 does not. See Figs. 8a and 8b. Type 2 The stepsizes h = 0.25 and 0.4 ((h, k) _ (- 0.5, -2) and (-0.8, -2), respectively) make the scheme T-stable, while h = 1.0 does not. See Figs. 9a to 9c. Type 3 The stepsize h = 0.05 ((h, k) = (0.15, 3)) implies a T-stable scheme , while h = 1.0 does not. See Figs. 10a and 10b. 6. Conclusions and future aspects We introduce the notion of T-stability of numerical scheme for SDE , and derive its criterion for the Euler- Maruyama scheme with the two- and three -point random variables. It can be shown in the (h , k)-plane as the region of T-stability. We are further to examine our notion for other numerical schemes as well as for other Wiener random variables as driving process. To this end , it might be useful to interprete the Ito SDE into the Staratonovich one with a varibale transformation14
References 1. T.C. Gard, Introduction to Stochastic Differential Equations, Marcel Dekker, New York, 1988. 2. D.B. Hernandez and R. Spigler, Numerical stability for stochastic implicit Runge-Kutta methods, ICIAM'91 Abstracts, July 8-12, 1991.
3. J.R. Klauder and W.P. Petersen, Numerical integration of multiplicativenoise stochastic differential equations, SIAM J. Numer. Anal. 22(1985), 1153-1166 4. P.E. Kloeden and E. Platen, A survey of numerical methods for stochastic differential equations, J. Stoch. Hydrol. Hydraulics, 3(1989),155-178.
339 5. P.E. Kloeden and E. Platen, Higher order implicet strong numerical schemes for stochastic differential equations, J. Statist. Physics, to appear. 6. J.D.Lambert, Numerical Methods for Ordinary Differential Systems, John Wiley, Chichester, 1991.
7. H. H. Liske and E. Platen , Simulation studies on time discrete diffusion approximations, Mathematics and Computers in Simulation, 29(1987), 253260. 8. G.N. Mil'shtein , Approximate integration of stochastic differential equations, Theory Prob. Appl.,19(1974), 557-562. 9. N.J. Newton, Asymptotically efficient Runge-Kutta methods for a class of Ito and Stratonovich equations, SIAM J. Appl. Math., 51 (1991 ), 542-567. 10. E. Pardoux and D.Talay, Discretization and simulation of stochastic differential equations, Acta Appl. Math., 3(1985), 23-47. 11. W.P. Petersen, Numerical simulation of Ito stochastic differential equations on supercomputers, Proc. IMA Conference, 18-24th September 1985. 12. E. Platen, An approximation method for a class of Ito processes, Lith. Math. J., 21(1981), 121-133.
13. W. Riimelin, Numerical treatment of stochastic differential equations, SIAM J. Numer. Anal., 19(1982),604-613 14. Y. Saito and T. Mitsui, Discrete approximations for stochastic differential equations, Trans. Japan SIAM, 2(1992), 1-16 (in Japanese). 15. Y. Saito and T. Mitsui, Stability analysis of numerical schemes for stochastic differential equations, submitted. 16. D. Talay, Simulation and numerical analysis of stochastic differential systems, INRIA Report 1313, 1990. 17. D. Talay, Approximation of upper Lyapunov exponents of bilinear stochastic differential systems, SIAM J. Numer. Anal., 28(1991), 1141-1164.
340
h
k Fig 1 . The region of stability of (11 )
k
Fig 2 . The region of stability of (11)
h
k Fig 3 . The region of stability of (13 )
Fig 4. The region of stability of (13)
341
X
TEST 1 (h=0.2)
-I
0
I
2
3
4
5
t
Fig 5b
Fig 5a.
X
TEST1 ( h=0.6)
x
TEST2 ( h=0.25)
t
Fig 5c. Fig 6a.
342
X
x
TEST2 (h=1.8)
30
TEST2
(h=0.5)
200
20 100 ^ 10
-10
100 20 40 0 t t
0
Fig
X
TEST2
6b.
(h=2.0)
40
10
Fig
x
6c.
TEST3
( h=0.2)
2
1
30 20 10 0 -10
0 0
10
20 30 40 t
0
2 4 6 t
Fig 6d. Fig 7a.
8
10
343
X
TEST3 ( h=0.5)
100-1 80-I 60 I 40 20
10
IY
15 20 25 t
x
TEST I ( h=0.1)
2n
30
0 2 4 t
Fig 7b.
X
6
8
10
Fig 8a.
x
TEST1 (h=0.2)
TEST2
(h=0.25)
20 , 15^ 1
0^ 5y
-1o
T_. - T
0
10
20 It
5
30
Fig
8b.
Fig
10 It
15
9a.
20
344
X
x
TEST2 (h=0.4)
TEST2 ( h=0.8)
30
I
20 10^
I -10.
5
10 15 t
Fig
X
TEST3 ( h=0.05 )
-30 20 0 t
9b.
10
Fig
x
20
30
9c.
TEST3 ( h=0.1)
30
1.5
20 1.0 10-^ 0.5 -10^ 0.0 -20 -0.5 0
I
2 t
- 30 3 0 t
5
10
Fig 10a. Fig 10b.
WSSIAA 2(1993) pp. 345-354 ©World Scientific Publishing Company
SHAPE PRESERVING APPROXIMATION BY RATIONAL SPLINES
Manabu Sakai Department of Mathematics, Faculty of Science University of Kagoshima, Kagoshima 890, Japan
and Raiz A. Usmani Department of Applied Mathematics, University of Manitoba, Winnipeg, Manitoba, R3T 2N2, Canada ABSTRACT By use of simple rational splines of the continuity class C2, we consider a problem of shape-positivity , monotonicity and convexity -preserving approixamation to the data which can or cannot be represented as a single-valued function . No efficient shape preserving approximation to the plane data has been known so far . Some numerical examples are given to show usefulness of our methods.
1. Introduction and Description of Methods Recently a considerable number of papers have appeared which are concerned with shape preserving approximation; see [5], [6] and the references therein. In this paper, twice contiouosly differentiable splines are of interest since they are sufficiently smooth for our eyes. Let the data points (xi , yi ), j = 0(1) n on the partition: x0 <xl < - - - <xn be given. First we consider the case when the data can be represented as a single-valued function.Then it is shown that there is a rational spline which preserves the positivity, monotonicity and convexity at the same time provided that a rationanilty parameterp is greater than and sufficiently close to - 1. We use the following simple rational spline s proposed in [5]: (i) S E C2[XO,xn]
(1)
(ii) on each [x3, xj+l], s(x) is a linear combination oft, r, t3/ (1 + pt) and r3/(1+pr) withx =xj +thj (0 s t s 1 ),r =1-t, hj =xj+l -xj and the rational parameter p ( > -1), i.e., s is of the form s(x) = ajt + bjr + cjt3/(1 + pt) + djr3/(1 + pr)
with real constants aj, bj, cj and dj where forp = 0, the above s reduces to the usual cubic spline. For the given data (xj, yj), j = 0(1) n, we take a splines of the form (1) so that (2) sj(=s(xj))=yj,j = 0(1)n. For later use, we introduce the following quantities As(s),Bj(s), j = 0(1) n -1: (i) Aj(s) _ (2+p)sj+1 + (1 +p)sj - (3+ 2p)s[xj,xj+i]
345
346 (3) (ii)
Bj (s) = (2+ p)sj + (1 +p)sj+i (3 + 2p)s[xj,xj+i]
where s[...] means Newton divided difference of s, for example, s[xj,xj+1] = (sj+l - sj)/ (xj+1 -Xj).
Since s depends on n + 3 paramerters in view of 1(i), there are two additional conditions to (2) required for a unique determination of it . Here we take these to be end ones: (4) Bo(s )= An-As) =0Note that conditions (4) are equivalent to do = cn-1= 0, i .e., they correspond to the socalled natural end ones for the cubic spline function . Then one gets Theorem 1. For p greater than and sufficiently close to - 1, the rational spline s of the form ( 1) is uniquely determined by (2) - (4), and s(kl(x) > 0
on (xj,xj+1)
( k =0,1,2)
if Yi, Yi+12 0, Yj + Yi+ 1 > 0 (k = 0 ) ;y[xj, xj+ 1] > 0 (k = l) ;Y[xi - 1,xj,x1+1] Y[xj, xj+I,xj+2] a 0,Y[xj-l,xj,xj+l] +y[xj, xj+1,xj+21 > 0 (k = 2 ), respectively. In addition,
s(k>(xj) > 0 (k = 1, 2 )
if hj-ly[xj, xj+1] + hjy[xj- 1, xj] > 0 (k =1) ;Y[xj- l,xj,xj+l] > 0 (k = 2 ), respectively. That is, the rational interpolant subject to (2) and (4) strictly reflects the local behaviour, i. e., positivity, monotonicity, and convexity of the given data at the same time. Remark. If y[xi- 1, x3, xj+1] and y[xj, xj+1, xj+2] have the opposite signs, then the interpolant has only one point of inflection in (x1, x j+1) for p greater than and sufficiently close to - 1. Next, we consider an application of the above spline of the form (1) to convexity preserving approximation of plane open or closed data which cannot be represented as a single-valued function. Now for the data (xj, y1), j = 0(1) n chosen in order, splines u and v of the form (1) can be fitted to the points (r,, x1) and (rj, yj), respectively, i. e., (5) ui (= u(ri )) = x1 and vj (= v (rj)) = Yi , j = 0(1) n wherero =0, Arj={(Axj)2+(Ay1)2}. For a unique determination of u and v, we take the following end conditions: (i) for the closed data , i . e., (xo, yo) = (xn, yn), 4) = u(k)"(k) =vnkl
(k =0,1,2);
(6) (ii) for the open data:
Bo(u) = Bo (v) = 0, An-1(u) = An-1(v) = 0. Then, the resulting curve: (u (r), v (r)) (0 s r s rn) is the parametric rational spline. Now one gets
347 Theorem 2. For p greater than and sufficientlty close to - 1, the splines u and v of the form ( 1) are uniquely determined under (5) - (6), and if (7) x[-ri,Ti+11y[rj-1,rj,rj+I] -x[tj-1,z1,'ri+1]y['ri,rt+1] > 0 (i =j, j+1) then
u'(r)v"(s)- u"(x)v'(i) > 0 on [ij,ri+1] i. e., the parametric rational spline subject to (5) - (6) is locally convex on [ri , Tj +11 where for the closed data 6 (i), (xi, yi) = (xn+j, Yn+j ) (j = -1, 0, 1 ), and for the open data 6 (ii), the above convexity condition (7) is not required when (i, j) _ (0, 0) and (n-1,n).
2. Proof of Theorems 1 and 2. Before we proceed with analysis, we shall require the following two lemmas. Lemma 1. For the spline s of the form (1), we have the following consistency relation : Sj+1 2+p )!
1
sj-i
hj + (1+p)^ hj +hj- l + hj-1 - (2+p) (s[xj,xj+1] + s[xi-l,xi] (1+p) h1 hi-1
j =l(1)n -1.
Lemma 2. Lets be of the form (1) subject to (2) and (4). Then, with 6 =1 + p Aj(s) = hjy[xj,xj+l,xj+2] + 0(S ), j =0(1)n - 2; Bi(s) = - hjy [xi- l,xj,xj+1] + 0(6 ), j =1(1) n -1. Proof of Lemma 2. Note that the equations derived from the consistency relations in Lemma 1 and the end conditions (4) are now rewritten as 2 AO 0 ••• 0 0 0
so
do
N1 2 Al ... 0 0 0
s1
dl
0 M2 2 ••• 0 0 0
S2
d2
0
0
0
2 An-2 0
Sn-2
0
0
0
Nn-1 2 An-1
Sn-1
do-1
Sn
do
(8)
0
0
0
N,
2
do-2
348
where
+p) - 2(1 +p) /(1+_L ,o = ,un = 2(l (2 +p) (2+p)hj-1 hj hj-1 µj +Aj=
dj
do
(2+p),j=1(
1)n-1, and do =2(2+p) Y[xo,x1],
+ -1-)}
-12(3 +2p)^(1 (Y[xj,xj+1] +Y[xj -1,xjl (2+p) h1 hi - 1 hi hj-1 = 2(2+ p)
1(1)n-1
[xn-lxn]-
Note that the above coefficient marix is equal to the well-known one for the cubic spline ([ 1]).Since the coefficient matrix of the above system of equations is diagonally dominant for anyp > -1([1]. p. 21), it is nonsingular and, in addition, lettingp -0 gives
so = Y[xo, X11 +O(b )
(9) s. =(y[xj,xj +1] hj
Y[xj-1,xj]
)A(1 +
hj_ 1 hj hj-1
+O(6),j =1(1)n-1
Sn = Y[xn - l,xnl + O(t )
from which follow the desired asymptotic expansions for A1(s) and Bj(s)_ Now we are ready to prove Theorem 1 . First we note that, on [xj, xj+1] (10)
s(x) = yjr + yj+
lt - (3 + 2p
) j{Aj(s) B(t) - Bj(s)B(r)}
(x = xj + t hi , 0 s t s 1 and r= 1- t) where (1 +pt)O(t) = t r(1 + 6t) (6 = l +p ) Because of s, , = 0(1) ( 6 -. 0) by use of lemma 2, the positivity of s on (xj, xj+1) is equivalent to (11)
(1 + pt)(1 +pr)(yjr+yj+it) + trO(S )> 0 for 0
Since (1 +pt)(1 +pr) s (2+p)2tr (> 0) (0 < t < 1 ), the above inequality (11) is valid for sufficiently small positive 6 if (12) Yj, Yj+1 a 0 and Yj +Yj+1 > 0 ([41).
349 Next, from (10) one (13)
gets
f
l
his'(x) =yj+1-Yi - (3+ )L`i{Aj(s)6 (t)+Bj(s)B (r)}
where (1+pt)20 (t)=(1-2t)(1+6t)(1+pt)+t(1-t).
Since 0 (t) - . 1 asp -^
1 + fora fixed t (0 < t < 1), the monotonicity of son
(xi, Xj+1) , j = 0 (1) n is equivalent to (14) y[xi,xi+1]+O(S)>0 (0
y[xi, xj +1] > 0.
( 1 5)
In addition, since 0 (0) = 1 and 6 (1) (1 + 1/6 ) , s'(xi) > 0 is equivalent to y[xi, xj+1] -hi-jy[xj-l,xj,xj+1] + O(8) > 0
(16)
which holds true for sufficiently small 0 provided that (17) hi-1Y[xi,xi+l] +hiy[xj-l,xi] > 0. Similarly, the convexity of s on (xi, xi+1), i. e., s '(x) > 0 on (xi, xi+1) if and only if (18)
Ai(s)t(1 + pr)' - B^(s)r (1 + pt)3 > 0
for 0 < t < 1
with (1 + pt)3 0'(t) = - 20 t(3 + 3pt + p 2t 2) . On using the asymptotic expansions for A1(s) and B1(s) given in Lemma 2, the above inequality ( 18) is valid for sufficiently small S provided that
Y[xj-1, xi,xi+l], Y[xi,xi+l,xi+2] Z 0
(19) y[xj-1, xi,xi+l] +y[xj,xj+l,xi+2] >0. This completes the proof of Theorem 1. Before we prove Theorem 2, we shall require the following inequality: Lemma 3. With a positive constant C ( < 6) independent of t, {6"(t)+0" (r)}+0(8){0"(t)0(r)+0 (t)0 (r)} s-C ( 0 st s 1, r =1- t Proof of Lemma 3. By a simple calculation , the above inequality is equivalent to t(1 + pr)2(3 + 3pt + 3p2t2) + r(1 + pt)2(3 + 3pr + 3p2r2)
+O(S)
[t(1 +pr)(3+ 3pt+ 3p2t2){(1 +pr)(1-2r)(1+ 6r) +rt}
350 + r(1 + pt)(3 + 3pr + 3p 2r2) {(1 + pt)(1- 2t)(1 + 6t) + t r } ] (20)
2 C(1+pt)3(1+pr)3
(0sts1,r=1-t).
Here, fort E[0,1] and r =1- t
{t(1 +pr)2(3+3pt+p2t2) +r(1 +pt)2(3+3pr+p2r) } { t(1 + pr)3 + r(1 + pt)3 }
34-(t4+ r4) z 32'
2 4
(21) (1 +pt)3(1 + pr)3 s (1 +2)6 =A- +00). 64
Therefore, we have the desired inequality with a positive constant C (< 6 ). In addition, we note that (i) for the open data 6 (ii), the determining equations for the splines u and v are of the form (8), and that (ii) for the closed data 6 (i), the equations for u (or for v , x is to be replaced by y ) are given by
0
...
0
0
Al
S1
dl
µ2 2
A2
•••
0
0
0
S2
d2
0 µ3
2
...
0
0
0
S3
d3
0
0
0
...
2
0 0
0
...
µn -1
0
0
...
2 Al
(22)
where
0
4 -2 0 2 µn
An-1 2
2(1 +P) µ^ = 1 +.. µf + A1= 2 (1 + p (2+p),i-1 T! Ii-1 (2+p)
dt _{2(3+2p (2+p)
1+x[^i"S+1]+ x[51,^i
do-2
Sn-2
do-1
Sn-1 L Sn
L
do
= 1(1) n , and
]),j =1(1)n ( T. =To ).
Note that the coefficient matrices for determining u and v are exactly the same, i . e., v is easily calculated with additional effort and that they reduce to the well-known one for the periodic cubic spline as p -^ 0 ([1]). Similarly as in proof of Lemma 2, we can easily check the asymptotic expansions for A1(u),Bi(u),Aj(v) and Bj(v) of the same forms
351 for As(s) and Bj(s) given in Lemma 2, where for the closed data, j = 0(1) n -1 and for the open data, j = 1(1) n - 2.
We are now ready to prove Theorem 2. By a simple calculation, on
['y,
+i]
hj{u'(r)V'(i)-u"(i)v'(r) } = {x[Tj,ij+1] - 6 {A1(u) 0(t) +Bj(u)0 (r)}x 1 +25
{Aj(v)e "(t)+Bj(v)B"(r)} - {y[i5,i+11 -
6 {Aj(v)B'(t) +B;(v)e'(r)}x 1+26
{Aj(u)B"(t)+Bj(u )B "(r)} { Aj(v)x[ti,ii+l] -Aj(u)y[ii,zi+ l]}0"(t)
(23) -{Bj(u)y[li,li +i] - Bj(v)x[ ij,rj +1]}0"(r)+o(6) {0 "(t)0 (r)+
e '(t)0 (r)}
(r=r1+tArj,Ost slwith r=1 -t andh1 =At ).
By making use of Lemmas 2 and 3, one gets u'(r)v "(r) - u "(i)v '(r) >0 on [Ty, rj+1]
(24) if
x[sj, y+1,i + 2]y[r1,ri + l l
-Y[ rj, rj+1,rj +2]x[rj,Ii +1] > 0,
(25) x[ri-1, y,'S +11Y[15,^9 +1] - Y[ri-1, 5,i+ 1]x[rj,ri+1] > 0,
where for 6 (i), j = 0(1) n - 1, i. e., on [io,rn] and for 6 (ii), j = 1(1) n - 2, i. e., On [r1,rn
-11
For the open data, on [r o, r 11 and [ r „ . ], r„] , from the end conditions 6 (ii) one gets
ho{ u'(r)v "(i) -u "(r)v'(r) } =- {A o(v)x[ro,r i 1- Ao(u)y [ro,r 1 1} 0(t) ( r=ro+tAro, 05t sl with ho =Aro), (26) hn-1{u ( Z)V"(r)-
(r=
rn.i
u "(i)V (r) } = - {Bn - i(u)y[ rn ,rn +1]- Bn -1(V)x[rn , ln +1] }0(r)
+tA
rn-1 ,0st
Bland r =1-t with hn-1=
din -1).
Hence, by use of the asymptotic expansion in Lemma 2, we also have the desired result on [r o , r 11 and [rn-1, rn ] for the open data.
352 This completes the proof of Theorem 2. 3. Numerical Illustration For calculation of the spline interpolant s, first determine unknowns s'i by use of the consistency relations given in Lemma 1 and the end conditions (4) or (6), and then we have only to use (10). On a uniform partition of [0, 1] with knots xi (= jh, nh = 1), by means of the consistency relation in Lemma 1 t
(27) 6{ei+1+21+ 3 +p p i+et-1}
phn 2
18(1+p
i3) +... j=1(1)n -1
where e = s - f for yi = f (xi) with f E C4[0, 1].
Hence, we have the asymptotic expansion of the error at the mesh point xi bounded away from the both end points x = 0 and 1: (28) Si - fi = -
i3)+... (p"0). ph2 6(1 + 2p
This shows that the absolute error would be smaller for a smaller value of I , while the method withp = 0 (the well-known cubic spline interpolation method) might not always give the shape preserving interpolant. Therefore, in practical application, it would be sufficient to decrese the rationality parameterp, starting at zero, until the curve would be satisfactory ([5]). In order to illustrate the above stated methods,we take the following four examples where the data points are denoted by the solid circles. In Figures 1- 2, the data is obtained from a quadratic polynomial y = 1/(2 -x)2 (X = 0, 1, 1.7, 1.8 ). In Figure 2, the data is taken to be the plane one, i. e., the parametric rational spline is used. In Figures 3 - 4, we take the random data and the one from Cardioid r = a (1 + cos 6) (a > 0 ), respectively. Numerical results show that our methods give visually pleasing curves for the data which can and can not be represented as single-valued functions. Acknowledgement . Part of this paper was written while the first author visited the Department of Applied Mathematics,University of Manitoba,whose support he gratefully acknowledges. References 1. J. Ahlberg., E. Nilson and J. Walsh: Theory of Splines and Their Applications, Academic Press, New York, 1967.
2. H. Hanna, D. Evans and P. Schweitzer: On the approximation of plane curves by parametric cubic splines, BIT 2 6 (1986), 217 - 233. 3. W. Hoskins and H . Sager: Spline Algorithms for Curves and Surfaces ( translated from the German [ 5] ), Utilitas Mathematica, Winnipeg, 1974. 4. M. Sakai and J. Schmidt: Positive interepolation with rational splines, BIT 2 9 (1989), 140 - 147. 5. H. Spath: Spline-Algorithmen zur Konstruktion glatter Curven and Ffachen, Oldenbourg-Verlag, Miinchen, 1973.
353 6. H. Soh: Eindimensionale Spline-InterpolationsAlgorithmen,Oldenbourg-Verlag, Miinchen,1990.
FIG. 1
p=0
p = - 0.9
FIG. 2
354
FIG. 3
FIG. 4
I
I
t1
WSSIAA 2(1993) pp. 355-371 ©World Scientific Publishing Company
Multivariate Polynomial Equations as Matrix Eigenproblems Hans J . STETTER Inst . f. Appl. & Numer. Math. Technical University of Vienna A-1040 Vienna, Austria Abstract In this paper , we give an overview of how systems of multivariate polynomial equations may be expressed and solved as matrix eigenproblems . The eigenproblems are determined by the multiplication tables of the residue class ring associated with the polynomial ideal generated by the polynomials whose joint zeros are to be found. We consider the case of zero manifolds as well as that of isolated zeros . Various technical details will be treated in separate papers.
1 Introduction The solution of systems of multivariate polynomial equations was the driving force in the development of algebra from early times to the first decades of this century; cf. textbooks like Perron's Algebra [1] or others from that time. Then a different concept of algebra took over and the interest in polynomial systems vanished. It has largely been due to the arrival of computer algebra systems that a renewed interest in this classical subject has arisen in a small faction of algebraists within the past 10 years. Quite recently, a huge European project called POSSO (POlynomial System SOlving) has been initiated for the study and implementation of algebraic approaches to the constructive solution of multivariate polynomial systems. Meanwhile, polynomial systems have been solved numerically through homotopy (continuation) and local linearization methods; cf. any more advanced textbook of Numerical Analysis. However, a good deal of mathematical structure remains unattended in these approaches; furthermore, it has to be commonly assumed that the number of equations equals that of the unknowns, which is not at all a natural restriction in polynomial systems. Also it may be quite difficult - or at least time-consuming - to obtain all solutions, even for not so large systems. Therefore, it appears important to utilize the algebraic structure of a polynomial system as far as possible. Through an attempted stability analysis of Buchberger's algorithm, W. Auzinger and the author of this paper were led to the design of an elimination algorithm which aimed at the solution of a polynomial system via its reformulation in terms of a matrix eigenproblem; see [2]. Since we were not able to make the algorithm work for all systems with only isolated zeros, we aborted the effort; a rudimentary report [3] was circulated but not published. Motivated by our paper [2], Wu Wenda and his students also attempted to overcome the difficulties described in [3]. By clever considerations , they have succeeded in pushing the limits of the elimination algorithm a bit further ([4]); but they, too, did not achieve a breakthrough. At a recent meeting in Beijing (July 1992), Wu Wenda and the author came to the
355
356 conclusion that the desired algorithm would have to incorporate all algorithmic techniques of Buchberger 's algorithm ; besides it has to start with an extremely large matrix in nontrivial applications . Therefore , it appeared wiser to us not to compete with Buchberger's algorithm but to concentrate on the elaboration of our eigenproblem approach, beginning from a Groebner Basis form of the multivariate polynomial system . This path has been followed in this paper. In his attempt to generalize the elimination procedure of [2], Wu Wenda gained another fundamental insight: He found that the transformation of a polynomial system into a matrix eigenproblem could be extended to the case of systems with zero manifolds; cf. [4]. There, the eigenproblems become singular eigenproblems for rectangular matrices; such eigenproblems have been studied by Kronecker a century ago (see e.g . [5]). They have parametrized solutions which may represent zero manifolds of positive dimension. This approach has been further developed by the author; a general account of it is given in section 4 of this paper. In section 2 of this paper, we will explain how the eigenproblem approach to the solution of multivariate polynomial systems is most natural if one shifts the attention from the consideration of the ideal generated by the set of polynomials to the residue class ring modulo this ideal. Section 3 will be devoted to a further analysis of this approach in the case where the system has only isolated solutions (i.e. the ideal is 0-dimensional ). The consideration of the inverse problem - given the zeros, find the polynomials - gives further insight into the situation. The simultaneous determination of zero manifolds and isolated zeros from associated singular matrix eigenproblems will be explained in section 4. The algorithmic details must be reserved for a separate report to keep this paper at a reasonable length. Finally, I would like to emphasize that I have regarded the solution of multivariate polynomial systems as a constructive numerical problem throughout ; this is the aspect in which I have been interested above all. I am sure that my unforgettable friend Lothar Collatz who happened to attend my first presentation of the material in [2] at a major conference (Singapore, Spring 1988), would have enjoyed the intuitive but nontrivial use of ideas from pure mathematics for the solution of a numerical problem described in this paper. Therefore, this volume in honor of the memory of L. Collatz is a good place for its publication.
2 Representation of a Polynomial System as a Matrix Eigenproblem Let IP' be the ring of all polynomials in s variables , with complex coefficients. We consider the following problem: (PZ, polynomial zeros ): Given a set F of m polynomials f„ E IP', determine the set T C C' of all joint zeros t' = (ti, ... , t;) of the f,,. In other words, we want to find all solutions t' E C' of the multivariate polynomial system of equations
357
fl(xl , ..., x,)
1
f(x)= ... fm(xl,
=0.
(2.1)
..., x,)
It is well-known that T is also the set of all joint zeros of all polynomials in the polynomial ideal.F which consists of all polynomial combinations of the fµ: .F := span {fl, ... , f} := {f E ]P' : f = E cµ f,„ f„ E F , cµ E 1P'} . (2.2) µ=1
Thus F may also be characterized by .F={pEIP':p(t)=0fortET}. (2.3) Therefore , for the determination of the zero set T, the set F may be replaced by any polynomial set G = {g„ E 1P'} such that .F = span {91,•••,9k} ; (2.4) the system of equations 91(x1, ..., x,) g(x)
=
...
9k(x1 , ...,
=
0
(2.5)
x,)
is then equivalent to the system (2.1): Each solution of (2.5) is a solution of (2.1) and vice versa. The transition from (2.1) to an equivalent system (2.5) which admits an easier determination of T has always been the fundamental algebraic approach to the constructive solution of (2.1). E.g., if the f„ are linear polynomials, (2.1) is commonly transformed into an equivalent triangular system (2.5).
Proposition 2.1: (2.5 ) is equivalent to (2.1) if 9(x) = C(x) . f(x)
(2.6)
where the k x m polynomial matrix C satisfies rank (C(t*)) = m for all zeros t` of g. (The rank condition is not necessary as the set F of the f,, may be redundant.) In ]P', let R = IP' /.F be the residue class ring mod F. R is a vector space over C whose dimension n is finite if F is 0-dimensional, otherwise it is infinite. There exist representations R of R through bases of power products (PPs) or monomials xj := x11 ... x;• , j E ]N' (the set of s-tuples of non-negative integers). Let Z = {x'(" , j(v) E IN' , v = 1(1)n} (2.7) 'Wherever n denotes the dimension of R, v = i(1)n denotes an infinite sequence if the dimension of R is infinite.
358 be such a PP -basis of R; then the residue class mod .T of each p E IP' has a unique representation n
p(x)
F b„x3(" mod F, b„ E C . (2.8) V=1
In the representation it, the multiplicative structure of the ring R is described by multiplication tables containing the coefficients a(a) E C of the representation (2.8) mod F of the products x,.xj(') ,a = 1(1)s: n
X, * xj(.)
> a^ a xj(") mod .F , v = 1(1)n .
(2.9)
A=1
These a, a permit the recursive computation of the b„ in (2.8) for any p E 1P'. For a convenient handling of the multiplication tables, we denote by Z(x) the polynomial n-vector of the PPs xj(') in the set Z in some specified order, and let the matrices A, (a(') E Cnxn Then (2.9) becomes x,•Z(x)=A,•Z(x)mod.F, a=1(1)s.
(2.10)
If F is of a positive dimension d and dim(R) = oo, we will only use finite sections of the multiplication tables (cf. section 4) so that (2.10) will have a standard interpretation even in this case.
Since a basis Z and the associated multiplication table matrices A„ a = 1(1)s, characterize a residue class ring R in IP' completely, and since there is a one-to-one correspondence between an ideal F C IP' and its residue class-ring R = IP' IF, the set of equations (cf. (2.10)) (A, - x,z)Z(x) = 0 , a = 1(1)s , (2.11) must be equivalent to (2.1) : Each t' E C' which satisfies (2.11) must also satisfy (2.1) and hence be in the zero set T of (2.1), the converse is trivial by (2.10). From the special structure of (2.11), we obtain our fundamental result: Theorem 2 .2: For each zero tµ of (2.1), the vector zµ = Z(tµ) E C' is a joint eigenvector of the matrices A„ with eigenvalues (tµ), resp.; conversely, upon suitable normalization, each joint eigenvector zµ E Cn of the A, , with resp. eigenvalues (tµ)„ yields a zero tµ of (2.1) via the interpretation
Z(tµ) = u , (tµ),, = (1,),, •
(2.12)
The request for joint eigenvectors is not really restrictive: Theorem 2 .3: The multiplication table matrices A, of (2.10) commute. Proof: For some a1, a2 E {1, ... , s}, (2.9) implies A„A„Z(x) __ A,IA,2Z(x) mod F ;
since Z is a basis of R this proves the assertion. ❑
2.13)
359 Corollary 2.4: If some A„ is nonderogatory, its eigenvectors ztf are also eigenvectors of all the other A„. Corollary 2.4 shows that - under the specified mild restriction - one of the sets of equations (2.11), i.e. one of the eigenproblems, is already equivalent to (2.1). Example 2.1: Consider the system (m = 3, s = 2) /i(xi,u) /afxi.u) /s(n,xj)
= 3x5x3+ 9 x ? + 2 x i i j + 5zi + r a - 3 = 0 , = 2x?xj-6x?-2i?-Xixj-3xi-x2+3 =0, = x ? u + 3xf + i i i 2 + 2 x ? = 0 .
Foi total degree ordering with x 2 <xi, the associated Groebner Basis system (2.5) is Ji(xi,X2) SJ(II,IJ) »3(xi,xj)
= = =
xj-4xi-|xj-| XiXj+xi-u+3 x? - | n + xj - 3
= 0, = 0, = 0.
By Theorem 3.1, Z = { l . x j . i i } is a PP-basis for the associated residue class ring and multiplication by xi mod T is specified by
"(;•)■(;':
YK;)"" 1 "-
Since A\ is nonderogatory, the eigenproblem (Al-z1I)Z(z) = 0
(2.14)
is equivalent to the systems /(x) = 0 and g{x) = 0 above. Q
Note that many of the a x n equations (2.11) may beempty becausex <7 tx J = x3 € Z so that there is only the one non-vanishing element 1 in the (i/,i/)-position of Ac. However, in the interpretation (2.12), the associated relations between T^ and the (f^)„ guarantee that the components of Tu are consistent in the sense that they may be interpreted as PPs x'i") of the (<;)„.
3 3.1
T h e Computation of Isolated zeros T h e C o m p u t a t i o n a l G e n e r a t i o n of t h e E q u i v a l e n t E i g e n p r o b l e m
There exists a well-established algorithmic approach to the generation of a PP-basis Z for a representation R of the residue class ring 72 associated with the ideal T spanned by a given set F of multivariate polynomials /„ € P ' , viz. via the generation of a reduced Groebner Basis G = {gx>K = l(l)fc} for T. The polynomials gK in a Groebner Basis G span T (cf. (2.4)), and they satisfy certain irreducibility conditions, cf. e.g. [6] and [7]. For a specified total order in the set of all PPs in P ' which is consistent with multiplication, there exists a unique Groebner Basis G for the ideal span{/^}; the gK of this set G may be computed from the coefficients of the /,, by a finite algorithm, the so-called Buchberger algorithm (see e.g. [7]). This algorithm is essentially an elimination algorithm; it uses only rational operations. Hence, if the coefficients of the /M are rational (real or complex) numbers,the coefficients of
360 the g" will also be rational. All existing implementations of the Buchberger algorithm assume this case and employ exact rational arithmetic, with potentially arbitrary numerators and denominators. In the following, we will assume a specified admissible order within the set of PPs which is kept fixed (unless stated otherwise ). Such an order defines a total order in the set IN' of s-tuples of nonnegative integers which arise as exponents in the PPs x' of IP'. This order is therefore consistent with addition, we will denote it with <: 0<j for each j # 0;
j(1)<j(2) = j + j(1)<j + j(2) . (3.1)
Besides this total order, we will also use the standard componentwisr the IN': j(1) < j(2) q j(1) < j(2) , a = 1(1)s .
"-l ordering < of (3.2)
For each polynomial p E IP', we define its leading PP as that PP whose exponent j is largest in the sense of <. Theorem 3 .1 (Buchberger [7]): Let xjl"l,r, = 1(l)k, be the leading PPs of the polynomials g" in the Groebner Basis G of the polynomial ideal F C ]P. Then Z := {xj : xj is not a multiple of some xj("), it = 1(1)k} (3.3) is a PP-basis for a representation R of the residue class ring R = ]P' /.F. The associated multiplication table matrices Ao,a = 1(1)s, are obtained by the reduction to normal form mod G of all products xo * x', x' E Z, which are not in Z. Since a normal form algorithm also uses rational operations only, the Aa for a system F with rational coefficients will be rational. Corollary 3.2: Let j("),r. = 1(1)k, be the exponents of the leading PPs in G (leading exponents), then J:={jEIN' : x3EZ}= IN'- U "-1(1)k
Corollary 3.3: If and only if Z contains a pure power xo for each a E {1, ... IS}, the ideal .F is 0-dimensional and n = dimR = IZI = JJI is finite. Example 3.1: Cf. Example 2.1. There we had used (3.3)/(3.4) to determine the PP-basis Z for the residue class ring.
The PP-basis Z which is defined by (3.3) may lead to multiplication table matrices A, for which the eigenproblems (2.11) are very ill-conditioned. As has been shown in [8], this is likely to happen when a lexicographic ordering has been chosen for the total ordering of the PPs. But even for a total degree ordering this may happen when the given polynomial system is actually a `perturbation'F of a nearby system F which leads to a degenerate situation or to a more degenerate situation than the given system F; cf. [8]. In this case, the PP-basis Z associated with the nearby degenerate system F should be used; naturally, the multiplication table matrices must still be formed modF.
361 In [8], it has also been indicated how this near-degeneracy of a given polynomial system F may be detected during the execution of the Buchberger algorithm. Further research in this direction may lead to safe floating-point implementations of the Buchberger algorithm and of the generation of the multiplication table matrices. For the purpose of this paper, we will assume that we can determine a PP-basis Z of size n and the associated set of n x n-matrices Ao such that the polynomial system (2.11) is equivalent to the original system (2.1) and the related eigenproblem is well-conditioned. This includes the assumption that we use a total degree ordering of the PPs.
3.2 Relation to Multivariate Polynomial Interpolation In the case of a 0-dimensional ideal F where the system (2.1) has only isolated solutions, a good deal of additional insight in the constructive solution process may be gained from a consideration of the inverse problem (PI, polynomial interpolation) to our given problem (PZ, polynomial zeros): (PZ) Given: a set F C IP' which spans a 0-dimensional ideal F Find: the set T = {t*,, v = 1(1)n} C C' such that p(t;,)=0,v=1 (1)n, for each pE.F (PI) Given: a finite set T = {t*^,v = 1(1)n} C C' Find: a set F C 1P' such that (3.5) holds for T = span F Thus, in (PI), .F consists of all polynomials which satisfy the homogeneous interpolation problem (3.5); cf. also (2.3). Remark : The duality (PI) q (PZ) may be extended to Hermite interpolation and multiple zeros resp. cf. [9]. For the sake of simplicity, we restrict ourselves to the case of (3.5) and simple zeros at present; some generalizations will be considered in section 3.4. It is easy to see that, for (PI), the residue class ring R = ]P' /F, in its representation by a PP-basis Z C IP' and the associated multiplication tables A,,a = 1(1)s, plays the same fundamental role as for (PZ): Take a set Z of PPs xj('),v = 1(1)n, such that spanZ is a correct interpolation space for T which means that Find q E spanZ such that q(tµ) = w„ ,µ = 1(1)n ,
(3.6)
has a unique solution for any data to E C". This is equivalent to the nonsingularity of the Vandermonde matrix V := ((t;)i(V)) . (3.7) A set Z which is correct for a given set T C C' may be found by the procedure described in [10].
362 For some fixed u E we interpolate the functions x, * x2(')(x),v = 1(1)n, on T, according to (3.6), i.e. we choose the data (ti )o 0 (wµp)
1(1)n,
(3.8)
0 to o
or, in a shorter notation, (3.9)
W,=T,•V.
The corresponding solutions of (3.6) are given by (... xj(') ...)(b^ a) where the matrix B, = (b(a)) satisfies
V.B,=W,=T,•V .
(3.10)
Thus the rows (...(t;,)3 ...,v = 1(1)n),µ = 1(1)n, of the Vandermonde matrix V of (3.7) are the left eigenvectors of each of the n x n-matrices B„o = 1(1)s, with eigenvalues (tµ), resp.
Transposing (3.10) and using the notation Z(x) as in section 2, we obtain Bo - Z(tµ) = (tµ),Z(tµ) ,µ = 1(1)n ,
(3.11)
and by comparison with Theorem 2.2 and (2.12) we find that the BT are the multiplication table matrices A, of the representation by the basis Z of the residue class ring 1Z = 1P' /Y. Therefore, the nontrivial polynomials in the set (BT - z,I) Z(x) , v = 1(1)s
(3.12)
span the requested ideal F which satisfies (3.5), cf. (2.11). Thus we have the same set of joint eigenproblems (A,-x,I)z=0,
a=1(1)s,
(3.13)
as the central tools for solving both (PI) and (PZ):
Z, V ( PI )
T
(PZ)
T
(2.11) -
1Z
re p resented b y Z and BT
(3.12)
R G, Z represented by ZandA,
F
F
The duality between (PI) and (PZ) has recently been described and analyzed in a more general and abstract form in [9]; we have used it in [8] without knowing of [9]. Here, we want to emphasize the crucial role of the intermediate transition to the residue class ring IZ in both (PI) and (PZ). This establishes the approach of section 2 as the natural approach to the constructive solution of multivariate polynomial systems by algebraic means.
363 3.3 Computation of Simple Zeros At first, we consider the case where the A, have a full joint eigenvector basis {z,,, v = 1(1)n} where n = dim(R) = number of zeros of (2.1). Hence each eigenvector z„ represents one zero t*, and vice versa. Obviously, all A. must be diagonizable. If the PP-basis Z of R contains all 1st powers x„ a = 1(1)s, all components (t;,), of the n zeros appear directly as components of the z,,, cf. (2.12). According to Corollary 2.4, it is generally sufficient to form one of the matrices A, and to solve the associated eigenproblem (3.13). If some x,, is not in Z, we may obtain the resp. component of the zeros as `first' components (x°-components) of A0'z,,: (tw)o' = ((t*.)o'z,)o = (A0,z„)o . (3.14) If all A, are diagonizable, a derogatory A, can only occur thus: We may have chosen an A, such that the a-components of the zeros t*, are not all disjoint, say (t*,), = (tµ), for t;, # tµ. Then (t;,), will be a multiple eigenvalue of A„ with its multiplicity m equal to the number of zeros whose o-components coincide, and there will be an associated eigenspace of dimension m. Thus the individual m disjoint zeros cannot be identified. In this case we may check a linear combination of basis vectors of this eigenspace (with `first' component 1) z cµzµ=( zl...z„,)•c, with Ecµ=l,
(3.15)
µ.1
against some other A,,: A,,(zl ... zm)c - x,,(z1 ... zm)c = 0 ,
(3.16)
(zl ... Zm)TA,,(zl ... z,,,)c - x,,(zs ... zm)T (zl ... zm)c = 0 .
(3.17)
Thus, the coefficients c„ in (3.15) may be obtained from an m x m-eigenproblem (3.17). Naturally, one may wish to avoid this complication by selecting an A, with disjoint eigenvalues if it exists. (There may be no coordinate direction which separates all zeros.) The selection may be effected by means of proposition 2.1: Since (2.11) is equivalent to (2.1) and to the associated Groebner Basis system (2.5), there are n x k-matrices C,(x) such that (A, - x01) Z(x) = C,(x) • g(x) , a = 1(1)s . (3.18) If rank C,(x) = k for some a, each eigenvector of A, must generate a zero of g, i.e. there cannot be redundant eigenvectors. (Again, there may not be such a a although the complete ns x k-matrix C always has rank k.) If the C, are computed together with the multiplication table matrices A„ it should not be difficult to select an appropriate A,. Remember that many rows of the C, are 0 (cf. last paragraph of section 2). The actual solution of the eigenproblem (3.13) will generally have to be done by a numerical approximation algorithm. Suitable software may be found in the NAG and IMSL mathematical software libraries or in LAPACK ([11]).
364 Example 3 . 2: Cf. Example 2.1. The 3 normalized eigenvectors of Al are
0)
1
3
.26556. 10.76556 ...) (
6
3.26556...
which yield the zeros ( 0, 3), (-0.76556 ... 1 1.26556 ...), (3.26556..., -2.76556 ...).
3.4 Computation of Multiple Zeros For some fixed a E {1,...,s}, consider the identity (3.18) (A, - x,I) Z(x) = C,(x)g(x) .
(3.19)
Fechet differentiation of (3.19) and multiplication by some v E C3 yields (A, - x,I) Z'(x)v - Z(x)v, = (C,(x)v)g(x) + C,(x)g'(x)v .
(3.20)
Upon substitution of a zero t*, of (2.5) or (2.1) resp., this reduces to (A, - (t;,),I) Z'(t,)v - Z(t*,)v, = C,(t;,) g'(t;,)v .
(3.21)
Now assume that t*, is a multiple zero of (2.5), with multiplicity at least 2. This implies that there exists a direction v E C" such that g(t*^ + Av) = 0(a2) or g'(t*,)v = 0 .
(3.22)
For all a such that v, 0, this implies (cf. (3.21)) that (t*,), is a defective eigenvalue of A, (with algebraic multiplicity at least 2), with the joint eigenvector z„ = Z(t;,), and that z„ = Z'(t;,)v is the associated joint principal vector. For those or with v, = 0, z„ is a further eigenvector of A, for the multiple eigenvalue (t*,),; cf. (3.21). However, since its `first' component is 0, it cannot be interpreted as a representation of a zero of .1 via (2.12). (Note that the `first' component of Z(x) is 1.) Conversely, assume that A, has an eigenvalue with algebraic multiplicity (at least) 2 which is either defective or whose eigenspace contains nonvanishing elements with a vanishing `first' component so that it defines only one zero t*,. We may then form the linear homogeneous system (cf. (3.21)). [(A, - (t*^),I) Z'(t ) - Z(t*.)eo] v = 0 , (3.23) where e,T = (0 ...1...0); if rank [...] < s, there is a nontrivial solution v E C of (3.23). If also rank C,(t*,) = k, then (3.23) and (3.21) imply (3.22), and t: is (at least) a double zero of (2.5) resp. (2.1), i.e. of.F. Else one has to check that v from (3.23) also satisfies (3.23) for sufficiently many other values of a so that (3.22) may be concluded. (Note that the matrix C composed of all C, has rank k.) This approach may be extended to higher multiplicities on the basis of the theory developed in [9]. Technical details will be reported in a joint paper by M. Moeller and the author of this paper.
365 Example 3 .3: Consider the system xl-x2-2x1+1=0 x22x2-x1-2x1x2+2x1 +x2-1=0 with the associated Groebner Basis system x1-2x1-x2+1=0 2 x2-x2=0 According to Theorem 3.1, Z = {1, x2,xl , x2xr) and
Al - (
0
0
1
0
0
1
0 0
0
0
0
1
0 0
-1
1
2
_ 1 0 0) A2 , 0
0
0 1
0
0
0
2
0
0 1
0
There are 3 joint eigenvectors : ( 1, 1, 0,0 ) T,(1,0,1,0 )T,(1,1,2,2)T. The components of the zero tz = (1,0) are algebraically double eigenvalues of Al and A2 resp. but rank (Al - I) = 3. The associated principal vector is ( 0, 0, 1, 0 ) T. (3.23) as well as the relation 0
0
0
0 At;)v = 0
1
t1o
0
1
0
1
1
lead to v = (1, 0)T which also satisfies (3.22): 0
9(t;)v= 0
-1
1
-1 ) 0 1 =0.
4 The Computation of Zero Manifolds 4.1 Finite Sections of Multiplication Tables For a polynomial ideal F of positive dimension d, with Groebner Basis G = {g4, the residue lass ring R = ]P' /.F has infinite dimension. But Theorem 3.1 remains valid so xj(') of a representation R that we may still consider the infinite vector Z(x) of basis PPs of R in the specified order and the associated infinite multiplication table matrices A, of (2.10): x,Z(x) = A, • Z(x) + Co • g(x) ,
a=1(1)s. (4.1)
Each row of the A, contains only a finite number of non-zero elements due to its generation by a normal form algorithm. The matrices Ca have infinitely many rows but only k columns (k = # of g,5 in G). Because of the one-to-one relation between F and R, the oo x k-matrix C composed of the Co must have rank k. Consider the `first' n equations of (4.1) for each a, where `first' means lowest w.r.t. the PP order in Z(x). The rank of the ns x k-matrix C(") composed of the corresponding n-row sections Co") of the C, is nondecreasing with n; hence it must achieve its maximal value k
366 at some finite n and preserve it for all larger sections . Consider such a finite rank-k section of (4.1). xoZ(n )(x) = A(n) . Z(x) + C'(n)(x) . 9(x ) , a = 1(1)s . (4.2) Since there are only finitely many non-zero elements in each row of A^n ), there must be a smallest number n > is such that there are no non - zero elements in the Aon) beyond the nth column , and (4 . 2) becomes
x,Z(n )(x) = A(n.n ) . Z(")(x) + C(n)( x) • 9(x)
or = 1(1)s . (4.3)
Thus we have proved Theorem 4 . 1: In the case of a d-dimensional ideal .F, there exist n- and ft - dimensional sections Z(x) and Z(x) resp. of the infinite PP-basis vector of R and the corresponding n x n-sections A, of the infinite multiplication table matrices such that the polynomial system (A, - x,I) Z( x) = 0 , o = 1 (1)s , (4.4) is equivalent to the systems (2.5) resp. (2.1). Here the is x n-matrix I is the unit matrix for the `first' otherwise, i.e. I • Z(x) = Z( x)
n components of Z and 0 (4.5)
Note that we have dropped the superscripts n and n of (4.3) and returned to the notation of section 2, with - denoting n x n-matrices and n- vectors resp .; thus the analogy between ( 2.11) and (4.4) becomes fully apparent. Furthermore , if we extend the eigenproblem terminology to the more general systems (4.4), Theorem 4.1 asserts that our fundamental Theorem 2.1 remains valid verbatim. Thus we should be able to determine all zeros (i.e. zero manifolds and isolated zeros) of a ddimensional ideal F defined by a polynomial system ( 2.1) from the set of eigenproblems (4.4). Note that a (locally) parametrized manifold {x(u), u E U} C C' is a solution of (2.1) if f(x(u)) = 0 for u E U. For (4.4) this requires
(A, - x,(u)I) Z(x(u)) = 0 , a = 1(1)s . (4.6) For the practical application of this result their remain two tasks which we must be able to perform algorithmically:
(i) We must determine suitable finite sections Z(x) and 2(x) j Z(x) of the PP basis vector of R, with is and Is as small as possible. (ii) We must compute all joint solutions z' E Ca of the n x n-eigenproblems (A,-XI)z =0, o=1(1 )s.
(4.7)
367 For task (i), the described search procedure for is and n must be made more efficient; we will not discuss this algorithm here (cf. e.g. [4]). The principles of the algorithmic solution of (4.7) will be described in the following section. Example 4 . 1: Consider the ideal T spanned by the Groebner Basis 91(51,x2 ) =
zi + z1z7 - X2 - X2 - z1 + 1 ,
92(x1,52) = xiz2 + xz + xi + z2-X2-1-
According to Theorem 3.1, the infinite PP-basis for the residue lass -ring R is (in increasing order) Z :_ {1,z2, z1,z],zjz2 , zi,z2 , x152, 5Z , 5152,...}
and the infinite multiplication table matrix Al of (4.1) begins thus: 1 1
0
52
0
0
1
x2
0
0
0
0
0
1
x1
0
0
0
0
1
57
0
0
x1
0
0 0
57
0
0 0 0 0 0 0 1 z2z1 0 0
= 1 1 0 -1 0 -1 -1 0 X2 + 0 1 xl 5251 1 -1 0 1 1 0 1 0 -1 z2 1 0
91(51,x2) ( 92(51,52)
xz 0 0 0 0 0 0 0 0 0 1 zz 51 0 0 57x1 -1 0 0 2 0 1 0 0 -1 0 X2 0 y- 1 57x1
Obviously (cf. C,(6) ), the finite section with n = 6 and n = 8 suffices for the equivalent polynomial system (4.4) in eigenproblem form.
4.2 Algorithmic Solution of Singular Eigenproblems We consider a so-called singular eigenproblem
(A-Al)i=0, (4.8) with I,A E C nx4,ft >
n,
i E Cf , A E C. Since rank(A-xI)
(4.8) has r := it - n nontrivial solutions i,(A), p = 1(1)r, for arbitrary A.
According to Kronecker [ 5], these eigensolutions I (A) are, with suitable normalization, polynomial in A of degrees k, i.e. kP
zP(,\)
= L 1\K Ia,K ,
ip,K E Cn ,
K
=
0(1)kp
,
(4.10)
K=0
with E p=1 kp =: n < is. If is - n =: ko > 0, there exist ko regular eigenvalues ao,lin C (counting multiplicities) for which
rank(A - AoKI) < n (4.11) which permits associated eigenvectors zo, E Cn.
368 Thus the complete solution set of (4.8) consists of the general eigensolution r
i(a)
with co E 1P1 ,
_ E c,(A)i,(A) ,
(4.12)
P=1
and ko = n - is isolated eigenelements (\os, -;Oa). Kronecker describes a recursive algorithm for the computation of the ip,,, p = 0(1)r, rc = 0(1)k,, in a more general context than above. In our special context, a direct, non-recursive algorithm for the computation of the ip,,, has been derived by the author which has been implemented by his student J. Haunschmied. The algorithm and its implementation will be described in a separate paper [12]. It should be mentioned that the z,,,, p = 1(1) r, arise from the elements of A by a purely linear process which needs only rational operations. The regular eigenvalues Ao,c - if any are determined as eigenvalues of a ko x ko matrix and the associated eigenvectors ;0,c are obtained as those solutions of (A - Ao,5I).z = 0 which are orthogonal to the ip(Ao,c). Example 4. 1: (continued): The singular eigenproblem (4.8) for
Al =
0
0
1
0
0
0 0
0
0
0
0
0
1
0 0
0
0
0
0
0
0
1 0
0
0
0
0
0
0
0 0
1
1
1
0
-1
0
-1 -1
-1
0
1
1
0
1 0
-
E
Casa
0 I 1 /1
has r = ii - n = 2; it yields the general eigensolution (cf. (4.12)) 1
0
0
1
A
0
i(A) = cl(A) 1 0 + cs(A) 0
a'
0 A-
I
I
0 1-A2 k
0
The degrees of the i. are k, = 3, k2 = 2, hence ko = n - >3 kp = 1. The one regular eigenvalue is 1 and =0i = (0, 0, 0, 1 , 0, 0, -1,1)1.
4.3 Zero Manifolds from Solutions of Singular Eigenproblems When we compare the general singular eigenproblem (4.8) with (4.4) we find that the role of the parameter A is occupied by x,; thus, the general solution of (4.4) will be parametrized by x,. Therefore it is essential to know which of the x, may serve as parameters for the zero manifold(s) of the polynomial system (2.1).
Theorem 4.2 (Kredel-Weispfenning [13], Gerdt [14]): Let S be a subset of {x1..... x„} and Z(S) the set of all PPs over S. S is a maximal independent set of variables (MIS) mod T if no PP in Z(S) is a leading PP of one of the
369 polynomials g^ in a Groebner Basis G of F. The elements of any MIS may serve as free parameters for a zero manifold of dimension ISI of Y. Thus, from a Groebner Basis G, we can tell the dimension of the highest dimensional zero manifold of F and a potential parametrization. Naturally, this parametrization need not be the only possible parametrization; a different order of the variables will normally lead to different leading PPs and thus to different MIS. Only an xo which cannot be used as parameter - because xo = const on the manifold - will naturally not appear in a MIS for any order of the variables. In this paper, we will only discuss our approach to the computation of zero manifolds for manifolds of dimension 1. The extension to zero manifolds of higher dimension which involves a generalization of the singular eigenproblems in section 4.2, will be presented in a separate paper. Let xl be one of the variables admissible for parametrization. The general solution of (cf. Theorem 4.1 and (4.4))
(A1- x17)2=0
(4.13)
has the form (cf. (4.12)) 2(x1) = E cp( xl )
2p(x1) •
(4.14)
p=1
If we assume for the moment that Z contains the linear powers of all xv, then (4.14) must contain representations of all 1- dimensional zero manifolds of F in the form xo(xl ) = E °p(xl) xo.p(x1 ) , p=1
(4.15)
where xo,p denotes the 'x,-component' of 2,,, due to Theorem 4.1. Thus it remains to choose the coefficients cp in (4.14) such that the components (4.15) are consistent with the remaining components of Z(x(xi)). The respective conditions for the cp are found either by direct comparison within Z(x(xl)) or, more algorithmically, by the introduction of (4.14) and (4.15) into (4.6), with xl in place of u and for a = 2(1)s: (Ao - (EC'( xl)xo,p (xl))I) (Ecp( x1)ip(x1 )) = 0 , a = 2(1)s • (4.16) P=1 p=1
(4.16) shows that the conditions for the cp can be at most quadratic in the cp. Naturally, one of the conditions is always the normalization condition for the x°-component (2(xl))o:
E cp(xl)(zp(x1))o = 1 .
(4.17)
P=1
(4.17) implies the validity of (4.15) for or = 1. Each isolated solution ( cp(xj ), p = 1(1)r) of (4.16) describes an isolated one-dimensional zero manifold of.F parametrized by xl. Under our assumptions about IZI = on and about the feasibility of xl as a parameter , there must be solutions of (4.16) due to Theorem 4.1.
370 If (4.16) admits solution manifolds, with d-1 parameters beyond x1, we have found a d-dimensional zero manifold of F. This case will be considered in more detail in a separate paper. Example 4.1 (continued ): Although the Groebner Basis G suggests x2 as parametrization parameter through Theorem 4.2, zl is feasible as well and we have already found the general solution (4.14) of (4.13): 1 0
) (°' 0
xl
0 + cs(xr)
i(xl) = cr(xr)
= 2L(x(xr )) .
(4.18)
xl
Condition (4.17) yields cr(xi) = 1 and (4.15) becomes xz(xl) = c2(xl)• The internal consistency of Z requires that the xi-component be the square of the z2-component: 1 - xl = (xz(xl))z
This yields the equation of the zero manifold xi + x2 - 1 = 0. There is no further independent consistency condition. The direct substitution of (4.18) into (4.16) yields 1 - xi - (cz(xl))z = 0 as the only condition besides cr(xl ) = 1 from (4.17). The regular eigenvector for of the singular eigenproblem (4.8) yields one further isolated zero (1, -1) of F, by a procedure which will be described in [12].
5 Conclusions In this paper, we have shown that the zero set of a polynomial ideal F may be obtained from the multiplication tables of the associated residue class ring R = IP' /F through matrix eigenproblems. In the case of a 0-dimensional ideal F, the size of the matrix eigenproblems is equal to the dimension of R, i.e. the number of isolated zeros (counting multiplicities). It appears that the numerical solution of the eigenproblem is the most efficient approach to a numerical approximation of the solution set. The numerical approximations of individual zeros can be improved by Newton's method if necessary. A total-degree order Groebner Basis of F is used for the generation of the multiplication tables. For the case of a d-dimensional ideal F, d > 0, we have shown that there exist finite sections of the infinite multiplication tables which carry the complete information about the zero set. These rectangular matrices define singular eigenproblems whose general eigensolutions contain the zero manifolds and whose regular eigenvectors determine the isolated zeros. However, the restriction of the coefficients in the general eigensolution such that it defines a zero manifold may require the solution of another quadratic system. While this may make the approach numerically unattractive, the fact that zero manifolds can also be found from eigenproblems associated with the multiplication tables of the associated residue
371 class ring 1Z is an interesting observation. This case will be considered in more detail in a further paper by the author. Acknowledgements : The author owes a great deal of insight to an exchange of ideas with the group of Professor Wu Wenda in the Academics Sinica. He has also been given valuable information about his related work by Professor M. Moeller in Hagen.
References [1] 0. Perron : Algebra, vol. 1 (Die Grundlagen), 2nd edition, Walter de Gruyter, 1931. [2] W. Auzinger , H.J. Stetter : An Elimination Algorithm for the Computation of all Zeros of a System of Multivariate Polynomial Equations , in: Conference in Numerical Analysis, ISNM vol. 86 , 11-30, Birkhaeuser, 1988. [3] W. Auzinger , H.J. Stetter: A Study of Numerical Elimination for the Solution of Multivariate Polynomial Systems, unpublished manuscript. [4] Huang Yu-shen, Wu Wen-da: A Modified Version of an Algorithm for Solving Multivariate Polynomial Systems, MM Research Preprints No. 5 (1990 ), 23-29, Academia Sinica. [5] F.R. Gantmacher : Matrizenrechnung , Teil II , Deutscher Verlag der Wissenschaften, 1959. [6] J.H. Davenport, Y. Siret, E. Tournier: Computer Algebra: Systems and Algorithms for Algebraic Computation , Academic Press, 1988. [7] B. Buchberger: Groebner Bases : An Algorithmic Method in Polynomial Ideal Theory, in: N.K. Bose ( ed.), Multidimensional Systems Theory , 184-232, D. Reidel, 1985. [8] H.J. Stetter : Verification in Computer Algebra Systems, to appear in: Validation Numerics Theory and Applications , Computing Suppl . 8, 1993. [9] M.G. Marinari , H.M. Moeller, T. Mora : Groebner Bases of Ideals Given by Dual Bases , Proceed. ISSAC 91, 55-63, acm press. [10] C. de Boor, A. Ron : Computational Aspects of Polynomial Interpolation in Several Variables, CS Tech. Rep. # 924 , Univ. Wisc., 1990. [11] E. Anderson et al .: LAPACK User's Guide, SIAM, 1992. [12] H.J . Stetter, J. Haunschmied: Computational Solution of Singular Matrix Eigenproblems, in preparation. [13] H. Kredel , V. Weispfenning: Computing Dimension and Indepedent Sets for Polynomical Ideals, J. Symb . Logic ( 1988), 231-247. [14] V.P. Gerdt, N.V. Khutornoy, A.Yu. Zharkov: Groebner Basis Technique, Homogeneity, and Solving Polynomical Equations , Preprint , to appear.
WSSIAA 2(1993) pp. 373-388 ©World Scientific Publishing Company
BACKWARD EULER TYPE METHODS FOR PARABOLIC INTEGRO-DIFFERENTIAL EQUATIONS WITH NONSMOOTH DATA
VIDAR THOMEE Department of Mathematics , Chalmers University of Technology S-412 96 Goteborg, Sweden
NAI-YING ZHANG Department of Computational and Applied Mathematics, Rice University P.O. Box 1892, Houston, Texas 77251, USA ABSTRACT The discretization in time of a parabolic equation with a memory term is considered . The time derivative is approximated by a backward difference quotient, and the memory term by quadrature formulas specially adapted to the case that the solution is nonsmooth at t = 0 , and with advantageous storage requirements.
1. Introduction. Consider the initial value problem B(t, s)u(s)ds Bu(t),
u t + Au =
for t E J = (0 , T], u(0) = uo. (1.1)
j Here u : J -* H where H is a Hilbert space, ut = 8u/8t, A is a positive definite, densely defined selfadjoint operator with compact inverse in H, B(t, s) is a linear D(A) depending smoothly on t and s, and u0 E H. operator with D(B(t, s)) We shall discuss time discretizations of (1.1) based on the backward Euler method. Letting k be the time step, to = nk, Un = U(tn) the approximation of u(t,,), atU" = (Un - Un-1)/k, and n-1
_
f
()d
(1.2)
the time discrete versions of (1.1) we shall study are of the form atUn + AUn = Qn(BU), for n > 1, U0 = u0i (1.3)
373
374 where on(BU) = vn(B(tn, •)U). In applications, A and B(t, s) will be a second order elliptic operator under some homogeneous boundary condition and an arbitrary second order partial differential operator, respectively, or finite dimensional operators obtained from these by finite element discretization.
Analyses of such numerical methods have been presented in Sloan and Thomee [2] and Zhang [4], [5]. In these the stability properties needed were based on the assumption that the quadrature coefficients w,, in (1.2) are dominated in the sense that there exist w; such that n
IwnjI < wj, for 0 < j < n,
with >wj < C, i=o
for to E J. (i)
Stability then resulted from a discrete version of Gronwall's lemma, and the error estimate II Un - u(tn)II < C(u)k (1.4) followed under the assumption of sufficient regularity of the solution of (1.1). One particular aspect treated in [2] and [4], [5] is the storage requirement
imposed by the choice of the quadrature rule. If an is taken to be the left side rectangle rule, i.e. wn; = k for j = 0, • • • , n - 1, then condition (i) is satisfied and the error estimate (1.4) holds. A drawback with this rule, however, is that, unless B(t, s) has a special structure, all previous values of U' need to be saved, which results in excessive demands on storage. In order to cope with this problem, alternative rules were introduced based on the idea of using more accurate quadrature rules on longer time steps, with certain modifications near the ends of the intervals (0, tn). Such rules were constructed employing the trapezoidal rule and Simpson's rule, with basic step length of order 0(k1/2) and 0(k1/4), respectively. In both cases it was possible to do this in such a way that the rules satisfy (i) and such that only the values of Uj at 0(k -1/2 ) and 0(k-1/4) different levels, respectively, need to be saved, rather than the O(k-1) levels required for the rectangle rule. This reduction in storage requirement was accompanied by an increase of regularity demands on the exact solution. In [2] and [4], [5] the same questions were addressed also for some other approximations of the time derivative in (1.1), such as the Crank-Nicolson method. For the pure differential equation case of (1.1), i.e. when B = 0, it is known that optimal order convergence for the backward Euler method takes place for positive time, even without any regularity assumption on the initial data, or, more precisely, II Un -
u(tn)II
<
Cktnl lluoll,
for n > 1. (1.5)
This fact is related to the smoothness property for the parabolic equation expressed by (Dt = d/dt)
IID'A`u(t)II <
c,;t-
`Iluoll, for t > 0,
Z',
j' > 0.
(1.6)
375 The purpose of the present paper is now to extend this analysis to backward Euler type schemes of the form (1.3) for the integro-differential equation problem (1.1). The presentation builds on Zhang [4], which in this case was based on a previous study by Le Roux and Thomee [1] concerning an integro-differential equation in which the integrand depends nonlinearly on u. In Section 2 we begin by discussing the regularity of the solution and show that the estimate in (1.6) holds for the solution of (1.1), for t E J and j > 0, if i = 0,1, and if the operator B(t, s) is dominated by A in a sense to be made precise below. In Section 3 we study discretizations of (1.1) of the form (1.3). We show that if an additional assumption is satisfied for the quadrature coefficients, namely that for the wi occurring in (i) we have n-1
wit; 1 < Clog for
to
E
J,
(ii)
i=1
then
IIUn - u(tn)II :5 Ck{tn1 + logk}IIu0II +CQn(u), (1.7) where Qn(u) is a global quadrature error. We then prove that if in addition the quadrature formula vn is appropriate for nonsmooth data in the sense that, with qn(g) = ,n(g) - fan gds the quadrature error, there exists m > 0 such that Ilgn(g)II < Cklog 1, if
IIDtg(t)II < t
for t E J, 0 < j < m,
(iii)
then the global quadrature error term can essentially be absorbed into the first term on the right in (1.7). In Section 4 we give examples for which this condition holds. As a first example we consider again the above rectangle rule, for which condition (ii) and (iii) are simply verified. Next we note that the modified trapezoidal rule with stepsize O(k-1/2), which was appropriate for smooth solutions, now demands more regularity than we may assume at t = 0 to yield an optimal order error estimate. Instead we construct a quadrature rule in which the mesh underlying the rule is graded, with short intervals near t = 0, and for which (i), (ii), and (iii) hold, the quadrature error has the right order, and which uses Uj at 0(k-1/2) levels. A similarly graded scheme based on Simpson's formula with O(k-1/4) level storage requirement is finally analyzed. In Section 5 we discuss briefly the application of our results to equations obtained by spatial discretization in finite element spaces of integro-differential equations in which A and B(t, s) are partial differential operators. In Section 6, finally, we illustrate our results by a simple numerical example. The authors are grateful to John Carroll and Lanzhen Xue of the Dublin City University for their help in carrying out the computations.
376 2. A regularity result. In this section we shall show that (1.1) has a solution which is continuous on J, and which belongs to D(A) and depends smoothly on t for t E J, under a natural assumption on B(t, s): We shall say that an operator B is dominated by A if BA-1 and A-1B are bounded. Note that by elliptic regularity this holds in H = L2(St) when A is a second order positive elliptic differential operator in 52, under Dirichlet boundary conditions, and B is an arbitrary second order partial differential operator. Theorem 2.1. Assume that B(t, s) and its derivatives with respect tot and s are dominated by A. Then (1.1) has a unique solution u E C(J, H) fl C°o(J, D(A)) and
IIDtA1u(t)II
(2.1)
In view of wellknown estimates for E(t)uo (cf. (1.6)) this will complete the proof. Note that thus w(t) has a weaker singularity at t = 0 than E(t)uodoes.
The proof of (2.1) will be by induction over n. By our definitions (1.1) may be rewritten as wt + Aw =
J0 t B(t, s)u(s)ds = Bu (t) = Bw(t) + BE(t)uo. (2.2)
Hence, since w(0) = 0, (1 . 1) is equivalent to the integral equation w(t) =
fo
t E(t - s)Bw (s)ds + / t E(t - s ) BE(s)uods = Kw(t) + V (t).
(2.3)
We shall first demonstrate that
IIBE(t)uoll
S CIIuoII,
for t E J, (2.4)
and that as a result thereof,
IIAV(t)II < CIIuoII, for t E J. (2.5) We then show that the Volterra type integral operator K defined in (2.3) satisfies
t
IIAKg(t) II s C f II Ag(s)Ilds. (2.6) 0
377 Together these estimates show that K is a Volterra operator on D(A) with
t
II Aw(t )II
< CIIuoll + C f
II A w(s)Ilds,
and hence that ( 2.3) has a solution w which satisifies (2.1) for i = 0, j = 1. Turning to the proof of (2.4 ) we have
BE(t)uo =
(B(t, s) - B(t, 0))E(s)uods + B(t, 0) t Jo Jt
E(s)uods = L1 + L2-
Here, Here, after integration by parts and using DtE(t)uo = -AE(t)uo, we have L1 = ft B,(t,r)( o
J t E(s)uods )dr t B,(t,Tr)A-1(E(-r) - E(t))uod-r, o
and since B, is dominated by A, the desired estimate for L1 follows. The estimate for L2 holds similarly, and the proof of (2.4) is complete. To show (2.5) we write, using AE(t - s) = D,E(t - s) and integration by parts,
t AV(t) =
J ' B(s, r)E(r)uodrds = BE(t)uo - J E(t - s)B(s, s)E(s)uo ds - J E(i - s)tE(s)uodrds =
J
AE(t - s)
0
0
3
.i=1
By (2.4) we know that M1 is bounded as desired and the argument proving (2.4), with B replaced by Bt, shows that so is BtE(s)uo and hence M3. As for M2, we have, using the smoothness property (1.6) for E(t), that
IIM2II
< C
t/2 (t - s )-1 llA-1B(s, s)E(s)uo Ilds + C f t IIAE(s)uo lids < CIluoll, t/2 Jo
which completes the proof of (2.5). To show (2.6), finally, let ¢ be such that 0(0) = 0. By changing the order of integration we have, with k defined by the first equality,
AEO(t) =
( I - E(t - s))cb'(s)ds, = t AE(t - s)¢(s) ds = t J E,(t - s)4(s)dst fo J0 0
= B(t,t)g(t) +Btg(t), t t dr < C 0 II AgII dr. II DT(B9 )(r)II0 II AKg(t ) II = II AE(B9 )(t )II s 0
whence II AEO (t)II <
fo
IIc' Ilds. Hence, since
f
Dt(Bg )(t)
378 We have thus proved (2.1) for i - 0,j = 1. Note, in particular, that ||5u(<)|| < ||S£(t)wo|| + ||£u>(<)ll < c\\uo\\- To complete the proof of (2.1) for n = 1 we note that by (2.2), using the result already obtained and (2.4), we have ||to t || < \\Aw\\ + \\Bu\\ < C\\uQ\\. We shall now carry out the induction step of the proof of the bound (2.1) for n > 1 and assume thus the result for n < m. We begin with the case i = m, j = 1, and recall that w - EBu. In order to be able to estimate D?Aw(t) = D™AEBu(t) we shall first prove the following lemma. L e m m a 2 . 1 . For g appropriately t"+l\\D?AEg(t)\\
smooth and n > 0 we have, with C = Cn, C f a" + , ||ff ( B + I ) (a)|| ds.
Jo
J^o '<*
Proof. We write ft/2 Eg(t) =
ft/2 E{t-
s)g(s) ds +
Jo
E(s)g(t
By straightforward manipulation, using AE(t) coefficients c „ ; ,
D?APg(t)
-s)ds
= Pg(t) + Rg(t).
Jo
= T
Cnj&>\t/2)gln-'\t/2)
= —Et(t),
-
f
we have, with some
E
- s)g(s) ds.
Here, using the smoothing property of E(t), we have < n + 1 ||£°' ) (2)<7 ( " _ '' ) (2)|| <
ct„+i-;||ff(„-7)(t/2)||
and *»+»|| / ' Jo
2
E^n+1\t
- s)g(s) ds\\
Similarly,
D?ARg(t) = J2dnjE^\t/2)g(n-»{tl2)+g(n\t) .7=1
-
Rg(n+l\t).
(2.7)
379 The sum is estimated by (2.7) and similarly for g(-)(t). Finally
to
+l IIRg(n+l)(t)II < Cft^2 (t C
I
) n+111g(n
+1)(t - s)IIds
t Sn+111g(n+1)(5)11 ds.
0
Together these estimates complete the proof of the lemma. We return to the proof of ( 2.1) for i = m, j = 1. As a result of the lemma we have, with C = C„,., m
5'IIDtBu(s)II) tmIIADt"w(t)11
t
+Ct-1
f Sm+IIIDi +1Bu(5)II
(2.8)
ds.
0
Setting Bt') (t, s ) = Dit B(t, s ) and Bt^)u(t) = f0 Bt') (t, s ) u(s) ds we shall show j-1
IID'Bu(t)II <
IIAu(`)(t)II + II Bt')u(t)II, for 0 < j < m + 1, t E
C
J. (2.9)
i=o In fact, we have DtBu(t) = B(t, t) u(t) + Btu (t), and hence Dt Bu(t) = Dt
( B(t, t)u(t)) + Dt-1 .-'B(1)u(t)
9-1
_ Dt(Bti
-1-i)(t,t)u(t))+Bt '
)u(t)
i=0
Further, noting that
Dt(Bt
(t,t)u(t))
C
l/Dt-'Bt 1 '- 1-s)(t,t)A-1(Aul^)(t)),
and using our assumption of domination, we conclude that
IIDt(Bt'-1-')(t,t)u(t))II
which shows (2.9).
380 Using our induction assumption , (2.9) and an analogue of (2.4) imply that i-1
t'IIDiBu(t)II
< Ct'.
IIAu1'1(t) II +t'IIBtj' u (t)II < CIIuoII, for j < m, E i=O
(2.10)
and that similarly tm+lIIDm .+1Bu(t)II < CIIuoII+Ctm+hII ADi'u(t)II < CIIuoII +Ctm+III ADr w(t)II Together with (2.8) this implies
tmIIADt'w(t)II
CIIuoII
+CfsmIIADt' t
0
w(s)II ds,
whence the desired estimate for ADt `w follows by Gronwall's lemma. The corresponding estimate for D'n+1w now results from (2.2) and (2.10), and the proof is complete.
3. Completely discrete schemes for nonsmooth initial data. In this section we shall discuss backward Euler time discretizations of the form (1.3) of the integro-differential equation (1.1), in the case that the initial data are nonsmooth. We start with a preliminary error estimate in which the precise choice of quadrature rule is yet to be made. Lemma 3.1. Assume that B, Bt, and B. are dominated by A. Assume that an satisfies (i) and (ii). Then we have for the solutions of (1.3) and (1.1)
IV - u(tn)II
C Ck(tnl + log
1 )Iluoll + Clma^xn IIQ'(u)ll, k
fort,, E J,
where, with Ek = (I + kA)-1 and qn(B¢) = gn(B(tn, •)0), we have set n-1 Qn(O)
= k 1 Ek- iqi +1(BO) i=o
Proof. Letting Fk = Ek - E( tn) denote the error operator for the purely parabolic problem , we write en = Un - u(tn ) = Fk uo + en. Then ( 1.5) may be expressed as (3.1) IlFk uoll < Cktnllluoll, for t o > 0.
381 To complete the proof we shall show 11enII < Cklog
k
IluoII + C 1max IjQj(u)jj, for to E J.
Letting n-1
Sn(O)
= k Ek -^Q^+ 1 j=o
we have, with (Fkv)n = Fk v and Ij = (tj, tj+1), since Un = Fk uo + en + u(tn), that en
= Sn (U) -
J to E(tn - s)Bu(s)ds = Sn(Fkuo ) + Sn(e)
n-1
n-1
+ k Er (oj+1(Bu) - Bu(tj +1)) + j=0 n-1
j=o
j Ek-'(Bu(tj+1) - Bu(s))ds
j
n-1
+ f Fk -'Bu(s)ds + J (E(tn-j) - E(tn - s))Bu(s)ds j=o I' j=o 3
= Sn(Fkuo) + Sn(e) + Qn(u) + E e.7
.
(3.2)
j=1
We note that the P are independent of the quadrature formula. We shall show n-1
IISn(O)II < CEw7IIoiII,
i
f 0(0) = 0.
j=1
Assuming this for a moment, we shall now invoke the following discrete form of Gronwall's lemma (cf. [2]). Lemma 3 .2. Assume that the nonnegative sequence {9n}000 satisfies n-1
gn <- 7n + wjgj,
forn > 0,
j=o
where {wn}0 and {-in}' are nonnegative and nondecreasing, respectively. Then n-1
gn
< 7n e xp(>
wj),
j=o
forn > 1.
382 In view of this lemma, we obtain from (3.2) and (3.3) 3
Ilenll < C,max ( IIS'(Fkuo)II + IIQ' (u)II + : IIe:II), 2, n i=1
and it therefore now remains to estimate the S3(Fkuo) and V. We have first, using (3.3), (3.1), and condition (ii), n-1
n-1
Ilsn(Fkuo)II S C>wjliFFuoll < Ck>wjtj 1II u oii <- Cklog k Iluoll. j=1
j=1
To estimate en we note that, using (2.10) with j = 0, 1, k
n-1
IIBu(k) - B u(s)Ilds + f II(j+1) - Bu(s)Iids Ile i II s fo , n-1
Ck lluoll
+Ck2 E tj 1 Iluo1I
< Cklog k lluoll•
j=1
For ez we have similarly n-I
n-1
f tn1jiiBu(s) Iids < Ck2 ^ tnl jlluoll < Cklog k lluoll,
Ile2 II <- Ck
j=0
;
j=o
and for e3 , finally, since for s E Ij,
II E(tn-j ) - E(tn - s )II = II
n ET(,r)drii < C In _ _T-1d, < Cktnlj -1, I 1 Jtn -S
we find n-2
IIe3II <- J n_ II(E (k) -E(tn - s))Bu(s )llds +Ck2 EtnI j-llluoiI I
j =o
1
< Cklog klluoll.
The proof is now complete, modulo the proof of (3.3). To show this we make use of our definition of On and a change of the order of summation to obtain n-1 n-1
Sn(^) = k
1:(Ewj +l,iEk-')B(tn,ti)
Ui
i=0 j=i n-1
kE E
n-1
Ek
o jt
-' I
i=0 j=i +1
Bt(s, ti)ds Ui.
0 Since B and Bt are dominated by A, the proof is now a simple consequence of the following smoothing properties of Ek, namely, with r(A) = (1 + A)-1,
sup I Ar(kA) n I < k-1 sup A( 1 + A)-n < k-'n-' = to 1 (3.4) I AEk II =AESp(A) A>O and n-1
n-1
Ilk E wj +1,iAEk -' II =
sup
n-1
1k E wi +l,iAr(kA)n-l I < wi sup
j =i AESp(A) j=i
<
E Ar(a)n-i
A>o j=i
Ar(A)
wisup = W. A>o 1 - r(A)
We are now in a position to show a complete error estimate, and assume that the quadrature rule an is appropriate for nonsmooth data in the sense of condition (iii). Theorem 3.1. Assume that B = B(t, s) and its derivatives with respect to t and s are dominated by A, and that on satisfies (i), (ii), and (iii). Then we have for the solutions of (1.3) and (1.1)
IIUn - u(tn )Il s Ck(tn' + (log k)2)IIuoII, for tn E J. Proof. It remains only to estimate Qn(u). By our assumptions and Theorem 2.1 we have IID^(A-' B(t,s) u ( s ))II Cis-'IIuoll, fort E J. Hence, by (iii) and the linearity of qn,
IIA-'gn(Bu)ll <_ Cklog k lluoll and, using also (3.4) we find n-1
IlQn(u)II = Ilk E AEk-'A- 1q5+'(Bu)ll i=o
1n-1 _ 1
< Ck21og tn1;Iluoli < C k(log k)2IIuoII i=o
In view of Lemma 3.1 this completes the proof.
384 4. Examples of appropriate quadrature rules. In this section we give some examples of quadrature rules that are appropriate for nonsmooth data and have advantageous storage properties. We note first that, in addition to (i), condition (ii) obviously holds for the rectangle rule described in the introduction since, with wj = k, n-1
n-1
Ewjt^l=^j-1
fortnEJ.
j=1 j=1
In this case we have, using the assumption of (iii) with m = 1, that rz-1
rt„
lIgn(g)II = Il k E g(tj)-J
(t„
g(s)ds ll < 2kmax llg(s)II+k J llgtllds < Cklog k, S
j=o
and hence (iii) also holds. In order to reduce the storage demands a quadrature rule was proposed in [2] which is sparser than the above, but nevertheless retains the order of accuracy of the backward Euler discretization. This was accomplished by basing the quadrature rule on the longer time step k1 = mk, where m = [k-1l2], the integral part of k-1/2, thus with k1 = 0(k1/2). Setting tj = jkli j = 0,1, • • • , and with jn the largest integer such that tin _< tn_l, the trapezoidal rule was applied with step k1 on (0, t j„) and the rectangle rule with step k on the remaining part, i.e., o,n(g) = kl {2g(0)+g(tl)+..+2g(tj..)}+k{g(tjn)+g(tj,+k)+"+g(tn-1)}. (4.1) Then qn(g) = 0(k2) + O(klk) = O(k) as k •-> 0, for g smooth enough. The number of Un that need to be stored thus reduces to O(kl 1)+O(m) = O(k-1l2), without loss of accuracy. This rule has dominated weights, with wj = Ck1 if j 0(mod m), and wj = Ck otherwise, since then E " wj < Cjnkl + Cnk < CT,
and it is also easy to see that (ii) is satisfied. However, the main contribution to the bound for the quadrature error is Ck1 fki llgttllds, and since we may now only assume llgtt(s)II = O(s-2) for s small, this gives an error bound of order 0(k1) = 0(k1/2), thus with a loss of accuracy. We shall therefore present an alternative quadrature rule which behaves better for non-smooth data, but still performs well for smooth data. We introduce this time the sequence of time steps defined by tj = j2k, i = 0, 1, • • • , and let jn be the largest integer for which fin < to-1. We now propose to use the basic trapezoidal rule on the intervals (t j, fj+l) and then the rectangle rule with time step k on (tj,,,tn _ 1), i.e., we set j
-1
n-1
o,n(g) = 2 E (tj+l - tj)(g(tj+1) +g(tj)) + k E g(tj). (4.2) j=o j=jn
385 It is easy to see that the number of nonzero weights in o "(o) is n - jn + in, and since n (jn + 1)2 this is bounded by 3jn + 1 < Ck-1/2, so that, like the earlier modified trapezoidal rule, the number of levels of the solution which need to be stored is O(k-1/2) We find easily that this rule satisfies (i) and (ii) with wj = 1j+1 - tj_1 for j = i2,i > 0 and wj = k for other j. For the purpose of showing that (iii) holds we note that qn(g) _ { 2 k (g(0) + g(k )) - f k gds} J tn 1 in-1 tin n-1 +{ 2 (tj+l - tj)(g(tj+1) + g(tj)) - Ik gds } + {k E g(ti) - 'n gds}, j =1
j=jn
whence in-1
to
) 3 ' ma^ }1 llgtt ( s)Il Ilgn(g)II < Ck max llg(s)II + C E (tj+l - 1j
jIgtIIds. + k 'yin
This yields, using the assumption of (iii) with m = 2, in -1
Ilgn(g)II < Ck + C E
to tj)3 (tj)-2 + Ck f s-lds,
j=1
where jn-1(
in-1
jn-1
tj) 3(tj)-2 = k j=1
1
(( Cklog k. ((j + 1)2 -j2)3j-4 < Ck j-1 <
j=1
j=1
These estimates complete the proof of (iii). We finish our examples by a method requiring only 0(k-1/4) values of U" to be stored. As above, our quadrature points will now be chosen denser near the origin. We first recall a quadrature formula which was introduced in [4] and which was suitable for smooth solutions. Let mo = [k-1/4] and k; = m'- 'k, 1 _< i < 4, and define ti = jk4 for j > 0. To define o"(g) let jn be the largest even integer such that tin tn_1i and use Simpson's rule with steplength k4 on the interval we first use the trapezoidal (01tjn ). On the remaining subinterval (tjn, tn_1) rule with step length k3 on as many intervals of length k3 that fit in, starting from the left, then the trapezoidal rule with step length k2 on as many intervals of length k2 that fit into the then remaining interval, and finally the left hand rectangle rule with step k1 = k on the remaining interval of (0,tn). This rule
386 has dominated weights, but requires that g(4) E L1(0,T) to secure the desired accuracy. We now propose an alternative quadrature rule which is appropriate for nonsmooth data also. This time our basic quadrature intervals are defined by Ij = (ij,tj+l) where tj = 2j4k, j = 0, 1, • • • . For n = 1, 2 we choose the rectangle rule, i.e., Q1(g) = kg(0) and v2(g) = k(g(0) + g(k)). When n > 3, let jn be the largest integer such that tin _< tn_1 and use Simpson's formula based on the points tj, tj+1/2 = (tj + tj+1)/2, and tj+1 in Ij for j < jn. In the remaining interval in = (fin, tn_1) we now use the trapezoidal rule on as many intervals of length j,2^k as can fit into In, starting from the left, then the trapezoidal rule on as many intervals of length jnk as fit into the remaining part of In, and finally the left hand rectangle rule on the remaining intervals of length k in (0, tn). We denote by In, I' , I' n' the corresponding subintervals of In.
This quadrature rule has dominated weights: The weights corresponding to the points tj are bounded by (t j+1 - tj) + (T j - fj_1) _< C j 3k, and their contribution to Ej wj is bounded by C4k < CT where tj is the largest such point in J. The contributions from the points tj+ 1/2 are similarly bounded. The endpoints of the intervals of length j2k in Ij for which the trapezoidal rule is used at some time of the process give coefficients bounded by j2k. Those in Ij contribute at most Cj3k to >j wj, and after summation over j, Cj4k < CT. Similarly, the intervals of length jk in Ij contribute at most Cj3k and altogether Cj4k < CT, and finally the intervals for which the rectangle rule is used correspond to quadrature weights k, with a bounded sum nk < T on J. It is also easy to see that condition ( ii) is satisfied . For instance, the contribution of the points tj to E, wj/tj is now bounded by C Ej j3k/(j4k) = C>j j-1 < Clog(1/k), and that of the first of the two trapezoidal rules by C>jI •j2k/(j4k) < Clog(1/k). We finally show that this quadrature rule satisfies (iii) with m = 4: For n = 1, 2 we find at once, for g bounded, that Ilgn(g)II < Ck, and for n > 3 we have, using the assumptions of (iii), since tj - tj_1 < Cj3k and j4nk < T, in-1
I1gn(9)II < Ck sup I19(s)II + C
1, (tj
+1 -
tj)4 f II9(4)Ilds
j=1
f II9"Ilds + Ck f II9'Ilds + C(jnk)2 f I II9"Ilds + C(jnk)2 'l, in -1
< Ck + C E (j 3k)5 (j 4k)-4 j=1
+ C ((.7nk)2jnk + (jnk)2I nk )(jnk)-2 + Ck(jnk)(jnk) < C k (log j n+ 1):5 C k log ..
387 5. Application to finite element discretizations in space. In this section we shall briefly discuss the application of the above results to the spatial discretization by finite elements of an equation of the form (1.1) when A and B(t, s) are differential operators with respect to a spatial variable x. Let thus A denote a second order selfadjoint elliptic operator with homogeneous Dirichlet boundary conditions in a bounded domain Il C Rd with a smooth boundary, corresponding to a bilinear form A(u, v) such that, with c > 0, d au au
A(u' u)
a'j ax; axe + ao u2
) dx > c u H^ , for u E Ho l.
Further , let B(t, s ) be an arbitrary partial differential operator of order a < 2 in I1 and B(t, s; u, v) the corresponding bilinear form on Ho. Here we denote as usual by H' = H''( Il) the Sobolev space of functions with derivatives of order at most r in L2(I2 ), and by Ho the functions in H1 which vanish on asz. Our basic Hilbert space is now L2(IZ). Letting { Sh}h
uh(0) = uoh.
It was shown in [3] that if Sh satisfies the standard O(h'') approximation assumption associated with piecewise polynomials of degree < r - 1 and if uoh is chosen as Phuo where Ph denotes the L2-projection onto Sh, then for the error in this semidiscrete solution one has
Ilu h (t) - u (t)II
:5 Ch'Yt_-t/2
IIuoII,
with y = min(4 - 0, r).
The completely discrete backward Euler problem (1.3) now reads atUn + AhUn = Un(BhU ), for n > 1, U° = Phuo. In order to be able to apply Theorem 3 . 1 we need to assume that Bh(t, s) and its derivatives are dominated by Ah, which may easily be shown to be the case in the presence of an inverse assumption by using a standard error estimate for the associated elliptic problem ( cf. [4], Section 3.3). (In some cases , such as when B = g(x)A + B1 with B1 a first order operator , the inverse assumption is not needed for this ( cf. also [2, Section 4]).) Theorem 3 . 1 then yields, with y=min(4 -/3,r), II Un - u(tn )II C C (h'Ytn 1'/2 +
k(tn 1
+ ( log
1 )2)) IIuO
II,
for to
E
J.
388 6. A numerical example. To illustrate our results we have considered the simple special case when A = -32/8x2 on [0,1], with B(t, s) = A. Here the integro-differential equation may be differentiated with respect to t to yield the second order differential equation utt + Aut - Au = 0,
for t > 0,
with u(0) = uo, ut (O) = -Auo,
and the exact solution may easily be calculated by separation of variables. We remark that in the very special case the rectangle rule gives a"(AU) = k E o AUj which may be stored separately so that no need exists to use memory saving quadrature rules. We have chosen the initial function uo(x) - 1 in (0, 1), for which the solution is discontinuous at x = 0, 1. As comparison, we have also treated the smooth initial function uo(x) _ (4 /ur) sin irx, which is the first term in the Fourier series expansion of the above discontinuous initial function. The errors e2 and e,,,, are measured in discrete 12 and l,,. norms, respectively, and the order of corresponding convergence rates p2 and p. are calculated. The following table shows some results for the error at t = 1, using (a) the rectangle rule, (b) the trapezoidal rule (4.1), (c) the graded trapezoidal rule (4.2). The discretization in space is accomplished by replacing A by the standard second order finite difference operator on a uniform partition, using xj = j h = j /M, j = 0, • • • , M = 500, and N denotes k-1. Smax is the maximum storage requirement.
Method N a b 160 c a b 320 c a b 640 c
Smax
160 17 28 320 32 48 640 40 40
Smooth data 104e2 02 104e". 119 137 156 180 100 115 60 1.0 69 80 1.0 92 50 1.0 58 30 1.0 35 42 0.9 49 25 1.0 29
Nonsmooth data P".
104e2
1.0 1.0 1.0 1.0 0.9 1.0
120 357 96 60 237 49 30 163 25
02
1.0 0.6 1.0 1.0 0.5 1.0
104e 0o Poo
168 914 114 84 662 60 42 494 31
1.0 0.5 0.9 1.0 0.4 0.9
REFERENCES 1. M.-N. Le Roux and V. Thomee, Numerical solution of semilinear integrodifferential equations of pa rabolic type with nonsmooth data, SIAM J. Numer. Anal. 26 (1989), 1291-1309. 2. I. H. Sloan and V. Thomee, Time discretization of an integro-differential equation of parabolic type 23 (1986 ), 1052-1061. 3. V. Thomee and N.-Y. Zhang, Error estimates for semidiscrete finite element methods for parabolic integro-differential equations , Math. Comp. 53 (1989 ), 121-139. 4. N.-Y. Zhang, On the Discretization in Time and Space of Parabolic Integro-Differential Equations, Thesis, Chalmers University of Technology, Goteborg, 1990. 5. N.-Y. Zhang , On fully discrete Galerkin approximations for partial integro-differential equations of parabolic type, Math. Comp. (to appear).
WSSIAA 2(1993) pp. 389-398 ©World Scientific Publishing Company
GLOBAL METHOD FOR THE POLES OF ANALYTIC FUNCTION BY RATIONAL INTERPOLANT ON THE UNIT CIRCLE
TATSUO TORII and TETSUYA SAKURAI Department of Information Engineering, Nagoya University Nagoya, 464-01, Japan
ABSTRACT We present an algorithm to find the poles of an analytic function f(z) in the neighborhood of unit disk of complex plane except the finite number of poles within a circle. Using FFT we expand f(z) into a polynomial of z and then transform it to rational interpolant . According to the increase of the sample points on the unit circle with the rate of geometric progression, we can construct the sequence of rational interpolant of f (z) from the polynomials generated by FFT. The zeros of the denominator of the rational interpolant are approximation of poles of f(z) under the suitable choice of the degree of the denominator.
1. Introduction In order to find a zero of a given analytic function , we need normally a close approximation of the zero when we use Newton 's method or many kind of Newton like methods. As all of these methods converge locally, how to find the initial approximation for the solution is generally the difficult problem. Our method can be applied to find initial approximation of the poles of an analytic function in the complex plane. In this sense this can be called global method. Let f ( z) be analytic in a neighborhood of the unit disk except a finite number of poles within the circle . The function f (z) is expanded in Laurent series
f (Z) _ c k
zk
(1)
k--oo
in a annulus including the unit circle Izj = 1. We consider the discrete Fourier transform of complex function on the equidistributed points on the unit circle zj = W/,, 0 < j < N where WN = exp(2ai/N). When N is power of 2, FFT is efficiently applicable to get the interpolant of
f (z) N-1 f(z)
fN(z) _
CkNlzk^ k-0
0
(2)
390 where
CkNI =
N-1
r
f
0
<
k
<
N.
(3)
j=0
Sometimes this is called discrete Fourier transform , abbreviated to DFT of f. For the interpolant of f (z ) on the zeros of zN - 1 it is convenient to use the notation as defined next section, fN(z) = f( z) mod (zN - 1). (4) The relation between the coefficients ck of Laurent series of f (z) and discrete Fourier coefficients CkNI of the sample { f( z;)} is well known as the name of aliasing in signal processing C (N)
00
(5)
k = Ck+mN• m=-W
For the poles {wk} of f(z), we arrange as IW1I <
I W21
< ... < IWrI < 1 < IWr
+ll
< Iwr+
2I < ... .
(6)
We choose suitable positive number R1 and R2 for the poles at the outside of unit disk
1
(7)
Suppose that there is no other kind of singularity in the disk Izl < R2. Then
(8)
f(z) _ µ. + W(z), i=1z-w; where p(z) is an analytic function in the disk Izl < R2. Hence we have r oo^ a X^00`` [[te^ [ µiwi CA; = L ^Lµiwk-1 / z-k - L k =1 i =1 k=0 =r+1 i k-1/ Zk
+O(R2k)
(9)
for sufficiently large N. And for relatively small positive k < 1 there holds r
CN^k=Yµiw; -1+O(Ri-k ),
k=1,2,...,1
(10)
i=1
from Eq. 5 and Eq. 9. Therefore, significant information about poles within the unit circle is included in the coefficients CkNI of higher degree of polynomial fN(z) given by Eq. 2. The complex number sequence k SA; = µ;w ;, i=1
k = 0,1,...
391 is a particular solution of some difference equation of order r whose characteristic polynomial is given by q*(z) = 11 (z - w;).
(12)
In order to find the polynomial q*(z), Pade expansion technique may be used. In our case instead of Pade expansion, we apply rational interpolant on the unit circle. Let us explain the principle of our method. We consider the sequence of polynomials of degree N - 1 zkfN(z)mod(zN-1), k=0,1,...,1-1
of which arbitrary linear combination is expressed by the arbitrary polynomial q(z) of degree I - 1 as q(z) fN(z) mod (zN - 1). Determination of q(z) to satisfy the degree of q(z) fN(z) mod (zN - 1) at most N - I - 1 is equivalent to solve the simultaneous linear equation Eq. 10 with respect to µ; and w;. To this end we use rational interpolant. There are many literatures for rational approximations. 2. Algorithm Let a polynomial p(z) of degree n with complex coefficients be p(z) = ao + asz + ... + a „zn, an # 0.
(13)
We use the following notations deg p = n : degree of p(z), deg p/ q = deg p + deg q : degree of rational function p(z)/q(z), IIPII = maxla ; I : norm of p(z), f (z) mod w(z) : Interpolant of f (z) on the zeros of polynomial w(z), fN(z) = f( z) mod (zN - 1), (14) fN(z) = f ( z) mod (ZN + 1). (15) By the composition of FFT, f2N(Z) = 2(zN + 1)fN(Z) - 2(zN - 1)JN(z),
(16)
we can generate polynomial sequence f2i (z), I = 1, 2, ... . Setting initial condition for Euclid' s algorithm PO (Z) = zN - 1, P1(z) = fN(z) = f (Z) + O(zN - 1). (17)
392 We generate the sequence of remainder polynomials such that pi+1(z) = pi-1(z) - ai(z)pi(z),
degpi+l < degpi, i = 1,2,..., (18)
where ai(z) is a quotient polynomial of pi-1/pi. Finally pi+1 (z) may be none zero constant or zero according to relatively prime or not with respect to given po(z) and pi(z), respectively. On the remainder polynomial it is well known that pi(z) = Ai( z)po(z ) + Bi(z)pl ( z),
deg Bi + deg pi < deg po. (19)
Substituting Eq. 17 into right hand side, we get pi(z) = Bi(z )f(z) + O(zN - 1).
(20)
It means that the rational function pi/Bi whose degree is less than N is just the rational interpolant of f (z) on the zeros of zN - 1. So we change the notation Bi(z) to qi(z), that is, gi(z)f( z)-pi(z )=_ 0
(mod zN-1 ),
deggi+degpi
(21)
The most difficult problem is to determine the degree of the denominator qi(z) of the rational interpolant. For the description for our algorithm , we need further notations . We divide the polynomial p(z) into two parts one of which is the lower degree and another part is the higher degree according to the number m < n, as follows, p(z) = Lm(p) + zmHm(p), (22)
where p(z) = Lm(p) H. (p) =
ao + alz + ... + anzn, ao + a1z + ... + am_lzm_l, am + am+lz + ... + anzn.
For the error estimation of DFT, we define
e(p) = Ian-1I + Ianl. (23) Hereafter to avoid the waste of computation of pi and qi through Euclid's
algorithm, we set the lower bound for the degree of residual polynomial, for instance, degpi > 3N/4. And the numerator pi(z) of rational interpolant is normalized to Ilpill=1. We check the convergence for given tolerance c > 0. First criterion:
IIHN1 2(pi ) II 5 max {e(LN/2(pi), V}.
(24)
393 Second criterion: e(p.) < e.
(25)
When the both conditions are satisfied, we adopt qt(z) corresponding to Pi(z) as the solution. If none of the element of sequence {p,} satisfies above two conditions, we need to iterate the process mentioned above after doubling N. Algorithm. Let the initial conditions be e : small positive number, N : multiple of 4, /*(*) = /(*) mod ( z w - l ) . Stage for FFT. Compute fN{z) = f(z) mod (zN + 1). Compose fw(z)
by Eq. 16.
N «- N x 2. Stage for convergence check. Generate the sequence {p,} and {;} Through Euclid's algorithm until the both conditions Eq. 24 and 25 are satisfied. Then the final element g,(z) is the solution. So, the zeros of 9,(2) are the approximation of the poles of f(z) in the neighborhood of the unit disk. If none of the element of {p,} satisfies the both conditions, then return to Stage for FFT. 3.
Numerical examples We define three kinds of polynomials:
A(z) =
li(z-a^), i=o
B(z) = fi(z-a-v), 1=0
Q(z) = £(k+\)z\
/=1,2.
We tested problem (1) to (4) as follows. d(z)/A(z),
C2(z)/{A(z)B{z)),
cos2nz/A(z),
where we set
a = yr
„ = 5.
cos2nz/(A{z)B(z)),
394 Table 1: Prob. (1) N(e) = 32 0.06249999999847504 0.25000000000293065 0.49999999999807024 0.70710678118718011 0.84089641525360938
Prob. (2) N(e) = 128 0.06249999854954685 0.25000000228443052 0.49999999882077747 0.70710678148593961 0.84089641521948099 1.18918789762274121 + 0.00000128926607739i
Table 2: Prob. (3) N(c) = 64 0.06250000000228643 0.25000000000526976 0.50000000001562673 0.70710678118204512 0.84089641525164608
Prob. (4) N(e) = 64 0.06250000005101599 0.25000000014644407 0.50000000060502682 0.70710678095635713 0.84089641512288177 1.18920729048269239 - 0.00000001007125467i
We computed by double precision, that is, 16 digits. As the initial conditions, we set 10"10, N = 16. We forced the coefficient of highest degree of the polynomial to zero in the process of Euclid's algorithm, if its modulus does not exceed c/100. We show numerical examples in Table 1 and 2 where N(c) means the number of samples for given c. The part of figures underlined means the error. We could know from these numerical results as follows. First, the accuracy of distinct poles within the circle is comparable to given tolerance c. Secondarily, the accuracy of poles in the exterior of the circle is less than that of the poles within the circle. We describe the behavior of the numerator pi(z) at final stage denoting n pi(z) = akzk,
Ilpil) = 1,
k=0
where n and each ak depend on i but we omit it for simplicity. Figure 1 to 4 show the graphs (k, log10 IakI) for problem (1) to (4), respectively.
395 4. Acknowledgments We wish to thank graduated student Mr. Hideo Yamagata whom we owe the programming and many numerical tests. 5. References 1. A. Cuyt and L. Wuytack, Nonlinear methods in Numerical Analysis, Mathematics studies; 136, North-Holand, Amsterdam, 1987.
k
0
30
0
-4
-4
-8
-8
-12
-12
1og10
lakl
k
0
0
20
loglolakI 1st
FFT
k
k 0
30
0
-4
-4
-8
-8
-12
-12
log10lakI
2nd
log10Iakl 3rd
k 0
30
0
k 1
ZU---- 30
-4
-4
-8
-8
30
20
-12
-12 1
1og10lakl
log10IakI
4th
Figure 1:
5th
396
k 0
0
60
90
-4 -8
-8
-12
-12
log10Jakl 1st
FFT
log10 lake
k 0
60
k 0
-3210
-4
-4
-8
-8
-12
-12
k 3
60
90
3rd
log10 lak)
log10 lak) 2nd
0
60 90
90
0
120
k 60
90
-4
-4
-8
-8
-12
-12
loglo yak)
log10
)akI
log10lak) 5th
4th
6th
Figure 2:
120
397
-4
-8 -12
logiolakl
FFT k
k -4
-8
-8
-12
-12
log10 Iakl
2nd k
k
0 -4
-8
-8
-12
-12
log10 lakl
3rd
log10Iakl
I
-,1III1VV
log10 Iakl
4th
Figure 3:
60
30
h,
398
k 0
k
0
30
30 -4
-4
-8
-8
-12
-12 loglolakl
loglolakl
FFT
1st k
k
0
30
60
-4 -8
-12 loglolaki
3rd k
_k 60
0 -4
-8
-8
-12
-12
logloIakI
loglo Iakl
4th
5th
k
Figure 4:
I t
I
WSSIAA 2(1993) pp. 399-412 ©World Scientific Publishing Company
SEQUENTIAL AND PARALLEL ALGORITHMS FOR SECOND-ORDER INITIAL-VALUE PROBLEMS
E.H. TWIZELL Department of Mathematics and Statistics, Brunel University Uxbridge, Middlesex England. UB8 3PH
A.Q.M. KHALIQ and D.A. VOSS Department of Mathematics, Western Illinois University Macomb, Illinois 61455, U.S.A.
ABSTRACT Complex splittings of a family of multiderivative methods, based on diagonal Pad6 approximants to the exponential function , which were developed in an earilier paper, are considered for the solution of the periodic initial-value problem y'(t) = f(t,y), t > 0, y(O) and y'(0) given. The major benefit from the use of the complex-splitting techniques is that the need to obtain expressions for higher derivatives of y(t) is obviated. The numerical methods arising from the splittings are tested on two problems from the literature , one linear and one nonlinear. Linear problems in which the function f(t,y) = Ay, where A is a constant matrix , arise following the semi-discretization of the simple wave equation . A parallel implementation of the methods based on the diagonal Pad6 approximants, in solving such linear problems, is also considered for MIMD machines . Speed-up and efficiency factors of parallel-over-sequential implementations are given for two particular methods on an Alliant FX/8.
1.
Introduction
Second-order initial-value problems of the form Y,(t) = f (ta (t)) ,
t > 0 ; Y (0) = Yo , Y'(0) = zo , (1)
where y, y', y', f, yo, zo are vector-valued functions, arise in the theory of orbital mechanics, and recent years have seen growing interest in the numerical solution of such problems. Problems of the type (1) with periodic solutions can be classified in one of two ways: (i) problems for which the solution period is known in advance, (ii) problems for which this period is not known at the outset. Numerical methods applied to class (i) which yield a computed solution that stays on the orbit are described as orbitally stable; numerical methods which yield a computed solution that spirals inwards or outwards are said to be orbitally unstable (Stiefel and Bettis"'). The literature contains numerous computational methods which can be used
399
400 to solve problems of class (ii). Multistep methods are discussed by Lambert and Watson 16, for instance; hybrid methods by Cash2', Chawla and Rao', Hairer'2 and Voss and Serbin29; global extrapolation techniques by Khaliq and Twizell14; RungeKutta-Nystrom methods by Chawla4, Chawla and Rao', Dormand et a1.9, Hairer12, van der Houwen and Sommeijer24,u and van der Houwen et al. 'b; and predictor-corrector formulations by van der Houwen and Sommeijer23 and Voss and Khaliq's. It was demonstrated in Twizell and Khalig22 that multiderivative methods can be powerful in that they produce high accuracy combined with good stability/ periodicity properties. This is borne out by Ananthakrishnaiah', Coleman8 and Meneguette'7. The methods developed in Twizell and Khalig22 are based on the (M,K) Pade approximant to the exponential function e' given by e e - P(e)/Q(e) , (2) where P(O) and Q(O) are polynomials of degrees K and M, respectively; the error in the (M,K) Pade approximant is 0(9M+x+1\ The multiderivative methods were then developed by adapting (2) for use in a recurrence relation (see §2 below). In the special cases for which M = K in (2) the resulting numerical methods are known alternatively as Obrechkoff methods. The method based on the (3,3) Pade approximant also appeared in a later paper by Ananthakrishnaiah' (that author's equation (2.12)) who, like Twizell and Khalig22, erroneously claimed Pstability for the method in the sense of Lambert and Watson 16 (see Chawla and Subramanian'). However, this and other Obrechkoff methods, which are absolutely stable are "almost P-stable" in the sense of Coleman8 and Thomas20. The benefits, already noted, which are enjoyed by multiderivative methods are somewhat offset, especially for nonlinear problems, by the need to obtain expressions for higher-order derivatives for direct implementations. Fortunately, this need is obviated by turning to the complex factors of the polynomial differential operators resulting from the use of (2) in the recurrence relation (Twizell and Khalig22). These factors are given in §2 of the present paper where they are applied to (1). Numerical results relating to one linear and one nonlinear problem are reported in §3. Gladwell and Thomas" discuss the efficient implementation of some fourth-order Pstable methods for the solution of second-order initial-value problems. Their findings are based on both complex and real arithmetic techniques. Early work on complex splittings for the second-order partial differential equations (vibrating beam problems) is reported briefly in the thesis by Khaliq". An expansion of this work appears in Khaliq et al.'s Following the semi-discretization of the simple wave equation 32y/3t2 = a2y/axe , in which y = y(x,t), obtained by replacing the space derivative a'y/ax2 by some finite
401 difference or finite element approximant, there results a linear second-order initialvalue problem of the form ( 1). The right hand side of the associated ordinary differential system has the form
J(t,y) = Ay
(3)
in which the constant matrix A depends on the replacement of a2y/fib. In §4, a parallel implementation of the numerical methods based on the use of the diagonal Pade approximants to the exponential function will be discussed with regard to ( 1), (3). The speed-up and efficiency of the parallel implementations of the (3,3) and (4,4) Pade approximants, with respect to the sequential implementations, will be given. 2. The Complex Splittings and Their Sequential Application The (M, M) Pade approximant to ee may be written in the form e° = II (1-a/Cm)/ II (1 +e/Cm) + 0( 62"'.') ,s -'
(4)
m -'
where Cm is the complex conjugate of C. (henceforth, all products II are defined for m = 1,2,...,M). The values of Cm = a. + ibm, i = + ✓ -1 , m = 1,2,...,M are given in the paper by Reusch et aL 18 (p.831). It follows, therefore, that 11(1-a/C.)II(1-6/Cm) + 11(1+0/Cr)II(l+e/C,) e ea + e - > (5) 11(1 +e/Cm) 11(1-e/Cm) from which it may be shown that ee + e-B - 2II( 1+P 02) . 11(1-une2-ivme2)
(6)
The values of pm, um, vm (m=1,...,M) are given in Table Al of the Appendix for M = 1,2,3,4,5. Consider, now, the special, second-order, initial-value problem ( 1) and the exact relation y(t+Q) - [exp(QD) + exp(- QD)]y (t) +y(t-l) = 0 ; t = Q, 29,... (7)
402
in which D = diag{d/dt) is the differential matrix operator and e > 0 is an increment in t. Adapting (6) and substituting in (7) gives y(t+Q) - 2]I(I-umDZD2-i%eD2)-1(I+pmD2D2)Y(t) - Y(t -Q) . (8) This gives (I-uJQ2D 2-ivJQ2D 2)...(I-u2Q2D 2 -iv2Q2D 2)(I-u1 Q2D 2 -iv1Q2D 2)y(t +Q)
= 2(I+p1i'D2)(I+p202D2 )...(I+p D2)Y(t)
(9)
- (I-UmeD2 - ivMQ2D2)...(I-u2Q2D2-iv2Q2D2)(I-u1Q2D2-iv1Q2D)y(t-Q)
which suggests the procedure To : = YO-Q) , Y. (I- u.Q2D2- ivmQ2D2)Y,„ -1 ; m = 1,2,...,M PO
: = Y(t) ,
(10)
P. _ (I+PM-m +1eD2) Y,,,-1 ; m ao=2pm-Ym , (I-um_M +1 Q2D2 -ivM- m+1#2D2) as,
=
am -1 ;
m = 1 ,2,...,M ,
Y(t+Q) - «M for computing an approximation to y(t+Z). This splitting algorithm obviates the need to derive theoretical expressions for the higher derivatives of the function f(t,y) while preserving the accuracy and stability/periodicity properties of the multiderivative methods. It is noted that, using (10), am (m = 1,2,...,M) is given implicitly and an iterative method, such as the Newton-Raphson method for a nonlinear algebraic system, must be used to compute am. It is known (Twizell and Khalig22) that the numerical method associated with the (M,M) Pade approximant is of order 2M, with local truncation error 0(e2M+2). The splitting based on the (2,2) Pade approximant is applied in the paper on cantilever beams by Khaliq et al.'s 3. Numerical Results The complex splittings of the numerical methods based on the (M,M) Pade approximants (M = 1,2,...,5), detailed in §2 were tested on two problems from the literature Problem 1. This is the single, linear problem y"(t) = - 100y (t) , t > 0 ; y(0) = 1 , y^(0) = 0 (11) the solution of which is
403
y(t) = coslOf .
( 12 )
The numerical solutions y, (n = 2,3,...,Q) were computed to time t = lOir using time steps ( = w/12 and ir/48 for which Q = 120 and 480, respectively. The absolute errors en = | y(n€) - y„ | , n = iQ/10 (i = 1,2,4,6,8,10), are given in Table 1 for the two time steps, together with the corresponding results without using the splitting technique. The results obtained using the complex splittings are seen to be generally superior to those obtained without splitting, especially for the higherorder methods used with a large time step. The latter results lose accuracy because, in the higher derivatives of y(t) obtained from the ODE (11), the coefficient -100 is raised to a large power. It was noted that, for M = 4,5 and (. = ir/48, the error in the real part of the computed solution was less than 5 x 1017 and the errors reported in Table 1 for this value of M and I are due to residual imaginary parts in the computed solution. Calculations were carried out on a Pyramid 9820 computer using Fortran with double precision complex arithmetic. Ananthakrishnaiah1 reported results (that author's Table 1) for Problem 1, to three significant figues, using I = TT/12 and the sixth- and eighth-order methods used in that paper. Ananthakrishnaiah's results agree with the results obtained for M = 3,4 without using complex splittings to three significant figures. Problem 2. This is the single, nonlinear undamped Duffing problem y"(t)
+
y® * y\t) = Fcosuf , t > 0 ; y(0) = A , y'(0) = 0
with F = 0.002, to = 1.01 and A = 0.200426728067. Numerical results will be compared with the series solution y(f) = a, cos oof + a3cos3»r + a}cos5o>t * a7cos7o>f in which
a, = 0.200179477536 , a5 = 0.000000304014 ,
a, = 0.000246946143 , a7 = 0.000000000374
(see van Dooren27). The numerical solutions y„ (n = 2,3,...,Q) were computed to time t = 10rr using a time-step of i = TT/5 for which Q = 50. The absolute errors e„ = |y(n€)-y„| are given in Table 2 for the case M = 3, together with the corresponding sixth-order results of Ananthakrishnaiah1 (that author's equation (2.12)). The results obtained using the complex splittings are seen to be more accurate than those without splittings reported by Ananthakrishnaiah1.
404
Table 1 Error moduli for Problem 1 using t = ir/12 and ir/48 W12
^/4i
M Time
with splitting without splitting
with splitting without splitting
1
0.19667D+01 0.69121D-01 0.14799D 00 0.23578DOO 0.33158D 00 0.43438D 00 0.64910D 00 0.18225D+01 0.56566D 00 0.93502DO0 0.15477D+01 0.10988D 00 0.24005D-02 0.10542D-01 0.43826D-01 0.99088D-01 0.17506D 00 0.26999D 00 0.20862D-05 0.90599D-05 0.37968D-04 0.86665D-04 0.15521D-03 0.24348D-03 0.91234D-16 0.91234D-16 0.13685D-15 0.23949D-15 0.59605D-07 0.59605D-07
0.48694D 00 0.14934D+01 (see 0.14941D+01 previous 0.17524D-04 column) 0.153HD-01 0.14556D+01 0.29147D-04 0.27564D-03 0.11921D-03 0.11348D-02 0.48196D-03 0.46015D-02 0.10881D-02 0.10392D-01 0.19374D-02 0.18493D-01 0.30299D-02 0.28885D-01 0.20390D-16 0.84304D-08 0.33983D-17 0.34714D-07 0.13593D-16 0.14084D-06 0.23788D-16 0.31837D-06 0.16992D-16 0.56732D-06 0.67966D-17 0.88768D-06 0.85892D-18 0.71054D-13 0.85892D-18 0.31264D-12 0.51535D-17 0.13074D-11 0 0.29701D-11 0.60124D-17 0.52651D-11 0.85892D-18 0.82352D-11 0.68537D-17 0.71054D-14 0.34268D-17 0.21316D-13 0.94238D-17 0.71054D-14 0.17134D-16 0 0.37695D-16 0.71054D-14 0.49689D-16 0.14211D-13
Ir Itr 4TT 6v 8TT lOir 2 v lir 4ir (m 8ir 10ir v 2TT 4TT frrr 8ir lOir 4 v 2ir 4ir 6TT 8TT IOTT 5 w 2ir 4TT 6ir 8ir IOTT
(see previous column)
0.64910D 00 0.18225D+01 0.56566D 00 0.93502D+00 0.15477D+01 0.10988D 00 0.24005D-02 0.10542D-01 0.43826D-01 0.99089D-01 0.17506D 00 0.26999D 00 0.20632D-05 0.90781D-05 0.37962D-04 0.86652D-04 0.15515D-03 0.24345D-03 0.67138D-09 0.29541D-08 0.12354D-07 0.28199D-07 0.50489D-07 0.79225D-07
405 Table 2 Error moduli for Problem 2 using e = it/5 Error Moduli Time
M = 3 with splitting
Ananthakrishnaiah' 0.453D-04 0.18813-03 0.746D-03 0.163D-02 0.278D-02 0.411D-02
0.117D-04 0.469D-04 0.18713-03 0.417D-03 0.731D-03 0.11213-02
7r 21r 41r 67r 8a 10ir
4. Parallel Implementations Consider the second-order hyperbolic equation
a2y/at2 = a2ylax2 ; 0 < x < X , t > 0
(13)
(the simple wave equation with unit wave speed) together with the boundary conditions y(0,t) =y(X,t) = 0 ; t>0,
(14)
the initial displacement
y(x,0)= fi(x) ; 0
sx:X
(15)
and initial velocity-function ay(x,0)/at = g(x) ; 0 S x s X.
(16)
It is well known (see, for instance, Twizell21) that the initial/boundary-value problem (4.1) - (4.4) can be approximated by a second-order initial-value problem of the form d2Y/dt2 = AY , t > 0 ; Y(0) = # , r(O) = g .
(17)
The solution vector Y = Y(t) of the semi-discrete problem (17) is an approximation to the solution y = y(x,t) at each point x of the discretization of the interval 0 5 x <_ X at some time t. In (17) the vectors 0 and g follow in an obvious way from (15) and (16), respectively, and the matrix A depends on the discretization of the interval 0 s x <_ X and on the approximation chosen to replace a2y/axe in (13). It
406 then follows from (1) and (7) that Y(t) satisfies the recurrence relation Y(t+Q) - [exp(QB) + exp(-QB)]Y(t) + Y(t-Q) = 0
(18)
in which B is a matrix such that B2 = A, 0 is the zero vector of order N (N being the number of interior points at which the interval 0 <_ x <_ X is discretized) and, as before, e > 0 is an increment in t. Consider again the (M,M) Pade approximant to ee, where 0 is a real scalar, written now in the form (19)
ee - PM (0)/PM(-0) = RM,M(6) , say, in which PM is a polynomial of degree M. Then e e + e -e = [P(0) + P - 0)]/[PM(0)PM(-0)J = SM(62)
(20)
say, so that SM(02) has the form (21)
SM(AZ) = PM(u)/O,(u) where PM(u) and QM(u) are obvious from (20)^and u = 02 Denoting the zeros of QM(u) by c„c2,...,cM and noting that PM(0) PM(9)
=(-1)M
+E q1
i =1
w ! +2 0-ci
q'*q2
F
Re
i =gj.1
(22) 0-ci 1
is the partial fraction expansion of R,,,.M(9)(see, for instance, Gallopoulos and Saad10), which has poles ci of which q1 are real and 2q2 are complex . These poles are listed in Reusch et al. 18 In (22) the weights w; (i=1,2,...,g1+q2) are given by wi = -PM(c) /PM(-c) . (23) Similarly, SM(02) may be expressed in partial-fraction expansion form as q,
SM(02) = SM(u) = 2 (-1)M + E i=1
gl'gx
wi u-ci
*Vi
+ 2
(24)
i q,.1 u-ci
where c; = ci and w; = PM(a,)/{2Q M(ci)} for i = 1,2,...,M. The poles c, and the weights w; which will be used in (24) are given in Tables A2 and A3, respectively, of the Appendix. By way of illustration, consider first of all the parallel implementation of the
407
numerical method based on the use of the (3,3) Fade approximants to exp(±lB) in (18), in solving the hyperbolic problem (13)-(16) with X=l, y(x,0)=simrx, O i x s l , in (15), and g(x) = 0, O^xsX, in (16). The theoretical solution of this problem is y(x,t) = sinwx cosirt It is easy to see that the semidiscrete solution Y(t) satisfies the recurrence relation y(f+t) - 53(M)K(r) ♦ y(r-0 « 0 ,
(25)
in which, using I to denote the identity matrix of the same order as A, S3(M) - 2[-/+w,(M-c,/)"1 +2/te(w2(M-£J)X)]
(26)
(see Tables A2 and A3 of the Appendix for the values of the poles and weights to be used in (26)). The fully discrete solution, Y*+1, at time t=(n+l)€ may then be calculated using the following algorithm for which two processors are needed (1) Solve
(M-^17)V, = Y* using Processor 1 .
(27a)
(2) Solve
(M-CjflVj = F" using Processor 2 .
(27b)
(3) Compute
r M - -2Y" + 2wtVt * 4/fe(*2V2) - J"-1
(27c)
In a series of numerical experiments, the interval O S x i l was divided into N+l subintervals each of width h, so that (N+l)h = 1, with N = 50, 100, 200 and 400; these values of N gave the corresponding order of the matrix A and of all the vectors in (27). The problem was integrated to time t = 1 using T time steps each of length €, so that Tt = 1, with (i) T = 50, (ii) T = 200. The approximation used to replace the space derivative in the partial differential equation was the usual second-order central difference approximant afyofcc2 - h-2fy(x-h,t) - 2y(x,t) + y(x+A,f)J + 0(h2)
( 28 )
so that A = [a^] was the familiar matrix defined, for i = 1,2,...,N, by au = -2A-2 ; ay = h1 , \i-j\ = 1 ;
atf = 0 , \i-j\ > 1.
(29)
The experiments were run sequentially, using the algorithm of §2 with M=3, and using the parallel algorithm (27) on an Alliant FX/8. The benefits of using the parallel algorithm come in the form of a saving in CPU time. This can be seen in Table 3 where the CPU time (in seconds) is given for each numerical experiment Defining the speedup and efficiency to be
408
speedup =
sequential time
speedup efficiency = parallel time No . of processors '
these quantities are easily calculated and are given in Table 3. Table 3 CPU times (s) in solving (13)-(16), speedup and efficiency Sequential Parallel Speedup Efficiency N 50 100 200 400
T=50 T=200
T=50 T=200
T=50 T=200
T=50 T=200
0.18 0.38 0.78 1.56
0.10 0.20 0.40 0.80
1.80 1.90 1.95 1.98
0.90 0.95 0.97 0.99
0.77 1.55 3.12 6.27
0.40 0.80 1.58 3.17
1.93 1.94 1.97 1.98
0.96 0.97 0.98 0.99
It is noted in (27a,b) that one processor solves a real linear system while the other solves a complex linear system. This creates a time delay in that one processor is idle whilst waiting for the other to catch up before the computation can proceed to the calculation of Y..,, given by (27c). The CPU time used in integrating from time t to time t+e is, of course, governed by the time taken by the processor which is not idle. An increase in accuracy may be obtained, at no overall increase in CPU time, by keeping both processors busy and by using the numerical method derived from (24) with M=4. Here q1=0 and q2=2 so that (24) leads to S4(QA) = 2[1+2Re{r"v1(PA-e1n-1} + 2Re( 2(DA-e2n-1 }]
(30)
(see Tables A2 and A3 of the Appendix for the values of the poles and weights to be used in (30)). The solution vector Y"+1 may then be computed using the algorithm (1) Solve (0A-c11)V1 = Y" using Processor I
(31a)
(2) Solve (U - c21) V2 = Y" using Processor 2
(31b)
(3) Compute
(31c)
Y"'1 = 2Y" + 4Re(w1 V1) + 4Re(iv2 V2) - Yn-1
409 Clearly both processors here are solving different, complex, linear systems of the same order and no time delay is experienced. In fact the CPU time, speedup and efficiency figures were the same for all numerical experiments carried out using (27) and (31) (see Table 3). Using the higher-order method (M=4) did give some improvement in accuracy for all values of N. 6.
Summary
Complex splittings of a family of multiderivative methods, based on diagonal Pade approximants to the exponential function, which were developed, analysed and tested by Twizell and Khaliq22, have been considered for the second-order initial-value problem y*(t) = f(t,y), t > 0, with initial conditions y(0) and y'(0) specified. The algorithms arising from the splittings, which were employed sequentially, were tested on two problems from the literature, one linear and one nonlinear. It was seen for the latter that not having to calculate the higher derivatives was a considerable advantage. Parallel implementations of two methods based on diagonal Pade approximants to the exponential function for solving the linear system of differential equations arising from the semi-discretization of the simple wave equation, were also considered. Speed up and efficiency factors of parallel-over-sequential implementations highlighted the saving in CPU time of the parallel algorithms. Acknowledgements The research for this paper was begun whilst one of the authors (EHT) was a visitor to Western Illinois University (WIU). The financial support for this visit received from WIU Foundation, the Department of Mathematics (WIU) and Sigma Xi (WIU) is gratefully acknowledged. The authors are also grateful for the use of the computing facilities at Brunei University and at the Advanced Computing Research Facility at Argonne National Laboratory. We would also like to thank Ms. M.E. Demmar, Department of Mathematics and Statistics, Brunei University, for her patience and expert preparation of the typescript References 1. 2. 3. 4.
U. Ananthakrishnaiah, P-stable Obrechkoff methods with minimal phase-lag for periodic initial-value problems, Math. Comp. 49 (180) (1987), 553-559. J.R. Cash, High order P-stable formulae for the numerical integration of periodic initial-value problems, Numer. Math. 37 (1981), 355-370. J.R. Cash, High order P-stable methods for periodic initial-value problems, BIT 24 (1984), 248-252. M.M. Chawla, On the order and attainable intervals of periodicity of explicit Nystrom methods for y"=f(t,y), SINUM 22 (1) (1985), 127-131.
410
5. M.M. Chawla and P.S. Rao, High accuracy P-stable methods for y"=f(t,y), IMA J. Numer. Anal. 5 (1985), 215-220. 6. M.M. Chawla and P.S. Rao, Phase-lag analysis of explicit Nystrom methods for y"=f(x,y), BIT 26 (1986), 64-70. 7. M.M. Chawla and R. Subramanian, Diagonal Pade approximations to the exponential function and P-stability, Intern. J. Computer Math. 33 (1990), 103106. 8. J.P. Coleman, Numerical methods for y"=f(x,y) via rational approximations for the cosine, IMA J. Numer. AnaL 9 (1989), 145-165. 9. J.R. Dormand , M.E.A. El-Mikkawy and P.J. Prince, Families of Runge-KuttaNystrom formulae, IMA J. Numer. Anal. 7 (1987), 235-250. 10. E. Gallopoulos and Y. Saad, On the parallel solution of parabolic equations, CSRD Report No.854 (1989), University of Illinois, Urbana-Champaign. 11. I. Gladwell and R.M. Thomas, Efficiency of methods for second-order problems, IMA J. Numer. Anal. 10 (1990), 181-207. 12. E. Hairer, Unconditionally stable methods for second-order differential equations, Numer. Math. 32 (1979), 373-379. 13. A.Q.M. Khaliq, Numerical Methods for Ordinary Differential Equations with Applications to Partial Differential Equations, Ph.D. thesis, Brunel University, 1983. 14. A.Q.M. Khaliq and E.H. Twizell, Global extrapolation of numerical methods for initial-value problems, AppL Maths. & Comp. 31 (1989), 148-160. 15. A.Q.M. Khaliq, E.H. Twizell and A.Y. Al-Hawaj, The dynamic analysis of cantilever beams by the finite element method, in J.R. Whiteman (ed.) The Mathematics of Finite Elements and Applications (MAFELAP 1990) (1991). 16. J.D. Lambert and I.A. Watson, Symmetric multistep methods for periodic initial-value problems, J. Inst. Maths. Applics. 18 (1976), 189-202. 17. M. Meneguette, Multistep-Multiderivative Methods and Related Topics, D. Phil. thesis, University of Oxford, 1987. 18. M.F. Reusch, L. Ratzan, N. Pomphrey and W. Park, Diagonal Pade approximants for initial-value problems, SIDDC 9 (5) (1988), 829-838. 19. E. Stiefel and D.G. Bettis, Stabilization of Cowell's methods, Numer. Math. 13 (1969), 154-175. 20. R.M. Thomas, Phase properties of high order, almost P-stable formulae, BIT 24 (1984), 225-238. 21. E.H. Twizell, An explicit difference method for the wave equation with extended stability range, BIT 19 (3) (1979), 378-383. 22. E.H. Twizell and A.Q.M. Khaliq, Multiderivative methods for periodic initialvalue problems, SINUM 21 (1) (1984), 111-122. 23. P.J. van der Houwen and B.P. Sommeijer, Predictor-corrector methods for periodic second-order initial-value problems, IMA J. Numer. Anal. 7 (1987), 407-422. 24. P.J. van der Houwen and B.P. Sommeijer, Phase- lag analysis of implicit
411 25. P.J. van der Houwen and B.P. Sommeijer, Diagonally implicit Runge -KuttaNystrom methods for oscillatory problems, SINUM 26 (2) (1989), 414-429. 26. P.J. van der Houwen, B.P. Sommeijer, K. Strehmel and R. Weiner, On the numerical integration of second -order initial-value problems with a periodic forcing function, Computing 37 (1986), 195-218. 27. R. van Dooren, Stabilization of Cowell's classical finite difference method for numerical integration, 1. Comput. Phys. 16 (1974), 186-192. 28. D.A. Voss and A.Q.M. Khaliq, A sixth-order predictor-corrector method for periodic IVP's, AppL Math. Lett. 2 (1) (1989), 65-68. 29. D.A. Voss and S .M. Serbin, Two-step hybrid methods for periodic initial-value problems, Comput. Math. Applic. 15 (3) (1988), 203-208. APPENDIX Table Al Values of pm, um, v. (m=1,...,M) for m=1,2,3,4,5. M m
pm
um
1
1
1/4
1/4
1
(5-421)/24
1/24
-V 3/24
2
(5+421)124
1/24
3/24
2
3
4
5
vm
0.0
1
0.4221124574542046D-02
0.4636030080182417D-01
2
0.4060379415750504D-01
0.1819849599087895D.02
-0.3866028019101487D-0 1
3
0.4051750898361206D 00
0.1819849599087895D-02
0.3866028019101487D-0 1 -0.1503251597448412D-01
0.0
1
0.1520828693173826D-02
0.2285060654976824D-01
2
0.1289164921847655D-01
0.2285060654976824D-01
0.1503251597448412D-01
3
0.4458960145711899D-01
-0.4993463692625382D-02
-0.2118158122584153D-01
4
0.4052836298942566D 00
-0.4993463692625382D-02
0.2118158122584153D-01
1
0.4052847325801849D 00
0.1879882659272389D-01
2
0.4500343870828825D-01
0.1006209373012608D-01
0.0 -0.1433779819776761D-01
3
0.1558017637580633D-01
0.1006209373012608D-01
0.1433779819776761D-01
4
0.5675885826349258D-02
-0.5572618137599262D-02
-0.1259126383891092D-01
5
0.6779918330721557D-03
-0.5572618137599262D-02
0.1259126383891092D-01
412
Table A2 The poles c; of S (u)
M 1 2 3 4 5
Imaginary part
Real part 4.0 6.0 0.2157017928495951D+02 0.1214910357520242D+01 0.3054376328600142D+02 -0.1054376328600142D+02 0.5319480953066981D+02 0.3279497156148978D+02 -0.2939237632682468D +02
0.0 613 0.0 0.2580915194983086D+02 0.2009354143478240D+02 0.4472518320265668D+02 0.0 0.4673060068426140D+02 0.6641172175186130D+02
Table A3 The weights w; of S (u) Real part
1 2 3 4 5
-8.0 36.0 -0.2656698024186751D+03 -0.1116509879066245 D + 02 0.6846648265824569D+03 -0.2846648265824571D+03 -0.5914862140286641D+04 0.1478317349114175 D + 04 0.5791137210298139D+03
Imaginary part
0.0 1213 0.0 0.1339510410172097D+03 0.8885576639531694D+03 0.1751150081589036D+03 0.0 0.3349888808606921D+04 0.4035801364253665D+03
WSSIAA 2( 1993) pp. 413-426 ®World Scientific Publishing Company
SOLVING GENERALIZATIONS OF ORTHOGONAL PROCRUSTES PROBLEMS G A WATSON Department of Mathematics and Computer Science University of Dundee Dundee DD14HN Scotland
ABSTRACT The orthogonal Procrustes problem involves finding an orthogonal matrix which transforms one given matrix into another in the least squares sense: thus it requires the minimization of the Frobenius matrix norm. We consider the solution of this problem for some other orthogonally invariant norms.
1. Introduction Let A, B be given real m xn matrices, and consider the following problem: minimize { IIA-BXII : XEIR"x", XTX=1" }.
(1.1)
where the norm is a matrix norm on m xn matrices. When this is the Frobemus norm, II•IIF, then the problem is referred to as an orthogonal Procrustes problem: the intention is to find an orthogonal matrix which most nearly transforms B into A in the least squares sense. A solution to the problem in this case is readily obtained from the singular value decomposition of BTA (see for example Golub and Van Loan 3, p 582): the solution is in fact the orthogonal polar factor of BT A. This problem arises in factor analysis: see for example 5,10; some other applications are to be found in 1,4.6 When m=n and B is the unit matrix, then the solution in the Frobenius norm is also the solution in all orthogonally invariant norms (for example Fan and Hoffman 2). However, this is not true in general, and the purpose of this paper is to consider the solution of (1.1) for some other orthogonally invariant norms, principally the spectral norm, but also other members of the class of so-called Schatten p norms. It will be assumed for convenience in what follows that m 2m.
413
414 For given X satisfying the constraint of (1.1), let R=A -BX, and let the singular value decomposition (SVD) of R be given by R =U VT, where U is m xm orthogonal, V is n xn orthogonal, and E is an m xn matrix zero except for the singular values cr z ..Zcr down the main diagonal .
Then a class of orthogonally
invariant norms can be defined by taking
(1.2)
IIRII=1IaIIp,
and the vector norm on /R" is the usual lp norm. This class of
where a=(al,
norms is usually referred to as Schatten p or cp norms. Important special cases are the Frobenius norm (p =2), the spectral norm (p =) and the trace norm (p =1). All orthogonally invariant norms can be defined by replacing the right hand side of (1.2) by a symmetric gauge function s, but we will not require this level of generality. In order to give useful conditions satisfied by a solution of (1.1), we require the subdifferential or set of subgradients of the norm. For any convex function f :/R 8 SIR , the subdifferential off at x is defined by
af (x) = {v: VE]R's, f (y)Zf (x) + (y-x)TV,
for all yE IR s }.
For the norms defined by (1.2), it may be shown (for example 12,13) that aRRR11 = (GE/Rm : G=UDVT, R=UEVT is any SVD, D=diag{dl,..,d" }, dEallallp }• In particular, when 1
Theorem 1. Let X solve (1.1). Then there exists G E aIIR II such that XT BT G is a symmetric matrix. Proof By standard Lagrange multiplier theory for nondifferential optimization (for example ), there is a symmetric matrix of Lagrange multipliers AEIR"' and G E aIIR 11 such that BTG =XA. The result follows. ❑
415 These conditions are necessary for a solution, but because the problem is not a convex one, they are not generally sufficient.
2. The case p -. Consider now the special case when p=er in (1.2). Then Theorem 1 can be expressed in the following form, using the form of the subdifferential in this case 12. Theorem 2 Let p=c and let X solve (1.1) with R=[U1: U2] E [V1:V2]T an SVD, where a1 has multiplicity t, and U 1 and V 1 have t columns. Then there exists a t xt symmetric positive semi-definite matrix M such that XTBTU1MVi is symmetric, trace(M) = 1. If the value t of the multiplicity of a1 at a solution is known, then locally, the problem (1.1) may be expressed in the form minimize h subject to a? = h, i=1,..,t,
(2.1)
XTX =I". The treatment of this problem is complicated by the fact that when t>1, the set of t constraints representing coincidence of singular
values
is not immediately
differentiable. However, this set can be replaced by a differentiable system of constraints, and this is the basis of the method of Watson 11 for minimizing the spectral norm of a matrix whose elements depend on some free parameters : the underlying technique uses results from Overton 9. Clearly the two problems differ only in the presence of the differentiable set of constraints representing the orthogonality restriction in (2.1). The basic idea is to solve a sequence of quadratic programming subproblems effectively obtained from (2.1) by linearization of the constraints about the current approximation to X, and replacement of the objective function by a quadratic approximation; the solution then gives an increment for the current approximation. The objective function of the quadratic programming subproblem can be defined in such a way
416 that a second order convergent process is obtained analogous to Newton's method applied to the solution of the nonlinear system of equations representing the conditions of Theorem 2. Let D E IR n "n be a matrix whose elements can be regarded as increments in the corresponding components of X, and define d and x as vectors in IRn2 whose elements are given by concatenating the columns of these matrices: thus x = (X11,X21,..,Xn1,X12,..,Xnn)T,
and similarly for D. Then the appropriate quadratic programming problem is 11 minimize w + 112dT W d subject to n2
wI, - F, djCj = diag{0, aZ-vl,..,c2 -ai }, (2.2) j=1
XTD +DTX =I„ XTX, (2.3) where
Cj
= EJUl a(a BX) V 1 + (E1Ul a( -BX) V 1)T ,
.l =l,••,n2,
axi El = diag{al,..,at }, W is a given symmetric matrix , and U 1 and V 1 have t columns .
Performing the
differentiation , it is readily seen that (2.2) can be expressed as wit +ElU1BDV1+ViDTBTU 1 ,=diag{O,a2_01,..,at-a;}. (2.4) The key to efficient solution of this quadratic programming problem is the removal of the linear equality constraints associated with the orthogonality condition, and this can be done by observing that D satisfies equation (2.3) if and only if D = P + QS, (2.5) where Q = X.T, P = 1h(Q X ), and S is any skew-symmetric matrix . Thus the linear constraints (2.3) can be eliminated, and the number of variables of the problem immediately reduced from n2 to 'hn(n-1). The disadvantage of this is that a matrix inversion is required . However, let E
=XTX-I,,.
( 2.6)
Then a simpler choice of D which avoids the matrix inversion is to use (2.5) but with
417 P =-JhXE, Q =X(I-E),
(2.7)
both of which are correct to O(IIEIF). It follows that in this case XTD +DTX =-E +O(IIE112). Thus if the system of equations characterizing the solution to the quadratic programming problem remains nonsingular, (2.3) is satisfied to first order in D ; the linearization is still correct and so a second order rate of convergence is not inhibited. Now let the vector SE IR `i" (n-1) be formed from the skew-symmetric matrix S by concatenating the rows above the diagonal, i.e. S = (S 12,513,..,51n ,S 23, ... ,'Sn_1,n )T.
Then (2.5) can be expressed as d=p+Zs, (2.8) where p is formed from P just as d is formed from D, and where Z is an n2x'hn(n-1) matrix formed from Q so that (2.8) is equivalent to (2.5). Therefore making the substitution from (2.8) into the quadratic programming problem (and ignoring constants in the objective function) gives minimize w +'hsT ZT WZ s + pT WZ S subject to n
wit + Y, sij[CeTe^V1--CejeiTVi Vie,e^CT +VTejeTCT] i,j=1 i<j
= diag{0, 62-i,..., a? -vi }-HV 1-V fHT , (2.9) where C = E1UfBX, H = E1U1BP, and ei denotes the i th coordinate vector in /R". If (2.9) is written as a linear system of equations in the unknowns, then these constraints can be eliminated in a straightforward manner via a QR decomposition of the matrix on the left hand side to leave an unconstrained quadratic problem which can be solved as a linear system of equations. Locally this will normally have a second order rate of convergence to a solution of the problem (2.1) provided that the matrix W is chosen appropriately as the Hessian matrix of the Lagrangian function for that problem. The vector of first derivatives of the Lagrangian may be expressed as the n xn matrix BTU 1MV i -X A where AE /R"""
418 is symmetric and M is t xt symmetric positive definite (this n xn matrix being zero corresponds to the conditions of Theorem 2). The matrix W is then formed by taking the partial derivatives of this matrix with respect to the components of x (or equivalently those of X). Thus the components of W are formed from the partial derivatives
a (BT UIMV )kt - Ajtstk, i j=l,..,n; k,l=l,..,n, axtj taken in the correct order: for details of this, see Watson". Now, in addition to dependence on the current X, W depends on the Lagrange multipliers M and A, and estimates of these matrices are required. An appropriate way to do this is to solve in the least squares sense the system of linear equations BTU1MVI-XA=0, trace(M)=1, for the 'ht(t+l) + lhn (n+1) elements of the (symmetric) matrices M and A. This is a standard technique in optimization, and this choice does not inhibit a second order rate of convergence of the algorithm. Some simplification can be possible. For example in the case when t=1, only A is required and we can take A = th(XT BT UIVi + VIUIBX ), where ul and v1 are singular vectors corresponding to a1. We have so far only considered local convergence. A natural initial approximation to take for X is the Frobenius norm solution to the problem, which as already explained can be computed by a single singular value decomposition; however, this may not be good enough to guarantee convergence of the Newton iteration. It is appropriate therefore to use solutions of (2.9) to generate directions of progress towards a solution. One way to measure progress is with respect to an appropriate penalty function: here, we use
4(X) = IIA BX112+eIIXTX-1114, for some suitably large scalar 0, where the norm 11.1114 is given by the sum of moduli of the components. The solution to (2.9), converted to D, is potentially a descent direction in X at the current point with respect to this function. This will be the case if W
419 is positive definite and 0 is big enough, and so to improve the chances of D being a descent direction, the implementation used here had W supplemented by the addition of µl, where p20. This can have the effect of restricting the size of the components of s, and an alternative procedure would be to introduce bounds on the components of s directly into (2.9). However, the advantage of being able to solve an equality constrained quadratic programming problem would then be lost. The value of µ has to be adjusted adaptively according to the success or otherwise of the previous step. The intention is to let µ eventually become zero. Of course the correct value of t is also not normally known in advance, and has to be estimated, this estimate being used in the construction of the quadratic programming problem. A reasonable criterion to use is the relative nearness of other singular values to a at each iteration, starting for example with t=1. It may also be necessary to adaptively adjust the penalty parameter 0 as the iteration progresses: the main requirement is for 0 to exceed the largest optimal multiplier in absolute value. If a descent direction D is obtained, then a line search may have to be performed; in the numerical examples, this was done simply by halving the step (starting from a unit step) until a suitable decrease was obtained. Two numerical examples are given to demonstrate an implementation of the method. The intention is merely to illustrate the potential of the method, and the present code is purely experimental and does not at this stage automatically deal with all aspects of the solution process or handle completely general problems of the type. Note that it is possible to converge to a matrix X which satisfies Theorem 2 except that M is not positive semi-definite. Movement away from such a point may be achieved as described by Watson ii Example 1 This example has m =4, n =2 and is taken from Golub and Van Loan 3, p 582, where a solution to the (Frobenius norm) Procrustes problem is given. This is used as a starting 112 approximation, and the initial value of J A -BX (4(X)) is 0.142988. Results are given in Table 1 for a constant value 0=10 of the penalty parameter. The column headed y gives the step length and ^(X) gives the new penalty function value. Here, the largest singular value is simple at the solution, so that this problem is straightforward and the estimated value of t is 1 throughout the calculation. The final matrix X
420 is
X = [0.999994 -0.0033611 0.003361 0.9999941
Table 1: Results for Example 1
i
IIE IIA
a1
62
1
0.000000
0.378137
11
IIslI ,
0.272480
1.0
Y
4(X )
0.045
0.25
0.1333195
4.0
0.001664
1.0
0.128753
2
0.000025
0.361692
0.335413
3
0.000006
0.358744
0.327430
1.0
0.000291
1.0
0.128667
4
0.0000001
0.358699
0.325748
0.25
0.000024
1.0
0.128665
5
0.0000000
0.358699
0.325888
0.0
0.000004
1.0
0.128665
Example 2 Here m =n =4 and the matrices A and B are filled using the Fortran random number generator RAND. Again the initial approximation is the Frobenius norm solution, and the penalty parameter remained fixed at 0=0.25. The estimated value of t was 1 up to iteration 4, when it was reset to 2, its final correct value.
The final matrix M is M = 10.049158 0.2161971 0.216197 0.950842P showing that the conditions of Theorem 2 are correctly obtained. These results indicate that the method can work in a satisfactory way. However, some questions remain to be addressed: how to best fix the penalty parameter initially, and how best to alter it; also how best to detect coalescing singular values, and how best to adjust t to ensure descent. These issues will not be considered further at this time.
421 Table 2: Results for Example 2
i
IIE 14
61
µ
a2
IIsli,.
Y
^(X ) 0.279339
1
0.000000
0.532236
0.188600
1.0
0.095
0.5
2
0.016657
0.524571
0.191329
1.0
0.047
1.0
0.277351
3
0.048922
0.514898
0.297462
0.25
0.079
0.5
0.271645
4
0.051130
0.508786
0.389545
0.25
0.064
0.5
0.264761
1.0
0.260142
5
0.040040
0.504729
0.461244
0.25
0.068
6
0.024870
0.503909
0.499949
0.0625
0.065
0.5
0.257790
7
0.024096
0.501765
0.499480
0.0625
0.042
1.0
0.254399
0.018890
0.499676
0.499045
0.015625
0.014
1.0
0.249226
0.0
0.0013
1.0
0.248542
1.0
0.248536
8 9
0.002237
0.498665
0.498591
10
0.000018
0.498535
0.498534
0.0
0.000003
11
0.000000
0.498534
0.498534
0.0
0.000000
3. Some other values of p. For values of p satisfying l
XTX =1n. This is normally a straightforward differentiable optimization problem and a successive quadratic programming based method can be applied directly, with the linearized constraints corresponding to the orthogonality conditions removed exactly as described in the previous section. Therefore we will not pursue that approach , but we will turn to another possibility which involves the solution of a sequence of weighted Frobenius norm problems, and is analogous to the method for solving vector Ip approximation problems by means of a sequence of weighted least squares problems . The use of a penalty function is avoided, as feasibility is maintained at all times.
422 For any X, XTX =I,, , let U EV T be any SVD of R A -BX, and let U (1) denote the first n columns of U, and E(1) denote the first n rows of E. In this case G E aIIR II if and only if 12 G =
aU(1)E (1)P-1VT,
where a is a scalar chosen so that the dual norm JIG II` is equal to one: the dual norm is just the 19 vector norm of the singular values, where 1/p +1/q=1. Therefore assuming that 6„>0 when p <2, the conditions of Theorem 1 correspond to the symmetry of the matrix XTBT U(1)E(1)P-1VT = XTBT U(1)E(1)P-2E(1)VT
=XT BT U(1)E(1)P-2U(1)T (A-BX ) =XTBTW(A-BX)
where W = IM + U(1)(E(1)P-2-IR)U(1)T , in other words the symmetry of the matrix XTBT WA. Notice that if p =2 then W Im and we require the symmetry of X T B T A ; the role of the SVD of BTA in this can readily be seen . Let the SVD of B T A be PSQT T. Then for symmetry of (3.1)
XTPSQT =QSPTX, and it is clear that a solution is given by taking X = PQT T. Note, however, that this derivation only gives a necessary condition and so offers no theoretical advantage over the straightforward direct approach given in Golub and Van Loan 3, where this choice of X is shown also to be sufficient. Returning to (3.1), if W is regarded as a fixed matrix, then this condition corresponds to a characterization of the solution to the problem for the Frobenius norm minimize {IIWA BXIIF: XEIR""", XTX=1„ }. (3.2) A fixed point iterative method is therefore suggested for solving problems with p #2 based on the solution of a sequence of weighted Frobenius norm problems , and having
I
1
II
423 the following form: ( 1) Set WO = I,,,, i=0.
(2) Let BTW,A have the SVD P,S,QT. (3) Set X, = P, QT. (4) Let A-BX, have the SVD U, E, VT. (5) Set Wi +1 = I,n + U, t1l(E,1° -2 - In)U,(l)T . (6) Return to (2) with i increased by 1.
This iteration process should be stopped if Xi = X,_1 to a given tolerance (or if the numbers pA -BX, 11 have converged to a given tolerance ) or if the process is obviously divergent. An analysis of this simple iteration process represented in the form Xi+1 = G (Xi),
for the appropriate matrix G, can be given, and it may be seen to be locally convergent for values of p close enough to 2. The analysis is based on standard derivative calculations. Notice that steps (2) and (3) are just the solution of (3.2) with W=W,. The initial approximation is of course just the (unweighted) Frobenius norm solution. Note that the process will break down when p <2 if any A BXi becomes rank deficient. We consider the application of this method to Example 1. In Table 3 are shown the number of iterations k for convergence for a range of values of p, together with the final value of the p th power of the norm. Note that the case p =2 is given no special treatment so that 2 iterations are required to give convergence. The stopping criterion is the condition that successive values of the p th power of the norm differ by less than 0.000001. In contrast to some other examples tried, convergence occurred for quite large values of p. Finally, we consider the application of the same iteration scheme to Example 2 above, and similar results to those in Table 3 are given in Table 4. In this case, despite converging rapidly for p =2.2, the method diverged for p =2.3 and higher values . From these and other numerical examples , it would appear that the process converges (but sometimes very slowly) for values of p <2, but for p much greater than 2, there is the risk of divergence. It is an open question whether the iteration can
424 be modified (as can the analogous process in vector 1, approximation) to either force convergence or indeed to speed it up.
Table 3: Results for Example 1: different values of P.
P
IIA-BXI!
k
1.1
0.581362
15
1.2
0.520632
10
1.3
0.466371
8
1.4
0.417871
6
1.5
0.374506
5
1.6
0.335717
5
1.7
0.301011
4
1.8
0.269949
3
1.9
0.242139
3
2.0
0.217233
2
2.1
0.194924
3
2.2
0.174934
3
2.3
0.157019
4
2.4
0.140959
4
2.5
0.126559
5
2.6
0.113646
5
2.7
0.102063
5
2.8
0.091671
6
2.9
0.082347
6
3.0
0.073979
7
I
1
"tl
I
425 Table 4: Results for Example 2: different values of p.
p
1I4 -BXIr
k
1.1
0.643672
39
1.2
0.594944
25
1.3
0.550255
98
1.4
0.509812
30
1.5
0.472522
14
1.6
0.437929
9
1.7
0.405768
6
1.8
0.375863
5
1.9
0.348081
4
2.0
0.322298
2
2.1
0.298392
4
2.2
0.276247
5
References 1. J. E. Brock, Optimal matrices describing linear systems, AIAA Journal 6 (1968), 1292-1296. 2. K. Fan and A. J. Hoffman, Some metric inequalities in the space of matrices, Proc Amer Math Soc 6 (1955), 111-116. 3. G. H. Golub and C. F. Van Loan, Matrix Computations, Second Edition, Johns Hopkins University Press, Baltimore, Maryland, 1989. 4. J. C. Gower, Multivariable analysis: ordination, multidimensional scaling and allied topics, in
Handbook of Applicable Mathematics, Vol VI: Statistics,
Lloyd, E H ed, John Wiley, Chichester (1984), 727-781. 5. B. Green, The orthogonal approximation of an oblique structure in factor analysis, Psychometrika 17 (1952), 429 - 440.
6. R. J. Hanson and M. J. Norris, Analysis of measurements based on the singular value decomposition, SIAM J Sci Stat Comp 2 (1981), 363-374.
426 7. J. B. Hiriart-Urruty, Tangent cones, generalized gradients and mathematical programming in Banach spaces , Math Oper Res 4 (1979), 79- 97.
8. J. von Neumann, Some matrix inequalities and metrization of metric spaces, Quart Jour Math Oxford (Ser 2) 11 (1937), 50-59. 9. M. L. Overton, On mimimizing the maximum eigenvalue of a symmetric matrix, SIAM J Matrix Anal and Appl 9 (1988), 256-268. 10. P Schonemann , A generalized solution of the orthogonal Procrustes problem, Psychometrika 31 (1966), 1-10. 11. G. A. Watson, The solution of matrix approximation problems in cp norms, in Approximation Theory, Proc of 6th South Eastern Approximation Theorists Conference, ed G Anastassiou, Marcel Dekker Inc, New York (1992), 513-524. 12. G. A. Watson, Characterization of the subdifferential of some matrix norms, Lin Algebra and Appl 170 (1992 ), 33-45. 13. G. A. Watson, On matrix approximation problems with Ky Fan k norms, preprint (1992).
WSSIAA 2(1993) pp. 427-434 ©World Scientific Publishing Company
SOME EXPLICIT FORMULAS FOR PADS APPROXIMANTS OF RATIOS OF HYPERGEOMETRIC FUNCTIONS
JET WIMP Department of Mathematics and Computer Science, Drexel University Philadelphia , Pennsylvania 19104, USA
and
BERNHARD BECKERMANN Institut fur Angewandte Mathematik,' Universitdt Hannover Welfengarten 1, W-3000 Hannover 1, Germany
Abstract We discuss close-form expressions for the numerator and denominator of [n - l1n] and [nin] Pade approximants for the classical continued fraction of Gauf. Several limiting cases are studied like ratios of confluent hypergeometric functions, Bessel functions and trigonometric functions.
1 Introduction Suppose that a (formal ) power series f (z) = E'o cjz' and non-negative integers m, n E INp are given . Then there exist polynomials P[minl(f) and Q[m1n](f) with deg P[minl(f) 5 m, deg Q[rnln](.f) 5 n, such that R(z) := P[mIn](f)(z) - f (z) Q[minl (f)(z) has the order of at least m +n + 1, i.e., the coefficients of z0, z1, ... , z"`+n in R vanish . It is well-known that the rational function P[minl(f)/Q[min](f) is unique; it is called the [mjn] Pade approximant of f. Padd approximants of f are displayed in the so-called Pade table. The formal power series f (or its Pade table) is normal, if any two entries of the table are distinct. In a recent paper ([4]), one of the authors presented an explicit formula for the numerator and denominator of the [n - 1 1 n] Pade approximant of the function 2F 1(a1,2+l z )
427
/2F11 c x/
428 where the hypergeometric function PF9, p, q E INp , is (formally) defined by h(a + j)
a'1)) • ... • (a'P)J . z>> (a )7 =
PF9(6,, ...;6p I z) _ (
^_^ (b5 )i j! r(a)
Applying some contiguous relations, we also obtain information about the convergents of GauB' classical interpolating continued fraction 2F1(ac+1' I z)/2F1( acbl z)
a c-b (b+1 )( c-a+1) (a+1 )( c-b+1) (b+1 )( c-a+2) 1 _ c(c+i) z I _ (c+1)(i 2) ' z I _ (c+2)(i 3) Z I _ (c+3)(1 4) z I _ .. .
In this paper, we show how Wimp's results can be written down in a more compact form as a truncated product of two 2F1 functions. A shorter proof is given based on a simple divided difference calculus. We present some generalizations with respect to slightly different parameters in the numerator hypergeometric function. Finally, many examples for `nearly diagonal' Pade approximants are considered; we also discuss confluent cases. In [5], the results of this paper already have been extended to certain two-point Pade approximants. The following assertions about particular non-normal Pade tables are immediate. Lemma: (a) Let g(z) = u • z° • f (w • z) with v E INp , u, w E Q, u, w # 0, then for m, n E INp
P[m +vin](g)(Z) = u
- Z° • P[min](f)(w . z), Q[m+uI,](g)(Z) = Q[min](f)(w • Z).
( b) Let g (z) = f (z2), then for m, n E INp with rn' _ [ 2 ] (denoting the largest integer less or equal to m/2) and n' = [2] P[min](g)(z) = P[m'In,](f)(Z2),Q[min](g)(z) = Q[m'In'](f)(Z2),
where on the right hand sides we have an additional factor z if both m and n are odd. Proof: Left to the reader. ❑
2 Definition of polynomials It is well-known, see [1, 4.3.], that the product 2F1( ac'b
I
(a'' b Iz)' 2 F 1 c' IZ)
I
FI
I
429 simplifies to a pFq- function only if certain connections between a, b, c and a', b', c' hold. In the sequel, we will consider a product which in general will not simplify. Let the
{ EJ'!° o cjzj } = E
projection operator IIk be defined by 1Ik
cj zj.
For m, n E [NO let the following polynomial be defined
Definition : Vm,n
j^=o
{ (°'6I m+n+1 C z} 2 n1 c
F
2
( a,l .-m, -b -n z) • ( 1 -c-m-n
I
F
I z) } .
The main feature of Vm,n in proving explicit formulas for Pade approximants is stated in the next theorem. Theorem 1 :
For all in, n E MVO deg Vm,n (
'N z) < max{m, n}. m+n+1
Proof.
dkzk
Taking the Cauchy product, we find that Vm,n("f I z) _ k=0
with k ^ (a)7 (b(e))( (l6 a 171k-7 n)k b - n)k-l 1 dk k. ^=0 (j
_ 1 1 k(-1)k-' - Gk(i) I [O,1,.-.,k]Gk, Go(O) kl=G o(0) i=o
where
_ I'(a+z) I'(b+z) I'(c+m+n+1-k+z) Gk(z) I'( a+m+ 1 - k+ z ) I'(b+n+ 1 - k+z) I'(c+z) and [•] denotes a divided difference . Since for max{m, n } < k < m + n + 1, Gk is a ❑ polynomial of degree k - 1, we obtain the assertion dk = 0 In addition , we easily obtain the following identities V, ( c' I Z) = Vm,n ( a c-m bn n
I
Vn,m (bc Vmn(c-6c c aIz)
Z)
I
z)
(2)
(3)
where for (3) we have applied the identity
2F1( a I z)
(1)
)" ---o . 2F1 ( -Y-;Y -If I z) .
430 Moreover, for particular values of a, b the function Vm,n simplifies z) V m,n (c,n c J
I
= F j -n,-b+c-m 2 1 -c-m-n
j
Vm n ( a,OI z ) = F C 2 1 -c-m-n
_ (-a - m) n
z)
j
I
(4)
z)
(5)
n n,1+c+m 1
z ' 2F1( I+a+m-n ( 1+c+Min (1 + c - a)n (1 +C+m)n 2F1
- n,-a-m ( 1+c-a I 1 - z).
I z) (6)
(7)
Formulas (6), (7) can be easily proved by comparing the coefficients and by using the explicit value 2
_ (7 -a)n
F 1(-
J
7"I I)
(7)n
Let us have a closer look at the special case m = n. In [4], one of the authors introduced the associated Jacobi polynomials P„ "'p)(z; c), defined by a three-term recursion. He showed a closed-form expression, which -using our notations- takes the form Pna,") 2z - 1 • c) = Q + 1 + 2c ( ' (a + a Q + +1 +c)n•( 1
2n +c)n
Z " ' Vn,n (al+R+2c I z)- (8)
By the recurrence relation or directly by (8), (6) we see that P("'R](z; 0) are the Jacobi polynomials of first kind,
P("'Q](2z - 1; 0)
_ (a + Q + 1 + n)n n - P-n,-n 1 n! Z 2F, ( - n-0-2n I z) + On (a1)n.n !
+ +P+n, -n 2F
1( 1 " 1+P Iz)
whereas P„"33(z;1) are the Jacobi polynomials of second kind.
3 Pade approximants for ratios of two 2F1 Theorem 2: Fors E {-1, 0,1} let .+.,b+1 c(+.+1 IZ) fe(z) = 2F1( 2F1( cb l z)
Suppose that n - 1 < m and, if b • (c - a) 0, also m < n - s holds. Then for the numerator P]mjn](fe) and the denominator Q[,nln](fe) of the [min]-Pade approximant
I
t
III
I
I
431 of f, we have the formulas ( a+,,b +1 a+3 +1 I Z) _ J a+s,b+ 1 Ilp \ ,m { 2F1( c+,+1 I'z) 2F11 -c-m- n-s ( - 1-7-811-n I'zl}
P[min](fs)(Z) /^
/
= Vm,n -1
-
(9)
Q[minl(fs)( Z) = Vm+s,n(acbl z) He,n { 2 F 1 (a,b I Z)
Proof.•
2F 1 (
a-m-,,-b-n I z) } , (10) -c-m-n-s
The degree of Fminl(f,) is correct due to Theorem 1. Similarly,
deg Q[ml,y](f,) < n follows either from Theorem 1 or (in the cases c = a and b = 0) from (4) and (5). For the order we use the identity Hm+n
If - 9} = II m +n If •
I k {9}}
(k > n + m ).
Consequently, flm+n {P[minl(f,)(z) - f,(z) - Q[minl(fs)(z)} _ i7 nm+n {2F.(acbI z) - P[minl(fs)(Z) 2F1( ac+s+11 I Z ) - Q[min[(fs)(Z)} llm+n 2F1( cbI z) = 0.
Theorem 2 is well-known for b = 0 and c = a (see, e.g., [3, p.341]). In both cases, the function f, reduces to a particular 2F1 function, and by (5), (10) we get Q[minl(9)(z) = 2F1( Inc-m n I z), 9(z) = 2F1(a^l I z), m > n - 1. (11) Applying the Lemma we obtain Pade denominators for some elementary functions as displayed in Table 1 (confer also [1, 2.5.5.,2.8.(4)-(17)]).
g(z)
Pade denominator Q[min]W(Z)
(1-z)d= 2 F 1 ( i'1 Iz)
2 F 1 ( (-n,--n Iz) , m>n
log(1 + z) = z . 2F1(1,1 z) 2 2 arctan z = z • 2F1 1/2,1 3/2 - z
2
F1( -m-n I - z), m > n 2 i i' n :_ ^ 2F 1 --n',-1/2-m 1/2 -m,-n, - z , m :_ z [ m-1' n 2
],
m' > n' - 1 (with factor z if m even and n odd) arcsinz
= 2F1
1,1 2) 3/2 Z
-m ), 2F 1^ --n"-1 1/2-m,-n Z2
n_1 m :=[ m-1] 2 n , :_ ^2J
1 - Z2 - Z •
m'
n' - 1 (with factor z if m even and n odd)
TABLE 1.: PADt DENOMINATORS FOR SOME ELEMENTARY 2F1 FUNCTIONS
432 For arbitrary a, b, c, s = I and therefore m = n - 1, Theorem 2 together with (8) yields the result of [4] described above that, up to a constant, the corresponding reversed Pade denominator and numerators of f, coincide with associated Jacobi polynomials zn Q[n- 11n]
zn
(fi )(I)
z
=
const • P('-'-b"-')(2z - 1; b),
-' Ftn- 1ln ](f1)(1) = coast • pnc ^a-1'a-1 ) (2z - 1; b + 1). z
In particular, for reversed Pade denominators of the functions 2 F 11 2 F (,+1/2,1 I ), (1,1 log(1 - z) (1 21 I ) -z + 2F1 ` /1 z f 1 a"++01+"2 l 2v+1 z zFl(z I z) = we obtain the classical Jacobi, Gegenbauer, Legendre, and Chebychev polynomials of degree n, respectively.
4 Pade approximants for ratios of two 1F1, 2F0, OF1 Since lim 2F1( cbl a) = 1F1(^I x), Jim 2F1(acbl g) = 1F1(,'I z), a-a-b-.oo lim 2F1( c6I c. z) =a,b-.oo 2Fo(a,b I z), lim 2F1( a,b I - b) = OF1(c I z), c-+oo we have several limiting cases of Theorem 2. Define Vm n (°°'b l z) := lira Vmm,n ( a l z ) C a--
c
a
Ilm+n+l {1F1(cIz)'1F1(- c'mnnl-z))> Jim Vm,n(acblc•z) c-+oo 11m+ntl
2F
( a,b I Z) . I
(12)
1
F, ( -a-m,-b-n I - x) 1 2 0
(13)
lim Vr,n ( cb l -a b ) a,b-+oo
1Tm +n+1
{0
F1( c I z)
• OF1( -c-m-n I
l
z) 1
(14)
Of course, Equations (1)-(7) have their confluent analoga. In addition, we obtain the following simplifications (see [1, 4.3.(2),(6)]) Vn,n(oo, bIz) = F 3 ( \ -n/2,-n/2+1/2 I s 1 2b 2 b+1/2,-b-n + l/2,-n 4 / Vm,n ( ooco° I z) = 2F3 ((m cn))/ 2,(' mn nn- 1)
(15)
/2 14z) . (16)
433 By means of (13), (16), we are able to desribe Laguerre and Lommel polynomials a +
Ln(z) =
n!1) n ' 1F1( 1+al z) - n! V,n( oo l z1 ), (17) r(v+n)
".n(z) = F(v)
2" ' ,zn Un_k k-1 ( _'m I - 4z2) 1 G k G n. (18)
An immediate consequence of Theorem 2 are the representations as displayed in Table
2.
g(z)
P[m1n
(( 1F1\cbal1I Z) IFI(c
1
V ( oo,b
+11 Z )
c+8+1
'n,n-1
z)
](9)(Z)
F 1(_ Z) a}3,b}1
2F2F,0 (a
( -bi
Vm}a,n
-1 G S G 1
z Z
/ a++1 I ) Z \
Z)
Vm+s,n \ W
I
Z
n- 1 G m n- 1 < m < n S
Z
)
-1 G S G 1 n - 1 < m
2F0( -a-m -n I - Z)
I z) c
Z)
c
n-1 Gm Gn - s
IFi(1-c-m-n - z)
2F0( a,1 Z)
OF0 1
Range
Q[m1n)(9)(Z)
Vm,n-1 ( c+a+1 I Z )
Vm+s,n c^
I
1SL
it Z)
- S OG
G 1
TABLE 2.: CLOSED- FORM EXPRESSIONS FOR PADS APPROXIMANTS
Of course, for several special choices of the parameters (a, b, c) we obtain wellknown results which will not be mentioned in detail. Note that the complementary error function, the exponential integral, the logarithmic integral, the integral sine and cosine and, more generally, the function 'l are closely connected to 2Fo functions, whereas the incomplete gamma function, parabolic cylinder functions, Laguerre functions and, more generally, the function $ are connected to 1F1 functions (for more details, confer, e.g., [3, p.335ff], [1, Chapter VII). In view of (17), (18), let us only discuss two particular cases. From the third line of Table 2 we see that the reversed [n-11n] Pade denominator of g(z) = 2Fo( a}1'1 I z) is a Laguerre polynomial. Similarly, by applying the Lemma we can conclude that for the ratio of Bessel functions 9o(z)
J0(z) __ _z Jc_1(z) 2c
oF1(c +I i - a2) OF1(cI - )
(including the case g1/2(z) tan(z)) the reversed [mjn] Pade denominator coincides with the kth Lommel polynomial, where k = m' + n' + 1, n' = N21 _> m' = 1'n21 1 > n'- 1.
434
References [1] A. Erdelyi et al., Higher transcendental York, 1953.
functions,
Volume 1, McGraw-Hill, New
[2] A. Erdelyi et al., Higher transcendental York, 1953.
functions,
Volume 2, McGraw-Hill, New
[3] H.S. Wall, Analytic York, 1973.
Theorie of Continued Fractions, Chelsea Publishing Co., New
[4] J. W i m p , Explicit formulas for the associated Jacobi polynomials and some applications, Can.J.Math. 3 9 (1987), 983-1000. [5] J . W i m p and B. Beckermann, Families of Two-point Pade Approximants and New 4 F 3 (1) Identities, to appear.
WSSIAA 2(1993) pp. 435-475 ©World Scientific Publishing Company
THE SEQUENTIAL GRADIENT-RESTORATION ALGORITHM FOR A CLASS OF OPTIMAL PROBLEMS:
CONVERGENCE ANALYSIS`
K.H. Wong Department of Computational and Applied Mathematics, University of the Witwatersrand,
Wits 2050 , South Africa. K. Kaji Department of Computational and Applied Mathematics, University of the Witwatersrand,
Wits 2050 , South Africa. K.L. Teo Department of Mathematics, The University of Western Australia, Nedlands, Western Australia 6009 , Australia.
Abstract. The sequential Gradient -Restoration Algorithm (SGRA ) was developed for the solution of equality constrained nonlinear optimization problems by Miele and his co-workers in the late 1960 's. It has been used with great success for many largescale problems, mainly in the area of engineering optimization . However, convergence properties of the AGRA was only reported by Rom and Avriel in the late 1980's. In the early 1970 's, Miele et al extended the use of SGRA to optimal control problems. This paper presents the sequential gradient -restoration algorithm for a class of optimal control problems with initial equality constraints and terminal equality constraints . The aim is to investigate certain convergence properties of the algorithm. A numerical example is then solved to illustrate the efficiency of the technique.
This paper was partially supported by a research grant from the Australian Research Council
435
436 1. Introduction There are many efficient gradient-type algorithms in the literature for solving optimal control problems. As early as in 1968, it was used in conjunction with the method of penalty functions1-2. In 1970, the gradient information was used together with the approach of Bryson and Denham'. However, both of these methods did not ensure the satisfaction of all the constraints and the boundary conditions at the end of each interation. Thus, the comparison of the cost functional values between each iteration was not possible. To avoid this drawback, Kelley' proposed to have a restoration phase at the end of the gradient phase. He superimposed to the change of the control associated with the gradient phase a perturabation obtained by combining linearly some arbitrarily prescribed functions fl(t), f2(t),.... The constants of the combination were determined iteratively in order to satisfy the terminal conditions. Unfortunately, there is no clear-cut way to choose the functions fl (t), f2 (t), ... . In the late 1960's, the Sequential Gradient-Restoration Algorithm (SGRA) was developed for the solution of equality constrained nonlinear optimization problems by Miele and his co-workers-s. The algorithm consists of a sequence of two major phases: a gradient-type minimization in a subspace tangent to the constraint surface, and a feasibility restoration procedure. It has been used with great success for many large-scale problems, mainly in the area of engineering optimization. Convergence properties of the SGRA was reported by Rom and Avriel7-8, where global convergence was successfully proven, and an asymptotic rate of convergence is established.
Miele and his co-workers9 extended the use of SGRA to optimal control problems. Similarly, no convergence properties of the SGRA were included. In this paper, we consider a class of optimal control problems involving both initial and terminal equality constraints. This class of optimal control problems is more general than that considered in the paper by Miele et a19, where the state variable x is fixed at the time t = 0. The problem is summarized as follows: Minimize a functional J which depends on the state variable x(t), the control variable u(t), and the parameter 7r, where J is a scalar, x is an n-vector, u an m-vector and 7r a p-vector, and t is the normalized time10 with 0 < t < 1. x(t), u(t) and 7r are required to satisfy n scalar equality differential equations. Furthermore, x(t) and 7r are required to satisfy the initial condition (r scalar equality equations) as well as the final conditions (s scalar equality equations). In this paper, a computational algorithm similar to the version of the SGRA7-8 is developed for solving this class of equality constrained optimal control problems. Furthermore, the convergence result is also established for the first time in this paper. Note that the proof of this convergence result is strongly motivated by that reported in the paper by Mukai and Polak". More precisely, we have extended the convergence result for the mathematical programming problems reported in this paper" to that for the optimal control problems.
437 An important feature of our algorithm is that a scheme is incorporated in such a way that at points far from optimium, only a rough estimate of feasibility is required; and the feasibility requirements are gradually tightened as the optimal solution is approached. This scheme could save a great deal of computational time. 2. Statement of the problem The purpose of this paper is to study the minimization of the functional J=
J0 1 f (t, x(t,
(1)
u(t), ir)dt + [g(x, -)]o + [h(x, ir)] 1
with respect to the functions x (t), u (t ) together with the parameter Tr which satisfy the differential equation x (t) = ¢ (t, x (t) , u (t) , ir) (2) together with the initial condition (3)
[w (x, x)]o = 0 and the final condition
10 (x, x)]1 = 0 (4) where x : [0,1] -+ R" is an absolutely continuous function such that x (t) is essentially bounded on [0,1]; u : [0,1] -+ R' is a bounded measurable function; 7r is a vector of RP; and 0: [0, 1]xR"xRmxRP -+R", f:[0,1]xR"xR'"xRP-+R, g:R°xRP--,R,h:R"xRP-+R,w:RnxR"-,R',O:RnxK"-+R'axe given functions. For convenience, the above problem is referred to as Problem (P). 3. Basic Definitions and Assumptions Let I • I denote the usual norm in an Euclidean space. For any matrix M (M,,):=,,....", , let 1/2
IMI = E (Mi)2 1=1,...,n, j=1 ,..., nz
For any G: R"' -,Rn2' let
8G ay =
8G1 /ayn, aGns 149Y1
aA • UGny /"ynl
Let L,, denote the Banach space L,,. ([0,1] , R") of all essentially bounded measurable functions from [0, 1] into R". Its norm is defined by jjxjj^ ess sup lx(t)l tE[0,1)
(5)
438 Let U be the set defined by U = {z = (x, u, 7r) : x : [0, 1] -. R" is an absolutely continuous function such that icEL ,;uEL ,;7rERp} (6) For any z E U, let
IIZII = IIXII- + Hill. + [lull + 1.1 + NO + Ixli (7) Let F be a subset of U defined by T = {z = (x, u, 7r) E U : (x, u, 7r) satisfies constraints (2), (3) and (4)} We assume that the following conditions are satisfied.
Assumption (3A) (3.A1) The functions f and 0 are continuous on [0, 1] for each (x, u, 7r) E R" X R' X RP. (3.A2) There exists positive constants Kl and K2 such that
IFa
(t,X1 , u1,7f' )
1
:5
K1,
(8)
(9)
lop (x',7r1) I ^ K1,
and
j P# (t, X' , U1, x l ) < K2
(IX1
- F,B
I u1 - x21 +
(t, x2, -
u 21
u2, 7r2)
I
+ I7rl - 7r21) , (10)
I GP (x1, X1) - Ga (X2, 7r2) I
(3.A3) The functions 82F/8/318/32 and 820/8/318/32 are continuous, where for each i = 1, 2, represents any of the variables xi, i = 1, ... , n; ui, i = 1, ... m, and7ri,i=1,...,p.
439 4. First- Order Necessary Conditions and Constraint Qualification In this section, we shall derive first-order necessary conditions for optimality together with a constraint qualification for the problem (9). Note that the constraint qualification in the mathematical programming is to ensure that the algorithm does not jam up at undesirable points. For brevity, we write F=F(t, x,u,ir)
(12)
G = 6 (X" r)
(13)
Fp = Fp ( t, x, u, ir)
(14)
Gp = Gp (x, 70
(15)
and
and G,s are as defined before.
where
Remark 4.1. From the standard existence and uniqueness theorem in the theory of ordinary differential equations, it follows that the trajectory x (t) that satisfies (2) is uniquely determined once (x (0) , u (t ) , a) is specified. Let z = ( x (t) , u (t ) , ir) E 3 be given. By adjoining the differential equation (2) to J with the Lagrange multiplier function a° (t), we get
J= [9]o+[h] i+Joi{f +(A0)T(* - 0)}dt
(16)
Integrating by parts the last term on the right hand side of (16), we get
J = [g[0 +
[h[i
+
x dt f - (A0)To - (A°) Ti
i Jo [ +
\
[(,\°)T X]i
-
[(A °)T X] 0
(17)
In view of Remark 4.1, z is completely determined by (x (0), u (t), a). Thus, by considering the variation in J due to the variation in (x (0) , u (t) , a), we obtain
OJ
+ (AO)T =
+ +
[9x
1 j
- (A0) T
i
{fx -
J
(a
[AXj0 +
°)T ^x
-
[hx
(a°)
I, r
T
}
Oxdt +
J
{fu - (A0)T ou } Dudt
T {f^- (,\°) 0,r} o^rat + {[9x]° + [h]1} oW
(18)
440 Choose a° to satisfy the following system of differential equations: 0
T
T
(19a) (19b)
[_X°]1 = - [hx]1 Then, OJ simplifies to
OJ =
- (,\ O) T] [9x
0
[OX]o + (, °)T 0u] Oudt 0
{ r l + { r 1 [ fw - (A°)T ¢x] dt } Oar + {[9.]o + [hx]1 } O7r
(20)
Consider the constraint functions: [wi]0, i = 1,...,r (21) where [wi]0 denotes the ith component of [w]0. In this case, we have
[Awi]o = [wi.Ax + wi.,O7r]0 , i = 1, ... , r. (22) We now consider the constraint functions: [T/'i]1,
i= 1'...'s
(23)
where [1/ii], denotes the ith component of Then, by using a similar argument as that given for OJ, we can show that
[ - i]1 = -([-\i]0)T[AX]0 - J 1(A1')T 4 u dt 0
-j (i)Tdt}i = 1, ... , s, (24) where
T
(.') = - (A' )T
0.
(25a)
[(Ai) T] 1 - [0i, x] 1 (25b) Let us now construct (Ox (0) , Du (t) , sir) such that OJ < 0 and that the following constraints are satisfied. [Owi]0 = 0, i = 1, ... , r (26) and [Aoi] 1=0, i=1 ,...,s
(27)
441 Multiply the ith equation in (22) by pi and the ith equation in (24) by µi. Sum them together. Then, by adding the resulting equation to (20), we obtain
OJ +
E
µi
[owilo +
=1
J 1 K"Dudt + K"0i + K" [Ax]0 (28)
µ; [i']1 =
;=1
°
where
a
(A
Ku (t)
O )T o._
(A') T
(29a)
0.
i=1
- j1
K*
:
!fa - ('\°)T oa - ^ N'i
o
l
(),')T Oa
r [ hr + + g,c + µiui,,r l + o i=1 1
1
dt
(29b) 1
and r
a
TI
(29c)
Kx = 9x - (^°)T + i=1 i=1
0
Now choose Au (t) _ -rc
(h" (t)) T
(30)
L7r = -rc (K" )T
(31)
(Q-)0 = -rc(Kx)T
(32)
where K A. Then,
OJ + µi[^ilo + ^i[OWi]1 i=1
i=1
= -K {j1 K"(t)12dt + jK12 + 1Kx12} (33) 111 which is negative unless K"(t) =- 0 Vt E [0, 11, K" = 0, and Kx = 0. Next, we need to determine µi, i = 1, ... . r, and µi, Z' = 1, ... , s, so that the constraints (26) and (27) are satisfied. In view of (22), (24), (30) - (32) and (29), the matrix equation for µ and µ takes the form
[M3 M4] LµJ - [w2I
(34)
where M1 E Rrxr, M2 E Rrxa, M3 E Rsxr, M4 E Rax', W1 E Rr, W2 E Ra, and
Mi j = - [wi,x (wj,x)T] o - [Wi,a (Wj,'a)T, o
(35a)
442 1
M, _ [wi,x \^]o - [wi,*]o { ([^1 ]1)T - f (0^)T Aidt } 0 111
(35b)
Mi, = ([A'] 0)T ([w.i,x]0)T
J
J( /' /' 1 - 1 [TYi,a]1 -
1
Mi,)
-
([a`] 0)T
[ A i] + J
1 '(,\i)T + 1 [v^;,+r]1 -
[wr a]o L ✓ O1
Wi2
{(fx)T
(A i)T
0
- (0a)T
A
l = - ([As]0)T
+ {[ir1i -
\l dt
0
Oadt} lJ 1 (flT A dt - ([ Yii,x]1 )T] (35d) 0
Wil = [wi, x]0 [(gx )T -'\O1
+
(A )T Oadt j ([wi,, ]o ) T (35c)
1(9x)T -' \OJ O
o}
dt + ([g,r]0 )T +
1 (,\ ,)T 0. [(fu)T -
(36a)
([h *]1)TJ
ll
(Ou)T AO
- J0
J
dt
1 {(f,r)T - (O^r)T AO} dt f l (Ai)T Oxdt} {J
+ ([9^]o ) T + ([ha]1)Tl
(36b)
Let M(z) = [M3(z) M4(z) J
(37)
Then, for each z E U, µ (z) and µ (z) exist if and only if the matrix M (z) is nonsingular. In other words, if (M (z))-1 does not exists, it is not possible to control (Ox (0), Du (t), Aa) to satisfy one or more of the constraints (26) and (27). Definition 4.1. Any z E U (not necessarily feasible) is said to satisfy the constraint qualification of the problem (T) if (M(z))-1 exists. Note that we have allowed non-feasible z to be included in the definition of constraint qualification. This is done for the convenience of future development. At this stage, we have constructed (Ox (0), Au (t), Ox) that deceases J and satisfies the constraints (22) and (24). From (33), the only case in which we cannot decrease J is when
K° = 0 Vt E [0 ,1] , K" = 0,K- = 0 (38)
Now, let
a (t) = A° (t) + > 1 A` (t)
(39)
Then, by ( 19) and (25), we have (40a)
(. (t))T = ff - ATm,
[AT + hX + (µ)Tk=11 = 0 (40b) Next, by combining (29) and (39), the conditions (38) become (41a)
fa-,Tou=0
J 1 {ff - AT OM} dt + [gn + µTW^ ] o + [hn + -AT +
g,
+ µTWxJ ° =
AT y,n]1
0
=0
(41b)
(41c)
Thus, under the constraint qualification condition , we seek a z E 3' together with A (t), y, and µ, such that (40) and (41) are satisfied. Note that the technique used in this section is motivated by that reported in the book by Bryson and Ho12. 5. Approximate Method For each z = (x, u, 7r ) E U, \ (t) E R", p E R', f E R.', let P, P and Q be defined by P (z) = P (x, ic, u, x) Ji-012dt +[w(x,x)10+[^(x,^)11 (42) and Q(z,µ) = {ITT - fX + ATOXI2 + Ifu - A TO.12}dt j 1
2
+ I f {f^ -A T OM }dt + [9,^ + N'TWa10 + [ham +µTV)"]1I
+ I[- a T +gx +izTWX1 °I2 +I[AT+h.
+ATOX] 1
I2
(43)
Here, P measures the cumulative errors in the constraints, while Q measures the cumulative errors in the optimum condition.
444 Remark
5.1.
In view of Assumption (3.A2), we see that the function P(i)
is con
tinuous in z with respect to the norm defined by (7). Thus, we conclude that for any ZGU,
P(z) ^oo.
Define T, = {z 6 U : V ^ W < «}
(44)
and A e = {z G U : 3A(c) G R n , /i G R r and fc G R J such thatyjQ(z,\,\,fi,fi)
< e},
(45)
where A : [0,1) —♦ R " is an absolutely continuous function. The approximate method is to seek a z such that z€llnTtinAl2, where £\ and £2 are small, preselected numbers. 6.
Restoration Phase For any nominal function z G ( x , u, Jr) £ U — X 0 , we seek a function: z-f A z = ( x - f A x , u + AU,TT + Air) G
UnT0.
If quasilinearization is employed, equations (2) — (4) are approximated by A x - 4>*Ax - ^ u A u -
+ u>„ Air] 0 = 0
[^ + 0 x A x + ^ A * ]
]
= 0,
(46a) (46b) (46c)
where the tilde on top of the functions <j>, u>, 0 and their derivatives denotes evaluation of the functions at the beginning of the restoration phase. In order to prevent Ax(<), A u ( t ) , and A i from becoming too large, we imbed (46) in a one-parameter family: A x - <£xAx - ^ u A u - ^jrAir + a (x - >J = 0
where a is a constant. functional
(47a)
[aui + £ x A x + u>„ AJf]0 = 0
(47b)
\aij> + fcAx 4- ^ A * ]
(47c)
= 0,
For each prescribed a, we seek the minimum of the cost 1 ff' J = —
^
a
/
Uo
|Au|2
1 (48)
J
445 subject to (47). Let the above problem be referred to as (T). Remark 6.1. Since system (46) (and hence system (47)) is obtained from equations (2) - (4) by quasilinearization, it is clear that the constraint qualification for the problem (T) is exactly the same as that for the problem (T). Thus, by following exactly the same argument as that given in §4, we can show that ifz satisfy the constraint qualification of the problem (T) (i.e., (M (z))-1 exist), then there exist Oz = (Oz, On 0fl and A (t) E R", µ E R' , E R' such that (47) together with the following equations are satisfied: (49a) (49b)
AllT/a - ^T^u = 0
0 [ - T - 7rT {aTdt +1 TWx]0+ A ^^]1+ -=0 a
(49c) (49d)
[_T + ^TtDx] o ^^ I [^T + T Y'X µ 9v
1
1
0
(49e)
= 0
6.1. Coordinate Transformation Let A(t) _ fa(t)
(50)
B (t) _ hu(t)
(51)
C
=
^n
(52)
a
Then (47a), and (49a) - (49c) become:
A= ¢xA+¢uB+¢,rC(:i-¢)
(53)
A = AT& (54)
=
BT = ^T^u (55)
j
dt - pT]
(56)
- [T 1
and the boundary conditions (47b) - (47c) and (49d) - (49e) become: [w +w.,& +w"CJ
0
=
0
(57)
446
[^++.k++//,rC]1 =0
(58)
_ T + jT,^,xj0 = 0
(59)
(60)
[^T + FLT^x] = 0 6.2. Integration Technique
The aim of this section is to describe a method for solving the restoration corrections A, (t), B(t), and From (54) and (60), we obtain
a (t)=-(N(1,t))T T
(61)
where N (t, T) satisfies the following matrix equation: aN(i,T) - at - 0x(t)N(t,T)
(62a)
N (T, T) = I
(62b)
N (t, T) = 0 if r > t,
(62c)
where I denotes the identity matrix of appropriate dimension. Using equation (61), it follows from (55), (56) and (59) that B (t) = k1 (t) µ k2µ
(63) +
k3
(64)
and (65)
144/1 + k5/1 = 0
where 141 (t ) = - (^u (t)) T (N (1, t)) T
k2
(66a)
( px] 1) T
(66b)
= - (LW,]0)T T
R3 -
fl
(N(l,t))T ([x]1) T dt- ([w^]1)
K4
(66c) (66d)
and k5 = (N (1,0))T ([^x]1) T
(66e)
447 Thus, from (53), (63) and (64), we obtain A (t) = N (t, 0) A (0) + k6 (t) µ + k7 (t) /t + k8 (t) (67) where k6 (t) =
R7 (t)=
I
J
N (t, T) (T) dT k2 (68a)
1
[f
N(t,T)(bu(T) K1(T)
0
k8
t
= -
d T+
V
J
N(t, T)^a(T)dT1 kg
(68b)
0
t
N
t, T X T _ ( T
)1 d
T
( 68 c )
Then, it follows from (65), (57), (58), (67), and (64) that K4/2 + ksµ=0
(69a)
[wX}0A (0)+[w,rk2]o µ+ [W*ka]o µ=-[w]o
(69b)
[t ] N (1, 0) A (0) + [TXk6 + cbxk2] + +b,k3] µ [T + Xka]i
(69c)
Remark 6.2.1. System (69) constitutes a system of (n+r+s) equations in (n+r+s) unknowns (i.e., A(0), ) and µ). It has a unique solution if and only if the nominal functionz satisfies the constraint qualification (cf. Remark 6.1). Remark 6.2.2. Note that A (t), B (t) and C can be easily obtained from (67), (55) and (56), respectively, once A0i µ and Al are determined from (69). Then, of (t), Dn, 07r can be calculated from (50) - (52) if & is known. The method for determining the step- size & is discussed in the next section. 6.3. The Restoration Algorithm Algorithm
6.3 (Restoration Algorithm)
Step 0 Let ry" E (0, 2), ,B E (2, 1) be given constants. Choose z° = (io, n°, fro) E U. Set i = 0. Step 1 If P (z') =0, let i'+' = i' Vj > 0 and stop; otherwise go to Step 2 .
Step 2 Let the nominal function be V. Then, calculate A (i') (t), B (i') (t) and C (z') by the method described in §6.2. 2 Step 3 Choose ai to be the first element in the sequence 1, , ... such that P(i' + & U (i')) - P (i') < -2ry&1P ( z') (70)
448 where D (it) = (A ( ii) , f3 (ii ) , C
(it))
(71)
Step 4 Set ii+1 = ii + & i f) (it)
(72)
and go to Step 1. Definition 6.3.
Let r (z, i) be the function generated at the ith iteration of Algorithm
6.3, starting from the nominal function r (i, 0) = i. To continue, we assume that the following assumptions are satisfied. Assumption (6.3A)
(6.3A1) Assumption (3.A) is satisfied. (6.3A2) Let {r(i,i)} be a sequence generated by Algorithm 6.3. Then, each of its elements satisfies the constraint qualification. Hence, from Remark 6.2.1, for each i and i, Det [M (r (i, i))] # 0 (73) where M is the coefficient matrix of the system (69). If {r(z,i)} is an infinite sequence, then lim Det [M (r (i, i))] # 0 (74)
Lemma 6.3.1. Let Assumption (6.3A) be satisfied. If {r(ii,i)} is a sequence generated by Algorithm 6.3, then {r (z, i)} has the following properties: (i) There exists a constant el >0, independent of both i and i, such that
11 f) (r (Z, i)) II <_ el [P (r (z, i))11,2
(75)
( ii) D (r (i, i)) E U for any r (i, i) E U. (76) Proof. For brevity, ( z, i) is used to denote r(i, i) in the proof. Proof of Part (i).
From (62), we have
N (i, Z-) (t, r) = I + Jt 0. ( z, i) (s) N ( i, i) (s, r) ds (77) r
Thus, from Assumption (3.A2), we obtain IN (i, i) (t, r) I <_
(n')1/2
+
ft
K1 I N (Z, i) (s, r) Ids (78)
r
I
1
uIi
449 By Gronwall's lemma (cf. the book by Ray and Szekely13, p. 62, Lemma 2.3.5), it follows that IN (z, i) (t, T) ^ < exp [(t -,r) Ki ] : exp (Ki) (79) Thus, by (79), Asssumption (3.A2), the Cauchy-Schwarz inequality, and (42), we deduce from (66) and (68) that there exists a constant K1 >0, independent of both z and i, such that Iij(z,i )I <_ K1, j =2,...,5 (80)
I.i
i, i) (t) 1
< Kl, Vt E [0,1 ] , j = 1, 6,7
(81)
and 1R8 (Z,i)(t)I
[A(i, i)]0 µ(i, i)
ii(i, i) 0
-[w (z, i)[0 -[J'(i, i) 11 + [,b ( i, i)ks(i, i)]1
(83)
where k is the coefficient matrix of the system (69). In view of (79) - (81) and Assumption (3.A2), it is clear that all the coefficents of the matrix M (i, i) are bounded. Thus, we deduce from (75) that there exists a constant K2 >0, independent of bothz and i, such that I [M(z,i)]1< KZ (84) By (83), (84), (42), (82) and Assumption (3.A2), it follows that [A (i, i)] u
2 ] 1/2
< k2 [P(z,i)11/2 1+ (1 + K1K1)
= K3 [p (j, i)]1/2
(85a)
Ili (z, i) I < K3 [P ( z, i)]1/2
(85b)
(z,i)I K3[P( z,i)]1/2
(85c)
and
where
2] 1/2 K3 k2 [1 + (1 + K1K1)
(86)
450 Thus, by virtue of (79), ( 81) - (82 ), (85) and (80 ), we deduce from the formulae of A(t), B(t ) and C (defined by (67), ( 63) and (64), respectively ) that there exists a constant K4 > 0, independent of bothz and i, such that
I[A(i,z)]lI
(87a)
1/2
)11/2
(87b)
(87c)
< K4 [P (i, i)11,2
and (87d)
IC (Z, i) 1< K4 [P (z, i)] 1/2
Hence, from (53), (87b) - (87d), Assumption (3.A2) and (42), it follows that there exists a constant K5 >0, independent of bothz and i, such that
IIA (Z, i) 11. <
K5
[P (Z,
(88)
.)11/2
Thus, from (7), (87b), (88), (87c) - (87d), ( 85a), and ( 87a), we obtain IID (i, i) II
= IIA(i,i)11o+IIA(i,i)11.+IIB(i,i)1ho+IC(i,i)I+I[A(Z,t)]ol (89)
+ I [A (Z, Z)] 1, < fl [P (i, i)11/2 where
(90)
fi = K3 +4k4+ KS Proof of Part (ii). The proof follows easily from (89) and Remark 5.1. Let Assumption (6.3A) be satisfied. Then,
Lemma 6.3.2. li Proof.
P(r(i, i) + db(r(i, i))) - P(r(i, i)) _ -2P u
(r( i , i ))
( 91 )
For brevity, define r(i,i)-z =(z,u,7r)
(92)
From (42), we have P(i + dO(i)) - P(z)
1 { X(t ) + & A(i ) (t) - O(t, x (t) + &A (i) (t), u (t) + &B (i) (t), i + &C (i)) IZ =l - IX (t) - (t, X(t), u(t), *)I2 }dt + I [w(x+aA( z),^r+&C ( z))]ol2 - I[w(X,j)10I2 + I[^(i + &A(i) fr +
«c(i))11
12 - I[+G (X,'<)11I2
(93)
451 By the mean valued theorem , we obtain
Ix(t) + & A (i) (t) - ¢(t ,X (t) + &A (i) (t), u (t) + aB (i) (t),+ &C (i))I2 - X(t)- ^(t,X(t),a(t),X)12 = IX(t)+ &A(i)(t) - 0(t,1(t),u(t),^) - &L, (z)(t )12
- IX (t) - 0(t,X(t), ii(t)I2 L, &21 A =
(Z) ( t) 12
(Z) (t) -
[X(t) - ¢ (t, x (t) , u (t) ,;r)] T [A ( i) (t) - L, (z) (t)]
+ 2& I
= &21A (Z) (t) - L2 (Z) ( t) 12 + 2a2 [A (Z) (t) - L2 (Z) (t)] Tf [L2 (i) (t) - L, (i) (t) l 1 + &
21 L
(t) 12
2 (Z) (t)
+2a[x(t)-4)(t,1(t),ii (t),X)]T [A( i)( t)- L2( z)(t), + 2& [u (t) - ¢ (t, X (t), u (t), *)] T [L2 (i) (t) - L, (i) (t)] (94) where L, (z) (t) = 4)(t, x (t) + aA (z) (t ) , u (t) + &B (i ) (t) , x + aC (z))A ( z) (t)
+ 4,.(t, X (t) + aA ( i) (t), u (t) + aB (i) (t), 1 + «c (i))B (i) (t)
+ 4)„(t, X (t) + &A (i) (t) , u (t) + &B (i) (t) ,;r + «C
(95)
L2 (i) (t)
= 4)X (t,x(t), u(t), 7f ) A( z) (t)+4„(t,1(t),ii (t),1)B(i)(t) +0, (t,1(t),a(t), I )C(i),
(96)
and a is an intermediate value satisfying
(97)
0
IX ( t) + & A ( i) ( t) - ^(t, x ( t) + a A (i) (t) , u (t) + aB (i) (t) , x + a (i))12
-Ix_
2
(&2 -
+ (2& + &2
1L
2 a)
1x
2&2)
(t)
-4) (t,1(t),u(t), 1)12
[1(t) -
2 (Z) (t) -
L,
0
(Z)
(t, :j (t) ,
(t) 12
u (t)
, 7f )J T [L2 (i) (t) - L, (i)
(t )) (98)
452 From (95), (96), Assumption (3.A2), (89) and (97), we obtain IIL2 (i) (t) - L1 (Z) (t) ll- < K2t1&P (z) < K2ti&P (i) (99) Thus, from Cauchy-Schwartz inequality, (99) and (42), it follows that
IL
1 [" (t) - (t, :j (t) , u (t)
T [L2 (z) (t) - L1 (Z) (t)] dtl
< K2t1&[P(i)]3/2
(100)
Now, by (98), (100), and (99), we deduce that
lim 1 j1 [I X(t) + nA(i)(t) 1 6 -0 & o
- q(t, x(t) + &A(z)(t), u(t) + & B(i)(t), ii + &c(z))I2 - I x(t) - 0(t, x(t), n(t ), x) I2 dt
+2
f
1
]
1 X(t) - 0(t, x(t), u(t), *)I2dt)
< li m{&P(i) + [2& - 2&2]K2t2 [P(i)]3/2 + &3Kzti [P(z)]'/2 } (101) Thus, it follows from (101) and Remark 5.1 that lim 1 Ix( t) + &A(i)(t) &_O a o L
J
12 1
- ¢(t z(t) + &A( z)(t), n(t ) + &B(z)(t), i1 + &C(i))12
- I X(t) - m(t, x (t), fi(t), j) dt _ -2
I'
I X (t) - ¢(t, x(t), fi(t
), j) j2dt
(102)
Next, by following a similar argument as that used to obtain (102) from (94), with the exception that (53) is being replaced by both (57) and (58), we can readily show that h o tip \x+crA(z) 1r+aC(
z)/JoI2 - I[w (X,'')]oI2}
= -2I [w (x, *)]o 12 (103) and o l l ' (x + &A (i)' fr + 112 - 1[0 (X, x)^i 12 -21[0
r)] 1 12
} ^a
(104)
453 Combining (93), (102) - (104), we obtain lim
P(z + &D(z)) - P(z) = -2P(i)
a-0
(105)
a
This completes the proof of the lemma. Remark 6.3.2.
Consider Step 3 of Algorithm 6.3. Since '5E (0,1/2) and P( r(i, i)) >
0, it is clear from Lemma 6.3.2 that the inequality (70) will be satisfied when a; is sufficiently small. Thus, Step 3 of Algorithm 6.3 is well-defined. Let {%'} be a sequence in U such that lim P(V) = a, where a is iW a finite number. Furthermore, let fail be a sequence of real numbers such that Remark 6.3.3.
lim ai =0. Then, by following a similar argument as that given in the proof of Lemma 6.3.2, we can show that
lim
Theorem 6.3.
P(i' + aiD(z')) - P(i') = -2a ai
Let { i' } be a sequence generated by Algorithm 6.3. Then the sequence
{ P( V )} converges to zero at a quadratic rate.
Proof
Firstly, let us show that lime(i') = 0 ioo
Since { P (i')) is a non-increasing sequence of real numbers, which has a lower bound zero, it is convergent. Suppose that limP(i')=a>0 i^
(106)
2 'a;P (i') < P (z') - P (Z'+1)
(107)
From (70) and (72), we have
for all i. Hence, it follows from (107) and (106) that lim &i = 0 i-00
(108)
Now, Let N be a positive integer such that ai < 1 for all i > N . Since ai is the first element in the sequence 1, /j, /2,... such that (70) holds. Furthermore, ai//3 is the element which just preceeds &i in the above sequence. Thus, (70) does not hold if we replace ai by ai/Q in (70). On this basis, it follows that P(z' + (ai /6)D (ii)) - P (i' ) > - (2ai/Q) yP (i')
(109)
454 for alli>N. From (106) and (108), we note that Jim P (z') =a and Em {& /%j} = 0, respeci-+oo
ioo
tively. Thus, by substituting a!; = &0 in Remark 6 . 3.3, we have
slim lP(Z^ + (n`l $)D (
z')) - P (i`) } /2 (d;/,B)
On the other hand , it is dear from ( 109) that lim {P(i' + (& •//)D (i')) - P (Z') } /2 (&+/i) ? -7a Combining them together , we obtain ilirn (1 - ') P (z') < 0 This is, however, a contradiction. Therefore, we conclude that limP(i')=0
(110)
Our next task is to show that { P (i`)} converges to zero at a quadratic rate. By letting & = 1 in (98), we obtain from (98) and (99) that
X: (t) + _ (i`) (t) - 0(t, X` (t) +
(i') (t) , u' (t) + B (i') (t) , jr_ + C (i')) I2
< (K2)2 (11)4 [P(i)j2 (111) In view of the mean value theorem , (57), Assumption (3.A2) and ( 75), we obtain [w(x'+A(z ), ii +C(z`))}o12 [CJ(X', 7r ') + CJX(% +
+ 5'(x' + (j'),
7r'
&A(z'),
7r' +
C ( 7''))A(^'`)
&
+ &C(z`))O(Z')]o 12
+&A(i'), +&c(i')) +A(i`), i' + C(i')) } A(i') + {Cw,(x` + &A(i`), jr` + 0x(X` +
A(i`),
*' +
C(i`)) } C(i')]
< (1 - &)2(K2 )2I[A(z') + C(Z') ]o14 < (K2 ) 2Q1)4[P(i')]2
o 12 (112)
where & is an intermediate value satisfying 0<&<1. Similarly, we can show that
I[V,(X'+A(i'),k'+C Z`))] 1 1 2
1
t
(t )4 < (K2)2
[P (Z`)] 2
1'
(113)
455 From (111)-(113), we obtain (114)
P(i1 + D (ii.)) < 3 (K2)2 (e1)4 [P (ii)}2 Now, in view of (110), there exists an integer io > 0 such that i 1 - 2j P(%) < 3(K2) 2(el)4
(115)
whenever i > io. Combining ( 114) and (115), it follows that
P(i' + D (i' )) < [1 - 255] P (z') whenever i > io. Thus, inequality (70) is satisfied if we choose di = 1 for all i > io. Consequently, we have z
i+1
i
= z i + f) (i )
(116)
for all i > io. From (110), (114) and (116), we conclude that {P (k') }converges to zero at a quadratic rate. This completes the proof. Remark 6.3.4.
From Thoerem 6.3, it is clear that under Assumption (6.3A), Algo-
rithm 6.3 generates in a finite number, i (e, i), of iterations, a function r (i, i (e, i)) such that r (i, i (e, i)) C T. Lemma 6.3.3. Suppose that Assumption (6.A3) is satisfied. Let i E U be such that P (i) >0, and let r (i, i) be as defined in Definition 6.3. Then, there exists a constant e2 >0, independent of both i and i, such that Ir(i,i)-i l <12 [P(z)]112
Proof.
For brevity, let r (i, i) = i` = (i', ui, ii)
Let a (i, i) be the value of & obtained in the ith iteration of Step 3 of Algorithm 6.3. Without loss of generality, we can assume that {a ( z, i)} is an infinite sequence. From the proof of Theorem 6.3, it is clear that for eachz E U fl T. and for all i
a
(i,i)
>
0.
(117)
and ilim a (i, i)
=
1
#
0.
(118)
456 Now, it is clear from (117) and (118) that there exists an &, 0 < & <1, independent of both i and i, such that 1
>
&
(i,i)
>
&
(119)
From (70) and (72), we have P (ii+1) < [1 - 2y&] P (a') = &P (Z') where 0<&=1-2y&<1
(120)
P (ii) < (&)' P (i)
(121)
IID (Z') II s el [(&)' P (Z)] 1/2
(122)
Thus,
By (75) and (121), we have
But
i-1 (123)
= i + E & (z,7) D (i') i=0 Thus, from (123), (119), (122), and (120), it follows that
Ili, - ill i-1 < 21 [P(i )]1/2 E[(&)j]1/2 j =0 <
£l[P(i)]1 /2/{1 - (&)1/2} = Q2[P( Z)]1/2'
where Q2 = fl/ 11 - (&)1/2
J
This completes the proof.
°II
457 7. Sequential Gradient -Restoration Algorithm 7.1. Gradiant Phase Let z = (x, u, a) E Te be any nominal function for the gradient phase. Define z+Az = (x + i x,u+Au ,Tr+A7r). Then , the first variation of
{J(z + Az) - J(z)} can be expressed as: ={ f f.Ax + f. Au + f,Ox } dt + [g,4x + 9,Ox] o + [hx Ax + h.,Air] 1 (124) If e > 0 is sufficiently small, then the differential equations (2) together with the boundary conditions (3) - (4) can be approximated up to the first order by Ak =
Y'xOx
+ Oul u + 0"0r
(125a)
[fixAx + W,r07r]o = 0
(125b)
[ xtx+
(125c)
'O,r07r]1
=0
To the first order, the minimum of J is achieved if the first variation (124) is minimized subject to (125a) -(125c). To make the problem meaningful, we want the variations Au (t), O7r to satisfy the isoperimetric constraint
J 1 IDu12dt + IA 7rI2 = K (126)
Let the above problem be referred to as (T). Remark 7.1. Since system (125) is obtained from equations (2) - (4) by quasilinearization, it is clear that the constraint qualification for the problem (T) is exactly the same as that for the problem (T) (cf. Remark 6.1). Thus, by following exactly the same argument as that given in §4, we can show that if z satisfy the constraint qualification of the problem (T) (i.e., (M(z))-1 exists, where M is defined by (37)), then there exist Az=(Ax, Au, Da) and )(t) ER", y ER', µ ER',aER(1/2a is the Lagrange multiplier of the isoperimetric constraint (126)) such that (125) (126) together with the following equations are satisfied.
AT = fx - ATOx (127a)
458 A U T /0, - ATo„ (.f a -AT o
+
x) dt + [gw + µTWr] 0 + [hx
fu
= 0 (127b)
[-AT + gx + PTW.] 0 = [•1T + hx +
A
+ M^rT /a = 0 (127c)
+ ATka] 1
0
(127d)
T b ] 1 = 0 (127e)
7.2. Coordinate Transformation Let A = Ax/a (128a) B = Au/a (128b) (128c)
C = A7r/a Then, equations ( 125a) and (127a) - (127c) become:
A=q A +¢uB+OnC (129a) AT=fx - ATox (129b)
BT = -fu + ATo. (129c) CT ={f+T} dt - [g+[h j
+
(129d)
and the boundary conditions (125b) - (125c) and (127d) - (127e) become :
[WxA + W,rC]0 = 0 [1'XA +
[-
AT
[AT
,i,C]1
(130a) (130b)
=0
+gx +,UTWx]0 =0
0
+hx+ATp.]1
(130c) (130d)
Moreover, the isoperimetric constraint ( 126) becomes: 1
a
j
1 BTBdt+CTC =K
J
2[ ,
(131)
Since we do not know what value of K to use in (131), we shall consider a as the steplength in the equations Ax (t) = aA (t)
(132a)
Au(t) = aB (t)
(132b)
459 Air = aC (132c) in order to perform a line search on the objective function J. The method for determining the step-size a is given in §7.4. 7.9. Integration Technique In this section , we shall describe a method for solving the gradient corrections A (t), B (t) and C. From (129b), (130d) and (62), we obtain A (t) = - (N (1, t))T [([h.], )T + ([T^,x]1)T I'J (N (r, t))T (
f
fx (r))T
t
(133)
dr
Using (133), we obtain from (129c), (129d) and (130c) that B (t) = k1 (t) µ + k2 (t)
(134a)
C = k3µ + k4µ + k5
(134b)
k6µ + k7µ = ks
(135a)
and
k1(t) = -(4„(t))T(N(l, t))T([ox]1)T (135b)
k2(t) = -(fu (t))T - (0.(t))T [(N(l,t))T( [h.],)T
(135c)
+ f 1 N(r, t )((fx(r))T dr] /
T
(135d)
T
] ([+/^x]1^ (135e) T
[j1 (
k5 = -
J1
l
(fx(t))T dt - ([g,.]. +
(q5,.(t))T [(N(l,
- [
[h,r]1)T
t))T ([hx]1)T + L N(t, r)(fx(r))T dr dt
j
10 1
k6 = ([wx]o)T
(135f)
(135g)
k7 = (N(l,0))T([,x]1)T (135h)
460 , k8 = -([gx]o)T - (N(1, 0))T([h.]1)T - f N(r,0)(ff(r))Tdr 1
(135i)
0
Thus, from (129a), (134a) and (134b), we obtain
A (t) = N (t, 0) A (0) + k9 (t) ,u + klo (t) µ + k11 (t)
(136)
k9 (t) = k3 f N t (t, r) 0, (r) dT 0
(137a)
k1o (t ) = f N t (t, r ) [45„ (r) k1 ( r) + (r) k4 ] dr a
(137b)
( r) k5] dT
(137c)
where
k11 (t ) = f N t (t, r) [Ou ( r) k2 (r) + 0 From (135a), ( 130a), ( 130b ), ( 136) and ( 134b ), we have
kep + k7µ = k8
(138a)
[w,r]o A ( 0) + ['.',rk3]o µ + [W ^k4 ] o µ = - [w,rk5]o (138b)
[^G =l1 N ( 1, 0) A (0) + [OXk9 + ,, k3]1 a + ['+/Xk1o + O,rk4]i µ = - ['.k11 + &,k5]1 (138c)
Remark 7.3.1. Similar to the system (69), the system (138) has a unique solution in A (0), y and µ if and only if the nominal function z satisfies the constraint qualification (cf. Remark 7.1). Remark 7.3.2.
Note also that A (t), B (t) and C can be easily calculated from (136),
(134a) and (134b), respectively, once A (0), µ and µ are determined from (138). Thus, Ax (t), Au (t), A7r can be obtained from (128a) - (128c) if a is known.
An Iteration of the Sequential Gradient Restoration Algorithm Let -ft E (0, 1), /3 E (2 ,1) be arbitrary but fixed. Then, we define
A(e) = {(/3)t : (/3)1 >-
yee,
t E {0, 1, 2, ...
11 ,
(139)
i.e., the set, A (e), of step-size corrections contains all the values of the form 1, /3, (0)2 (/3)3 .... which are equal to or greater than yte. We now define an iteration of the Sequential Gradient-Restoration Algorithm as follows :
461 Algorithm 7.4. Step 0 Let y, yj E (0, 1); 6 E (2, 1) be given constants.
Step 1 From a nominal function z = (x, u, r ) E TE fl U, obtain D (z) = (A (z), B (z), C (z)) E U together with A (z), µ (z), µ (z) by solving system (129) - (130). Step 2 Compute the step-size a(z, e) as follow: a (Z, E)
0 if Q(z) < -yIe 0 if J( r(z + aD(z ), i(e, z + aD(z)))) -J(z) > -yaQ(z), Va E A(e) max{a E A(e) : J(r( z + aD(z ), i(e, z + aD(z)))) otherwise -J(z) < -yaQ(z)}, (140) where Q(z) =Q(z,A(z),A(z), p(z), ji(z ))
(141)
and i (e, z) denotes the minimum number of iterations of Algorithm 6.3 required for obtaining a function belonging to the set Te, starting from the nominal function z.
Step 3 Let GR(z,e) = r (z + a (z , e) D (z), i (e, z + a (z , e) D (z))) (142) In order to establish some important properties concerning Algorithm 7.3 and to prove a convergence result for the sequential gradient-restoration algorithm given in the next section, we need the following assumptions : Assumption (7.4A) (7.4A1) The constraint qualification is satisfied for any nominal function z in Step 1 of Algorithm 7.4. (7.4A2) Assumptions (6.3A1 ) and (6 .3A2) are satisfied. (7.4A2) The set Ao fl To is non-empty, i.e., there exists a z E 3 such that the necessary condition for optimality is satisfied. Lemma 7.4.1.
Suppose that Assumption (7.4A) is satisfied. Then, there exists a
constant e3 > 0 such that JID (z)
11
<
t3
(143)
462 for any nominal function z defined in Step 1 of Algorithm 7.4. Proof.
The proof can be easily obtained by using a similar approach as that given
in the proof of Lemma 6.3.1.
Lemma 7.4.2. Suppose that Assumption ( 7.4A) is satisfied. Then, there exists a constant e4 > 0 such that JID (z2) - D (z1 ) 11 <- £4Ilz2 - zl lI (144) and IQ (z2) - Q (Z1) 411Z2 - Z1II (145) for any nominal function z1, z2 defined in Step 1 of Algorithm 7.4.
Proof.
Let z', z2 be any nominal functions defined in Step 1 of Algorithm 7.4.
Using ( 62) and Assumption (3.A2), we get I N(z2)(t, r ) - N(z')(t,T)f J t I0x( z2)(s)N (z2)(s, T) - 0X (z1)(s)N(z1 )( s, 7-)Ids r f t{^0x(z2)(s) N(z2 )(s, T) - 0x( z2)(s) N( zl)(s, T)I /+ 19Sx(z2)(s) N(zl )( s,T) - q5x ( z1)(s) N(zl)(s,T)I}ds
J t KIIN( z2)(s,T ) - N(zl)(s,T)Ids r
+
1
K2 N( z1)(s,T ) 1 11z2 -z1IIds (146)
By using a similar approach as that used to obtain (79) from (62), it follows that
IN(z)(s,T)I
<exp(Ki)
(147)
for all nominal functions z defined in Step 1 of Algorithm 7.4. Thus, from (146) and (147), we get I N(zJ 2)(t, r) - N(zl)(t,T)I K 1IN(z2)(s,T)-N(z1)(s,T)lds+K2exp(Kl)I1z2-z1II
148)
Applying Gronwall's lemma, it follows that I N(z2)(t,T) - N(z1)(t, r)J < K2 exp(Ki) exp(Ki (t - T)) 11 z2 - z1 lI < K2[exp(K1)]2 JJz2 - z1II <_ ki Ilz2 - zl 11
(149)
463 where ki = K2[eXp(Kl)]2 (150) Using (149), Assumption (3.A2), and (147), and then following a similar approach as that used in obtaining (149) from (62), we deduce from (135) and (137) that there exists a constant x2 > 0 such that for all t E [0, 1], we have I kj(z2 )( t) - kj(z ')(t)I <-:k2 IIz2 - z111 j = 1, 2, 9,10,11
(151)
jkj(z2 )- kj(z1 )I :^ k2IIz2- z1II, 7 =3 ,4,5,6,7,8
(152)
and
where kj, j = 1,...,11, are as defined in (135) and (137). From (138), Assumption (7.4.A1) and Remark 7.3.1, we have [A(z2)]o µ(z2) %i(z2)
[A(z')]o
-
µ(z1) A(z1)
= [M(z2)]-1b(z2) - [M(z1)]-1b(z1) (153) where b(z) _
kg(z) [(-, ( z)ks(z )]o -[wx(z)kii( z) +w,(z ) ks(z)]1
(154)
and M is the coefficient matrix of the system (138). Using Assumption (3.A2), (152), (151), (149) and (147), we deduce from (138) that there exists a constant X4 > 0 such that lb(z2) - b( z1)I <_ x4lIz2 - z' lI (155) and IM,j(z2)-M;j(z1)I s k4IIz2-z111, i,7=1,...,n+r+s (156) where Mij is the coefficient of the matrix M of system (138). By Assumption (7.4A1) and Remark 7.3.1, it follows that there exists a constant 9Cs > 0 such that Det [M(z)] > X5 (157) for any nominal function z defined in Step 1 of Algorithm 7.4. Thus, we deduce from (156) and (157) that there exists a constant k6 > 0 such that I(M(z2)) -' - ( M(z 1))_11
< xsIIZ2 - z'
1
(158)
Thus, from (153), (158) and (155), we can show that there exists a constant X7 > 0 such that I[A(z2)]o - [A(z1)]o1 <- x7[pz2 - z ' I[ (159a)
464
Iµ(Z2) - P( z 1)I
<-
x7112 2 - 2 111
1A(z2) - f(z1 )l
<-
x7IIz2 -
(159b)
and 2111
(159c)
In view of (149), (151), (159) and (152), we deduce from the formulae of A(t), B(t) and C (defined in (136), (134a) and (134b), respectively) that there exists a constant 7C8 > 0 such that
I[A(Z2)]1 -
[A(z1
)ll I
x81122 - 2111
<_
(160a)
IIA(Z2) - A(z')II. < x81122 - 2111 (160b) 1IB(z2) - B(z1 )II, < x81122 - 2111 (160c) and
IC(z2) - C( z1)1 <- x81122 - 2111 (160d) Thus, from (129a), Assumption (3.A2) and (160b) - (160d), it follows that there exists a constant 9C9 > 0 such that
IIA(z2) - A(ZI)11, < x91122 - 2111
(161)
Thus, from (7), (160b), ( 161), (160c), ( 160d), (159a) and (160a), we get
IID(Z2) - D(z 1)il = IIA(z2) - A(z')II. + IIA(Z2) -
A( ZI )II. + IIB(z2 )
- B( z 1)11.
+ IIC(z2) - C(ZI)IIoo + I[A(Z2)]o -
[ A(z 1)lo l+I [A(z2)]1
<
-
9(IoIIz2
2111
- [
A(z 1)]1I
(162)
where x1o = x7 + 4X8 + x9 (163) From the definition of ()(z) and Q(z) (defined in (141) and (43), respectively) together with the definition of A(z), µ(z) and µ(z) (obtained by solving the system (129) and (130)), it follows that (Xz2) - Q(ZI)
= ^I
[IB(Z2 )(t)12
- IB(z1 )(t)I2Jdt + IC(Z2)I2 - IC(ZI)I2 (164)
Thus, from (164), (160c), (160d), Lemma 7.4.1 and the definition of D(z), we obtain
iQ(z 2) - Q(z1 )I
<_ 2
x81122 -
2111 (165)
465 Let £4 = max{x10, 2Qs 8 } (166) Then, the rest of the proof follows easily from ( 162) a nd (165).
Remark 7.4.1.
In view of Lemma 7.4.1, it is clear that for any z E U, the set {z + aD (z ) : a E [0, 1]}
is compact with respect to the norm defined by (7). Lemma 7.4.9.
Suppose that Assumption (7.4A) is satisfied. Then, lim
J(z + aD(z)) - J(z) _ _Q(z)
a-.0
Proof.
(167)
a
Define z = (x, u, K ).
(168)
Flom (1), (128) and the mean value theorem, we obtain J (z + aD (z)) - J (z) =a
J
1 [L1 (z) ( t) + L2 (z ) (t) - L1 (z) (t)] dt
+a[L3(z)+L4 (z)-L3(z ) ]
(169)
where L1 (z) (t) = fX(t,x(t),u(t),7r)A(z)(t)+f„ (t,x(t),u(t),7r)B(z)(t) +f,(t,x(t),u(t),7r)C(z)
(170)
L2 (z) (t) = f, (t, x (t) + a1A (z) (t), u (t ) + it, B (z) (t ), 7r + a1 C (z)) A (z) (t)
+ f„( t,x(t )+a1A( z)( t),u(t) +a,B(z)(t), 7r +a,C( z))B(z)(t) +f, (t,x(t)+a,A(z)(t), u(t)+a1B (z)(t),7r+&1C(z))C(z) (171) L3 (z) = [gx (x, 7r) A ( z)]0 + [g, (x , 7r) C (z)]0 + [hx (x, 7r ) A (z)], + [h,r (x, 7r) C (z)], (172)
466 L4 (z) = [gx(x+a2A(z)(t),ir+62C(z))A(z)]0 +[gx(x+a2A(z)(t),x+a2C(z))C(z)1o + [hx (x + 63A (z) (t), 7r + 63C (z)) A (z)]1 +[h,r(x+«3A(z)(t),7r+a3C(z))C(z)]1
173)
and di, i =1,2,3, are appropriate intermediate values satisfying 0 < ai < a
(174)
From (170), (171), Assumption (3.A2), (143), and (174), we deduce that IL1(z) (t) - L2 (z) (t) I < K2a1C3 < K2a23
(175)
Similarly, by (172), (173), Assumption (3.A2), we obtain IL3 (z) (t) - L4 (z) (t) I < K2e3 (a2 +a3) < 2K2a23
(176)
Now, combining (169), (175), (176), it follows that li
J(z + aD(z)) - J(z) o
= j
L 1(z)( t ) dt + L 3(z)
( 177 )
From (170), (129b) - (129d), (129a), (130d), (130c), (172), (130b) and (130a), we can show that
f 1 L1 (z) (t) a
= f (A(z)(t))TA(z)(t) + (A(z)(t))T {0x(t, x(t), u(t),1r)A(z)(t) + ¢,.(t, x(t), u(t),7r) B(z)(t) + ¢,r(t, x(t), u(t), 7r)C(z)}dt -
f
1(B(z)( t))TB(z)(t )dt - (C(Z))TC(Z)
{9,(z, lr ) + (µ(z))T'''n(x, _) } C(z)] o
- [{h,r(x, 7r) + (µ(z))T'b,r(x, 7r)}C(Z)j
[(A( z))TA( z)]1 -
f
- [(,\ (z))TA(z)]o
1(B(z)( t))TB(z )(t)dt - (C( z))TC(z)
- [{g,r(x, ir) + (µ( z))T°J, ( x, 7r)}C(z )lo - llh, (x, ir) + (F1(z))Tb,r (x, ir)}C(z)] 1 = - [{h.(x, ir) + (A(z ))T', (x, 7r)}A(z)] 1
- [{gx(x, 7r ) + (/U(z))TfOx ( x, 7r))A(z)] o
467
- J (B(Z)(t))T B(z)(t)dt - (C( Z))TC(Z) - [{g, (X, jr) + (p(z))Twa(X, r)}C(z)] 0
- [{h,r(x, 7r) + (i(z))T ,(X, ir)}C(Z)]1
-L3(Z) - f 1 (B(z)(t))T B(z)(t)dt - (C(Z))TC(Z)
(178)
On the other hand, from the definitions of Q (z) and Q (z) (defined in (141) and (43), respectively) together with the definitions of A (z), µ (z) and ji (z) (obtained by solving the system (129) and (130)), it follows that
Q (Z) = J0 1 (B (z) (t))T B (z) (t) dt + (C (z))T C (z) (179) Combining (177), (178) and (179), the conclusion of the lemma follows readily. Definition 7.4. For any z E U and e >0, let B(z, E) _ {z E U: IIz - zll < E} (180)
Lemma 7.4.4. Let Assumption (7.4A) be satisfied, and let z E To be such that z ^ Do. Then, there exist a constant p' > 0 and an integer i' > 0 such that for any zEB(z,p')andforalliE{0,1,2.... } J(r(z +
(//)`I
D (z)), i) - J (z) <
-7 (,0)'; Q (z)
(181)
Proof For each z E U, let z+aD(z )= z=(x,u,*)
(182)
r(z + aD( z), i) - [z + aD(z)] = D = (A, B , C)
(183)
and
Thus, by (1), the mean value theorem, (3.A2), definition of IIDII, (183), and Lemma 6.3.3, it follows that J(r(z+aD ( z)),i) - J(z+aD(z)) =
f
1 { f,^(z + a1D) (t) A ( t) + ff(1 + a1D) (t) B (t) + ff (z + a1 D) (t) C} dt
0
+ [9.(X + a2A , i + a2C)A - g,r(x + a2A, * + a2C)C] +
[h
.(X
+ a3A ,
$
+ a3C)A - h.,(X + a3A ,
a
+ CY3C)C
0
l
< 2K1lIDII <_ 2K1t2 [P ( z + aD (Z))]1/'2
(184)
468 where a;, i =1, 2, 3, are appropriate intermediate values satisfying 0
(185)
Now, from (169), (178), (179), (175) and (176), it is clear that J(z + aD(z)) - J(z) + Q(z) a
< 3K2a(e3)2
(186)
for all z E U and for all a E [0 ,1]. Thus, J(z + aD(z)) - J(z) + Q(z) < 0.01Q(z) (187) a for all z E U and for all a E [0, a41, where 0.01Q(z) l a4 = Max 11
3K2(e3 )2
(188)
Combining (184) and (185), we have J(r(z+aD( z)),i) - J(z )+ryaQ(z) < 2K1e2 P(z + aD(z)) - all - ry)Q(z) + 0.01aQ( z) (189) for all z E U and for all aE[0,a4]. In view of (189) and Lemma 7.4.2, we have J (r (z + aD (z)), i) - J (z ) + ryaQ (z) < 2Kie2
P(z + aD(z)) - a(0.99 - ry)Q(z) + 0- 01a4 11Z - ill (190)
for all z E U and for all a E [0, a4]. Define 9 (z, a) = 2K, e2
P(z + aD (z)) - a (0.99 - ry) Q (z) + 0.01ae4 11z - z11 (191)
Since z E To satisfy the system (2) - (4) exactly, it follows from the mean value theorem and (129a) that
469 0(t, R (t ) + aA(z)(t ), u(t) + aB(z)(t), a + aC (i)) - [iz(t ) + aA(z)(t)]
=
[¢(t, z(t) + aA (z)( t), u(t) + aB(z)(t), a + aC(Z)) - q(t, X(t), u(t), a)] - aA(z)(t)
= 2(a)2
a20(t, X(t))A i (z)(t)Ai(z)(t)
A a 7 i-1 i=1
+ M M a20(t, X ( t)) Bi(z)(t)Bi(z)(t) L. auiaui j=1 i=1
^ + E E -T E 1 = 1 i=1
X (t))Ci
a^ri ar;
+ m n 2a20(
j=1 i=1
(Z)(t)C,(Z)(t)
t, X (t)) Ai
axiaui
( z)(t)Bi(z)(t)
+ p n 2 020(t, X (t)) Ai(z)(t)Ci(z)(t) L1 ;=1 i= 1
ax ;a,;
+ 1 m 2820( t, X (t)) Bi (Z)(t)Cc(z)(t)l au;airi J
( 192)
=1 i=1
where
X = ( x ( t) + &A ( z) (t) , n ( t) + & B (z) (t) , *r + &C (Z)) = Z (t) + &D (Z) (t) (193) and & is an intermediate value satisfying 0
<
&
<
a
(194)
Hence, from (192 ), Assumption (3.A3), Remark 7.4.1, and Lemma 7.4.1, there exists a constant 9Ci (z ) > 0 such that
(t X (t) + aA (Z) (t), u (t) + aB (z) (t), zr + aC ( z)) - I z (t) + aA (z) (t)] (195)
< X1(Z)(a)2
for almost all t E [0,1]. Now, by using a similar argument as that used to obtain ( 195) with the exception that ( 129a) is being replaced by both ( 130a) and (130b), we can readily show that there exists a constant X2 (z ) > 0 such that
I[w (x + aA ( z), ir + aC (z))]o I < X2 ( z)
(a)2 (196)
and [^(X+aA(z ), ir +aC(z))] lI <XC2 (z)(a)2 (197)
470 Thus, it follows from (42) and (195) - (197) that P(z + .D (j)) < 7C (z) (a)2 where
(198)
12 (199)
9C (z) = [(9C1 (z))2 + 2(5(2 (z))2] Let
(0.99 - y)Q(z)
as = mi
{ a4
(200)
, 5C(z)4KIE2
Thus, by (191), (198) and (200), we obtain (201)
O(z, a) < - a (0.99 - 7)Q(z)
for all a E [0, as]. In view of Remark 5.1 and Lemma 7.4.2, it follows that for each a > 0, 0 (., a) is continuous with respect to the norm defined by (7). Now, let a = /i`z, where i" is a positive integer such that /i'` < a5. Thus, there exists a positive constant ps such that for all z E B (z, ps), we have
0(z, 0(z, /3`) < a4 (0.99 - 7)Q(z) Consequently, it follows from (201) that
0(z, - 4 (0.99 - 7)Q( z)
(202)
Combining (190), (191) and (202), we obtain J(r(z + / 3 ' D (z)), i) - J (z) + 7/3 Q (z) < 0(z, /i') < 0 for all z E B (z, p'). This completes the proof.
Theorem 7.4. Let Assumptions (7.4A) be satisfied, and let z E To be such that z V A o. Then, there exist p (z) > 0, e(z) > 0 and 6 (z) < 0 such that for all e E [0, e (Z)], J (GR (z, e)) - J (z) < 6 (z) , Vz E B (z, p (z)) fl T. (203) where GR (z, e) is the function generated from z E Te by Algorithm 7.4. Proof.
The proof is exactly the same as that given for Proposition 4 of the paper by
Mukai and Polak", except that Assumption 2(i) and Assumption 2(iii) in the proof
471 of Proposition 4 of the paper by Mukai and Polak", are being replaced by Lemma 7.4.2 and Lemma 7.4.4.
7.5. The Main Algorithm In this section, we present the Sequential Gradient-Restoration Algorithm (SGRA), for solving the problem (T). Its convergence properties will also be investigated. This algorithm has two main features. Firstly, sufficient amount of decrease in the objective function J within each combined minimization - restoration step is guaranteed in order to ensure convergence of the algorithm. Secondly, in order to prevent the disadvantage of requiring a huge amount of iterations in the restoration algorithm (Algorithm 6.3), we develop a scheme in which the value of e is reduced as the algorithm approaches the optimal solution. Thus, at points far from the optimum, we require only a rough estimate of feasibility, whereas by approaching the optimal solution, the feasibility requirements are gradually tightened.
Algorithm 7.5.
(SGRA)
Step 0 Let Ty E (0, 1), P E (2 ,1), e° > 0 be given constants. Choose a function z° EU. Set k=0, I=O,E=e°. Step 1 Use Algorithm 6.3 to generate a function r (zk, i (E, Zk)) E T. Set z = r (zk , i (e, zk ))
(204)
Step 2 Use Algorithm 7.4 to compute GR(z,E). Step 3 If
J(GR(z,E)) - J(z) < -ye,
(205)
go to Step 4 ; else set yt = z, e = fie, t = i + 1 and go to Step 1. Step 4 Set zk+1 = GR (z, e), k = k + 1 and go to step 1 . Theorem7.5. Let Assumption (7.4A) be satisfied. Suppose that the sequence {zk} generated by Algorithm 7.5 is contained in a compact set (with respect to the norm defined by (7)). Then, one of the following two terminations can occur : (i) If Algorithm 7.5 jams at a particular point i and generates an infinite sequence then any accumulation point y*(with respect to the norm defined b y (7)) satisfies y*ET°f10° (206) (ii) If Algorithm 7.5 generates an infinite sequence {zk}, then any accumulation point z* of this sequence ( with respect to the norm defined by (7)) satisfies z* E Ton 0° (207)
472 Proof.
The proof of (i) follows easily from Theorem 7.4 and Theorem 6.3. The proof
of the fact that z` E To in part (ii) is exactly the same as that given for Proposition 2 of the paper by Mukai and Polak". The proof of the fact that z' E Ao is exactly the same as that given for Proposition 3 of the paper by Mukai and Polak", except that Assumption (2.3) in the proof of Proposition 3 of the paper by Mukai and Polak" is being replaced by Theorem 7.4.
8. An Illustrative Example Consider the problem (cf.
the book by Ray and Szekely13, pp.
245-247) of
minimizing the cost functional J=
-x2
(1)
(208)
subject to the differential constraints 21 = - (u + rut) x1
(209a)
22 = ux1
(209b)
7x1 (1) + (1 - 7) = x1 (0)
(210a)
7x2 (1) = x2 (0)
(210b)
and the boundary conditions
By letting x1 (0) = 7r1 and x2 (0) = ire, (210a) and (210b) are transformed to x1 (0) - Irl = 0
(211a)
x2(0)-7r2 =0
(211b)
7x1(1)+(1-7)--I =0
(211c)
7x2 (1) - 7r2 = 0,
(211d)
Thus, this problem belongs to the same class of problem as that described in §2 of this paper. Choose 7 = 0.1 and /j =0.5. Furthermore, let z° (x° u° 7ro)
where x° (t) = f i 1, u°(t) = 0, 7ro = f i t (212) as the nominal function. Then, Algorithm 7.5 is employed to solve the problem iteratively. From the computed result tabulated in Table 8.1, it is clear that at the 40th iteration, both P (z) and Q (z) are extremely close to zero.
473
J(zk)
P(zk)
Q(zk)
0
-0.5350
6.48x10 -6
0.91046
2
-0.5516
6.26x105
0.00701
4
-0.5573
8.54x105
0.00323
6
-0.5610
9 . 38x105
0.00106
8
-0.5644
1.53x105
0.00121
10
-0.5662
3.98x10 - 5
0.00085
12
-0.5675
2 .26x10- 5
0.00063
14
-0.5685
1 . 78x10-5
0.00048
16
-0.5692
1 .59x105
0.00038
18
-0.5697
1.55x10 -5
0.00031
20
-0.5703
2.00x10-5
0.00026
No. of iteration
22
-0.5707
1.64x10 -5
0.00022
24
-0.5710
1 . 61x10-5
0.00018
26
-0.5714
1.60x10 -5
0.00016
28
-0.5716
1 .56x10 -5
0.00014
30
-0.5719
1.60x10-5
0.00012
32
-0.5721
2.23x10-5
0.00011
34
-0.5722
1 .95X10-5
0.00010
36
-0.5723
3 . 04x10-1
0.00009
38
-0.5725
1.83x10 -5
0.00008
40
-0.5727
1.96x10 -5
0.00007
Table 8.1
474 References 1. A.V. Fiacco and G.P. McCormick, Nonlinear Programming : Sequential Unconstrained Minimization Technique, John Wiley, New York, New York, (1968). 2. F.A. Lootsma, Boundary Properties of Penalty Functions for Constrained Minimization , Ph.D. Thesis, Technische Hogeschool Eindhoven, The Netherlands, (1970). 3. Bryson, A.E., and Denham, W.F., A Steepest-Ascent Method for Solving Optimum Programming Problems, Journal of Applied Mechanics , Vol. 84, (1962). 4. H.J. Kelley, Gradient Theory of Optimal Flight Paths, ARS Journal, Vol. 30, (1960). 5. A. Miele, H.Y. Huang and J.C. Heideman, Sequential Gradient-Restoration Algorithm for the Minimization of Constrained Functions, Ordinary and Conjugate Gradient Version, Journal of Optimization Theory and Applications, Vol. 4, (1969), 213-246. 6. A. Miele, A. Levy and E.E. Cragg, Modifications and Extensions of the Conjugate Gradient-Restoration Algorithm for Mathematical Programming Problems, Journal of Optimization Theory and Applications , Vol. 7, (1971), 450-472. 7. M. Rom and M. Avriel, Properties of the Sequential Gradient-Restoration Algorithm (SGRA), Part 1: Introduction and Comparison with Related Methods, Journal of Optimization Theory and Applications , Vol. 62, (1989), 77-98. 8. M. Rom and M. Avriel, Properties of the Sequential Gradient-Restoration Algorithm (SGRA), Part 2: Convergence Analysis, Journal of Optimization Theory and Applications , Vol. 62, (1989), 99-125. 9. A. Miele, E. Pritchard and J.N. Damoulakis, Sequential Gradient-Restoration Algorithm for Optimal Control Problems, Journal of Optimization Theory and Applications , Vol. 5, (1970), 235-280. 10. R.S. Long, Newton-Raphson Operator; Problems with Undetermined End Points, AIAA Journal, Vol. 3, (1965). 11. H. Mukai and E. Polak, On the Use of Approximations in Algorithms for Optimization Problems with Equality and Inequality Constraints, SIAM Journal of Numerical Analysis, Vol. 15, (1978), 674-693. 12. A.E. Bryson and Y.C. Ho, Applied Optimal Control - Optimization, Estimation and Control, John- Wiley & Sons, (1975). 13. W.H. Ray, and J. Szekely, Process Optimization, John- Wiley F4 Sons, New York, New York, (1973). 14. A.A. Goldstein, Convex Programming in Hilbert Space, Bulletin of American Methematical Society, Vol. 70, (1964), 709-710. 15. E.S. Levitin and B.T. Polyak, Constrained Minimization Methods, USSR Computational Mathematics and Mathematical Physics, Vol. 6, (1966), 1-50.
475 16. D.G. Luenberger, The Gradient Projection Method Along Geodesics, Management Science, Vol.18, (1972), 620-631. 17. G.P. McCormick and R.A. Tapia,_ The Gradient Projection Method Under Mild Differentiability Conditions, SIAM Journal on Control, Vol. 10, (1972), 93-98. 18. J.B. Rosen, The Gradient Projection Method for Nonlinear Programming. Part 1. Linear Constraints, Journal of Industrial and Applied Mathematics Society, Vol 8, (1960), 181-217. 19. J.B. Rosen, The Gradient Projection Method for Nonlinear Programming. Part 2. Nonolinear Constraints, Journal of Industrial and Applied Mathematics Society, Vol.9, (1961), 514-532. 20. J. Abadie and J. Carpentier, Generalization of the Wolfe Reduced Gradient for the Case of Nonlinear Constraints, Optimization, edited by R. Fletcher, Academic Press, New York, New York, ( 1969). 21. J. Abadie and J. Guigou, Numerical Experiments with the GRG Method, Integer and Nonlinear Programming, edited by J. Abadie, North-Holland, Amsterdam, (1970). 22. D. Gabay and D.G. Luenberger, Efficiently Converging Minimization Methods Based on the Reduced Gradients, SIAM Journal on Control and Optimization, Vol. 14, (1976), 42-61. 23. D.G Luenberger, Introduction to Linear and Nonlinear Programming, AddisonWesley, Reading, MA, (1973). 24. P. Wolfe, Methods for Nonlinear Constraints, Nonlinear Programming, edited by J. Abadie, North-Holland, Amsterdam, (1967).