The Approximate Minimization of Functionals

THE APPROXIMATE MINIMIZATION OF FUNCTIONALS THE APPROXIMATE MINIMIZATION OF FUNCTIONALS JAMES W. DANIEL Computer Sc...

Author: James W. Daniel

32 downloads 767 Views 4MB Size Report

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!

Report copyright / DMCA form

DOWNLOAD PDF

THE APPROXIMATE MINIMIZATION

OF FUNCTIONALS

THE APPROXIMATE MINIMIZATION

OF FUNCTIONALS

JAMES W. DANIEL Computer Sciences Department University of Wisconsin

PRENTICE-HALL, INC.

ENGLEWOOD CLIFFS, N. J.

PREFACE

This material was prepared for use as a text in support of my lectures at the 1969 session on "Minimization problems" of l'Ecole d'Ete Analyse Numerique

held at Le Breau-sans-Nappe, France, and sponsored by the Commissariat a l'Energie Atomique, Gaz de France, and l'Electricite de France; I gratefully acknowledge the financial support of those organizations and their cooperation in reproducing this text. Of course those ideas in this text which are primarily the author's did not come to light only during the period of writing; therefore I must also acknowledge several other sources of support in recent years, particularly the National Science Foundation, the- Office of Naval Research, the Mathematics Research Center at the University of Wisconsin, and the Computer Sciences Department at the University of Wisconsin, Madison. It is a great pleasure fpr me to acknowledge the invaluable coopera-

tion of my three assistants at I'Ecole d'Ete, Mssrs. Chavent, Yvon, and Tremolieres, and the students there, especially Mr. Lascaux, whose questions, comments, and suggestions were of great help to me. This book represents an attempt at a general presentation of problems

involved in the approximate minimization of functionals. As such it treats in the first chapter the general area of variational problems and the question of the existence of solutions, in the next two chapters the question of the convergence of approximate solutions of discretized variational problems in general and in specific important cases, in the next two chapters the theory of gradient methods of minimization in general spaces, and in the final four chapters practical computational methods of minimization in ER'. Although a significant amount of material is presented concerning constrained minimization problems, the primary emphasis of this text is on unconstrained problems, particularly in the last four chapters; this is not a text on mathe-

Vill

PREFACE

matical programming. In the last four chapters on "practical" methods, detailed computational algorithms or computer programs are not presented. The rather "theoretical" presentation of this "practical" material reflects my feeling that analysis and understanding of an algorithm, even under specialized

hypotheses, is all-important in the creation of new methods; whenever pos-

sible, however, the references include sources of computer programs or detailed computational descriptions of the implementations of algorithms. The references presented here give a reasonably thorough coverage of the most important English language literature on minimization methods, although no claim is made of completeness. Although many of the important results of Russian papers can also be found in English, the largest gaps in the references are with respect to the very extensive literature in Russian. I hope to correct this shortcoming at some future time. JAMES W. DANIEL

Madison, Wisconsin

CONTENTS

1

VARIATIONAL PROBLEMS IN AN ABSTRACT SETTING

I

I.I. Introduction

1

1.2. Typical Problems 1.3. Basic Functional Analysis 1.4. General Functional Analysis for Minimization Problems 1.5. The Role of Convexity 1.6. Minimizing Sequences

2

THEORY OF DISCRETIZATION

3

7 13

17

28

2.1. Introduction 2.2. Constrained Minimization 2.3. Unconstrained Minimization 2.4. Remarks on Operator Equations

3

3

EXAMPLES OF DISCRETIZATION

28 28 31

34

37

3.1. Introduction 3.2. Regularization 3.3. A Numerical Method for Optimal Control Problems 3.4. Chebyshev Solution of Differential Equations 3.5. Calculus of Variations 3.6. Two-Point Boundary Value Problems 3.7. The Ritz Method ix

37

37 41

52 55

59 65

X

CONTENTS

4 GENERAL BANACH-SPACE METHODS OF GRADIENT TYPE

70

4.1. Introduction 4.2. Criticizing Sequences and Convergence in General 4.3. Glgbal Minimum along the Line 4.4. First Local Minimum along the Line: Positive Weight Thereon 4.5. A Simple Interval along the Line 4.6. A Range Function along the Line 4.7. Search Methods along the Line 4.8. Specialization to Steepest Descent 4.9. Step-Size Algorithms for Constrained Problems 4.10. Direction Algorithms for Constrained Problems 4.11. Other Methods for Constrained Problems

70 71

76 77

80 84 89 93 95 105

110

5 CONJUGATE-GRADIENT AND SIMILAR METHODS IN HILBERT SPACE

5.1. Introduction 5.2. Conjugate Directions for Quadratic Functionals 5.3. Conjugate Gradients for Quadratic Functionals 5.4. Conjugate Gradients as an Optimal Process 5.5. The Projected-Gradient Viewpoint 5.6. Conjugate Gradients for General Functionals 5.7. Local-Convergence Rates 5.8. Computational Modifications

6

GRADIENT METHODS IN IR'

6.1. Introduction 6.2. Convergence of x,+ i - x, to zero 6.3. The Limit Set of {x,} 6.4. Improved Convergence Results 6.5. Constrained Problems 6.6. Minimization along the Line

114 114 114 117 119 122 125

127

136

142 142 142 145

150 154 154

7 VARIABLE-METRIC GRADIENT METHODS IN IR'

7.1. Introduction 7.2. Variable-Metric Directions 7.3. Exact Methods for Quadratics 7.4. Some Particular Methods 7.5. Constrained Problems

159 159 159 165 168

178

CONTENTS A

8 OPERATOR-EQUATION METHODS

8.1. Introduction 8.2. Newton and Newton-like Methods 8.3. Generalized Linear Iterations 8.4. Least-Squares Problems

9

AVOIDING CALCULATION OF DERIVATIVES

9.1, Introduction 9.2. Modifying Davidon's First Method 9.3. Modifying Newton's Method 9.4. Modifying the Gauss-Newton Method 9.5. Modifying the Gauss-Newton-Gradient Compromise 9.6. Methods Ignoring Derivatives

180 180 181

185

190

194 194

194 197

200 201

203

EPILOGUE

211

REFERENCES

213

INDEX

225

1

VARIATIONAL PROBLEMS IN AN ABSTRACT SETTING

1.1. INTRODUCTION

Many problems of pure and applied mathematics either arise or can be formulated as variational problems, that is, as problems of locating a minimizes

ing point for some (nonlinear) real-valued functional over a certain set. Such a setting is often beneficial from the analytic viewpoint of determining the existence and uniqueness of such points; however, we shall primarily emphasize here the computational aspects of this approach, that is, how we can compute a solution to the problem by actually minimizing the appropriate functional. In some cases, in which the problem was originally formulated as a variational problem (e.g., calculus-of-variations problems), this may well be one of the best computational approaches, rather than considering the problem in some equivalent form (e.g., the Euler-Lagrange differential equation). In other cases, in which the problem is artificially converted into a variational one, this approach provides new computational methods for consideration. General references: Kantorovich-Krylov (1958), Courant-Hilbert (1953), Kantorovich (1948). 1.2. TYPICAL PROBLEMS

The variety of minimization proI ems is indeed immense; we consider some typical problems. A large class is that of optimal-control problems, which arise so often in the highly technological industries. Here one seeks to minimize a cost functional

f(x, u) = f c[t, x(t), u(t)] dt u

1

2

VARIATIONAL PROBLEMS IN AN ABSTRACT SETTING

SEC.

1. 2

over the collection of points (functions) x, u satisfying constraints such as x - d1 = s[t, x(t), u(t)) X(O) E X, X(t p) E Xp

x(t) E X(t) U(1) E U(t)

where X Xp, X(t), and U(t) are certain specified sets and set functions. Such problems arise, for example, in determining the minimal time or minimal expenditure for striking a given missile target.

As a special case with s(t, x, u) - u, we have a basic problem of the calculus of variations, in which we seek to minimize

f(x) = f c[t, x(t), *(t)] dt 0

subject to some boundary conditions on x(t), such as x(0) = x(1) = 0

Many problems of applied mathematics are of this latter form if we allow the variable t to represent a vector of dimension higher than one and x to represent the vector of first partial. derivatives of x with respect to those variables. The classical problem of finding the height x(t,> t2) of a surface of minimal area stretched across a closed space curve C = c(t t2) for (t t2) E G, a simple closed curve in the plane, for example, is most naturally described mathematically as the problem of finding x(t t2) to minimize

f rl

+

(d)2

+

(dX)2] 112

)

dt, dt2

(G = region enclosed by G)

over the set of x(t t2) satisfying

x(t t2) = c(t t2) for (t t2) E G A discrete analogue of the continuous optimal-control problem described above is the mathematical programming problem in which one seeks to minimize some function of a finite number of real variables subject to finitely many constraints; these problems arise, for example, very commonly in the petroleum industry. As we shall later see, it is also often solved as a discretized approximation to a continuous optimal-control problem. Many problems in data approximation ultimately are of this form also and provide perhaps one of the most common forms of such problems. Where one has

SEC. 1. 3

VARIATIONAL PROBLEMS IN AN ABSTRACT SETTING

3

some data y, depending on a parameter t measured at certain points t;, i = 1, 2, . . . , N, it is desired to approximate the formula generating the data by some expression

y(t) = g(t; a) where the choice of the parameter a determines the expression. We can do this by picking a so as to minimize some norm of the vector in E^' with components

Y,-g(ti;a). Essentially, this technique has been applied lately in a somewhat novel fashion for the solution of such problems as differential or integral equations. For example, one seeks to solve the differential equation

Du =f in

C,

where C is some domain. If q

u = 0 on dC = boundary of C . .

., 97,Y are some functions satisfying the boun-

dary condition on dC, we try to choose numbers a ... , a,,, to minimize some norm of the vector in EM with components given by

[Du](t) -f(t),

i= 1, ... , M

where {t,) is some grid of points over the domain C; alip, + ... + axqv is then taken as an approximate solution to the differential equation. This represents only a few of the types of variational problems commonly of practical interest at pcesent; rather than proceed to list more types, we shall develop tools to allow us to consider such problems in a more general setting.

General references: Hestenes (1966), Pontryagin et al. (1962), Balakrishnan-Neustadt (1964), Courant-Hilbert (1953), Morrey (1966), Akhiezer

(1962), Abadie (1967), Fiacco-McGormick (1968), Hadley (1964), Mangasarian (1969), Zangwill (1969), Lorentz (1966), Rosen-Meyer (1967), Rabinowitz (1968), Mikhlin-Smolitskiy (1967). 1.3. BASIC FUNCTIONAL ANALYSIS

We shall consider variational problems which can be stated as problems in a general function-space setting; for simplicity, we restrict ourselves to problems in Banach spaces. We recall that a real or complex Banach space E (hereafter assumed to be real unless otherwise stated) is a vector space over the real or complex numbers which is complete in the topology generated by the norm I

I defined on E. A Hilbert space is a Banach space in which the norm is given via an inner product < -, -), that is, IIxl I = `x, x>'-z. I

-

I

4

SEC. 1. 3

VARIATIONAL PROBLEMS IN AN ABSTRACT SETTING

The Banach space E* of bounded (real-valued) linear functionals on E is called the dual or adjoins space; it generates another topology for E, the weak topology, for which a basis at the origin 0 is given by

(0,0 =(x;If,(x)I

in this topology, that is, weak

Convergence to a point x of a sequence convergence, is denoted by X.

x

Convergence in the norm topology, for which lim I I xa - x I I = 0, is denoted by

Recall that x, - x if and only if lim I f(x)

0 for each fixed f e E*.

We can consider E as a subset of E** since, for each fixed x in E, we can define a bounded linear functional over E* via

x(f) =-f(x) for alif in E5 If E = E** under this identification, E is said to be reflexive. Recall that a compact (countably compact) set is one such that every open cover (countable open cover) has a finite subcover, and a sequentially compact set C is one such

that every sequence in C has a subsequence converging to a point of C. Every bounded, weakly closed subset of reflexive space is weakly sequentially

compact; the important theorem of Alaoglu moreover says that every such set is weakly compact. Since every continuous linear functional f on a Hilbert space E is given by an inner product-that is, there exists a unique y in E withf(x) = <x, y> for all x-we can write E = E*, which in particular implies that E is reflexive. Given a finite. set of points (x ... , in E, a convex combination of the points is a point x such that

x=Ea,x;, t-1

1.

a, =1,

a,>0,

i= 1,2,...,n

I

A set C in E is said to be convex if every convex combination of every finite subset of C is itself an element of C. The following important result is known:

If x - x, then for each n = 1, 2, ... , there is a convex combination y of .the points x ... , x such that y. x. This is implied by the fact that a convex set which is closed in the norm topology must be closed in the weak topology; a weakly closed set is always norm-closed of course since a norm convergent sequence is weakly convergent.

SEC. 1. 3

VARIATIONAL PROBLEMS IN AN ABSTRACT SETTING

5

EXERCISE. Prove that the fact that a convex set is norm-closed if and only if it is weakly closed implies the fact that for every weakly convergent sequence that norm-conof convex combinations of there is a sequence verges to the same limit.

In the sequel we shall be considering (real-valued) nonlinear functionals f on a space E and various continuity properties of those functionals. Recall

that a functional f is lower semicontinuous in a topology if and only if (x; f(x) > a) is open for each real a. From this we deduce that if f is weakly. lower semicontinuous-that is lower semicontinuous in the weak topology-

then from x, -, x it follows that f(x) < lim inf f(x.) A similar statement holds for norm lower semicontinuity, of course, under the hypothesis that x -. x. Since most of our arguments depend only on the above sequential property, we define a functional to be sequentially lower semicontinuous (for some topology) if and only if the convergence of (xe) to x impliesf(x) < lim inf Sequential continuity properties are generally weaker than the usual continuity properties; the two kinds of properties are equivalent in the norm topology since the norm topology has a countable basis at each point. EXERCISE. Prove that a functional on a normed space is norm sequentially lower semicontinuous if and only if it is norm lower semicontinuous.

We shall often consider linear mappings from one Banach space E, into another E2; If A is such a mapping, then it is continuous (between the norm topologies) if and only if it is bounded, that is, II AxIIE,< IIAI1=sup IIAxIIE,=sup IxIIE,=I xeE,

0o

IIXIIE,

The Banach space of all such bounded linear operators is denoted by L(E E2); thus E* = L(E, IR). If A E L(E E2) and f2 E E2, we can define

f, EE*by

f1(x) =f2(Ax) The mapping of f2 into f, is linear and is denoted by At, the adjoint of A. If E is a Hilbert space, add if A E L(E, E) and At = A, then A is called self-adjoint; in this case we have _ (x, A*y> = <x, Ay> for all

x,yEE.

If E is a real (complex) Banach space and A E L(E, E), then a(A), the spectrum of A, is defined via

6

SEC. 1.3

VARIATIONAL PROBLEMS IN AN ABSTRACT SETTING

a(A) = [A E IR (Q); Ax - Al - A is not one to one, L[R(A,), E]] Ax' or R(A,) # E, or

where R(A,) denotes the range of A, and S denotes the norm closure of any set S. If Eis a Hilbert space and A is self-adjoint, then all points irl a(A) are real and lie in the interval [m, M] where

m = IEf

and

<x, z>

M = sup E

<X, >

Then we can write

ml
m<x, x> < <x, Axy < M<x, x>

It follows that A 11 = max (I m I, I MI). If m > 0 (> 0), A is called positivedefrnite (semidefinite) or coercive (semicoercive). Examples of simple spaces 1. E = IR", dimensional real Euclidean space with any norm whatsoever since all norms on IR" are topologically equivalent. Some standard

norms are, for x = (x,, ... , x"): IIxII",ve =

maxlx,l

and

1SiSi

IIxII",, -

lEIx1I,]1/,

in the second equation particularly for p = 1 or p = 2, in which latter case E is a Hilbert space. 2. E = C[a, b] is the real Banach space of real-valued continuous functions defined on [a, b] c iR with the norm I If I I- = max I f(t) 1; norm converla bl

gence here is equivalent to the usual uniform convergence. E is not reflexive.

3. E = L,(a, b), the Banach space of real-valued pth-power Lebesgue integrable (if p = oo, then essentially bounded instead) functions defined on (a, b) c IR, with norm III III, = [J: I f(t)11, dt]

'P

or

IIxII-

esssuplf(t)I (a, b)

For I < p < co, L,(a, b) is reflexive; L,(a, b) is a Hilbert space. 4. E = W;(a, b), the Banach space of all real-valued functions defined on [a, b], having absolutely continuous derivatives of orders less than or equal to (m - 1) on [a, b], and having the mth derivative (and of course all

SEC.

VARIATIONAL PROBLEMS IN AN ABSTRACT SETTING

1. 4

7

lower-order derivatives) in LD(a, b). Many equivalent norms are possible for this Sobolev space, one of which is I/P

IIf1I =

{ 1-0

Ifs"(a)1°+ JlfT(t)I°dt o

)

Ifp = 2, WZ is a Hilbert space with the obvious inner product. (Such spaces can be defined in higher dimensions, although the concept of the derivatives of absolutely continuous functions must be replaced by that of generalized or distributional derivatives, and the norm should be the sum of the L, norms of all the derivatives of order i for 0 < i < m.) For clarity we consider the Hilbert space WZ(0, 1). Here we have I If I

I = [=]12 ° {If(o)12 + f o If'(t)12 dt)12

Thus, if f. -f in WZ, then f. converges to f' in L2(0, 1) and f converges to f uniformly, that is, in C[0, 1]. Even weak convergence in W21 implies a

strong type of convergence. For, if f -f in W;,, then f(0) converges to f(O) and f;, converges to f' weakly in L2(0, 1). From this it follows that (f }, is equicontinuous and uniformly bounded, which implies that f,, converges to f uniformly on [0, I], that is, it converges in C[0, 1].

EXERCISE. Provide the details proving that weak convergence in WZ implies, norm convergence in C[0, 1].

Since many algorithms yield sequences only weakly convergent to some desired solution, it is important to realize, as this example indicates, that weak convergence can sometimes be quite "strong."

General references: Dunford-Schwartz (1962), Kantorovich (1948),. Kantorovich-Akilov (1964), Morrey (1966), Taylor (1961). 1.4. GENERAL FUNCTIONAL ANALYSIS FOR MINIMIZATION PROBLEMS

Perhaps the fundamental result concerning the existence of a solution for a variational problem is the well-known general fact that a sequentially lower semicontinuous real-valued function defined on a countably compact set must achieve its infimum there; the important special case for our use is the following.

THEOREM 1.4.1. Let f be a weakly sequentially lower semicontinuous functional defined on a weakly sequentially compact set C; then there exists

xQ in C such that f(xo) = inff(x) = min f(x). xEC

xEr

8

SEC.

VARIATIONAL PROBLEMS IN AN ABSTRAO SETTING

1. 4

Proof: Let m = inff(x); then there exists a sequence of points x E C, xEC m. Since C is weakly sequentially compact, there is an with xo E C and a subsequence x such that lim f(x,,) = m and x - xo. By the semicontinuity,

f(xo) < lim inf

m < f(xo)

Thus f(xo) = m. Q.E.D. COROLLARY 1.4.1. A weakly sequentially lower semicontinuous functional achieves its infimum on every convex, bounded, norm-closed subset of a reflexive space. Proof: From earlier statements we know that a convex, norm-closed set is weakly closed and that a weakly closed, bounded set is weakly sequentially compact in reflexive space; the corollary then follows immediately from the theorem. Q.E.D. EXERCISE. Prove that a lower semicontinuous functional achieves its infimum on each countably compact set.

Theorem 1.4.1, of course, directly applies only to constrained minimization problems, that is, where x is considered only in some-weakly sequentially

compact, and hence bounded, set C; our maim concern here is with unconstrained problems, but for proving existence of a solution it is often possible and wise to reduce them to constrained problems. (An interesting sidelight is that for computational purposes one often tries to reduce the constrained

problem to an unconstrained one.) Here one tries to find a bound-say B-on the norm of the solution if it exists, and then consider the problem over S. = {x; I I x I I S B). This is possible if, for example, the functional satisfies a T-property in a reflexive space.

DEFINITION 1.4.1. A functional f is said to satisfy a T-property (at xo

with T = To) if there exists an xo and To > 0 such that I Ix implies f(x) > f(xo).

-- xo I1

To

THEOREM 1.4.2. A weakly sequentially lower semicontinuous functional f satisfying a T-property in a reflexive space E achieves its infimum over E.

Proof: Let S(xo; T0) be the norm-closed sphere of radius To about xo; then S(xo; To) is weakly sequentially compact. Since f(xo)
f(x) < f(xo) <

inf

x f S(xo; T,)

f(x)

VARIATIONAL PROBLEMS IN AN ABSTRACT SETTING

SEC. 1. 4

9

Thus

inf

sES(x,;T,)

f(x) = inf f(x) xEE

By Theorem 1.4.1 there exists x' E S(x0; T0) achieving the infimum over S(x0; T0) and hence over E. Q.E.D. Satisfaction of a T-property is usually guaranteed by a growth property

of the functional; for example, if lim f(x) = +c, then f certainly satisfies IIxII-

a T-property. In the next section we shall see that various convexity assumptions can lead to such growth properties as well as the needed semicontinuity property. One important application of and reason for studying variational problems lies in the study of the solution of operator equations; this arises from the generalization of the calculus result that if a differentiable real-valued function of several real values achieves a local extremum at an interior point x0 of a set, then the gradient of the functional vanishes at x0. For this and other reasons it is necessary to consider the concept of the derivative of a general operator. Let Finap a Banach space E, into a Banach space E2. Although a weaker concept of the derivative-namely, the Gateaux derivative or differential-is often of interest, for our purposes we need only consider the stronger concept.

DEFINITION 1.4.2. If at a point x in E there exists a bounded linear operator denoted F; mapping h in E, into F' h in E2-that is, FE L(E E2)and satisfying lim IIhII--0

I

I F(x + h) - F(x) - FXh I = 0 I

then Fx is called the (Frechet) derivative of Fat x.

It is simple to prove that the properties of the Frechet derivative are analogous to those statements in the standard calculus. PROPOSITION 1.4.1. If the Frechet derivative f z of a real-valued functional,

f exists at each point of a convex set C, then for x e C and x + h E C, the Lagrange formula

f(x + h) -f(x) =f=+,hh is valid for some t c (0, 1). If the operator F from E, into E2 has a Frechet derivative F; at each point x in a convex set C, then for x E C and x + h E C, the Lipschitz condition

10

SEC. 1. 4

VARIATIONAL PROBLEMS IN AN ABSTRACT SETTING

I

I F(x + h) - F(x)II < I I F''=+,h I

I

I

IhI

I

is valid for some t E (0, 1). If F, exists at some point x, then F is norm continuous at x. If F maps E, into E2 and G maps E2 into E, with F and G Frechet-differentiable, then GF is Frechet-differentiable and (GF)', = G'(,,)FX.

The usual rules (F, + F2)' = F, + F'2 and (cF)' = cF' are also valid under the obvious assumptions. EXERCISE. Prove the statements in Proposition 1.4.1.

If f is a functional over E, then f' is a bounded linear operator into IR, that is, f, (=- E*; in such a casef. is called the gradient off at x and is denoted

I; = Vf(x) = (Vf)(x) Thus we can write

f(x + h) =f(x) + + o(h)

as

I Ih I I

)0

where

= g(h) for g E E*,

hEE

Since Vf = f' is an operator from E into E* we can define its derivative (the second derivative of f) and so on through derivatives of arbitrary order; f' , for example, then maps elements h of E into fxh in E*. It is easy to find conditions in terms of derivatives which will imply the

growth and continuity properties important for variational problems. Although we shall return to this later with a stronger result (Theorem 1.5.1), we give a typical theorem below. THEOREM 1.4.3. Suppose the functional f has first and second derivatives

at all points of a convex set C, and that for x in C, f" satisfies > 0 for all h in E. Then f is weakly sequentially lower semicontinuous in C.

Proof. Let x E C, x,, --- x, x E C. Then by the Lagrange property,

f(x.) = f(x) + <x --- x, Vf(x +

t E (0, 1)

with h = x - x. Then f(x.) =f(x) + <x, - x, Vf(x)> + By the Lagrange property for the function

g(r) -
r e [0, 1 ]

SEC.

VARIATIONAL PROBLEMS IN AN ABSTRACT SETTING

1. 4

11

it is easy to see that

f(x,.) =f(x) + `x - x, Vf(x)> +
X+r,h.thR>

for some point (1, r) in (0,, 1) x (0, 1). Since x - x, > 0, and Q.E.D. Vf(x) E E*, this implies f(x) < lim We can deduce growth conditions in a similar manner. THEOREM 1.4.4. Let f have first and second derivatives throughoui.

E and let f X satisfy at each x E E the condition > I I h I Ig(I I h I I) where g is a nonnegative continuous function for t > 0 tending to infinity

with t. Then lim f(x) _ +oc I;x i -

Proof: Let s(t) = <x, Vf(tx)> for 0 + f ( <x, f;xx> dt

Therefore, <x, Vf(x)> >- I I x I I I Vf(0) I I+ I I

I x I Ig(I I x 11). Since

f(x) = f(0) + f <x, Vf(tx)> dt 0

= f(0) + f o t dt

= f(0) - IIxIl IIVf(0)II + IIxII f ' g(tIIxll)dt >_ f(0) + 11X11 I- I I

VP I I + f " g(:I I x 11) dt, 0

and since limg(t) = oo, we then conclude that lim f g(t1Ix11) dt = 011. OxU-.

0

Q.E.D.

EXERCISE. Consider

f(x) = f [ [1(t)]2 + c[t, x(t)1} dt,

1(1) = d x(t)

for x in c

W21

(See Example 3, Section 1.3.) We let 11 x112 = f o [x(1)12 dt. Suppose

12

VARIATIONAL PROBLEMS IN AN ABSTRACT SETTING 2

cum(t, u)

=

for all t, u. Prove that lim f(x)

SEC.

1. 4

t, u ) > y > -7c2

dug

+oo. Hint: Use the fact that for all

x E E we have f0 [z(t)]2 dt > nz f o [x(t)]z dt to prove that Z e, for small enough e > 0.

As we stated earlier, we wish to see how minimization problems are related to the solution of operator equations. The fundamental statement is as follows. THEOREM 1.4.5. Let the functional f defined on a set Sin a Banach space

E be minimized at a point xo E S, with x0 an interior point in the norm topology. If f has a derivative Vf(xo) at x0, then Vf(xo) = 0. Proof: For any fixed h E E, f(xo + th) is a real-valued function of the real variable t on some open interval containing zero, having a derivative at zero, and being minimized there. Therefore,

f(xo + th)

0=

d

G-o

=

Since h E E was arbitrary, = 0 for all h c= E and, therefore, Vf(xo) = 0. Q.E.D. Similarly, minimization problems over certain types of sets can be related to generalized eigenvalue problems via Lagrange multiplier theorems. As an example, we merely state one type of such result.

PROPOSITION 1.4.2. Let f and g be functionals on a Banach space E, both of which are differentiable at the point xo which is local minimum off on the set [x; g(x) = g(x0)}. If I I Vg(xo) I I > 0, then there exists a real number A such that Vf(x(,) = t,Vg(x,,). Proposition 1.4.2 gives a necessary condition for a point to give a local minimum over a set defined by one real equality constraint. Much of the theory of optimal control and mathematical programming involves determining necessary conditions and also additional sufficient conditions for a point to be an extremizing point over sets with more complicated constraints. Another simple necessary condition for xo to minimize a differentiable f(x) over a convex set C is that <x - x0, Vf(x,,)> > 0 for all x in C. The amount of literature in this extremely complex area is vast. Since our main concern here is with the theory of and methods for approximate solutions primarily

to unconstrained problems, we shall pursue this no further. General references: Levitin-Poljak (1966a), Vainberg (1964).

VARIATIONAL PROBLEMS IN AN ABSTRACT SETTING

SEC. 1. 5

13

1.5. THE ROLE OF CONVEXITY

We saw already in Corollary 1.4.1 that convexity of the set over which we seek a minimum can be of importance in proving existence of a solution; various kinds of convexity assumptions about the functional to be minimized also play a vital role. We shall consider next some examples of how these properties are related to continuity properties of the functional and to the nature of the points solving the minimization problem. Although more general definitions are possible, for the following let C be a convex subset of a Banach space E. DEFINITION 1.5.1. A functional f is convex (strictly convex) on C if and

only if for each x x2 in C and A in (0, 1),

f[Ax + (1 - x)x2l < (<) Af(x0 + (1 - A)f(x2) EXERCISE. When is Definition 1.5.1 equivalent to the following?

f(x'

x2)

2 Ax') + 2 f(x2)

2 This definition is the natural extension of the concept of a convex realvalued function of a real variable to that of a functional on a general space. The usual characterizations for differentiable convex functionals are valid here also; we state them without proof. PROPOSITION 1.5.1. A Frechet-differentiable

functional f is convex

(strictly convex) on a norm-open convex set C if and only if

f(x2) -f(x1) >_ (>) <x2 for each x x2 in C, x,

x., Vf(x1)>

x2. Another equivalent condition is that

<X2 - x1,

Vf(x2) - Vf(x1)> >_ (>) 0

for x x2 in C, x, $ x2. If f x exists in the convex open set C, then f is convex if and only if 0 for all h in Land x in C. If we combine the last statement in Proposition 1.5.1 with Theorem 1.4.3, we immediately see that a twice-differentiable convex functional on an open convex set C is weakly sequentially lower semicontinuous on C. The hypotheses leading to this conclusion can be weakened considerably, leading to a generalized concept of convexity which, as we shall see, is in a certain sense the natural hypothesis for minimization problems. First we observe that if f is convex on E, then fx;f(x) < a) is convex for each fixed real a; we take this as a defining property.

14

SEC. 1. 5

VARIATIONAL PROBLEMS IN AN ABSTRACT SETTING

DEFINITION 1.5.2. A functional f is quasi-convex on the convex set C

if and only if {x; f(x) < a, x E C) is convex for each fixed real at. This equivalent to f[Ax1

+ (1 - A)x2] < max

is

[f(x,),f(x2))

in C.

for

EXERCISE. Prove that the two statements in Definition 1.5.2 are equivalent.

EXERCISE. When is Definition 1.5.2 equivalent to the following?

f(x, + x2) 2

1

max (f(x1),f(x2)]

Thus a convex functional is quasi-convex. So also is the product of two

nonnegative convex functions [Tremolieres (1969)] or the quotient of a convex, nonpositive function by a convex, positive function [Mangasarian (1969)].

EXERCISE. Prove that a convex functional is quasi-convex.

We can now prove a weakened form of our earlier statement concerning the weak lower semicontinuity of a twice-differentiable convex functional on an open convex set. THEOREM 1.5.1. Let f be a quasi-convex, norm sequentially lower semicontinuous functional defined on a convex set C. Then f is weakly sequentially lower semicontinuous on C.

Proof: Let x x, x E C, x -E C, and let x,, be a subsequence such that lim f(x,,) = lim inf f(x ). According to a statement in Section 1.3, ,_a there exist for each I convex combinations y,,, of the points x, for i > 1 norm converging to x, that is,

and

limIly,,,-xII=0 r-b Then f(x) < lim inf f(y,,,). But by the quasi-convexity one can see that f(y,,,) < max f(x,,,). Thus tshr

VARIATIONAL PROBLEMS IN AN ABSTRACT SETTING

SEC. 1. 5

15

f(x) < lim inf f(y,,,) < sup Letting 1 tend to infinity we find

f(x) < lim sup f(x.,) = lim f(xN,) = lim inf f(x.) Q.E.D.

EXERCISE. Prove that a quasi-convex, norm lower semicontinuous functional defined on a weakly closed convex set C is weakly lower semicon-

tinuous on C by considering (x, f(x) < a) for real a.

Thus a functional as described by Theorem 1.5.1 achieves its minimum on each weakly sequentially compact set; it is simple to describe hypotheses under which the minimizing point is unique. DEFINITION 1.5.3. A functional f is called strongly quasi-convex on a convex set C if and only if for x, ;i,,- x2,

f[,.x, + (1 -- ))x2] < max [f(x,),f(x2)] for a, in (0, 1). EXERCISE. When is Definition 1.5.3 equivalent to the following?

j(x, 2

x2)

<max (f(x,),f(x2)}

From strong quasi-convexity follows a uniqueness theorem:

THEOREM 1.5.2. A strongly quasi-convex functional can achieve its minimum over a convex set C at no more than one point.

Proof: If f(x,) = f(x2) = inf f(x), then --x, + 3x2 E C, but AEC

f(-Fx, + x2) < max [ f(x,), f(x2)) = inff(x) AEC

which is a contradiction. Q.E.D. In a sense, strong quasi-convexity is the natural assumption for minimization problems. More precisely, it is easy to show that iffis a norm-continuous functional on a reflexive space E, then it achieves its minimum at a unique point over each norm-closed, convex, bounded set if and only iff is strongly quasi-convex [Poljak (1966)]. EXERCISE. Prove that a norm-continuous functional (defined in a reflexive space achieves its minimum at a unique point over each norm-closed, convex, bounded set if and only if j is strongly quasi-convex.

16

VARIATIONAL PROBLEMS IN AN ABSTRACT SETTING

SEC. 1. 5

A related but weaker concept is that of strict quasi-convexity, in which the

inequality in Definition 1.5.3 is required to hold only if f(x,) :P-4 f(x2); strongly quasi-convex functionals and convex functionals are strictly quasiconvex, but a strictly quasi-convex functional need not even be quasi-convex. EXERCISE. Prove that both strongly quasi-convex functionals and convex functionals are strictly quasi-convex; find a strictly quasi-convex functional that is not quasi-convex.

The importance of strict quasi-convexity can be seen from the following theorem.

THEOREM 1.5.3. Let x0 yield a local minimum for the strictly quasiconvex functional f over the convex set C. Then xo yields the global minimum.

Proof: If there exists an x, in C with f(x,) < f(xa), then f(2x(, + (1 - 2)x1) < f(xo) for all 2 in (0, 1) and A arbitrarily close to 1. But then xo cannot yield a local minimum. Q.E.D.

Convexity assumptions often appear in somewhat different form than discussed so far. For example, in the calculus-of-variations problem of minimizing

f(x) = f c[t, x(t), z(t)J dt with certain boundary conditions, a standard assumption [Akhiezer (1962)] is that c(t, x, y) is differentiable and convex in the variable y for fixed (t, x), a condition to which we shall return in Section 3.5; under some additional continuity and growth hypotheses on c(t, x, y), one can deduce the weak lower semicontinuity of f in the appropriate space. This is really a special case of the following theorem generalizing Theorem 1.5.1 [Nashed (1967)]. THEOREM 1.5.4. Let the functional c(x, y) on the space E x E for a Banach space E be norm lower semicontinuous and quasi-convex in y for each fixed x, and weakly sequentially lower semicontinuous in x on bounded sets, uniformly so for y in a bounded set. Then f(x) = c(x, x) is weakly sequentially lower semicontinuous.

Proof: Let x x'. For any subsequence {x,,,}, if c(x', x,,) converges to r, then for all A > r we consider Cz = {x; c(x', x) < Al which by the norm lower semicontinuity and quasi-convexity of c(x, y) in y must be norm-closed and convex, and hence weakly closed. Thus x' E Cx for all A > r, so

SEC.

VARIATIONAL PROBLEMS IN AN ABSTRAC' SETTING

1. 6

17

c(x', x') < lim inf c(x', x.) Now, .1'(x')

:_

c(x', x') = c(x', x') - c(x'. x,) + c(x', xJ - c(x., xj + f(xj

From the preceding inequality and the uniform (in y) weak sequential lower semicontinuity of c(x, y) in x, it follows that

f(x') < lira inf Q.E.D.

Convexity properties are also of vital importance from a more computational viewpoint; we consider this aspect. next.

General references: Levitin-Poijak (1966a, b), Mangasarian (1969), Ponstein (1967). 1.6. MINIMIZING SEQUENCES

.In computing a sequence of approximate solutions x. to a minimization problem, one of our goals nearly always is to be sure that f(x,) is converging to the minimum value off; thus we desire {x.} to be a minimizing sequence. follow directly from Often other conclusions, such as convergence of is a minimizing sequence of elements from the this fact. For example, if weakly sequentially compact set C for the weakly sequentially lower semicontinuous f over C, then for any weakly convergent subsequence (at least one of which of course exists) {x,,,} converging to an x', we have

f(x) < lim inf f(x,,,) = inf f(x) xeC

that is, x' minimizes f over C; if the minimizing point is unique, then weakly converges to it. The above argument is extremely common and is almost always the least we can say about convergence in some scheme of approximate minimization; often it is possible to say much more-for example, to deduce norm convergence. We indicate next some of the typical concepts that are of value in this respect; in most of the rest of this book we shall not state,all the various conclusions about stronger convergence in each approximation method which can be drawn from the type of considerations to be presented now, but leave such synthesizing to the student. We now take a more general view. Since in many practical cases the approximate solutions can be guaranteed only approximately to satisfy the constraints on the problem, we introduce the concept of an approximate minimizing sequence (or generalized minimizing sequence [Levitin-Poljak (1966a, b), Poljak (1966))).

18

SEC. 1. 6

VARIATIONAL PROBLEMS IN AN ABSTRACT SETTING

DEFINITION 1.6.1. The sequence {xn} is an approximate minimizing sequence for the functional f over the set C if lim d(xn, C) = 0, where d(xn, C)

= inf I I x - x 11, and lim f(x) n-.°

XEC

xEC

PROPOSITION 1.6.I. If f is a weakly sequentially lower semicontinuous

functional on a norm neighborhood of the weakly sequentially compact is an approximate minimizing sequence, then all weak limit set C and points of [xn}, at least one of which exists, lie in C and minimize f over C. If f has a unique minimizing point x* in C, then xn

x*.

EXERCISE. Prove Proposition 1.6.1 above.

The goal of this section is to discover conditions under which the convergence is more useful-that is, so that {x,} itself converges or so that we have norm convergence. We saw in Section 1.5 that strong quasi-convexity is in a sense the patural hypothesis for minimization problems; a stronger form of it will allow us to deduce norm convergence of approximate minimizing sequences. DEFINITION 1.6.2. A quasi-convex functional is called uniformly quasi-

convex on a convex set C if and only if there is a real-valued, continuous,

monotone-increasing function 6(t) for t > 0 with 6(t) = 0 if and only if t = 0, such that

f(x 2 Y) S max (f(x),f(y)} - b(II x - y1D for all x, y in C. EXERCISE. Find a condition equivalent to the one in Definition 1.6.2 for f[2x + (1 - A)y], A in [0, 1].

THEOREM 1.6.1. Let C be a norm-closed, bounded, convex subset of a reflexive space E; f a weakly sequentially lower semicontinuous uniformly quasi-convex functional on a norm neighborhood of C; and an approximate minimizing sequence. Then x, x*, the unique point in C minimiz-

ingf. Proof. By Proposition 1.6.1, there exists at least one subsequence x,,, weakly converging to a minimizing point x'. But the minimizing point x' = x* clearly is unique, just as in Theorem 1.5.2, so x -, x*. Since f(xn) converges tof(x*), since (x + x*)/2 -- x*, and since f is weakly sequentially lower semicontinuous, for all c > 0 there is an N, such that n > N, implies

I f(xn) -f(x*) I < E

and

fx

*) > f(x*) - E

19

VARIATIONAL PROBLEMS IN AN ABSTRACT SETTING

SEC. 1. 6

By the uniform quasi-convyity, a(I l x

f (x*)} - f

x* I I) < max (f

X.

±

x*) < 2e for n

N,

and hence I I x. - x* I I -- 0. Q.E.D.

It is also possible to prove strong convergence recalts on the basis of properties of C, not f. DEFINrrION 1.6.3. A set C is called uniforrr;y convex if and only if there

exists a real-valued, continuous, monoton function 8(t) for t > 0 with 8(t) =0 if and only if t = 0, such that ., y c- C and I I z I I < S(I I x - y I ) imply (x + y)/2 + z E C. THEOREM 1.6.2. Let C be a no .n-closed, bounded, uniformly convex subset of a reflexive space E; f a v sequentially lower semicontinuous functional on a norm neighborhoc i of C with a unique point x* minimizing f on C; x* on the norm boundary )f C; and (xc} an approximate Minimizing sequence. Then x. -* x*. Proof.: By Proposition 1.6 L and the uniqueness of x*, we have x, -> x*. Since d(x,,, C) ---. 0, there i, a sequence x; E C with Jim I Ix. - x; I I = 0.

Suppose there exists e > ' and a subsequence (x;,} with I x', - x* I I > e for all i. By the uniform onvexity of C, if I I z I I < b(e), then I

x",+x*+zCC 2

for all i. Since

x',,+x* 2 -l- z -- x+ z and C i, weakly closed, x* + z E C for all z satisfying I z I I < b(e) > 0. Then c* is not a boundary point, yielding a contradiction. Thus x. (and henc also converges in norm to x*. Q.E.D. I

EXERCISE. Consider f(x) = f (t[x(t)]2 + V0171 dt on o W](0, 1), the 0 set of x in W2(0, 1) with x(0) = x(1) = 0 and with IIx112

f 1 [x(t)]2 dt. Let 0

n1/2t xw(t)

for 0
n-1/2 - n1,2+t 0

for

- K/

n
for

n
20

SEC. 1. 6

VARIATIONAL PROBLEMS IN AN ABSTRACT SETTING

is a weakly convergent but not norm-convergent minimizing Show that sequence. Is there a better norm to use for this problem?

Computationally one often desires to have some type of error bound or estimate of the rate of convergence; such bounds can be found for certain classes of problems. As an example, we state merely one simple theorem. THEOREM 1.6.3. Let C be a norm-closed, bounded, convex subset of a reflexive space E; f a twice-differentiable functional whose second derivative satisfies > 5 ( 1 1 h I I) for all x in a norm neighborhood of C where 6(t)

is a real-valued, continuous, monotone function for t > 0 with 6(t) = 0 if and only if t = 0; and an approximate minimizing sequence. Then x -- x*, the unique point minimizing f in C, and 60 1 X. - x* 11) <_ f(x.) - f(x*)

Proof. According to Proposition 1.5.1, f is convex. It is also uniformly quasi-convex, since the equations

+ x 2 y) > f(x 2 y) + \x

f(x) = f (X +

'

Vr(x 2 yII

y'

Vf(x

a(IIx 2 y11)

f(y) =f(x 2 y +

y

2

x) > f(x 2 y) - (x 2

2

y))

T6(11X2 yl1i

imply

(X + )

2f(x)+ 2f(y)- 28(11x2 y111

a stronger. condition (uniform convexity) than asserted. Thus by Theorem 1.6.1, this theorem is proved except for the error estimate. In the remarks following Proposition 1.4.2 we noted that we must have <x -x*, Vf(x*)>>0 for all x in C; otherwise, f decreases in the direction of x, a contradiction to the optimality of x*. Hence

f(x) - f(x*) > <x - x*, Vf(x*)> + 6(I I X_ x* 11) >_ a(I I x - x* I I) Q.E.D.

We shall have cause to make use of the type of estimate in the above proof later, especially when f, is a uniformly positive-definite self-adjoint operator in a Hilbert space, that is, 6(t) = mt2, m > 0, so that

SEC. 1. 6

VARIATIONAL PROBLEMS IN AN ABSTRACT SETTING

IIX. - X*II <

21

{f(x_) -f(x*)l1i: m

f(x*) to be In order for bounds of the form d(I 1 x, - x* 11) < f(x*), computable, we need a good computable lower bound-say indicate how this can so that we have 8(I I x. - x* 11) < f(x,) - m,; we now be done in some cases by complementary variational principles, which have been rather thoroughly studied for differential equations [Noble (1964, 1966)],

operator-theory extensions thereof [Rail (1966)], and error estimation for variational boundary-value problems [Shampine (1968)]. THEOREM 1.6.4. Let f be a real-valued nonlinear functional on a real Hilbert space H and let u minimize f on the norm-closed convex set S. For each w in some given set S* c H, let a self-adjoint linear operator P. be defined, satisfying > a,, for all h in H, with a, > 0. Suppose that f is twice continuously Frechet-differentiable on the set of points of the form lv + (1 - .l)w for v in S, w in S*, and A in [0, 1]; and that

>0 for all v in S, w in S*, and A in [0, 1]. For each w in S*, let yr in S be defined as the (unique) point in S minimizing 4 + , and let

f* be defined on S* as

f*(w) =f(w) + 1 If Ii is in S*, then f *(w) is maximized by w = u, and f *(u) = f(u). For all u in S and w in S*, we have the estimates

f(u) - f *(w). > $ f(u) -f*(w) >_ $

s-

Proof: Since u minimizes f over the convex set S, we have +

Then

<s - u, V&)> = <s - a, Pau + Vf(a) - P A>

=<s-u,Vf(u)>>0

for all s in u. Since the quadratic functional g(v) has go - Pa, a positivedefinite operator, a unique point va exists (similarly for v, for all w in S*) and is characterized by the condition <s - va, Vg(va)> > 0; therefore, va = u and hence f *(u) =f(4). To prove that u maximizes f *, we write, for w in S*,

letting v. = v and P. = P for notational ease,

22

SEC. 1. 6

VARIATIONAL PROBLEMS IN AN ABSTRACT SETTING

f *(u) - f *(w) = f(u) - f(w) - 4 - = +l - [f(w) - . + by adding and subtracting #. Since v minimizes # over S, and u is in S, the inner-product inequality characterizing v gives > 0. Hence

f*(u)-f*(w)>I+d(I)-d(w) +
d(u) - d(w) = + f o

t
w), u - w> dt

_ i

+ f0

t<[fro+u-n.-P.l(u-w),u-w>dt

> Inserting this we find f*(u)-f*(w)z4+

+ ZI-> 2 Iu - V I F Thus 11 maximizes f * over S*. To obtain the first error estimate, we merely write

f(u) -f*(w) zf(u) -f*(w) =f*(a) -f*(w) >_ i For the second, we write

f(u) - f *(w) zf(U) - f *(u) = f(u) - f(u) = + Jo t<[f,.+(! -, - Pl(u - u), u - a> dt

+ - Zf<Pa(u-u),u-u> by the assumption on f" - P and the necessary condition for u to minimize f over S, since u is in S.

1. 6

SEC.

VARIATIONAL PROBLEMS IN AN ABSTRACT SETTING

23

Remarks on Theorem 1.6.4.

1. The closedness and convexity of S are used only to guarantee the existence and uniqueness of v, for w in S and to deduce that v, = u; these properties can be guaranteed in other ways as well. 2. The differentiability hypotheses can be weakened easily; in particular,

f(v) - . need only be differentiably convex on S for each w in S*. 3. Independent of any convexity hypotheses on f, points other than u can maximize f * without added restrictions on f" - P; this is of no consequence for our purposes of error-bounding, however. 4. In order for the error bounds to be effective, one would require that f* be continuous at u; this requires further study of P. An examination of the expression for f *(u) - f *(w) shows that if f;' and P. are uniformly bounded for u and w near a, then for some constants a, b, c we have

0<,f*(u)-f*(w) 0, I I of(w) - Vf(u) II < KI I w - u II

III! -v.ll<[1 +(KIE)]IIu- wII.

near u, and S = H, then

For the general situation we have the following more restrictive results: THEOREM 1.6.5. Let the assumptions of Theorem 1.6.4 hold. Moreover,

suppose that IIf."II<_A,1IP.-Poll
E>0,allfor

w near a. Then I112 - v, I I = OQ I u - w 11' 12) and hence f *(Cl) - f *(w) _ 0(I I u - wID.

Proof. Let

g,(v)+ Then

Ig.(v)-g.(v)1=041vllllu-wID Then we have

8.(v.) S 8.(u) = g.(u) + [g.(12) - g(u)) _< g.(u) + 0(11 u I I and, similarly,

g.(v,)+O0Iv,111Iu- wID Hence

19-02) - g.(v.)15IIv.11001u - wID

I I u - w 11)

24

VARIATIONAL PROBLEMS IN AN ABSTRACT SETTING

sec. 1. 6

Now also for w near u, gg(v) Z 3a,f I v 112 - I I v IIM for a fixed constant M; and since g,(0) = 0, we have I I v. I I < 2M/e. In addition, g .(V) = g0(a) + +

zg.(a)+4-IIv-a1I2

_

Therefore,

g(v.)-g.(a)
Inv.

Q.E.D.

We wish to give two concrete examples illustrating the meaning of the general theorem above, Theorem 1.6.5; it is simplest to consider differential equations, and in order to minimize technical complexities we consider the equation u"(t) = c[t, u(t)],

u(0) = u(1) = 0

t in (0, 1),

More precisely, we consider minimizing the functional

f(u) = If' [u'(t)]2 dt + f ' 0

over the set H

0

f.uf

c(t, x) dx dt

0

W 2(0, 1). For u, v in H, we take

= f o [u (t)v (t) + u(t)v(t)] dt Since we wish to illustrate ideas rather t4an technicalities, we shall be rather sloppy and speak blithely of D2u - u" for u in H; the precise formulation is easily filled in.

A: For the first example, let us suppose that u) > y > -x2 for all u in (-oo, oo), tin [0, 1]. Let (u, v) = f u(t)v(t) dt. Then 0

flu + h) =f(u) + (-D2U + c(t, u), h) + 4([- D2 + c (t, u)]h, h) + small terms Thus

_ (-D2u + c(t, u), h)

VARIATIONAL PROBLEMS IN AN ABSTRACT SETTING

SEC. 1. 6

25

and

_ ([- D2 + ca(t, u)Jh, h)

We let S = S. = H and for all w define P. by -D2 + y which is positivedefinite since y > -n2. Since ca > y, we have <[f: - P.]h, h> _ ([c. - y1 h, h) Z 0 and the hypotheses are fulfilled. Here

f'(w) = 2

f u [w'(t))2 dt

+f {

f

o

tee)

2 v2(t) + [v(t) - w(t)][c[t, w(t)] - yw(t)) [c(t, x) - yx] dx} dt

0

where v(t) solves

v"(t) - yv(t) = c[t, w(t)] - y%(t) for t in (0, 1),

v(0) = v(l) = 0

In this case, discussed in Shampine (1968), the error bounds are in the norm
This yields useful bounds for any approximate solution w and for the corresponding v,,. Such a w might, for example, be obtained by the Ritz procedure

or by an iterative process. For some problems the Newton iterative process yields a sequence u, decreasing to the desired solution [Bellman (1957, 1962), Collatz (1966), Shampine (1966)]; often, then v,,, turns out to lie below the solution [Shampine (1966)] yielding error bounds for u. The variational procedure above in addition furnishes bounds involving the derivatives. B. In some cases, the Newton iteration mentioned above may be costly to carry out. Certain Picard-type iterations, though.more slowly convergent, are sometimes used at least until one is near the solution where Newton's method might be worth the cost. The process above in A will often yield twosided bounds and the bounds as well in this case too. We wish to observe that one Newton step also provides. such bounds in some cases. Suppose now that ( c(t, u) I < N for all t, u; that uo solves u'0 = -N, u0(0) =

u0(1) = 0; and that k > c,(t, u) > y > -R2. Then it is known that the sequence

u,;, - ku,+,

c(t, u,) - ku

u,(0) = u,(1) = 0,

n = 0, 1, .. .

26

SEC. 1. 6

VARIATIONAL PROBLEMS IN AN ABSTRACT SETTING

is a monotone-decreasing sequence converging to the solution u. Now let

S=[u;u
S* ={u;u>uin [0, 1))

and

For w in S *, define P. by - D2 + c,(t, w). As we saw before, this is positive-

definite. Let us also suppose that y > 0-that is,

u) > 0-and that

c,,,,(1, u) < 0. Then for w in S* and u in S we have

Pw](u - w), u - w>

=<[c[t,w+2(u-w)]-c(t,w))(u-w),u-w>>0 since u < w and c < 0 implies c[t, w + .Z(u - w)] > c(t, w). Thus the hypotheses are satisfied. We now claim that, for w in S*, the v, that minimizes

4- + over S in fact minimizes it over all H-that is, that the gradient is P,v + Vf(w) - P,w = 0

at

v = v,

This is well known. To do this, we show that if P,v + Vf(w) - P,v = 0, then v is in S and hence v = v,. This equation for v yields

v" - c,(t, w)v = c(t, w) - c,(t, w)w,

v(0)

v(1) = 0

which is just the Newton iteration from w to v. Since also

u" - c,(t, w)u = c(t, u) - c,(t, w)u,

12(0) = a(l) = 0

subtracting we have

(v-u)"-c,(t,w)(v-u)=c(t,w)-c(t, u)-c,(t,w)(w-u)>0 since w > u, and c,,,, < 0. But then the maximum principle implies that v u < 0-that is, that v is in S and hence v = v,. Thus we find

f *(w) _ f I [v'(t)]2 dt + f I f 0

0

.a

c(t, x) dx dt

0

+4 f l Cw[tf w(t)][v(t) - w(t)]2 dt 0

+ f c[t, w(t)][v(t) - w(t)] dt 0

where v is the Newton iterate of w solving

v" - c,(t, w)v = c(t, w) - c,(t, w)w, These error bounds are in the norms

v(0) = v(l) = 0

-

SEC. 1. 6

VARIATIONAL PROBLEMS IN AN ABSTRACT SETTING o

J

([e'(t)]2 + c.(t, z)e2(t)} dt

Since cs(t, z) > 0, we have bounds for

for

27

z = w and z = u

[e'(t)]2 dt as well as the fact that

vim. < uG w. The computable bound for derivatives would thus use the variational results,

!o w(l)

-

dt
General reference: Levitin-Poljak (1966a, b).

2

THEORY OF DISCRETIZATION

2.1. INTRODUCTION

As we remarked in Section 1.1, many problems of practical interest can be considered minimization problems in general spaces. Computationally, however, one is often unable to work in such general spaces since one is restricted usually to dealing with discrete data in the real world. While it is useful, as we shall see later, to study methods of minimization in general spaces as well as in finite dimensional spaces, it is also necessary to study the relationships between the solutions of problems in the "discrete" and "continuous" domains. Therefore, we shall now look at this question in a rather general way, to see what kinds of relationships need to exist in order for our computations to be meaningful; later, we shall see how these ideas apply to certain problems. 2.2. CONSTRAINED MINIMIZATION

As usual, we suppose that we seek to minimize the weakly sequentially lower semicontinuous functional f over a weakly sequentially compact subset C of a reflexive space E; an x* E C solving this will be called a solution to the MPC-the minimization problem over C. We shall hope to compute x* by dealing with some approximating functionals over approximating spaces, such as quadrature sums for discrete data instead of integrals as in the case

of the calc+s-of-variations problem. DEFINITION 2.2.1. A discrelization for the MPC consists of a family of

normed spaces E a family of functionals f. over E., a family of mappings p of E. into E, a family of mapping r of E into E., and a family of subsets C. of E,,. 28

THEORY OF DISCRETIZATION

SEC. 2. 2

29

We are thinking of C. and f, as "approximations" to C and f, r,x E E. as an "approximation" (restriction) to x E E, and px, E E as an "approximation" (prolongation) to x, E E. We shall call x.* a solution to the MPC;,-E,, the E,-approximate minimization problem over C. with e, > 0 converging to zero, if

f4(x"*) Sx.EC. inf f" (x,) + E, For a solution x,* to the MPC-E, to converge in some sense to a solution of the MPC, we need to have some relationships between the two problems to measure the degree of approximation. DEFINITION 2.2.2. A discretization for the MPC is consistent if and only if

1. Jim sup f,(rx*) < f(x*) for some x* solving the MPC; 2. Jim sup [ f(p,x,) - f,(x; )] < 0 if x; solves the MPC,-E,; 3. the sets Ca - p.C. U C are uniformly bounded, and, if z,, E CM and z z, then z E C; 4. solutions x; of the MPC,-E, exist for all n; 5. r,x* E C. for the same x* in condition 1 above. We remark that one might of course prove conditions I and 2 in Defini-

tion 2.2.2 by proving them for all points, not just for the solutions of the minimizing problems. Condition 3 is trivial if p.C. c C. Condition 3 is also trivially true if e(C", C) ---. 0, where C = p.C. U C, e(C", C) = sup d(x, C), and

d(x,C)-inf11x-y1{.

x EC'

)Ec

If f is weakly sequentially lower semicontinuous on a set containing, for sufficiently large n, the sets C = p,C, u.C, then the numbers y, f(x*) -

inf f(x) > 0 must converge to zero; this follows since, if f(z,) S inf f(x) xEC' + (1/n), there is z in C and a subsequence z,, --- z yielding f(x*) < f(z) S lim inf f(z,). XEC'

r-.

EXERCISE. Give a rigorous proof that Jim y, = 0, where y. is defined in the preceding paragraph.

We can now prove the following fundamental theorem on approximate minimization via discretizations. THEOREM 2.2.1. Let f be a weakly sequentially lower semicontinuous functional on a set containing C" for large n, where Co _ p.C. U C, and let

[E f C., p r,}.be a consistent discretization of the MPC for a weakly sequentially compact set C in a reflexive space E. Let x,* and x* solve the MPC,--E, and MPC, x* satisfying conditions 1 and 5 of Definition 2.2.2.Ther

30

sEc. 2.2

THEORY OF DISCRETIZATION

limf(p,.x.) = lim f"(x,*) =f(x*) and all weak limit points of p"x,*, at least one of which exists, solve the MPC; in particular, if x* is unique, then pnx. , x*. Proof.- Let y" - f(x*) - inf f(x) for large n as above. Since x* solves s EC"

the MPC, by the consistency we have

f(x)+Y"
f(pnxn) - fn(x")

where nn satisfies Jim sup ,n < 0 by condition 2 of Definition 2.2.2. On the +-w

other hand xA solves the MPC"-Er, so

f(x*)
Sf,(r,,x*)+Y"+rl"+E" =f(x*)+Y,+'i,+E"+b, where bn - f"(r"x*) - f(x*) satisfies JimA--sup a" < 0 by condition

I

of

Definition 2.2.2. From the first and last terms of this basic inequality we have since

lim sup b" < 0, we have R--

0 <Jiminf(yn+n,+e,+b")
and therefore lim (q, + b") = 0. If lim inf ?In < a < 0 for some a, then there exist n, with n", < a/2 < 0 for all i, and hence 6, > -a/4 for infinitely many i, since Jim (q" + bn) = 0, contradicting lim sup 8 < 0. Thus lim q, = 0 and, similarly, Jim b" = 0. Letting n now tend to infinity in the basic inequality yields

lim f(pn x.) = lim fn(x.) = J(x*) n-.oo

Since C" is uniformly bounded, {p xp) has weak limit points, all of which lie in C by condition 3 of Definition 2.2.2. For any such limit point z with z,f(z) < lim inf f(p",x*) =f(x*), so z must solve the MPC. Q.E.D. General reference: Daniel (1968b).

SEC. 2. 3

THEORY OF DISCRETIZATION

31

2.3. UNCONSTRAINED MINIMIZATION

As we saw in Section 1.4, for the purposes of analysis unconstrained minimization problems are often reduced to constrained problems by means of growth conditions. This same approach is useful for the analysis of constrained minimization via discretization methods. Hence we consider the MPE-the minimization problem over E-to locate x* with f(x*) = inf f(x) for reflexive E and weakly sequentially lower semicontinuous f. We also consider the discretized problem, the MPE the Es-approximate minimization problem over E of finding x,* in E., such that f (.x,*) < inf f .(x.) f . x.E E If x* exists and the discretization is consistent (here C. = E., C - E), then the proof of Theorem 2.2.1 with minor modifications shows that lim f (p,x* )

lim f(x.) =f(x*) while, if f satisfies a T-property, one can also conlude that has limit points, all of which solve the MPE. EXERCISE. State and prove the modification of Theorem 2.2.1 outlined

above.

We omit this modification of Theorem 2.2.1, however, because it is generally not useful; the difficulty is that, in practice, one cannot usually prove consistency of a given discretization without having some additional information on the points x.*, such as that I I x. I I. < B for a fixed constant B. One way of guaranteeing such a uniform bound is via a uniform-growth condition.

DEFINITION 2.3.1. A discretization for the MPE satisfies a uniformgrowth condition if and only if 1im sup

oo

whenever lim sup I I x I

+ o0

Actually, one can describe a uniform T-property, but the above definition covers most of the cases of interest. DEFINITION 2.3.2. A discretization for the MPE is stable if and only If there is a real-valued function B(t) for t ? 0, bounded on bounded set,, such that

II x Ih < r implies IIpnx II < d(r) Now we can prove the following fundamental theorem on unconstrained minimization via discretization. THEOREM 2.3.1. Let f be a weakly sequentially lower semicontinuous functional on the reflexive space E, let( satisfy a T-property at 0 with T 7'. .

32

sEc. 2. 3

THEORY OF DISCRETIZATION

be stable, be consistent (for uniformly

Let the discretization {E., f., p

bounded I I x,* 11, and IIPx: I I), and satisfy a uniform-growth condition. Then

limJ(p.x;) = lim f.(x;) = f(x*) where x* solves the MPE, and all weak limit points of {px }, at least one of x*. which exists, solve the MPE; in particular, if x* is unique, then is in E., Proof: By our assumptions, x,* and x* exist. Since and

.(x.*)
lim sup f.(r.x*) X_

Therefore, there exists a constant R, > 0 such that

f(x;) < R

R,

for all n

Thus, from the uniform-growth condition, there is an R2 > 0 such that I I :x.* II. < R2.,

II r.x* 11 < R2

for all n

Hence, by stability, II P.x. II < 5(R2)

Let R3 = max {To, S(R2)}, and let

C.={x.;IIx.II.
C = (x;l(xll
Note that the proof of Theorem 2.3.1, above, shows that condition 2 in the consistency definition (Definition 2.2.2) could be proved valid for x,* by proving it valid for all sequences {x,} with II x. II. bounded independently of n.

In some cases one can find estimates for the rate of convergence of to x*. From the proof of Theorem 2.2.1, we see that for large n in the constrained case and all n in the unconstrained case (where y 4 0) we have

-y.
-D,(n)
sEc. 2. 3

THEORY -OF DISCRETIZATION

33

where D,, D, and D, measure the defect in consistency via D,(n) >_ f.(r.x*) - f(x*) D.(n) > f(x*) - infi(x) C.

D,(n) f(p.x:) -f.(x*) In some cases D,(n), D,(n), and D,(n) can be estimated beforehand; if, for example, f involves integration and f, is a quadrature approximation, D,(n)

and D,(n) might be estimated from known facts about the accuracy of quadrature formulas. Once we can estimate the speed of convergence of f(p,x;) tof(x*), methods such as described by Theorem 1.6.3 can sometimes be used to conclude that x* and to bound IIp.x - x* 11The theorems of the previous section, especially Theorem 2.2.1, are

related to unpublished work of Aubin and Lions [Aubin-Lions (1966)) treating similar problems. In their work one seeks to minimizef(x) = J[G(x)] over a weakly compact subset C of a reflexive space E, where G maps E into a reflexive space H and is continuous between the weak topologies, and where J is a weakly lower semicontinuous functional on H. Instead, one minimizes J,[G,(x,)] over a weakly compact subset C. of a reflexive space where G. maps E. into a reflexive space H. and is continuous between the weak topologies, and where J. is weakly lower semicontinuous on H,. Mappings p,: C. - C, r,: C -- C q,: H. - H, and s,: H H. are assumed to exist; the authors make the following assumptions: 1. IJ.(w.) - J.(v.)I = 0(1) as II W. - v.ll. --' 0; 2. Jim I J(q,w,) - J,(w,J I = 0 if II w.11. is bounded;

3. lira I J,(s,w) - J(w) I = 0 for w E H;

4. If

IIP.x.11 is bounded, then S. Jim 11 G,(r x) - s,G(x) I L = 0

G(p,x,) - q,G,(x,) - 0;

for x c- E;

Under these assumptions the authors show that the x.* exactly minimizing. f, over C. satisfy the properties proved for our x.* in Theorem 2.2.1. EXERCISE. Show that, under hypotheses 1-5, listed above, the discretization [E f p r C,} is consistent and hence Theorem 2.2.1 applies directly to the Aubin-Lions problem.

The special form off and f, in the above presentation is related to nonlinear integral equations that are posed as variational problems. We discuss this and other kinds of operator, equations briefly. General reference: Daniel (1968b).

34

SEC. 2. 4

THEORY OF DISCRETIZATION

2.4. REMARKS ON OPERATOR EQUATIONS

Suppose one wishes to solve the following nonlinear integral equation [Anselone (1964), Vainberg (1964)1:

u(t) = f K(t, r)c[r, u(T)] dr 0

where we suppose that the integral operator Au

f K(., r) u(r) d r 0

is bounded from L.(0, 1) into L,(0, 1), where p > 2, and I/p + 1/q = 1; that K(t, T) = K(r, t), that the spectrum of A as an operatodfrom Lq into Lq is positive and that A maps bounded sets into precompact sets (that is, A is a compact operator). We suppose that the operator

c(u) = c[.,

is norm-continuous from L,(0, 1) into L,(0, 1)-i.e., that I c(t, u) I < a(t) + b I u li JF, where a E L,,(0, 1), and b > 0; that c(t, u) is continuous in u for almost all t and measureable in t for all u; and that C(t,u)

fc(t,s)ds 0

satisfies

C(t, U) < Gcu2 + fi(t)' u jr + d(t) where

a E (0, m),

m = inf {A; A E Q(A)},

0 < f E L2,(,_,)(0, 1) for some F in (0, 2),

and

0 - 2 f 1 C[t, G ,,:(t)] dt 0

is defined on L2(0, 1) and achieves its minimum there. At the minimum x*, Vf vanishes, and so

THEORY OF DISCRETIZATION

SEC. 2. 4

35

0 = Vf(x*) = 2x* - 2G* c(Gx*) Defining u* = Gx* E Lp(0, 1), we see that u* = Ac(u*) and that u* solves the integral equation. if we define

J(w) = - 2 f t C[t, w(t)) dt 0

for w e L,(O, 1), we see that f(x) = J[G(x)]

as described by the Aubin-Lions work. Thus the theory described in Section 2.3 applies to numerical solution of integral equations, where integration is, for example, discretized by means of quadrature formulas. A particularly attractive feature of integral equations is the compactness of the operator A. If A is approximated by a quadrature sum, A. u = t=t w.,, K(., T;) u(zt) = Q,,[Ku)

where the quadrature formula Q. gatisfies Jim --.m

f l = f ' f(t) at o

for all continuous f, then the operators A. turn out to be collectively compact in many cases; that is, the union of the images by A. of each bounded set is

compact. This fact has been exploited greatly [Anselone (1965, 1967), Anselone-Moore (1964), Moore (1966)] to analyze numerical methods for linear integral equations. Essentially, the same viewpoint has been used to analyze nonlinear equations given by variational problems [Daniel (1968a)]. A typical result using this viewpoint is as follows. If f and f., n = 1 , 2, ... , are weakly lower semicontinuous functionals such that for each x in a weakly compact subset C of a reflexive space E we have

lim f (x) =f(x) and such that {Vf - Vf } is a collectively compact set of norm-continuous mappings of E into E*, then if x,* E C satisfies f(x.) < inf c,, x E C

36

sec. 2.4

THEORY OF DISCRETIZATION

0, it follows that every weak limit point of {x,*), at least one of with e. which exists, minimizes f over C. EXERCISE. Show that the collective compactness referred to above merely serves to guarantee the consistency of the discretization with C. = C, E. = E,

p. = r = the identity map. Thus most results of Daniel (1968a) follow from either of the above Theorems 2.2.1 or 2.3.1. It is a trivial exercise further to deduce from these theorems results concerning the solution of operator equations via discretizations; one need only recall that Vf(x*) = 0 at an interior minimum of f. Thus one is led to results concerning the weak convergence of p x,* to x* where

Vf.(x.) = 0

and

Vf(x*) = 0

Stronger convexity hypotheses on f will then give norm convergence. The analysis of convergence for discretization methods of solution of nonlinear equations has been carried much further, however, than can be covered from the variational viewpoint. Rather than give such an incomplete picture of the subject, therefore, we merely refer the interested reader to the literature. We proceed, in the following chapter, to examine a number of examples to which the variational viewpoint naturally applies in order to demonstrate some particular cases of discretization methods. General references: Aubin (1967a, b; 1968), Browder.(1967), Petryshyn (1968).

3

EXAMPLES OF DISCRETIZATION

3.1. INTRODUCTION

In the well-developed theory of discretizations for operator' equations, many examples of particular discretization schemes can be found, particularly

for partial differential equations (see the General References at the end of this section for such general examples). In this chapter we shall examine some specific types of problems or methods which, by our considering a particular form for the discretization, can be analyzed from the viewpoint of the discretization theory of variational problems and from the theorems presented in Chapter 2 or extensions of those theorems. In some cases this leads to new results, in some it provides a different way of looking at well-known results, and in one it shows how the approach can be used to guide the direction of one's research on a new method. General references: Aubin (1967a, b; 1968).

3.2. REGULARIZATION

The idea of regularization has been studied from at least two different viewpoints. Under the name of regularization it was developed theoretically largely by the Russian school [Levitin-Poljak (1966b), Tikhonov (1965)] for the situation in which one seeks to minimize a functional g and, out of all the solutions to this problem, find the one which is "smoothest" or "most regular" with respect to another functional h-that is, which minimizes h over the set of solutions of the first problem. If g represents a calculus-of-variations problem, for example, one might take. 37

38

SEC. 3. 2

EXAMPLES OF DISCRETIZATION

h(x) = f' I z(t) V z dr 0

to make x "smooth." This goal can be accomplished in many cases by minimizing

g + a"h,

a" > 0, and lima" =0

where

(3.2.1)

and noting that the solutions to these problems converge to the desired regular solution. This same technique has also been studied as a form of the penalty function method, since minimization of the functional in Equation 3.2.1 is equivalent to minimizing

h + 1a"g,

where

a" > 0,

lima" = 0

and

In this form we recognize the procedure as a form of the penalty-function technique to minimize h over the set of x satisfying g(x) = 0, if g(x) > 0 for all x [Courant (1943), Butler-Martin (1962)]. We shall briefly consider this method (from the regularization viewpoint) as a discretization. First we shall generalize it somewhat because of its relevance for numerical work, and then we shall specialize to the above description. Suppose we seek to minimize the nonnegative weakly sequentially lower semicontinuous functional h over the set of points which minimize the weakly sequentially lower semicontinuous functional g over a weakly sequentially compact subset C of a reflexive space E. Suppose we have a discretization for this problem described by [E", g", h", C", p", r"}. For a sequence of positive a" tending to zero we shall define

.f"(x") = g"(x") + a,h"(x"),

for

x" E E"

We also define

f(x) = g(x),

for

xEE

and thus. we have a discretization [E", f", C", p", r,j. We list assumptions for this example corresponding to the consistency definition (Definition 2.2.2), but stronger:

1. lim sup [g"(r,rx*) + a"h"(r"x;) = g(x*) - ah(x`)] = lim sup Z > 0

" " A--

for every x minimizing g over C; 2. lira sup a,h(p-x*) - g"(x.) - a,h"(x,*)] = lim sup 8" S 0 if x,* erapproximately minimizesf" over C";

3. the sets C" = p"C" U C are uniformly bounded, and, if z", E C with z", z, then z c C;

EXAMPLES OF DISCRETIZATION

SEC. 3. 2

39

4. solutions x; exist; E C. for every x* minimizing g over C. 5. THEOREM 3.2.1. Suppose x.* satisfies j,

inf [g.(x.) + a h.(x.)) + C. g.(x.) + a.h.(x ) < x.CC. Under hypotheses 1-5 (above) on the discretization for the weakly sequentially lower semicontinuous functionals g and h, with h > 0, h. > 0, and C weakly sequentially compact, if in addition all > 0 converges to zero slowly enough

that urn supC, n-.o^

+§.+En+y,,<0 a.

where y. = g(x*) - inf g(x), then all weak limit points x' of p.x., at least C.

one of which exists, minimize h over the set of minimizing points of g over C.

Proof: Since h. > 0 and h > 0, it is trivial to verify that hypotheses 1-5 (above) imply the consistency of the discretization f En, f., C.; p., r.) for minimizing f = g over C. EXERCISE. Prove that hypotheses 1-5 (above) imply the consistency of the discretization [E., f., C., p., r.] for minimizing f = g over C.

Thus by Theorem 2.2.1, weak limits x' exist and all minimize f =g over C. As in the proof of Theorem 2.2.1, for large n we have g(x*) < y. for any x* minimizing g over C. Thus for large n, we write

g(x*) < g(x*) + a.h(p.x.) < g(p.x.) + a.h(p.x.) + Y.

=g.(x.)+a.h.(x.)+Y.+Yn
g(x*) + a.h(p.x.*) < g(x*) + a.h(x*) + y. + E. + 8. + C. which implies

h(p.x,*)
x', we have

h(x') < lim inf h(p.,x.,) < h(x*) I--

Thus x' minimizes h over the set of minimizing points of g. Q.E.D.

40

sEc. 3.2

EXAMPLES OF DISCRETIZATION

COROLLARY 3.2.1. If.g and h are weakly sequentially lower semicontinuous functionals over the weakly sequentially compact subset C of a reflexive space E, with h > 0, and if x,* satisfies

g(x,) + ah(x.) < inf [g(x) + a,h(x)1 + a.b, xEC where a,,, b > 0, Jim a, = lim b = 0, then all weak limit points x' of x.*, at least one of which exists, minimize h over the set of minimizing points of g over C.

Proof: Let E. = E, C. = C, p = r = the identity map, g = g, and h = h; the hypotheses in and immediately preceding Theorem 3.2.1 are clearly satisfied, since g + a,,h is weakly sequentially lower semicontinuous. Q.E.D.

The above corollary describes the nature of the regularization method as it is most often described [Levitin-Poijak (1966b)). It is possible in many cases to guarantee more than the rather weak convergence properties guaranteed in Theorem 3.2.1; we give an example below. THEOREM 3.2.2. If, in addition to the hypotheses in and immediately preceding Theorem 3.2.1, C is convex, g is quasi-convex, and either g or h is strongly quasi-convex, then the entire sequence [ is weakly convergent. if, in addition, either g or h is uniformly quasi-convex, the sequence is normconvergent.

Proof: If g is strongly quasi-convex, then by Theorem 1.5.2 the set

C' = (x*; x* E C, g(x*) = inf g(x)) XEC

consists of only one point, so

x*. If g is only quasi-convex, then C' is convex, and by Theorem 1.5.2 the strongly quasi-convex functional h is

minimized over C' at a unique point x', so p x x'. If g is uniformly quasi-convex in addition, then p,,x. --- x* by Theorem 1.6.1. If only Is is uniformly quasi-convex, then we write

Px: 11) < max fh(z ), h(P.x. )) - Is (xl

Recalling from the proof of Theorem 3.2.1 that h(P.x,*)
a

2PN

)

EXAMPLES OF DISCRETIZATION

SEC. 3. 3

41

we have

lim sup 5(11 x' - p,,x (I) < lim sup

{max [x'), h(x') + E- + fe a+ C + y J h(x) - lim inf h rx'

-h

x'. Q.E.D.

x', and hence

since (x' +

0

In some cases one can also compute the order of accuracy of the regularized solution as a function of the parameter a.; in the following paragraph we briefly describe some recent results of this type [Aubin (1969a, 1969b)] which have application to optimal-control problems. Suppose we wish to minimize the convex and differentiable functional f over the set C _= [x; Lx = b) where b is given and L is a bounded linear operator from the Hilbert space E, into a Hilbert space E2 such that L has a closed range. To do this we instead minimize fn(x) = II Lx - b 112 + over E,. Suppose that x* minimizes f over C and x; minimizes I I Lx - b 112 + an f(x) over all of E,. Then it can be proved [Aubin (1969a)] that the following

estimates hold for some constant k > 0: (1) 11 b - Lx I I S ka.; and (2) f(x*) - (1 /a )f (x,*) < f(x*) - f(x,*) < ka,,. Under stronger hypotheses on f, this of course leads to error bounds for II x,* - x* 11. For computational purposes, if the problem is discretized via mappings p,,, rA for example, by using finite differences to replace the differential equations Lx = b of a control-theory problem-the same type of error estimates are known as those given above [Aubin (1969b)]. 3.3. A NUMERICAL METHOD FOR OPTIMAL-CONTROL PROBLEMS

We seek to compute numerically an approximate solution to an optimalcontrol (C-problem) of the following type: minimize 91

AY, u) = f c[t, y(t), u(t)] dt where the cost function c(t, y, u) is nonnegative, over the collection of functions (y, u) satisfying y - dY = s[t, y(t), u(t)],

to < t < t

y(t) E Y(t),

y(to) E Y,

U(t) E U(t),

and y(l) E YP

42

EXAMPLES OF DISCRETIZATION

SEC. 3. 3

where to and t, are unknown points in some fixed interval [0, T] and Y YF, Y(t), and U(t) are specified subsets of E', E', E', and Ek, respectively. As is well known [Warga (1962)], this problem can be transformed to one with fixed time-i.e., to to = 0, t, = 1--essentially by introducing to and t, as components of an "extended" y-vector; this transformation preserves important properties of the problem, including the form of the y- and uconstraints, so hereafter we shall assume that we have a fixed-time problem

with to = 0, t, = 1. ASSUMPTION Al. to = 0, t, = 1. EXERCISE. Supply the details to justify the above specialization to to = 0,

t, = 1.

The following numerical method has been proposed to solve the Cproblem [Rosen (1966)]: for positive integers n, set k = k = 1/n, t, = ik for

0 S i S n; find vectors y. _ (y, 0, k

.

0, . , u,,,,,) minimizing . , y.,.), u.(u., =..

c(t y,,,,, u,,,,) over the collection of vectors satisfying Y..i+,

Y.,' = s(t1, y.,,, us,,) for i = 0, ... , n - 1, Y.,, E Y(ti) k and u,,, E U(t) for i = 0, ... , n, y,,,o E Y,, Y., E Y,,

This method has proved useful in practice; under certain assumptions [Rosen (1966)], the nonlinear programming problem (P problem) defined by the numerical approximation can be computed rapidly by a variant of Newton's method. We are concerned not with methods for computing (y,, but with whether or not the sequence (y or-more precisely-approximations to (y,,, converge in some sense to a solution (y, u) to the original C-problem. In Cullum (1969), some results are obtained concerning this convergence, particularly for C-problems with s(t, y, u) linear in y and u; in a certain sense which will become clear later, the convergence statements of that approach do not quite face the computational problems squarely (except for problems lacking state constraints [Cullum(1970)]). We shall examine the method of Rosen (1966) in detail and see that the convergence theory can be treated nicely by the discretization approach. In fact, Rosen (1966) treats the problem computationally as one involving inequality constraints Y.,r+ I < Y.,1 + ks(tr, y.,,, u..,)

and indicates that one can solve the problem under equality constraints by a penalty-function approach. This allows us to analyze the method by means of the tools of regularization as discussed in Section 3.2 and, in particular,

EXAMPLES OF DISCRETTZATION,

SEC. 3. 3

43

in Theorem 3.2.1. Therefore, for a sequence of positive numbers a converging to zero, we shall approximately minimize

k

[s(11

Y j' u,,t-,) -

t + a k F c(t,, y

,u

,1)

under the above inequality constraints and with the sets Y(t), U(t) slightly expanded; the points obtained will be shown to converge to a solution of the control problem under equality constraints. The sets Y(t) and U(t) must be expanded slightly in order to guarantee in general the existence of feasible

points for the discrete P-problem arbitrarily near the solution to the Cproblem. To see the reason for this, consider the following C-problem: Solve y = u, t2 < y < t2 + 1, 2t < u < 3t, t e [0, 1], minimizing 10 (y2 + u2) A The solution is y = t2, u = 2t; but there are no feasible points at all for the P-problem with constraints y,+, = y, + ku (ik)2 < y, < (ik)2 + ik, 2ik < u, < 3ik, because yo = uo = 0 implies y, = 0. In this example, however, it is clear that there exist points satisfying the equality constraints for the P-problem and which are very near to being in Y(t,), U(t).

EXERCISE. For the example immediately above, show that there exist points satisfying the equality constraints which are very "near" to being in Y(t,) and U(t) for all i, where "near" is some reasonable and precise concept.

The assumption in general that this is true is essentially equivalent to the assumption of the existence of a mapping r in a consistent discretization scheme.

Continuing with the intuitive approach, we also see that in order for the numerical method to have a chance of success, the nature of the sets Y(t), U(t) must be revealed fully by their nature at the discrete points t,; for example, if Y(t) = (t) for irrational t, but Y(t) = (-oo, oo) for rational 1, then the numerical method using Y(i/n) would never detect restrictions. Thus we need to assume that Y(t), U(t) vary nicely in the sense that, given feasible vectors for the P-problem (y., with y,,, E Y(t), u,,,, E U(t,), then there exist feasible functions (y, u) for the C-problem with y(t) near Y(t), u(t) near U(t), and y and u near y. and u in some sense; the point (y, u) will be called p.(y,,, un), where p will turn out to be the relevant mapping in a consistent discretization. For notational convenience, we restrict ourselves henceforth to scalar problems; that is, we assume that y and u are in E'. The situation for y E E', u E E" is exactly the same except that some statements-such as those regarding convexity of functions s(t, y, u)--must be read with regard to the vectorvalued function's individual components.

44

EXAMPLES OF DISCRETIZATION

SEC. 3. 3

Now we are ready to make our intuitive assumptions more precise. Define the Hilbert space E = [(y, u); u c- L2(0, 1), y c= L2(0,1), y is absolutely continuous}. For x = (y, u) E E, let I I x 112 = f 1 y2 dt + f 1 u2 dt + y(0)2. 0

0

This is essentially a standard Sobolev space. Define the discretized space E. = [(Y", un) ; y" = (Y", o, ... , Y",n), U. = (un, o, ... , un.n)}

For x" = (y", u") E E", let

)=+k" u2, + Y2

11x112=kj:(y

As we noted in Section 1.3, weak convergence in E is equivalent to weak convergence of the components y and u in L2(0, I) and convergence of y(O) in ER, which implies uniform convergence of y-i.e., convergence in C[0, 1]. Next we define functionals h(x), g(x), h"(x"), g"(x") for x = (y, u), xn = (y", u") asfollows:

h(x) = f 1 c[t, y(t), u(t)] dt 0

g(x) = f o [s[t,Y(t), u(t)] - At)) dt h"(xn) = k g"(xn) = k

f=1

c(t Y

u ,)

[s(ij_UY'..f-I, un,l_I) -Y".f

'yn.(-11

Let

Q' = [(y, u) E E; y(i) E Y(t) for all t E [0, 1], y(O) E Y Al) E YP} Q" _ [(y, u) E E; u(t) c U(t), y(t) < s[t, y(t), u(t)] for almost all t E [0, 1]}

Q"' _ [(y, u) E E; g(y, u) = 0) Our C-problem now takes the following.form: find an x* _ (y*, u*) E Qo =

Q' n Q" n Q"' satisfying h(x*) = h(y*, u*) = inf h(x). x E Q,

ASSUMPTION A2. We assume there exists 80

> 0 such that for all

the set Q(b) _ [(y, u) E E; d[y(t), Y(t)] < 6 for all t E [0, 1], d[y(0), Y,] < a, d[y(l), YP] < d, d[u(t), U(t)] < S and y(t) < s[t, y(t), u(t)] for almost all t c [0, 1]} is weakly sequentially compact and bounded by the constant B.

The boundedness of Q(S) can be deduced if, for example, the sets Y(t), U(t) are bounded above and below by functions in L2(0, 1) or if in some other

SEC. 3. 3

EXAMPLES OF DISCRETIZATION

45

fashion one can find a priori bounds on the solutions and then include the bounds (theoretically) in the constraints. If, furthermore, the set of (y, p) satisfying the differential inequalities forms a weakly closed subset and if Y,, Y,, Y(t), and U(t) are closed for each t and U(t) is a convex set for each 1, it is easy to deduce that the set of (y, u) satisfying the other constraints is weakly closed and hence Q(b) is a weakly closed subset of a bounded set and is therefore weakly sequentially compact. EXERCISE. Supply the details for the above argument concerning the weak sequential compactness of Q(b).

ASSUMPTION A3. Suppose h and g are weakly sequentially lower semi-

continuous functionals and that x* solves the C-problem-i.e., x* E Q0, h(x*) = inf h(x). XEQ,

AssuMPTION A4. Assume there exists a map r of E into E. such that, for some x* solving the C-problem, (y,,, x - rrx* satisfies the following:

h(x*);

1. iim

.

2. Y.,,+I = y.,r + ks(t,, y.,,, u.,,) for 0 < i < n - 1, Y..o E Y,; 3. lim d = 0, where d max max [d[y.,,, Y(t,)], d(y,,,,,, Y,), os1s4

d[u..n U(011-

We define the expanded constraint sets for the P-problem now as

Q.=[x.=(y.,u.) E E.;IIx.I1.
Our P-problem will be to approximately minimize

over Q..

ASSUMPTION A5. Assume there exists a map p of E. into E such that, if

x E Q,,, then 1. lim I h.(x.) - h(p.x.) I = lim I g.(x.) - g(p.x.) 0; a2. (Z:, satisfies z.(t) < s[t, z (t), w.(:)], for almost all

t E [0, 1], z.(0) E Y,;

3. .-m lim e = 0, where e > e(p,,x.) - sup max d[z.(1), Yr], d[w (t), U(t)1j, and 4. I1p.x.11 < B + e,,.

osrs t

Y(t)],

Finally, we define the slightly enlarged constraint set Q' for the Cproblem :

Q" _ [x = (y, u) E E; II x 11 < B + e., e(x) < e., y(t) S s[t, Y(t), u(t)] for almost all t E [0, 1], y(O) E Y,}.

46

SEC. 3. 3

EXAMPLES OF DISCRETIZATION

Under all the above sets of assumptions, we can now prove that sufficiently accurate apyroximate solutions to the penalty-function form of the P-problem will converge to a solution of the C-problem, by using Theorem 3.2.1; we shall later examine hypotheses under which Assumptions Al-A5 will be valid.

THEOREM 3.3.1. Let Assumptions Al-A5 hold, let h, g, h", g", Q0, Q",

Q", r", p" be as described above; and let a" > 0, lim a" = 0. For each n, let x* satisfy

\

S

g"(x*) + a,,h"(x*) S g"(x") + anh"(x") + a"8"

for.all x" E Q.

where 6" Z 0, lim 6" = 0. Then all weak limit points x' of p"x*, at least one

of which exists, solve the C-problem-i.e., if x' = (y, u), then y = s[t, y(t), u(t)] almost everywhere, x' E Q0, and h(x') < h(x) for all x in Q0. Proof: We wish to apply Theorem 3.2.1, if possible; we check five numbered hypotheses preceding that theorem with C =_ Q0, C. = Q". Number 1 is true by Assumption A4, but only for some x*, not all x*; No. 2 is valid by Assumption A5; No. 4 is assumed above; No.5 is valid by Assumption A4,'

but only for some, not all, x*. For condition 3, we note that C" - p"C" U

C e Q". If z", e C and z", -- - z, since Q"+' c Q" and Q" is weakly sequentially compact because of Assumption A2, we conclude that z E Q for all i and hence z e n Q. c Q0, as demanded by condition 3. Although we cannot exactly apply Theorem 3.2.1, we can follow the lines of its proof, making use of additional information we have in this case. Thus we can conclude, as in Theorem 3.2.1, that a weak limit point x' of exists, must lie in Q0, and minimizes g over C-that is, g(x') = 0.

The hypotheses I through 5 and that on the decay rate of a" were used in Theorem 3.2.1 only to show that h(x') < h(x*) (where x* solves the Cproblem in our case); we can handle this differently. We have

0= g"(r"x*) < g"(x:) < g"(x*) +- ah"(x*) < g"(r"x*) 4 a,h"(r"x. ) + a"v".< g,,(xn) -

anh"(r,x*) + a"a"

which implies

h"(x*) < h"(r"x*) + J. Then

h(x') < lim inf

lim inf h"(x*)

< lim inf [h"(r"x*) + 8"] = h(x*)

Since h(x') < h(x*) and x* minimizes h, so must x'. Q.E.D.

SEC. 3.3

EXAMPLES OF DISCRETIZATION.

47

EXERCISE. Provide all the details for the Proof of Theorem 3.3.1 above.

The results of Section 3.2 concerning stronger convergence properties of course apply here also, but we shall not state them again. Rather we must consider conditions under which the assumptions in the theorem are true. The theorem above (Theorem 3.3.1) merely serves to identify conditions sufficient to guarantee the applicability of the numerical method of Rosen (1966). We note that the existence of p. is important only to the proof, while

the existence of r. and the numbers d related thereto are crucial to the numerical algorithm itself; we are required to treat the P-problem over Q., a set defined via d., and we must therefore know d. in order actually to compute. In Cullum (1969), it is shown that, for certain problems, if the sets Y(t) are expanded by distances y., U(t) by distances a and a. discretized step size k of length k = 1/m is used, then sequences m(n) and 1(n) exist such that maps

p., r. exist for the problem defined by y., a,(.), k = 1/m(n), with d S y. + 0g.). This does not really yield a computational procedure, since for a given sequence of step sizes k we still do not know by how much to expand the constraint sets. Now we shall attempt to make the numerical method really implementable; another approach to this can be found in Cullum (1970) for problems lacking state constraints. First, however, we remark that the assumptions other than A4 and A5 are reasonable assumptions insofar as the existence of the solution to the C-problem and the computability of approximate solutions of the P-problem are concerned. In Rosen (1966), for example, in order

to prove that the numerical method used there for minimizing g. + a.h. works, it is assumed that s(t, y, u) and c(t, y, u) are convex jointly in y and u; it is a simple matter to show-using this assumption, the assumptions in the

paragraph following A2, and the additional one that s,(t, y, u), S.(t, y, u), c,(t, y, u), c.(t, y, u) exist and as functions of t are in L2(0, l) for fixed (y, u) E E-that f, h, Q satisfy their needed assumptions. We therefore do not discuss these assumptions further. EXERCISE. Indicate how the assumptions of the preceding paragraph can be

used to deduce that f, h, and Q satisfy the assumptions demanded by the theory developed so far.

Let us consider the mapping p.; we must apply it to points X. _ (y., u.) satisfying

y,..,+, _ y,,; + ks(t, y.,,, u.,,) - kb.,, with b.,, > 0 for 0 < i

n-I

If we define w.(t) = p.u. as a step function constant on each interval (t t,+,) with value u.,, and b.(t) similarly, then y. looks like the numerical solution of the equation 1 = s[t, z(t), w.(t)) - b.(t), z(0) = y..o; if we define v.(t) = p.y. as the solution of this equation, then we are asking y. and v.(t) to be close

48

SEC. 3. 3

EXAMPLES OF DISCRETIZATION

in some sense uniformly in u, and b,. Even then, one needs to know that Y(t) and U(t) are continuous enough that v,(t,) near Y(t) for all i will imply the nearness for all t, and similarly for w,, U. Finally, to conclude satisfying h,(V,) and g(p,V,) - g,(V,) to tend to Assumption A5, we need h(p,V,) zero. We give some conditions under which Assumption A5 is valid via this approach. For any set T and positive number c, let N(T, e) = [z; d(z, 7) < e}. We.shall say a set function T(t) is continuous on 0 < t < 1 if and only if for e > 0 there exists b > 0 such that I t' - t" I < 6 implies T(t') c N[T(t"), e]. ASSUMPTION A6. Assume that Y(t) and U(t) are continuous set functions.

ASSUMPTION A7. Suppose that for each w E LZ(0, 1) with w(t) E U(t) almost everywhere and z(0) E Y there exists a unique solution of i(t) _

s[t, z(t), w(t)], z(O) = 0 for almost all t E [0, 1], and that the set of such solutions z(t) is bounded uniformly in such w and zo. ASSUMPTION AS. Assume there exists a function q(t, y) continuous in (t, y) for (t, y) in [0, 1] x (- oo, cc) and such that, if 1(t, y, u) is either of the functions s(t, y, u) or c(t, y, u), we have I 1(t', y', u)

- 1(t", y", u) I S I q(t', y') -

q(t",

y") I

for all u E U* _ fu; u E U(t) for some t E [0, 1]).

Remark: If both Y(t) and U(t) are of the form Y(t) _ [y; m(t) S y S M(t)) for continuous m, M, then they are continuous set functions. If U* and Yare compact, if s(t, y, u) is Lipschitz-continuous in y uniformly in

(t, u) E [0, 1] x U*, and if I s(t, y, u) I < p(t) a(l y 1) for u E U* where µ(t) is integrable on [0, 1] and a(I y ) = O(1 y ) as Iy I -- oo, then Assumption A7 is valid [Roxin (1962)].

EXERCISE. Prove that a set function of the form Y(t) = [y; m(t) < y < M(t)} is continuous if m and M are continuous real-valued functions.

THEOREM 3.3.2. Under Assumptions A6, A7, A8, the mapping p, described above satisfies Assumption A5.

Sketch of proof: Letting (v w,,) = p,(y u,) as described above, it is easy to show that I v,(t,) - y,,, I = o(1) uniformly in i as k the difference equation for y,,,, the equation

v,(t,,.,) = v,(t,) +

0 by examining

s[t, v,(t), u,.,] dt

and using the continuity assumptions on s(t, y, u). Since d, > d[y,,,, Y(t)] tends to zero, we have d[v,(t,), Y(t,)] = o(l) + O(d,) which, by the continuity

EXAMPLES OF DISCRETIZATION

sEc. 3. 3

49

of Y(t), yields d[a,(t), Y(t)] = o(1). Similarly, we find d[w,(t), U(t)] = o(1). Writing

h(v., w.) - h.(y,,, u.) = E

j

(c[t, v.(t), u..,] - c[t,, y..,, u,.,]) dt

and using the continuity property of c(t, y, u), we find that

Jim I h(v w,) - h,(y u,) I = 0 and similarly for g - g,,. Q.E.D. EXERCISE. Supply the details in the Proof of Theorem 3.3.2 above.

We remark that the estimates "o(l)" above are satisfactory for p,, since we have no need for the actual bounds; dealing with r,,, however, we must have computable numbers d,. Consider the definition now of an operator r,,, to be applied to x* = (y*, u*), the solution of the C-problem. Thus y* satisfies y*(t) = s(t, y*(t), u*(t)]

almost everywhere. Suppose for the moment we can define u, - r,u* via u,,, = u*(t,). Then y, = r, y* can be defined via

y.,,+, = y.,, + ks(t,, y,,,, u,,,) for 0 < i < n - 1, y,,, = y*(0) that is, so that y, is a numerical solution of the differential equation for y*; under suitable hypotheses we can then bound y,,, - y*(t,). If u* is only

measurable, we cannot estimate d but can only show that, for certain problems, there exist satisfactory d using the techniques of Cullum (1969) as sketched in the first paragraph of this section. To derive computable d,, we need more continuity assumptions on u*(t). Using these hypotheses we can

bound y; , - y*(t,) and hence bound d while Assumption A8 is more than sufficient to guarantee lim I h,(x,) - h(x*) I= 0. ASSUMPTION A9. Assume that s(t, y, u) is Lipschitz-continuous with respect to y uniformly in (t, u) E [0, 1) x U* and continuous in (t, y, u) e [0, 1] x (-oo, oo) x U*. Assume u*(t) is piecewise continuous, having only finitely many discontinuities, each of finite-jump type. ASSUMPTION AlO. Assume in addition to Assumption A9 that At, y, u) is continuously differentiable with respect tot and u, and that u*(t) is piecewise

continuously differentiable, both u* and 9* having only finitely many discontinuities, each of finite jump type. THEOREM 3.3.3. Under Assumptions A8 and A9, r, as described above

satisfies Assumption A4, with d, = O[k + m(k)], where w(k) - sup Is[t',

50

EXAMPLES OF DISCRETIZATION

ssc. 3. 3

y, u*(t')] - s[t", y, u*(t")] with the supremum taken over all t', t" with 0 < t' < t" < 1, 1 t' - t" I < k, t' and t" in the same-interval of continuity of u*, and y in a certain bounded set R. If Assumption A10 holds, then w(k) = 0(k), and we may take the computable value dA = k' e > 0. Sketch of proof: The only real task is to bound l y.,, --- y*(t,) I. Were it not for the discontinuities in the equation, we could immediately write that ly.., - y*(t,) I = 0[k + co(k)] uniformly in i by the standard theory in Henrici (1962); it is trivial to generalize this to allow the discontinuities. Essentially the argument is as follows. Up to the first discontinuity r, the 0[k + co(k)] result is valid. One can consider the calculation between T, and the next discontinuity T, as the solution of a new initial-value problem in which the initial data used in the numerical method-that is, y*(t,) for the last t, < T,are inaccurate of order 0[k + co(k)]. Since the initial error propagates in a bounded fashion, the error on is also 0[k + co(k)]. The argument proceeds in this manner throughout the finitely many discontinuities Ti. Q.E.D. EXERCISE. Supply the details for the Proof to Theorem 3.3.3, above.

The reader should note that we have only partly attained our goal of finding computable constants d,,. Our estimates-saying that we may take d = 0(k'-,), for example-only mean that the numerical method will thus work for sufficiently small k; we do not have a computable expression for d guaranteed to work for all k. Although one would like to be able to prove convergence of the numerical

solutions without the continuity requirements in Assumptions A9 and A10, this does not seem possible in general (for a special case, see Daniel [1970]); however, very broad classes of problems do have solutions satisfying A10, and one might even call this a typical situation. Thus the assumptions in

A10 do not appear to be unreasonably strong. As a simple special case, the optimal-time problem for y = Ay + Bu,, with A and B constant, with y(0) = yo given, and with u restricted by I I u II,. S 1, can be treated by making use of the classical theory of optimal-time processes; and it can be shown that, if a solution exists, it will be approximated by approximate solutions of the discretized problem with k' expanded constraint sets, extending slightly a result in Krasovskii (1957). More generally, under Assumptions Al-A10, we have proved that approximate solutions to a penalty-function form of the P-problem have weak limit points solving the C-problem.

Another approach for defining the mapping r without assuming the control u*(t) to be piecewise continuous is as follows (only the outline of the procedure is given). Suppose that, for each e > 0, u*(t) can be approximated

by a continuous function u,(t) "nearly" satisfying the constraints-say,

EXAMPLES OF DISCRETIZATION

SEC. 3. 3

d[u,(t), U(t)] < bl(f) with bl(f)

51

0 as e , 0; and suppose that y,(t),

defined as the solution to y, = s(t, y u,), y,(0) = y*(0), is also "near" constraints-say, d[y,(t), Y(t)] sb2(e), d[y,(1), Yp] < b2(E), with b2(e) ---i 0 as f ---' 0-and "near" y* so that the

I h(y*, u*) - h(y,, u,) 1 < 63(E)

with

b,(E) - . 0 as c --+ 0

Pick n so large that the oscillation of u, over intervals of length kis less than

c, and define u, as the piecewise constant interpolant of u, at the points 0, k, 2k.... and y, as the solution to y,,. = s(t, y,,,,, u,,.), v,,.(0) = T*(0). Again we can argue that y,,,, and u, are "near" the constraint sets and I h(y*, u*) - h(y,,., u,,.) I < b,(E). For each n, let (z., w.) be the solution (assuming it exists) of the original C-problem only with the control restricted

to be constant on each interval

[t. t

)

and define (y u) = r (y*

u*) via u.,, = w.(tr), y.,r+1 = y.,, + ks[t ,Y.,r, w.(tr)], Y.,0 = z.(0). If, for example, s(t, y, u) is (uniformly) Lipschitzian in y and t, then it is simple to see that l y.,, - z.(t,) I 0(k) uniformly in n and i. EXERCISE. Prove that l y.,, - z.(t) I = 0(k) uniformly in n and i if s(t, y, u) is uniformly Lipschitzian in y and t, as asserted in the preceding paragraph.

From this estimate for y. - z. one can conclude that I h(z., w.) - h.(y., u.) - ) 0

Because of the minimal property of (z., w.) and the fact that (y,, u,,.) is "near" the constraint set, one can conclude that h(z., w.) < h(y,,., u,,.) + bs(E)

Therefore, we can write h(y*, u*) < h(z., w.) < h(y,,., u,,.) + b,(f). Since h(y*, u*) __ h(y,,., u,,.) I < b,(e) and

I h(z., w.) - h. [r.(y*, u*)] I

)0

we conclude that I h(y*, u*) - h.[r.(y*, u*)] 1 --, 0. Thus Assumption A4 is satisfied for this r. and d. can be taken to be k' for any fixed f > 0. EXERCISE. Consider the simpler C-problem in which Y, = [y0], Yp = Y(t) = (-oo, oo), u(t) _ [-a, a] for some fixed a. Provide the detailed and precise hypotheses and arguments for the above construction of r.. [For the solution of this problem, see Budak et al. (1968-69)].

52

sEC. 3.4

EXAMPLES OF DISCRETIZATION

3.4. CHEBYSHEV SOLUTION OF DIFFERENTIAL EQUATIONS A

We wish to consider at this point a problem which can be examined best from the discretization viewpoint, although the theorems of Chapter 2 are not directly applicable. An attempt to apply the concepts of that chapter, however, will reveal the fundamental difficulties and research areas in the particular problem. This will show, as stated in Section 3.1, how the abstract discretization can be useful in guiding one's research. Suppose one seeks to solve Au = b where A is a uniformly elliptic linear (for simplicity here only) differential operator in two variables over a bounded

domain D, under the condition u = 0 on r, the boundary of D, assumed to be sufficiently smooth; more general types of equations may also be treated by the method to, be presented. A numerical method of recent popularity [Krabs (1963), Rosen (1968)], given a sequence of functions (qrr} satisfying the boundary data, consists in choosing numbers aw,,, . . . , aw,w to minimize II ImaxM

A

(r aw,# )] (x,) - b(x,)

where the M points (x,) form a "grid" over D. Strictly for convenience we take M = cn for fixed c (experience indicates that c = 4 is a good choice [Rosen (1968)]) and suppose that the grid is such that any point in D is at a distance of at most hw from a grid point xi. We wish to find conditions under which the miminizing point u,* a.,r#, will converge, in some sense, to the solution u* of our problem. Since we seek to minimize a supremum norm, the norm must be defined; therefore let

E = (u; u = 0 on r, all partial derivatives of u through second-order are continuous on b = D u r)

For u E E, let 1 1u1 1 = II u I I,, = max I u(x) I. Let zED

f(u) = II Au - b I I where we now need to assume that b is continuous and bounded on D. Let . E w be that subset of E spanned by the functions 9,, ... , qw, assumed to lie in E; let pw be the identity mapping, and rw be at the moment undefined. Define f1(uw) = I I Auw - b I I

m = max I [Auw] (xr) - b(x;) I 1 SrScw

SEC. 3. 4

EXAMPLES OF DISCRETIZATION

53

We now seek conditions for consistency. Consider condition 2 of Definition 2.2.2:

f(paua) -fa(un) =11 Au. - b 110. - 11,4u.* - b Ilta,w

Since this quantity is always nonnegative, the requirement lim sup [f(pau?) - fa(u: )] < 0

in fact demands convergence; in order to compare suprema over discrete and continuous sets, we need to know something about the growth of the functions Aug' - b between grid points. Hence we now assume that Ac, satisfies a Lipschitz condition with Lipschitz constant A, (this restricts A somewhat also) and that b satisfies one with a constant Ao. From this it follows that

If(pau.) -fa(u.)I
I a.,, I I A, I.

1. that there exists a constant C such that E Iaa,,I < C for all n; and ,= 2. that h1A1 tends to zero, where A. = A,. maxo

15(58

In practice, the A. do in fact become large, while the restriction on the aa,, is easy to implement. In essence, the above restrictions are defining Cw--i.e., a

Ca= (U.;

L

j Iaa,,I
1-0

Next consider condition 1 of Definition We require lim aim

2.2.2,

where ra is to be defined.

supfa(rau*)
Now fa(rau*)
so we need only require that lim sup f(rau*) < f(u*); this is certainly true if rau* is an approximation method in which Arau* converges uniformly to Au*-if, for example, rau* and all its partial derivatives through secondorder converge uniformly to those for u*. Note that it is necessary to have rau* in Ca.

54

EXAMPLES OF DISCRETIZATION

sEc. 3.4

Under the above conditions, it follows in the same manner as Theorem 2.2.1 that

limf»(u.) = 1im f(p,.u:) =f(u*) = 0 .-m where u* solves Au* = b and lies in E; the conditions on weak sequential compactness and weak sequential lower semicontinuity are needed only to a probleni easily handled differently here. We prove convergence for know that

JJAu; - bJJ. =f(p.u:) converges to, zero. By a simple use of the maximum principle [ProtterWeinberger (1967)], we deduce

Iiu. - u*II, <JIAu,* - bJJ.IJwfI_

where w solves Aw = -1 in D, w = 0 on t; therefore, u: converges uniformly to the solution u*. EXERCISE. Provide the details for the above arguments showing that u* converges uniformly to u*.

The application of the theory in Chapter 2 to this problem indicates the type of approach necessary to prove convergence for-this numerical method.

We require: (1) smooth functions q, with Lipschitz constants A, for Alp, that do not grow too rapidly; (2) results from approximation theory that state that if one approximates functions b by combinations of functions AV the. sums E I a.., I remain bounded; and (3) results from approximation theory that state that functions b can be approximated by functions AV,. The requirements 1 and 3 here are probably less difficult; generalized Bessel-inequality

results such as 2, however, are not known to this author for general cases. While numerical work with this method proceeds, theoretical results of the type suggested by Theorem 2.2.1 should and are being sought.

Using known general results, from approximation theory [RivlinCheney (1966)] comparing discrete and continuous approximations, we can avoid the questions of the growth of the a»,, and A,, although other problems arise. In particular, if b and the yr, - AV, are merely continuous, then there exists a sequence [h») tending to zero so that, for the resulting discretization,

f(p»u;) - f,(u;) tends to zero, leading us to the uniform convergence of u,* to u* as above. In general, however, we cannot give an explicit form for h,,; special results defining h can be given in one dimension in which the linear span of the yr, is the space of polynomials of a certain degree, but we are not aware of more widely applicable results in this direction.

EXAMPLES OF DISCRETIZATION

SEC. 3. 5

55

EXAMPLE [Rosen (1968)]. The Chebyshev method described above can also be used on mildly nonlinear problems as well as linear ones, although the

computation of the a,,,, is then a nonlinear programming problem. We consider, for example, the approximate solution to

in D

u=0 on dD where D is the unit square in two dimensions. Using 45 polynomials c, satisfying the boundary conditions exactly and using a grid of 225 points in

the interior of D for computing the discrete maximum norm, a relative error bound of 0.0023 was computed, making use of a maximum principle for the bound. Using only 21 functions, the error bound increased to 0.021. General references: Daniel (1968b), Rosen (1968).

3.5. CALCULUS OF VARIATIONS

We wish to consider now the standard problem in the calculus of variations for functions with given boundary values. For such problems over an arbitrary region in lR it has been suggested [Greenspan;(1967)] that a numer-

ical solution be computed by minimizing a certain type of quadrature sum with derivatives in the integrand replaced by differences. The quadrature formula in two dimensions, for example, exactly integrates functions which are piecewise constant-in particular, constant over each component of a

triangularization of the domain. In order to simplify the notation and eliminate some minor technical problems, we shall greatly specialize our analy-

sis to the case of only one dimension. The techniques, assumptions, and results go over without essential change to rectangular domains in (R"; we have not yet looked at the problem of arbitrary domains from the special viewpoint of discretizations. The space E, which we shall define, has of course some

properties in IR" for n > I that are different from those for n = 1; J particular, weak convergence for n > 1 is rather "weaker." For a thorough analysis of the calculus of variations in the reader is referred to Morrey (1966); relevant approximation concepts are in DiGuglielmo (1969). Consider the problem of minimizing the functional

f(x) = J o g(t, x, x) dt

subject to

x(0) = x(1) = 0

where x = dxldt. The following simple case of a general numerical method has been suggested [Greenspan (1967)]: minimize (or nearly minimize)

56

SEC. 3.5

EXAMPLES OF DISCRETIZATION

subject to

hig(tr- , x".,

f"(x")

X.,0 =

0,

h, = t, - t,-,

where the minimization is over the set of values of x" ,, ... , x,, ,,_, ; this method can be fitted neatly into the theory of Theorem 2.3.1. In Greenspan (1967), under the assumption that there exist unique minimizing points x* for f (in C' [0, 1]) and xA for f" satisfying the spike condition-

for some constant A independent of n-it was purportedly proved that p"x,*, the piecewise linear interpolation to x;, converges uniformly to x*; because the author inadvertently left out an assumption guaranteeing a lower semicontinuity property for the functional f, the proof is in fact incorrect. However, as we shall-show below by use of Theorem 2.3.1, the usual assumptions guaranteeing a unique minimizing point for f, in conjunction with an assump-

tion guaranteeing the satisfaction of a type of spike condition, yield a convergence proof.

For convenience, let us take h, = h = 1/n for all i. For a fixed p > 1, let

E = (x; x(O) = x(1) = 0, x is absolutely continuous on [0, 1], x E L,[0, l]}

For x E E, let 11X11= Ilxil,

For each n, let E. be (n the norm Ilx.II

k(')1 ' dt}'i°

1)-dimensional Euclidean space where X. E E. has

={hE[Ix".(-x

h

I11I

where x,,,0 = x,," = 0 by definition. Let p" be the mapping defined by piecewise linear joining of the values x",, at t, = ih, so p"x" E E. Define the mapping =x(t,),i = 1,...,n - 1. We now make the standard type of assumption in the calculus of variations [Akhiezer (1962)] in order to guarantee the existence of a minimizing point for f. Note that E, as a closed linear subspace of W 1(0, 1), is reflexive, and that weak convergence in E implies uniform convergence-that is, convergence in C[0, l]-as noted in Section 1.3.

SEC. 3. 5

EXAMPLES OF DISCRETIZATION

S7

ASSUMPTIONS: Al. g(t, x, w) is jointly continuous in its variables for

0 S t < 1 and - oo < x, w < 00. A2. There exist constants a, b with b > 0 such that g(t, x, w) > a + b l`w I' for all tin [0,1], x finite. A3. g iF differentiably convex in w; i.e.,

g(t, x, wI) - g(t, x, w2) > (WI - w:)g.(t, x, w2) with g continuous in x, uniformly for (t, w) bounded. PROPOSITION 3.5.1. The functional f is weakly sequentially lower semicontinuous on E, bounded below, and satisfies a T-condition. Proof: For the last two assertions in this proposition, note that

f(x) = f Ig(t,x,2)dt> fa[a+blzlljdt = a+bllxll° The proof of the weak sequential lower semicontinuity is straightforward using the convexity of g; details may be found in Akhiezer (1962), pp. 137-139. Q.E.D.

THEOREM 3.5.1. The discretization scheme defined above is stable and

satisfies a uniform-growth condition. Proof:

l

II° = f o ! (p"x")' 11 dt dt

=Ji

h

= Ilx"Ila

proving stability. For the growth condition, " f"(x")=h±g

x .!,I -- x",,_ I h

)>h!Ir, Ia + b I x" , hx"

= a + bllx"II: Q. Q.E.D.

The only remaining ingredient for application of Theorem 2.3.1 is the Consistency; in Greenspan (1967), the spike condition was needed for this. In our case, we must make the following assumptions.

58

EXAMPLES OF DISCRETIZATION

SEC. 3. 5

ASSUMPTIONS: A4. Some solution x* minimizing f(x) lies in C1[0, 1];

i.e., z* is continuous. A5. There exist constants c and d and a continuous function s(t, v) such that 19(t I Iv., z) - g(12, v2, z) I < (c + dI z 1P) I s(t., v) - s(t2, V2)1

where t, t2 are aTibtrary points in [0, 1] and v v2, z are arbitrary real numbers.

Remarks. If

g(t, x, w) _ (w2/2) + r(t, x) then Assumption A5 is satisfied with s = r. If

g(t, x, w) = l(w)m(t, x) with I l(w)I < c + dl wI

then Assumption AS is satisfied with r = m; many actual problems are of the above types. Assumption A4 is probably superfluous in many cases. THEOREM 3.5.2. The discretization described above is consistent.

Proof: For condition 1 of Definition 2.2.2 we prove lim A- I f,(rrx*) - f(x*) I = 0

Since, by assumption, x* is in C'[0, 1], given e, for sufficiently large n, I x*(t,_,) - x*(i) I < e

and

*

z*(t) - xf h xr-. < e

for tr_, < t
Thus,

I f(x*) - IA(rAx*) I < ri f

I go, x*' x*)

-

g (tr

x* x*

h

x* .

j I dt

But, by uniform continuity of g, given 6 > 0 there exists e > 0 and then N

such that n > N implies . '

I f(x*) -

fA(r,x*) I .<_

,f

b dt = 6

Since d > 0 is arbitrary, condition 1 is proved. For condition 2 of Definition 2.2.2, we show that lim I fA(xA) I = 0 if 11 p xA 11 is bounded:

EXAMPLES OF DISCRETIZATION

SEC. 3. 6

f

og[t,_, + ah,(1

-9

xi

-F-

hxA r

I

X.' I

-h

59

x,,.1_ I

da

h:L f (c + dlx,' - X 1-1

x s(t,_, + ah, (1 - a)x,,,_I + ac,,,,) Now, I I x I I = I I P,,xn I I is bounded, I x,,,, I

Ix.,, - xA,

I

da

is bounded, and

(

Thus, using the uniform continuity of s(t, x), given c > 0, there exists N such that n > N implies hx.,I- I

)

Since c > 0 is arbitrary, condition 2 follows. Q.E.D. EXERCISE. Show that I x,,, the Proof of Theorem 3.5.2.

is bounded independently of n, i as asserted

We now can state the following theorem which follows immediately from Theorem 2.3.1 and the above theorems.

THEOREM 3.5.3. Let Assumptions Al - A5 be valid and let the discretization method described above be used. Then all weak limit points of at least one of which exists, minimize f. If the solution x* is unique, then, in particular, p.x,* converges uniformly to x* and the derivatives converge L; weakly. EXAMPLE [Greenspan (1965)]. Consider minimizing

f' I xI (1 + *,)'n dt,

subject to

x(0) = 1, x(1) = cosh 1

having solution x(t) = cosh I. Using h - 0.2, a maximum error of 0.046 is found, while for h = 0.01 the error is 0.0015. Q

For similar results, see Simpson (1968, 1969). 3.6. TWO-POINT BOUNDARY-VALUE PROBLEMS

The problem discussed in the previous section is of course essentially a two-point boundary-value problem for a second-order ordinary differential

60

EXAMPLES OF DISCRETIZATION

sEC. 3. 6

equation; the method described is only one of many possible for use on this problem. Another recent method of great interest is the application of the Ritz method to this problem, using certain special classes of functions as basis functions. In Section 3.7 we shall examine the general Ritz method, but in this section we wish to look at the more special problem indicated above. For clarity we shall consider only simple boundary conditions, ahhough more complex ones can be treated [Ciarlet et al. (1968a, b)]. The method has been thoroughly analyzed [Ciarlet (1966), Ciarlet et al. (1967)] for solving J=O

(-1)1"D'[q,{t)D'x(t)l = g[t, x(t)], t c- (0, 1)

Dkx(0) = Dkx(1) = 0, k = 0, I,

... , n - I

where Dy = dy/dt. The results in this general case, if q (t) Z E > 0, are more complicated to state, but just the same as those for the equation D2x(t) = g[t, x(t)], t E (0, 1) (3.6.1)

X(0) = x(1) = 0

that is, for n = 1, q0(t) = 0, q,(t) - 1; therefore, we shall present only this simpler but sufficiently representative problem.

Let the Hilbert space E = o WZ = [x; x is absolutely continuous, x E L2(0, 1), x(O) = x(l) = 0), and, for x, y E E, define <x, y> =f ` Dx(t) Dy(t) dt 0

Assume that g(t, x) is continuous in (t, x) in [0, 1) x (-oo, oo), and satisfies

(1) g(t,xx_g(:,y)}Y>-n2 (2)

if x#Y

g(t, z_y(t,Y)
if

lxl
Define the functional

f(x) = f0

(xcn g(t,

l3 [Dx(t)]2 + Jo

z) dzl dt 1

It is easy to deduce that, if x*(t) is a classical ;olution of Equation 3.6.1, then x* minimizes f over E. Clearly also x* is tie unique minimizing point, since f is convex and, in particular,

f(x + Y) >_f(x) 4- f0 ([Dx(t)] [Dy(t)] -I-Y(t)g[t, x(t)]} dt+(Y+7r2) f

`y2(t) dt a

SEC. 3. 6

EXAMPLES OF DISCRETIZATION

61

which implies

f(x* + Y) Zf(x*) + (Y + n2) J' y2(t) dt

as we have seen before. Moreover, if SM is a subspace of dimension M spanned by the functions 97 . in SM minimizing f over SM,

then there exists a unique element YAM

. .

M

0MartDr r=. which is also the unique solution of M

afd

0,

l = 1,..., M

r

that is,

Ba+G(a)=0

(3.6.2)

where a = (a ... , a,)r, B is the matrix B = ((B1)) Bra ° <9r, rt> = Jo [D,r(t)l[Dc/t)l dt G(a) = [G, (a),

G,(a) = Jo g [t,

... ,

GM(a)l T

a,p,(t) I p,(t) dt

For various kinds of subspaces SM, bounds on the error between cM and x* have been computed; the basic argument for obtaining the bound is simple. If we write Vf(x) = J(x), then since x* minimizes f on E we have J(x*) = 0, while <J(oM), tp,> = 0 since 0, minimizes f over SM. Thus 0 = <J(x*) - J(qM), qp1> = <Jx,(x* - 0M), 91,>

for some fixed xo. Defining [x, y] _ <J,,(x), y> as a new inner product, we see that 0M is the closest point to x* in SM in the sense of this inner product. Thus any theorems about how well x* can be approximated by elements of

SM can be used to lead to statements about the error x* - c6M in various

norms. Much of the theory has been developed for the case of SM being various "piecewise polynomial" subspaces, making use of the well-developed theory

of spline and polynomial approximation. For example, let P denote the

partition 0 = to < t, <

< t,,., = I of [0, I]. For m > 1, we define. a

62

sEc. 3.6

EXAMPLES OF DISCRETIZATION

class of splines H'(P) _ [q(t); tp is in C"'-'[0, 1], p(0) = c(l) = 0 and' c is a polynomial of degree at most 2m - 1 on [t;, t;+,] for 0 < i < N]. This space Ho (P) is spanned by the m(N + 2) - 2 functions S,,k(t) for 1 < i

andfori=0,N+1,1
D'S,,,k(t1) = 8,,f8k,,

for 0 < 1 < m - 1

The functions S,,k(t) are zero except in [t,_ t;+,]. For example, with m = 1, H' (P) is spanned by the N functions S,, 0, 1 < i < N, where S,, 0(t) is given by the roof function (Figure 3.1).

Figure 3.1

EXERCISE. Find the basis functions for H20(p).

By using known results about approximation (in fact, interpolation) by elements of Ho(P), we can give bounds for 0,,, - x*, as described above, where M = m(N + 2) - 2. For example [Ciarlet et al. (1967)], if

IPI= max It,+, -t,l 05(SN then if x* E CQ[O, 1] with q > 2m, then there exists a constant K suith that II Dk(om - x*) II= < KII D2ax* II. I PI2m-1,

where I I u I I = ess sup u(t) I

k = 0, 1

for u E Q0, 1). Thus, if x* E C 2[0, 1 ],

01't' we can use Ho as our subspace and find error bounds of order I PI. If x* E

C4[0,1] we can use Ho and find bounds of order I PI'. In fact, by more subtle

arguments [Perrin-Price-Varga (1969)], one can show that the order of convergence for H,' approximation is actually 1 PI2.

sEC. 3. 6

EXAMPLES OF DISCRETIZATION

63

In a practical sense, however, the above results are not meaningful unless one can compute q,,,; that is, solve

Ba + G(a) =0 For the type of subspaces we are considering, the matrix B can be assumed to be known exactly, since it is computed by integration using polynomials; using Ho spaces, the matrix in fact is a band matrix. The operator G(a), a,c,(t)j, which,we cannot perform however, involves integration of g rt,

exactly; the use of a quadrature formula gives us a computable method [Herbold (1968), Herbold et al. (1969)]. Suppose we use a quadrature formula

f s(y) dy = ' a,s(y,) y.

(3.6.3)

1=0

with error given by K,

(Yk k

yok.+1

)

S(k.)(y)

as usual. Given the partition P: 0 = to < t, <

dt = I f

f0

< t,,,+, = 1, if we write dt

and apply the quadrature formula to each subinterval, we obtain a composite quadrature formula

f

0

s(t)gr(t) dt L 6Js(t')

where M,, = k(N + 1) and the fi, and tJ are obtainable from the a; and t1. EXERCISE. Find explicit expressions for the fii and tj' in the composite quadrature formula above.

If we use the above sum to approximate the operator G(a), we get Gk(a) = [llk,l(a). .. ., vk.A[(a)J

vk,(a) =

k(E') J=0

fi/g[tl, Fi a, 1.I

and we now can solve

Ba + 0,(a) = 0

(3.6.4)

64

SEC. 3.6

EXAMPLES OF DISCRETIZATION

numerically. This, however, is the gradient equation for the functional 1

.fk(a)

M

2

= J 0 7 1D m a,c,(t)j } dt +

M

k(N+1) J=o

a,¢,(U)

,8i f

g(tj, 1) d,,

EXERCISE. Prove that Equation 3.6.4 above is equivalent to Vfk(a) = 0, where fk is as defined above.

if one assumes that the quadrature scheme of Equation 3.6.1 has a, > 0, E /, = 1), and that k E a, = Yk - Yo (which implies f, > 0 and k k(N+1) R

fl

f

1

o

q'r(t)c,(t) dt

for .l
dt exactly-then one can show -that is, our formula computes f that there is a unique a* - ak = (a1,.. . , ak M)T minimizing fk(a) for each k and solving Equation 3.6.4. The condition on the positivity of the weightsis valid, of course, for all Gauss formulas and for L-point Newton-Cotes formulas with 2 < L < 8, while the condition on integration of the polynomial 9A requires that the quadrature formula be of a certain degree. Thus what we now have is a discretization scheme. Rather than minimize

f(x) over E, we now minimize fk(a) for a E IRM; the various parameters M, N, and k must of course be related in some way. We want the vector M

M.k = 1=, ak,c, to be near x* and infact to be as accurate as

,/

96M

itself.

,/

If for some set of spaces SM we have the error bound for x* - 'YM as a function eM(x*), we shall say that the quadrature scheme is compatible if the error From the discretization viewIM,k - x* is also bounded by point, we can prove the existence of the mapping r by using interpolation of

x* in SM, and we can take p to be the obvious embedding operator. We now state what is the final result for use of the space H;(P). PROPOSITION 3.6.1 [Herbold (1968)]. Suppose we use the subspaces

Ho(PN) for a partition PN: 0 = to < t1 < . . . < tN+1 = 1, a subspace of dimension m(N + 2) - 2, where the partition PN saitsfies I PN I < L min I t,+, - t, 1 for all N. Suppose g(t, x) is so smooth that for any 051SN

la E HZ(PN), D'g[t, ip(t)] is continuous on each interval of the partition for 0 < s < ko, where ko describes the accuracy of the quadrature formula of

Equation 3.6.3 and satisfies ko > 4m - 1. Suppose that the weights a, k in the quadrature formula satisfy a, > t', E a, = Yk - Yo. Suppose x* solving the original problem is in C21e[O, 1). Then M = ,, f over Ho and `YM,k minimizing fk are related by

N+2)_2 minimizing

EXAMPLES OF DISCRETIZATION

SEC. 3.7

k) I Im =

0(I PN 12m),

65

1 = 0, 1

and thus the quadrature scheme is compatible and II LY(x* -- 0t, k) II- = 0(I PN

12- -1),

1= 0, 1

To be more specific, consider use of Ho(PN); for a quadrature scheme one could use a two-point Gaussian scheme, but we consider

5"s(t) dt

,,, Y2 6 Yo [s(Yo) + 4s(Y1) + s(Y2)],

ko = 4

Since ko Z m - 1, we deduce I I MOM -

0M,

2) 11- = 0(1 P. I2),

1 = 0, 1

For the special case of Ho we know that the error in 0,,, - x* is also 0(1 PN 12),

so we conclude that, if x* E C2[0, 1], 111Y(x* - OM. 2) 11- = 0(I PN 12),

1 - 0,1 N -->,oo

As a further example, the four-point Gaussian scheme, with ko = 8, is compatible with H2(PN), so that the error using this subspace and quadrature formula, if x* E C4[0, 1], is of order I PN 13. As a concrete example, consider D2x(t) = e"('",

x(0) = x(l) = 0

to be solved using HI(PN), PM: t; = ih, h =1/(N + 1), and the compatible four-point Gaussian integration scheme. The errors 11x* are:

3.13 x 10-' at h=-, 4.40 x 10-6 at

-

and 7.15 x 10-7 at h=g

[Ciarlet et al. (1967), Herbold (1968)]. In this special case, one can show that Ilx* - 0;,211. G Kh7"2 [Perrin-Price-Varga (1969)]. General references: Herbold (1968), Simpson (1968). 3.7. THE RITZ METHOD

The method discussed in the previous section was simply a special case of the general Ritz method. Suppose we wish to minimize a functional f(x) over a Hilbert space E. Suppose that for each n we have a finite dimensional subspace E. of E with the property that for each x in E (actually, we need this only for the point x* minimizing f),

lim d(x, E.) = 0 n--

66

SEC. 3.7

EXAMPLES OF DISCRETIZATION

where d(x, En) = min lax - x,II. We can write "min" rather than "inf" x.E E. above since in a Hilbert space the distance to a closed linear subspace is always attained, and uniquely so. For each n, we then find x.* minimizing f over E. and hope that x,* converges to x* in some sense. We can describe this easily as a discretization method. For each n let

fn - f and let p be the identity mapping; let r. be the best approximation mapping-that is, x - rnx I I = min H H x - xn 1I. If f is norm-upper semix.E E.

continuous, then this discretization is consistent. Employing the conditions of Definition 2.2.2 to check this, for condition 1 we need lim sup f(x*), which follows directly from the norm-upper semicontinuity, the definition of r,,, and the assumption that d(x*, En) approaches zero. For condition 2, we have fn(x,*) f(x,*) -- f(x.) = 0; the other conditions are irrelevant here. Thus by a simple modification in Theorem 2.3.1, we have the following; essentially the same theorem is valid for minimization over a set C, when the Ritz problem is then solved over C r) E.. PROPOSITION 3.7.1. Let f be a weakly sequentially lower semicontinuous,

norm-upper semicontinuous functional on the Hilbert space E and let f satisfy a T-prcperty. Let {E.) be a sequence of closed linear subspaces such that lim d(x*, En) = 0, and for each n let x,* satisfy

f(x.) G f(xn) -}- e for all xn E E. with Jim c. = 0. Then all weak limit points of x:, at least one of which exists, minimize f over E, and lim f(x,*) =f(x*). EXERCISE. Give a rigorous Proof for Proposition 3.7.1.

As mentioned before and shown in a previous section, if f satisfies some

type of a uniform-convexity assumption, then x* is unique and x; -. x*. As also shown in a previous sectio}t, in practice one cannot compute f precisely but must use some approxitn. tion to it; for the particular example of the preceding section-namely. two-point boundary-value problemswe saw that this still cotiki lead to satisfactory results under suitable hypotheses. Clearly we could consider this problem in greater generality via the discretization viewpoint; we prefer to leave this as an exercise and look briefly instead at some known results [Mikhlin-Smolitskiy (1967)] in this direction for the special case in which f is a quadratic functional. Suppose we seek to solve the equation

Ax = k where A is a bounded, positive-definite, self-adjoint linear operator in a

EXAMPLES OF DISCRETIZATION

SEC. 3.7

67

Hilbert space E; this equation has a unique solution, x*, which clearly must also be the unique point minimizing the functional

f(x)

- <x, k>

over E. Since f = A for all x, f is convex and in fact X

fix + h) = f(x) + + I f(x) + + 4mII h 112

where 0 < mI < A and Vf(x) = Ax - k. Thus if lim f(x,,) = f(x*), we have IIx,,-x*II2

f(x.)-f(x*)>+ 2

x*. Thus if we use the Ritz method on this f, we find that

which implies x,,

x,* - x*. We suppose that, for the Ritz method, E. is, for each n, the linear subspace spanned by V . . . , l). where (q; ...} is a complete basis for E-that E is, a set of linearly independent elements with d(x, E. is precisely equivalent to solving

f

k,,

where k E IR", k _ (, , matrix, A. = ((A, ,1)),

k>)T and A. is an n x n

A,,, = EXERCISE. Prove that minimizing f over E. is equivalent to solving Ax.* = k for A. and k as described above.

Since the (V,} are linearly independent, there is a unique solution x,, =

a,,,,qt,

for each n. We recognize however that the vector k,, and matrix A. will not be computed exactly but will involve some errors. Thus we shall compute

A' = A. + r., k', = k + b where I F I. and 18p I are assumed to be small, and I I. denotes a standard norm on RR (and its induced matrix norm). We wish to see how this affects the computed x.*.

DEFINITION 3.7.1. Let (yr ...} be a finite or infinite sequence in a Hilbert space E. The sequence is called minimal if the removal of each single element tiv, restricts the subspace spanned by the sequence. EXERCISE. Show that a finite set of linearly independent elements is minimal and that an infinite set of orthonormal elements is minimal.

68

SEC. 3.7

EXAMPLES OF DISCRETIZATION

DEFINITION 3.7.2. An infinite sequence f , . ..) in a Hilbert space E is called strongly minimal if the smallest eigenvalue with inner product

of the matrix

is bounded below by a positive number independent of n. EXERCISE. Show that any orthonormal sequence is strongly minimal and that a strongly minimal sequence is a minimal sequence.

The following result is known [Mikhlin-Smolitskiy (1967)]. PROPOSITION 3.7.2. If the coordinate system f q , ...) is strongly minimal in E, then the solution by the Ritz method is stable under small variations in the matrix and right-hand side; that is, if A;x'. = k;, A. = A. + F,,, and k;, = k + b,,, then for I r.1. and 16. 1. sufficiently small there exist constants c, and c2 independent of n such that I1

x.-x:IlSc1 Ir.I.+c216-1,,

We still are not being realistic computationally, however, since we are only considering x., the exact solution of the perturbed equations. In attempting to solve the linear system A. x'.

= k;, we shall of course make further

errors. The size of these errors, roughly speaking, varies directly with the condition number of the matrix A' -say, the ratio of its largest to smallest eigenvalues. Even for a general strongly minimal system, it is possible for this ratio to grow without bound as n increases. One can say the following, however, for strongly minimal sets of a special nature [Mikhlin-Smolitskiy (1967)] :

PROPOSITION 3.7.3. Let B be a positive-definite self-adjoint bounded linear operator on E which satisfies m, < < M,

for all x E E

with m, > 0, M, < 00. Suppose the complete basis (c ...) satisfies

i=j

I

if

0

if i

J,

i,JZI

Then the basis is strongly minimal and moreover the condition number of A is bounded by M, /m, and thus that of A' is uniformly bounded for I F.1. and 16.1. sufficiently small. In particular, these statements hold if f 9p, ...) is a complete orthonormal basis in E.

SEC. 3.7

EXAMPLES OF DISCRETIZATION

69

The Ritz method has been very popular for use in solving certain differential equations of physical interest, and much theory has been developed in this area. For more details and excellent examples the reader is referred to Mikhlin-Smolitskiy (1967). A special feature of the finite dimensional subspaces E., namely that E. is the span of Vl, . , p., allowed us to derive the special results above; often, however, one does not have such expanding subspaces. For example, in ER', if E. is the set of piecewise linear functions on [0, 11 with nodes at i/n, i = 1, 2, . . ., n - 1, we have no such expanding basis; another feature in this case makes analysis easy as we shall now see. More generally, in ER', let P(x)

be a function with compact support and define

for x in RR' and 1-integers j; if we let E. (for h = I/n) be the space of linear combinations of these functions, we have the finite-element method. Some steps have been taken to analyze this very general method [Fix-Strang (1969)]. For example, on the sample problem

-ux'xi - uxu. + u =f(xl, x2) in IR2, the relevant square matrices important in the Ritz method (finiteelement method) have uniformly bounded inverses (in the 1Z norm) if and only if there is no 0a in IR2 such that 0(2 rj + 0o) = 0 for all 2-integers j, where 0() is the Fourier transform

f

dx

Moreover, the resulting numerical method is convergent if and only if for some

integer p > 1, 0(0) $ 0 but 0 has zeros of order at least p + I at

= 2a j

for all other 2-integers j. More widely applicable theory is under development. EXERCISE. In 1R' rather than {R2, find some functions r having Fourier transforms satisfying the above necessary-and-sufficient condition for convergence.

4

GENERAL BANACH-SPACE METHODS OF GRADIENT TYPE

4.1. INTRODUCTION

We now wish to consider iterative methods for minimizing a functional f in some real Banach space E; primarily, we shall be concerned with the unconstrained problem-that is, minimization over all of E-but we shall also briefly consider methods for the constrained problem when they are natural extensions of earlier methods. If f is differentiable, then from the formula dt

f(x + tp)

we see that f is instantaneously decreasing most rapidly in the direction p (that is, with I I p II = 1) if

° -11 Vf(x) II

In a Hilbert space this gives

Vfx)

p II

f(x) Ti

the direction of steepest descent [Cauchy (1847), Curry (1944)]. More generally [Altman (1966a, b)], we consider a steepest-descent direction to be any direction p E E, 1 1 p1 1 =1, such that = -11 Vf(x)11. If the unit sphere in E is strictly convex-that is, if II x II = I I Y II = 1 and 0 < A < I imply

ll Ax + (1 - A)Y II < 1-then such a direction p is unique. The function f of course instantaneously decreases in any direction p

satisfying < 0. We shall consider iterative methods which, at each point x in the iterative sequence, provide such a direction p 70

GENERAL BANACH-SPACE METHODS OF GRADIENT TYPE

SEC. 4.2

71

along which we move to the next point x.., = x + t.p where the distance t of movement must be expeditiously chosen. We must be sure that the directions do not become nearly orthogonal to Vf(x) too rapidly. DEFINITION 4.1.1. A sequence of vectors

if and only if < 0, and II P.(x.) 11 P.(X.)71

Vf(x.))

will be called admissible II --. 0 whenever

-- 0

For example a sequence of steepest-descent directions is admissible, where B: E* --' E satisfies as is a sequence generated by p.(x.) _ > m II c 112, m > 0, for all c E E*. We shall be able to analyze the iteration x + t.p.(x.) for admissible direction sequences (p.(x.)) and various methods of choosing t > 0. 4.2. CRITICIZING SEQUENCES AND CONVERGENCE IN GENERAL

Throughout this chapter we shall denote by W(xo) the following set:

W(xo) = the intersection of all norm-closed convex sets containing L(xo) - (x; f(x) < f(xo)} as a subset Thus W(xo) is the closed convex hull of the level set L(x,,). The problem of minimizing f over E can of course be reduced to that over W(x,,), which we shall often assume is bounded. We shall always assume that f is bounded below, so that we can speak of trying to minimize f. The goal of our analysis of each method will be to compute a sequence (x.} such that (f(x.)} is decreasing, hopefully toward the infimum off. Generally we shall discover that, for some 6 > 0, we have

f(x.) -

f(x.+ l) ?

-S
Vf(x.)>I l p.(x.) I I-1

If f is bounded below, then f(x.) - f(x.+,) and hence

must converge to zero; from the admissibility of we can then conclude that II Vf(x.) II 0. If fix.) -- inf f(x), we call (x.) a minimizing sex0

quence; if Vf(x) = 0, x is called a critical_'point [Vainberg (1964)]. Thus we are led to the following definition. DEFINITION 4.2.1. A sequence (x.} is called a criticizing sequence if and only if 11 Vf(x.) II -- 0. Thus our numerical methods to be discussed will provide us merely with

72

SEC. 4.2

GENERAL BANACH-SPACE METHODS OF GRADIENT TYPE

criticizing sequences; we wish to know under what circumstances this yields indeed a minimizing sequence. is a criticizing sequence for THEOREM 4.2.1. If W(x,,) is bounded and is a minimizing sequence. the convex functional f bounded below, then f(y.) = inf f(x). By the Proof: Let y E W(xo) be chosen such that lim XXCE convexity of f in -E we have from Proposition 1.5.1 that

f(y.) -f(x.) >_ Since II y. - x. II is bounded and II

inf f(x) < lim inf f(x,,) < lim sup f(x) x

0, we have

II

S

xE

point 0 x* implies and for all y E E, then each criticizing sequence converges weakly to the unique point x* minimizing f over W(xo). Proof: Since W(xo) is a convex, bounded, norm-closed subset of reflexive space, it is weakly compact and thus some subsequence x,,, , x' E W(xo).

E

Then since II Vf(x.) II - 0, for each y in E we have 0 -= lim =

and hence Vf(x') = 0. This then implies x' = x* for every weak limit point x*. Q.E.U. x' of {x}, and hence x Further results on the convergence of criticizing sequences can be obtained by considering properties of Vf, as in Theorem 4.2.2, or by using Theorem 4.2.1 in conjunction with statements about minimizing sequences in Section 1.6. Therefore, in what follows we shall often go no further than making statements about criticizing sequences. EXERCISE. Derive further results concerning the convergence of criticizing sequences by considering properties of Vf, as in Theorem 4.2.2, and by using Theorem 4.2.1 together with results in Section 1.6.

We shall often prove that certain criticizing sequences satisfy lim 11 x.+

- x.l l =

0

a fact which is useful in many cases since such a sequence cannot have two

distinct norm-limit points x' and x" unless there is a continuum of limit

SEC. 4.2

GENERAL BANACH-SPACE METHODS OF GRADIENT TYPE

73

points "connecting" x' and x". For, if there are only finitely many limit points x"), . . . , x'^'), then there is an c > 0 with II x111 - x'J II > E if r j; for large enough n, x must lie in some one of the spheres of radius e/3 0, this implies in fact about the x"', i = 1, 2, ... , N. Since JJ x.11 that all the x must be in some one fixed sphere, since to jump to another x. J J > e/3 which is never true for large n. Although we shall improve this result later, we have proved the following theorem. one requires I I

THEOREM 4.2.3. If Vf(x) is norm-continuous in x, if Vf(x) = 0 has only finitely many solutions in W(xo), and if [xn} is a criticizing sequence with JJ

x JJ --+ 0, then

either has no norm-limit points or x, -. x*

with Vf(x*) = 0. We do not wish to give the impression that the only way to treat minimization is from the criticizing-sequence point of view; other approaches also

can be taken. For example [Yakovlev (1965)], suppose the directions p are generated via p = where for each n, H. is a bounded, positivedefinite self-adjoint linear operator from the Hilbert space E into itself. Suppose

0 < a < S A for all x, y in E, and suppose we take

0 < E, < t

X.+1 = X. + taPs'

A

- EZ

Then f of course is uniquely minimized at a point x* and one can prove that x, -- p x*. Arguing much as we shall in Section 4.6, one can show that f(x,,

1) -Ax.) <

t (1

- tz )

and

x*), x - x*>

f(x.) -f(x*) < 4 aZ

Vf(x,.)>

Therefore,

f(x.+,) -f(xJ

r2 } AZ[f(x,)

-f(x*)]

74

GENERAL BANACH-SPACE METHODS OF GRADIENT TYPE

SEC. 4.2

which yields

f(x.+,) - f(x*) _< q[f (x,.) - f(x*)]

for a certain q < 1, since 0 < E, < t.< (2/A) - E2. Thus f(x,,) - f(x*) <_ q"U(xo) - f(x*)] and the sequence is a minimizing sequence. Since

f(xJ _ f(x*) ?

-T
x*), x - x*>

x*. Thus convergence proofs can be given by estimating it follows that x. f(x*) directly rather than using the criticizing-sequence approach. It is true, however, that the direction sequence so generated is admissible, implying that the analysis is possible from either viewpoint. Historically, most convergence proofs for the step-size algorithms we are

about to consider have been performed by contradiction; this is rigorous but often not intuitive. Recently it has been shown [Cea (1969)] that a single unifying principle can be used to analyze directly many methods; we shall use an extended version of this approach whenever possible. DEFINITION 4.2.2. A function c(t), defined for t > 0, is called a forcing

function if and only if c(t) > 0, and

can converge to zero only if t

converges to zero. EXERCISE. Give some examples of forcing functions.

Throughout this chapter we shall be assuming that Vf is uniformly continuous on W(xo); this implies that for every c > 0 there exists 6 > 0 such that II x - y I{ < b implies II Vf(x) - Vf(y) 11 < e for x, y E W(xo). In particular, we can let b = s(E), where s(t) is the forcing function (reverse modulus of continuity) defined by the following. DEFINITION 4.2.3.

,f(t) = inf {II x - y li; x, y E W(xu), II Vf(x) - Vf(y) II > t} EXERCISE. Prove that s(t) is a monotone nonincreasing forcing function and that we can set b = s(E) in the description above of uniform continuity of Vf.

In terms of these concepts, we can prove the following theorem, which will be our fundamental tool in subsequent sections.

GENERAL BANACH-SPACE METHODS OF GRADIENT TYPE

SEC. 4.2

75

THEOREM 4.2.4. Let f be bounded below on W(xo), let Vf be uniformly be an admissible direction sequence. continuous on W(xo), and let p = p c2(t) such that c,(t) and t - c2(t) are forcing E et there exist functions c,(t) and

functions. (A) If the step sizes satisfy \\

IIP.II

`'((-Vf(x.).

then

is a criticizing sequence. (B) If t;, is any step size so that ``((-Vf(x.).

II P.

I()) <_ t.IlP.ll <_

IIP.II))]

LL

and if x.+, is chosen as any point such that

f(x.) - f(x.+;) >_ f[f(x.) - f(x,, + t' p.)] for f > 0 then

is a criticizing sequence. In either case A or B, f(x.) - f(x.+,) >_ At. I I P. I I [Y. - c2(Y.)] >_ 2c, (Y.)[Y. - c2(Y.)]

where

Vf(x.), P-

Y.

and A = I or f in cases A or B, respectively. Proof: First we consider case A. Since t II p. 11 < s[c2(y )], we have (Vf(x. + IPA) - Vf(x.),

II P.

II

c2(Y.)

0
for

and hence

(-Vf(x. + IPA), IIP.II) > y. Since <-Vf(x. t. I I P. I I > c, (Y.), we have

for

0
+ tp.),

t in (0, t.) and

f(x.) - f(x.+) ? I. I I P. I I IF. - c2(YJ] >_ c, (Y.) [Y. - c2(Y.)] >_ 0

Hence y. - 0 and

is criticizing. Now consider case B. We can, by case A,

write

-f(x.+1) > lc,(R)[Y. - c2(7.)] and we are again done. Q.E.D.

76

SEC. 4.3

GENERAL BANACH-SPACE METHODS OF GRADIENT TYPE

Remark 1. We point out that we can get by with weaker assumptions than the admissibility of the direction sequence. In Theorem 4.2.4 we found that

f(x.)

-f(xN+1) >_ CO.)

for some forcing function c(t). This implies that

E C(y,) <=0 Fi [Ax.) -f(x.-!)J < 00

w=0

and thus we only need choose directions such that E c(y). < oo implies

.o

11 Vf(x) I I --, 0. In particular, in the theorems in the next sections, we shall

find c(t) = const ts(t), where s(t) is the reverse modulus of continuity of Vf ; if Vf is Lipschitz-continuous, we have s(t) = const t and hence c(t) _ const t2. Therefore, if

a; = oo where a =
then

11

1(x,,)11 lip. 11

c(y,.) < oo implies E a.2 11 Vf(x.) 112 < oo which implies 11 Vf(x.) 11

0.

Generally speaking, however, we do not feel that this analysis is applicable to many direction algorithms; in our experience, direction sequences used in practice are usually admissible in our sense. We shall not, therefore, state theorems based on the fact that E c(y.) < oo ; the reader should, however, be .-0 aware of this approach.

Remark 2. In the following sections we shall be proving that various choices of t. yield criticizing sequences. By part B of Theorem 4.2.4, there is always the obvious corollary concerning the choice of x..,; although this is a useful fact since, in particular, it indicates that t. need not be found exactly, we shall not bore the reader by continually stating this corollary. It should, however, be remembered. 4.3. GLOBAL MINIMUM ALONG THE LINE

We consider first intuitively the most natural way of choosing t., by minimizing f(x. + tp.) as a function of t > 0; we assume that such t. always exists. We prove a more general theorem. EXERCISE. Prove that t exists if W(xo) is bounded.

THEOREM 4.3.1. Let f be bounded below on W(xo), let Vf be uniformly

continuous on W(xo), and let p. = p.(x.) define an admissible sequence of

GENERAL BANACH-SPACE METHODS OF GRADIENT TYPE

SEC. 4.4

77

directions. For a set of numbers a, E [0, a] with a < 1, choose t. so that

f(x. + t. P.) - a.t. Sf(x. + tP,,) - a.t for all t > 0. Then

is a criticizing sequence and

f(x.) - f(x.+,) > s(CY.)( I - c)y. for all c in (0, 1 - a), where we let

y. = t\--Vf(x.),11 XTI

\\

Proof. Since t mini mizes f(x,+
P.\/

we have

- a. -- 0

For any fixed c in (0, 1 - a), if t lip. 1I < s(cy ), we would have \[Vf(x.

+ t. P.) - a.Vf(x.)] - [Vf(x.) - a.Vf(x.)], H

II)

I

c3'

This would then give (1 - a)y, < (I - a.)y,. < cy1, a contradiction to c E (0, 1 - a). Therefore, t I I P. I I ? s(cy ). By part A of Theorem 4.2.4 with s(ct) and c2(t) = ct, the method generated by t' I I p. II = s(cy.) c, (t) yields a criticizing sequence. By the defining property of t,,, we have

f(x.) - f(x. + t.P.) >_ f(x.) - f(x. + t;,P.) + M.O. - t:)<-Vf(x.), P.> >_ f(xJ -Ax. + t:P.) since we showed above that t > t'. The theorem now follows from part B of Theorem 4.2.4 with f = 1. Q.E.D. EXERCISE. Assuming Vf to be Lipschitzian, apply the approach of Remark I after Theorem 4.2.4 to derive another convergence theorem for the method of Theorem 4.3.1.

Remark. Setting a - a = 0 yields the usual method. General references: Altman (I966a), Cda (1969), Elkin (1968), Goldstein (1964b, 1965, 1966, 1967). 4.4. FIRST LOCAL MINIMUM ALONG THE LINE : POSITIVE WEIGHT THEREON

The problem of locating the absolute minimum along x + ip is quite difficult unless one knows, for example, that every local minimum is a global minimum. In any case, it would be simpler to seek the first local minimum-

78

GENERAL BANACH-SPACE METHODS OF GRADIENT TYPE

SEC. 4.4

say, by using a one-dimensional root-finding method to locate the first root

of = 0; since such a root may give only an inflection point, we must analyze this possibility also. EXERCISE. Find conditions under which the first local minimum with respect to t along x + tp" is a global one.

Actually, as we saw in Remark 2 following Theorem 4.2.4, it is not vital to reach the local minimum exactly. We analyze some additional ways to describe how close one need come. THEOREM 4.4.1. Let f be bounded below on W(xa), let Vf be uniformly

continuous on W(xa), and let p = p.(x.) define an admissible direction sequence. Let t be either (1) the smallest positive t providing a local minimum for

f(x. + tP.) - a.t or (2) the first positive root of - (%" = 0

or (3) the following:

p,> < 0 for 0 < z < t}

I- = SUP {t; KVf(x. + ?p.), A,> -

We assume 0 < at. < a < 1. Let

x+

then (x") is a criticizing

sequence and

f(x.) -f(x..,,) >_ s(cy,.)(l - c)y,. for all c E (0, 1 - (x) with Vf (x")'

I I P I I\

Proof. In any determination of t", clearly - a. = 0 and

dt [f(x. + tP.) - a.t ] G 0

for 0 < t < t", implying f(x. + t. P.) - a.t. (Vf(x"), P.> for 0 < t < t.. The proof is now identical to that of Theorem 4.3.1. Q.E.D.

GENERAL BANACH-SPACE METHODS OF GRADIENT TYPE

SEC. 4.4

79

EXERCISE. As a generalization of Theorems 4.3.1 and 4.4.1, show that t may be chosen as any number satisfying

- a, = 0 and

f(X. + t.P.) - ante

for 0
As a special case, one may take an - a E [0, 1), a method often called the generalized Curry method in recognition of Curry's result [Curry (1944)] with a = 0 in which one essentially seeks the first local minimum of f(x,, + tp,). We state this as a corollary. COROLLARY 4.4.1. Let f be bounded below on W(xo), let Vf be uniformly

continuous on W(x(,), and let p, = p,(x,) define an admissible direction sequence. Let t be defined as either (1) the smallest positive t providing a local minimum for f(x, + tp,); or (2) the first positive root of

= 0 or (3) the following:

t, = sup {t; < 0 for 0 < T < t}

Let x,+, = x, + t, p,; (then {x,} is a criticizing sequence and

f(xn) -f(x.+1) > s(cy,x1 - OF. for all c in (0, 1) with

Vf(x.),NP.IIi Another method of describing the weight on the minimum is as follows. First, choose a t.' precisely as t, is determined in the Theorem 4.4.1, above,

with 0 < a, < a < 1; then define t, = 2,t' for an appropriate relaxation factor 2,. THEOREM 4.4.2. Under the hypotheses of the preceding theorem, let

t.' be determined as is t, there. Let t, = 2,t; where d(y,) < A. d(y.)s(c J(1 ,,

c)Y.

for all cin(0,1-a). Proof.- From the proof of the preceding theorem we know that t;, 11 p, I I

s(cy,) for all c in (0, 1 - a); therefore, t;, I I p. I I >_ t, I I p. I I >_ d(YJs(cy.)

II p, I I - d(y,)s(cy,) yields a criticizing sequence by part A of Theorem 4.2.4 with c,(t) - d(t)s(ct) and c2(t) - ct. Since Therefore,

dt l f (x, + tp,) - a,t ] < 0 for 0 < t < t, then

f(x.+J - a.t. and thus

f(x, -f(x.+) -2! Ax.) -f(X. + t.p.) yielding convergence by part B of Theorem 4.2.4 with f = 1. Q.E.D. General references: Altman (1966a), Elkin (1968), Goldstein (1964b, 1965, 1966, 1967), Levitin-Poljak (1966a). 4.5. A SIMPLE INTERVAL ALONG THE LINE

In some cases it is possible to write down beforehand a simple interval from which t, can be chosen arbitrarily, guaranteeing the generation of a criticizing sequence. If Vf satisfies II Vf(x) - Vf(y) II < L II x - y II in W(xo)

and L is the best such constant, then we have from Definition 4.2.3 that s(t) > t/L. Our Theorem 4.2.4 then tells us that the choice of t, so that, for example,

EiY.0, Ez>0 and

GENERAL BANACH-SPACE METHODS OF GRADIENT TYPE

sEC. 4.5

Y. --

81

Vf(X.), IIP.II)

will yield a criticizing sequence. By more careful analysis, the size of this interval can be doubled, as we now proceed to show. THEOREM 4.5.1. Let f be bounded below on W(xo), let Vf satisfy

IIVf(x) - Vf(y)II >_ A. II Vf(x.)112,

II P.(x.) II <_ A, I I Vf(x.) 11,

A2 > 0

Then if 8, > 0 and b2 > 0 and if t is chosen with 2A b,
then the sequence

0, and

is criticizing, I I x.., - x. I I

f(x.) - f(x.+,) > E I I Vf (xJ 112

for some E > 0

Proof:

Ax..') =f(x.) + f o da =f(x.) +
+

o

da

f (x.) - t.A21I V f ( x. ) I I 2 + Lt, l i p . 112 f

Sf(x.)

t.IIVf(x.)112[A2

- t. 2

da 0 o

A11

For t in the given interval, the term'in brackets is bounded away from zero below, so and I I Vf(x.) I I - 0. Since I I P.(x.) I I <_ A, I I Vf(xJ I1--) 0

and I t I is bounded, II x.+. = x. I I = I I t.P.(x.) Il

"0

Q.E.D.

EXERCISE. Apply the approach of Remark I after Theorem 4.2.4 to find another convergence theorem for the method of Theorem 4.5.1.

82

SEC. 4.5

GENERAL BANACH-SPACE METHODS OF GRADIENT TYPE

A similar simple range is given as follows.

THEOREM 4.5.2. Let f be bounded below on W(xo), let Vf satisfy

IIVf(x)-Vf(y)II<_Lllx-yII and let X. -

where p - p

is an admissible sequence of directions with I I P I I = 1

and

0<6,
The

I

2

x I I -- 0, and

e I I Vf(x.) II2

for some e>0

Proof: Proceeding just as above via integration we find -LTA2

4LTJ

f(x,), and and II Vf(xA)1I converge to zero.

Thus Finally, II

x. I I = I I T.
I I <_ L 1
10

Q.E.D.

Remark. It is a simple matter to allow a slightly larger range for T. in Theorem 4.5.2-namely,

s(<-Vf(x.), R,>) < T < G

P.>)

where 0 < s(t) < 1, and s(t) is a forcing function [Elkin (1968)]. EXERCISE. Prove the assertion in the above Remark. Finally, we state one more result giving a simple range but not depending on a priori knowledge of Lipschitz constants.

THEOREM 4.5.3. Suppose f is a convex functional bounded below on W(xo) and such that Vf(x) is uniformly continuous in x in W(xo). Let p = be an admissible sequence of directions with II PA II = 1, and pick

8 S2 satisfying 0 < 8, < b2 < 1. Let mined bv:

xq + T p where T is deter-

sEc. 4.5

GENERAL BANACH-SPACE METHODS OF GRADIENT TYPE

83

1. if - P,), p >, then set T. _ <-Vf(x.), P.>, 2. otherwjse compute T., which always exists, to satisfy 0 < T C - and

-61 < - < -a2 Then {x.} is a criticizing sequence and I I x.+, - x. I

0.

Proof: By Proposition 1.5.1 we have > 0 if

t" > t'

Therefore,

f(x.) -f(x,, 1) = > -T. > T.b,<-Vf(x.),P.>

Thus tf(x.)} is decreasing and hence convergent and T.S,<-Vf(x.), p.> tends to zero. If infinitely often we have

- > e > 0 then it cannot occur under condition l since then we have

f(x,)

-f(x.+1) >_ 612 >_ 6"F1

in contradiction to the boundedness below off. Under condition 2, however, we have

E(1 - b2) < (1 - a2)<-Vf(x.),P.> S the right-hand side of which tends to zero since Vf is uniformly continuous and since T. must tend to zero if <-Vf(x.), p.> does not. This gives a contradiction, leading us to conclude that <-Vf(x.), p.> and hence I I Vf(x.) II tends to zero. Finally, Ii x.+1 - x.lI = II T.P.II = I T.I _< I I

0

Q.E.D.

General references: Altman (1966a), Elkin (1968), Goldstein (1964b,

1965, 1966, 1967), Levitin-PoIjak (1966a).

84

GENERAL BANACH-SPACE METHODS OF GRADIENT TYPE

SEC. 4.6

4.8. A RANGE FUNCTION ALONG THE LINE

We shall now describe another way of selecting IN by making use of a function g(x, t, p) which will determine the range of values t can assume.

The method is similar to that in Theorem 4.5.3 except that a different measure of the distance to be moved is used. The main idea is to pick t to guarantee that the decrease in f dominates

as discussed in Section 4.2. We shall determine admissible values of tM in terms of the range function

g(x, t, P) = (x) - f(x + zP)

-t< f(x),P>

which is continuous at t = 0 if we define g(x, 0, p) = I. We shall assume that an admissible sequence of directions pp is given satisfying II p II = 1. Given a number 6 satisfying 0 < 6 < # and a forcing function d satisfying d(t) < 8t,

we shall attempt to move from x to x,+, = xn + and x,-= x. + p. we find g(x., then we set and also

t.

as follows: if, for

d(<-Vf(xN),

P

x;; otherwise, find t E (0, 1) satisfying Equation 4.6.1 g(x,,, t., ,r

d(<-Vf(x ), p.>)

(4.6.2)

First we observe that this algorithm is well defined. Since g(x., 0, p.) = 1, and

1 - dtt) > !At)

for all t

if we have g(x,,, 1,

where z = <-Vf(xj,

d(z) z

d(zz

then by the continuity of g(x,,, t, pN) in t there exists

z)

< g(x., t,,,

I

- d(z)

Now we prove the convergence of the method.

sEC. 4.6

GENERAL BANACH-SPACE METHODS OF GRADIENT TYPE

85

THEOREM 4.6.1 [Elkin (1968)]. Let f be bounded below on W(xo), Vf(x)

be uniformly continuous in x in W(xo), and p, = p,(x,) give an admissible sequence of directions with I I p. I I = 1. Let d(t) be a forcing function with x', d(t) < at, 0 < 6 < 4. If, for t = 1, Equation 4.6.1 is valid, let x. + p,. Otherwise, find t, E (0, 1) satisfying Equation 4.6.1 and Equation 4.6.2. Then {x,} is a'criticizing sequence, and

f(x.) - f(x.+,) > A,d(<-Vf(x.), p.>) where A. = I if t = 1 and A. = s(d(<-Vf(x,), p,>)) if t :p6 1, where s is the reverse modulus of continuity of Vf. Proof: By Equation 4.6.1, f(x,) is decreasing and

f(x.) -f(x.+,) > t.d(< - Vf(x.), p.>)

(4.6.3)

If t = I does not satisfy Equation 4.6.1, then t, E (0, 1). For these n, we write

f(x.+,) -f(x.) _ for some A. E (0, 1) Thus, from Equation 4.6.2,

d(<-Vf(x.), p.>) < g(x., t., P.) -II x.), P. x.), P.

d(<--Vf(x.), p.>) and hence

t. = II x.+, - X. II > II A.t.P. II > soI Vf(x. + 2.t.P.) - Vf(x.) II] > s[d(<-Vf(x.), P.>)] (4.6.4) Hence, using Equation 4.6.3, we conclude that

f(x,) - f(x.+.) >_ d(<-Vf(x.), P.>)s[d(<-Vf(x.), p.>)] Thus

f(x.) -f(x.+) >_ A.d(<-Vf(x.),P.>) as asserted, which implies, as before, <-Vf(x,), p,> --, 0. Q.E.D.

86

GENERAL BANACH-SPACE METHODS OF GRADIENT TYPE

SEC. 4.6

Computationally one needs a procedure for computing a tM E (0, 1) satisfying Equations 4.6.1 and 4.6.2 if tM = 1 does not satisfy Equation 4.6.1. We consider doing this [Armijo (1966), Elkin (1968)] by successively trying the values

tM = a, a2, a', ... , for some a E (0, 1) THEOREM 4.6.2. Under the hypotheses of Theorem 4.6.1, t may be chosen

as the first of the numbers a°, al, a2, ... satisfying Equation 4.6.1, and then (x,) is a criticizing sequence,

f(xM) -

f(xM+ 1) >_ 1d(<- Vf(xa), PM>)

where

A. = 1

if tM = 1, aM = as[(1 - a)<-Vf(XM), PM>]

ift,$1. Proof. As in the previous theorem, tM = 1 yields no problem. in the other case, we have xM+1 = xM + alp,,, j > 1. Let xM = x, + a' 'p,. Then we have

f(x,) - f(xr) ) f(XM) - f(xM+ t )

II

XM+ I - xM I I d(< - Vf(x,), PM>)

Therefore,

f(x,+,) -

f(x:) < (1 - a) I I xM - xM I I d(<-Vf(XM), PM>)

We can write

f(xa+,)

-{y J X.,) =

This leads to > -d(<-Vf(XM),PM>)

PM>

Hence II

Vf[2MXM + (I - ~M)xn+I]

>-

- Vf(x,) II

> (1 - a)<-Vf(XM),P.>

(4.6.5)

We then have I I XM+

1

X. I I

a 112M x, + (1 -

1 - XM I I >_ as[(1 -

aK-Vf(x,), PM>]

GENERAL BANACH-SPACE METHODS OF GRADIENT TYPE

SEC. 4.6

87

Therefore, from this and from Equation 4.6.1 we have f(xw)

-f(x.+1) > as[(1 - 6)<-Vf(x.),P.>]d(<-Vf(xw),Pw>)

0. Q.E.D.

which implies that

In particular, one can consider this algorithm with d(t) = 8t and, instead of Equation 4.6.2, the stronger condition

1-6

g (x,,, t.,

This method has been considered often [Goldstein (1964b, 1965, 1966, 1967)].

The two theorems above can be extended somewhat. For example, rather than demanding, in Theorem 4.6.1, that lip. II = 1, suppose we assume that lip. 11 >_ d,

Vf(x.).

I I P.11

))

for some forcing function d and that

(_Vf(x ) , ..

P.

lip. II

tends to zero whenever

d(<-Vf(x.), p.>) IIP.II

tends to zero. EXERCISE. Show that the latter condition immediately above is valid, for example, if 11 p. II is bounded above or d(t) = qt, q # 0.

Looking at the proof of Theorem 4.6.1, we see that under these conditions Equation 4.6.3 with tw = 1 becomes

f(xw) - f(x.+J >_ d(<-Vf(x.), Pw>) = d((-Vf(x.3,

II P. Pw

fl>

IIP.II)

so that either (-Vf(xw),11P.

II

or

lip. 11 >_

d1\-Vf(x.).

IIp. [I))

must tend to zero, yielding U Vf(xw) II -- 0. For rw E (0, 1), Equation 4.6.4 becomes instead t. I I P. II >_

Sd(<-ol(xw), Pw>)

IIP.II

88

GENERAL BANACH-SPACE METHODS OF GRADIENT TYPE

SEC. 4.6

and thus d(<-Vf(x.), P.>) d( - Vf(x.), PR )

f(x,) - f(x.+ 1) Z s

IIP.I

IIPRII

which implies that d(<- Vf(xw), p.>) II P. II

and thereby II Vf(x,) II tends to zero. Thus we have proved the following corollary. COROLLARY 4.6.1. Theorem 4.6.1 is valid [except for the bound on

f(x,+,)] with the assumption that Iip.Il = I being replaced by I. Ip.II >_

11 d,\(-Vf(x.),

IIP.II/)'

and

2. (- Vf(x,J, I I - \

0 whenever d. (<- Vf(x.), Pte.) -, 0 IIP.11

Looking next at Theorem 4.6.2, we see that the case to = I follows as above. For t. E (0, 1), Equation 4.6.5 becomes

IIP.II IIVf[2.x. + (1 - R)xA+IJ - Vf(x.)II z (1 - aX-Vf(x.),P.> and thence 1 xw+ 1 - X. 11 >

P.

L(I - b) (- Vf(x,),

lip. 11

»

and

f(x.) - f(x.+l)

acsl (I - O)(_Vf(x.),II l

\

1]

<-V ( .), A

. )

Ij

Thus we have proved the following two corollaries. COROLLARY 4.6.2. Theorem 4.6.2 is valid [except for the bound on

f(x,) - f(xx+,)] with the assumption that IIp.II = I being replaced by conditions I and 2 in Corollary 4.6.1. COROLLARY 4.6.3 [Armijo (1966)]. The conclusions of Theorems 4.6.1 and 4.6.2 are valid for the method defined by p. = -Vf(x,) and d(t) = bt, 6 E (0, 11-that is, with tw determined so that

f(x,,) - f(x. - twvf(xj) c

Also, I l x.+ I .- X.11

0.

` II Vf(xw) 112

GENERAL BANACH-SPACE METHODS OF GRADIENT TYPE

SEC. 4.7

89

Proof.- We may take d, (t) - t for condition 1 of Corollary 4.6.1. For condition 2, d(< II

Vf(x.), p../) = a II Vf(x.) I I

IIP.II xn II = to II Vf(x.)II

II Vf(x.)II -0

Q.E.D. In all of the above, note that if II Vf(xn) 11 > E 11p. 11, c > 0, then II

x. II , 0, since to is bounded; this is true in particular for p _ -Vf(x.), as we saw above.

General references: Altman (1966a), Elkin (1968), Goldstein (1964b, 1965, 1966, 1967). 4.7. SEARCH METHODS ALONG THE LINE

In actual computation it is of course necessary to deal with discrete data; this means, for example, that one cannot generally minimize f(x;, + tpn) over all t > 0 but only over some discrete set of t-values. In this section we shall indicate how, in some cases, we can guarantee convergence for practical, computationally convenient choices of step size. For theoretical analysis, we shall restrict ourselves to strictly unimodal functions-that is, to those that have a unique minimizing point along each

straight line; from Section 1.5 we know that this is equivalent to strong quasi-convexity. EXERCISE. Prove the equivalence of strict unimodality and strong quasiconvexity as asserted above.

This equivalence implies that if we have three t-values t, < t2 < t, such that

f(x + t2 P) < f(x + t, p) and f(x + t2p) < f(x + t,p), then f(x + tp) is minimized at a value of t between t, and t,. EXERCISE. Prove the preceding assertion concerning the location of the t-value minimizing the strictly unimodal function f(x + tp).

We combine this fact with Theorem 4.4.2 for a. = a = 0 to prove the following.

THEOREM 4.7.1. Let f be strongly quasi-convex and bounded below on W(x,), let Vf be uniformly continuous on W(xa), and let pn = define an admissible direction sequence. Suppose that for each n there are values such that tn, 1 , tn.29 ' ' ' f tn.ko

90

GENERAL BANACH-SPACE METHODS OF GRADIENT TYPE

SEC. 4.7

f(x.) >f(x. + t..tP.) > ... >f(x. + t., k. P.)
for some constant A. Then either t. = t..k._ I or t = t..k. and makes [x.} a criticizing sequence with

x. + t. p.

f(x.) - f(x.+ l ) < 1s(cy.)(l - C/!. for all c in (0, 1), with P.

Y.

IIPII/

Proof. The point t. providing the first local minimum for f(x. + tp.) must satisfy t..k.- I < t. < t..k.+,. Therefore, t,,,k._ I = where

>t..k.->>2 ,

tn. k.+1

and 2. < 1. Thus) Theorem 4.4.2 with d(t) = 2 implies our theorem for t. = t..k._I- Since

f(x. + t..k.P.)
COROLLARY 4.7.1. Under the hypotheses of Theorem 4.7.1, if in addition t..,+, h. for all i, then k. > 2 is sufficient to guarantee that t. _ (k. - 1)h. or t. = kh. will make (x.) a criticizing sequence. Proof. In this case,

t..k.-I =k. - 1> 1 =2 3 k. + 1

t..k.+I Q.E.D.

COROLLARY 4.7.2 [Cea {1969)]. Under the hypotheses of Theorem 4.7.1, if in addition

f(x. + 2h.p.) > f(x. + h.p.) 0, then t. = h. makes {x.} a criticizing sequence.

SEC. 4.7

GENERAL BANACH-SPACE METHODS OF GRADIENT TYPE

91

Proof: If f(x. + 3 h;.p.) f(x. + h: p.) we have an example of Corollary 4.7.1 with k = 2 and h = h./2. Q.E.D.

We shall combine these results into a single algorithm in a moment; since a simplification is possible if f is actually convex, we derive one more result first. THEOREM 4.7.2 [C6a (1969)]. In addition to the hypotheses of Theorem

4.7.1, suppose that f is convex and that for all n we have h. > 0 such that

f(x,, + h.p.) < f(x. + 2h.p.) <_f(x.)

Then t - h makes

a criticizing sequence and

f(x.) -f(x.+,) >_ +s(cy,)(1 - c)y. for all c in (0, 1), with

(-Vf(x),II Proof.-.The point t;, providing the global minimum for f(x. + tp,) must satisfy 0 < t; < 2h and of course t = t; would yield a criticizing sequence. Since f is convex, for 0 < t < h we have

f(x. + tp.) > 2f(x. + h.p.) - f(x. + 2h. p.) + fix. + 2h. p.) -f(xx + h. p.)

h

> 2f(x. + h.p.) -f(x. + 2h.p.) > 2f(x. + h. p.) - f(x,) while arguing similarly for h < t G 2h we deduce

f(x. + tp.)

2f(x. + h. p.) -f(x.)

Setting t; = t thus gives

f(x. + t.'p.) > 2f(x. + h.p.) -f(x.)

92

SEC. 4.7

GENERAL BANACH-SPACE METHODS OF GRADIENT TYPE

and therefore

f(x.) -f(x. + h.p.) >.[f(x.) -f(x. + t,p.)] The theorem follows from part B of Theorem 4.2.4 with 8 = }. Q.E.D. We can now give a practical algorithm of a search type to locate a suitable value of We assume that the algorithm is entered with a point x4, direction p,,, and a number h > 0 given. We write in a pseudo-ALGOL language for convenience.

Search routine [Cea (1969)]

if f(x + hp.) < reduce:

h

then go to first;

-T;

if f(x + hp,) > (x. + h pal > f(x4 + hp.) then EXIT FROM ROU-

f

TINE NOW WITH t = h; if f IS CONVEX then EXIT FROM ROUTINE NOW

WITH t. --2 h T;

loop:

while f (x + h

first:

EXIT FROM ROUTINE NOW WITH t = h; if f(x + f(x. + hp.) then go to oldway;

f(x + hp.) do h -

T;

t.- 2h; change: oldway:

while f(x + (t + f(x. + tp.) do t t + h; EXIT FROM ROUTINE NOW WITH t = t; if f IS CONVEX and f(x. + 2hp,J
It would also of course be possible to move more rapidly by replacing the one line with the label "change" with the line change:

while f(x + 2tp.) < f(x +

t

2t;

THEOREM 4.7.3. Let f be strongly quasi-convex and bounded below on W(xo), let Vf be uniformly continuous on W(xo), and let p. = p define an admissible direction sequence. Let t be determined by the above search routine. Then is a criticizing sequence. EXERCISE. Supply the Proof to Theorem 4.7.3.

SEC. 4.8

GENERAL BANACH-SPACE METHODS OF GRADIENT TYPE

93

4.8. SPECIALIZATION TO STEEPEST DESCENT

The general gradient-type methods we have been discussing are generalizations of the original method of steepest descent [Cauchy (1847)]. In that -Vf(x.), special case we suppose that E is a Hilbert space and we let p which clearly is an admissible sequence of directions. Thus all of the theorems we have developed in this chapter yield corollaries when applied to this method. In some cases, however, one can go further for the original steepest-

descent method and give estimates on the rate of convergence. The next chapter, for example, will contain, as a by-product, convergence estimates for the steepest-descent directions selecting t as in Section 4.3 and Section 4.4. Therefore, at this point we shall only demonstrate the results obtainable for selecting t from a simple interval along the line.

THEOREM 4.8.1. Suppose f is a twice-differentiable functional on a

Hilbert space E and that mi < f x < MI for 0 < m < M < oo for all x. Let 6, > 0 and a2 > 0 be chosen and choose t to satisfy S,

and set xn+, = x - t.Vf(xn). Then x -+ x*, the unique point minimizing f over E, starting at any x0. Given any e > 0 there exists an N such that for

n > N,

-t.M1)

11x*-x.... II<_11x*-x.11(A.4-E), The error estimate is best when

t - t* = M + m for all

n

In this case, then, x converges faster than any geometric series with ratio greater than (M - m)/(M + m). Proof: By Theorem 1.4.4, lim f(x) = oo, so we may restrict the problem

to a bounded set-that,is, W(xa)"1is7bounded for each x0. Since

f(x)>_f(0)- IIVf(0)11Ilxll+;IIx112 f is bounded below. Since II f I I < M, Vf is Lipschitz-continuous with

'

Lipschitz constant M; Theorem 4.5.1 with A, = A2 = 1 then says that is a criticizing sequence. From Theorem 4.2.1 it follows that

is a minimizing sequence and then from Theorem 1.6.3' we conclude that x. x*,

94

GENERAL BANACH-SPACE METHODS OF GRADIENT TYPE

SEC. 4.8

the unique point in W(xo) and E minimizing f. Now we wish to consider the convergence rate:

xx"+, =x*-x"+t"Vf(x")=x*-x"+t"fx-(x"-x*) x*)] + t"[Vf(x") _ [t - t" f'.'-] (x* - x") + t"[Vf(x") - Vf(x*) - f

x*)]

since Vf(x*) = 0. By the definition of f'.")

Vf(x") - Vf(x*) - f '.'.(x. - x*) = II X. - x* II w(4 X. - x* II) where lim w(s) = 0. Thus 1-0

IIx*-x"+,ilSllx*-x"Ii[III-t"f'.II+Mw(IIX"-x*II)] Therefore,

IIx*-x"+(IISIIx*-x"II X [max (I 1

-

I,1 1 - t" M I) + M w(I I x" - x* I I)]

Given c < 0, then for large n > N,

Mw(Iix" - X* 11) <E which gives IIx*-xn+,II

with A. = max (I 1 -

IIx*-x"II(A"+e)

I, 11 - t"M I) < 1. Each A. can be minimized by

choosing

t"=t*=m+M=a,=8 2 in which case

=M-m

A

"

M+ M

Q.E.D.

Computer programs implementing steepest descent in IR' may be found in Whitley (1962), Wasscher (1963). General references: Kantorovich (1948), Levitin-Poljak (1966a).

SEC. 4.9

GENERAL BANACH-SPACE METHODS OF GRADIENT TYPE

95

4.9. STEP-SIZE ALGORITHMS FOR CONSTRAINED PROBLEMS

Although we shall not attempt to examine the many various kinds of iterative methods for treating problems with constraints, we do wish to see to what extent the methods of the previous sections can be modified for use on these problems. We remarked in Section 1.4 that a necessary condition for a point x* to provide a minimum for a differentiable function f over a convex

set C is that <x - x*, - Vf(x*)> < 0 for all x E C-that is, that all directions leading into C make obtuse angles with the direction of steepest descent,

implying that f is nondecreasing in every direction pointing into C. If this condition is not satisfied at a point x0, the natural step would be to find a direction p, = x', - x0, xo E C, with > 0 and move to a new point x, E C along this direction. One might hope to find such a point x', by "projecting" the gradient direction or some other admissible direction onto

the set C. Unfortunately, an arbitrary direction making a strict acute angle with -Vf(x,) can "project" into a direction p, making an obtuse angle with -Vf(x,), leading to no decrease in the value for f. Thus when we consider "projection" methods, we shall have to deal directly with -Vf(x,). First, however, let us consider in general what happens if one uses feasible directions p [Topkis-Veinott (1967), Zangwill (1969) Zoutendijk (1960)]. DEFINITION 4.9.1. A sequence of directions p, =

is called feasible

if and only if p, = xw - x where )x + (1 - 2)x, E C for all A E [0, 1], x'. # x,,, and > 0. For unconstrained problems, with 11p. 11 = 1, if
implied

the sense that reasonable step-size algorithms yielded criticizing sequences. For the constrained problem we shall similarly deduce
for many methods, so the problem will be to choose directions to avoid "jamming" [Zangwill (1969)] or "zigzagging" [Zoutendijk (1960)] so that "in the limit" the condition " = 0" is a necessary or sufficient optimality criterion; in Section 4.10 we give some examples of directions for which
96

sEC. 4.9

GENERAL BANACH-SPACE METHODS OF GRADIENT TYPE

for unconstrained problems, the t that was determined always guaranteed that

f(x.) - f(x. + t. p.) >- d((_ Vf(x.),

I I P. I

for some forcing function d(t), depending on the method; we can analyze all such algorithms of this type. As usual, of course, we remark that any different x'.+, may be used satisfying f(x.+,) Z f[f(x.) - f(x,+,)] for fixed /1 > 0 as in Theorem 4.2.4; we shall not continually repeat this obvious fact. THEOREM,4.9.1. Let the convex functional f be bounded below on the

bounded convex set C and, for some xo in C, let the set {x; f(x) < f(xo)} be bounded; let p = p define a feasible direction sequence and let II Vf(x) II be uniformly bounded for x c C n {x; f(x) < f(xo)). Let the numbers t: be some steps satisfying

f(x.) -1(x. + t:P.)

d(\-Vf(x.), 11P. )) t = t; if X. + t;p E C and t = t?, > e > 0

x+

Let

with s + t'p E C otherwise. Then Proof: If t. = g, then

0.

f{x.) - f(x.+) ?

5

\))

If

t = t;, and f(x + t'

f(x + t:

then also

f(x.) - f(x.+,) ? d

(\-Vf(x.),

i!

p.11))

We consider the final case of

t = t. and f(x. + t`.P.) > f(x. + I :p.) Since f is convex and t: > t', we have

f(x. + t.P.)

- A )f(x.) + r f(x. 4- t.-p.)

- fx. +

and thus

f(x.) - f(x. + t,p.) > t: I I P. I I [f(x.) - f(x. + t.P.)] 1.1,

11P.

GENERAL BANACH-SPACE METHODS OF GRADIENT TYPE

SEC. 4.9

97

Since fx;f(x) < f(xo)) is bounded, there is a K such that II t'P II < K and, therefore, f(xn) - f(xw+1) > to Il P. II IAd(\-

V(x-),

11p-

l pn

II)

Since t;, > e > 0 and II Vf(xn) II is uniformly bounded and

then from the three inequalities for f(xn) - f(xn+,) we deduce that
Remarks. Since p = x;, - xn for some x, E C, t;, = I is always allowed,

and so certainly t!, > e > 0 is possible; in particular, t,, = max {t; xn + tp e C) is possible. If C is itself bounded, by modifying and redefining f outside of C we can generally guarantee that (x; f(x) < f(xo)) is bounded. For the step-size algorithms studied for the unconstrained problems, it is possible to eliminate the hypotheses that f is convex in the above theorem. In general, the proofs of these facts follow the arguments for the unconstrained case, so we shall be rather brief; first we state a theorem similar to Theorem 4.2.4, again using the reverse modulus of continuity s(t) defined in Definition 4.2.3.

THEOREM 4.9.2. Let f be bounded below on W(x,,), let Vf be uniformly continuous and uniformly bounded on W(xo), let C be convex and bounded, and let p define a feasible direction sequence for C. Let there exist functions c,(t) and c2(t) such that c;(t) and t - c2(t) are forcing functions. Let t.,, be step sizes such that C`(\-Vf(x.), II P. II))

<

I

I pn I

I

11P.

and let t', be step sizes such that

x + t;,p. E C and

II)) l

\

t;,IIp.II

_ d1(IIpnIUd#

-Vf(xn),

rip. ll/

for two forcing functions d,(t) and d2(t). 1. If we set to = to if x. + t."pq E C and t = t;, otherwise, with xn+, = xn + tnpn, we conclude that
f(xn) -

f(xn+.) > f [f(xl.)

._ f(xn + t.Pn)J

for a fixed fl > 0, then - 0.

98

GENERAL BANACH-SPACE METHODS OF GRADIENT TYPE

SEC. 4.9

Proof: As in Theorem 4.2.4, we easily find f(x.) --.1(x. + t:P.) > tu. II P.11 [Y. - c2(Y.)] > c,(Y.)[Y. - c2(Y.)] where

/-Vf(x.),

-\

Y. =

P. IIP.II

If t = t", we then have -f(x.) - f(x.+,) > cl(Y.)[Y. - c2(Y.)] If t = t',, then t;, I I P. 11 < t.u 11p, 11, and arguing as for t.0 we get

f(x.) - f(x. + t.P.) >- tI[P.II IF. - c2(Y.)] d2(Y.)d1(IIP.IDEV. - c20.)]

Thus y.

0 or

11 P. 11

0; since 11 p.11= I I x. - x.11 and 11 Vf(x.) 11 are

bounded, this gives
0. Part 2 follows easily from the estimates

Remark. The convexity hypothesis on C can be removed easily here and in what follows. As in the unconstrained case, this general theorem makes it easy to analyze the convergence of many step-size algorithms. THEOREM 4.9.3. Let f be bounded below on W(xo), let Vf be uniformly continuous and uniformly bounded on W(xo), let C be convex and bounded, and let p. = p.(x.) define a feasible direction sequence for C. For numbers

aE

with a < 1, choose t such that x.+, - x. + t.p. E C and f(x. + t.P.) - a.t.
for all t > 0 such thatx + tp. E C. Then - 0.

Proof: By part 1 of Theorem 4.9.2 with c,(t) - s(a) and c2(t) - ct for fixed (0, 1 - a), d,(t) - t, and d2(t) - 1, the algorithm with t'. determined from IIP.II = s[c` -Vf(x.), \

IIP-.

and _4-- 1

gives the desired convergence. We know that either

- a. = 0 or

t = sup [t; X. + tp. E C)

GENERAL BANACH-SPACE METHODS OF GRADIENT TYPE

SEC. 4.9

99

In the first case, arguing as in the proof of Theorem 4.3.1, we find to > t.a', which implies t. = t:' and thus t > t;,; in the second case, clearly t > t;,. In either case, by the defining property of t. and the fact that t. > t,', we have

f(xa) - f(xa + tap.) > f(X.) - f(x. + t:P.) + a.(t. - t.X

f(x.) -

P.>

f(x. + to P.)

so that the theorem follows from part 2 of Theorem 4.9.2 with f

1. Q.E.D.

EXERCISE. Fill in the details in the above Proof.

Remark. Setting a - a = 0 yields the usual method. THEOREM 4.9.4. Let f and C be as in Theorem 4.9.3 and let C be normclosed. Let t be either: (1) the smallest positive t providing a local minimum for

f(xa + t.P.) - a t(Vf(x.), P.>

over the set of t such that x + tp E C, t > 0; or (2) the first positive root r of

- aa(Vf(x.),

if x. + rp E C, otherwise t = sup {t; X. + tP. (=- C) or (3) the following:

t = sup {t; < 0

for 0
x + tap,; then (Vf(x.), p.> ---. 0. We assume 0 < as < a < 1. Let Proof: The theorem follows from Theorem 4.9.2 precisely as does Theorem 4.4.1 from Theorem 4.2.4. Q.E.D. EXERCISE. Give the complete proof for the constrained-minimization analogue of Corollary 4.4.1-that is, for a = a = 0, in Theorem 4.9.4. THEOREM 4.9.5. Let f and C be as in Theorem 4.9.4 and let i be deter-

mined as is t in that theorem. Let t _ Aai. where

d((-Vf(xa),

F1P- 1 I/

)

< A. < 1

for some forcing function d(t), and let xa+, = xa + taps; then (Vf(xa), P.> --. 0.

100

SEC. 4.9

GENERAL BANACH-SPACE METHODS OF GRADIENT TYPE

Proof: The theorem follows from Theorem 4.9.2 precisely as does Theorem 4.4.2 from Theorem 4.2.4 by using, for fixed c c (0, 1 - a), c,(t) = d(t)s(ct), c2(t) = ct, t."I1 PAII =

d((-Vf(xA),

\ IIPA

II))s(`(-Vf(x")'IIPAII/

sup {t; xn -+- tpn E C}

t;,

>

d((-Vf(xn), II

A P.

d,(1) = t, d2(t) = d(t), /3 = 1. Q.E.D. EXERCISE. Fill in the details in the above Proof.

As in the unconstrained case, if Vf is Lipschitz-continuous, while the 0, it is above theorems define a range of t-values leading to possible to double the size of this range by a more careful analysis. THEOREM 4.9.6. Let f be bounded below on C, let Vf satisfy

11Vf(x) - Vf(y)II <_Lllx-yl1 for x, y in C, and let pA = pn(xn) define a feasible direction sequence. Pick b S2, b, all greater than zero and let yA lie in

[min(o

l2

allp ll2

_< f i(xp.> /' L

for all n. For each n let xn+, = xn + to pA where to is defined via

to = min (1, yn<-Vf(x,), PA>l I I Pn 112

J

Then f (xn) decreases to a limit. If 1I PA 11 is uniformly bounded-for example,

if C is bounded-then lim = 0

If 1lpAI l -0 implies (Vf(x,i),

P. IIPAII

-> 0

then

Jim (Vf(xn), 11 P.

II = 0

SEC. 4.9

GENERAL BANACH-SPACE METHODS OF GRADIENT TYPE

101

Proof.

f(xn+ 1)- f(xn) <-

+ f dd 0 T1.2 lip. 112

+

< -rn<-Vf(xn),Pw> + 2 r.

IIP.112

If < Y. <-Vf(xw), Pw> Ii P.11,

then rn = 1, xn+, is in C, and

L

f(xn+.) - f(xn) < <-Vf(xn), Pn> - 1 +

11

< <-Vf(xn), P.> L- I + L,-]

<

-,r (

2

11

23L <-Of( n), Pn> < 0

If, however, 1 > tw = Yw <- Vf(xw), Pw> 11P.112

then xn+, is in C and Ax., J -f (xn) < Yn <- Vf(xw), pw>2 + L I I Pw I I2 y22 <-Vf(xw), pw>2 IIpnll I pwll <-Vf(x.), Pw>2 1ywL 11 P.

2

112

L

< either

ywJ

<-Vf(xw), pw>2 11p.li2

or

2

3L<_.Vf(xw),Pw>

In either case, f(xw+,) - f(xw) 0 and f(x,) decreases to a limit. If II P Ii = 11x;, - xn 11 is bounded, then from the three inequalities bounding the decrease

in f we obtain a a > 0 such that f(xw) - f(x., ) z a<- VAX.), Pw>' for either r = 1 or r = 2, which implies lim <-Vf(xj, pw> = 0. Since
P. IIPwI!

\=

the final conclusion also follows. Q.E.D.

Vf(xw), Pw IP.I1

102

GENERAL BANACH-SPACE METHODS OF GRADIENT TYPE

SEC. 4.9

Other simple range theorems in terms of Lipschitz constants can be given analogous to Theorems 4.5.2 and 4.5.3 in the unconstrained case; we leave these as exercises and proceed to the more complex methods of Section 4.6. We determine admissible values of t in terms of the range function

f(x + zP) g(x, t, P) = f (X)t<- f(x),P Given a feasible direction sequence defined by p. = p (x.), for the moment assuming IIP.II = 1, a real number 6 E (0, i-], and a forcing function d(t) < bt, we move from x. to as follows. If, for t = I and x = x + p,,, we find

d(<-Vf(x.), p.>) f(x.), P.

(- V

g(x., t., P.) ?

(4 . 9 . 1)

we set x,,,, = x;; otherwise, find t in (0, 1) satisfying Equation 4.9.1 and also

g(x., t., P.) - 1 I >

d(<-Vf(x.), a.) <- AX.), P.

(4.9.2)

and set x,,., = X. + t p E C since x + p. E C. We observe that the algorithm is well defined. Since g(x,,, 0, I and 1 - d(t) > d(t)

I-I

for all t, if we have g(x., 1, P.) <

dzZ)

where

z = <-Vf(x.), P.>

then by continuity of g(x,,, 1, p.) in t and the fact that x + tp is in C for tin [0, 1] since p is a feasible direction, there exists t in (0, 1) with J(Z)

S g(x., t., P.) < 1 -

d(z)

which certainly satisfies Equations 4.9.1 and 4.9.2. THEOREM 4.9.7. Let f be bounded below on C, Vf be uniformly continu-

ous on C, and p -

be a feasible direction sequence with IIP.II = 1. Let d be a forcing function with d(t) < bt for b in (0, 41. Let the algorithm described above be applied. Then lim = 0. . -.oo

Proof. The proof is exactly the same as that for Theorem 4.6.1. Q.E.D. For problems in which C is not the whole space-that is, in which there

SEC. 4.9

GENERAL BANACH-SPACE METHODS OF GRADIENT TYPE

103

are constraints-the restriction IIp.II = 1 is unrealistic; the following corollary shows that it is not needed so long asp. cannot be "too small" compared to how "near" one is to a solution. COROLLARY 4.9.1. Under the hypotheses of Theorem 4.9.7, above, with the assumption I I p1. I I = 1 replaced by

1. IIP.II >

for a forcing function d and d(<- f(xl),

I

2. (-Vf(x,), II P-n

II) - 0

p^) -. 0,

whenever

it follows that Jim

(Vf(x ), P" ) = 0 IIP.

F

Proof. - The proof is exactly the same as that for Corollary 4.6.1. Q.E.D.

For the direction algorithms we shall consider in Section 4.10 for constrained problems, it will not be necessary.for us to have (Vf(xs), II

P. H

) --

0

The condition
COROLLARY 4.9.2. Under the hypotheses of Theorem 4.9.7, above, with the hypothesis IIp. II = I replaced by

(-Vf(x,), II P II)

)0

p1.>)

3.0

whenever

Ilp,ii it follows that

0.

EXERCISE. Prove Corollary 4.9.2 in detail.

The algorithm above is not computational in that it may well be very difficult to locate a t,. E (0, 1) satisfying Equations 4.9.1 and 4.9.2; the algo-

rithm of Theorem 4.6.2 and Corollary 4.6.2 works for the unconstrained

104

GENERAL BANACH-SPACE METHODS OF GRADIENT TYPE

SEC. 4.9

problem as well, yielding the desired computational procedure. The proofs of the following results are exactly the same as those for Theorem 4.6.2 and Corollary 4.6.2. THEOREM 4.9.8. Under the hypotheses of Theorem 4.9.7 or Corollary

4.9.1 or Corollary 4.9.2, t may be chosen as the first of the numbers a°, xA + a', a2, ... satisfying Equation 4.9.1 for a fixed a E (0, 1); if t,, p,., then -+ 0.

It is also quite clear that the search routine described in Section 4.7 works equally well on constrained problems so long as x + hpn E C for the initial h and the increasing oft in the step labeled "change" is not allowed

to force x + tp. outside of C. EXERCISE, By proving the analogues to Theorems 4.7.1 and 4.7.2 and to Corollaries 4.7.1 and 4.7.2, prove Theorem 4.9.9 below.

THEOREM 4.9.9. Let f be strongly quasi-convex and bounded below on W(x°), let Vf be uniformly continuous and uniformly bounded on W(x°), let C be convex and bounded, and let p = define a feasible direction sequence for C. Let the search routine described below be used to determine

t,,, where x + hp. E C, and set x,,

,=x+

then -+ 0.

Search routine

start:

if f(x. +

reduce:

h

f(x.) then go to first;

T;

f f(x. + hp.) >

f(x

f(x +

then EXIT FROM ROUTINE NOW WITH t = h;

f f IS CONVEX then EXIT FROM ROUTINE NOW WITH

2;

loop:

while f(x + 2 per) < f(x. + hp.) do h -- 2 ;

first:

EXIT FROM ROUTINE NOW WITH t = h; f x + 2hp IS IN C then go to inside;

h-

h

;

2 go to start; inside:

if f(x. + 2hp.) > f(x, + hp.) then go to oldway;

t- 2h;

GENERAL BANACH-SPACE METHODS OF GRADIENT TYPE

SEC. 4.10

f(x +

105

and x + (t + h)PA

change:

while f[x + (t -}r

oldway:

EXIT FROM ROUTINE NOW WITH t,, = t; then EXIT if f IS CONVEX and f(x + 2hp.) < FROM ROUTINE NOW WITH t = h;

IS INCdot-t+h;

go to loop; 4.10. DIRECTION ALGORITHMS FOR CONSTRAINED PROBLEMS

As we have mentioned before, since the step-size algorithms above 0, we must have a direction sequence such guarantee that that this is a useful condition: For unconstrained problems-for example, condition was II Vf(x.) II 0; under with p = sufficient regularity assumptions, this implied that limit points x' of was [xA} satisfied Vf(x') = 0 and that-say, for convex a criticizing sequence. For constrained problems, analogously one should pick directions p(x) that the 0 the

if f is for

x

that

feasible

x in

consider of x satisfying x > 0. For each x let direction d(x) = (d,, ... , d,) have components 0 if

d; = af

r;; = 0 and df Z 0 d, otherwise

and let p(x) - a(x) d(x) for some scalar a(x) chosen so that p is feasible. Show that, if g is convex and = 0, then xs minimizes f Sketch an example to show that choosing x+, to minimize f Ix. + tp(x,)) need not yield convergence to an optimal point. [The difficulty here is that - 0 is an optimality condition but
We shall now consider, for illustrative purposes, three methods of choosing directions p (xfl) so that -b 0 is a useful condition. These methods can be combined with the step-size algorithms of Section 4.9

to yield complete numerical-minimization algorithms; we shall not state theorems concerning the resulting combined algorithms, although-the-reader should pause to consider such statements himself. Recall that C is a convex set. Then a well-known necessary condition

106

GENERAL BANACH-SPACE METHODS OF GRADIENT TYPE

SEC. 4.10

for x* to minimize f over C, one that is sufficient if f is convex, is that <x - x*, Vf(x*)> z 0 for all x in C-that is, every direction into C is a direction of increase for f. If one has a point x, which does not satisfy this condition, then it is reasonable to seek the x, which most violates this condi-

tion and then take p, x.' - x,; this conditional-gradient method is well known [bemyanov-Rubinov (1967), Frank-Wolfe (1906), Gilbert (1966), Goldstein (1964a), Levitin-Poljak (1966a)]. Thus we seek x, such that < inf + e. xEC

for some nonnegative f. tending to zero. If C is bounded, we can always find

x;; if C is bsunded and norm-closed as well as convex, then we can take c: = 0 if desired, although this causes unnecessary computation. THEOREM 4.10.1. Let f be convex, bounded below on the bounded convex set C, and attain its minimum at some point x* in C. Let x be a sequence in C such that tends to zero, where p, = x' . - x and x,' satisfies < inf + e, for a sequence :EC

of nonnegative e,

0. Then (x,) is a minimizing sequence-that is, f(x,)

f(x*). Proof. From the convexity of f and the definition of x, we can write

0 S fix.) - .f(x*) < + < - + E. wnlcn tends to zero. Q.E.D. EXERCISE. State some convergence theorems combining some step-size algorithms with the above direction algorithm.

The steepest-descent method for unconstrained problems, in which p, = -Vf(x,), has been a popular method for many years, for some applications undeservedly. For constrained problems, that direction need not point into the constraint set C, so it is not directly applicable. Perhaps the most successful way of handling this has been to "project" the direction onto C; more precisely, one proceeds in the direction p, = x.' - x where x,' is the orthogqpal projection onto C of x, for some scalar a. > 0. This is the well-known gradient projection method [Rosen (1960-61)]. In view of the numerical evidence that certain so-called variable-metric methods are much

better than steepest descent for unconstrained problems [Chapter 7 of this volume, Fletcher-Powell (1963)] and the growing interest in such methods for

constrained problems [Goldfarb (1966, 1969a, 1969b), Goldfarb-Lapidus

SEC. 4.10

GENERAL BANACH-SPACE METHODS OF GRADIENT TYPE

107

(1968)], we consider an analogous variable-metric projected-gradient method. We suppose that (AA) is a uniformly bounded, uniformly positive-definite

family of self-adjoint linear operators on the space E-that is, that there are x> < M<x, x> for all x in E. m > 0, M < 00 such that m <x, x> <

For each n, let x, be the projection, with respect to the variable metric AA- >, of x - a,A;'Vf(xA) onto C; that is, x; minimizes [x,

<x - [xA - aAA;'Vf(xA)],

-

over x in C. If C is norm-closed and convex, a unique xr exists. By the usual necessary condition, the variational definition of x' means that for all x in C we must have

<x - x, AA(x, - we)> > 0

(4.10.1)

where w = x - aAA;'Vf(xA). If we set x = x in this inequality, we obtain

0 > <x, - x', AA(w, - xi)> <XA - x,', AA(w, - XA)> + <xA - x, A,(x,, - x,)>

and since w - x = -aAA;'Vf(xx), we obtain <xA

- x, - a.VJ (xn)>

- <x - x., A.(xn - x.)>

or

aA<-Vf(X.), P.> >

(4.10.2)

Therefore, the direction sequence is feasible. We mow show that the condition lim = 0 is useful. ATHEOREM 4.10.2. Let f be convex, bounded below on the norm-closed,

bounded convex set C, and attain its minimum over C at x*. Let x be a sequence in C such that the projected-gradient directions pA defined above satisfy lim = 0 and a > e > 0. Then (xn] is a minimizing Sequence-that is, f(xA) --, f(x*). Proof: We write

0 ` f(X.) - f(x*) < Vf(XA), X. - X*> l`- +
- X*>

-I

M.

<x, - a, A;' Vf(xA) - xA, A,(x* - x',)',

a

a- <X,, - X,,, AA(x, - x *)>

< <-

1

of(Xn)) PAI +

(/ \ XA

xll, AO(xA

- x *)>

108

SEC. 4.10

GENERAL BANACH-SPACE METHODS OF GRADIENT TYPE

by Equation 4.10.1. Therefore,

0 + aM 11P.11 11x, - x* II

<<-Vf(x"),P,>+ M II x:-x*11 [_<-vf(x)P>]

1/2

using Equation 4.10.2 and the positive-definiteness of A,,. Thus

0 <_f(xn) - f(x*) <

P"> ± M 1 x2m1 /2 *

II <-Vf(x,),

which tends to zero. Q.E.D. EXERCISE. State some convergence theorems combining some step-size algorithms with the above direction algorithm.

EXERCISE. Show that, if f(x) = <x* - x, A(x* - x)>, A. = A, and a = 1, then x" = x*.

We note that our projected-gradient method for A. = I, E = ER', and C a polyhedral set is not quite the same as the gradient-projection method originally described in Rosen (1960-61), since the latter requires that xn be the projection onto one of the faces to which xn belongs or, in some implementations [Cross (1968)], onto a small neighborhood of x" in C. The computational versions of gradient projection in use apply a special technique near edges of C which turns out to be essentially equivalent to bounding a away from zero but keeping it small enough so that the projection is always very near x,,. Thus it is clear that a simple convergence proof for Rosen's original computational gradient-projection method can be fashioned in this way from our results above; this has been done [Kreuser (1969)]. If one, however, does not take a" small, one needs a good, efficient method for projection, in an arbitrary quadratic metric, onto a full polyhedral set. Such an algorithm has been brought to our attention [Golub-Saunders (1969)] and raises the possibility of using larger a,,, which may well be more powerful than the original gradient-projection approach, at least far away from the solution.

The method analyzed in Theorem 4.10.1 can be considered intuitively in a fashion different from that presented before if we notice that x', is chosen so as to (approximately) minimize f(xn) + over C; that is, x" solves the problem with the cost function linearized at x,,. We consider next the approximation off by a quadratic at x (Newton's method)-namely, f(xn) + + 'T' where f;,' denotes the derivative f', . If x" is chosen to minimize g"(x)

+ +

SEC. 4.10

GENERAL BANACH-SPACE METHODS OF GRADIENT TYPE

109

then

0

for all x in C. Setting x = x and defining p = x;, - xn in this inequality yields

< - < 0

for convex f, implying that p is a direction of nonincreasing f-values as needed.

THEOREM 4.10.3. Let f be convex, bounded below on the norm-closed,

bounded convex set C, and attain its minimum over C at x*; let f', exist in C and I I f' I I 0 for all x in C. Let [xj be a sequence in C such that lim = 0 where p = xn - x and xn minimizes

gn(x)=

f(x*).

over C. Then

Proof: By the definition of xn as we saw above, we have <- VAX.), P.> >_

However,

I If .,P. IIZ

B

and hence from for all x in C, we write

pn> -, 0 we conclude that II f.'P.II

<x' - x,,, Vf(xn}> - <x - X.,

Q. Thus

Vf(x.)>

= <x. - x, Vf(x.) + f.'(x. - x.)> - <x - x, f.'(`v.' - X.) > - <x - x'., V n(x'n)> - <--e. - x, f n'P.>

- <x'i - x, f ,,P.>

since
E.su?I<x.-x, tend to zero. Thus we have

<xe - x., Vf(x.)> < <x - X., Vf(x)> + E.

for all x in C with E. Q.E.D.

0. The result now follows from Theorem 4.10.1.

110

GENERAL BANACH-SPACE METHODS OF GRADIENT TYPE

SEC. 4.11

EXERCISE. State some convergence theorems combining some step-size algorithms with the above direction algorithm.

A local convergence theorem yielding the usual quadratic convergence rate when xn+, = x, has been given in Levitin-Poljak (1966a). EXERCISE. Contrary to the unconstrained case (see Section 7.2), by considering the minimization of x2 + y2 over [(x, y); y Z 1), show that picking ;:n+, so that f(xn+,) < f(x') need not maintain quadratic convergence, where x; is generated by the above Newton's method.

For more extensive discussions of algorithms for constrained problems, the reader is referred to Fiacco-McCormick (1968), Zangwill (1969), and references therein.

4.11. OTHER METHODS FOR CONSTRAINED PROBLEMS

'A considerably different kind of method has been developed for the case in which the constraints take the form P(x) = 0 where P: E - E is nonlinear, implying that one need not be able to proceed from xn into C along

straight lines. In this case, under suitable hypotheses [Altman (1966b)], one can find s(x, t) E E such that P[x - tVf(x) + s(x, t)] = 0 for all t > 0 and 11 s(x, t) I( S Kt2 for some K. Thus only a small perturbation of the linear motion keeps us in C. Algorithms have been given for determining t-values,

and convergence proofs are known. The methods for computing s(x, t), however, are very complex and do not appear to lend themselves to practical computation; therefore, we consider the method no further. One further type of method for constrained problems which we wish to consider is the penalty function method. We have met this approach before in Sections 3.2 and 3.3 in a more specialized form. In fact, the whole approach fits into the discretization analysis if one makes some extensions in those results, but this adds but little to the general applicability of those theorems; therefore, we treat the penalty-function method briefly in the more classical fashion. We seek to minimizef(x) over C = {x; g(x) < 0), where g is some nonlinear functional. Instead, we shall approximately minimize f(x) + PP[g(x)] over E, where the penalty functions P. are such that, fort > 0, lim PP(t) = 00; n-.«

uniformly for

t > 6 > 0 for all 6 > 0

Thus P. will penalize us for having an x with g(x) > 0. EXERCISE.. Give some examples of penalty functions that satisfy the conditions immediately above.

GENERAL BANACH-SPACE METHODS OF GRADIENT TYPE

sEc. 4.11

111

What we can hope will occur, then, is that our computed sequence xp will satisfy

lim sup g(xp) < 0 p+w

This, however, is not enough in general to guarantee that d(xp, C) = inf I I xp - x l l rEC

is tending to zero. DEFINITION 4.11.1 [Levitin-Poljak (1966a)]. The constraint defined by g

is called correct if lim sup p-.w

0 implies lim d(xp, C) = 0. 11-.w

EXERCISE. Find some explicit conditions under which constraints are correct.

THEOREM 4.11.1. Let g define a correct constraint; for some e > 0 let I f(x) - f(y) I < L II x - yli if d(x, C) < E and d(y, C) < e; let Pp[g(x)] > 0

for all x E E; let lim PP(t) = oo for t > 0, uniformly for t > a > 0 for all 6 > 0; and let lim PP[g(x)] = 0 for all x r= CO, a dense subset of C. Define mp = inf {f(x) + PP[g(x)]}, xEE

m = inf f(x) xEC

and assume inff(x) = rn > - oo. For a sequence ep > 0, satisfy

xEE

Ep ---p

0, let xp E E

f(x,) + PP[g(x,)] < mp + ep Then {xp} is an approximate minimizing sequence for f over C in the sense of Definition 1.6.1.

Proof: Let wp E C, lim f(wp) w, E Co with

m. Since f is continuous, there exists

If(w) -f(wp)I
f(x) + PA[g(xJ]
Since Pp[g(xp)] > 0, we also have lim sup f(xp) < m. Also pm

112

GENERAL BANACH-SPACE METHODS OF GRADIENT TYPE

P.[g(xj] = (f(xw) +

SEC. 4.11

P.[g(x,,)]} - f(x.)

which implies

lim sup PA[g(x,)] < m - rn < o0 A-.w

Therefore, also lim sup g(xA) < 0

oo, a or otherwise for some subsequence xA, we would have PA,[g(xA)] contradiction. Since g is a correct constraint, d(xA, C) - 0. Thus, for large n, d(x,, C) < E and we write

I f(x,) - f(x') I < L

'xA - x'A I

I S 2Ld(x,, C)

where

x,E C,

Ilx.-x'.II<2d(x.,C)

Then we have

m
f(x,) > m - 2Ld(x,, C) which implies lim inf f(xA) > m

We already have lim sup f(xA) < m so, therefore, {xA} is an approximate minimizing sequence. Q.E.D. As usual, one can use the results of Section 1.6 to deduce stronger convergence results. It should be pointed out that the problem of minimizing a convex function, bounded below, over a bounded convex set can be reduced to that of a linear function over a convex set. For example, suppose that C is a convex bounded

set in E and that f is a convex functional. Define the new space E, = E x IR, f, (x, A) = A, and for some z E C let

C, = [(x, .1); x E C, f(x) < d < f(X)} The original problem is solved now by minimizing the linear functional f,

SEC. 4.1 1

GENERAL BANACH-SPACE METHODS OF GRADIENT TYPE

113

over the convex bounded set C,. Since Vf, _ (0, 1), it is Lipschitz-continuous with constant L arbitrarily small, allowing for application of the preceding methods, in particular,. gradient projection.

Finally, we mention briefly another method, somewhat similar to the penalty-function method, but for which the penalty is introduced in a different way. For simplicity only, consider the problem of minimizing a strictly convex functional f such thatf(x) + 00 as 11 x 11 00 over the convex subset C of IR' defined by

C=[x;g,(x)<0,i = 1,...,k} where the g; are convex functionals; we suppose C has an interior point. The method of centers proceeds as follows [Cba (1969), Huard (1967), Zangwill (1969)]; given an initial xo E C, we compute xa+, as a point minimizing

f (x) = [f(x) - f(x )] J [-g,(x)) over {x; f(x)
gradient methods can never increase the f; values, and hence 11 1- g,(x)]

-

will always be positive andf(x) will-always be negative; therefore, the iterative method to compute x,+, can ignore the constraints, just as in the penalty-function-method. Under the hypotheses we have stated, it is known that Jfx - x* (j 0, where x* is the solution to the minimization problem. Similar results hold for much more general versions of this method.

We remark again that there are many other methods for constrained minimization; since this is not a book on mathematical-programming methods, we have only mentioned a few. General reference: Levitin-Poljak (I966a).

5

CONJUGATE-GRADIENT AND SIMILAR METHODS IN HILBERT SPACE

5.1. INTRODUCTION

In recent years there has been a great deal of interest in iterative minimi-

zation methods, for both constrained and unconstrained problems, which make use of the idea of conjugate directions; we shall discuss some of the practical algorithms in IR' in other chapters. In the present chapter we wish to describe, in a general setting, the basic theory behind conjugate-direction and particularly conjugate-gradient methods. We shall examine the method first for the simple case in which the function f to be minimized without con-

straints is a quadratic functional, essentially with 0 < aI < f < Al. It is the great power of the methods when applied to this problem that has made them appear attractive for the more general nonlinear problems. Later we shall extend the results to the more general case. We shall throughout this chapter

consider the problem as defined over a real, separable Hilbert space lo with inner product

5.2. CONJUGATE DIRECTIONS FOR QUADRATIC FUNCTIONALS

If we seek a critical point of a quadratic functional f, then we are really trying to solve a linear equation. Thus let M be a bounded linear operator with bounded inverse from .*' into . . Let H be a positive-definite, self-adjoint, bounded linear operator from into .*'; then N = M*HM has the same

properties. The problem of solving Mx- k for given k with h = M-'k now can be stated as the problem of minimizing the functionalf(x) = Kr, Hr-

114

SEC. 5.2

CONJUGATE-GRADIENT AND SIMILAR METHODS IN HILBERT SPACE

115

where r = k - Mx is the residual. When we consider more general functionals f, we shall still find it necessary to examine a functional
E(x) - = = Since N is positive-definite and E, = 2N, we see that E is strictly minimized over .*' at x = h. We shall attempt to minimize E over Al by minimizing it over a sequence of expanding subspaces. THEOREM 5.2.1. Let [B.] be a sequence of closed linear subspaces of B. c B.+ B _ norm closure of U B,,. Let x,, minimize Eon B,,. Then xA --i

x' E B, and x' minimizes E over B. Proof: The points x exist and are unique because of the growth property and the uniform (quasi-)convexity of E(x). Since E(x,) forms a decreasing is a Cauchy sequence. For n > m, sequence bounded below by zero, we write

<xm - xn, N(xm - xn)> = E(xm) -

2<M*Hrn, xm - X.>

Since n > m, xm - x is in B.; since x minimizes E(x) over B., which equals -2M*Hr,,, is orthogoral to B. and therefore to xm - x,,. Thus we conclude that <xm - xn, N(xm - xn)> = E(xm) -

which tends to zero, since {E(x,)} is a Cauchy sequence. Since N is positivex'. definite, {x,} is a Cauchy sequence and there exists x' E B such that x By a continuity argument we find, setting r' = k - Mx', that <M*Hr', z> = 0 for all z E B. Since again VE(x') = -2M*Hr', this implies that x' minimizes E over B. Q.E.D. EXERCISE. Prove that <M *Hr', z> = 0 for all z E B, where r' is defined it the Proof of Theorem 5.2.1 above.

As a practical matter, the minimization is easier if B. is finite-dimensional-if, say, B. is spanned by the linearly independent vectors [po, p ... for all n. Then of course x a,,,jpj. It would be convenient if a., j

were independent of n, so that xn+, = x + a,p,,. It is a simple matter to or

= 0 show that this is the case if and only if either x = for i = 0, 1, . . . , j - 1 [Antosiewicz-Rheinboldt (1962)]. We include the proof in one direction as a part of the following.

116

CONJUGATE-GRADIENT AND SIMILAR METHODS IN HILBERT SPACE

SEC. 5.2

THEOREM 5.2.2. Let {p,); be a sequence of linearly independent vectors

satisfying

= 0 if i x,+ i = X. + c P,W

j and let x0 be arbitrary. Let c,

_<

*HNP> '

r, - k - Mx,

Let B. be spanned bypo, ... , p,-, and let B = closure of U B,. Then x, -- 'x' minimizing E over B.

Proof: For i < n - 1, <M*Hr P+) _ <M*Hr,-t, P,> - c,-,

For i < n - 1 we have <M*Hr,,, p,> = <M*Hr,_ , p,>

while, by the definition of c, <M*Hr,+,, p,> = 0. Thus <M*Hr z> = 0 for all z in B. and hence x, minimizes E over B,. The rest follows from Theorem 5.2.1. Q.E.D.

EXERCISE. Prove the converse of Theorem 5.2.2 by proving that the a,,, defined immediately before Theorem 5.2.2 are independent of n only if the directions [p,)o satisfy = 0 for i :;6 j.

A set of linearly independent directions [p,]o satisfying

= 0 if i # j is called a set of N-conjugate (or conjugate) directions. A general scheme has been devised [Hestenes (1956)] by which such directions can be generated. One can show fairly directly that the following is valid (Daniel (1965, 1967b), Hestenes (1956)].

PROPOSITION 5.2.1. Let K, N be positive-definite, bounded, self-adjoint

linear operators in .e, and let go # 0 be given in .*'. The algorithm

g-

K&+1 +

po = Kgo,

with C. =_

and

b = -
generates directions satisfying = = 0 if i = ,

b. __

1

j, <91+ 1,p,>=0,

SEC. 5.3

CONJUGATE-GRADIENT AND SIMILAR METHODS IN HILBERT SPACE

117

The algorithm terminates at n = no if and only if ga, = O. If we define Kx v(x) = <x, N_,x>

<x'K xx>' A(x) _ fix,

T = KN

then the spectrum of T lies in an interval [a, A), a > 0, and for any such a, A, we have

a < 4u(p) < I < ,u(Kg,) < A

and a< v(g) <

I

< v(Np,) < A r

According to Theorem 5.2.2, the iteration defined therein yields a solu-

tion to Mx '= k if B = .°; if B #

,, x' need not equal h in general. Of

course, if X1 is finite-dimensional, the iteration terminates, B = and x' _= h. For infinite-dimensional problems, however, we need additional conditions to assure x' = h. EXERCISE. Find an example of a conjugate-direction method for a specific problem for which the limit x' _,£ h. General references: Hayes (1954), Hestenes-Stiefel (1952).

5.3. CONJUGATE GRADIENTS FOR QUADRATIC FUNCTIONALS

We consider a special conjugate-direction algorithm-namely, one in

which, in the algorithm of Proposition 5.2.1, we take go = M*Hra = -3VE(x,,). Clearly, then, g = M*Hr,,, which implies that the c of Proposition 5.2.1 and Theorem 5.2.2 are the same if we write x,,,, = x + c,,p,,. Thus x --y x' minimizing E on some closed subspace of ;'. If K = I, then g = M*Hr = -and, since p,,,., = Kgn+, + b p,,, we see that the new direction is obtained by "conjugatizing" the direction -,, VE(x )that is, by projecting onto the space of vectors conjugate to p0, p ... , p,,; hence the name conjugate-gradient method. We shall return later to the projection aspect of the method. We now wish to show that, for the conjugate-gradient method,. x,, --- h. THEOREM 5.3.1. Using the conjugate-gradient method, x,, - h. We have

E(x,,,.,) <

q,

0
Let a, A be the positive spectral bounds for T = KN. If KN = NK, then we can take

A - az q

A + a)

118

CONJUGATE-GRADIENT AND SIMILAR METHODS IN HILBERT SPACE

SEC. 5.3

Otherwise, we can take q = I - (a/A). The same convergence rates obtain

forllx,--hll2 Proof: It is trivial to verify that E(x.) - E(x.+,) = c. Next we have

E(x,) = = <M-'r M*Hr,> = <M-'(M*H)-'M*Hr M*Hr,>

_ = y(gi) 1

E(x,) - E(xr+,) = c, = E(x,)c,v(g,)

The estimates of the theorem follow from cr > 1/A, Y(g) > a. If K and N commute, then ary(gr)

y(g,)

K)

,u , gr

_-

[Kg,, Kg,]2

[Kg,, TKg,][Kg,, T -'

where Ix, y] - <x, K-' y>. It is easy to see that T is self-adjoint positivedefinite relative to with spectral bounds a, A; thus 4aA

c.v(g) > (A+a)2 by the inequality of Kantorovich [Faddeev-Faddeeva (1963), Kantorovich (1948)]. Now let fi > 0 be the lower spectral bound for N. Then

fill x.-

I12S=E(x.)
Thus I I x. - h 112 < q" E

,

X.

)h

and the stated convergence rate is valid. Q.E.D.

It is also possible to show that another error measure-namely,

F(x) = decreases [Hestenes-Stiefel (1952), Daniel (1965, 1967b)]. EXERCISE. Prove that

F(x.) -- F(x.+,) > (x.+, - x., K-'(x.+, -- x.)>

SEC. 5.4

CONJUGATE-GRADIENT AND SIMILAR METHODS IN HILBERT SPACE

119

In some cases the method can be shown to converge even when a = 0, but examples are known in whit, i we then have 11x - x* 11 > (inn)-' for some ).. > 0, showing that no geometric convergence rate is possible [Odloleskal (1969), Poljak (1969a)]. General references: Antosiewicz-Rheinboldt (1962), Daniel (1965, 1967b), Hayes (1954), Hestenes (1956), Hestenes-Stiefel (1952). 5.4. CONJUGATE GRADIENTS AS AN OPTIMAL PROCESS

Much-improved bounds on the convergence rate can be obtained by viewing the conjugate-gradient method in a different light, one which shows more clearly the great power of the method-as opposed, say, to the steepestdescent method, which also has a convergence factor like (A - a)/(A + a).

Suppose we seek to solve Mx = k-that is, M*HMx = M*Hk-by

some sort of gradient method; for more generality we allow ourselves to multiply gradients also by an operator K, where M, H, N, K, Tare as defined earlier. If at each step we allow ourselves to make use of all previous information, we are lead to consider iterations of the form

x.+I =

h = M -'k

P JA) are polynomials of degree less than or equal to n. If we should by chance have x0 = h, we would want x = h for all n. This leads, since h should be considered arbitrary, to the requirement that where

xo + P.(T)T(h - x0)

(5.4.1)

where P,.(.t) is a polynomial of degree less than or equal to n. We wish to use methods of spectral analysis to discuss such methods,

so we are forced to assume that

N = p(T) where pa,) is a positive function continuous on some neighborhood of the spectrum of T. As we shall later see, this is satisfied in the practical methods, where usually p(..) ) or p(A) -- 1. For each n, we wish to choose so that E(x..,I) is the least possible under any method of the form of Equation 5.4.h According to the spectral theorem, we can write 1) _ A p(2)[1 - 2PP(1)12 ds(2) J

(5.4.2)

0

where s(.) is a known increasing function. The fact that there is a polynomial PP(2) yielding this least value follows from a straightforward generalization

120

CONJUGATE-GRADIENT AND SIMILAR METHODS IN HILBERT SPACE

SEC. 5.4

[Daniel (1965, 1967b)] of the theorem in finite dimensions as proved in Stiefel (1954, 1955).

is minimized by setting

PROPOSITION 5.4.1. The error measure

1-

to be the (n + 1)st element of the orthogonal [on

[a, A] relative to the weight function .p(A) ds(A)} set of polynomials R,(A) satisfying R,(0) = 1. EXERCISE. Prove Proposition 5.4.1.

We shall now show that, for each n, the vectors generated by the conjugate-gradient method are precisely those generated by this optimal process. THEOREM 5.4.1. For each 'n, the vector x generated by the conjugategradient (CG) method coincides with that generated by the optimal process of the form in Equation 5.4.1. Proof: Given n, the vectors p0, ... , p,-, in the CG method are independent. Since p0 = Kgo and p,+I = Kg,,.., + b,p it is clear that any linear combination of p0, ... , can be written as a linear combination of Kg0, ... , Thus the n vectors Kg0, ... , Kg, span at least the ndimensional space B.- sp[po, . . . , hence B = sp[Kg0, . . . , Kg._,].

Now Kg0 =g°Kg0; assume that for j< i, Kg, can be written as a linear combination of T°Kg0, T'Kg...... T'Kgo. Then Kgr+, = K(g, - c1Np,) = Kg, - c,Tp, We can write p, as a linear combination of Kg0, ... , Kgt, each of which, by

the inductive assumption, is a linear combination of T°Kg0, ... , T'Kgo. Therefore, Kg,+, is a linear combination of T°Kg0, ... , T'*'Kg,,. Reasoning as above, we have

B. = sp[T°Kg0,... ,

T"-'Kg.)

Now x, minimizes E(x) on x0 + B. if x is generated by the CG method. By what we have shown above, this says that the x generated by the CG method minimizes E(x) on the set of points ,.-1

x = x0 + E1-0s1T`Kg0 = x0 + P,,-,(T)T(h - x0) where P,,_,(A) is the (n - 1)st-degree polynomial P.-,(A)

among all iterations of the form

X. I = x0 + P (T)T(h - x°) the CG method makes

the least. Q.E.D.

_'E-,

1-0

s,V'. That is,

SEC. 5.4

CONJUGATE-GRADIENT AND SIMILAR METHODS IN HILBERT SPACE

121

Thus, if we insert any polynomial into Equation 5.4.2, we can get a where x,+, is generated by the conjugate-gradient method, bound for If we choose for comparison since that method gives the least value of as the (n + 1)st Chebyshev polynomial relative to )p()) ds(1) 1on [a, A], we find the following bound.

PROPOSITION 5.4.2. Let a - a/A. Then, for the conjugate-gradient method,

E(x.) S w.E(x0) S 4(1 \1 +

/a

a ) 2"E(xo)

and 11 x. - h 112 converges to zero at this same rate, where

"'. = (1 +

2(1 - ar )2. +

EXERCISE. Prove Proposition 5.4.2.

By this result we have reduced our estimate of the convergence factor from (1 - a)/(1 + a) to at least (1 - /-,% )/(1 _ix ). When one uses the steepest-descent algorithm to solve Mx = k by minimizing E, one moves from xx to x.., in the direction M*Hr.. Therefore, the steepest-descent method has the form of Equation 5.4.1 and, therefore, reduces the error E(x) by less than the conjugate-gradient method for every n. Since the best-known and in

certain cases best possible convergence estimates for steepest descent [Akaike (1959)] are of the form (1 - a)/(1 + a), while we have at least (1 - ^/ T )/(1 + ,/ _a), we see that the convergence of the conjugate-gradient method is also asymptotically better. For clarity, we now state the form that the conjugate-gradient algorithm takes in certain special cases. The iteration takes its simplest form in the case in which the operator M is itself positive-definite and self-adjoint; it was this case for which the method was originally developed. Here we may now take H = M-' and K = I. Thus

N=T=M,

E(x)=

Since N = T, we have p(A) = A, and the analysis of this section applies. The iteration becomes as follows:

Given x0, let po = ro = k - Mxo. For n = 0, 1,.. . , let

=

11r.11'

_

P., MP.>

r.+i = r. - c.MP.,

,

x.., = X. + c .P.

p.+1 = r.+1 + b.P.

122

SEC. 5.5

CONJUGATE-GRADIENT AND SIMILAR METHODS IN HILBERT SPACE

where

_

-11 r"+1112

-

b"

I I r" 112

A second special case which is simple enough for practical use arises from setting H = K = I, so that T = N = M*M. Again, p(2) - A, and we have E(x) = 11 r 112. Fortunately, for computational purposes one can avoid

the actual calculation of M*M and can put the iteration in the following form :

Given x0, let ro = k - Mxo, Po = go = M*ra. For n = 0, 1, C.

_

11 % 112

<MP MP.> - 11 MP" I I2'

r.+1 = r" - C. MP",

... , let

x"+ 1 = X. + C.P.

g"+ 1= M r"+ 1

P"+1 = g"+1 + b"P"

where

_<MP Mgr+I>=11 g"+1112

b"

II MP"112

11&11,

A third special case arises from H = (M*M)-', K - M*M, so that N = I, T = M*M, p(2) - 1, E(x) = II h - x112. By some manipulation, the iteration takes the following form:

Given x0, let r,, = k -- Mx0, p0 = M*ro. For n = 0, 1, . c" =

11 r" 112 I

IP P.

rn .1 = r

X" 1

112'

c"MP",

- X.

. .

, let

CnPn

P"+ I_ M*r"+ t

'?

b"P"

where b" . _.

II r"+1 E. Ilr"II2

EXERCISE. Show that the last two algorithms above generate the desired iterates.

General references: Daniel (1965, 1967b), Faddeev-Faddeeva (1963). S.S. THE PROJECTED-GRADIENT VIEWPOINT

It has been widely believed that the CG method exhibits superlinear convergence-that is, that II x" - h II tends to zero faster than any geometric

SEC. 5.5

CONJUGATE-GRADIENT AND SIMILAR METHODS IN HILBERT SPACE

123

sequence An with ) > 0-although the best error estimates in general only yield

,,/A

,/a

!

If we view the method as one of projecting the gradient direction onto the space conjugate to all preceding directions, we obtain an indication that the convergence might in fact be superlinear; the result we obtain in this way is also needed later for the analysis of nonquadratic functionals. For simplicity of notation, we restrict ourselves to the simplest special case of the CG method

with M itself positive-definite and self-adjoins, with N = T == Al, K 1. Without loss of generality, we consider the CG iteration starting with a first guess x0 - 0. Suppose we are given a vector d $ 0 such that 'd, k; = 0. by We define an equivalent inner product [x, y] = <x, My; Then we have [h, d] = 0-that is, h is M-conjugate to d. Let P, be the orthogonal (in the sense of the inner product projection onto the linear

subspace spanned by d, and let P, = I - PF Define the Hilbert space A', = P,. with inner product and define the operator M, = P,M in. ,. EXERCISE. Prove that M, is a bounded, self-adjoint, positive-definite linear operator from .7f', onto . V , and that, therefore, h is the unique solution of the equation

M,x -- k, = P,k Show that the spectral bounds a,, A, of M, are related to those a, A of M by a < a, A, A. Hint: For example, to solve Mix k' for k',,- Al',, let

aM-ld

x, - A4-1k' If

-
Mix, --: P,Mxa :_- P,(k' -?- Ord) = k'

If, also, Mix' == k' and x' C

then P,M(xa -- x') = 0, which implies

M(x, - x')

fad

CONJUGATE-GRADIENT AND SIMILAR METHODS IN HILBERT SPACE

124

SEC. 5.5

and

0 = [x,, - x', d] _ <M(xa - x'), d> = 48

soft = 0 and xa = x'.

To solve M,x = k, in .Y° we consider the general form of the CG method obtained by letting

K=M H=M2, sothat N=I,T=M, All the theory of the,CG method applies here, and we can in particular deduce

that E,(x") S w.zE,(xo)

where W. = (XI

2(1-a)" (1 + ax" + (I - -777A,

E,(x) = [h - x, h - x] = A straightforward calculation shows that the iterates x, generated by this general algorithm on M, in .", are precisely the same as the iterates generated by using the standard simple algorithm on M in ,' if the initial direction po

in the simple algorithm is not chosen as ro = k - Mxo = k as usual, but by the formula

Po=P,r0=ro

b_,d,

b_,=-

ro, Md>

that is, by the usual way of generating CG directions if we identify d with p_,. EXERCISE. Prove the assertion in the preceding paragraph.

All that the preceding paragraph says is that the standard CG method, modified to require the first direction po to be conjugate to d, is equivalent to a general CG method in a space M-conjugate to d; therefore, the modification of the standard method converges and, in fact, since

E,(k)=[h-x,h-x)=

E(x)

we have

E(x") < w2E(xo)

More generally, if we have proceeded through standard CG directions

SEC. 5.6

CONJUGATE-GRADIENT AND SIMILAR METHODS IN HILBERT SPACE

125

P09 pI, ... , PL-, to arrive at xL = 0, then the solution his M-conjugate to pi, 0 < i < L - 1, and we can define P,, as the orthogonal projection (in the [ ,

sense) onto the span of [p0, ... , PL-,}, P, = I - PA, .'I = P,.,°,

]

M, = PM. Then the remainder of the standard CG iterates are precisely the same as those generated by the more general CG method applied to M, in A, and, therefore, our convergence estimates can make use of the spectral bounds of M. on A e, rather than of M on A. Since the projections P, are "contracting" as we do this analysis after each new standard CG step, the spectral bounds on the operators M, might be contracting, allowing a proof of superlinear convergence. While we have not been successful in accomplishing this, it seems a worthwhile approach. 5.0. CONJUGATE GRADIENTS FOR GENERAL FUNCTIONALS

We now wish to consider minimizing a general functional f(x) over a

Hilbert space .° by some analogue of the conjugate-gradient method. In this case, Vf(x) plays the role of 2(Mx - k) and fx plays the role of 2M. For notational convenience we shall write J(x) _ Vf(x), P. = f we shall also write r" _ -J(x), J',.. Thus, in analogy to the quadratic problem, given x0, let pa = ro = -J(x0); for n = 0, 1, . . . , let x.+, = x" + c" to be determined; set r.+, = -J(x.+,), and p.+, = b"p", where

_ b"

- r"+ , J.+ i P" P., J.'. P.>

If the sequence of vectors p" that we generate in this manner is admissible,

then all the results of Chapter 4 apply to determine the choice of c,,; we consider the admissibility. If we desire

>allr"II2, then

a>0

precisely what we need is

b.-Xr.,P.-J > -(1 - a)IIr.II2 This follows, for example, if b.-

(I -a)

Ilr"II

IIP.-i Il

for which and

IIr.II

126

CONJUGATE-GRADIENT AND SIMILAR METHODS IN HILBERT SPACE

SEC. 5.6

However, unless the b as determined by the algorithm satisfy such a condition, we must modify bn and thus lose the relationship with conjugate gradi-

ents. Although the study of such methods may be of interest, the rapid convergence of the conjugate-gradient method for quadratic functionals is so desirable in general that we shall limit ourselves to the situation in which similar results can be proved for general functionals. Therefore, we shall now always assume that there exist positive numbers a, A such that

al <J;< = 0

We call this the pure CG algorithm. From these conditions it is simple to prove the following [Daniel (1965, 1967b)]. PROPOSITION 5.6.1. 0
I Ir,,112

l\Pn, Jnrn/ = I\Pn, Jn p \ /1 >

l\rn, Jarn/

= IIrn112+ N_I IIPn-, I12

II\Pn112

+ bn

.1

\

(Pn- 1, Jnpn- 1 >

Ilrn112
The following theorem follows from several earlier theorems in Chapters

1 and 4; for clarity we prove it directly here.

THEOREM 5.6.1. The sequence xn generated by the pure CG algorithm

starting with an arbitrary x0 conyerges to the unique x* minimizing f over J. The error estimate

Ilx. - x*II < cl 11J(x.)11 is valid.

Proof., Let fn(c) = f(xn -i- cpn); then

f Xc) = (J(x + all Pn III < f" (c)

CPA), PRA

<Jx.+c,,PA,

PO/

A II PA II2

SEC. 5.7

CONJUGATE-GRADIENT AND SIMILAR METHODS IN HILBERT SPACE

127

Since

-IIrn112 <0

f' (0) _ we deduce that c. exists and satisfies

a 11 r 112 A IIP.I12-A2 I

C.

Thus, for all c < c,,, we have for some 0 < t < 1, 2

f

f (X. +

2 2c2a I1 r,112

Thus

Ax.).-

I

l r 112 +

f(x) is bounded below, it follows that f (x)

a

A2 II r.ll2 <

converges to zero. Since

f (xo) is bounded, hence I I

xII

is bounded;

but

a l l xn+k - X. 112 < <J(xe+k) - J(x.), xa+k - X.>

which converges to zero. Thus there exists x' such that x converges to x'; clearly J(x') = 0 and f (x') = min (f (x) ; x in.*']. Uniqueness follows from I I J(x) - J(y) II I I x - y I I >_ <J(x) - J(y), x - y> > a II x -

1I2

as does the error estimate with x = x', y = x,,. Q.E.D. This theorem by itself does not indicate any special value for the method; all of the methods of Chapter 4 behave essentially in this fashion. The advan-

tage of the method for quadratic functions is its rapid convergence rate; we show that, asymptotically, this same rate is obtained in general. 5.7. LOCAL-CONVERGENCE RATES

In examining the local-convergence rate, we discover that estimates can

be found simultaneously for a larger class of methods-namely, without

128

CONJUGATE-GRADIENT AND SIMILAR METHODS IN HILBERT SPACE

SEC. 5.7

choosing b. via the conjugacy requirement. We assume instead that I I b.-, P.-, I I < D 11 r.11 for some D; then I I P.112

1 1r

. 1 1I2 + I I b.-, P.-, III < (1 + D2) II r.112

which yields

r.,.-=11r.11Z 11r.1111P.11

IIP.II

121 (1+D)

so that the p are admissible directions. (This assumption can be weakened via Remark I following Theorem 4.2.4.) If we examine the effect of this change on the Proof of Theorem 5.6.1, we find instead that 1

A(1+D2) f(x.+,) _< f(x.) - A(1

I

D2 II r.li2

so that the conclusions of the theorem follow. Thus we have proved the following. THEoi

i 5.7.1. Let 0 < aI < Jx < Al for x E .-°, J = Df.

Given x0, let po = ro = -J(xp). For n = 0, 1, . . ., let

(5.7.1)

Then x. converges to the unique x* minim;zing f over .*°, and

iix-x*II
SEC. 5.7

CONJUGATE-GRADIENT AND SIMILAR METHODS IN HILBERT SPACE

129

EXERCISE. Supply the details in the Proof of Theorem 5.7.1.

EXERCISE. Suppose we only know that Vf(x) is Lipschitz-continuous in Jr with a fixed Lipschitz constant, and that the algorithm of Theorem 5.7.1 is well defined; prove that 11 V f(x.) 11 , 0.

We shall analyze the local-convergence properties of this method; we merely note that when b =

P.,1.+ I P.

we have

D=(a - 1)

1/2

EXERCISE. Prove that

D = (A Q -

1)I/2

for the choice of b. immediately above.

Our approach will be to analyze the convergence in terms of an error measure E.(x) similar to E(x) in the quadratic case; the work lies in proving that. asymptotically, the convergence is the same for the more general case. LEMMA 5.7.1.

1

A(1+AIP.112-'Sa lip.lisa IIC.P.II< all

Ilr.+1115(1 + a)Ilr.ll Proof. The lower and upper bounds on c follow easily by considering f.(c) as in the Proof of Theorem 5.6.1; since 11 r.

112 = _< IIr.111IP.1I

we have I

11r. 11

a ilp s a

and II c.P.II < ll a 11

130

CONJUGATE-GRADIENT AND SIMILAR METHODS IN HILBERT SPACE

SEC. 5.7

Finally,

Ilr.+,

-r.II+IIr.II
-x.Ii+Ilr.1

<(Q + I)Ilr.ll Q.E.D.

In the quadratic case we found an error measure E(x) such that E(x,.,) < qE(x.) with q < 1. We attempt the same here. DEFINITION 5.7.1.

r - J(x)

E.(x) _ , Note that

E.(x.) =

where h. - x. + J'- I r. is the approximate solution given by Newton's method; thus E.(x) measures, in a sense, our deviation from that method. We also remark that E .(x,) and r. are of the same order of magnitude-that is, 11 .11'

< E.(x.) < 11 a 11'

We shall, for convenience, write

E.(x.) _ e. We now assume that there is a constant B such that

IIJx-Jy11

Bllx - yll

This assumption only needs to be valid in some neighborhood of the solution x' since eventually all iterates x. will be inside that neighborhood. LEMMA 5.7.2. 1Ir.IIZ

(

1

<J.P., Pw> \ 1 + >7.i

I IIr.l12 ( <J.P., P.> (I - 1.

Where

n. = e.

[B,I-Ala

j

CONJUGATE-GRADIENT AND SIMILAR METHODS IN HILBERT SPACE

SEC. 5.7

131

Proof: Define

&(C) = (Ax. + cP.), P.> JJ)P., P.>dt

_ - II r. I I2 + C<J,P., P.> -+- c J This gives

-II r. 112 + c<J.'p.,p.> -+c2BIIP.1I' + c2B

I I P II3

On the interval

p
c<J.' P., P.>

--

1

11 r. 11

c a 11p.11

B II P II' < g.(c)

and similarly above. Using aI < JX and II r II < e."IA, we deduce - I I r. II2 + c(l - q,,) <J.1 p., p.> < g.(c) and hence derive the upper bound on c,,. The lower bound is derived similarly. Q.E.D.

With Lemma 5.7.2 as a tool, we demonstrate that E decreasing, just as was

is strongly

for quadratics.

LEMMA 5.7.3.

E.+,(x.+,) - E.(x.) < -c. II r.II2 + de.1 where d

a\3

}

a/

Proof:

E, (x.+.) - E.(x.) = Kr., (J',+,

- J'-')r.>

+ r.+, - r., J.+, r.+, + = X + Y + Z

132

CONJUGATE-GRADIENT AND SIMILAR METHODS IN HILBERT SPACE

For the first term, J.-,

1r+1-1

4-1

= J'..' -, (J. -

So we have 11

III:+1-1

a2

Ilc.p.ll_a Ilr.ll

yielding

IXIsa Ilr.lI For the second term,

Y _ <1(x.) - Ax.' 1), X..'-Ir.+> + d1 = d, where, using an integral to represent d, as in Lemma 5.7.2, we have

Id115 2

a

llr.il2IIr.+1 II

Using the same device we derive

Z=-c.Ilr.Il2+d2 where

Id2I5- - "r-II' The proof then follows from II r.11 S e. ,I-A. Q.E.D.

LamA 5.7.4. E.+ 1(x.+1) S E.(x.)(q + s.]

where

q=1-A(l+D2)
converges to zero. If we use the pure CG method,

_(A - a2 q-\A+a)

SEC. 5.7

SEC. 5.7

CONJUGATE-GRADIENT AND SIMILAR METHODS IN HILBERT SPACE

133

Proof:. C. I I

IIT.1i2

IIr"IHZ

I Iz >

P., P.>

1+ 1.

2

11 r112

p.> 1

I I P.

I12

E"(x")

IIr"II2

r., "-'r"

1 + rl.

1

Therefore, by the previous lemma, de.'

A(l + D2x1 + q") so that

E"+,(x.+.) < E.(x.) [1 - A(1 + D2x1 + q.) + de.] E.(x.)[q + s.],

s. = O(e.)

For the pure CG method, <J' p., p.> = <J' r., r.> + b'-

I

so that C.

II r. 112 I

I r. 112 >

E. (x.)

1

<J. P., P.>

1 + rl.

I

I r. I I2

z

r.><J. r., r.5 4aA E"(x")

(A -+a)'

1+

1 + rl.

/.

by the inequality of Kantorovich [Faddeev-Faddeeva (1963), Kantorovich (1948)]. The remainder follows easily. Q.E.D. Since

A>A IIx.-x'I12 E.(x.)>IIII2 the above lemma completes the Proof of the following theorem. THEOREM 5.7.2. The sequence generated via Equation 5.7.1 in Theorem 5.7.1 is such that I I x -- x* I I2 converges to zero faster than any geometric sequence with convergence factor greater than a

A(1

D2)

134

SEC. 5.7

CONJUGATE-GRADIENT AND SIMILAR METHODS IN HILBERT SPACE

If we use the pure CG method, then q

_-A+a) (A-a\2

The above theorem, however, is not a really sharp theorem for the pure CG method, since it does not contain the convergence-rate factor IV2 < 4

(%1A

a )l 2n

+N/a)

found in the quadratic case. Since the factor (A

a)2 -}- a

is also valid for steepest descent by the same argument made in the Proof of Lemma 5.7.4 using the Kantorovich inequality, our CG estimate is no better. We now show that the rate factor w,2, is essentially valid here, showing the greater convergence rate for the pure CG method. THEOREM 5.7.3. For the pure CG algorithm, the following error estimate

holds. For any m > 0 there exists an N. such that for n > Nm, we have En+m(xn+m) l (wm + sn

= O(e.1

tends to zero. Here

V = (1

2[1 - (a/A)]m

n

\f5m

a/A)2m -+ - (1 -

a/A)2m

(

A -1

-a

Proof: Consider the iterate xn and the linear equation J'z = J,xn -1for z, having solution hn = xn + J;,-'r,,. We note that hn - xn is J'; conjugate to P, - If we consider the standard CG method to compute z = hn starting with zo = xn but requiring that the first direction po be J.-conjugate to the given direction d = pn._ we have precisely the situation discussed in Section 5.5. Therefore, the sequence of such iterates z; converges to hn and hn - Zm, Jn(hn - Zm)>

x'm2
J' -1 rn>

The first direction po in the modified method is the projection of J',xn + rn - JnZo - rn onto the J;-conjugate complement of pn_, ; that is, Po

pn

(5.7.2)

SEC. 5.7

CONJUGATE-GRADIENT AND SIMILAR METHODS IN HIL.BERT SPACE-

135

If we show that I
J;(hn - zm)> - En+m(Xn+m)1.

which equals I _.
'-- .fin+m)

I

is of order e.2 +I'/(4m-3)I, then we shall have En+m(Xn+m) =
-T- [En+m(X.+m) -
n

n

-m

1

n

We indicate the proof of the order of magnitude. The sum to he e timated splits into h - zm, (J;, - Jn'.m)(hn

I

-- zm) .

I

and

- zm

hn+m

the first of which is less than

BIIhn -

O(en)

Zm112I1Xn+m - x.

by Equation 5.7.2 and the fact that

II X.--X*II''\

en

Clearly the second part of the sum is less than Ilhn

---

xntm - =n. IIO(en)

We estimate the normed term. First, Il hn - hntmll

mII

lhn+; - h,+I)iI

while

I I h,+ I- hill

II

+I

Ic1Pi J

xi

J,'

.1

'(rn.,I

- r.)

r, :l

(J;.I

I

-J,-')rill

O(e;)

136

CONJUGATE-GRADIENT AND SIMILAR METHODS IN HILBERT SPACE

SEC. 5.8

since

rj+, - r, _ -JJ+,(c,PJ) + O(e:) We still must estimate I I XR+m - Z . 1 1 = I I

zm- I - e,.-,

Xn+,n-, +-

Pm-, 11

where the - indicates the z,-iteration. Since Po = p,,, an inductive argument [Daniel (1969)] yields -f(41-4)/(4m- 3)1)

II x. +, - z,II = O(e.'

or

II

II =

O(e;+11/(4m-3)1)

for all i, which leads to II =

II

0(e.1+(1/(4.-3)))

Q.E.D.

Thus we have proved that, asymptotically, the rapid convergence of the CG iterates for quadratic functionals carries over to more general functionals.

It is in part this convergence, more rapid than any other gradient type of method, that has led to the great popularity of conjugate-gradient methods recently. Of course, so far as the analysis above has been taken, it appears that one must precisely compute c. and make p and pb-, precisely J;-conjugate in order to guarantee convergence. Since such precision is impossible computationally, it is important to know that the rapid-convergence behavior will be maintained under computationally convenient modifications. Much

,

the same results apply, of course, to nearly any method; we consider the methods for which 1 1b

II S D II r,.ll

5.8. COMPUTATIONAL MODIFICATIONS

Consider the class of methods given by Equation 5.7.1. The condition that 0-that is, that be precisely orthogonal to p.-is very restrictive. Let us consider the algorithm with the sole modification that c be chosen so

If

I=

P <8 IIr+,IIIIPAId-

for some small 3 > 0. Since b.- <_IIr.IIIIP II + SDIIrRII2

II r IIZ = r,,

CONJUGATE-GRADIENT AND SIMILAR METHODS IN HILBERT SPACE

SEC. 5.8

137

we have (1 - 3D) Ii rw lI <_ IIPw it

LEMMA 5.8.1. If 1

a< 1+2D then cw is bounded away from zero. Proof.

\

/

\\

SS

(I - VD) I I rw 112 <

P-> 1111 rw+ 1 11 11P. 11 + On - rw+ ,P.> 11

rwr+ l 11

6 11p.11Ilrw+IlI+Ac"llp lie

Now

IIP"ll<(1 +D)Ilrwll and

I1 rw+,Il
and hence (1 - JD) I I rw ! I2 < 6(1 + D)11r.11 [11 r.11 + cwA(1 + D) 11 r.111

+ Ac"(1 + D)2

Ill

IIrw

which implies

1 - b(1 + 2D)

)>0

if

d<1+2D 1

Q.E.D.

THEOREM 5.8.1. For arbitrary x0, with J > 0 small enough (independent of x0) and cw and b" determined as described above, it follows that xw - x*. Proof..

r(xw) -

f(xw+,) _ <J(xw+,), -cwPa> + 2 C.NPw, Jxw+rc.P.> C. IIP"112

{

II rw+, Il IPw

dIIP"112[Ta-6

II I1Pwil

ll

c"a}

(A+cw(1+6L)IJ

Because of the lower bound for cw, if 6 is small enough, then f(xw) - f (xw+,) > d, I I Pw 112

+

138

CONJUGATE-GRADIENT AND SIMILAR METHODS IN HILBERT SPACE

SEC. 5.8

for some d, > 0, and hence I I p.11 > (1 - SD) I I r I I tends to zero, implying x*. Q.E.D.

X.

The above theorem is somewhat similar to Theorem 4.5.1. In order to obtain good estimates of the local-convergence' rate, we need to determine c more accurately. According to Lemma 5.7.2, c. is approximately given by I r I J2

I

-
Let us consider, using this latter value as an approximation c to cn, and let

us denote the elements of this method by an overbar (-)-that is, z., p,,, etc.-starting with xu = x0, given. Proceeding for this iteration just as we did in Section 5.7, we can easily find that

IIP,,II<(1+D)Ili.II that c solving

0 exists and satisfies

IC - e.I = O(e) that I

I

I

I I= O(e,2, )

that with

qn = (1

- A(1 + D2)) + O(e) + W.- 1)

and that

This in essence proves the following proposition.

PROPOSITION 5.8.1. The asymptotic convergence rate for JIx -- x*II2 for the general algorithm with I I bn-, I < DII i, I I and c. determined by I

its linearized value

is greater than that of any geometric sequence with convergence factor

greater than

q=1-

a

A(1

,

D2)

CONJUGATE-GRADIENT AND SIMILAR METHODS IN HILBERT SPACE

SEC. 5.8

139

If b" = 0 (steepest descent) or b" is determined by the requirement of Ja conjugacy, then 9

_(A-a12 `A + a )

If we use the conjugacy requirement to determine b then, as one would expect, the better convergence rate holds [Daniel (1967a, 1969)]. PROPOSITION 5.8.2. If

b= "

-,+, P"> and CP "+, P">

r P"> c - P", .,P">

then the asymptotic convergence rate-that is, for e,, small enough-is described as follows: for every m there exists N;" such that for n > Nm, we have E"+m(x.,+m) < [wn, ± O(e.11(4m-3)) 1. O(e+/(4m-»)]En(x")

where wm is given in Theorem 5.7.3. When J(x) is linear, we know that

b" = II r"+,IIZ IIr"II Since this formula does not involve Jn in any way, it is computationally useful

and has been used in practice for general problems; a computer program can be found in Fletcher-Reeves (1964). If b" satisfies II b"p" II < DII r"+, II, then convergence is guaranteed. by previous theorems; such an inequality

does not appear to be valid in general, however. It can be guaranteed by setting

b"=min{11r"+,IIZ A ,IIr"+,II I1r"IIZ

' a

IIP"II

Another way to compute a b" which is just as convenient from the computa-. tional viewpoint as that above, but more easily analyzed, is via the formula [Poljak (1969a)] IIr"II2

which is a correct formula for quadratics. EXERCISE. Prove that the three determinations of b", namely

11 r"+, III 11r"IIZ

r

Z

and

f,

P

,

are equivalent on quadratics.

140

CONJUGATE-GRADIENT AND SIMILAR METHODS IN HILBERT SPACE

SEC. 5.8

For the global convergence question, we have x.D l

I l b"P" I I =11 P I I

I
I

At"IIP"Illlr"+,11
IIAt.11P"I111r"+I11

which then implies that we get global convergence. This choice has been used widely in practical computations with optimal-control problems in the Soviet Union [Poljak (1969b, 1969c), Poljak-Skokov (1967a, 1967b), Poljak-Orlov

et al. (1967), Poljak-Ivanov-Pukov (1967)]; essentially the same localconvergence results as above are known in this case also [Poljak (1969a)). In fact, if one uses either of'these computationally convenient values of b" and even the linearized c" (or, of course, a c" such that = 0), then the rapid local convergence described in Proposition 5.8.2 is valid here also [Daniel (1967a, 1969)]. Thus reasonable computational modifications will preserve the global-convergence properties, while the same asymptotic behavior of the error will occur if the modification is asymptotically exact, such as in the linearizations. It seems reasonable that a criterion such as

Illr"+,IIIIP"II < for small fixed 6, with b" determined by the conjugacy requirement, should lead to the rapid convergence described by w"; such results do not appear to be known, however. We have recently learned via private communication with G. Zoutendijk of a global convergence theorem for b,, = 11 r,+, 112/11 r, 112 with no additional modification of b,, assuming exact minimization along the line. THEOREM 5.8.2. Let f be bounded below and Vf be Lipschitz continuous and bounded on W(xo), the closed convex hull of {x; f(x)
minimization along the line x" + cp.. Then there exists at least one subsequence x", such that Vf(x",) - 0; if W(x.) is bounded and f is convex, then the entire sequence [x,} is a minimizing sequence.

Proof: If no subsequence has the stated property, then there are positive numbers B, E, and N such that e < 11 V f(x") 11 N. Now

P. Ilr"112

r" P"- i llr"I12 + 11 r"-, 112

SEC. 5.3

CONJUGATE-GRADIENT AND SIMILAR METHODS IN HILBERT SPACE

141

and hence 2

P.- I

II I

I

p"II2II'

II

rl 112 + II1.1 r. ,

11,11

which in turn yields P.

2

II r" IIZII

IIPNZ + 11 rN 11`

for n > N. If we define In -=

IIr"II1IP"II

=

fl r. r. l I

I

I I r I12

IIP"II - B IIP.II

for n > N, we see that

a"fin-N

1

IIPNI2

and hence

But then, according to Remark I after Theorem 4.2.4, this implies that II VJ(x.)11-> 0, a contradiction. Therefore Vfix,,,) - 0 for some subsequence. Theorem 4.2.1 applied to this subsequence implies that the subsequence is

minimizing, while the inequality f(x",,,) j > n, implies that {x.) is a minimizing sequence. Q.E.D.

6

GRADIENT METHODS IN

(RI

6.1. INTRODUCTION

Since ER' under any norm (all of which are equivalent) is a Banach space, and is in fact a Hilbert space under the usual inner-product, all the results of Chapters 4 and 5 apply here. In fact, of course, more detailed results can be obtained for gradient methods in R' because of the especially simple structure of this space; in this chapter we examine some of these results.

First, because of the finite dimensionality of E', the weak and norm topologies coincide, and any closed, bounded set is (sequentially) compact and vice versa; thus the existence theory of Cnapter I is simplified, the precise simplifications being left to the reader. Second, because of the nature of the topology in R', criticizing sequences 1x,.) for a functional f are generally more valuable since, if W(x,,) is bounded (see Section 4.2), then limit points x' of {xn} exist and must be critical points off; in the following sections we shall examine the consequences of this more closely.

Finally, the asymptotic convergence rates of particular methods can be studied in more detail in ER'; w_- describe some of these results. 6.2. CONVERGENCE OF x,,, i - X. TO ZERO

We mentioned in Section 4.2, particularly in Theorem 4.2.3, that the

convergence of x,t, - x to zero could be of great value; in Section 6.3 we shall examine this in some detail. In the present section we shall examine

situations in which one can assert that x,,, - xn does converge to zero. We have already seen in Chapter 4 -according to Theorems 4.6.1, 4.6.2, x tends to zero when is determined by use of

and 4.6.3-that

142

SEC. 6.2

143

GRADIENT METHODS IN IR'

simple intervals along the line. For the methods of Section 4.7 involving a

range function along the line, we could not in general prove that 0, as indicated by Theorems 4.7.1 and 4.7.2 and their exxII tended versions in Corollaries 4.7.1 and 4.7.2. As shown in Corollary 4.7.3, II --+ 0 implies more generally, whenever II where p = IIP. II - . 0-in many special cases of this general method we can assert that II x.11 -- 0. It is not true in general, however, that the algorithms of Sections 4.3, 4.4, and 4.5 involving minimization along the line necessarily II

yield 11 x.+, - x II - 0; contradicting examples- can -be created. We can,

however, show that for many methods and certain kinds of functions we must always have I I x. (I --. 0.

If W(xo) is compact, then

has limit points x' and Vf(x') = 0.

Hence the following proposition follows. PROPOSITION 6.2.1. If Vf is continuous on the compact set W(xo) and

Vf(x) = 0 has only one solution x*, then x -- x*. We seek more significant results.

THEOREM 6.2.1 [Elkin (1968)]. If W(xo) is compact, if there exists a

a > 0 such that

f(x.+,)
for 0 < t < 5 for all

and if f is not constant on any line segments in W(x(,), then 11 x,,, , - x I I Proof., If II xn+, - x I I a 0, then we may assume that x,, x,,,

-, x", x' :# x" for some subsequence n;. Thus

f(x,,,,,) <

0.

- x',

t <8

(1 --

f(x,,,+,) < f

n

which implies

f(x') < f(x") < f[tx"

(1

t)x'] < f(x') for 0 < t < 6

which means that f is constant on a line segment. Q.E.D. THEOREM 6.2.2. If is determined as in Theorem 4.4.1, if Vf is continuous on W(xo), and if f is not constant on any line segment in the compact, set W(x0), then II xn+, - x I - -. 0. I

Proof: In the Proof of Theorem 4.4.1 we observed that t provides the global minimum of

f(x + 1Pa) -

for 0 < t

t.

144

SEC. 6.2

GRADIENT MEMODS IN RI

Since I I x.+, - x 11 is bounded because W(xo) is compact, we know from Theorem 4.4.1 that = (Vf(x.), If we have x,,, -per x',

xw,+,

Ipw1I>IIx.+l

-+ x" : x', then

Jlxw,), x.,+l - x,) aw. <Jlxw.+l)

which yields

f(x') < f(x") Sfbtx' + (t - 2)x'] Sf(x') a contradiction to the assumptions about f. Q.E.D.

The above results treat the methods of Section 4.4; we still have not considered the method of Section 4.3, in which minimizes f along the line xw + tp,,. We only know how to treat this via more general results applying to all methods. THEOREM 6.2.3. If W(xo) is compact, if f(xw+,) S f(xw) for all n, if I I Vfl (x.) I I -' 0, and if there exists a function b(t) for t > 0, a(t) Z 0, with 0 if and only if to - 0 and satisfying 11(x) -f(Y) l + 1I Vf(x) - of(y) I I >_ RI x - Y ID, for x, y e W(xo) then 11x.+,

- x. I1--+ 0.

Proof: If II x.+, - x.11 + 0, we take Then clearly, f(x') = f(x"). Thus

x,,,

x', x,,+, -+ x" # x'.

a(II xw.+, - x., II) S I f(x.,+,) -f(xw,) I + I I o1(xw,+,) - V1(xw,) I I -' 0

Q.E.D.

We recall that if f is strictly convex and Vf is continuous, then

f(x2) -f(x1)> <x2 - x Vf(x,)> if x, :e-x2 Thus, if <x2 - x V1(x,)> > 0, we conclude f(x2) > f(x,); a function satisfying this property is called strictlypseudo-con vex [Elkin (1968), Ponstein (1967)].

GRADIENT METHODS IN IR'

SEC. 6.3

145

THEOREM 6.2.4 [Elkin (1968)]. For all x, y in the compact set W(x,,), f(x,,) for all n, let let (x - y, V f ( y ) > > 0 imply f(x) > f (y). Let f 0. Then
II X., - xnll 0. Proof: As usual, we take x., --i x', x.,*, , x" :;,-, x' if II 0. Of course, f(x') = f(x") and, therefore,

xn II +

<x" - x', Vlx') j < 0 by the strict-pseudo-convexity assumption. However, <x,,,+, - x,,,, VAx,,,p converges to <x" - x', Vflx')> by continuity and to zero by assumption, a contradiction. Q.E.D.

Thus we have found a large variety of ways to guarantee that 0; let us now see how this restricts the nature of the limit set of (x,)-that is, the set of limit points. l l x.,., --x.11

6.3. THE LIMIT SET OF lx,l We Let L denote the (closed) set of limit points of the sequence next study the nature of L in terms of the sequence (x.], particularly under the

assumption that

I1

x. II

0. First we strengthen Theorem 4.2.3;

recall that a continuum is a closed set which cannot be written as the union of two nonempty, disjoint, closed sets. THEOREM 6.3.1 [Ostrowski (1966a, b)]. If the sequence (x.] in 1R is 0, and if (x.) does not converge, then the limit x II set L is a continuum.

bounded, if 11

Proof: Suppose we can write the closed set L as L = C, u C2 where C, and C2 are closed, nonempty, and C, n C2 = 0. Then there is an e > 0

such that II c, - c211 > e for all c, E C c_ E C2. For n > N, we have 11 < E. Choose c, in C, ; there exist arbitrarily large n > NF with

IIxA-c,11>_ 3 For such n there exist n: > n with 1lxm-C2,mlI<

S

for some c2,m in C2. Let mo be the smallest such index. Then I I xmo- - c211 > 3

146

sEC. 6.3

GRADIENT METHODS IN IR!

for all c2 in C2 and hence, since II xm. - xm.-, II _< 3

we have Ilxm. -

czll>

3

for all c2 in C2. Thus we have <11xm.-c211

for all. c2 E CZ and

Ilxm.-C2.m.11< 36

If we do this for an infinite sequence of indices m0, there is a limit point x' of the sequence, so x' E L and the distance of x' from'C2 is at least 6/3 since <11xm.-CZ11

for all c2 E C2 so x' must lie in C, . Yet its distance from C2 is at most 26/3 since II xm. - CZ,m. 11 <

36

while C, and C2 are 6 apart-a contradiction. Q.E.D. The import of the above result should be quite obvious; if Vf is continu0, and therefore ous, then Vj(x) = 0 for all x in the limit set L, if I I Vf

must be convergent if I I x.+, - x. I I - 0 and if we can conclude that (x; Vf(x) = 0) contains no continuum as a subset. Under some additional it is possible to discover still hypotheses on the method of generating more about the properties of the limit set L, following Ostrowski (1966a); if these properties are not valid on (x; Vf(x) = 0), then we again conclude is convergent. We consider some of these results. that We assume that x.+, = X. + t. p.

(6.3.1)

where

a.=llt.p.Il
R 0,

}

11 VJ(x.)11- 0

(6.3.2)

These assumptions are valid, for example, for the simple-interval methods of

147

GRADIENT METHODS IN IR'

sEC. 6.3

Section 4.5, and for other methods under some added hypotheses. It is, for instance, valid for the following form of the method of Section 4.6 using a range function. Let d(t) = it, 0 < 0 < 4, pw =11 V f ( x w )IIq", I14. 11= 1

E> 0

<-VJ (xw), P.> ? E I I Vf(xw)112,

and suppose that

IIVf(x)-VJ(y)II
xw+, -

xw I

I - 0 and

R= I

vw-lltP.IISR11Vf(x")11, We also have that

f(xw+,) > 3tw <- Vf (xw), Pw> Z 3 e tw 11 VAX.)

f(xw) -

11,

Thus, if tw is bounded away from zero, I(x") -1(x"+,) >_ rIIVf(x")II2

with r > 0 as asserted; we show that tw is bounded from zero. If not, suppose 0 (actually a subsequence). Since tw # 1, we know that t" has been chosen t" so that

0
g(xw,

tw,

Pw) - 1 I

Then

a
f(xw) - l lxw+ )

tw -Vf(x"), Pw>

-1

<--Vf(xw + 2NINPN) + VAX.), tNPw t" f(xw), Pw

for some A. E (0, 1)

which yields

8<-

Vf(xw), P.> <_ lip. 1111 Vf(xw) - Vf (xw + .tw t" Pw) I I < Ltw lip.

1(2

Therefore,

t"-

8

-Vf(x.), Pw

LIIPw11

Of 11 VAX.) II2

ZL

11

x"II2

Je

L>0

which is a contradiction. Thus the assumptions of Equations 6.3.1 and 6.3.2 are valid for this important method.

148

GRADIENT METHODS IN IRt

SEC. 6.3

EXERCISE. By the same kind of argument as above, show that Equations 6.3.1 and 6.3.2 are also valid for a functional f and directionspn as described above if>t is chosen as the first local minimum or global minimum off along X.

1P.-

Thus our assumptions in Equations 6.3.1 and 6.3.2 are valid for a large class of methods; we now consider the implications of these assumptions. Without loss of generality, we assume f(xn) 0.

LEMMA 6.3.1. Assume that z is a limit point of fxn} and that for Ix - ZII

0 we have

r<

f(x) <-1' I I Vf(x)112, Then (xn) converges to z.

Proof: Without loss of generality, we take z = 0. Define

Q=max( ir i l) D = the greatest integer less than or equal to 4Q2. Suppose we have an integer m and an integer p > D such that I I x,, +, I I <_ p for s = 0, 1, ... , p. we write

r

II vf(x,n+!) II2 < J(x,n+,) -

for s = 0, 1, i=,

, . .

f(x,n+o+1) < f(xm+,)

1, I I Vflxm+,)112

, p, since I xn,+, I I < p. Thus we have I

ui
Q>

ui =

I1Vf(xm+i)112,

S - 0, 1, ...,p

Solving this inequality [Ostrowski (1966a)] yields Ui+D < a ui,

i.e.,

I I Vf(x

+,+D) I I <

I I Vf

I I,

1= 0, 1, . . ..,p - D

Now, since the origin z is a limit point of [xn} and 11Vf(xn)11 0 and I I xn+1 - x I I = 0, there exists a fixed m depending on p such that an

11X.11< P

Vf(X,n+i)

I I< 3R

GRADIENT METHODS IN IR'

SEC. 6.3

149

We now show that. I I x I I m, which, since p can be taken 0. If not, let x.+,+, be the first such x arbitrarily small, gives I I x - z I I with I I xm+,+, I I > p; then I I x, II D so that m and p are allowable values for the inequality found above. Thus we have I I x.+,+, - xm I I =

1-0

x,,.+,+, - x,.+, II

II

<_ R

1-0

I I Vf(xm+,)

< R{ E I I Vf(xm+,) I I+ ZE I I Vf

I I+ .. .

-xmll<_4+3<_p which is a contradiction. Q.E.D. Now we can use the above to prove a theorem on the nature of the limit set L.

THEOREM 6.3.2 (Ostrowski (1966a)]. If f is twice continuously differ-

entiable on W(xo), if z is a limit point of and if the Jacobian matrix f" off at z is nonsingular, then converges to z. Proof: Since f(z) = 0 and Vf(z) = 0, near z we have f(x) = + 0(11 x - z 11)2

Without loss of generality, we let z = 0. Since f', is symmetric and nonsingular, we order its eigenvalues 0 < 11, I <

[A;+o(l)]IIx-z112 and thus

f(x) <

[ti + 0(1)]

11-x112

so that the hypotheses of Lemma 6.3.1 are satisfied. Q.E.D.

The import of the above is that, if f " is not singular on any continuum, then we conclude that converges, since otherwise L is a continuum and

150

sEc. 6.4

GRADIENT METHODS IN IR'

f" must be singular on it. Actually, in Ostrowski (1966a) it is proved that if f is four times continuously differentiable in W(xo), then, if for some particular z in L the rank off;' is I - 1, it follows that {x,} converges to z, providing more detail for the theorem above. Thus we see that in IR', if the assumptions in

Equations 6.3.1 and 6.3.2 are valid-as they are for many methods-the

sequence (x,} is convergent except for very pathological functionals f whose

gradient and second-derivative matrix "vanish" to a high degree on a continuum. 6.4. IMPROVED CONVERGENCE RESULTS

In Section 5.7 we derived a local-convergence-rate estimate for methods in which Pn+t - r +l + b,.p,.,

r,+, = -Vf(x,+i)

and 11 b.p. II < D 11 r,, 11 for some constant D. In particular, for the pure

steepest-descent algorithm with b - 0, we found that the convergence was at least as fast as a geometric sequence with convergence ratio (A - a)/(A + a), where

0
f(x) = <x, Mx> with

0 = 0; in IR2 this implies that one uses precisely the same two directions ro and r, continually. Asymptotically this is true in IR' for I Z 2 as well. More precisely [Akaike (1959), Forsythe (1968)], asymptotically the directions alternate between two fixed directions in the space spanned by a certain two eigenvectors of M and thus the convergence is precisely linear

and described by the ratio (A - a)/(A + a) for any iteration not starting with an eigenvector as x0. Essentially the same results have been found for the

s-dimensional optimum-gradient method [Forsythe (1968)], in which x,+, is chosen to minimizef over the s-dimensional plane

X. +

aM'x,

GRADIENT METHODS IN IR'

SEC. 6.4

151

Since no better results can possibly hold for more general : anctionals, we see

that the rate given for the steepest-descent method is the best possible. A better estimate, however, can be given for the conjugate-gradient method also discussed in Section 5.7; in that case, we know by Theorem 5.2.2 that for quadratic functionals in IR' the precise solution is found in at most I steps-that is, x, is the solution. Therefore, we must look at general functionals in (R' to find a better convergence estimate; the estimate is provided by using Theorem 5.7.3, which states that the convergence rate is essentially that given for quadratics by considering the method as an optima: process.

If we use the symbol D'to represent arbitrary constants, then in the proof of that theorem we found r.II1.[1/(4m-3))

II x.+m - zm II < D II

Thus

IIx,+m-x*II<_IIxA+m-z,ll+lIz.,-h.ll+lfh.-x*II is the Newton step and zm is the point obtained where h. = x - J,' in m steps of the conjugate-gradient method used to compute h. by solving J;, z = J;,x r- J(x ). If we are in RR! and m = 1, then we know that z, = zm = h,,, and thus I1

xa+,-z,l1 +11h.-x*11
- x*II

-J(x*)II'+[1/(41-3)]

D II x.

- x* III + 11

/(41- 3)].

Thus we have proved the following theorem.

THEOREM 6.4.1. If 0 < aI < J < AI, and

I I JX - Y. 1 1 < B I I x - y I I,

then the pure conjugate-gradient (CG) method in IR' yields a sequence converging to the point x* minimizing f and the asymptotic-convergence rate is described by

IIx.+,- x* 11<_ Dllx. - x*

111+(1/(41-3)1)

D constant

Since I steps of this method yield the same error-reduction factor as one

step of a method with superlinear-actually, {1 + [1/(41 - 3)]}th orderconvergence, we call this (1/1)-superlinear convergence. It would seem likely from this result and from our later Theorems 7.4.3 and 7.4.5 that the conver-

gence is in fact at least superlinear-that is, that jim IIx.+1 - x* II = 0 -o FIX. - x* I I

152

sec. 6.4

GRADIENT METHODS IN CR'

and probably that lim 11 X.+, - x* I I

=

( P=\1+ 4

Y

)

I/'

Neither of these results has been proved or disproved, however. At the end of Section 5.8 we considered the possibility of using the conjugate-gradient method with b determined by the formula

b"-

Ilr.+, III

IrIIII

as in the quadratic case; for this method, all of the convergence results, including the 1/1-superlinear convergence described above, are valid near the

solution and we have global convergence. Practical experience in IR' has shown that one must periodically restart the algorithm with a steepest-descent direction; that is, one should let

b. = 0 if n = 0 (modulo m) for some m Commonly m =1 or m =1 + 1. We analyze this method. THEOREM 6.4.2 (Ortega-Rheinboldt(1968)]. Let 0 < aI < f < Al < oo for all x in IR', and let be determined by

X.+t = X. + C.P.,

P.=r.+b.-tP.-t,

J(x. + C.P.) = minJ(X. + CP.) czo

b.-, = 0 if n - 1 = 0 (modulo m) and

t=

Y.

Y.-t

otherwise.

Y. = IIr.112

Then I I x. - x* I I

0 and all of the convergence estimates for the conjugate-

gradient method hold near x*. Proof. For any n > 0 let no be the greatest integer not exceeding n which is congruent to zero modulo m. Then it is easy to see that

=I

. Yr

r,r

Since Wo = {x; f(x)
Yr. = .,srs. min ',

GRADIENT METHODS IN IR,

SEC. 6.4

153

Then

lip. II<_

y"1)C7 E ilrill5(m Y, '

Therefore,

C(m+1)Y' P. IIP" II

-C(m+ 1)

Proceeding as in Theorem 4.3.1, we conclude that

Ir"'

IIP.II\ - 0

and hence I I ri,112 - 0 However, since IIP

.Ii/ 0

and W. is bounded, we conclude from Theorem 6.2.4 that II

x, I I - ' 0.

Since Vf is uniformly continuous in Wo it follows then that IIVRx,,) V1(x,.) II - 0 and hence I I Vf(xJ I I ---)-0

0. Once the algorithm essentially restarts This then implies I I x" - x* I I [n = 0 (modulo m)] near enough to x*, all of the original conjugate-gradient theory applies to give the convergence results. Q.E.D. EXERCISE. Give another, simpler proof of the above theorem by showing,

first, that the sequence (z1) converges to x*, where z, - x, is essentially generated by the steepest-descent method; and then by showing that this implies the convergence of the entire sequence (x"}.

Theorem 6.4.2 was first proved in Ortega-Rheinboldt (1968), where it is given, since the local-rate-of-convergence estimates were not desired there, under more general assumptions on f-namely, that Wo be compact and that some condition guaranteeing I I x,+, - x, I I - 0 be satisfied. The theorem is valuable from a computational viewpoint, since the conjugate-gradient method is known to exhibit its powerful convergence properties thereby, even when implemented in a fashion not requiring explicit use of the Hessian f" of second derivatives.

154

SEC. 6.6

GRADIENT METHODS IN IR'

EXAMPLE [Daniel (I 967a)). Consider obtaining a solution to V2u = 16yz(ey'° - 1) in [0, 1] x [0, 1] with u(y, z) = 0 on the boundary, having unique solution u(y, z) = 0, using the usual five-point formula with an h x h mesh. The discretized equations are just Vflx) = 0 for some uniformly convex functional f, with x in IR', I = [(1/h) - 1]2. Using (1) the pure CG method. with by determined for exact conjugacy; (2) the modified CG method with b

mini 11r141III, A I l rn 1 1

a

11p.11

and (3) steepest descent-that is, b - 0-the number of iterations and computer time required to reduce 11Vf(x)11 from 100 to 10-6 on an Electrologica X1, a very slow machine, for h = }, were respectively (1) 12 iteratioqs, 210 seconds; (2) 13 iterations, 211 seconds; and (3) 40 iterations, 454 seconds. 6.5. CONSTRAINED PROBLEMS

Most of the previous comments of this chapter apply to problems with constraints; the topology of IR' simplifies convergence questions. We shall not attempt, however, to look into the details of these specializations. The methods discussed in Section 4.10 are of course applicable in IR' and, in fact, usually originated there. In particular, the methods of Theorems 4.10.1 and 4.10.2 are extensions of methods in Frank-Wolfe (1956), Gilbert (1966), and Rosen (1960-61). Once we restrict ourselves to Ct' and constrained

problems, all of the complex theory of mathematical programming and its many algorithms presents itself. Since we could not hope to proceed to any

real depth of presentation of this material in this text, we go no further with mathematical-programming methods but rather refer the reader to the literature [Fiacco-McCormick (1968), Hadley (1964), Mangasarian (1969), Zangwill (1969)]. 6.6. MINIMIZATION ALONG THE LINE

To turn a theoretical method into a useful computational algorithm, one needs to be able to implement all steps of the method reasonably quickly and accurately. The simple-interval method of Section 4.5 in theory requires a

knowledge of the Lipschitz constant L; in practice, of course, one would usually try some such procedure as letting t. equal the first of the numbers T, OCT, a2T, ... , a r (0, 1), for which f decreases significantly. We saw how this approach could be justified in Section 4.6, where it was applied to the method there for finding t,,. We have not, however, indicated how one might

implement the methods of Sections 4.3 and 4.4, except for the material concerning the search routine in Section 4.7. It is very difficult to say how

one should proceed with the approximate minimization along the line

SEC. 6.6

GRADIENT METHODS IN IR1

.155

x,, + tp,,. Clearly one need not waste time doing this too accurately "far away" from the solution, but one does demand accuracy "near" the solution. These

are difficult terms to define, but one might reasonably use as a measure, if all the variables in f are scaled so as to have essentially the same importance; such scaling is always important computationally. We shall assume that such questions of needed accuracy can be answered and shall proceed with a presentation of methods for acquiring this accuracy; we shall present methods which appear from practice to give satisfactory accuracy at a reasonable cost of efficiency. Most algorithms in use for finding a minimum along a line rely on an itei-ative interpolation method rather than direct search; they do, however, often incorporate as a first step a preliminary search to isolate the minimizing point in a certain interval. Therefore, we shall first look briefly at the results of direct-search methods. We have already considered in Section 4.7 how one

can search to locate the minimizing point for a strongly quasi-convex functional. Although we generally prefer interpolation methods for accurate determination of the minimizing point in practice, we describe a direct-search method for finding the minimizing point as accurately as possible. Suppose

that the minimizing point for a strongly quasi-convex function g(t) = f(x,, + t is known to lie in the interval [ao, bo]. If we insert two points, ao < to,, < to,2 < bo, and evaluate g there, then the minimizing point is in [ao, to,2] if g(to.,) < g(to,z); in [to,,, bo] if g(to,1) > g(to, 2); and in [to.1, to, 2] if g(to, 1) = g(to, 2). Thus we have located the minimum

in an interval [a b1] smaller than [ao, bo] and we can proceed iteratively. The method would be most efficient if we only need to evaluate g at one new point each time-that is, if either t,,, or t1,2 would equal whichever of to,1 and to,2 lies in (a b,); to allow this, we never choose a, = to,,, b1 = to,2 but in that case of g(to,,) = g(to,2) we define a, = ao, b, = 111,2. If one seeks the smallest final interval [a,,,, bm] for a given m, then it is known [Kiefer (1957), Spang (1962)] that one should choose tj, I

___

Fm- I -1

m+I -i

(b, - a1) + a

tt.2 =

F. -, Fm+1-t

(b, - a) + a,

where Fo = 1, F, = 1, F; = F1-1 + F,-2 are the Fibonacci numbers. This Fibonacci search always requires only one evaluation of g per step. On the

final step, one takes

tm- i,I = (# +

am-I) + am_,

t.-J,2 - 3-I

+ am-i

in order to isolate the minimum best. The final interval has width bm-am=(b,-ao)-1-E

2F.

156

sEC. 6.6

GRADIENT METHODS IN IR'

Since F20 > 104, we see that the intervals shrink rapidly. It is known that for large i, we have

F!-'

0.382,

F,r i

F' F,+t

- 0.618

which allows one to use the simpler formulas:

, = 0.382(b, - a,) + a, t, z = 0.618(b, - a,) + a, t,

The final interval in this way satisfies

b. - a.

= (0.618)m(bo - ao)

Thus one can isolate the minimum in this way as accurately as desired.

Next we turn to methods using interpolation, although some of our remarks apply to direct-search techniques as well. Some of the procedures first seek an interval in which the minimizing point t* lies. Usually this is done by taking some number t, as an estimate of t* and then evaluating g

at 0, to a2t a,t ... , for some sequence a, (often a, = 2') and stopping at the first instance that the values of g do not decrease; if one is willing to evaluate g'(a,t,) as well, one can also stop whenever g'(a, t,) becomes positive.

If the termination procedure occurs at t = t then t, is reduced and the process restarted. Thus we tnally find a, t, with g(a,t,) < g(a,_,t,), g(a,t,) < g(a,+I t,) and t* is isolated in [a,_, t a,+, t,]. The number of evaluations of g will be reduced if t, at least near the solution x*, is a good estimate, for then one would expect to isolate t* in [0, a, t,] every time. In fact if near the soution x* one sets t'. = it, where t, is asymptotically correct, then we should isolate t* easily in [t:, 3t,] and a choice of t* = t' or 2t,', whichever gave the lower f-value, would lead to convergence, as we saw at the start of this section. In this light we see that Theorem 5.8.1 on the convergence of the conjugate-gradient method with c determined as C.

_

.P P

can be considered as providing a good estimate t, which is asymptotically the correct t*; this has been used [Daniel (1967a)] as t, and has given good results. If is any admissible sequence of directions and the functional f on RI satisfies 0 < aI < f x < Al, the analogous choice for t, is P.>

fPP>

GRADIENT METHODS IN R'

sEC. 6.6

157

It has been shown [Elkin (1968)] that one obtains global convergence with t = /,t, where 0 < e < f, < 2 - E, and of course that t, is asymptotically correct. Thus linearization can always be used to get a good estimate t, if but has an estimate one can afford to evaluate f'.. If one cannot compute x f* for the minimum value off, then

f

f(x.) - f

t`

<-Vf(x.), P.

is usually an underestimate of t* near the solution x* while 2t; is usually an overestimate near the solution. Another choice of the estimate t, is simply the value of the actual step used the preceding time; in the end this usually requires little computation to trap t* in an interval. Now we turn to the problem of locating t* more accurately by inter-

polation. The interpolation procedures are sometimes used without first bounding t*; in this case, the "interpolation" becomes extrapolation but the formulas are essentially the same. Such methods are, therefore, contained within the ensuing discussion, although we generally prefer the methods which first bound t*.

Nearly all of the analysis of minimization methods is based on the assumption that the function f is nearly quadratic near its minimizing point;

thus it would be reasonable, and asymptotically exact, to approximate f(x + tp,) by a quadratic or by a linear function. Although it is often assumed in the literature that Vf is much more costly to evaluate than is f, in certain important practical problems such as arise from differential equations we can afford to evaluate Vf reasonably often. In this case, one can then try to solve = 0, and it is quite reasonable to use linear interpolation to solve this-that is, use the secant method. Since one can only approximately satisfy the equation, scaling becomes important and one should probably treat instead Vf(x. -t- tP.) . + IP.) I

11 V A ( x

, I

_0

P. _11

It is no longer clear that linear interpolation is appropriate here, so one might consider using quadratics-that is, Muller's method. In our experience, however, the linear interpolation is usually satisfactory. Essentially the same idea as using a quadratic on the gradient equation is that of using a cubic on the function f. Again we assume that we can evaluate Vf conveniently. Thus we suppose that we have the real-valued functiong(t) of one real variable to minimize, and that we know the function values g g2 and the derivatives g,, g2 at two points z, < z2; we wish to interpolate the data by a cubic and then minimize the cubic. This method is usually used when t* is bracketed by [r, z2], in which case the next estimate is a zero of the quadratic derivative in

158

GRADIENT METHODS IN R'

sEC. 6.6

[z1, Tz]. In many implementations for which the basic interval ['r T2] was chosen so that estimating t* by t, would yield convergence to x*, the interpolation is performed only until an estimate of t* is provided by the scheme at which g is smaller than at r, or T,. This guarantees very accurate minimization near x*. Quadratic interpolation to the values of g at three points followed by minimization of the interpolating quadratic appears in general to be an excellent scheme, particularly if evaluation of Vf is very costly. This is the method commonly used with algorithms which never evaluate Vf(see Chapter

9). A variation is to use one value of the derivative for at least the first estimation of t*; this is easy, since the derivative at t=0-that is, at x-is often known. EXERCISE. Write an algorithm implementing one of the above interpolation schemes.

We are not aware of any good tests comparing the efficiencies of the various interpolation methods; our limited experience indicates that the simpler approaches-say, using quadratics on f-values-are usually satisfactory. One should not generally spend much effort locating t* unless one appears to be very near x* and wants to avoid badly over- or undershooting it. Even then, driving the cosine of the angle between

to be less

than 0.1 often is quite satisfactory. Precisely how accurate one needs be at this point depends on the criterion used for determining convergence of to x* ; if this is based on the size of I I x.+, - x. I (, clearly one must approxi-

mate t*p. to at least the accuracy demanded for the cutoff of I I x 11. As is always true with numerical methods, no single, good, universally applicable-method is known for deciding when has converged. If necessary, one can use the very stringent test of moving away from the computed "x*" and starting the algorithm over to see if the sequence returns toward "x*" again. No really good method is known.

General references: Fletcher (1965, 1968), Fletcher-Powell (1963), Fletcher-Reeves (1964), Kowalik-Osborne (1968), Powell (1964a, b), Stewart (1967).

7

VARIABLE-METRIC GRADIENT METHODS IN

RI

7.1. INTRODUCTION

The earliest gradient-type methods relied strictly on the steepest-descent

direction; that is, to minimize f given an initial point x0, one wrote

f(x0 + tp) = f(x0) + t + 0(t) so that

t f(x0 + tp)I a0 = which is the most negative for directions p with I I p I I = I if

P_

Vf(x)

Ilf)II

The results of Chapters 4, 5, and 6 show, however, that using the steepestdescent direction itself is not necessary; essentially, any direction bounded away from being orthogonal to -Vf(x0) will suffice. In fact, as we saw with

respect to the conjugate-gradient (CG) methods in Chapter 5, one may well obtain remarkably more rapid convergence by purposefully avoiding the steepest-descent direction. Let us therefore consider other ways of generating directions. 7.2. VARIABLE-METRIC DIRECTIONS

Consider again'the expression

1(x0 + 1P) =f(x0) + t(Vf(X0),P> + a(t) 159

160

VARIABLE-METRIC GRADIENT METHODS IN RI

SEC. 7.2

Suppose Qo is some self-adjoint positive-definite operator (I x I matrix, in IR'. Then we can write

f(xu + tP) =f(xo) + t + o(t) =f(xo) + t + o(t)

(7.2.1)

Since Qo is self-adjoint and positive-definite, we can use it to define a new metric-that is, a new inner product-on IR` which will determine a topology via equivalent to the usual one; precisely, we define the inner product [x, yl = <x, Qoy>

Then we can rewrite Equation 7.2.1 as

f(xo + tP) = f(xo) + t[Q;'Vf(xo), Pl + o(t) and suddenly the steepest-descent direction with respect to this new metric has become -Q,'Vf(x(,). Since Qp' is itself positive-definite and self-adjoint, we call the direction -H0Vf(xo) where Ho = Qa' is positive-definite and self-adjoint. If we use a different "H" (that is, "Q") at each successive approximation x to the minimizing point x*-that is, if we use a different metric each time-we thereby generate the sequence of directions

p,, = -H,Vf(x.) A method of this type is called a variable-metric method [Davidon (1959, 1968), Fletcher-Powell (1963)].

From this viewpoint it is clear that any method yielding directions p 0 if 0 is a variable-metric method, such that
p=

the purpose of viewing gradient methods in this fashion, however, is to discover what properties H. should have to generate directions that are good ones. We have already seen in Section 5.5, for example, that if one is trying to solve Mx = k where M is self-adjoint and positive-definite and if one chooses H. as the orthogonal-projection operator, with respect to

the inner product <x, My>, onto the subspace orthogonal in the <x, My> sense to po, pt, ..., then the method obtained is the conjugate-gradient method. As we shall later see, the Davidon method is in fact a way of generating the conjugate-gradient iterates directly from a recursively defined set of matrices For more general problems with J(x) - Vf(x) nonlinear, H.

becomes the orthogonal projection in the <x, J;y> sense onto the subspace J'. >-orthogonal to p, and-asymptotically, of course-to po, . . . , also. Thus we can consider that the power of the conjugate-gradient method compared to the steepest-descent method comes from the former's use of a

VARIABLE-METRIC GRADIENT METHODS IN R'

SEC. 7.L

161

good variable metric. In Yakolev (1965), gradient-type methods are considered

strictly in the setting of variable-metric methods-that is, x"+! = X. - t"H"VJ lx")

for some sequence of operators H. and steps t Most of the results there concern convergence under various choices of t" given certain properties of H. such as

0 < a < < A These correspond, with some minor changes, to the methods of Chapter 4, although more detailed convergence rates are often given in Yakovlev (1965). Thus we consider the methods in this completely general setting no further.

In a sense the best metric would be one which turns the level curves J(x) = c into spheres so that the interior normal direction to the surface-that

is, -Vf(x)-points to the point minimizing f. For quadratic functionals

f(x)_Kh-x,M(h-x)>=[h-x,h-x) where

[u, v] _
makes the level curves appear to be spheres;

this leads to the direction -M-'Vf(x) = 2(h - x)-that is, directly toward the solution. Analogously, for nonlinear equations, the optimum metric would appear to be given by <-,J'.> and thus generates the direction

P. = -J.'-'AX.) This is the direction of Newton's method. Because of this intuitive viewpoint and because Newton's method leads to quadratic convergence [KantorovichAkilov (1964), Rail (1969)], one often tries to pick the variable-metric formulation to mimic Newton's method; thus variable-metric methods are also called quasi-Newton methods [Broyden (1965, 1967), Zeleznik (1968)]. Because of the situation in the constrained case (see Section 4.10), one might

not greatly expect quadratic convergence from mimicking the Newton process if one proceeds close to the Newton direction to the minimum off along that line rather than using the pure Newton step -1

However, the value of t. which minimizes f(x + tp") is asymptotically P")
-

r r.> = I r., J. r.>

162

VARIABLE-METRIC GRADIENT METHODS IN IR,

SEC. 7.2

in this case, and thus near the solution x* the minimization along x + tp, nearly leads to the normal Newton step. While one should then hope for quadratic convergence, most results known to us guarantee only superlinear convergence [Levitin-Poljak (1966a), Yakovlev (1965)]. From what we have done, this can most easily be seen from the viewpoint of conjugate gradients. In Sections 5.3, 5.4, and 5.5 we considered a very general form of conjugategradient methods involving arbitrary self-adjoint positive-definite operators H and K, while in Section 5.6 such extra operators were missing. Clearly one may define a general method using operators H., K. at each point x and develop convergence theory and error estimates in terms of the operator T, _ just as in the quadratic-functional case; this is done in Daniel (1965, 1967a, b), and the convergence rates are given via the spectral bounds a, A of T, as usual. If one takes Hz = K, = J` 1, where J', is self-adjoint, uniformly positive-definite, and uniformly bounded, one gets T, = I and a = A = 1, which implies superlinear convergence. In this case, of course, p, = J; 'r, = -J,''J(x,) and we have the minimization modification of Newton's method and a proof of superlinear convergence. It is possible, however, to show that the convergence is actually quadratic. If we let

so that f(x +,) < f(x,), from 0 < aI < J,,< AI for scalars and pick a, A, one can conclude a 11X.., - x* 112 <_ f (x.+) - f (x*) <_ Ax.') - f(x*)

211x.-x*112 < const X 11x. - x* 11`

so that IJ x.+ I - x* I I < const x I I x. - x* I I2 -EXERCISE. Provide the details for the above argument showing Il x.+, - X* II < const x II x. - x* IIr.

Thus we hope that a good quasi-Newton or variable-metric method will yield very rapid convergence. To obtain this convergence, most of the methods always choose t by minimization along x + tp,,; in some cases this is neces-

sary in order that the next "metric"-that is, H,+ defined often in terms of x,+,-be a good one. A recent method of Davidon (1968), however, attempts

to pick t, automatically and include it in ff. so that really we have t = 1; this can be viewed as one of the interval methods along x, + tp,,. The iteration is as follows:

VARIABLE-METRIC GRADIENT METHODS IN

SEC. 7.2

163

Set

Given x and H., one sets x', = x. P. _ ,

1R,

Y.

P.

For two fixed positive constants a, f, with 0 < a < 1 < f, one then defines a

a

if

_

+a
if

Y

1+v. w

1

-

a

<7,<1

-a -i-a

1

If Y.

jT --1

1 - Y.

otherwise

and defines

H.+, = H. + (A. - 1)H.Vf(x.)[H.Vf(x.)]* P

where * denotes conjugate transpose. If f(x.) > x,,; if f(x,) < f(x ), then we set xp+, = x..

then we set

The value of 1. that is chosen minimizes the length of

H.+ 1 [Vf(x.) - Vf(x )] - (xx - x,) Hx-+, >; this distance would be zero for Newton's in the inner product method applied to a quadratic f, and therein lies the reason for so choosing A.. It can be shown [Davidon (1968)] that if H. is positive-definite, then each R. is positive-definite. If f is a quadratic,

f(x) = and if Y.

1 +Y. for all n, then x

to guarantee that

E [a, f]

h and, in fact,, in (RJ, x, = h. The only hypothesis known

y.) E [a, f] is as follows [Vercoustre (1969)]:

if

0<<x,Mx> and fl= 1

164

VARIABLE-METRIC GRADIENT METHODS IN IR1

SEC. 7.2

then

Y.

I+ Y.

E

[a, 48l

Thus, even for quadratics, one cannot guarantee convergence in general. In fact, if yn = -i-, then A. = I and Hp+, = Hn; so if yn = -} and f(x') > f(xn), the iteration halts at x,,. Computationally, a similar phenomenon has been observed, and the method as presently developed does not appear to this author to be exceptionally useful; when the method does work, it works fairly well [Vercoustre (1969)].

Let us return to the question of generating the matrices H. If we used precisely Newton's method with H.. f= f;'-l, we would have HH[VJ (xn) - V/(xn- 1)] = Je-1 [s (xn -,,((xn- 1) + o(I I xn - Xn- 1 11)]

= X. - Xn- 1 + 0(I I X. - xn- 1 I I )

so it seems reasonable to ask that in general H,[V (xn)

- Vf(xn-1)] = X. - xn-1

If we let H. = H. + B. then we have

Hn+ 1 [Vf(xn+ 1) - V (xw)] = H,[Vf (xn+ 1) - Vf (xn)] + B,[VJ (xn+ 1) - VJ (xn)]

= Xn+1 - X. For convenience, we define an = Vf(xn+1) - Vf(xn)

and then we must pick B. so that Bnan = xn+1 - xn - Hnan

Computationally, we desire B. to be rather simple; for example, one might allow B. two degrees of freedom and set B. = (xn+, - x,,)q.

-

Hnanz,*

where

(7.2.2)

_ = 1

VARIABLE-METRIC GRADIENT METHODS IN 1k'

SEC. 7.3

165

and * denotes conjugate transpose. This defines a very general class of variable-metric methods in terms of the two families [q^] and (z^) [Broyden

(1967)].

If each H. is positive-definite with

0E>0 then

vJ (x^)+

Fn

lie^11>-

IIH,W(x)11 a^ I I Vf(x^) I K

- A^ I I Vf(x^)11 f I I Vf(x^) I

I

and we see that the direction sequence is admissible and that the variablemetric method will yield convergence under these conditions. Using the

approach of Remark I after Theorem 4.2.4, for the case in which Vf is Lipschitzian, we would only require a R-0

2

^

A,

; 00

for convergence. Rather than try to see how one should pick ff. so as to satisfy either of these conditions, we note that we were seeking not just a convergent method but one with Newton-like rapid convergence. For the conjugate-gradient method we saw that the 1/1-superlinear convergence in UR' depends on the fact that the method exactly minimizes quadratics in finitely many steps. We consider other such algorithms in the following sections. 7.3. EXACT METHODS FOR QUADRATICS

It has been widely stated that any method which minimizes a quadratic exactly in a finite number of steps will be a good one for more general func-

tionals and should exhibit superlinear or at least very rapid convergence; the 1/1-superlinear convergence of the conjugate-gradient method depends largely on its exactness for quadratics in RR'. This by itself is certainly not sufficient in fact to define a good method; one at least should require the direction sequence to be admissible. Exactness does, however, appear to be a reasonably useful property which we should try to obtain for the variablemetric methods. EXERCISE. Find an exact method which is useless on general functionals.-

VARIABLE-METRIC GRADIENT METHODS IN RI

166

SEC. 7.3

Necessary and sufficient conditions for a variable-metric method as presented here to be exact-that is, exact for quadratics in RI-do not appear to be known. As our remarks prior to Theorem 5.2.2 indicate, however, it is sufficient and "almost necessary" that the directions be generated by conjugate directions; experience also seems to indicate that the conjugate-direc-

tion methods of some form are usually the best. This in fact has been the approach used implicitly, if not explicitly, in the presentation of most methods. One way of developing the methods is as follows [Broyden (1967), Zelcznik (1968)]:

Since for quadratics with

f(x) _ we always have

M-' 6. = M-' [VAX.-,) - Vf(x.)] = x.+, - X. and we are trying to duplicate Newton's method, let us insist that H,+, not only satisfy H4+16. = x,; but also satisfy

H.+,a, =x,+, -x,,

i =0, 1, .

. .

,n

This gives, for i S n - 1,

xr+, -xI =H.+,d, =H.b,+B.6, =x,+, -x,+B.6I and hence

B.6, =0 for i =0, 1,...,n- 1 In the special case of Equation 7.2.2, where we allow two degrees of freedom, we find

B. = (x.+, - x.)q: - H.a.z. where

(7.3.1)

.=

1

=

if i=n

0 if

i S n- 1

Such q. and z. exist if 6. is linearly independent of the set If this is true for all n, then So, . . . , (J._, are linearly independent and we have

MR.. 16, = M(x,+, - x) = 5,,

i = 0, 1, ... , n

sEC. 7.3

VARIABLE-METRIC GRADIENT METHODS it,; IR'

167

Thus in IR',

i=0,1,...,1- 1

MH,d,=b

which says that MH, has I linearly independent eigenvectors b, associated with the eigenvalue 1 and hence

MH, =I Thus p, _ -H,Vf(x,) will yield the exact solution x,+, = h for quadratic f. Since the argument above did not depend on the two-parameter nature of the matrices H we have proved the following theorem. THEOREM 7.3.1. Suppose that [O , ... , O) is a linearly independent set of vectors for 0 < n . COROLLARY 7.3.1 [Vercoustre (1969)]. Suppose that H.6, = x,+, - x, for 0 .

n-1

Proof. If for some n we have 6. _ E T,8 then JQo .-I

.-1

1-0

(-0

M-'

r=o

.-s

HB,=ET,H.S.-ETJ(x,+1

-X)=ET,M'a, 1-0

T,6, = M-'6. = x.+,

- x.

which is a contradiction. Thus [b( ..... 6.) is linearly independent for all n. Q.E.D. We now suppose that Hn+, is symmetric and that t is always chosen to

minimize f(x. + tp,J, so that

0 = _ J.>

- , for n = 0, 1, ... , r Under these hypotheses for the two-parameter methods of Equation 7.3.1 it can then be proved [Broyden (1967)] that = 0 if i :*j for 0 < i, j < r and, therefore, that we have a conjugate-direction method. The proof goes roughly as follows. From the definitions of and it easily follows that

P.+. = a.P. +

for some scalars a,,, 8.

168

SEC. 7.4

VARIABLE-METRIC GRADIENT METHODS IN (R,

Now,

= 0

_ for

n=0,1,...,r We then get

_ + f.
\H"8 n- > r
,

P.- I> = 0,

n=0,1,...,r

The induction then proceeds easily to give

i = 0, 1,...,n,

=0,

n =0, 1,...,r

Thus we have found a two-parameter class of exact variable metric methods; the admissibility of the direction sequence for nonquadratic functionals still remains unknown, however. Since a study of the admissibility requires considerable specialization of the vectors z,,, we consider this question for special methods, although little information is available even in special cases. 7.4. SOME PARTICULAR METHODS

We consider first the class of variable-metric methods [Broyden (1967)] defined via

q. ° (a.P. - R.H.a.) Z.

+ P.P.)

an = (I + <8,,, H.6.>)

(7.4.1)

P.>

a.,P. ) ., H..> Y.=(1-Q.t.

where P. is arbitrary, t is chosen so PA> = 0, and Ho is symmetric. By a straightforward inductive argument [Broyden (1967)] or by using Corollary 7.3.1, one can show that, if M is symmetric and nonsingular, then the b, are linearly independent and hence the method is exact.

VARIABLE-METRIC GRADIENT METHODS IN IRt

SEC. 7.4

169

THEOREM 7.4.1. If Ho and M are positive-definite and fin > 0, then it follows that H. is positive-definite for each n. Proof: The proof goes by induction. If H. is positive-definite, let LL* _ v =/L*x, and w/= L\*5N, we then have H,,. If we let u = Kx, H., 1X> = -

+

V,''

2

&t.
+

,2

/\ + (w, w>12 [l ;, U>

The only possible negative terms come from

v, v

lv w>2 .w, w>

which in fact is nulinegative by the Schwarz inequality and is positive unless

v = )w. If the term 2 is also zero, then = 0 but = -Ku, u> # 0. Therefore, Kx, H.+ x> > 0 if x:;& 0 and hence definite. Q.E.D.

is positive-

Since only finitely many iterations need be used for quadratics, we have with

">E>0

An

in that case. If we use the algorithm for more general functionals, however,

we cannot immediately conclude such bounds (see, however, Theorems 7.4.2 and 7.4.5). Similarly, it is not known whether or not such bounds exist for quadratics in infinite-dimensional spaces, a result which would at least give some indications for the nonquadratic case in IR'. At this time numerical experience testing various choices of the parameters f, is rather limited; thus the method for arbitrary P. requires further study both theoretically and computationally. In practice, when one uses this method one seldom actually performs an exact minimization along the direction p to reach.-(.,, -that is, p,> = 0; it is striking to note, however, that if one seldom has = 0 for all n, then the directions are independent of the parameters {f,). Computationally, however, one finds great dependence on the choice of these parameters.

THEOREM 7.4.2. Under the assumptions of this section, if pp> = 0, then the direction n+1

11p.+' 11

determined by Equation 7.4.1 is independent of f,,.

170

SEC. 7.4

VARIABLE-METRIC GRADIENT METHODS IN R,

Proof: ,, Hf(j)> H,,Vf(xw+,) - Hwaw
P" Vf(xw+I)> +

Nwi'w

where tfpf{<JP H"a"9 - }

e>

w,

- H"a"{t"(P Vf(x"+J> - t"
Vf(x +l)>1 J

Using = 0, we have w>I)

-Pw+I = HUVf(xw+I) - H"b"
+

a"
Rwt"{Pw<-aw, H"Vf(x"+I)> ± Hwaw

Now, 1

:a,,, H,Vf(xw+, )

times the term in braces { } in the above expression for -pw+I yields }

<

_

H.Vf(xw+,)>

a">

-Pw

H"aw

H,,a,i

-Pw

`a". H.Vf(x")> H a w

H"aw>

w

aw, H"(- aa>S -+HA> H a

"w

= HwVf (x,,) -? HwY'f (xw+l)

Pw+, _ -[1

±

Hwaw

-

- law, H.Vf(xw+l )> H.6.
fi tw

X {H"Vf(x"+,) -Hwaw<s5N>l)>} and p./il Pw I I is independent of P.. Q.E.D.

SEC. 7.4

VARIABLE-METRIC GRADIENT METHODS IN RI

171

One reason for feeling that this class of methods deserves study is that

the special case with f. = 0 for all n-to which the others are in a sense equivalent, as we have seen- has been shown to be one of the most powerful methods available at present; this is the original Davidon method [Davidon (1959)] as modified by Fletcher and Powell (1963). In this way, H. is modified as follows:

H.+ I = H. + B.

B = tp,, ,

-

8. = VAX.-,) - Vf(x,.),

H^a

x.+, = X. + V.

Therefore, from the results for arbitrary f. > 0, we know that for quadratics this yields an exact method-in fact, a conjugate-direction method. A somewhat startling fact is that for quadratics this yields precisely the same iterates as a form of the conjugate-gradient method; this was first noted for Ho = I [Myers (1968)], but is true in general. THEOREM 7.4.3. Let f(x) = -'I, let M be self-adjoint

and positive-definite, and let H,, be self-adjoint and positive-definite. For a given x0, let the Davidon method be used to generate points x x2, .. . by directions p0, p .... Starting with x', = x0, let the general form of the conjugate-gradient method of Section 5.3 be used with, in the notation of that section, K- Ho, H - M-', N _ M, T = Ho M, generating a sequence x' x2, ... by directions p,,, p;, .... Then x = x' . and p = 2,,p; for scalars A. for all n.

Proof: Define g, = -Vf(x,). Since _

0

we have

<J., H.a.> = + Therefore,

a.-, g., H.- g. l P. = Hg. - S.,[H.H.-I g. + g.- H.-. g. Then for n < m, we have

_ + fig,., H.- I [g.- - g,J>

g.,H.- g. g.- H.-Ig.=0+- X

X

g.,

1, H.

g., H.- g. + g,,

I g.

172

VARIABLE-METRIC GRADIENT METHODS IN (R'

SEC. 7.4

is positive-definite, we must have

Therefore, since

= 0 for n < m But then, from the definitions of H. and B,,, we have for n < m that

H.g..=H.-1g.=... =Hog. and hence g,,, is an eigenvector with eigenvalue 1 for Ho 1 H. if m > n. There-

fore, we find that P.

[I + H0g.g" dP.-

8.. Hog.> + g.-1,p.-1>

For the conjugate-gradient method, po = Kg0 = po, and therefore x', = x1.

If p, = 2,p; and x;+1 = x,+, for i = 0, 1, ... , n, as is true for n = 0, then P.+1 =

Kg.. , Hog.+1>I + Hog... 1g:J2.P., 8.+1, Hog.+1/ + g.,P.

while

P;.+1 = Hog.+1 + <94g.

Hog.>1>P;.

H0g.+, + p g., Hog. 8.+1, Hg.+1> + AX g.' Haa.> If + Hog.+3 X f
g.+1, H09.+1>

1

+ Vg.' H0g.>H0g +1- Hog.+,$., A.P., + P.+

(where a...

is defined by the equality) since = - 6.-1 =

Thus

2.+1P:+1 and hence

x,+:. Q.E.D.

Thus, because of Theorems 7.4.2 and 7.4.3 and the results of Chapter 5 and Section 6.4, we know that the methods of Equation 7.4.1 (and in particular, Davidon's method) yield global convergence when implemented

with exact minimization along x + tp and applied to uniformly convex

SEC. 7.4

VARIABLE-METRIC GRADIENT METHODS IN IR'

173

quadratic functionals-for another proof see [Horwitz-Sarachik (1968)]-and we have estimates on the local-convergence rate. Similarly, for x, near x*, for uniformly convex nonquadratic functionals, their relationship to conjugate-gradient methods gives a local-convergence result for these methods as well. Only very recently [Powell (1970)] has a global-convergence result for the Davidon method been found; we give this result and outline the proof. THEOREM 7.4.4 (Powell (1970)]. Let f be twice continuously differenti-

able, let f > aI, a > 0, and let the Davidon algorithm be used with exact x minimization along x + tp,,, starting with an arbitrary x, and positivedefinite symmetric H,. Then the sequence tion 2.

converges to the unique solu-

Proof. Since H. is positive-definite for each n [Fletcher-Powell (1963)],

we can define r. - H;'. Writing

7,.x.+, -x. one can check that

r.+, _(I-
>)r"\I -
/+
t

Writing tr(A) for the trace of a matrix A-that is, tr(A) equals the sum of the eigenvalues for symmetric A-we have

tr (r.+,) = tr (r.)

- 2 + + Y., 0.>

r. ,

A,,, which

a., a.> 7.,

,,>

Using the fact that = t.

0

and 1

1

I(x,.+, ), H.+, f(x.+,)> = < f(x,.+, ), H. f(x.+, )> I

+

Ax.), H. f(x,.)>

we conclude that

tr(r.) +

11 ofx.+,) I I;

-

Ilof(x.+ IIZ + 1i H. x.+,) I_I

I I Vf(x.)112

H (x,.)

174

SEC. 7.4

VARIABLE-METRIC GRADIENT METHODS IN IR,

Solving this recursion thus yields tr

tr (ro) + 1-0

I I Vf(x

I I VAX. + J II{2

)112

J (x0), Ho J (x0)

+

,A simple use of the mean-value theorem implies that 116,112

/

for some constant Bo. Since II=112

is less than or equal to the largest eigenvalue of Ham+, = r.+ which is less than and since I I Vf(xo)112

tr(ro) and

are constants, we find that a B exists such that l I VAX, +,

< Bn

Arguing similarly for H. instead of r and using conclude that there is a constant M such that

(7.4.2)

0, we can alsc

IIH,6,112<Mn ,=o < j

Now, using <6 H,6,>2 > we finally conclude that < Mn 116,112

(7.4.3)

VARIABLE-METRIC GRADIENT METHODS IN IR'

SEC. 7.4

175-

Thus, from Equation 7.4.2, for at least two-thirds of the integers i in 0 

< 3B

From Equation 7.4.3, at least two-thirds of the terms must also satisfy < 3M 11

5,112

Therefore, for at least one-third of the integers i in 0 

< 9BM116,112 = 9BM Ijl 1-2

Y,l1211Y,I12

I

Since, by the mean-value theorem, 116,112

II Y,II

2

is bounded and f(x,) - f(x,+,) >'a II Y,112 must converge to zero, we deduce

the existence of an infinite subsequence of x,, such that

I I Vf(x,,) I I --' 0.

It follows from the uniform convexity off that x,, -- Sr; since f(x,,) if n Z ip we then deduce that the entire sequence (x,} converges to x. Q.E.D.

So far as the local-convergence rate is concerned, one can also prove the following theorem. THEOREM 7.4.5 [Powell (1970)]. Under the hypotheses of Theorem

7.4.4, there exists a scalar a E (0, 1) such that

f(xJ - f(2) < a"[f (x0) - AX)] If in addition f' satisfies a uniform Lipschitz condition in (x; f(x) < f(xo)},

then there exist a > 0, A < 00 such that aI < H. < Al, and

zI Il x. - x l l

converges to zero; (H.) need not converge to [fz]The local-convergence rate has also been studied in McCormick (1969), where somewhat weaker results than those above are obtained; an analysis

176

sEc. 7.4

VARIABLE-METRIC GRADIENT METHODS IN R,

of a restart Davidon method-in which, after every 1 steps, H. is reset to H, is also presented there. A computer program for the Davidon method can be found in Wells (1965). EXERCISE. Prove global convergence for the restart Davidon method.

A method similar to the Davidon method attempts to compute the inverse of fs more directly. If one has proceeded through I successive direct

tions p ... , p, and defines the matrix P with columns p ... , p,, then for a quadratic functional

f(x) = we have P*MP = D so that M-' = PD-'P'. Thus in general a reasonable approximation f o r ( f '. ' ) - ', given 1 directions p .

.

. , p would be

H = PD-'P Of course, one cannot actually compute D without first computing the Hessian matrix, which we seek td avoid. However, we note for quadratics that

D = P"MP = P*GT -' where G is a matrix whose ith column is b, = Vf(x,+,) - Vf(x,) and T is a diagonal matrix having as its diagonal (t ... , t). Defining G thus in terms of differences of gradients for arbitrary f, we can then take as a good initial matrix the matrix

Ho = PTG -' It has been suggested [Ritter (1969)] that one use essentially this choice of the H-matrix at every step. The point of this is that even for nonquadratic functionals, setting H. = P.T G,' where

P. = (P.-,, P.-,+ ... , P.G. = (a.- . . ., a,-) and

T. = diag(t.-,, . . . , t.-,) guarantees that

H.6, =x,+, -x, for n-1 Z Eli Vf(x.) 112 for a fixed e > 0

sEC. 7.4

VARIABLE-METRIC GRADMNT METHODS IN (R'

177

and as, say, -Vf(x,J or some other globally convergent choice otherwise. This clearly gives global convergence for sufficiently convex problems, as-

suming that a reasonable t is chosen. EXERCISE. Prove the global convergence asserted in the preceding paragraph.

If the choice of H. is modified so that it is computed from a "sufficiently

linearly independent" set of I preceding directions-say, by requiring Z e-the local-convergence properties can be analyzed. In particular [Ritter (1969)], x converges superlinearly to 2 and H. converges, to f %-'. Computationally, one would not want to be forced to compute G;' every time; fortunately, it is simple to compute G,-+, recursively from G.-' using and G.. This appears to be an excellent method the relationship between in theory, but we are not aware of any significant computational results. EXAMPLE [Kowalik-Osborne (1969)]. Consider using Davidon's variable-metric method with Ho = I to minimize

AX, Y) = 100(y - x2)2 + (1 - x)2

whose unique minimum is at x = y = 1. Letting xo = -1.2, yo = 1.0, 50 functional calls of f were required to find (x, y) = (0.6885, 0.4580), f = 0.122, and 80 calls yielded (x, y) = (0.992, 0.9984), f = 6.27 x 10-7. Another particular variable-metric method [Broyden (1965)] is obtained by setting

H p. 4.=Z.= H.P., .> This method is not exact, but if applied to , one finds

IIHM+t -MII
H.a.-t.

H.. - t.P.,

H.+, =H.+B.

. a>

178

SEC. 7.5

VARIABLE-METRIC GRADIENT METHODS IN (R'

if no division by zero is required, and H , = H. otherwise. For convenience

we write A > B (A > B) to mean that A - B is positive-definite (semidefinite). By straightforward arguments one can show that for this method with

f (x) = if Ha > M-' (Ho < M-'), then

Ho > H. > M-' (Ha < H. < M-') for all

n

> For a general nonquadratic functional f, if 0 < al < f < Al, if f x f(x + p) < f(x), Ha > k1 for some k < 1/A, and if f x+, < fs whenever then kl < H. < (1/a)I for all n. This then states that the direction sequence, under these hypotheses, is admissible and hence that the iterative method will yield a criticizing sequence, in this case converging to k. A similar theorem

is valid for Ha > f,- '. To the author's knowledge this is the only result on the admissibility of a variable-metric direction sequence. The statement of the result with Ha- < f" -I is in fact useful computationally, since this is satisfied with Ha = .1-'I where A > I I f X, 11; thus if f' is nonincreasing along

paths of nonincreasing functional values, such a computable choice of Ho will assure convergence. Some other two-parameter-type methods have been considered briefly

in the literature [Broyden (1967), Vercoustre (1969)] but have not proved particularly successful; for example only, we mention the possibilities

=z=

P.

and

= z, -

H^a^

7.5. CONSTRAINED PROBLEMS

Because of the experimental evidence and, in some few cases, theoretical analyses showing that variable-metric and in particular the conjugate-gradient and Davidon methods are extremely powerful tools for unconstrained minimization, a great deal of interest has arisen lately in variable-metric methods for constrained minimization; conjugate-direction methods have been applied in various ways [Barnes (1965,) Goldfarb (1966, 1969a), Goldfarb-Lapidus (1968), Zoutendijk (1960)], perhaps the most promising ofwhich is an analogue

of the Davidon-Fletcher-Powell method. In the unconstrained case for a quadratic f, this method merely uses the steepest-descent direction defined by a metric approximating f". An analogue for constrained minimization is to

use the projected-gradient method where the projection is with respect to some different metric. This of course can be described in the variable-metric

SEC. 7.5

VARIABLE-METRIC GRADIENT METHODS IN IR'

179

setting with p =

where H. is a projection matrix for a varying metric. Such an algorithm can be made to be exact for quadratics with linear

constraints. This Davidon-type method for constrained problems has not been thoroughly analyzed with respect to its global-convergence properties. Evidence indicates that this will be a very powerful method, but much of the

theoretical analysis is lacking and should be provided. The theory that is known can be found in Goldfarb-Lapidus (1968) and in part in Section 4.10, particularly Theorem 4.10.2 of this volume.

8

OPERATOR-EQUATION METHODS

8.1. INTRODUCTION

The methods discussed in the preceding chapters have all been of gradient type-that is, such that <x... - x1, Vf(x )> < 0; these are the types of methods which arise most naturally for unconstrained minimization problems. Another large class of useful methods, however, is comprised of those which in essence ignore the minimization aspect of the problem and treat directly the operator equation

Vf(x) = 0 A solution to this equation need not of course minimize f unlessf is further restricted-for example, by some convexity hypotheses. We cannot hope to describe in any detail the tremendous variety of methods available for the solution of such equations; entire books, and large sections of others, are devoted to these methods [Collatz (1966), Ostrowski (1966b), Rail (1969), Traub (1964)). We shall, therefore, be very brief and merely mention the various methods which appear to be useful and the kinds of problems to which their application is understood; only the barest outline is presented here.

We are primarily interested in methods for which global-convergence results are known; that is, we wish to locate a solution x' iteratively starting with any x. whatsoever, or at least with restricted x0 that can easily be found. Although local-convergence results have long been known for many iterative methods, global results are much harder to obtain and are commensurably less common. Global results usually are obtained either by variational analysis or by an analysis using some type of abstract, generalized monotonicity viewpoint. From the variational viewpoint, some methods are shown to be of gradient type and then the g.neral analysis of the preceding chapters an be 1911

sac. 8.2

OPERATOR-EQUATION METHODS

181

applied to analyze convergence; we have already seen this, for example, for Newton's method. We shall look further into Newton's method from both viewpoints in this chapter, examine briefly the iterative methods generalizing

well-known methods for linear equations-such as Jacobi, Gauss-Seidel, successive-relaxation, and alternating-direction iterations-and conclude with a study of methods based on minimizing I I Vf (x)112-that is, on least-

squares methods. Let us then turn to the question of iterative solution of nonlinear operator equations J(x) = 0 8.2. NEWTON AND NEWTON-LIKE METHODS

The original Newton method for nonlinear equations takes the form

X... =x, - J,-' 'Jz.) ( where as usual we write J; for the Frechet derivative at x,; that is, J' =1zj We have mentioned briefly the use of Newton's method when J(x) = Vf(x) for a function f to be minimized, in which case we take the direction p, = -J, 'J(x,) and let x.,, = x, + t, p, where t, is determined by one of the

general methods of Chapter 4; we know that choosing t, to minimize J (x. + tp,), for example, yields quadratic convergence. We can show in addition that, roughly speaking, one can always eventually choose t, equal to 1, in which case we have the pure Newton method which also converges quadratically [Kantorovich-Akilov (1964), Rail (1969)]. THEOREM 8.2.1. Suppose J(x) = Vf (x) is such that 0 < & < J' <,41 for all x in W(x,), the closed convex hull of (x; f(x) S f (x,)), and suppose that J;, is uniformly continuous in x in W(x,). Let

'('JxJ

zw+l = X. - twJ

where, for a fixed a in (0, 1), t, is the first of the numbers t = a°, at, a2, satisfying

... ,

f (x,) - f(x. + tp,,) Z qt. <J(x,), J1-'J(x >

for a fixed q in (0, 4). Then x, - . 2, the unique point minimizing f. For sufficiently large n, we may take t, = I and hence the convergence is quadratic.

Proof. Since W(x,) is bounded and II J= I I S A, Vf is uniformly continuous in W(x,). We have A l l o1(xJ I I

S A IIp. II

182

sec. 8.2

OPERATOR-EQUATION METHODS

and hence the assumptions of Corollary 4.7.2 are satisfied with d(t) = bt, d,(t) - (11A)t. We thus conclude that 0

Then by Theorems 4.2.1, 1.5.1, and 1.6.1 we conclude that x - Q. To show that we may take t = I for large n, we write

f(xn) - M. + Pn)

f(xn) -

1

[f(xn) + + W'. Pn, Pn>J

<J(x ), Jn

J(xn), Jn 'J(xn)>

+

'J(xn)

#<Jn P., Pn> - 4<J''=,+,,D.P., Pn> for some An E (0, 1)

<J(x ), J 'J(xn)>

_ +<J(xn), J'

o(I I AX.) I F) <J(xn), J w 'J(xn)

=4+o(1) Since q c= (0, 1), for large n we may take to = 1, thus yielding the wellknown quadratic convergence. Q.E.D. The above theorem shows how variational analysis can be used to prove .a global-convergence theorem for (an asymptotically null modification of) Newton's method. We next state a similar result proved by monotonicity methods [Baluev (1952), Ortega-Rheinboldt (1967a)]; for simplicity we

restrict ourselves to C'. We remark that if x and y are two elements of a finite-dimensional space, x > y is defined to mean that the components of x - y are nonnegative. THEOREM 8.2.2. Let J: R', C' be continuously differentiable and satisfy

J[Ax + (1 - a)y] < AJ(x) + (1 - 2)J(y) for x, y in

ER',

2 E [0, 1]

Let J'- I exist, and let J;' > 0 for all x in ER'. If J(x) = 0 has a solution R

R', then it is unique and for any xo in ER' the sequence xn+, = X. - J;'J(xn) converges to z and satisfies x > xn+, > 9 for n > 1.

in

Proof: The "convexity" assumption on J merely says that J is convex in each of its components; thus from Proposition 1.5.1 we have

J(x) - J(Y) > Jr(x - Y) If J(x) = J(y) = 0, then one has

0>J;(x-Y) and since Jy ' > 0, we have

x-y<0

OPERATOR-EQUATFON METHODS

SEC. 8.2

183

Similarly, y - x < 0 and hence x = y-that is, x is unique. Now, for any Xo,

X1 - 2 = xo - Jo 1J(xo) - [ac - Jo 'J(X)] = x0 _2 +J10_ 1[J(x) - J(X0)] Xo - 2 + J'0 -1V'0(! - Xo)]

>0 Therefore, since xo is arbitrary, x. > z for n > I. But then

-J(xn_1) = J,'-,(x. - xw-1) <_ J(x.) and hence

for n> I yielding

X. = -J' 'J(X,) < 0 Thus we have

for n>l Therefore, there exists at least one convergent subsequence; for any such subsequence {x ], with x,,, - x', since X,,,., S X,,,+ 1 < X,

it is easy to show that x,,,+, x' also and hence J(x') = 0. Therefore, all limit points of [x.] must equal.9 and hence x - ' I. Q.E.D.

Since Newton's method is often cumbersome, because of the need to compute the derivatives in J, and to solve a linear system, one often desires to use modifications of the method, such as

x.,. = x - H where H. is some "approximation" to J; 1; thus we are back to the quasiNewton methods (called variable-metric methods when considered from the variational viewpoint). If we ask only local-convergence theorems, these methods have been thoroughly studied in general. A typical such theorem is the following [Dennis (1969), Rheinboldt (1967)]. for H. = PROPOSITION 8.2.1. Let J : X - Y for Banach spaces X, Y, I I J;, - J' I I

K II x - y I I for x, y E D0, a convex subset of X. Let B: Do -. L[X, Y], and' for a given xo E Do, let

IIB(x)-B(xo)II <_gllx-xoll, jIJ'=-B(x)Ilsao+allix-xoll

184

sEc. 8.2

OPERATOR-EQUATION METHODS

for x in Do. Let II B-'(xo) II S f, II B-'(xo)J(xo) II S at,

and assume that fibo < 1 and

h _ (1 o_Ka o)z

' 21

where

o= max(1,n+al) K Set

1-

h-

-2h

a

r1 0

+ 1-(ho) x h

If S(xo, r,) c Do, then the sequence

3 am

0

xa+, = xa - B(x,J-' J(x.) exists, lies in S(xo, r,), and converges to a solution.2 ofJ(x) = 0 unique inside S(xo, rs) (1 Do. We are looking, however, for theorems about global convergence, not

local; to our knowledge such theorems other than contraction-mapping theorems are not known (except for the variational results of earlier chapters)

for the general quasi-Newton method. By making much more specialized assumptions on the precise form of the method, some global results have been obtained; this is the subject, in effect, of the next sections.

We wish to point out that some of the quasi-Newton methods are implemented with a variational flavor: given x and a direction pa, one sets xa + to pa where t, is chosen to minimize or reduce some local-error measure EE(x); for the pure variational methods we have of course EE(x) J(x) for all n. A particularly popular choice in general is Ea(x) = II J(x) II'. The general form of the conjugate-gradient method for nonlinear operators, mentioned in' Section 7.3, treats the error measure E,(x) = <J(x), HsJ(x)> for some operator Ha,. The direction is

p. = -K: J(xJ + ba- Pawhere K,. is another operator and ba_, is chosen to make pa and pa_, (J,*HH.J;)-conjugate [Daniel (1965; 1967a, b)]. This exhibits very rapid convergence near the solution, but also reveals the difficult problem of such

methods-namely, making a good choice for the operator H. In Daniel

aec. 8.3

OPERATDR-EQUATION METHODS

185

(1967a), an example of two equations in two unknowns is handled in this way with two different, apparently reasonable error measures, with very different results; one method gives 1J J(x.o) 11:!g 0.007 II J(xo) I I, while the other gives II J(x:,) I I S 10-' 1I J(x0) I I. Thus this type of variable error-measure

method, while seeming to grow in popularity, is but poorly understood in general.

8.3. GENERALIZED LINEAR ITERATIONS

We next consider some iterative methods which are natural extensions of well-known methods f o r linear equations. Let vectors x in R' have components Z ... , Z,, and consider the linear system

Ax =b The following iterative relaxation methods are well known [Douglas-KelloggVarga (1963), Varga (1962)].

Jacobi method: Given x,,, compute

via

akkc.+,,k = bk - A ak1C.,lr

k=1,2,...,1

!wk

There is also a block version in which several C.+ ,k are found simultaneously.

Successive overrelaxation (SOR) method: Given x., compute x.+, by letting k-1

`k = bk - L. akjc.+t.l - 1-+1 J akkxxi

1

.+1.k=(1-w).,k+w

k,

akJCk.l

k=1,2,...,1

for some relaxation factor w. There is also a block version in which several .+1,k are found simultaneously.

Gauss-Seidel method: This is just SOR with the relaxation parameter

w - 1. Alternating direction implicit (A DI) method: Suppose A = H + V. Given x and some parameters s > 0 and r > 0, compute via (s.I + H)x.+(l,2) = V)x. + b (r.I + V)x.+1 = (r.I - H)x.+(1 i:) + b There are many ways in which these methods can be generalized to nonlinear problems; we mention three common approaches [Ortega-Rheinboldt

186

OPERATOR-EQUATION METHODS

sEc. 8.3

(1968)]. Let ,. be the name of one of the methods above having the form xw+, = F(x,,; A; b). 1. Nonlinear H- method. In each of the first three of the above methods, one or more components of xw+, were computed successively; for the non-

linear version, precisely the same is done. Thus to solve G(x) = 0 where G: R1 --r CR`, G = (g . . ., g,)*, the nonlinear Jacobi method, for example, yields

from xw via gk(Cm,I)

-

-'

w,k-1, Sw+l,k, w,k+1,

.., Cw,l) = 0,

k = 1, ...,1

For nonlinear AD1, we assume G(x) = H(x) + V(x), and then the method looks just like the linear one. EXERCISE. Write out the nonlinear Jacobi method for solving , +

= 0,

e°i - n = 4. 2. K.-m-step-Newton method. In each nonlinear method, it was necessary to solve at least I nonlinear equations to get from xw to xw+,. Since one cannot do this exactly, one might try to compute xw+, by using m steps of

Newton's method on the required nonlinear problem [Lieberstein (1959)]. Thus the Jacobi-l-step-Newton method is g (x")

T9 -,

k=1,2,...,1

Rk (X

EXERCISE. Write out the Jacobi-l-step-Newton method for solving

'1+'}=

3. Newton-m-step-+a method. A third possibility is to consider solving G(x) = 0 by Newton's method: xw+, =xw - G- 1 G(xw) Since this involves solving a linear system at each step, we might consider using m steps of a method F(xw; A; b) for solving the linear system, thus getting the method x,,., = F'"[xw; G.; G'fxw - G(xw)]

Thus the Newton-l-step-Gauss-Seidel method is agk // \xw) ZlG

k

1a jg&J

w+l,k = Fj

/

(xw)(

,/ - w 1.!) - Ok(xn), k - 1, 2, ..., 1

EXERCISE. Write Vyout the Newton- l-step-Jacobi method for solving

'1

=0, ec-- z =4.

SEC. 8.3

OPERATOR-EQUATION METHODS

187

The local convergence of such methods can be analyzed in the usual way by using the fact that, near solutions, the non'_nearities contribute only

second-order, effects; thus many of the results for linear equations carry over without change to the local convergence for nonlinear problems [Ortega-Rockoff (1966), Ortega-Rheinboldt (1968)). The global-convergence properties are far more difficult, however. With the exception of the ADItype methods, some useful results can be obtained from the variational view-

point, especially for the pure nonlinear methods. This is because the step from xq to x, can be considered as the result of I steps through the coordinate directions, which directions will be of gradient type for certain problems,

thus allowing the application of any of the various step-size algorithms. The following proposition is typical of the type of result that has been obtained from the variational viewpoint. PROPOSITION 8.3.1 [Schechter (1962, 1968)]. Let f(x) be twice continu-

ously differentiable in IR' with 0 < al
an M-matrix A = ((a;;)) is one such that A-1 > 0 and a,j < 0 if i # j,

and that 0: RI -p R' is isotope if -x < y implies O(x) < ra(y); is diagonal if, for I < i < 1, the ith component of O(x) depends only on the ith component of 0; and is convex if

0 [,lx + (1 - ))Y) < AO(x) + (I - AWY) for all x, y in [R'. Such operators commonly arise in the numerical solution of boundary-value problems for mildly nonlinear differential equations [Bers (1953), Greenspan-Parter (1965), Schechter (1962)]. The following two results are typical of what is known for the full nonlinear methods. PROPOSITION 8.3.2 [Bers (1953), Ortega-Rheiboldt (1967a, 1968)]. Let A be an M-matrix, 0: IR' IR''a continuous, diagonal isotone mapping, and set

G(x) = Ax + O(x)

Then for any xu in

IR',

the nonlinear Jacobi and nonlinear SOR (for

189

sEc. 8.3

OPERATOR-EQUATION METHODS

0 < w S 1 (and hence, for w = 1, the nonlinear Gauss-Seidel)] methods all yield sequences xa converging to the unique solution R of G(x) = 0.

PROPOSITION 8.3.3 (Caspar (1968), Kellogg (1969), Ortega-Rheinboldt

(1968)]. Let G(x) = H(x) + V(x) with H and V continuous, let G(R) = 0, let

Z 0 for all x, y in 1, and for each bounded set B let there exist positive constants

L, and a, such that

IH(x)-H(Y)IISL,llx-YII and Z a, I I x - y 112

for all x, y in B. Suppose that

0
s
for n Z 0 and that

lima a- = r a- =limsa Then the sequence x,/2 generated by the nonlinear ADI method converges to R for every initial xo.

For the nonlinear Jacobi and SOR methods, one can also compute with monotone iterates. PteoPOSITIoN 8.3.4 [Ortega-Rheinboldt (1967a, b, 1968)]. Let A be an M-matrix, #: IR' -+ K' be continuous, diagonal, and isotone, and suppose that yo S x, satisfies G(yo) S 0 S G(xo) where G(x) = Ax + #(x). Then the nonlinear Jacobi and (for 0 < w < 1) nonlinear SOR methods starting from xo and yo yield sequences satisfying

YaSYa+1 SRSx.+t Sxa for nZO and xa -,R, ya -+.2, where G(2) = 0. We now look at the Newton- K- methods; the following results are typical. PROPOSITION 8.3.5 [Greenspan-Parter (1965), Ortega-Rheinboldt (1967a,

R' be continuously differentiable and convex, let a, be an M-matrix for all x, and let G(2) = 0. Then for any xo satisfying G(xo) Z 0, and for any m Z 1 (actually m may vary) and 0 < w S 1, the 1968)]. Let G: 6t'

SEC. 8.3

OPERATOR-EQUATION METHODS

189

iterates generated by the Newton-m-step-SOR method converge to R R for n > 0. If G(x) = Ax + Ox) where A is an and satisfy x Z M-matrix and 0 is continuously differentiable, diagonal, isotone, and convex,

then the Newton-l-step-Gauss-Seidel iterates converge to R for any xo and monotonically if G(x,,) > 0.

PROPOSITION 8.3.6 [Caspar (1968)]. Let G(x) = H(x) + V(x) with H

and V continuously differentiable, let G(x) - G(y) < G',(x - y) if x < y or y < x, let rI + Hx and rI + V' be M-matrices for all r > 0 and all x, let G(R) = 0 and xo > 2, G(xo) > 0. Then, for the Newton-l-step-ADI method, if

d(x.)
x

the iterates satisfy x Z

R.

For ADI-Newton methods, we have the following result. PROPOSITION 8.3.7 [Caspar (1968)]. Let H, V, G, H', V', x0, and x* satisfy

the hypotheses of Proposition 8.3.6. For the ADI-1-step-Newton method with

r. = S.< s < oo and

s.+(112) < S < 00

if r. is greater than or equal to each diagonal element of V',. and r.+(1/2) is greater than or equal to each diagonal element of Hx,, then x-/2 > X(.+1)/2 G R

and x.12 - R. To treat SOR-Newton methods, however, it has appeared necessary to

use variational methods entirely. Thus one considers the SOR-1-step

Newton method applied at x as yielding a direction p,,, and then one must set x.+, = x. + t p. for suitable t just as in the full Newton method analyzed in Theorem 8.2.1. The step t. can be chosen by essentially any of the usual

step-size methods, but we know of no general results allowing t. = 1, even for large n. One of the simplest methods for choosing t. is one of the techniques of Section 4.6, which in this case is as follows.

PROPOSITION 8.3.8 [Elkin (1968)]. Suppose f' is continuous and 0 < aI < f" on W(xo). Let p. be the direction generated by the GaussSeidel-l-step-Newton method applied at x, to solve Vf(x) = 0..-Let

190

OPERATOR-EQUATION METHODS

x+

SEC. 8.4

where, for a c- (0, 1), t is the first of the numbers

t = a°, a', a2, ..' . , to satisfy

f(x.) - f(x. + tP.) > -at. for a fixed 6 > 0. Then x.

2, minimizing f.

Most other results concerning the choice of t restrict t to an interval, as in Section 4.5, depending on values of the eigenvalues of f; these results as well as that above are valid for block methods also [Elkin k1968), Schechter (1968)]. 8.4. LEAST-SQUARES PROBLEMS

One approach for solving the operator equation J(x) = 0, and one which in fact is the source of many minimization problems, is that of minimizing E(x) = I IIJ(x)112; we shall consider this general question without the assumption that an k exists with J(9) = 0. Thus we shall seek a point .2 where

VE(2) = 0-that is, J'*J(z) = 0 where * indicates conjugate transpose. If Jr is invertible, this will then imply J(9) = 0. If we write G(x) = J''*J(x)

then we merely have a general nonlinear operator equation G(x) = 0 where in fact G = VE and we can use any of the methods mentioned before. For example, we can consider for many problems the Newton direction for the equation J(x) = 0 and use this as a descent direction for E(x). Since <-J' 'J(x),VE(x)> = -11J(x) 111,

this generates directions driving 11 J(x)11 to zero if 11 J' ' 11:!9 B for all x. This approach has been widely used [Fletcher (1968)] with a modification to avoid derivative calculations, so we leave it until the next chapter. We shall now dwell on modifications of Newton's method for this problem. To use Newton's method one needs Gx; we see here that GX = Y'Ax) + X *Yx

Thus to use Newton's method one needs second derivatives of J, while J may itself involve first derivatives of some function f. We note, however, that in the case in which we expect J(2) = 0, the term J"J(x) for x near R

sEc. 8.4

OPERATOR-EQUATION METHODS

191

is very small; we might hope, therefore to be able to replace G.', by J',*J'. If this latter operator is invertible, we are led to the Gauss-Newton method for solving J',*J(x) = 0-namely, X..' = X. -- [J.*J.] - I J.*J(x,)

We note that the direction p = -[J;*Jr]-'J;*J(x,) is a direction of decrease 0. The presentation in Prop0 if for E(x) since
immediately turn to our general results on variational methods if we are willing to modify the method via X. + I.P. and allow t :?,-, 1. What we really need is for the direction sequence to be admissible, which will be valid if, for example,

0
If J'. J'X is singular, then the inverse of course makes no sense and the Gauss-Newton iteration is not defined. However, we can consider an extension of the method in which the inverse is replaced by a pseudo-inverse. Denoting the (Penrose) pseudo-inverse of a matrix A by A+ [Langenhop (1967), Penrose (1955)], we have the iteration

x.+1 = X. - [J.*Jo+J.*J(x.) This extension of the Gauss-Newton method, along with the Newton method similarly modified, can be analyzed just as the original methods. If, essentially,

the pseudo-inverse operation is continuous in some neighborhood of the iterates [x.), then a local-convergence proof can be given for the solution of [J=*J')+J',*J(x) = 0 [Ben-Israel (1965, 1966)]. Under the same kind of hypotheses, global-convergence results for the usual choices of the step t. are known. This type of method appears quite useful for such least-squares problems; we remind the reader, however, that all of the preceding variational techniques are available here also, including the very powerful conjugate-gradienttype methods with the "conjugatizing" performed with respect toJ''*J,-that is, with H, = I in the general notation for conjugate gradients. One of the drawbacks in practice of the Gauss-Newton approach is that in some cases

192

W. 8.4

OPERATOR-EQUATION METHODS

E(x) _ 11j(X) 112 cannot be reduced appreciably in the computed direction p often because p, makes nearly a right angle with VE(x), although such an angle is not necessarily bad. In such a case it is sometimes wise to use the gradient direction instead, or, perhaps better yet, some compromise between the pure gradient and pure Gauss-Newton directions. We consider this approach next. One way [Kowalik-Osborne (1969), Levenberg (1944), Marquardt (1963), Morrison (1960)] of effecting such a compromise between the directions is to use the direction. p,(a,) defined by [

?fit

i

+ aw ]p,, = -Js J(x,), a.

0

For a = 0, this yields the Gauss-Newton direction, while for a, large, the direction is

p1 = a

O\a 1

which is asymptotically the gradient direction. EXERCISE. Show that, as a increases, IUp,(a,)II and the angle between

p,(a,) and -V

decrease monotonically to zero.

Since, moreover, J,*J', is positive-semidefinite, we can always solve for p, if a > 0 and, if a, is bounded away from zero, [J;*J;, + a,l] will be uniformly positive-definite; therefore, if I I J: I I is bounded and 0 < a < a, S A < 00, then the direction sequence is admissible and the various methods for choosing

t, may be used. In practice, of course, one of the difficulties is in choosing a, so as to get a significant decrease in E(x) without moving too far from the Gauss-Newton direction, which should, if J(x*) = 0, yield ultimately superlinear -convergence. Since computing p,(a,) involves solution of a linear system for each value of a,, one apparently would not want to experiment with different values of at. for fixed n if. this can be avoided; in fact this can be done easily using matrix decompositions [Golub-Saunders (1969)f. A more direct way of effecting a compromise between two choices of direction has recently been proposed [Powell (1968, 1969)]. The method there is merely to set

-p,(a,) =

(1 - ajJ,*J(x,),

a, E [0, 1]

the compromise here being between and the pure Newton direction for J, although the Gauss-Newton direction could be used as well. Because

each of the directions of which p,(a,) is a convex combination is itself a direction of decrease for E(x), so also is p,(a,J. In fact, if a is bounded away from 1, even if the inverse is replaced by the pseudo-inverse the directions

sEC. 8.4

OPERATOR-EQUATION METHODS

193

will be admissible and our various step-size algorithms are applicable.

Again, the problem of choosing a at each step is not well understood; however, trying various choices is not too costly, since a set of linear equations is only solved once. A heuristic approach in Powell (1968) is used to select a+, with excellent results; since the method there is actually implemented

with differences-as opposed to derivatives-and with other computationally convenient modifications, we postpone further discussion to the next chapter.

9

AVOIDING CALCULATION OF DERIVATIVES

9.1. INTRODUCTION

In the preceding chapters a variety of methods has been presented for minimizing a functional f; these methods require a variety of kinds of data

about f, such as values of f, Vf, and f". Although for certain important problems the derivatives needed in Vf and f" are easy to compute, generally speaking we should like to avoid using any more derivatives than absolutely necessary. Thus recently, methods have been created to reduce dependency on knowledge of derivatives. In some cases these methods take the simple expedient of approximating the derivatives, usually by differences, but there have also been some techniques developed which entirely avoid all derivatives and approximation thereto. In this chapter we shall examine some of these methods and see how they can be analyzed. 9.2. MODIFYING DAVIDON'S FIRST METHOD

Recall that the first method of Davidon is a variable-metric algorithm which proceeds as follows, given x,, and Ho :

Given x and H., set

x.+, = x. + t.P.

p,. =

where t minimizes f(x. +

let 6. = Vf(xa+,) -

H.+ 1 = H. + t.

p

.\

/

and let

.a

where * indicates conjugate transpose. This method requires use of the 194

AVOIDING CALCULATION OF DERIVATIVES

sEC. 9.2

195

derivatives Vf(x,J; these can be approximated very simply by the vector y of forward differences-that is, Y, =

f(x, + d, e) - f(x,,)

d, > 0

d,

where e, is the vector with e,, = 6,,, the Kronecker delta. It is clear that if the d, for i = 1, 2, . . ., I are small enough, this method will behave essentially

as well as the method with derivatives; the problem here, as with all the

other methods wherein derivatives are replaced by differences, lies in how to choose the d,. An excellent analysis of this problem has been given for Davidon's method [Stewart (1967)]; since the viewpoint is of interest for use on any method, we present the ideas here once and for all. The whole basis of most gradient methods is to treatf(x) as a quadratic, locally; thus we shall consider the problem of approximating the derivative

y = f'(0) of the quadratic f(t) =f(0) + yt + 40&t2 by the difference

f(d)

f(0 d

= Ya

The two sources of error in approximating the scalar y by the scalar Yd are the truncation error in the divided difference and the cancellation produced in computing f(d) - f(0) for small d; clearly we should balance these errors. We can estimate the relative truncation error by

YdY

YI

1I

Ild Y

If we assume that, in computing f(t), we actually compute

f +(t) = f(t)(l + E),

l e I < it known

then the relative cancellation error can be estimated as 21

f(o) f(d) - f(0)

I

q

If we equate the two estimates and solve for the number d, we find that I dl should be the positive root of

3a2z3 + lal ly! z2 - 41f(0)I IYI n =0 and sign (d) = sign (ay).

196

ssc. 9.2

AVOIDING CALCULATION OF DERIVATIVES

EXERCISE. Show that the above choice of d is correct.

To avoid solving the above cubic, experimentally it has been satisfactory to ignore the cubic or quadratic term, depending on which gives a smaller root, to solve the resulting simpler equation, and to refine the result by one Newton step applied to the original cubic. This gives the following:

2{ILI}u2'

Idl=r[1-

z = 211 fao)I}"'

Idl = s[1 -

JaIr 1a T1+14Iy1]

if

y2ZIaf(o)I17

if

y2 < Iaf(O)I,

This computation requires a crude estimate of y, which can easily be obtained, and also an estimate a ,of f "(0). For the Davidon method we are considering,

f(xx + te,) =f(xJ + t + 3g,'12+ o(t2) where a,, is the ith diagonal element of f' s'. Recall that H. - f' - '; thus we seek the diagonal elements of H.'. These can in fact be computed recursively (see Proof of Theorem 7.4.4) without knowledge of the off-diagonal elements H.-' from the equations [Stewart (1967)]

Hi+'i = Hr ' + ( ltr

-

f) a,a* +

I

[Vf(x)o' + a,vf(xi)*]

where

P, = (Vf(x), P,>,

.8r =

and we replace Vf(x,) by its approximation. EXERCISE. Show that the above recursion for H;+', is correct.

Thus we have a rule for determining the size of the numbers d, in the difference approximation to Vf(xJ. Certainly we have ignored many problems, implying that our analysis is far from rigorous, but the ideas in practice appear to lead to good results. For further computational details and examples the reader is referred to Stewart (1967), where the method is shown to be quite powerful in practice. A similar approach has been applied for a Davidonlike method applied to minimize I I J(x) 112 for a nonlinear operator J [Fletcher (1968)]; here J'- I usually does not exist, so one is led to finding a Davidontype approximation to a pseudo-inverse J'+ using differences. The method as proposed in Fletcher (1968) is exact for quadratics. EXAMPLE [Stewart (1967)]. The Davidon modification without derivatives was used to minimize

AVOIDING CALCULATION OF DERIVATIVES

SEC. 9.3

197

f(x, y) = 100(y - x2)2 + (1 - x)2 starting with xo = -1.2, yo = 1.0. After 163 function evaluations, f was reduced to 9 x 10-12 with (x, y) = (1.000002, 1.006003)

9.3. MODIFYING NEWTON'S METHOD

Computationally, at least two problems are involved in Newton's method: the need for solving a linear system at each step, and the need for evaluating roughly 12,derivatives at each step. One can eliminate most of the derivative evaluation by using the derivatives at one fixed point throughout,

but this eliminates the powerful feature of quadratic convergence. An alternative, of course, is to use differences to evaluate the derivatives. If the step size used for the differences is small enough, one should maintain the rapid convergence, it would seem. The local-convergence properties can be

analyzed by means of Proposition 8.2.1; if we replace the derivative of J(x) by AJ(x, f) ^ y, where the components yr =

E

[J(x + fe) - J(x.)]

as in Section 9.2, we have the following local result [Dennis (1969)]. PROPOSITION 9.3.1. Let

Ao' ° [AJ(xo, fo)] exist with

IIA0 'J'-A0 1J;II_KIIx-yIl for x, y E S(xo, r). Let e > 0 be such that

K(e + Suppose that

fo) < 1 and

2

>h=

KII Ao'J(xo) II

(1 - Ke - Kfo)2

2

r ro =

1

-

h

x

IIAo'J(xo)II

1-Kf-+Kfo

and that f is a sequence of numbers such that If. I < f and x + f.e, E S(xo, r) f o r i = 1, ... , 1. Then the sequence

x.+, = X. - [AJ(x., fJ]-'J(xJ

198

sEc. 9.3

AVOIDING CALCULATION OF DERIVATIVES

is well defined and converges to 2, solving J(x) = 0. If, for a constant C, we have I E. I S C I I J(x.)11, then the convergence is quadratic. For the global properties the situation is somewhat Less simple; we have to be sure that the directions generated are descent directions. Thus let us suppose that J(x) = Vf(x) and that AJ(x, e) is as defined before; let us consider the follwoing direction-generating algorithm [Goldstein-Price (1967)]: If AJ(x1, e.) is singular, or if

S 0

then set p -

(9.3.1)

--Vf(x1; otherwise set

P. = -[AJ(x., f,)]-1Vf(x.) Suppose that 0 < aI S Jx S AI; clearly this "uniform nonsingularity" of J' implies, if J' is continuous in x, that for bounded sequences x with J' I I -p 0, that AJ(x1, c.) is nonsingular c. --, 0, we have that I I AJ(x1, for large n and is in fact positive-definite, and that

I I [AJ(x., fJ]-' - J: '

II

00

Thus we may use, as usual, any of the step-determining methods in conjunc-

tion with the admissible direction sequence this generates. In particular, if we can take t.= 1 for large n and I e. I S C I I Vf(x) Ii, we would expect quadratic convergence near the solution 9 because of Proposition 9.3.1 if J', is Lipschitz-continuous in x near .2. The following theorem describes how this can be done by means of the method of Section 4.6.

THEoREM 9.3.1. Let 0 < al :!&- J;, S AI for all x, with J(x) = Vf(x) and YX continuous in x. Let the directions p. for n Z 1 be generated by Equation 9.3.1 with I EI I < C II t.-, p.-, II and let p0 = -Vf(x0). Let q E

(0, 1) and a E (0, 1) be given and choose t as the first of the numbers t = 1%°, a', a', ... , for which

f(x,) - f(x. + tp.) Z qt <-Vf(x.), p.> and let

X. + t p.. Then for any x0, the sequence

converges to the

unique R minimizing f over R. There exists an N such that for n Z N we have t = 1; if J' is Lipschitz-continuous in some neighborhood of .21 then x -.2 quadratically. Proof. The iteration is certainly well defined, since <-Vf(x ), unless x =.R. Therefore, we have

0

AVOIDING CALCULATION OF DERIVATIVES

sec. 9.3

199

qt.<-V f(x.), p.> S f(x.) - f(x. + t.P.) a

_

t.<- VAX.), P.> --1J=.+Iy.P., P.> for some A. e (0, 1)

and hence

(l -q)<-Vf(x.), t.P.> z -- Il t.P.lI2 Since f is bounded below and

f(x,) -f(x.+,) z q<-Vf(xJ, t.P.> hence 11 t. p 112-tends to zero. Therefore, it follows that <-Vf(xj, is bounded; hence f(x,), the sequence e. -- 0 and, since

)0,

I I AJ(x., E.) - J; II

11 [AJ(x., E.)]-' - J'.-' I1-) 0

Hence for large n we have

I S AJ(x., c.) S 2AI We may now apply Corollary 4.6.2 with

d(t) = qt,

d1(t) = t min (-, 1)

0. Thus, by Theorem 4.2.1 and either Proposiand conclude that tion 1.6.1 or Theorem 1.6.1, we deduce that x - 2. Arguing just as in the Proof of Theorem 8.2.1 and using the fact that

_[AJ(z,,,e.])-'+o(1)

as n-boo

we conclude that we may take t = I for large n. If J, is Lipschitz-continuous in x, we conclude from Proposition 9.3.1 that the convergence is quadratic. Q.E.D.

In practice this method appears [Goldstein-Price (1967)] to exhibit rapid convergence without the cost in time caused by derivative evaluation. . Ex LE [Goldstein-Price (1967)]. The method was applied to minimize f(x, y) = 100(y - x2)2 + (1 - x)2 and generally required about onethird as many function and gradient evaluations as the original Davidon method to reduce the function to a given value. In particular, about 70 and.

200

sEC. 9.4

AVOIDING CALCULATION OF DERIVATIVES

205 evaluations were required by the respective methods to reduce f from an initial value of about 24 to a value of about 10-14.

A similar modification of the Gauss-Newton method that has been proposed also attempts to reduce computing time by eliminating some of the difficulties caused by having to solve a linear system at each step of the itera-

tion; the technique is certainly applicable to the general Newton method, but we shall discuss it for the Gauss-Newton method, to which it was first applied. 9.4. MODIFYING THE GAUSS-NEWTON METHOD

Recall that the Gauss-Newton method is applied to minimize E(x) IIJ(x)112 for a nonlinear operator J mapping 1R' into IRE'; the iteration pro-

t minimizes E(xa + tpa) and ceeds by setting x+ pa = -(J;,*J.)-1J'.*J(xa) where * indicates conjugate transpose. We wish to eliminate the derivative calculations and reduce the difficulty inherent in computing pa as the solution of a linear system at each iteration. The theory behind approximating the derivatives causes no problem, as we saw in the last section; we consider the following approach for treating the other difficulty, essentially as in Powell (1964a).

Suppose that at the iterate x we have 1 (in IR') independent directions d., 1, da,2, ... , d,,,, and estimates ye,, of the derivative J;,*J'd.., of Eat x in the

direction da,, for i = 1, 2,

1. If we write pa = scalars q,,,, then we need to solve . . . ,

r

ga,,da,, for unknown

J,*J,.P. = -JA*J(x.), that is, I

J:*J1. . q..rd..r Assuming, of course, that J;, is itself approximated by some matrix Pa, probably by differences, we must solve the linear system

I'.q. = -P.J(x.) where

r. =(y..,,Y..2,...,y..),

q. =(q..r,...,q..r)

During the minimization of E(xa + tpa) it is quite simple to compute estimates of the derivative J;,*J', p. of E along pa; therefore, it is natural to let pa replace

one of the directions da,,. With some extra effort one can choose the da,,.

SEC. 9.5

AVOIDING CALCULATION OF DERIVATIVES

201

to be replaced in such a way that the new direction set is as "linearly independent" as possible [Powell(1964a)]; that direction and its derivative estimate are then replaced, while the other directions and derivative estimates are kept fixed. This is based on the assumption that J' d.., is not significantly

different from J;,*J',dd,,, or, if it is different, that the direction d,,,, and its derivative estimate will soon be replaced automatically. Thus, from step n to step (n + 1), only one vector y.., is changed-that is, only one column

of r, is different from that of r ; in this situation, however, r.-,', can be computed by a simple updating of r;'. Therefore, if I is small enough that the 1 x 1 matrices r.-, can be stored, the iteration can proceed rapidly without constantly resolving a linear system. Clearly a local-convergence proof for this method can be given under suitable hypotheses; we are not aware of any global-convergence results, however. At any-rate, this heuristic approach and other modifications of it appear quite useful. EXAMPLE [Powell (1964a)]. The method of Powell (1964a) was applied to minimize

f(x, y) = [10(y - x2)]2 + [(I - x)]2

starting at x0 = -1.2, yo = 1.0. About 65 function evaluations yielded (x, y) = (0.9919, 0.9827) with f = 0.0002, while about 70 yielded (x, y) _ (0.9986, 0.9973) with f = 0.000003. 9.5. MODIFYING THE GAUSS-NEWTON-GRADIENT COMPROMISE

In Section 8.4 we considered some "compromise-direction" methods,

one given by

(J;,*J;, + a.T)P = -J.*J(xn) where * indicates conjugate transpose, and another given by

p. = a. N. + (1 -

a, E [0, 1]

where N. and G. were two descent directions; it is proposed in Powell (1968) that

G_

and N. _

= might also be used, especially if J: (R' ER"' and J;, is not invertible. Computationally, the more direct method of the two, setting although N.

P. = a. N. + (1 - a)G.

202

AVOIDING CALCULATION OF DERIVATIVES

SEC. 9.5

appears more reasonable and we therefore discuss the implementation of this approach. The algorithm as presented in Powell (1968) gives a full description of an operational method of apparently great power, partially because of the

way in which it attempts to simplify computations by using some of the various techniques described in other sections and chapters of this text; we sketch the approach to show the way in which many concepts can be combined. First of all, the computation of N. as first described requires solution of a linear system. This can be handled instead by a variable-metric approach with N. = - H,J(x;,) [or - H,J' *J(x,)] where H. is computed recursively as

an approximation to the appropriate matrix inverse, we saw in Sections 7.2 and 7.3 how simply H. could be generated to satisfy this condition, and in Section 9.2 how all derivatives could be replaced by differences. Once N. and G. are available, we try to generate p,. In order to avoid searching for the

minimum of E(x) along x, + tp, each time, the computer program in Powell (1968) carries ilong a varying estimate s, of the step it t,P, I I, computed by using essentially the previous step 11 t,_, p,_, II. Therefore, if Il N,115 s,,

the direction p, = N. is used; if I I N. I1 > s some component of G. is introduced. Now the minimum of E(x) along x, + W. can be estimated to be at X. + pG,,,

k = 1JGGill 2

If s, S {t 11 G,1l, the direction vector p, is chosen as

P.=S"IIG.11 Otherwise we set

p, = a,N, + (1 - a,)G,tc with a, chosen so that 11 p, ll = s,. If E(x, + p,) is "sufficiently" smaller than

E(x,), then x,+, is set equal to x, + p while otherwise s, is "reduced" and the direction-generating process recommences; the terms "sufficiently" and "reduced" are defined in Powell (1968) heuristically in an attempt to guarantee convergence without having an excessively small step size 11 x,+, - x, ll very often.

EXAMPLE [Powell (1968)]. The implementation of the above method in Powell (1968) was applied to minimize

f(x, y) = [10(y - X2)]2 + [(1 - x)]2

starting with xo = -1.2, yo = 1.0. After 26 function evaluations, the point (x, y) = (1.0000, 1.0014) had been reached with f = 0.0002.

SEC. 9.6

AVOIDING CALCULATION OF DERIVATIVES

203

The rough outline of the above computational algorithm was given only to show how various computational methods could be blended together to avoid such problems as derivative computation, matrix inversion, and so forth; it is certainly not possible for us to list all possible techniques, but the reader should now have a basic grasp of the kinds of techniques available for simplifying computations and of the variety of ways in which they can be combined. Evidence seems to indicate that, for efficiency without loss of accuracy, those methods which avoid derivative calculations by using differences and avoid the matrix inversion of Newton or Gauss-Newton methods by using variationally oriented variable-metric Davidon-type methods are the best available at present.

9.6. METHODS IGNORING DERIVATIVES

The methods mentioned so far in this chapter have been simply modifications of other methods in order to avoid the calculation of derivatives required by those methods. We turn now to methods, designed essentially for nonsmooth or noisy functions, which do not, even by approximation, attempt to make any use of derivatives. Perhaps the earliest such method is the basic relaxation method [Southwell (1946)], in which the functional f(x), x E IRI, is minimized as a function of each coordinate direction e, successively-that is,

X. + t p,,, where t minimizes f(xx + tp3, and p = e where

i = n(modulo 1) + 1; sometimes also t. was chosen merely to decrease

sufficiently.

Since the above method does not allow for diagonal movement, the following method was developed [Hooke-Jeeves (1961)]. Staring at any x, we define the exploratory-move operator EM(x) as follows. Given some

step sizes b, = J,(x) > 0, setting x' = x, starting with i = 1 up to i = 1, if f(x' + a,e,) < f(x'), then x' is replaced by x' + b,e, and i by i + 1; if f(x' + b,e) Z f(x'), but f(x' - 26,e) < f(x'), then x' is replaced by x' 26,e, and i by i + 1; otherwise x' is not changed. Finally, when i reaches 1-+- 1, we set EM(x) = x'. The entire algorithm proceeds from x to as follows, starting with some initial x, and x, = EM(xo) # x0. We compute 4 = EM(x,); if x ' . = x then is cut in half f o r i = 1, 2, ... , l and the

-

iteration restarts at x,,. Otherwise, if f[EM(2x f(x,), we set xx+, = EM(2x, if the latter inequality is invalid, we set x = EM(x ). Iff is strictly convex with Vf continuous and lim f(x) = +oo, then it can be shown [Cea (1969)] that II x - 2 -.0, where .2 minimizes f over IR'. A computer implementing this method can be found in Kaupe (1963).

Since the coordinate directions, which are used in the above algorithm, need not be the best ones, the process has been modified as follows (Rosen-

204

AVOIDING CALCULATION OF DERIVATIVES

SEC. 9.6

brook (1960)]. Given a vector x and I orthonormal directions d,(x),I step sizes 8,(x), and two parameters a > 1, ft E (0, 1), the exploratory-move operator EM(x) is defined as follows. For i cycling through the values 1, 2, ... ., 1setting

x' = x, if f(x' + 6,d,) < f(x'), then we replace x' by x' + 6, d,, i by i + 1 (or 1 by 1), b; by a3 and record a success; otherwise b, is replaced by -,86, and a failure is recorded. After one success and one failure have been recorded for each value of i, we set EM(x) = x'. The iteration, starting with xa and

and from directions d,(xa), ... , d,(xo), now proceeds from x to EM(x ). Let 2., be the sum of the steps in the as follows. Set to 2.,d,(x ). New directions direction d,(x,) and define vectors d,(x,+,) are now obtained by orthonormalizing the vectors q,; this completes the description of the method. Roughly speaking, we can say that is the most successful motion found so far, d2(x,+,) is the most successful direction orthogonal to d,(x,+,), and so on. A further modification of this to move to the point minmethod [Swann (1964)) is, for each direction imizing f in that direction, and then to compute new directions as before. We do not consider these methods further since we believe the methods to be considered next to be of greater importance and usefulness. We have seen several times in earlier chapters that there is great advantage to using conjugate directions of some type; the above methods, however,

all deal with orthogonal directions-that is, 1-conjugate directions. It is possible, however, to generate directions that are conjugate (at least for quadratics) without dealing with derivatives. These methods, which appear to be the best of those that ignore derivatives, are based on the fact that if one minimizes a quadratic

f(x) = in the direction p from two points x, and x2, arriving at the points x', and x'2, then x', - x'2 is M-conjugate to p, since

= =0 implies = 0. The first good method to make use of this was given by Smith (1962); the method was considerably simplified without losing the conjugate-gradient

property in Powell (1964b), but unfortunately the latter's algorithm could break down even for quadratics [Zangwill (1967)]. A correct version of this idea was developed in Zangwill (1967), and it is this version we shall present. The method is as follows. Let the initial directions d0,,, i = 1, . . . ,1 be the coordinate directions,

do,, - e let e E (0, 1), set b, = 1, and let x,,o be the initial point. Iteratively apply the basic k-iteration starting with k = 1; the basic k-iteration is as

AVOIDING CALCULATION OF DERIVATIVES

SEC. 9.6

205

follows: (1) for r = 1, 2, ... , 1, compute tk,, to minimize f(xk,,_, + tdk_1,,) and let

(2) Let ak = I I xk., - xk,0 Ii

xk,, - xk, o and dk,+, = ak

Compute tk,l+I to minimize

f(xk., + tdk.,+l) and set Xk+ 1,0 = xk.l+1 = xk,, + tk,l+1 dk,,+I

(3) Let tk,, = max {tk,,; r = 1, 2, . . . , 1); if tk,,ak ak

>E

let dk+ I., = dk,,

for r # s,

dk+1,, = dk,,+,

and let

vk+1 - tk.sak ak yy

If, however, tk.syk <E ak

let dk+,,, = dk,, for r = 1, 2, . . . ,1 and set vk+I = 6kConsider the method applied to minimize the quadratic

f(x)_ Suppose each time that the last direction dk,, is replaced by dk,,+,. Then the last step of the k-iteration and that of the (k + 1)-iteration are in the same direction and, therefore, because of our earlier remark, we shall next- introduce a conjugate direction. After 1 + 1 steps we would have I conjugate directions and, if they are linearly independent, we shall therefore get the correct solution on the next iteration. The method of choosing the direction to be

206

sec. 9.6

AVOIDING CALCULATION OF DERIVATIVES

eliminated is the technique that keeps the directions dk....... dk,, independent, as we shall see; it also determines which direction, if any, is eliminated at each step and thereby invalidates the above argument. Thus it does not appear possible to prove that this method is exact for quadratics, although we can prove convergence. First we show that the directions dk,, (for arbitrary functionals) are linearly independent.

[Zangwill (1967)]. The directions dk....... dk,, are

THEOREM 9.6.1

linearly independent: In fact, their determinant satisfies det [(dk....... dk.,)] ak > E.

Proof: The result is true for k = 1; assume it for k. Then, since xk.J - xk.0 = akdk.J+1, we have det [(dk,,, ..., dk..-1, dk.,+I,

I

'om'' det [(dk,,,

ak

dk.,+I, ..., dk.,)]

..., dk.J)] = tk.sak for all s ak

The choice of s-that is, the direction to try to replace-gives us the greatest chance of replacement, while the criterion for replacing or not yields {k+1 > E. Q.E.D.

Having the above fact, we can prove convergence. THEOREM 9.6.2. Let f be a continuous and strongly quasi-convex functional on IR', and let the above method be applied starting with an arbitrary x1,,, ; suppose the sequence {xk,,],

r =0, 1,...,1,

k = 1,2,...,

is bounded. Then any limit point x' of xk,,, as k - oo for any ro = 0, 1,

... ,1 is also a limit point of xk.,, r:# r0, as k

oo, and for each such limit point there exist I linearly independent directions dl, ... , d, such that f(x') S

f(x'+td,)foralltandr=1,2,...,1.

Proof Since I I dk., II = I also, given any subsequence K of integers k, we can find a further subsequence K, such that dk,, d, for r = 1, 2, . . . , I as k , oo with k E KI, and xk,, -+ x, for r = 0, 1,. .. , I as k oo with k E K, ; we show next that x,+ 1 = x, for r = 0, 1, ... , 1- 1. Recalling that

f(xk,,+,)
C

lim

f(xk,r) =f(x,)

In fact, for all t, f(xk,,+1) Sf(xk, + tdk-l.r+l)

AVOIDING CALCULATION OF DERIVATIVES

SEC. 9.6

.

207

which then yields, just as above,

f(x,+1) =f(x,)
x,+1 - x, = t,+1d,+1

where in fact t,+1 = lim tk,,+ which implies that f is minimized along the kEL,

line x, + td,,, at two points x, and x,+ 1; because of the convexity assumption on f, we must have x, = x,+,. Denote the common limit by x'-that is,

X =xo=x=...=x, We now have f(x') < f(x' + td) for all t and r = 1, ... ,1, since this was true for x,-,, as we saw above. However, since det [(dk, I, dk.2, ... , dk.1)] > E and I I dk,, I I = 1, we must have

det[(d,, d2..... d,)]>E which implies that the d, are linearly independent. Since for any subsequence K, we could find a further subsequence K, as above, the theorem is proved. Q.E.D.

COROLLARY 9.6.1. If, in addition to the hypotheses of Theorem 9.6.2, f is continuously differentiable, then Vf(x') = 0 for any limit point x' of xk.,.

Proof: As we saw above, we have f(x') < f(x' + td,) for all t and r =

1,

... ,1; thus (Vf(x'), d,> = 0. Since the d, are linearly independent, we have

Vf(x') = 0. Q.E.D. COROLLARY 9.6.2. If, in addition to the hypotheses of Corollary 9.6.1,

we know that [x; Vf(x) = 0} contains no continuum, then the sequence Xk., converges.

Proof. We have shown that the difference of successive elements in the sequence xk.0, ... , xk,, tends to zero; the same argument shows that I I xk,1 -

I

II I I =I I xk.1 - xk+1,011- -> 0

Thus we may apply Theorem 6.3.1. Q.E.D. COROLLARY 9.6.3. If the continuously differentiable, strongly quasiconvex functional f is strictly pseudo-convex-that is, if <x - y, Vf(y)> Z 0

SEC. 9.6

AVOIDING CALCULATION OF DERIVATIVES

implies f(x) > f(y) for x # y-and if [xk,,} is bounded, then xk,, --+ z, the unique minimizer for f.

Proof: Limit points x' exist with Vf(x') = 0 by Corollary 9.6.1; then, by the strict pseudo-convexity, for any x, we have 0 = <x - x', Vf(x )> and hence f(x) > f(x'), so x' minimizes f. By the strong quasi-convexity, such a minimizer is unique. Q.E.D. COROLLARY 9.6.4. If f is uniformly quasi-convex, strictly pseudo-con-

vex, and continuously differentiable, then for any x,,0, the sequence (xk,,} converges to the unique k minimizing f. COROLLARY 9.6.5 [Zangwill (1967)]. If 0 < aI S f' in IR', then for any x1,0, the sequence [xk,,) converges to the unique z minimizing f. EXERCISE. Prove Corollaries 9.6.4 and 9.6.5.

We have not, however, been able to prove that the method is exact for quadratics; Zangwill (1967) has developed a modification of the method that is exact. The Zangwill method is as follows. Let e i = 1, . . . , I be the coordinate directions and let an initial point x0., and directions dl,,, . . . , d,,,, II d,,, II = I be given. Let to,, minimize

f(x0,, + td,,,) and let xo.I+1 = x0,1 + to,ldl,,

Set n = 1 and iteratively apply the basic k-iteration starting with k = 1; the basic k-iteration is as follows, given xk_ I.,+ dk....... dk,, and n: (1) compute a' to minimize f(xk_1,,+, + let n' = n, and replace n by n(modulo 1) + 1. If a' :?,- 0, let xk,o = xk_,,,+1 + a'e,; ; if, however, a' = 0, return to the start of step 1, noting that if step 1 is performed I times, we may consider xk_,,,+, to be the solution. (2) For r = 1,. .. ,1, compute tk,, to minimize f(xk,,_ 1 + tdk,,) and let

Define dk,,+ l

I I xk,, - xk- 1.1+

1

compute tk,,+, to minimize f(xk., + tdk,J+,), and setxk,l+1 = xk,, + tk.,+1 dk,,+l

Define directions dk+,,, ° dk,,+, f o r r = 1 , 2, ... ,1.

SEC. 9.6

AVOIDING CALCULATION OF DERIVATIVES

209

This method differs from the preceding primarily in its feature of minimizing over the coordinate directions as well as the directions dk,,; this feature allows us to revise the directions dk,, in the simple manner of the algorithm and thus obtain exact convergence for quadratics, as the following theorem shows. THEOREM 9.6.3

[Zangwill (1967)].

Let f(x) = ,

where M is self-adjoint and positive-definite, and let the initial point xo,, be given. Then the iteration stops during step 1, with xk_,,,+, = h, the solution, for some k < 1. Proof: Assume that at the start of the basic k-iteration at step k for

k < n - 1, the directions

dk.1-k+I, dk.1-k+2, ..., dk,,

are mutually M-conjugate and linearly independent; clearly this is true for k = 1, starting the induction. If we do not stop in step I this time, then xk-1,1+I : Xk.o and, since M is positive-definite, f(xk,o)
f(xk.,)

so xk-,,,+I :P1- xk,, and hence dk,,+, # 0. At iteration k - 1, since dk_I,,+, _ dk,,, the last k directions to be used were dk,1-k+I, ... , dk,,. Since these directions were assumed to be linearly independent and M-conjugate, the point xk_,,1+1 minimizes f over a k-dimensional flat spanned by those directions; similarly, however, xk,, yields the minimum over a parallel flat. Therefore, dk,,+, is M-conjugate to each of the directions dk,,_k+l, ... , dk,,. Since none of these directions nor dk,,+I is zero and M is positive-definite, it follows that . , dk,,, dk,,+l are linearly independent and mutually conjugate. . Since these directions equal the directions

dk.1-k+I+

dk+l,1-(k+l)+l , ... s dk+1.1

the induction is complete. Thus, if the procedure has not stopped by the time we reach k = 1, then the directions d,,,, . . . , d,,, are linearly independent and M-conjugate. In step 2 of the iteration with k = 1- 1, we have minimized successively over these I directions and, therefore, x,_, therefore, we shall stop automatically during step 1 of the iteration with k =1. Q.E.D.

Thus this new method is exact for quadratics; it also is convergent for more general functionals, as was the preceding method. Since the Proof of the following is nearly the same as that for Theorem 9.6.2, we shall be very brief.

210

AVOIDING CALCULATION OF DERIVATIVES

sEc. 9.6

THEOREM 9.6.4. Let f be a continuous and strongly quasi-convex functional on 1R', and let the new method of Zangwill described above be used, starting with an intitial point xo,, and directions d,....... d,,,; suppose that the sequence {x,,,) is bounded. Then any limit point x' of xk,,, for any fixed ro is also a limit point of xk,, for r:;,,- r0, and for each such limit point x', we have f(x') < f(x' + te,) for all t and r = 1, . . . , 1, where the {e,} are the coordinate directions. Proof: For any subsequence of integers K, we can take a further subsequence K, such that

lim xk+,,, = x,., for

kELt

r = 0, 1, ...,/+ 1 and j=0, 1, ..., l - 1

Arguing precisely as in the Proof of Theorem 9.6.2, if necessary taking a further subsequence so that the directions also converge, we can finally conclude that there is a point x' with

x'=x',

for r=0,1,...,1+1 and j=0,1,...,1-1

Since we have been considering the results of 1 consecutive basic k-iterations

by dealing with the xk+,,, for j = 0, 1, . . . ,1- 1, all I of the coordinate directions are used in the minimizations for each value of k as j and r vary. Arguing for these directions e, as w did for the directions d . . . , d, in the

Proof of Theorem 9.6.2, we conclude that f(x') < f(x' + te,) for all t and

r= 1,2,...,1.Q.E.D.

Since the conclusions of this theorem for the new Zangwill method are effectively the same as for the old method in Theorem 9.6.2, all the corollaries of that theorem follow immediately. Thus we have the following corollary. COROLLARY 9.6.6. Corollaries 9.6.1, 9.6.2, 9.6.3, 9.6.4, and 9.6.5 are valid for {xk, j generated by the new Zangwill method above.

The two methods of Theorems 9.6.2 and 9.6.3, which attempt to generate conjugate directions without dealing with derivatives in any way whatsoever, appear to be about the best of the methods of this section, all of which ignore derivatives entirely.

EPILOGUE

We wish to close with some brief comments concerning the last four chapters of the text. The reader will have observed that we have not been able to give concrete statements as to which methods are the best to use; this is not

only because there is no one "best" method, but also, and as importantly, because there have not been any really carefully performed tests, comparisons, and documented results for the many types%f methods available for use on large classes of problems. Such a thorough study has been made [Engeli et al.

(1959)] for the special case of convex quadratics, but nothing comparable exists for more general classes of problems. The few numerical examples we have given throughout the text have been taken from the literature and serve to indicate the difficulty of comparing results computed by different authors. Although no really thorough comparative studies of methods have been made, the reader can find a number of numerical examples for various methods in

Broyden (1965), Fletcher (1965, 1968), Fletcher-Powell (1963), FletcherReeves (1964), Goldstein-Price (1967), Kowalik-Osborne (1969), Powell (1964a, 1964b, 1968, 1969), and Stewart (1967). There is still a great need for readily available documentation and thorough comparison of those algorithms which appear to be useful, and until such data are available we can only give general guidelines based on incomplete data.

Some rough guidelines are as follows. It is generally best to use a conjugate-direction (especially conjugate-gradient), Newton-like, or even Newton method if this is feasible; since algorithms are available which exhibit the properties of these methods and which are designed to use various numbers of derivatives, some method of this type is almost always suitable for any given problem, and should be preferred. Generally, one need not try too hard to perform an exact minimization along a line. If the theoretical 211

212

EPILOGUE

success of the method depends on the exact minimization, then some effort should be made; but in the other cases, function reduction is usually more important than function minimization, and some such method as in Section 4.5 or 4.6 might best be used. Admittedly, these are rather sparse guidelines, but until more valid comparisons are performed on significant classes of problems, little more can safely be said.

REFERENCES

ABADIE, J. (ed.) (1967), Methods of nonlinear programming, North Holland, Amsterdam. AKAIKE, H. (1959), "On a successive transformation of probability distribution and its application to the analysis of the optimum gradient method," Ann. Inst. Statist.

Math., Tokyo, vol. 11, 1-16. AKHIEZER, N. (1962), The calculus of variations, Blaisdell, Waltham, Mass.

ALTMAN, M. (1966a), "Generalized gradient methods of minimizing a functional," Bull. Acad. Polon. Sci., vol. 14, 313-318. ALTMAN, M. (1966b), "A generalized gradient method for the conditional minimum of a functional," Bull. Acad. Polon. Sci., vol. 14, 445.451.

ANSELONE, P. M. (ed.) (1964), Nonlinear integral equations, Univ. of Wisc. Press, Madison. ANSELONE, P. M. (1965), "Convergence and error bounds for approximate solutions

of integral and operator equations," in Error in digital computation, vol. 2, ed. L. B. Rall, Wiley, N. Y. ANSELONE, P. M. (1967), "Collectively compact operator approximations," Comput. Sci. Report #76, Stanford Univ., Stanford, Calif. Also to be published in Prentice-Hall series on Automatic Computation. ANSELONE, P. M., R. H. MOORE (1964), "Approximate solutions of integral and operator equations," J. Math. Anal. App!., vol. 9, 268-277. ANTOSIEWICZ, H. A., W. C. RHEINBOLDT (1962), "Conjugate direction methods and

the method of steepest descent," 501-512, in A survey of numerical analysis, ed. J. Todd, McGraw-Hill, N. Y. ARMUo, L. (1966), "Minimization of functions having Lipschitz continuous first partial derivatives," Pacific J. Math., vol. 16, 1-3. 213

214

REFERENCES

AUBIN, J.-P. (1967a), "Approximation des espaces de distributions et des operateurs diff6rentiels," Bull. Soc. Math. France Memoire 12, 1-139 (French).

AUBIN, J.-P. (1967b), "Behavior of the error of the approximate solutions of boundary value problems," Ann. Scu. Norm. Pisa, vol. 21, 599-637.

AuBIN, J.-P. (1968), "Evaluation des erreurs de troncations des approximations des espaces de Sobolev," J. Math. Anal. App!., vol. 21, 356-368 (French).

AUBIN, J.-P. (1969a), "Estimate of the error in the approximation of optimization problems with constraints by problems without constraints," 153-175, in Control theory and the calculus of variations, ed. A. V. Balakrishnan, Academic, N. Y. AUBIN, J.-P. (1969b), private communication. AUBIN, J.-P., J. L. Lions (1966), unpublished notes on minimization, private communication. BALAKRISHNAN, A. V., L. W. NEUSTADT (1964), Computing methods in optimization

problems, Academic, N. Y.

BALUEV, A. (1952), "On the method of Chaplygin," Dok1. Akad. Nauk CCCP, vol. 83, 781-784 (Russian). BARNES, J. (1965), "An algorithm for solving nonlinear equations based on the secant method," Comput. J., vol. 8, 66-72. BELLMAN, R. (1957), "On monotone convergence to solutions of u' = g(u, t)," Proc. AMS, vol. 8, 1007-1009. BELLMAN, R. (1962), "Quasi-linearization and upper and lower bounds for varia-

tional problems," Quart. App!. Math., vol. 19, 349-350. BEN-ISRAEL, A. (1965), "A modified Newton-Raphson method for the solution of

equations," Israel J. Math., vol. 3, 94-98.

BEN-ISRAEL, A. (1966), "A Newton-Raphson method for the solution of systems of equations," J. Math. Anal. App!., vol. 15, 243-252.

BERS, L. (1953), "On mildly nonlinear partial difference equations of elliptic type," J. Res. Nat. Bur. Standards, vol. 51, 229-236.

BRowDER, F. E. (1967), "Approximation-solvability of nonlinear functional equations in normed linear spaces," Arch. Rat. Mech. Anal., vol. 26, 33-42. BROYDEN, C. G. (1965), "A class of methods for solving nonlinear simultaneous equations," Math. Comp., vol. 19, 577-593. BROYDEN, C. G. (1967), "Quasi,,Newton methods and their application to function

minimization," Math. Comp., vol. 21, 368-381. BUDAK, B. M., E. M. BERKOVICH, E. N. SOLOVEVA, (1968-69), "Difference approx-

imations in optimal control problems," Vest. Mosc. Univ. Ser. Mat. Mech. (1968), 41-55 (Russian), and translated in SIAM J. Control, vol. 7 (1969), 18-31.,

BUTLER, T., A. MARTIN (1962), "On a method of Courant for minimizing functionals," J. Math. Phys., vol. 41, 291-299.

CASPAR, J. R. (1968), "Applications of alternating direction methods to mildly nonlinear problems," Comput. Sci. Report # 68-83, Univ. of Md., College Park.

REFERENCES

215

CAUCHY, A. (1847), "M6thodes g6n6rales pour la resolution des syst6mes d'6quations

simultan6es," C. R. Acad. Sci. Par., vol. 25, 536-538 (French). CEA, J. (1969), Lectures on minimization problems, l'Ecole d'Et6 Analyse Num6rique, France (French). CIARLET, P. G. (1966), "Variational methods for nonlinear boundary value problems," dissertation, Case Inst. Tech., Cleveland. CIARLET, P. G., M. H. Schultz, R. S. Varga (1967), "Numerical methods of high order accuracy for nonlinear boundary value problems I: One dimensional problems," Numer. Math., vol. 9, 394-430. CIARLET, P. G., M. H. SCHULTZ, R. S. VARGA (1968a), "Numerical methods of

high order accuracy for nonlinear boundary value problems II: Nonlinear boundary conditions," Numer. Math., vol. 11, 331-345. CIARLET, P. G., M. H. SCHULTZ, R. S. VARGA (1968b), "Numerical methods of

high order accuracy for nonlinear boundary value problems IV: Periodic boundary conditions," Numer. Math., vol. 12, 266-279.

COLLATZ, L. (1966), Functional analysis and numerical mathematics, Academic, N. Y. COURANT, R. (1943), "Variational methods for the solution of problems of equilibrium and vibrations," Bull. AMS, vol. 49, 1-23. COURANT, R., D. HILBERT (1953),. Methods of mathematical physics, vol. 1, Interscience, N. Y.

CROss, K. E. (1968), "A gradient projection method for constrained optimization," Union Carbide Nuclear Div. Report # K-1746. CULLUM, J.

(1969), "Discrete approximations to continuous optimal control

problems," SIAM J. Control, vol. 7, 32-50. CULLUM, J. (1970), "An explicit procedure for discretizing continuous optimal control problems," to appear in J. Opt. Th. Appl. CURRY, H. B. (1944), "The method of steepest descent for nonlinear minimization

problems," Quart. Appl. Math., vol. 2, 258-263. DANIEL, J. W. (1965), "The conjugate gradient method for linear and nonlinear operator equations," dissertation, Stanford Univ., Stanford, Calif. DANIEL, J. W. (1967a), "Convergence of the conjugate gradient method with computationally convenient modifications," Numer. Math., vol. 10, 125-131. DANIEL, J. W. (1967b), "The conjugate gradient method for linear and nonlinear operator equations," SIAM J. Numer. Anal., vol. 4, 10-26. DANIEL, J. W. (1968a), "Collectively compact sets of gradient mappings," Indag. Math., vol. 30, 270-279. DANIEL, J. W. (1968b), "On the approximate minimization of functionals," Comput. Sci. Report # 42, Univ. of Wisc., Madison, also in Math. Comp., vol. 23, 573-582. DANIEL, J.W. (1969), "Convergence of the conjugate gradient method: a correction,"--

to appear in SIAM J. Nun er. Anal.

216

REFERENCES

DANIEL, J. W. (1970), "Convergence of a discretization for constrained spline func-

tion problems," Comput. Sci. Report # 76, Univ. of Wisc., Madison. DAVIDON, W. C. (1959), "Variable metric method for minimization," A. E. C. Res. and Dev. Report # ANL-5990. DAvrpoN, W. C. (1968), "Variance algorithm for minimization," Comput. J., vol. 10,406-410. DEMYANOV, V. F., A. M. Rubinov (1967), "The minimization of a smooth convex

functional on a convex set," J. SIAM Control, vol. 5, 280-294. DENNIS, J. E., Jr. (1969), "On the convergence of Newton-like methods," presented at Conf. on Numer. Meth. Nonlinear Algebraic Equations, Univ. of Essex. Di GUGLIELMO, F. (1969), "Construction d'approximations des espaces de Sobolev Hm((R") sur des r6saux en simplexes," to appear (French). DOUGLAS, J. E., Jr., R. B. KELLOGG, R. S. VARGA (1963), "Alternating direction methods for n space variables," Math. Comp., vol. 17, 279-282.

DUNFoRD, N., J. Schwartz (1962), Linear operators I: General theory, Interscience, N. Y. ELKIN, R. M. (1968), "Convergence theorems for Gauss-Seidel and other minimiza-

tion algorithms," Comput. Sci. Report # 68-59, Univ. of Md., College Park. ENGELI, M., TH. GINSBURG, H. RUTISHAUSER, E. STIEFEL (1959), "Refined iterative

methods for computation of the solution and the eigenvalues of self-adjoint boundary value problems," Mitt. Inst. Angew. Math., Zurich, # 8. FADDEEV, D. K., V. N. FADDEEVA (1963), Computational methods of linear algebra,

Freeman, San Francisco. FIAcco, A., G. MCCORMICK (1968), Nonlinear programming: sequential unconstrained minimization techniques, Wiley, N. Y. Fix, G., G. STRANG (1969), "Fourier analysis of the finite element method in RitzGalerkin theory," to appear.

FLETCHER, R. (1965), "Function minimization without evaluating derivatives, a review," Comput. J., vol. 8, 33-41. FLETCHER, R. (1968), "Generalized inverse methods for the best least squares solution of systems of nonlinear equations," Comput. J., vol. 10, 392-399. FLETCHER, R., M. Powell (1963), "A rapidly convergent descent method for minimi-

zation," Comput. J., vol. 6, 163-168. FLErcHER, R., C. Reeves (1964), "Function minimization by conjugate gradients,"

Comput. J., vol. 7, 149-154. FoRsYTHE, G. E. (1968), "On the asymptotic directions of the &-dimensional optimum gradient method," Numer. Math., vol. 11, 57-76. FRANK, M., P. WOLFS (1956), "An algorithm for quadratic programming," Nav. Res. Log. Quar., vol. 3, 95-110. GILBERT, E. G. (1966), "An iterative procedure for computing the minimum of a quadratic form on a convex set," SIAM J. Control, vol. 4, 61-81.

REFERENCES

217

GOLDFARB, D. (1966), "A conjugate gradient method for nonlinear programming,"

dissertation, Princeton Univ., Princeton, N. J. GOLDFARB, D. (1969a), "Extension of Davidon's variable metric method to maximization under linear inequality and equality constraints," SIAMJ. Appl. Math., vol. 17, 739-764.

GOLDFARB, D. (1969b), "Sufficient conditions for the convergence of a variable metric algorithm," 273-282, in Optimization, ed. R. Fletcher, Academic, N. Y.

GOLDFARB, D., L. LAPIDUS (1968), "Conjugate gradient method for nonlinear programming problems with linear constraints," 1. and E. C. Fundamentals, vol. 7, 142-151. GOLDSTEIN, A. A. (1964a), "Convex programming in Hilbert space," Bull. AMS, vol. 70, 709-710. GOLDSTEIN, A. A. (1964b), "Minimizing functionals on Hilbert space," 159-166, in Balakrishnan and Neustadt (1964). GOLDSTEIN, A. A. (1965), "On steepest descent," J. SIAM Control, vol. 3, 147-151.

GOLDSTEIN, A. A. (1966), "Minimizing functionals on normed linear spaces," J. SIAM Control, vol. 4, 81-89. GOLDSTEIN, A. A. (1967), Constructive real analysis, Harper & Row, N. Y.

GOLDSTEIN, A. A., J. F. PRICE (1967), "An effective algorithm for minimization," Numer. Math., vol. 10, 184-189.

GOLUB, G. H., M. SAUNDERS (1969), "Linear least squares and quadratic programming," Comput. Sci. Report CS 134, Stanford Univ., Stanford, Calif. GREENSPAN, D. (1965), "On approximating extremals of functionals I," ICC Bull., vol. 4, 99-120.

GREENSPAN, D. (1967), "On approximating extremals of functionals II," Int. J. Eng. Sci., vol. 5, 571-588. GREENSPAN, D., S. PARTER (1965), "Mildly nonlinear elliptic partial differential equations and their numerical solution II," Numer. Math., vol. 7, 129-145. HADLEY, G. (1964), Nonlinear and dynamic programming, Addison-Wesley, Reading, Mass.

HAYES, R. M. (1954), "Iterative methods of solving linear problems in Hilbert space," 71-104, in Contributions to the solution of systems of linear equations and the determination of eigenvalues, ed. O. Taussky, vol. 39, Nat. Bur. Standards Appl. Math. Ser. HENRICI, P. (1962), Discrete variable methods in ordinary differential equations, Wiley, N. Y. HERBOLD, R. J. (1968), "Consistent quadrature schemes for the numerical solution of boundary value problems by variational techniques," dissertation, Case Western Reserve, Cleveland. HERBOLD, R., M. ScHULTZ, R. VARGA (1969), "Quadrature schemes for the numer-

ical solution of boundary value problems by variational techniques," Comput.

218

REFERENCES

Sci. Report, Carnegie-Mellon Univ., Pittsburgh, Pa. Also in Aeq. Math., vol. 3, 96-119.

HESr r as, M. R. (1956), "The conjugate gradient method for solving linear systems," Proc. Symp. Appl. Math., vol. 6 (Num. Anal.), 83-102. HESIENES, M. R. (1966), Calculus of variations and optimal control theory, Wiley, N. Y. HESrENES, M. R., E. STIEFEL (1952), "Method of conjugate gradients for solving

linear systems," J. Res. Nat. Bur. Standards, vol. 49, 409136. HOOKE, R., T. JEEVES (1961), "Direct search solution of numerical and statistical

problems," JACM, vol. 8, 212-229. HORWITZ, L. B., P. E. SARACHIK (1968), "Davidon's method in Hilbert space," SIAM J. Appl. Math., vol. 16, 676-696. HUARD, P. (1967), "Resolution of mathematical programming with nonlinear constraints by the method of centres," 209-219, in Abadie (1967). KANTOROVICH, L. V. (1948), "Functional analysis and applied mathematics," Uspekhi Mat. Nauk, vol. 3, 89-185 (Russian). Also translated as Nat. Bur. of Standards Report # 1509 (1952). KANTOROVICH, L. V., G. P. AKILOV (1964), Functional analysis in normed linear

spaces, Macmillan, N. Y. KANTOROVICH, L. V., V. I. KRYLOV (1958), Approximate methods of higher analysis,

Noordhoff, Groningen, Netherlands. KAUPE, A. F., Jr. (1963), Algorithm 178, "Direct search," CACM, vol, 6, 313, and CA CM, vol. 9, 684.

KELLOGG, R. (1969), "A nonlinear alternating direction method," Math. Comp., vol. 23, 23-27. KIEFER, J. (1957), "Optimum sequential search and approximation methods under minimum regularity assumptions," SIAM J. App!. Math., vol. 5, 105-136. KOWALIK, J., M. OSBORNE (1968), Methods for unconstrained optimization problems,

American Elsevier, N. Y. KRABS, W. (1963), "Einige Methoden zur Losung des diskreteh linearen Tschebyscheff-Problems," dissertation, Univ. of Hamburg (German).

KRASOvsxu, N. N. (1957), "On a problem in optimal control," Prik. Mat. Mex., vol. 21, 670-677 (Russian). KREUSER, J. (1969), private communication.

LANOENHOP, C. E. (1967), "On generalized inverses of matrices," SIAM J. Appl. Math.,, vol. 15, 1239-1246. LEVENBERO, K. (1944), "A method for the solution of certain nonlinear problems in least squares," Quart. App!. Math., vol. 2, 164-168.

LEVITIN, E. S., B. T. POLJAK (1966a), "Constrained minimization methods," Zh. vych. Mat, mat. Fiz., vol. 6, 787-823 (Russian). Also translated in USSR Comput. Math. Math. Phys., vol. 6 (1968), 1-50.

REFERENCES

219

LEVITIN, E. S., B. T. PoLIAK (1966b), "Convergence of minimizing sequences in

conditional extremum problems," Soviet Math. Dokl., vol. 7, 764-767. LIEBERSTEIN, H. (1959), "Over-relaxation for nonlinear elliptic partial differential

equations," Math. Res. Cntr. Report # 80, Madison, Wisc. LORENTZ, G. G. (1966), Approximation of junctions, Holt, N. Y.

MANGASARIAN, O. L. (1969), Nonlinear programming, McGraw-Hill, N. Y. MARQUARDT, D. W. (1963), "An algorithm for least squares estimation of nonlinear

parameters," SIAM J. Appl. Math., vol. 11, 431-441. MCCORMICK, G. P. (1969), private communication. MIKHLIN, S. G., K. L. SMOLITSKIY (1967), Approximate methods for solution of differ-

ential and integral equations, American Elsevier, N. Y.

MooRE, R. H. (1966), "Differentiability and convergence for compact nonlinear operators," J. Math. Anal. Appl., vol. 16, 65-72. MORREY, C. B., JR. (1966), Multiple integrals in the calculus of variations, Springer-

Verlag, N. Y. MORRISON, D. D. (1960), "Methods for nonlinear least squares problems and convergence proofs," JPL Seminar Proc., Space Tech. Lab., Los-Angeles. MYERs, G. E. (1968), "Properties of the conjugate gradient and Davidon methods,"

J. Opt. 7h. Appl., vol. 2, 209-219. NASHED, M. Z. (1967), "Supportably and weakly convex functionals with applica-

tions to approximation theory and nonlinear programming," J. Math. Anal. Appl. vol. 18, 504-521. NOBLE, B. (1964), "Complementary variational principles for boundary value prob-

lems I: Basic principles," Math. Res. Cntr. Report # 473, Madison, Wisc. NOBLE, B. (1966), "Complementary variational principles for boundary value problems II: Nonlinear networks," Math. Res. Cntr. Report # 643, Madison, Wisc.

ODLOLESKAL, L. (1969), "Questions of accuracy and effectiveness of computing

methods," Trudy Symp., Kiev, vol. 4, 94-103 (Russian). ORTEGA, J., W. C. RHEINBOLDT (1967a), "Monotone iterations for nonlinear equa-

tions with applications to Gauss-Seidel methods," SIAM J. Numer. Anal., vol. 4, 171-190.

ORTEGA, J., W. C. RHEINBOLDT (1967b), "On a class of approximate iterative processes," Arch. Rat. Mech. Anal., vol. 3, 352-365. ORTEGA, J. M., W. C. RHEINBOLDT (1968), "Local and global convergence of

generalized linear iterations," Comput. Sci. Report # 68-82, Univ. of Md., College Park.

ORTEGA, J., M. ROCKOFF (1966), "Nonlinear difference equations and GaussSeidel type iterative methods," SIAMJ. Numer. Anal., vol. 3, 497-513. OsTRowsKu, A. M. (1966a), "Contributions to the theory of the method of steepest descent I," Math. Res. Cntr. Report # 615, Madison, Wisc.

220

REFERENCES

OSTROwsxl, A. M. (1966b), Solution of equations and systems of equations, Academic,

N. Y. PENROSE, R. (1955), "A generalized inverse for matrices," Proc. Camb. Phil. Soc.,

vol. 51, 406413. PEREYRA, V. (1967), "Iterative methods for solving nonlinear least squares problem,"

SIAM J. Numer. Anal., vol. 4, 27-36.

PERR'N, F. M., H. S. PRICE, R. S. VARGA (1969), "On higher order numerical methods for nonlinear two-point boundary value problems," Numer. Math., vol. 13, 180-198. PETRYSITYN, W. V. (1968), "On the approximation-solvability of nonlinear equations," Math. Ann., vol. 177, 156-164.

POL3AK, B. T. (1966), "Existence theorems and convergence of minimizing sequences in extremum problems with restrictions," Soviet Math. Dokl., vol. 7, 72-75. POLJAK, B. T. (1969a), "Conjugate gradient method in extremal problems," Works

of the All-Union School on Mathematical Programming (Russian), also in J. Comput. Math. and Math. Phys., vol. 9, 809-821 (Russian). POUAK, B. T. (1969b), "Iterative methods of solving some incorrect variational problems," Computing methods and programming, vol. 12, Moscow State Univ., 38-52 (Russian). POLJAK, B. T. (1969c), "On one method of solving large size linear and quadratic programming problems," Computing methods and programming, vol. 12, Moscow State Univ., 10-17 (Russian). PouAK, B. T., T. P. Ivanov, G. V. PUKOV (1967), "Numerical methods of solving some extremal problems with partial derivatives," Computing methods and programming, vol. 9, Moscow State Univ., 194-203 (Russian). POLJAK, B. T., V. S. ORLov, V. A. REBRY, N. V. TRETJAKov (1967), "Experimental solution of the problem of optimal control," Computing methods and programming,

vol. 9, Moscow State Univ., 179-193 (Russian). PoLrAK, B. T., V. A. SxoKov (1967a), "Choice of parameters of kinetic equations with respect to experimental data," Computing methods and programming, vol. 9, Moscow State Univ., 167-178 (Russian). POLrAK, B. T., V. A. SKOKOV (1967b), "Standard programs for minimizing func-

tions of several variables (for the machine M-20)," in "Standard programs for solving mathematical programming problems," vol. 4, Moscow State Univ. (Russian). PONSTEIN, J. (1967), "Seven types of convexity," SIAM Rev., vol. 9, 115-119. PONTRYAUIN, L. S., V. G. BOLTYANSKII, R. V. GAMKRILIDZE, E. F. MISCENxo (1962),

The mathematical theory of optimal processes, Wiley, N. Y. POWELL, M. J. (1964a), "A method for minimizing a sum of squares of nonlinear functions without calculating derivatives," Comput. J., vol. 7, 303-307.

REFERENCES

221

POWELL. M. J. (1964b), "An efficient method for finding the minimum of a function of several variables without calculating derivatives," Comput. J., vol. 7, 155-162.

POWELL, M. J. (1968), "A FORTRAN subroutine for solving systems of nonlinear

algebraic equations," Unit. King. Atom. Energy Auth. Res. Grp. Report # R5947, Harwell, Berkshire. POWELL, M. J. (1969), "A hybrid method for nonlinear equations," to appear.

POWELL, M. J. (1970), "On the convergende of the variable metric algorithm," to appear. PROTTER, M. H., H. F. WEINBERGER (1967), Maximum principles in differential equations, Prentice-Hall, Englewood Cliffs, N. J. RABINOWITZ, P. (1968), "Applications of linear programming to numerical analysis," SIAM Rev., vol. 10, 121-160.

RALL, L. (1966), "On complementary variational principles," J. Math. Anal. Appl., vol. 14, 174-184. RALL, L. B. (1969,) Computational solution of nonlinear operator equations, Wiley, N. Y. RHEINBOLDT, W. C. (1967), "On a unified convergence theory for a class of iterative

processes," Comput. Sci. Report #TR-67-46, Univ. of Md., College Park. RITTER, K. (1969), private communication.

RIVLIN, T., W. CHENEY (1966), "A comparison of uniform approximation on an

interval and a finite subset thereof," SIAM J. Numer. Anal., vol. 3, 311-320. ROSEN, J. B. (1960-61), "The gradient projection method for nonlinear programming I : Linear constraints," J. SIAM, vol. 8,181-217; "The gradient projection method for nonlinear programming II: Nonlinear constraints," J. SIAM, vol. 9, 514-532.

RosEN, J. B. (1966), "Iterative solution of nonlinear optimal control problems," J. SIAM Control. vol. 4, 223-244. ROSEN, J. B. (1968), "Approximate solution and error bounds for quasi-linear elliptic

boundary value problem," Comput. Sci. Report # 30, Univ. of Wisc., Madison. ROSEN, J., R. MEYER (1967), "Solution of nonlinear two point boundary value problems by linear programming," Comput. Sci. Report #1, Univ. of Wise., Madison. ROSENBROCK, G. H. (1960), "An automatic method for finding the greatest or the least value of a function," Comput. J., vol. 3,175-184.

RoxIN, E. (1962), "The existence of optimal controls." Mich. Math. J., vol. 9, 109-119.

SCHECHTER, S. (1962), "Iteration methods for nonlinear problems," Trans. AMS, vol. 104, 179-189. SCHECHTER, S. (1968), "Relaxation methods for convex problems," SIAMJ. Numer. Anal., vol. 5, 601-612.

SHAMPINE, L. (1966), "Monotone iterations and two-sided convergence," SIAMJ. Numer. Anal., vol. 3, 607-615.

222

REFERENCES

SHAMPINE, L. (1968), "Error bounds and variational methods for nonlinear boundary value problem," Numer. Math., vol. 12, 410-415.

SIMPSON, R. B. (1968), "Approximation of the minimizing element for a class of functionals," SIAM J. Numer. Anal., vol. 5; 26-41. SIMPSON, R. B. (1969) "The Rayleigh-Ritz process for the simplest problem in the calculus of variation," SIAM J. Numer. Anal., vol. 5, 258-271. SMITH, C. S. (1962), "The automatic computation of maximum likelihood estimates,"

N. C. B. Sci. Dept. Report #SC846/MR/40. SOUTHWELL, R. V. (1946), Relaxation methods in theoretical physics, vol. I, Claren-

don, Oxford. SPANG, H. A., III (1962) "A review of minimization techniques for nonlinear functions." SIAM Rev., vol. 4, 343-365. STEWART, G. W., I11(1967), "A modification of Davidon's minimization method to

accept difference approximations of derivatives," JACM, vol. 14, 72-83.

STIEFEL, E. (1954), "Recent developments in relaxation techniques," Proc. Int. Cong. App!. Math., vol. 1, 384-391.

STIEFEL, E. (1955), "Relaxationsmethoden bester Strategic zur Losung linearer Gleichungs-systeme," Comment. Math Helv., vol. 29, 157-179. SWANN, W. H. (1964), "Report on the development of a new direct search method of optimization," I. C. I., Ltd., Cent Instr. Lab. Res. Note #64/3, Wilmslow, England. TAYLOR, A. E. (1961), Introduction to functional analysis, Wiley, N. Y.

TIKHONOV, A. N. (1965), "Methods for the regularization of optimal control problems," Soviet Math. Dok1., vol. 6, 761-763. ToPKis, D. M., A. F. VEINOTF, Jr. (1967), "On the convergence of some feasible direc-

tion algorithms for nonlinear programming," SIAM J. Control, vol. 5, 268-279. TRAUB, J. F. (1964), Iterative methods for the solution of equations, Prentice-Hall, Englewood Cliffs, N. J. TREMOLIERES, R. (1969), private communication.

VAINBERG, M. M. (1964), Variational methods for the study of nonlinear operators, Holden-Day, San Francisco.

VARGA; R. S. (1962), Matrix iterative analysis, Prentice-Hall, Englewood Cliffs, N. J. VERCOUSTRE, A.-M. (1969), "Critisre g6n6ral de convergence pour les m6thodes de

quasi-Newton," to appear.

.

WARGA, J. (1962), "Necessary conditions for a minimum in a relaxed variational problem," J. Math. Anal. App!., vol. 4, 129-145.

WASSCHER, E. J. (1963), Algorithms 203-204, "Steep 1" and "Steep 2," CACM, vol. 6, 517-519, CA CM, vol. 7, 585, and CACM, vol. 8,171.

REFERENCES

223

WELLS, M. (1965), Algorithm 251, "Flepomin," CACM, vol. 8, 169 and CA CM, vol. 9, 686.

WHITLEY, V. W. (1932), Algorithm 129, "Minifun," CACM, vol. 5.550, and CACM, vol. 6, 521.

YAKOVLEV, M. N. (1965), "On some methods of solving nonlinear equations," Trudy Mat. Inst. Steklov, vol. 84, 8-40 (Russian). Translated as Comput. Sci. Report #68-75, Univ. of Md., College Park. ZANGWILL, W. I. (1967), "Minimizing a function without calculating derivatives," Comput. J., vol. 10, 293-296. ZANGWILL, W. I. (1969), Nonlinear programming: a unified approach, Prentice-Hall,

Englewood Cliffs, N. J. ZELEZNIK, F. J. (1968), "Quasi-Newton methods for nonlinear equations," JACM, Vol. 15, 265-271. ZOUTENDIJK, G. (1960), Methods of feasible directions, Elsevier, Amsterdam.

INDEX

A Adjoint operator, 5 Adjoint space (see Dual space) Admissible direction sequence, 71

Alternating direction implicit (ADI) method, 185, 188, 189 Approximate minimizing sequence, 18 convergence of, 18-20

B Banach space, 3 Basic relaxation method, 203

Boundary value problems, 3, 52-55, 59-65

Compact set, 4 countably, 4 sequentially, 4 Compatible quadrature scheme, 64 Complementary variational principles,

21-27 Conditional-gradient method (see Direction algorithms for constrained problems, conditionalgradient) Conjugate directions, 116 without derivatives, 204-210 Conjugate-gradient method, 115-141 computational modifications, 136141 convergence rates, 117, 121, 124,

Bounded linear operator, 5

127-136, 138, 139, 151-153, 175

C

in finite dimensions, 151-154 general functionals, 125-141 least-squares problems, 191 optimality of, 119-122 quadratic functionals, 117-119

Calculus of variations, 2 discretization of, 55-59

Cauchy's method (see Steepest descent)

Centers, method of (see Method of centers)

CG method (see Conjugate-gradient method) Chebysbev solution of differential equations, 52-55

Consistency (see Discretization, consistent) Convergence of 142-150 Convex function, 13 quasi-, 14

0

x,, to zero, 72,

226

INDEX

Convex function (cont.) : quasi- (cont.): strictly, 16 strongly, 15, 89 uniformly, 18 strictly, 13 uniformly, 20 Convex mapping, 187 Convex set, 4 uniformly, 19 Correct constraint, 111 Critical point, 71 Criticizing sequence, 71-76 Curry's method (see Generalized Curry method)

D Davidon's first method, 171-175, 177 without high derivatives, 194-197 Davidon's second method, 163-164 Derivative, 9 Diagonal mapping, 187

Direction algorithms for constrained problems, 105-110 conditional-gradient, 106 gradient-projection, 106 quadratic approximations (Newton's method), 108-109 variable-metric projected-gradient, 106-108,178-179

Discretization: boundary value problems for partial differential equations, 52-55 calculus of variations, 55-59 consistent, 29 definition of, 28 integral equations, 34-36 optimal control problems, 41-51 Rayleigh-Ritz method, 65-69 regularization, 38-39 stable, 31 two-point boundary-value problems,

59-65 uniform growth condition for, 31 Dual space, 4

E Eigenvalues, 12 Error bounds, 20-27, 33, 41, 61, 62

Exact methods (see Variable-metric methods, exact)

Existence of solutions to variational problems, 7-9

F Feasible direction sequence, 95 Finite-element method, 69 Forcing function, 74 Fourier transform, 69 Frechet derivative (see Derivative)

G

Gauss-Newton method, 191 without high derivatives, 200-201 Gauss-Seidel method, 185, 187 Generalized Curry method, 79 Gradient, 10, 12 Gradient-Newton compromises, 192, 201-203

Gradient-projection method (see Direction algorithms for constrained problems, projection) Growth property, 9, 11

gradient-

H Hilbert space, 3 Hooke-Jeeves method, 203

I Integral equations, 34-36 Isotone mapping, 187 J

Jacobi's method, 185, 187, 188

INDEX

L Least-squares problems, 190-193 Level set L(xo), 71 Levenberg-Marquardt method, 192 Limit set of {x ), 73, 145-150 Linear functional, 4 Lower semicontinuity, 5 sequential, 5 weak, 5, 10, 13, 14 L(xo)[see Level set L(xo)]

227

Piecewise polynomial functions (see Spline functions) Positive definite (semidefinite) operator, 6 uniformly, 20 Q

Quasi-convex (see Convex function, quasi-) Quasi-Newton methods (see Variablemetric methods)

M

R

Mathematical programming, 2, 42, 55 Method of centers, 113 Minimal sequence, 67 strongly, 68 Minimizing sequences, 17-27 (see also Approximate minimizing sequence) M-matrix, 187

Range function g(x, t, p), 84, 102 Rayleigh-Ritz method, 65-69 stability of, 68 two-point boundary-value problems, 59-65

N Necessary optimality condition, 12 Newton-like methods, 181-185 Newton-m-step- K_ methods, 186, 188, 189

Reflexive space, 4 Regularization, 37-41 Relaxation factor, 79, 99

Relaxation method (see Basic relaxation method) Restart Davidon method, 176 Reverse modulus of continuity, 74 Ritz method (see Rayleigh-Ritz method) Rosenbrocks' method, 203-204

Newton's method, 161-162, 181-185, S

190

without high derivatives, 197-200 --m-step-Newton methods, 186, 189 Nonlinear -methods, 186, 187, 188

0 Optimal control, 1-2 discretization of, 41-51 .

P Penalty function, 110 Penalty-function method, 38, 42, 110112

Search routines along line, 92, 104, 154-158 Sobolev space, 7

Southwell's method (see Basic relaxation method) Spectrum, 5 Spline functions, 61-62 s(t) (see Reverse modulus of continuity) Stability (see Djscretization, stable) Steepest descent:'. direction, 70 in finite dimensions, 150-151 step-sizes for, 93-94

228

INDEX

Step-size choices, 76-105 constrained problems, 95-105 first local minimum along line, 99 global minimum along line, 99

range function along line, 102104

U

Unimodal function (see Strictly unimodal function) Uniqueness of solutions of variational problems, 15

computational version of, 103-

V

104

search methods along line, 104105,154-158simple interval along line, 100101

unconstrained problems, 76-95

first local minimum along line, 77-80

171

directions, 159-162 exact, 165-168, 208

quasi-Newton methods, 161, 181185

global minimum along line, 76-

rank-one versions, 177-178

rank-two versions, 164, 166, 167-

77

range function along line, 84-89

computational version of, 8687

search methods along line, 89-92, 154-158 simple interval along line, 80-84 Strictly unimodal function, 89 Strongly minimal sequence (see Minimal sequence, strongly) Successive overrelaxation' (SOR ) method, 185, 187, 188, 189 Swann's method, 204

T T-property, 8

Variable-metric methods, 159-179 conjugate directions, 160, 167-168,

175

Variable-metric, projected-gradient method (see Direction algo-

rithms for constrained problems, variable-metric projectedgradient) Variational problems, 1

W Weak convergence, 4 Weak topology, 4 W(xo), 71

Z Zangwill's method, 208-210

Recursive functionals

Read more

Approximate Calculation of Integrals

Read more

Recursive Functionals

Read more

Recursive functionals

Read more

Unbounded functionals in the calculus of variations

Read more

Unbounded functionals in the calculus of variations

Read more

Recursion on the Countable Functionals

Read more

Exponential functionals

Read more

Approximate solutions of operator equations

Read more

Approximate Approximations (MSM141)

Read more

Approximate methods of higher analysis

Read more

Handbook Of Pollution Control And Waste Minimization

Read more

Approximate and Renormgroup Symmetries

Read more

Approximate Commutative Algebra

Read more

Approximate commutative algebra

Read more

Approximate and renormgroup symmetries

Read more

Approximate Kalman filtering

Read more

Case-Based Approximate Reasoning

Read more

Approximate Iterative Algorithms

Read more

Extras Approximate Chemical Shifts

Read more

Density Functionals: Theory and Applications

Read more

Handbook of Pollution Control & Waste Minimization

Read more

Handbook of Pollution Control and Waste Minimization

Read more

Lectures on white noise functionals

Read more

On radially symmetric minima of nonconvex functionals

Read more

Theory of Functionals and of Integral and Integro-Differential Equations

Read more

Stable Approximate Evaluation of Unbonded Operators

Read more

Nonlinear Programming: Sequential Unconstrained Minimization Techniques

Read more

Complex convexity and analytic functionals

Read more

Approximate Dynamic Programming: Solving the Curses of Dimensionality

Read more

Recommend Documents

Recursive functionals

Approximate Calculation of Integrals

Recursive Functionals

RECURSIVE FUNCTIONALS STUDIES IN LOGIC AND THE FOUNDATIONS OF MATHEMATICS VOLUME 131 Honorary Editor: P. SUPPES, S...

Recursive functionals

Unbounded functionals in the calculus of variations

Unbounded functionals in the calculus of variations

1C CHAPMAN & HALUCRC Monographs and Surveys in Pure and Applied Mathematics 125 UNBOUNDED FUNCTIONALS IN THE CALCULU...

Recursion on the Countable Functionals

Exponential functionals

which by formulas (5.4) and (5.6) means that the definition of A(D)u(z) does not depend on the form of representing u(z)...

Approximate solutions of operator equations

APPROXIMATE SOLUTIONS OF OPERATOR EQUATIONS APPROXIMATIONS AND DECOMPOSITIONS Editor-in-Chief: CHARLES K. CHUI Vol. 1...

Approximate Approximations (MSM141)