MATHEMATICAL PROGRAMMING STUDIES
Editor-in-Chief M.L. BALINSKI, International Institute for Applied Systems Analysis, ...
36 downloads
1021 Views
7MB Size
Report
This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form
MATHEMATICAL PROGRAMMING STUDIES
Editor-in-Chief M.L. BALINSKI, International Institute for Applied Systems Analysis, Laxenburg, Austria, and City University of New York, N.Y., U.S.A. Senior Editors E.M.L. BEALE, Scientific Control Systems, Ltd., London, Great-Britain GEORGE B. DANTZIG, Stanford University, Stanford, Calif., U.S.A. L. KANTOROVICH, National Academy of Sciences, Moscow, U.S.S.R. TJALLING C. KOOPMANS, Yale University, New Haven, Conn., U.S.A. A.W. TUCKER, Princeton University, Princeton, N.J., U.S.A. PHILIP WOLFE, IBM Research, Yorktown Heights, N.Y., U.S.A. Associate Editors PETER BOD, Hungarian Academy of Sciences, Budapest, Hungary VACLAV CHVATAL, Stanford University, Stanford, Calif., U.S.A. RICHARD W. COTTLE, Stanford University, Stanford, Calif., U.S.A. J.E. DENNIS, Jr., Cornell University, Ithaca, N.Y., U.S.A. B. CURTIS EAVES, Stanford University, Stanford, Calif., U.S.A. R. FLETCHER, The University, Dundee, Scotland D.R. FULKERSON, Cornell University, Ithaca, N.Y., U.S.A. ARTHUR M. GEOFFRION, University of California, Los Angeles, Calif., U.S.A. TERJE HANSEN, Norwegian School of EconomIcs and Business Administration, Bergen, Norway ELI HELLERMAN, Bureau of the Census, Washington, D.C., U.S.A. PIERRE HUARD, Electricite de France, Paris, France ELLIS L. JOHNSON, IBM Research, Yorktown Heights, N.Y., U.S.A. C.E. LEMKE, Rensselaer Polytechnic Institute, Troy, N.Y., U.S.A. GARTH P. McCORMICK, George Washington University, Washington, D.C., U.S.A. GEORGE L. NEMHAUSER, Cornell University, Ithaca, N.Y., U.S.A. WERNER OETTLI, Universitiit Mannheim, Mannheim, West Germany MANFRED W. PADBERG, New York University, New York, U.S.A. L.S. SHAPLEY, The RAND Corporation, Santa Monica, Calif., U.S.A. K. SPIELBERG, IBM Scientific Center, Philadelphia, Pa., U.S.A. D.W. WALKUP, Washington University, Saint LOUIS, Mo., U.S.A. C. WITZGALL, National Bureau of Standards, Washington, D.C., U.S.A.
MATHEMATICAL PROGRAMMING STUDY 3 N ondifferentiable Optimization Edited by M.L. BALINSKI and Philip WOLFE
November (1975)
NORTH-HOLLAND PUBLISHING COMPANY - AMSTERDAM
© The Mathematical Programming Society, 1975 All rights reserved. No part of thIs publication may be reproduced, stored in a retrieval system, or transmItted in any form or by any means, electronic, mechanical, photocopying. recording or otherwIse, without the prIor permission of the copyrIght owner.
This STUDY is also available to non-subscribers in a book edition.
Prmted in The Netherlands
PREFACE We will attempt to describe nondifferentiable optimization (NDO), give some reasons for doing it, and sketch the contributions of the eight papers constituting the major part of this volume. NDO is concerned with the minimization of a continuous real-valued function, of several real variables, which is not necessarily differentiable. The nondifferentiability is not due to properties of the domain of the function, which in most cases can be taken to be all of Euclidean n-space, but to the function itself; yet we will rule out truly wild functions. The examples below indicate the scope we consider, and they are difficult enough. There are at least three good reasons for concern with NDO: curiosity; the practical importance of some problems having seriously nondifferentiable objective functions; and the possibility of recasting some familiar problems as NDO problems so as to improve our understanding or our ability to handle them. On the first count, the assumption of differentiability was once important in the existence theory of optimization-necessary conditions and duality. Now, thanks in large part to Fenchel and Rockafellar, we realize that it is really irrelevant to most of that theory. On the other hand, most algorithms for optimization problems seem to make heavy use of derivatives, or at least of differentiability, and the study of convergence of the algorithms, and especially of their rate of convergence, seems to require more than that: it is customary to assume that the functions involved have positive lower and upper bounds for the eigenvalues of their matrices of second derivatives. Are these assumptions necessary for useful results? Is there an effective algorithm for minimizing an arbitrary convex function? Do the methods of steepest descent, the conjugate gradient methods, and the quasi-Newton methods have reasonable extensions to the kinds of problem described below? The most general class of NDO problem much studied is the Minimax Problem: let g(x, y) be defined on a suitable subset of X x Y(X and Ybeing, usually, subsets of Euclidean space); we seek the minimum of f(x) for x E X, where f(x) = max {g(x, y): y E Y}. No matter how smooth gis, v
vi
Preface
j is almost surely significantly nondifferentiable. In the simplest case Y is finite and we may write j(x) = max {gi(X): i = 1, ..., m}, taking each gi
differentiable. Such functions will be generally differentiable almost everywhere, as convex functions are, but the minimum sought is mostly likely to fall at a nondifferentiable point: j(x) = Ixl = max {x, - x} fails only for x = 0, but if one depends on differentiability at the minimum, he can say little about this problem. Danskin* originated the computational study of the minimax problem in the context of some military problems of "one-sided" games, and much subsequent work is in that spirit. It forms an important part of approximation theory, for example in posing the problem of finding the best "Chebycheff' approximation of a given function h(y) by linear combinations of functions!;: for that problem g(x,y) = Ih(y) - LiXi!;(y)l. Further, since any convex function j has the form max {g(x, y): y E Y} with g linear in x, problems of the minimization of general convex functions overlap largely with minimax problems. Each paper in this STUDY deals essentially with a minimax problem, although that fact is not always immediately obvious. Paper 3 offers an example of a problem whose statement as a minimax problem-a convex one, at that---<;omes as a pleasant surprise: the objective function is calculated by finding eigenvalues of a certain matrix function of the variables. Among problems, traditionally handled by other means, whose study as NDD problems may well be rewarding are the ordinary nonlinear programming problem and large-scale column-generation problems. It has often been observed that, under pretty general conditions, the problem min {f(x): gi(X) ~ O} is equivalent to minimizingj(x) + K Limax {O, gi(X)}, or a similar nondifferentiable objective, for all sufficiently large K. A powerful method for general NDD would then have immediate application to this formulation, as well as to the more sophisticated formulation of Fletcher. The "column-generation" problem has the form ofthe minimax problem above, where g is usually linear in both variables and X, Y polyhedral. A much-studied example is that of the decomposable structure introduced by Dantzig and Wolfe, whose dual may be written as the minimization of the convex, piecewise linear j(x) = (b, x) + Lkmax {(Ck - Akx, y): Biy ~ bJ. Calculating j(x) requires solving the linear programming problems of which it is composed, a process which yields differential information about f The Dantzig-Wolfe procedure used that information as a column in a "master"
* See the bibliography at the end of this volume for this and other citations.
Preface
vii
linear program (whence the term "column generation"), which is equivalent to minimizingfusing a particular kind of cutting-plane method. Now we know that cutting-plane methods for smooth problems are generally inferior to those now available; ifthe latter can be extended to nonsmooth problems, we may have a new attack on the very large scale linear problem. While it has not yet, as far as we know, been handled this way in practice, the theory of doing so has been well explored (see ego Geoffrion; Grinold; Hogan et al.). Meanwhile "subgradient optimization", about the simplest possible general method, has been proved for a variety of column-generation problems (see e.g. Held, Wolfe, Crowder), greatly raising our hopes that improved methods for NDO problems will impact the large-scale optimization area. The short survey above can, of course, only give a quick and narrow view of some of the reasons for the existence of a subject of NDO. The eight papers of this volume, to whose brief description we now turn, will serve to flesh out this view. Each is concerned with the practical solution of a certain class of NDO problem. We think that they form a convincing argument for the existence of such a subject as a well-defined and important, although difficult, entity within the subject of mathematical programming. The authors of these papers were encouraged to relate their computational experience. Each of the first seven describes some specific set of problems and the results obtained from running them: we hope they will serve as useful benchmarks for the continuing development of NDO algorithms. The newcomer to the study of functions without derivatives might wish to read the initial portions of Sections 2 and 3 of Paper 8, which review some basic concepts from the differential analysis ofconvex functions and discuss the notion of steepest descent in NDO. Paper 1 by D,P. Bertsekas deals with problems whose nondifferentiability is due to the presence ofterms of the form max {gl(X), g2(X)} in the explicit definition of the objective function. Each such max is replaced by a single smooth function, depending on a parameter c, such that the smooth function tends to the max as c -. 00. For given c the resulting smooth problem is solved by any suitable method. Results ofthe smooth optimization are used to guide successive choices of larger values of c, and the nature of convergence to the solution of the original problem is studied. This general approach, while differing in the spirit from the later papers which confront nondifferentiability directly, gives a benchmark against which other methods must be compared. Paper 2 by P.M. Camerini, L. Fratta and F. Maffioli, reports another advance in improving the efficiency of the beautifully simple "subgradient
viii
Preface
optimization" (or "generalized gradient descent", from the Russian) procedure pioneered by N.Z. Shor. An easy modification of the ordinary subgradient direction (reminiscent of the modification employed in the method of conjugate gradients) yields a -procedure which should never be inferior to the usual one, and is supported by their experiments. Paper 3 by J.K. Cullum, W.E. Oon'ath and P. Wolfe, treats the problem of minimizing the sum of a certain few highest eigenvalues of a certain class of symmetric matrices. This perhaps strange-sounding problem has application to a hard graph theory problem (and other, related, problems of eigenvalue optimization are important in engineering design). The algorithm is related to steepest descent; the key contributions are the means of effectively calculating a steepest descent direction for such a function and a practical way to modify that calculation so that "jamming" (convergence to a non-solution, which can happen when the method of steepest descent is used on a nondifferentiable function) does not occur. In Paper 4, M.L. Fisher, W.O. Northup and J.F. Shapiro explore an important area of mathematical programming which benefits from the ability of some methods for NOO to obtain rapid, approximate solutions of problems usually handled by other means. The main theme has application to almost any branch-and-bound procedure: when, at some point in the branch-and-bound tree, a bound is computed as the solution of a certain mathematical programming subproblem, a quickly obtained approximate solution is likely to be of more help in making further progress than a laboriously obtained exact one. The authors use several integer programming problems and constrained network scheduling problems as a test bed. They show how a "Lagrangian relaxation" dual linear programming problem is appropriately formed to yield the desired bounds, describe a "primal-dual ascent algorithm", a "simplicial approximation algorithm", and the subgradient optimization algorithm, all aimed at solving the dual, and report on their extensive experience with the first and third methods. While the first method can be run so as to give exact answers and the third in general cannot, the quickness of the latter turns out to make it advantageous. Paper 5 by C. Lemarechal is a direct attack on the problem of minimizing an arbitrary convex function under the severe, but realistic assumption that at a given point it is feasible to obtain only the value ofthe function and one subgradient. The algorithm presented has the remarkable property of being effective for that problem while at the same time constituting an extension of the powerful method of conjugate gradients, which arose as a minimiza-
Preface
IX
tion method for the smoothest of all nonlinear functions, the quadratic. A central idea of the procedure is that the construction of a direction conjugate to a given family of directions-as extended by Zoutendijk from quadratic to differentiable functions-ean be generalized to functions for which only subgradients are available, and in such a way that in the extremely nondifferentiable case the direction is that of steepest descent. The direction thus determined provides a usable step for any convex problem, and appropriate restarting provisions avoid the "jamming" difficulty, resulting in a generally convergent algorithm. The generality and ease of implementation of this scheme, together with the known success of its predecessors for smooth optimization problems, promise it a special place among algorithms for NDO. Paper 6 by K. Madsen treats the special minimax problem of minimizing f(x) = max {I};(x)!: i = 1, ..., m} by a method mathematical programmers will recognize as related to the Method of Approximation Programming of Griffith and Stewart. (The technique used would seem to apply to the more general problem obtained by replacing I};(x) I by};(x).) At a trial point x the functions}; are replaced by linear approximations (in the variable x + h), and the corresponding linear minimax problem is solved under the "box" restriction II h I 00 :::; A for some A < O. While}; are assumed smooth, their derivatives are not taken; rather, the Broyden updating scheme is used to approximate their Jacobian. A particularly interesting feature of the algorithm is the procedure for altering the box size A: the quality of the linear approximations used is simply assessed, and A is raised when they are good and lowered when they are not. The numerical experience seems quite encouraging. In Paper 7, R. Marsten reports on further experience with the "Boxstep" method (of Hogan, Marsten and Blankenship) on problems of the kind studied in Paper 4. The essence of that method for minimizing a function is replacing the original problem by the sequence of problems min {f(x): Ix - xkl :::; Ak }, xk+ 1 being defined as the solution of problem k. One hopes, of course, that the "box" restriction will make the individual problems much easier to solve than the original, which has turned out to be the case for many column generation problems in which piecewise linear f is minimized by a cutting-plane algorithm. An optional feature is that of using the line through X k and x k + 1 as a search direction; for small Ak , it is the direction of steepest descent in the L oo norm. The author's extensive experimentation with various Lagrangian problems and choices of box size indicate some of the boundaries of success for this approach; for many
x
Preface
purposes the quick initial convergence of subgradient optimization makes it the preferred method. Paper 8, by P. Wolfe, deals with the same problem as does Paper 5, and presents a procedure which, although differently arrived at, matches that of Paper 5 in most points; so the sketch of that paper given above will suffice for this one. (The points of view of the two papers seem sufficiently different to warrant the inclusion of both in this volume although, had the opportunity been present, their merging into a single work would have been even better.) We draw attention only to the example given in Section 3 which shows that the method of steepest descent does not in general work in NDO-a proposition assumed by most workers in the area, but not previously established. We hope that the Bibliography which concludes this volume will be of service to the student of NDO. Intending to include only serviceable material, we have restricted it to internationally distributed books and technical journals. We know it must be far from complete, and will welcome notice of appropriate material, published or not, that has not been cited. We are indebted to many individuals for their encouragement and assistance in selecting the papers included in this STUDY. We take this opportunity to thank them collectively; their names will appear in due course in the pages of MATHEMATICAL PROGRAMMING. M.L. Balinski Philip Wolfe
CONTENTS
Preface , Contents . . . . . . . . . . . . . . . . . . . . . . . ,
v Xl
(1) N ondifferentiable optimization via approximation, D.P. Bertsekas (2) On improving relaxation methods by modified gradient techniques, P.M. Camerini, L. Fratta, and F. Maffioli. . . . . . . ,
26
(3) The minimization of certain nondifferentiable sums of eigenvalues of symmetric matrices, J. Cullum, WE. Donath and P. Wolfe. . .
35
(4) Using duality to solve discrete optimization problems: theory and computational experience, M.L. Fisher, WD. Northup and J.F. Shapiro. . . . . . . . . . . . . . . . . . . . . . . . .
56
(5) An extension of Davidon methods to non differentiable problems, C. Lemarechal. . . . . . . . . . . . . . . . . . .'. . ,
95
(6) Minimax solution of non-linear equations without calculating derivatives, K. Madsen . . . . . . . . . . . . . . . . . .
110
(7) The use of the Boxstep method in discrete optimization, RE. Marsten . . . . . . . . . . . . . . . . . . . . . . . ,
127
(8) A method of conjugate subgradients for minimizing nondifferentiable functions, P. Wolfe
145
Bibliography . . . . . . . . . . . . . . . . . . . . . .
174
Xl
Mathematical Programming Study 3 (1975) 1-25. North-Holland Publishing Company
NONDIFFERENTIABLE OPTIMIZATION VIA APPROXIMATION* Dimitri P. BERTSEKAS University of Illinois, Urbana, IlL, U.S.A. Received 8 November 1974 Revised manuscript received I 1 April 1975
This paper presents a systematic approach for minimization of a wide class of nondifferentiable functions. The technique is based on approximation of the nondifferentiable function by a smooth function and is related to penalty and multiplier methods for constrained minimization. Some convergence results are given and the method is illustrated by means of examples from nonlinear programming.
I. Introduction
Optimization problems with nondifferentiable cost functionals, particularly minimax problems, have received considerable attention recently since they arise naturally in a variety of contexts. Optimality conditions for such problems have been derived by several authors while a number of computational methods have been proposed for their solution (the reader is referred to [1] for:a fairly complete list of references up to 1973). Among the computational algorithms currently available are the subgradient methods of [10, 15, 19], the s-subgradient method [1, 2] coupled with an interesting implementation of the direction finding step given in [12], the minimax methods of [6, 7, 9, 17] which were among the first proposed in the nondifferentiable area, and the recent interesting methods proposed in [-5, 20]. While the advances in the area of computational algorithms have been significant, the methods mentioned above are by no means capable of handling all problems encountered in practice since they are often limited in their scope by assumptions such as convexity, cannot handle * This work was supported in part by the Joint Services Electronics Program (U.S. Army, U.S. Navy and U.S. Air Force) under Contract DAAB-07-72-C-0259, and in part by the U.S. Air Force under Grant AFOSR-73-2570.
D.P. Bertsekas / Nondifferentiable optimization via approximation
nonlinear constraints or they are applicable only to a special class of problems such as minimax problems of particular form. Furthermore several of these methods are similar in their behavior to either the method of steepest descent or first order methods of feasible directions and converge slowly when faced with relatively ill-conditioned problems. Thus there is considerable room for new methods and approaches for solution of nondifferentiable problems and the purpose of this paper is to provide a class of methods which is simple to implement, is quite broad in its scope, and relies on an entirely different philosophy than those underlying methods already available. We consider minimization problems of the form minimize g(x), subject to x ~ Q c R",
(1)
where g is a real-valued function on R" (n-dimensional Euclidean space). We consider the case where the objective function O is nondifferentiable exclusively due to the presence of several terms of the form V[fi(x)] = max {0, fi(x)},
i ~ I,
(2)
where {fi: ir I} is an arbitrary collection of real-valued functions on R". By this we mean that if the terms y(.) in the functional expression of g were replaced by some continuously differentiable functions ~(-) then the resulting function would also be everywhere continuously differentiable. For purposes of easy reference we shall call a term of the form (2), a simple kink. It should be emphasized that while we concentrate attention on simple kinks, the approach is quite general since we do not necessarily require that the functions f in (2) are differentiable but rather we allow them to contain in their functional expressions other simple kinks. In this way some other kinds of nondifferentiable terms such as for example terms of the form max {fl(x) .... ,f,,(x)}
(3)
can be expressed in terms of simple kinks by writing max {fl, ...,fro} = fl + Y[f2 - f , + Y[---7[fm-1 - fm-z + Y[f., -- f m - , ] ] . " ] ] '
(4)
Since there are no restrictions on the manner in which simple kinks enter in the functional expression of g, a little reflection should convince the
D.P. Bertsekas / Nondifferentiable optimization via approximation
reader that the class of nondifferentiable problems that we are considering is indeed quite broad. The basic idea of our approach for numerical solution of problems of the form (1) is to approximate every simple kink in the functional expression of 9 by a smooth function and solve the resulting differentiable problem by conventional methods. In this way an approximate solution of problem (1) will be obtained which hopefully converges to an exact solution as the approximation of the simple kinks becomes more and more accurate. While, as will be explained shortly, other approximation methods are possible, we shall concentrate on the following two-parameter approximation ~[f(x), y, c] of a simple kink 7If(x)],
( f ( x ) - (1 - y)2/2c ~[f(x), y, c] =
if (1 - y)/c < f(x), if - y / c < f(x) < (1 - y)/c, if f(x) < - y/c,
(5)
where y and c are parameters with O_
(6)
O
If the functionfis differentiable then the function ~[f(x), y, c] above is also differentiable with respect to x. Its gradient is given by
( V f(x) V~[f(x),y, c] = ~ [y + cf(x)] Vf(x) ~, 0
if (1 - y)/c < f(x), if - y / c < f(x) < (1 - y)/c, if f(x) < -y/c.
(7)
The functional form of ~ is depicted in Fig. 1. It may be seen that ,)(t, y, c) < 7(0 < ~(t, y, c) + (1/2c)max {y2, (1 - y)2} < ~(t, y, c) + (1/2c) for all t E R.
(8)
Thus the parameter c controls the accuracy of the approximation. The parameter y determines whether the approximation is more accurate for positive or negative values of the argument t. Thus for the extreme case y -- l, the approximation to 7(0 is exact for 0 < t while for y = 0 the approximation is exact for t < 0. Let us now formally describe the approximation procedure for solving problem (1), where we assume that the nondifferentiability of 0 is exclusively due to the presence of terms 7[f~(x)] = max {0, f~(x) }, and I is an arbitrary index set.
iEI
(9)
D.P. Bertsekas / Nondifferentiable optimization via approximation ~(t,y,c)
y(t)
(l-y) 2 ~c
~ /II
y(t)
/ / / _ y2
~
J
%
i ~(t,y,c)
Fig. 1.
Given parameters Ck, Y~k, i e I with Ck > O, 0 < y~ < 1, replace each term 7[fi(x)], i e ! in the functional expression of g by ~[fi(x), y~, Ck] to obtain a function 9k and solve the problem minimize subject to
~k(X),
(10)
xeQcX.
If xk is a solution of the above problem update c by setting Ck + 1 = fl Ck, where fl > 1, and update the multipliers y~ in some fashion to obtain y~+ 1, with 0 < y~+ 1 < 1, i e I. Solve the problem minimize subject to
gk + 1(X), xeQcX
(1 1)
to obtain a solution Xk+ 1 and repeat the procedure. It is important to note that the choice of the approximation (5) is by no means arbitrary and in fact it is closely related to penalty and multiplier methods for constrained minimization (see e.g. [3, 4, 8, 11, 16, 18]). By
D.P. Bertsekas / Nondifferentiable optimization via approximation
introducing an auxiliary variable z, a simple kink may be written as y[f(x)] =
min
f(x)<--z,O<.z
z.
(12)
By using a quadratic penalty function, the minimization problem above may be approximated by min [z + 89 (max{0, f ( x ) - z})2], O_
c > 0.
(13)
Carrying out the minimization with respect to z, the expression above is equal to ( f ( x ) - 1/2c if I/c < f(x), ~[f(x), 0, c] = { l c [ f ( x ) ] 2 if 0 _< f ( x ) < 1/c, (o if f(x) <_ o. If we use the generalized quadratic penalty function used in the method of multipliers [4, 18] the minimization problem in (12) may be approximated by the problem min [z + 1/2c[(max{0, y + c [ f ( x ) - z]}) 2 - y2]], o-
(14)
Again by carrying out the minimization explicitly, the expression above is equal to ~[f(x), y, c] as given by (5). Notice that we limit the range of the multiplier y to the interval [0, 1] since one may prove that the Lagrange multiplier for problem (12) lies in that interval. The interpretation of the approximation procedure in terms of penalty and multiplier methods is very valuable for a number of reasons. First it provides guidelines for approximation of simple kinks by a wide variety of functions. Every penalty function suitable for a penalty or multiplier method yields an approximating function via the procedure described above. A wide class of such functions is given in [13, 14]. Many of these functions yield twice differentiable approximating functions, a property which may be desirable from the computational point of view. Second, our interpretation reveals that we may expect that the basic attributes of the behavior of penalty and multiplier methods will also be present in our approximation procedures. Thus we may expect ill-conditioning for large values of the parameter c. This fact necessitates sequential operation of the approximation method, i.e., repetitive solution of the approximate problem for ever increasing values of the parameter c. Third, the interpretation motivates us to consider an updating procedure for the parameters y which
D.P. Bertsekas/ Nondifferentiableoptimizationvia approximation is analogous to the one used in the method of multipliers. This updating procedure results in most cases in significant improvements in computational efficiency. The updating formula for the multipliers y~, which closely parallels the one of the method of multipliers and which will be discussed in some detail in this paper, is given by
{i fik+a =
if 1
if 0 < eL + Ckfi(Xk) ~ 1, if y~ + Ckfi(Xk) < O.
(15)
One heuristic way to justify (15) is based on the following observation9 If we assume for a moment that XR --' 2, where 2 is an optimal solution of (1) and for some k we have (1 - YD/Ck < f~(xk) this indicates that likely there holds 0 < f~(~-)and therefore it is better to approximate more accurately the simple kink y[fi(x)] for positive values o f f ( x ) rather than for negative values9 The iteration (15) accomplishes precisely that by setting y~+a = 1 (c.f. (5)). Similarly for the other c a s e s fi(Xk)~-~--y~/C k and -Y~/Ck < fi(Xk) < (1 -- Y~)/Ck the iteration (15) may be viewed as adaptively adjusting the accuracy of approximation of the simple kink y(t) from positive to negative values of the scalar argument t and vice versa as the circumstances dictate. A more rigorous and satisfying justification for the employment of (15) together with a clarification of the connection with the method of multipliers is provided in Section 4. Since the class of problems of the form (1) is extremely rich in variety, it is very difficult to provide a convergence and rate of convergence analysis for the general case. The utter impossibility o f providing a unifying notational description for the general case of problem (1) is one of the main obstacles here. For this reason we shah restrict ourselves to the specific class of problems minimize subject to
g[x, ~[f~(x)],..., ?[fro(X)]], x~QcR",
(16)
where #, f~, ..., fm are continuously differentiable functions. Notice that this class of problems includes the problem of solving systems of equations of the form
hi[x , 7[fl(x)], ..., y[f,,(x)]] = O, by means of the minimization problem
i = 1,..., n
D.P. Bertsekas I Nondifferentiable optimization via approximation
minimize ~. Ih,[x, eEA(x)] .... , eEf.(x)3]l i=1
Our results also apply with simple modifications in statement and proof to the class of problems minimize subject to
sm
g[x, max{f~(x),...,f~'(x)} ..... max{f~(x) ..... f~ (x)}], (17) x~QcR.
On the other hand, our analysis can serve as the prototype for the analysis of other different or more general cases and provides a measure of what kind of behavior one may expect from the approximation methods that we propose. The paper is organized as follows: In the next section we prove the basic convergence results for our approximation methods. Section 3 shows that one may obtain, as a byproduct of the computation, quantities which play a role analogous to Lagrange multipliers in constrained minimization. We also show that our convergence results may be used to obtain some optimality conditions for the problem that we are considering. Section 4 examines the possibility of acceleration of convergence by using iteration (15). The connection with the method of multipliers is clarified and some convergence and rate of convergence results are inferred. Finally in Section 5 we present some computational results.
2. Some convergence results
Consider problem (16), where we adopt the standing assumption that the set Q is nonempty and that the functions g : R " + m ~ R,
f:R"~
R,
i = 1,...,m
are everywhere continuously differentiable. We denote by VxO the (column) vector of the first n partial derivatives of 9, while we denote by ag/dt~, i = 1, ..., m the partial derivative of O with respect to the (n + i)th argument
O.e., agfat,
=
Consider now the
k th
approximate minimization problem
min O[x, ~[fl(x), y~, Ck], ..., ~[fm(X), yr~, Ck]], xEQ
where 0 ~ Ck ~ O k + l ,
O
l,
Ck --.I. 00,
i = 1,...,m,
k = O , 1,...
(18)
D.P. Bertsekas / Nondifferentiable optimization via approximation
and the approximate kink ~[f/(x), y~, Ck] is given by (5). Any rule m a y be used for updating y~--for example y~ may be left constant. Let Xk be an optimal solution of problem (18) (assuming one exists). We have the following basic proposition: Proposition 2.1. There holds
Ig*
-
g[xk, ~ [ A ( x k ) ] ,
..., ~ [ f m ( x k ) ] ] l
-<
L/ck,
k = 0, 1, ...,
(19)
where
o* = inf oEx,~,[A(x)], .... ~,[f,.(x)]],. L =
sup
i=1
(x, tl ..... tm)~M ~ i
(x, tl,
"'"
(20)
tin)
(21)
M = {(x, t, ..... t,.): x e Q, 7[f,(x)] -- 89 o _< <-- t i <-- 7[f~(x)], i = 1..... m},
(22)
provided L above is finite. Proof. By Taylor's formula, (8), and (21), (22) we have for every x e Q,
IgEx, tEA(x)] ..... 7[f,(x)]] + - g[x, ~[f~(x), y~, Ck],..., ~[fm(x), Y~', Ck]]l < L/2Ck
(23)
from which the result follows. As a direct consequence of the proposition above we have the following convergence results. Corollary 2.2. Let Q be a bounded set. Then
lkim g[Xk, ~[fl(Xk)], ..., ?[fm(Xk)]] = g*.
(24)
Proof. The boundedness of Q implies that L as given by (21) is finite. Hence by (19) and the fact CR ~ ~ the result follows. Corollary 2.3. Let g have the particular form
g[x, ?[fx(x)], ..., ?[f,,(x)]] = go(X) + ~
'),[f~(x)],
i=1
Then Ig* - g[x,,, ~,[A(xk)],
..., ~,[f,,,(x,,)]]l
-< ,n/c,,.
90: Rn -~ R.
D.P. Bertsekas / Nondifferentiable optimization via approximation
Proof. Immediate from (23), (21) and (19). The above corollary is interesting from a computational point of view since it shows that for the problem above there are available a priori bounds on the approximation error. Corollary 2.4. Let ~ be any limit point of the sequence
{Xk}and
let Q be a
closed set. Then g[~, ~[f~(~)] ..... 7[fm(~)]] = min g[x, v[fx(x)] .... , ~[fm(x)]], xeQ
i.e. -~ is an optimal solution of problem (16). Proof. Without loss of generality assume that the whole sequence {Xk} converges to ~ and let S be any closed sphere containing the sequence {Xk}. Clearly each vector Xk is an optimal solution of the problem
min g[x, )[fl(x), y~, Ck],..., ~[fm(x), Y~, Ck]]
xeQnS
and since Q n S is bounded, by Corollary 2.2 and (23) we have l i m g[x,, ~[fl(xk), y~, c,] ..... ~[f.(x,), y~, c,]]
=
= g[~, ? [ f , ~ ) ] .... ,7[fm(x)]] = min g[x, 7[fl(x)], ..., ?[f,~(x)]]. x~Qr~S
It remains to show that the minimum of g over Q n S above is equal to the minimum of g over Q. But this follows from the fact that S can be any closed sphere containing {Xk} and hence it can have an arbitrarily large radius. Notice that the proposition states that every limit point of {Xk} is a minimizing point of problem (16), but does not guarantee the existence of at least one limit point. This is to be expected since problem (.16) may not have a solution. On the other hand the existence of a solution as well as of at least one limit point of {Xk} is guaranteed when Q is a compact set. The above proposition and corollaries establish the validity of the approximation procedure. One may notice that the proofs are very simple and rest on the fact that the function ~[f~(x), y~, Ck] approximates uniformly the simple kink ~[f~(x)] with an approximation error at most equal to 1/2Ck as shown by (8). It is interesting to observe that convergence does not
10
D.P. Bertsekas/ Nondifferentiableoptimizationv~tapproximation
depend on the particular values y~, i ~ I employed. This allows a great deal of freedom in adjusting y~, for the purpose of accelerating convergence. Since our procedures are related to penalty methods one expects that they must yield, as a by-product of the computation, quantities which may be viewed as Lagrange multipliers. In the next section we show that such multipliers may indeed be obtained. Furthermore we show that these multipliers enter in optimality conditions which, aside from their analytical value, may serve as a basis for termination of our approximation procedure.
3. Multiplier convergence and conditions for optimality Let us assume throughout this section that Q is a closed convex set. Using (9), the gradient with respect to x of the objective function in problem (18) may be calculated to be
Vg[x, ~[f,(x), y~, Ck].... , ~[f.(x), Y~'!k,Ck]] = ag
= v.
+
i=1
,
'
cd J Ix
= Vxg + ,=, ~ ~//Yk(X) -og ~ Vf,
(25)
where ~, i = 1..... m is given by
Y~k(X) =
{i
~ + Ckfi(x)
if I
(26)
and all gradients in the right-hand side of (25) are evaluated at the point x. Since Q is a convex set and Xk is an optimal solution of problem (18) we have the necessary condition ( V x g + i~ 1 CtiO~gYik(Xk)V filx =x~'(X -- Xk)) >-0
for all x ~ Q, (27)
where ( .,. ) denotes the usual inner product in R". Let now {Xk}k~r be a subsequence of {XR} which converges to Y. By Corollary 2.4, 2 is an optimal solution of problem (16). In addition, the subsequence {)7/} = {37~(Xk),.... ~kk(Xk)}k~r defined by (26) has at least one limit point and, by taking limits in (27), each of its limit points y = {yl ..... ~-,,} must satisfy
D.P. Bertsekas / Nondifferentiable optimization via approximation
i= 1 -~i yiVfi
, (x - 2)
> 0
11
for all x e Q.
Combining the above observations with Corollary 2.4 we obtain the following proposition:
Proposition 3.1. Let Q be a closed convex set, and let (~,y) be any limit point of the sequence { Xk, Yk}, where Yk = (Y~(Xk), ..-, ~(Xk)) is defined by (26). Then ~ is an optimal solution of problem (16); and y is a multiplier vector satisfying Vxg+i=l~i
yi = 0
yvJi
,(x-2)
/ff/fx) < O,
0 ___ y' _< 1
>_0 for a l l x ~ Q ;
yi = 1
/ff(x) > O,
(28)
/ff, f~) = 0.
Proposition 3.1 together with Proposition 2.1 and its corollaries may be used to provide a simple proof of an optimality condition for problem (16). We shall say that ~ is a local minimum for problem (16) if ~ is a minimizing point of g[x, y[fl(x)], ..., ~[fm(x)]] over a set of the form Q c~ {x: [x - ~] _< e}, where [.I denotes the Euclidean norm and e > 0 is some positive scalar. If~ is a unique minimizing point over Q n {x: Ix - ~] _< e}, we shall say that ~ is an isolated local minimum. Proposition 3.2. Let Q be a closed convex set and ~ be an isolated local minimum for problem (16). Then there exists a multiplier vector = (-ill,..., ym) satisfying
I
1
i=1 ~ y vJi Ix=" (x - x--) >_ 0 for all x e Q,
(29)
y' = 0
if f~(2) < O,
i = 1..... m,
(30)
yi= 1
/ff~(2)>O,
i = 1..... m,
(31)
0 < y' <_ 1
if f~(2) = O,
i = 1,..., m.
(32)
Proof. Let Q be a set of the form Qc~ {x: [x - 21 < e} within which 2 is a unique minimum of g. Consider the approximation procedure for the problem min g[x, ~[f~(x)] ..... ~[f,,(x)]]. xei~
12
D.P. Bertsekas/ Nondifferentiableoptimizationvia approximation
The generated sequence {Xk} is well defined since ~) is compact. Furthermore, since 2 is the unique minimizing point of g within Q, we have Xk ~ and {)k}k~r ~ Y for some vector y e R" and some subsequence {)k}k~r, where Yk is defined in Proposition 3.1. Then (29) follows from (28). The relations (30), (31) and (32) follow directly from the definition of Yk and the fact Xk --" ~. We note that when Q = R", the above proposition yields the stationarity condition Vxg +
- - y vJi i= a Oti
= 0.
(33)
X~
The above condition and in some cases the more general condition of Proposition 3.2 may be used as a basis for termination criteria of the approximation procedure. We note that necessary conditions similar to the one of Proposition 3.2 may also be proved in an analogous manner for problem (17) as well as for many other problems which are similar in nature and are amenable to the same type of analysis as the one presented for problem (16).
4. Acceleration by multiplier iterations In this section we examine an updating procedure for the multiplier vectors Yk which in many cases can greatly improve the computational efficiency of the approximation method. We consider the case where the approximation procedure is operated sequentially and the multipliers y, y~, i = 1..... m used in the approximations, are updated by means of the iteration
{i Y~k+l =
if 1 <--Yik+Ckfi(Xk)' ~ + Ckfii(Xk) ifO <_ y~ + Ckfi(Xk) -~ 1, if Y~k + Ckf~(Xk) < O.
(34)
A heuristic justification for this iteration was given in the introduction, where we also mentioned its connection with iterations for the method of multipliers. We now concentrate on clarifying this latter connection further. Some familiarity with the method of multipliers is required on the part of the reader for the purpose of following the discussion. Consider the following simple special case of problem (16)
D . P , Bertsekas / Nondifferentiable optimization via approximation
13
minimize{go(X)+ i=1~aiT[f~(x)]}, where a~, i = 1, ..., m are some positive scalars. By introducing additional variables zl, ..., Zm this problem is equivalent to the problem min ~if~(x) < zi O<_z i
{go(x)+ i=~zi}. 1
The problem above may be solved by the method of multipliers with a quadratic penalty function as described for example in I-4, 13, 14, 18]. One may either eliminate only the constraints =~f~(x) < z~ by means of a generalized penalty or eliminate both the constraints a~f~(x)< z~and 0 _< z~. It is possible to show that both approaches lead to identical results in our case. The reader may easily verify that our approximation procedure coupled with iteration (34) is in fact equivalent to solving the minimization problem above by the method of multipliers mentioned earlier. Our approximation procedure however is not equivalent to the method of multipliers for problems which do not have the simple form above although there is a certain relation which we now proceed to discuss. Let Q = R" and let 2 be a unique (isolated) local minimum of problem (16) within some open sphere S(2; e) = {x: Ix - 21 < 5}. Let us use the notation t + = {i: f/(2) > 0, i = 1,2 ..... m}, 1- = { i : f ~ ( 2 ) < 0 , i = 1 , 2 .... ,m}, I ~ = {i: f~(2) = 0, i = 1,2 .... ,m}. Assume that 5 > 0 is taken sufficiently small to guarantee that fi(x) > 0 ft(x) < 0
for all x ~ S(2; 5), for all x s S(2; 5),
i e 1 +, i e 1-.
Let us first consider the case where the objective function O in problem (16) has the particular form
g[x, 7[f,(x)],...,
7[fm(x)]] =
go(x) + i=1 ~ gi(x)7[fi(x)],
(35)
where g~: R" ~ R, i = 0 . . . . , m, are continuously differentiable functions. Now if we make the assumption gi(2) 4:0
for all i s I ~
(36)
14
D.P. Bertsekas
/ Nondifferentiable
optimization
via a p p r o x i m a t i o n
we have that, when g has the form (35), problem (16), locally within a neighborhood of ~, may be written as
min~go(x) + ~ g,(x) fi(x) ieI + ( + ,~,o+ ~ max[-O, g,(x) f(x)] + ,~,o ~ min[O, g,(x) f~(x)]} (37) where I ~247 = {i:
gi(~) >
O, f~(x) = 0},
I ~ = {i:
gi~)
< O, f(x) = 0}.
Since 2 is an isolated local minimum of the above problem, it follows under the mild assumption
V[gi(x) fi(x)]
x=x'
i ~ 10
are linearly independent vectors
(38)
that the set I ~ is empty and we have
gift)
> 0
for all i ~ I ~
(39)
Notice that the previous assumption (36) is implied by assumption (39). This fact can be verified by noting that 2 is an optimal solution of the problem min
~go(X)+ i~, gi(x) fi(x)+ i e~l o + z,+ i~, gi(x)fi(x)~ d + ~ "o -
g i ( x ) f t (x) <- zl ~ 0 < zl, i e l O+
(40)
J
for every subset'/~ of the set I ~ By applying the Kuhn-Tucker theorem to problem (40) and using (38) it follows that the set I ~ must be empty, i.e., (39) holds. The basic conclusion from the preceding analysis is that, assuming (38), the problem of minimizing (35) is equivalent, locally around ~, to the nonlinear programming problem min gi(x)f~ (x) <--zi 0 <- zl, i~l ~
{go(x)+~,gi(x)f~(x)+~zi}. iEl + i~l ~
(41)
At this point we deviate somewhat from our main subject and discuss briefly a constrained minimization method which is identical to the method of multipliers as described for example in [-4, 18] except for the fact that the penalty parameter may depend on the vector x. It turns out that this method is closely related to our approximation procedure.
D.P. Bertsekas / Nond(fferentiable optimization via approximation
15
Consider a constrained minimization problem of the form min
pi(x) <. 0
po(X),
(42)
i=l .... ,m
where Po, Pi :R"--' R. The method of multipliers consists of sequential unconstrained minimization of the function po(X) +
~ i=1
{[max(O, Y~k + dkp,(x))] 2 -- (Y/k)2},
(43)
~Ck
where (d,) are sequences of positive numbers and the multipliers y~, are updated at the end of each unconstrained minimization by means of the iteration Y~,+I = max[0, y~, + dkpi(Xk)], i = 1,..., m, (44) where Xk minimizes (43). The same updating formula may be used even if Xk is only an approximate minimizing point of (43). The method maintains its basic convergence characteristics provided the unconstrained minimization is asymptotically exact. We refer for a detailed discussion of this point as well as for supporting analysis to [3, 4]. Now consider a variation of the method above whereby x-dependent penalty parameters ?~(x), ?~, : R n-~ R, are used in (43), (44), in place of c~,, i.e., we minimize 1
P(x, yk)= po(xt +
{[max(0,y~, + ~,~(x)pi(x))] 2 - (y~,)~} (45)
i=1
and update the multipliers by means of Y~,§ = max[0, y~, + ~,(xk) p~(x~)],
i = 1, ..., m,
(46)
where x~ minimizes (45). Here we assume that ~,(x) is positive over a region of interest and that there is some form of control over the magnitude of ?~,(x) so that it can be uniformly increased if desired. For example we may have ?/k(x) = cikr~(x), where c], is a scalar penalty parameter that may be increased from one minimization to the next and ri(x) is a positive function of x over the region of interest which does not depend on the index k. It is not difficult to see that a method of multipliers with an x-dependent penalty parameter of the type described above should behave similarly as a method of multipliers of the ordinary type. The reason is that if Xk is a minimizing point of the function P(x, Yk) of (45), then Xk is also an approximate minimizing point of the function
16
D.P. Bertsekas / Nondifferentiable optimization via approximation m
P(X, Yk) = po(x) +
1
,=i ~ 2~(Xk) {[max(O,y~ +
C~(Xk)pI(x))]2 --
(y~)2} (47)
in the sense that VP(Xk, Yk) = --89 ~ [max(-- y~/C~k(Xk),p,(xk))]2V~k(Xk). i=1
Now i f x k is close to an optimal solution ~ and Yk is close to a corresponding Lagrange multiplier vector y of problem (42), then VP(xk, Yk) is small and in the limit as Xk ~ X, YR --* Y we have VP(xk, Yk) - * O. In other words, Xk is an asymptotically exact minimizing point of (47). As a result, the multiplier method with x-dependent penalty parameters is equivalent to an ordinary multiplier method with penalty parameter sequences {~(Xk)} where asymptotically exact minimization is employed. It follows under suitable assumptions that the reader may easily provide, using the analysis of [3, 4], that such methods of multipliers employing x-dependent parameters possess the well-known advantages of multiplier methods over penalty methods. In particular the multiplier iteration (46) accelerates convergence and the penalty parameters need not be increased to infinity in order for the method to converge. Now it may be verified that our approximation method when iteration (34) is employed is equivalent, within a neighborhood of ~, to a multiplier method for solving the constrained minimization problem (41) where the penalty parameter is x-dependent as described above. To be precise, let (Ck} be a parameter sequence used in the approximation method for minimizing (35) and let {Xk}, {Yk}be the corresponding generated sequences. Then the vectors Xk, Yk, for k > k, where k is sufficiently large, are identical to the ones that would be generated by a method of multipliers for problem (41) for which: (a) Only the constraints g ~ ( x ) f ( x ) < zi, i e I ~ are eliminated by means of a generalized quadratic penalty. (b) The penalty parameter for the (k + 1)'hminimization corresponding to the ith constraint, i e I ~ depends continuously on x and is given by = ck/g
(x).
(c) The multiplier vector ~ at the beginning at the (k + 1)th minimization is equal to Yr. Alternatively, the vectors Xk, Yk for k > k, where k is sufficiently large, are identical to the ones that would be generated by the method of multipliers for problem (41) for which: (a) Both constraints g~(x) f ( x ) < z~, i e I ~ and 0 < z~, i e I ~ are eliminated by means of a generalized quadratic penalty.
D.P. Bertsekas / Nondifferentiable optimization via approximation
17
(b) The penalty parameter for the (k + 1)th minimization corresponding to the ith constraints depends continuously on x and is given by ~(x) =
2cdgi(x). (c) The multiplier vectors ~ , ~ at the beginning of the (k + 1)thminimization (where ~ corresponds to the constraints #~(x) f~(x) < zi, i ~ I ~ and ~ corresponds to the constraints 0 < z~, i t I ~ satisfy ~ = Yk-and ~ = 1 -- y~.
The equivalence described above may be seen by verifying the following relations which hold for all scalars y ~ [0, 1], c > 0, g > 0, f . ~ ( f , y, c) = min[z + (g/2c) {[max(0,y + (c/g)(gf - z))] 2 - y2}] 0
= min[z + (g/4c){[max(0,y + (2c/g)(g f -
z))] 2 - y2
+ [max(0, 1 - y - (2c/g)z)] 2 - (1 - y)2}]. The above relations show that after a certain index (when ck is sufficiently high and the multipliers y~, i e I +, y~, i e I - have converged to one and zero respectively) the k th unconstrained minimization in our approximation method is equivalent to the kth unconstrained minimization in the multiplier methods mentioned above. The conclusion from the above analysis is that our approximation procedure with the iteration (34) when applied to minimization of a function of the form (35) has similar behavior as a method of multipliers with xdependent penalty parameter applied to problem (41). Thus we can conclude that results concerning convergence and rate of convergence for the method of multipliers carry over to our case. In particular we expect that under assumptions which hold in most cases of interest the iteration (34) will accelerate convergence and will avoid the need to increase ck to infinity, i.e., the approximation procedure will converge even for ck constant but sufficiently large. We turn now to the general case of problem (16) where the cost function g does not have the form (35). Let us assume for convenience and without loss of generality that I ~ = { 1,..., m}, and consider the following Taylor series expansion around
a(x) = O[x, ~[fl(x)] ..... 7[fm(X)]] =
mr? gEx, ?[f,(~)] .... ,7[fm~)]] + 2 ~ [ X , 7[f,(x)], ..., 7[fm~)]]TEf~(x)] i = l tJLi
(48,
18
D.P. Bertsekas / Nondifferentiable optimization via approximation
where Ox is a function of x, 7[f/(x)], i = 1..... m such that for every x,
lim
=0. i=1
Now the function (48) is of the form (35) except for terms of order higher than one. Thus we expect that in the limit and close to 2, where the term 0~(~i~1 7[f/(x)]) is negligible relative to first order terms, the approximation method will yield similar behavior as for the case of a function of the form (35), and the iteration (34) can be viewed as asymptotically equivalent to the iteration of a method of multipliers. A more precise justification of the point above can be given by considering the function (48) up to first order ~(x) = g[x, 7[f1(#)] ..... 7 [ f , fY)]]
8g + ,=;~ ~-~ [x, 7[flfY)] ..... 7[f,,(2)] 7[f(x)].
(49)
Now the point 2 satisfies the necessary condition of the previous section for an isolated local minimum of the function above. Let us assume that is indeed an isolated local minimum of (49). This assumption is not, of course, always satisfied but is likely to be true in many cases of interest. Then the approximation procedure for minimizing G(x) is equivalent to the approximation procedure for minimizing (~(x) except for the fact that in the latter case we terminate the unconstrained minimizations at points Xk, where the gradient'of the approximate objective function is
Now when Ox contains terms of second order and higher and x k ~ 2~ Yk ~ (this is guaranteed for example if Ck ~ 00), the term above tends to the zero vector and the minimization of the approximation of G(x) is asymptotically exact. Our discussion above has been somewhat brief since a detailed analysis ~ of our viewpoint would be extremely long and tedious. However it is felt that enough explanation has been provided to the interested reader in
D.P. Bertsekas / Nondifferentiable optimization via approximation
19
order for him to supply the necessary assumptions and analysis and firmly establish the conclusions reached. We mention that for practical purposes it may be computationally efficient to update the multipliers prior to completing the minimization in the approximate optimization problems. One may use analogs of termination criteria used for penalty and multiplier methods with inexact minimization [3, 4]. While it seems that the employment of such termination criteria should result in more efficient computation for many problems, our computational experiments were inconclusive in this respect. Finally we note that when the constraint set Q in problem (16) is specified by equality and inequality constraints, it is convenient to eliminate these constraints by means of penalty functions while solving the approximate minimization problems. In this way the approximation method is combined with the penalty function method in a natural way. One may use the same parameter Ck to control both the accuracy of the approximation and the severity of the penalty. Assuming that Ck ~ O0 one may prove various convergence theorems for methods of this type simply by combining standard arguments of convergence proofs of penalty methods [8] together with the convergence arguments of this paper. As an example consider the problem minimize
go[X, 7[f~(x)], ..., 7[fm(x)]],
subject to
gl[x, 7[hl(x)], ..., r[hp(x)]] = 0,
where go, gl are real valued functions. One may consider sequential minimization of the function
go[x, ~[s
yL ~], ..., ~[/,(x), H, cQ] +
+ ,kg,[~, ~[h,(~), wL c~] ..... 9[h,(~),,q, ~ ] ] + 89
[~, ~[h,(~), w~', ~,], ..., ~ [h~(x), w~, c~]]] ~,
where (Ck} is an increasing sequence tending to infinity with Ck > 0, {y~}, {W~,} satisfy 0
1,
0<w~<
1
for allk, i
and {2k} is a bounded scalar sequence. The updating procedure for the multipliers which corresponds to the iteration of the method of multipliers is given by
20
D.P. Bertsekas / Nondifferentiabte optimization via approximation
{ii
y'k + 1 =
Wk + 1
k + ckf~(x~)
if 1 < yik + ckf~(Xk), if 0 < Y~k + Ck-f~(Xk) < 1, if Y'k + ckfi(xk) < 0;
(50)
k + Ckhi(Xk)
if 1 < Wik + ckhi(xk), if 0 < w~ + Ckh~(Xk) < 1, if y~ + Ckhi(Xk) < 0;
(51)
~[h1(Xk), W~, Ck], ...,~'[hp(Xk), W~, Ck]].
)'k+1 = 2k + Ckg,[Xk,
(52)
The updating procedure (50)-(52) appears to be a reasonable one and when employed it improved a great deal the speed of convergence in our computational experiments. However we offer here no theoretical analysis which supports the conjecture that it accelerates convergence for any broad class of problems.
5. Computational results We have performed, with the assistance of L. Berman, a number of computational experiments to test the analysis of this paper. We present here some of the results related to two test problems. In both problems we performed the unconstrained minimizations by using the Davidon-Fletcher-PoweU method available on the IBM-360 and referred to as the F M F P Scientific Subroutine. The value of the parameter e which controls accuracy of minimization in this method was taken to be e = 10- s. Double precision was used throughout. The starting point for each unconstrained minimization, except the first one in each problem, was the final point obtained from the previous minimization. The computational results are reported in terms of number of iterations required (the number of function evaluations not being readily available). These results, naturally, are highly dependent upon the efficiency of the particular unconstrained minimization subroutine employed. It is possible that much better (or worse) results may be obtained by employing a different unconstrained minimization method such as for example Newton's method. Test problem 1. The problem is
mini+
ilx i /=1
,
D.P. Bertsekas / Nondifferentiable optimization via approximation
21
where x ~ denotes the ith coordinate of x ~ R". We represented Ix~[ by x ~ + y [ - 2 x i] and used our approximation procedure with and without updating of the multipliers starting with x i = - 1, y~ = 0, i = 1, ..., n. We solved the problem for n = 5 and n = 50 and a penalty parameter sequence Ck = 5 k. We also solved the problem by using a constant value of penalty parameter ck = 10 for all k in conjunction with iteration (34). Table 1 shows the results of the computation. Table 1 'Value of objective at minimizing point Xk)/(Number of iterations required) Ck= 5k, y~ -- 0 k
n--5
0 1 2 3 4 5 6 7 8 9 10 11 12
32.6820 3.06250 1.32250 1.06090 1.00240 1.00048 1.00010 1.00002 1.00000
Ck = 5k, Yk : updated
n=50 10 19 16 13 13 34 24 29 30
Total number 188 o f iterations
14753.3 6420.92 441.704 11.9509 2.28010 1.21441 1.04122 1.00818 1.00163 1.00033 1.00007 1.00001 1.00000
n=5
n=50
ck -- 10, Yk : updated n=5
n=50
6 13 72 93 132 165 131 180 103 186 143 168 140
32.6820 10 2.00493 18 1.00000 13
14753.3 6 5807.64 73 72.7005 60 1.93668 139 1.00000 I00
1.89062 16 1.00000 10
2028.52 56 131.120 86 50.7946 84 18.6044 115 1.00091 71 1.00090 50 1.00000 36
1532
41
378
26
498
We also solved a constrained version of the problem
subject to
Ix 1-
2L + Ix 1 + ... § Ix l = 1
by using the combination of the penalty or multiplier m e t h o d and the approximation m e t h o d described at the end of the previous section. The starting points were x i = - 1, yi = 0, wi = 0, i = 1, 2 . . . . . n and ;t = 0. The results are summarized in Table 2. In each case the constraint equation was satisfied within six significant digits at the final solution point.
22
D:P. Bertsekas / Nondifferentiable optimization via approximation
Table 2 (Value of objective at minimizing point X k ) / ( N u m b e r of iterations required) Ck = 5 k, yk,Wk,2k =- 0 k
n = 5
0 23.6249 1 4.35020 2 3.84160 3 3.95553 4 3.99054 5 3.99809 6 3.99962 7 3.99992 8 4.00000 9 10 11 12 13
Ck ---- 5 k, yk,Wk,2k:Updated
n = 50 13 19 15 19 16 26 44 36 36
164410. 7095.79 432.923 20.1098 6.21738 4.40393 4.07921 4.01578 4.00315 4.00063 4.00013 4.00003 4.00001 4.00000
Total number 224 of iterations
n = 5 61 93 102 108 137 222 132 167 139 149 186 131 500* 500*
n = 50
23.6249 13 3.17223 16 3.80147 15 3.99683 5 3.99999 5 4.00000 6
2,627
Ck = 10, yk,wk,2 k :updated
164410. 4400.94 99.9155 5.73819 3.99912 4.00002 4.00000
60
n = 5
n = 50
61 103 114 133 109 60 50
3.85141 15 3.63379 16 3.93774 5 3.98959 5 3.99826 5 3.99971 5 3.99995 5 3.99999 5 4.00000 5
2611.20 92 161.944 133 52.6251 131 3.22301 111 3.86473 84 3.97907 51 3.99800 54 3.99937 50 3.99989 50 3.99998 50 4.00000 50
630
66
856
* Limit on # of iterations
Test problem 2. This is a minimax problem suggested to us by Claude Lemarechal: min max{fx(x), f2(x) ..... fs(x)}, X
where x E R t~ and f~(x) = ~x, A i x ) - (bi, x ) ,
i = 1, 2 . . . . , 5.
The elements at(m, n), m, n = 1..... 10 of the matrices Ai and the coordinates hi(m) of the vectors bi are given by: at(m, n) = e m/n cos(m- n) sin(i) ai(m, n) = ai(n, m)
for m < n,
for m > n,
a,(m,m) -- 2 Isin (i)1 i/m + Y~ la,(m,j)l, j~p rn
bi(m)
=
e "/i
s i n ( / " m),
i =
We represented max{f1, f2 ..... fs} by
1.....
5,
m =
1, ..., 10.
D.P. Bertsekas / Nondifferentiable
optimization
0
.=.~. I I I I I I I I
~
~ t r l
I I I I I 1 ~ 1 r
r162
N
I
.~
I
I
I r
I
I
,...~ t'q
0
tr
tr
I
~'%
t
I
.8 t"q ("4
II
.=.
I
t,r
II
I
I
tr
I
I
it-i
I
I
O 0
0,.,.
[.-o
via approximation
23
24
D.P. Bertsekas / Nondifferentiable optimization via approximation
m a x { A , . . . , fs} = A + 7[f2 - A
+ ~:[f3 - A + ~'[A - f3 + ~:[f5 - A ] ] ] ]
and used our approximation procedure in conjunction with iteration (34). The starting points were x ~ = x z = ..- = x 1~ = 0 and y01 = y~ = yo3 = yo4 = 0. The optimal value obtained is -0.51800 and the corresponding multiplier vector was = (1.00000, 0.99550, 0.89262, 0.58783). It is worth noting that for minimax problems of this type the optimal values of the approximate objective obtained during the computation constitute useful lower bounds for the optimal value of the problem. Table 3 shows the results of the computation for the case where unconstrained minimization was "exact" (i.e., e = 10- 5 in the D F P routine% It also shows the results of the computation when the unconstrained minimization was inexact in the sense that the k th minimization was terminated when t h e / t - n o r m of the direction vector in the D F P was less than max[lO -s, lO-k].
References [17 D.P. Bertsekas and S.K. Mitter, "A descent numerical method for optimization problems with nondifferentiable cost functionals", SlAM Journal on Control 11 (1973) 637-652. [2] D.P. Bertsekas and S.K. Mitter, "Steepest descent for optimization problems with nondifferentiable cost functionals", Proceedinos of the 5th annual Princeton conference on information sciences and systems, Princeton, N.J., March 1971, pp. 347-351. [3] D.P. Bertsekas, "Combined primal-dual and penalty methods for constrained minimization", SIAM Journal on Control to appear. [4] D.P. Bertsekas, "On penalty and multiplier methods for constrained minimization", SIAM Journal on Control 13 (1975) 521-544. [5] J. Cullum, W.E. Donath and P. Wolfe, "An algorithm for minimizing certain nondifferentiable convex functions", RC 4611, IBM Research, Yorktown Heights, N.Y. (November 1973). [6] V.F. Demyanov, "On the solution of certain minimax problems", Kibernetica 2 (1966). [7] V.F. Demyanov and A.M. Rubinov, Approximate methods in optimization problems (American Elsevier, New York, 1970). [8] A.V. Fiacco and G.P. McCormick, Nonlinear programming : sequential unconstrained minimization techniques (Wiley, New York, 1968). [9] A.A. Goldstein, Constructive real analysis (Harper & Row, New York, 1967). [10] M. Held, P. Wolfe and H.P. Crowder, "Validation of subgradient optimization", Mathematical Programmin0 6 (1) (1974) 62-88. [11] M.R. Hestenes, "Multiplier and gradient methods", Journal of Optimization Theory and Applications 4 (5) (1969) 303-320. [12] C. Lemarechal, "An algorithm for minimizing convex functions", in : J.L. Rosenfeld, ed., Proceedings of the IFIP Conoress 74 (North-Holland, Amsterdam, 1974) pp. 552-556.
D.P. Bertsekas / Nondifferentiable optimization via approximation
25
[13] B.W. K o n and D.P. Bertsekas, "Combined primal-dual and penalty methods for convex programming", SIAM Journal on Control, to appear. [14] B.W. Kort and D.P. Bertsekas, "Multiplier methods for convex programming", Proceedings of 1973 IEEE conference on decision and control, San Diego, Calif., December 1973, pp. 428-432. [15] B.T. Polyak, "Minimization of unsmooth functionals", ~,urnal Vy~islitel'noi Matematiki i Matemati~eskoi Fiziki 9 (3) (1969) 509-521. [16] M.J.D. Powell, "A method for nonlinear constraints in minimization problems", in : R. Fletcher, ed., Optimization (Academic Press, New York, 1969) pp. 283-298 [17] B.N. Pschenishnyi, "Dual methods in extremum problems", Kibernetica I (3) (1965) 89-95. [18] R.T. Rockafr "The multiplier method of Hestenes and Powell applied to convex programming", Journal of Optimization Theory and Applications 12 (6) (1973). [19] N.Z. Shot, "On the structure of algorithms for the numerical solution of planning and design problems", Dissertation, Kiev (1964). ['20] P. Wolfe, "A method of conjugate subgradients for minimizing nondifferentiable functions", Mathematical Programming Study 3 (1975) 145-173 (this volume).
Mathematical Programming Study 3 (1975) 26-34. North-Holland Publishing Company
O N I M P R O V I N G R E L A X A T I O N M E T H O D S BY M O D I F I E D GRADIENT TECHNIQUES*
P.M. CAMERINI, L. F R A T T A and F. M A F F I O L I lstituto di Elettrotecnica ed Elettronica, Politecnico di Milano, Milano, Italy
Received 14 November 1974 Revised manuscript received 4 August 1975
Relaxation methods have been recently shown to be very effectivefor some large scale linear problems. The aim of this paper is to show that these procedures can be considerably improved by following a modified gradient step direction.
1. Introduction Given the constants Ck ~ R, #k e R" (k = 1..... KL n and K being two positive integers, and a variable vector ~ ~ R", find V. 1
maxw(n) = rnkin {c ~ + ~ . / ~ ) ~--- Ck(~) "4- 7C" ~ktn ).
It has been shown [9, 16] that whenever K is large and an efficient m e t h o d exists to find w(Tr) for each value of ~ (this happens for instance for some combinatorial optimization problems [1,9, 16]) a m e t h o d related to gradient techniques provides a very effective attack on P. 1. Specifically one can a d o p t the following iterative scheme:
{~
o
-----0,
~m + 1 = 7,cm "Jr tm Sin,
(1)
{ t.,} being a sequence of scalars, and sm the gradient of worm) such that s" = Vw(Tr") = #k(,..J.
(2)
* Partially supported by the Centro di Telecomunicazioni Spaziali of CNR (Italy). A provisional version of this work was presented at the International Conference on Operation Research (Eger, August 1974).
P.M. Camerini et al. / Improving relaxation methods by gradient techniques
27
Since k(n) is in general a multivalued function, there exist some points where the gradient is undefined. At such points a set of subgradients can be defined and any of them can be used as s" [16]. (Note that this iteration scheme is not applied as in conventional gradient procedures, but rather, as in relaxation methods, in order to come closer and closer to the optimal region; the objective function need not be improved at each step.) Computational experience with some large scale problems (assignment, traveling salesman, multicommodity max flow) [1, 9, 10, 16] has shown that the above relaxation method works better than classical gradient procedures or column generation techniques [7, 15]. The aim of this paper is to prove that the efficiency of the relaxation technique is further improved by selecting the modified gradient direction s" = ~ ' + / ~ . s m- 1,
(3)
where #m = #k<~m>and tim is a suitable scalar (sm- 1 = 0 for m = 0). Note that (3) is in fact equivalent to a weighted sum of all preceding gradient directions, which has been successfully used by Crowder [5] in order to avoid some possible troublesome effects due to the "subgradient's alternating components". In [5], two kinds of modified gradient directions are proposed and, although their general properties have not been investigated, computational experience shows their advantage over use of the simple gradient direction. In the following section we give a rationale for choosing the iteration scheme (1) and (3), leading to a policy for determining tim and t m at each step. We also show that the properties given in [9] for (2) can be extended to (3) and that (3) is always at least as good a direction as (2). In Section 3, some computational experiments are discussed which confirm the theoretical results.
2. Properties of modified gradient directions
Lemma 1. [9]. L e t ~z and n m be such that ~ = w(~z) >>. w(rc'~) = wm. Then (~ - n"). #m I> g, _ wm i> 0.
(4)
Let, f o r all m, ~V--Wm
0 < t. <
ils.[i 2
and
fl~>10.
(5)
28
P.M. Camerini et al. / Improving relaxation methods by gradient techniques
Then (~ -
n').
s '~ > / ( n
-
(6)
re"). ~u=
for all m. Theorem 1. Let S,n- 1 .pm
-7./is._,112
fl1r
/ys--I
0
-
(7)
otherwise,
with
(8)
0 ~< 7,, ~< 2. Then
Ilrll
>1
(9)
I1 '1t
Theorem 2.
(10) Proofs. Lemma 2, and Theorems 1 and 2 are proved in the Appendix. Lemma 1 guarantees that the direction of/~" forms an acute angle with the direction leading from rg" to the optimum ~, while Lemma 2 extends this property to sm. Theorem 1 shows that by a proper choice of tim, s" is always at least as good a direction as/~". Fig. 1 attempts to illustrate such a behaviour in a two-dimensional case. K
"~m =sm
Sm,, p:O
rim
P~ S ~-1
I
1~r n ' l
l~rn- 1
Fig. 1.
r
'
~
m IBm> O
P.M. Camerini et aL / Improving relaxation methods by 9radient techniques
29
Theorem 2 guarantees that a point closer and closer to the optimum is obtained at each iteration, and that the following convergence property holds. If (5) holds and, for some e > 0,
t,,
--
W m
rorallm,
(ll)
where ~, < max,, w(r0, then the sequence ( ~ } either includes a point n I e Pw or converges to a point on the boundary of Pw, where Pw denotes the polyhedron of feasible solutions to ~ ~< ck + n" p~ for all k. In fact it can be shown exactly as in [-9] and [14] that {n m} is F6jer-monotone relative to Pw and hence converges. From (1), (5) and (11) it follows that if no n~ in {n"} exists such that rrt e Pw, then w (lim,._~~ n ' ) = ~ and the limit point is on the boundary of Pw. Similar results were proved in [9] for the iteration scheme (1) and (2) with a condition on t., less restrictive than (5), namely 2(~, - w,,,)
o < t. <
11,.112
(12)
However, as it is represented in Fig. 2, the best choice for t,, would be that yielding the nearest position to ~ in both directions (H and H'). Following Lemmas 1 and 2, an estimate for this step is given either by letting t,, be equal to half the upper limit in (12) or equal to the upper limit in (5).
Fig. 2.
P.M. Camerini et al, / Improvino relaxation methods by gradient techniques
30
As a final remark, note that the policy (7) tends to avoid "zig-zag" behaviour of the sequence {rim}, since, when the actual gradient direction forms an obtuse angle with the preceding step direction, fl,, is set greater than zero, thus favouring the "subgradient's persistent components" [5].
3. Computational results For choosing tm and s" three policies have been tested. (a) s" -/~'~ and t., = 1 [9]. (b) s m --/~" and t m = (w* - w.)/lls'll max~ w(rc). (c) s m = #m + rims"-1, where s m- 1./~m 0
where w* is a good estimate of
if s m- 1./~m < 0, otherwise,
and W* ~ W m
t.--- 0:112 Choosing ~ = 1 would a m o u n t to using a direction orthogonal to. sm- 1. Better computational results have been obtained choosing y = 1.5. This value of ~ may be heuristically justified by the following considerations. Let us define an improvement ratio r / = cos ~s / cos ~ , where 6s and 6~ are respectively the angles which the vector s" and the vector #" form with the optimum direction ~ - r: ~. It is easy to verify that, when s m- t . #m < 0, 1 + yp cos a t / = [1 - y(2 - y)cos20t] 1/2' where ~ is the (acute) angle between - s m- 1 and #% and p = cos tp / cos ~O, tp and ~, being respectively the angles which s " - 1 and/~" form with ~ - n m. The maximum value of r/is
fl=[l + 2pc~176 + P2~1/2 which is obtained by the following value of 7: p + cos 0c cos ct (1 + p cos 0t)"
(14)
P . M . Camerini et at. / I m p r o v i n g relaxation m e t h o d s b y gradient techniques
31
A simple heuristic estimate of p is p = 1 (which amounts to assuming that on the average, when s~- 1. # , < 0, sm- 1 and/z m are equivalent directions with respect to ~ - 7rm).From (13) and (14), two estimates for ~/and ~ follow, namely ~=
1 -cos~
~=
1/cose.
Therefore a policy for choosing ~,~ is Yra ---
IIs'-11L I1 '1t sra - 1 . [lra '
115)
and we may note that if we assume ~ as the mean value of 0~, we obtain = x/2 and q --- 2.61. This value of ~ agrees fairly well with the value of 1.5, suggested by our computational experience. In order to experiment with the three above policies, the shortest hamiltonian path (SHP) problem (traveling salesman problem) has been solved for the graphs listed in Table 1 by utilizing the heuristically guided algorithm presented in [1, 3]. For any state vi of the search, a lower bound to the length of a SHP spanning the set Ni of nodes not yet connected is obtained by solving a problem of the form max,, wi(rc), r~ being a IN, I- dimensional vector. For all successors vj of the state v~ the corresponding problems of the form max~ %(7r) have to be solved on the same sub-graph. Table 1 Example 1 2 3 4 5 6 7 8 9 10 11
Croes [4] Random not euclidean (35)a Random euclidean (350)~ Karg & Thompson [11] Dantzig et al. [6] Held & Karp [8] Random euclidean (1000) a Karg & Thompson [11] Random euclidean (1000) a Join I & 6 b Join 4 & l0 b
Number of nodes
SHP length
20 20 20 33 42 48 48 57 67 67 100
246 25 1078 10861 699 11461 5394 12955 5456 16689 28298
a Random (not) euclidean (x) is a randomly generated graph with (not) euclidean distances between nodes not greater than x. b Join x & y is a graph obtained by joining graphs x and y by means of Lin's procedure [12].
32
P.M. Camerini et al. / Improving relaxation methods by gradient techniques
As a consequence, these problems lead to almost the same performance of the relaxation method. Let M be the mean number of iterations of(l) and Aw the mean relative increment of wj(n) over the set {vi} of all successors of the state vs. For each vi, we assume as a performance measure of the method the ratio a = Aw/M. Some of these values for the three policies and for different n = IN, I - 1 are reported in Fig. 3. The corresponding three
-o -
o
I
o
9
a
x o
b c
10 -1
p-
E I
60
80
n
100
Fig. 3. minimum mean square regression lines are also represented. Even if the number of samples is not sufficiently large for a satisfactory statistical analysis, one can see that for any n (except n = 38) a steadily increases when passing from policy (a), to policy (b), to policy (c), in accordance with the previous theoretical results.
4. Conclusions Relaxation methods, recently revived, and applied to some large scale linear problems have been shown here to be considerably improved by a
P.M. Camerini et al. / Improvin 0 relaxation methods by gradient techniques
33
suitable choice of the direction of search, which turns out to be given by a modified gradient vector. More computational experience will be obtained by applying these methods to other problems such as those mentioned in [2, 5, 13, 16] and in testing the performance of policy (15) for choosing ~m.
Appendix Proof of Lemma 2. The proof is by induction on m, since (6) is valid for m = 0 with an equal sign. Assume therefore (6) is valid for m. Hence from (5) and Lemma 1 t . IIs-II s ~< r
~ ' ) . e'.
Since/~,+ 1 >t 0, we may write
~.+ 1[(~ - ~ ' ) s- - t. IIs" [Is]/> 0, i.e., from (1) tim+ 1(~ - rim+ 1). s" >i 0. Then Lemma 2 follows from (16), Lemma 1 and (3).
Proof of Theorem 1. The proof is trivial when ~m = 0. When/~, > 0,
IIs-II 2 - II~," II2 = ~. IIs - - ills + 2flm(S'-l" #m) <~0 provided (8) holds. Then
and from Lemma 2, the theorem follows:
Proof of Theorem 2. From (5) t . IIs'11 s ~< ~ - w . < 2 ( ~ - w . ) .
From Lemmas 1 and 2, t . tls'll s < 2(~ - ~ ' ) . e'.
This may be written, since tm > 0, as
II~ - ~112 + t~ IIs-II s - 2tm(n - ~m)" sm < I1~ - ~112.
(16)
34
P.M. Camerini et al. / Improving relaxation methods by gradient techniques
References [1] P.M. Camerini, L. Fratta and F. Maffioli, "A heuristically guided algorithm for the traveling salesman problem", Journal of the Institution of Computer Science 4 (1973) 31-35. [2] P.M. Camerini and F. Maffioli, "Bounds for 3-matroid intersection problems", Information Processing Letters 3 (1975) 81-83. [3] P.M. Camerini, L. Fratta and F. Maffioli, "Traveling salesman problem: heuristically guided search and modified gradient techniques", to appear. [4] G.A. Croes, "A method for solving traveling salesman problems", Operations Research 6 (1958) 791-812. [5] H. Crowder, "Computational improvements for subgradient optimization", IBM Research Rept. RC 4907 (No. 21841). Thomas J. Watson Research Center (June 1974). [6] G.B. Dantzig, D.R. Fulkerson and S.M. Johnson, "Solution of a large scale traveling salesman problem", Operations Research 2 (1954) 393-410. 1-7] J.B. Dantzig, Linear programming and extensions (Princeton University Press, Princeton, N.J., 1963) ch. 23. [8] M. Held and R.M. Karp, "A dynamic programming approach to sequencing problems", SIAM Journal on Applied Mathematics l0 (1962) 195-210. [9] M. Held and R.M. Karp, "The traveling salesman problem and minimum spanning trees: part II", Mathematical Programming 1 (1971) 6-25. [10] M. Held, R.M. Karp and P. Wolfe, "Large scale optimization and the relaxation methods", in: Proceedings of the 25th Conference of the ACM, August 1972, pp. 507-509. 1-11] L.L. Karg and G.L. Thompson, "A heuristic approach to solving traveling salesman problems", Management Science 10 (1964) 225-248. [12] S. Lin, "Computer solution of the traveling salesman problem", The Bell System Technical Journal 44 (1965) 2245-2269. El 3] F. Maffioli, "Shortest spanning hypertrees", in: Symposium on optimization problems in engineering and economics, Naples, December 1974. [14] T. Motzkin and I.J. Schoenberg, "The relaxation method for linear inequalities", Canadian Journal of Mathematics 6 (1954) 393-404. [15] J.F. Shapiro, "A decomposition algorithm for integer programming problems with many columns", in: Proceedings of the 25th Conference of the ACM, August 1972, pp. 528-533. [16] P. Wolfe, M. Held and H. Crowder, "Validation of subgradient optimization", Mathematical Programming 6 (1974) 62-88.
Mathematical Programming Study 3 (1975) 35-55. North-Holland Publishing Company
THE MINIMIZATION OF CERTAIN NONDIFFERENTIABLE SUMS OF EIGENVALUES OF SYMMETRIC MATRICES
Jane CULLUM, W.E. DONATH and P. WOLFE IBM Thomas J. Watson Research Center, Yorktown Heights, New York, U.S.A.
Received Revised manuscript received 28 April 1975 Properties of the sum of the q algebraically largest eigenvalues of any real symmetric matrix as a function of the diagonal entries of the matrix are derived. Such a sum is convex but not necessarily everywhere differentiable. A convergent procedure is presented for determining a minimizingpoint of any such sum subject to the condition that the trace of the matrix is held constant. An implementation of this procedure is described and numerical results are included. Minimization problems of this kind arose in graph partitioning studies [8]. Use of existing procedures for minimizing required either a strategy for selecting, at each stage, a direction of search from the subdifferential and an appropriate step along the direction chosen [-10,13] or computationally feasible characterizations of certain enlargements of subdifferentials [1,6] neither of which could be easily determined for the given problem. The arguments use results from eigenelement analysis and from optimization theory.
I. Introduction
This paper is concerned with the problem of minimizing a certain convex but not necessarily differentiable function, the sum of the q largest eigenvalues of a real symmetric matrix as a function of the diagonal entries of the matrix, constrained only by the requirement that the sum of these entries is constant. Use of existing procedures for the minimization of a general convex function requires either a strategy for selecting directions and steps at each stage [10, 13] or computationally tractable characterizations of certain enlargements of subdifferentials [1, 6], neither of which can be easily determined for this particular function. The analog for convex functions of the method of steepest descent is known not to work in general.
36
J. Cullum et al. / Minimization of nondifferentiable sums of eigenvalues
The practical origin of this problem is sketched in the next section. The simplest problem of this kind (and not a practical one) comes from taking the entries of the matrix all zero. The function to be minimized is then just the sum of the q largest components of the vector d = (dl ..... d~) of diagonal entries, constrained by ~7= t dj = 0. The unique answer is d = 0, a point at which the function is not differentiable. In general, we must cope not only with the likelihood of nondifferentiability at a minimizing point, but also with the fact that our highly nonlinear objective function has no simple analytical expression. Fortunately, the objective function is convex, so that we can use, as in [1] and E6], some of the machinery of "convex analysis". After specifying notation in Section 3, we develop in Section 4 the essential properties of the objective function and find a computationally tractable description of its subdifferential. Section 5 is devoted to the algorithm itself. As discussed there, the procedure is motivated by the method of steepest descent; its outstanding feature is its use of computationally tractable enlarged subdifferentials to yield a convergent and implementable procedure. The proof of convergence is given in Section 6. Section 7 is devoted to a more detailed discussion of the computational steps, while Section 8 gives some results from using the algorithm on a problem of reasonable test size. This report is a condensed version of another [3]; we have, for conciseness, omitted a discussion of the possible failure of steepest descent on a general convex function, some details on the implementation of the algorithm, and a large number of test results. As presented in Section 5 the minimization procedure is applicable only to the specific class of problems described. However, the basic idea in this algorithm of adaptively modifying the subdifferentials to anticipate any nondifferentiability has general application.
2. Origin of the problem The problem studied here--that of minimizing the sum of the q algebraically largest eigenvalues of a symmetric matrix, as a function of the diagonal entries--arose in a problem of graph theory: partition the n vertices of a given graph into q sets of specified cardinalities ml < ... < mq in such a way as to minimize the total number of edges connecting different sets. In problems of practical interest, lsuch as the packaging of computer circuitry, the graph may have as many as 2 000 vertices, while 2 < q < 20.
J. Cullum et al. / Minimization of nondifferentiable sums of eioenvalues
37
A procedure due to Donath and Hoffman [7] for obtaining a suboptimal partitioning uses the eigenelements of a matrix A + D* that achieves the following minimization. - 89 i=
mi2i(A+D): D diagonal, Tr(D) =
.
(2.1)
In [8] Donath and Hoffman showed that this minimum is a lower bound on the minimum number of connecting edges in all such partitions. In (2.1), the n-order matrix A is the incidence matrix of the given graph excepting along the diagonal, where Au = - ~ i , ~ Aii, so it is real, symmetric and usually sparse; 2~(A + D) is the ith eigenvalue of A + D with the ordering 2~(A + D) > ... > 2,(A + D) and Tr(D) is the trace of the matrix D. In the sequel we assume that m~ is independent of i, in which case the problem posed in (2.1) can be written
min{~2~(A+D):Tr(D)=O, Ddiagonal}.i=
(2.2)
This corresponds to a partition into q groups of equal cardinality. We note that the only property of A required below is its symmetry; its provenance in a graph problem has no bearing on the minimization procedure. The sparsity of A makes the eigenvalue, eigenspace computations required by the algorithm feasible for large values of n.
3. Notation and definitions Given any d = (dl .... , d,)~ E", D is the corresponding diagonal matrix of order n with Du = d~ 1 < i < n. A is a fixed real, symmetric matrix of order n. The quantities 2j(A + D), 1 < j < n are the n eigenvalues of A + D arranged in descending order: 21(A + D) > 22(A + D) > ... > 2,(A + D). The objective function is q
f(d)= j =~ l " 2j(A + O).
(3.1)
Given the ordering of the eigenvalues of A + D above, we define the integer r(e) by supposing 21(A + O), ..., 2q_ a-r~)(A + D) to be all of those eigenvalues greater than 2q + e, and define
38
J. Cullum et al. / Minimization of nondifferentiable sums of eioenvalues
Yl(d, e) = {Yl ..... Yq -1 -rtO} as a set of corresponding eigenvectors. The integer s(e) is then defined by supposing that Aq-r~(A + D) ..... 2q+~lo(A + D) comprise all the eigenvalues lying in the closed interval + o ) - e,
q(a + D) + el,
and Y2(d, e) = {Yq-r<~), ..., Yq+,~)} is defined as a set of corresponding eigenvectors. (If eigenvectors are not unique, we choose an arbitrary orthonormal basis from the corresponding eigenspace.) The numbers r(e) and s(e) are respectively called the "interior" and the "exterior" e-multiplicities of 2q(A + D). The e-multiplicity of 2q(A + D) is r(e) + s(e) + 1. If X = {x 1,.., Xm} is any set of vectors, then T(X) is the vector defined by T(X)i = ~ xj2,
1< i< n
(3.2)
j=l
(x~i is the ith component of xi). sp{X} is the linear manifold spanned by the set X. H~ is the set of all a x b matrices with orthonormal columns. 0f(d) is the subdifferential o f f at d [15]. By definition, Of(d) = {u: f ( d + Ad) - f(d) > (u, Ad> for all Ad~E"}. (3.3) (a, b> denotes the inner product of the two vectors a and b. Any u ~ Of(d) is subgradient o f f The directional derivative o f f at d in the direction w, f'(d, w) = lim [(f(d + t w) - f(d))/t]. ttO +
(3.4)
A finite-valued convex function on E" is continuous and has a directional derivative at each point in each direction. Moreover, for each d, Of(d) is nonempty, convex, compact and for each w f'(d, w) = max <w, u>: u ~ Of(d).
(3.5)
Relation (3.5) expresses the directional derivative in terms of the support function of the subdifferential and is used extensively in the development of the minimization procedure.
J. Cullum et aL / Minimization of nondifferentiable sums of eigenvalues
39
For any set G, conv G is the convex hull of G. For any matrix E, Tr(E) is the trace of E. For any vector z, z r is the transpose of z, and Ilzll is the Euclidean norm of z. e will always denote the vector (1, 1, ..., 1).
4. Properties of the objective function In this section we develop some of the properties of the function f in (3.1) that we want to minimize. Recall that q < n.
Theorem 4.1. ( i ) f i s bounded below on the set C = {d: (d, e> = 0}, (ii)The intersection of C with any level set {d: f(d) < oe} o f f is bounded, (iii) f assumes its minimum value on C. Proof. (i) Suppose that, on the contrary, f ( d k ) ~ - ov for a sequence {dk} c C. Then 2 i ( A + D k ) ~ - - G o for i > q , and so T r ( A + D k ) = ~7= 1 2i(A + Dk) ~ -- oo, a contradiction since Tr(A + Dk) = Tr(A) on C. (ii) Since f is bounded on the intersection and Tr(A + D) is constant there, ~7>q 2i(A + D), and hence each eigenvalue 2~(A + D) is bounded on the intersection. Let Ck = A + Dk, then by the Hoffman-Wielandt theorem [17], i=1
i=1
Thus, since the 2i(Ck) are uniformly bounded for all dk ~ C, so are the 2i(Dk) which range over all components of vectors dk ~ C. (iii) This is an immediate consequence of (ii) and the continuity off. Lemma 4.2 was proved in [8] using inequalities from matrix theory. Lemma 4.2 [8]. The function f is convex. Proof. Ky Fan [11] has shown that for each d, f(d) = max
(xj, (A + D ) x j ) : X = (xl .... ,Xq) ,
(4.1)
J
where X is any set of orthonormal vectors. For fixed X the summand in (4.1) is a linear function of d, s o f i s the pointwise supremum of a family of convex functions and hence is convex [15].
J. Cullum et al. / Minimization of nondifferentiable sums of eigenvalues
40
For the rest of this section the vector d and the corresponding diagonal matrix D will be fixed, as will the associated eigenvalues 2j = 2j(A + D), 1 < j < n. Y = {Yl ..... y,} is any corresponding orthonormal basis of eigenvectors of A + D. For each such Y, Y1 = Y~(d, 0) is the set of eigenvectors in Y corresponding to eigenvalues exceeding 2q--see Section 3 - and Y2 = Y2(d, 0) is the set of eigenvectors in Y corresponding to 2q. Let r and s denote respectively the interior and exterior multiplicities of 2q. Lemma 4.3. The maximand of(4.1), q
<(A + D) x j, x j>, j=l
attains its maximum value on a set X = { x l . . . . . xq} of orthonormal vectors if and only if
X ___ sp{Y,, Y2}
and
sp{Y,} ___ sp{X}.
Proof. Since Y is a set of orthonormal vectors, for each 1 < k < q, XR = ~,3= 1 <XR, yj> yj. Let trj = ~ = 1 <Xk, yj>2. Then )-', (Xk, (A + D) x , ) = )-" k=l
k=l
2 i <x,, yj)2 + 2q q j=l
q
<Xk, yj)2 k=l
j=l
q
= E &k+ E ( - - 2 d + 2 q ) [ 1 - - t r j ] k=l
j=l
+ j=q+l
k=l
Since each term in the j sums is nonpositive, equality holds if and only if each term vanishes. Thus, f o r j > q + s and 1 < k < q, (Xk, Y~> ---- 0, SO each Xk~Sp{Y1, Y2}. Similarly, for e a c h j < q - r - 1, trj = ~ = 1 (Xk, Yj> 2 = 1 so each such y j ~ s p { X } . N o w let M be the family of all 'maximizing' sets of orthonormal vectors X = {xl ..... xq} described by Lemma 4.3. Define G(d) = { u ~ E": u = T ( X ) for some X ~ M }.
(4.2)
Lemma 4.4. Of(d) = c o n v G(d).
Proof. By a theorem of Danskin [5], the directional derivative o f f at any point d in any direction w is given by
J. Cullum et al. / Minimization o f nondifferentiable sums o f eiflenvalues
f'(d, w) = max {(u, w>: u e G(d)}.
41
(4.3)
By (3.5) and (4.3), f'(d, w) is the support function of the set G(d) and of the convex set Of(d). Therefore, Of(d) = conv G(d). L e m m a 4.5 expresses G(d) in terms of any orthonormal set of eigenvectors Y = {Yl ..... y,} of A + D. Lenuna 4.5. For any orthonormal set of eigenvectors Y = {Yl ..... y,} of
A+D, f.lr+ 1 G(d) = {u: u = T(Yt(d,O)) + T(Y2(d,O)H)for some ,,l-I e--r+s+l}.
(4.4)
Recall that, Yl(d, 0 ) = {Yl, ..-,Yq-,-1} and Y2(d, 0 ) = {y~_~..... y~+~}, where r and s are the interior and exterior multiplicities of 2q(A + D), and H ,r §+ s + l is the set of all (r + s + 1) x (r + 1) matrices with orthonormal columns. Proof. Clearly, for any H e ,t-t,+ , , + s +1 l the set of vectors { Yl(d, 0), Y2(d, 0)H} is orthonormal and by Lemma 4.3 is in M. Conversely, let u e G(d) in (4.2) be generated by X = {Xl .... , xq} e M. Since sp{ Y1, Y2} - sp{X} _ sp{ }'1 }, there is a matrix Ve Hg such that X V = { Yl(d, 0), Z} witlf'sp{ Z} _c sp{ Y2} ~.irr + 1 and the columns of Z orthonormal. Therefore, there exists an H e ,,,+~+ 1 such that Z = Y2(d, O)H. Observe that the components of u, ui = ~ = 1 x2~, 1 < i < n arethe diagonal entries of the matrix X X T. Since V is orthogonal, X V v T x T = X X T. Therefore, X and X V generate the same vector. That is, every u e G(d) has a representation T(YI(d, 0)) + T(Y2(d, O)H) for some H = ~tJtr+s+ t-/r+ 1 1" Lemmas 4.2 to 4.5 yield the following characterization of the differentiability off. Theorem4.6. For any
real symmetric matrix A, the function f = ~,~= 1 21(A + D) is differentiable except at those points d such that 2q(A + D) = 2q+ I(A + D).
ProoL By Lemma 4.4, f is differentiable at a point d if and only if G(d) contains exactly one vector. Let u e G(d). By Lemma 4.5, for any Y there is t_tr + 1 1 such that some matrix n e --,+s+
u -- T(Y~(d, 0)) + T(Y2(d, O)H).
42
J. Cullum et al. / Minimization of nondifferentiabte sums of eigenvalues
Casea. If2q+l(A + D) ~ )~q(A + D), then s = 0, r + 1 = r + s + 1 and H is an orthogonal matrix. But then, as in the proof of Lemma 4.5, the sets { Yl(d, 0), Y2(d,0)} and { Y~(d, 0), Y2(d, 0)H} generate the same vector. Thus, any u = T(YI(d, 0)) + T(Y2(d, 0)). So G(d) consists of the single vector, q
ui = ~ y f,
1 <_ i < n.
(4.5)
j=l g/r+ 1 Case b. Let 2q+ I(A + D) = 2q(A + D). Then s > 0 and for any /,,4 ~--r+~+ 1, sp{ Y2(d, 0)} #- sp{ Y2(d, 0)H}. In particular, G(d) contains all the vectors. q-1
ui = ~ yj2 + (yqicosO + y~q+a)isinO)2,
1 <_ i < n
j=l
forO
J. Cullum et al. / Minimization of nondifferentiable sums of eioenvalues
43
5. A minimization procedure for problem (2.2) Curry's proof of convergence [4] of the method of steepest descent for differentiable functions rests upon the continuity of the directional derivatives off. General convex functions need not have continuous directional derivatives. Some of the procedures proposed for minimizing nondifferentiable convex functions simulate this continuity by using suitable adaptively modified versions of the subdifferentials of the given function, In particular, if the procedure in Demyanov [6] were applied to problem (2.2) the subdifferentials 0f(d) would be enlarged to sets S(d, e) obtained by including in the set G(d) in (4.2) all orthonormal sets that are e-maximal; that is, yield the maximum value to within _ e. If the procedure in Bertsekas and Mitter [1] were used, the set S(d, e) would be the e-subdifferential o f f at d, that is,
S(d,e) = O~f(d) = {u: f(d) > f(d) + (u,-d - d) - e for all d}. In both procedures, non-local information about the function f is required to generate these sets, making them difficult to characterize. Convergence is proved using continuity properties of these sets. The procedure presented in this paper requires only local information, namely the subdifferential of f at the current iterate. What seems to be required for convergence of such a procedure is the construction of a family of sets S(d, e) for suitable e > 0 which approximates 0f(d) in the following sense: If d k --* d, then any u ~ 0f(-d) is the limit of some subsequence Uk ~ S(dk, e). Conversely, limit points of sequences Uk ~ S(dk, e) are in Of(d). The sets S(d, e) that we construct satisfy those requirements and also are fully characterized by computable quantities. It is not difficult to show, moreover, that for any d and r / > 0, S(d, e) ~_ Oj(d) for all sufficiently small e and all d sufficiently close to d. By Theorem 4.6, the only potential sources of trouble are points d with )],q+l(A q - D ) = ,~q(A q - D ) . As indicated earlier, in the applications described in [8], a minimization point of f will often be such a point, at which by Theorem 4.6,fis not differentiable. The proposed descent procedure has the ability to anticipate the multiplicity of 2q at a minimizing point. If at an iterate dk, the multiplicity of2q(A + Dk) is 1, but 12q+I(A + Dk) -- ,~q(h -t- Dk) I is 'small', then the procedure will treat 2q(A + Dk) as though it were an eigenvalue of multiplicity at least 2. In particular the subdifferential o f f at dk will be computed under this assumption. We now specify the procedure. As before, at any point d, let Y =
44
J. Cullum et aL / Minimization o f nondifferentiable sums o f eigenvalues
{Yx, ..., Y,} be a complete set of orthonormal eigenvectors for the matrix A + D. For any e > 0, define
G(d, 8) = {u 9 u = T(Yt(d, e)) + T(Y2(d, e)H) for some
H e
H,(~)+ rt~)+s(,) a + * }. (5.1)
The sets Yj(d, e ) j = l, 2 as well as the t-multiplicities r(e) and s(e) were defined in Section 3. Clearly, G(d, e) ~_ G(d). Let S(d, e) = c o n v G(d, e), and P S(d, e) and P G(d, e) denote the corresponding projections of these sets onto the constraint <e, d> = 0. By Caratheodory [9] for each e >_ 0, P S(d, e) = cony P G(d, e).
Sum of eigenvalues algorithm (SEV) At iteration k, dk and ek > 0 are given. Define
trk = tr(dk, ek) =
min
max
.
Ilwll-< I ueP S(dk, gk)
(5.2)
O"k is the minimal ek-first order change in f a t d k o n the space <e, d> -- 0. Step 1. Determine a direction of movement Wk in <e, d> -- 0 such that crk =
max
ucP S(dk, ek)
.
Step 2. Perform a one-dimensional search. Choose ~ to minimize q~(~) -- f(dk + O~Wk),0 < ~. Step 3. Update the parameter ek. If o"k < ek, set 8k+ 1 = ek" If ak --> -- ek, set ek+ 1 ---- ek/a, where a is some fixed number larger than 1. Step 4. Terminate when {0} e P Of(dk). Otherwise return to step 1. -
-
Remarks. In equation (5.2), if ek = O, ak is the minimal first order change i n f a n d Wk is the direction of steepest descent at dk in the plane = 0. Bertsekas and Mitter [1] have demonstrated that in (5.2) Wk = --zJIIz~ll, where Zk is the vector of m i n i m u m n o r m in the set P S(dk, ek). In Section 7 details of the computation of Wk are given along with comments about the other steps in the algorithm.
6. Proof of convergence
The convergence theorem is stated in Theorem 6.5. Lemma 6.1. Let f b e continuous and Gateaux differentiable. Let the sequences
{dk}, {Wk} satisfy f(dk+l) < f(dk + t Wk), 0 < t < T, Wk ~ Woo and dk ~ do~.
J. Cullum et al. / Minimization o f nondifferentiable sums o f eigenvalues
45
Then f'(d~, w~) >_ O. Proof. By (3.4),
f'(d~, w~) = lim [f(doo + t woo) - f(d~)]/t. t~O +
But for any t ~ [0, T], f ( d ~ + t w~) = limk f(dk + t Wk) > limk f(dk+ 1) = f(d~), since f is continuous. The result is now clear. The following L e m m a is in Wilkinson [17]. Lemma 6.2. Let the sequence (dR}, k E K have the limit doo, and let ~ be any eigenvector for 2 ~(A + DR~ 1 <- j < n, k ~ K. Then for each j, ).j(A + DR) 2j(A + D~), and any limit point of the sequence {~}, k ~ K is an eigenvector for 2j(A + D~). Theorem 6.3. I f {dk} ~ do~ for k ~ K, then for any e > O, any u ~ P ~f(doo) is the limit of a sequence {Uk} with Uke P S(dk, e). Proof. By L e m m a 4.4, for any d, P df(d) = P conv G(d) = c o n v P G(d). Let r and s be the interior and exterior multiplicities of 2q(A + D~). Let yk be an ordered orthonormal basis of eigenvectors of A + Dk. Let yk(oo) denote the first q - r - 1 vectors in yk and y2k(oo)denote the next r + s + 1 vectors. By L e m m a 6.2 there exists a subsequence K' _ K and a set of vectors {Y1, Y2} such that for k e K ' , {Yk(oO)} ~ Y1, {Yff(~)} ~ Y2, and { Y1, II2} is an orthonormal set of eigenvectors of A + D~ corresponding to 2j{A + O~o), 1 < j < q + s. + 1 By L e m m a 4.5, # ~ P G(d~o) if and only if for some .~./. . .~. T-,/r,+s+ 1,
g = T(Y1) + T(Y2H) - e q/n where e = (1, 1,..., 1). Clearly, {Ok} -o #, k e K' for
gk = T(yk(~176 + T(yk(~
H) -- e q/n.
But, by Lemma 6.2, for any e > 0, for large k, Yl(dk, e) contains at most q -- r -- 1 vectors and Y2(dk, e) contains at least r + s + 1 vectors. Therefore, for large k the set
P Gk(OO) = ( U : U = T(yk(oo)) + T(yk(oo)H) - e q/n for some H e/-/~, ++1s+ 1}
(6.1)
is contained in P G(dk, e). The desired result follows immediately since any
J. Cullum et al. / Minimization of nondifferentiable sums of eioenvalues
46
u ~ P Of(d~) has the representation n+l
n+l
u= Y~#,~j, #j->0, j=l
~#j=l,
gj ~ P G(d~).
(6.2)
j=l
T h e o r e m 6.4 is the converse of T h e o r e m 6.3. Together these theorems provide the weak continuity arguments needed in the p r o o f of convergence of the minimization procedure. Theorem 6.4. For k ~ K let {dk} ~ d~o, Uk 6 P S(dk, e), {Uk} --* U~o. I f e < e* = m i n { [ 2 / -
)],j[" "~i ~- "~j, eigenvalues
o f A + Do},
then u~ ~ P 3 f(d~o). Proof. Each Uk = ~"~+=l #ktgkb where gkt e P G(dk, e), #k, >-- O, ~,"Z+l #k, = 1. By compactness there exists a subsequence K ' _~ K such that for 1 _< l _< n + 1 and k e K'{Pkt} --' #t and {gkt} ~ g~. Clearly, it is sufficient to prove that each g~ e P Of(doo), since uo~ = ~_-+ 1 #tgl. Let r and s be the interior and exterior multiplicities of 2~(A + Do~). By L e m m a 6.2, since e < e* for large k, Yl(dk, e) has q - r - 1 m e m b e r s and Y2(dk, e) has r + s + 1 members. By compactness and L e m m a 6.2, there exists a further subsequence K" ~_ K' such that Yl(dk, e)--, I"1 and Y2(dk, e)--, II2, where Y1 is an o r t h o n o r m a l basis for the eigenspace of A + Doo corresponding to 2j(A + Doo), 1 _ j ___ q - r - 1 and similarly I12 corresponds to 2~(A + D~o) = I~(A + D), q - r _< j _< q + s. By L e m m a 4.5, for each 1 -< l < n + 1 and k ~ K", gkt = r(Yl(dk, e)) + T(Y2(dk, e)Hu) u , + 1 . 1. But there exists K"' c K" such that for each e q/n for s o m e Hkl e--r+s+ _ l _< n + 1, Hkl--* ,u, t e n- t ~,r+ Therefore, {gu} --* gl = T(YO + T(Y2H,) 1< + s + 1 1.
= e q/n. By L e m m a 4.5, gt ~ P G(d~), 1 < I < n + 1. Therefore, uoo ~ P Of(d~o). T h e o r e m s 6.3 and 6.4 are used to prove the following convergence theorem.
Theorem 6.5. (i) The sequence {dR} of iterates generated by the sum of eigenvalues algorithm is bounded. (ii) f(dk), k = 1, 2 . . . . converges to the minimum value o f f on the space (e, d ) = 0. (iii) Any limit point do~ of the iterates is a minimizing point o f f on (e, d ) = 0.
J. Cullum et al. / Minimization o f nondifferentiable sums o f eigenvalues
47
Proof. The boundedness of the iterates is a consequence of Theorem 4.1. (iii) is an immediate consequence of (ii) and the continuity of f, so consider (ii). We first show that {sk} ~ 0. Otherwise, for large k, trk< - 5 k = - e #- 0. Let d* be any limit point of the iterates. Let K' =__K and w* be such that for k e K ' , {dk} ~ d* and {Wk} --" W*, where Wk is the direction of search used at step k. Since the sets P S(dk, 5) are uniformly bounded, for some R and large k ~ K', max (u, u~P S(dk, e)
w*) <
(7k -~- R Iw :~ - - Wkl "~
-5/2.
(6.3)
Let u be any member of P Of(d*). By Theorem 6.3 for some K" ___ K' and uk ~ P S(dk, 5), {Uk} --* U for k e K". By construction,
( u , w * ) = lim(uk, w*) < lim k~K"
max
k~K" gEP S(dk,e)
( g , w * ) < -5/2.
Since u is arbitrary and (w*, e) = 0, f'(d*, w*) < - 5 / 2 . But, this is a contradiction to Lemma 6.1 since f is convex, and for successive integers k _ l in K' and any t >_ 0,
f(dk) < f(dt+ x) < f(dl + t wz). Therefore, ek ~ 0. Since ek ~ 0, there is a subsequence K' _ K such that r > --ek. Thus, for some subsequence K" _ K' and some point d*, ak ~ 6 --> 0 and dk ~ d*, k ~ K". Now, we need only to prove that for any direction w with (e, w) = 0, f'(d*, w) >__O. By construction, for any w with Ilwll = 1, ~k<-
max
u~P S(dk, ek)
.
Thus, there exist, gk ~ P S(dk, 5k) with (Ok,W) >__ O"k. Choose K"' ___ K" such that gk ---)"g*" Since 5k --~ 0 for large k, gk ~ P S(dk, 5*), where 5" -- m i n { [ 2 i - )~)[: 2i ~- 2), 2i eigenvalues of A + D* }. So by Theorem 6.4, g* ~ P Of(d*). Therefore, for any (w, e) = 0, f'(d*, w) >_ (g*, w) >_ 0, so d* is a minimizing point of f on (d, e) = 0. Since f is continuous, and f(dk), k ~ K is m o n o t o n e decreasing, f ( d k ) ~ f(d*), the minimum value of f on (e, d) = 0. Therefore, any limit point of iterates is a minimizing point of f
48
J. Cullum et aL / Minimization of nondifferentiable sums of eigenvalues
7. Implementation of the algorithm Use of the algorithm requires the repeated computation of the q algebraically largest eigenvalues and associated sets of eigenvectors of real symmetric matrices. In the applications described in [-7] these matrices are large (n > 500) but very sparse, approximately 5 nonzero entries in each row and column. Because of the large amount of storage space required by similarity methods only eigenelement procedures, such as simultaneous iterations [16], that do not explicitly modify the given matrix can be used. For the matrices in [,7], the algebraically largest eigenvalues are usually not largest in magnitude and the matrices are not usually definite. A typical eigenvalue configuration is 0.01, -0.015, - 0 . 0 2 ..... - 1 0 , and one expects to encounter multiple eigenvalues near a minimizing point. Application of a simultaneous iteration scheme to such a matrix requires heuristic matrix shifting (A ~ A - aI, a a scalar) to achieve magnitude dominance of the desired eigenvalues, and there is no guarantee of convergence to the desired eigenelements. Single vector Lanczos procedures [-14] that do not require dominance are available but other types of difficulties such as losses of orthogonality between the vectors generated arise. A block Lanczos algorithm suggested by W. Kahan was developed for the eigenelement computations. The use of blocks alleviates the problems of orthogonality associated with the single vector Lanczos schemes and allows multiple eigenvalues to be handled easily. Since it is a Lanczos procedure the desired eigenvalues need not be dominant. The sparsity of the matrices was used to make the matrix-vector multiplications required by the method very efficient. At each iterate dk the direction of search Wk is a solution of the following min-max problem: min
max
Ilwll-< 1 u~PS(dk,~k)
(u, w).
To simplify the notation, let P S(k) = P S(dk, ek) and P G(k) = P G(dk, 8k). By [-1], wk = -z /llz ll, where z~ is the element in P S ( k ) c l o s e s t to the origin. By [,12] this element can be determined computationally if contact points of P S(k) can be computed. A point u* ~ P S(k) is a contact point of P S(k) corresponding to the direction w, if the plane through u* with normal w is a supporting hyperplane of P S(k). Algebraically, (u*, w) = max (u,w). ueP S(k)
(7.1)
The following lemma demonstrates the ease with which contact points of
J. Cullum et al. / Minimization of nondifferentiable sums of eigenvalues
49
P S(k) can be computed. Fix k, let Y = {Yl ..... y,} be any ordered orthonormal basis of eigenvectors for the matrix A + Dk. Let r(k) and s(k) denote, respectively, the interior and exterior ek-multiplicities of ~,q(A + Dk). Let t = q - r(k) - 1 a n d m = r ( k ) + s ( k ) + 1. Lemma 7.1. For any direction w, a contact point u of P S(dk, ek) correspondin9 to w can be computed from any ordered set Z = {zl ..... Zr(k)+1} of orthonormal eioenvectors of the symmetric m • m matrix M, where
Mkj = ~ Y(j+t)iYtk+t)iWi,
1 < k , j < m.
i=1
In fact, for any such set Z, the vector u = T(Yl(dk, ek)) + T(Y2(dk, ek)H) -- e q/n, where H is the matrix whose fh column is z i, is a contact point of P S(dk, ek) corresponding to the direction w. Proof. From the linearity in (7.1) and the convexity of P S(k), the maximum in (7.1) must occur on a generator u ~ P G(k). By construction, u ~ P G(k) only if for some H ~ H~ k) § 1,
u = T(Yx(dk, ek)) + T(Y2(dk, ek)n) -- e q/n. Let ~lo, 1 < l < m, 1 < j < r(k) + 1 be the entries in H. Then (u, w ) = ~ i=1
yj2 + j=l
E t=l
rhu-oY;,
-q/n
w,.
(7.2)
\j=q-rtk)
The only variables in (7.2) are the orthonormal vectors t/j, 1 < j < r(k) + 1, so only the second summation must be considered in the maximization of (7.1). A rearrangement of this sum yields the expression, ~, /=1
j,k=l
th~
Yu+t) iY(k+t)iWi
?hk
9
(7.3)
i=1
If we define M(w, y) to be the real, symmetric d x d matrix whose j, k element is
~ Y(j+t) iYtk+t) iWi,
(7.4)
i=1
then (7.3) can be written as r(k) + 1
Z 1=1
Qh, M (w, Y)n,).
(7.5)
50
J. Cullum et al. / Minimization of nondifferentiable sums o f eigenvalues
Lemma 4.3 is applicable with A replaced by M(w, y) and q by r(k) + 1. The maximum of (7.5)over all sets Z = {zl ..... Z,tk)§ of orthonormal vectors is assumed when Z is any set of vectors satisfying the conclusions of Lemma 4.3. For any such set Z = {r/1..... t/,~k)+1}, let H = (the), 1 _< I < m, 1 <_ j < r(k) + 1, be the matrix whose j,h column is t/j. Then
u = T(Y1 (dk, ek)) + T(Y2(dk, ek)n) -- e q/n is a contact point to P S(dk, ek) corresponding to the direction w. In particular, Z may be any ordered set of eigenvectors of M(w, y) corresponding to the r(k) + 1 algebraically largest eigenvalues. Lemma 7.1 demonstrates that a contact point of P S(k) is easily computed from the eigenvectors of a small m x m real symmetric matrix. Since, for any w and d,
f'(d, w) = max (u, w): u ~ 0f(d),
(7.6)
(7.2) and Lemma 7.1 were also used to compute the directional derivatives of f used in the line search. The computation in (7.5) was used in conjunction with the algorithm in [12] to determine each direction of search. Since at dk, the desired direction Wk = -- Zk/N Zk II' Zk ~ P S(k) satisfies o-k = .-[I Zk I[, the direction computation Zkj, j = 1, 2. . . . was terminated at the first iterate z~a such that max ( u , z ~ )
u~P S(k)
+ IlZks]l < ek.
This adaptive criterion worked very well. If for some J, ][zkj ]1 < •k, then '0' ~ P S(k) and the procedure reduced ek to ek/2, set P S(k) = S(dk, eJ2) and repeated the direction finding step. If at some stage ek = emi,, the smallest value of e allowed, and iIZkj [I < eminfor some J, the minimization algorithm terminated, since to within working accuracy we had a point dk where ' 0 ' e P Of(dk). The procedure in [12] will sometimes produce iterates that zigzag into the vector of minimum norm, this happened mainly on the first few iterations of the SEV algorithm when ek was 'large'. Zigzagging was detected by comparing the values of tkj = maxu~p S~k~(U, Zkj;' at alternating points, i.e., [tkj -- tk (j+ 2)[
and
Irk (j+ 1, -- tk (j+
3)[.
If at any j + 3, these 2 quantities were both less than e,nin, this indicated a very flat zig-zag approach to the vector of minimum norm, and the direction
J. Cullum et al. / Minimization of nondifferentiable sums of eioenvalues
51
finding procedure terminated and took the best direction obtained thus far. After the direction of search Wk was determined, it was necessary to minimize f along the ray d = dk + CtWk, ~ > O. The convexity of the function, especially the monotonicity of the directional derivative of f along this ray, was used explicitly in these searches. Line derivatives can be easily computed using (7.3) since the relevant eigenelements of A + Dk + eWk are available. (In the block Lanczos algorithm the required vectors are computed simultaneously with the eigenvalues). The initial step aa in the search was taken to be -ek/f'(dk, Wk). This would yield an ek decrease in f if f(dk + ~Wk)= f'(dk, Wk)~ + f(dk). A sequence of quadratic fits is constructed using the 2 currently most appropriate points and the 3 most appropriate values from the associated function and derivative values. Extrapolation and interpolation are allowed at each stage and some information from the best point obtained thus far is always used in the next quadratic fit. Once a 2 point bracket is obtained for the minimizing point, note that this can always be achieved since the derivative of f along dk + ew, is a monotone increasing function of ~, a final quadratic is fit. The best point in this fit is taken to be the next iterate dk+ 1. The stopping rule was mentioned in the discussion of the direction finding procedure. The condition {0} E P 0f(d) is a necessary and sufficient condition for d to be a minimizing point of f on the set (d, e) = 0. However, it is not practical numerically since the subdifferentials are not continuous functions of d. A simple example is the function f ( x ) = Ix I, x E E 1. Clearly f = max(x, - x ) , {0} e 0 f ( 0 ) b u t Of(x) = {1} for any x > 0. However, if one enlarges Of(x) to S(x, e) generated by using t-optimal forms of f then for Ix] < e,
{O}eS(x,e) = { - 1 < u _< 1}. Thus, the termination criterion used was { 0 } ~ P S(dk, emin),
where emin was determined by the accuracy to which one could compute the eigenvectors used in the computation of the sets S(dk, ek).
8. Numerical tests
A few results of tests on small graphs are presented here. Computations using much larger graphs (500 to 1300 vertices) have been made.
J. Cullum et al. / Minimization of nondifferentiable sums of eigenvalues
52
Detailed descriptions of 4, 20-node graphs and of the results of numerical tests on these graphs and on a 90-node graph for a practical device are given in [3]. Typical results are presented in Tables 2 and 3. In each table, ITN = iteration number; F C N = value of the function Aq, the sum of the q algebraically largest eigenvalues; e-GRAD = norm of the direction of search; FE = cumulative number of function evaluations; 5 = relaxation parameter; 5min minimal allowed value of 5; RT, Re, (ST, $5) = the interior and exterior multiplicities of 2q(A + D) obtained using 5minand 5, respectively. =
Table 1 Description o f 20node4 graph Node
Connecting nodes
1 2 3 4 5 6 8 9 10 11 12 13 15 16 18
4,5,6,7,20 4, 10, 12, 15, 17 4, 5, 10, 13, 16, 19 7,8,11,12 10, 11, 12 13, 14, 18 9, 16, 17 11, 12, 14 19, 20 14, 15, 19, 20 15, 18, 19 15, 19 18 17 20
The 20-node graph used in these tables is given in Table 1. In Tables 2 and 3, f = A2. The numbers in Table 2 were obtained using the sum of eigenvalues (SEV) algorithm. Those in Table 3 come from using the convex analog of the method of steepest descent, obtained from the SEV algorithm by setting 5k -- emin for all k. The SEV algorithm terminated at iteration 23 with '0' ~ P S(d23, 5mi,), where 5min 2 x 10 -4. At this iteration 21 = 0.288, 22 = -2.22280, 2a = -2.22296, 24 = -2.22306 and 25 = -2.708. Thus, at the minimizing point o f f = A2, the interior multiplicity r of22(A + D23) ~-
J. Cullum et al. / Minimization of nondifferentiable sums of eigenoalues
53
Table 2 20node4,f = A2, SEV Algorithm, eo = 0.1, e,,l, = 0.0002 ITN 0 1
3 5 7 9 11 13 15 17 19 21 23
FCN -1.145 - 1.428 - 1.807 -1.907 -1.928 -1.931 -1.932 -1.934 -1.934 -1.934 a -1.934 -1.934 -1.934
e-GRAD
3.9 1.4 2.4 1.1 7.3 6.3 7.6 9.1 6.3 9.5 6.2 4.3
• x x x • x x x • x
10-1. 10-1 10 -1 10 -1 10 -2 10 -a 10 -2 10 -2 10 -2 10 -2
•
10 -2
x 10 -4
FE 1 4 10 17 23 28 35 41 46 53 62 68 71
e
5 2.5 1.3 6.3 3.1 1.6 7.8 3.9 2 2
10-1 10-1 t0 -2 10 -2 10 -2 10 -3 10 -3 10 -3 10 -4 10 -4 10 -4 10 -4
• x x x x x x x x x
RT
RE
ST
Se
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 1 1 2
0 1 1 1 1 2 1 1 1 1 1 2
Changing in the 6th digit. Iterates converged on iteration 23. a p p e a r s t o b e 0 a n d t h e e x t e r i o r m u l t i p l i c i t y s a p p e a r s t o b e 2. T h e S E V a l g o r i t h m s e t R e = 0 a t e a c h i t e r a t i o n a n d s e t Se s u c c e s s i v e l y e q u a l t o 0, 1, 1 , 2 , 1 , 2 , 1 , 2 , 1, 1 , 2 , 1, 1 , 2 , 1 , 2 , 1 , 2 , 1, 1, 1, 1 , 2 . AS c a n b e s e e n i n T a b l e 3, t h e s t e e p e s t d e s c e n t a l g o r i t h m w i t h 2 e x c e p t i o n s s e t b o t h R e a n d Se e q u a l t o 0 a t e a c h i t e r a t i o n . T h u s , it w a s n o t a b l e t o r e c o g n i z e t h e n u m e r i c a l n o n d i f f e r e n t i a b i l i t y o f A2 a n d c o n s e q u e n t l y c o n v e r g e d v e r y s l o w l y . B y i t e r a t i o n 23 a r e d u c t i o n by iteration 3 of the SEV algorithm
of f equivalent only to that obtained had been obtained.
Table 3 20node4,f = A2, Steepest descent, ek = erain = 0.0002 ITN
FCN
0 1 3 5 7 9 11 13 15 17 19 21 23
-1.145 -1.428 -1.563 -1.721 -1.729 -1.732 --1.733 --1.733 --1.770 -1.790 -1.791 -1.804 -1.805
e-GRAD
3.9 3.3 2.7 2.3 2.1 2.0 1.9 1.6 2.3 2.2 3.9 1.2
• • • • x X X X x x x x
10 -1 10 -1 10 -1 10 -1 10 -1 10 -1 10 -1 10 -1 10 -1 10 -1 10 -1 10 -1
FE 1 5 9 16 20 26 34 40 48 54 58 66 73
e
2 2 2 2 2 2 2 2 2 2 2 2
• • • •
10 -4 10 -4 10 -4 10 -4
X 10 -4
x X X x • x •
10 -4 10 -4 10 -4 10 -4 10 -4 10 -4 10 -4
RT
Re
ST
Se
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 1 0 0 0 1
0 0 0 0 0 0 0 1 0 0 0 1
54
J. Cullum et al. / Minimization of nondifferentiable sums of eigenvalues
For each test, the initial value of e, Co, was set equal to a (2. - ,~l)/n, where 2, - 2, was an estimate of the spread of the eigen4alues of A + D o, obtainable from the block Lanczos computations, n was the order of A and a was a smudge factor usually taken to be 89 Thus, eo was approximately 89 of the average spacing of the eigenvalues in the initial matrix A + Do. In the tests Do = 0. Tests were made with different starting values for e and the convergence seemed relatively insensitive to the values chosen. Behavior similar to that displayed in Tables 2 and 3 was observed on all the nondifferentiable problems tested. Several problems with f differentiable at the minimizing point were also run and in each such case it was still better to use the SEV algorithm than the convex steepest descent procedure. The reason for this is clear. If the eigenvalues surrounding 2q(A + D) are well separated numerically, then at some early iteration the SEV procedure will revert to the steepest descent procedure. If however, there are eigenvalues 2j(A + D) numerically close to 2q, the SEV procedure will recognize this and compensate for it by enlarging the associated subdifferential sets.
References [1] D.P. Bertsekas and S.K. Mitter, "Steepest descent for optimization problems with nondifferentiable cost functions", Proceedings of the 5th Annual Princeton Conference on Information Sciences and Systems, 1971/. ['2] Jane Cullum and W.E. Donath, ' A~l-ock generalization of the symmetric s-step Lanczos algorithm", IBM RC 4845 (May 1974). 1-3] Jane Cullum, W~E. IYonath and P. Wolfe, "An algorithm for minimizing certain nondifferentiable convex functions", IBM RC 4611 (November 1973). [4] Haskell B. Curry, "The method of steepest descent for nonlinear minimization problems", Quarterly of Applied Mathematics 2, 258-26 I. [5] John Danskin, The theory of max-rain (Springer, Berlin, 1964). ['6] V.F. Demyanov, "Seeking a minimax on a bounded set", Doklady Akademii Nauk SSSR 191 (1970) [in Russian; English transl. : Soviet Mathematics Doklady 11 (1970)
517-521]. ['7] W.E. Donath and A.J. Hoffman, "Algorithms for partitioning of graphs and computer logic based on eigenvectors of connection matrices", IBM Technical Disclosure Bulletin 15 (3) (1972) 938-944. ['8] W.E. Donath and A.J. Hoffman, "Lower bounds for the partitioning of graphs", IBM Journal of Research and Development 17 (5) (1973) 420-425. I-9] H.G. Eggleston, Convexity (Cambridge University Press, London, 1958). [10] Ju. Ermol'ev, "Methods of solution of nonlinear extremal problems", Kibernetika 4 (1966) 1-14. [1 I] Ky Fan, "On a theorem of Weyl concerning eigenvalues of linear transformations", Proceedings of the National Academy of Sciences of the U.S.A. 35 (1949) 652455.
J. Cullum et al. / Minimization of nondifferentiable sums of eigenvalues
55
[-12] E. Gilbert, "An iterative procedure for computing the minimum of a quadratic form on a convex set", S l A M Journal Control 4 (1) (1966) 61-80. [13] M. Held, P. Wolfe and H.P. Crowder, "Validation of subgradient optimization", IBM RC 4462 (August 1973). [-14] M. Newman and A. Pipano, "Fast model extraction in NASTRAN via the FEER computer Program NASTRAN: Users' experiences', Sept. 11-t2 1973, NASA Langley Research Center, Hampton, Virgina. [15] R.T. Rockafellar, Convex analysis (Princeton University Press, Princeton, N.J., 1970). [16"] H. Rutishauser, "Computational aspects of F.L. Bauer's simultaneous iteration method", Numerische Mathematik 13 (1969) 4-13. [17] J.H. Wilkinson, The algebraic ei#envalue problem (Clarendon Press, Oxford, 1965). [18] Philip Wolfe, "A method of conjugate subgradients for minimizing nondifferentiable functions, Mathematical Programming Study 3 (1975) 145-173 (this volume).
Mathematical Programming Study 3 (1975) 56-94. North-Holland Publishing Company
USING DUALITY TO SOLVE DISCRETE OPTIMIZATION PROBLEMS: THEORY AND COMPUTATIONAL EXPERIENCE* M.L. F I S H E R University of Per.nsylvania, Philadelphia, Pa., U.S.A.
W.D. N O R T H U P National Bureau of Economic Research, Cambridge, Moss., U.S.A.
and J.F. S H A P I R O M.L T., Cambridoe, Mass., U.S.A.
Received 6 February 1974 Revised manuscript received 7 April 1975 Meaningful dual problems have recently been identified for the integer programming problem, the resource constrained network scheduling problem and the traveling salesman problem. In this paper a general class of discrete optimization problem is given for which dual problems of this type may be derived. We discuss the use of dual problems for obtaining strong bounds, feasible solutions, and for guiding the search in enumeration schemes for this class of problems. Properties of dual problems and three algorithms are discussed, a primal-dual ascent algorithm, a simplicial approximation algorithm and an algorithm based on the relaxation method for solving systems of inequalities. Finally, computational experience is given for integer programming and resource constrained network scheduling dual problems.
1. Introduction T h e p u r p o s e of this p a p e r is to consolidate a n d extend recent advances in the use of m a t h e m a t i c a l p r o g r a m m i n g duality theory in discrete optim i z a t i o n and to r e p o r t on s o m e new c o m p u t a t i o n a l experience. In particular, meaningful dual p r o g r a m s have been identified for the integer p r o g r a m m i n g p r o b l e m [13], the resource constrained n e t w o r k scheduling * Supported in part by the U.S. Army Research Office (Durham) under Contract No. DA HCO4-73-C-0032.
M.L. Fisher et al. / Using duality to solve discrete optimization problems
57
problem [10, 11,1 and the traveling salesman problem [26, 27,1. The solution techniques developed to exploit the structure of these dual problems are similar and may be applied to a general class of discrete optimization problems. The general problem we consider is:
v(b) = min f(x), subject to g(x) < b,
(1.1)
x ~ X,
where X _ R n is a finite set, say X = {x'}~= 1, f(x) a real-valued function and g(x) a function with values in R m both defined over the set X. Note that f(x) and g(x) may always be extended to continuous functions in R n. We call problem (1.1) the primal problem. The dual problem is constructed in the usual fashion [15,1. For u > 0, define the Lagrangean function
L(u) = - u b + mixn(f(x ) + ug(x)}. We assume the set X is chosen so that the problem minx~x{f(x) + u g(x)} is much easier to solve than (1.1). Examples of algorithms which have been used for this problem are the "greedy algorithm" for finding a maximum weight basis of a matroid [7-1, or in its graph theory form, the minimum spanning tree algorithm, other algorithms from graph and combinatorial theory, and, for problems where one can prove strong theorems characterizing the form of an optimal solution, dynamic programming (e.g., shortest route algorithms) or branch-and-bound algorithms. It is well known and easily shown that L(u) < v(b) and the dual problem is to find the best lower bound; namely, find
w(b) = max L(u) subject to u > 0.
(1.2)
Clearly, we will always have w(b) < v(b), but with the non-convex structure of our primal problem there can be no guarantee that w(b) = v(b). It is often difficult to solve (1.1) directly and some search of the finite set X is required. Our purpose in constructing and solving one or more dual problems is to make this search efficient by exploiting the following properties: (1) The duals are easier to solve than the primal, (2) feasible dual solutions provide lower bounds to the primal problem objective function and those of any primal subproblems derived during the search,
58
M.L. Fisher et al. / Usin 9 duality to solve discrete optimization problems
(3) Lagrangean minimizations may yield optimal or good feasible solutions to the primal. The plan of this paper is the following. Section 2 is devoted to a review of the properties of dual problems that we use in the construction of algorithms for solving them. Included are constructive necessary and sufficient conditions that the Lagrangean function L is optimal at a point u _> 0. Section 3 contains a primal-dual ascent algorithm for the dual problem based on these conditions. Section 4 contains an approximation algorithm closely related to the primal-dual algorithm of Section 3 and based on Scarf's method of simplicial approximation [25, 36]. The nature of the simplicial approximation is given a precise interpretation, and sufficient conditions that the approximation is exact are presented. The following section, Section 5, contains a brief description of subgradient relaxation methods for solving the dual problem which have been studied and used by Held and Karp [27], and Held, Wolfe and Crowder [28]. Section 6 contains a general branch-and-bound schema for solving the primal problem (1.1) using the dual problem (1.2). The final section, Section 7, reports on new computational experience with these dual methods. Specifically, we give results for (1) the primal-dual ascent algorithm applied to the integer programming dual problem [-13], (2) a variant of the primal-dual algorithm and subgradient relaxation algorithms applied to a variety of resource constrained network scheduling problems [10].
2. Properties of dual problem In this section, we review the principal properties of the dual problem (1.2) which will be used in the remainder of the paper. Many of these properties can be found in [35] in somewhat different form. Our orientation is different in that we wish to solve the dual problem, either exactly or approximately, and we wish to do so whether or not a duality gap exists between the primal and dual problems. For notational convenience, let L(u, x') = - u b + f ( x ' ) + u 9(x') = f ( x t ) + u ),',
where xt s X and 7t = 9 ( x s) _ b. Then L(u) =
min t=l .... ,T
L(u, xt).
M.L. Fisher et al. / Using duality to solve discrete optimization problems
59
It is well known and easily shown that L(u) is a continuous concave function defined on R'. With this background, the dual problem (1.2) can be rewritten as the linear programming problem: Find w(b) = max w, subject to w < - u b u~O.
(2.1) + f ( x t) + ug(xt),
t = 1.... , Z
The corresponding LP primal problem is T
w(b) = min ~ 2tf(xt), t= 1
(2.2)
T
T
2 2t9(xt) < b, t=l
).t ~ 0,
Z 2t = 1, t=l
t=
1 . . . . . T.
Unfortunately, the LP problem (2.1) cannot be efficiently solved as an ordinary LP problem because the number of rows T is quite large in most applications. Moreover, only a small number of the rows will be binding at an optimal solution. One method of solution of (2.1) is to apply the generalized programming algorithm of Dantzig and Wolfe [4, Ch. 24] to (2.2) or equivalently the dual cutting plane algorithm [41] to (2.1); see also [31 ]. There are two reasons why the generalized programming algorithm is unsatisfactory for solving problem (2.1). First, it does not provide monotonically increasing lower bounds for the branch and bound algorithm of Section 6. In addition, the algorithm's performance has been inconsistent; for example, in [34, p. 240], Orchard-Hays says: "Nevertheless, the D-W (generalized programming) algorithm presents difficulties, and overall experience with its use has been mixed". It is important to emphasize, however, that the generalized linear programming algorithm is extremely important as a concept and that its discovery around 1960 anticipated the later developments in duality theory. In the next section we will present an algorithm which generates feasible solutions to (2.1) with monotonically increasing objective values. This algorithm is derived from necessary and sufficient conditions given in this section that a solution is optimal in the dual problem. The algorithm converges finitely to an optimal solution to the dual and it serves as the basis for other approximate and heuristic methods. The function L(u) is not differentiable everywhere but directional
60
M.L. Fisher et al. / Using duality to solve discrete optimization problems
derivatives exist in all directions at any point u ~ R m. Directions with positive directional derivative are the ones to use in ascent algorithms. In order to make these ideas operational, we need some definitions. An m-vector ~ is called a s u b g r a d i e n t of L at ~ if L(u)
for allu_>0.
Any subgradient of L at ~ points into the half-space containing all optimal solutions to the dual problem; namely, {u: ( u - ~)~ > 0} contains all optimal dual solutions. Thus, it is plausible to construct ascent algorithms for the dual problem of the form u k§ 1 = u k + Ok?k, where ?k is a subgradient of L at u k and Ok satisfies L(U k A- Ok7k) = max L(u k + o])k). 0~<0
Subgradient relaxation methods discussed in Section 6 can be viewed as an ascent algorithm of this general type, where Ok is chosen by a different rule. The theoretical difficulty is, however, that L may not increase in the direction yk although uk is not optimal and L increases in the direction of another subgradient. To put this difficulty in its correct perspective and to see how it can be overcome, we need the following characterization of all subgradients of L at a point, and a closely related characterization of directional derivatives of L. Let T(-~) = { t: L(-ff) = L ~ , x t) } ;
then 7t = g(x t) - b for any t ~ T(~) is a subgradient of L at ~. The collection OLN) of all subgradients of L at ~ is given by (see [22]) 8 L ( u ) = { 7:7=,~r(~)~ 2tTt, t~r(~) ~ 2'=1'2r>--0
f o r a l l t ~ T ( ~ ) } . (2.3)
Moreover, the directional derivative V L ( ~ ; v) of L at fi in the direction v is given by [22] V L ( ~ ; v) = min v 7'. t~ Tfff)
(2.4)
This last result establishes the existence of directional derivatives in all directions v ~ R m at all points ~ ~ R', as well as characterizing them. The following theorem, stated without proof, gives us the constructive mechanism for using the characterizations. In words, the theorem says that
M . L . F i s h e r et al, / Using d u a l i t y to solve discrete o p t i m i z a t i o n p r o b l e m s
61
a point ~ > 0 is optimal if and only if VL~; v) < 0 for all feasible directions v from ~. For any ~ > 0, let I(-fi) = {i: ~i
=
O}
and let IC(~) denote the complementary set. Theorem
2.1. The point ~ >_ 0 is optimal in the dual problem (1.2) /f and
only if
max VL(fi; v) = O, vEV
where V c_ R " is the set of points satisfying -1
O <_ vi <_ 1, < v i <_ 1,
i ~ l~), i e lC~).
We use (2.4) to construct a linear programming problem equivalent to max VLffi; v). This problem is max min v 7t, or more specifically, find a*, vEV
v c V t~T(ll)
where a* = max v, subject to v _< v ?t,
(2.5) t ~ T(-a),
- 1 < vi < 1,
i ~ IC(-6),
0 < vi < 1, i~ I(~), v unconstrained in sign.
Problem (2.5) clearly has an optimal solution. We will need the following dual linear program to (2.5), a*
min ~ sF + ~ S+i , i= 1 i~tr subject to ~ 2 , ? I - s i - +s~ + = 0 , ~-
(2.6) i = 1.... ,m,
t ~ T ffl)
2,=1, tr T(u-)
2,>0,
t~T~),
s7 > 0 ,
s~+ > 0 ,
i = 1,...,m.
Let v*, v* = a* denote an optimal solution to (2.5), and let 2* denote an optimal solution to (2.6). Theorem 2.2 below gives necessary and sufficient conditions that ~ > 0 is optimal in the dual in terms of the dual linear programs (2.5) and (2.6). The proof of theorem 2.2 is self contained and does not rely on the charac-
62
M.L. Fisher et al. / Using duality to solve discrete optimization problems
terizations (2.3) and (2.4) except for the obvious fact that a convex combination of subgradients is a subgradient. The difficult part of (2.3) is to establish that all subgradients can be generated as convex combinations of the 7 t, t ~ T(-fi). Theorem 2.2. The dual solution ~ > 0 is optimal in the dual problem (1.2) (equivalently (2.1)) if and only if a* = 0 in the dual L P problems (2.5) and (2.6). Proof. Sufficiency: If a* = 0 in (2.6), then the subgradient t~ T(~)
satisfies 7* < 0, i z I(fi), ~* = 0, ir lCC~). Thus, for all u > 0,
L(u) < LCu) + (u - -~)~* = LCu) + ~ (u, - -fii)?*
~
= L(fi) +
i~lC~)
uiy* < L(fi).
i~I(~
The-first inequality follows because 7" is a subgradient, the first equality because ~* = 0, i ~ IC(fi), the second equality because ui = 0, i 6 I(~) and the second inequality because ~* < 0, i~ I(-fi) and ui > 0. This establishes sufficiency. Necessity: We prove necessity by showing that a* > 0 in (2.5) implies is not optimal. If a* > 0, then v*7 t > 0 for all t 6 T(fi) and therefore v* ~ 0. Consider now the problem max LCu + Or*). 0>/0
If the maximum of L(u) along the half-line fi + Or* for 0 -> 0 is greater than L(fi), or if L(u) is u n b o u n d e d along the half-line, then clearly fi is not optimal in the dual problem and the theorem is proven. Suppose then that for all 0 _> 0,
LC~ + Or*) < LC~).
(2.7)
We will show a contradiction. For 0 > 0, we have L(~ + Or*) = rain f ( x 9 + -~ 7t + Ov*~,', t=l
.....
T
= min f ( x 9 + -u 7t + Ov*~'~, tr T(~)
where the second equality follows from (2.7) and the fact that v*y t > 0 for all t ~ T~). Consider a sequence {0 k}k=l ~ such that Ok --* 0 +. Since the
M.L. Fisher et al. / Using duality to solve discrete optimization problems
63
number of x t is finite, we can select 01 sufficiently small that there will be an x s, s r T(fi), satisfying for all k,
Lffi + Okv:#) = f ( x s) + ~ 7s + Okv*7s.
(2.8)
Taking limits on both sides of (2.8) and using the continuity of L, we have L~) = f ( x s) + fi7 s or s ~ T(-fi), a contradiction. This proves necessity and the theorem. Suppose fi is optimal in the dual problem (2.2). It can easily be shown that 2 ~ X satisfying g ( 2 ) < b is optimal in the primal problem (1.1) if L(fi) = L(~, 2) and fi(g(2) - b) = 0. There is no guarantee, however, that such an ~ exists. Even if one does, it might still be quite difficult to find it since the number ofx ~ X satisfying L(~) = L(fi, x) may be large. In Section 6, we show how the dual problem can be used in a "fail-safe" manner as an analytic tool in a general branch and bound algorithm. The problems (2.5) and (2.6) are the ones to use to solve the dual problem in the linear programming form (2.1) or (2.2) using the primal-dual simplex algorithm. The set T~) can be extremely large, however, and it should be generated as needed. Details are given in the next section. There are some additional observations to be made about the dual problem. First, it is possible to compute an upper bound on the optimal dual objective function value w(b). The following result, stated without proof, will be used in the branch and bound algorithm in Section 6. Corollary 2.3. Consider the points x 1, ..., x K ~ X and suppose there exist non-negative weights 21 .... , ~,r satisfying ~ = 1 2k = 1 and ~kr= 1 g(xR) 2k -< b, then K
w(b) <- ~ f ( x k) 2k. k=l
A common primal problem manipulation which can be combined with duality to form many duals is relaxation; see [16] for a greater discussion of relaxation in discrete optimization. A relaxation of (1.1) is
v'(b) = min f(x) subject to gi(x) < hi,
(2.9)
i ~ I',
x ~ X',
where I' ___ { 1, ..., m}, X _ X'. Clearly, v(b) > v'(b) and a dual problem of the form (1.2) with maximum value w'(b) constructed from (2.9) bears the
64
M.L. Fisher et al. / Using duality to solve discrete optimization problems
same relationship to (1.1) as does (1.2) (i.e., v(b) > v'(b) > w'(b)). There is, of course, an increased likelihood that v(b) > w'(b~ or more generally that v(b) - w'(b) > v(b) - w(b) > O, but the larger duality gap may be compensated by the shorter time to calculate w'(b). Reference [I2] describes an application, where this phenomenon occurs. A more complex form of relaxation, called data relaxation, has been found for integer programming [20]. This type of relaxation consists of a change in the functions O~(x) - b~ so that subsequent Lagrangean minimizations can be more efficiently accomplished. We remark that it is possible to use the duality theory developed here to try to minimize or reduce the error introduced by the data change (see [20, ~4]).
3. Primal-dual ascent algorithm for dual problem In this section we give a primal-dual ascent algorithm for solving the dual problem in the LP form (2.2). The algorithm we propose is an adaptation of the primal-dual simplex algorithm applied to (2.2) where the large number of points in X are handled implicitly. The algorithm may also be interpreted as an ascent algorithm in which the successive restricted primal problems solved in the primal-dual for a given u value generate a subgradient which may be used in a manner similar to the gradient in differentiable ascent algorithms to determine a direction of ascent. Our algorithm bears a close resemblance to the algorithm of Grinold in [23] for large scale linear programming and also to the algorithm of Bradley in [3]. It was demonstrated in Theorem 2.2 that the point fi > 0 is optimal in (2.1) if and only if there exists a convex combination y* of the subgradients ~t, t ~ T(~), such that 7" = 0 if ~ > 0 and 7~ --- 0 if ~ = 0. The algorithm proceeds by generating successive yt, t ~ T(-ff) until either fi is proven optimal in the dual by the test of Theorem 2.2 or a new ~ # ~ is found such that L~) > L(~). Thus, we consider a generic step of the algorithm beginning at a point > 0 and a set T'(~) ___ Tfa). Recall that I(-a) = {i: ~ = 0}. We solve the following LP problem which is simply problem (2.6) with a restricted number of columns: rtl
t r * = min )-" s/- + i= 1
subject to
~ IcT'('~)
~
s~/, (3.1)
i~tct~)
2,7 ~-s~- + s + = 0 ,
i = 1..... m,
M.L. Fisher et al. / Using duality to solve discrete optimization problems
65
2,=1, teT'(-~)
,~,>0,
teT'(fi),
si- > 0 ,
s+ > 0 ,
i = 1,...,m.
The dual of (3.1) is: a* = max v, subject to v < v ?',
- 1 ~vi<~ 1, 0_
t e T'~),
(3.2a)
i e Ir
(3.2b)
i e I~).
(3.2c)
Let a* be the optimal solution value in (3.1) and (3.2), and let 2*, t e T'~), v* be optimal values of the primal and dual variables. It should be clear that the direction v* produced by (3.2) is somewhat arbitrary; for example, the ranges on the v i in {3.2b) and (3.2c) except vl -> 0 in {3.2c) could be adjusted to reflect differences in scaling on different rows. By Theorem 2.2, ~ is optimal in the dual if a* = 0. We describe a generic step of the algorithm for the case a* > 0. Notice from (3.1) and (3.2) that ?* = ~t~r'(~)2"? t is a subgradient such that by complementary slackness v*~* = a* > 0 and an increase in the Lagrangean function is indicated in the direction v* namely, for 0 >- 0,
L ~ + Or*) <_ LC~) + Or*v* = L ~ ) + Oa*.
(3.3)
The primal-dual algorithm proceeds by finding the maximal value of 0 such that equality holds in (3.3). Let VN) = {i: v* < 0}; note that V(~) _c IOfa).The search for a maximal 0 with the desired property is confined to the interval [0, 0maJ, where 0ma~= m i n { ~ d - v * : i s V~)} > 0. This interval is the one in which u remains non-negative. It is possible, of course, that V(h-)= 0 and 0max = + ~ . As long as w(b) < + ~ , 0,,ax = + ~ does not cause any real difficulty because we can easily devise a finite search of the half-line ~ + Or*, for 0 > 0, for the maximum of L along it. If w(b) = + ~ , then a finite search to discover this fact can be constructed because there are a finite number of pieces in the piecewise linear function L being maximized. In this regard, it is important to mention that when the primal-dual ascent algorithm is used in conjunction with branch and bound as described in Section 6, we can also impose the restriction 0max< (~ - LC~))/a*, where ~ is the incumbent cost
66
M.L. Fisher et al. / Usin 0 duality to solve discrete optimization problems
as defined in Section 6. In our discussion henceforth Of the primal-dual ascent algorithm, we assume 0max< + ~ . Let 0* ~ [0, 0m,~ denote the maximal value of 0 such that L(~ + Or*) = Lfu) + Oa*. The procedure for finding 0* is the following. Let 01 = 0max and at iteration k, compute L(fi + OkV*) = Lf6 + Ok?)* , X k)
for some x k ~ X.
Notice that L(-d + Ok?)*) = L(~, x k) + OkV*Yk. IfL(~ + OkV*) = LC6) + OkV*7*, then Ok = 0* and the search terminates. If L(~ + OkV*) < Lf6) + Okv*?*,
(3.4)
then compute L~, x k) - L ~ ) Ok+ 1 =
?)*7* - - ?)*~)k "
(3.5)
The denominator in this fraction is positive because of (3.4) and the fact that L(-6, x k) _> L~). If0k+ 1 = 0, then set 0* = 0 and the search terminates. If Ok+l > 0, the basic iteration is repeated. We show now that (i) the solution of the LP (3.1) defined at ~ + O'v* can begin with the optimal basis for (3.1) at fi, (ii) the value of the objective function at ~ + O'v* will be (barring degeneracy) less than a*. Finite convergence of the primal-dual algorithm is thereby assured by the usual simplex criterion (see [38; pp. 128-134]). There are three cases to consider. Case l: 0 < 0* < 0m~x- In this case, we want to show that the optimal basic variables from (3.1) at ~ remain in the problem at K + O'v* and with the same objective function coefficients. First, for t E T'~) such that 2t is in the optimal basis for (3.1), we have by complementary slackness (see (3.2)) that v*? t = a*. This implies that t ~ T(-6 + O'v*) since by construction, Lfd + O'v*) = L(-U-) + O'a* = L(~) + O'v*? t = L(~ + O'v*, xt).
To complete the argument, we need to consider the s7 and s + variables. They clearly remain in (3.1) at ~ + O'v* and can be in a starting feasible basis. We must simply show that the objective is unchanged for the s + that are basic. For i z i ~ we have that ICfu)~_ ICf~ + O'v*) because 0 " < 0m~x; the coefficient of s + for such an i thus remains equal to 1. Consider s + for i ~ I(-6). If s + is basic in (3.1), then by complementary slackness,
M.L. Fisher et al. / Using duality to solve discrete optimization problems
67
v* = 0 which implies i ~ I(~ + O'v*) and the objective function coefficient remains equal to zero. In short, we use the previously optimal basis as a starting feasible basis for (3.1) at ~ + O'v* and the objective function value equals tr*. To show that this value of a* can be reduced for (3.1) at fi + O'v*, note that 0* > 0 means we have found an x ~ X such that L~, x ~) > L ~ ) = L(~, x t) for 2, basic in the starting basis. By constructio n, L(~ + O'v*) = L(fi, x ~) + 0*v*7~ = L(fi) + O'a* and therefore v*y" < tr*. In other words, the LP reduced cost relative to the starting basis of the subgradient 7~ of L at fi + O'v* is v*7~ - a* < 0. The activity 7' should be pivoted into the starting basis and (barring degeneracy) the objective function will be decreased. Case 2: 0* = 0max. This case is the same as case 1 except for the possibility that for some i e Ic~), we have s~ basic and ui + O*vi = 0 implying i e I ~ + O'v). If this is so, t h e objective function coefficient of s~- needs to be changed from 1 to 0 which (again barring degeneracy) causes a reduction in the objective function. Problem (3.1) is then reoptimized. Case 3: 0* = 0. In this case, we have found an s ~ T(-~) such that L~, x ") = L~) and v'y* - v*?s > 0 (see (3.5)). Thus, v*~~ < tr*, we have s ~ T~) T'(~) and the non-basic 7~ can enter the basis causing (barring degeneracy) the objective function to be reduced. The primal-dual rule of moving to the next breakpoint of the L function in the direction v* ensures convergence of the ascent algorithm. It is also reasonable to move to the maximum of L in the direction v*, but then convergence cannot be established. This is a curious point which needs more study from both a theoretical and computational viewpoint (see also [-23, p. 452]).
4. Simplicial approximation algorithm for dual problem In this section, we present a method for approximating an optimal solution to the dual problem (1.2) by approximating the optimality conditions of Theorem 2.2 as expressed by problem (2.6) with a* = 0. The motivation for this approximation is the slow increase in the dual objective function value due to small ascent steps using the primal-dual algorithm on some problems. The simplicial approximation algorithm takes steps of fixed size which can be preselected. Since Scarf's original work in reference [36] on fixed point approxima-
68
M.L. Fisher et al. / Usino duality to solve discrete optimization problems
tion, there have been a number of important theoretical and computational extensions. For expositional reasons, however, we will ignore these extensions here. Our method draws on some of the construction of Hansen and Scarf for simplicial approximation in nonlinear programming [-25, pp. 14-18]. The major conceptual difference between our method and theirs is that we are approximating optimal dual solutions while they are approximating optimal primal solutions. Wagner [39] has also applied simplicial approximation to nonlinear programming. It is convenient to bound the region in which the simplicial approximation algorithm searches for an approximate solution by adding the constraint ~'=1 ui < U for U sufficiently large. Lemma 4.1 gives sufficient conditions under which a valid U can be computed. Lemma 4.1. L e t x k, k = 1, ..., m, be points in X such that gi(Xk) ~ b i f o r all i and gk(X k) < b k. Then without loss o f optimality the constraint ~ = 1 ui < U can be added to (4.1), where ~,
f ( x k) - Wo
v =
and w o is a lower bound on w(b) (e.g., w o = L(u) f o r some u > 0).
Proof. Let u*, w* = w(b) be any optimal solution in (2.1). We have u~(bk -- gk(Xk)) < ~ u*(bi -- gi(xk)) --< f ( x k) -- W* < f ( x k) -- WO, i=1 or
u~ < -
f ( x k) - w o bk -- gk(Xk) '
k=
1,...,m.
Summation over i gives us ~7'=, u* < U. Thus, the addition of the constraint ~T= ~ ui < U does not cut off any optimal solution of (2.1) and the lemma is proven. With the addition of the bounding constraint, the dual problem (2.1) becomes w(b) = max
subject to
w, w < - u b + f ( x ' ) + u g(x'),
~
i=1
ui < U,
u > 0.
(4.1) t = 1,..., T,
M.L. Fisher et al. / Using duality to solve discrete optimization problems
69
The necessary and sufficient condition for optimality of fi in (4.1) from Theorem 2.2 becomes a* = 0 in a*=min
~ st+
subject to
~
si+,
i= 1
i~IC(u~
~
2t~-si-+s~
(4.2) +-0=0,
i = 1,...,m,
t~ T ( ~
~
2,=1,
t~T(~)
2i > 0,
t e T(fi),
sT>O,
s+>0,
i=
1 . . . . . m,
where m
0=0
.if ~ f i i <
U,
i=1
0_>0
if ~ R i =
U.
i=1
For technical reasons, it is convenient to add a z e r o th c o m p o n e n t to the u vector and consider our search to be over the simplex
This modification holds only for this section. In [36], Scarf selects points to form a set P = {u ~ u 1, ..., uK}, where u k ~ S, k > m and where u ~ = (0, Mo,..., Mo), u I = (M1, O, Mx,..., M1), u" = (M,, ..... M,,, 0) and Mo > M1 > ... > M,, > U. In addition, the uk are chosen so that u/k1 r u~2 for any i and any kl ~ k2, k~ > m and k2 > m. With the points uk so chosen, Scarf defines a primitive set u k~ . . . , u k" to be one such that there is no u k satisfying u~ >
min
(u~J),
i = 0 . . . . , m.
j=O, 1 . . . . . m
A primitive set describes a subsimplex {u: u i >I minju~, i = 0, ..., m} in which no other u k lie. F r o m a computational viewpoint, the simplex S can be partitioned into a regular grid of subsimplices and moreover, the points uk can be generated as required (see [25]).
70
M.L. Fisher et al. / Using duality to solve discrete optimization problems
The main tool in Scarf's simplicial approximation algorithm is the following result. Lemma 4.2. [-36, p. 1332]. L e t u k~..... u k" be a primitive set, and let u k~ be a specific one o f these vectors. Then, aside f r o m one exceptional case, there is a unique vector u k different f r o m u k~ and such that uk~ Uk~- ', Uk, Uk~+ 1, . .~ Uk", f o r m a primitive set. The exceptional case occurs when the m vectors u k', i ~ ct, are all selected f r o m u ~. . . . . u s, and in this case no replacement is possible.
To apply this result to solving approximately (4.1), we need to apply another theorem which extends the above result. Theorem 4.3. [36, p. 1341]. L e t
/1 A=
0
0
ao,,,+l
ao.K\
am,m+t
a,n,K
" 0 0 0
0
1
)
be an (m + 1) x (K + 1) matrix and b T = (bo . . . . , bin) a (m + 1) non-negative vector, such that the k th column o f A is associated with the k th vector uk ~ P. A s s u m e that the set o f non-negative vectors y, satisfying A y = b, is bounded. Then there exists a primitive set u k~ ..., u k~ so that the columns ko, ..., k,, f o r m a feasible basis f o r A y = b.
The theorem is applied to (4.1) by specifying the appropriate matrix A and vector b. The matrix A is given below and the appropriate uk are shown above their associated columns. U0
UI
U rn
Um + l
UK
1
0
0
1
1
0
1 0
y~'+~ + 1
?~c + 1
0 1 7ss + ~ + 1
?sg + 1
A =
0
0
)
M.L. Fisher et al. / Using duality to solve discrete optimization problems
71
where 7 k is any subgradient of L at u k. The vector b x = (1 ..... 1). Clearly, the set {y: A y = b, y > 0} is bounded in this case. We sketch the constructive proof of Theorem 4.3 because it implies an algorithm which we need to discuss in some detail. After the discussion of the algorithm we will give a discussion of the results of the algorithm and their interpretation as an approximation to the optimality conditions of Theorem 2.2. The algorithm begins with the set u ~ u 1, ..., um which is not a primitive set. A primitive set is uniquely formed by replacing u ~ with the vector u k*, where =
max
k>m
We then take the feasible basis corresponding to the first m + 1 columns of A and pivot in column k*. This pivot is the ordinary simplex pivot. If the column of A corresponding to u ~ is pivoted out of the starting basis, then the algorithm terminates. If another column is pivoted out of the starting basis, say column ki, then the primitive set u k*, u 1. . . . , u m and the new basis correspond except for u i and the zero th column of A which is still in the basis. In this case, we remove u ~ from the primitive set, and replace it with a unique u ~ P according to Lemma 4.2. If u ~ replaces u i, then the algorithm terminates. Otherwise, the algorithm continues by repeating these same operations. At an arbitrary iteration, we have a primitive set u k~ u kl, . . . , u km and a feasible basis corresponding to the columns 0, kl . . . . . kin. S t e p 1. Pivot column ko into the basis. If column 0 is pivoted out of the basis, terminate the algorithm. Otherwise, go to step 2. S t e p 2. If column k, was pivoted out of the basis in step 1, take u k' out of the primitive set. If u ~ comes into the primitive set, terminate the algorithm. Otherwise, repeat step 1. The finiteness of the algorithm is due to the uniqueness of the changes at each of the above steps a n d the finiteness of the set P and the possible bases of A y = b. Thus, the algorithm implicit in Theorem 4.3 terminates with a primitive set u k~ . . . , u k" and a non-negative basic solution of A y = b using the basis columns k0,..., km of A. Let s0,..., sm denote the variables corresponding to the first m + 1 columns of A, and let 2m+ 1, ..., 2r denote the remaining variables. Let ~ and ~[k denote the specific values of these variables in the terminal solution; we will show that the ~'k and any u in the final simplex satisfy an approximate sufficient condition for optimality that tr* -- 0 in
72
M.L. Fisher et al. / Using duality to solve discrete optimization problems
(4.2). The condition is approximated because the subgradients in the desired convex combination are subgradients of L chosen not at a single point but from rn + 1 points which are close together. There are two cases to consider. Case 1 : 0 r {ko, ..., k.,}. In this case, we have s0 = 0 and row 0 o f A y = b gives us K
Z
L = 1.
(4.4)
k=m+l
Also, in this case ~7'= 1 < u so we require 0 = 0. For row i, i = 1, ..., m, we have K
si+
)--
(~/k+l)2k=
1,
k=ra+ 1
and using (4.4), we have K
si+
~
yik~k = 0 .
(4.5)
k=ra+ 1
If si > 0, then u ~ is in the final primitive set and u k~ ~ 0, j = 0 .... , m and ~kX=,.+ 1 Y~k~k= --Si < 0 is the appropriate approximate optimality condition for row i. On the other hand, if u ~ is not in the final primitive set, st is not a basic variable implying ~ = 0 and ~kr=.,+ 1 ~'ik~k = 0 which is again the appropriate approximate optimality condition. Case 2 : 0 ~ {k 0. . . . , kin}. In this case, the algorithm has progressed from the region of S, where ~7'= 1 ui ~ 0 to the b o u n d a r y ~7'= 1 us ~ U. By the definition of primitive sets, there must be some u ~, 1 < l _< m, not in the final primitive set implying g, = 0. Since the right-hand side on row I is 1, there must be at least one ~k > 0 and therefore ~kr=.,+ 1 ~'k > 0. We norrealize the non-negative weights to sum to one by taking k ~
K
~k
k=m+
1,...,K.
E k = ra + 1 ~[k
F r o m row 0, we have ~ = , . + l ~ [ k = 1 --30 < 1, where the inequality is strict (barring degeneracy) because Do is basic. F r o m row i, we have K
K
k=m+ l
k=m+ l
M.L. Fisher et at. / Usino duality to solve discrete optimization problems
73
and dividing by ~ = , , + 1 ~[k, 1
X
~[k~ , +
~
S0
~,~[k--
2k=O"
(4.6)
Since the uk~,j = 0, 1. . . . , m, in the final primitive set all satisfy ~7'= 1 u~~ ~ U, it is appropriate to allow the non-negative quantity 3o on every row i, i = 1, ..., m. As before, ~i > 0 implies u i in the final primitive set and the condition ~ = m + 1 ~'~-k - ff -< 0 is appropriate. The above analysis establishes the nature of the approximation in terms of approximating optimality conditions. It is also natural to ask: what is an approximately optimal solution t7 in (4.1)? It is reasonable to use any point a in the terminal simplex described by uk~. . . . , u*-. The best choice, however, appears to be any optimal solution in the linear programming problem ~(b) = max w, subject to w < - u b + f ( x ' ) + u g(x'), ~u~
(4.7) t~
u>O,
i=1
where T is the index set of the yt corresponding to basic 2 variables in the terminal basic feasible solution. We have argued above that ~ contains at least one t. L e t a be an optimal solution in (4.7) and compute L(tT). It is easy to see that t7 is optimal in the dual problem (4.1) if ~(b) = L(t2). This is because L(~) <- - a b + f ( x t) + ~ #(x t) for all t and therefore the constraints from (4.1) omitted in forming (4.7) are satisfied a fortiori. In any case, ~(b) > w(b) and v~(b) - L(a) > 0 is an upper bound on the objective function error of stopping with a. Let us now consider briefly the convergence properties of the simplicial approximation as the uk, k > m are increased in number and the diameter of the subsimplices defined by primitive sets approaches zero. For r = 1, 2, ..., let u k~ ..., u kin'" denote the defining points for the terminal primitive set at the r ta partition of S. As r goes to infinity, the diameter of these subsimplices converges to zero, and they contain a sequence with a subsequence converging to a single point u*. Any such point must be optimal in (4.1). To see this, recall that there is only a finite number of distinct columns possible
74
M.L. Fisher et al. / Using duality to solve discrete optimization problems
for the system A y = b. Thus, there is some basic set repeated infinitely often in the converging subsequence. Since the sets Q~ = {u: u > 0 and L(u) = L(u, x')} are closed and convex, the approximate optimality conditions represented by this fixed basic set become exact at u*.
5. Subgradient relaxation methods for dual problem In this section we discuss the very interesting work of Held and Karp [27] and Held, Wolfe and Crowder [28] on the application and extension of "The relaxation method for linear inequalities" of Agmon [1] and Motzkin and Schoenberg [32] to the traveling salesman and other optimization problems. We will refer to this method as subgradient relaxation to distinguish it from the data and problem relaxation methods discussed in Section 2. Our discussion in this section will be brief and the development will follow the development of [27] incorporating an observation from [28]. The idea behind subgradient relaxation is to generate a sequence {u'} of non-negative dual variables by the rule Ur+l
=
max{u~. + 0,77,0},
i = 1..... m,
where ?r is any subgradient of L at ur and 0r is a positive constant to be chosen from a specified range. There is no guarantee that L(u r+l) > L(u r) but if w(b) is known, the method can be made to converge (in a certain case, finitely) to a fi such that L(fi) > ~, where ~ is arbitrarily close to w(b). Of course, w(b) is generally not known although it can often be closely estimated. Another point worth noting is that in using the subgradient relaxation method in the branch and bound algorithm of Fig. 6.1, we try to take ~ to be the lower bound required for fathoming the subproblem by bound in step 5. In any case, the justification for the subgradient relaxation method rests with its computation effectiveness. The method has proven effective for the traveling salesman problem [27], the assignment problem and multi-commodity flow problems [28]. Consider now a generic step from the point u > 0 to the point u + 07 + 6, where 0 is some positive number to be chosen, ? is a subgradient of L at u andbi=0ifui+0?i>_0,6i= - u i - OTi > O if ui + O?i < O, i = 1. . . . . m. The following lemma stated without proof summarizes the results of [27, p. 9] and [28]. In words, it says that if a sufficiently small step in
M.L. Fisher et al. / Usin 0 duality to solve discrete optimization problems
75
the direction of the subgradient 7 is taken followed by a projection onto the non-negative orthant, then the new point u + 07 + 6 is closer to a maximal u* than u. Lemma 5.1 (Held, Karp and Wolfe). Let u > 0 with dual value L(u) be given. Suppose -~ is such that ~ >_ 0 and L(-d) > L(u). I f 0 is chosen to satisfy
0<0< then
- (u +
2(L(-fi) - L(u))
+ 6)[I <
- ull
This lemma indicates the type of step size required for convergence. Assume ~ < w(b) and let U~ denote the set of u ~ R m satisfying < - u b + f ( x t) + u g(xt),
t = 1. . . . , T,
u > O.
(5.1)
The subgradient relaxation method generates a sequence {u'} according to the rule: Stop if L(u*) > ~; otherwise, select u '+t such that -
U "+t i
=max{u~ + p,
117,112 ] 7 i , 0 l,
i = 1, ..,m.
(5.2)
The results of Agrnon [-1] and Motzkin and Schoenberg [-32] are concerned with convergence properties of this sequence. If p, = 2 for all r, the sequence always includes a point ute U~, whereas if 0 < e < p, < 2 for all r, the sequence includes a point ut e U~ or converges to a point on the boundary of U~. If strict convergence is not the primary goal there is great flexibility possible in the use of subgradient relaxation. For example, Held, Wolfe and Crowder [-28] have experimented with ~ > w(b) in (5.2) as well as other changes in the basic method. The advantages of subgradient relaxation over the primal-dual ascent algorithm of Section 3 is the elimination of computational overhead. One disadvantage is the absence of guaranteed monotonically increasing lower bounds. Moreover, when there are constraints other than u > 0 on the dual problem (1.2) (e.g., the IP dual in [13]), the subgradient relaxation method requires that the point u + 07 be projected m a non-trivial fashion onto the set of dual feasible solutions.
76
M.L. Fisher et al. / Using duality to solve discrete optimization problems
6. Branch and bound algorithm for discrete primal problem As we mentioned in the introduction, the nonconvexity of the set X makes it possible, and for some problems even likely, that there exists a duality gap. Nevertheless, dual problems can be usefully exploited in a branch and bound algorithm which searches over the finite set X for an optimal solution to the primal. In this section we describe the general form such a search would take. The approach can be viewed as a generalization of the branch and bound traveling salesman algorithm given by Held and Karp [27]. The search of the set X is done in a non-redundant and implicitly exhaustive fashion. At any stage of computation, the least cost known solution & ~ X satisfying 9(5r < b is called the incumbent with incumbent cost = f(~). The branch and bound algorithm generates a series of subproblems of the form v(b; X k) = min f ( x ) , subject to O(x) < b,
(6.1)
x ~ X k,
where X k c X . The set X k is selected to preserve the special structure of X used in solving minx~x f ( x ) + u O(x). If we can find an optimal solution to (6.1), then we have implicitly tested all subproblems of the form (6.1) with X k replaced by X ~ ~_ X k and such subproblems do not have to be explicitly enumerated. The same conclusion is true if we can ascertain that v(b; X ~) > ~, without actually discovering the precise value of v(b; Xk). If either of these two cases obtain, then we say that the subproblem (6.1) has been fathomed. If it is not fathomed, then we separate (6.1) into new subproblems of t h e form (6.1) with X k replaced by X t, l = 1..... L, and L
U X~ = x k ,
X 11 ca X '2 = O,
11 ~ 12.
1=1
We attempt to fathom (6.1) by solution of the dual problem w(b ; X k) = max w, subject to w <_ - u b + f ( x t) + u g(xt), t ~ T k, u >_ O, (6.2)
where T k ~ T is the index set of the x ~e X k. The use of (6.2) in analysing (6.1) is illustrated in figure 6.1 which we will now discuss step by step. Steps 1 and 2: Often the initial subproblem list consists of only one subproblem corresponding to X.
M,L. Fisher et al. / Usino duality to solve discrete optimization problems
Initialize [ subproblem list Select I I initial [ subproblem
Compute Lagrangean
/
~ot~L~..
solution from ~ . ~ y J Lagrangea_n 7 -
i
~ ~
Update ]incumbentJ
Y N 11 Separate subproblem ]
l"
subproblem
1 Fig. 6.1.
T~
SubproblemI I I fathomed J ]
77
78
M.L. Fisher et al. / Usin 9 duality to solve discrete optimization problems
Step 3: A good starting dual solution fi > 0 is usually available from previous computations. Step 4: Computing the Lagrangean is a network optimization problem; shortest route computation for integer programming, minimum spanning tree for the traveling salesman problem, dynamic programming shortest route computation for the resource constrained network scheduling problem, etc. Step 5: As a result of step 4, the lower bound L(fi; X k) on v(b; X k) is available, where LN; X*) = - ~b + mirn{f(xt) + ~g(x')}.
(6.3)
It should be clear that (6.1) is fathomed if L(fi; X*) > 2 since L(~; X*) _<
v(b ; xk). Steps 6, 7, 8: Let x s, s~ T k, be an optimal solution in (6.3) and suppose x~ is feasible; i.e., g(xs) < b. Since (6.3) was not fathomed (step 5), we have L(fi, X k) = f ( x ~) + -~(g(x~) - b) < 2, with the quantity ~(g(xs) - b) < 0. Thus, it may or may not be true that f ( x ~) < 3, but if so, then the incumbent should be replaced by x ~. In any case, if x~ is feasible, we have by duality theory
f ( x ~) + ~(g(x ~) -- b) < v(b; X k) <_ f ( x ~) and it should be clear that x ~ is optimal in (6.1) if-~(g(x ~) - b) = 0; i.e., if complementary slackness holds. Step 9: This may be a test for optimality in the dual of the current ft. Alternatively, it may be a test of recent improvement in the dual lower bound. Another possibility for evaluating whether or not to persist with the dual is to try to prove that w(b; X k) < 2 by the mechanism of Corollary 2.3 and therefore the subproblem will never be fathomed by bound. Specifically, this can be accomplished by finding x ~..... xRe X k such that ~-~R= 1 f ( x ' ) 2 , < ~ for non-negative 4, summing to one and satisfying ~ = ~ g(x r) 2r < b. If the primal-dual algorithm of Section 3 is being used, this can be readily accomplished by allowing any vectors g(x) - b for x ~ X to enter (2.6) rather than limiting the generating process to vectors g(x t) - b, t e T~). Finally, for some dual problems such as the IP dual, it is possible to persist with the dual beyond dual optimality by the addition of a cut to (1.1) (see Section 7.1). Step 10: The selection of a new fi > 0 depends upon the methods being used, and which of these methods have proven effective on the same type of problem in the past.
M.L. Fisher et al. / Usin# duality to solve discrete optimization problems
79
Steps 11, 12: The separation of the given subproblem can often be done on the basis of information provided by the dual problem. For example, in integer programming, problem (6.1) may be separated into two descendants with a zero-one variable x i set to zero and one respectively, where xj is chosen so that the reduced cost is minimal. It is important to point out that the greatest lower bound obtained during the dual analysis of (6.1) remains a valid lower bound on a subproblem derived from (6.1) with X ~ ~ X k. In step 12, the new subproblem selected from the subproblem list can be one with low or minimal lower bound.
7. Computational experience Most of the dual methods described above have been implemented and applied to various discrete optimization problems. The authors have applied dual methods to several types of integer programming problems and to resource constrained network scheduling problems. This is the experience reported on here. Held and Karp [-27] have successfully applied the subgradient relaxation method to the traveling salesman problem. Held, Wolfe and Crowder [28] report success with the same method applied to multi-commodity flow problems. Hogan, Marsten and Blakenship [-29] have applied a different dual algorithm than the ones described here to a resource constrained network problem. Our computational experimentation is continuing. The main goals to date have been (1) to establish that dual methods provide tight lower bounds, (2) to show that these bounds can be obtained quickly, (3) to contrast and integrate the different dual methods, and (4) to evaluate the use of dual methods in conjunction with branch and bound. 7.1. The integer p r o g r a m m i n g ( I P ) p r o b l e m 1
We begin with an IP problem written in the form minimize subject to
c w, A w = d,
(7.1) w >_ 0 and integer,
where A is an m x (n + m) integer matrix of rank m, d is an m x 1 integer vector, and c is a 1 x (n + m) vector. Let B be any m x m nonsingular 1 Cf. [13] and [37].
80
M.L. Fisher et al. / Using duality to solve discrete optimization problems
matrix formed from the columns of A and N the matrix of remaining columns. We partition w into y and x, with y corresponding to the columns in B, and in the same way partition c into ca and cN. Then (7.1) can be written as minimize subject to
c a B - l d + (cN - c a B - I N ) x ,
B-1N x < B-ld,
(7.2)
B - 1 N x =- B - l d (mod 1),
x > 0 and integer. The correspondence with (1.1) is obtained by letting X = {x: B - 1 N x - B - l d (mod 1), x > 0 and integer}, f ( x ) = (cN -- c a B - 1N) x, g(x) = B - 1 N x, b = B-ld. The problem m i n x ~ x { f ( x ) + u g(x)} is equivalent to finding the shortest route in a special network with [det B I nodes and efficient algorithms have been developed for the problem (e.g., see [21]). In the context of the IP dual problem, the shortest route problem can be solved even faster because only a single path from the origin node to a specified node is required and because the previously optimal path provides a good starting solution when the problem is solved repetitively. It may be shown that if any component of cs - c a B - 1N + u B - 1N for u > 0 is less than zero, then L(u) = - o o . Hence, it is useful to require cN - c a B - I N + u B - ~ N > 0 when we solve the dual. The modifications required to include these linear side constraints in the analysis of the dual problem are straightforward ['13]. There are additional practical factors to be considered. For most IP problems, many or all of the variables have explicit or implicit upper bounds and these should be taken into account in the dual analysis. An upper bound on a basic variable y~ has the effect of eliminating the constraint u~ > 0. An upper bound on a non-basic variable can be explicitly added to the set X in which case the Lagrangean shortest route problem needs to be modified. Of course, upper bounds affect the Lagrangean subgradients and analysis of Section 2; we omit details. A second practical need are data relaxation methods to control the size of the shortest route problems [-20]. Our experimentation on the IP dual problem has been limited to solution by the primal-dual algorithm of Section 3 with the modification that ascent steps are to the maximum of L along the indicated direction rather than
M.L. Fisher et al. / Using duality to solve discrete optimization problems
81
the next breakpoint. Although the modified algorithm can no longer be proven to converge, this has not appeared to present practical problems. The basis B used in the transformation of (7.1) to (7.2) was always an optimal LP basis and the starting dual solution was always u = 0. The algorithm was encoded for the MULTICS time sharing system on the G.E. 645 computer at M.I.T. The group theoretic IP algorithm IPA [21] was also established on MULTICS, and we used it to find feasible and optimal IP solutions while the primal-dual provided lower bounds. Segment size on MULTICS effectively limits search space and, as a result, IPA would either find a feasible or optimal IP solution within 10 to 60 seconds after the LP solution was found, or be forced to quit because the available space was filled. The IP dual approach has not yet been combined with branch and bound except on an ad hoc basis. The first IP test problem is a covering problem due to Day [5] consisting of 30 rows and 28 zero-one variables. A number of test problems were constructed from this problem by altering the objective function. Fig. 7.1 illustrates the performance of the primal-dual algorithm on one of the first test problems. The abscissa represents the number of iterations of the primal-dual algorithm and the ordinate the lower bound. The lower bounds have been normalized so that the LP cost is 100. The circled points on the figure are the successive lower bounds. The first two points are on the ordinate axis because the primal-dual had not yet begun to pivot. The arrow around iteration 50 on Fig. 7.1 indicates that a feasible IP solution to the IP problem was found while maximizing along a line on which lie the points yielding the indicated adjacent bounds. Coincidentally, it took 48 simplex iterations to find an optimal LP solution. The feasible IP solution was in fact optimal, but it was not possible to prove it because complementary slackness did not hold (see Section 6). Optimality was established by a separate run with IPA. We did not allow the algorithm to converge to an optimal dual solution because it was taking too much time, but the asymptote seems clear. The behavior illustrated in Fig. 7.1 has typical and atypical characteristics. The typical characteristics are the rapid rise in dual lower bound during the first few steps, the long tail towards the maximal dual lower bound, and the existence of a duality gap. The atypical characteristic is the discovery of a feasible solution. The discovery of a feasible IP solution as a result of a Lagrangean minimization was a rare event for all of the IP problems tested thus far. To see why this might be so for any discrete dual problem, consider the general model (1.1) and let ~ E X be an arbitrary
M.L. Fisher et al.
82
Using duality to solve discrete optimization problems
MINIMAL
106
|
(9
|
|
|
|
IP C O S T
(9
|
|
105
|
C3 Z E3 O r
104
rY 103 LU O d 102
101
I00
,
I
50
100 PRIMAL- DUAL
150
200
250
ITERATIONS
Fig.
7.1.
feasible solution (i.e., g(~) _< b). Since J(~) _> L(u) for any u _> 0, the scalar u(g(Yc) - b) may need to be a negative number fairly large in magnitude in order that L(u) = f(YO + u(g(YO - b). This is particularly true when there is a significant duality gap between the primal and dual problems, i.e., when f(~) :~ L(u) for all u _> 0. Table 7.1 summarizes our experience with six test problems derived from the problem of Fig. 7.1 by altering the objective function. Problem 1 is the one depicted in Fig. 7.1. Note that there was, or appeared to be, a duality gap in all cases between the IP problem and its dual (a ">_" or "_<" indicates that dual or primal optimality was not necessarily attained). The magnitude of the gap appears to bear no relationship to the magnitude of the determinant of the optimal LP basis B although a determinant of one would trivially ensure that there is no gap. For all of these problems, the Lagrangean shortest route problem was solved with upper bounds on the
M.L. Fisher et al. / Using duality to solve discrete optimization problems
83
Table 7.1 Day coverin problem variants [5]. 30 rows; 28 zero-one variables
Problem
LP
First lower bound
Lower bound after
Dual
IP
of optimal
cost
L(0)
first step
cost
cost
LP basis
100.0 100.0 100.0 100.0 100.0 100.0
102.2 105.3 101.8 102.7 102.3 100.3
104.5 105.5 104.8 104.7 102.7 102.0
--- 105.3 110.1 > 107.3 106.7 > 104.5 > 102.5
106.0 118.0 118.6 111.5 _< 126.9 105.0
2 42 42 36 11 20
Determinant
IP
non-basic variables. Our experience has been that these bounds can be ignored with little effect on the lower bound. For example, the dual to problem 2 was solved without regard to the upper bounds on the non-basic variables and the best lower bound obtained was 109.8 as compared to 110.1. Problem ! was the only one for which a feasible integer solution was found during solution of the IP dual problem. Table 7.2 gives results for the test problem IBM9 1-24] which is a 35 row covering problem with 15 zero-one columns. Run 1 is the original test problem and the other runs were obtained by varying the objective function. For run 1, the initial point u = 0 was optimal in the dual; 5 subgradients had to be generated to prove it. The next test problem is a multiple-choice capital budgeting problem with additional constraints for R & D selection. This problem is due to E. M. L. Beale and was sent to us by Laurence Wolsey. It consists of 50 rows Table 7.2 IBM9 test problem variants [24]. 35 rows; 15 zero-one variables
Problem 1 2
3 4 5
First lower
Lower
IP
LP
bound
bound after
dual
IP
o f optimal
cost
L(0)
first step
cost
cost
LP basis
100.0 100.0 100.0 100.0 100.0
112.5 105.3 100.9 102.3 < 108.3
-
109.2 106.0 102.4 108.5
112.5 111.2 110.8 104.2 113.1
150.0 120.8 < 137.0 111.5 _< 140.1
21 15 33 7 36
Determinant
84
M.L. Fisher et al. / Using duality to solve discrete optimization problems
and 115 columns and is quite difficult for its size; for example, the optimal LP solution to the problem has fractions (i.e., mixed strategies) for nine out of 10 multiple choice sets. For the five capital budgeting constraints, we did some mild data relaxation [20] which did not cause serious approximation errors (IPA solves the original problem using the relaxed problem to generate lower bounds and trial solutions). The first run on this problem T a b l e 7.3 R & D s e l e c t i o n p r o b l e m . 50 r o w s ; 115 i n t e g e r v a r i a b l e s Determinant Subproblem
IP dual
IP
of optimal
Variables
profit
profit
LP basis
fixed
31.76
0 . 2 0 4 1 x 10 s
No variables fixed
30.69
120960
x 0206 = 1 x 0704 = 1
30.08
27846
x 0206 = 1 x 0301 = 1 x 0704 = 1
2898
x x x x
0206 0301 0704 0807
7956
x x x x
0206 = 1 0301 = 1 0704= 1 0810=0
LP profit
30.05
~27.59
30.00
~27.44
19.45
= = = =
29.94
26.60
540
x x x x
0206 0301 0704 0807
29.94
27.15
1080
x x x x x x
0206 = I 0301 = 1 0704 = I 1005 = 1 0807 = 0 0810=0
1800
x x x x x x
0206 = 1 0301 = 1 0505 = 1 0704 = 1 0807=0 0809 = 0
29.75
~27.43
= = = =
I 1 1 1
1 1 1 l
M.L, Fisher et al. / Using duality to solve discrete optimization problems
85
is subproblem 1 in Table 7.3; the determinant of the optimal LP basis was too large and we did some manual fixing of variables to simulate a branch and bound search. The variables fixed were from the multiple choices sets with variables prefixed x01, x02, ..., xl0. A " " in Table 7.3 indicates that the IP dual calculation was not made or a feasible IP solution was not found. The search space for IPA on this problem was extremely limited and a feasible solution was found only for subproblem 5. For subproblem 5, IPA and the IP dual problem worked with a relaxation of X which gave a shortest route subnetwork with 1326 nodes rather than 7956 (the use of relaxation in conjunction with duality is given at the end of Section 2). Results for an extremely difficult small test problem ESBAL4 sent to us by Ellis Johnson are given in Table 7.4. The problem consists of 20 rows and 28 zero-one variables and it is almost 1 0 0 ~ dense with integer coefficients between - 9 and + 9. The determinant of the optimal LP basis was T a b l e 7.4 E S B A L 4 . 2 0 rows; 28 z e r o - o n e variables Determinant o f optimal L P basis
Remarks
Subproblem
LP cost
IP dual cost
IP Cost
1
52.15
58.16
-< 71.00
876
2
34.18
37.87
39.87
3170
d a t a relaxation 6 variables fixed
3
24.18
27.44
-
296
d a t a relaxation 4 variables fixed
4
24.58
28.05
-
96
d a t a relaxation 4 variables fixed
5
27.85
33.26
-
4224
d a t a relaxation 9 variables fixed
no d a t a relaxation 10 variables fixed
immense, and as with Beale's problem, we did some ad hoc branch and bound to find more favorable subproblems. The feasible IP solution found by IPA on run 2 is probably optimal. In spite of the fact that most of the subproblems were relaxed by the procedures of [-20], the dual lower bounds were significantly higher than the LP lower bounds and they were obtained fairly rapidly. Table 7.5 gives some results on two other test problems. The capital budgeting problems were taken from a Harvard Business School case study. The one with no duality gap was solved by the dual; i.e., the optimal IP
86
M.L. Fisher et al. / Using duality to solve discrete optimization problems
Table 7.5 Miscellaneous test problems
IBM5 Variant [24] IBM5 Variant [24] Capital budgeting Capital budgeting
Rows
Variables
LP cost
tP d u a l c o s t
IP cost
15 15 18 18
15 15 76 76
122.7 131.7 -2301. -2361.
129.0 137.2 -2275.5 -2359.
130.0 139.0 ~ - 1801.0 -2359.
solution was found by the dual and optimality proven. Moreover, the optimal solution was found along the first direction of ascent. The above experimentation indicates a number of areas of future experimentation and research. First, it is probably best to combine the primal-dual algorithm with subgradient optimization and heuristic methods. The primal-dual does a great deal of LP pivoting in solving the LP problem (3.2) in order to perfect the direction chosen. In many cases, this is unnecessary because any subgradient direction would be a true direction of ascent. Perhaps the primal-dual should only be used if simple ascent methods like maximizing along subgradient directions get jammed at a point. The most extreme example of jamming was on IBM9 with a Gomory cut added. More than 20 subgradients were required to prove dual optimality or find a direction of ascent (we did not pursue computation long enough to find out which was the case). On many problems, the primal-dual exhibited the familiar difficulties of other ascent methods including slow convergence and zig-zagging on low dimensional affine subspaces. For these reasons, we believe the simplicial approximation algorithm of Section 4 may prove effective because it takes forced steps of fixed size. A n o t h e r important area of future experimentation are methods for continuing the dual analysis of an IP problem after the optimal solution to the IP dual has been obtained, or after the dual ascent has slowed down. One method for doing this at a point u > 0 is to add to the IP problem (7.1) the cut (CN -- (CB -- U) B - 1N) x > L ( u ) + u B - ld.
The validity of this cut and its properties are given in [13]. A potentially superior method which subsumes the cutting plane method is given by Bell and Fisher [2]. In addition, if an unsatisfactorily large duality gap is the result of data relaxation, some of the gap can be recovered by putting multipliers on the approximation error (see [20, w 4]). The success of the subgradient relaxation method on other discrete
M.L. Fisher et al. / Using duality to solve discrete optimization problems
87
optimization problems makes it particularly attractive for use in IP. Experimentation should begin with those problems for which tight upper bounds on all variables, including slacks, are known or given; e.g., set partitioning problems (see [,14]). For these problems, the subgradient relaxation method can search without constraints. Further experimentation needs to be done on whether the Lagrangean shortest route problem should be solved with or without upper bounds on the non-basic variables. The inclusion of upper bounds has not produced thus far significantly higher lower bounds, and the additional computational effort is not justified. If, however, there is a strong possibility of finding a good feasible or optimal IP solution, then the explicit imposition of upper bounds on the non-basics becomes more attractive. Finally, the IP dual methods need to be formally combined with branch and bound. Included for consideration are the conventional branch and bounds methods for IP, and also the special purpose branch and bound search over the non-basic variables used by IPA [,21]. We believe that the experience above plus the success of the dual approach in solving the resource constrained network scheduling problem discussed below, and the traveling salesman problem [-27], justifies our confidence in its potential for IP. 7.2. Resource constrained network scheduling problem 2
This problem concerns the scheduling of n tasks to be performed over T time periods. For each task i we are given an integer processing time Pi and a set of tasks Bi which must be completed before task i is started. K different resources are available, with bkt denoting the amount of resource k available in the time interval from t - 1 to t and rik denoting the amount of resource k required by task i while in process. Letting xi denote the start time for task i we would like to obtain a value for x which minimizes ~7= 1 max(O, xi + Pi - di), where di is a due date for job i. Letting It(x ) = { i : x~+ 1 < t < x ~ + p ~ } denote the set of tasks in process during the time interval from t - 1 to t, the correspondence with (1.1) may be given by X = {x:0 < x i < T -
Pi, xiinteger, a n d x i >- xt + Pt, l~Bi},
f ( x ) = ~ max(0, xi + Pi - d3, i=1 2 Cf. [10].
88
M.L. Fisher et al. / Using duality to solve discrete optimization problems gkt(X) =
~.
rik,
i~lt(x)
g()r = (gll(X), ffl2(X),-.., aKT(X)),
b = (b11, b12..... bgr). The integrality of p, allows us to restrict xi to integer values without loss of optimality. For the case where the set A, = {j: i~Bj} contains at most one element for each j, a dynamic programming algorithm requiring at most nT steps is given in [12] for the problem minx~ x f(x) + ug(x). In our computational work with these problems two algorithms for the dual were used, a modified version of the primal-dual algorithm of Section 3, and the subgradient relaxation method described in Section 5. In the modified primal-dual, the dual simplex method rather than the primaldual simplex method was applied to the LP problem (2.2). Columns for this LP were generated by solving the Lagrangean problem. Table 7.6 summarizes our experience in solving 9 specific problems. The first eight are problems for which earlier computational results have been reported in [10]. Problems 1 and 2 are job shop problems with values for Pi selected randomly between 1 and 9. Problem 9 is a 36 task job shop problem published in [33]. Problems 3 through 8 were obtained from problem 9 by removing selected tasks. These problems used a different objective function from the problem in [33] and it was necessary to randomly generate due dates for tasks. Table 7.6 Computational experience with job sho , problems (1)
(2)
(3)
(4)
(5)
(6)
L(0)
Problem
Number of tasks 13 13 16 16 20 20 20 20 36
Optimal primal objective value
Greatest lower bound computed
First lower bound L(O)
Solution time
secs
secs
(360-65)
(360-65)
98
95
8
8
97 0 128 2 139 2
97 0 125 0 134
1
0
68 0 81 0 97 0 98 0 0
3.85 3.22 2.29 1.15 9.80 7.04 20.15 16.65 30.24
4.58 3.26 2.78 1.64 8.24 30.05 25.39 30.88 48.95
1
Solution time
M.L. Fisher et al. / Using duality to solve discrete optimization problems
89
Solution times for all problems are given in column (5) of Table 7.6 using a branch and bound algorithm of the type described in Section 6. Problems 1-4 were solved using the algorithm given in [,10] with some adjustment made to the effort expended in solving the dual. This had an effect only for problem 2 where the solution time was slightly reduced. Problems 5-9 were solved by a branch and bound algorithm which differed in two respects from that in [,10]. First, an improved algorithm for the Lagrangean problem given in [,12] was used. This algorithm requires an amount of work proportional to n T and has been found to be significantly faster than the algorithm for the Lagrangean problem given in [-10], particularly for large problems. Second, the subgradient relaxation method was used on the dual problems. These changes resulted in solution times which are significantly lower, in some cases by as much as an order of ma'gnitude, than those reported in [-10]. Column (3) of Table 7.6 gives the greatest lower bound obtained for each problem. For problems 2-4, this lower bound is known to be optimal in the dual because there is no duality gap; in the other problems, it is not known if the greatest lower bounds are optimal. Most algorithmic work in the literature for problems of this type has been concerned with the minimization of the time to complete all tasks. This objective is different from the minimization of tardiness objective used here and hence it is difficult to compare our computation times with those reported elsewhere. To obtain solution times for comparison, we disabled the solution of the dual in our branch and bound algorithm so that the bound used everywhere was L(0). This is the bound commonly used in other algorithms for this problem. The solution times using this bound are given in column (6) and may be compared with the times in column (5). As well as solving each problem by branch and bound using dual lower bounds, we separately applied the modified primal-dual algorithm and the subgradient relaxation method to the duals of problems 5-8. The primaldual was run until optimality was attained, or until 50 columns for the LP (2.2) had been generated. The subgradient relaxation method was run for a maximum of 100 steps. Step sizes were selected as in (5.2) with P0 initialized to 2.0 and halved whenever L(u) failed to increase for 5 consecutive steps. Because the amount of time spent optimizing a particular dual was generally different for each algorithm, it is difficult to make a direct comparison between the two methods. To help make comparisons, we have displayed in Figs. 7.2, 7.3 and 7.4 the value of L(u) as a function of compu-
M.L. Fisher et al. / Using duality to solve discrete optimization problems
90
OPTIMAL PRIMAL VALUE 128
r--
!
120 , F, v U
? .......
,__FI T
/
PRIMAL- DUAL
/
SUBGRADIENT METHOD
/
RELAXATION
110
100 97
I
I
I
I
I
I
I
2
4
6
8
'i0
12
14
D
SECONDS Fig. 7.2. Comparison of primal-dual and subgradient relaxation method for scheduling problem 5.
139 t
OPTIMAL. PRIMAL VALUE
/
PRIMAL-DUAL
1301
! /
r . . . ~ . _ . . ~ , j, ~ ' ,
P - - ""L~ "
/
r'l.l " ~ . ~ 1
SUBGRADIENT RELAXATION
M ETHOD
.,F"J ~
G J
,oohl 98 III
t 2
/
I 4
~ 6
I 8
I 10
I 12
I I.14
SECONDS Fig. 7.3. Comparison of primal-dual and subgradient relaxation method for scheduling problem 7.
M.L. Fisher et al. / Using duality to solve discrete optimization problems
OPTIMAL PRIMAL VALUE =
/
.5
-
~
91
2
PROBLEM 8
'
Ur--" s I
-j
6
L,)_
B SECONDS
I
10
I
12
I
14
I.-
_J
Jj 'l!il
rl
Ilir
--11, !
i
J
Ii I i
rJLJ 'I/PROBLEM 6 -2
I - I I I I I
Fig. 7.4. Subgradientrelaxation method for scheduling problems 6 and 8.
tation time for the two algorithms and for problems 5-8. The slower Lagrangean algorithm of [10] was used in these runs so computation time is larger than it needs to be, but this should not invalidate comparisons between the two algorithms. Fig. 7.2 gives L(u) for both algorithms applied to problem 5, and Fig. 7.3 for problem 7. For problems 6 and 8, the primaldual produced L(u) values which remained close to zero and thus the results for the primal-dual are not very interesting. In Fig. 7.4, however, we do show L(u) for the subgradient relaxation method.
92
M.L. Fisher et al. / Using duality to solve discrete optimization problems
Generally, these figures show that the subgradient relaxation method performed somewhat better than the primal-dual algorithm. Note from Table 7.7, however, that the primal-dual method required the solution of significantly fewer Lagrangean problems per unit of computation time. This is entirely consistent with the nature of the two algorithms and suggests that comparisons are sensitive to the time to solve the Lagrangean problems. Table 7.7 Primal-dual and subgradient relaxation methods
Problem
Greatest lower bound computed 123 0 133 0
Primal-dual Compute time Lagrangean secs problems (360--65) solved 13.21 13.99 10.35 7.84
49 14 54 24
Subgradient method Greatest Compute lower time Lagrangean bound secs problems computed (360-65) solved 126 1 134 1
17.52 17.93 25.76 26.00
100 100 100 100
The subgradient relaxation method should be more favorable if this time is relatively short. The runs depicted in Figs. 7.2, 7.3 and 7.4 substantiate the principle that it is best not to try to optimize the dual problem, but to use quickly obtained good lower bounds in conjunction with branch and bound.
Acknowledgment The authors would like to acknowledge the helpful comments of A.M. Geoffrion, M. Held, T.L. Magnanti and M.H. Wagner.
References [1] S. Agmon, "The relaxation method for linear inequalities", Canadian Journal of Mathematics 6 (1954) 382-392. [2] D.E. Bell and M.L. Fisher, "Improved integer programming bounds using intersections of corner polyhedra, Mathematical Programming 8 (1975) 345-368. [3] S.P. Bradley, "Decomposition programming and economic planning", ORC Rept. 67-20, Operations Research Center, University of California, Berkeley, Calif. (1967).
M.L. Fisher et al. / Using duality to solve discrete optimization problems
93
[4] G.B. Dantzig, Linear programming and extensions (Princeton University Press, Princeton, N.J., 1963). [5] R.H. Day, "On optimal extracting from a multiple file data storage system : an application of integer programming", Operations Research 13 (1965) 482--494. 1-6] B.P. Dzielinski and R. Gomory, "Optimal programming of lot size inventory and labor allocations", Management Science 11 (1965) 874--890. [7] J. Edmonds, "Matroids and the Greedy algorithm", Mathematical Programming 1 (1971) 127-136. 1-8] J.E. Falk, "Lagrange multipliers and nonconvex programs", SIAM Journal on Control 7 (1969) 523-545. [9] J.E. Falk and R.M. Soland, "An algorithm for separable nonconvex programming problems", Management Science 15 (1969) 550-569. [10] M.L Fisher, "Optimal solution of scheduling problems using Lagrange multipliers : Part 1", Operations Research 21 (1973) 1114-1127. [11] M.L. Fisher, "Optimal solution of scheduling problems using generalized Lagrange multipliers and relaxation: Part II", in: Symposium on the theory of scheduling and its applications (Springer, Berlin, 1973). 1-12] M.L Fisher, "A dual algorithm for one machine scheduling problems", Mathematical Programming, to appear. [13] M.L. Fisher and J.F. Shapiro, "Constructive duality in integer programming", SIAM Journal on Applied Mathematics 27 (1974) 31-52. [14] R.S. Garfinkel and G.L. Nemhauser, Integer programming (Wiley, New York, 1972). [15] A.M. Geoffrion, "Duality in nonlinear programming : a simplified applications-oriented development", SIAM Review 13 (1971) 1-37. [16] A.M. Geoffrion, "Lagrangian relaxation and its uses in integer programming", Mathematical Programming Study 2 (1974) 82-114. [17] P.C. Gilmore and R.E. Gomory, "A linear programming approach to the cutting-stock problem, Part II", Operations Research 11 (1963) 863-888. [18] R.E. Gomory and T.C. Hu, "Synthesis of a communication network", SlAM Journal on Applied Mathematics 12 (1964) 348-389. [19] G.A. Gorry and J.F. Shapiro, "An adaptive group theoretic algorithm for integer programming problems", Management Science 17 (1971) 285-306. [20] G.A. Gorry, J.F. Shapiro and LA. Wolsey, "Relaxation methods for pure and mixed integer programming problems", Management Science 18 (1) (1972) 229-239. [21] G.A. Gorry, W.D. Northup and J.F. Shapiro, "Computational experience with a group theoretic integer programming algorithm", Mathematical Programming 4 (1973) 171-t92. [22] R.C. Grinold, "Lagrangean subgradients", Management Science 17 (1970) 185-188. [23] R.C. Grinold, "Steepest ascent for large scale linear programs", SIAM Review 14 (1972) 447-464. [24] J. Haldi, "25 integer programming test problems", Working Paper No. 43, Graduate School of Business, Stanford University, Stanford, Calif. (1964). [25] T. Hansen and H. Scarf, "On the applications of a recent combinatorial algorithm", Cowles Foundation Discussion Paper No. 272, Yale University, New Haven, Conn. (1969). [26] M. Held and R.M. Karp, "The traveling salesman problem and minimum spanning trees", Operations Research 18 (1970) 1138-1162. [27] M. Held and R.M. Karp, "The traveling salesman problem and minimum spanning trees : Part II", Mathematical Programming 1 (1971) 6-25. [28] M. Held, P. Wolfe and H.D. Crowder, "Validation of subgradient optimization", Mathematical Programming 6 (1974) 62-88.
94
M.L. Fischer et al. / Using duality to solve discrete optimization problems
[-29] W.W. Hogan, R.E. Marsten and J.W. Blankenship, "The Boxstep method for large scale optimization", Operations Research 23 (3)(1975). [30] L.S. Lasdon, "Duality and decomposition in mathematical programming", IEEE Transactions on Systems, Man and Cybernetics 4 (1968) 86-100. [-31] T.L. Magnanti, J.F. Shapiro and MIH. Wagner, "Generalized linear programming solves the dual", Working Paper OR 019-73, Operations Research Center, Massachusetts Institute of Technology, Cambridge, Mass. (September 1973). [32] T. Motzkin and l.J. Schoenberg, "The relaxation methods for linear inequalities", Canadian Journal of Mathematics 6 (1954) 393--404. [-33] J.F. Muth and G.L. Thompson, eds., Industrial scheduling (Wiley, New York, 1963). [-34] W. Orchard-Hays, Advanced linear programming computing techniques (McGraw-Hill, New York, 1968). [-35] R.T. Rockafellar, Convex analysis (Princeton University Press, Princeton, N.J., 1970). [36] H. Scarf, "The approximation of fixed points of a continuous mapping", SIAM Journal on Applied Mathematics 15 (1967) 1328-1343. 1-37] J.F. Shapiro, "Generalized Lagrange multipliers in integer programming", Operations Research 19 (1971) 68-76. [38] M. Simonnard, Linear programming (Prentice-Hall, Englewood Cliffs, N.J., 1966). [-39] M.H. Wagner, "Constructive fixed point theory and duality in nonlinear programming", Tech. Rept. No. 67, Operations Research Center, Massachusetts Institute of Technology, Cambridge, Mass. (September 1971). [-40] P. Wolfe, "A method of conjugate subgradients for minimizing nondifferentiable functions", Mathematical Programming Study 3 (1975) 145-173 (this volume). ]-41] W.I. Zangwill, Nonlinear programmin 9 : a unified approach (Prentice-Hall, Englewood Cliffs, N.J., 1969).
Mathematical Programming Study 3 (1975) 95-109. North-Holland Publishing Company
"AN EXTENSION OF DAVIDON M E T H O D S TO N O N DIFFERENTIABLE PROBLEMS
C. L E M A R E C H A L LR.I.A., Laboria, Le Chesnay, France Received 14 May 1974 Revised manuscript received 14 May 1975 Making use of convex analysis a property possessed by almost all Davidon methods is exhibited. This property--although true only in the quadratic c a s e ~ o e s not depend on the quadratic nature of the objective function. An algorithm is given which is shown to coincide with the conjugate gradient algorithm in the quadratic case. The convergence is proven when applied to uniformly convex functions. Numerical aspects are considered.
O. Introduction
The algorithm presented here is very similar to one found in [9]. Let H be a Hilbert space, with inner product (,) and associated norm [ I. Let f be a convex function defined on H, such that domf = H Va ~ R, { x ~ H: f (x) ~< e} is weakly compact.
(1)
Furthermore, for purposes of proofs, we shall need the property that f be locally uniformly convex: namely, that there exists an increasing function d from R + into R +, such that:
d(O) = O,
d(t) > O
Vx, y e H , V2~[O, 1],
if t > 0 ,
f[2x + (1-
2)y] ~<
(2)
~< 2f(x) + (1 - 2)f(y) -- 2(1 - 2)d(Jx - y]). For numerical convenience, d will be supposed to be convex. Note that this hypothesis is consistent; in particular, if f is strongly convex, then d(t) = ~t 2, ~ > 0, and so d is convex.
96
C. Lemarechal / An extension of Davidon methods
We consider the problem of minimizingfover H. To solve this problem, if H is finite-dimensional and i f f is regular enough, Davidon methods are among the best ones. In these methods, a sequence {xn} is generated according to xn + 1 = x~ + p,s~ where p, ~ R is the step size, and sn E H the direction. Adachi [1] points out that many Davidon methods (at least 12) are equivalent in the quadratic case, in the sense that they generate sequences of directions which are colinear, Zoutendijk [1(3] remarks that, if g, = f'(xn) is the sequence of gradients, then, for each n, all the gradients have the same projection onto s~: (s~, gi) is a constant with respect to i. McCormick and Ritter [7] derive an algorithm in connection with this remark. Intuitively the fact that (s~, gi) equals a constant suggests that sn indeed realizes mins maxi(s, g~) (as in the Tshebychev theory), whence the term "projection method". But maxi(s, gi) is related to some perturbed derivative in the direction s, and this remark establishes a link with the "steepest descent" algorithm of Demjanov [4]. This in turn is related to some perturbed subgradients, used first by Bertsekas and Mitter [2]. These informal remarks establish the intuitive background of this paper: starting from the ideas developed in [6], we show that a natural way for extending Davidon methods is to calculate s~ as the projection of the origin onto the convex hull of the preceding gradients. Furthermore, we exhibit a possible test for cutting off the sequence (i.e. restarting onto the gradient), and a convergent law for an approximate one-dimensional minimization. Thus, we establish an algorithm, prove its convergence, and apply it to some examples.
1. Theoretical background Letfsatisfy (1). For each e i> 0, the e-subdifferential off a t x is defined by a j ( x ) = {g ~ H: Vy ~ H, f ( y ) > / f ( x ) + (g, y - x) - e}.
(3)
We denote by Of(x) the set Oof(x), i.e. the ordinary subdifferential. Let f * be the convex-conjugate function o f f It is easy to show [8, p. 220] that d , f ( x ) = {g ~ H: f * ( g ) + f ( x ) ~< (g, x) + e}.
(4)
Theorem 1.1. (i) O,f(x) is a non-empty convex and weakly compact set. (ii) I f (~, x) remains bounded in R x H, then O,f(x) remains bounded in H.
C. Lemarechal / An extension of Davidon methods
97
(iii) I f e~ ~ e, x, ~ x weakly, g, 9 O~n f ( x . ) and g, ~ g strongly, then g 9 Oj(x). (iv) For each direction s 9 H, one has : inf {(f(x + ps) - f ( x ) + O/P: P > 0} = = sup{(s, g): g 9 d~f(x)}.
(5)
Proof. (i) This is well-known. For the infinite-dimensional case, see I-5, ch. 6-]. (ii) F r o m (1), f is continuous, and weakly lower semi-continuous. For each e, x, and each g 9 O~f(x), set y = x + g/Igl. Then by (3)
(g, y - x) = I gl
f(Y) - f ( x ) + e.
But y is bounded, and so are f ( y ) and f(x), thus proving the proposition.
(iii) By (4), f*(g~) + f(x~) <% (gn, xn) + e,. Hence, lim inf[f*(gn) + f(x~)-] ~< lim[(gn, x~) + e~] = (g, x) + e. By lower semi-continuity, f*(g) + f ( x ) <%lim i n f f * ( g , ) + lim inff(xn) ~< lim inf[f*(gn) + f(x~)] ~< (g, x) + e, and g 9 O j ( x ) by definition (4). (iv) Formula (5) is given in [8, p. 220].
Theorem 1.2. Let x, y 9 H, g 9 Of(y). A necessary and sufficient condition for g to belong to Off(x) is f ( y ) >1f ( x ) + (g, y - x) - e.
(6)
Proof. The condition is obviously necessary: the relation of definition (3) must be satisfied at least at the point y. Conversely, let g 9 Of(y). Then Vz 9 H, f ( z ) >>.f ( y ) + (g, z - y) = f ( x ) + (g, z - x) + f ( y ) - f ( x ) + (g, x - y).
But, by (6), f ( y ) - f ( x ) - (g, y - x) >1 - e , completing the proof.
Theorem 1.3. Let x, s 9 H, and Po 9 R be such that f ( x + pos) <%f ( x + ps)
Vp 9
Then there exists go 9 Of(x + pos) such that (go, s) = O.
C. Lemarechal / An extension of Davidon methods
98
Proof. We intentionally give a non-elegant proof, which will be useful in Section 4. Since d f ( x + pos) is weakly compact, there exists g2 ~ d f (x + pos) such that (s, g2) = sup{(s, g): g ~ Of(x + PoS)} = inf{(f(x + ps) - f ( x + poS))/(p - Po): P i> Po } (by (5). But since f ( x + ps) achieves its minimum for p = Po, the last term is non-negative; in other words, 392 ~ Of(x + pos)
such that (s, 92)/> 0.
Similarly we can prove (by setting s' = - s , p' - Po = - ( P - Po)) that 391 ~ d f ( x + pos)
such that (s, gl) ~< 0.
Then, there exists 2 ~ [0, 1] such that 2(s, ga) + (1 - 2)(s, g2) = 0. But since Of(x + poS) is convex, go = 291 + (1 - 2)92 belongs to Of(x + pos) and the proposition is proven. The following results will be useful for convergence proofs.
Lemma 1.4. Let ct~R, and let s, g ~ H be such that (s,g) ~< ~]S] 2. Let us denote by z the projection of the origin onto the segment [s, g]. Then for each 6>0
11 + ~)lzl ~ ~ (~ + 6)lsl ~ § Proof.
igl ~.
(7)
The optimality conditions for z are
(z,g) >/Izl ~,
(z,s)/> Izl ~
One has (z - s, g) = (z, g) - (s, g)/> Izl ~ - o, Isl ~.
(8)
We apply now the well-known inequality2(a,b)<~6la[2+ V6 > O. Afortiori, (z - s , g ) ~ 6[Iz[ 2 - 2(z, s) + Isl2] + 6 - ' ~< 6E-Izl~ § IsiS3 + ,~-' Igl ~.
Ig[2
Combining (8), (9), we deduce
Izl 2 - ~lsl 2 ..< 6E-Izl; + Isl;3 46 -11gl ~, which is (7).
6-'lb[ 2, (9)
C. Lemarechal / An extension of Davidon methods
99
Corollary 1.5. Let G be a bounded set containing O. Let us define two sequences {s,}, {g.} by (i) go ~ G, s o = go, (ii) g.§ e G such that (g,+l, s.) = 0, (iii) s,+ 1 = the projection of the origin onto the convex hull of{go,..., g. +1}. Then s, --* 0 strongly.
Proof. One obviously obtains (7) for each n, when s, z, g, ~ are replaced by s,, s.+ 1, gn§ 1, 0. Denote by fl(6) and V(~) the quantities 3/(1 + 6) and 1/6(1 + 5). Recursively, we obtain (note that ~(6) < 1 V5 > 0) n-1
Is l 2 <
( )Isol 2 + 7(6) E
2.
i=0
Let M be a bound for G. By direct calculation it is easy to obtain
For e > 0, choose 5 = 2M2/e. It is then possible to choose n such that (6/(1 + 6))" ~< e/2M z. So, Isnl2 < and the proof is complete. We now study Davidon methods, keeping in mind Theorem 1.3 and Corollary 1,5.
2. Some properties of Davidon methods In Davidon methods one defines the sequence x. by x.+ 1 = x. + p.s.. Thus, s. is the sequence of directions, g. = g(x.) is the sequence of gradients. When applied to a quadratic objective f ( x ) = (x, A x) - 2(f, x), the fundamental characteristics of the method are (s,, A s j) = 0, ( g . + . s.) = 0, g.+l=g.+p.
j < n,
(10) As..
It is well-known that, for each n t> 0, (g,+i, s~) = 0, i ~< n,
(11)
that is to say, x. + 1 minimizes f in the affine manifold spanned by {So, ..., s. }. We consider those methods which, in addition to (10), satisfy, for each n,
C. Lemarechal / An extension of Davidon methods
100
s. = ~ 27g,.
2." • 0.
(12)
i=0
In other words, we seek the n 'h direction in the subspace spanned by the gradients which are already known. It is quite natural to proceed in this way. Lemma 2.1.
For each method satisfying (10), (12), one has, for each n,
(g.+ a, gi) = O,
Proof.
i =.0,..., n.
The proof is by recurrence on i. For i = 0,
(g.+ 1, go) = (g.+ 1, So) = 0
For i
0 = (gn+ 1, Si) =
> 0,
from (11). 2}gj
i-1
=
g,) + j =2O 2}(g.+,,
But the last sum is supposed to be zero, and 21 :~ 0, so the proposition is proven. Thus, in a Davidon method, (12) is a sufficient condition for the gradients gi to be orthogonal to one another. Theorem 2.2.
s.=
For each method satisfying (10) and (12), one has
k,__~~ gi "=
(13)
igii 2 ,
where {k.}, a scalar sequence, is the only possible freedom allowed in such methods. Furthermore, if-g. is the projection of 0 onto the convex hull of{go . . . . . g.}, then the directions s, generated by all such methods are colinear to -g..
Proof. F r o m (10), 0 = (s.; A si) = (s., gi+ l - gi), a constant with respect to i. F r o m (12), (s,, gi) = ~. 2~(g,, g j) = k,,
i < n. Thus, (s., gi) is
i ~ n.
j=O
This is a linear system with respect to 2, the solution of which, by L e m m a 2.1,
C. Lemarechal / An extension of Davidon methods
101
is obviously 27 = kJlg;[ 2. Now, adjust kn = kn, so that sn = L~q,/lo, I2 belongs to the convex hull of {go. . . . . 0n}. kn is defined by 1.
Is.I 2 = _~.~21/]g,I ~ = kn = (s.,o,), i <~ n. So the second proposition results from the fact that there exists a unique point ~-. verifying ~-n~ conv{go,..., 9n} and (Y.,g,) ~> ]~.12 for i ~< n, and this point is exactly the stated projection. But in this case,
Remark 2.3. Adachi [1] mentions twelve Davidon methods which are equivalent. Hence the question is: since "(10) and (12) =~ (13)" and since one does not know any method satisfying "(10) and not (13)", is (12) really useful to prove (13)? The answer is obviously yes: for n = 1 there are many directions Sl such that (sx, A So) = 0, but only a few of them verify sl = 2~go + 2~ga. In fact, the phenomenon mentioned by Adachi justifies a posteriori the hypothesis (12): if no method is known verifying "not (12)", perhaps it is because (12) is really natural. Theorem 2.4.
Proof.
F o r each n, 9, ~ O~,f(Xo), w h e r e e. = f ( X o ) - f ( x , ) .
From (6), we know that the theorem holds if f ( x . ) >f f ( X o ) + (9., x . - Xo) - ( f ( X o ) - f ( x . ) ) ,
i.e., if (9,, x. - x0) ~< 0. But from (11) we have n--1
(0n, X. -- XO) =
~
Pi(O., Si) = 0.
i=0
We conclude this section by claiming that when applied to a quadratic objective function all methods defined by (10) and (12), i.e., in practice all Davidon methods, have the following properties: Sn = Proj O / c o n v { - 9 o . . . . . --gn} (with Pn > 0), On~ ~,nf(Xo), where en = f ( x o ) -- f(Xn).
3. Extension to non-differentiable functions
Both the properties exhibited in the preceding section are intrinsic and do not depend on the nature of the objective function. Hence, we are able to
C. Lemarechal / An extension of Davidon methods
102
define an extension of conjugate gradient methods in the following way: Sn is calculated as the projection of O onto the convex hull of { - g0 . . . . . - g . } as long as g. e 8,nf(xo). This latter can easily be checked; if it does not hold, then the sequence is cut off and, for example, sn is taken as -On. In order to ensure convergence, we have to slightly modify e. by defining en=f(xo)-f(x.)+e,
wheree>O.
We propose the following algorithm. Algorithm 3.1. Let Xo 9 H, e > 0 be given. Let go e ~3f(xo). Set i = n = 0, Pi = O. Step 1. Calculate sn, being the projection of O onto the convex hull of . . . ,
- g n ) .
Step 2. Calculate Pn such that f ( x n + pnsn) ~ f(xn + psn), Vp >10. Set Xn+ l = Xn -F pnSn.
Step 3. Determine gn+ 1 9 df(xn+ 1) such that (gn+ 1, sn) = 0. Set n = n + 1. Step 4. If (gn, x. - xp) -%<e, then go to step 1. Step 5. Set i = i + 1, p~ = n, and go to step 1. In step 2, note that since (s,, g,) < 0, f ( x . + ps,) >1 f ( x . ) if p < 0. Our convergence theorem will be: f ( x . ) (which is decreasing) has a limit f * , such that f * ~< inf{f(x): x 9 H} + e. This property is easily proven if the sequence of cuts is finite. Suppose for example that Pi = 0. Then s, ---, 0 from Corollary 1.5. But en is an increasing sequence; hence, if m <~ n, g,~ 9 ~,,~f(Xo) c d,.f(xo) implies that s . e ? , , f ( X o ) . Also e. --. e* = f(Xo) - f * + e. F r o m Theorem 1,1(iii), O 9 d,,f(x0), and from Theorem 1.1(iv), this implies that f ( x o ) <%inf{f(x): x 9 H} + e*. Replacing e* by its value we obtain the desired property. In the preceding algorithm, call a chain a sequence delimited by two successive cuts. Thus the algorithm generates a sequence of chains. If a chain is infinite, the argument above applies. We shall prove that the length of the chains is u n b o u n d e d in order to derive a similar argument. First we need the following result. Lemma 3.2.
Let f b e a uniformly convex function (cf. (2)). I f g e Of(x), then
Vy E H, f ( y ) >>,f ( x ) + (g, y - x) + d([y Furthermore, if d(t) ~ O, then t ~ O.
x[).
C. Lemarechal / An extension of Davidon methods
103
The proof of this classical result is not given. Theorem 3.3.
When applied to a uniformly convex function, Algorithm 3.1 is such that f(x,) ~ f*, where f * <<,inf{f(x): x e n } + ~. Proof. The sequence x. is bounded; from Theorem 1.1(ii), Ig.I is also bounded, say by M; f ( x . ) is decreasing, bounded from below (1), and so converges to f * . For each n, we have f ( x . ) >1 f ( X n + l) + ( g . + l , X. -- X.+ I) -t- d(lx, - x,+~l). But (g,+~, x, - x.+l) = -p,(g.+~, s.) = 0 by step 3. Hence d(lx. - x.+ll) <~ f ( x . ) - f ( x , + 1) ~ 0. F r o m L e m m a 3.2, this implies that Ix. - x.+ 11 ~ 0. On the other hand, if p~ and p~+ 1 are two consecutive cuts, one has by step 4,
< (gp,+,,xp,+l- xp,) <~ M l x p , . ,
- x~, I.
It follows that the length of the chains generated by the algorithm is unbounded. Then, as in Corollary 1.5, for p~ ~< n < p~+ ~, we have
> 0, Is.I s
+
MS'
since n - p~ is the number of projections performed to obtain s.. Let r / > 0 be given; we choose 6 = 2MZ/q. Since p~+ 1 - Pi is unbounded, we can choose p~ such that 2M 2
\2M
) v , + t - p , -' M 2 ~< ~/
+
Thus, Isv,+,_al 2 ~< q. In other words, we have proven that there exists a subsequence {n'} such that Is.,I ~ 0. Each n' belongs to a chain initialized on some iteration denoted by Pl. Then, by Step 1, s., e d~f(xv~), where e', = f(xt,~) - f ( x . , ) + e. Two cases may occur: either P'i <~ N or p; ~ + oo. Suppose p; ~< N. This means that the number of chains is finite; let P be the last one. Then e',-, e * = f ( x v ) - f * + e and, by Theorem 1.1(iii), 0 ~ d~.f(xv). By (iv), f ( x v ) <~ inf{f(x): x ~ H } + f ( x v ) - f * + e and the theorem is proven. Otherwise, p'~ ---, + ~ . Then e'. ~ e. By extracting again a subsequence, we may suppose that xv~ ~ x* weakly. Following the proof of Theorem l.l(iii), we have f ( x * ) <<,f * <~ inf{f(x): x ~ H} + e and the theorem is also proven.
104
Remark may be replaced uniform
C. Lemarechal / An extension of Davidon methods
3.4. This proof indicates that a possible stop in Algorithm 3.1 [s.I ~< r/. On the other hand, the proof remains valid if step 1 is by s. + 1 = Proj O/conv{ s., - g. + 1 } (cf. L e m m a 1.4). Finally, the convexity is only useful to prove that Ix,+1 - x.[ ~ 0.
4. Line search
Even though we know, by Theorem 1.3, that step 3 is feasible, the numerical determination of g. + 1 in Algorithm 3.1 needs to be clarified. This section is closely related to I-6, w Just as in Section 3, where we give a convergent rule for cutting the sequences of conjugate gradients, we give here a convergent rule for an approximate one-dimensional minimization. Our aim is that the proof of Theorem 3.3 remain valid despite the fact that g, + 1 is only approximated. It is easy to see that this will be the case if g. + 1 satisfies: (i) (0.+1, s.)/> -0t Is.I 2 for a certain number ~ < 1 (see (7)); (ii) Ix.+ 1 - x.[ ~ O, which is ensured if f ( x . ) > / f ( x . + l ) + sthg + d(lx.+l
- x.I),
where sthg is some non-negative number; (iii) g.+ 1 9 O~.+, f(xp,) if possible; (iv) (g.+x, x . + l - xp,)/> e' > 0 if a cut must appear, g.+l will of course be sought by a dichotomic search in the presence of: (J) Pl 9 [0, p.], x l = x. + p l s . and gi 9 Of(x1) such that (s., 91) ~< O, (JJ) P2 9 ]P., + ~ [, x2 = x . + p2s. and g2 9 Of(x2) such that (s., 02) >/O. We seek g.+l as a convex combination of gl and g2. Let 21, 22/> O, 21 + 2 2 = 1 and set X -~- )~IX1 "-~ ,~.2X2,
g -- "~lgl q- )]'292-
x and g are candidates for x. + 1 and g. + 1, respectively, Lemma 4.1.
g 9 O~,f(x), where
e' = ,~l,~2(Xl - x2, gl - g2). Proof.
Vz e H, f ( z ) >~ f ( x 0 + (gl, z - Xl) f ( z ) >~ f ( x 2 ) "+" (g2, Z -- X2).
By convex combination, we deduce
(14)
105
C. Lemarechal / An extension of Davidon methods
f ( z ) >~ f ( x ) + (9, z - x) = ~1(41 "~ 42)(91, Xl) -- 42(~1 -~ /1,2)(92, X2)
+ 4~(91, xl) + 41~2[(9,, x2) + (92, x0] + ~(92, x2) and a direct calculation gives the stated value for e'. As in Theorem 1.2, it is easy to see that 9 ~ ~ +,f(xp,), with e', +1 = f(Xv,) - f ( x ) + (g, x -- xv, ) + e'
and e' as in (14). Therefore, since e ~ d , f ( x ) is an increasing mapping, we see that we have (iii) if e'n+ 1 ~< f(xp,) - f ( x ) + e, i.e., if (g, x - xp,) +
41,~,2(X 1 -- X2, 91
--
g2) ~< 8"
(15)
If (15) does not hold, we should cut the chain. But, before doing this, we have to be sure that Ix - xp,[ remains b o u n d e d from below (see (iv)). Note that e'/> 0 in (14). Therefore, if fl~]0,1[, then, as soon as 2142(xl - x2, gl - 92) ~< fl e; "not (15)",
(16)
the inequalities (9, x - xp,) >/(1 - fl)e > 0 obtain. As the dichotomic search proceeds, e' ~ 0. So, if (15) never obtains, we are sure that (16) obtains after a finite number of iterations and then we can safely cut the chain. The precise choices for 41 and 42 will be given in view of (i) and (ii). Lemma 4.2.
Suppose f is uniformly convex with d convex (cf. (2)). I f
41P1(S,, 91) "~- 42P2(Sn, 92) ~ 0,
(17)
then f ( x , ) >1 f ( x ) + d([xn - x[).
Proof.
We have f ( x , ) >~ f(Xl) + (91, Xn -- Xl) "~ d(p 1 ISn[) f ( x , ) >t f(x2) + (92, x , -- x2) + d(p2 Is.I).
By convex combination making use of the convexity of f, we obtain f ( x , ) >i f ( x ) - 2,PI(gl, s,) - 22P2(g2, s,) + ).,d(pl
Is,l) +
4~d(p~Is,l).
But since d is convex,
41d(p, Is,I) + ~=d(p= Is.I) f> dr(~lp~ + ~2p2)Is.I] -- d(Ix - x,I) and thus the desired property is obtained when (17) holds.
106
C. Lemarechal / An extension of Davidon methods
To summarize this section, let us choose e > 0, 0 < ~ < 1, 0 < fl < 1. The dichotomic line-search is terminated as soon as we have simultaneously (see (j), (jj) for notations),
2,(g1, s.) + a2(g , s.) 1> •lPl(gl,
(18)
]s.I 2,
(19)
Sn) + "~'2P2(92, Sn) <~ 0
and either (2191 + 2292, 21xl + 22x2 - xv,) + + )q22(xx - x2, 91 - 92) ~< e,
(20)
or
not (20), )~lJ,2(Xl - x2, 91 - 92) ~< fl/~.
(21)
One of the two following events must occur: (a) Pl becomes strictly positive. Then we can choose 2 1 p l ( g 1, s,) + 22p2(g2, s,) = 0
"~1, 22
such that (22)
and it is shown in [6] that for this choice, (18) and (19) become consistent (we have already seen that there is no problem with (20) and (21)). (b) Otherwise, pa remains zero. In this case, the choice (22) remains certainly inconsistent. It is better to choose 2a, 22 such that (23)
'~1(01, Sn) + 22(92, Sn) -----0.
Then, (19) will never obtain; but this is less important, because we can admit in this case that p, = 0 in Algorithm 3.1, so x,+ 1 = x, and (ii) is automatic. Figure 1 presents a possible flow chart for the algorithm.
5. Numerical experiments Algorithm 3.1 and the algorithm in [6] have been compared on the following "academic" example" 5 f ( x ) = m a x {(AkX, X) -- (fk, X)},
X ~ R 1~
(24)
For a given value attributed to matrices Ak and vectors fk, the optimal solution appears to be near 0. Starting from two different initializations: Xo = 0 and Xo --- 10, and running on two different computers, we generate
C. Lemarechal / An extension of Davidon methods
I
=X p
G={-g}
L~,~~ ~,~
107
I
s=-g
~=~1
[ g E af(xp + ps) [
I pl = p
p=
2p
xl = xp + ps gl = g ,02 = p ii,
[
,1[
X2 = X
n
g2 =
l I ]
I x = ;~xl+(1-X)x~ g ffi ;~#'~ + ( 1 - 'h)Z2
I
I y
( g , x - x o) + X ( 1 - ~ ) ( x l - x 2 , g l - g 2 ) ~
(s,g)~-~lsl2 X
- ~
)
~,(1-h)(x~-x2,g~-g2)
q
i
Y
I
c=cu{-~} I s = projO/G
i J
I
I
1
(
I
J T
P2 = P
[
x 2 = xp +ps g2 = g
[ x I = xp+ps gl = g
PJ = P
Fig.
1.
C. Lemarechal / An extension of Davidon methods
108
Table 1 CII 90-80 xo =0 2
Algorithm 3.1
57 298 -0.51798
4
Algorithm in [6]
15 537 -0.51800
CII 10070
x o = 10 3
20 361 -0.51798 2
20 399 -0.51798
xo =0
1
O0 336 -0.51799 2
22 839 -0.51799
x o = 10
1
22 471 -0.51799 0
41 448 -0.51799
four tests for each method. Table 1 contains for each test a vector giving: the computing time in min. sec., the number of subgradients needed, and the value of the computed optimum. It should be mentioned that (24) is quite difficult: at the optimum point there are some subgradients whose norms are not less than 100, a fact which substantially slows down the convergence (cf. Corollary 1.5). Moreover, at the optimum, f ( x ) = (AkX, x) -- (fk, X) for k = 2, 3, 4, 5. In both methods, the projections have been achieved by making use of the "reduced gradient" method. CII 90-80 has 12 digits/word and is very slow. CII 10 070 has 7 digits/word; the speed is comparable to that of IBM 360-50 but the computing times may have little significance since the system is time-shared.
6. Conclusion It has already been said that two very similar algorithms have been presented, here and in [9]. Both these algorithms are based on "bundling" subgradients, their only difference lying in the test for cutting the chains, apart from the numerical aspects. It can be concluded that bundling is a valuable idea, since it has two justifications: Section 2 here, and the geometric discussion, in [9, Section 3]. But it can also be said that the underlying reason for cutting the chains is the same in both the papers: in [9] this test is n--|
J=pi
c. Lemarechal / An extension of Davidon methods
109
while here it is (Xn -- Xpl, gn) > e.
T h e s e in fact are devised to e n s u r e t h a t Ixn - x~,,I does n o t b e c o m e t o o large. For, as long as sn+ 1 is c a l c u l a t e d by projection, o n e has (s j, sn+ 1) > 0 for all p r e c e d i n g d i r e c t i o n s sj. T h u s , zig-zagging is a v o i d e d ; the a l g o r i t h m c a n n e v e r " r e t r a c e its steps". But it is n o t sure that, as the a l g o r i t h m proceeds, the m i n i m u m p o i n t is n e v e r o v e r s t e p p e d . H e n c e the necessity of a cut if Ixn - xp, I b e c o m e s t o o large.
References [-1] N. Adachi, "On the uniqueness of search directions in variable-metric algorithms", Journal of Optimization Theory and Applications 11 (6) (1973) 590-604. [2] D.P. Bertsekas and S.K. Mitter, "A descent numerical method for optimization problems with nondifferentiable cost functionals", S I A M Journal on Control 11 (4) (1973) 637-652. [3] W.C. Davidon, "Variable-metric algorithms for minimization", A.E.C. Research and Development Report ANL 5990 (1959). 14] V.F. Demjanov, "Algorithms for some minimax problems", Journal of Computer and Systems Science 2 (1968) 342-380. [5] P.J. Laurent, Approximation et optimisation (Hermann, Paris, 1972). 1"6] C. Lemarechal, "An Algorithm for minimizing convex functions", in: J.L. Rosenfeld, ed., Information processing "74 (North-Holland, Amsterdam, 1972) pp. 552-556. [7] G.P. McCormick and K. Ritter, "Projection method for unconstrained optimization", Journal of Optimization Theory and Applications 10 (2) (1972) 57-66. ['8] R.T. Rockafellar, Convex analysis (Princeton University Press, Princeton, N.J., 1970). 1"9] P. Wolfe, A method of conjugate suhgradients for minimizing nondifferentiable functions, Mathematical Programming Study 3 (1975) 145-173 (this volume). [10] G. Zoutendijk, "Some algorithms based on the principle of feasible directions", in: J.B. Rosen, O.L. Mangasarian and K. Ritter, eds., Nonlinear Programming (Academic Press, New York, 1970).
Mathematical Programming Study 3 (1975) 110-126. North-Holland Publishing Company
MINIMAX SOLUTION OF NON-LINEAR EQUATIONS WITHOUT CALCULATING DERIVATIVES Kaj MADSEN Technical University of Denmark, Lyngby, Denmark Received 8 November 1974 Revised manuscript received 12 May 1975
The problem of minimizing the maximum residual of a set of non-linear algebraic equations is considered in the case where the functions defining the problem have continuous first derivatives, but no expression for these derivatives is available. The method is based on successive linear approximations of the functions, and solution of the resulting linear systems in the minimax sense subject to bounds on the solutions. Approximations to the matrix of derivatives are updated by using the Broyden l-2] rank-one updating formula. It is shown that the method has good convergence properties. Some numerical examples are given.
1. Introduction T h e p r o b l e m of minimizing the m a x i m u m residual
F(x) =- m a x J
If (x)l
(1,1)
of a set of nonlinear algebraic functions fj(x) = f j ( x l , . . . , x,),
j = 1. . . . . m
(1.2)
can be solved by a m e t h o d described by M a d s e n [6]. As m o s t other m e t h o d s for o p t i m i z a t i o n this m e t h o d is iterative and ifxk is an estimate of a solution, the increment hk is found as a vector which minimizes the linear a p p r o x imation maxlfj(xk) + ~ 9 J
i=l
, I Tdfj,xtx jh,
(1.3)
subject to the constraint
IIh[I = I](hl,..., h.)ll
- max
Ih, I _<
(1.4)
K. Madsen / Minimax solution of nonlinear equations
111
The rules for adjusting the bounds "~'k, k = 1, 2, ..., ensure that the method will converge and that normally the rate of convergence will be very rapid. The method of Osborne and Watson [8] is similar but it is based on unconstrained solutions to the systems (1.3), and convergence cannot be guaranteed in all cases. In this paper we are concerned with the case where the functions fj are differentiable, but no expressions for the derivatives are available. In this situation it has been shown (see e.g. Broyden [2], Powell [9]) that if the objective function is a sum of squares, then the amount of computational labour is smaller when the Jacobians are generated by a rank one updating formula rather than approximated at each step by linear expressions of the type ~ ( " Xk) ~-- fj{Xk + h e,) - fj{Xk) h '
j ___ 1, . . . , m, i=l,...,n.
(1.5)
Therefore we minimize the function given in expression (1.1) by a method which is very similar to that of Madsen [6] and approximate the derivatives of the functions by using the Broyden formula [2]. We give a detailed description of the algorithm in Section 2 and in Section 3 we prove convergence theorems. We show that if the generated sequence of vectors converges, then the limit point must be a stationary point, and if the derivatives Vfi satisfy a certain linear independence condition at the limit point, then the rate of convergence is super-linear. In Section 4 some examples are given.
2. T h e a l g o r i t h m
We use the notation f ( x ) ==-(fi(x) . . . . . fro(X)),
(2.1)
irs(x)/i
(2.2)
ax =
= max
IA(x)l,
(2.3)
In an iteration we find an increment hk to add to the approximate solution Xk and the next iterand Xk+l is either Xk or (Xk + hk), depending on the function values at these points. Now the approximation Bk+l to the
112
K. Madsen / Minimax solution of nonlinear equations
Jacobian at xk+ 1 is found by the updating formula of Broyden, the idea of which is as follows. Since
fj,
f, xk + hk) -L xk)~- i=1 ~xi~xd"''nu,
(2.4)
it is desirable that the matrix Bk+ x, which should be an approximation of either ~/xk or J,~k+h~, has the property f ( x k -1- hk) - - f ( X k ) :--- Bk+ lhk.
(2.5)
If we further add the restrictions (2.6)
BkV = Bk+ 1 V
for all v orthogonal to hk, then we obtain the updating formula
Bk+ 1 = t3k +
(f(xk + hk) --f(xk) -- Bkhk)h~ h~hk
(2.7)
(Note that this formula requires no extra calculation of function values.) However we have observed that rather slow convergence is often obtained if the sequence of directions hk, k = 1, 2. . . . does not satisfy some linear independence condition. Therefore we introduce some "special iterations" in order to maintain a linear independence. These will be described later in this section. The initial approximation Bo is found by a difference approximation
b!O) =_ fi(xo + t ej) - f ( x o ) t
,
(2.8)
where t is a small positive number and e~ is the normalized j~ coordinate vector. We will refer to the iterations which are not special iterations as "normal iterations". These find the increment hk and the approximation Xk+l in a way which is much the same as that given in [6]. In order to calculate hk we consider the linear function
fk(h) --f(xk) + Bkh
(2.9)
which is a good approximation to f(x) in a neighbourhood of Xk provided that Bk is a good approximation to the Jacobian at xk. N o w we wish to minimize the functional [IA(h)[I, but since expression (2.9)is only a local approximation to f(x) it is reasonable to introduce an upper bound for
K. Madsen / Minimax solution of nonlinear equations
the length of h. Therefore we find
IIA(hk)ll =
rain
II~ II -< ,lk
h k as
113
a solution to the problem
IIf~(h)ll,
(2.10)
where the bound ~.k is positive. This linear problem may be solved by a standard linear programming routine, but it is solved more efficiently by the method of Powell 1-12]. A natural criterion for accepting (xk + hk) as the next point in the iteration is to ensure that the value of the objective function (1.1) will decrease. However this is not sufficient to guarantee convergence, even in the case where the derivatives are calculated exactly at each iteration. Therefore we use a slightly stronger condition: We test whether the decrease in F exceeds a small multiple of the decrease in the linear approximation (2.9), i.e., we use the test f(x~) - g(x~ + n~) >_ pl(IIA(o)II
- IIA(h~)ll),
(2.11)
where Pl is a small positive number. If this condition is satisfied, we choose xk+l = xk + hk, otherwise we let xk+l = xk. In the case where we can calculate the derivatives, the criterion (2.11) enables us to prove that the generated sequence, {xk}, will always converge to the set of stationary points of F. When the approximations Bk of the Jacobians are used, however, we cannot prove a similar result, but we believe that the criterion (2.11) will ensure convergence in practice. N o w the new bound 2k+ 1 is found in the following way: If the decrease in the objective function F is poor compared to the decrease predicted by the approximation (2.9), we choose 2k+ 1 < 2k. More precisely, if F(xk) - F(x~+ 1) <- P2(ItA(O)II - IlA
(2.12)
then we let 2k+ 1 = al:.~,, where 2~, = max{
IIh~ll, ~l~k)
(2.13)
Here the numbers P2, al and a 2 satisfy Pl < P2 < 1 and 0 < cr1 < 1 < a2. As a consequence, the bound will decrease when xk+ 1 = xk. If the iteration is more successful, i.e., if (2.12) fails, then we will allow the value of the bound to increase provided that the agreement b e t w e e n f a n d the approximation (2.9) is good. In fact we test the inequality
Ilf(x~ + hk) --A(hk)l[ -< p3(F(xk) -- Ftxk + hk)),
(2.14)
where 0 < P3 < 1, and if it holds, we let 2k+ 1 = min(a22~,, A) where A > 0
114
K. Madsen / Minimax solution of nonlinear equations
is a general upper bound on the step length. (Note that if IIhk [l is small and Bk is a good approximation to the Jacobian, then the term on the left-hand side of (2.14)is of order Ilhkll 2 whereas the right-hand side is of order ]lhkH.). If neither (2.12) nor (2.14) is satisfied, we let 2k+~ = 2~,. The main difference between this procedure for finding 2k+ ~ and the method we use when the derivatives are given is that in the latter case 2k+~ is defined as a multiple of []hk[I. This difference is due to the fact that errors in the approximation of the Jacobian may yield a very small value of ]lhk I] in a situation where the distance t o the solution is large, and this can be disastrous for the rate of convergence. Another difference is the choice of the constants Pi and tri. Since inequality (2.12) may be true not only because '~k is tOO large but also because of the difference between Bk and the Jacobian, we choose a smaller value of P2 and a rather large value of tr~ in order to prevent the sequence of bounds from decreasing too fast. We tried several values of the constants and found the following satisfactory: (Pl, P2, Pa, o'1, tr2) = (0.01, 0.1, 0.33, 0.5, 2.0). (2.15) It remains to describe the "special iterations". In the first version of our programme we used the algorithm as we have described it so far and it worked very well in most of the examples tried. But in some cases the rate of convergence was extremely slow because a poor Jacobian approximation at some point Xk caused XR+j = Xk in several successive iterations. After a number of iterations of this type, however, Bk+i was a good approximation to J~k, but then the bound 2k+~ had become very small because of the rules for updating the bounds. We tried in several ways to cure this trouble, and found that the method of linear independent directions given by Powell I11] was satisfactory. Every third iteration is a "special iteration" which uses an increment hk of length 2k chosen so that the following identity holds for each value of i /
hyhg
= O.
(2.16)
This condition ensures "uniform linear independence" of the directions In each special iteration we let 2k+ ~ = '~k, and Xk+ ~ = XR unless F(Xk + hk) < F(Xk), in which case Xk+ ~ = Xk Jr- h k.
h k,
(2.17)
K. Madsen / Minimax solution of nonlinear equations
115
3. Convergence theorems The presence of the special iterations and formula (2.16) enables us to prove that if the sequence of vectors generated by the algorithm converges to x*, then the sequence of Jacobian approximations Bk converges towards the Jacobian at x*. This fact makes the proofs of our theorems much easier, but it seems that the theorems would hold without the presence of linear independent directions. We need a smoothness assumption which permits local linearization, and we will use
f ( x + h) = f ( x ) + J= h + o(h),
(3.1)
where the vector o(h) has the property [Io(h)ll/lJhl[ --, 0 for h ~ O. We prove that our algorithm can converge only towards "stationary points", this term being defined as follows.
Definition 3.1. The vector x is a stationary point of F if
F(x) = rain{ [[f(x) + Jxhll }.
(3.2)
If we calculated the derivatives exactly, an iteration starting at a stationary point would give hk = 0 for k = 0, 1, 2. . . . , so we cannot prove anything stronger than convergence towards stationary points for our algorithm. Further, when the linear systems
lk(h) --f(Xk) + J,,.h
(3.3)
are bounded away from singularities it is easy to prove (see [8]) that every direction from a stationary point is uphill. The convergence theorem for the Jacobian approximations is as follows.
Theorem 3.1. I f f satisfies the condition (3,1), and the generated sequence of vectors Xk converges to x*, then
Bk ~
J.,.
(3.4)
Powell [10] proves a very similar result. In the appendix we give a modification of Powelrs proof to prove Theorem 3.1. In the proof of our convergence theorems we also need the following lemma.
K. Madsen / Minimax solution of nonlinear equations
116
Lemma 3.1.
I f Xk ~ X* and there exist positive numbers c and d such that
f ( X k ) -- IIf(xk) + Jxkhk II ~ C min {ll h~ [], d},
(3.5)
then 2k+1 --> IIh~l/for k sufficiently large. Proof. F r o m Theorem 3.1 it follows that B k ~ '-Ix* and therefore we have the inequality
F(Xk) -- F(Xk + hk) >_ F(Xk) -- IIf(xk) + dx~hk + O(hk)iI F ( X k ) - I/Ath~)ll F(x~)- IIf(x~) + J~,~hkl[ § [IJ~ -- B~II" rlh~[I
o(llh~ll)
=1+
F ( x k ) - [If(xk) + Jxkhkll +
o(llh~l[)
= 1 + o(1).
(3.6)
Therefore inequality (2.12) is never satisfied when k is sufficiently large, and the lemma is proved. Theorem 3.2.
I f xk ~ x*, then x* is a stationary point of F.
Proof. Suppose x* is not a stationary point. In this case there exists a vector h* such that
d - II/(x*)ll -II:(x*) + J.,h* II > O.
(3.7)
Because of continuity this implies that for k sufficiently large, k > ko say, we have
Ilf(xk)/I - I[f(xk)
+
ax~n*ll - d/2.
(3.8)
If we let tk be defined by the equation [Ihk/I = tk [[h*l[,
(3.9)
then we deduce from inequality (3.8) and the definition of hk that for k_koandtk< 1,
Ilf(x~)l[ - [If(x~) + J~hkll --- [[f(x~)[I -IIf(x~) + Jx~(tdi*)lt = i]f(Xk)[j -- [[f(Xk)(1 -- tk) + tk(f(Xk) + ,J,.h*)il >-
t~(]l f(x~)ll - [If(x~) + -/~h*il)
>- tk" d/2.
(3.10)
Now (3.8), (3.10) and the definition ofhk imply that for k > ko we have the
K. Madsen / Minimax solution of nonlinear equations
117
inequality
IIf(xk)ll
-
Ilf(xk) + J~)ik[[
> min(tk, 1)' d/2,
(3.11)
and therefore it follows from Lemma 3.1 that
~a+x -Ileal[
(3.12)
for k sufficiently large. Further, Theorem 3.1 implies that B a ~ J~,, and therefore inequality (3.7) gives that for k sufficiently large we have
[If(xk)ll - Ilf(xa) + eah* I[ > d/2.
(3.13)
However, hk --} 0 and consequently,
IIf(x~)[I -IIf(xa) + B~hall-~ 0.
(3.14)
(3.13) and (3.14) can only hold if ha is restricted for all large values of k, and therefore
Ilha[I = ~k
(3.15)
when k is large. Now (3.12) and (3.15)imply that Hhk+~ 11 >-- I/hall for all large values of k. This contradicts the fact that hk ~ 0 and therefore x* must be a stationary point of F. In the remaining part of this section we assume that m > n, and that the objective function F is positive at the solution x*. In this situation the rate of convergence for the algorithm depends on the following condition. Definition 3.2. If every n x n submatrix of,/x is nonsingular, then we say that ,/~, satisfies the Haar-condition.
Lemma 3.2. Let hx be the unconstrained minimax solution of the linear system Ij(h) = fj(x) + ,=~ ~xitx) h, = O,
j = 1. . . . . m.
(3.16)
I f the Haar-condition holds at a solution x* of the non-linear problem (1.1), then there exist 6 > 0 and c > 0 such that for IIx - x* If - ~, F(x) -
[If(x) + ,L, hx[t ->
F(x) -
F ( x * ) >_ cllx
cllhxll
(3.17)
and -
x*ll
(3.18)
K. Madsen / Minimax solution of nonlinear equations
118
Proof. It follows from the theory of linear minimax approximation that when the matrix J , satisfies the Haar-condition, the solution to the problem (3.16) is the minimax solution to (n + 1) of the equations in (3.16) and the maximum residual is attained by each of these equations. Such a set of (n + 1) equations is called a reference, and we use the notation refz for the set of indices defining the reference at the point x. Without loss of generality we can suppose that A(x*) = . . .
= A(I*) > If,<x*)l,
i > k.
(3.19)
Then the Haar-condition implies that k > n (see [-3J). It follows that if x is close enough to x*, then the reference defining the solution of (3.16) will consist of functionsfi with j _< k, and further the signs of the residuals corresponding to functions in the reference will all be positive. This means that there exists 51 > 0 such that
IIx - x*ll < 5, ~ I[f(x) + J~h~ll = f/x) § (vf~<x), h~),
j~ref~, (3.20)
where (Vfj(x), h) is the inner product, (Vfj(x), h) = ~ ~fJ(x) h,.
(3.21)
i= 1 (~Xi
Again, from the theory of linear minimax approximation it follows that if x satisfies the condition in (3.20) then there exist positive constants r such that
Z
je ref=
~JVL(x) = O.
(3.22)
For any vector h # 0 we deduce from this jeref=
~j(Vfj(x), h) = 0
(3.23)
and consequently one of the terms in the sum must be positive, since tl~ Haar-condition implies that at least two of the inner products are non-zero. Therefore c~, =
rain ~ max (Vf~(x), h) ~ > 0 )
Ilhl= l ~ j~refx
(3.24)
and since c~ is continuous as a function of x there exists cl > 0 such that i[x - x * l l -< 51 ~ c x >- c~. From (3.20) and (3.24), with h = - / i x
(3.25) inserted, it follows that for
K. Madsen / Minimax solution of nonlinear equations
IIx - x*ll ~ ~1
we
119
have
max f j ( x ) = max
jr refx
j~ refx
{ IIf(x) + a~hxll
-(vf,<x), hx))
-> Itf(x) + Jxh~ll + Cl IIh~ll
(3.26)
and since the left-hand side of this inequality does not exceed F(x), inequality (3.17) is proved. Now it is easy to prove (3.18). Inequality (3.24) implies that
max _ cl Ilhll
(3.27)
j_k
and consequently max {fi(x*) + (Vfj(x*), h)} >- [If(x*)[] + cl j
llh]l.
(3.28)
From this inequality and the smoothness condition (3.1) we deduce that for IIhll sufficiently small, F(x* + h) > max {fj{x* + h)} > F(x*) + j<_k
Cl/2llhll.
(3.29)
This is the same as (3.18) and the lemma is proved. We now prove that in certain circumstances our algorithm converges superlinearly to the solution. This term is defined as follows. Definition 3.3.
If Xk ~ X* we say that the convergence is superlinear if
IIx~§ - x*ll -< ~llx~ - x*ll,
(3.30)
where the sequence ek, k = 1, 2, . . . , tends to zero for k ~ oo. In the normal definition of this term, Xk+2 on the left-hand side of (3.30) is replaced by Xk+ 1, but we have to use this weaker condition because of the special iterations in our algorithm. T h e o r e m 3.3. I f the sequence of vectors generated by the algorithm is convergent then the rate of convergence is superlinear provided the Haarcondition holds at the limit point.
Proof.
First we prove that the inequality
IIhk II < ~
13.31)
holds for arbitrary large values of k. Next it is shown that if k corresponds
120
K. Madsen / Minimax solution of nonlinear equations
to a normal iteration, and min {1[hk 1[. IIXk -- X* I1} < '~k'
(3.32)
IIx~+,
(3.33)
then - x*lf ~ ~ Ifx~ - x*fl,
where ek ~ 0 for k --* oo. From this it will follow that (3.33) fails only for a finite number of normal iterations, and finally it is shown that the special iterations do not spoil the result. In the following, N is the set of indices corresponding to normal iterations. For the first part of the proof we use inequality (3.17) in Lemma 3.2. Suppose that k ~ N and define tk by IIh~ll = t~ IIh=~ll,
(3.34)
where h~ is defined as in L e m m a 3.2. Because of the definition of the vectors, tk < 1, and then (3.17) implies that there exists a positive number c such that
IIf(x~)ll -IIf(x~)+ Jx, hk[[ >-- IIf(x~)ll -[If(x~) + Jx.(t~h=.)ll _> t~(ilf(xk)ll- ilf(xk) + J,,h,,,]l) >_ t~c llh,.ll -- c iln~ll.
(3.35)
Therefore we can use Lemma 3.1 and we find that 2k +, > IIh~ II for k > ko (say). Note that we can choose k0 such that inequality (2.11) always holds for k > ko. N o w hk --* 0 for k --* oo, and this can be true only because (3.31) holds for infinitely many values of k. In order to prove that (3.32) implies (3.33) we let e;,, k = 1, 2 . . . . , be a sequence of positive numbers satisfying ][f(Xk +.hk) -- {f(Xk) + J=khk}]l < e'k IIh~l[,
t[ J,,. - B R[I <- e'k, e;, --* 0
for k --* oe.
(3.36)
The existence of {e;,} is guaranteed by the smoothness condition on f a n d by Theorem 3.1. N o w suppose that iteration number k is a normal iteration. If k > ko, then (2.11) holds and Xk+l = Xk + hk. Then (3.18) of L e m m a 3.2 implies that there exists a positive number c such that for k sufficiently large,
K. Madsen / Minimax solution of nonlinear equations
x*ll
C Ilxk+ 1 -
121
IIfIx*lll ~ IIfIx~+~/ll--IIf(x~ § h~)ll <- IIfIx~) § J~h~ll + ~;, Ilhkll <- Ilftx~t § a~h~ll § 2~;, IIh~ll. I3.37t
§
If (3.32) holds, then we have the following inequality because of the definition of hk
IIf(x~) + a~h~ll < IIf(x~) + ak(x* - x~)ll < Ilf(xk) + Jxk( x* -
Xk)II
+ ~'~ IIx* - xkll
-< IIf(x*)ll + 2~ IIx* - x~ll.
(3.38)
Combining (3.37) and (3.38) we find that < IIx~., - x*ll-(2/c)~tilh~il
§ ilx* - xkil)
(3.39)
and since Ilhkll = IIx* -- xk + x ~ l
-- x*ll
(3.40)
IIx* -- x~ll + IIx~l -- x* II, we deduce that
(3.41) and (3.33) follows. Further, since x k and ek < 89 then
--
X*
Ilxk.,-x*ll-<
~----
(Xk+ 1
--
X*)
~kllx~-x*ll-<
h k we find that if (3.33) is true
--
~ -IIh~ll-< 2~kllh~ll.
1 -
ek
(3.42)
Because of the smoothness condition (3.1), and Lemma 3.2 there exist positive numbers Cl and c2 such that cl llx; - x*ll <- IIf(xj)ll - IIf(x*)ll <- c2 IIx~ - x*I 1
(3.43)
for j sufficiently large. Therefore, since
Ilf(x;§
___ IIf(x,)ll
for q -> 0,
it follows that
IIx~+~- x*ll-< c_~ Ilxj- x*ll. c,
(3.44)
Now we choose a number kl such that (3.44) holds for j > kl and also
K. Madsen / Minimax solution of nonlinear equations
122
ej < 89c--L[/0-1/2\ r
for j > kl.
(3.45)
\ 0"2.,/
Since the following inequality holds t 0-1 ';]'k+ 1 ~ 0-1/~k ~ --/~k 0-2
(3.46)
we deduce from (3.42), (3.45) and (3.44) that if k E N, k > kl and (3.32) holds, then
IIx k + ,
.
28kRk. <-- c l ((71x 2t~,k .
x * [I .<
C2 \0-2,]
and
IIx +a --
x*ll
_
+, -- x*ll
Cl 0"1 C2 0"2
~k+ 1 < ~k+ 1
0"1 "( - - ' ~ k + l ~-~ ~k+2' 0"2
(3.47) (3.48)
From this it follows that if (3.32) holds for k E N, k >_ ka, then it will hold also in the next 2 iterations. Therefore, since every special iteration is followed by a normal iteration, (3.33) will hold for every k e N following the first index k 2 ~ k 1 for which (3.31) holds. N o w the result is a consequence of (3.44) since
IIx +2 - x*ll -< max{ek+1, ek}
C- llxk- x*[I
(3.49)
cl
for every k > k2.
4. Numerical results
We have solved several test problems with our algorithm, and we illustrate the results by giving the number of function evaluations for some of these. The calculations were performed in double precision on an IBM 370/165 computer. The first problem is a data fitting problem given in [1]: Find the parameters xi, i = 1, 2 . . . . . 5 such that the rational expression Xl + x2y 1 +Xay+x4y 2+xsy 3
(4.1)
is the best fit to e y at the points yj = - 1 (0.1) 1. Thus the functions are
xl + xzyj - e yJ, f ~ ( x l , . . . , x 5 ) = 1 + x3y j + x4y~ + xsy 3
j = 1, 2 . . . .
21
(4.2)
K. Madsen / Minimax solution of nonlinear equations
123
and we use the starting point Xo = (0.5, 0, 0, 0, 0) and the initial bound 20 = 0.125. The second example is the Chebyquad equations as defined by Fletcher [-4]. Here we find zeros for n = 2, 4, 6 and 9, whereas no zero exists in the case n = 8. Here a local m i n i m u m x* with F ( x * ) > 0 is found, the convergence being rather slow because the Jacobian is singular at the solution. We used the starting vectors given by Fletcher and the initial values 2o = 0.01 for n < 6 and 20 = 0.005 for n = 8, 9. Finally we consider the functions fl(x)
= x21 + xg -
1,
f2(x) = 3 -
x21 -
x~,
f3(x) = x 4 + x 4 + 2 x ~ x ~ - 3(x~ + x~2) + 1.
(4.3)
For these functions the m a x i m u m residual attains its minimum at every point on the circle with centre (0, 0) and radius x/2. We tried several starting points and in all the cases the sequence of vectors generated converged to a point on the circle though not always the same point. We also tried the method where the derivatives are given [-6] on these test examples and here the sequence of vectors corresponding to the functions (4.3) always converges to the set of points on the circle but not always to a specific point. We give the number of function evaluations used when Xo = (3, 1) and 2o = 0.2. In Table 1 we quote the number of function evaluations used by the two methods to find the solutions with 6 decimal accuracy. The figures for this m e t h o d are considerably smaller than the figures which would be the result of a derivative-approximation of the form (1.5) at each step in the iteration, since these figures would be at least (n -~ 1) times the figures in the last column. Table 1 Number of function evaluations 6 decimal accuracy
Functions (4.2) Chebyquad
n
m
This method
Derivativesgiven
5 2
21 2
31 13
9 7
--
4
4
18
8
--
6
6
29
9
--
8
8
123
40
--
9
9
2
3
Functions (4.3)
60 29
12 16
K. Madsen / Minimax solution of nonlinear equations
124
Our results indicate that the method is also efficient for finding zeros of non-linear equations in the case where m = n. For comparison we give the number of function evaluations used by the method of Powell [-9] to find the zeros of the Chebyquad functions, and also to find the zeros of the trigonometric functions
fj(x) = ~ {Aijsin xi + Bijcos xi} - Ej,
j = 1. . . . . n
(4.4)
i=1
whose parameters are given by Fletcher and Powell [-5]. Table 2 Number of functionevaluationsused to meet the criterion F(Xk) < n Chebyquad ----
Functions(4.4)
This method
2
11
4 6 9
16 26 54
5
10 - 4
Powell 7 14 34 46
16
16
--
10
21
22
--
20
33
45
These results, and the results from other test examples, show that our method is efficient for solving overdetermined systems of non-linear equations in the minimax sense. This includes minimax data fitting. Further, the algorithm is a good alternative to existing methods for finding zeros of systems of non-linear equations.
Appendix Proof
of
Theorem 3.1. Define the matrix C k by
(
B k + , - J x , = (Bk--Jx,) I - h ~ h k / i - Ok,
(5.1)
that is, Ck = (Bk -- dx,) ( I - - h k h T k ~ - - ( B k + (f(Xk + hk)--_h~ffkkf(Xk)- Bkhk)hTk)
h hJ
(f(Xk + hk) -- f(Xk) -- d~*hk)h T
hrhk
+4. (5.2)
K. Madsen / Minimax solution of nonlinear equations
125
Since hk -~ 0 for k ~ ~ the smoothness condition (3.1) implies that 1]f(Xk + hk) -- f(Xk) -- Jx, hk il _~ 0
for k ---, oo
(5.3)
iFhkll and therefore
[ICk[[ ~ 0
for k ~ ~ .
(5.4)
Now suppose that ilck[I < ~ for k >__ ko. Then it follows from (5.1) and (2.16) that for k > ko
J,,*l[ -< (Bk
-
-
J'*)"
+
IIC [I
3n'F,.
(5.5)
F r o m this it follows that Bk ~ Jx* and Theorem 3.1 is proved.
Acknowledgment Most of this work was carried out during my stay at AERE Harwell, England, and I wish to express my thanks to M.J.D. Powell for m a n y helpful discussions and suggestions.
References [1] I. Barrodale, M.J.D. Powell and F.D.K. Roberts, "The differential correction algorithm for rational l~-approximation", SIAM Journal on Numerical Analysis 9 (1972) 493. [2] C.G. Broyden, "A class of methods for solving non-linear simultaneous equations", Mathematics of Computation 19 (1965) 577. [-3] A.R. Curtis and M.JD. Powell, "Necessary conditions for a minimax approximation", The Computer Journal 8 (1966) 358-361. [-4] R. Fletcher, "Function minimization without evaluating derivatives--a review", The Computer Journal 8 (1965) 33. [-5] R. Fletcher and M.J.D. Powell, "A rapidly convergent descent method for minimization", The Computer Journal 6 (1963) 163. [-6] K. Madsen, "An algorithm for minimax solution of overdetermined systems of nonlinear equations", Journal of the Institute of Mathematics and its Applications, to appear. [7"] D.W. Marquardt, "An algorithm for least squares estimation of non-linear parameters", SIAM Journal on Applied Mathematics 11 (1963) 431. [8] M.R. Osborne and G.A. Watson, "An algorithm for.minimax approximation in the non-linear case", The Computer Journal 12 (1969) 63.
126
K. Madsen / Minimax solution of nonlinear equations
[9] M.J.D. Powell, "A hybrid method for non-linear equations", in: P. Rabinowitz, ed., Numerical methods for non-linear algebraic equations (Gordon and Breach, New York, 1970) p. 87. [10] M.J.D. Powell, "A new algorithm for unconstrained optimization", in: Nonlinear programminff (Academic Press, New York, 1970). [11] M.J.D. Powell, "A Fortran subroutine for unconstrained minimization, requiring first derivatives of the objective function", Rept. R6469, A.E.R.E., Harwell, England (1970). [12] M.J.D. Powell, "The minimax solution of linear equations subject to bounds on the variables", Rept. CSS11, A.E.R.E., Harwell, England (1974).
Mathematical Programming Study 3 (1975) 127-144. North-Holland Publishing Company
THE USE OF THE BOXSTEP METHOD IN DISCRETE OPTIMIZATION Roy E. MARSTEN Massachusetts Institute of Technology, Cambridge, Mass., U.S.A. Received 10 March 1975 Revised manuscript received 9 April 1975 The Boxstep method is used to maximize Lagrangean functions in the context of a branch-and-bound algorithm for the general discrete optimization problem. Resuits are presented for three applications: facility location, multi-item production scheduling, and single machine scheduling. The performance of the Boxstep method is contrasted with that of the subgradient optimization method.
1. Introduction
The Boxstep method [-15] has recently been introduced as a general approach to maximizing a concave, nondifferentiable function over a compact convex set. The purpose of this paper is to present some computational experience in the use of the Boxstep method in the area of discrete optimization. The motivation and context for this work, originated by Held and Karp [10], is provided by Geoffrion [-9] and by Fisher, Northup and Shapiro [-5, 6], who have shown how the maximization of concave, piecewise linear (hence nondifferentiable) Lagrangean functions can provide strong bounds for a branch-and-bound algorithm. We shall consider three applications: facility location, multi-item production scheduling, and single machine scheduling. Our experience with these applications, while limited, is quite clear in its implications about the suitability of Boxstep for this class of problems. We shall also take this * The research reported here was partially supported by National Science Foundation grants GP-36090X (University of California at Los Angeles), GJ-1154X2 and GJ-1154X3 (National Bureau of Economic Research).
128
R.E. Marsten / The use of the Boxstep method in discrete optimization
opportunity to introduce two refinements of the original Boxstep method which are of general applicability. Three appendices give complete data for the test problems used.
2. The Boxstep method We present here a specialized version of the Boxstep method which is adequate for maximizing the Lagrangean functions which arise in discrete optimization. We address the problem max w(lr),
(2.1)
w(~) = m i n ( f k + ~ ok).
(2.2)
~t>O
where
fk is a scalar, it, gk ~ R m, and K is a finite index set. Thus w(ir) is a concave, piecewise linear function. The Boxstep method solves (2.1) by solving a finite sequence of local problems. Using (2.2), the local problem at 7rt with box size fl may be written as P(Tr'; fl)
maximize subject to
a, fk+~gk>a fork~K, t t 7ri--fl~< lri--< ~ i + f l f o r i = 1,...,m, 7r>0.
This local problem may be solved with a cutting plane algorithm [7, 12, 14]. If a global optimum lies within the current box, it will be discovered. If not, then the solution of the local problem provides a direction of ascent from 7rt. The Boxstep method seeks out a global optimum as follows. Let P(Trt; fl) denotes p(;~t; fl) with K replaced by some K" _~ K.
Step 1 (start). C h o o s e 7~1 ~__ 0,/~ ~_~ 0, j~ > 0. S e t t = 1. Step 2 (cutting plane algorithm). (a) (initialization). Choose K" ___K. (b) (reoptimization). Solve P(rrt; fl). Let ~r, # denote an optimal solution. (c) (function evaluation). Determine k* ~ K such that w(~) = f k , + ~rgk, (d) (local optimality test). If w(r > # - e, go to step 3; otherwise set = K w {k*} and return to (b). Step 3 (line search). Choose n t+l as any point on the ray {$ + 0c($ - it'): >- 0} such that w(rct+ 1) >_ wffr). Step 4 (global optimality test). If w(Trt+ 1) < w(rct) + e, stop. Otherwise sett=t+ 1 and go to step 2.
R.E. Marsten / The use of the Boxstep method in discrete optimization
129
The convergence of the method is proved in [,15]. In the piecewise linear case (finite K) we may take e = 0, at least in theory. The implementation of the method works with the dual of P(~r'; fl) so that new cuts can be added as new columns and the primal simplex method used for reoptimization at step 2(b). The motivation behind the method is the empirical observation that the number of cutting plane iterations required to solve P(~zt; fl) is a monotonically increasing function of B. This presents the opportunity for a trade-off between the computational work per box (directly related to fl) and the number of boxes required to reach a global optimum (inversely related to B). Computational results reported in [,15] demonstrate that, for a wide variety of problems, the best choice of B is "intermediate", i.e., neither very small nor very large. If B is sufficiently small, then (in the piecewise linear case) we obtain a steepest ascent method; while if fl is sufficiently large, Boxstep is indistinguishable from a pure cutting plane method. (For ~1 = 0 and B = oo we recover the Dantzig-Wolfe method [2].) For intermediate values of fl we have something "between" these two extremes. The three applications which follow are all of the form: v* = min fix), x~X
s.t. 9(x) < b,
(2.3)
where f : X - - . R, g : X ~ R m, and X = {x*: k s K } is a finite set. The Boxstep method will be used to maximize a Lagrangean function w(rO, defined for rr ~ R~ as
w(n) = min f(x) + n[g(x) - b]. xeX
(2.4)
Any branch-and-bound algorithm for (2.3) can compute lower bounds by evaluating this Lagrangean, since w(n)< v* for all n > 0. Finding the greatest lower bound, i.e., maximizing w(n) over all n > 0, is a dual problem for (2.3). Thus we shall be using Boxstep to solve a Lagrangean dual of the discrete program (2.3). By defining fk = f ( x k) and gk = g(x k) _ b for all k ~ K we obtain the form assumed above, (2.2).
3. Facility location with side constraints The first application is a facility location model [-8] of the form:
130
R.E. Marsten / The use of the Boxstep method in discrete optimization
min ~, f x i + ~ i=l
~ ciYij,
(3.1)
i=1 j=l
subject to
~ y~j = 1,
j = 1..... n~
(3.2)
i=1
A x + B y _< r,
(3.3)
vixi < Z djyij <- Vixi,
i = 1.... , m,
(3.4)
j=l
O
xi = 0 or 1,
for alli, j, i = 1, ..., m.
(3.5) (3.6)
The variable .xi is an open/close variable for facility i, which will have minimum and m a x i m u m throughput v~ and V~, respectively, if opened. The cost of opening facility i is f~, the unit cost of serving customer j from facility i is c~j and the demand of customer j is dj. The (3.3) constraints are general linear side constraints. (In fact these will be Benders cuts since the problem displayed here is actually the master problem in a Benders decomposition context [8].) Let p denote the number of these side constraints. Geoffrion has devised a branch-and-bound algorithm for (3.1)-(3.6) which uses the Lagrangean function formed by dualizing with respect to constraints (3.2) and (3.3). If we let (21,.., 2n) and (/~1.... , pp) be the dual variables for (3.2) and (3.3), respectively, then the Lagrangean is -
--
i=1
§
j=l
k=l
where, for each facility i,
wi(2, /~) -- min(f~ + # Ai) x~ + Z (cij + 2j + I~ B~j) Yij, xi,yij
subject to
j=l
v~xi _<
djyij < V~xi, j=l
0 < y~j < 1, X i = 0 or
j=l
.... ,n,
1.
A~ and Bij are columns of A and B, respectively. Each function w i is easily evaluated by considering the two alternatives xi = 0 and x~ = 1. For xi = 1 we have a continuous knapsack problem.
R.E. Marsten / The use of the Boxstep method in discrete optimization
131
An attempt was made to maximize w(2,/t) over all 2, p > 0 with the Boxstep method. The test problem (Problem A of [8]) has m = 9 facilities, n = 40 customers, and p = 7 side constraints (Benders cuts). Thus rr = (2,/~)~R 47. Problem (3.1)-(3.5) with (3.6) replaced by (0 < xi < 1) was solved as a linear program and the optimal dual variables for constraints (3.2) and (3.3) are taken as the starting values for 2 and #, respectively. At step 2(a), K" is taken as all cuts already generated, if any. The line search at step 3 is omitted (ret+l = r the tolerance is e = 10 - 6 . Table 1 reports the outcome of four runs, each of 20 seconds' duration (IBM360/91). The last four columns give the number of w(zr) evaluations, number of linear programming pivots, the pivot/evaluation ratio, and the value of the best solution found. Table 1 Facility location problem fl
boxes
w(~)eval
LP piv
piv/eval
best
0.250 0.100 0.010 0.001
<1 <1 <1 8
37 34 43 58
745 706 433 282
20.1 20.7 10.1 4.9
10.595676" 10.595676" 10.793467 10.789490
starting value. Global optimum at 10.850098.
The results are not encouraging. Convergence of the first local problem could not be achieved for a box size of 0.25, 0.10, or 0.01. Convergence was finally achieved with fl = 0.001 and 8 local problems were completed in the 20 seconds. The increase in the Lagrangean over these 8 boxes amounted to 76~o of the difference between the starting value (10.595676) and the global optimum (10.850098). The price paid for this increase, in terms of computation time, is prohibitively high, however. Geoffrion [8] has executed an entire branch-and-bound algorithm for this problem in under 2 seconds (same IBM360/91). Geoffrion did not attempt to maximize the Lagrangean in his algorithm but simply used it to compute strong penalties [9]. The computational burden on the cutting plane algorithm for a given local problem P(rct; fl) depends on the number of cuts needed and on the average number of LP pivots required to reoptimize after a cut is added. This average is given by the pivot/evaluation ratio and is recorded in
132
R.E. Marsten / The use of the Boxstep method in discrete optimization
Table 1. In the present application, difficulty was encountered with both of these factors. First, some cuts had no effect on the objective function value &. As many as ten successive cuts had to be added before the value of & dropped. This is simply a reflection of degeneracy in the dual of p(nt; fl). The effect of this degeneracy is to increase the number of cuts needed for convergence. The second and more serious difficulty, however, is the great number of pivots (more than 20 for fl > 0.10) required for each reoptimization. This is in marked contrast to other applications where, typically, only one or two pivots are required (see Section 4, and the results in [15]). This behavior was quite unexpected and appears to be a kind of instability. Starting with only one negative reduced cost coefficient (for the newly introduced cut), each pivot eliminates one negative but also creates one (or more). This process continues for several pivots before optimality is finally regained. Unfortunately this phenomenon is not unique to this class of facility location problems but arises in the application of Section 5 as well. Its effect is to impose a heavy "overhead" on the Boxstep method, rendering it very expensive computationally. Three suggestions that might be offered are: (a) generate a separate column for each facility at each iteration [-12, p. 221], (b) use the dual simplex method at step 2(b), (c) use a larger tolerance, say e = 10-3. The outcomes are: (a) much worse, (b) much worse, and (c) no change. We shall return to this test problem in Section 6.
4. Multi-item production scheduling The second application we shall consider is the well-known DzielinskiGomory multi-item production scheduling model with one shared resource [-3, 12, 13]. Two test problems are used: one with I = 25 items and T = 6 time periods, the other with I = 50 and T = 6. The variables n = (zt1..... nr) are the prices of the shared resource in each time period; resource availability in each period is given by b = (bl .... , br). The Langrangean function w(rt) is given by I
T
W(g) = E wi(T[) -- E i=1
~kbk,
(4.1)
k=l
where wi(rc) is the optimal value of a single-item production scheduling
R.E. Marsten / The use of the Boxstep method in discrete optimization
133
problem of the Wagner-Whitin type [16] and is evaluated by a dynamic programming algorithm. Thus evaluating w(r0 involves solving I separate T-period dynamic programs. The 25- and 50-item problems are solved, for several box sizes, using Boxstep. The origin is taken as the starting point at step 1 (re1 = 0) and the line search is omitted at step 3 (r: + 1 = ~). No more than 13 cuts are carried. (Once 13 cuts have been accumulated, old non-basic cuts are discarded at random to make room for new ones.) A tolerance of e = 10-6 is used. Note that z~e R 6. The results are presented in Tables 2 and 3. For each run the number of w(n) evaluations, LP pivots, and pivot/evaluation ratio are recorded. The computation times are in seconds for an IBM370/165. Table 2 Twenty-five item problem; original Boxstep method fl
boxes
v(y) evaluations
LP pivots
piv/eval
time
0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00 1.25 1.50 1.75 2.00 10.00 20.00
18 9 6 5 4 3 3 3 2 2 2 2 1 1 1 1
98 85 70 68 77 56 65 61 44 53 49 54 32 36 42 50
105 104 99 93 111 72 81 86 64 74 67 78 47 56 58 75
1.1 1.2 1.4 1.4 1.4 1.3 1.2 1.4 1.5 1.4 1.4 1.4 1.5 1.6 1.4 1.5
2.81 2.45 2.23 2.02 2.20 1.57 1.85 1.80 1.21 1.42 1.27 1.45 0.86 0.96 1.10 1.31
Table 3 Fifty item problem; original Boxstep method /3 1.0 2.0 5.0 10.0 20.0 40.0
boxes
v(y) evaluations
LP pivots
piv/eval
time
8 4 2 1 1 1
187 130 99 68 71 72
229 154 126 103 96 97
1.2 1.2 1.3 1.5 1.4 1.3
7.05 4.82 3.59 2.71 2.77 2.83
134
R.E, Marsten / The use of the Boxstep method in discrete optimization
For the 25-item problem, w(0) = 47 754.00 and v* = w(lt*) = 48 208.80; while for the 50-item problem w(0) = 92 602.00 and v* = w(rc*) = 94 384.06. In this application the Boxstep method has no difficulty in reaching a global optimum. Notice that the pivot/evaluation ratio never exceeds 2. This is a significant qualitative difference from the facility location problem of Section 2. Examination of local problem convergence reveals the signs of degeneracy in the dual of P(rct; fl), that is, several cuts may be required to reduce & This difficulty can apparently be overcome (at least in R 6) as long as each reoptimization takes only one or two pivots. The same two test problems were also solved by the direct generalized upper bounding (GUB) approach advocated by Lasdon and Terjung 1-13]. The times are 2.20 seconds and 6.87 seconds for the 25-item and 50-item problems, respectively. This suggests that Boxstep may be quite successful on Dzielinski-Gomory problems, particularly since these usually involve only a T = 6 or 12 month horizon. This will require testing on industrialsize problems for verification (e.g. I = 400, T = 12). These production scheduling problems will serve to illustrate a refinement of the original Boxstep method. Let r~ denote an optimal solution of the local problem P(rc'; fl). We may define G t = w(~) - w(zd) as the 9ain achieved in box t. Because of the concavity of w(rc), we would expect the gain achieved in successive boxes to decline as we approach a global optimum. For example, in the 13 = 0.20 run from Table 2, the sequence of gains over the nine boxes is: 271, 71, 33, 25, 23, 14, 10, 6, 2 (rounded). Notice that solving the first local problem gives us some idea of the gain to be expected in the second. Since solving a local problem to completion is often not worth the computational cost when we are far from a global optimum, this suggests the following cutoff rule. Choose an "anticipated gain" factor 7, 0 < ~ _< 1.0, and if while working on P(lr t+ 1; fl) a point ~ is generated with w(~) >_ w(rc t+ 1) + 7 G', then stop the cutting plane algorithm, set r = ~, and proceed immediately to step 3. (In this event take G t+~ = Gt.) A large value of 7 should have little effect on the trajectory {rcr: t = 1, 2, ... } while offering the possibility of computational savings. Too small a value of ~, however, may cause wandering in response to small improvements and hence an increase in the number of boxes required. These effects may be observed in Table 4, where the fl = 0.10 and fl = 0.20 runs from Table 2 are repeated with alternative gain factors (~ = 1 reproduces the original results). The column headed
R.E. Marsten / The use of the Boxstep method in discrete optimization
135
Table 4 Twenty-five item problem; suboptimization based on anticipated gain factors (~) r
7
boxes
subopt
v(y) evaluations
LP pivots
time
0.10 0.10 0.10 0.10 0.10
0.10 0.30 0.50 0.80 1.00
19 18 18 18 18
16 11 11 2 0
64 66 77 99 98
93 91 105 114 105
1.72 1.77 2.05 2.52 2.81
0.20 0.20 0.20 0.20 0.20
0.10 0.30 0.50 0.80 1.00
11 9 9 9 9
8 5 4 1 0
60 62 69 77 85
89 80 84 91 104
1.57 1.55 1.78 2.06 2.45
"subopt" gives the number of local problems which are terminated when the anticipated gain is achieved. In both cases the maximum reduction in computation time is a little less than 40 %.
5. Single machine scheduling Finally we consider the single machine scheduling model of Fisher [-4] 1. The problem is to schedule the processing of n jobs on a single machine so as to minimize total tardiness. Job i has processing time Pi, due date di, and start time xi (all integer-valued). To obtain bounds for a branch-andbound algorithm, Fisher constructs the Lagrangean function w(zO=min ~ x~X j = l
max{xj+pj-dj,
O} +
rck , k=xy+ 1
where rck is the price charged for using the machine in period k and X is a finite set determined by precedence constraints on the starting times. Fisher, who has devised an ingenious special algorithm for evaluating w(rc), uses the subgradient optimization method [-11] to maximize w(rc). When using subgradient optimization, the sequence of Lagrangean values { w(rct): t = 1, 2, ... } is not monotonic and there is no clear indication of whether or not a global optimum has been found. Consequently, a predetermined number of steps is made and the biggest w(rc) value found is 1 The author is grateful to Marshall Fisher for his collaboration in the experiments reported in this section.
136
R.E. Marsten / The use of the Boxstep method in discrete optimization
taken as an approximation of the m a x i m u m value of the Lagrangean. It was therefore of interest to determine how close the subgradient optimization method was coming to the true m a x i m u m value. To answer this question, one of these Lagrangeans was maximized by the Boxstep method. A second refinement of the original Boxstep method is illustrated in this application. An upper limit is placed on the number of cutting plane iterations at step 2. If this limit is exceeded, then the local problem P(rct; fl) is terminated and the box is contracted (set fl = file for E > 1). Furthermore, if ~ is the best solution of P(rct; fl) generated so far, and w(~) > w(rct), then we take rct+ 1 = ~; otherwise ret+l = rc~. This provides another opportunity for suboptimizing local problems and also offers some automatic adjustment if the initial box size is too large. The test problem used is taken from [-4] and has n = 20 jobs. The number of time periods is the sum of all n processing times, in this case 53. Thus zce R s3. The starting point for Boxstep is the best solution found by the subgradient optimization method. Furthermore, some of the subgradients that are generated are used to supply Boxstep with an initial set of linear supports. (If w(~c) = f * + ~ 9", then 9* is a subgradient of w(rc) at rc = ~.) For the 20-job test problem, subgradient optimization took about one second (IBM360/67) to increase the Lagrangean from w(0) = 54 to w(n ~) = 91.967804. The Boxstep method was started at r? with fl = 0.1. Up to 55 cuts were carried and a tolerance of e = 10-6 was used. A m a x i m u m of 10 cutting plane iterations was allowed for each local problem. Each contraction replaced the current fl by 89 These parameters (B = 0.1, 55 cuts, 10 iterations, E = 2) were chosen after some exploratory runs had been made. The final run is summarized in Table 5. Four boxes were required to reach the global optimum, v* = w(rc*) = 92. The first two of these boxes had to be contracted; the last two converged. The time for Boxstep was 180 Table 5 Single machine scheduling problem
Box Box Box Box
1 2 3 4
fl
w(rr) evaluations
LP piv
piv/eval
0.0100 0.0050 0.0025 0.0025
10 10 7 1
168 82 75 20
16.8 8.2 10.7 20.0
R.E. Marsten / The use of the Boxstep method in discrete optimization
137
seconds. As with the facility location problem, this is exceedingly expensive. Fisher [4] reports that the entire branch-and-bound algorithm for this problem took only 1.8 seconds. The details of this run display the same two phenomena we have encountered before: a high pivot/evaluation ratio (as in Section 3) and degeneracy in the dual of P(rct; fl) (as in Sections 3 and 4).
6. Conclusions
The most promising alternative method for maximizing the class of Lagrangean functions we have considered here is subgradient optimization [10, 11]. Subgradient optimization tends to produce a close approximation to the global maximum, v*, for a very modest computational cost. Fortunately, this is precisely what is needed for a branch-and-bound algorithm. Since v* is not actually required, the time spent pursuing it must be weighed against the enumeration time that can be saved by having a tighter bound. This is dramatically illustrated in the example of Section 5. Subgradient optimization obtained w(n 1) = 91.967804 in about one second. Since it is known that the optimal value of the problem is integer, any w(rc)value can be rounded up to the nearest integer, in this case 92. Boxstep spent 180 seconds verifying that 92 was indeed the global maximum. This is a factor of 10z longer than the 1.8 seconds required for the complete branch-andbound algorithm ! To further illustrate this qualitative difference, the performance of Boxstep and subgradient optimization was compared on the facility location problem of Section 3. An approximate line search was used at step 3 of the Boxstep method and suboptimization of the local problems was done as in Section 4, with 7 = 89 The box size was held fixed at fl = 0.001 and up to 56 cuts were carried. The global maximum was found at w(rc*) = 10.850098 after a sequence of 28 local problems and line searches. This required 318 w(rc) evaluations, 929 LP pivots, and over 90 seconds of CPU time (IBM370/168). The subgradient optimization method, starting from the same initial solution, reached the global maximum (exactly) in only 0.9 seconds-again a factor of 10 2 ! This required only 75 steps (w(rt) evaluations). It is apparent from these and other results [4, 5, 11] that subgradient optimization is the preferred method in this context. Boxstep may be viewed as a method of"last resort" to be used if it is essential to find an exact
138
R.E. Marsten / The use of the Boxstep method in discrete optimization
global maximum. In this event, Boxstep can start from the best solution found by subgradient optimization and can be primed with an initial set (K c K) of subgradients. The performance of the Boxstep method is clearly limited by the rate of convergence of the imbedded cutting plane algorithm. Wolfe [17] has provided an invaluable insight into the fundamental difficulty we are encountering. He shows that for a strongly and boundedly concave function (as our Lagrangeans would typically be), the convergence ratio is at best (a/4A) 1/2", where 0 < a < A and n is the dimension of the space. Notice that the convergence ratio gets worse (i.e., approaches unity) quite rapidly as n increases. The Boxstep method attempts to overcome this slow convergence by imposing the box constraints, thereby limiting the number of relevant cuts (indices k ~ K). What we observe when n is large, however, is that to achieve even near convergence the box must be made so small that we are forced into an approximate steepest ascent method. (Boxstep can do no worse than steepest ascent, given the same line search, since it is based on actual gain rather than initial rate of gain.) Steepest ascent is already known to work very poorly on these problems [5]. Degeneracy in the dual of the local problem P(rct;/~) is an important characteristic of all of the problems we have considered. This is not surprising since this dual is a convexification of the original problem (2.3) and degeneracy in the linear programming approximations of discrete problems is a well-known phenomenon. The effect of this degeneracy is to slow further the convergence of the cutting plane algorithm. In two of the three applications we have encountered the phenomenon of high pivot/evaluation ratios. That is, many LP pivots are required to reoptimize after each new cut is added. This difficulty, when present, increases the computational burden associated with each cut. It is not clear yet whether this is caused by problem structure or is another consequence of higher dimensionality. There remains one opportunity which we have not investigated here. In the course of a branch-and-bound algorithm we have to solve many problems of the form (2.1). The Lagrangean function is somewhat different in each case, but the optimal rc-vector may be nearly the same. When this is the case, starting Boxstep at the previous optimal ~-vector and using a small box can produce rapid detection of the new global optimum. This has recently been applied with considerable success by Austin and Hogan
[I].
R.E. Marsten / The use of the Boxstep method in discrete optimization
139
Appendix 1. D a t a for the facility location problem (m = 9, p = 7, n = 40) i
f~
vl
vi
1 2 3 4 5 6 7 8 9
0.069837 0.065390 0.072986 0.068788 0.072986 0.064241 0.067739 0.0 0.0
0.600000 0.600000 0.600000 0.600000 0.600000 0.600000 0.600000 0.153312 0.271873
1.709064 1.823961 1.316324 2.163716 2.203619 1.959994 2.691376 3.063312 2.638127
j
dj
j
dr
j
dr
1 2 3 4 5 6 7 8 9 10 11 12 13
0.081426 0.206631 0.027246 0.303292 0.044033 0.034300 0.128681 0.093232 0.024378 0.254841 0.337371 0.173633 0.229527
14 15 16 17 18 19 20 21 22 23 24 25 26 27
0.066465 0.114898 0.054371 0.114093 0.178713 0.087465 0.136781 0.088804 0.084573 0.159997 0.292932 0.100370 0.104647 0.079968
28 29 30 31 32 33 34 35 36 37 38 39 40
0.032030 0.185120 0.136249 0.386822 0.031088 0.028759 0.526940 0.245977 0.121745 0.060440 0.038508 0.113301 0.072122
Let C = (C 1, C2,..., C9), w h e r e C i = (C~) for i = 1..... 9; a n d j = 1.... ,40 (see T a b l e A. 1). O n l y the finite c o m p o n e n t s of C will be listed. (Each facility can serve o n l y a subset of 40 c u s t o m e r s . ) Let B = [ B 1, B 2, ..., B 9] w h e r e B i = (b~j) for i = 1. . . . . 9; p = 1, ..., 7; a n d j = 1, ..., 40. O n l y the n o n - z e r o c o m p o n e n t s of B will be listed. N o t e t h a t B 8 = 0 a n d B 9 = 0.
B 1"
b~j=
B 2"
b2j = - 1.0
B3:
b3j = - 1.0 b4j = - 1.0 b~j = - 1 . 0 b6j 1.0 bTaj = - 1 . 0
B 4"
Bs: B 6" B7:
=
-1.0
-
forj forj for j forj forj forj forj
= = = = = = =
5, 7, 8, 10, 11, 12, 10, 11, 12, 17, 23, 24, 25, 31, 38, 39, 34, 35, 36, 37, 13, 14, 16, 4, 16, 19.
140
R.E. Marsten / The use of the Boxstep method in discrete optimization
<
R . E . M a r s t e n / T h e use o f the B o x s t e p m e t h o d in d i s c r e t e o p t i m i z a t i o n
1 2 3
Xl
X2
7.0 1.0
1.0 4.0
X3
X4
X5
A=4 5 6 7
X6
X7
1.0
4.0
4.0
1.0
X8
141
X9
4.0 3.0 4.0
r = (1.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0).
Appendix 2. Data for the multi-item production scheduling problem W e shall use the following n o t a t i o n [,12, pp. 171-177]. si hi Pi ai ki bt
= set-up cost for item i, = i n v e n t o r y holding cost for item i, = unit p r o d u c t i o n cost for item i, = a m o u n t of resource r e q u i r e d for one set-up for item i, = a m o u n t of resource r e q u i r e d to p r o d u c e one u n i t of item i, = a m o u n t of resource available in time period t. Dit = d e m a n d for item i in p e r i o d t. (See T a b l e A.2.).
T h e d a t a for the 50-item p r o b l e m will be given. T h e first 25 items constitute the 25-item p r o b l e m w h e n used with the resource v e c t o r b 25 = (3550, 3550, 3250, 3250, 3100, 3100). T h e resource v e c t o r for the full 50-item p r o b l e m is b 5~ = (7000, 6000, 6000, 6000, 6000, 6000). B o t h p r o b l e m s have 6 time periods. Let hi = 1, Pi = 2, and ki = 1 for all i = 1, ..., 50. Let I x ] d e n o t e the largest integer that does not exceed the real n u m b e r x, and let ~b(j) = j ( m o d 5) for any integer j. T h e n for i = 1, .... 25 we h a v e ai=
10 • { 89
1] + 1},
si=75
+25
• q S ( i - 1)
while for i = 26 . . . . . 50 we h a v e a, = 5 + 10 • qS(i - 26),
si = 30 • {89 - 26] + 2}.
142
R.E. Marsten / The use of the Boxstep method in discrete optimization Table A.2. Demand
i
Dil
D~2
D~3
Di4
D~5
Di6
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
120 100 147 58 112 62 103 96 112 86 100 125 110 103 90 87 112 100 65 150 92 99 100 101 77 125 86 104 100 131 114 96 102 88 74 16 95 107 124 62 72 23 123 107 32 121 150 56 69 100
113 125 99 95 118 121 106 96 106 85 109 130 140 102 94 93 84 120 73 100 120 126 114 103 86 138 81 106 90 152 100 56 52 100 53 43 104 120 124 64 98 41 163 104 102 117 103 64 84 107
97 78 108 136 64 133 124 98 141 91 98 120 103 106 99 92 76 130 102 75 79 133 136 124 99 84 92 63 126 186 130 88 77 94 82 72 130 96 97 87 73 55 180 117 74 138 84 79 107 120
88 116 77 154 132 98 114 103 106 103 ~ 94 90 89 121 108 109 92 102 114 50 92 85 102 93 104 102 120 41 87 142 85 100 63 74 91 86 100 100 99 4i 114 58 100 104 42 104 100 81 48 101
96 97 162 91 154 96 126 94 107 114 85 100 85 93 117 105 105 97 91 125 107 122 105 91 121 103 110 120 77 130 63 115 75 140 51 103 103 116 120 124 135 99 84 87 130 142 110 89 63 95
103 84 141 113 104 63 95 100 111 96 86 120 110 96 124 108 100 88 106 100 103 106 105 90 134 52 94 113 96 120 128 115 115 120 115 110 85 120 125 i33 166 130 93 93 152 100 112 90 132 88
R.E. Marsten / The use of the Boxstep method in discrete optimization
143
Appendix 3. Data for the single machine scheduling problem job i
Pi
1
6
2 3 4 5 6 7 8 9 10
l0 5 I 9 9 1 5 2 3
dl
job i
Pi
di
34 61 56 23 80 1 18 21 14 113
11 12 13 14 15 16 17 18 19 20
7 4 6 2 3 8 10 6 9 8
95 77 63 56 60 78 1 58 27 24
References [1] L.M. Austin and W.W. Hogan, "Optimizing procurement of aviation fuels", Management Science, to appear. [2] G.B. Dantzig and P. Wolfe, "Decomposition principles for linear programs", Operations Research 8 (1) (1960) 101-I 11. [3] B.P. Dzielinski and R.E. Gomory, "Optimal programming of lot sizes, inventory, and labor allocations", Management Science 11 (1965) 874-890. [4] M.L. Fisher, "A dual algorithm for the one-machine scheduling problem", Tech. Rept. No. 243, Department of Operations Research, Cornell University, Ithaca, N.Y. (1974). [5] M.L. Fisher, W.D. Northup and J.F. Shapiro, "Using duality to solve discrete optimization problems : theory and computational experience", Mathematical Programming Study 3 (1975) 56-94 (this volume). [6] M.L. Fisher and J.F. Shapiro, "Constructive duality in integer programming", SIAM Journal on Applied Mathematics 27 (1) (1974). [7] A.M. Geoffrion, "Elements of large-scale mathematical programming", Management Science 16 (11) (1970) 652-691. [-8] A.M. Geoffrion, "The capacitated facility location problem with additional constraints", Graduate School of Management, University of California, Los Angeles, Calif. (1973). [-9] A.M. Geoffrion, "Lagrangean relaxation for integer programming", Mathematical Programming Study 2 (1974) 82-114. [10] M. Held and R.M. Karp, "The traveling-salesman problem and minimum spanning trees: Part II", Mathematical Programming 1 (1971)6-25. [-11] M. Held, P. Wolfe and H. Crowder, "Validation of subgradient optimization", Mathematical Programming 6 (1974) 62-88. [12] L.S. Lasdon, Optimization theory for large systems (Macmillan, New York, 1970). [13] L.S. Lasdon and R.C. Terjung, "An efficient algorithm for multi-item scheduling", Operations Research 19 (4) (1971) 946 969.
144
R.E. Marsten / The use of the Boxstep method in discrete optimization
[14] D.G. Luenberger, Introduction to linear and nonlinear programmin 9 (Addison-Wesley, Reading, Mass., 1973). [15] R.E. Marsten, W.W. Hogan and J.W. Blankenship, "The boxstep method for large scale optimization", Operations Research 23 (3) (1975). [-16] H.M. Wagner and T.M. Whitin, "A dynamic version of the economic lot size model", Management Science 5 (1958) 89-96. [17] P. Wolfe, "Convergence theory in nonlinear programming", in:J. Abadie, ed., Integer and nonlinear programmih 9 (North-Holland, Amsterdam, 1970) pp. 1-36.
Mathematical Programming Study 3 (1975) 145-173. North-Holland Publishing Company
A M E T H O D OF CONJUGATE SUBGRADIENTS FOR
MINIMIZING NONDIFFERENTIABLE FUNCTIONS Philip WOLFE I B M Thomas J. Watson Research Center, Yorktown Heights, New York, U.S.A.
Received 28 May 1974 Revised manuscript received 25 June 1975 An algorithm is described for finding the minimum of any convex, not necessarily differentiable, function f of several variables. The algorithm yields a sequence of points tending to the solution of the problem, if any, requiring only the calculation o f f and one subgradient o f f at designated points. Its rate of convergence is estimated for convex and for differentiable convex functions. For the latter, it is an extension of the method of conjugate gradients and terminates for quadratic functions.
1. Introduction
Many problems of mathematical programming pose the task of minimizing a convex, but not necessarily differentiable, function of several variables. In some cases that task is inherent in the problem: an example of interest to us has been the problem of minimizing the sum of the k highest eigenvalues of a given symmetric real matrix whose diagonal entries are variable. In other cases setting such a task is just one of several ways of formulating the problem. For example, the linear programming problem mincTx+dTy
subject t o A x + B y < b
x,y
may be posed as the minimization of the convex, and usually nondifferentiable, function f ( x ) = cTx +
min{dTy: B y < b - A x } . r
That idea is fundamental to most "decomposition" schemes aimed a t exploiting the structure of B to solve the problem efficiently [-5, 9], and any improvement in methods for minimizing nondifferentiable functions
146
P. Wolfe / A method of conjugate subgradients
will yield rewards in the area of large-scale optimization [10]. Another application is to the constrained convex optimization problem, which under simple restrictions [-6] can be posed as a problem of our type. An effective procedure has been devised for the eigenvalue problem based on the idea of steepest descent, using some special features of that problem [2]; and for certain structured linear programming problems, whose size makes any known version of the simplex method impractical, a very simple "subgradient algorithm" due to Shor applicable to any convex function has proved effective [13], despite the fact that it does not converge as fast even as steepest descent when the function is differentiable. These results raise the question as to whether there is an algorithm which is reasonably effective for both differentiable and nondifferentiable convex objective functions. We think the procedure offered here will meet those requirements. We place strict limitations on our ability to get information about the objective function f being minimized: we suppose that, given x ~ R", we have a finite process which will determine f(x), and a finite process which will find one subgradient off a t x (that is, one member of the subdifferential 0f(x)--see Section 2) on which we cannot impose any further conditions. Such a limitation is realistic for the problems of interest to us: in the eigenvalue problem determining f(x) requires finding the k largest eigenvalues of a large symmetric matrix, and Of(x) is a nonlinear transform of an associated invariant subspace (in which, however, some selection can be made); for linear programming problems the theoretical characterization (2.5, 2.6) below of Of(x) just cannot be used for a problem in which which m > 101~176 which is typical. A number of proposals have been made for treating nondifferentiable optimization problems by some version of steepest descent. Outstanding among them are those of Demjanov [-4] and Bertsekas and Mitter, whose paper [1] gives an excellent survey of the area and an extensive bibliography. None of these proposals, however, seems to have a straightforward implementation in the general case. Demjanov calls for knowledge of the complete subdifferential or, worse, an e-subdifferential [20, ch. 23] at any point, which seems impractical, while Bertsekas and Mitter confine their procedure to objective functions whose conjugate functions can be given analytically, and require the solution of a (smooth) constrained nonlinear programming problem at each step of their method. We know of three alternatives to the present method which are fully articulated for our class of problem. The convergence of the "cutting-
P. Wolfe / A method of conjugate subgradients
147
plane" method does not depend on differentiability [23]. For smooth problems, even quadratic, it can have quite slow convergence, but it seems more attractive in the nondifferentiable case: the closer the epigraph o f f can be approximated by a polyhedron, the better it should work. A recent refinement of the method [11] seems even more promising. A second alternative is the "subgradient algorithm" mentioned above which has, we think, worked well, but whose generally effective implementation and rate of convergence is still a mystery. (Shot's modification [21, 22] of subgradient optimization does, however, allow some convergence rate estimates to be made.) A third we should mention is the recent very interesting procedure of Lemarechal [14, 15], which of all these is closest to ours in spirit, and has shown encouraging computational results. We expect the present method to work well even on smooth problems, for it constitutes for them a version of the method of conjugate gradients. Indeed, w h e n f i s quadratic and the obvious specialization of the procedure to that case is made, this method is exactly that of Hestenes and Stiefel [12], and terminates in the solution of the problem. Our belief in the effectiveness of the present method is based largely on the above observation, for our practical experience with it is small. It will be seen that there is great freedom of choice in how the algorithm can be run, and we have only begun to explore some of the possibilities. Feuer [7] has tried it in its simplest form (i.e., as presented below) on several of the small but hard differentiable problems in the literature, and found that it required from seven to ten times as much work, measured by number of function evaluations, as standard good methods. Our few experiments, employing a good line search adapted to the differentiable case, seem to bring the required work to within a factor of two of that taken by a standard good implementation of conjugate gradients [8]. This limited use has been encouraging, but we are not ready to offer the procedure as the answer to our class of problem without more study. We have confined our study to convex functions. While those to which this procedure applies are many and useful, some extensions are possible; for example, Feuer [7] has developed some of the ideas of this work into an effective algorithm for the type of minimax problem studied by Demjanov [4]. We can hope that they will be useful in any problem for which a suitable analogue of the notion of subgradient is at hand. In Section 2 below we review some notions from convex analysis which are not yet household words in optimization and develop formulas needed in the sequel. The bulk of the paper, Sections 3~5, concentrates on the
P. Wolfe / A method of conjugate subgradients
148
"main algorithm": a procedure which, given certain positive "tolerances", terminates in a point which approximates the solution to a degree measured by the tolerances. Section 3 gives an informal account of the main algorithm and reasons for its several parts. Section 4 presents the main algorithm, while Section 5 discusses its convergence. Section 6 shows its connection with the method of conjugate gradients. Section 7 is devoted to some theoretical observations on how the main algorithm may be run sequentially, reducing its tolerances to zero, in an "outer algorithm" which converges to the solution of the problem if one exists. The main results are that the minimum is approached as k -3 in k line searches in the general case, and as A -k (A > 1) if f is twice differentiable and strongly convex.
2. Preliminaries
We first summarize some notions we need from convex analysis, as presented by Rockafellar [20]. Let f be a closed, proper convex function on E" (i.e., f is convex and lower semicontinuous and never assumes the value - Go, although + oo is allowed). The effective domain o f f is the convex set dora f = {x:f(x) < + ~ } . For any x E d o m f a n d nonnull d~ E", the
directional derivative if(x;d) = lim
t~0 +
f ( x + t d) - f ( x ) t
(2.1)
exists. It is a convex function of d, and is closed and proper if x belongs to the relative interior of d o m f The vector u ~ E" is a subgradient of f a t x if
f(y) > f(x) + (u, y - x)
(2.2)
for all y. The set of all subgradients at x is the subdifferential Of(x) of f a t x, which is a closed, convex set. Denoting by f'(x; d) the closure of f '(x; d) as a function of d (that is, the function whose epigraph is the closure of that of f '(x; "), which is the greatest lower semicontinuous function majorized by f'(x; ")), J~(x; d) = sup{{u, d): u E c3f(x)}.
(2.3)
If x belongs to the relative interior of d o m f , then Of(x) is nonempty, and if(x; ") is closed (i.e., ~' = f ' ) and proper; further, Of(x) is nonempty and bounded if and only if x belongs to the interior of d o m f in which case
if(x; d) is finite for all d, and closed.
P. Wolfe / A method of conjugate suboradients
149
The convex function f is called polyhedral if it has the form
f(x) = max{f/(x), i = 1. . . . . m},
(2.4)
where the functions f~ are affine, so that Vf~(x) = gi is constant for each i; f and f'(x; ") are closed and proper 9 Letting I(x) = {i: f/(x) = f(x)}
(2.5)
Of(x) = conv{9i: i e I(x)},
(29
for all x, where for any set S, conv S denotes the convex hull of S. Suppose that x e dom f, which consists of more than a single point, and that f~(x; d) r - o o for all d. Then f~(x; d) has a finite minimum value M on the unit ball {d: Idl < 1}, where I l denotes the Euclidean norm 9 If M > 0, then x minimizes f; otherwise, the value M is assumed for a unique d. To prove the latter statement, note that when M < 0 any minimizer d has norm 1, for Idl < 1 would imply f~(x; d/ldl) = if(x; d)/Id I = M/Id I < M. If there were two different minimizers dl, d2, we should have 189 + 89 < 1, while iT(x,9 8 9 89 < M since f ~ X( ; ' ) is convex. We will call the direction d thus defined the direction of steepest descent for f at x, and correspondingly define the gradient o f f at x to be
Vf(x) = if(x; d)d,
where Idl _< 1 minimizes J~(x; .).
(2.7)
(Note Vf(x) = 0 just when x minimizes f ) Of course, i f f is differentiable at x, then Vf(x) is just the ordinary gradient. For any set S g E" there is a unique point v in the closure of the convex hull of S having minimum norm; it will be denoted by Nr S. Algebraically, the point is characterized by the relation (v, s) >_ 1/312 for all s sS. The characterization below of the gradient was essentially given by Demjanov [4]: Vf(x) = - Nr af(x).
(2.8)
The proof uses the minimax theorem for a bilinear function on the product of two sets, one of which is bounded [20, Corollary 37.3.2], and (2.3): min f~(x; d) = min sup{(u, d): u E 0f(x)}
[dl < 1
[d[ < 1
=
~SoU~ ,
min ( u , d ) =
sup
l a l -< 1
=
-
min
u~i~Of (x)
lu[ =
...
-]Nr~f(x)].
m
lul
P. Wolfe / A method of conjugate subgradients
150
The saddlepoint of(u, d) is given by u = Nr 0f(x), d = -u/[u], from which (2.8) is immediate. Using formula (2.8) to calculate Vfis quite feasible w h e n f i s polyhedral and I(x) can be found, for by (2.6) Vf(x) = Nr{gi: i t I(x)},
(2.9)
so the problem of determining Vf(x) is the quadratic programming problem of finding the point of smallest norm in the convex hull of a given finite point set in E". We note that an algorithm especially designed for that calculation has been given [26]; it requires 89 + O(r/) cells of auxiliary storage to carry out. At a typical step of the algorithm of this paper, Nr G has already been found for some finite set G; a point 9 is determined, and Nr[G w {g}] must be found. The latter step requires O(n z) multiplications to perform. Unfortunately, the operator V does not have many useful properties. While V~ f = ~Vf for real ~, V(f + 9) # Vf + Vg (although 0(f + g) = Of + 0g in general), nor has Vf(x) any useful continuity properties we know of although 0fis upper semicontinuous [20, Corollary 24.5.1]). It may be worth mentioning that we use the Euclidean norm throughout only for convenience; but the convenience is great. We believe that all the theoretical results hold if]'] is taken as any strictly convex norm on R", and the induced dual norm is used for points of the dual space, in which Of(x) lies. The definition of Vf(x) would then be modified to designate that d for which minld I _
fork=
1,2 . . . . .
(2.10)
To simplify notation, in discussing a single step of the procedure we may suppress "k" and "1" from an expression like (2.10), which then reads (2.11)
x+ = x + td.
At a particular step (with x and d 4:0 given) we define Q(t) =
f ( x + t d) - f ( x ) t]dl 2
for t > 0.
(2.12)
P. Wolfe / A method o f conjuoate subgradients
151
Note that Q is continuous and m o n o t o n e nondecreasing, and that lim Q(t) = if(x; d)/Id] 2.
(2.13)
t-+O+
We have supposed that for any x E d o m f we can obtain, in a uniquely determined way, some member 9 of Of(x). Given x and d :~ 0 and t > 0, let that member of Of(x + t d) be denoted by 9(0; we define
M(t)-
(9(t),d)
id[2
for t > 0.
(2.14)
M(') is m o n o t o n e nondecreasing, and
- f ' ( x + td; - d ) <_ Idl2M(t) <_ f'(x + td;d).
(2.15)
Hence M(.) is integrable for t _> 0, and for T > 0 T
SM(t)dt = [ f ( x + r d ) - f ( x ) ] / I d [ z = TQ(T).
(2.16)
o
Finally, if [g(t)[ < C for 0 < t < T, we have the bounds [Q(t)[, [M(t)[ _< C/[d[
for 0 < t _< T.
(2.17)
3. Motivation
We begin by seeing why the natural use in a Cauchy procedure of the notion of steepest descent developed in the last section does not work. The following example is one of several we have studied I-2, 25]" Define
f(x,y) =
lyl
5(9x + 16y2) 1/2
for x >
9x + 16[y[
for x _< [y[
(Region I), (Region II).
Its contours are sketched in Fig. 1. It fails to be differentiable only on the ray x < 0, y = 0 (but which, typically, is just where we would like to be: f(x, y) ~ - oo as x ---, - oo). The heavy arc of Fig. 2 (a portion of the ellipse (u/15) 2 + (v/20) z = 1) constitutes all Vf(x, y) for w h i c h f i s differentiable at (x, y); its two endpoints suffice for Region II. For x < 0, Of(x, O) is the line segment {(9, v): Ivl < 16}, and ~3f(0, 0) is the convex hull of the arc; thus Vf(x, 0) = ( - 9 , 0) for x < 0. One can calculate that if the Cauchy algorithm is begun anywhere in the region x > lyl > (9/16)21x1, then it follows a polygonal path of successively orthogonal segments, like that sketched in Fig. 1, converging to the origin.
P. Wolfe / A method of conjugate subgradients
152
3 4._5~
6_Iy
7
8.._
2 -I -2
Fig. 1.
--
-
-----------~- x
C o n t o u r s o f f a n d steepest descent path.
The gradients calculated alternate between two points like A, B in Fig. 2; the slope of OA lies between ~ and 9 , and OB _L OA. Convergence to the wrong point with steepest descent would seem to be due to the fact that neither gradient A nor B alone says much about ~3f(0, 0), no matter how close we approach (0, 0); yet we know 1-20, Theorem 25.6] that ~3f(0,0) can be approximated arbitrarily well by the convex hull of gradients in the neighborhood of (0, 0). Indeed in this case A, B suffice: with P = Nr{A, B}, the slope of O P turns out to be bounded by +0.203,
(9,16) A
u
(9,16) Fig. 2.
Gradients off.
P. Wolfe / A method o f conjugate subgradients
153
and it can be seen that f descends monotonically to - ~ along any ray that points to the left and has slope between 5_ 9 . The main idea of our algorithm can now be stated. We will generate a sequence {Xk} of points, {dk} of directions, {gk} of subgradients, and {tk} of scalars such that for k = O, 1, 2 . . . . t k >---0 ,
[approximately], f(Xk + 1) <- f(Xk)"
gk E 3 f ( X k ) -
Xk+ 1 = Xk + tkdk,
At step k the bundle G k will be formed (by rules given later). It will consist of gk, possibly - d k_ 1, and a (possibly empty) string gk-1, gk-2 of limited length. We set . . . .
dk
=
-
Nr
G k.
The step size t k will be determined by an approximate minimization of f ( x k -t- t dk) for t _> 0. Our aim is that at some step the last p + 1 points ---Xk-p . . . . . Xk-I, xk--are all close together, so that gk-p, ..., gk are approximately contained in Of(xk), and also that d k ~- - N r G k = - N r {gk-p ..... gk} is small; then Vf(XR) ---- --Nr ~f(Xk) will be small, and XR will be an approximate solution. We see that such a procedure will resolve the first example. As another kind of illustration, consider f ( x , y) = max{fl,fz, fa }, where fl = - x , f2 = x + y,
fa = x -
2y,
sketched in Fig. 3. Differentiability fails at points of the form (x, 0), ( - x, 2x), and ( - x , - x ) for x > 0. A reasonable prescription giving a subgradient o f f at any (x, y) would be the selection of Vf~, where i is the least index for which f (x, y) = f~; it yields [1, 1] when x > 0, y _> 0, and [1, - 2 ] when x>0, y<0. Steepest descent proceeds from the point (6, 3) in the direction -Vf(6, 3) = [ - I, - 1] to the point (3, 0), but there the only available direction, still [ - 1, - 1], is not a descent direction. As we shall see, we must take a small step in that direction anyway. Then the new subgradient will be [1, - 2 ] ; forming the bundle G = {[1, 1], [1, - 2 ] } , we find the direction - N r G = [ - 1, 0] just what we need to reach the solution from (3, 0). On arrival at (0, 0) or its neighborhood we will similarly some time add the subgradient [ - 1 , 0] from fl to the bundle. Then Nr G = 0, and we accept the last point reached as an approximate solution if the points at which the subgradients were found are deemed close enough together. If not, then we
P. Wolfe / A method of conjugate subgradients
154
i
__
Fig. 3.
i~
IO
, "(6,3) \ ~
X
Second example--contours and subgradients.
begin all over again from that point--we "reset". For simplicity here, take Gk = {gk, --dk-1} for k > 0. Note that Gk c_ conv{go . . . . . gk} for all k. We begin with go ~ ~?f(xo), Go = {go}, do = -go. In minimizing f(Xk + tdk) we will be able to find (iff does not tend to - oo), as in the differentiable case, gk+~ ~ ~f(Xk + t dk) SO that 9k+ ~ is "approximately" normal to dk. Knowing that 19k+11 <- e, we can show that ]dk+l] is less than Idk] by a definite amount (see Fig. 4), and thus that ]dkl ~ 0 as long as this goes on. Small Idkl is only of use, however, if Xo,..., Xk are close together. Accordingly we suppose we have chosen small e > 0, 6 > 0. The iteration above continues until Idk] < e for k = K, say. If then
Z ]Xk+, - Xk] <-- 6,
k
(3.1)
we are finished (this is step 2 of the algorithm, Section 4). What we can then conclude about the optimality of Xk constitutes the conclusion of the Theorem of Section 5. If (3.1) does not hold, we reset (Step 3 of the algorithm): discard all previous subgradients by taking Gr = {gr}, and begin the algorithm again from XK. As shown in Section 5, resetting does not go on for ever; the requirement (3.1) is eventually satisfied, unless in the course of some line search we decide that f has been reduced as far as we need.
P. Wolfe / A method of conjugate subgradients
I55
CONV G+ -d
/, / f//]/:~
Fig. 4. Constructing G+,D+.
The line search needs some explanation. It is modelled on the differentiable case, where we would have f ' ( x ; d ) < 13, and in minimizing f ( x + t d) would try to approximately satisfyf'(x + t d; d) = 0. In practice, we would settle for reducing f ' ( x + t d ; d ) = d W f ( x + td) to a certain fraction ml, with 0 < ml < 1, of f '(x; d) = dWf(x). Noting that generally dTVT(x) = -Idl z, we can write that requirement dTo+ >_ -m ld[ 2, or, using (2.14), M(tk) >-- --ml.
(3.2)
The rule above is not sufficient, however. Our descent direction d may be so feeble that (3.2) is easily satisfied and we could take endless steps without reducing f much. Accordingly, we also impose the requirement that the drop f ( x + t d ) - f ( x ) in moving the distance IAxl = rid [ be at least - m , ldl.lAxl . Using (2.12), we require in other words that Q(t) <_ - m z.
(3.3)
Choosing 0 < m2 < m~ < 89turns out to ensure the needed reduction of [d[ in successive steps. In Fig. 5, the heavy portions of the graphs of the two functions sketched indicate values of t satisfying both (3.2) and (3.3). W h e n f i s differentiable, our two requirements can be shown to give the conclusion that either gk ~ 0 o r f ( x , ) ~ -- oo. (They are taken from a more
P. Wolfe / A method of conjugate subyradients
156
f(x+td)-f(x) -tO(t) 2
\
tl~~ t/ 2
~t
SLOPE -m2 SLOPE-m
Fig. 5.
I
t Q(t) for two functions.
general study of the differentiable case [23]. We would like to point out that such a statement about convergence is stronger than saying that gk -~ 0 iff is bounded below.) However, satisfaction of (3.2, 3.3) is only one possible outcome (Case (A) of step 4) of the line search here, since d may not be a descent direction at all--we may have f ' ( x ; d) > O. In that case we find in the course of doing the line search that (3.3) cannot be satisfied even for rather small t, and settle for 9+ ~ t?f(x + t d) for some small t, while taking x + = x. The line search algorithm of Section 4 shows, by exhibition, the existence of a finite process which can satisfy the requirements of step 4 of the algorithm. Being a bisection search, that is about its only virtue; for particular classes of problem we can devise much better searches, but we suppose they would all use the same idea: since the functions Q and M are nondecreasing, the set of all t > 0 satisfying (3.3) is an interval (possibly empty) and the set of all t > 0 satisfying (3.2) is an interval which, if not empty, extends to infinity. We show in Section 5 that if these intervals are nonempty, then they overlap in an interval of positive length, and bisection will find a point in their overlap, giving Case (A); otherwise, "stepping in" will end in Case (B) or "stepping out" will cause f ( x + t d)--, - o o .
P. Wolfe / A method of conjugate subgradients
157
4. The main algorithm The parameters e, 6, b, m 2 < m I < i, all positive, are fixed throughout. Initially we have the starting point Xo, some 9o ~ Of(xo), Go = {9o}, and the scalar ao = 0. The k 'h step of the algorithm constitutes one execution of the procedure below. See the end of Section 2 for the notation used.
Step 1. Set d = - N r G. If ]d[ < e, do step 2; else step 4. Step 2. If a < 8, stop; else do step 3. Step 3 (reset). Set x+ = x, 9+ = 9, G+ = {9}, and a+ = 0. Step k is finished. Step 4 (Line search: see below for details). Find t > 0 and g + ~ 8f(x + t d) such that (4.1) >- - m , l d l 2 and either Case A
f ( x + t d ) - f(x) <- -mztld[ 2
(4.2)
or
Case B t [d[ < b. (4.3) In Case A, set x+ = x + t d and do step 5; in Case B, set x+ = x and do step 5. Step 5. Choose H ___G and set G+ = H w { - d , 9 + } a n d a + = a + Ix+ - x I. Step k is finished.
Line search (bisection) Let x and d be given, with [d[ # 0. Define
L = { t : t > 0, Q(t) < - m z } , R = { t : t > 0, M(t) >_ - m l } .
(4.4)
(a) Choose t > 0. Go to (b), (f), or (c) according as t ~ L\R, L ~ R, or R~L. (b) Replace t by 2t until t ~ R. Then if also t ~ L, go to ( f ) ; otherwise go to (d). (c) Replace t by i t until t[d[ <_ b or t ~ L. In the former case go to (g); if t ~ Lc~ R, go to (f); otherwise go to (d). (d) Let I be the interval whose endpoints are the last two values of t found by (b) or (c). (e) Let t be the midpoint of I. If t ~ L c~ R, go to (f). Otherwise replace
P. Wolfe / A method of conjugate subgradients
158
I by its right half if t e L, and by its left half if t ~ R, and repeat this step. (f) Declare Case A and go to (h). (g) Declare Case B and go to (h). (h) Set 9+ = 9(0 and exit. Remark 4.1. Fletcher [-8] gives a good choice for the starting value of t in step (a) for differentiable functions: Use 2 ( f - f _ ) / ( g , d), where f f _ are the values o f f at the current point and the previous point of the main algorithm. If f were quadratic and f _ - f = f - f+ (the latter would be approximately true if the procedure had slow linear convergence), then the recommended value of t would minimize f on x + t d.
5. Convergence We will show that the algorithm of Section 4 either generates a sequence of points on which f tends to - o o , or terminates in a point Xk which approximates the solution of the problem in the sense of the conclusion of Theorem 1 below. In Lemma 1 and Theorem 1 we hypothesize: f is a convex function on R", Xo ~ R", and e > 0, b > 0, | 6 > 0, 0 < mz < m~ < 89are given. Let S be the set of all points of R" lying within Euclidean distance 2b of (5.1) {x: f ( x ) <_ f ( x o ) } ; f i s finite-valued on S, and there is C so that ]g[ < C for all x ~ S and 9 ~ Of(x).
r
Lemma 1. I f f ( x ) < f(xo) and d ~ 0, then the line search procedure of Section 4 either terminates, yieldin9 t, 9+ satisfyino (4.1, 4.2) or (4.1, 4.3), or forms a sequence t ~ + oo for which f (x + t d) ~ - ~ . Proof. Since Q and M are nondecreasing and Q is continuous, L is an interval which, if neither empty nor the whole line, has the form {t: 0 < t < tz} for some t2, while R, if not empty, is either {t: tl < t} or {t: t~ < t} for some t~. Satisfying (4.1) amounts to having t e R, and (4.2) amounts to t e L. In step (b), as long as t r we have M(s) < - m x for 0 < s < t, so by (2.16) f ( x . + t d) < f ( x ) - mlt Id[ 2. Thus either f ( x + id) --, - oo or eventually t ~ R. If then t E L ~ R, all is well; otherwise we pass to (d) with t ~ R ~ L and 89 ~ L-,R.
P. WolJe / A method of conjugate suboradients
159
It is clear that step (c) terminates, either satisfying all requirements or passing to (d) with t ~ L \ R and 2t ~ R--L. Step (e) begins with an interval having one endpoint in L--R and the other in R ~ L . The bisection procedure replaces it with an interval half its size having the same property unless the midpoint was in L c~ R. Such a midpoint must be found if the interval L c~ R has positive length, which we now show. On entering step (d) neither L nor R is empty nor the whole line, so they have the forms mentioned above. The graph of t Q(t) against t is a convex curve (see Fig. 5). Since Q is continuous, t2Q(t2) -- -m2t2, so the line of slope - r n 2 through the origin intersects the graph at the origin and at P 2 = (t2, - m 2 t 2 ) . Since M ( t ) < - m I for 0 < t < tl, by (2.16) taQ(t~) < -m~t~ < - m z t l , so the point P~ = (tl, t~Q(t~)) lies below the line. It follows that t~ < t2 ; L ~ R has the length t2 - t~ > 0. The Lemma is proved. Remark 5.1. The number of steps (each consisting of one function and one subgradient evaluation) the line search requires can be bounded if a lower bound for f is known. Let
f ( x ) - inf{f(y): y E S} _< D,
(5.2)
and choose the initial value of t in step (a) so that
b < t ld I <-'D/m2ld[.
(5.3)
Then the number of steps does not exceed 1 +
[logz(D/m2ldl
b) + [logz[(m = + C / [ d l ) / ( m l
- m2)],
(5.4)
where "[a" denotes the smallest integer not less than a. The three terms summed in (5.4) come from: step (a); either (b) or (c); and step (e), respectively. For the second, since Q(t) > - D / t Idl E for t > 0, t will not grow larger than t2 = max{t: Q(t) < - m z } < D/mz[dl 2, and the bound is clear. For the third, note that the slope of the line P a P 2 (see the proof of the lemma) is bounded, by (2.17), by C/Id I, so that - - m 2 t 2 + m l t l < - - m z t 2 -- t l Q ( t x ) = tzQ(t2) -
<- (C/Idl)(t2
tlQ(tl)
-
tl),
whence (m 1 + c/Idl)tl < (m 2 + C/Idl)t2 so that we have (t 2 - - tl)/t 1 > (ml -- mz)/(m2 + C/[d[). Step (e) begins with an interval of length no greater than t~ and terminates at the latest when its interval has length t2 - tl, from which the bound follows.
P. Wolfe / A method of conjugate subgradients
160
Remark 5.2. If f were known to be differentiable, it would be appropriate to set b -- 0. By construction (in step 1 of the main algorithm) ( - d, g) _> [d[2, so that M(0) =
in this case; so d is a descent direction, L has positive length, and there is no need to provide for Case B. L e m m a 1 holds, and with "or (4.1, 4.3)" deleted. Theorem 1. L e t the hypothesis (5.1) hold. The algorithm o f Section 4 generates either a sequence {Xk} for which f(Xk) ~ -- 00, or a point Xk and a sequence tj ~ + ~ for which f ( x k + tjdk) -* -- o0, or terminates at a point x k such that, for all y ~ R", f ( y ) - f(Xk) >-- - - e l Y - xkl - 2c(6 + b).
(5.5)
Proof. We suppose that the second possibility of the conclusion does not occur, so that the line search always terminates. (i) We show that the algorithm resets (executes step 3) at least once every 4C2/e 2 iterations. For any iteration which is not a reset, ]d+[ = [ N r H u { - d , g+}l < I N r { - d , g + } l 9 N o w N r { - d , g+} = - d ( 1 - 0) + Og+ for some 0 < 0 < 1, and I - d ( 1 - o) + 0o+15 --Idl2(1 - 0) 5 _< Id]2(1 - 0) +
2o(1
-
O)(,d,g+) + OZ[g+[ z
02c ~,
since - ( d , g + ) <_ m,[d[ z < 89 2 and [g+[Z < C 2. The m i n i m u m value of the last expression for 0 _< 0 _<1 is Idl 2 - Idl'/4c ~, so t h a t we have Id+l 2 < [d[ 2 (1 -- [d[2/4C2), whence [d+[ -2 > [d[-2 (1
+
]d[2/4C 2)
=
[d[ -2 + 1/4C 2.
Repeated application of this inequality for k = K . . . . . L yields IdL[-z > ]dr[ -z + (L - K ) / 4 C 2, so that 1> 12>L-K e-~ - ~ - ~ 4C
1 + idol z
(5.6)
as long as no resets occur. The result follows. (ii) In Case A of step 4 of the algorithm we have f(Xk+l)
-- f ( X k ) <-- --m2tk[dk[ 2 <-- -- m2g. [Xk+ 1 - - Xk] ,
(5.7)
P. Wolfe / A method of conjugate subgradients
161
and only in Case A is Xk+l ~ Xk, s o f ( x k ) is nonincreasing. Summing (5.7) over K - 1 steps, we have K-1
f ( x o ) - f ( X r ) >-- m2e ~
(5.8)
[xk+l - xkl.
k=0
Suppose from now on t h a t f ( x O is bounded below; then the series of (5.8) is finite or convergent. Let K(k) be the iteration number at which the latest reset occurred, for k X j+l -- Xj[. Since k - K(k) is bounded, by (i), we any k; then ak = ~_.~=~(k)[ know ak ~ O, SO that the stop of step 2 will eventually be executed. (iii) We have stopped in step 2 with x = x k. Denote by J the indices K(k) . . . . , k. Then d=dk
= -~2jgj,
j~J,
for some 2j > 0, ~ 2~ = 1. Set in caseA,
yj=xj+tfl~
yj=xj
in C a s e B ,
for all j. Then g~ e Of(yj) for all j, and ]yj - x~] < b for all j, so [x - yj[ _< + b for allj. Let C' = m a x { [ g j [ : j e J } . Take TEE". By convexity, f ( Y ) - f(Yi) > ( g J, Y -- YJ} = (9i, Y - x} + (g j, x - yj} > ( g j, y - x ) -
c ' ( 6 + b)
for all j, whence multiplying by 2j and summing, f ( y ) - ~ 2H(yj) > - (d, y - x ) - C'(5 + b) >_ - I d [ ' [ y - x[ - C'(6 + b).
(5.9)
N o w f(yj) - f ( x ) > - C ' [ y j - x[ >>_ - C ' ( 6 + b) for all j, so ).jf(yj) - f ( x ) >_ - C'(6 + b).
(5.10)
Adding (5.9) and (5.10) we have f(y) -f(x)>
-]d I 9 ]y - x I - 2C'(6 + b),
(5.11)
which proves the theorem. Remark 5.2. The proof gives a crude bound on the number of steps required by the algorithm when it terminates. By (5.8), the total "path length" is bounded by [ f ( X o ) - f(Xk)]/m28. The path length between resettings must be at least 6, so [f(Xo) - f(xk)]/m2e6
(5.12)
162
P. Wolfe / A method of conju,qate subgradients
bounds the number of resettings and, by (i), 4C 2[f(xo) - f(xk)]/me~a6
(5.13)
bounds the total number of steps. When the bound D of (5.2) is known, the product of the expression (5.4), with e substituted for ]d[, and the bound on steps 4C2D/m2e3•
(5.14)
bounds the total number of function and gradient evaluations for the main algorithm. Remark 5.3. Paralleling a suggestion of Hestenes and Stiefel [12, Section 7], the point ~ = ~ 2jyj (see (iii) above for 2j) may be better than x = Xk: adding (5.9) to the relation ~ 2J(yj) > f(:~), we find that for any y ~ R" f ( y ) - f ( 2 ) >_ - ~ [ y - x[ - C'(,5 + b).
(5.15)
It seems sensible to replace x by 2 at each resetting when f(2) < f(x); the proof is not affected. While it is not clear that doing this will have a great effect in the general case, it appears to be a very good idea whenfis smooth. S u p p o s e f t o be twice differentiable at the solution x* of the problem and to have a nonsingular Hessian H there; then gj = Vf(yj) = H . (yj - x*) + O(]Y~ - x'l), whence - d = H . ( 2 - x*) + O(max~ ]yj - x'l). The degree of improvement using 2 over Xk is roughly the degree to which Idl is smaller than IVf(xk)l, which could be considerable. Remark 5.4. When the conjugate gradient method is used to minimize a strongly convex differentiable function, it is known that resetting every n or more iterations makes the procedure superlinearly convergent [16, 17], while otherwise it will usually converge linearly [3]. The proof above is not altered by such resettings, provided they are done infrequently enough to allow the resetting of Step 3 of the algorithm to happen at least once every I iterations for some fixed number I. However, a valuable example due to Powell [18] briefly mentioned in [19, Section 4] shows that we cannot in general expect better than linear convergence, even when f is piecewiselinear and exact line search is done. Let f ( x ) = max(ui, x), where ui = (cos 72i, sin 72i) for i = 0, 1, 2, 3, 4 (all angles in degrees). Contours o f f and a track of the main algorithm between restarts are shown in Fig. 6. Using polar coordinates Jr, 0], choose x0 = [ro, 36], so that (Uo, Xo) = ( u l , x l ) , and choose 9o = Uo. We find
P. Wolfe / A method of conjugate subgradients
163
X1 [/'1, 108], g , - U2, dl = -Nr{uo, u2} = Is, 252], x2 = [r2, 180], g2 U3, d2 = -Nr{u0, u2, u3} = 0. The isosceles triangles (0, Xo, x0, (0, xl. x2) are similar, the ratio of side to base being 89 + x/5), so that Ix] has been reduced by that factor in each of the two steps. We restart at Xz and find a similar path being traced again. =
=
u0
Fig. 6.
Powell's example.
Remark 5.5. The above example suggests strongly that one should not restart with the bundle G reset to only a single subgradient but rather to the set of all "active" subgradients--those which actually serve to determine the descent direction in the neighborhood of x. (That determination is easy if the subgradients generated after restarting are retained, as well as the corresponding numbers f(Xk) -- (gk, Xk), since 9k is active at x just when f ( x ) = f ( X k ) + (gk, X - Xk).) Since u2, ua are both active at x2 in the example, and - N r { u2, u3 } points directly from x2 to the origin, this device would solve the problem in one step after restarting. Such a modification seems promising and could hardly do any harm, but we do not think that it will in general yield a terminating procedure for the piecewise-linear case.
6. The algorithm and conjugate gradients The algorithm of Section 4 constitutes an extension of the method of conjugate gradients developed by Hestenes and Stiefel [12] for the solution of linear equation systems or, equivalently, for the minimization of convex quadratic functions. That fact was not at all apparent at first, but emerged empirically.
164
P. Wolfe / A method of conjugate subgradients
Having done most of the work reported above, we thought it as well to make sure that the algorithm would not do something actually ridiculous on smooth problems. Applying it to a quadratic function is particularly easy (the appropriate version of the algorithm for quadratics is given below). The algebra for a general problem of two variables is not hard to do, and we were surprised to find that the procedure terminated with the exact solution in two steps. The algebra for problems of more than two variables is tedious, so we formulated several numerical problems of three and four variables and ran them with a simple interactive computer routine. We were dumbfounded to find that the algorithm terminated in an apparently exact solution in each case, in as many steps as the number of variables. Since every known optimization scheme requiring only n linear minimizations to solve any quadratic problem in n variables involves conjugate directions, we knew where to look for an explanation. Suppose for a bundle G that - d ~ Nr G is not one of the vectors in G; then there is some subset G' ~_ G, having at least two members, such that - d E conv G' and (d, g') = (d, g")
for all g', g" ~ G'.
(6.1)
In the differentiable case. taking g' = Vf(x'), g" = Vf(x"), and x" = x' + t d' for suitable t, d' the relation (6.1) constitutes a workable definition of "conjugacy" between the directions d and d' even when f is not quadratic; for when f is the quadratic function (Q is a symmetric matrix)
f(x) = (p, x ) + 89
Q x),
(6.2)
we have V f ( x ) = p + Q x, so g " - g ' = t Qd', and (6.1) becomes just d Q d' = 0. (We find (6.1) first given essentially by Zoutendijk [28, Section 11.2].) As a matter of fact, our procedure for a quadratic problem is exactly the method of conjugate gradients. We could show that most quickly by referring to Hestenes and Stiefel: the points x and gradients g Cr" in [12]) generated are the same, and using the scaling of their Modification I [12, Section 9], our "d" is their "p'. We give, however, a self-contained demonstration, since we want the stronger than customary result that the method makes sense even whenfis not convex. (That fact does not seem to be generally known about the conjugate gradient method, but it emerges from a thoughtful reading [12, Section 10, first paragraph].) We restate the algorithm of Section 4 for the case t h a t f i s the quadratic function given by (6.2). It can be greatly simplified; effectively the parameters e, 6, rn~, m2 are all set to zero, and resetting is not needed.
P. Wolfe / A method of conjugate subgradients
165
Algorithm for a quadratic objective T h e starting point x o is given. Set Go = {go} = {Vf(xo)}, and p e r f o r m the steps b e l o w for each k = 0, 1. . . . 'until stopped in step 2 or step 3.
Step 1. Set d = - N r G. Step 2. I f ( d , Qd> <_ 0, stop. ( T h e n f ( x + td) ~ - ~
as t ~ ~ . ) O t h e r -
wise do step 3. Step 3. Set
t = - ( g , d > / ( d , Qd>, x+ = x + td, g+= Vf(x+)=g + tQd. If g+ = 0, stop. Step 4. C h o o s e H _ G a n d s e t G + = H w { - d , g + } .
Lemma 2. Let the vectors go. . . . . gk be nonnull and mutually orthogonal, and let u = Nr{go . . . . . gk-,},V = Nr{go, . . . , gk}. Then (u, gi> = lul ~ # 0 for j < k, and v = Nr{u, gk}. Proof.
W e m a y write (uniquely) u = )-'4 < k 2fli with ~ 2 i = 1 and 2 i > 0,
>_ lul 2 for a l l j < k. Since (u, gi) = ~ilgil 2 and Igil r 0 for allj < k and 2 i # 0 for some j, we k n o w u # 0, and thus 2 i > 0 for j < k. gi) - In[ 2] = (u, u> - ( E 2j)lul 2 = 0 then implies (u, gi> [ul a for a l l j < k. N o t e 2 i = lul2/Igil ~ for j < k Similarly writing v = ~iXk#jgj, we have ~j --Ivl=/Igil 2 for j < k, so
2j
=
v = Zi
+ (IvlVlgkl2)g~
+ (Ivl2/Ig~12)g~,
The sum of the coefficients of this last term equals ~i- = 0, we have (v, u> = (v, gk> = Iv?, p r o v i n g v = N r { u , gk}. Theorem 2. Let f be the quadratic function of (6.2), let Xo ~ R", and let the sequences {Xk}, {dk} be generated by the algorithm above. For some k no greater than the number of distinct eigenvalues of Q, either Vf(Xk) = 0 or f(x k + tdk)~ -~ ast ~ +~. Proof.
(We d e n o t e by [ . . . ] the smallest linear subspace containing the
166
P. Wolfe / A method of conjugate subgradients
vectors . . . . and understand the expression (a, [. ; . ] ) = 0 to mean that (a, v) = 0 for all v ~ [...].) (i) We first prove the theorem in the case that H = G, and later show that the choice of H is irrelevant. We have
Go = {go},
G, = {go, g,},
a z = {go, g l , - d l , g 2 } ,
Ga = (go, g l , - d , , g z , - d 2 ,
ga},
etc., but clearly nothing else in the algorithm changes if we replace this Gk by just {g0. . . . . gk}, SO we suppose that done. (ii) Define, for k _> O, the linear subspaces
Mk = [Gk] = [ g o , ' ' ' , gk],
M~, = [ d o , . . . , dk].
(6.3)
We will prove by induction, in (iii, iv) below, that (gk+ 1, Mk) = 0 (that is, g's are orthogonal) and Mk = Mk' as long as the procedure continues. We suppose, then, that 9i ~ 0 for all j < k and that d i Q d i > 0 for all j < k. The induction hypothesis is that ( g j+ 1, m j )
(6.4)
= O,
M i = M)
(6.5)
for all j < k. (As previously, the subscript "k" will be suppressed, and "k + 1" replaced by " + ' , where convenient.) (iii) Since d = - Nr G, by L e m m a 2, d = - ~ i < k 21gj, with 2k r 0. Thus both d ~ M and g ~ [d, M k - 1] = [d, M~,_ 1] (by (6.5)) = M', whence M = M'. (iv) By L e m m a 2, (d, g) < 0. Since
f ( x + t d) = f ( x ) + t ( d , g ) + 89
Qd)
for all t, if (d, Q d) _< 0, then f ( x + t d) --* - oo as t ~ + oo. Otherwise, f ( x + t d) is minimized for t = - ( d , g)/(d, Q d) > 0, and setting g+ = x + td we have (g+, d) = 0. Thus (g+, M ) = (g+, [d, M k - 1]) = 0, which shows (6.4) for j = k. The induction is complete, the truth of (6.4, 6.5) for k = 1 being obvious. (v) We have seen in (iv) that tj # 0 for j < k. By L e m m a 2, (dj, gi) = - I d l l 2 for i -<j _< k, so that 0 = (d i, gi+ l - - g i ) -= ti (dj, Q d i ) , whence (dj, Q di) = 0, for i < j _< k; {di: j _< k} is a conjugate system. (vi) Define M~ = [go, Q go .... , Qkgo]. Then M~ = Mo; and ifM~ = Mk, t h e n gk+l = gk -~- tkQ dk implies gk+l ~ [Mk, Q Mk] -----M~+ 1, so Mk+l ~-M~+ ~. N o w as long as gk :7[=0 the dimension of M k is k -t- 1, and the dimension of M~ can be no more, so M [ = M k. Since the dimension of'ME can
P. Wolfe / A method of conjugate subgradients
167
be no greater than the degree of the minimal polynomial for Q, which is the number of distinct eigenvalues of Q, the number of steps to termination is proved. (vii) It remains to show by induction that the choice of H in step 4 of the algorithm is irrelevant. That is obvious when k = 1 ; suppose it true for allj 1. Choosing Hj = {9o, 91, . . . , 9j} for all j < k we have all the results (i-vi) above, and dk = - N r { 9 o . . . . . 9k}. Now whatever is chosen for Hk, we have from Gk+ 1 = H k t.) { - - d k , gk+l} that conv{--dk, gk+l} ~ conv Gk+l c_ conv{9o, . . . , 9k+l}" By Lemma 2 (with k replaced by k + 1, u by --dk, and v by --dk+,), N r { - d k , 9k+1} = Nr{#o . . . . . 9k+1}; it follows that Nr Gk+ 1 = Nr{#o . . . . . 9k+ 1} = --dk+ 1- The theorem is proved.
7. The outer algorithm The main algorithm uses the five parameters ~, 6, b, ml, m2 in arriving at a point which deviates from a solution of the problem to a degree measured by e, 6, and b. A complete algorithm, converging to the solution of the problem if one exists, embraces a strategy for altering those parameters appropriately between runs of the main algorithm. We call the result the "outer algorithm", and investigate here how it might be sensibly designed. We find that the best strategy we can devise for the general case can be shown to r e d u c e f t o within h of its minimum value, if any, in a number of steps (as the term is used in the main algorithm) bounded by a multiple of h-3. We fare better, however, i f f is smooth (without our knowing i t ) that is, twice differentiable and strongly convex; then convergence is linear, the number of steps being bounded by a multiple of - l o g h. The parameters ml, m2, and b do not strike us as very significant to the strategy of the outer algorithm. We will take the first two fixed at the convenient values ml = 89and m2 = -~. The parameter b plays a role in the line search like that of 6 in the main algorithm, and we shall set b = (fl - 1)6, with fl fixed at some value slightly greater than 1 ; we shall see that its exact value is not important here. Three assumptions are important for further progress: (i) The function f assumes a minimum value at some point x*~ S:
f(x*) = M = inf{f(x): x e S} > - ~ .
(7.1)
P. Wolfe / A method of conjugate subgradients
168
(ii) There is L such that
Ix - x*l - L
(7.21
for any x used in the course of any run of the main algorithm. (iii) We are willing to make the estimates l ,,~ L ,
c ,,~ C,
ho ~ f ( X o )
- M,
(7.3)
where X0 is the starting point of the outer algorithm. (As will be seen, the accuracy of the estimates does not much influence the form of the outer algorithm strategy.) We believe the estimates called for in (7.3) can be found for most problems of interest; indeed, the bounds L and C and a bound forf(Xo) - M can usually be found. For example, the eigenvalue problem [2] mentioned in Section 1 for a matrix of order n easily yields the bounds C = n 1/2, L = n [ f ( X o ) / k - 2], where 2 is the least eigenvalue of the matrix when x = 0; and the assignment problem for a matrix A of order n [13] immediately yields C = n,
L = [-maxij Aij] - [-minij Aij].
Setting y = x* in (5.11) tells us that for a terminal point x of the main algorithm f ( x ) - M <_ I d l l x
- x*l + 2c'B a,
(7.4)
where d is the last direction found (Idl and C' is the supremum of the norms of the subgradients in terms of which - d is expressed. Without those details, all we know is f ( x ) - M <_ L e + 2Cfl6,
(7.5)
and by (7.3) we are guessing f(x) -
M < l~ + 2cfl6.
(7.6)
In order to design a strategy we will assume that we have made reasonable estimates in (7.3), and we further pretend the step-number bounds developed in the proof of Theorem 1 (Remark 5.3) also to be good estimates. We really suspect them all to be pretty bad, greatly overstating the work required; but the form of the best strategy is not highly sensitive to their quality. We are even more doubtful of the estimate (5.4) of the work involved in line search, and for that reason set our goal here as that of minimizing the total number of steps, in the sense used in describing the main algorithm,
P. Wolfe / A method of conjuyate suboradients
169
required to reduce the value o f f to a specified departure from its minimum. The outer algorithm consists of the following stages for j = 1, 2, ... : Given X i_ 1: (i) Choose es, c5i. (ii) Run the main algorithm with e = ei,, 6 = 6j,, letting the final point reached be X i. Desirable choices es, 6i for (i) will now be deduced. For each stage j we shall set the goal of achieving (7.7)
f ( X s ) - M < hi;
the sequences hi, ei, 6j are to of steps needed to arrive at Given hs, the best we can to choose ei = e and 6~ = 6
be chosen so as to minimize the total number some final goal. do to ensure (7.7) seems, in view of (7.6), to be so that (7.8)
It + 2cfl6 < h i.
By (5.13) the number of steps required to get Xs will be bounded by (7.9)
2 5 c E [ f ( X i - 1) - M]/e36,
so we will choose e, 6 to maximize e36 as constrained by (7.8). Those values are then (7.10)
e = ei = 3hH41, 6 = 6 i = hi/8fle,
and (7.9) becomes A[f(Xi_I)
where A
- M]/h~,
=
2143-3C213flc.
(7.11)
For j > 0, (7.5) and (7.10) yield f(Xi)
- M < Le i + 2Cflfs = B hi,
(7.12)
where (7.13)
B = 3L/41 + C/4c,
so the number of steps in stage j is bounded by A[f(Xo)
- M]/h 4
whenj = 1
and by A B h i_ 1/h~ when j > 1.
Thus with the prescription (7.10), J stages require at most A
f(Xo) - M
h~
+ A
Ba~I
~
j=l
hi
h~+l
(7.14)
170
P. Wolfe / A method of conjugate subgradients
steps. In our state of knowledge, all we can do is assume that the estimates (7.3) yield ho close tof(X0) - M and B close to 1. Reducing f to within a given amount h of its minimum thus poses the problem: Given ho > h > 0, find an integer J and hi, h2. . . . , hs satisfying ho > hi > ... -> hj_ 1 > hs = h which minimize J-1
hjhfPl
for p = 4.
(7.15)
j=O
That problem can be solved algebraically for any p > 0, although the analysis is lengthy [27]. The main features of its solution for p > 1 are: The optimal value of J is (p2__ 1) [log(ho/h)]/log p if that is an integer, and otherwise is one of the integers bracketing that number; denoting by s(ho, h) the minimum value of the sum (7.15), lim s(ho, h)h p-
pp/p-1/(p _ 1);
1 =
h-~ O
and denoting by hj(ho, h) the value of hi when the minimum is assumed, lim hj(ho, h)/hj+ l(ho, h) -= pl/tp-1)
for anyj.
(7.16)
h~O
(The limits are, in fact, approached quite rapidly.) Thanks to the asymptotic behavior (7.16) our prescription is simply stated: set
h~+ 1 = 4-1/3hj
for allj.
(7.17)
Theorem 3. Let the hypotheses (5.1, 7.1, 7.2) hold, let positive numbers, ho, l, c be given, let hj = 4-J/3h o
(7.18)
and let ej, 6j be defined by (7.10) for j = 1, 2, ... ; then J stages of the outer algorithm yield X j ~ S with f(Xs) - M < B ha
(7.19)
in a total number of steps bounded by
44/3A [ 3 [ f ( X o ) - M] - 4B ho 3 Proof. (7.18).
L
h~
+
.
(720)
Relation (7.19) is (7.12) above, and (7.20) is (7.14) evaluated using
P. Wolfe / A method of conjugate suboradients
171
Theorem 4. Let the hypotheses of Theorem 3 hold, and in addition let f be twice differentiable, with uniform positive upper and lower bounds on the eigenvalues of its Hessian on S. Then the outer alyorithm converges linearly as counted by steps: the total number of steps required to find Xj with
f(Xj) - M < B
4-J/3h o
is bounded by E J, ,for a fixed E. Proof. Let the eigenvalue bounds be A > 2 > 0. Suppose the main algorithm to have arrived at the point X = X~ after a run of the main algorithm using e = ej, ,5 = 8j. We consider its next,run, which starts from X and uses e+ = 4- 1/3 e, 8+ = 4 - 1/38. Let 9 = Vf(X). The bound (7.9) on the number of steps per stage requires only that C bound the norms of the gradients of successors of X in the main algorithm. Knowing [24, pp. 6-8] that 22[f(x)-
M] < Ivftx)l 2 _< 2A I f ( x ) -
M]
for all x eS, replacing C 2 by 2 A [ f ( x ) - M] and then [ f ( x ) Igl2/2;. in (7.9), we obtain the bound 24AIgl'/; 38, or by (7.10) 213hl3flc ~lOi~ 2
M] by (7.21)
27; 2
bounding the number of steps per stage. At the end of stagej we had (Theorem 1, Proof, part iii) - d = ~_,2iVf(yi), with lYi - X[ < 8 for all i. It follows that [Vf(y~) - 9l < A 6 for all i, whence ] - d - 91 < A 8, whence ],q] < A 8 + Idl < A 8 + e, so that by (7.10)
fgl/hj <- 21 + A/8flc.
(7.22)
This relation, used in (7.21), gives the uniform bound E on the number of steps per stage required for the conclusion of the theorem. Remark 7.1. In specifying the "goal" h~+l for stage j + 1 of the outer algorithm, one might be tempted to replace hj of (7.17) by the better estimate lid I + 2C'fl6 for f(Xj) - M available at the beginning of the stage. We have not done that because it seems it could work badly w h e n f i s smooth. Under the hypotheses of Theorem 4, C' = 0(8), so hj = Lid[ + 0(82); and it appears that we can have Igl -- 0(8) while, by chance, Idl is nearly zero. Then lal/hj -- O(8-1), and the bound (7.21) explodes. Replacing Xj by the
172
P. Wolfe / A method of conjugate subgradients
better point 2 of Remark 5.4 makes the explosion less violent, but does not yield a uniform bound. We are not sure, however, that the difficulty is not simply an artifact of the analysis.
Acknowledgment In am indebted to Albert Feuer for his careful reading of an early version of this paper, and to Magnus Hestenes and Michael J.D. Powell for much valuable advice.
References [-1] D.P. Bertsekas and S. Mitter, "A descent numerical method for optimization problems with nondifferentiable cost functionals", SIAM Journal on Control 11 (1973) 637 652. [2] J. Cullum, W.E. Donath and P. Wolfe, "The minimization of certain nondifferentiable sums of eigenvalues of symmetric matrices", Mathematical Programming Study 3 (1975) 35-55 (this volume). [3] H.P. Crowder and P. Wolfe, "Linear convergence of the conjugate gradient method", IBM Journal of Research and Development 16 (1972) 431-433. [4] V.F. Demjanov, "Algorithms for some minimax problems", Journal of Computer and System Sciences 2 (1968) 342 380. [5] G.B. Dantzig and P. Wolfe, "The decomposition algorithm for linear programming", Econometrica 29 (1961) 767-778. [6] J.P. Evans, F.J. Gould and J.W. Tolle, "Exact penalty functions in nonlinear programming", Mathematical Programming 4 (1973) 72-97. [-7] A. Feuer, "Minimizing well-behaved functions", in: Proceedings of the twelfth annual Allerton conference on circuit and system theory, University of Illinois (October, 1974). [-8] R. Fletcher, "A FORTRAN subroutine for minimization by the method of conjugate gradients", AERE-R 7073, Theoretical Physics Division, A.E.R.E., Harwell, Berkshire (1972). [9] A.M. Geoffrion, "Elements of large-scale mathematical programming", in: A.M. Geoffrion, ed., Perspectives on optimization (Addison-Wesley, Reading, Mass., 1972), ch. 2. [-10] M. Held, R.M. Karp and P. Wolfe, "Large-scale optimization and the relaxation method", in: Proceedings of the 25th national ACM meeting, Boston, Mass. (August, 1972). [11] W.W. Hogan, R.E. Marsten and J.W. Blankenship, "Boxstep: a new strategy for large scale mathematical programming", Discussion Paper No. 46, Northwestern University, Evanston, I11. (May, 1973). [,12] M.R. Hestenes and E. Stiefel, "Methods of conjugate gradients for solving linear systems", Journal of Research of the National Bureau of Standards 49 (1952) 409-436. [,13] M. Held, P. Wolfe and H.P. Crowder, "Validation of subgradient optimization", Mathematical Programming 6 (1974) 62-88. [14] C. Lemarechak An algorithm for minimizing convex functions. Proceedings IFIP Congress 74 (August, 1974) 552-555.
P. Wolfe / A method of conjugate subgradients
173
1-15] C. Lemarechal, "An extension of Davidon methods to non differentiable functions", Mathematical Programming Study 3 (1975) 95-109 (this volume). 1-16] M. Lenard, "Practical convergence conditions for restarted conjugate gradient methods", MRC Tech. Summary Rept. No. 1373, Mathematical Research Center, University of Wisconsin, Madison (December 1973). [17] G.P. McCormick and K. Ritter, "On the convergence and rate of convergence of the conjugate gradient method", George Washington University Serial T-266 (1972). [18] M.J.D. Powell, Conversation at Yorktown Heights, N.Y., March, 1975. [19] M.J.D. Powell, "A view of unconstrained optimization", Rept. C.S.S. 14, Computer Science and Systems Division, A.E.R.E. Harwell, Oxfordshire (January, 1975). [-20] R.T. Rockafellar, "'Convex analysis" (Princeton University Press, Princeton, N.J., 1970). [-21] N.Z. Shor, "Utilization of the operation of space dilatation in the minimization of convex functions", Cybernetics (1970) 7-15. [Translation of Kybernetika 1 (1970) 6-12.] [22] N.Z. Shor, "Convergence rate of the gradient descent method with dilatation of the space", Cybernetics (1970) 102-108. [Translation of Kybernetika 2 (1970) 80-85.] [-23] P. Wolfe, "Convergence conditions for ascent methods", SlAM Review 11 (1969) 226 235. [24] P. Wolfe, "Convergence theory in nonlinear programming", in: J. Abadie, ed., Integer and nonlinear programming (North-Holland, Amsterdam, 1970) pp. 1 36. [-25] P. Wolfe, "On the convergence of gradient methods under constraint", IBM Journal of Research and Development 16 (1972) 407-411. [26] P. Wolfe, "An algorithm for the nearest point in a polytope", IBM Research Center Report RC 4887 (June, 1974). [27] P. Wolfe, "An optimization problem arising in algorithm design". Unpublishable manuscript (April, 1974). [28] G. Zoutendijk, Methods of feasible directions (Elsevier, Amsterdam, 1960).
Mathematical Programming Study 3 (1975) 174-178. North-Holland Publishing Company
BIBLIOGRAPHY*
S. Agmon, "The relaxation method for linear inequalities", Canadian Journal of Mathematics 6 (1954) 382-392. A. Auslender, "Mrthodes numeriques pour la drcomposition et la minimisation de fonctions non-differentiables", Numerische Mathematik 18 (1971) 213-223. A. Auslender, "Recherche de point de selle d'une fonction", Cahiers du Centre arEtudes de Recherche Operationnelle 12 (1970). J.W. Bandler and C. Charalambous, "Nonlinear programming using minimax techniques", Journal of Optimization Theory and Applications 13 (1974) 607-619. M.S. Bazaraa, J.F. Goode and C.M. Shetty, "Optimality criteria in nonlinear programming without differentiability", Operations Research 19 (1971) 77-86. D.P. Bertsekas, "Stochastic optimization problems with nondifferentiable cost functionals", Journal of Optimization Theory and Applications 12 (1973) 218-231. D.P. Bertsekas and S.K. Mitter, "A descent numerical method for optimization problems with nondifferentiable cost functionals", SIAM Journal on Control 11 (1973) 637-652. D.P. Bertsekas and S.K. Mitter, "Steepest descent for optimization problems with nondifferentiable cost functionals", in: Proceedings of the 5th annual Princeton conference on information sciences and systems (Princeton, N.J., U.S.A., 1971) pp. 347-351. B. Birzak and B.N. Psenichnyi, "Some problems of the minimization of unsmooth functions", Kibernetika 2 (6) (1966) 43-46. Jerome Bracken and James T. McGill, "Mathematical programs with optimization problems in the constraints", Operations Research 21 (1973) 37-44. Jerome Bracken and James T. McGill, "A method for solving mathematical programs with nonlinear programs in the constraints", Operations Research 22 (1974). * Items not easily attainable have not been included.
Bibliography
175
F.H. Clarke, "Generalized gradients and applications", Transactions of the American Mathematical Society 205 (1975) 247-262. A.R. Conn, "Constrained optimization using a nondifferentiable penalty function", SIAM Journal on Numerical Analysis 10 (1973) 760-784. J.M. Danskin, "The theory of max-min with applications", SIAM Journal on Applied Mathematics 14 (1966) 641-664. J.M. Danskin, The theory ofmax-min (Springer, Berlin, 1967). G.B. Dantzig, "On a problem of Rosanov", Applied Mathematics and Optimization 1 (1974). V.F. Demjanov, "Algorithms for some minimax problems", Journal of Computer and Systems Sciences 2 (1968) 342-380. V.F. Demjanov, "The solution of several minimax problems", Kibernetika 2 (6) (1966) 58-66. V.F. Demjanov and V.N. Malozemov, "The theory of nonlinear minimax problems", Uspekhi Matematcheski Nauk 26 (1971) 53-104. V.F. Demjanov and A.M. Rubinov, "Minimization of functionals in normed spaces", S l A M Journal on Control 6 (1968) 73-89. I.I. Eremin, "A generalization of the Motzkin-Agmon relaxation method", Uspekhi Matematcheski Nauk 20 (1965) 183-187. Yu. M. Ermolev, "Methods of solution of nonlinear extremal problems", Kibernetika 2 (4) (1966) 1-17. Yu. M. Ermolev, "On the stochastic gradient method and stochastic quasi-F6jer sequences", Kibernetika 5 (2) (1969) 73-84. Yu. M. Ermolev and L.G. Ermolieva, "A method of parametric decomposition", Kibernetika 9 (2) (1973). Yu. M. Ermolev and T.P. Marianovich, "Optimization and modeling", Problemy Kibernetiki 27 (1973). Yu. M. Ermolev and E.A. Nurminskii, "Boundary extremum problems", Kibernetika 9 (4) (1973). J.P. Evans, F.J. Gould and J.W. Tolle, "Exact penalty functions in nonlinear programming", Mathematical Programming 4 (1973) 72-97. Albert Feuer, "Minimizing well-behaved functions", in: Proceedings of the twelfth annual Allerton conference on circuit and system theory (Coordinated Science Laboratory, University of Illinois, Urbana, Illinois, 1974). M.L. Fisher, "A dual algorithm for the one-machine scheduling problem", Graduate School of Business Rept., University of Chicago, Chicago, Ill. (1974). R. Fletcher, "An exact penalty function for nonlinear programming with
176
Bibliography
inequalities", Mathematical Programming 5 (1973) 129-150. A.M. Geoffrion, "Elements of large-scale mathematical programming", in: A.M. Geoffrion, ed., Perspectives in optimization (Addison-Wesley, Reading, Mass., 1972) ch. 2. R.C. Grinold, "Lagrangian subgradients", Management Science 17 (1970) 185-188. R.C. Grinold, "Steepest ascent for large scale linear programs", SIAM Review 14 (1972) 447-464. M. Held, R.M. Karp and P. Wolfe, "Large-scale optimization and the relaxation method", in: Proceedings of the 25th national ACM meeting, held at Boston, Mass., August 1972. M. Held, P. Wolfe and H. Crowder, "Validation of subgradient optimization", Mathematical Programming 6 (1974) 62-88. W.W. Hogan, R.E. Marsten and J.W. Blankenship, "The Boxstep method for large scale optimization", Operations Research 23 (1975). C. Lemarechal, "An algorithm for minimizing convex functions", in: J.L. Rosenfeld, ed., Information processing '74 (North-Holland, Amsterdam, 1974) pp. 552-556. C. Lemarechal, "Note on an extension of 'Davidon~ methods to nondifferentiable functions", Mathematical Programming 7 (1974) 384387. E.S. Levitin, " A general minimization method for unsmooth extremal problems", Zurnal Vycislitel'noi Matematik i Matematiceskoi Fiziki (U.S.S.R. Computational Mathematics and Mathematical Physics) 9 (1969) 783-806. D.G. Luenberger, "Control problems with kinks", I.E.E.E. Transactions on Automatic Control AC-15 (1970) 570-575. V.C. Mikhalevich, Yu. M. Ermolev, V.V. Skurba and N.Z. Shor, "Complex systems and the solution ofextremum problems", Kibernetika 3 (5) (1967) 29-39. J.J. Moreau, "Fonctionneiles sous-diff6rentiables", Comptes-Rendus Hebdomadaires des S~ances de l'Acad~mie des Sciences 257 (1963) 4117-4119. J.J. Moreau, "Semi-continuit6 de sous-gradient d'une fonctionnelle", Comptes-Rendus Hebdomadaires des S~ances de l'Acad~mie des Sciences, Groupe 1 260 (1965) 1067-1070. T. Motzkin and I.J. Schoenberg, "The relaxation method for linear inequalities", Canadian Journal of Mathematics 6 (1954) 393-404. L.W. Neustadt, "A general theory of extremals", Journal of Computer and Systems Sciences 3 (1969) 57-92.
Bibliography
177
E.A. Nurminskii, "Minimization of non-differentiable functions with perturbations", Kibernetika 10 (4) (1974). E.A. Nurminskii, "The subgradient method for solving problems in nonlinear programming", Kibernetika 9 (1) (1973). B.T. Polyak, "A general method for solving extremal problems", Doklady Akademii Nauk SSR 174 (1967) 33-36 [-In Russian; Engl. transl.: Soviet Mathematics Doklady 8 (1967) 593-597]. B.T. Polyak, "Minimization of unsmooth functionals", Zurnal VycisliteFkoi Mathematik i Matematiceskoi Fiziki (USSR Computational Mathematics and Mathematical Physics)9 (1969) 509-521. R.A. Polyak, "On optimal convex Tchebycheff approximation", Doklady Akademii Nauk SSSR 200 (1971) 538-540. B.N. Pshenichnyi, "Convex programming in a normed space", Kibernetika 1 (5) (1965) 46-54. B.N. Pshenichnyi, "Dual methods in extremum problems", Kibernetika 1 (3) (1965) 89 -95. R.T. Rockafeilar, "Conjugate convex functions in optimal control and the calculus of variations", Journal of Mathematical Analysis and Applications 32 (1970) 174-222. R.T. Rockafellar, Convex analysis (Princeton University Press, Princeton, N.J., 1970). N.Z. Shor, "Application of the generalized gradient method in blockprogramming", Kibernetika 3 (3)(1967)53-55. N.Z. Shor, "Generalized gradient search", in: Transactions ofthefirst winter school on mathematical programming, Drogobich, Vol. 3, Moscow 1969. N.Z. Shor, "On a method for minimizing nearly differentiable functions", Kibernetika 8 (4) (1972). N.Z. Shor, "On the convergence rate of the generalized gradient method with space dilation", Kibernetika 6 (2) (1970). N.Z. Shor, "On the rate of convergence of the generalized gradient method", Kibernetika 4 (3) (1968). N.Z. Shor, "Utilization of the operation of space dilation in the minimization of convex functions", Kibernetika 6 (1) (1970) 6-12 [Engl. transl.: Cybernetics 6 (1) 1970, 7-15]. N.Z. Shor and P.R. Gamburg, "Some questions concerning the convergence of the generalized gradient method", Kibernetika 7 (6) (1971) 82-84. N.Z. Shor and N.G. Jourbenko, "A method of minimization using space dilation in the direction of two successive gradients", Kibernetika 7 (3) (1971) 51-59.
178
Bibliooraphy
N.Z. Shor and L.P. Shabashova, "On the solution of minimax problems by the generalized gradient method with space dilation", Kibernetika 8 (1)(1972). H.S. Witsenhausen, "A minimax control problem for sampled linear systems", I.E.E.E. Transactions on Automatic Control AC-13 (1968) 5-21. H.S. Witsenhausen, "Minimax control of uncertain systems", M.I.T. Electronic Systems Laboratory Rept. ESL-R-269, Cambridge, Mass. (1969). P. Wolfe, "Convergence theory in nonlinear programming", in: J. Abadie, ed., Integer and nonlinear programmin9 (North-Holland, Amsterdam, 1970) ch. 1. P. Wolfe, "Note on a method of conjugate subgradients for minimizing nondifferentiable functions", Mathematical Programmino 7 (1974) 380383. S.I. Zukhovitskii, "On the approximation of real functions in the sense of Tchebycheff", Uspehi Matematieeskhi Nauk 11 (1956)125-159. S.I. Zukhovitskii and R.A. Polyak, "An Algorithm for the solution of rational Tchebycheff approximation problems", Doklady Akademii Nauk SSSR 159 (1964). S.I. Zukhovitskii, R.A. Polyak and M.E. Primak, "An Algorithm for the solution of convex Tchebycheff approximation problems", Doklady Akademii Nauk SSSR 151 (1963) 27-30.