MATHEMATICAL PROGRAMMING STUDIES
Editor-in-Chief R.W. COTTLE, Department of Operations Research, Stanford University, ...
91 downloads
431 Views
8MB Size
Report
This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form
MATHEMATICAL PROGRAMMING STUDIES
Editor-in-Chief R.W. COTTLE, Department of Operations Research, Stanford University, Stanford, CA 94305, U.S.A. Co-Editors L.e.W. DIXON, Numerical Optimisation Centre, The Hatfield Polytechnic, College Lane, Hatfield, Hertfordshire ALIO 9AB, England B. KORTE, Institut fUr Okonometrie und Operations Research, Universitat Bonn, Nassestrasse 2, D-5300 Bonn I, W. Germany T.L. MAGNANTI, Sloan School of Management, Massachusetts Institute of Technology, Cambridge, MA 02139, U.S.A. M.J. TODD, School of Operations Research and Industrial Engineering, Upson Hall, Cornell University, Ithaca, NY 14853, U.S.A. Associate Editors E.L. ALLGOWER, Colorado State University, Fort Collins, CO, U.S.A. R. BARTELS, University of Waterloo, Waterloo, Ontario, Canada V. CHVATAL, McGill University, Montreal Quebec, Canada J.E. DENNIS, Jr., Rice University, Houston, TX, U.S.A. B.e. EAVES, Stanford University, CA, U.S.A. R. FLETCHER, University of Dundee, Dundee, Scotland J.-8. HIRIART-URRUTY, Universite Paul Sabatier, Toulouse, France M. IRI, University of Tokyo, Tokyo, Japan R.G. JEROSLOW, Georgia Institute of Technology, Atlanta, GA, U.S.A. D.S. JOHNSON, Bell Telephone Laboratories, Murray Hill, NJ, U.S.A. C. LEMARECHAL, INRIA-Laboria, Le Chesnay, France L. LOVASZ, University of Szeged, Szeged, Hungary L. MCLINDEN, University of Illinois, Urbana, IL, U.S.A. M.W. PADBERG, New York University, New York, U.S.A. M.J.D. POWELL, University of Cambridge, Cambridge, England W.R. PULLEYBLANK, University of Calgary, Calgary, Alberta, Canada K. RITTER, University of Stuttgart, Stuttgart, W. Germany R.W.H. SARGENT, Imperial College, London, England D.F. SHANNO, University of Arizona, Tucson, AZ, U.S.A. L.E. TROTTER, Jr., Cornell University, Ithaca, NY, U.S.A. H. TUY, Institute of Mathematics, Hanoi, Socialist Republic of Vietnam R.J.B. WETS, University of Kentucky, Lexington, KY, U.S.A. C. WITZGALL, National Bureau of Standards, Washington, DC, U.S.A. Senior Editors E.M.L. BEALE, Scicon Computer Services Ltd., Milton Keynes, England G.B. DANTZIG, Stanford University, Stanford, CA, U.S.A. L.V. KANTOROVICH, Academy of Sciences, Moscow, U.S.S.R. T.e. KOOPMANS, Yale University, New Haven, CT, U.S.A. A.W. TUCKER, Princeton University, Princeton, NJ, U.S.A. P. WOLFE, IBM Research Center, Yorktown Heights, NY, U.S.A.
MATHEMATICAL PROGRAMMING STUDY 16 A PUBLICATION OF THE MATHEMATICAL PROGRAMMING SOCIETY
Algorithms for Constrained Minimization of Smooth Nonlinear Functions Edited by A.G. BUCKLEY and J.-L. GOFFIN
cp1i~ ~
~
April (1982)
NORTH-HOLLAND PUBLISHING COMPANY - AMSTERDAM
© The Mathematical Programming Society, Inc. -1982 All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, without the prior permission of the copyright owner. Submission to this journal of a paper entails the author's irrevocable and eXclusive authorization of the publisher to collect any sums or considerations for copying or reproduction payable by third parties (as mentioned in article 17 paragraph 2 of the Dutch Copyright Act of 1912 and in the Royal Decree of June 20, 1974 (S. 351) pursuant to article 16b of the Dutch Copyright Act of 1912) and/or to act in or out of Court in connection therewith.
This STUDY is also available to non-subscribers in a book edition.
Printed in The Netherlands
PREFACE Although the subject of this study could adequately be described as 'nonlinear programming', we have chosen to be more specific, for nonlinear programming is rather a broad field and it is perhaps not even actualIy agreed just what topics it encompasses. If you like, this Study examines one topic in nonlinear programming, as reflected in the title. The papers presented here do vary in the problem they discuss and they do differ in their point of view, but they are unified in their consideration of the minimization of smooth nonlinear functions with constrained variables. The impetus for this Study was the Tenth Mathematical Programming Symposium held in Montreal in August, 1979. Most of these papers were presented at that time, but the presentations then were often much briefer than the papers here, and indeed, the Study was not specificalIy limited to any connection with the Montreal meeting. By the same token, this Study is not intended to be a Proceedings for that meeting, and some Montreal papers which were submitted to the Study were not included because they did not jibe with the theme chosen for the Study (although some were undoubtedly of sufficient calibre). In general, alI of the papers consider the problem NLP: minimize f(x), x E R", subject to c,(x) 2: 0, i E 1, c,(x)=O, iEE where alI functions are smooth. One notable exception is Schnabel's paper (7) which considers the question of finding a point which satisfies the constraints, without regard for any objective function. Nonetheless, we do feel the paper is suitable for the Study. If for no other reason, finding a feasible point amounts to minimizing f == 0, subject to the constraints. But, more importantly, effective determination of an initial feasible point can be important for some algorithms. Schnabel's method is penalty function based, but the author comments that a finite value normalIy suffices for the penalty parameter. The algorithm either returns a feasible point or indicates infeasibility. Some numerical results indicate the efficiency of this penalty function approach for the particular problem of finding feasible points. Schnabel mentions, as welI, a situation in which feasibility and not optimality is the relevant point. The second paper which is perhaps slightly exceptional is that by Shanno and Marsten (8), for they alone restrict consideration to linear constraints. They examine the application of conjugate gradient algorithms to linearly constrained v
vi
Preface
problems, with particular reference to related work of Murtagh and Saunders on MINOS. This really ties the Shanno and Marsten paper to the Study, for MINOS is again considered by Murtagh and Saunders in paper (5) (where they extend it to nonlinear constraints). Furthermore, algorithms discussed in some of the other papers require the solution of linearly constrained subproblems, to which the work of Shanno and Marsten could have relevance. The paper essentially considers the extension of Shanno's 'memoryless' quasi-Newton algorithm to constrained problems. The other papers in the Study deal specifically with algorithms for solving NLP, but with varying balances of theory and computation. At one extreme, the paper (6) of Sandgren and Ragsdell is almost exclusively computational. They present an evaluation of several algorithms, in fact of classes of algorithms, based on both general problems of the form NLP, and of the sort of NLP problems arising in engineering design. Their results are favorable to reduced gradient codes, but the authors note that their tests were performed only with codes accessible to them, and did not include any recursive quadratic programming codes. This paper is both interesting for the results presented and the means by which they are presented. Murtagh and Saunders also present significant computational results, but for one specific algorithm, MINOS/AUGMENTED, which is the extension of their well known MINOS code to nonlinear constraints. One significant feature of their work is that they give considerable emphasis to the actual implementation of the code, an important point which is often glossed over. Their method is based on Robinson's algorithm and solves successive linearly constrained subproblems, using MINOS of course. An interesting aspect of their work is that it alone, of all papers in the Study, is designed for large sparse problems. The remaining papers are primarily theoretical. Murray and Wright in (4) consider one specific concern, that of determining a direction of search by means of solving a quadratic subproblem. Linear constraints are specifically discussed, as are issues of solving either equality or inequality constrained subproblems. Their conclusion discusses unresolved difficulties for line searches in constrained minimization, a point which is taken up by Chamberlain, Lemarechal, Pedersen and Powell in (1). The 'Watchdog Technique' suggests a relaxation of the line search criteria used in the algorithm of Han so as to avoid large numbers of very short steps near curved constraint boundaries. Global and superlinear convergence theorems are proven, and, although no numerical results are given, the authors indicate that the watchdog technique has been successfully implemented. Each of the last three papers presents an algorithm for solving NLP. Each also presents results concerning the global and superlinear convergence properties of the proposed algorithm. The paper (2) of Gabay discusses a quasi-Newton method in which an approximation to a reduced Hessian of the Lagrangian is stored. A line search is again required, but along a curve defined by 2 directions; an exact penalty function determines the step size. The relation to multiplier and recursive quadratic
Preface
vii
programming methods is investigated. The method is limited to equality constraints. In (3), Mayne and Polak address similar points. A search direction found from a quadratic approximation is determined; an exact penalty function determines the step length. Near the solution, searching along an arc prevents loss of superlinear convergence. Some test results are given. Finally, van der Hoek (9) discusses another algorithm based on successively solving reduced and linearlyconstrained subproblems. No numerical results are given, but there is careful consideration of the convergence properties of the proposed algorithm. The use of an active set strategy is considered. We would like to express our sincere thanks to those researchers who found the time to give to these papers. The papers are neither short nor simple-as befits the problem-and therefore demanded more than a few moments of time for the careful evaluations produced by the referees. One anonymous referee deserves special thanks, and we the editors and that referee know well to whom the thanks are due! Albert G. Buckley Jean-Louis Goffin
CONTENTS
Preface
v
(1) The watchdog technique for forcing convergence in algorithms for constrained optimization, R.M. Chamberlain, M. I. D. Powell, C. Lemarechal and H.C. Pederson . . . . . (2) Reduced quasi-Newton methods with feasibility improvement for nonlinearly constrained optimization, D. Gabay
18
(3) A superlinearly convergent algorithm for constrained optimization problems, D.Q. Mayne and E. Polak .....
45
(4) Computation of the search direction in constrained optimization al. gorithms, W. Murray and M.H. Wright . . . .
62
(5) A projected Lagrangian algorithm and its implementation for sparse nonlinear constraints, B.A. Murtagh and M.A. Saunders
84
(6) On some experiments which delimit the utility of nonlinear programmingmethods for engineering design, E. Sandgren and K.M. Ragsdell
118
(7) Determining feasibility of a set of nonlinear inequality constraints, RoB. Schnabel . . . . .
137
(8) Conjugate gradient methods for linearly constrained nonlinear programming, D.F. Shanno and R.E. Marsten . . . .
149
(9) Asymptotic properties of reduction methods applying linearly equality constrained reduced problems, G. Van der Hoek . . .
162
viii
Mathematical Programming Study 16 (1982) 1-17 North-Holland Publishing Company
THE WATCHDOG TECHNIQUE FOR FORCING CONVERGENCE IN ALGORITHMS FOR CONSTRAINED OPTIMIZATION* R.M. C H A M B E R L A I N and M.J.D. P O W E L L Department of Applied Mathematics and Theoretical Physics, University of Cambridge, Silver Street, Cambridge CB3 9EW, England
C. L E M A R E C H A L LN.R.I.A., Domaine de Voluceau, 78150 Rocquencourt, France
H.C. P E D E R S E N Numerisk Institut, Denmark Tekniske Hojskole, 2800 Lyngby, Denmark Received 18 March 1980 Revised manuscript received I April 1981
The watchdog technique is an extension to iterative optimization algorithms, that use line searches. The purpose is to allow some iterations to choose step-lengths that are much longer than those that would be allowed normally by the line search objective function. Reasons for using the technique are that it can give large gains in efficiency when a sequence of steps has to follow a curved constraint boundary, and it provides some highly useful algorithms with a Q-superlinear rate of convergence. The watchdog technique is described and discussed, and some global and Q-superlinear convergence properties are proved.
Key words: Constrained Optimization. Convergence Theory, Line Searches.
1.- Introduction The general nonlinear programming problem is to calculate a vector of variables x that minimizes an objective function F(x), subject to constraints of the form
c~(x)=O,
i = 1 , 2 . . . . . m',
ci(x)>-O,
i=m'+l
. . . . . m.
(1.1)
We suppose that the functions F and {c~;i = 1,2 . . . . . m} are real-valued and twice continuously differentiable. Most algorithms for solving this problem are
~Presented at the Tenth International Symposium on Mathematical Programming (Montreal, 1979).
2
R.M. Chamberlain et al./The watchdog technique
iterative. Given a starting vector x0, they generate a sequence of points {xk; k = 0, 1, 2 .... }, that is intended to converge to the required solution, x* say. One of the main difficulties in practice is to achieve the required convergence when x0 is far from x*. The main technique for this purpose depends on a 'line search objective function', which is usually the sum of F ( x ) and a penalty term that becomes positive when the constraints are violated. For example, Han [4] recommends the function W(x) = F ( x ) + ~ ~lc~(x)l + ~:,~ ~, ~lmin[0,
c,(x)]l
(1.2)
where each IX~ is a positive parameter. In order to help convergence, each new vector of variables in the sequence {xk; k = 0, l, 2 .... } is calculated to satisfy an inequality that is at least as strong as the condition W(xk+l) < W(xk).
(1.3)
Han [4] analyses a particular algorithm that obtains xk+l by searching from xk along a direction dk, where the step-length is chosen to give the reduction (1.3). The search direction dk is the vector d that minimizes the quadratic function Q ( d ) = F(xE) + dTV F(xk) + ~dT Bkd,
(1.4)
subject to the linear constraints c~(xk) + drVci(xk) = O,
i = 1,2 . . . . . m ' ,
ci(xk) + dr~Tci(xk) >--O,
i = m ' + 1. . . . . m
(1.5)
where Bk is a positive definite approximation to the second derivative matrix of the Lagrangian function. Hart assumes that the matrices {Bk; k = 0, 1, 2 .... } are uniformly bounded, that their eigenvalues are bounded away from zero, and that the constraints (1.5) are consistent. Under these assumptions and some other conditions, he proves that all accumulation points of the sequence {xk; k = 0, 1,2 .... } are K u h n - T u c k e r points of the optimization calculation. Let {X~;i: 1,2 . . . . . m} be the Lagrange parameters at the solution of the quadratic programming problem that defines dk. One of the conditions of Han's analysis is that the parameters { ~ ; i = 1, 2 . . . . . m} of the objective function (1.2) are bounded below by the inequality
I,
i = 1,2 ..... m.
(1.6)
Each /~ is a constant that has to satisfy this condition for every value of the iteration number k. Therefore sometimes these parameters have to be quite large. Large parameter values cause the line search objective function (1.2) to have steep sides at the boundary of the feasible region. Therefore, because of condition (1.3), many iterations may be needed if the calculated vectors of variables have to follow a curved constraint boundary. This inefficiency can
R.M. Chamberlain et at./The watchdog technique
3
often be avoided by revising the parameters {/~i; i = 1,2 . . . . . m } on each iteration. Han [4] proves that the reduction (1.3) can be obtained for any values of the parameters that satisfy condition (1.6). An algorithm that calculates each search direction by minimizing the function (1.4) subject to the constraints (1.5), that adjusts the parameters {/~j;i = 1, 2 ..... rn} automatically on each iteration, and that satisfies condition (1.3) for each value of k, is described by Powell [6]. A recent comparison of algorithms for constrained optimization, made by Schittkowski [7], suggests that this algorithm is more efficient than other methods, if one measures efficiency by counting function and gradient evaluations. The automatic adjustment of {~i; i = 1,2 . . . . . m}, however, violates the conditions of Han's convergence theory, and some pathological examples, due to Chamberlain [1], show that the algorithm can cycle instead of converging. Therefore a technique, called the 'watchdog technique', is proposed in Section 2, that relaxes the line search condition on XE+~in a way that preserves a global convergence property. The idea is to let the parameters {/~;i = 1, 2 . . . . . m} be constants that satisfy Han's conditions, but to allow a violation of inequality (1.3) if W ( x k ) is substantially smaller than the numbers {W(x~); i = 0, 1. . . . . k - 1}. Thus many of the inefficiencies at curved constraint boundaries are avoided. Another advantage of the watchdog technique is that it gives a Q-superlinear rate of convergence in some cases when condition (1.3) prevents both Han's and Powell's algorithms from converging superlinearly. This subject is studied in Section 3, where it is assumed that the sequence {xk ; k = 0, 1, 2, ...} converges to a K u h n - T u c k e r point at which the gradients of the active constraints are linearly independent, and at which some strict complementarity and second order sufficiency conditions are satisfied. In this case there are several useful ways of choosing the matrix Bk of expression (1.4) so that the step-length ak = 1 in the definition Xk+l = Xk + O~kdk
(1.7)
would give superlinear convergence. T h e r e f o r e we ask whether condition (1.3) allows a step-length of one. A simple example shows that it is possible for the answer to this question to be unfavourable if Han's or Powell's algorithm is used. It is proved, however, that the watchdog technique avoids the need to satisfy condition (1.3) on the iterations where the condition would prevent a superlinear rate of convergence. When the watchdog technique is used, and when XE~ does not have to satisfy condition (1.3), then one may let ak = 1 in equation (1.7). Sometimes, however, it may be helpful to include a restriction on xk~. For instance one may prefer to reject any changes to the variables that make the objective function and all the constraint violations worse. If W(XE + dE) is greater than W(Xk), then it is usual for any constraint violations at xk to be so small that the search direction ~, is almost parallel to the boundaries of the active constraints. Often in this case the
4
R.M. Chamberlain et al.[ The watchdog technique
Lagrangian function has positive curvature along the search direction. Therefore we recommend the inequality Lk(xk+l) ~ Lk(xk)
(1.8)
as an alternative to condition (1.3), where Lk is an approximation to the Lagrangian function of the main calculation. T h e Lagrange parameters of Lk can be estimated from the quadratic programming calculation that determines dk. Section 4 studies a version of the watchdog technique that requires inequality (1.8) to hold if the reduction (1.3) is not obtained. It is proved that the new restriction on xk§ does not impair the superlinear convergence result of Section 3, provided that the Lagrange parameter estimates of Lk converge to the Lagrange parameters of the main calculation at the K u h n - T u c k e r limit point of the sequence {XE; k = 0, 1,2 .... }. A discussion of the usefulness of the watchdog technique to practical algorithms is given in Section 5. It includes some comments on the problem of choosing suitable values of the parameters {/~; i = 1,2 ..... m} automatically.
2. The watchdog technique and global convergence We consider iterative algorithms for constrained optimization that apply (1.7) to calculate a sequence {xk; k = 0, 1, 2 .... } of vectors of variables that is intended to converge to the solution of the optimization problem. We assume that, for any Xk, a suitable search direction dk can be calculated. In order to include the watchdog technique in an algorithm, it is necessary for the unmodified algorithm to define the step-length ak in the following way. A step-length of one is tried initially. If necessary it is reduced recursively, where the ratio of successive trial values is bounded away from zero, until a prescribed condition is satisfied. We refer to this condition in the unmodified algorithm as the standard criterion. The watchdog technique allows certain iterations to use a relaxed criterion instead of the standard criterion to decide whether or not a trial step-length is acceptable. The relaxed criterion may be any condition that is never more restrictive than the standard criterion; for example the relaxed criterion may be to accept the first trial step-length in all cases. The following description of the watchdog technique indicates in general terms the iterations that may use the relaxed criterion instead of the standard one. In this description it is assumed that a line search objective function W ( x ) is being used to force global convergence, but, even if the standard criterion gives the reduction (1.3), it can happen due to the relaxed criterion that the numbers {W(Xk); k = 0, 1,2 .... } do not decrease strictly monotonically. We let k be the iteration number, we let l be the greatest integer in [0, k] such that W(x~) is the least of the numbers {W(x~);i=0, 1. . . . . k}, and we let t be a fixed positive integer.
R.M. Chamberlain et al./ The watchdog technique
5
Step 0: The initial vector of variables Xo and the positive integer t are given. Set k = I = 0 and set the line search criterion to standard. Step l: Calculate the search direction dk and set the initial trial value of the step-length ok = I. Step 2: Reduce ok if necessary until the current line search criterion is satisfied. Let Xk+~ have the value (1.7). Step 3: If W(xk+~) is 'sufficiently less' than W(x~), then set the line search criterion for the next iteration to relaxed. Otherwise set it to standard. Step 4: If W(xk+O<- W(xt) set I = k + 1. Step 5: If k = l + t, replace Xk.I by x~ and set l = k + 1. Step 6: If another iteration is required, then increase k by one and return to Step I. The watchdog technique is an extension to an optimization algorithm, because, if the relaxed criterion is the same as the standard criterion, then the original algorithm is obtained. We consider the way in which the technique may work when W is the function (1.2), and when a nonlinear equality constraint is present whose multiplier p.~ is very large. In this case, if Xk is close to the constraint boundary, then usually the standard criterion requires a very short step-length. When xk is close to the constraint boundary, however, then it is usual for W(xk) to be so small that the current line search criterion is relaxed, which may allow a more efficient value of ok. If the calculated step is suitable, then one or two iterations of the standard line search procedure should provide a new least value of W(x). H o w e v e r , if W(xk.~) is larger than W(XE) due to the relaxed condition, and if t further iterations of the standard procedure fail to bring the objective function back to its least calculated value, then, because it seems that the alternative procedure is not helpful, Step 5 restores the vector of variables to the value that gives the least calculated W(x), in order that a further reduction to the objective function can be made. In the remainder of this paper we study the watchdog technique applied to a particular algorithm, that is similar to the one proposed by Han [4], except that Han makes an exact line search on each iteration. T h e r e f o r e the following conditions are assumed in all the given theorems. The search direction dk is calculated by minimizing the function (1.4) subject to the constraints (1.5). The parameters of the line search objective function (1.2) are constants that satisfy condition (1.6) on each iteration. The standard criterion is that the vector (1.7) must satisfy the inequality
W(Xk-O ~-- W(Xk) -- O[W(xk) - Wk(xk-O]
(2.1)
where 0 is any constant from the interval 0 < 0 < 89
(2.2)
and where Wk is the approximation to W that is obtained by replacing the
6
R.M. Chamberlain et al./The watchdog technique
functions F(x) and {c~(x); i = 1, 2 . . . . . m} in equation (1.2) by their first order Taylor series approximations about x = xk. Condition (2.1) can be satisfied because inequality (1.6) and positive definiteness imply that [W(Xk)-Wk(Xk + adk)] is positive for all 0 < a -< 1, and because the ratio
[ W ( X k ) - W(xk + adk)]/[W(Xk) -- Wk(Xk + adD]
(2.3)
tends to one as a tends to zero. For consistency with expression (2.1), we define W(xk+i) to be 'sufficiently less' than W(x~) in Step 3 of the watchdog technique if and only if it satisfies the condition
W(xk+l) <- W ( x l ) - O[W(xt)- Wt(xl, 1)].
(2.4)
We do not specify the relaxed criterion yet, because we will consider two different cases. In the remainder of this section it is proved that this algorithm has a global convergence property, and the rate of convergence is studied in Sections 3 and 4. The main purpose of the theory is to show that the good properties of the underlying algorithm are not impaired by the watchdog technique. The consideration of global convergence depends on the following statement, which is a corollary of the method of proof of Theorem 3.4 in [4]. If {xto); j = 0, 1,2 .... } is any sequence that converges to a point that is not a Kuhn-Tucker point of the constrained optimization problem, where the numbers {W(xlo~);j = 0 , 1,2 .... } need not decrease monotonically, and if for each j the vector x~j)~ = x~ci) + ctl(j)dl(j) is calculated using the standard criterion, then the differences {W(x~(j))Wl(i)(xa(j)*l); j = O, 1, 2 .... } are bounded below by a positive constant. Theorem 1. If the given algorithm calculates the infinite sequence {xk; k = O, 1,2 .... }, if the sequence is bounded, and if the condition stated immediately before the theorem is satisfied, then a subsequence of {xk; k = 0 , 1,2 .... } converges to a Kuhn-Tucker point of the constrained optimization calculation. Proot. We show first that inequality (2.4) is obtained an infinite number of times. The watchdog technique sets l = k + 1 at least once every (t + 1) iterations. On such an iteration either inequality (2.4) is satisfied, or this inequality is satisfied on the following iteration, due to the use of the standard criterion and condition (2.1). Thus inequality (2.4) holds at least once every ( t + 2 ) iterations. For j = 1,2,3 .... we let i(j) be the value of l on the jth occasion of satisfying inequality (2.4), and we let {l(j);j = 1, 2, 3 .... } be a subsequence of {i(j);j = 1,2,3 .... }, such that the points {xl~i~;j= 1,2,3 .... } converge to a limit. We consider the differences {W(x~j))- Wlo)(xto~.l); j = 1, 2, 3 .... }. If x,i).j is calculated by an iteration that uses the relaxed criterion, then, because the function {WlCj~(x~il+ ad~tj~);O<_a <_ 1} is a monotonically decreasing function of a, the difference [W(x,~))- W~)(x~cj)+~)] is no less than the value that would have been obtained if the iteration had used the standard criterion. It follows from the
R.M. Chamberlain et al./ The watchdog technique
7
statement that is given immediately before the theorem, that, if { x , j l ; j = 1,2,3 .... } does not converge to a Kuhn-Tucker point, then the numbers {W(xl~j~)-WI<s~(XI<j~§ = l, 2, 3 .... } are bounded below by a positive constant. Hence the definition of l and inequality (2.4) imply that the numbers {W(XR); k = 0, l, 2 .... } are not bounded below. This is a contradiction, however, because W is a continuous function and the sequence {xk ; k = 0, l, 2 .... } is bounded. Therefore the theorem is true. The theorem remains valid if inequality (2.4) is replaced by the condition W(xk§
<- W ( x t ) -
O [ W ( x k ) - Wk(Xk~0],
(2.5)
but this change is not recommended because it would prevent the Q-superlinear convergence property that is proved in the next section.
3. Superlinear convergence The following example shows that condition (1.3) can prevent some algorithms for constrained minimization calculations from converging superlinearly. A similar example is given by Maratos [5]. There are two variables, the objective function is the expression F ( x ) = - x l + "r(x~ + x ~ - 1)
(3.1)
where r is a parameter, and the only constraint is the equation x~+ x ~ - 1 = 0.
(3.2)
We let the matrix Bk of expression (1.4) be the unit matrix, which is the second derivative matrix with respect to x of the Lagrangian function at the solution {x* = 1, x* = 0} of the constrained optimization problem. Therefore, if xk is the point (cos ~b xk = \sin ~b }
(3.3)
where ~b is a parameter, then the method of Section 1 gives the search direction (_
dk
sin: 6 '~ sin 4~ cos ~b/"
(3.4)
Hence the Euclidean distances from Xk and from (XE + dk) to the solution x* are defined by the equations Ilxk - x*ll 2 = 2(1 - cos 6), +
-
x*ll 2 = (1 - cos
(3.5)
8
R.M. Chamberlain et al./The watchdog technique
It follows that the rate of convergence of the iteration is superlinear if xk.1 is made equal to (Xk + rig). Therefore we ask whether a step-length of one is allowed by condition (1.3). We find the values
W(xk) = -cos 4', W(Xk + O~dk)= --COS tk -- a sin 2 ~b + (T + p.)a 2 sin z ~b
(3.6)
where p. is the multiplier of the modulus of the constraint function in W. Hence condition (1.3) allows a = 1 only if ( r + / z ) is less than one. When ( r + / z ) exceeds one, then the step-length must be less than l/(r + bt). Thus the line search objective function can prevent a Q-superlinear rate of convergence. The example shows that the rate of convergence is liable to be only linear if the distance from Xk to the feasible region is much smaller than the distance from Xk to the nearest Kuhn-Tucker point. It is therefore important to note that, if first derivatives of constraint functions are calculated exactly, if the second derivative matrix Bk is estimated, if xk is close to a K u h n - T u c k e r point, and if Xk~ has the value (xk + t/k), then, due to the conditions (1.5), any infeasibilities at Xk+~ are of order Ildkll2, but the distance from Xk-~ to the Kuhn-Tucker point is sometimes of order IldkltllB~ -V2L*II where (Bk -V-'L*) is the difference between Bk and the second derivative matrix of the Lagrangian function. Because IIBk-V2L*II is usually much larger than Ildkll, it follows that it may be difficult to achieve a superlinear rate of convergence when xk§ is calculated from Xk,~. Maratos [5] suggests a modification to the usual line search procedure, which regains superlinear convergence by making first order corrections for constraint violations. It is shown below, however, that there is no need for this refinement if the watchdog technique is employed. We study the rate of convergence of the sequence {Xk; k = 0,1,2 .... } when the watchdog technique is used, when the sequence converges to a K u h n - T u c k e r point x* of the constrained optimization problem, when the method of calculating dk for each k satisfies the condition lim Ilxk + dk -
x*ll 0,
(3.7)
and when the matrices {Bk; k = 0, 1,2, ...} are bounded. We assume that the constraint gradients {Vci(x*); i ~ I} are linearly independent, where I is the set I = {i: ci(x*) = O, 1 <- i <- m}.
(3.8)
T h e r e f o r e there exist Lagrange parameters {,~*; i E I} such that the function L*(x) = F(x)-
~ ;t*c,(x)
(3.9)
is stationary at x*. We assume the strict complementarity condition, which means that ,~* is positive if i is the index of an active inequality constraint. We
R.M. Chamberlain et al./ The watchdog technique
9
assume also that the parameters {g~; i ~ I} of the objective function (1.2) satisfy the strict bounds p,, > IA*[, i E I .
(3.10)
The parameters of the inactive inequality constraints are unimportant when k is sufficiently large. We let c(x) be tee vector whose components are {cdx);i E I}. Finally we assume the second order sufficiency condition that, if d is any direction that is orthogonal to the gradients {Vq(x*); i E I}, then the inequality dTV2L( x*)d > v~[[dll2
(3.11)
holds, where 1)t is a positive constant. The remainder of this section establishes a Q-superlinear convergence property of the watchdog technique under the conditions that have just been stated. Three lemmas are proved before the main theorem. They include some results that are required in Section 4.
Lemma 1. There exist positi1)e constants 1)2, 1)3 and i)4 such that the inequalities W ( x ) >- L*(x)
(3.12)
W ( x ) _> F i x * ) + 1),_llc(x)ll + 1),llx - x*ll"
(3.13)
and
are satisfied for all points x in the neighbourhood (3.14)
IIx - x*ll-< 1),,
Proof. It is shown by Fletcher [3], for instance, that the second order sumciency condition implies the existence of a positive constant r such that the function L+(x) = L*(x) + rllc(x)[l:
(3.15)
has a positive definite second derivative matrix at x*. Therefore we may choose the constants 1)3 and 1)4 so that the condition L * ( x ) ->
F(x*) + 1)311x - x*ll =
(3.16)
is obtained in the neighbourhood (3.14). The definitions (I.2), (3.9) and (3.15) imply the identity W ( x ) - L+(x) = E {A*ci(x)- r[ci(x)] 2) iEI
m'
+ E ~,lc,(x)l i=l
+ ~
giImin[0, ci(x)]l.
(3.17)
10
R.M. Chamberlain et al./ The watchdog technique
We consider the contribution to the right-hand side of this equation from each of the m constraints. It is clear that the contribution is non-negative if the index i is not in I. When i is the index of an equality constraint the contribution is bounded below by the expression
[~,- Ix,*l- rlcffx)tllc,(x)l,
(3.18)
and when i is the index of an active inequality constraint the contribution has the value
[~, -Ix,*l- rlc,(x)l]lc,(x)l, [Ix*l- rlc,(x)lllc,(x)l,
c,(x) <- o, c,(x)
>- O.
(3.19)
Because each of the functions { c ~ ; i E I} is continuous and has a zero at x*, it follows from the bound (3.10) and from the strict complementarity condition that expressions (3.18) and (3.19) are bounded below by positive multiples of Ici(x)l when x is sufficiently close to x*. We reduce v, if necessary so that the points in the neighbourhood (3.14) are sufficiently close and we let v2 be the least of the positive multipliers, which gives the relation W (x ) - L +(x ) >- v211c(x)ll.
(3.20)
Inequality (3.12) is a consequence of expressions (3.15) and (3.20), and inequality (3.13) is implied by expressions (3.16) and (3.20).
Lemma 2. There exists an integer Kt and a positive c o n s t a n t v5 such that the inequality W ( x k ) - F ( x * ) - O[W(xk) - Wk(Xk + dk)] --> vs[W(Xk) -- F(X*)]
(3.217
holds f o r all k >- K~.
Proof. Condition (1.5) and the definitions of Wk and L* give the equation Wk(Xk + dk) = F ( x i ) + d ~ V F ( x k ) = L * ( x D + d'~VL*(xk) + ~, A *[C,(Xk) + d'~VC,(Xk)].
(3.22)
i~.l
The strict complementarity condition and the boundedness of BE allow K~ to be chosen so that, for all k >-KI, the terms { C i ( X k ) + d ~ V C i ( X k ) ; i E I } are zero. Moreover the limit (3.7) and the fact that IIVL*(xk)ll is of the same magnitude as IIx, - x*ll imply the relation d[VL*(Xk) = (x* - Xk)T~TL*(xk) + o(llx~ - x*ll 2)
(3.23)
where the notation o(112 - x*llb indicates a term whose ratio to 112 - x*ll 2 tends to zero as k ~ . Therefore, for k - Kt, expression (3.22) has the value W~(xk + dk) = L*(x~) + (x* - x~)TVL*(x~) + o(llx~ - x*ll=).
(3.24)
R.M. Chamberlain et al.[ The watchdog technique
11
In order to remove the first derivative term we note that, because L* has a continuous second derivative, the equation L *(x*) - L*(Xk) = ~(x* - xk)X[VL*(x *) + VL*(xk)]
+ o(llx
- x*ll 2)
(3.25)
is satisfied. Therefore, remembering that L*(x*) = F ( x * ) and that V L * ( x * ) = O, we write the relation (3.24) in the form WE(Xk + dk) = 2F(x*) - L*(xk) + O(Hx~ - x*ll2).
(3.26)
If follows that the left-hand side of inequality (3.21) has the value (1 - 2 0 ) [ W ( x D - F(x*)] + O[W(xE) -- L*(xD] + o(llxk - x*l[2).
(3.27)
We increase K~ if necessary so that the points {xk;k >-Kt} are all in the neighbourhood (3.14). It follows from conditions (2.2) and (3.13) that the first term of expression (3.27) is bounded below by v3(1- 20)llxk- x*ll 2. Therefore it dominates the last term of the expression when k is large. We make a further increase of K l if necessary so that, for all k -> K j, the sum of these two terms is at least ~(1 - 2 0 ) [ W ( x D - F(x*)]. Because of inequality (3.12), it follows that the lemma is true, and that we may let v5 have the value ~(1 - 2 0 ) . Lemma 3. There exist constants 1)6 and 1)7 such that, f o r all k >- Kt, the bounds IIc(xk + dk)ll--< v611xk - x*il 2
(3.28)
and W(x
+
+ vvllxk - x*ll =
(3.29)
are satisfied, where Kt is defined in the statement of L e m m a 2.
Proof. It is noted in the proof of L e m m a 2 that, for k-> Kl, the terms {ci(xk)+ d ~ V c i ( x k ) ; i ~ I} are all zero. Therefore the existence of second derivatives implies that ]]c(xk + dk)lt is of order of magnitude Ildk]12. Because expression (3.7) shows that IlXk -- X*[1 is of the same magnitude as lldkll, it follows that inequality (3.28) is valid. Similarly the difference [W(xk + d k ) - W k ( x k + dk)] is of order of magnitude Ilxk - x'l] 2. T h e r e f o r e equation (3.26) implies the relation W (xk + dk) = F ( x * ) + [F(x*) - L *(xk)] + O(llxk - x*l12).
(3.30)
The condition (3.29) is now a consequence of the equations F ( x * ) = L*(x*) and VL*(x*) = 0, which completes the proof of the lemma. Theorem 2. If the relaxed criterion o f the watchdog technique is to accept the first trial step-length ctk = l, and if the conditions that are stated before L e m m a l
R.M. Chamberlain et al./ The watchdog technique
12
are satisfied, then there exists an integer K2 such that the w a t c h d o g technique sets Xk+l = Xk + dk f o r all k >- K2.
Proof. One of the conditions that we impose on the integer K2 is that, for k = K2, the relaxed criterion is used when the step-length a~ is calculated, which is not a severe restriction because we found in the proof of T h e o r e m 1 that the relaxed criterion is used at least once every (t + 2) iterations. We prove the theorem by showing by induction that the equations Xk+l = Xk + dE
(3.31)
l= k
(3.32)
and or
k- 1
are satisfied at Step 3 of the watchdog technique, where k is any integer that is not less than K2. We suppose first that the relaxed criterion was used in Step 2, which is the case when k = K2. In this case l -- k and equations (3.31) and (3.32) hold for the current value of k. If Step 3 chooses the relaxed criterion for the next iteration, then the inductive hypothesis is maintained. Otherwise, because xk,t = Xk +dk and l = k, the failure of condition (2.4) can be expressed in the f o r m W(xk~l) > W(Xk) -- O [ W ( x D - Wk(Xk§
(3.33)
The next Step 4 may increase l from k to k + 1. Even without this increase, the value of k at Step 5 is less than (l + t), which is important because a change here to xk~t would disprove the theorem. On the next iteration the s t a n d a r d criterion is used, and we have to show that condition (2.1) allows a step-length of one. We increase K2 if necessary so that it is no less than the integer K~ of L e m m a 2, and so that the limit (3.7) gives the bound
Ilxilxj - x*ll
-<
'
i -> K2,
(3.34)
where the constants v3, v5 and v7 are defined in the lemmas. It follows from expression (3.29), from condition (3.34), f r o m inequality (3.13), f r o m L e m m a 2, from the relation (3.33), and from L e m m a 2 again, that on the next iteration the bound W(x
+
<_ F ( x * ) + v llx -
x*ll:
F(x*) + v v llx -,- x*ll <- F ( x * ) + v~[W(xk ~)- F(x*)] <- F ( x * ) + vs{W(xk-O - F ( x * ) - O[W(xk ,) - W~-I(Xk)]}
< F(x*) + v s [ W ( x D - F(x*)] <- W ( x D -
O [ W ( x D - Wk(xk + dk)]
(3.35)
R.M. Chamberlain et al./ The watchdog technique
13
is obtained. T h e r e f o r e Xk~t is given the value (Xk + dk) by the standard line search criterion, and the inductive conditions (3.31) and (3.32) hold when the watchdog technique next reaches Step 3. At this stage the value of I is either k or k - 1. If l = k, then inequality (2.4) is identical to expression (3.35). T h e r e f o r e the relaxed criterion is used on the next iteration, which maintains the inductive hypothesis. To complete the proof of the theorem we show that Step 3 also gives the relaxed criterion when I = k - I. We make a further increase to K2 if necessary in order that the bound IlxJ + di - x*ll Ilxi -- X*II
(
3vq
t'7 - "] '
-<:- \
j - K2,
(3.36)
is obtained f r o m the limit (3.7). Because Xk,l=Xk +dk and because Xk = Xg-~ + dk-1, we deduce from expressions (3.29), (3.36), (3.13) and (3.21) that the inequality <-
- x*ll 2
§
_< F(x*) + v3vsllxt - x*ll2 -< F(x*) +
vs[W(xt)
-
F(x*)]
<- W ( x l ) - O [ W ( x l ) - WI(XH)]
(3.37)
holds, which is the same as condition (2.4). The proof of the theorem is complete.
4. An extension that depends on the Lagrangian function Instead of always using a step-length of one in the relaxed criterion, it may be more efficient in some applications to include a restriction on xk§ Because of the second order sufficiency condition (3.11), and because the Lagrangian function is invariant under constraint scaling and additions of constraint functions to F(x), it is suitable to require XE~ to satisfy the condition Lk(xk ~w)--< Lk(xk),
(4.1)
if the inequality
W(xk§
<-- W(xk)
(4.2)
does not hold, where the function
Lk(x) = F ( x ) - ~, ~.ici(x)
(4.3)
i=l
is an approximation to L*(x). The p a r a m e t e r s {,~i;i = 1, 2 . . . . . m} of expression (4.3) have to be estimated, and the index k on Lk is present to show that these estimates may be changed on
14
R.M. Chamberlain et al./ The watchdog technique
each iteration. It is assumed in the analysis of this section that, as k ~ ~, the parameters {A~; i E I} converge to {A*; i E I}. It is assumed also that there exists an integer K3 such that, if i is the index of an inactive inequality constraint, then h~ is zero for all k --> K3. It is not difficult to achieve these conditions in practice when the sequence {xk; k = 0, 1,2 .... } converges to a K u h n - T u c k e r point at which the strict complementarity condition is satisfied. In this section we let the relaxed criterion be that the step-length at is accepted if one or both of the inequalities (4.1) and (4.2) are obtained, where xg, is the vector (xk + Otkdk). It is proved below that, although the new relaxed criterion sometimes gives a step-length that is less than one, the Q-superlinear convergence property of Theorem 2 is preserved.
Theorem 3. If the conditions of Theorem 2 hold, except that the relaxed criterion is altered in the way that has just been described, and if the parameters {h~; i = l, 2, ..., m} satisfy the conditions that are stated in the second paragraph of this section, then there exists an integer K4 such that the watchdog technique sets xk+t = xk + dk for all k >- K4.
Proof. We show that there exists an integer K5 such that the new line search procedure would use a step-length of one for all k -> Ks. We let K~ be any integer that is not less than Kj and K3, and that gives the conditions L*(x~ § d~) <- F(x*) + ,~v311xk - x*ll =,
(4.4)
IIA - x*ll-< ~min[v3/v6, v2v3/v7]
(4.5)
and
IIx, - x*ll = ~
~(v2v3)/(rv27)
(4.6)
for all k --- Ks, where vz, v3, v6 and v7 are defined in Lemmas 1 and 3, and where r occurs in (3.15). Inequality (4.4) can be satisfied because the difference between L*(xk + dk) and F(x*) is of order IIx~ § d~- x*tl =, and because the limit (3.7) implies that IIx~ § d~ - x*lt z can be bounded above by an arbitrarily small multiple of IIx~ -x*ll 2 for sufficiently large k. If (4.2) does not allow a step-length of one, then (3.29) implies the inequality
W(xk) -< f(x*)
§ v7tlx~ - x*ll =.
(4.7)
It follows from (3.13) that the bound
llc(xk)ll-< (v7/v2)llxk - x*ll 2
(4.8)
is obtained. In this case by applying the expressions (4.3), (4.4), (4.5), (3.28), (3.16), (3.15), (4.3), (4.8), (4.6), (4.5) and (4.8) in sequence, we deduce the
R.M. Chamberlain et al./ The watchdog technique
15
inequality L~(x~ + d~) ~ L * ( ~ Jr d~) + I1~ -- ~*llll~(~ + d~)ll -<
F(~*) + ~-~11~ - ~'11 ~
<- L*(xD + _<
rl[c(xk)ll
~-
~~llx~ -
x*ll=
L~(x~)+ IlX - x*lill~(x~)ll-A ~[Ix~ - x*ll ~
<- Lk(xk).
(4.9)
Therefore Step 7 gives a step-length of one if k -> Ks. It follows that, if in the proof of T h e o r e m 2 we impose the additional restriction K2-> K5 on the integer K2, then the proof of Theorem 2 is valid for the new line search procedure of Step 7 of the watchdog technique. T h e r e f o r e Theorem 3 is correct.
5. Discussion
Sections 2 and 3 give reasons for including the watchdog technique if the line search objective function (1.2) is used to help the convergence of an algorithm for constrained optimization. The technique has been applied to several numerical examples, and it seems to be highly valuable, particularly when the parameters {/z~;i = 1,2 ..... m} of expression (1.2) are large. Therefore it is unnecessary to change the parameters on every iteration in the way that is proposed by Powell [6], which is an important advance because Chamberlain [1] shows that Powell's algorithm can cycle instead of converging. Some of the difficulties that are reported by Chamberlain, however, are due to the method of estimating Lagrange parameters that is suggested at the end of Section 1. Therefore it may be better to calculate the parameters {A~;i = 1,2 . . . . . m} of expression (4.3) by a technique that does not depend on the matrix BR. For example the method that is used by Fletcher [2] may be successful. A suitable automatic way of choosing the parameters {/z~;i = 1,2 ..... m} of expression (1.2) is as follows. Initially /z~ is set to 2IA~[ for i = l, 2 . . . . . m, where { ~ ; i = l, 2 . . . . . m} are the Lagrange parameter estimates of the first iteration. If on any later iteration it is found that /z~ is less than 1.5[A~l, where {A~;i--1,2 . . . . . m} are now the Lagrange parameter estimates of the later iteration, then ~ is increased to 2[A~I. Thus, provided that the Lagrange parameter estimates are bounded, there exists an integer K6 such that ~t is constant for all k-> K6. It follows that our convergence theorems still hold. The Fortran version of Powell's [6] algorithm that is in the Harwell subroutine library will be extended to include the watchdog technique. The technique may also be useful to some algorithms for unconstrained optimization, in particular when the objective function F ( x ) has some narrow curving valleys.
16
R.M. Chamberlain et al./ The watchdog technique
Acknowledgement The referees' reports suggested valuable improvements to the organisation of the paper. In particular they helped to show that the use of the Lagrangian function, given in Section 4, is not an essential part of the watchdog technique.
Glossary of formulae m'
W ( x ) = F(x) + ~ tt, lc,(x)[ + ~ i=l
0.2)
~lmin[0, c,(x)]l.
i=ra '' I
(1.3)
W(xk41) < W(XE). Ci(Xk) + d T V c i ( x D = O,
i = 1,2 ..... m',
ci(xk) + dTlTci(Xk) >-- O,
i = m ' + 1 . . . . . m.
(1.5) (1.7)
Xk ~ l -~ Xk + Otkdk.
(2.1)
O [ W ( x D - Wk(xk*0].
W(Xk-,) <-- W ( x s
W(xk~) <-- W ( x 3 - O [ W ( x 3 - WI(XH)].
(2.4)
lim IIx! + dk Z.x, *ll = 0. k-~ Ilxk - x,~H
(3.7)
L*(x) = F ( x ) - ~ . h*ci(x).
(3.9)
iEl
t~, > Ix?l, W(x)
(3.10)
i~l.
(3.12)
>- L * ( x ) .
W ( x ) >- Fix*) + v:llc(x)ll + v & - x*ll2.
(3.13)
L * ( x ) = L*(x) + rllc(x)ll ~.
(3.15)
L ' ( x ) ~ F ( x * ) + v311x - x'I] 2.
(3.16)
W(xD
- F(x*)
- O[W(xk) -
Wk(XE +
ak)] --
v s [ W ( x E ) --
F(x*)].
(3.21)
IIc(x~ + ak)ll-< l)6[IXk ,I'112.
(3.28)
W(xk + dk) <- Fix*) + u~llx~- x*ll=.
(3.29)
-
Lk(x) = F(x) - ~
-
hici(x).
(4.3)
i=l
References [1] R.M. Chamberlain, "Some examples of cycling in variable metric algorithms for constrained minimization", Mathematical Programming 16 (1979) 378-383.
R.M. Chamberlain et al./ The watchdog technique
17
[2] R. Fletcher, "An exact penalty function for nonlinear programming with inequalities", Mathematical Programming 5 (1973) 129-150. [3] R. Fletcher, "An ideal penalty function for constrained optimization", Journal of the Institute of Mathematics and its Applications 15 (1975) 319-342. [4] S.P. Han, "A globally convergent method for nonlinear programming", Journal of Optimization Theory and Applications 22 (1977) 297-309. [5] N. Maratos, "Exact penalty function algorithms for finite dimensional and control optimization problems", Thesis, University of London (l.ondon, 1978). [6] M.J.D. Powell, "A fast algorithm for nonlinearly constrained optimization calculations", in: G.A. Watson, ed., Numerical analysis, Dundee, 1977, Lecture notes in mathematics 630 (SpringerVerlag, Berlin, 1978) pp. 144-157. [7] K. Schittkowski, Nonlinear programming codes, Lecture notes in economics and mathematical systems 183 (Springer-Verlag, Berlin, 1980).
Mathematical Programming Study 16 (1982) 18--44 North-Holland Publishing Company REDUCED QUASI-NEWTON METHODS WITH FEASIBILITY IMPROVEMENT FOR NONLINEARLY CONSTRAINED OPTIMIZATION* Daniel G A B A Y Laboratoire d'Analyse Num~rique, Universit~ P. et M. Curie, Paris and lnstitut National de Recherche en In[ormatique et Automatique, Domaine de Voluceau, 78153 Le Chesnay, France
Received 5 February 1980 Revised manuscript received 20 March 1981 We present a globally and superlinearly converging method for equality-constrained optimization requiring the updating of a reduced size matrix approximating a restriction of the Hessian of the Lagrangian. Each iterate is obtained by a search along a simple curve defined by a quasi-Newton direction and a feasibility improving direction; an exact penalty function is used to determine the stepsize. The method can be viewed as an efficient approximation to the quasi-Newton along geodesics of [1] where feasibility was enforced at each step. Its relation with multiplier methods and recursive quadratic programming methods is also investigated. Key words: Algorithms, Rate of Convergence, Constrained Optimization, Quasi-Newton Methods, Multiplier Methods, Stepsize Selection Procedures.
I. Introduction In a previous paper [1] we have studied a class of d e s c e n t m e t h o d s for the solution of the p r o b l e m Min{f(x) Ix E R",
c~(x) = 0 for i = 1. . . . . m}
(1.1)
where the s u c c e s s i v e iterates x k remain on the constraint manifold C = c-I(0). This class of m e t h o d s includes most of the p r i m a l m e t h o d s available t o d a y (e.g. gradient projection, r e d u c e d gradient methods). In particular, the f r a m e w o r k a d o p t e d allows one to define q u a s i - N e w t o n m e t h o d s for the solution of (1.1) w h i c h require only the updating of an ( n - m ) - d i m e n s i o n a l positive definite s y m m e t r i c matrix a p p r o x i m a t i n g the H e s s i a n of f on the manifold C at the m i n i m u m x*; s u c h a m e t h o d is s h o w n to be superlinearly c o n v e r g e n t . It requires, h o w e v e r , the s e a r c h f o r an a p p r o x i m a t e local m i n i m u m o f f along g e o d e s i c c u r v e s of the manifold C, e n d o w e d with a Riemannian metric. It is c o n c e p t u a l l y possible to a c h i e v e this s c h e m e by t w o - p h a s e d iterations: a tangent step in the tangent space to C f o l l o w e d by a restoration phase to e n f o r c e the constraints. T h e resulting algorithm m a y be viewed as a s e q u e n c e of solutions of m nonlinear equations, themselves solved iteratively (by a modified N e w t o n ' s method); for large problems, like the optimal control of a nonlinear discrete time *Presented at the 10th International Symposium on Mathematical Programming Montreal, August 1979. 18
D. Gabay/ Reduced quasi-Newton methods
19
(implicit) dynamic system over a large number of periods, this second phase may require a much larger computational effort than the first phase. In a practical implementation the restoration phase can only be performed approximately; the convergence theory becomes much more complex and requires the introduction of adaptive parameters for precision (see [2]). It has been recognized for a long time that constrained problem (1.1) could be solved by successive unconstrained minimizations of an augmented performance criteria. The penalty approach consists of successive minimizations of the functional p(x, rk) = f ( x ) + ~rk IIc(x)ll:
(1.2)
for a monotonically increasing sequence of positive parameters rk ~ + 2. The resulting problems become increasingly ill-conditioned, which severely impairs the efficient minimization of (1.2). For this reason, the multiplier method is preferred. At each iteration the augmented Lagrangian functional q(x, X k, rk) = f ( x ) + (a k, c(x)} + ~rk Ilc(x)ll =,
(1.3)
obtained by adding a penalty term to the ordinary Lagrangian function l(x, )t k) = [(x) + (X k, C(X)),
(1.4)
is minimized and the parameters )~k (and rk) updated. The resulting algorithm may be viewed as a dual method and, under some regularity assumptions [3], it may be shown to converge to a solution x* without increasing rk to + ~ (see the excellent survey by Bertsekas [4]). Primal methods and multiplier methods require practically a sequence of (approximate) solutions of respectively a system of m nonlinear equations and an n-dimensional minimization problem. To overcome this inconvenience a class of methods has recently attracted much attention [5, 6, 7, 8, 9]: it can be viewed as a family of methods to solve the system of (n + m) equations arising from the first order optimality conditions for problem (1.1):
[Vxl(x*, x*)] Vl(x*,X*)= L c(x*) _1=0"
(1.5)
Notice that the augmented Lagrangian q(x, A, r) could also be used. It is possible to express both Newton and structured quasi-Newton iterations for the solution of (1.5) as x k*l = x k + tkd k
(I.6)
where the direction d E is the solution of the quadratic programming problem
(QP) Min{~(d, M k d ) + ( V f ( x k ) , d ) [ d E R " , c ( x ~ ) + [ V c ( x k ) ] T d = O } ;
(1.7)
D. Gabay[ Reduced quasi-Newton methods
20
Mk is the Hessian with respect to x of the Lagrangian function or an approximation based upon a quasi-Newton update formula. The stepsize tk is introduced to force the global convergence of the method; it can be selected to achieve at each iteration a sufficient decrease of the (non-differentiable) exact penalty function Oh(x, r) = f ( x ) + r ~ [ci(x) I,
(1.8)
i-I
as suggested by Han [10]. If, after a finite number of iterations, the stepsize tk can be chosen equal to 1, then method (1.6) has a superlinear rate of convergence [6, 7]. Notice that in contrast to the superlinearly convergent quasiNewton method presented in [1], H a n - P o w e l l method requires the updating of an n-dimensional positive definite matrix Mk. Although such quasi-Newton methods involve a sequence of solutions of QP, their advantage over the primal methods and multiplier methods lies in the key property that the QP can be solved efficiently in a finite n u m b e r of iterations. In this paper we propose a quasi-Newton method which offers the nice features of both the Han-Powell method and the quasi-Newton method of [1]. It consists of a sequence of quadratic programs (1.7) where the matrix Mk is of rank ( n - m ) and can be expressed and updated in terms of an ( n - m ) dimensional positive definite symmetric matrix Hk a n d its update Hg.~, defined by the generalized B r o y d e n - F l e t c h e r - G o l d f a r b - S h a n n o formula of [1] Hk.~ = ( I \
$k(yk)T,~
(I
yk(sk)T,~ -
sk(sk) T
(y ,s
(1.9)
where yk = gk-t _ gk
S k = - tkHkg k,
gk and gk+l being
the reduced gradients defined now at the non-feasible points x k and x k*t. This new method can thus be interpreted as a generalization of the quasi-Newton method of [1] to non-feasible points and with partial restoration (one step of the modified Newton's method for the solution of the constraint equations). Geometrically each iteration can be viewed as a combination of a step in the tangent space to the manifold C k = C-~(C(Xk)), 'parallel' to C = c-~(0), and a partial restoration step to improve (and no more enforce) feasibility. This method presents some similarities with the combined gradient-restoration algorithm experienced by Miele et al. [11] and with the quasi-Newton feasibility improving GRG method mentioned by Tanabe [12]. In Section 2 we state regularity assumptions on problem (1.1) and review the optimality conditions in terms of the ordinary and augmented Lagrangian functionals. In Section 3 we present a family of methods requiring only that the restriction of Mk to the tangent space to Ck = c -I(c(x~)) be positive definite; we show the equivalence of such structured quasi-Newton methods with the QP
D. Gabay/ Reduced quasi-Newton methods
21
quasi-Newton methods and a special class of diagonalized multiplier methods. The rate of c o n v e r g e n c e of algorithm (1.6) (with tk -~ 1) is investigated in Section 4; the two-step superlinear convergence result of Powell [7] is shown to hold for the reduced quasi-Newton method defined by a sequence of matrices Mk of rank ( n - m ) . Superlinear convergence is also established for a simply modified algorithm where a feasibility improving step is added at each iteration. Finally in Section 5 we introduce a stepsize p a r a m e t e r to define the new iterate on a parabola defined by the quasi-Newton and the feasibility improving directions; the resulting algorithm is shown to be in general globally and superlinearly convergent.
2. Assumptions and review of optimality conditions We consider the nonlinear programming problem
Min{f(x) I x E R", c(x) = 0}
(2.1)
where f : R" ~ R and c : ~n ~ R" (m -< n) are ~ " differentiable functions (tr >- 2). In addition we assume that the map c is a submersion, i.e. for all x E R n the Jacobian map c'(x) E Ze(~", R") is of full rank m. The regular value theorem (see Milnor [13]) implies that the set C, = c-~(w) is a differential manifold of class ~ " for all w E f~m; thus any point x E ~" belongs to a differential manifold C,.t~) and it is possible to define T~ the tangent space to Cctx~ at x: Tx = {y E R ~ I c'(x)(y) = 0}.
(2.2)
R e m a r k 2.1. This assumption on c is stronger that the regularity assumption introduced in [I], namely 0 is a regular value of c. Under the last assumption there may exist points x E R "-" c-~(0) where the Jacobian map c'(x) is not surjective; such points are called critical points of c. H o w e v e r , if c is a ~ differentiable map with o'~n
-m,
the M o r s e - S a r d theorem (see e.g. [14]) establishes that the image by c of the critical points is a set of measure zero in Rm; hence Cw is a differential manifold for ' a l m o s t ' every w @ R m. We recall that it is possible to characterize a solution x* of problem (2.1) in terms of the Lagrangian functional 1 : R n • R m ~ R defined by
I(x, A) = f(x) + (A, c(x)) (see e.g. l~uenberger [15] for the proofs of the following classical results).
(2.3)
D. Gabay/ Reduced quasi-Newton methods
22
If x* is a local minimum of (2.1) there exists a vector of Lagrange multipliers A* ~ R ~ such that (x*, h*) is a critical point of the Lagrangian functional (2.3). An equivalent formulation of this condition characterizes (x*, h*) as a solution of the system of (n + m) equations:
VJ(x, X) = V.f(x) + Vc(x)X = 0,
(2.4a)
Vxl(x, X) = c(x)
(2.4b)
= O.
Assuming that the functions f and c are cr differentiable with or-> 2, it is possible to define the Hessian with respect to x of the Lagrangian functional
V2~xl(x, X) = V2f(x) + ~ X~V2c~(x)
(2.5)
i=l
and derive second-order characterizations of a solution x*. Denote by L(x, A) the n • n symmetric matrix associated with the Hessian V2x~l(x, )t). Given (x*,,~*) a critical point of l(x,)Q, x* is actually an isolated local minimum of (2.1) if the following second-order sufficient optimality condition is satisfied:
(v, L(x*, X * ) v ) > 0
for all v E T~,, v ~ 0.
(2.6)
We now recall a result which will be used several times in our analysis. Proposition 2.1. Assume that c is a submersion and that the restriction to the tangent space Tx of the n • n matrix Mx is positive definite. Then, the (n + m) x (n + m) matrix D(x, M~) defined by
D(x, M D = [ M~
A'r]
(2.7)
mx
where A, denotes the Jacobian matrix of the map c at x, is non-singular. Proof. Let z = (y,/x) E R "+~ such that D(x, Mx)" z = 0. By (2.7) we have T Mxy + Axtx = 0,
(2.8)
Axy
(2.9)
= 0.
(2.9) implies that y ~ Tx and (2.8) yields (y, Mxy) = - (y, A~r/z) = - (A~y,/x)
= 0.
(2.10)
Since M~ is positive definite on Tx, (2.10) implies y = 0 and (2.8) reduces to A ~g = 0, which has for unique solution tx = 0 since Ax is of full rank m.
(2.11)
D. Gabay[ Reduced quasi-Newton methods
23
Corollary 2.2. Assume that (x*, X*) satisfies the second-order su~icient optimality condition. Then the Hessian matrix of l(x, X) with respect to x and A at (x*, A*) is non-singular. We conclude this section by a review of optimality conditions in terms of the augmented Lagrangian q : R " x R m x R ~~ R defined by
q(x, A, r) = l(x, A) + ~r Hc(x)ll2.
(2.12)
If (x*, X*) is a critical point of the ordinary Lagrangian l(x, X) it is also a critical point of the augmented Lagrangian (with respect to x and A) for any r > 0 since Vxq(x*, A*, r) = Vxl(x*, A* + rc(x*)) = V~l(x*, A*) = 0,
(2.13)
Vxq(x*, A*, r) = Vd(x*, A*) = 0.
(2.14)
The Hessian of the augmented Lagrangian is given by V2x~q(x, A, r ) = V2x~l(x, X + rc(x)) + rVc(x) . [Vc(x)]T;
(2.15)
hence, denoting by Q,(x, A) the matrix of the quadratic form V2~,q(x, y, r),
(V, Qr(x, A)v) = (12, L(x, A + rc(x))v)
for all v E Tx
and the second-order optimality conditions can be equivalently expressed in terms of the augmented Lagrangian functional (2.12). The key benefit of introducing this augmented functional lies in the following now classical result (see e.g. [4]): if (x*, X*) satisfies the second-order sufficient optimality condition, there exists r*->0 such that the matrix Q,(x*,X*) is positive definite for all r -> r*. The continuity of V2.f(x) and V%~(x) for i = 1..... m guarantees that the matrix Qr(x, x) remains positive definite in a neighborhood B(x*, E~) x BUt*, ~2) of the critical point (x*, X*) which can thus be characterized as a (local) saddle-point of the (locally) convex-concave function q(x, y, r). This is the basis for the interpretation of the multiplier method as a duality scheme (see [4] and the references therein): we can associate to the augmented Lagrangian the dual functional ~, : B(A*, E2)C R m--*R 9 r(A)=
Min
x~B(x*, ~1)
q(x,A,r);
(2.16)
q being locally strictly convex in x around x*, the minimum is achieved at the unique point x(X) and 9 is locally well defined. Notice that V,io%(A) = c[x(X)],
(2.17)
~72a/'tr(A) = - A~tA~[Qr(x(A), X)] -I Ax(A) T
(2.18)
where AxtA) denotes the m x n Jacobian matrix of c at x(A). We observe
D. Gabay/ Reduced quasi-Newton methods
24
that V2~,(A) is negative definite (since Ax~) has full rank m); hence ~r is locally concave for any r -> r* and we can define the dual problem Max
XEB(A*,~ 2)
qZr(h).
(2.19)
3. A class of quasi-Newton directions
3.1. Review of the quadratic programming subproblem approach In this paper we shall study a family of methods for solving the system of (n + m) equations arising from the first order optimality conditions (2.4) where a sequence of approximate solutions (x k, h k) is generated iteratively according to
JAMk A ]r[AE+L xk*'-x~ k ] + [rvx/(xk, )1 C(X k) j = 0
(3.1)
where Ak denotes theJacobian matrix of the map c at x k. With the choice Mk = L(x k, Ak), (3.1) is simply Newton's method for solving system (2.4) and its convergence analysis is well known. To avoid computations of second order derivatives, Mk is defined as an approximation to the Hessian L(x k, A k) based upon accumulated first order information. [terative procedure (3.1) is then a structured multiplier extension quasi-Newton method, according to the terminology due to Dennis and used by Tapia [16] to emphasis that only second order information in the coefficient matrix .D(x k, L(x k, h~)) of system (2.4) has been approximated. Introducing the notation
d k = x k+l - x k,
(3.2)
method (3.1) yields the system of equations V f(x k) + Mkd k + A~" )t k+l = 0,
(3.3a)
c(x k) + Akd k = 0,
(3.3b)
which characterizes (dk, Ak+~) as a solution ~ of the quadratic programming problem (QP) Min{(V/(xk), d) + ~(d, M , d ) [ d E R" s.t. c(x k) + Akd = 0};
(3.4)
Thus method (3.1) can be viewed as consisting .of solving a sequence of QP defined successively; since such problems can be efficiently solved in a finite number of iterations, this approach presents a clear advantage over primal
t By a solution of the QP we mean the couple consisting of a minimizing element and of the corresponding vector of Lagrange multipliers associated with the constraints.
D. Gabay/ Reduced quasi-Newton methods
25
methods, which require successive iterative solutions of systems of nonlinear equations, as well as over multiplier methods, which require successive minimizations of the augmented Lagrangian. Methods using a QP subproblem to obtain the search direction d k have recently received a great deal of attention; most of the results actually apply to general problems constrained by nonlinear inequalities. In the particular case of equality constrained problems, the equivalence of (3.1) and (3.4) has already been studied by Tapia [16]. It is however interesting to review the updating schemes proposed to define the successive QP. In the method introduced by Garcia Palomares and Mangasarian [5], an (n + m ) • + m ) matrix is updated at each iteration to approximate the matrix D(x ~, L(x ~, Ak)) and only the upper left n • n block is used to define Mk. The wasteful character of this procedure (especially if m, the number of constraints, is large) has been recognized by Han [6] who proposed to update directly the n • n matrix Mk by a quasi-Newton update formula to approximate L(x k, Ak); to establish the superlinear convergence of the method, he, however, assumed that the full Hessian L(x,•) is everywhere positive definite. Powell [7] explored further Han's method: he proposed to update M~ by a positive definite quasi-Newton update formula to approximate L(x k, )k), even when the Hessian is indefinite; he also showed that, under certain assumptions, this procedure exhibits a two-step superlinear convergence. Our aim in this paper is to further refine the analysis of such methods and to show that it is only necessary to update a reduced size matrix G~ of dimension (n-m)• m), namely the restriction of Mk to the tangent space Tk, by a positive definite quasi-Newton update formula (as in the quasi-Newton methods along geodesics proposed in [1]); notice that the QP (3.4) then has a unique solution by Proposition 2.1 and d t: is still well defined.
3.2. A change of coordinates In [l] we have found convenient to associate to the full rank m x n Jacobian matrix A~ of c at x a linear map of ~(R", R"-") defined by an (n - m) x n matrix Zx which is non-singular on the nullspace )/(Ax) of Ax, i.e. such that g'(ZD A N(A,) = {0}; the n • n matrix S~ defined by Sx=
Zx
is non-singular and thus defines a change of coordinates in R ~ We recall the following result established in [1, Proposition 2.1]. Proposition 3.2. Assume that c is a submersion. Let A2 be a right inverse for
D. Gabay[ Reduced quasi-Newton methods
26
Ax (i.e. AxA] = Ira). Then there exists a matrix Z~ of full rank (n - m) with right inverse Z-x satisfying ZxA~ = 0,
AxZ; = 0.
(3.6)
The matrix Sx defined by (3.5) is non-singular and its inverse is given by S~ ~= [A]-, Z.(].
(3.7)
If c is a cr map, A~, Z,, Z~ can be chosen as ~,T ~ differentiable functions of x in R n"
This change of coordinate allows us to express the unique solution of the QP (3.4) in a form convenient for computations.
Proposition 3.3. A s s u m e that the restriction of the matrix M to the ( n - m ) dimension subspace {y @ R" [Ay = 0} is positive definite. Then
A T(I _ M Z HZ-T)
_ A-r(i _ M Z - H Z - T ) M A -
(3.8) where H = ( Z - T M Z - ) - ~. Proof. We want to solve in (y,/z) the system of equations My + Ar/x = g,
(3.9a)
Ay
(3.9b)
= h;
it is non-singular by Proposition 3.1 and thus has a unique solution. Using the change of coordinate defined by Sx I we can write y = A-w + Z-z
(3.10)
where w = Ay E R m and z = Zy ~ R" -'. (3.6) (3.9b) and (3.10) imply that w = h; premultiplying (3.9a) by the non-singular matrix (S;I) v yields the system of n equations in z, /~: I~ + A - ' r M Z - z + A - V M A - h = A-Tg,
(3.1 la)
Z T M Z - z + z - ' r M A -h = Z "rg.
(3.|lb)
The ( n - m ) x ( n - m) matrix Z T M Z - represents the restriction of M to the subspace {y E R " [Ay = 0}= {Z-z [z E R " - ' } ; by assumption it is positive definite, hence invertible. Let H = (Z-TMZ-)-J; H is positive definite. Solving (3.11b) for z and getting /z from (3.11a) after substitution, we obtain formula
(3.8).
D. Gabay/ Reduced quasi-Newton methods
27
This formula generalizes the well-known formula for the inverse in partitioned from (see e.g. [17]) valid only when M is non-singular. It includes as a special case a formula used by Powell [18], Tapia [8] valid only when M is non-singular. A more complex formula is given by Tanabe [12] in terms of a generalized inverse A = of A (such that A A ~ A = A ) but is of little value in practical computation; assuming that A is of full rank the generalized inverse is also a right inverse and Tanabe's formula should reduce to (3.8). 3.3. The q u a s i - N e w t o n direction
It is legitimate at this point to define, as in [1], the reduced gradient of f at the nonfeasible point x as the (n - m ) dimensional vector g ( x ) = z;TVf(x);
(3.12)
obviously the reduced gradient depends upon the change of coordinate defined by Sx. The q u a s i - N e w t o n direction defined as the solution d ~ of the QP (3.4) can be expressed using (3.8) as d k = - Z ~ H k ( g k - Z~TMkA~c k) - A ~ c k
(3.13)
where the subscripts and superscripts k indicate that the respective matrices and vectors are evaluated at the current iterate x k. Formula (3.13) indicates that the quasi-Newton direction d k is a combination of a direction in the tangent space Tk to the submanifold Ck = c-I(c k) at x k and a direction pointing t o w a r d the constraint m a n i f o l d C = c-I(0) (since - A~ c k can be viewed as the first step of a Newton's method starting from x k to solve the system of equations c ( x ) = 0). See Fig. 1. Vf (x k )
-Akck
Fig. I. The search direction d k.
28
D. Gabay/ Reduced quasi-Newton methods
Formula (3.13) defines a double family of quasi-Newton directions. On one hand it depends upon the choice of the change of coordinates Sx, i.e. of the choice of the right inverse A~ of Ax which then conditions the choices of Zx and Z~ according to Proposition 3.2. Two convenient choices have been considered in [1]. The p a r t i t i o n e d right inverse consists in partitioning of Ax as Ax = [B, D],
(3.14)
(where B is an m • m non-singular matrix) and choosing respectively A~ =
0
'
Z~ = [0, I,_m],
Z~ =
Im
"
(3.15)
The OR right inverse consists in factorizing A~ as
[o,]
A~ = [L, 0] Q2
(3.16)
(where L is an m • m lower triangular matrix and Q~ and Q2 are submatrices of respective dimensions m and ( n - m) of an orthogonal matrix) and choosing respectively A; = Q~L-t,
Z~ = Q2,
Zx = Q~.
(3.17)
On the other hand the quasi-Newton direction (3.13) depends upon the scheme adopted to update the matrix ME. Many update formulae are available since we do not require Mk to be symmetric (in fact the quadratic programming problem (3.4) can be defined for a non symmetric ME by substituting ME by ~(Mk + M~) in its formulation, as noticed by Han [10]). However, our analysis requires that ME remains positive definite on the (changing) subspace Tk. A convenient scheme to achieve this property in a very simple way consists in considering matrices ME of the form ME = Z~GEZk
(3.18)
where Gk is an ( n - m ) • positive definite matrix; in this case the quasi-Newton direction is given by d k = - Z ~ H k g k - A ~ c k,
(3.19)
while the corresponding Lagrange multipliers vector of the QP (3.4) is A k§ = - A ~ T V f ( x k ) .
(3.20)
Observe that the solution of (3.4) depends only upon Hk = (z~rMkZk-) -~= Gkt and it is thus only necessary to update GE = H~ t as a positive definite symmetric matrix of dimension (n- m) approximating the restriction to the tangent space of the Hessian of the Lagrangian. By analogy with the reduced quasi-Newton method of [1] we can think of
D. Gabay/ Reduced quasi-Newton methods
29
updating Gk by the generalized Broyden-Fletcher-Goidfarb-Shanno formula yk(yk)T GkSk(Sk)TGk Gk+l = Gk + (y~, ) - (s k, Gksk)
(3.21)
where Sk
=
X k+l __ Xk
(3.22a)
yk = gk,l _ gk;
(3.22b)
(we introduce the notation s k to allow for the possibility of introducing a stepsize tk in the iteration x k*t = x k + tkdk). However, there is no guarantee in the new method that
(yk, s k) >
0
which is a necessary and sufficient condition for Gk. ~to be positive definite if Gk is. We must therefore modify the update formula (3.21) using a device suggested by Powell [19]: Zk(zk)T
Gk§ = Gk + (z k, s k )
(3.23)
GkSk(sk)TGk
(s k, Gks k)
where Z k = Oky k +
(1 - Ok)GkS k
(3.24a)
and Ok is a scalar between 0 and 1 chosen according to
Ok =
1 (1 -- tr)(S k, Gks k) (s k, Gks k ) _ (yk, s k )
if (yk, s ~) > o.(s k, GkSk), otherwise,
with ~r E (0, ~) (Powell suggests to use ~r = 0.2). Then (z k, s k) > 0 ; preserves positive definiteness. Notice that Hk -- G~-t can then be updated directly according to
Hk+l=(I
sk(zk)T~I-I/I"
(~r,~/,,k~-
zk(sk)T~
(3.24b)
hence (3.23)
sk(sk) T
(--~,)-r~]-~ (Zk, Sk),
(3.25)
although we do not recommend the use of this formula in practical computations for its potential numerical unstability.
3.4. Relation with multiplier methods We could also define Newton and quasi-Newton methods for solving the system of (n + m) equations arising from the first-order optimality conditions expressed in terms of the augmented Lagrangian
V ~q(x, h, r) = V~l(x, h ) + rV c(x ) c(x ) = O,
(3.26a)
V,q(x, h, r) = c(x)
(3.26b)
= O.
30
D. Gabay/ Reduced quasi-Newton methods
This scheme leads to the iterative definition of a sequence {(yk, p k)} according to [/z k*'
/.,i k j -~-
C(y k)
J
where Nk is an approximation of the Hessian matrix of the augmented Lagrangian given by (2.15); it is thus legitimate to define NE = Mk + rA'[Ak
(3.28)
where Mk is an approximation of the Hessian matrix of the ordinary Lagrangian. Formula (3.8) can again be used to compute the solution of (3.27). Noticing that (3.6) implies Z~TNEZk = Z~TM~Z~ = H k ',
it is easy to show that, starting from (yk /xE)_ (Xk, Ak), method (3.27) generates the same iterate (yk't,/Xk+t)----(Xk~l, Ak+L) as method (3.1) and is therefore (theoretically) independent of the choice of the penalty parameter r; this equivalence result between (3.1), (3.4) and (3.27) generalizes a result of Tapia [16] 2. Suppose that (yk, Xk) is in a small enough neighborhood of (x*,)~*) satisfying the second-order sufficient optimality condition and that r is chosen large enough so that Nk given by (3.28) is positive definite, hence invertible. Notice that the multipliers vector /~k+l is given by Ix k+' = Iz k + (A~N ~' A~)-'(c k - AkN ~'Vxq(y k, Ixk, r)).
(3.29)
The first block of equation of (3.27) can then be solved for yk,, yk+l _ yk _ N~,V~q(yk, ]j k+l, r);
(3.30)
(3.30) can be viewed as one step of a quasi-Newton method starting from yk to solve the unconstrained (locally convex) minimization problem Min
yEB(x*.~ I)
q(y,/z k*l, r).
(3.31)
If the minimization phase (3.31) had been performed exactly at the previous iteration, we would have Vxq(y k, /.~k, r) -- 0 and (3.29) would reduce to p k~l = i.tk + (AkNklA~)-tCk"
(3.32)
Iteration (3.32) can be viewed as the kth step of a quasi-Newton method to maximize the dual functional ~r defined in (2.16), since - A k N ~ I A [ is an
ZTapia called (3.27) with the approximation (3.28) the superstructured version of the structured multiplier extension quasi-Newton method and showed its equivalencewith the ordinary version (3.1) corresponding to r = 0 and with the QP quasi-Newton method (3.4) when Mk is non-singular.
D. Gabay[ Reduced quasi-Newton methods
31
approximation to V2q,-r(/~k) = _ Ak(Qr(yk, p k)) I A~" Thus method (3.27) can be interpreted as a particular efficient implementation of a q u a s i - N e w t o n m e t h o d f o r solving the dual problem Max ~r(/z), where the minimization phase (3.31) is performed only approximately by one step of a quasi-Newton method. Such a method has been called by Tapia [8] a diagonalized multiplier m e t h o d ; a particular implementation (corresponding to ME = I) has been experimented by Miele et al. [20].
4. Superlinear convergence The object of this section is to show that the greatly simplifying strategy (3.19), (3.20) requiring only the updating of a reduced ( n - m ) • ( n - m) matrix Gk (approximating the restriction to the tangent space of the Hessian of the Lagrangian) still preserves the attractive superlinear rate of convergence of the quasi-Newton method of H a n - P o w e l l requiring the update of the n • n matrix Mk. From now on, we assume ~r -> 3. 4.1. T w o - s t e p superlinear convergence
We consider the iterative method (4.1)
xk*l= Xk + d k
where d k is the general search direction defined in Section 3.3 d k = - Z k H k ( g k - Z ; T M k A ; C k) - Ak c k
(4.2)
with Hk = ( z ~ T M k Z k ) 1,
(4.3a)
gk = z~TVf(Xk),
(4.3b)
c k = c(xE).
(4.3c)
Assume that the sequence {x k} converges to a local minimum x* of f on C and let A* be the corresponding Lagrange multipliers vector associated with the constraint equations. Assume moreover that (x*, I*) satisfies the second-order sufficient optimality condition. Assume also that the method uses a sequence of bounded matrices Mk such that [(v, MkV) I <-- m,llv[[ 2 for all v E R" and all k (4.4a) and that the matrices HE remain positive definite and such that m2llpll z <- (p, n k p ) <--m3llpll 2
for all p E R "-m and all k.
(4.4b)
D. Gabay] Reduced quasi-Newton methods
32
Assume finally that Mk is the value at x k of a Lipschitz-continuous map M, i.e. such that I l M ( x ) - M(y)II <- m,llx - yll
for all x, y E R";
(4.5a)
that implies that Hk given by (4.3a) is also the image of x k by a map H satisfying IIH(x)
-
n ( y ) l [ - m~llx - yll for all x, y ~ R".
(4.5b)
The direction d k is a solution of the QP (3.4) to which is associated a Lagrange multipliers vector hk+~ E t~'~ given (by application of (3.8)) by ,~k~l = _ A~T(I _ MkZk HEZ-J)(Vf(x k) - M~A~c(xk)).
(4.6)
Proposition 4.1. If {xk}---,x * the sequence of Lagrange multipliers {)~k} con-
structed by the successive QP converges to ,~*.
Proof. Since (d k, h k*l) is the unique solution of QP (3.4) it satisfies Mkd k + A~)t k§ + Vf(x k) = 0.
(4.7)
Since xk---,x *, d k --.0; the pair (x*, ,~*) satisfies the first-order optimality condition Vf(x*) + Vc(x*),~* = 0. (4.8) Subtracting (4.8) from (4.7), we obtain Mkd k + ATk(~.k §
A*) + Vf(x/~) - V f ( x * ) + ( V c ( x k) - V c ( x * ) ) )t* = O,
which by continuity of V.f and Vc and (4.4) yields xk*J~ ~ . Proposition 4.2. If (x*,)t*) satisfies the second-order sufficient optimality condition, there exist Kt > K2> 0 such that
K2tUgkll + Ilckll] ---]Ix ~ - x*ll ~ K,[llgkll + IIc~ll].
(4.9)
Proof. By Corollary 2.5 the Hessian matrix D(x*, L(x*, ;t*)) of the Lagrangian with respect to x and X at (x*,),*) is non-singular; by continuity of the secondorder derivatives of f and c the matrix D(x, L(x, ;~)) remains non-singular in a neighborhood B(x*, eO • B()~ *, Ez). Suppose that x k E B ( x * , EO and ,~k+lE B(x*, ~z); Taylor expansion of Vl(x k, )k+l) around the critical point (x*,),*) yields I
Vl(x k, )~k§
/ D(x*+ t(x k - x*), L(x* + t(x k - x*), ;~* 0
,
r x k-x*
]
+ t(A k+l-)t )))[Ak.l - A*J dt r x ' - x* 1 =
OLx~,, - ;~*J
33
D. Gabay/ Reduced quasi-Newton methods
where the matrix/5 is non-singular. Hence
~*lL2)I~2,
ILoll ~llvl(~ ~, A k~)JL <- (Llx ~ - x*ll 2 + IL~~ I -
IIx ~ - x*ll-< (llx ~ - x*l12 + IIx ~'-
(4.10a)
x*ll~) '/:
_< IIt3-'llllVt(x ~, x ~+')11.
(4.10b)
(4.7) yields V x l ( x k, ) k, l) = _ M k d k = MkZ~Hk(g k - ZkTMkA[c
(4.11)
k) + M k A ~ c k.
Since A~ and Z~ are ~r t maps, they are bounded in norm on /~(x*,el) and (4.11) together with assumptions (4.4) yields IlVd(x ~, x ~+')11 _< K [llg~ll + IIc~ll], which, combined with (4.10b), proves the second inequality (4.9). Multiplying (4.1 I) by Z~ r and using definition (4.3a) of Hk, we obtain z~TVxI(Xk, Ak.I) = gt,;
hence IIz~ll ~]lgkll-< I[Vd(x k, X~+')II.
(4.12)
Finally the optimality conditions (2.4) yield g* = z,TVf(x
X* =
*) = O,
- A,TVf(x*),
c* = c(x*)
=0;
hence, using the expression (4.6) of )t TM, IIx ~+1 - x*fl ~ ttA;~[I
- MkZknkZ~TltV[(xk)
--
Vf(x*) -
Mkm~(c k -
c*)]ll
+ II[m~ T - m . T - (A~M~Z~H~Z~ 9 - A,TM,Z,H,Z,T)]Vf(x*)II
<-- K[I x k -
x*ll,
resulting from the Lipschitz-continuity of Vf and A- (as ~ ' - t maps) and of M and H (by assumption (4.5)). Combining (4.10a) (4.12) and (4.13) yields the first inequality (4.9).
Proposition 4.3.
T h e r e e x i s t K 3 > K4 > 0 such t h a t
K,[llg~ll + IIc~ll] ~ Ildklt _< K~[llg~ll + IIc~ll]. Proof. The direction d k given by (4.2) can also be viewed as the solution of the non-singular linear system !
d k =
34
D. Gabay/ Reduced quasi-Newton methods
which yields
IIE~II '(llc~ll = + [Igk[[2)'/'~-< Ild~ll -< IIE;'[l(llc~ll= + IIg~ll~) '~, hence (4.14) using the equivalence of norms in R ~.
Proposition 4.4. There exists K~ > 0 such that IIc(x ~ + dk)l[ < K~lld~ll 2
(4.15)
Proof. Since the functionals c~ are ~ y,=
are differentiable with cr >- 2 we have
Sup ( S u p l I c " ( x ) . v l l ) < + ~ c . xet~tx~ Ilvll=I
Taylor expansion of the map c around x k yields I
c(x k + d k) = c(x k) + c'(xk)( d ~) + f (1 - t)c"(x ~ + tdk)( dk)( d k) dt. 0
By definition, d k satisfies the linearized constraint (3.3b); hence IIc(x ~ §
d~)ll < ~ Max ~,lld~ll 2. i
We finally establish a bound on the norm of the reduced gradient at xk~:
gk+l= ZkTtVf(xk+l)
= z~T1Vxl(x TM, Ak*l),
(4.16)
because of the property (3.6) of the right inverse Z ; . Notice also that the ~g~ t map Z~ being Lipschitz-continuous,
(4.17)
IlZ;-,- z;[I--- mlldkll. Taylor expansion of V,l(., Ak§ around x k yields
V~l(x k+l ' ATM) = VxI(x ~, A k'l) + L ( x k, Ak+l)d k I
+ I (L(xk + tdk' Ak§ -- L(Xk' Ak+l))dk dt. 0
(4.18)
The integral term in the R H S is, in norm, of the order of IId~ll2, since L(., ,X~§ is Lipschitz-continuous; the first terms of the R H S can be written using (3.3a)
Vxl(x k, A k~t ) + L(x k, A k+l)d k = [L(x k, X k~t) - Mk] d k, which is of the order of
Ifd~ll. H e n c e
IIV,t(x ~+', x~")ll _< KIId~ll.
for k large enough,
(4.19)
D. Gabay/ Reduced quasi-Newton methods
35
Combining these results and observing that we can write d k = _ (Z~pk + qk) where pk = Hkgk and qk = (I -- ZkHkZ~TMk)A~c k (hence obtain
IIq~ll-< m611c~ll)we finally
IIg~+'ll-< IlzffTVxl(x TM,Xk+')ll + II(z~§ Zk)TVxl( xk*`, A k* ')ll <-IIZ~T[L(x ~, X E*') - m d ZZp~II + Kelldkll2 + K~llc~ll. Proposition 4.5. Assume that f and c are ~" differentiable with tr -> 3; 3 then there
exist K6. K7 > 0 such that
Ilgk+'ll-< II[Z;VL(x k, Xk+')Z; - H ~']pktl + K6lldkl]2 + gTtlckll
(4.20)
with pk = Hkgk. We are now in position to establish the two-step superlinear convergence for the class of methods (4.1) (4.2). extending a result established by Powell [7] for a particular implementation. T h e o r e m 4.1. Assume that f and c are ~
differentiable with tr -> 3. 3 and that the general quasi-Newton method (4.1) (4.2) generates a sequence of approximate solutions {x k} together with Lagrange multipliers {)~k} (given by (4.6)) such that {(x k, •k)} converges to (x*. ,~*) satisfying the second-order sufficient optimality condition. Suppose that the method uses a sequence of bounded matrices Mk such that Hk = ( Z k T M k Z k ) -t is positive definite and satisfies
~im
II[Z~TL(x ~, ,Xk' 1)Z~ - H ~1]pkll
ilx~§ x~ll
--0
(4.21)
with p~= Hkg k. Then the sequence {x k} is T W O - S T E P S U P E R L I N E A R L Y convergent to x*. i.e.
lim Ilxk§ x*ll 0.
(4.22)
k~§ IIx k-'- x*tl--
Proof. Notice first by combining (4.14) and (4.9) that
K~ ' l x k - x*,,-,Ix k§
xkl, = ,,dkll--
x*ll,
(4.23)
hence (4.24) 3It is sufficient to assume that f and c are q~2differentiable and have Lipschitz continuous secondorder derivatives.
36
D Gabay/ Reduced quasi-Newton methods
Combining (4.9) (for index k + 1), (4.15), (4.20) we obtain
(4.25t
Ilx ~'' - x'l[-~ K , N ZiT L(x k, X k' ')Z ; - H k'lP~ll + ( r ~ + K~)itd~ll ~ + KTVklII; using (4.23) and (424) (for index k - 1) yields
Ilx " + ' - ~*11 < r ~ [~ + K3]/,.. I][Z;rL(x k, ~E~')Zi -- H;'Ip'J]
IIxk-'- x*ll- rE ~,
E ) I ~''
Ilxk+l-
+ (K, + K6)lldkl] +
xkll
KIK3K.~K7
K2
IId~-'ll.
Taking the limit as k ~ +oo yields (4.22) given that d k --*0 and (4.21) holds. Theorem 4.1 shows that the two-step superlinear convergence of the general method (4.1) (4.2) depends only upon how the restriction of M~ to the subspace Tk approximates the similar restriction of the Hessian of the Lagrangian L ( x k, AE*~). Actually it only requires that this approximation be adequate along the direction pk; this result, mentioned by Powell for a particular method where the orthogona~ restriction was considered thus generalizes a similar condition given by Dennis-Mor~ 121] for quasi-Newton methods for unconstrained minimization. In other words il is not necessary in order to obtain two-slep superlinear convergence to update the full n x n matrix Mk. In our general framework, we are able to exploit fully this observation by considering, as discussed in Section 3.3, matrices Mk of the form Mk = Z~ GkZk
(4.27)
where Gk is an (n - m) x (n - m) positive definite symmetric matrix updated according to the modified generalized B F G S f o r m u l a (3.23); notice that Hk = G E1 and can be updated directly according to (3.25). It remains to show that the update formula (3,23) verifies the generalized D e n n i s - M o r ~ condition (4.21) which requires very technical estimates; we shall not attempt to prove this result in this paper (see Powell [7] where (4.21) is established for the updating of M~ by a formula analogous to (3,23)), We shall refer to such methods as reduced quasi-Newton methods.
As discussed in Section 3.3, method (4.1) then uses the simplified quasiNewton direction d E = - Z~Hkg ~ - A~c k
(4.28)
and generates Lagrange multipliers ~ k+l = _ Ak-T V f ( x k ) ;
(4.29)
notice that only Hk is required in the computations. Finally, Theorem 4.1 can be interpreted in the framework of multiplier methods proposed in Section 3.4. It establishes the two-step superlinear con-
D. Gabay [ Reduced quasi-Newton methods
37
vergence of the d i a g o n a l i z e d m u l t i p l i e r m e t h o d where the minimization of the augmented Lagrangian Min q(y, p k, r)
(4.30)
y
is performed by one step of a quasi-Newton method starting from x k and using the approximate Hessian -TrrI Nk = ZXkGkZk + r A ~ A k = 5k [ 0
0 ]Sk, Gk
(4.31)
and the multipliers are updated according to the formula txk*! = _ A k T V f ( x k) = tX k + rc k - A~TV~q(x k, tz k, r).
(4.32)
This result cannot be directly compared with those of Tapia [16] and Byrd [22], where ME is assumed to be non-singular; in particular the multiplier update formula (4.32) does not fit in the framework of Tapia's general update formula. Besides Byrd is chiefly interested in proving two-step or uniform quadratic convergence of diagonalized multiplier Newton methods. 4.2. S u p e r l i n e a r c o n v e r g e n c e o f a r e d u c e d q u a s i - N e w t o n
m e t h o d w i t h .feasibility
improvement
Going back to the proof of Theorem 4.1 we can observe that superlinear convergence could be established if we had I1c(x~)ll
= o.
To achieve this result we consider a modification of method (4.1) (4.2) where the next iterate is now given by x k§ = x" + d k + e k
(4.33)
where d k is still defined by (4.2) and e k is an additional step to improve the error on the constraints, defined by e k = - A ~ c ( x k + dE);
(4.34)
the additional step e k is incorporated only if it actually improves feasibility, i.e. if given a E (0, ,~). IIc(x k § d k § e~)ll--- (1 - ,~)llc(x ~ § dk)ll.
(4.35)
By (4.34) we conclude using (4.15) that
Ilekll-< KsIId~ll 2, which shows that (4.35) is eventually satisfied as d k ~ 0 .
(4.36)
38
D. Gabay/ Reduced quasi-Newton methods
We assume again that the sequence {x k} defined by (4.33) and the associated sequence of Lagrange multipliers {Ak} (still defined by (4.6)) converge to (x*, A*) satisfying the second-order sufficient optimality conditions. We also assume that (4.4) and (4.5) still hold. Propositions 4.1, 4.2, 4.3, 4.4 are obviously still valid together with estimates (4.9), (4.14), (4.15). It is easily verified, using (4.36) that the estimate (4.20) of [[gk.t[l still hold, with possibly a larger constant K~,. To establish the rate of convergence we need an estimate of ]IC(XE*~)]]for the new scheme. Taylor expansion of c(x k*~) leads now to I
IIc(x~")ll = Ilctx ~ + d k) + Ak e k + I [c'(xk + dk + tek) - C'(xE)] ek
dtll;
0
hence, by the Lipschitz-continuity of c' and by definition (4.34) of e k, (4.37)
IIc~"ll _< Kglld k + e~lllle~ll.
Theorem 4.2. Assume that f and c are ~,T differentiable with or >- 3 and that the
MODIFIED quasi-Newton method (4.33) (4.2) (4.34) generates a sequence of approximate solutions {x k} and a sequence of Lagrange multipliers {Ak} (defined by (4.6)) converging to (x*, A*) satisfying the second-order sufficient optimality condition. Suppose that the method uses a sequence of bounded matrices Mk such that Ek = (ZkT MkZD i is positive definite and satisfies
lim [[[ZJL(xk' X~'~)Zk iix ~, , _
n~'lPkl[ -- o,
(4.38)
with p~ = Hkg k. Then the sequence {x k} is S U P E R L I N E A R L Y convergent to x*, i.e.
lim I I x ~ " - x * l l _ 0. ~-~
(4.39)
IIx ~ - x*ll -
Proof. We still have
(4.40)
K~ ~ llx k - x*ll-< lld~ll-< K3 K~ llx k _ x*ll,
for the modified method; however, llx ~+'-
xkll
(4.41)
-- lld k + e"11-< lldkll + K.lldkll 2-< K,011d~ll.
We thus obtain, using (4.25), (4.37), (4.40), (4.41)
IIx~"-x*ll Ilxk-x*lt-<
K~K,o{K,II[ZffL(x~.X~+')Z;-H.'IP~II K7K.Ilek-,ll} K2 Ilxk"--xkll +
Ka
+ E (K, + Kt)lldkll.
(4.42)
39
D. Gabay/ Reduced quasi-Newton methods
Given (4.38) and the assumption that d ~ ~ 0 (hence e k ~ 0 ) , we obtain (4.39), showing the superlinear convergence of the modified method. The comments following Theorem 4.1 are still in order and lead to the same strategy: we only need to update an ( n - m ) • ( n - m) positive definite matrix Gk = H ~ according to (3.23). Notice that, starting from a feasible point x k (i.e. e k = 0), method (4.43) is equivalent to the quasi-Newton method along geodesics of [1] where only one step of the restoration phase is performed (the stepsize being taken as 1). Examples of choice of A;, and Z~- have been presented jn [1] and include the partitioned right inverse formulae (3.15) and the QR right inverse formulae (3.17).
5. Stepsize selection and global behavior We now introduce a stepsize parameter t > 0 and consider for the reduced quasi-Newton method with feasibility improvement the parametrized arc of parabola starting from x k and tangent to d k, formally similar to the parabolic arc introduced in [l], (5.1)
x ( t ) = x k + td k + lee k
where d k and e k are respectively defined by (4.28) and (4.34) (if (4.35) is satisfied). Following Han [10] we choose the stepsize It, defining the new iterate x k+' = x(tk),
(5.2)
to achieve a sufficient decrease of the exact penalty function, q)(x, rk+j) = f(x) + r~., ~ [ci(x)[.
(5.3)
i=l
The non-decreasing sequence of penalty parameters is defined recursively by rk,, = Max{rk, Max~lA ~"l},
(5.4)
starting from r 0 > 0 ; the m-dimensional vector Xk§ is taken as the Lagrange multipliers vector at the solution of the quadratic programming problem (3.4) and given by (4.29). Instead of requiring tk to achieve an approximate minimization of the form proposed in [10], we follow the spirit of Powell [19] and select tk = 2 -t for the first index I of the sequence {0, 1, 2 .... } such that ~ ( x ( tk ), rk~l) < ~ ( X k, rk+l) -- Cttk tlt(X k, d k, rk+l)
(5.5)
40
D. Gabay/ Reduced quasi-Newton methods
with a ~ (0, ~) and ~(x, d, r) defined by ~(x, d, r) = r ~ Ic,(x)l- .
(5.6)
i=1
Proposition 5.1. Given x k, let d k, A k+l and rk+l be defined by (4.28), (4.29) and (5.4); then 9 (x k, d k, rk+l) -->0
(5.7)
where equality holds iff (x k, )~k+~) satisfies the first-order optimality conditions (2.4). Proof. Notice that, using (4.28) and (4.29), formula (5.6) yields 9 (x k, d k, rk+l) ----(gk, Hkgk) + rk+l ~ Ic,(xk)l - <x i=1
c(xk)>;
(5.8)
inequality (5.7) results from the choice (5.4) of rk+l and the positive definiteness of Hk. The stepsize selection rule (5.5) thus insures a sufficient decrease of the exact penalty function from a non-critical iterate x k. The convergence analysis of the algorithm must, however, distinguish between two situations. Theorem 5.1 (Global behaviour). Assume that f and c are ~ differentiable with tr >_2 and that c is a submersion. If the sequence {rk} defined by (5.4) increases infinitely, then the sequence {xk}, constructed by (5.1), (5.2) (5.5), has no accumulation point; if rk is increased only a finite number of times according to (5.4), then any accumulation point of the sequence {x k, ;~k+l} satisfies the firstorder optimality conditions. Proof: (a) Suppose that rk ~ + ~ as k ~ + ~ and that the sequence {x k} has an
accumulation point x*, i.e. there exists a subsequence Xk, ~ X* as i ~ + oo. Define r(x) = Max
l_
Ix,(x)l
(5.9)
where )t(x) is the Lagrange multipliers vector defined by h(x) = - A~XVf(x).
(5.10)
Since r is a continuous function, given 9 > 0, r*=
Max
r(x)<+oo.
x~(x*, ~)
There also exists N > 0
such that XkIEB(x*,~) for all i - N .
Since rk is
41
D. Gabay/Reduced quasi-Newton methods
increased infinitely often according to (5.4), we must have
rk~ < r(Xk,) <-- r* < + oo Vi >-- N,
(5.12)
which contradicts the fact that rk--' + ~. (b) Suppose now that rk is increased finitely many times, i.e. that there exist r > 0 and an integer N such that rg+l = r
Vk-> N,
(5.13)
which implies that the Lagrange multipliers remain bounded: [h~+l[-
fori=l
..... m, V k - > N .
(5.14)
Let (x*,h*) an accumulation point of the sequence {x k,hk+~}; we assume for simplicity that x k ~ x *. Suppose that ( x * , h * ) does not satisfy the first-order optimality conditions (2.4); then by Proposition 5.1 ~ ( x * , d*, r) = 8 > 0.
(5.15)
There exists N'-> N such that
xlZ(xk, dk, r) > 89
Vk >--N '.
(5.16)
We can evaluate ~ ( x ( t ) , r) along the parabolic arc (5.1) starting from x k, k >-N', for t E [0, 1]:
9 (x(t), r) = f ( x k + td k + t2e k)+ r ~ [ci(x k + td k + t2ek) I. i=l
(5.17)
Using the definitions (4.28) and (4.34) of d k and e k, second-order Taylor expansions of f and c~ yield the majorization
~ ( x ( t ), r) <-- ~ ( x k, r) - t ~ ( x k, d k, r) + tZ @(x k, d k, e k, r)
(5.18)
where O(x k, d k, e k, r) is a positive bounded term since V2f and V2ci are continuous hence bounded o n / 3 ( x * , e):
0 < @(x k, d k, e k, r) <- M.
(5.19)
There exists an integer L ->-0 such that 2-L < ( 1 - - a)~ < 2-L+l 2M provided 0 < a < 1 and 6 -< 4M; then t = 2 -L is an admissible stepsize since ~(x(2-L), r) -< q~(x k, r) - a 2 - L ~ ( x k, d k, r).
(5.20)
Thus the selection rule (5.5) defines a sequence of stepsizes tk bounded from below tk >-- 2 - L
Vk >- N'.
42
D. Gabay[ Reduced quasi-Newton methods
By definition (5.2) of the new iterate x k~, we have (5.21)
~(xk~ i, r) - 4P(x k, r) <- - ettk ~ ( X k, d k, r) <-- - a 3 2 -L-I,
which implies that ~(x k, r ) ~ - o% in contradiction with the continuity of ~(., r) which guarantees that ~(x k, r ) ~ 4~(x*, r). Hence (x*,)~*) must satisfy the first-order optimality conditions. We assume now that f and c~ are ~ " differentiable functions with tr-> 3. The superlinear convergence established in Section 4 holds since we can establish that the stepsize tk = 1 satisfies (5.5) after a finite number of iterations.
Proposition 5.2. T h e r e exists an integer N > 0 satisfies the selection rule (5.5) [or k ~- N.
such
that
the stepsize
tk = 1
Proof. Third-order Taylor expansion of f yields, using (4.36), f ( x k + d k + e k) = f ( x k) + ( V f ( x k ) , d k) + ( V f ( x k ) , e k)
+ ~ §
~(lld~ll3).
<5.22)
Combining (4.36) and (4.37), we obtain I[c(x k + d E + ek)ll--
~(lld~l13),
(5.23)
hence 9 (x k + d E + ek, r) = ~ ( x k , r ) -
eg(xk, d k , r ) + ( , x k~t,c(x k + d k ) )
+ 89 k, V 2 f ( x ~ ) d k ) +
c(lldkll 3)
(5.24)
where we have made use of the definitions (4.34) and (4.29) of e k and h k'~. Using the third-order Taylor expansions of c~, we obtain ~ ( X k + d k + e k, r) = qb(x k, r ) - llz(xk, d k, r)
+ '(d k, L(x ~, x~") d ~) + ~(lld~ll3).
(5.25)
With the expressions (5.8) of ~/'(x k, d k, r) and (4.28) of d k, we can rewrite (5.25) as 9 (x k + d k + e k, r) - ~ ( x k, r) + a ~ ( x k, d ~, r) <- - (12et)W(xk, d k, r)
+ }(p~, [Z~TL(x k, Ak~')Z~ -- H ~ 'lp ~) + 6(lldk{p)
(5.26)
with p k = Hkgk. Notice that if the sequence Hk generated by the algorithm satisfies (4.38) the second term of the RMS is C(}ldkll2) while, by (5.8),
9 (x ~, d ', r) = ~(lld~l12);
(5.27)
hence, provided a < ~, (5.26) shows that 9 (x ~ + d k + e k, r) <- dP(x k, r) - o t ~ ( x k, d k, r)
(5.28)
D. Gabay/ Reduced quasi-Newton methods
43
once I]dk]l is sufficient small (i.e. after a finite number of iterations since {xk}~ x* satisfying the first-order optimality conditions). We have thus established the 'global' and superlinear convergence of the quasi-Newton m e t h o d with f e a s i b i l i t y i m p r o v e m e n t which we summarize as follows: Given x k E R n, HE positive definite, Ak, Z~ satisfying (3.6), a E (0, 21), rE > 0: (i) Compute the constraints residues: c k = c ( x k ) ; (ii) compute the reduced gradient: gk = z E T V f ( x k ) ; (iii) compute the quasi-Newton direction d k = - Z ~ H k g k - Ak ck; (iv) compute the feasibility improvement direction: let s = x k + d k and e k = --AkC(.~k); if IIcr k + ek)ll > (1 then e e *--0; (v) penalty parameter: compute the Lagrange multipliers )t k'~ = - A; T V f ( x k) let rE.t = Max{rE, MaxilX~+~l}; (vi) stepsize selection: let l be the smallest integer such that reduced
cI)(x k + 2-1d E + 2-2re k, rk~t) <-- cI)(x k, rk.O -- a 2 - t ~ ( X ~, d E, rk.O;
let x k§ = x k + 2-~d k + 2-2~e k.
6. Conclusions In this paper we have presented a superlinearly and globally convergent algorithm for the equality constrained nonlinear programming problem. This method can be viewed as a particularly efficient approximation of the quasi-Newton method along geodesics presented in [1] where the feasibility of the successive iterates is not enforced. It is also related to the class of diagonalized multiplier methods [8] and to the increasingly popular variable metric methods for constrained optimization [6, 7] but offers several advantages: from a computation viewpoint it only requires to update a reduced-size matrix approximating the Hessian of the Lagrangian; on the theoretical side the method is guaranteed to converge globally in general and with a superlinear rate. Numerical results will be presented in another report. Additional analysis is also needed to extend our method to nonlinear programming problems with inequality constraints (see [1, w for a possible approach).
Acknowledgment The author expresses his thanks to two anonymous referees for their very careful readings of a previous version of this paper and for their comments and suggestions.
44
D. Gabay[ Reduced quasi-Newton methods
References [l] D. Gabay, "Minimizing a differentiable functional over a differentiable manifold", Journal of Optimization Theory and Applications 37 (1982) to appear. [21 H. Mukai and E. Polak, "On the use of approximations in algorithms for optimization problems with equality and inequality constraints", SIAM Journal on Numerical Analysis 15 (1978) 674-693. [3] R.T. Rockafellar, "'Augmented Lagrange multiplier functions and duality in nor~cor~vex programming", SIAM Journal on Control 12 (1973) 555-562. [4] D.P. Bertsekas, "Multiplier methods: A survey", Automatica 12 (1976) 133-145. [5] U.M. Garcia Palomares and O.L. Mangasarian, "Superlinearly convergent quasi-Newton algorithms for nonlinearly constrained optimization problems", Mathematical Programming 11 (1976) 1-13. [6] S.P. Han, "Superlinearly convergent variable metric algorithms for general nonlinear programming problems", Mathematical Programming I1 (1976)263-282. [7] M.J.D. Powell, "The convergence of variable metric methods for nonlinearly constrained optimization calculations", in: O.!,. Mangasarian, R.R. Meyer and S.M. Robinson, eds., Nonlinear programming 3 (Academic Press, New York 1978) 27--63. [81 R.A. Tapia, "Diagonalized multiplier methods and quasi-Newton methods for constrained optimization", Journal off Optimization Theory and Applications 22 (1977) 135-194. [9] S.T. Glad, "Properties of updating methods for the multipliers in augmented Lagrangians", Journal of Optimization Theory and Applications 28 (1979) 135-156. [10] S.P. Han, "A globally convergent method for nonlinear programming", Journal of Optimization Theory and Applications 22 (1977) 297-309. [11] A. Miele, J. Tietze and A.V. Levy, "Summary and comparison of gradient restoration algorithms for optimal control, Journal of Optimization Theory and Applications 10 (1972) 381-403. [121 K. Tanabe, "Differential geometric methods in nonlinear programming", in: V. Lakshimikanthan, ed., Applied nonlinear analysis (Academic Press, New York, 1979) 707-719. [13] J.W. Milnor, Topology [from the differentiable viewpoint (University of Virginia Press, Charlottesville, VA, 1965). [14] M.W. Hirsch, Differentiable topology (Springer-Verlag, Heidelberg, 1976). [15] D.G. Luenberger, Introduction to linear and nonlinear programming (Addison-Wesley, Reading, MA, 1973). [161 R.A. Tapia, "Quasi-Newton methods for equality constrained optimization: Equivalence of existing methods and a new implementation", in: O.L. Mangasarian, R.R. Meyer and S.M. Robinson, eds., Nonlinear programming 3 (Academic Press, New York, 1978) pp. 125-164. [17] B. Noble, Applied linear algebra (Prentice Hall, Englewoods Cliffs, N J, 1969). [18] M.J.D. Powell, "Algorithms for nonlinear constraints that use l.agrangian functions", Mathematical Programming 14 (1978) 224-248. [19] M.J.D. Powell, "A fast algorithm for nonlinearly constrained optimization calculations", in: G.A. Watson, ed., Numerical analysis (Springer-Verlag, Heidelberg 1978) pp. 144-157. [20] A. Miele, E.E. Cragg and A.V. Levy, "Use of the augmented penalty function in mathematical programming problems, Part. II", Journal of, Optimization Theory and Applications 8 (1971) 336-349. [21] J.E. Dennis and J.J. Mort, "Quasi-Newton methods, motivation and theory", SIAM Review 19 (1977) 46--89. [22] R.E. Byrd, "Local convergence of the diagonalized method of multipliers", Journal off Optimization Theory and Applications 26 (1978) 485-500.
Mathematical Programming Study 16 (1982) 45-61 North-Holland Publishing Company
A SUPERLINEARLY CONVERGENT ALGORITHM FOR CONSTRAINED OPTIMIZATION PROBLEMS* D. Q. MAYNE Department of Electrical Engineering, Imperial College of Science and Technology, London SW7 2BZ, Great Britain
E. P O L A K Department of Electrical Engineering and Computer Sciences and the Electronics Research Laboratory, University of California, Berkeley, CA 94720, U.S.A. Received 29 January 1980 Revised manuscript received 18 March 1981
An algorithm for constrained optimization is presented which is globally convergent and whose rate of convergence is superlinear. The algorithm employs a quadratic approximation to the original problem to obtain a search direction and an exact penalty function to determine step length. The algorithm incorporates a rule for choosing the penalty parameter and, near a solution, employs a search arc rather than a search direction to avoid truncation of the step length, and thereby loss of superlinear convergence.
Key words: Superlinear Convergence, Constrained Optimization, Exact Penalty Functions.
1. Introduction We consider the following problem: P:
min{f(x) l g(x) <- O, h(x) = 0}
(1)
where f : ~ " ~ , g : ~" ~ ' , h : ~n ~ , . There exist many algorithms, such as the constrained Newton method of [I], and the methods of [2], [3] and [4], which have a superlinear rate of convergence but are only locally convergent, for solving this problem. A major difficulty in globally stabilizing these algorithms is that they generate sequences which are not necessarily feasible, making it difficult to decide whether the next iterate is an improvement or not. A mechanism for doing precisely this is the exact penalty function, which was employed for this purpose in [5], [6], [7] and [8]. In this class of algorithms the search direction is determined by solving a first or second order approximation to the original problem and the step length is determined by approximately minimizing an exact penalty function; the penalty * Research sponsored by the National Science Foundation Grant ENG 73 08214-AOI, National Science Foundation (RANN) Grant ENV 76 04264 and the Joint Services Electronic Programme Contract F44620-76-C-0100. 45
46
D.Q. Mayne and E. Polak/ Superlinearly convergent algorithm
parameter is chosen so that the search direction is a descent direction for the exact penalty function. In this respect the algorithms differ from the poineering algorithms of [9] and [10] in which the penalty parameter is first chosen to ensure equivalence with the original problem and the search direction then chosen to be a descent direction for the penalty function. The advantages gained by the new procedure were accompanied by several difficulties, mostly foreseen by Han. The first is the choice of the penalty parameter. Much of the literature either assumes that a suitable value is known a priori (in practice many computations with different choices of the parameter may be necessary) or employs heuristic procedures for iteratively updating the parameter. It would seem desirable to have an automatic rule which ensures global convergence. The second difficulty, the choice of step length, arises as a consequence of the non-differentiability of the exact penalty function. Exact minimization was employed in the earlier papers; obviously approximate minimization is preferable if convergence is not thereby destroyed. A third, problem is the 'Maratos effect' [7]: the exact penalty function step length procedure can truncate the step length near a solution when a quadratic approximation is employed to determine the search direction, thus destroying superlinear convergence. Finally the approximation to the original problem may not have a solution and may yield discontinuous estimates of the multipliers (required for choosing the penalty parameter). This paper presents an algorithm, utilizing a quadratic approximation for determining the search direction and an exact penalty function for choosing the step length, which overcomes these ditficulties. We believe that it is the only algorithm currently available which does. Nevertheless the algorithm can be improved further; its structure is such that heuristic modifications can easily be made without destroying its properties of global and superlinear convergence; several such modifications are suggested. The algorithm is naturally more complex than that given [5]; nevertheless the extra complexity does not add a significant computational burden and appears necessary.
2. The algorithm Superlinear convergence is achieved by using a search direction which is obtained by solving a quadratic approximation to the original problem P. Let L : ~ " x ~m x ~ ' ~ ~ denote the Lagrangian, defined by: L(x, A, It) a___f ( x ) + (A, g(x)) + (It, h(x)).
(2)
Let H E ~ "• denote an estimate of the Hessian Lxx(x, A, It). The quadratic approximation to P, given x and H, is defined to be the program: QP(x, H):
min{fx(x)p + (89
I P E F(x)}
(3)
D,Q. Mayne and E. Polak/ Superlinearly convergent algorithm
47
where: (4)
'~(X) a={p E ~" I g(x) + g~(x)p <--O, h(x) + h~(x)p = 0}.
Clearly x + F(x) is an estimate of the feasible set F for P. Solving QP(x, H) yields a search direction which is certainly satisfactory (if H is) near a solution point, but which may not be satisfactory, or may not even exist (F(x) may be empty, for example) elsewhere. A mechanism for obtaining an alternative search direction is necessary. One simple possibility is to compute a search direction which is known to be a descent direction for the exact penalty function employed in calculating step length. The exact penalty function employed in this paper is y : ~ • ~ -:, ~ defined by: T(x, c) ~=f(x) + c~(x)
(5)
where ~b: ~ " ~ ~ is defined by:
q,(x) A=max{gi(x) ' j ~ m; Ih~(x)i, j E r; O)
(6)
where m, r denote, respectively, the sets {1,2 ..... m}, {1,2 ..... r}. The penalty parameter is c. Clearly ~(x)_>0 and ~ ( x ) = 0 if and only if x is feasible. However, this choice is not essential; ~(x) may be replaced, for example, by Y, Ihi(x)l + E gJ(x) § where gi(x)+ ~ max{0, gi(x)}, or a separate penalty parameter may be employed for each constraint. The major consequential change in the algorithm is in the rule for choosing the penalty parameter(s). To ascertain whether a given search direction p is a descent direction for y(x, c) we require first order estimates r p,c), ~(x,p) of "r(x+p,c) and ~ ( x + p ) ; these are obtained by replacing gi(x + p) by gi(x) + gi,(x)p and hi(x + p) by hi(x) + hi~(x)p in the appropriate definitions yielding: ~(x, p, c) a=f(x) +f~(x)p + cOt(x, p ) ,
(7)
~(x,p) a=max{f(x)+gix(X)p, jEm;Ihi(x)+hi~(x)pl, j@r;O}.
(8)
It is assumed that f, g and h are continuously differentiable. Let 0 : ~ -> ~ be defined by:
O(x, p, c) L ~(x, p, c) - v(x, c)
x~
x
(9)
so that O(x, p, c) is a first order estimate of y(x + p, c) - y(x, c); p is a descent direction for y(x, c) if O(x, p, c ) < 0. If p is a solution of QP(x, H) then, under certain conditions, c can be chosen so that p is a descent direction for y(x, c):
48
D.Q. Mayne and E. Polak/ Superlinearly convergent algorithm
Proposition 1. Let {p, A,/z} be a K u h n - T u c k e r triple I for QP(x, H). Then p is a descent direction for ~,(x, c) if: (i) H is a positive definite, tel j=l
$=1
This result is easily established from the fact that if {p, ;t, #} is a Kuhn-Tucker triple for QP(x, H), then: (11) i=t
j=l
The algorithm presented in this paper has the following components: a procedure for choosing the penalty parameter; a procedure for determining a search direction, a procedure for determining step length and a mechanism for avoiding truncation of the step length near a solution. To see what properties are required of each component it is convenient to analyse first an abstract form of the algorithm; this is:
Algorithm Model Data: x ~ @ ~ n, c 0 > 0 , ~ > l . Step 0: Set i = 1. Step 1: If cH >- ?(xi) set c~ = c~_~. If ci-i < ?(xi), set ci = max{~ci_l, t?(xi)}. Step 2: Compute any x~,t E A(x~, c~). Seti=i+landgotoStep 1. The procedure for choosing the penalty parameter is given in Step 1. In normal operation ci is increased finitely often and then stays constant at c* say. The purpose of ? is to ensure that c* is sufficiently large for Pc,: min{y(x, c*)} to be equivalent to P. If xi is the current value of x and c~ the current value of the penalty parameter, then A(xi, ci) is the set of all possible successors to x; that can be generated by the algorithm so that x~§ E A(xi, c~). Letting A be set valued covers the possibility that there exists more than one solution to the search direction sub problem. Let D and Dc, respectively, denote the sets of points satisfying first order necessary conditions for P and Pc: min{y(x, c)}. The following result, proven in [6], gives sufficient conditions, in the form of properties of ? and A, for convergence to D. l l.e. (p, A,U.) satisfies the Kuhn-Tucker conditions of optimality for QP(x, H)--this includes satisfaction of constraints.
D.Q. Mayne and E. Polak/ Superlinearly convergent algorithm
49
Theorem 1. Suppose that, for all c > 0, A(., c) has the following property:
(i) If {xi} is any infinite sequence such that xi+t E A(xi, c), c >- P(xi), for all i, then any accumulation point x* of {xi} satisfies x* ~ De. Suppose ~ : ~" -+ ~ has the following properties: (ii) x E Dc and c >- ?(x) ~ x E D, (iii) 6 is continuous. Then any sequence {x~} generated by the Algorithm Model has the following properties: (a) If ci_, is increased finitely often (and then remains constant) any accumulation point x* of {xi} satisfies x* E D. (b) If cH is increased infinitely often, for i E K say, then the infinite sequence {xi}icK has no accumulation points. An immediate consequence of Theorem 1 is the following: Corollary. If the sequence {xi}, generated by the Algorithm Model, is bounded, then ci is increased only finitely often, and any accumulation x* of {xi} is desirable (x* @ D). It is possible, though unusual, for c~ to increase infinitely often; this defect currently appears to be inherent in any algorithm with an adjustable parameter unless further assumptions are made. We return to this point later and show that in our case {xi} is indeed bounded under reasonable assumptions so that ci increases only finitely often. Since Theorem 1 is the most general convergence result available (strictly speaking minor extensions, such as replacing the test c -> tT(x) by tc(x) <- 0 where tr is semi-continuous, are possible) for algorithms with adjustable parameters, we shall employ it to construct our algorithm.We consider each component of the algorithm in turn. Penalty parameter. A possible formula for choosing c is suggested by (10); ((x) can be defined as the right hand side of (10) where {A,/1} are the multipliers obtained by solving QP(x, H). However, QP(x, H) may not have a solution, and even when it does the multipliers may vary discontinuously with x. Hence we replace {a,/z} in (10) by continuous first order estimates {)7(x),/2(x)} where A :~-*~'~ and/i.:~~ r~ are defined by: (.~(x),/~(x)) a__arg min{l]Vf(x) + gx(x)Th + hx(x)T/zl]2 + ~ [(~b(x)- g~(x))2(A j)2] + ~ [(~b(x)_ [hi(x)l)2(txi) 2] i=t
I A E ~",/z ~ ~r}.
i=l
(12)
The first term in the right hand side of (12) ensures that ,~ and/2 are estimates of the multipliers, and the second and third terms (which are zero when x is optimal)
50
D.Q. Mayne and E. Polak/ Superlinearly convergent algorithm
ensure (under conditions specified in Section 3) their continuity; if {,L ,~,/2} is a Kuhn-Tucker triple for P, then A(,f)=,~ and /2(s Solving (12) may be reduced to solving a set of linear equations. Our test function g : ~ " ~ ~ is now defined by:
g(x) a=max{/__~l AJ(x)+ J=,~112J(x)l+b,b }
(13)
where b is an arbitrary small constant. The b in the first term ensures that ?(x) exceeds the minimum value, the second b ensures that ?(x) > 0 (AJ(x) may be negative). An alternative definition, perhaps less problem dependent, is:
e(x) ~ 1.1 {~ I~(x)l § y~ I~*(x)l}. Search direction (Search arc). Let ,6(x, H) denote any solution of QP(x, H) (if a solution exists). The algorithm selects /~ for the search direction if this is consistent with convergence. Suitable conditions for acceptance are: (i)
A solution p(x, H) of QP(x, H) exists,
(ii)
IJ/~(x, n)ll-< L,
(iii)
O(x, 13(x, H), c) <- - T ( x )
(14)
where L is a large positive constant and T : ~ is a continuous function satisfying T(x) >_0 and T(x) = 0 if and only if x E D. A suitable function is:
T(x) ~ rain{e, [~(x) + IIVf(x) -4-g,~(x)TA(x) ~ + hx(x)T~(x)ll2] 2)
(15)
where 9 is a small positive constant and ,~(x)" denotes the vector whose ith component is Ai(x)~. Test (iii) ensures that /~(x, H), if accepted, is a descent direction for ~/(x, c); the constant 9 in (15) ensures that the test is easily satisfied (T(x) <_ 9 If the tests are not all satisfied an alternative search direction/5(x, c) must be employed. A suitable choice for/~(x, c) is the solution of: QPc(x):
min{(~)pXHp + O(x, p, c) lp E ~"}
(16)
where/-~ is any positive definite matrix (e.g. r/I, ~7 > 0). Using the fact that:
O(x, p, c) = [x(x)p + c( ~(x, p ) - ~b(x)),
(17)
QPc(x) is seen to be equivalent to the following quadratic programme:
O'(x, c) a___min{L(x)p + (~)pT~Ip + c(w -- ~b(x)) l gJ(x) + gi~(x)p <_ w, j ~ m; Ih~(x) + h~(x)p l< w, j E r; w >- 0;p ~ ~ " } . A solution/5(x, c) to QPc(x) always exists and is unique. Clearly
0'(X, C) = 0(X, J~(X, C), C) -~-(89
c)T/'~p(X, C).
(18)
D.Q. Mayne and E. Polak/ Superlinearly convergent algorithm
51
A convergent algorithm, employing/~(x, H ) as a search direction if conditions (14) are satisfied and `6(x, c) otherwise, can be constructed. However a superlinearly convergent algorithm requires, inter alia, that, near a solution, the search direction/~(x, H ) is always employed and that the step length is unity. The test conditions (14) are such that the former requirement is satisfied. An asymptotic step length of unity requires, in turn, that ~/(x + p(x, H), c ) < "/(x, c) (c and H appropriately chosen) near a solution. Maratos [7] has pointed out that this is not necessarily the case. Consider, for example, the case of a single equality constraint and suppose x is feasible (qt(x) = 0); then x +/~(x, H ) lies on the plane tangential to the constraint surface at x. Unless the constraint is linear, ~ x + ~6(x, H)) > 0, and hence, for all c sufficiently large, y(x + 13(x, H), c) > y(x, c). Maratos gives an example which shows that the step length is truncated no matter how close the current iterate is to a solution. The difficulty occurs because x + ,6(x, H ) satisfies the constraints to first order only. We overcome the difficulty by replacing the search direction strategy (xi+~ = xi + a,~(xi, Hi)) by a search arc(xi+~ = xi + al3(Xi, H)+az`6(xi, Hi)), where a is the step length. It is easily seen that x + a~(x, H ) + a2`6(x, H ) traces out an arc in ~ " as a ranges over the interval [0, 1]; /5 is chosen so that this arc satisfies the constraints to second order. This may be achieved by obtaining a second order approximation to the constraint surface at x. Since this involves calculating second order derivatives, which we wish to avoid, we obtain ,6 as follows. Let if(x, H ) denote the set of active inequality constraints predicted by QP(x, H), i.e. if(x, H ) _A{i G~m [ Ai > 0}
(19)
where {p, ;t,/z} is the Kuhn-Tucker triple for QP(x, H). If:
(a)
T(x) < ~,
(20)
(b) there exists a solution to gi(x + 13(x, H)) + g~(x)p = O, j E if(x, H ) ,
(21)
h(x + ~(x, H)) + hx(x)p = O,
(22)
(c) the minimum norm solution to (21), (22) has a norm less than or equal to
[It (x, hr)ll, then /~(x, H ) is set equal to the minimum norm solution of (21), (22); else /~(x, H ) is set equal to the zero vector. Condition (a) ensures that ~6 is computed only in the neighbourhood of a solution. Condition (b) ensures that h(x + ~ + l S ) - h(x +~)+h~(x)`6 = 0 to second order with a similar relation for the predicted active inequality constraints.
D.Q. Mayne and E. Polak/ Superlinearly convergent algorithm
52
Step length. Since y is not continuously differentiable the standard Armijo test, which employs the gradient of y, cannot be employed. However, the essence of the Armijo test is the comparison of the actual change in 3' with its first order estimate. Thus the standard Armijo procedure is easily modified using our estimate O(x, otp, c) of "y(x + ap, c) - y(x, c) (in place of ay,(x, c)p). It is easily shown that O(x, ap, c) <- aO(x, p, c) for all a E [0, l] so that the modified Armijo procedure is: Choose the largest a E .ff _a_{1,/3,/32 .... } where/3 @ (0, 1) such that 3,(x + ap, c) - y(x) <- 6aO(x, p, c)
(23)
for some 6 E (0, l) (e.g. 8 = ~ for a 'first order' search direction and 8 ~ (0, ~) (e.g. g = 8l) for a 'second order' search direction). We now have all the necessary ingredients to state the algorithm. Main Algorithm Data: xl, H~, c0->0, 6 > 1, L E ( 0 , oo), ~.~ 1,/3 G(0, 1). Step 0: Set i = 0. Step 1: If ci-~-> ?(xi), set ci = c.~. If ci-~< ?(xi), set ci = max{6ci_t, 6(xi)}. Step 2: If: (a) A minimum norm solution O(xi, Hi) of QP(xi, Hi) exists,
(/3) II (xi,/4,)11-< L, (3') O(xi, P(xi, H;), ci) -< -T(xi), then compute 10(xi, Hi) and set Pi =/5(xi, Hi), 16i = O(xi, Hi). Else set Pi = 10(x, ci). Step 3: If Pi =/~(x, Hi) compute the largest ai E~r such that: 3'(xi + aipi + Ot~Pi) -- T(Xi, Ci) <-- ~otiO(xi, Pi, Ci) and set xi.i = xi + OtiPi + Ot2ilOi9 If Pi ---- O(Xi, Ci) c o m p u t e the largest ai E ~ such that: y(xi + aao~, ci) - y(xi, c~) <- 88 Pi, ci) and set Xi+ I : X i + otip i.
Step 4: Update Hi to Hi+t. Step 5: Set i = i + 1 and go to Step 1. Global convergence is established in Section 3, superlinear convergence in Section 4 and performance discussed in Section 5; a specific procedure for updating Hi as well as a modification to test/3 in Step 1 (L is replaced by k~Si, 81E(0, 1), j equal to the number of times /~ has'previously been accepted) are required for superlinear convergence.
3. Global convergence
For all x E ~ " let I(x) index the most-active inequality constraints and E(x) the most-active equality constraints, i.e. t(x) a={j E m ] gJ(x) = ~0(x)}
(24)
D.Q. Mayne and E. Polak/ Superlinearly convergent algorithm
53
and E(x) a={j E r I IhJ(x)l = ,t,(x)}.
(25)
Either l(x) or E(x) may be empty; indeed for almost all x in ~ " there will exist only one integer in l ( x ) tO E(x) which is therefore a (small) subset of the active constraints. The following assumptions suffice to establish global convergence (in the sense of Theorem 1): (HI) The functions f, g, h, are continuously differentiable. (H2) For all x the vectors {Vgi(x), j E l(x); Vhi(x), j E E(x)} are linearly independent. Note that, because of the almost everywhere low cardinality of I ( x ) U E(x), (H2) is relatively mild given that F is non empty; it is certainly much less mild than linear independence of all equality and active inequality constraints, and cannot be much further relaxed without destroying the ability of the algorithm to determine feasible points. To establish global convergence we prove that 6 and A satisfy the hypotheses of Theorem 1. Because of space limitations only an outline of each proof is presented; full details are available in [12]. To proceed we need to define more precisely the sets D and Dc: D =a {x ~ ~" I (x, A,/z) is a Kuhn-Tucker triple 2 for P}
(26)
D~ L {x ~ ~" [ O'(x, c) : O(x, ,O(x, c), c) = O}
(27)
and
(see (18) for the definitions of ,5 and 0'). It follows from the linear independence of the gradients of the most active constraints that a solution (,~(x),/2(x)) to (12) always exists and is unique and that ,~,/7. are continuous. Also, if (.f, A,/2) is a Kuhn-Tucker triple for P, then A"= )~(.f),/~ =/~(.f). Hence we have: Proposition 2. ? : ~ " -~ ~ is continuous.
Next we note from (18) that O'(x, c)<-0 and that/~(x, c) is a descent direction for y(x, c) if O'(x, c) <- O(x,,6(x, c), c) < 0, so that O'(x, c) = 0 is a necessary condition of optimality for the unconstrained optimization problem: Pc:
min{y(x, c) l x E ~"} .
Hence Dc is a set of desirable points for Pc. Using the dual form of QP~(x) (see (18)) the following result can be established [12]: Proposition 3. Let c >- ?(x). Then x E Dc if and only if x E D.
z Hence satisfying $(x) = 0.
54
D.Q. Mayne and E. Polak/ Superlinearly convergent algorithm
Proposition 3 is a typical exact penalty function result, establishing 'equivalence' of P and P , We have now established that ,~ satisfies hypotheses (ii) and (ii) of Theorem 1 so we turn our attention to A. For given x, c and H, let p(x; H, c) and a(x; H, c) denote, respectively (any) search direction and step length generated by the algorithm; p(x; H, c) is equal either to /~(x,H) (a solution of Q P ( x , H ) or to /~(x,c) and, hence, is not necessarily unique; neither, therefore, is a ( x ; H , c ) . The set {x+
a(x; H, c)p(x; H, c) + a(x; H, c)2l~(x, H)} (/~(x, H ) __a0 if p(x; H, c) = ~(x, c)) of all possible successor points to x is denoted A(x; H, c). We recall that a solution ,if(x, c) to QPc(x) always exists and is unique. It can also be established, given c > 0 , that /~(-, c) is continuous [12]. Moreover the continuity of ~. and/2 imply that T is continuous; also T(x) > 0 if x ~ D. Hence rr : ~ " • ~ -o ~ defined by ~-(x, c) = max{- T(x), O'(x, c)} is continuous in x and (by virtue of (27) and Proposition 3) lr(x, c ) < 0 if c->t~(x) and x ~ De. Since O(x, 15(x, H), c) <- - T ( x ) <- ~r(x, c) and O(x, ~(x, c), c) < O'(x, c) <- 7r(x, c), it follows that: O(x, p(x, H, c), c) <- ~(x, c)
(28)
for all x and all H. Since O(x, ap(x, H, c), c) is a first order estimate of y(x + ap(x, H, c) + a2~(x, c), c) - ~/(x, c) and since 7r(., c) is continuous, it follows [12] that for all (x, c) such that c-> ,3(x) and x ~ Dc there exists an 9 > 0, 8 > 0 such that 3,(x", c) - 7(x', c) -< - 8
(29)
for all x ' E B(x, 9 all x " ~ A(x; H, c) and all (symmetric) H. Hence [13] the following result can be established: Proposition 4. Let x* be any accumulation point of an infinite sequence {xi} satisfying xi+l E A(xi; H, c) and c >- ?(xi) for all i; then x* E De. Thus ? and A satisfy the hypotheses (i)-(iii) of Theorem 1 yielding: Theorem 2. Let {xi} be an infinite sequence generated by the main algorithm. Then {xi} has the convergence properties specified in conclusions (a) and (b) of Theorem 1 and the corollary to Theorem 1.
4. Rate of convergence To proceed further we need to strengthen our hypotheses. We replace (HI) by: (HI') f, g and h are three times continuously differentiable and add a further assumption:
D.Q. Mayne and E. Polak/ Superlinearly convergent algorithm
55
(H3) At each Kuhn-Tucker triple {2, ,~,/2} for P the second order sufficiency assumptions hold with strict complementary slackness, i.e. ,(i > 0 for all j E I(.f) and Lxx(s A,/2) is positive definite on the subspace {p[g~($)p = O, j E/i-f); hx(s = 0}. We also replace test/3 in Step 2 of the Main Algorithm by:
(/3)
11`6(x,, n,)ll-< kiai
(30)
where k~ > 0 and 8~ G (0, 1) and J is the number of times `6(xi, Hi) has satisfied the tests a,/3 and y in Step 2. To avoid choosing too many constants k~ can be set equal to, say, 10lip(x,, H,)II or max{llP(xi, H,)II [ i -- 1..... n}. Choosing k~ and 81 large makes satisfaction of the test easy. These additions enable a stronger convergence result, that {xi} converges to a desirable point, to be proven.
Theorem 3. Let {xi} be a bounded infinite sequence generated by the algorithm. Then x, ~ . f E D (and llxi~,-xill~0) as i ~ . The proof of this result relies on the fact [14] that the Kuhn-Tucker points (x is said to a Kuhn-Tucker point if {x, A, tz} is a Kuhn-Tucker triple) for P are isolated and that `6(xi, Hi) satisfies (30) while if(., c) is continuous, forcing the search direction p(xi, Hi, ci) to converge to the zero vector as i ~ . Note that Theorem 3 holds no matter what method is used for updating Hi. Suppose, however, the following secant update is employed:
Step 4: Set Hi,1 equal to Hi with column i mod(n)) of Hi replaced by: ( l/ Ai)[V xL(xi~ l + Aiei, A, xi~ i), ~(xi, i ) - V~L(xi+l, ,~(xi.i),/z(xi+0]
(31)
Ai & max{l[xi+, - xill, e,}.
(32)
where
Since xi ~ . f E D as i ~ so that ft(xi)~ A and ~(xi)~/2 where (x, ,~,/2) is a Kuhn-Tucker triple for P it follows [15, 16] that H i g H =a Lxx(.f, A, 12). For all i let {`6i, A, ~i} denote a Kuhn-Tucker triple for QP(xi, Hi) (i.e. Pi = `6(xi, Hi)). Since x ~ . f and Ht--->/-t an analysis based on [14] can be employed to show that A~--*,(, Iz~~/2, that for all i sufficiently large the solution `6i is unique, and that `6~~ 0 . Also, for all i sufficiently large, [(xi, H/)= I(:f) (see (19) and (24)) so that the active constraints are correctly predicted. An analysis, similar to that in [17], shows that O(xi, `6i, c,)<--T(x,) for all i sufficiently large (satisfaction of test 3'). It is shown in [12] that for all i sufficiently large ~6, exists, that [1`6,11= o[llp, l12], that [[~(x +,6, +`6,)11 = O[]]pil]3] where `6i _a_~(xi, ci) (this justifies the choice of /~) and consequently that, for all i, sufficiently large: 3~(x/+ `6i + ,6/, ci) - ~/(x/, ci) <- ~O(xi, `6i, ci).
(33)
D.Q. Mayne and E. Polak] Superlinearly convergent algorithm
56
Finally it can be shown, using results from [18], that for all i sufficiently large
Ilpill-< k,8i (satisfaction of test/3) and:
Ilz,.2- z.,l[ <-/3,llz.,- zilP
(34)
A
where zi = {xi, Ai-i,/zi-i} and/3i ~ 0 as i ~ , c . It follows, therefore, that for all i sufficiently large, tests c~,/3 and y in Step 2 of the main algorithm are satisfied; for such i, p(xi; Hi, ci)--"/9(xi, Hi)= [)i. Also, for all i sufficiently large, T(xi) -< ~,/~ =/5(xi, Hi) exists and II/~ill<-]llbi[I (satisfying (20)-(22); since (33) is also satisfied, we obtain: xi~ = xi + t~, + Pi
(35)
i.e. a step length of unity. Finally (34) shows that convergence is superlinear, yielding: Theorem 4. Suppose {xi} is a bounded infinite sequence generated by the main algorithm using the secant update. Then xi ~ ~ E D superlinearly. The secant updating procedure in Step 4 can be replaced by other procedures such as the BFGS procedure with Powell's modification [17]. Theorem 3 (convergence of {xi} to .f) remains true, but this does not now necessarily imply convergence of Hi. It is shown in [17] that if xi~Y, E D as i ~ where xi.~ = xi + Pi, ~ = P(xi, Hi) for all i, then xi converges superlinearly. Since IIP,II--o[11,112], setting xi§ = xi +,6i+/~i does not destroy superlinear convergence, so that Theorem 4 remains true under the added assumption that ,6i is accepted for all i sufficiently large.
5. Discussion
We discuss various points made by colleagues and referees thus enabling a fairly penetrating appraisal of the algorithm to be made. How useful is the global convergence result since it permits case (b) in Theorem 1? Note firstly that the Corollary to Theorem 1 is a direct consequence of (b). Secondly, there do exist rules for choosing c (e.g. ci = 6(xi)) for which it is possible for the subsequence {xi}ieK to have accumulation points which are not necessarily desirable; this possibility is excluded in our algorithm. Most importantly, however, it should be recognized that case (b) can occur. Consider, for example, the problem: min{-(xt) 2 + x 2 1xl + x2 = 0}. The exact penalty function is -(x~)2+ x2+ clx~+ x21 and the first order multiplier estimate f i ( x ) = x t - ~. As i ~ , x~--* + ~, x ~ - ~ so that 6(xi)= Ifi(xi)l + b ~ . however, {xi} is bounded in general if a few relatively minor additional assumptions are satisfied. Suppose that v: ~ ~ , defined by v ( a ) = inf{f(x) l ~(x)= a} is bounded from
D.Q. Mayne and E. Polak/ Superlinearly convergent algorithm
57
below (v(a) >- d for all a -> 0). This is certainly true if f is bounded from below. a Since ci -> c~, it can be shown that ~O(xi)-< e for all i where e = [y(x~, c t ) - d]lc,. Hence {xi} is bounded if V ~ {xl~b(x)-< e} is compact. Since V is the intersection of m + 2r sets {x I gJ(x) <- e}, j = 1..... m, {x I hi(x) -< e} and {x I hi(x) >- e}, j = 1..... r, V is certainly compact if any of the latter sets are. Summarising, {xi} is bounded if v (or f) is bounded from below and V is compact. Should not the scheme for c permit high initial values and low final values since this has been found to be practically useful? By contrast the procedure for c in Step 1 has a ratchet effect; {ci} is non-decreasing. The actual formula in Step I is unimportant; what is necessary for the convergence result is that ci ~ if ci is changed infinitely often. The counter example in [19] shows that cycling can occur even if apparently sensible procedures (without this property) are employed. Nevertheless our results are asymptotic and it is possible to employ any heuristic for choosing c finitely often without destroying our conclusions. High initial values of c can be achieved by replacing Step I of the algorithm by: If ~'i--, >- g(xi), set ~i = ~i-,. If '~i-, < g(xi), set '~i = m a x { 8 ~ i _ t , c(xi)}. Set (~i = 89 Set ci = max{6/, ci}. If 50 is chosen to be high, then for the first iterations at least, cg= ?0/2 ~. Asymptotically c~ will be dictated by the original procedure. Alternatively ci can be set equal to a large constant for a specified number of iterations. A low final value of c; can be achieved by adding yet another instruction to Step I which reduces ci to g(xi) the first time (or the first few times) the condition 4,(xg)-< 82 (or T(x~) <- 8_,), 82"~ 1, is satisfied. One more criticism of the rule for choosing c can be levelled--setting c > ?(x) does not guarantee that /~(x, H) is a descent direction for 3,(x, c) (given that H is positive definite). This difficulty can be overcome by defining P(x) as 1.1 max{,~ + E Ip (x), E I Jl § E I ,q} where {p, A,/z} is a Kuhn-Tucker triple for QP(x, H). Although ? is now no longer continuous, it nevertheless can be shown that Theorem 1 is still true. Could a better exact penalty be chosen? The conditions in Step 2 can be chosen so that the second order search direction /~ is nearly always chosen (always sufficiently near a solution) in which case the exact penalty play a role only in the determination of step length. It would not appear to matter very much what exact penalty function is employed for this purpose (if the constraints are prescaled). However, ,6 must be a descent direction for the exact penalty function and must therefore cope with non-differentiability whatever penalty function is chosen. The search direction/~ (which exists and is unique for all x) does this by considering all constraint gradients, not only the most active (see (16)-(18)). This search direction has been employed by the authors in complex design problems (e.g. with constraints of the form max{4,(x, a) I a E ,~/} -< 0}, ~ C ~P) with success. However, the algorithm can be easily modified to use other penalty functions such as ~,(x, c) a=/(x) + ~, cJgJ(x) ++ ~, dJlhJ(x)l where c is now
58
D.Q. Mayne and E. Polak/ Superlinearly convergent algorithm
a vector with components c s, j = 1..... m, d r, j = 1..... r. Let the components of r be defined by ?J(x) --a 1.11,~J(x)l, j = 1..... m, di(x) -a- 1.11/2J(x)l, j = 1 ..... r. This C can be employed in Step 1, the operations there being applied to each component. The estimate "~(x,p, c) of y(x + p, c) is obtained by replacing g S ( x + p ) by g~(x)+gJx(X)p with a similar substitution for h S ( x + p ) . As before O(x, p, c) is defined as ~(x, p, c) - 3,(x, c), and the descent direction/~(x, c) is the solution of (16); this is equivalent to the following quadratic program: O'(x, c) = min{f~(x)p + (1)p T ~ p + 2CJ(W s _ gJ(x)+) + V~di(vj -Ih~(x)l) [ gJ(x) + g~(x)p <- w~; Ih~(x) + hJx(X)pl <- vS; p E ~ " ; w j >- 0; j = 1 ..... m; vJ>-O,j = 1 ..... r}. The algorithm is otherwise unaltered; our results hold for the modified version which is probably advantageous if the constraints are not scaled, although the dimensionality of the quadratic program for determining/~ has been increased. Next it should be questioned whether the additional constants introduced (e.g. Co, 3, ~1, k~, e, b) are as difficult to choose as the penalty parameter, negating the advantage of an automatic rule for choosing c. First of all, the algorithm has the stated convergence properties whatever values the additional constants have in their specified ranges, whereas a wrong choice for a constant penalty parameter prevents convergence or makes it difficult. Secondly it is easy to choose suitable constants, modifying the algorithm if necessary, to make the parameters relatively problem independent. Thus Co= 0 is suitable especially if the modification to permit high initial values of ci is employed; in Step 1, ~ E [1.1,2] would appear suitable; the constants k~ and ~1 appear in test (/3) of Step 1 (see (30)) and can be chosen as stated in Section 3; the constant e (occurring in the definition of T) is, like k~ and 6~, chosen to make the tests in Step 1 easy to satisfy and is thus made small (e.g. 10-8): the constant b is problem dependent so that the definition of C should be modified to, for example, 1.1{~ I J(x)l § 5: I J(x)l}, which is problem independent. A final constant is the matrix H appearing in the quadratic program defining/~; the choice of H is discussed below. Could the algorithm be simpler? The major complication compared with Han's algorithm is the solving of an extra quadratic program QPc(x) if/~ does not satisfy tests a,/3 and 3' in Step 1. However, test a (existence of/~) is implicit in any algorithm of this type and provision must be made to cope with failure, e.g. by relaxing the constraints. But this involves solving another quadratic program, so the complexity is substantially the same. The other addition is the rule for changing c (which requires the solution of a set of linear equations) and determination of /~ (also requiring the solution of linear equations) near a solution. These are relatively minor and carry with them the advantages of convergence and closer following of the constraint surface respectively. However, it is desirable that the definition of ,~, /~ (see (12)) be modified to
19.Q. Mayne and E. Polak/ Superlinearly convergent algorithm
59
improved the conditioning of the resultant linear equations. Finally it should be noted that the strategy of employing /~ if ,6 does not satisfy certain tests guarantees convergence even if the second order assumptions are not satisfied. The choice o f / 4 is not significant if/~ is (nearly) always selected; the choice /4 = I will always work, but may cause ~ to be badly scaled. To see what is a good choice for H note that QPc(x) is the exact penalty function equivalent of QP(x, H) if ffI is set equal to H and c satisfies (10). Hence if the BFGS update (with Powell's modification) is employed a good rule for H is to set it equal, in succession, to Hi,, Hi2..... Hij where J is finite. The equivalence of QPc(x) and QP(x, H) suggests also that /~(xi, lqi) be determined by solving QPc(x) with / f = Hi rather than solving QP(xi, Hi). This procedure automatically relaxes the constraints (a solution always exists and, if Hi is positive definite, the solution is unique) allowing test a to be discarded. The resultant algorithm is fairly close to that suggested recently [20] in an interesting paper. The algorithm requires a quadratic program which yields a minimum norm solution to QP(x, c). Any quadratic program which first determines the nearest feasible point and then generates feasible descent directions will suffice. If another type of quadratic program is employed then tests fl and 3' (which can easily be expressed as linear constraints) should be incorporated as extra (pseudo) constraints in QP(x, H). Solution of the modified problem automatically satisfies tests/3 and y which can therefore be discarded. It should be evident that the algorithm is a useful 'workhorse' which can be modified in many ways to improve performance without affecting its asymptotic properties. Two versions of the algorithm were tested. Both used the modification presented above to give high initial values of c, with c0 set equal to 100. The first version employed the secant update (31) with ~ = 10-8 using a program in [21] for solving QP(xi, Hi); since this program does not determine asymptotically a minimum norm solution, the descent direction constraint O(x, p, c ) < - - T ( x ) (test 3,) was incorporated in QP(x, H) as an additional linear constraint (Vf(x), p ) c~O(x) < - - T ( x ) . The second version employed the BFGS update with Powell's modification. Step I employed the (problem dependent) rule ci = max{cl ~+0.1, ~(xi) rather than c~ =max{Sci_~,t?(x~)}. The other constants were b =0.1 (definition of g), 9 = 10-s (definition of T), ~5~=0.95, kj = 100 (test Table I Modified BFGS update
Secant update
Problem Powell's Colville I Colville 3
Iterations
T(s m
II 7 3
15. l0-I~ 8.4- l0 .t0 6.6.10 -8
Iterations T(i) '/2 18 9 3
2.2- 10-~~ 1.7. l0-t~ 6.6. l0-8
60
D.Q. Mayne and E. Polak/ Superlinearly convergent algorithm
/3), /3 = 1/2 ( A r m i j o p r o c e d u r e (23)) and H = (1/2)I. The results obtained are s h o w n in Table 1, w h e r e P o w e l l ' s p r o b l e m is a problem in [8] and Colville 1, 3 refer to p r o b l e m s in [22]. In Colville 1 the initial point was p e r t u r b e d by 10 -5 to obtain satisfaction o f (H2).
Acknowledgment The authors gratefully a c k n o w l e d g e the assistance of A. H e u n i s w h o critically reviewed an initial draft of the paper, c o m p u t e d the e x a m p l e s q u o t e d above, and contributed to s o m e of the proofs.
References [I] E.S. Levitin and B.T. Polyak, "Constrained minimization methods", USSR Computational Mathematics and Mathematical Physics 6 (1966) 1-15. [2] R.B. Wilson, "A simplicial algorithm for concave programming", Ph.D. Dissertation, Graduate School of Business Administration, Harvard University (Cambridge, MA, 1963). [3] E.M.L. Beale, "Numerical methods", in: J. Abadie, ed., Nonlinear programming (North-Holland, Amsterdam, 1967)pp. 132-205. [4] S.M. Robinson, "A quadratically-convergent algorithm for general nonlinear programming problems", Mathematical Programming 3 (1972) 145-156. [5] S.P. Han, "A globally convergent method for nonlinear programming", Journal of Optimization Theory and Applications 22 (1977) 297-309. [61 D.Q. Mayne and N. Maratos, "A first order exact penalty function algorithm for equality constrained optimization problems", Mathematical Programming 16 (1979) 303-324. [7] N. Maratos, "Exact penalty function algorithms for finite dimensional and optimization problems", Dissertation, Imperial College of Science and Technology, University of London (1978). [8] M.J.D. Powell, "A fast algorithm for nonlinearly constrained optimization calculations", in: G.A. Watson, ed., Numerical analysis, Lectrure Notes in Mathematics, Volume 630 (SpringerVerlag, Berlin, 1978) pp. 144-157. [9] A.R. Corm, "Constrained optimization using a nondifferentiable penalty function", SIAM Journal on Numerical Analysis 10 (1973) 760-784. [10] A.R. Corm and T. Pietrzykowski, "A penalty function method converging directly to a constrained optimum", SIAM Journal on Numerical Analysis 14 (1977) 348-378. [11] E. Polak, "On the global stabilization of locally convergent algorithms", Automatica 12 (1976) 337-342. [12] D.Q. Mayne and E. Polak, "A superlinearly convergent algorithm for constrained optimization problems", Computing and Control Publication 78/52, Imperial College of Science and Technology (London, 1978). [13] E. Polak, Computational methods in optimization: A unified approach (Academic Press, New York, 1971). [14] S.M. Robinson, "Perturbed Kuhn-Tucker points and rates of comvergence for a class of nonlinear programming algorithms", Mathematical Programming 7 (1974) 1-16. [151 E. Polak and I. Teodoru, "Newton derived methods for nonlinear equations and inequalities", in: O.L. Mangasarian, R.R. Meyer and S.M. Robinson, eds., Nonlinear programming 2 (Academic Press, New York, 1975) pp. 244-277. [16] E. Polak, "A globally converging secant method with applications to boundary value problems", SlAM Journal on Numerical Analysis 11 (1974) 529--537.
D.Q. Mayne and E. Polak/ Superlinearly convergent algorithm
61
[171 M.J.D. Powell, "The convergence of variable metric methods for nonlinearly constrained optimization problems", in: O.L. Mangasarian, R.R. Meyer and S.M. Robinson, eds., Nonlinear programming 3 (Academic Press, New York, 1978) pp. 27--63. [18] U.M. Garcia Palomares and O.L. Mangasarian, "Superlinearly convergent quasi-Newton algorithms for nonlinearly constrained optimization problems", Mathematical Programming I1 (1976) 1-13. [19] R.M. Chamberlain, "Some examples of cycling in variable metric methods for constrained minimization", Mathematical Programming 16 (1979) 378-383. [20] R. Fletcher, "Numerical analysis with an exact L penalty function method", in: O.L. Mangasarian, R.R. Meyer and S.M. Robinson, eds., Nonlinear programming 4 (Academic Press, New York, 1981) pp. (}(}-4)0. [21] S.J. Byrne and R.W.H. Sargent, "A version of Lemke's algorithm using only elementary principal pivots", Presented at the Tenth International Symposium on Mathematical Programming (Montreal, August 27-31, 1979). [221 D.M. Himmelblau, Applied nonlinear programming (McGraw-Hill, New York, 1972) p. 404 and p. 406.
Mathematical Programming Study 16 (1982) 62-83 North-Holland Publishing Company
COMPUTATION OF THE SEARCH DIRECTION IN CONSTRAINED OPTIMIZATION ALGORITHMSt Walter M U R R A Y and Margaret H. W R I G H T Systems Optimization Laboratory, Department of Operations Research, Stanford University, Stanford, CA 94305, U.S.A. Received 26 February 1980 Revised manuscript received 26 November 1980
Several algorithms for constrained optimization are based on the idea of choosing the search direction as the solution of a quadratic program (QP). However, there is considerable variation in the precise nature of the quadratic program to be solved. Furthermore, significant differences exist in the procedures advocated to ensure that the search direction is well defined, and some algorithms abandon the quadratic programming approach for particular iterations under certain conditions. In this paper, we discuss some crucial differences in formulation and solution that arise in QP-based methods for linearly constrained and nonlinearly constrained optimization, with particular emphasis on the treatment of inequality constraints. For linearly constrained problems, we consider the effect of formulating the constraints of the QP sub-problem as equalities or inequalities. In the case of nonlinear constraints, the issues to be discussed include incompatibility or ill-conditioning of the constraints, determination of the active set, Lagrange multiplier estimates, and approximation of the Lagrangian function.
Key words: Linearly Constrained Optimization, Active Set Strategies, Nonlinearly Constrained Optimization, Lagrange Multiplier Estimates, Quadratic Programming Sub-problem. This research was supported by the Department of Energy Contract DE-AS03-76SF00326, PADE-AT-03-76ER72018; the National Science Foundation Grants MCS-7926009 and ENG7706761A01; and the U.S. Army Research Office Contract DAAG29-79-C-0!I0.
I. Introduction
The methods to be considered in this paper are designed to solve the inequality-constrained nonlinear programming problem
P1
minimize x ~
F(x),
n
subject to
ci(x)->0,
i = 1. . . . . m,
where F (the objective [unction) and {c~} (the constraint [unctions) are twice continuously differentiable. No further properties (such as convexity) will be assumed, and hence we shall be concerned only with local minima. The major focus of the paper will be on the treatment of inequality constraints. The § Presented at the Tenth International Symposium on Mathematical Programming, Montr6al, August 1979. 62
w. Murray, M.H. Wright/Computing the search direction
63
algorithms to be considered can be applied to equality constraints in a straightforward manner; in fact, many of the complications to be discussed do not arise with equality constraints. A typical iteration of a method for solving P l includes the following procedure. If x, the current iterate, does not satisfy the appropriate optimality conditions: (i) compute a search direction p by solving a sub-problem: (ii) determine a step a, such that specified properties hold at x + ap (this portion of the iteration is usually termed the step length procedure or line search). Following these steps, x + ap becomes the new iterate. In this paper, we shall be concerned with algorithms in which computation of the search direction is achieved by solving a quadratic programming subproblem. Methods of this type have been suggested for many years, and have recently enjoyed a surge of popularity, especially for nonlinearly constrained problems. Although it can be shown that, under certain conditions, QP-based methods are 'equivalent' (with exact arithmetic) to other classes of methods [34], our sole concern will be with methods in which a QP sub-problem is explicitly formulated and solved. We shall examine and contrast the two extremes of QP sub-problems that might be posed when solving constrained problems with this approach. At one extreme, an equality-constrained QP (EQP) is solved whose linear constraints represent a subset of the original constraints. At the other extreme, the subproblem is an inequality-constrained QP (IQP), in which all the original constraints are represented as inequalities. The two approaches differ in several ways: the effort required to solve the sub-problem; the information required about the original problem functions; the theoretical properties of the resulting search direction; and the information returned after the sub-problem has been solved. All these differences affect the performance of the outer algorithms. Any algorithm for P1 must include some procedure--usually termed an active set strategy--for determining which constraints hold with equality at the solution. The decision about the active set is one of the most important differences between the pure EQP and IQP approaches. Algorithms based on solving an EQP include a pre-assigned active set strategy that specifies which constraints are to be treated as equalities in the QP; the term 'pre-assigned' signifies that the decision about the active set is made before posing the QP sub-problem. With a pure IQP approach, on the other hand, a QP-assigned active set strategy is used, in that the set of constraints active at the solution of the QP will be taken as a prediction of the active set of the original problem. Between the two extremes, there are many variants in formulation of QP sub-problems, some of which are briefly noted. By discussing the features of the two 'pure' strategies, we hope to shed light on the course of action most likely to be successful.
64
W. Murray, M.H. Wright~Computing the search direction
2. Notation and assumptions The following definitions will be used throughout: g(x)-VF(x), O(x) - V2F(x), a~(x) - Vcdx), Gi(x) - V2ci(x).
The vector t~(x) will be used to represent the set of constraints that are considered active at x; let t~(x)--Vt~i(x), and t~i(x)= VE~:(x). The matrix A ( x ) will denote the matrix whose ith column is a~(x), and similarly for A(x). Where the meaning is clear, the argument of a function may be suppressed, i.e., A = A(x). The vector x* will denote a solution of P1. In general, it will be assumed that the first- and second-order Kuhn-Tucker conditions (see [11]) hold at x*, namely: (i) there exist Lagrange multipliers {)t*} corresponding to the active constraints, such that g(x*) = A(x*)A*,
,~*~0,
(la)
i = l ..... t.
(lb)
(ii) Let Z ( x ) denote a matrix whose columns form a basis for the set of vectors orthogonal to A(x), so that A ( x ) : Z ( x ) = O. Then the matrix
Z(x*)T(G(x*)-i_~.:
A*Gi(x*))Z(x*)
(2)
is positive definite.
3. Methods for solving quadratic programs Before considering different QP sub-problems, we present a brief overview of some aspects of solving the two quadratic programs of interest. 3.1. Equality-constrained QP
The problem to be considered is p~" minimize
subject to
~1 pTHp + pTd,
ATp=
6,
(3a) (3b)
W. Murray, M.H. Wright[Computing the search direction
65
where the matrix fi~ contains t columns. Note that (3) has a solution only if the constraints (3b) are consistent; this will always be true if the columns of A are linearly independent (in which case t -< n). Let r be the rank of A, and let the r columns of a matrix Y form a basis for the range space of ,4. Similarly, the (n - r) columns of a matrix Z are assumed to form a basis for the set of vectors orthogonal to the columns of A, i.e. A T z = Y T z = 0.
(4)
The matrix Z depends on A, but we shall generally suppress this dependence for economy of notation. In general, the coefficient vectors in a particular set of linear equalities define a subspace, and Z is meant to be a generic representation of a basis for its orthogonal complement. The solution of (3), p*, can be written as p* = Yp~, + Zp~.
(5)
Substituting (5) into (3b) gives A Tp * = ATyp~, = /9.
(6)
The t • r matrix ATy must contain an r • r non-singular submatrix. Because the constraints are consistent, p~ is uniquely determined by any r independent equations of (6), the remaining equations being satisfied automatically. The vector p ~ is determined by minimization of the quadratic form (3a) with respect to the remaining ( n - r) degrees of freedom. Substituting (5) into (3a), differentiating with respect to the unknown Pz, and equating the derivative to zero, we obtain the linear system Z T H Z p ~ = _ ZT d - Z T H Y p ~.
(7)
If Z T H Z is non-singular, p~ is unique. If Z T H Z is positive definite, (5) is the desired solution of (3). If Z T H Z is positive semi-definite, p~. is not unique; however, the quadratic form may nonetheless have a finite minimum value in the subspace defined by Z. If Z T H Z is indefinite, the vector p* defined by (5), (6) and (7) is not a local minimum of (3), and the quadratic function (3a) is not bounded below. Even if Z T H Z is positive definite, the projection of H onto the range of fi~ can have negative eigenvalues. However, it is significant in some QP-based methods that there exists a finite t5 > 0 such that H + pZfi~ r is positive definite for all p > t5 (see [12]). Furthermore, the solution of (3) is unaltered if d in (3a) is replaced by d + A s for any vector s, or if the matrix H is replaced by H + , 4 S A x for any symmetric matrix S. There are many alternative ways of computing the solution of (3). The key advantage of the method outlined above is that the determination of whether the solution to (3) is well defined can be made during the computation. Other
w. Murray, M.H. Wright/Computing the search direction
66
methods may yield a constrained stationary point without being able to check whether it is a minimum, saddle point, or maximum. 'Ideal' choices of the matrices Y and Z can be obtained from the complete orthogonal [actorization of A, which is usually computed by applying simple orthogonal transformations with column interchanges (see [33] for a full discussion). This process also provides an extremely reliable estimate of the rank (see [17]). 3.2. Inequality-constrained QP The problem of concern can be stated as minimize
~pTHp + pTd,
(8a)
pE,r n
subject to
A Tp >--b,
(8b)
where A has m columns. Unlike the equality-constrained case, where the solution can be determined by solving just two linear systems, in general the solution of (8) must be found by iteration; and although the number of iterations is bounded by a finite number involving the numbers of unknowns and constraints, it is potentially very large. The essential feature of algorithms for (8) is an iterative search for the correct active set, although some algorithms are not usually described in these terms (see [7, 10]). Almost all algorithms for (8) assume that an initial feasible point is available (if it is not, one can be found by applying 'Phase I' of the simplex method for linear programming; see, e.g., [9]). Each subsequent iteration contains the two procedures mentioned in S e c t i o n l - - t h e determination of a search direction and of a step length. The search direction is determined as follows. At the current point, say ~, some subset of the constraints (8b) hold with equality, and are considered as the active set. Let the matrix fi, contain the columns of A corresponding to the active constraints, and let/~ be the vector of corresponding elements of b, so that
~iTlo = b.
(9)
The vector p will denote the solution of the EQP with objective function (8a) and constraints fiTp = b. Let d denote the gradient of the function (8a) at p, i.e.,
d=Hp+d. If 1~r (i.e., ZTd:r 0), the search direction 8* is the solution of a further EQP. In order for the same constraints to continue to be satisfied exactly at the next iterate, we require that AT(o + ~*) = g. The search direction 8* thus solves
(10)
W. Murray, M.H. Wright[Computing the search direction
minimize 5 E ~
67
~STH~ + ~Td,
(lla)
,~T~=0,
(lib)
n
subjectto
and can be computed using (5), (6), and (7). Although a step of unity along 8" would yield the vector ~, this step may violate some of the inactive constraints. Hence, the step & to be taken along ~* is min(1,ti), where & is the maximum feasible step. If & = &, the inactive constraint corresponding to & is added to the active set before the next iteration: If ZTd = 0, p =/5. To determine whether/~ is optimal for the original QP (8), the Lagrange multipliers are checked; in effect, such a test determines whether the constraints in the EQP that defines /~ are the correct active set for the original inequality-constrained problem. The Lagrange multipliers )~ of the EQP are the solution of the compatible system --
^
A)t = d + HiL
If ,(~ > 0 for all i, /~ is optimal for (8). However, if some A~ is negative, the objective function can be reduced by deleting the ith constraint from the active set. The relevant features of solving an inequality-constrained QP sub-problem within an outer algorithm are: (i) the need to solve a sequence of EQP sub-problems; (ii) the revision of the predicted active set based on the Lagrange multipliers of the intermediate EQP sub-problems.
4. Linearly constrained optimization We first consider algorithms in which all the constraints are linear, and shall discuss only algorithms that retain feasibility throughout. The problem of concern is then minimize xE~ n
subject to
F(x), A Tx >--b.
(12a) (12b)
A typical EQP sub-problem for (12) is given by minimize p~n subject to
~pTHp + gyp,
ATp = 0,
(13a) (13b)
where A consists of the constraints from (12b) selected by a pre-assigned active set strategy.
w. Murray, M.H. Wright/Computing the .~earchdirection
68
The IQP sub-problem corresponding to (12) is minimize ~pTHp + grp, pE.~ n subject to
AV(x + p) >- b.
(14a) (14b)
In both (13) and (14), the objective function is a quadratic approximation to F(x + p) at x, with g = g(x). The search direction computed via the QP subproblem is to be used within an outer algorithm, and hence should be a descent direction for F at x, i.e. gyp < O.
4.1. Equality-constrained QP Using (5), (6), and (7), the solution p* of (13) is given by p * = Zp~,
(15a)
ZTHZp ~ ~ - ZTg.
(I 5b)
where
Condition (15a) means that p* is a feasible direction with respect to the constraints (13b). The equation (15b) is directly analogous to the standard definition of an unconstrained search direction in the subspace spanned by the columns of Z. If ZTHZ is positive definite, p* is a descent direction for F. Several benefits accrue because the definition of the search direction involves only the projected Hessian matrix ZTHZ and the projected gradient ZTg; in effect, the EQP sub-problem deals with a quadratic approximation of F within a reduced subspace. If fit is the correct active set for (12), ZTG(x*)Z must be positive semi-definite, and thus it is reasonable to require a positive definite matrix in (15b). The positive definite matrix to be used depends on the information available within the outer algorithm. When exact second derivatives are known, the matrix G(x) typically serves as H; the matrix ZTG(x)Z is then formed, and procedures analogous to those for unconstrained Newton-type methods can be applied to ensure that the matrix used in solving for p~ is positive definite (see [24]). It is important to note that any such modification of the matrix alters the QP sub-problem during the process of solution because the objective function approximation has been found to be inadequate with respect to the needs of the outer algorithm. A further benefit of the EQP approach is that quasi-Newton methods can be extended to the linearly constrained problem in a natural way by approximating the projected Hessian (rather than the Hessian itself). The standard quasiNewton updates satisfy the quasi-Newton condition in the subspace defined by Z; furthermore, the essential property of hereditary positive definiteness can be retained. For further details, see [14]. Because of the special nature of the EQP (13), a reduction can be made in the
W. Murray. M.H. Wright[ Computing the search direction
69
number of function and gradient evaluations required to compute the search direction. In finite-difference Newton methods, for example, a direct finitedifference approximation to the matrix ZTG(x)Z may be obtained by differencing along the columns of Z, since the full matrix G(x) is not required. In a quasi-Newton method that uses function values only, the required vector ZTg can be approximated directly in a similar manner. Simply solving the equality-constrained QP (13) at each iteration and performing a line search with respect to F is not sufficient to provide a reasonable algorithm for solving the original linearly constrained problem, since the active set selected for (13) may be incorrect. An EQP-based method must therefore include some procedure for altering the matrix A between iterations. The usual choice of active set includes all the constraints that are exactly satisfied at the current iterate. Further constraints are added to the active set as the iterations proceed, in order to remain feasible. The missing ingredient is a procedure for deleting constraints from the predicted active set. One possibility is to delete a constraint only when the minimum of F in a subspace has been located, but this strategy is inefficient. Usually, a test is made at each iteration to see whether any constraint should be considered for deletion. Lagrange multiplier estimates are typically used to decide whether to delete a constraint, but they need not be computed at every iteration. For example, it may appear that F can be reduced substantially on the current subspace, in which case no constraints need to be deleted (in addition, any multiplier estimate at the current point will tend to be unreliable). A more complete discussion of the role of Lagrange multiplier estimates is given in [15]. Other information may also be used in predicting the active set. For example, if a constraint enters and leaves the predicted active set several times, the criteria for its deletion should be made more stringent; such a policy can avoid the phenomenon of zig-zagging (see [36]).
4.2. Inequality-constrained QP If the solution p* of the IQP (14) is used as the search direction, complications arise because it may be necessary to store or represent the full matrix H. Although only the matrix Z~HZ~ is required at iteration i, it is not known a priori which sets of constraints will define Zi as the iterations proceed. Hence, most IQP methods assume that the full matrix H is available. In contrast to the positive-definiteness of the projected Hessian in the EQP case, there is no presumption that H should be positive definite. In particular, H may be viewed as an approximation to the Hessian of F, which need not be positive definite, even at x*. If H is indefinite, (14) may not have a bounded solution; furthermore, the solution of (14) for an indefinite H is not necessarily a descent direction for F, since it may happen that gTp, > 0. Descent is not assured even if ZxHZ is positive definite during every iteration while solving the QP, in which case the solution of the IQP is a strong local minimum. To avoid such problems,
70
w. Murray, M.H. Wright[Computing the search direction
it may be possible to devise some a priori tests on H to ensure that the solution of (14) is bounded and is a descent direction for F. Because of the substantial work that might be required to solve the IQP, a policy of testing the solution only a posteriori is unappealing. In view of this analysis, it might seem desirable to require that (14) should be posed only with a positive definite matrix H. However, the question then arises as to how to devise a suitable H from the information available within a standard algorithm. In a Newton-type method, for example, H is usually taken as the exact Hessian G(x). If G(x) is found to be indefinite or singular (which can be verified only after computing a factorization or eigenvalue decomposition), what related positive definite matrix should be used as H? If the correct active set ,~ were known, and Z r G ( x ) Z were positive definite, H could be taken as the augmented matrix G ( x ) + pA.21T, where p is sufficiently large to make H positive definite (see Section 3.1). Unfortunately, this strategy presupposes that A is known, which undermines the motivation for solving the IQP. If the wrong active set has been predicted, it may not be possible to make the corresponding augmented matrix positive definite. Even if it is, this alteration in the QP will in general change the solution. For the quasi-Newton case, the conditions that ensure continued positive definiteness of the updated matrices do not in general apply when updating the full matrix; this may lead to serious problems with numerical instability and the loss of desirable convergence properties. In addition, the convergence results [28] associated with Powell's [27] technique for retaining a strictly positive definite quasi-Newton approximation do not apply when the active set is varying, since the proofs of convergence assume that the correct active set has been determined and is not altered. A number of approaches have been proposed to overcome the inherent deficiencies of solving an inequality-constrained QP. For bounds-constrained problems, Brayton and Cullum [3, 4] have suggested a method based on solving an initial simplified IQP. If the result is unsatisfactory, a more complicated inequality-constrained QP is solved; finally, if neither QP sub-problem has succeeded in producing a satisfactory search direction, an alternative method is used that does not involve a QP. Although the results reported by Brayton and Cullum indicate that the first QP solution is acceptable most of the time, the reason that both may fail is the presence of a possibly indefinite matrix H in the QP formulation, so that neither QP can be guaranteed to yield a descent direction. For the linearly constrained case, Fletcher [12] suggested a QP sub-problem that includes all the original inequality constraints as well as additional bounds on each component of the search direction. The purpose of the extra constraints is to restrict the solution of the QP to lie in a region where the current quadratic approximation is likely to be reasonably accurate. The bounds on p are readjusted at each iteration if necessary to reflect the adequacy of the quadratic
W. Murray, M.H. Wright/Computing the search direction
71
model. Fletcher's algorithm effectively includes the 'trust region' idea that is used in other areas of optimization. A similar approach is to add a positive definite matrix (e.g., a positive diagonal) to the Hessian or its approximation [20, 21, 22].
4.3. Summary of the linearly constrained case For linearly constrained problems, both possible approaches to formulating QP sub-problems are unsatisfactory to varying degrees in their 'pure' forms. A simply strategy for deleting constraints makes the EQP approach a viable one with many advantages, most accruing from the need for approximating the projected Hessian only. With the IQP approach, the difficulties are more fundamental, and it remains unclear as to which variation in the basic strategy is optimal. All current suggestions for the IQP strategy involve heuristic procedures that risk inhibiting the rate of convergence.
5. Nonlinearly constrained optimization Methods for nonlinearly constrained optimization based on QP sub-problems have recently received a great deal of attention. In this section we consider in some detail various aspects of the resulting procedures for computing the search direction. The comments concerning the constraints apply not only to QP-based methods, but also to methods based on linearly constrained sub-problems in which the objective function is a general (rather than quadratic) approximation to the Lagrangian function or an augmented Lagrangian function [26, 29, 31, 32].
5.1. Definition of the QP sub-problem With linearly constrained problems, the constraints of the QP sub-problem are taken directly from the original constraints. In the nonlinear case, the constraints must be transformed in some way. A 'linear approximation' of a smooth nonlinear constraint c~ at the point x can be derived from the Taylor series expansion: c,(x + p) = c,(x)+ a,(x)Tp + 89
+
O(llPll ).
(16)
Using only the linear terms of (16), we obtain
ci(x + p) ~ ci(x) + a,(x)Tp.
(17)
The relationship (17) suggests that the coefficient vectors of the sub-problem constraints be given by the gradients of the nonlinear constraints evaluated at x, and this choice is made in all published QP-based algorithms.
W. Murray, M.H. Wright/Computing the search direction
72
Various options have been proposed for the right-hand side of the QP constraints. If the nonlinear constraint c~ v ~re an equality, the linear constraint
a~(x)'rp = -c~(x)
(18)
would be a statement that the linear approximation (17) is zero at the point x + p. Hence, the value -c~(x) is a common choice for the right-hand side of the ith linear constraint [18, 27, 35]. Other algorithms construct a more general righthand side whose typical element is -7~c~(x)+ ~r~, where the values of 7~ and tr~ depend on the derivation of the linear constraints [2, 23]. The quadratic function of the sub-problem is usually based on the Lagrangian function because of its essential role in the second-order optimality conditions for nonlinear constraints. Of course, the Lagrangian function depends not only on x, but also on the Lagrange multipliers. A QP-based method for nonlinearly constrained problems thus involves, either explicitly or implicitly, Lagrange multiplier estimates. If the quadratic function in the sub-problem is meant to approximate the Lagrangian function, one might assume that the linear term of the quadratic function would be the gradient of the Lagrangian function. However, as noted in Section 3.1, if a QP has only equality constraints with matrix ,~, then its solution is unaltered if the linear term of the objective function includes a term of the form ,~s. Hence, since the gradient of the Lagrangian function is g(x) - A(x))~ (if A(x) defines the equality constraints of the QP), g(x) alone is usually taken as the linear term of the objective function. With an inequality-constrained QP, however, it is expected that the active set at the solution is not the same as the set considered to be active at the initial point. In this case, the solution of the QP will vary depending on whether g(x) or g(x) - A(x)A is used as the linear term of the objective function. Almost universally, the Hessian matrix in the sub-problem is viewed as an approximation to the Hessian of the Lagrangian function. However, algorithms vary in the form in which this matrix is represented, the properties of the matrix, and the role of Lagrange multipliers in the matrix.
5.2. History QP-based methods for nonlinear constraints have an interesting history. We shall cite selected references to illustrate the origin of, and trends in, the use of certain strategies. To the best of our knowledge, the first suggestion of using a QP sub-problem to obtain the search direction in a nonlinearly constrained problem was made for the special case of convex programming by Wilson [35], in his unpublished PhD thesis. Wilson's method was subsequently described by Beale [l]. In Wilson's method, an inequality-constrained QP is solved at each iteration,
W. Murray, M.H. Wright/Computing the search direction
73
with linear constraints
A(x)Tp ~--c(x).
(19)
The quadratic function is an approximation to the Lagrangian function in which the exact Hessians of F and {c~} are used, and the QP multipliers from the previous iteration serve as Lagrange multiplier estimates. The new iterate is given by x + p, where p is the solution of the QP, so that no linear search is performed. Murray [23] proposed a different motivation for QP-based methods for general nonlinearly constrained problems. His derivation of the QP sub-problem was based on the limiting behavior of the solution trajectory of the quadratic penalty function (see [11]), and he showed that certain equivalences exist between the projected Hessians of the Lagrangian function and the penalty function. The right-hand side of the linear constraints is the 'damped' value -~/c(x). Several possibilities were suggested for the Hessian of the quadratic function, including a quasi-Newton approximation, and also several multiplier estimates. The idea of 'partial' solution was also proposed--e.g., 'deleting' only one or two constraints of the QP. Finally, Murray suggested that the QP solution be used as a search direction, and that a line search be performed at each iteration with respect to a 'merit function' (in this case, the quadratic penalty function), to ensure a consistent measure of progress toward the solution of the original problem. This feature overcame the problems of divergence that often occurred when Wilson's method was applied outside a small neighborhood of the solution. Biggs [2] presented a variation of Murray's method in which the QP subproblem contained equality constraints only. In Biggs' method, a typical component of the right-hand side of the linear constraints is of the form -3,~c~(x) + cry, where the values of 3'~ and tr~ depend on the Lagrange multiplier estimate and a penalty parameter. Biggs also proposed some special multiplier estimates. Han [18, 19] revived the idea of obtaining the search direction by solving an inequality-constrained QP with constraints (19), as in Wilson's method. He suggested quasi-Newton updates to an approximation to the Hessian of the Lagrangian function, but assumed that the full Hessian of the Lagrangian function was everywhere positive definite. In Han's method, the Lagrange multipliers of the QP sub-problem from the previous iteration are used as estimates of the multipliers of the original problem. Han's algorithm is shown to have superlinear convergence under certain conditions. Han also suggested the use of the non-differentiable 'exact' penalty function (see [6])
F(x) - p ~ min(0, ci(x)) i=1
as a 'merit function' within a line search. Powell [27, 28] proposed an inequality-constrained QP procedure in which a positive definite quasi-Newton approximation to the Hessian of the Lagrangian
74
W. Murray. M.H. Wright/Computing the search direction
function is retained, even when the full matrix is indefinite. He also showed that this procedure would not impede the superlinear convergence of the method under certain assumptions. This historical view is necessarily limited. All the authors mentioned above and many others have continued to publish new results on QP-based methods.
5.3. Incompatible constraints The first diffculty that can occur in formulating the linear constraints of the QP sub-problem is incompatibility--i.e., the feasible region of the sub-problem is empty even though that of the original problem is not. This difficulty does not arise in the linearly constrained case, since incompatible constraints in the sub-problem would imply that the original constraints were incompatible. The phenomenon of incompatibility is not at all rare; even if it were, any robust algorithm would nonetheless have to deal with it. For example, consider any nonlinear constraint whose gradient vector at x is zero, but whose value is negative. Both the equality and inequality linear versions of this constraint of the type (18) or (19) are incompatible. In practice, incompatibility appears to be more likely with an IQP sub-problem, for two reasons. First, by definition the IQP sub-problem contains more constraints. Second, and probably more important, is the fact that the iinearization of an inactive constraint represents a restriction involving the boundary of the feasible region that is made at a point far removed from the boundary. In order to define a satisfactory search direction in such a situation, a different sub-problem must be posed. We stress this point because some algorithm designers do not mention the possibility that the QP sub-problem might be completely rejected. With an EQP approach, the constraints are of the form
ATp = a.
(20)
If (20) is incompatible, the columns of .3, must be linearly dependent. Several techniques for treating incompatibility are possible in an algorithm based on a pre-assigned active set strategy. As mentioned earlier, some algorithms include a flexible strategy for specifying a~, which can be invoked to eliminate or reduce the likelihood of incompatible constraints. However, even within such a method the constraints of the EQP may need to be adjusted by an alternative procedure. First, the procedure that selects the active constraints can attempt to exclude constraints whose gradients are linearly dependent. As a first step in this direction, some pre-assignment strategies do not allow more than n constraints to be considered active. A more elaborate procedure can be carried out when the matrix ,~ is factorized while the active set is constructed. If, for example, the QR factorization is used, each new column can be tested for linear dependence and rejected if necessary. If the
W. Murray. M.H. Wright/Computing the search direction
75
active set is selected before i i, is factorized, a column interchange procedure can be used to select a set of columns from ,~ that appear to be 'sufficiently' linearly independent. The above strategies not only influence the selection of the constraints that are to be treated as 'active', but may also alter the intended QP sub-problem. With another strategy, it might be considered that the motivation underlying the formulation of the sub-problem constraints is still valid, and that linear dependence of the constraint gradients is not an appropriate criterion in deciding which constraints are active. With this view, the search direction could be constrained not by the linear equalities (20), but to be a least-squares solution of minimize [IATp -
all .
(21)
p
The solution of (21) can be computed using the complete orthogonal factorization of ,4. Note that if d happens to be consistent with A T, the solution of (21) will also satisfy (20). If p is required to solve (21), p will also satisfy a perturbed system of linear equality constraints •Tp = a + &
(22)
for some vector ~, and thus p may be regarded as the solution of a 'least perturbed' EQP sub-problem. Great care must be exercised when re-defining p in the case when (20) is incompatible, to ensure that p still satisfies the requisite conditions within the outer algorithm. For example, the proof that p is a descent direction for some merit function should not rely on exact satisfaction of (20). Incompatibility leads to a more complicated situation with an inequalityconstrained QP sub-problem. In the equality case, incompatibility can be determined during the process of solving (21), and an alternative definition of p typically makes use of the quantities already computed. With inequalities, however, incompatibility is determined only at the end of an iterative procedure to find a feasible point. If it turns out that there is none, the question arises of defining a new sub-problem. Powell [27] suggested that the first QP sub-problem be posed with constraints A(x)Tp --> -- c(x).
(23)
If (23) is incompatible, a sequence of QP sub-problems is then posed and solved, in which the right-hand side of (23) is successively replaced by -13c(x), -[3:c(x) ..... for some positive 'damping' factor 13 (/3 < 1). As the right-hand side approaches zero, the zero vector becomes closer and closer to being feasible, and thus the hope is that eventually some non-zero vector will be feasible. Nonetheless, although this procedure succeeds in some cases, it is not guaranteed to produce a compatible system of inequalities except in the limit, with a zero solution.
76
W. Murray. M.H. Wright/Computing the search direction
Another idea for the IQP approach is simply to use the final result of the 'Phase I' procedure applied to the original constraints (23). In some sense, this should be the 'least infeasible' vector. In this case, no further attempt would be made to solve the QP. Obviously, many other strategies are possible, and will undoubtedly be proposed. It seems clear that there is a danger of great inefficiency with an inequality-constrained QP sub-problem unless the computational effort expended to discover incompatibility can be exploited in the same way as in the equality-constrained case. 5.4. IU-conditioning in the constraints
The difficulties just noted with incompatibility imply that the determination of the search direction from a QP sub-problem may be ill-conditioned when the linear constraints are 'almost' incompatible. In the case of a pre-assigned active set strategy, the columns of/~ in (20) can be 'nearly' linearly dependent. If the original constraints were linear, the sub-problem then would represent a real situation--namely, that the intersection of the constraints, or possibly x* itself, is ill-defined. When the constraints are nonlinear, however, it is quite possible for their intersection to be perfectly well defined, and hence the current linear approximation is misleading. The effect of ill-conditioning in ,4 on the QP sub-problem is thus to make the constraints (20) of questionable value. Usually, lip ~ll becomes extremely large if A is ill-conditioned, and hence tends to dominate the search direction. Even if by chance IIp~ll is an acceptable size, the reliability of p ~ is dubious because, by definition, small perturbations in the right-hand side of (20) can induce large relative changes in its solution. Since (20) only provides an approximation to the desired behavior of the nonlinear constraints, it is important to take precautions so that the entire sub-problem is not invalidated. For example, it has been suggested [25] that p and p~. be re-scaled to overcome any significant imbalance caused by illconditioning. Alternatively, a very loose criterion for 'linear dependence' could be used in factorizing ,~, and p could be chosen to solve the least-squares problem (21). This latter procedure would, however, induce discontinuities into the definition of p ~ when a certain quantity crosses the boundary between 'non negligible' and 'negligible'. An analogous phenomenon can be observed with an inequality-constrained QP, in which case the active set at the solution is nearly linearly dependent, and the ill-conditioning is typically revealed by a very large search direction. It could be argued that such 'bad scaling' of the search direction is unimportant in either case if a line search is to be executed with respect to a merit function after the QP is solved. However, a badly scaled search direction will tend to produce an inefficient performance by even the best line search pro-
W. Murray, M.H. Wright/Computing the search direction
77
cedures. Moreover, there is the danger that an algorithm will be unable to make any progress away from the neighborhood in which the ill-conditioning is present. In addition to its effect on the search direction, ill-conditioning in ,4 affects the Lagrange multipliers of the QP, which solve the compatible system fitA = lip* + g.
(24)
(In the case of IQP, the matrix ,4 in (24) represents the matrix of constraints active at the solution of IQP.) When ,4 is ill-conditioned, the solution of (24) must be considered as unreliable because of the inherent numerical instability in its computation. The difficulties are of course further exacerbated by the presence on the right-hand side of (24) of the vector p*, which is known to be the result of solving an equally ill-conditioned linear system. 5.5. Determination of the active set
The nonlinearity of the constraint functions introduces great complexity into the decision as to which constraints are considered to be active. With linear constraints, the active set invariably includes the constraints that are satisfied 'exactly' (i.e., within working precision). In the nonlinear case, the active constraints at the solution are usually satisfied exactly only in the limit, and hence other criteria must be employed. Any method based on the Lagrangian function inherently includes some decision about the active set in defining which constraints correspond to non-zero multipliers. In order to formulate an EQP sub-problem, a pre-assigned active set strategy is used to determine which set of the original nonlinear constraints will be linearized and included in the linear equality constraints of the sub-problem. A pre-assigned active set strategy typically involves examination of the local behavior of the constraints at the current point in terms of properties that hold for the correct active set in a neighborhood of the solution. The active set selection may also be based on the known behavior of the merit function used in the line search. For example, penalty function methods approach the solution along a path of points that are infeasible with respect to the active constraints; hence, there is justification for including the violated constraints in the predicted active set within an algorithm that measures progress via a penalty function. With nonlinear constraints, it is no longer essential to 'delete' constraints from the active set, since movement along the search direction will in general alter the constraint value. With a QP-assigned active set strategy, the Lagrange multipliers from the IQP sub-problem at the previous iteration determine the selection of the active set. It can be shown that in a neighborhood of x*, under certain conditions on H and A, some IQP sub-problems will indeed make the correct choice of active set (see [8, 18, 30]), in the sense that the set of active linear constraints at the solution of
78
W. Murray. M.H. Wright/Computing the search direction
the QP is equivalent to the set of active nonlinear constraints at the solution of the original problem. Although this result is clearly essential in order to place any reliance on the QP multipliers, it does not imply that a QP-assigned active set strategy posseses an inherent superiority over a pre-assigned procedure. Almost any sensible set of criteria will predict the active set correctly in a small neighborhood of the solution under the same assumptions required to guarantee a correct prediction by the IQP sub-problem. Furthermore, it is not difficult to construct examples in which the neighborhood of correct prediction for the IQP is arbitrarily small. The justification for any active set strategy arises from its reliability when the current iterate is not in a small neighborhood of the solution. Since the prediction of the active set influences the logic of either an EQP or IQP method, it is advisable to include tests for consistency before considering the prediction to be reliable. If, for example, a different active set is predicted in two consecutive iterations, it seems doubtful that the prediction is accurate. See Chamberlain [5] for some interesting examples of cycling in the active set prediction with the IQP approach.
5.6. Lagrange multiplier estimates Lagrange multiplier estimates are used within the EQP approach in two ways: (1) an approximation to the Lagrangian function must be constructed; (2) many pre-assigned active set strategies consider multiplier estimates in selecting the active set. Since g(x) and ,a.(x) are evaluated before defining the approximation to the Lagrangian function at x, a first-order Lagrange multiplier estimate can be computed as the solution of the least-squares problem minimize
IIA(x)x -
g(x)ll~.
An alternative estimate, which allows for the fact that by )t: = )ti - (ArA)-ld,
(25)
II (x)ll is not zero, is given (26)
where )tj is the solution of (25), and ~ and a are ~valuated at x. The definition (26) is based on correcting the least-squares estimate to include the predicted change in the gradient of the Lagrangian function following a first-order step to a zero of the active constraint functions (with the identity matrix as an approximation of the Hessian of the Lagrangian fufiction). Neither ,X~ nor ,X2 is guaranteed to be non-negative. Therefore, some alternative definition may be appropriate, such as setting a negative value to zero. Lagrange multiplier estimates of higher than first order may in some circumstances be obtained from the relationship (24), which defines the multipliers of the QP sub-problem. If the full matrix H is available, the compatible system (24)
W. Murray, M.H. Wright/Computing the search direction
79
can be solved after p* is computed. If only a projection of the Hessian of the Lagrangian function (or an approximation to it) is known, additional computation is necessary in order to solve (24), e.g., Hp* may be approximated by a finite-difference of the gradient of the Lagrangian function along the vector p*. When using the IQP approach, the QP multipliers from the previous iteration are used to define the new quadratic approximation to the Lagrangian function. In this way, the QP multipliers can be interpreted as providing a prediction of the active set, in the sense that they define which constraints are included in and which are omitted from the Lagrangian function. Because they are the exact multipliers of an inequality-constrained sub-problem, the IQP multipliers must be non-negative. If is often emphasized that the IQP multipliers are a better than first-order approximation to A* if the second derivative approximations are sufficiently accurate. However, the accuracy of the IQP multipliers is guaranteed only if the IQP sub-problem will eventually predict the correct active set. If the predicted active set varies between iterations, or is incorrect for several consecutive iterations, in effect the 'wrong' quadratic objective function has been chosen for the sub-problem, even when exact second derivatives are known. Some similarity between multiplier estimates on successive iterations should be required before the estimates can be considered reliable. With an IQP approach, g and A. must be computed at the new iterate before defining the approximation to the Lagrangian function. It would seem appropriate to compare the QP multipliers with a first-order estimate from (25) or (26), using the predicted active set from the previous iteration. An |QP multiplier should be considered unreliable if its sign differs from that of the corresponding first-order estimate at an improved point. With either an EQP or IQP sub-problem, the quality of the estimate (24) critically depends not only on the correctness of the active set, but also on how well H approximates the Hessian of the Lagrangian function. The right-hand side of (24) can be interpreted as a prediction of the gradient of the Lagrangian function at x + p*, and the value of the estimate from (24) hence depends on the fact that x + p* is a 'better' point than x. Consequently, the value of the estimate (24) is questionable when a step of unity is not taken along p*, which is often the case except very near the solution.
5.7. Approximation of the Lagrangian function A key element in the success of both the EQP and IQP approaches is the quality of approximation of the Lagrangian function by the quadratic objective function in the QP. As already observed, there seems little hope in either case of obtaining a good approximation while the predicted active set is changing, in large part because of the nonlinearity of the constraint functions. When solving linearly constrained problems by an EQP approach, a more
80
W. Murray, M.H. Wright [ Computing the search direction
limited objective is adopted--namely, to approximate F within a given subspace. The active set is altered only when the multiplier estimates are considered sufficiently accurate, and the dimension of the subspace can be increased by only one at each iteration. Thus, it is possible to accumulate and retain information about the second-order behavior of F within previous subspaces. However, constraint nonlinearities make it difficult to obtain accurate information about the second-order behavior of the Lagrangian function even when the active set remains fixed. When the dimension of the active set changes, there is a discrete change in all the Lagrange multipliers. Furthermore, changes in the active set are not controlled by the accuracy of the multiplier estimates. A varying active set poses no particular difficulties in the EQP approach when the Hessian is re-approximated from scratch at each iteration. However, if information is to be accumulated (as in quasi-Newton updating schemes), the whole validity of the updated approximation is questionable, even when an approximation to the full Hessian is recurred. This suggests that the competitiveness of Newton-type methods may be improved relative to quasi-Newton methods for nonlinearly constrained problems with inequality constraints. There are several serious difficulties in approximating the Lagrangian function with an IQP approach. If exact second derivatives are available, the matrix H in the QP sub-problem is not necessarily positive definite, even in a neighborhood of the solution; thus, Newton-type IQP methods may suffer from the problems noted earlier in the linearly constrained case of defining a satisfactory subproblem with a possibly indefinite Hessian---e.g., the search direction may not be a descent direction for the associated merit function. The problems associated with quasi-Newton updates are the same as for an EQP approach. A further difficulty in approximating the Lagrangian function with an IQP approach is its reliance on the QP multipliers in forming the Hessian approximation, since even the linear term of the QP objective function may be erroneous. In effect, the IQP approach uses a pre-assigned selection of the active set to define the objective function of the QP. Thus, the prediction of the active set at the end of the iteration is dependent on the initial prediction that defines the quadratic objective function of the sub-problem, even in cases when this information is bound to be questionable. Much further research is needed concerning techniques for approximating the Hessian of the Lagrangian function with either QP approach.
6. Conclusions
Computing the search direction by solving a QP sub-problem is not a cure for all algorithmic ills in constrained optimization. Certainly a QP-based formulation overcomes some of the disadvantages of alternative methods, and QP-based methods typically work extremely well in the neighborhood of the solution. The
w. Murray, M.H. Wright/Computing the search direction
81
real challenge for QP-based methods is to demonstrate their robustness and efficiency when the current iterate is not close to the solution. In particular, it is crucial that the QP sub-problem, which is posed as an approximation to conditions that hold at the solution, can be shown to have a sensible interpretation far f r o m the solution as well. For instance, what should be the interpretation of Lagrange multiplier estimates computed with the wrong active set and an inaccurate Hessian approximation at a point that is far from optimal? It may be thought that the greater effort involved in solving an IQP subproblem would cause fewer major iterations to be required to solve the original problem. If this were true, m a x i m u m efficiency would be achieved by balancing the extra effort per iteration against the savings in number of iterations. Such a phenomenon occurs, for example, when performing a linear search in quasiNewton methods for unconstrained optimization, since numerical experimentation (see, e.g., [16]) confirms that, in general, the more accurate the linear search, the f e w e r major iterations. Unfortunately, unlike performing an accurate line search, solving an IQP sub-problem yields no obvious benefit (in fact, it may even be detrimental). Furthermore, there is a clear tangible cost to solving the IQP which m a k e s the approach impractical for large problems. The purpose of this paper has been to point out considerations of importance in evaluating any proposed method that includes a QP sub-problem. Further theoretical analysis and computational experience are clearly necessary before any reliable understanding can be gained of the merits of alternative formulations and strategies.
Acknowledgment We thank Dr. Philip E. Gill for his careful reading of the manuscript and his helpful suggestions. We also thank the referees for their c o m m e n t s .
References [1] E.M.L. Beale, "Numerical methods", in: J. Abadie, ed., Nonlinear programming (NorthHolland, Amsterdam, 1967) pp. 132-205. [2] M.C. Biggs, "Constrained minimization using recursive equality quadratic programming", in: F.A. Lootsma, ed., Numerical methods for nonlinear optimization (Academic Press, l.ondon, 1972) pp. 411-428. [3] R.K. Brayton and J. Cullum, "Optimization with the parameters constrained to a box", in: Proceedings of the IMACS international symposium on simulation software and numerical methods for differential equations, IMACS (1977).
[4] R.K. Brayton and J. Cullum, "An algorithm for minimizing a differentiable function subject to box constraints and errors", Journal of Optimization Theory and Applications 29 (1979) 521-558. [5] R.M. Chamberlain, "Some examples of cycling in variable metric methods for constrained minimization", Mathematical Programming 16 (1979) 378-383.
82
W. Murray, M.H. Wright~Computing the search direction
[6] A.R. Conn, "Constrained optimization using a non~differentiable penalty function", SIAM Journal on Numerical Analysis 10 (1973) 760-779. [7] R.W. Cottle and A. Djang, "Algorithmic equivalence in quadratic programming, 1: a least distance programming problem", Journal of Optimization Theory and Applications 28 (1974) 275-301. [8] J.W. Daniel, "Stability of the solution of definite quadratic programs", Mathematical Programming 5 (1973) 41-53. [9] G.B. Dantzig, Linear programming and extensions (Princeton University Press, Princeton, NJ, 1963). [101 A. Djang, "Algorithmic equivalence in quadratic programming, II: on the Best-Ritter algorithm and other active set methods for indefinite problems", Technical report, Department of Operations Research, Stanford University, to appear. [11] A.V. Fiacco and G.P. McCormick, Nonlinear programming: sequential unconstrained minimization techniques (John Wiley and Sons, New York, 1968). [12] R. Fletcher, "Minimizing general functions subject to linear constraints", in: F.A. Lootsma, ed., Numerical methods .for nonlinear optimization. (Academic Press, London, 1972) pp. 279-296. [131 R. Fletcher, "Methods related to Lagrangian functions", in: P.E. Gill and W. Murray, eds., Numerical methods for constrained optimization (Academic Press, London, 1974) pp. 21%240. [14] P.E. Gill and W. Murray, "Quasi-Newton methods for linearly constrained optimization", in: P.E. Gill and W. Murray, eds., Numerical methods for constrained optimization (Academic Press, London, 1974) pp. 67-92. [15] P.E. Gill and W. Murray, "The computation of Lagrange multiplier estimates for optimization", Mathematical Programming 17 (1979) 32-60. [16] P.E. Gill, W. Murray and R.A. Pitfield, "The implementation of two revised quasi-Newton algorithms for unconstrained optimization", Report NAC 24, National Physical Laboratory (Teddington, England, 1972). [17] G.H. Golub, V. Klema and G.W. Stewart, "Rank degeneracy and least-squares problems", Report CS-76-559, Computer Science Department, Stanford University (1976). [18] S.-P. Han, "Superlinearly convergent variable metric algorithms for general nonlinear programming problems", Mathematical Programming 11 (1976) 263-282. [19] S.-P. Han, "A globally convergent method for nonlinear programming", Journal of Optimization Theory and Applications 22 (1977) 297-310. [20] K. Levenberg, "A method for the solution of certain problems in least-squares", Quarterly of Applied Mathematics 2 (1944) 164-168. [21] D. Marquardt, "An algorithm for least-squares estimation of nonlinear parameters", SIAM Journal on Applied Mathematics I1 (1963) 431-441. [22] J.J. Mot6, "The Levenberg-Marquardt algorithm: implementation and theory", in: G.A. Watson, ed., Numerical analysis, Lecture Notes in Mathematics 630 (Springer-Verlag, Berlin, 1977). [23] W. Murray, "An algorithm for constrained minimization", in: R. Fletcher, ed., Optimization (Academic Press, London, 1969) pp. 24%258. [24] W. Murray, "Second derivative methods", in: W. Murray, ed., Numerical methods for unconstrained optimization (Academic Press, London, 1972) pp. 57-7I. [251 W. Murray and M.H. Wright, "Projected Lagrangian methods based on the trajectories of penalty and barrier functions", Report SOL 78-23, Department of Operations Research, Stanford University (1978). [26] B.A. Murtagh and M.A. Saunders, "The implementation of a Lagrangian-based algorithm for sparse nonlinear constraints", Report SOL 80-1, Department of Operations Research, Stanford University (1980). [27] M.J.D. Powell, "A fast algorithm for nonlinearly constrained optimization calculations", Report DAMTP 77/NA 2, University of Cambridge, England (1977). [28] M.J.D. Powell, "The convergence of variable metric methods for non-linearly constrained optimization calculations", in: O.L. Mangasarian, R.R. Meyer and S.M. Robinson, eds., Nonlinear programming 3 (Academic Press, London, 1978) pp. 27-63. [291 S.M. Robinson, "A quadratically convergent algorithm for general nonlinear programming problems", Mathematical Programming 3 (1972) 145-156.
W. Murray. M.H. Wright/Computing the search direction
83
[30] S.M. Robinson, "Perturbed Kuhn-Tucker points and rates of convergence for a class of nonlinear programming algorithms", Mathematical Programming (1974) 1-16. [31] J.B. Rosen and J. Kreuser, "A gradient projection algorithm for nonlinear constraints", in: F.A. Lootsma, ed., Numerical methods for nonlinear optimization (Academic Press, London, 1972) pp. 297-300. [32] J.B. Rosen, "Two phase algorithm for nonlinear constraint problems", in: O.L. Mangasarian, R.R. Meyer and S.M. Robinson, eds., Nonlinear programming 3 (Academic Press, London, 1978) pp. 97-124. [33] G.W. Stewart, Introduction to matrix computations (Academic Press, London, 1973). [34] R.A. Tapia, "Quasi-Newton methods for equality constrained optimization: equivalence of existing methods and a new implementation", in: O.L. Mangasarian, R.R. Meyer and S.M. Robinson, eds., Nonlinear programming 3 (Academic Press, London, 1978) pp. 125-164. [35] R.B. Wilson, "A simplicial algorithm for concave programming", Thesis, Harvard University (1963). [36] G. Zoutendijk, "Nonlinear programming, computational methods", in: J. Abadie. ed., Integer and nonlinear programming (North-Holland, Amsterdam, 1970) pp. 37-86.
Mathematical Programming Study 16 (1982) 84-117 North-Holland Publishing Company
A PROJECTED LAGRANGIAN ALGORITHM AND ITS IMPLEMENTATION FOR SPARSE NONLINEAR CONSTRAINTSt Bruce A. MURTAGH The University of New South Wales, Sydney, N.S. W., Australia
Michael A. SAUNDERS* Stanford University, Stanford, CA, U.S.A. Received 29 January 1980 Revised manuscript received 2 February 1981
An algorithm is described for solving large-scale nonlinear programs whose objective and constraint functions are smooth and continuously differentiable. The algorithm is of the projected I,agrangian type. involving a sequence of sparse, linearly constrained subproblems whose objective functions include a modified l,agrangian term and a modified quadratic penalty function. The algorithm has been implemented in a general purpose FORTRAN program called MINOS/AUGMENTED. Some aspects of the implementation are described, and computational results are given for some nontrivial test problems. The system is intended for use on problems whose Jacobian matrix is sparse. (Such problems usually include a large set of purely linear constraints.) The bulk of the data for a problem may be assembled using a standard linear-programming matrix generator. Function and gradient values for nonlinear terms are supplied by two user-written subroutines. Future applications could include some of the problems that are currently being solved in industry by the method of successive linear programming (SLP). We would expect the rate of convergence and the attainable accuracy to be better than that achieved by SI.P. but comparisons are not yet available on problems of significant size. One of the largest nonlinear programs solved by MINOS/AUGMENTED involved about 850 constraints and 4000 variables, with a nonlinear objective function and 32 nonlinear constraints. From a cold start, about 6000 iterations and I hour of computer time were required on a DEC VAX 11/780.
Key words: Large-scale Optimization, Optimization Software, Nonlinear Programming, Projected Lagrangian.
I. Introduction The work reported here was prompted by consideration of various ways to extend the linearly constrained optimization code MINOS [35, 36, 52] to probt Presented at the Tenth International Symposium on Mathematical Programming, Montreal, August 1979. *Supported by the Department of Energy Contract DE-AS03-76SF00326, PA DE-AT-0376ER72018; the National Science Foundation Grants MCS-7926009 and ENG-7706761 A01; the U.S. Army Research Office Contract DAAG29-79-C-0110; and the Applied Mathematics Division, DSIR, New Zealand. 84
B.A. Murtagh and M.A. SaunderslSpar.~e nonlinear constraints
85
lems containing both linear and nonlinear constraints. In particular, we are concerned with large, sparse problems, in the sense that each variable is involved in relatively few constraints. Ignoring sparsity for the moment, consider the model problem minimize subject to where the functions Hessians. For this sequence of linearly and p, minimize subject to
f~
),
l(x)=0,
l-<x--
(1.1)
of x are assumed to be twice differentiable with bounded problem, the algorithm presented here would solve a constrained subproblems of the following form: given xk, ~tk L(x, xg, Ak, p),
(l.2a)
)~=0,
(1.2b)
l-<x-
where the objective function is a modiJied augmented Lagrangian, and )~ is the linear approximation to f(x) at the point Xk. These quantities are defined as follows: L(x, Xk, Ak, p) = f~ -- A T q -- ]) + ~p(f -- )7)r(f _ j~), ] = fk + J~(x - xk)
where fk and Jk are the constraint vector and Jacobian matrix evaluated at Xk. Our approach is based on the algorithm given by Robinson [45]. When applied to problem (1.1), Robinson's algorithm would solve a sequence of subproblems of the form (1.2), except that the penalty term involving p would not be present. If xk and Ak are taken to be the solution and corresponding Lagrange multipliers for the previous subproblem, Robinson has shown for the case p = 0 that the sequence {(xk, Ak)} will converge under certain circumstances to a solution of the original problem, and that the rate of convergence will be quadratic. A case for which convergence can be expected from an arbitrary starting point is when the modified Lagrangian L(x, Xk,/tk,0) is convex. In general, however, convergence may not occur unless the initial point (x0,/t0) is sufficiently close to the desired solution. In order to allow convergence from a wider range of starting points, Rosen [48] proposed a two-phase algorithm in which a penalty function is first used to locate a point in a neighborhood of the solution. For the model problem, Phase I of Rosen's method would involve solving the problem minimize subject to
f~
+ ~OITI,
l-< x-< u
(1.3a) (1.3b)
for some choice of p. (Phase 2 is then Robinson's method.) The general constraints are not linearized in Phase 1; they appear only in the objective function (I.3a). However, any that are known to be linear would be included in (l,3b). A linearly constrained optimizer is therefore needed in both phases.
86
B.A. Murtagh and M.A. Saunders[Sparse nonlinear constraints
This method depends heavily on a good choice for the penalty parameter p. If p is too large, the Phase i subproblem may be difficult and expensive to solve. Conversely, if p is too small, the point obtained from Phase 1 may be too far from the desired solution for Robinson's method to converge. To overcome these difficulties, Best et al. [9] have recently proposed an algorithm similar to Rosen's, in which provision is made to return to Phase 1 with an increased p if it is determined that Robinson's method is not converging. Under certain weak assumptions they are then able to guarantee convergence to a Kuhn-Tucker point for the original problem. As an alternative to the two-phase approach, the method we propose uses subproblem (1.2) throughout. The penalty term is chosen to ensure that the Hessian of L(x, Xk,)tk,p) is positive definite within an appropriate subspace. It also inhibits large discrepancies between f and ], thereby discouraging large changes in x in each subproblem if the nonlinearities are such that the linearized constraints have little meaning far from the point of linearization. As in Rosen's method, we depend rather heavily on making a good choice for p. Heuristically, we increase p whenever it seems necessary to prevent nonconvergence. More rigorously, we develop a mechanism for deciding when to reduce p to zero, in order to benefit from Robinson's method's quadratic convergence. The reason for choosing the modified quadratic penalty function in (1.2a), rather than the conventional ~pfTf, will become clear when sparsity is reintroduced (Section 3.1).
1.1. Other approaches One of the few general-purpose methods for large-scale nonlinear programs is the method of approximation programming [23]. This is now often called successive linear programming (SLP), and has been implemented in various forms, typically in conjunction with an existing mathematical programming (MP) system (for example, [4-7]). As the name implies, the method involves a sequence of linear subproblems. These can be formulated and solved using all of the facilities provided by the MP system. Clearly this carries many advantages. One difficulty with SLP is that the solution to an LP subproblem is unlikely to provide a good approximation to a nonlinear solution, since the latter is normally not at a vertex of the (linearized) constraint set. The bounds on the variables must therefore be manipulated between subproblems, and the methods for doing this tend to be heuristic. An algorithm based on successive quadratic programs has been implemented for large problems by Escudero [16]. Again this takes advantage of many of the facilities in an existing MP system. GRG [1,2,30] is the only other generalpurpose algorithm we know of that has been applied to large-scale problems with some success.
B.A. Murtagh and M.A. Saunders/Sparse nonlinear constraints
87
1.2. Use o[ MINOS For a certain class of objective functions, the development of MINOS has opened the way to solving large, linearly constrained problems quite efficiently. Hence, for large versions of problem (1.1) involving a sparse Jacobian matrix and many purely linear constraints, it is natural to apply MINOS to the corresponding subproblems (I.2). The resulting system is called MINOS/AUGM E N T E D [37]. Our aim is to describe the algorithm used and some details of its practical implementation, and to discuss its performance on some nontrivial problems. Note that the Lagrangian and penalty terms in (l.2a) require continual evaluation of the nonlinear constraint functions during the solution of the subproblems. In some cases this may be expensive. M I N O S / A U G M E N T E D therefore allows the option of setting Ak = 0 and p = 0, so that only the true objective f~ remains in (1.2a). Some results obtained using this option are also rcported.
2. Brief description of MINOS MINOS is a particular implementation of Wolfe's reduced-gradient algorithm [54]. It is designed to solve large problems with nonlinear objective functions, expressed in the following standard form: minimize subject to
f~
+ cTx + dry,
(2.1a)
A[y]=b,
(2. I b)
l <_ [x l < u
(2.1c)
Y where A is m by n, m -< n, and the variables are partitioned into 'nonlinear' and 'linear' variables x and y respectively. (This standard form is a slight generalization of the one normally used for linear programming problems; it emphasizes the fact that nonlinearities often involve just a few of the variables. The concept of linear and nonlinear variables was introduced by Griffith and Stewart [231 in 1961.) For numerous practical reasons the last nz columns of A form the identity matrix L and the last m components of y are the usual logical ('slack' or 'surplus') variables. MINOS uses an "active constraint' strategy, with the general constraints and some portion of the bound constraints being active at any given time. Thus if A is partitioned as [ B S N ] where N is a set of 'nonbasic' columns, the active constraints are always of the form
88
B.A. Murtagh t~nd M.A. Saunders/Sparse nonlinear constraints
']L,tj The first part of this equation is equivalcnt to
while the second part reading x~ = b~ indicates that the nonbasic variables x,~ are being held equal to one or other of their hounds. (The c o m p o n e n t s of b~ come from I or u as appropriate and the partition [XH,Xs, XN] is some permutation of [x, Y].) The remaining columns of A are partitioned into 'basic' and 'superbasic' sets B and S, such that the basis matrix B is square and nonsingular. The corresponding basic and superbasic variables xB and Xs are free to vary between their bounds during the next iteration. It can readily be shown that an optimal solution of the a b o v e form exists for which the number of superbasic variables is less than or equal to the number of nonlinear variables. This point is discussed later in a more general context (Sections 3.1 and 3.2).
2.1. Some aspects of the algorithm used in MINOS The operators I
'
Z =
I
(2.3)
0 will be useful for descriptive purposes. The active constraints (2.2) are of the form , ~ =/~, and Z happens to satisfy ,'~Z = 0. Under suitable conditions a feasible descent direction p may be obtained from the equations
ZrGZps = - Zrg,
p = Zps
(see Gill and Murray [18]), where g is the gradient of the objective function (2.1a). Thus, if the reduced gradient ZTg is nonzero and if the reduced Hessian ZTGZ is positive definite (or if any positive definite matrix is used in place of ZrGZ), then the point x + up lies on the active constraints and some scalar a > 0 exists for which the objective function has a lower value than at the point x. Other matrices Z exist satisfying ,~Z = 0, but the form chosen above, together with a sparse LU factorization of B, allows etiicient computation of the products Z rg and Zps, (Neither B -~ nor Z is computed.) A positive-definite approximation to ZTGZ is maintained in the form RTR where R is upper triangular. Quasi-Newton updates to R lead to superlinear convergence.
B.A. Murtagh and M.A. SaunderslSparse nonlinear constraints
89
Let the gradient of the objective function be partitioned as [gR, gs, gN]. If ~satisfies BT1r = g~ it is easily seen that the reduced gradient is z T g = g s -- S T'/r.
Hence in linear programming terminology the reduced gradient is obtained by 'pricing' the superbasic columns S. This is a cheap operation once ~r has been computed. Likewise for p we have RVRps = - Z r g and then - B -1Sps 7 p =
ps pN
= Zps =
so most of the work lies in solving Bp~ = - Sps. (The value pN -- 0 indicates that no change will be made to the current nonbasic variables. As long as the reduced gradient ZTg is nonzero, only the variables in [B S] are optimized. If any such variables e n c o u n t e r an upper or lower bound they are m o v e d into N and the partition [B S] is suitably redefined.) Note that if the reduced gradient does prove to be zero (ZTg = 0), the reduced objective has reached its optimal value. If we compute or = gN - NT~r (i.e., the usual pricing of nonbasic columns) we then have
NT I
LgNJ
so that ~r and or are exact Lagrange multipliers for the current active constraints. The c o m p o n e n t s of or indicate whether any nonbasic variables should be released from their bounds. If so, one or more are m o v e d from N into S and optimization continues for the new set [B S]. If not, an optimum has been obtained for the original problem. In practice, optimization for each [ B S ] will be curtailed when ZTg is sufficiently small, rather than zero. In this case 7r will be just an approximation to the Lagrange multipliers for the general constraints. The accuracy of 1r will depend on the size of [[ZXgl[ and on the condition number of the current basis B.
2.2. K e y points
The algorithm implemented in M I N O S provides a natural extension of linear programming technology to problems whose objective function is nonlinear. If the number of nonlinear variables is m o d e r a t e (or more precisely, if the number of superbasic variables and hence the dimension of R is moderate), then the
90
B.A. Murtagh and M.A. Saunders/Sparse nonlinear constraints
work per iteration is not substantially greater than for one iteration of the revised simplex method on the same data
YJ
Y
Here we assume that the cost of evaluating the objective function and its gradient is moderate compared to manipulation of a sparse factorization of the basis matrix B. At the same time it is important that the step-size ~ be determined efficiently. The linesearch procedure used in MINOS/AUGM E N T E D is that of [19, 21], in the form of subroutine G E T P T C . This employs cubic interpolation with safeguards, and allows the user to control the accuracy of the search by means of a parameter ETA, where 0.0 ~ ETA < 1.0. Even with a relatively accurate search (e.g., E T A = 0 . 0 1 ) , the number of function and gradient evaluations required is typically very few (usually 1, 2 or 3 per search). This is increasingly beneficial for the algorithm discussed next, where the objective function is modified to include an arbitrary number of nonlinear functions.
3. Extension to nonlinear constraints
3.1. Statement of the problem It is assumed that the nonlinearly constrained problem can be expressed in the following standard form: minimize subjectto
f~
+ cTx + dTy,
f(x) + Aly = bl
(3.1a)
(mt rows),
(3.1b)
A2x + A3y = b2 (m2 rows),
(3.1c)
l<-[~]<-u,
(3.1d)
m=m,+m2
where / ( x ) = Ill(x) . . . . . fm'(x)]T. The first nl variables x are again called 'nonlinear variables'. They occur nonlinearly in either the objective function or the first ml constraints. There may be purely linear constraints, (3.1c). As before, a full set of slack variables is included as the last m components of the 'linear variables' y, so that general equality and inequality constraints can be accommodated in (3.1b, c) by means of suitable bounds in (3.1d). We shall assume that the functions f~(x) are twice continuously differentiable with gradients g~(x) and bounded Hessians G~(x), i = 0, 1. . . . . m~. We shall also assume that the 1st and 2nd order Kuhn-Tucker conditions hold for a local minimum x* with corresponding Lagrange multipliers X*. The standard form above is essentially the same as the one adopted by Grittith and Stewart [23]. Clearly, if the nonlinear variables x are given fixed values, the
B.A. Murtagh and M.A. Saunders[Sparse nonlinear constraints
91
problem reduces to a linear program in y. Beale [6, 7] points out that it can be useful to partition the variables further, such that the problem reduces to a linear program when just some of the nonlinear variables are given fixed values. Without loss of generality, x and f ( x ) can be expressed in the form x =
xc ,
f ( x ) = ](x N) + A ( x N ) x t"
(3.2)
where the vector ](x N) and the matrix A ( x N) depend only on x N. The objective function can be partitioned similarly. The extent to which an optimal solution deviates from a normal 'basic solution' is then bounded by the dimension of x N rather than the dimension of x. Although we do not exploit such structure explicitly, it is important to be aware that real-life models often arise in this form. Hence, a model expressed in the simpler standard form (3.1) may not be as nonlinear as it looks. This is fortunate, and we bear it in mind in formulating the subproblem below. The partitioning algorithms of Benders [8] and Rosen [46] were designed for problems in the form (3.1) and (3.2) respectively (see Lasdon [29, Ch. 7]). With certain restrictions, (3.2) also resembles the 'variable coefficients' form of Wolfe's generalized linear program [14, Ch. 22]. 3.2. The linearized subproblem The solution process consists of a sequence of major iterations, each one involving a linearization of the nonlinear constraints at some point xk, corresponding to a first-order Taylor's series approximation: fi(x)
= f i ( x k ) + g i ( x k ) T ( x -- Xt) + O[[X --
x ll-'.
We thus define
](x, x~) : f(xO + J ( x ~ ) ( x - x~), or
f = fk + Jk(x - x~)
(3.3)
where J ( x ) is the mj • nl Jacobian matrix whose ijth element is ofi/Oxj. We then see that f - J; = ( f - f k ) - J k ( x - Xk)
consists of the higher order ('nonlinear') terms in the Taylor's expansion of [(x) about the point xk. At the kth major iteration of our algorithm, the following linearly constrained subproblem is formed: minimize z, y
L(x, y, xk, Ak, p) = f~
+ c'rx + dry - ATk(f -- ])
+ }p(f _ ])r(f
_ f),
(3.4a)
92
B.A. Murtagh and M.A. Saunders/Sparse nonlinear constraints
subject to
f+Aty
= b~,
(3.4b)
A2x + A3y = b2,
(3.4c)
l <- [~ ] <- u.
(3.4d)
The objective function L is a modified augmented Lagrangian, in which f - . f is used in place of the conventional constraint violation, f + A ~ y - bt. The partial derivatives are ~I__~= gO(x ) + c - (J - Jk)T[Xk -- p(f -- j~)] tgX
(3.5)
and OL/Oy = d. We see that the nonlinearities in L involve x but not y, which means that the subproblem has the same nonlinear variables as the original problem. Further, if the functions are written in the form (3.2) above, f ( x ) is by definition a linear function of x ~, and so [ - ] is identically zero when x ~ is held fixed. Thus, in Beale's generalized sense, L is effectively nonlinear in x N alone. The dimension of the reduced Hessian for the subproblem is therefore bounded in a way that is consistent with the nonlinearity of the original problem. This property seems highly desirable. The modified Lagrangian was used by Robinson [45] with p = 0. The use of a penalty term to ensure the augmented Lagrangian maintains a positive-definite Hessian in the appropriate subspace was suggested by Arrow and Solow [3] and adopted later by, among others, Hestenes [25] and Powell [39] in their sequential unconstrained procedures, and by Sargent and Murtagh [50] in conjunction with their 'variable-metric projection' algorithm involving a sequence of linearized constraints. To our knowledge, the modified penalty function has not been used elsewhere. Note that it is identical to the conventional penalty function in the subspace defined by the linearized constraints. A potential disadvantage of the quadratic penalty function has been pointed out by Greenberg [22], namely, that it destroys separability if any f~(x) is the sum of two or more separable functions. This means that the subproblems can be solved by a separable programming code only in the case where each f~(x) = f~(x i) for some j, or in the convex programming case where the penalty term is not required anyway.
3.3. Choice o[ kg The choice ,~k = 0, p = 0 corresponds to simple sequential linearization of the nonlinear constraints, with no additional terms to f~ in the objective function. We shall call this the Newton strategy, although it should not be confused with applying Newton's method to the Kuhn-Tucker equations for a solution of (3.1). This simplified procedure does converge in certain cases, particularly if the solution is at a vertex (e.g., reverse-convex programs [47, 33, 26]).
B.A. Murtagh and M.A. Saunders/Sparse nonlinear constraints
93
Ideally, h.k should be as close as possible to A*, but of course the optimal multipliers are normally unknown. The simplest choice is Ak = h., the multipliers corresponding to linearized constraints at the solution of the previous subproblem. As we shall see, this choice is the best of several alternatives. For convenience suppose there are no linear constraints, so that ~. = ~r is the solution of BT~r = gR at the end of the previous major iteration. We know that ~also satisfies ST~r = gs (at least to within the convergence tolerance used for the subproblem). We thus have B T
^
and it is immaterial which variables are in B and which are in S. Now g is zero for all slack variables and it follows immediately that ~ = 0 if the ith linearized constraint is inactive. The choice h.~ = ,( therefore ensures that an apparently inactive nonlinear constraint will be excluded from the Lagrangian term h.~(fjT) in the next subproblem. This is a desirable property. It may seem that a better approximation to h.* could be obtained by evaluating the new Jacobian J ( i ) which is required anyway for the next subproblem. Let the resulting 'new' [B S] be denoted by [/~ S]. One possibility is to define Ak as the solution of the least-squares problem
where the rhs is still the 'old' gradient vector for the previous augmented Lagrangian. However, this least-squares problem would be very expensive to solve for large problems. Furthermore it is not guaranteed that ,~ = 0 would result where desired. A cheaper alternative would be to solve /~v.~. = gB and take A~ = ~, but then A~ = 0 for inactive constraints would be assured only if the corresponding slack variable happened to be basic and not superbasic. If the n e w / 3 is to be used, the method of Sargent and Murtagh [50] shows that /3rh. = gB + [I 0 0]G~Ax would produce the correct multipliers for the solution of the new subproblem if the original objective and constraints were quadratic and Gf was an adequate approximate to the Hessian of the new Lagrangian. (See equation (12) in [35].) However, this again is not a practical alternative for large problems. 3.4. Choice of p It is well known that x* need not be a local minimum of the Lagrangian function (with p = 0). If we assume that J(x*) is of full rank, then k* exists and
94
B.A. Murtagh and M.A. Saunders/Sparse nonlinear constraints
is such that L(x, k) = f~
cTx + dTy-- AT[f+ A l y - bl]
is stationary at (x*, ~t*), but L(x*, ,~*) may well display negative curvature in x at X*. The most that can be said [55] is that, if we consider the constraints satisfied at x* as equalities and ignore the inactive ones, then a necessary (sufficient) condition that x* is a local minimum is
Z(x*)T -'~LX (X*, X*) : 0 and
32L Z(x*) T ~ (x*, ~*)Z(x ~) is positive semidefinite (positive definite), where Z(x*) is as defined in (2.3) using J(x*) in the appropriate part of A. Thus if we restrict our search to the linearly constrained subspace defined by Z(x*) we do indeed seek a minimum of the Lagrangian, and we may expect that when xk is sufficiently close to x* for J(xk) to be close to J(x*) we may minimize (3.4a) with p - - 0 . This is confirmed by Robinson's theorem on quadratic convergence [45]. Difficulty arises when xg is far removed from x*, since the linearized constraints may define a subspace where perhaps a saddle-point would be closer to x* than a minimum would be. Successive minima of (3.4) with p - - 0 may therefore fail to converge to x*. The addition of a penalty term p [ f - ] ] T [ f _ ]] imposes the correct curvature properties on (3.4a) for a sufficiently large p > 0. For general n o n c o n v e x problems it is not practical to determine a priori what the appropriate order of magnitude/9 should be (indeed, p -- 0 is often adequate even in the n o n c o n v e x case). The more important consideration is when to reduce p to zero, for we know that there is a radius of convergence around (x*, ~.*) within which Robinson's theorem holds for p = O, ann we can then expect a quadratic rate of convergence. Two parameters we can monitor at the solution (J~,A) to each linearized subproblem are the constraint violation or 'row error', IIf(s + Aty - b,II = IIf(~) - ](s - ](2, xk)ll, and the change in multiplier estimates, II/,- xkll The question that arises is whether these can be used to provide adequate measures of convergence toward X~. For simplicity, consider the equality-constrained problem
P0:
minimize subject to
f~
f(x) = 0
B.A. Murtagh and M.A. Saunders/Sparse nonlinear constraints
95
where the functions of x are twice continually differentiable with bounded Hessians. We shall assume that at some point x* the Jacobian J(x*) is of full rank, there exists a A* such that Of~ = J(x*)TA *, and the reduced Hessian Z(x*)TO2L(x *, k*)/Ox2Z(x *) is positive definite (i.e., the sufficiency conditions are satisfied for x* to be a local optimum). Theorem 1. Let (x~, Ak) be an approximate solution to Po and let (i, ~.) be a
solution to the linearized subproblem S,:
minimize
f~
- A~(J - ]) + ~p(f - ])T(I _ ]),
subject to
](x, XE) = O.
If ft--Ak = ~, and f ( i ) = ~2, then ( i , , ( ) is also a solution to the perturbed problem Pl:
minimize subject to
f~ f(x) = E2
for sufficiently small ~ and E2. Proof. If (i, ~.) is a solution of Si we must have ] = 0 and g~
-
(3 -
Jk)TXk + p ( 3 -- j ~ ) T ( / _ ]) = y ~
where Jk is the Jacobian at XE but J, f and ] are evaluated at i. Adding (J - Jk)T~ to both sides and inserting the expressions for E~ and e2 gives g~
+ ( J - Jk)TIEI -I" p ( j -- Jk)TE2 = f'1"/~.
which shows that (i, ~.) also satisfies the conditions for a stationary point of P~. Now it can be shown that the Hessians for the Lagrangian functions of S~ and P~ differ only by the amount p ( J - - J k ) T ( J - - J k ) at the solution of Pi, which is of order pI[AXRH 2 where AXE = i - Xk. Hence for sufficiently small ~l, E2 and AXE, if the reduced Hessian of S, is positive definite at i, then by continuity the reduced Hessian of P~ will also be positive definite, thus satisfying the sufficiency conditions for a local minimum of P~ at i. It is of interest to examine the corresponding result for the conventional penalty term. Theorem 2. Let (xk, lk) be an approximate solution to P0 and let (i, ~.) be a solution to the linearized subproblem Sz:
If A - I k problem
minimize
[~
- A~(f - ]) + 89
subject to
](x, xk) = O.
= ~1 and f ( i ) = E2, then (i, A) is also a solution to the perturbed
96
P_,:
B.A. Murtagh and M.A. Saunders/Sparse nonlinear constraints
eT(f -
minimize
f~
subject to
f ( x ) = e2.
+
f ) + pe'~ f,
Proof. Analogous to the proof of T h e o r e m 1. Again it follows that if r and e2 are sufficiently small, (i, ~.) will be within the radius of convergence of Robinson's theorem and p can safely be reduced to zero. A point of interest is that problem P~ appears to be less sensitive than P2 to deviations from its optimum. Thus, let Ax be an arbitrary small change to the solution i of Pt. The objective function for Pt then differs from the true objective f~ by an amount
& = (e, + pe2)T(f - ]), I~,1 -< (11~,11 + pll~211)ollaxll 2. For P2 the analogous deviation is
82= eT(f - f) + pelf = ~ T ( t - ]) + o - I ( j + j a x + ollaxll2), 1~21-< (ll,tll + pll,~ll)Ollaxll = + pll,~ll 2 + plIJT,=II Ilaxll. Since & is of order Ilaxll = while 82 is of order Ilaxll, it appears that the modified penalty term in St has a theoretical advantage over the conventional penalty term of S> 3.5. S u m m a r y o f procedure
The cycle of major iterations can be described as follows: Step O. Set k = 0. Choose some initial estimates x0, y0 and /to. Specify a penalty parameter p -> 0 and a convergence tolerance ec > 0. Step I. (a) Given xk, yk, )tk and p, solve the linearly constrained subproblem (3.4) to obtain new quantities xk-t, yk~t and n (where rt is the vector of Lagrange multipliers for the subproblem). (b) Set/tk+~ = the first mt components of It. Step 2. (a) Test for convergence (see Section 4.8). If optimal, exit. (b) If
IlI(xk+,) + Alyk+l- biH/(1
+ II[x~-i, Yk+l]ll) -< ec
and IIx~+, - x~ll/(l + IIx~+,ll) ~ ,:,
then set p = O. (c) Relinearize the constraints at xk+,. (d) Set k = k + 1 and repeat from Step 1.
B.A. Murtagh and M.A. Saunders/Sparse nonlinear constraints
97
This procedure would not be complete without an algorithm for increasing the penalty parameter in certain circumstances. In Step 2(b) of the present implementation, we raise p by some factor if the relative change in '~k proves to be very large.
4. Computer implementation 4.1. Sparse matrices
Using equation (3.3), the linearized constraints (3.4b) can be expressed in the form: JkX
d-
A l y = bl + JkXk --
(4.1)
fk
where fk =f(Xk). The terms on the right-hand side of (4.1) are constant and become part of 'b', the current right-hand side. The set of linear constraints ' A x = b' for each major iteration is thus of the form: [A~ A ' l [ ; l = [ b ' + J ~ x ~J .- b~~
(4"2)
The major implication of A being large and sparse is that efficient methods are available for forming and updating an LU factorization of the basis matrix B (cf. (2.2)). (In particular, a 'bump and spike' algorithm [24] is used to preserve sparsity at each refactorization of B. This occurs at the start of every relinearization and occasionally thereafter as necessary. At each intervening change of basis the LU factors are updated using the scheme described by Saunders [51] to preserve both sparseness and numerical stability.) 4.2. Infeasible s u b p r o b l e m s
One of the difficulties with sequential linearization is that some of the linearized subproblems may prove to be infeasible. In particular, the point [XR, yk] used to define subproblem k is usually not a feasible point for the subproblem. However, it will typically be feasible for the previous subproblem (and possibly optimal). This can be used to advantage in the manner suggested by Powell [41]. Thus we write the linearized constraints (4.1) in the form Jkx + A I y = bl + JkXk --
f k d-
~/q
(4.3)
where 7q is a perturbation to the right-hand side. If [xk, yk] is the final feasible point from subproblem k - l , we can show that it will also be feasible with respect to the new linearized constraints (4.3) if 7 = l and q = f ( x k ) - - f ( X k , Xk l). (Thus q is the value of f - f at the end of the previous major iteration.) In MINOS/AUGMENTED the right-hand side of (4.3) is initialized with 7 -- 0. If the subproblem proves to be infeasible we add 89 to the right-hand side and
98
B.A. Murtagh and M.A. Saunders[Sparse nonlinear constraints
continue the solution process. If there is still no feasible solution we add ~q, ~q and so on. This simulates the sequence of values 3' = 89 43, ~.... tending to 1 as desired. If the above procedure fails after 10 modifications, or if it is not applicable (e.g., when k = 0 or the previous subproblem was infeasible), a new linearization is requested as long as at least one minor iteration has been performed. Otherwise the algorithm is terminated with the assumption that the original problem itself is infeasible. In [48], Rosen guards against infeasible subproblems by linearizing perhaps only some of the nonlinear constraints, namely those that have been active or reasonably close to active at any earlier stage. This alternative could be implemented in M I N O S / A U G M E N T E D by adjusting the bounds on the slack variables associated with the linearized constraints. 4.3. User options Various implementation options are discussed in the following sections. Capitalized keywords at the head of each section illustrate the input data needed to select any particular option. Fuller details are given in the user's manual [37]. 4.4. Subroutines C A L C F G and C A L C O N VERIFY OBJECTIVE GRADIENT VERIFY CONSTRAINT GRADIENTS As in the linearly constrained version of MINOS, a user-written subroutine C A L C F G is required to calculate the objective function f~ and its gradient. The Lagrangian terms in (3.4a) are calculated internally. The user also supplies a subroutine C A L C O N to define the constraint vector f(x) and the current Jacobian J(x). The nonzeros in J are returned column-wise in an output vector and must be in the same order as the corresponding entries in the MPS file (see below). Subroutine C A L C O N is called every time the constraints are linearized. It is also called one or more times each line search (except with the Newton strategy), to allow computation of (3.4a) and (3.5). The expense of evaluating the constraints and their gradients should therefore be taken into account when specifying the linesearch accuracy. Note that every function and Jacobian element is computed in every call to CALCON. Although some of these values may effectively be wasted (e.g. if some of the constraints are a long way from being active), the resulting simplicity of the subroutine from the user's point of view cannot be overemphasized. Since the programming of gradients is notoriously prone to error, the V E R I F Y option is an essential aid to the user. This requests a check on the output from
B.A. Murtagh and M.A. Saunders/Sparse nonlinear constraints
99
C A L C F G and/or CALCON, using finite differences or f~ of f(x) along the coordinate directions. The check is performed at the first feasible point obtained (where feasibility is with respect to the first linearized subproblem). This point will not satisfy the nonlinear constraints in general, but at least it will satisfy the linear constraints and the upper and lower bounds on x. Hence it is usually possible to avoid singularities in the nonlinear functions, both in the gradient check and in subsequent iterations.
4.5. Jacobian option J A C O B I A N -- S P A R S E or D E N S E The submatrices A~, A2, A3 and vectors b~, b2 in (4.2) are constant data and so may be entered using a standard MPS input file, as in linear programming, whereby only the nonzero coefficients and their row locations are entered column-by-column. Since we envisage that the Jacobian submatrix J will also be large and sparse we use the same scheme for recording the row and column locations of the nonzeros. Thus (with JACOBIAN = S P A R S E ) the sparsity pattern of J is entered as part of the MPS file. The corresponding numerical values in the MPS file may be genuine coefficients (if they are constant) or else dummy values, such as zero. Each call to subroutine C A L C O N subsequently replaces all d u m m y entries by their actual value at the current point x. Of course the intention here is to allow use of standard matrix generators to specify as much of the constraint matrix as possible. Pinpointing the nonzeros of J by name rather than number has the usual advantages, and in subroutine C A L C O N some code of the form LJAC -- LJAC + 1 G(LJAC) . . . . is usually adequate to define the next nonzero in a column of the Jacobian, without explicit reference to any row or column numbers. Nevertheless, the user is effectively required to give the sparsity pattern twice (in the MPS file and in CALCON), and it is essential that mismatches be avoided. At present the VERIFY option is the only aid to detecting incompatibility. In the interest of simplicity, the option JACOBIAN = D E N S E allows J to be treated as a dense matrix. In this case the MPS file need not specify any elements of J, and subroutine C A L C O N can use assignment statements of the form G(/, J) . . . . to specify Jii by row and column number. The danger of mismatches is thereby eliminated, but the storage requirements may be excessive for large problems. It may also give rise to an unnecessarily large 'bump' in the basis factorizations.
I00
B.A. Murtagh and M.A. Saunders/ Sparse nonlinear conxtraints
4.6. Partial completion COMPLETION = P A R T I A I . or F U L L
Partial completion is a compromise between the two extremes of relinearizing after each linesearch and solving each subproblem to high accuracy (lull completion). The idea of attaining only partial completion at each major iteration can be accommodated effectively via the convergence tolerances. Following Gill and Murray [20]. MINOS uses relatively loose tolerances for minimizing the reduced objective, until it appears that the optimal partition [B S N] has been identified. The partial completion option is effected by terminating a major iteration at this stage. Otherwise for full completion the normal optimization procedure is continued using tighter tolerances to measure the change in x, the size of the reduced gradient (llzTgll/ll~rll), etc. This usually gives only small changes in x and 7r without changing the partition [B S N]. An alternative means for achieving partial completion for early major iterations is via the MINOR ITERATIONS limit (see Section 4.8).
4.7. Lagrangian option, penalty parameter Newton strategy:
LAGRANGIAN
NO
PENALTY PARAMETER
0.0
With this option the constraint functions and gradients are ev'aluated only once per major iteration.
Augmented Lagrangian:
LAGRANGIAN
YES
PENALTY PARAMETER p
(p->0)
Here the constraints and Jacobian are evaluated as often as the objective. Evaluation of the augmented Lagrangian and its gradient with p > 0 is negligibly more expensive than with p = 0.
4.8. Convergence conditions MAJOR ITERATIONS
60
MINOR ITERATIONS RADIUS OF CONVERGENCE
40 ec ( ~ 10-2)
ROW TOLERANCE
er ( ~ 106)
Apart from terminating within each major iteration, we also need a terminating condition for the cycle of major iterations (Step 2(a), Section 3.5). The point [xk, yk] is assumed to be a solution to the nonlinearly constrained problem (3.1) if both the following conditions are satisfied:
B.A. Murtagh and M.A. Saunders/ Sparse nonlinear constraints
101
(1) [xk, Yk] satisfies the nonlinear constraints (3.1b) to within a predefined error tolerance, i.e., I[f(xk) + A,yk -- b,ll/(1 + II[x~,y~]l[) -< ~r(2) [Xk, yk] satisfies the first-order Kuhn-Tucker conditions for a solution to the linearized problem. The tolerance parameter er is specified by the user, and was set equal to 10 -6 for most of the test problems described in the subsequent sections. If the partial completion option is used, then once these two criteria are satisfied, a switch to full completion is invoked to obtain an accurate solution for the current subproblem, as well as for any possible further linearizations. The error tolerance ec is used to define a radius of convergence about (x*, A*) within which Robinson's theorem is assumed to hold. If the row error defined above and the relative change in Lagrange multipliers between successive subproblems are both less than ec in magnitude, then the penalty term is dropped (i.e. p is set to 0.0). The MINOR I T E R A T I O N S limit is used to terminate each linearized subproblem when the number of linesearches becomes excessive. The limit of 40 was used in most of the numerical experiments. A much smaller number would result in more frequent use of expensive housekeeping operations such as the refactorization of B. Similarly a much larger number may be wasteful; if significant changes to x have occurred, then a new linearization is appropriate, while if there has been little progress, then updating the Lagrangian information gives some hope of more rapid progress. It must be noted that for some choices of x0, A0 and p the sequence of major iterations may not converge. The MAJOR I T E R A T I O N S limit provides a safeguard for such circumstances. For a discussion of linearly constrained Lagrangian methods and their drawbacks see Wright [55, pp. 137-147]. In the present implementation of M I N O S / A U G M E N T E D , if convergence does not occur, the only recourse is to restart with a different (x0,/to) or a larger value for the penalty parameter p (or both).
5. Test problems Most of the test examples reported here appear in the published literature. The last two examples are new and are described in detail. They can be made arbitrarily large by increasing one parameter. 5. I. Colville No. 2 This well known problem is one of the more testing of the Colville series of problems [12] and has been used frequently to compare different algorithms [2,
102
B.A. Murtagh and M.A. Saunders/Sparse nonlinear constraints
17, 45, 50]. It has a cubic objective function and 15 quadratic constraints. Even in this small problem the variables can be partitioned into linear and nonlinear sets, of dimension 10 and 5 respectively. (a) Feasible starting point. (b) Infeasible starting point. 5.2. Colville No. 3
This problem has a quadratic objective function of five variables and three quadratic constraints. It also has upper and lower bounds on all the variables, and upper and lower limits on the constraints. These can be accommodated effectively by using the BOUNDS and RANGES options in the MPS file; the BOUNDS option allows variables to be nonbasic at their upper or lower bound, and the RANGES option assigns both upper and lower bounds to the slack variables associated with the constraints (thus allowing the right-hand side to range between bounds). (a) Feasible starting point. (b) Infeasible starting point. 5.3. ColvUle No. 8
This is a typical process design problem, where some of the variables are determined by solving nonlinear equations. It has 3 independent variables and 7 constraints. 5.4. Powell's problem [40]
This has 5 variables and 3 constraints. Although small, this is a good test problem as the nonlinearities in the constraints are quite significant. minimize
exp(x~x2x3x4xs),
subject to x~ + x~ + x~ + x~ + x~ = 10, x2x3 - 5x4x5
= O,
x~+ x~
= -1.
Starting point: (-2, 2, 2, - 1 , -1). 5.5. Power scheduling
This is a comparatively large test problem reported recently by Biggs and Laughton [10], with 79 variables and 91 constraints (all nonlinear). It also has upper and lower bounds on some of the variables. Although all the variables and constraints are nonlinear, the linearized constraint matrix J~ (4.3) is quite sparse, with on average a little under 6 nonzero row entries per column. Treating it as a dense matrix could result in a 'bump' of 79 columns, which is clearly undesir-
B.A. Murtagh and M.A. Saunders/Sparse nonlinear constraints
103
able. A number of minor discrepancies between Biggs and Laughton's paper and the original statement of the problem [53] were resolved by using the original data.
5.6. Launch vehicle design This problem appears in Bracken and McCormick's book on nonlinear gramming applications [11] and also appears in [49]. There are 12 linear straints and 10 nonlinear constraints, and the Jacobian of the nonlinear straints is quite sparse (density 23%), yielding an overall matrix density of All 25 variables are nonlinear.
proconcon15%.
5.7. Quartic with quartic constraints This problem appears in Pierre and Lowe [38]. Only a few terms are quartic, the remainder being predominantly quadratic. It has 20 variables (all nonlinear) and 17 constraints, 13 of which are nonlinear.
5.8. Dembo No. 7 This is one of Dembo's set of 8 Geometric Programming test problems [15]; in particular, it required the most computation time in Dembo's results. Other authors have reported difficulty with the problem [13, 42]. There are 16 variables (all nonlinear) and 19 general constraints (11 of them nonlinear). The solution has a few primal and dual degeneracies, but it is essentially at a vertex of the constraint space (i.e., a vertex of the final linearization).
5.9. Wright No. 4 [55] This is a highly nonlinear non-convex problem for which four local minima have been determined. minimize subject to
(Xl - 1)2 + (xl - x2)2+ (x2 - x3)3 + (x3 - x4)4 + (x4 - xs) 4, xt + x~ + x] = 2 + 3X/2, x 2 - x~ + x~ = - 2 + 2 x / 2 , XlX5 = 2.
Starting points: (a) (l, l, 1, 1, 1), (b) (2, 2, 2, 2, 2), (c) (-1, 3, -~, - 2 , -3), (d) ( - I , 2, l, - 2 , -2), (e) ( - 2 , - 2 , - 2 , - 2 , - 2 ) .
104
B.A. Murtagh and M.A. Saunders/ Sparse nonlinear constraints
Local optima: x*(1) = (1.11663,1.22044, 1.53779, 1.97277, 1.79110), x*(2)= (-2.79087,-3.00414, 0.205371,3.87474,-0.716623), x*(3)= ( - 1.27305, 2.41035, 1.19486, -0.154239, -1.57103), x*(4)= (-0.703393, 2.63570,-0.0963618,-1.79799,-2.84336).
5.10. Wright No. 9 [55] This is another highly nonlinear example. minimize
10X1X4 --
4 2 3 6X3X~+ X2X~+ 9 sin(x5 - x3) + xsx4x2,
subject to x~ + x~ + x~ + x~ + x~ -< 20,
x~x3 + x4x5 >--2, X2X4 + 10XlX5-> 5.
Starting points: (a) (1, 1, 1, 1, 1), (b) (1.091, -3.174, 1.214, -1.614, 2.134). Local optima: x*(1) = (-0.0814522, 3.69238, 2.48741, 0.377134, 0.173983), x*(2) = (1.47963, - 2.63661, 1.05468, - 1.61151, 2.67388). With the barrier trajectory algorithm, Wright [55] obtained convergence to x*(1) from (a) and convergence to x*(2) from (b). Note that starting point (b) was originally chosen to cause difficulty for the barrier algorithm and other methods that retain feasibility throughout.
5.11. Optimal control This problem investigates the optimal control of a spring, mass and damper system. It is adapted from [44]. While it is acknowledged that there may be simpler ways of solving the problem by taking specific advantage of the nature of the constraints, it serves the present purpose of providing a large, sparse test problem. T
minimize subject to
f(x, y, u) = 89 ~ Xzt, xt+l = xt +
0.2yt
Yt+l = Yt - O.Olyt2 - O.O04xt + 0.2ut -0.2 - ut -< 0.2 y, -> - 1.0 xo = 10, Yo= O, Yr = O. Starting point: xt = O, Yt = - 1, t = 1.... , T.
t=0
.....
T-1
B.A. Murtagh and M.A. Saunders/Sparse nonlinear constraints
105
F o r the results b e l o w we set T = 100. There are thus 202 nonlinear variables (x, y, t = 0 . . . . . 100) and 100 linear variables (ut, t = 0 . . . . . 99). T h e r e are also 100 quadratic c o n s t r a i n t s and 100 linear constraints. N o t e that the nonlinear constraints are v e r y sparse, with only 4 n o n z e r o s per row. T h e solution is d o m i n a t e d to a large extent by the l o w e r b o u n d on Yt; the optimal y~ d e c r e a s e s with increasing t and remains at - 1.0 f r o m t = 20 to t = 40, and then increases again, goes slightly positive and settles to 0.0 at t = 100. T h e c o r r e s p o n d i n g values of xt can be calculated directly f r o m the linear constraint Xt+l = xt + 0.2yr. T h e optimal value o f the objective is ~l]x]l2 = 1186.382. 5.12. E c o n o m i c growth model ( M a n n e [31]) This is a m o d e l f o r optimizing aggregate c o n s u m p t i o n o v e r time. T h e variables are Ct, It, and Kt which represent c o n s u m p t i o n , i n v e s t m e n t and capital in time period t for t = 1 . . . . . T. Utility function: T
m a x i m i z e ~]/3t loge Ct. t=l
Nonlinear c o n s t r a i n t s : atKb > Ct + It,
l <_ t <_ T.
Linear constraints:
Kt+j <- Kt + It,
1 <- t < T,
g K r <-- It. Bounds: K1 = I0 + K0, Kt ~ Io + Ko Ct >- Co
l~_t~_T.
It>-Io /t -< (1.04)tlo Data: /3 = 0.95,
b = 0.25,
Co = 0.95,
I0 = 0.05,
13t = / 3 t
g = 0.03, K0 = 3.0,
e x c e p t / 3 r =/3r/(1 - / 3 ) ,
at = a(1 + g)(1-b)t
w h e r e a = (Co + Io)/K b.
With b in the range [0, 1], this is a c o n v e x p r o g r a m with separable nonlinear functions. It should therefore be a useful test problem f o r m o r e specialized c o n v e x p r o g r a m m i n g algorithms.
106
B.A. Murtagh and M.A. Saunders/Sparse nonlinear constraints
For test purposes we have used T = 100, which gives a problem with 200 general constraints and 300 variables. The optimal value of the utility function is 9.287547. All general constraints are active at the solution, and the first 74 upper bounds on I, are also active. It should be mentioned that specialized methods are known to economists for solving this problem, with and without the upper bounds ('absorptive capacities') on the variables It. However, in more practical circumstances the model would be imbedded in a much larger model for which an analytical solution is not known. If the larger model contains no additional nonlinearities, the performance of M I N O S / A U G M E N T E D should degrade only in proportion to the problem size.
6. Results and discussion
M I N O S / A U G M E N T E D is implemented in standard FOR'rRAN. Various options can be selected at run-time by means of a S P E C S file, and initial vectors x0, y0, and A0 can be specified in the MPS file containing the constraint data. For the present results, some components of x0 were specified to match the given starting point, if any. (The corresponding variables were initially superbasic at the specified values.) Any remaining variables in [x0, y0] were subjected to a normal CRASH start, in which a triangular basis is selected from the associated columns of the Jacobian matrix. (The resulting initial values for these variables are not predictable.) The default value )~0 = 0 was used. The following parameter values were also specified: L I N E S E A R C H P A R A M E T E R ETA = 0.1
(moderately accurate search),
RADIUS OF C O N V E R G E N C E
= 0.01
(Ec in Section 3.5),
ROW T O L E R A N C E
= 1 0 -6
(Er in Section 4.8),
MINOR I T E R A T I O N S LIMIT
= 40
(not active on small examples).
Run-times are reported below in order to allow comparison among various algorithmic options. For the Lagrangian algorithm, the item labeled 'Functions' means the number of times the objective and the constraints and all of their gradients were evaluated. For the N e w t o n procedure, 'Functions' means the number of times the objective and its gradient were evaluated; the number of constraint and Jacobian evaluations is equal to the number of major iterations.
6.1. Problems 5.1-5.8 These examples were solved on a CDC Cyber 70. In all cases, convergence was obtained with zero penalty parameter p. The results are summarized in Table 1. In general the partial completion option converged more rapidly than
107
B.A. Murtagh and M.A. Saunders/ Sparse nonlinear constraints
d
~
z d u2rd o6
~_._d
& e~
J
d a2
u~
z
,..1
d
d .1
d d zu2 0
E
e-
_ _
z~-z
Z.oGG* x-
108
B.A. Murtagh and M.A. Saunders/Sparse nonlinear constraints
full completion. H o w e v e r , Example 5.4 illustrates that fewer major iterations are likely if subproblems are solved accurately once the correct subspace has been identified. This was observed in several other cases and is probably explained by the discussion of ,~k in Section 3.3. In terms of total run-time the Newton strategy was often superior, but it failed to converge on Problems 5.4 and 5.5. This deficiency becomes more prominent in the examples below. Problem 5.8 was run with two different MINOR I T E R A T I O N S limits (3 and 40, which could have been 3 and 15 without altering the comparison). The results illustrate that terminating major iterations early can sometimes help rather than hinder.
6.2. Problems 5.9-5.12 The results for these examples were obtained on an IBM 370/168 using the FORTRAN H Extended (Enhanced) compiler with full optimization (OPT(3)). Computation was performed in double precision, but the constraint data were stored in single precision, including Jk in the linearization of [(x). This limits the achievable constraint error to about 10 6, but that is usually adequate in practice. Full completion was used throughout.
Problem 5.9 (Wright No. 4) This highly nonlinear problem illustrates the difficulties discussed in Section 3.4 when no penalty term is used. The Newton strategy and the Lagrangian algorithm with p = 0 both gave rise to subproblems which changed radically each major iteration. The results using the Lagrangian algorithm with p = 10 and p -- 100 are shown in Table 2. Infeasible subproblems were encountered with starting point (e) using the penalty parameter 19 = 10, but the procedure discussed in Section 4.6 successfully overcame this difficulty. Case (e) was also the only one for which the larger p was important in stabilizing progress from the starting point to the solution.
6.3. Problem 5.10 (Wright No. 9) Again the N e w t o n strategy and the Lagrangian algorithm with p = 0 failed to converge. Results for the Lagrangian algorithm with p = l0 and p = 100 are shown in Table 3. A value of p = 10 is almost too low for starting point (b), the subproblem solutions changing radically as they did for p = 0, but finally converging to a new local optimum, x * ( 3 ) = (-0.07427, -3.69170, 2.48792, 0.37693, 0.18446) T. In general the results for these last two problems are inferior to those obtained by Murray a n d W r i g h t [34, 55] with their trajectory algorithms, in terms of total function evaluations. However, the difference is less than a factor of 4 in all cases, and averaged 2.2 over the seven starting points.
B.A, Murtagh and M.A. Saunders/Sparse nonlinear constraints
r~3
c5 Z
E
[.-,
r =
109
B.A. Murtagh and M.A. Saunders/Sparse nonlinear constraints
110
Table 3 Results for Test Problem 5.10 (Wright No. 9) (a)
Starting point
(b)
P
100
10
100
10
Major itns Total itns Functions Solution
12 92 183 x*(1)
9 71 146 x*(1)
5 32 61 x*(2)
19 201 386 x*(3)
Table 4 Results for Test Problem 5.11 (optimal control) Method Major itns Total itns Functions Time (s)
N
L (p = 0)
6 254 213 10.55
6 247 203 11.56
6.4. Problem 5.11 (Optimal control) Despite the large size of this problem both the N e w t o n strategy and the Lagrangian algorithm with p = 0 converged rapidly. Recall that procedure N evaluates the constraint functions only once per major iteration (in this case 6 times c o m p a r e d to 203 times for p r o c e d u r e L). If f(x) were more costly to compute, the time advantage would be that much greater. The Lagrangian algorithm displayed insensitivity to nonzero values of p in the range 0.01-10.0, taking the same iterations and calculations as shown in Table 4.
6.5. Problem 5.12 (Economic growth) On this example the N e w t o n strategy led to oscillation and no convergence. The Lagrangian algorithm did converge rapidly with p = 0. Without the upper bounds /t -< (1.04)t/0 it required l l major iterations, 355 minor iterations, 859 function calculations and 34.3 seconds, and there were 99 superbasic variables at the optimal solution. H o w e v e r when these bounds were imposed, the optimal number of superbasics was only 25 and convergence was obtained in 7 major iterations, 183 minor iterations, 497 function calculations and l l . 9 seconds. This illustrates the gains that are made when the presence of constraints reduces the dimensionality of the optimal subspace. As an e x p e r i m e n t on the effect of p on the rate of c o n v e r g e n c e the problem was solved several times with different values of O in the range 10 -4 ~ p --- 1.0.
B.A. Murtagh and M.A. Saunders]Sparse nonlinear constraints
111
L.P REDUCED FROM 1.0 TO 0.0
-3
-7
o.oo,
o.o I I
I
100
I
I
200
I
I
300
I
I
I
1
4.00 .
Fig. 1. Solution of Problem 5.12 (Economic growth) using various values for the penalty parameter p. The constraint violation. Iog.~[Lf(x)+ A~y- b,r[-, is plotted against minor iteration number. Dots signify the end of each major iteration.
(The tolerance e~. for dropping p was set to zero, thus forcing p to stay constant for each run.) The results are shown in Fig. 1. This is a plot of minor iterations versus log of the constraint violation or 'row error', log~ollI(x)+Aty-bdt| immediately following relinearization. The dots represent the end of each major iteration. Initially these occur every 40 iterations (the M I N O R I T E R A T I O N S limit) but later subproblems were solved accurately before the limit was reached. In fact the n u m b e r of minor iterations reduces rapidly to only one or two per subproblem as convergence is approached. This behavior was also observed in all of the preceding examples. It can be seen that higher values of p give lower row errors at the end of the first major iteration (as we would expect), but they lead to consistently slower convergence. It is interesting to note that rapid c o n v e r g e n c e does occur ultimately in all cases (except for p = 1.0 which was terminated after 500 iterations). However, this is not until the correct active constraints have been identified, by which time xk is very close to the o p t i m u m and the penalty term is having a negligible effect on the Lagrangian and its reduced gradient. These results confirm that the benefit of Robinson's p r o o f of quadratic convergence for the case p = 0 cannot be realized unless p is actually reduced to zero as soon as precautions allow. The dotted line in Fig. 1 shows the result for t9 = 1.0 (the worst case) with ~c set to 0.01, allowing p to be reset to zero at the start of m a j o r iteration 3. (Note
112
B.A. Murtagh and M.A. Saunders/Sparse nonlinear constraints
that on the highly nonlinear Examples 5.9 and 5.10, the same value of Ec ensured that p was not set to zero until a near optimum point was reached. This was the desired effect since the multiplier estimates Ak were changing radically in the early major iterations.) It will be seen in Fig. 1 that the row error increases sharply once p is reset to zero. This is consistent with the algorithm suddenly being free to take a large step. We could therefore regard the first two major iterations as having served to verify that the problem is only mildly nonlinear. 6.6. Other results Successive linear programming (SLP) algorithms have been used in industry for many years [4, 7, 23] on problems of the type discussed here. Some informal comparisons [32] on the Colville problems indicate that S L P is likely to require more iterations and function evaluations than the I.agrangian algorithm given here. However, it is clearly unwise to draw any firm conclusion from a few small test cases. We hope that comparison with S L P on a large model will be possible in the near future. Generalized reduced-gradient (GRG) algorithms [1] have also been in use for many years. In particular, the large-scale implementation M I N O S / G R G [27, 30] has been applied to a variant of Manne's economic growth model (Problem 5.12). The results reported in [30] show that the GRG algorithm required very many function evaluations for the case T = 40 (considerably more than the numbers given above for the Lagrangian algorithm in the larger case T = 100). This is consistent with G R G ' s emphasis on retaining feasibility at the expense of many short steps that follow the curvature of the constraints. Table 5 Iterations for air pollution model Major iteration k
Iterationsto find a feasiblepoint for subproblem S~
Total m i n o r iterations for S~
Constraintviolalion after termination of S~
I 2 3 4 5 6 7 8 9 10 11 12 13
1554 0 289 I1 126 12 7 1 4 0 0 0 0
0 10O0 1000 600 626 204 172 167 196 82 22 2 1
0 0.50 0.27 0.58 0.51 0.29 0.15 0.067 0.022 0.0035 0.00041 0.0000058 --
ItM ' ' - A~II 1+ [I)t~,l[I 1554 0 0.02 0.28 0.45 0.40 0.49 0.38 0.27 0.21 0.05 0.002 --
B.A. Murtagh and M.A. Saunders/Sparse nonlinear constraints
113
6.7. A practical application To date, the largest nonlinear programs solved by M I N O S / A U G M E N T E D have come from an energy production model concerned with air pollution control [28]. One example of the model involved about 850 constraints and 4150 variables. The objective function was nonlinear in 225 of the variables (a sum of terms of the form e x~ and eXJX0, and 32 of the constraints were quadratic in 78 of those variables (ellipsoids of the form xrEix <- ~ with E~ positive semidefinite). Some statistics follow for the solution of this problem [28]. A cold start was used with default values for all parameters, except that the MINOR ITERATIONS limit was set to the unusually high value of 1000. Major iterations (number of subproblems)
13
Minor iterations
5626
Objective function and gradient evaluations
5955
Constraint function and gradient evaluations
5968
Active nonlinear constraints at optimum
12
Superbasic variables at optimum
18
CPU time on a DEC VAX 11/780
63 minutes
Details for each major iteration are given in Table 5, where Sk denotes the kth subproblem. The high MINOR I T E R A T I O N S limit probably just reduced the number of major iterations, without having a substantial effect on the total work required. The first subproblem was terminated at the first feasible point, since the limit on minor iterations had then been exceeded. By chance, this point was also feasible for the original problem. As a result, the penalty parameter p was then reduced to zero, the ideal value for a convex program such as this. (In general, however, it seems clear that the criterion for reducing p should not take effect if the most recent subproblem was terminated short of optimality.) The next two subproblems were principally concerned with optimizing the objective function, without much regard for the nonlinear constraints. Thereafter, the constraint violation decreased steadily and, ultimately, quadratically. (The quantity shown is the largest violation, normalized by the size of the corresponding constraint gradient, [[eV~[Jk,~ Atlll, and by the solution size, 1 + II[xk, ydll.) The Lagrange multiplier estimates converged more slowly as might be expected, but it is known that this does not inhibit the final rapid convergence of xk.
7. Conclusions
Many real-life optimization problems originate as linear programming models that are not quite linear; i.e., they contain simple, differentiable functions in the
114
B.A. Murtagh and M.A. Saunders[Sparse nonlinear constraints
constraints and objective, but otherwise the constraint set is large, sparse and linear. For such problems the Jacobian matrix is also likely to be sparse, and the strategy of solving a sequence of linearly constrained subproblems has many advantages. This is clear from the results obtained for the larger test problems above. For convex problems the Lagrangian term in the objective of the subproblems is usually necessary to ensure convergence. The Newton strategy performed adequately without it on some occasions, but in general the saving in run time will seldom be substantial. For non-convex problems, both the Lagrangian and the penalty term are clearly useful but the actual choice of the penalty parameter p remains a critical decision for the user. In practice, optimization problems are often solved repeatedly on a production basis. In this situation it is worthwhile experimenting with different values of the parameters and tolerances (and perhaps with the Newton strategy). However, the case of an isolated problem with unknown characteristics is no less important. A conservative (high) value of p is then virtually mandatory. One of our aims has been to minimize the risk of subsequent slow convergence by determining an opportune time to reduce p to zero. The analysis in Section 3.4 has suggested a practical procedure for achieving this aim. In particular, the 'radius of convergence' tolerance ec (applied to both the constraint violation and the relative change in the estimates of h.) allows early removal of p in moderately nonlinear cases but otherwise retains it until convergence to a local solution is imminent. The results reported here should provide a benchmark for measuring the performance of other algorithms and their implementations. Clearly no single algorithm can be expected to perform uniformly better than all others in such a diverse field as nonlinearly constrained optimization. As it happens, MINOS/AUGMENTED has proved to be reasonably efficient on small, highly nonlinear problems, but more importantly, it represents an advance in the development of general-purpose software for large-scale optimization. Future research could include an investigation of the following: (a) Comparison with other large-scale algorithms such as SLP. (b) An algorithm for adjusting the penalty parameter between subproblems. (c) Alternative means for obtaining multiplier estimates ~tk. (d) Use of a merit function to evaluate the connection between consecutive subproblems; e.g., to allow interpolation between the points (Xk,'~k) and (xk. ~, h.k.1) of the present algorithm. (e) Methods for the case where some or all of the gradient functions are not available.
B.A. Murtagh and M.A. Saunders/Sparse nonlinear constraints
115
Acknowledgement The authors wish to thank Noel Diaz, Philip Gill, Harvey Greenberg, Charles Kolstad, Alan Manne, Walter Murray, Ben Rosen, Ken Schofield, John Tomlin, Wesley Winkler, and Margaret Wright for various guiding comments and assistance with the test problems. Availability
M I N O S / A U G M E N T E D is distributed by the Systems Optimization Laboratory, Department of Operations Research, Stanford University, Stanford, CA 94305. Most of the source code is portable and can be installed on any reasonably large system that includes a FORTRAN IV compiler. The distribution tape contains source code for most of the current scientific computers. It also contains some routines [43] to facilitate the solution of sequences of related problems.
Added in proof Since this paper was submitted, some particularly successful results have been obtained on electric power scheduling (optimal power flow) problems considerably larger than problem 5.5 above. This is an application where a good starting basis presents itself naturally. Some approximate statistics for a 600-bus Q-dispatch case follow [56]: 1200 nonlinear constraints, 1300 variables, 10 000 Jacobian nonzeros; 15 major iterations, 370 minor iterations, 700 function evaluations, 1 hour of C P U time on a DEC V A X 11/780.
References [1] J. Abadie and J. Carpentier, "Generalization of the Wolfe reduced gradient method to the case of nonlinear constraints", in: R. Fletcher, ed., Optimization (Academic Press, London, 1969) pp. 37-49. [2] J. Abadie and J. Guigou, "Numerical experiments with the GRG method", in: J. Abadie, ed., Integer and nonlinear programming (North-Holland, Amsterdam. 1970) pp. 529-536. [3] K.J. Arrow and R.M. Solow, "Gradient methods for constrained maxima, with weakened assumptions", in: K.J. Arrow, L. Hurwicz and H. Uzawa, eds., Studies in linear and nonlinear programming (Stanford University Press, Stanford, CA, 1958)pp. 166-176. 14] T.E. Baker and R. Ventker, "Successive linear programming in refinery logistic models", presented at ORSA/TIMS joint national meeting, Colorado Springs. CO (November 1980). [5] A.S.J. Batchelor and E.M.L. Beale, "A revised method of conjugate gradient approximation programming", presented to the Ninth International Symposium on Mathematical Programming (Budapest, 1976). [6] E.M.L. Beale, "A conjugate gradient-method of approximation programming", in: R.W. Cottle and J. Krarup. eds., Optimization methods for resource allocation (English Universities Press, 1974) pp. 261-277.
116
B.A. Murtagh and M.A. Saunders/Sparse nonlinear constraints
17] E.M.L. Beale, "'Nonlinear programming using a general Mathematical Programming System", in: H.J. Greenberg, ed., Design and implementation of optimization software (Sijthotf and Noordhott, The Netherlands, 1978) pp. 259-279. [8] J.F. Benders, "Partitioning procedures for solving mixed variables programming problems", Numerische Mathematik 4 (1962) 238-252. [9] M.J. Best, J. BrS.uninger. K. Ritter and S.M. Robinson, "A globally and quadratically convergent algorithm for general nonlinear programming problems", Computing 26 (1981) 141-153. [10] M.C. Biggs and M.A. Laughton, "Optimal electric power scheduling: a large nonlinear test problem solved by recursive quadratic programming", Mathematical Programming 13 (1977) 167-182. [111 J. Bracken and G.P. McCormick, Selected application.s for unconstrained and linearly constrained optimization (Wiley. New York, 1%8) pp. 58-82. [12] A.R. Colville, "A comparative study on nonlinear programming codes", Report 320~ (1%8), IBM New York Scientific Center, New York. [ 13] I.D. Coope and R. Fletcher, "Some numerical experience with a globally convergent algorithm for nonlinearly constrained optimization", Journal of, Optimization Theory and Applications 32 (1) (1980) 1-16. [14] G.B. Dantzig, Linear programming and extensions (Princeton University Press, New Jersey, 1%3). [151 R.S. Dembo, "A set of geometric programming test problems and their solutions", Mathematical Programming 10 (1976) 192-213. [16] I..F. Escudero, "A projected l.agrangian method for nonlinear programming", Report G3200 3401 (1980), IBM Palo Alto Scientific Center, CA. [17] U.M. Garcia-Palomares and O.L. Mangasarian, "Superlinearly convergent quasi-Newton algorithms for nonlinearly constrained optimization problems", Mathematical Programming II (1976) 1-13. [18] P.E. Gill and W. Murray, "Newton-type methods for unconstrained and linearly constrained optimization". Mathematical Programming 7 (1974) 311-350. [19] P.E. Gill and W. Murray, "Safeguarded steplength algorithms for optimization using descent methods", Report NAC 37 (1974), National Physical l.aboratory, Teddington, England. [20] P.E. Gill and W. Murray. Private communication (1975), National Physical l.aboratory, Teddington, England. [21] P.E. Gill. W. Murray, M.A. Saunders and M.H. Wright, "Two steplength algorithms for numerical optimization", Report SOL 79-25 (1979), Department of Operations Research. Stanford University, CA. (Revised 1981.) [22] H.J. Greenberg, Private communication (1979), Department of Energy, Washington, DC. [23] R.E. Griffith and R.A. Stewart, "'A nonlinear programming technique for the optimization of continuous processing systems". Management Science 7 (1%1) 379-392. 124] E. Hellerman and D.C. Rarick, "The partitioned preassigned pivot procedure", in: D.J. Rose and R.A. Willoughby, eds.. Sparse matrices and their appfications (Plenum Press, New York, 1972) pp. 67-76. [251 M.R. Hestenes, "MuItiplier and gradient methods", Journal o[ Optimization Theory and Applications 4 (1969) 303-320. [26] R.J. Hillestad and S.E. Jacobsen, "Reverse convex programming", Research paper (1979), RAND Corporation, Santa Monica, CA, and Department of System Science, University of California. l.os Angeles, CA. [27] A. Jain, L.S. Lasdon and M.A. Saunders, "An in-core nonlinear mathematical programming system for large sparse nonlinear programs", presented at ORSA/TIMS joint national meeting, Miami, FL (November 1976). [281 C.D. Kolstad, Private communication (1980), l.os Alamos Scientific Laboratory, New Mexico. [29] L.S. Lasdon, Optimization theory .for large systems (Macmillan, New York, 1970). [30] L.S.l..asdon and A.D. Waren, "'Generalized reduced gradient software for linearly and nonlinearly constrained problems", in: H.J. Greenberg, ed., Design and implementation o[ optimization so.ftware (Sijthoff and Noordhoff. The Netherlands, 1978) pp. 363-3%. [31] A.S. Manne, Private communication (1979), Stanford University, CA.
B.A. Murtagh and M.A. Saunders/Sparse nonlinear constraints
117
[32] P.H. Merz, Private communication (1980), Chevron Research Company, Richmond, ('A. I33] R.R. Meyer, "The validity of a family of optimization methods", S I A M Journal on Control 8 (1970) 41-54. [34] W. Murray and M.H. Wright, "Projected l,agrangian methods based on trajectories of penalty and barrier functions", Report SOL, 78-23 (1978), Department of Operations Research, Stanford University, CA. [35] B.A. Murtagh and M.A. Saunders, "Large-scale linearly constrained optimization", Mathematical Programming 14 (1978) 41-72. [361 B.A. Murtagh and M.A. Saunders, "MINOS user's guide", Report SOL 77-9 (1977), Department of Operations Research, Stanford University, CA. 137] B.A. Murtagh and M.A. Saunders, "MINOS/AUGMENTED user's manual", Report SOl, 80-14 (1980), Department of Operations Research. Stanford University, CA. [38] D.A. Pierre and M.J. Lowe, Mathematical programnzing t,ia augmented Lagrangiar, s (AddisonWesley, Reading. MA, 1975) pp. 239-241. [39] M.J.D. Powell, "A method for nonlinear constraints in optimization problems", in: R. Fletcher, ed., Optimization (Academic Press, New York, 1%9) pp. 283-297. [40] M.J.D. Powell, "Algorithms for nonlinear constraints that use l,agrangian functions". Mathematical Programming 14 (1978) 224--248. [41] M.J.D. Powell, "A fast algorithm for nonlinearly constrained optimization calculations", presented at the 1977 Dundee Conference on Numerical Analysis, Dundee, Scotland. [421 M.J.D. Powell, "'Constrained optimization by a variable metric method", Report 77/NA6 (1977), Department of Applied Mathematics and Theoretical Physics, Cambridge University, England. [43] P,V. Preckel, "'Modules for use with MINOS/AUGMENTED in solving sequences of mathematical programs", Report SO1, 80-15 (1980), Department of Operations Research, Stanford University, CA. [,~4] P.S. Ritch, "'l)iscrete optimal control with multiple constraints I: constraint separation and transformation technique", Automatica 9 (1973) 415-429. [45] S.M. Robinson, "'A quadratically convergent algorithm for general nonlinear programming problems". Mathematical Programming 3 (19721 145-156. [46] J.B. Rosen, "Convex partition programming", in: R.L. Graves and P. Wolfe, eds., Recent adt~ances in mathematical programnzing (McGraw-Hill, New York, 1%3) pp. 159-176. [47] J.B. Rosen, "lterative solution of nonlinear optimal control problems", S I A M Journal on Control 4 (1966) 223-244. [48] J.B. Rosen, "Two-phase algorithm for nonlinear constraint problems", in: O.L. Mangasarian, R.R. Meyer and S.M. Robinson, eds.. Nonlinear programming 3 (Academic Press, London, 1978) pp. 97-124. [49] B.C. Rush. J. Bracken and G.P. McCormick, "A nonlinear programming model for launch vehicle design and costing", Operations Research 15 (1%7) 185-210. [50] R.W.H. Sargent and B.A. Murtagh, "Projection methods for nonlinear programming". Mathematical Programming 4 (1973) 245-268. 151] M.A. Saunders, "A fast, stable implementation of the simplex method using Bartels-Golub updating", in: J.R. Bunch and D.J. Rose, eds., Sparse nzatrix computations (Academic Press, New York, 1976) pp. 213-226. [52] M.A. Saunders, "MINOS system manual", Report SOl, 77-31 (1977), Department of Operations Research, Stanford University, CA. [53t C.M. Shen and M.A. Laughton. "Determination of optimum power system operating conditions under constraints", Proceedings of the Institute of Electrical Engineers 116 (1969) 225-239. [541 P. Wolfe. "Methods of nonlinear programming", in: J. Abadie, ed., Nonlinear programming (North-Holland, Amsterdam, 1%7) pp. 97-131. [551 M.H. Wright. "'Numerical methods for nonlinearly constrained optimization", SI,AC Report No. 193 (1976), Stanford University, CA (Ph.D. Dissertation). [56] R.C. Burchett, Private communication (1981), General Electric Company, Schenectady, NY.
Mathematical Programming Study 16 (1982) 118-136 North-Holland Publishing Company ON S O M E E X P E R I M E N T S W H I C H D E L I M I T T H E U T I L I T Y OF NONLINEAR PROGRAMMING METHODS FOR ENGINEERING DESIGN E. S A N D G R E N * IBM, Information Systems Division, Lexington, KY, U.S.A.
K.M. R A G S D E L L School of Mechanical Engineering, Purdue University, West Lafayette, IN, U.S.A. Received 18 March 1980 Revised manuscript received 5 November 1981 A comprehensive comparative study of nonlinear programming algorithms as applied to engineering design is presented. Linear approximation methods, interior penalty function methods and exterior penalty function methods were tested on a set of thirty problems and were rated on their ability to solve problems within a reasonable amount of computational time. The effect of the problem parameters on the solution time for the various classifications of algorithms was studied. The variable parameters included the number of design variables, the number of inequality constraints, the number of equality constraints and the degree of nonlinearity of the objective function and constraints. Also a combined penalty function and linear approximation algorithm was investigated. Key words: Comparative Study, Nonlinear Programming Methods, Hybrid Method.
1. Introduction Conducting a comparative study of general nonlinear programming algorithms is not an easy task. In the field of linear, integer or quadratic programming, it is possible to generate a random set of problems which are quite representative of the field. This enables the use of statistical methods to arrive at a carefully designed comparative experiment. Unfortunately, there does not exist a representative class of general nonlinear programming problems. This means that one must either select a wide variety of problems which are typical of a chosen field or limit oneself to a specific problem set, such as cubic objective function with quadratic constraints. Both approaches have advantages and disadvantages. Using a wide variety of problems which are typical of a chosen field results in an overall picture of how each algorithm might perform in general usage, but does not indicate how the algorithms are affected by the structure of a problem. On the other hand using a random set of specifically structured problems allows one to trace the specific effects of such parameters as the number of design variables or number and type of constraints on each algorithm. Unfortunately the trends may not be valid to any degree on a different class of * At present: Dept. of Mechanical and Aerospace Engineering, Univ. of Missouri, Columbia, MO, U.S.A. 118
E. Sandgren and K.M. Ragsdell/ Experiments
119
problems. Both approaches have been employed in this study. A comprehensive study using a selected set of engineering problems is discussed and then a few of the algorithms which performed well are compared on specific sets of mathematically constructed problems. These comparisons are then followed by a brief description of a hybrid method employing both a biased penalty function method and the generalized reduced gradient method.
2. Generation of comparative data
The major goal of this study is to determine the utility of the world's leading nonlinear programming methods for use in an engineering design environment. For this reason an emphasis was placed on the use of industrial based problems and codes. Algorithms were solicited throughout the world and a wide selection of algorithms was obtained from both industry and the academic community. It should be noted that some of the codes which were initially considered were not tested due to the fact that not everyone who was invited agreed to participate. The overall response, however, was supportive and in all thirty-five codes were tested. A brief listing of the codes tested, their general classification and the unconstrained optimization technique used (if any) is given in Table !. Included in the test set are four generalized reduced gradient codes, a Method of Multipliers (BIAS) code, two methods based on extensions of linear programming, and a large variety of penalty methods. BIAS [15] is a Method of Multipliers code developed at Purdue under the direction of Ragsdell. SEEK 1, SEEK 3, APPROX, SIMPLX, DAVID AND MEMGRD [22] were contributed by Professor Siddall at McMaster University. GRGDFP [10] is a generalized reduced gradient code developed by LaFrance, Hamilton and Ragsdell at Purdue. RALP [20] is a variation of the Griffith Stewart [6] method developed by Spencer Schuldt at Honeywell. GRG [11] is the generalized reduced gradient code developed by Lasdon and co-workers. OPT [5] is a generalized reduced gradient code developed by Gabriele and Ragsdell at Purdue. GREG [7] is a generalized reduced gradient code developed by Abadie in France. Compute II [8] is a penalty function package contributed by Gulf Oil Corporation. The package employs an exterior penalty formulation, several unconstrained search methods and automatic constraint scaling. Mayne [I] is another penalty package which was developed and contributed by Professor Mayne at State University of New York at Buffalo. The package contains a variety of unconstrained searching methods, and interior and exterior penalty formulations. SUMT-IV [12] is the very well known penalty package developed by Fiacco, McCormick, Mylander and others at Research Analysis Corporation in Virginia. Various unconstrained search methods are provided, as well as a choice in penalty parameter updating scheme in conjunction with a standard interior formulation.
120
E. Sandgren and K.M. Ragsdell/ Experiments
Table I Codes in study Code number
Name and/ or source
1 2 3 4 5 6 7 8 9 10 1! 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 29 30 31 32 33 34 35
BIAS SEEK 1 SEEK3 APPROX SIMPLX DAVID MEMGRD GRGDFP RALP GRG OPT
GREG Compute 1[ (0) Compute II (i) Compute I1 (2) Compute I1 (3) Mayne (l) Mayne (2) Mayne (3) Mayne (4) Mayne (5) Mayne (6) Mayne (7) Mayne (8) Mayne (9) Mayne (10) Mayne (11) SUMT 1V (I) SUMT IV (2) SUMT IV (2) SUMT IV (3) SUMT IV (4) MINIFUN (0) MINIFUN (1) MINIFUN (2) COMET
Class exterior penalty interior penalty interior penalty linear approximation interior penalty interior penalty interior penalty reduced gradient linear approximation reduced gradient reduced gradient reduced gradient exterior penalty exterior penalty exterior penalty exterior penalty exterior penalty exterior penalty exterior penalty exterior penalty exterior penalty exterior penalty interior penalty interior penalty interior penalty interior penalty interior penalty interior penalty interior penalty interior penalty interior penalty interior penalty mixed penalty mixed penalty mixed penalty exterior penalty
Unconstrained search method variable metric (DFP) random pattern Hooke-Jeeves none
simplex variable metric (DFP) memory gradient variable metric (DFP) none
variable metric (BFS) conjugate gradient (FR) conjugate gradient (FR) Hooke-Jeeves conjugate gradient (FR) variable metric (DFP) simple x/Hooke--Jeeves pattern steepest descent conjugate direction conjugate gradient (FR) variable metric (DFP) Hooke-Jeeves pattern steepest descent conjugate direction conjugate gradient variable metric (DFP) Newton Newton Newton steepest descent variable metric (DFP) conjugate directions variable metric (BFS) Newton variable metric (BFS)
M I N I F U N [25] is a p e n a l t y m e t h o d d e v e l o p e d b y L o o t s m a a n d c o n t r i b u t e d b y LiU at t h e U n i v e r s i t y o f L i v e r p o o l . T h e m e t h o d offers t h r e e u n c o n s t r a i n e d s e a r c h m e t h o d s . C O M E T [23] is an e x t e r i o r p e n a l t y m e t h o d d e v e l o p e d b y Himmelblau employing a variable metric unconstrained search method. All c o d e s w e r e m o d i f i e d to run on P u r d u e U n i v e r s i t y ' s C D C - 6 5 0 0 c o m p u t e r s y s t e m . T h e c h a n g e s v a r i e d s o m e w h a t f r o m c o d e to c o d e a n d w e r e g e n e r a l l y quite small. All c o d e s w e r e c o n v e r t e d to single p r e c i s i o n , s i n c e t h e C D C - 6 5 0 0 s y s t e m has a 60 bit w o r d (48 bit m a n t i s s a a n d 12 bit e x p o n e n t ) , w h i c h r e s u l t s in a p p r o x i m a t e l y 14 d e c i m a l p l a c e s o f p r e c i s i o n using single p r e c i s i o n a r i t h m e t i c .
E. Sandgren and K.M. Ragsdell/ Experiments
121
Any code which required analytical gradient information was altered to accept numerical gradient approximations using forward differences. Finally all print instructions were removed from the basic iteration loop of the algorithms in order that the time of calculation could be measured more accurately. A more complete description of each of the codes, including names and addresses of code sources, is given by Sandgren [17]. To simplify the terminology, the numbering of the algorithms in Table 1 will be used throughout the paper. For example, the code APPROX will be referred to as code 4. Of course there are many additional algorithms which could have been included in the study and the fact that they have not been included is in no way a reflection on the value of the code. The codes included are simply a representative sample of the codes available to us when we began the study in 1973. No recursive quadratic programming codes were included because they were not available at this time. The test problems were selected so as to include a wide range in the number of variables and the number and type of constraints. Several problems were selected from past studies to give this study a sound historical foundation. These problems were basically selected as initial test problems which were used to gain familiarity with each of the codes. Other problems were selected from a wide range of additional engineering applications. The final total of thirty test problems includes all sixteen of the problems from the Coiville [2] and Eason and Fenton [4] studies, eight problems from a comparative study of geometric programming codes by Dembo [3], and seven problems which come directly from industrial application and have not to the best of our knowledge appeared in any previous comparative study. The welded beam problem [14] involves the design of a structural weldment of minimum cost with constraints on stress and buckling. The coupler curve problem is a classical problem from the field of mechanism synthesis [24]. The Whirlpool problem involves the design of a Whirlpool product [26]. The SNG [21] problem was contributed by Spencer Schuldt of Honeywell Corporation, and involves a simplified model of a synthetic natural gas production facility. The flywheel problem [19] involves the design of a rotating disk with maximum energy storage capacity subject to constraints on the volume and maximum stress. The automatic lathe problem [13] represents an application in industrial engineering and was contributed by Philipson and Professor Ravindran of the SchoOl of Industrial Engineering at Purdue. Finally, the waste water problem [9] was contributed by Professor Himmelblau of the University of Texas. A brief listing of the problems and the number of variables and constraints is contained in Table 2. A detailed description including complete starting and solution information along with Fortran listing of each problem is given by Sandgren [17]. As can be seen from Table 2, the problems vary greatly in size and structure. The number of variables range from 2 to 48 while the number of constraints goes from 0 to 19. The number of variable bounds range from 3 to 72. Larger
E. Sandgren and K.M. Ragsdell/ Experiments
122 Table 2 Problems in study Problem number
Name and/ or source
N
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
Eason ~1 Eason #2 Eason ~3 Eason ~*4 Eason ~5 Eason ~6 Eason ~7 Eason ~8 Eason ~9 Eason ~10 Eason ~11 Eason ~12 Eason ~13 Colville ~2 Colville ~7 Colville ~ Dembo ~1 Dembo ~3 Dembo 414 Dembo ~5 Dembo ~6 Dembo #7 Dembo #8 Welded beam Coupler curve Whirlpool SNG Flywheel Automatic lathe Waste water
5 3 5 4 2 7 2 3 3 2 2 4 5 15 16 3 12 7 8 8 13 16 7 4 6 3 48 5 l0 19
J
K
Number of variable bounds
10 0 2 0 6 0 0 0 0 0 0 4 1 0 2 0 9 0 0 0 2 0 0 0 4 0 5 0 0 8 14 0 3 0 14 0 4 0 6 0 13 0 19 0 4 0 5 0 4 0 0 1 I 2 3 0 14 1 I 11
5 6 10 8 4 12 4 6 4 4 4 8 3 15 32 6 24 14 16 16 26 32 14 3 6 6 72 l0 20 38
p r o b l e m s w e r e c o n s i d e r e d , b u t e l i m i n a t e d d u e to e x c e s s i v e c o m p u t a t i o n a l Consideration
cost.
w a s a l s o g i v e n to t h e a v a i l a b i l i t y o f s o l u t i o n i n f o r m a t i o n . T h i r t y
p r o b l e m s m a y n o t s e e m like a v e r y l a r g e t e s t set, b u t o n e m u s t r e c o g n i z e t h a t application
of each
requires over one
of the thirty-five codes thousand
individual
to e a c h
computer
runs.
of the thirty When
problems
this number
is
multiplied by an average number of runs per code, the time and cost quickly exceed reasonable bounds. A fundamental
decision on the handling of program input parameters
initial p e n a l t y p a r a m e t e r s
a n d line s e a r c h
termination
s u c h as
c r i t e r i a a l s o h a d to b e
m a d e . T o s i m u l a t e a c t u a l u s a g e , an initial r u n w a s m a d e u s i n g t h e r e c o m m e n d e d values for the input parameters.
If a n o r m a l , s a t i s f a c t o r y t e r m i n a t i o n
was not
o b t a i n e d a f t e r t h i s r u n , an a t t e m p t w a s m a d e to a d j u s t t h e i n p u t p a r a m e t e r s using the information
c o n t a i n e d in t h e u s e r ' s m a n u a l as a g u i d e to t h e a d j u s t -
E. Sandgren and K.M. Ragsddl/ Experiments
123
ment. It should be noted that no attempt was made to decrease the solution times by this adjustment of parameters. The only goal was to achieve a satisfactory solution to each problem. The drawback to this approach is that the nebulous factor of user skill was introduced into the results but this is most certainly a factor which must be considered in actual usage. The results using standard parameters were not felt to be representative and would have drastically effected the performance of several of the codes. The errors introduced into the calculated solution times due to timing errors and system load were determined to be extremely small. A further discussion of these factors is given by Sandgren [17]. In order to reduce the total computational time for the study and to gain facility in selecting the input program parameters, a preliminary test was conducted. In this way codes were 'qualified' for the final test. The 14 problems used in the preliminary testing are numbers l through 8, 10 through 12 and 14, 15 and 16. All 35 codes were applied to the 14 problems in the preliminary test set. The codes which solved less than 7 of these problems were eliminated from further study. It should be noted that in this phase any code which required more than three times the average time to solve a given problem was considered to have failed on that problem. Here the average solution time is simply the sum of the solution times of all the algorithms which solved a given test problem divided by the number of codes which solved that problem. Accordingly, codes 2, 4, 5, 6, 7, 17, 18, 23, 24, 25 and 30 were not considered in the final testing.
3. Data collection and reduction
The information obtained during the data collection phase consisted of the value of the objective function, the constraint values and the solution time at each stage for each code on each problem. This data had to be reduced to some common level in order to compare the effectiveness of the codes on each problem. This involves the development of some measure of the accuracy of a given point relative to the known solution. This is necessary so that all of the codes may be compared at some uniform level of accuracy. The accuracy criterion chosen was the total relative error which may be defined as: ./
K
where
f(x*)l if(x,) ] for
~I = I f ( x ) -
f(x*) -~ 0
(2)
and ~i = 1
for
f(x*) =
0.
(3)
E. Sandgren and K.M. Ragsdell/ Experiments
124
=0
i(f
oE rr LU > _1 LU
I
2
I
4
l
6
I
I
I
I
l
8 I0 12 14 16 TI ME (sec)
I
I
18 20
Fig. 1. Relative error for a hypothetical problem.
Here [(x) represents the objective function, [(x*) represents the optimal value of the objective function and g(x) and h(x) represent inequality and equality constraints respectively. The bracket operator is defined by (a)=
0 -a,
if ~ - > 0 , if a < 0 .
(4)
In other words only the violated subset of inequality constraints is included in the total relative error. Several other definitions of relative error were also monitored along with the number of objective function and constraint evaluations in order that comparison to previous studies might be made. As far as this paper is concerned, however, the total relative error will serve as the sole accuracy criterion. A plot of the total relative error versus the computational time for each algorithm on a given problem allows one to compare the solution times of all algorithms at the same level of accuracy, regardless of the path taken to the solution. For example the relative accuracies for two algorithms are plotted for a hypothetical problem in Fig. 1. Each circled point represents the end of a computational stage for that algorithm. Once the required level of accuracy for the problem has been specified, the solution times are simply the intersections of the relative error curves with a horizontal line drawn at the desired accuracy level. Plots of the total relative error for the tested algorithms on three of the
125
E. S a n d g r e n a n d K . M . R a g s d e l l / E x p e r i m e n t s
Icf
or"
,J
or LLI ILl >
icf
F-.< J
LLJ Or
.lO
. . . . .
Ii0
. . . . . . . . I0.. . . . TIME
(sec)
Fig. 2. Relative error versus time for test problem n u m b e r 1.
lO c
ILl
.01
9I 0
1.0
T I M E {sec}
Fig. 3. Relative error v e r s u s time for test problem n u m b e r 2.
~ .....
E. Sandgren and K.M. Ragsdell/ Experiments
126
io ] 10 I
w> 16s II
rr"
,oh .10
IO.
1.0
TIME (sec) Fig. 4. Relative error versus time for test problem number 26.
test problems are presented in Figs. 2, 3 and 4. A complete set of plots for all of the test problems is given by Sandgren [17].
4. Results
Many attempts have been made at developing a relative ranking criterion for comparing nonlinear programming codes. In selecting a representative ranking criterion careful consideration must be given to the characteristics a 'good' code should possess. Certainly the main two criteria would have to be robustness and computational speed. It is upon these two criteria that all relative rankings will be based. The rankings are determined by the number of problems solved within a series of specified limits on the relative solution times. The limits on the solution times are based on a fraction of the average solution times for all of the codes on each problem. Each solution time for a problem was normalized by the average solution time on that problem. This normalization essentially equalizes the time ratings on the various problems. The number of problems solved may then be directly related to the fraction or percentage of the average solution time of all of the methods tested on each problem. It should be noted that problems 9, 13, 21, 22, 28, 29 and 30 were not included in this analysis since less than five algorithms generated the same solution point. A discussion of the results on these problems is given by Sandgren [17]. The final results, then, included twenty-four codes on twenty-three problems.
E. Sandgren and K.M. Ragsdell/ Experiments
127
Table 3 Number of problems solved with Et = 10-+ Code number
Name and/ or source
1 3 8 9 10 11 12 13 14 15 16 19 20 21 22 26 27 28 29 31 32 33 34 35
BIAS SEEK3 GRGDFP RALP GRG OPT GREG C o m p u t e II (0) Compute 1I (1) C o m p u t e II (2) C o m p u t e II (3) Mayne (3) Mayne (4) Mayne (5) Mayne (6) Mayne (10) Mayne (11) S U M T IV (1) S U M T IV (2) S U M T IV (4) M I N I F U N (0) M I N I F U N (I) M I N I F U N (2) COMET
0.25* 0 0 10 12 15 16 14 2 2 4 1 0 1 7 2 0 2 0 0 0 0 0 0 0
Number of problems solved 0.50 0.75 1.00 1.50 9 1 15 13 17 21 18 7 4 9 5 0 4 13 4 1 7 0 0 I 0 2 I 2
14 2 17 13 19 21 20 9 6 I1 7 2 10 14 7 4 11 0 0 3 0 4 2 5
17 3 17 14 19 21 23 II 6 15 8 3 L0 16 8 6 14 0 I 5 4 7 4 7
19 6 18 14 19 21 23 13 8 15 9 7 11 16 8 9 14 3 2 9 8 13 8 9
2.50 20 10 18 16 19 21 23 15 9 15 II 11 I1 16 9 10 17 9 7 13 15 15 10 15
~ Times the average solution time.
With this background, consider the performance of the codes tested at a total error criterion of 10-+. Table 3 lists the number of problems solved at various solution time limits. From the table it is quite easy to identify the very fast codes, since they have solved a significant number of problems at 25% of the average solution time. This same information is shown in graphical form in Fig. 5 where the number of problems solved is plotted as a function of the fraction of the average solution time. This graph is denoted as the algorithm utility graph. The legend given in Fig. 5 identifying the various code performance curves is implied for all following algorithm utility figures. The maximum height attained on the y axis represents the robustness and a steep slope represents a computationally fast algorithm. Thus a 'good' code should have a steep rise and attain a high maximum value on the ordinate axis. When viewed in this fashion it is again quite easy to pick out the superior codes. The difference in the performance of the various algorithm classes becomes apparent when the average utility graph is considered. This graph as plotted in Fig. 6 consists of the average performance of the best three codes from each
E. Sandgren and K.M. Ragsdelll Experiments
128
24
....
I''''1
--
/
~2o
-
....
i
I ....
I ....
1 ' ' ' ' 1 ' ' ' ' [
....
~i.~l"
I I
._..._ ~ J
S,,": ,----,-i ~_._.~.-.+--/.,,:."1 d./ I t __~~-~ . ....... OOl5 I.-i I ..-..'T'_~.<.~--~"-TJ"J - ~ - - J .J~ ~( / ~ / * i . ~ ; f " ,,-,.
~ J'
.- - - ' ~ > ~ ' 7
08 0 O~ I0 O-
~s.) i~1 !"
h
ft./J/
/;
5'
if!
/
~"
.,Jl
-----_= -
. . . . . .
_
..~.--~'-
,"
~ v
.." I ~
--.--~s
/
J--'/J
I ....
.....
f
O3
W J
1 ....
.~
......... ,o
. . . .
.....
.....
----
9
8
--.--ss
....
s5
-
27
-4
21
-'
t
d 7 , , ~ ~ ,
0.5
, , ....
, ....
1.0
, ....
, ....
, ....
, ....
I.,5 20 2__5 3.0 5.5 FRACTION OF AVERAGE TIME Algorithm (et = 10
F i g . 5.
, ....
4.0
,
4.5
t
5.0
4).
algorithm class including the generalized reduced gradient, exterior penalty function and interior penalty function methods. The generalized reduced gradient methods averaged were c o d e s 10, 11 and 12, the exterior penalty function methods averaged were c o d e s 1, 15 and 21, and the interior penalty function methods averaged were codes 27, 33 and 35. The relative performance of the algorithm classes is now apparent.
24
''rl''''l
....
I ....
Reduced Gradient
~20 00
,Exterior Penalty 15 /
_1
Interior Penaltyi
n,a. I0
Z
5
0
~'T, I , J ,,I J J ~ iIi j , i 0.5 1.0 1.5 2.0 FRACTION OF AVERAGE TIME Fig. 6. Average algorithm utility.
E. Sandgren and K.M. Ragsdell[ Experiments
129
The same type of analysis may be performed at higher accuracy criterion levels. This has been given by Sandgren and Ragsdell [18] along with a more detailed description of the results and conclusions. A general discussion of the performance of each algorithm on the test problem set is given by Sandgren [17]. The level of compiler optimization and the variation of input parameters was found to have little effect on the final ratings.
5. An alternate approach to comparison The results of the comparative study so far have been based on the number of problems solved within a reasonable amount of computational time. The problems in the test set were intentionally selected to be widely varied in nature since the performance on the test set was to indicate the performance one would expect in general usage. It would be beneficial, however, to have additional information pertaining to the performance of some of the better algorithms on a specific type of problem. This type of information cannot be generated from the results on the selected test problems because very few of the test problems were closely enough related to make much of a performance judgement on a specific type of problem. To obtain this information, a different approach was taken. The performance of the algorithms was rated on the basis of how the solution time varies as a function of the type of problem considered. A problem containing five variables and ten inequality constraints with a quadratic objective function and constraints was selected as the standard problem. This standard problem was then altered by changing one problem factor such as the number of variables, the number of inequality constraints, the number of equality constraints or the degree of nonlinearity to form another problem class. The solution times reported for each problem class are the average time for each algorithm on a set of ten randomly generated problems within that problem class to an accuracy of et = l0 -4. The selection of a quadratic objective function and constraints for the standard problem has several distinct advantages. First of all, each algorithm tested was able to solve this type of problem with the recommended values for the input parameters so only one run per problem was required. Also the degree of nonlinearity can be increased simply by considering higher order terms such as cubic or quartic terms. In addition the quadratic functions may be easily selected so that the constrained region is a convex set, which guarantees the presence of a single optimum. The test problems were generated following the procedure of Rosen and Suzuki [16]. The quadratic form for the objective function may be expressed as f ( x ) -- xTQox + ax
(5)
130
E. Sandgren and K.M. RagsdeU/ Experiraents
and for each constraint as
gj(x)=xTQjx+bjx+Cj>-O; j=
1,2 ..... J
(6)
For these expressions the Q0 and Qi are randomly generated N by N matrices with Q0 forced to be positive definite to guarantee unimodality and the Qj forced to be negative definite to guarantee a convex feasible region. The a, bj and Cj are all column vectors containing N elements. Not only are the Q matrices selected but the bj vectors, and the Lagrange multipliers are randomly generated, and the solution vector is also selected. The Lagrange multipliers are either set to zero if a constraint is not to be active at the solution or to a random number between 0.5 and 10. So that the problem is not over constrained, the number of constraints allowed to be active at the solution was also selected as a random integer between one and N - 1. Now the aj and Cj may be determined by the conditions required to make the selected optimal vector a Kuhn-Tucker point. The procedure used to guarantee Q0 is positive definite and the Qj are negative definite is described by Sandgren [17]. The starting vector for all problems was the origin and the solution point for all of the five variable problems was x~ = 2.0. As the number of variables increased, the solution point was adjusted so that the distance from the origin to the solution vector was the same as for the five variable problem. Extending the standard problem to include additional variables or inequality constraints is elementary, and the basic procedure for the addition of equality constraints remains unchanged with the exception that the Lagrange multipliers for the equality constraints may now be positive or negative. However the addition of nonlinear equality constraints does introduce the possibility of local minima. This problem was handled by including only the problems generated where all of the algorithms reached the selected optimal vector. For the increase in nonlinearity, additional higher order terms were added to the basic quadratic form. Again for this case no check was made for positive and negative definiteness and problems where alternate solutions were found were not included. The average solution times on each set of ten problems for the increase in design variables, the number of inequality constraints, the number of equality constraints and the increase in nonlinearity are presented in Tables 4-7. The codes tested were code 1, the Method of Multipliers algorithm (exterior penalty function); code 9, a repetitive linear programming algorithm; codes 10 and 11, generalized reduced gradient algorithms; codes 15 and 21, exterior penalty function algorithms; and code 31, an interior penalty function algorithm. Individual codes results may be easily discerned from the tables but the main point is that the results are fairly consistent with those from the general comparative study. The generalized reduced gradient algorithms again demonstrate a significantly reduced computational solution time as compared to the penalty function methods.
E. Sandgren and K.M. Ragsdell/ Experiments Table 4 Solution time and percentage increase in solution time for an increase in the number of design variables
Algorithm
5 time
Number of variables 10 time/% increase
15 time/% increase
I 9 10 11 15 21 31
6.306 4.153 2.171 3.491 9.081 6.639 9.050
23.482/272.4 16.356/293.8 10.217/307.6 15.025/330.4 42.248/365.2 33.637/406.7 32.650/260.8
70.288/1014.6 42.865/932. I 23.307/973.6 26.318/653.9 132.320/1357.6 75.191/1032.6 128.086/1315.3
Table 5 Solution time and percentage increase in solution time for an increase in the number of inequality constraints
Algorithm
10 time
1 9 10 11 15 21 31
6.306 4.153 2.171 3.491 9.081 6.639 9.050
Number of constraints 15 20 time/% increase time/% increase 8.938/41.7 4.750/14.4 3.475/60.1 6.716/92.4 13.523/48.9 12.270/84.8 11.417/26.2
11.304/79.3 6.64)2/59.0 3.737/72.1 8.404/140.7 17.258/90.0 15.616/135.2 17.773/96.4
Table 6 Solution time and percentage increase in solution time for an increase in the number of equality constraints
Algorithm
Number of equality constraints 0 1 3 time time/% increase time/% increase
1 9 10 I1 15 21 31
6.306 4.153 2.171 3.491 9.081 6.639 9.050
6.892/9.3 5.382/29.6 2.976/37. I 6.234/78.6 10.108/11.3 8.504/28.1 18.380/103.1
"Could not locate a feasible starting point.
7.278/15.4 --" 2.749/26.6 3.405/- 2.5 9.956/9.6 10.320/55.4 27.156/200.2
131
E. Sandgren and K.M. Ragsdell/ Experiments
132
Table 7 Solution time and percentage increase in solution time for an increase in problem nonlinearity
Algorithm
Quadratic time
1 9 I0 iI 15 21 31
6.306 4.153 2.171 3.491 9.081 6.639 9.050
Highest nonlinear term Cubic time/% increase 8.263/31.0 5.324/28.2 2.306/6.2 5.735/64.3 9.787/28. ! 8.507/7.8 i0.393/14.8
Quartic time/% increase 10.680/69.4 9. 386/126.0 2.890/33. ! 7.247/107.6 11.023/66.7 11.069/21.4 12.453/37.6
6. Combination o| algorithms Upon close examination of the results of each algorithm class on each of the test problems several interesting observations may be made. The penalty function algorithms had difficulty in obtaining constraint satisfaction on problems where several contraints were active at the solution. The vicinity of the solution was generally located quickly but convergence to the final solution was slow. This is most likely due to the increasing distortion of the contours as the stages progress. To solve this type of problem the input parameters had to be carefully selected so that each stage was solved accurately. This was found in general to be a very time consuming process. On the other hand the generalized reduced gradient algorithms demonstrated rapid convergence once the solution vicinity was located. The only problem encountered with this class of algorithm was a tendency to follow a constraint to a local minimum. This was especially true on some of the heavily constrained problems where the starting point was infeasible. The phase I solution for the GRG algorithms generally locates a feasible point without any regard to the objective function. The methods then start from this point and can use a significant amount of time following constraints which may not even be active at the solution. These observations point out the desirability of combining the traits of both penalty function and generalized reduced gradient algorithms to produce a hybrid algorithm which not only locates the vicinity of the solution quickly but is then able to converge to the final solution in a minimum amount of time. An exterior penalty function method is a natural choice for the first phase of the combined algorithm technique. The convergence criteria for each unconstrained stage may be very loose and the increase in the penalty parameter could also be fairly large. This results in locating the vicinity of the minimum in a few stages each of which requires only a small amount of time due to the loose
E. Sandgren and K.M. Ragsdell[ Experiments
133
c o n v e r g e n c e c r i t e r i a at each stage. T h e active c o n s t r a i n t set w o u l d then be easy to o b t a i n s i n c e at this point the a c t i v e c o n s t r a i n t s will g e n e r a l l y be slightly to m o d e r a t e l y violated. The s e c o n d p h a s e , i n v o l v i n g locating the feasible optimal point, could a p p l y the logic of the g e n e r a l i z e d r e d u c e d g r a d i e n t algorithm with the d e c i s i o n a n d state variables selected with a good idea of the active c o n s t r a i n t set. T o test the feasibility of such a h y b r i d algorithm the m e t h o d of multipliers, algorithm 1 [15], was c o m b i n e d with the g e n e r a l i z e d r e d u c e d g r a d i e n t , algorithm 11 [5]. T h e s e two algorithms were c h o s e n due to the fact that both were d e v e l o p e d b y the O p t i m a l Design G r o u p at P u r d u e U n i v e r s i t y u n d e r the d i r e c t i o n of Ragsdell a n d t h e r e f o r e we did n o t violate a n y prior a g r e e m e n t s c o n c e r n i n g the use of the n o n l i n e a r p r o g r a m m i n g c o d e s s u b m i t t e d for the study. Also the i n p u t f o r m a t for b o t h a l g o r i t h m s was v e r y similar which simplified the i m p l e m e n t a t i o n of the p r o g r a m to i n t e r f a c e the a l g o r i t h m s . The c o m b i n a t i o n of these two a l g r o i t h m s was applied to the test p r o b l e m s from the g e n e r a l c o m p a r a t i v e s t u d y a n d the results were quite good. First of all the c o m b i n e d m e t h o d solved all t w e n t y - t h r e e of the rated test p r o b l e m set,
Table 8 Percentage improvement in solution time for the biased penalty-reduced gradient combined algorithm over algorithms I and I 1. Problem
Percentage over algorithm I
Improvement over algorithm 11
1 2 3 4 5 6 7 8 10 11 12 14 15 16 17 18 19 20 23 24 25 26 27
69.6 81.3 92.6 43.8 14.7 --" 44.4 80.3 69.2 75.4 2.6 75.4 92.1 -71.9 90.9 7710 97.3 61.1 74.9 79.9 -74. I
-
32.7 180 80 42.1 1.5 -- 25.0 - 87.7 - 90.9 - 15.7 -- 7.4 - 20.0 4.9 -35.0 22.9 33.5 61.1 6.1 68.2 -.3 5.3
'No solution was found in algorithm 1 or 11 alone.
134
E. Sandgren and K.M. Ragsde|l/ Experiments
which neither algorithm 1 or 11 was able to do alone. The percentage decrease in solution time for the combined method over the normal solution times for algorithms 1 and 11 on the rated test problem set is presented in Table 8. The percentage reduction in solution time for the combined method over the solution time for algorithm 1 is quite large. In fact the average percentage time savings over the entire test problem set was a remarkable 68%. This is slightly better than a two-thirds reduction in computation time. When compared to the generalized reduced gradient algorithm, code 11, the improvement is not quite so obvious. On most of the initial test problems the combined method was significantly slower than code 1i. These were generally problems in which only one or two constraints were active at the solution and algorithm 11 produced extremely fast solution times. Even for threse problems, however, the solution time for the combined methods is well below the average solution times for all of the methods. The more difficult problems in the study demonstrate the real effectiveness of the combined method. Many of these problems were solved significantly faster by the combined method than by code 11. To get a better idea of how well the combined code would have fared in the general comparative study consider the following relative rankings. The combined method solved 18 of the rated test problem set within 25% of the average solution time, 20 problems within 50% of the average solution time and all 23 of the problems within 75% of the average solution time. This ranks the combination of algorithms at the top in every ranking, even above the reduced gradient algorithms. Performance on the additional test problem set was also a marked improvement over the performance of either algorithm 1 or 11. A discussion of the relative performance of the combined method on these problems is given by Sandgren [17].
7. Conclusions
Of the nonlinear programming algorithms currently available for use, the generalized reduced gradient algorithms are computationally faster and are capable of solving a large variety of problems. This was evident in the results from both the general study of engineering problems and from the study involving specifically structured mathematical problems. The fact that much of the same information was obtained through both approaches of comparison is encouraging in that carefully designed statistical experiments can be set up to generate some specific comparative information. As far as defining the ability of an algorithm to solve a wide variety of engineering problems, however, the information would be difficult to obtain in any other manner than in a general comparative study. Finally the initial results on the combination of penalty function and generalized reduced gradient methods are very promising and deserve continued study.
E. Sandgren and K.M. Ragsdell/ Experiments
135
Acknowledgment T h e p e o p l e w h o c o n t r i b u t e d to this w o r k are too m a n y to m e n t i o n . W h i l e it was not the i n t e n t i o n of the a u t h o r s to a d v e r t i s e a n y c o d e s in this w o r k , it is u n d e r s t a n d a b l e t h a t m o r e i n f o r m a t i o n on the c o d e s and p r o b l e m s will be d e s i r e d . D e t a i l e d p r o b l e m a n d c o d e d e s c r i p t i o n s a r e a v i l a b l e in the d i s s e r t a t i o n e n t i t l e d , " T h e utility o f n o n l i n e a r p r o g r a m m i n g a l g o r i t h m s " , b y S a n d g r e n [17]. P r o b l e m information includes complete starting and solution point data, a general desc r i p t i o n of e a c h p r o b l e m a n d a c o m p l e t e f o r t r a n listing o f t h e o b j e c t i v e f u n c t i o n and c o n s t r a i n t s . T h e i n f o r m a t i o n c o n t a i n e d for e a c h of the c o d e s t e s t e d i n c l u d e s the a v a i l a b i l i t y o f the c o d e , a b r i e f d e s c r i p t i o n of the c o d e a n d the o r i g i n a t o r of the c o d e . A c o p y o f the t h e s i s m a y be o b t a i n e d t h r o u g h the U n i v e r s i t y M i c r o f i l m s I n t e r n a t i o n a l l o c a t e d at 300 N. Z e e b R o a d , A n n A r b o r , MI 48106. T h e d o c u m e n t n u m b e r is 7813115.
References [1] K.A. Afimiwala, "Program package for design optimization", Department of Mechanical Engineering, State University of New York at Buffalo, NY (1974). [2] A.R. Colville, "A comparative study on nonlinear programming codes", in: H.W. Kuhn, ed., Proceedings of the Princeton symposium on mathematical programming (Princeton University Press, Princeton, N J, 1970) pp. 481-.501. [3] R.S. Dembo, "A set of geometric programming test problems and their solutions", Working Paper Number 87, Department of Management Sciences, University of Waterloo, Ontario (June 1974). [4] E.D. Eason and R.G. Fenton, "A comparison of numerical optimization methods for engineering design", Journal of Engineering for Industry Series B 96 (1974) 196--200. [5] G.A. Gabriele and K.M. Ragsdell, "OPT: A nonlinear programming code in Fortran-IV-users manual", Purdue Research Foundation, West Lafayette, IN (January 1976). [6] R.E. Griffith and R.A. Stewart, "A nonlinear programming technique for the optimization of continuous processing systems", Management Science 7 (1961) 379-392. [7] J. Guigou, "Pr6sentation et utilisation du code, GREG", I~partment traitment de I'information et 6tudes math6matiques, Electricit6 de France, Paris, France (1971). [8] Gulf Oil Corporation, "Compute l l , ' a general purpose optimizer for continuous, nonlinear models-description and user's manual", Gulf Computer Services Incorporated, Houston, Texas (1972). [9] D.M. Himmelblau, "Optimal design via structural parameters and nonlinear programming", Engineering Optimization 2 (1976). [10] L.J. LaFrance, "User's manual for GRGDFP, an optimization program", Herrick Laboratories Report 45, West Lafayette, IN (April 1974). [Ill L.S. Lasdon, A.D. Waren, M.W. Ratner and A. Jain, "GRG user's guide", Cleveland State University Technical Memorandum CIS-75-02 (November 1975). [12] W.C. Mylander, R.L. Holmes and G.P. McCormick, "A guide to SUMT-version 4", Research Analysis Corporation RAC-P-63, McLean, VA (1974). [13] R. Phillipson, Department of Industrial Engineering, Purdue University, West Lafayette, IN, private communication. [14] K.M. Ragsdell and D.T. Phillips, "Optimization of a class of welded structures using geometric programming", Journal of Engineering .for Industry Series B 98 (1976) 119-130.
136
E. Sandgren and K.M. Ragsdell/ Experiments
[15] R.R. Root and K.M. Ragsdell, "BIAS: A nonlinear programming code in Fortran-IV-users manual", Purdue Research Foundation, West Lafayette IN (September 1978). [16] J.B. Rosen and S. Suzuki, "Construction of nonlinear programming test problems", Association .for Computing Machinery Communications 8 (1965). [17] E. Sandgren, "The utility of nonlinear programming algorithms", Dissertation, Purdue University, West Lafayette, 1N (1977). [18] E. Sandgren and K.M. Ragsdell, "The utility of nonlinear programming algorithms: A comparative study--Part I and Part II", Journal of Mechanical Design, 102 (1980) 540-551. [19] E. Sandgren and K.M. Ragsdell, "The optimal design of rotating disks", to appear. [20] S.B. Schuldt, Honeywell Corporation, Bloomington, MN, private communication. [21] S. Schuldt, Honeywell Corporation, Bloomington, MN, private communication. [22] J.N. Siddall, "Opti-Sep: Designers optimization subroutines", McMaster University, Hamilton, Ontario (1971). [23] R.L. Staha, "Documentation for program Comet, a constrained optimization code", The University of Texas, Austin, TX (April 1973). [24] J. Tomas, "The synthesis of mechanisms as a nonlinear programming problem", Journal of Mechanisms 3 (1968) 119-130. [25] The University of Liverpool Computer Laboratory, E04HAF, Numerical Algorithms Group, Document 737 (1974). [26] D. Williams, Whirlpool Corporation, Benton Harbor, MI, private communication.
Mathematical Programming Study 16 (1982) 137-148 North-Holland Publishing Company
DETERMINING FEASIBILITY OF A SET OF NONLINEAR INEQUALITY CONSTRAINTS R o b e r t B. S C H N A B E L
Department of Computer Science, University of Colorado at Boulder, Boulder, CO 80309, U.S.A. Received 12 February 1980 Revised manuscript received 10 October 1980
We show that a particular penalty function algorithm, using an exponential-type penalty function which was introduced by Kort and Bertsekas and by Hartman, is well suited to the problem of finding whether a set of nonlinear inequality constraints has a feasible point. In particular, our algorithm has the property that on virtually all problems, once the penalty parameter reaches or exceeds some finite value, either a feasible point is produced or infeasibility of the set of constraints is demonstrated. Thus it shares an advantage of augmented Lagrangian algorithms even though no multiplier estimates are used. We also demonstrate that with certain choices of the penalty function, our algorithm will produce points which are as feasible as possible, on feasible problems. Some computational results are also presented.
Key words: Nonlinear Inequality Constraints, Feasibility, Penalty Function
I. Introduction We consider the problem G i v e n D C R " , c i : D ~ R , i = 1. . . . . m. D o e s t h e r e e x i s t a n y x E D s u c h that c~(x)-
(1.1)
T h e f u n c t i o n s ci a r e c a l l e d c o n s t r a i n t f u n c t i o n s , and w e will a s s u m e t h e m to be t w i c e c o n t i n u o u s l y d i f f e r e n t i a b l e . If t h e a n s w e r to (1.1) is y e s , the set of c o n s t r a i n t s {c~} is said to b e feasible, o t h e r w i s e it is said to b e infeasible. P r o b l e m (1.1) is a s p e c i a l c a s e o f the n o n l i n e a r p r o g r a m m i n g p r o b l e m a n d o c c u r s f r e q u e n t l y in p r a c t i c e . H o w e v e r , not m u c h w o r k h a s b e e n d e v o t e d to d e v e l o p i n g a l g o r i t h m s s p e c i f i c a l l y f o r (1.1). ( R o b i n s o n [12] d i s c u s s e s a local m e t h o d f o r finding a f e a s i b l e point.) T h e p u r p o s e of this p a p e r is to s h o w t h a t a c e r t a i n p e n a l t y f u n c t i o n a p p r o a c h is well s u i t e d to this p r o b l e m . T h e class o f p e n a l t y f u n c t i o n s w e u s e w a s i n t r o d u c e d b y K o r t a n d B e r t s e k a s [8] a n d H a r t m a n [7] f o r the g e n e r a l n o n l i n e a r p r o g r a m m i n g p r o b l e m , and is also d i s c u s s e d in [1, 2, 9, 10]. P r o b l e m (1.1) c a n e a s i l y be c a s t as a n o n - d i f f e r e n t i a b l e o p t i m i z a t i o n p r o b l e m , f o r e x a m p l e a m i n i m a x p r o b l e m (see e.g., [3]). I n s t e a d , o u r m e t h o d s o l v e s (1.1) 137
138
R.B. Schnabel[ Nonlinear inequality constraints
by solving a sequence of problems 1 " minx~o~b(x, p) __.aP ~.= w ( p .
c,(x))
(1.2)
for a monotonically increasing sequence of values of the nonnegative penalty parameter p. The penalty or weighting function w is any twice continuously differentiable for w : R ~ R which obeys w(0) = 0,
(1.3a)
w'(y)>0
for all y E R ,
(1.3b)
w"(y) > 0
for all y E R,
(1.3c)
lim w(y) = l for some l > - ~ . y~-
(1.3d)
:,e
Note that (1.3a-c) also imply lira w(y) = +oe.
(l.3e)
An example is w(y) = e y - 1, but many other possibilities exist, including j'e y -- 1, w,(y) = [t(y _ a)2e, + (y _ (~)e" + (e" - 1),
y --< a, y > ~,
and y/(1 - y), w2(y) = (y _/3)2/(1 _/3)3 + (y _/3)/(1 - / 3 ) 2 +/3/(1 - / 3 ) ,
y ~/3, y >/3
for any a > 0 or/3 ~ (0, 1). Our motivation for (1.2) was to form a weighted sum of the constraint values, that penalizes constraint violation and mildly rewards constraint satisfaction. Similarly, (1.2) can be viewed as the application of a penalty function approach to rain subject to
[(x), ci(x)<-O,
i = 1..... m
(1.4)
where f is assumed to be identically zero and the penalty function obeys (1.3). Problem (1.4) is the context in which Kort and Bertsekas [8] and Hartman [7] introduced this penalty function. In Section 2 we show that when this type of penalty function is used in (1.2), then in almost all cases, either a feasible point is produced for {ci} or the system of constraints is shown to be infeasible, as soon as p reaches or exceeds some finite threshold value. Thus for the system of inequality constraints problem, a penalty function algorithm exhibits the type of behavior usually shown by aug-
R.B, Schnabel/ Nonlinear inequality constraints
139
m e n t e d L a g r a n g i a n algorithms (see e.g., f11]). F u r t h e r m o r e , the infeasible as well as the feasible case of (l.1) is solved. In Section 3 we s h o w that if {c~} is feasible, then certain c h o i c e s of w in (1.2) will c a u s e (1.2) to p r o d u c e a point which is as feasible as possible in the limit as p ~ o,, while with other c h o i c e s of w this will not always occur. In Section 4 we p r e s e n t s o m e c o m p u t a t i o n a l results.
2. Termination properties of the penalty algorithm In this section, we study the p r o p e r t i e s of the minimization p r o b l e m (1.2) for solving the feasibility p r o b l e m (l.1). W e will need the following definitions. Definition 2.1. L e t D C R", and let c~ :D-~R, i = l . . . . . m. T h e set of constraints {ci} is said to be strictly ? feasible f o r a given ? > 0 if there exists x E D such that c~(x)<-% i=l ..... m; strictly feasible if it is strictly ? feasible for s o m e ? > 0; exactly feasible if it is feasible but not strictly feasible; within ~ of feasibility for a given 8 > 0 if {ci} is infeasible but {ci - 8} is feasible; strictly infeasible if there exists 8 > 0 such that {c~} is not within 8 of feasibility; exactly infeasible if it is infeasible but not strictly infeasible. We will a s s u m e in this p a p e r that a n y infeasible set of c o n s t r a i n t s is strictly infeasible. This is true, for e x a m p l e , if D is closed and b o u n d e d . For p > 0, let us define 1 '~ 4~.(P) = min tb(x, p) = P ~ w(p 9 c,(x)), xED
i=1"=
x.(p) = a minimizer of ~(x, p)
for x ~ D.
Since w(y) -< 0 f o r all y -< 0, it is o b v i o u s that if {c~} is feasible, then 4~.(P) <- 0 for all p. T h u s the result of (1.2) will fall into one of three categories: (i) x,(p) is feasible, (ii) 4 ~ , ( P ) > 0, that is, infeasibility is established, or (iii) 6 , ( P ) -< 0 and x.(p) is infeasible, which can o c c u r w h e t h e r {ci} is feasible or not. T h e idea of o u r m e t h o d is to i n c r e a s e p until either (i) or (ii) occurs. In T h e o r e m s 2.3-2.5 we show that: (1) 4~.(P) is a m o n o t o n i c a l l y increasing function of p, (2) if {c~} is strictly feasible, then o n c e p r e a c h e s or e x c e e d s s o m e finite value, any x,(p) is feasible, and (3) if {c~} is strictly infeasible, then once p reaches or e x c e e d s s o m e finite value, 6 , ( P ) > 0.
140
R.B. Schnabel/ Nonlinear inequality constraints
Thus our m e t h o d deals nicely with the strictly feasible and strictly infeasible cases. We first need a standard result f r o m c o n v e x i t y theory (see e.g., [13]).
L e m m a 2.2. Let w : R ~ R be continuously differentiable and strictly convex, w(O) = O. Let p~ > p > 0 . Then f o r any y @ R, w ( p . . y) > w ( p 9 y)
P+
P
with equality if and only if y = O. Proof. If y = 0 the l e m m a is trivially true. O t h e r w i s e let t = p/p~ @ (0, 1). Since w is strictly c o n v e x , one has w(py) = w ( t p , y + (1 - t)0) < t w ( p . y ) + (1 - t)w(O) = ( p / p , ) w ( p , y).
T h e o r e m 2.3. Let w : R ~ R satisfy the a s s u m p t i o n s o f L e m m a 2.2, p. > p > 0. Then ~ b , ( p + )>- ~b,(p), with equality only if c i ( x , ( p ~ ) ) = 0, i = 1. . . . . m. Proof. F r o m L e m m a 2.2 w(p. 9 ci(x.(p~))) ~ w(p 9 c i ( x . ( p . ) ) ) , P, P i = 1. . . . . m, with equality if and only if c~(x.(p§ S u m m i n g (2.1) f o r i going f r o m 1 to m yields
(2.1) = O.
~b.(p§ --- ~b(x,(p ,), p)
(2.2)
with equality if and only if c~(x.(p,)) = O, i = 1. . . . . m. Also, by the definition of ~b,(p) ~b(x . ( p . ) , p) >- r
).
(2.3)
C o m b i n i n g (2.2) and (2.3) c o m p l e t e s the proof.
Theorem 2.4. Let w : R ~ R obey (l.3a,b,d). If {ci} is strictly feasible, then there exists p~ >-0 such that f o r any p >>-p], any x . ( p ) is feasible. Proof. Since {c~} is strictly feasible, there exists 3' > 0, x, E R n such that ci(xl) <- 7 , i = 1. . . . . m. L e t y ~ < 0 be c h o s e n so that w ( y 0 <- (m - l)l/m
(2.4)
and define Pt = - Y l / T . W e show that f o r a n y p - P l , any x , ( p ) is feasible. S u p p o s e f o r s o m e p - > p ~ , s o m e x . ( p ) is infeasible. T h e n since at least one
R.B. Schnabel[ Nonlinear inequality constraints
14t
constraint is infeasible at x . ( p ) and w ( y ) > 0 for all y > 0, (2.5)
& . ( p ) > (m - 1)lip.
Also, from the definition of x~ and the monotonicity of w, ~b(xt, p ) <- m 9 w ( - p v ) ] p.
(2.6)
From the definitions of p~ and Yt and the monotonicity of w w(-pv)
<- w ( - p l y )
= w(y~) <-(m - l)l[m
(2.7)
so that combining (2.6) and (2.7) 6(xl, P) <- (m - l)l[p. This contradicts (2.5) and completes the proof. Theorem 2.5. L e t w : R ~ R
obey (l.3a,b,c,d). I f {c~} is strictly infeasible, then
there exists P2 >- 0 s u c h that f o r any p >- P2, cb.(p) > O.
Proof. Since {c~} is strictly infeasible, we can choose ~ > 0 such that {c~} is not within 6 of feasibility. Let y~,> 0 be chosen so that w(y2) -> - (m - 1)1
(2.8)
and define P2 = yd6. We show that for any p -> p2, 4~,(P) --- 0. Since {c~} is not within 6 of feasibility, by the monotonicity of w we have ~b,(p) > ((m - 1)/+ w ( p 6 ) ) / p .
(2.9)
From the definition of p,, and Y2 and the monotonicity of w, w ( p 6 ) >- w(p23) = w(y2) >- - (m - 1)1.
(2.10)
Combining (2.9) and (2.10) completes the proof. The remaining case is when {c~} is exactly feasible. It is easily shown by the same techniques as in the proof of T h e o r e m 2.4 that in this case, for any 6 > 0, that once p reaches or exceeds some finite value p(6), any x , ( p ) is within 6 of A feasibility (i.e., m a x l~_i~,,, {ci(x , ( p ))} <- ~ ). Thus x , = l i m p _ ~ x , ( p ) is feasible. However, it is easily shown that for any w obeying (1.3), there exist some exactly feasible problems, such as c l ( x ) = l - x 3, c2(x) = x 2 - 1
for which x , ( p ) is infeasible for all p > 0, and other exactly feasible problems, such as
142
R.B. Schnabel/ Nonlinear inequalily constraints
c,(x) = 1 - x 3, c2(x) = x 3 - 1
for which x , ( p ) is feasible for all p > 0, Thus in practice, our method cannot decide the exactly feasible case. Instead, our computational algorithm uses the following stopping criterion, for some small 8 , > 0: If ~b,(p) > 0 Then stop (the constraints are considered infeasible) Else If maxl<~<m ci(x,(p)) < 8 , Then stop (the constraints are considered infeasible) Else continue (augment p). Clearly, this allows either conclusion for a set of constraints within 8 , of infeasibility. H o w e v e r , it is shown below that once p reaches or exceeds some finite value, our method using this stopping criterion will halt. Theorem 2.6. Let w : R ~ R obey (l.3a,b,c,d). Then given any 8 , > O, there exists p3~'0 such that foi" any p ->P3, either ~ b , ( p ) > 0 or any x , ( p ) is within 8 , o f feasibility.
Proof. Let y2 > 0 be defined by (2.8) and define ~5-- ~ , [ ( m + 1), P3 = Y2/& If {c~} is not within ~ of feasibility, then by the proof of Theorem 2.5, ~b,(p)> 0 for any P-> P3. Now consider the other possibility, that {c~} is feasible or within 8 of feasibility. Then there exists x3 E R such that c~(x3)<-& i = 1. . . . . m and so for any p > 0, ~b(x3, p ) <- m 9 w(pS)[p.
(2.11)
Now suppose that for some p -> P3, some x , ( p ) is not within 8 , of feasibility. Then by the monotonicity of w, ~b,(p) > (m - 1)l + w(p,5,) P
(2.12)
Since w is strictly convex and w(0)= 0, w ( p S , ) > (m + l ) w ( pS* "~= (m + 1)w(pS),
\m + 1]
(2.13)
and combining (2.12) and (2.13), ~b,(p) > [(m - 1)l + w(pS)] + m 9 w ( p ~ ) P
(2.14)
Also by the monotonicity of w and the definitions of P3 and Yz w(pS) ~ w(p38) = w(y2)-> - ( m + l)l
(2.15)
R.B. Schnahel/ Nonlinear inequality constraint~
143
so that combining (2.14) and (2.15) qb,(p) > m 9 w ( p 6 ) / p
which contradicts (2.11) and completes the proof.
3. Finding points as feasible as possible
When {ci} is strictly feasible, many practical applications desire an interior point of the constraint set, that is, any x E R such that c~(x) < O, i = I . . . . . m. In this case, let k , be the largest value of k for which {c~} is strictly k feasible, i.e.,
Im {max 'x'l] In this section, we show that if w ( y ) = e ~ - 1 for y -< 0, then for any k < k,, once p reaches or exceeds some finite value, any x , ( p ) will be strictly k feasible. Thus the algorithm is very effective for finding interior points. However, if w(y) = y/(l - y) for y -< 0, this result does not hold. Thus the choice of weighting function is important if one wants to find points which are as feasible as possible. Theorem 3.1 gives a sufficient condition for (1.2) to find a strictly k feasible point for a strictly k , feasible problem, if k < k,. Theorem 3.1. L e t w : R ~ R obey (1.3a,b,c) and define i f ( y ) = w ( y ) - l . L e t 0 < k < k , and a s s u m e that {c~} is strictly k , feasible. S u p p o s e there exists p4>-O such that f o r all p >- p4, m 9 ~r
(3.1)
~ ~,'(-pk).
Then f o r a n y p >- P4, a n y x , ( p ) is strictly k feasible.
Proof. Suppose there exists p4 satisfying the conditions of the theorem, and that for some p ~p4, some x , ( p ) is rmt k feasible. Then at least one constraint has value greater than - k at x , ( p ) , so that
& , ( p ) ___(m - l)l + w ( - p k ) _ m . l + r 1 6 2 P P
(3.2)
However, since {ci} is strictly k , feasible, there exists x 4 E R n such that ci(x4) <- k , , i = 1. . . . . m. Thus from the definition of x , ( p ) and the monotonicity of w 4 , , ( p ) < m w ( - p k , ) _ m 91 + m 9 r (3.3) P
Eqs. (3.2) and (3.3) imply that r completes the proof.
P mff(-pk,)
which contradicts (3.1) and
144
R.B. Schnabel[ Nonlinear inequality constraints
Corollary 3.2. Let w ( y ) = e y - 1 for y < 0 , O< k < k , , and assume that {ci} is strictly k, feasible. Then there exists P5 >-0 such that for all p >-ps, any x , ( p ) is strictly k feasible. Proof. From Theorem 3.1, it sutfices to show that there exists p~ such that for all p ->Ps, (3.1) is true. For w ( y ) = e y - 1, i f ( y ) = e y, so (3.1) requires that me -pk* ~
e-, t
or m --<e pCk*-k). This is clearly true for any p ->-In m/(k, - k) ___aPs. If w(y) = y / ( 1 - y) for y < 0 , then if(y) = 1/(1- y) and so (3.1) requires
m/(l + pk,)-> 1/(1 + p k ) or ( m - l ) < - p ( k , - m k ) which is only possible if k < k , / m . Thus with this weighting function, Theorem 3.1 only guarantees that (1.2) will produce strictly ( k , / m ) - 9 feasible points on a strictly k , feasible problem. Indeed, example 3.3 is one where (1.2) using w(y)= y[(1- y) does not produce any x,(p) which is strictly (2.5) feasible on a problem which is strictly (3) feasible. More extreme examples are possible for nonlinear constraints.
Example 3.3. Let c~(x)=-2-x, c,.(x) = - 10 + 7x. This set of constraints has k , = 3, with ct(1)= c z ( l ) = - 3 . However, we show that if w ( y ) = y [ ( l - y) for y -<0, x , ( p ) < 0 . 5 for all p > 0 . Thus c~(x,(p))>-2.5 for all p > 0. For any p > 0, direct computation yields x,(p) ___ah (p) = (7( 1 + 6p) - X/49( 1 + 6p)2 + 21 (3 + 4p - 36p 2))/21 p The term h(p) is monotonically increasing for p > 0 as ____1[__1.+ 7+24p ] h'(p) = 3p 2 k/(7 + 24p-~- ~-6 + 18p) z > 0 for all p > 0 . example.
Finally, l i m p ~ h ( p ) = 2 - ( 4 ~ / f f / 7 ) = 0 . 4 8 8
which completes the
4. Computational results We have implemented the penalty function approach (1.2) for finding a feasible point of a set of nonlinear inequality constraints, using the weighting function w ( y ) = e y - 1. Our algorithm handles linear inequality constraints
R.B. Schnabel/ Nonlinear inequality constraints
145
separately from nonlinear ones. It first determines whether the linear constraints are feasible (using phase I of linear programming, see e.g., [4]) and if they are, solves a sequence of linearly constrained optimization problems 1
min
4~(x,p)=
xED
m
~ ( e c'p-1),
-p i=1'=
li(x)<-O, j = l ..... q
subjectto
(4.1)
where the ci's are the nonlinear constraints and the /is the linear constraints. Our algorithm for (4.1) is a variant of the algorithms of Fletcher [5] and Goldfarb [6]. It includes a restarting strategy to enhance its chances of finding a global solution to (4.1). This increases our chances of finding a global rather than a local solution to (1.1), and in practice the algorithm has been successful in finding feasible points of some simple problems where a local algorithm for (4.1) would not have been. It is shown by L'Hopital's rule that 4~(x, 0) ~ lim rk(x, p) p~0
i=1
ci(x).
Thus we start our feasibility algorithm by minimizing ~b(x, 0), subject to the linear constraints and some special safeguards for this case. When the current penalty parameter Pc is greater than zero, it is increased to p+ by constructing an appropriate model re(p) of ~b,(p) based on current and past values of ~b,(p), and choosing p+ so that m(p+) = 0, subject to p+ E [2pc, 10pc]. A different strategy is used when Pc = 0. Tables 4.1-4.4 present four indicative runs of our algorithm. Mainly they illustrate that it works as the theory predicts. What is perhaps most striking in our tests so far is how few values of p seem to be required to obtain the correct
Table 1
c,(x) = sin(x~ + x 2) + xl
Initial guess = (0, \
/3 ~'/2"x
Iterations inside linearly constrained optimization routine 0 0.37 3.7
14 2 1
x,(p)
+,(p)
(-6.3,0.33) - 5 . 2 (-6.3, 0.01) - 2 . 3 (-6.0, 0.01) - 0 . 4 7 (feasible point found)
Constraint values at x,(p)
-5.3,0.1 -5.3,0.1 -6.8, -0.36
R.B. S c h n a b e l / N o n l i n e a r inequality constraints
146
Table 2 c,(x) =
(~x, - 3)x~ + 2x2 - 1
c2(x) = xt + (~x2 - 3)x2 + 2x3 - 1 c3(x) = x2 + (~x3 - 3)x3 + 2x4 - 1 C4(X) = X3 -~- (!X4 -- 3)X4 -~- 2X5 -- 1
c d x ) = x4 + (~x~ - 3)xs
-. 1
Initial g u e s s = (1, 1, 1, 1, I) p
Iterations inside linearly c o n s t r a i n e d optimization routine
x.(p)
0
1
( 2 , 0 , 0 , 0 , 1)
0.37
2
(I.3,0.7,0.3,0.2,0.3)
~5.(p)
Constraint values at x . ( p )
-7.5 -4.9
- 5 , 1 - 1, I, 3.5 -2.5,-1,-0.7, - 0 . 6 , - 1.7
( f e a s i b l e point f o u n d )
Table 3 c l ( x ) = x~ + 2 x ~ - 4 C2(X) = X~ q'- 2X2 + X~ -- 8
c3(x) = ( x t - 1) 2 + ( 2 x 2 - 2In) 2+ ( x 3 - 5) 2 - 4 Initial g u e s s = (1.0, 0.7, 5.0) p
Iterations inside linearly c o n s t r a i n e d optimization routine
0
4
x,(p)
+,(p)
(0.3,0.3, 1.5) (concluded infeasible)
1.7
Constraint values at x . ( p )
-3.7,-3.7,9.1
Table 4 c f f x ) = q u a d r a t i c w i t h m i n i m u m v a l u e of 1 at x = 0 E R "~ c2(x) . . . . . c~o(x) = q u a d r a t i c w i t h m i n i m u m v a l u e of - 1 at x = 0 ~_ R ~0 Initial g u e s s = (1, 1. . . . . 1) p
Iterations inside linearly c o n s t r a i n e d optimization routine
0 0.4 4.0
1 I I
x,(p)
-(0,0 -(0,0 -(0,0 (concluded
4~,(P)
. . . . . O) -9.0 . . . . . O) -3:7 . . . . . O) 11.2 infeasible)
Constraint values at x , ( p )
(1,-1,-1 ..... -1) (1,-1,-I ..... -1) (1,-I, I..... -1)
R.B. Schnabel/ Nonlinear inequality constraints
147
answer. Tables 1 and 2 illustrate the algorithm on feasible problems. Table 1 is a case where the global characteristics of our linearly constrained optimization were required due to the shape of the tilted sine function. Tables 3 and 4 illustrate the algorithm on infeasible problems. Table 4 is interesting because the optimal point x . ( p ) remains the same for all p, but p must reach a positive threshold value before ~b.(p)> 0 and infeasibility is detected. It has been suggested that our algorithm could be improved by modifying our subproblem to
1 " m i n - ~.. )t(p) 9 w(p 9 ci(x)) xED p
i=l
and then modifying the Lagrange multiplier estimates )t(p) to X(pO = X(p). w'(p 9 c~(x,(p)) before the start of the next iteration (see [2,8, 10]). This suggestion stems directly from the standard practice in augmented Lagrangian algorithms for the general nonlinear programming problem. However, it is not clear to us that it will lead to an improvement in our case. First, the justification for using multiplier estimates in the absence of an objective function is less clear. Second, we have shown that our algorithm has theoretical properties comparable to those achieved by augmented Lagrangian algorithms for general nonlinear programming problems. In fact, it is not obvious that all the results of this paper would hold for the above algorithm.
Acknowledgments The author thanks Professors L. Fosdick and L. Osterweil of the University of Colorado at Boulder, who first interested him in this problem in the context of their work in static software validation; R. Lear, who programmed the initial version of our algorithm; and Professor R.T. Rockafellar of the University of Washington for several helpful discussions at the early stages of this project, and for pointing out some relevant references. He is also indebted to the referees for many helpful comments which led to a shorter, improved paper. These included informing him of the previous work on exponential-type functions, and providing a simpler proof of L e m m a 2.2.
References [I] D.P. Bertsekas, "A new algorithm for solution of nonlinear resistive networks involving diodes", IEEE Transactions of Circuit Theory CAS-23 (1976) 599-608. [2] D.P. Bertsekas, "Approximation procedures based on the method of multipliers", Journal of Optimization Theory and its Applications 23 (1977) 487-510.
148
R.B. Schnabel/ Nonlinear inequality constraints
[3] C. Charalambous and A.R. Corm, "An efficient method to solve the minimax problem directly", S I A M Journal on Numerical Analysis 15 (1978) 162-187. [4] G.B. Dantzig, Linear programming and applications (Princeton University Press, Princeton, NJ, 1963). [5] R. Fletcher, "An algorithm for solving linearly constrained optimization problems", Mathematical Programming 2 (1972) 133-165. [6] D. Goldfarb, "Extensions of Davidon's variable metric algorithm to maximization under linear inequality and equality constraints", S I A M Journal on Applied Mathematics 17 (1969) 739-764. [7] J.K. Hartman, "lterative determination of parameters for an exact penalty function", Journal of Optimization Theory and its Applications 16 (1975) 49-66. [8] B.W. Kort and D.P. Bertsekas, "A new penalty function method for constrained minimization", Proceedings o.f 1972 IEEE Conference on Decision and Control (1972) 162-166. [9] B.W. Kort and D.P. Bertsekas, "Combined primal-dual and penalty methods for convex programming". S I A M Journal on Control and Optimization 14 (1976) 268-294. [10] V.H. Nguyen and J.J. Strodiot, "On the convergence rate for a penalty function method of exponential type", Journal of Optimization Theory and its Applications 27 (1979) 495-508. [ll] D.A. Pierre and H.J. Lowe, Mathematical programming via augmented lagrangians (AddisonWesley, Reading, MA, 1975). [12] S.M. Robinson, "Extension of Newton's method to nonlinear functions with values in a cone", Numerische Mathematik 19 (1972) 341-347. [13] R.T. Rockafellar, Coneex analysis (Princeton University Press, Princeton, N J, 1970).
Mathematical Programming Study I6 (1982) 14%161 North-Holland Publishing Company
CONJUGATE GRADIENT METHODS FOR LINEARLY CONSTRAINED NONLINEAR PROGRAMMING D.F. S H A N N O and R.E. M A R S T E N Department of Management Information Systems, College of Business and Public Administration, The University of Arizona, Tucson, AZ 85721, U.S.A. Received 14 July 1980 Revised manuscript received 25 March 1981
This paper considers the problem of minimizing a nonlinear function subject to linear constraints. The method adopted is the reduced gradient method as described by Murtagh and Saunders, with a conjugate gradient method due to Shanno used for unconstrained minimization on manifolds. It is shown how to preserve past information on search directions when a basis change occurs and when a superbasic variable is dropped. Numerical results show a substantial improvement over the reported results of Murtagh and Saunders when conjugate gradient methods are used.
Key words: I.inear Programming, Nonlinear Programming. Optimization, Linear Constraints, Conjugate Gradient Method, Reduced Gradient Method, SimpTex Method, Sparse Matrix, I.arge-Scale Systems.
1. Introduction The linearly constrained nonlinear optimization problem is minimize subject to
f(x),
x = (x~ . . . . . x.),
(1)
Ax = b,
(2)
1 <-x<-u
(3)
where A is an m x n matrix, m < n . This is the formulation considered by Murtagh and Saunders [7] in their seminal paper on methods for solving large p r o b l e m s of t h e f o r m (1)-(3). T h e m e t h o d t h e y a d o p t is a r e d u c e d g r a d i e n t m e t h o d , w h i c h w e will s u m m a r i z e briefly here. F o l l o w i n g M u r t a g h a n d S a u n d e r s , the m a t r i x A is p a r t i t i o n e d as
Ax=[BIS]
N
] Xs = b ' xN
(4)
where the basic variables xs are used to satisfy the constraint set, the superbasic variables Xs are allowed to vary to minimize f(x), and the nonbasic variables xu are fixed at their bounds. Here B is a square, nonsingular matrix. At each step, the problem then becomes determining a step vector Ax = (Axs, AXs, AXN) which 149
150
D.F. Shanno and R.E, Marsten/ Conjugate gradient methods [or linearly constrained
reduces the value of [ ( x ) while forcing x + A x to continue to satisfy the constraints (2). As the values xN are fixed, we get immediately AxN = 0,
(5)
and allowing Axs to be chosen freely means that Axe is determined by Axe = - B-i SAxs.
(6)
Considering the unconstrained minimization problem of minimizing f as a function of the current set of superbasic variables Xs, we note that the reduced gradient h of f ( x ) in the variables Xs is given by h = gs - STTr,
(7)
7r = (BT)-Ign,
(8)
where ga, gs and gu are the components of the gradient vector V f ( x ) corresponding to xA, Xs, and xN, respectively. Theoretically, the algorithm continues until I]hJr= 0, in which case Axs = 0 and Xs is a stationary point of f on the manifold determined by x~ and Xs. The Lagrange multipliers A defined by A = gN -
NVT"r
(9)
are computed, and if A -> 0 for each x in xN at its lower bound and A ~ 0 for each x in xN at its upper bound, the point x is a stationary point of f. If not, a set of candidates can be chosen from xN and allowed to become superbasic, and a new unconstrained problem is solved. The algorithm continues until a stationary point has been found. The only complicating factor beyond normal unconstrained optimization while minimizing on a given manifold is the possibility that either a basic or a superbasic variable may strike a bound during the search. If a superbasic variable strikes a bound, it is made nonbasic, the dimension of the manifold is reduced by one, and the search continues. If a basic variable strikes a bound, the basic variable is exchanged with an appropriate superbasic variable, and the resulting new superbasic variable is made nonbasic. The algorithm described above is incomplete in several details. First, the numerical method of doing unconstrained minimization on each manifold is not specified. Second, the question of how many nonbasic variables should be allowed to become superbasic upon completion of minimization on a manifold has not been determined. Finally, the convergence on manifolds, while theoretically defined as IIhll = 0, is open as a practical question. Murtagh and Saunders used both variable metric and conjugate gradient methods for minimization on a manifold. The major purpose of this paper is to suggest a new conjugate gradient algorithm, based on a recent method of Shanno [9], which can be modified to preserve information about good search directions when superbasic variables are dropped and when basis changes occur on manifolds. The computational results of Section 3 will show that this method is
D.F. Shanno and R.E. Marstenl Conjugate gradient methods for linearly constrained
151
much more efficient than the reported results of Murtagh and Saunders for conjugate gradient methods. A secondary thrust of the paper is to examine how the choice of the superbasic set of variables effects numerical efficiency. The computational results of Section 3 are in this case slightly less definitive, but indicate strongly that the initial manifold should be as large as possible. Furthermore, making each subsequent manifold as large as possible appears not to hurt, and often to help numerical efficiency, at least for conjugate gradient methods, which are all that are tested here. Murtagh and Saunders also note that for conjugate gradient algorithm, many superbasics may be introduced simultaneously. Finally, the paper tests loose versus strict convergence criteria on all but the suspected final manifold, and finds strong numerical evidence that the strategy of loose convergence on preliminary manifolds substantially improves performance in most cases, agreeing with the results reported for conjugate gradient methods by Murtagh and Saunders. 2. Memoryless variable metric methods with linear constraints
The classical conjugate gradient method for unconstrained optimization for a function f ( x ) with gradient g ( x ) is an iterative method due to Hestenes and Steifel [4], and applied to nonlinear problems by Fletcher and Reeves [3], which is defined by x~., = xk + akdk,
(10)
dk ~ ! -- -- gk +! + ~kdk,
(11)
where ~k = g ~ ~yJdTyk, Yk = gk+! -- gk, and ak is an appropriately chosen scalar. In (9), it is shown that if ak is chosen at each step to exactly minimize f ( x ) along dk, the direction dk.~ defined by (11) is equivalent to d~.l = - Ok+lgk+l,
(12)
where Qk.~ is the positive definite matrix defined by Qk+,
=
I
PkY~+YkP~+(I+ pTyk
yvkyk~pkp~ T r P k Y k ] PkYk
(13)
with Pk = akdk = Xk~,l-- Xk. In order to maintain superlinear convergence of the sequence defined by (10) and (ll) when d o s -go, Beale [1] proposed modifying (ll) to dt~ I = - gt~ ! + fltd,,
(14)
dk,, = - gk+, + ~kdk + Izkd,
(15)
where /xk = g kT, ! y , / d l YT , ,
with t < k < t + n. In particular, starting with t = 0, at
152
D E Shanno and R.E. Marsten/ Conjugate gradient methods [or linearly constrained
every n steps a new pair of vectors dt and y,, known as the Beale restart vectors, are saved, and search directions generated conjugate to both the restart vector and the previous search vector. After n steps, the restart vector is discarded and the current vector becomes the restart vector. In [9], it is shown that, similar to (12) and (13), (15) can be interpreted in matrix form as
Q,,, = y~(I with y, =
T T p,y,/y,y,,
pty]+ ytpVt yry, p,pVt,~+ ptpVt
PTY,
+ PTY, P~Y,] PVtYt'
(16)
and
+ Q'+'YkP r + (1 + Qk+, = Q k - , - PkY~-Q,+,p~yk
YrkQt"Yk'~kYk ] pTykPW~
(17)
where dk.~ is again chosen by (12). It is shown in [9] how the matrices Qt+~ and Qk+~need never be computed, and how the modified conjugate gradient method defined by (14), (16) and (17), called a memoryless variable metric method, can be implemented using only seven vectors of storage, the same requirement as the Beale restart method defined by (14) and (15). Clearly, for the method defined by (12), (16) and (17), when k = t only the single update (16) is performed, with Q,+~ defined by (16). Finally, in [9], the restart vector is changed whenever k - t = n or whenever the Poweli restart criterion [8] that
IgT,,gkl >--0.211gk+,ll2
(18)
is satisfied. In [9] the superiority of the method (10), (12), (14), (16), (17) and (18) for unconstrained problems is shown, with the particular advantage that at every step the direction dk+~ is a descent direction even when crude line searches are used to determine ak. In applying conjugate gradient methods to the linearly constrained problem by the reduced gradient method described in the preceding section, the algorithm proceeds on a manifold with a reduced gradient h precisely as for the unconstrained problem until a superbasic reaches a bound or a basis change occurs. At this point, traditional conjugate gradient methods adjust the manifold accordingly, and then restart the conjugate gradient search with the negative gradient of the function in terms of the new superbasic variables at the point where the previous iteration terminated. This restart is mandated by the consideration of the need to provide a descent direction as well as to prevent linear convergence. To see this, consider first the case where an exact line search is done at each step of the conjugate gradient iteration, and the exact line minimum corresponds to a point at which a superbasic variable must be dropped, but no basis change occurs. The fact that an exact line minimum was found guarantees that drkhk., = 0. Assuming that the
D.F. Shanno and R.E. Marsten/ Conjugate gradient methods for linearly constrained
153
qth superbasic is dropped, the corresponding search vector produced by continuing with the conjugate gradient method would be dk+, = - Pqhk+, + hk+,P k Pqd~
(19)
T Pq -- I - eqeq,
(20)
where
and eq is the unit vector with 1 as the qth element. These equations follow directly by simply dropping the qth element from hk+,, dk, and Yk, insuring that Axq is 0. From (19) T
h~,Pqd,+, = - hk+,Pqhk~-i v I..T O q"k, ,4 + ~dr, pqyk "k+l-
(21)
and while -hk+,Pqhk+~ r < O, h[+~dk = 0 does not imply that h'~+~Pqdk = 0, so dk+~ is not necessarily a descent direction. Further, in practice usually a bound will be encountered at a point where h/+ I d k < O, and the condition is again not satisfied. The situation when a basic and superbasic are exchanged and then the superbasic dropped is even more complex, but again yields the same difficulty that a descent direction cannot be guaranteed, so a restart must be effected. While restarts pose no theoretical difficulties, in practice a great deal of time may have been spent determining a good search direction, and this must all be discarded simply because an arbitrary bound has been encountered. It is important to keep in mind the fact that generally conjugate gradient methods will only be used on problems with large numbers of variables, and if 500 iterations have been used on a 1000 variable problem to determine a good search direction, the prospect of restarting the search because a bound has been encountered is unpalatable. For this reason, the remainder of this section will be concerned with a means of preserving search direction information when a bound is encountered while guaranteeing a descent direction is maintained on the revised manifold. In the simple case where a superbasic variable encounters a bound, we note that the search direction dk+, of (12) can be easily modified by noting that dropping a superbasic simply eliminates one variable. Thus we first compute Qk+~ by (17), then perform the equivalent of deleting a row and column from the approximate Hessian Q~+~, as Qk+~ is a memoryless BFGS approximation to the inverse Hessian. Dropping the qth row and column from Q~i~, is equivalent to modifying Qk+l to T Q~+, = Qk+, - Qk+,eqeqQk+, e~Qk+,eq
(22)
We also note that for Pq defined by (20), PqQT+, = Q~,+,P~ = Q'~+,.
(23)
154
D.F. Shanno and R.E. Marsten/ Conjugate gradient methods for linearly constrained
We now define
dk+j = -- Q~+lhk+l.
(24)
Since Qk.~ is defined by (17) and is positive definite, (23) ensures T
Pqhk+, < 0 hTk+,Pqd,+, : - h~+lPq O,+,- Qk+leqeqQk+,~ T eqQk+,eq ]
(25)
where equality occurs if and only if Pqhk+! = 0 , which is the condition which occurs when convergence has been achieved on the revised manifold. Theoretically the search direction defined by (24) is simple to compute, computing -Qk+lhk+j by (12), (16) and (17), while simultaneously accumulating the vector Qk+teq, again using (12), (16) and (17) but substituting eq for hk§ then using (24) to compute dk+~. However, as the matrix Qk+~ is never stored, two points must be kept in mind. First, the formula for Qk+~ given by (17) has in the denominator the term pTkyk, and a necessary and sutticient condition that Qk§ be positive definite is that P~Yk < 0. Now
prkYk = akdTk(hk~l - hk),
(26)
SO pTyk > 0 if and only if dTkhk+l > dThk. In practice, the line search criterion used by the program documented in the next section forces
IdTkhk+~l/ldThk[ < 0.9
(27)
whenever possible, assuring that (26) is satisfied. If, however, a bound is encountered before a large enough step can be made to satisfy (27), it is possible that Qk+, cannot be made positive definite. Thus dk§ is chosen by (24) only if
dThk+l >- 8dTkhk, 0 --<8 < 1,
(28)
and we use 8 = 0.999 in the program documented in the next section. If (28) is not satisfied, dk.~ must be set equal to -Pqgk+~, and the method is restarted. A second point is that even if f(x) is quadratic, the transformation (24) may produce an initial direction on the revised manifold which will produce an at best linear rate of convergence to the optimum by traditional conjugate gradient methods, unless dk*l =-hk§ or dk+~ is a restart direction. Thus, whenever the transformation (24) is used, the resulting vector dk+~ must replace the Beale restart vector for the subsequent iterations. Thus, the Beale restart vector is updated whenever k - t = n or (18) is satisfied, or the manifold is changed. Under these conditions, it is trivial to show that if proper measures are taken to prevent cycling, the method described in this section will converge to the minimum of a quadratic programming problem in a finite number of steps in the absence of roundoff. We have so far considered only the case where a superbasic encounters a bound. To consider the case where a basic and superbasic variable are
D.F. Shanno and R.E. Marsten/ Conjugate gradient methods for linearly constrained
155
exchanged, we first note, following Murtagh and Saunders, that if Qk§ were carried explicitly, the Qk+~ in the new manifold would be defined by
Ok§
( l -- e'-"~'~TJ
~
-
1 + eqv/ok+'
veT ~ I + err1 '
(29)
where v is defined by
BT1rr = e,,
(30a)
y = ST~r,,
(30b)
v = - (y + eq)/yreq,
(30c)
where the rth basic variable is being exchanged with the qth superbasic. We note that once (29) has been effected, the new qth superbasic, which is at its bound, must be dropped. Thus, the resulting search vector d~.~ is defined by ~
d~,, = -
T
N
Ok+, - Qk§ TeqQk+,eq
/
hk+,
(31)
where/~k§ is the reduced gradient on the new manifold defined by /~k§ = q~ -- S TTr,
(32)
with S corresponding to the new set of superbasics. Now by (23), (29), and the fact that for any vector v,
e-vr \ T (I -- eqeT) I - ~ ) = I -- eqeq,
(33)
we find by simple algebra the search vector dk+~ on the new manifold is defined by
dk§247
wTOk+, w /
hk§
w = ( e q - l + - ~ e q v ).
(34)
Thus as in (25), a new vector, in this case Qk§ must be accumulated along with Qk§ In addition, when (34) is used, the vectors v and/~k+, must also be stored, as hk+l must be used to update Qk+l, and /~k+t be multiplied by Qk+l. Thus the method requires three additional vectors to the seven used by the unconstrained version of this conjugate gradient method. Again, when a basis change occurs, the vector -/zk+, is used in place of dk+,, if the condition (28) is not satisfied. Summarizing the results of this section, whenever a new set of superbasics is introduced the full reduced conjugate gradient algorithm starts with a negative gradient direction at the initial point on the new manifold. Subsequent points are defined by
xk§ = xk + akdk.
(35)
156
D.E Shanno and R.E Marsten/ Conjugate gradient methods for linearly constrained
If no bound has been encountered,
dk§
(36)
where Qk§ is given by (17) if k s t and by (16) if k = t. In both cases, Yk is the difference in reduced gradients defined by Yk hk§ - hg. Further, t is set equal to k if k - t = n or [hTk§ >--0.2[[hk+~[[ 2. If a bound has been encountered, dk+l is given by (24) or (34) according to whether a basic or superbasic variable reached its bound, and t is set equal to k. The search continues until [[hk§ is appropriately small, at which time the minimum on the manifold is considered to have been found. A full discussion of convergence is included in the next section. As another note on the section, it should be noted that Benveniste [2] has, in a different context, examined transformations of a matrix corresponding to those in (24) and (31). However, Benveniste's transformations were done on the reduced Hessian in order to determine how to preserve direction information, and a stored set of past search vectors was then reorthogonalized in the metric determined by the new reduced Hessian in order to preserve direction information for a new Fletcher-Reeves conjugate gradient step. The above method applies the transformation directly to the new conjugate gradient step, and needs to store no past history, which is the purpose of a conjugate gradient method, for if past history were able to be stored, a variable metric method could be used. As a final note on this section, it is worthwhile to examine the convergence properties of the algorithm defined in this section. The global convergence properties of the algorithm defined by (16) and (17) have been demonstrated in [10]. As the proofs of [10] require nothing more than an arbitrary descent direction for each Beale restart direction, they apply directly to the method of this section and need not be reproduced here. Also, as no new variables are ever added on a manifold, if s is the number of superbasics at the beginning of a manifold, at most s transformations defined by (24) or (34) can take place on a manifold, as, independent of whether (24) or (34) is used, in each case a superbasic variable is dropped. Thus no jamming can occur on a single manifold, and the rate of convergence on any manifold must eventually be that of the unconstrained conjugate gradient method defined by (16), (17). Thus the only theoretical difficulty with convergence lies in jamming caused by continually dropping a variable on a manifold and restoring it on the next manifold. As is well known, this can be prevented by requiring exact convergence on each manifold. However, as the results of the next section show, this appears generally to be computationally inefficient. It is for this reason, however, that users of the system must be wary if they choose a loose convergence criterion on all but the expected final manifold. =
D.F. Shanno and R.E. Marsten[ Coniugate gradient methods for linearly constrained
157
3. The COSMOS program and numerical results
COSMOS, a program to minimize a nonlinear function subject to linear constraints using conjugate gradient methods, has been coded utilizing the ideas of the preceding sections. (COSMOS stands for COnstrained Shanno-Marsten Optimization System.) The program is built upon the XMP linear programming code of Marsten [6], and takes full advantage of sparsity in the constraint matrix and of variables which appear only linearly. COSMOS and XMP are written entirely in 1966 Standard FORTRAN. Basically, the program uses a phase l algorithm to find a basic feasible solution to the constraint set. The user may supply an initial feasible point if he so desires, but in all the numerical results documented in this section, the first feasible point was found by the phase I algorithm. A first set of superbasics is then chosen, and minimization on the first manifold completed. The vector ,~ given by (9) is then computed, and if possible, new superbasics are introduced. This procedure continues until no new superbasics can be introduced. A test is performed to see if [[hl]< EIITrl[. If this is true, then we are done. Otherwise we continue minimizing on the current manifold until it is true. Then A is tested again to see if any of the non-basic variables should be made superbasic. The process continues until no new superbasics can be introduced and the reduced gradient is sufficiently small. Murtagh and Saunders suggest a dual convergence test on a manifold, stopping if either IAfl -< el(1 + If[),
(37)
IlaXsll <- ,x(1 + IfXsJl),
(38)
Ilhll <- ngllhll~,
(39)
Ilhll-< ,11'11
(40
or
where /~ is the initial reduced gradient on the manifold and ~/, ~, r/g, and ~ are appropriately chosen scalars. The underlying idea is that by making el, Ex and r/g appropriately large, little time is wasted minimizing f to great precision on a manifold which is not the final manifold. Rather, the binding constraints are first identified, and then a more careful minimization done on the final manifold to ensure that (40) is satisfied. The numerical results we have obtained indicate that this strategy appears not to hurt, and often to greatly accelerate, the rate of convergence. Another question is that of how many superbasic candidates to allow to become superbasic at any given time. Several strategies were tested, and from our limited computational experience, it appears that the strategy of allowing every nonbasic with an appropriate Lagrange multiplier to become superbasic
158
D.F. Shanno and R.E. Marsten/ Conjugate gradient methods for linearly constrained
does not hurt, and can definitely help, the speed of the algorithm. Certainly, initially one wishes to choose the largest superbasic set possible, for otherwise one can be solving what is equivalent to an unconstrained problem in say 1000 variables by first solving in one dimension, then two, and continuing to 1000. This is known to be an extremely poor strategy for unconstrained optimization. Once the function has been minimized on the initial manifold, introducing several superbasic variables at a time should not hurt, as in any case the initial direction after adding the new superbasics is a negative gradient direction. The numerical results of this section appear to verify this strategy. In order to test the COSMOS conjugate gradient method, three strictly nonlinear problems and one basically linear problem with a few quadratic terms were tried. The nonlinear problems are all in Himmelblau [5]. They are J.M. Gauthier's IBM France problem (Himmelblau 19) with 16 nonlinear variables and 8 equality constraints (FRENCH); Bracken and McCormick's weapon assignment problem (Himmelblau 23) with 100 nonlinear variables and 12 constraints (WEAPONS); and Jones' chemical equilibrium problem (Himmelblau 6) with 45 nonlinear variables and 16 constraints (CHEMICAL). We note here that WEAPONS is the only problem on which Murtagh and Saunders published test results for conjugate gradient methods. The quadratic problem had 78 variables of which 5 were quadratic and 73 linear and 39 constraints (QUADRATIC). It was used solely to assure the efficiency of COSMOS on nearly linear problems. The authors also note that this is a meager set of test problems but comprises virtually all relevant test problems of more than a few variables which appear to exist in the literature. Table 1 contains a comparison of various candidate introduction strategies and manifold convergence strategies for WEAPONS. Here a strategy of r, s means that up to r appropriate variables were allowed to become superbasic initially, and up to s new superbasics could be introduced on any subsequent manifold. ITER is the total number of conjugate gradient Table 1
Strategy ITER n,n
n,5 n, 1 5,5 5,1 n,n
n,5 n,l 5,5 5,1
No. of Manifolds
IFUN
CPU
279 272 477 458 1043
534 515 931 901 2081
3.816 3.688 6.286 5.283 12.074
5 6 12 12 37
170 172 178 169 257
310 317 336 324 511
2.584 2.657 2.738 2.216 3.642
6 7 9 12 38
Convergence ~m = 0.00001 ~ = 0.5
em= 0.05 ~ = 0.5
D.F. Shanno and R.E. Marsten/ Conjugate gradient methods for linearly constrained
159
directions chosen, while IFUN is the number of function and gradient calculations made. The CPU time is in seconds on a CDC CYBER 175 computer (using the FTN compiler with O P T = 2) in single precision arithmetic, while linear searches are done as in CONMIN [10]. Here E,, = E~ = e: is used to test for convergence on a manifold. As the table shows, at least for this problem, the loose convergence criterion is clearly preferable. In both cases, final convergence was achieved when IIhll-< lO-~/17rll. The best test results for similar, but not identical, manifold convergence crileria reported by Murtagh and Saunders for traditional conjugate gradient methods are 508 iterations and 1461 function evaluations for fairly strict minimization on each manifold 228 iterations and 665 function evaluations for the looser convergence criterion, showing a definitive computational advantage for the conjugate gradient method of this paper. It is clear that for any convergence criterion it is imperative to introduce as many superbasics as possible initially. It also appears not to hurt, and with strict convergence to help, to allow as many as possible to enter at any time. Table 2 examines the effect of the manifold tolerance 6, on convergence rates with strategy n,n on all three nonlinear problems. Here in CHEMICAL, a lower bound of 10 -4 is placed on all variables (see Murtagh and Saunders for a full discussion). The results in general bear out the fact that looser minimization on preliminary manifolds can improve the performance of the algorithm. However, for a badly nonlinear problem such as CHEMICAL, conjugate gradient methods remain very sensitive to all convergence parameters, and must still be carefully tuned for specific problems. Table 2 Problem and e,,
No. of Manifolds
ITER
IFUN
CPU
CHEMICAl. I0 -5 10 3 10 2 0.05
361 294 358 294
772" 587 712 568
2.313 1.985 2.276 1.994
4 13 14 13
WEAPONS 10 5 10-3 10-2 0.05
279 195 162 170
534 363 292 310
3.816 2.823 2.492 2.584
5 5 5 6
26 28 27 26
45 49 46 44
0.273 0.285 0.287 0.283
3 4 4 4
FRENCH 10-s ]0 -3
10 " 0.05
The correct optimum was found, but the problem exited due to an abnormally small step size.
160
D.F. Shanno and R.E. Marsten/ Conjugate gradient methods for linearly constrained
Table 3
Problem
No. of Manifolds
ITER
IFUN
CPU
CHEMICAL Transformed Restarted
294 274
568 513
1.994 1.807
13 11
FRENCH Transformed Restarted
26 34
44 53
0.283 0.325
4 5
WEAPONS Transformed Restarted
170 203
310 387
2.584 3.180
6 6
QUADRATIC Transformed Restarted
23 23
28 30
0.494 0.503
3 3
Also, small relatively easy problems such as FRENCH are relatively insensitive to such tuning. Finally, in order to test the computational efficiency of retaining search direction information via the transformations (24) and (34), all four problems were run using the looser convergence criterion with ~,~ = 0.05 and r/g = 0.5 and the strategy n , n both with and without the transformations (24) and (34). The results in Table 3 strongly support using the transformations to preserve previous direction information, especially in view of the nature of the CHEMICAL problem, as described below. In conclusion, while the set of test problems is limited, it appears useful to preserve direction information and this should be confirmed as computational experience increases. The CHEMICAL problem, for which restarting is better in Table 3, is very unstable. For ~,~ = 10 -3, restarting yields 361 iterations, 709 function evaluations and 6 manifolds as compared to 294, 587 and 13 respectively for transforming (Table 2). The evidence for introducing as many superbasics as possible initially is very strong, while strategies for the subsequent introduction of superbasics remain for further testing. Also, looser convergence criteria on preliminary manifolds appear desirable, but must, as Murtagh and Saunders noted, be used with caution.
References [I] E.M.L. Beale, "A derivation of conjugate gradients", in: F.A. Lootsma, ed., Numerical methods for nonlinear optimization (Academic Press, London, 1971)pp. 39-43. [2] R. Benveniste, "A quadratic programming algorithm using conjugate search directions", Mathematical Programming 16 (1979) 63-80.
D.F. Shanno and R.E. Marsten[ Conjugate gradient methods [or linearly constrained
161
[3] R. Fletcher and C.M. Reeves, "Function minimization by conjugate gradients", Computer Journal 7 (1964) 149-154. [4] M.R. Hestenes and E. Stlefel, "Methods of conjugate gradients for solving linear systems", Journal of Research o[ the National Bureau of Standards 49 (1952) 409--436. [5] D.M. Himmelblau, Appl&d nonlinear programming (McGraw-Hill, New York, 1972) pp. 393431. [6] R.E. Marsten, "The design of the XMP linear programming library", ACM Transactions on Mathematical So[tware, to appear. [7] B.A. Murtagh and M.A. Saunders, "Large scale linearly constrained optimization", Mathematical Programming 14 (1978) 41-72. [8] M.J.D. Powell, "Restart procedure for the conjugate gradient method", Mathematical Programming 12 (1977) 241-254. [9] D.F. Shanno, "Conjugate gradient methods with inexact searches", Mathematics o[ Operations Research 3 (1978) 244--256. [10] D.F. Shanno, "On the convergence of a new conjugate gradient algorithm", SlAM Journal o[ Numerical Analysis 15 (1978) 1247-1257. [11] D.F. Shanno and K.H. Phua, "A variable method subroutine for unconstrained nonlinear minimization", ACM Transactions on Mathematical So[tware 6 (1980) 618--622.
Mathematical Programming Study 16 (1982) 162-189 North-Holland Publishing Company
ASYMPTOTIC PROPERTIES OF REDUCTION METHODS APPLYING LINEARLY EQUALITY CONSTRAINED REDUCED PROBLEMS G. VAN DER H O E K Econometric Institute, Erasmus University, Rotterdam, The Netherlands Received 29 January 1980 Revised manuscript received 8 June 1981
This paper deals with the solution of constrained nonlinear programming problems by means of the solution of a sequence of linearly constrained reduced problems. Convergence theorems are derived that describe the convergence of the sequence of first-order Kuhn-Tucker points of the reduced problems to a first-order Kuhn-Tucker point of the original problem. The convergence properties of simplified reduced problems using only the currently active constraints are investigated. A preceding phase I (to obtain a good starting point) is discussed in conjunction with strategies to recognise the final active set as soon as possible. Finally, the convergence of the resulting 2-phase algorithm is proved.
Key words: Constrained Nonlinear Programming, Reduction Methods, Sequences of KuhnTucker Points, Convergence, Rate of Convergence, Active Set of Constraints.
1. Introduction
This paper concerns the general nonlinear programming problem min
F(x),
s.t.
ci(x
-
O,
i = 1 , 2 . . . . . m.
(1)
The problem functions F ( x ) and -c~(x), i = 1..... m, are assumed to be sufficiently differentiable convex real functions on E ~. Our attention will be focused on reduction methods that are based on linearisation of the constraints. The idea of replacing the minimisation of a constrained nonlinear programming problem by sequentially minimising local linearisations of (1) is not new. One of the first successful implementations of such a reduction method is the Method of Approximation Programming (MAP) of Griffith and Stewart [5]. It attempts to solve (1) by solving the following sequence of problems: rain
LF(xk, x),
s.t.
Lci(xk, x){>--}O,
i = l . . . . . m.
(2)
The functions LF(XE, x) and Lc~(xk, x), i = l . . . . . m are the linearisations of F ( x ) and c~(x), respectively, in a neighbourhood of xk. 162
G. van der Hoek/ Asymptotic properties of reduction methods
163
A considerable improvement of the above algorithm is Wilson's method [28] in which a quadratic approximation of the Lagrangian function of (1) is minimised subject to the local linearisations of the constraints. We use the term 'linearisation method' to indicate that linearised problem functions occur in the formulation of the subproblem. The main advantages and drawbacks of these methods to solve (1) are: (i) Linearisation methods are relatively simple to describe and to implement. (ii) If the next iteration point xk~t happens to be infeasible with respect to the constraints of (1), an intermediate step is required to move back to the feasible region if the algorithm is a feasible-point method. This may give rise to slow convergence. (iii) In almost all suggested iinearisation methods all constraints are linearised at e v e r y step, and no use is made of information on the status of constraints (active, passive) gathered in the course of the iteration process used to solve the subproblem. (iv) The linearisations of nonlinear problem functions are acceptable approximations only in a neighbourhood of xk. This makes stepsize limitations inevitable. (v) A consequence of linearisations is also that poor search directions may be generated (e.g., not sufficiently downhill). (vi) The local validity of the linearisations prohibits the application of extrapolation techniques to accelerate convergence, except near the solution. (vii) Wilson's method requires the expensive calculation of second order derivatives. During the last two decades alternative, more sophisticated reduction methods based on linearisations have been proposed which are designed to avoid the drawbacks mentioned above. We mention, for example [19, 20, 18, 24, 6, 7, 23, I, 27, 26]. In those cases the reduced problem is defined by: min
F(x)+eh(Xk, X),
s.t.
Lci(xk, x
,{>1
0,
i = 1, 2 . . . . . m
(3)
In (3) the objective function is corrected by a term dp(xk, x); this term is intended to offset, by means of a corrected objective function in the reduced problem, possible poor behaviour of the algorithm caused by the linearisations. Therefore r x) will generally depend on ci(x) and/or on Lci(xk, x), (i = l, 2 . . . . . m), and different reduction methods arise from different choices of dp(Xk,X). For instance, Kreuser [10] and Rosen and Kreuser [24] use
,~(x~, x) = ~, Xi(Xk)Ci(X) xi < 0
where h~(x~), i = 1. . . . . m, are the current Lagrange multiplier estimates. Here r X) can be viewed as a linear penalty term or as a restricted Lagrangian
164
G. van der Hoek/ Asymptotic properties of reduction methods
function. In [25] this reduction method is further simplified by merely iinearising the constraints of the current active set of constraints I(xk) (this active set usually consists of at least all equality constraints and all inequality constraints with positive estimated Lagrange multiplier). Robinson [18] proposes the use of 4~(xk, x) -
Xi(xk)ILci(xk, x) -- c~(x)].
(5)
i=l
Rosen [23], Br~iuniger [3], Bailintijn, Van der Hoek and H o o y k a a s [1] apply modifications of (5) in their definition of the reduced problem. We shall want to take advantage of the possible presence of linear constraints in the original problem and we shall want to distinguish equality constraints from inequality constraints. Thus we formulate the problem (1) in another way: renumber the constraints in such a way that the indices i = 1. . . . . ml correspond with linear equality constraints and i = m ~ + 1. . . . . m2 with linear inequality constraints. Let L C E ~ be the collection of all x E E" satisfying the linear constraints: aXix { >- }bi, i = l .... , m2, with ai E E", bi E R. Further, using i = m2+ 1. . . . . m3 and i = m3+ 1..... m4 for the nonlinear equality and inequality constraints respectively, we denote by N L the collection of all x E E n that satisfy the nonlinear constraints. Then (1) can be stated as min F ( x )
(6)
xELfqNI.
Finally, the collection of all linearised nonlinear constraints, linearised around xk, is denoted by L N L (xk).Then the reduced problem corresponding to (5) is m4
min xE~L~LNI-~xk)
F(x) + ~
i-m24 I
:ti(Xk)[Lci(Xk, X)-- C~(X)].
(7)
Clearly the linear constraints (the indices i = 1. . . . . m2) do not contribute to the objective function of the reduced problem. One of the proposed algorithms in this paper is to linearise merely the constraints in the current active set I(xk). Then the reduced problem becomes: min x E Lf3LNL(xk)
F(x) + ~
iCl(xk)
•i(xk) [Lci(Xk, X)-- Ci(X)].
(8)
It is obvious that the active set strategy must define I(xk) in such a way that the indices of all equality constraints belong to the active set. In comparison with existing reduction methods, we have investigated and implemented the following new aspects:
G. van der Hoek/ Asymptotic properties of reduction niethods
165
(i) The reduced problem is defined solely in terms of the objective function and the constraints of the current active set; (ii) The (non) linearity of constraints is used explicitly; (iii) The coupling of the applied so called phase I, designed to provide a good starting point, and phase II (the algorithm to be developed in this paper) is discussed. Suggestions are discussed to obtain a good starting point, as well as a good initial set of active constraints. (iv) Theoretical results on the convergence and the rate of convergence of the algorithm are presented.
2. Definition and solution of linearly constrained reduced problems
A general form for a sequence of linearly constrained reduced problems was given in (3). Once the idea of linearisation of the constraints is accepted, the reduction method is characterised by the particular choice of ~b(Xk,X). For instance, Kreuser [10] and Rosen and Kreuser [24] considered the linear penalty term (4) for r x). Their reduction method uses an analogous function r x) as Kelly and Speyer [9] used to improve Rosen [20]. Lill [ll] also applies a similar function 4,(xk, x). In (3) and (4) the linearisations of all nonlinear constraints, around xk are included. In a computational study [25] we investigated an implementation of this reduction method in which only the constraints of the current active set I(xk) contribute to the objective function of the reduced problem, and only those constraints are linearised. If we compare the reduction methods based on the application of (3) with the MAP method of Griflith and Stewart, we see that ~b(Xk,x) should compensate in the objective function of the subproblem for the effect of linearising the constraints. Beale [2] summarises the geometrical background as: "the constraints are straightened out at the expense of the contours of constant value of the objective function". Rosen [21] showed that this geometrical transformation can be obtained algebraically using the shadowprices of the constraints. In that way the nonlinearities in the constraints are 'thrown' into the objective function. By means of a counterexample in [23] he showed that the trivial case ~b(x~,x) = 0 will not, in general, solve the standard convex nonlinear programming problem. A comparison of the first-order Kuhn-Tucker conditions of (1) and (3) suggests that the functions r x) should have the properties 4~(x~, xk) = 0
and
Vx,/~(xk,X)(Xk) = 0.
(9)
The reduced problems (7) and (8) possess ~b-functions which meet these requirements while (4) does not! To summarise, we present a stepwise description of reduction methods based on the application of (3).
166
G. van der Hoek/ Asymptotic properties of reduction methods
Step 1. Set k :-- 0. Initialise variables; if Xo is not optimal, go to Step 2. Step 2. Given Xk, find a first-order Kuhn-Tucker point Xk+l with Lagrange multipliers )tk+~ of
rain
F ( x ) + 4)(xk, x).
(10)
x~ELNLN L/x k )
If Xk+t is not unique, choose the K u h n - T u c k e r point which is closest to the preceding Kuhn-Tucker point xk. Step 3. Apply convergence tests. If Xk+l is not the solution, set k :=k + 1 and go to Step 2. Remarks 2.1. (1) If the original problem is linearly constrained, the algorithm requires only one major iteration. (2) If the original problem is a convex programming problem, the reduced problems (7) and (8) are convex programming problems as well, since the Lagrange multipliers A(xk) of the inequality constraints are nonnegative for all k. (3) If L is compact, L n N L is compact as well. Then a continuous function F ( x ) will attain a global minimum value FI at some point xl of L and a global minimum value F2 at some point x 2 E L A N L . If x I ~ L N N L , the nonlinear constraints are redundant. These minima are unique if F ( x ) is convex on a convex feasible region. (4) Linearisation of concave, differentiable nonlinear constraint functions ci(x) enlarges the feasible region. Hence linearisation of the constraints of a feasible convex problem yields a feasible reduced problem. (5) If we drop the requirement that all q ( x ) should be concave, linearisation may yield an infeasible reduced problem from a feasible nonlinearly constrained problem: L n LNL(xk) = 1) may occur. Consider the following constraint set: L : = { x I-2-<x-<J2} and N L : = {x I c ( x ) = x - x 3 > - O } . The feasible region is: L A N L = { x I - 2 - < x - < - l } U { x l 0 - < x - ~ } , whereas L n L N L ( - * 2 ) = { x l - 2 - < x - < - ~ } n { x l x - > l } = { ) . Thus linearisation around this infeasible point Xk yields an empty linearised constraint set. A further analysis of this example shows that infeasible iinearised problems can arise both from points xk E L and Xk~- L. (6) An infeasible problem that includes a convex differentiable function c(x) may lead to both feasible and infeasible linearised constraint sets. This is illustrated by the following example: L := {x ]~-<x-< 10} and N L := {x I c ( x ) = 1 - x 2 -> 0}. Then L n N L = 0: the original problem is infeasible. The choice Xk = *2 leads to L n L N L ( ~ = fl (an infeasible reduced problem), whereas Xk = 4 leads to L n L N L (4) = {x [ ~-----x -< ~} (a feasible reduced problem). (7) As a consequence of Remarks 2.1 (5) and (6), we shall require that xk+~~ L N LNL(xk) for our algorithm. Since infeasible points might be generated during the iteration process, some restoration procedure should be available to move back to the feasible region.
G. van der Hoek/ Asymptotic properties of reduction methods
167
The procedure implemented is described in [26]. It starts with transforming an infeasible point XE into a point x~, which satisfies the linear equality constraints. Then an auxiliary problem is solved in which the magnitude of the violations is decreased, subject to the condition that no further constraints are violated. Obviously, this feasibility step provides a feasible point by applying the linearly constrained algorithm which also solves the reduced problems. If this feasibility step does not succeed in finding a feasible point, the problem is considered to be infeasible. (8) In order to apply the convergence results, it is necessary that the starting point should be ' close enough' to the solution. That is why an initialising so called 'phase I', procedure is incorporated. Phase I generates an acceptable starting point. It amounts to solving the problem
m. [ci(x)]2+ ~
minF(x)+tk xEL
Li=m2,-I
] [r
2 9
(11)
i-rn3*l
In this definition tk > 0 is a penalty parameter whose choice will be discussed in Section 7, while c i ( x ) has the usual meaning. The solution of th~ phase I subproblem defined by (l l) requires an algorithm for linearly constrained nonlinear programming, which is also used in phase II of the algorithm.
3. Relations between the first-order Kuhn-Tucker conditions of the original and the reduced problems Since the properties to be discussed in this section depend only on whether a constraint is an equality or an inequality, and not whether it is linear or not, we return to the original problem formulation (l) in which the constraints are numbered such that the indices i = I . . . . . p correspond with the inequality constraints, and i - - p + 1..... m correspond with the equality constraints. Thus the problem formulation is min s.t.
F(x), ci(x)>- 0, c i ( x ) = O,
i = l ..... p, i = p + l . . . . . m.
(12)
The Lagrangian function associated with problem (12)is: ~ ( x , u, v) = F ( x ) - ~
u,ci(x)i-I
~,
vici(x)
(13)
i-p~'l
where ui, i = 1. . . . . p, and vi, i = p + 1 . . . . . m denote the Lagrangian multipliers of the inequality and equality constraints respectively. The first-order K u h n - T u c k e r conditions for problem (12) are:
G. van der Hoek/ Asymptotic properties of reduction methods
168
V ~ ( x , u, v) = 0,
(14)
uici(x) = 0,
(15)
i = 1. . . . . p,
c~(x)=O,
i=p+l
ci(x)>--O,
i = 1. . . . . p,
u~->0,
. . . . . m,
(16) (17)
i = l . . . . . p.
(18)
We shall denote the first-order K u h n - T u c k e r points of (12) by z = (x, u, v ) ~ E n+m and its kth approximation by zk = (xk, Uk, Vk) ~ E n+m. In this notation the conditions (14)-(16) can be studied in t e r m s of a mapping f : E "+m ~ E ~ given by the following Definition 3.1. VxF(x)-~
u,V~c,(x)-
i=l
~.
v,V~ci(x)
i=p+l
UIC1(C) /(z) :=
(19)
upcp(x) cp+,(x)
c,,(x) The following l e m m a is clear from the definition of f(z). L e m m a 3.2. z ~ E "+m is a first-order K u h n - T u c k e r point of (12) if and only if [(z) = O, and (17) and (18) are satisfied. It was pointed out in [13] that V~f(Zk) is nonsingular if Zk satisfies the second-order sufficiency conditions of problem (12) with strict c o m p l e m e n t a r y slackness in (15). If we state the kth reduced problem in a formulation analogous to the original problem we get min s.t.
F(x) + ck(xk, x), Lci(xk, x)>--O, Lci(xk, x ) = O ,
i = 1. . . . . p, i = p + l . . . . . m,
(20)
with the additional requirements (see (9)) that ~b(xk, x ) = 0 and Vx4~(Xk,x ) = 0 at x = xk. The Lagrangian function associated with (20) is Le'(x, Ix, v) = F ( x ) + 6(xk, x ) - ~, I~k,iLci(x~, x) i=l
~,
v~.iLci(xk, x)
(21)
i=p§
with /~,i, i = 1. . . . . p, and vk.i, i = p + 1. . . . . m, as Lagrange multipliers for the inequality and the equality constraints respectively.
G. van der Hoek/ Asymptotic properties of reduction methods
169
The first-order Kuhn-Tucker conditions for (20) are V~Lg'(x, ~, L,) = O, ~k,iLci(xk, X) = O,
(22) i = 1..... p,
(23)
Lc~(xk, x) = O,
i = p + 1. . . . . . m,
(24)
Lc~(xk, x) >--O,
i = 1. . . . . p,
(25)
/xk,i-->0, i = 1..... p.
(26)
Note that in the reduced problem (20) all nonlinear constraints are linearised. Since it is our intention to linearise only the nonlinear constraints of the current active set, we shall only pay limited attention to the relations between the Kuhn-Tucker points of (12) and (20). Therefore we merely mention the following proposition.
Proposition 3.3. If Zk ----(Xk, Uk, Vk) is a regular K u h n - T u c k e r point of (12) and if strict complementary slackness holds in both (15) and (23), then zk is also a regular K u h n - T u c k e r point of (20).
Proof. V ~ ' ( x , /~, v ) = V x F ( x ) + V~ck(xk, X)-- ~,f-i pk,iV~Lci(Xk, X)--Zm-p*l Pk.i • V~Lci(xk, x). The requirement (9) gives V~6(Xk, X)(Xk)= 0 and VxLci(xk, x ) = V~c~(xD. Hence it follows that
i=l
i=p~l
= V~(xk, ~, v). Since Zk is a regular Kuhn-Tucker point of (12) under the strict complementarity assumption, we know that the Lagrangian multipliers Uk, Vk are uniquely determined by the two conditions V~Le(Xk,U, V ) = 0 and (15), which means that V~LC'(Xg,Uk, Vk)= 0; the validity of the remaining conditions (23)-(26) follows directly from (15)-(18), while the regularity of Zk with respect to (20) is a consequence from its regularity with respect to (12). For the special case in which ch(Xk, X) is defined by (5), this proposition extends to the following theorem.
Theorem 3.4. Robinson [18]. Let all problem functions be differentiable. Then the following statements concerning a given point (x*, u*, v*) are equivalent: (i) There exist u E R p, v E R '~-p such that (x*, u*, v*) satisfies the K u h n Tucker conditions f o r (20) with Xk = X*. (ii) (X*, U*, V*) satisfies the K u h n - T u c k e r conditions f o r (12). (iii) For every u E R p and every v E R m-p, (x*, u*, v*) satisfies the K u h n Tucker conditions f o r (20) with Xk ----X*.
G. van der Hoek/ Asymptotic properties o.freduction methods
170
The proof of this theorem is clear f r o m the definition of the K u h n - T u c k e r conditions above. Proposition 3.3 and T h e o r e m 3.4 mean that as soon as the primal variables xk of a K u h n - T u c k e r point of (12) are identified, there exist dual variables uk, vk such that ZE solves the next reduced problem, independent of the correctness of the dual variables applied in the definition of the reduced problem. Just as in problem (12), the first-order K u h n - T u c k e r conditions of (20) can be described in terms of a mapping d(Zk, z) from E n~= into E "+m. For an arbitrary zk = (Xk, UE, Vk) this mapping is defined as follows for z = (x, u, v). Definition 3.5. Vx~'(x, ~, v)
I.tk.tLc l(xk, X) d(zk, z) =
IXk,pLcp (Xk, X) Lcp +t(Xk, X) Lcm (XE, x)
VxF(x) + Vx$(Xk, X) -- ~, tzk.iVxLci(xk, X) i=1
- ~,
i=p+l
Vk,iV~,Lci(xk, X)
I~k,lLc ~(xk, x)
I.s
(27)
X)
Lcp, l(Xk, X) LCm(XE, X)
i=1
i=p+l
u~c~(x) u:Ax) c~+,(x) c,.(x)
+
G. van der Hoek/ Asymptotic properties of reduction methods ,.e..
.e,
i-I
i=l
171
-- ~ 1)iVxCi(X)+ 2 Vk,ivxCi(xk) i-p+l i-p+| -
u jcj(x) - ~k.~Lc~(xk, x ) UpCp(X) -- p,k.pLCp(Xk, X) cp§ - LCp+l(Xk, x) c,,,(x) - Lc.,(Xk. X)
This means that d(Xk, Z) = I(Z) - ~(Zk, Z)
(28)
where ~b(zk,z) follows from the equation above. In the case of Robinson's reduction method (see (7)) this gives rise to
i-p+l
i=l
d(zk,z) = [ ( z ) -
u,c~(x) - ~.~Lc,(xE, x) : UpCp(X) -- I.s
(29)
X)
cp+t(x) - Lcp.,(xl,, x) cm(x) - Lcm(Xk, X) which can be interpreted as a relation expressing the difference between the K u h n - T u c k e r conditions of (12) and (20). In an analogous way as for /(z), we can formulate from the definition of d(Zk, z) a l e m m a on the first-order K u h n - T u c k e r points of (20). L e m m a 3.6. z E E n§ is a first-order K u h n - T u c k e r d(Zk, z) = 0 and (25) and (26) are satisfied.
point of (20) if and only i[
We shall denote by S(zE) the collection of all first-order K u h n - T u c k e r points of (20). Hence S(ZE) is defined by: S(Zk) := {z E R ~+m I d(Zk, z) = 0 and (25) and (26) are satisfied}.
(30)
The relation b e t w e e n the first-order K u h n - T u c k e r points of (12) and (20), as given in Proposition 3.3, now extends as follows, where u~.~, v~.~ and tt~.~, p~.~ denote the Lagrange multiplier estimates at xk of the original and the linearised constraints. Proposition 3.7. T h e Taylor expansions around zk = (x~, Uk, V~) O[ [(Z) and d(z~, z)
172
G. van der Hoek/ Asymptotic properties of reduction methods
are equal up to terms of the second order if vk,i(Lci(xk, x) - ci(x ) ). i=p§
i=1
Proof. From the Taylor expansions l ( z ) = l ( z ~ ) + (z - z o T V d ( Z E ) +" 9"
and d(Zk, Z) = d(Zk, Zk) + (Z -- zK)TVzd(zk, ZE) +" " "
we see that it is necessary to prove the following (i) [(ZE) = d(zk, ZE); (ii) Vd(ZO = V~d(zk, ZD, which implies that (iia) V d ( z O = Vxd(z~)(zO, (iib) Vd(Zk) = V,d(zD(zk), (iic) V~[(ZE) = Vd(ZD(ZE). Relation (i) follows immediately from (29). For (iia) we observe:
i=1
i=p+l
Uk,,VxC,(XO
Vd(z~) =
Uk,p V~Cp(Xk)
V~C~.I(XE) V~C.(Xk) and
- V ~ F ( x 0 + V~q,(xk, X)(XE) Uk,~V ~Lc ,( XE, X)(Xk) V~d(zk, z ) ( z O =
UE.~V ~L,c~(Xk, X)(Xk)
V~L%+I(XE, x)(xk) _V ~Lc,~(XE, X)(Xk)
From the definition of q~(x~,x) we derive 2
Vk,iV =Ci(Xk) i~l
i=p+l
which, together with VxLCi(Xk,X)(Xk)= VxCi(Xk) for i = 1.... , m, yields Vd(zk)= Vxd(z~, z)(zk). The proof of (lib) follows directly from (19) and (27). Indeed
G. van der Hoek/ Asymptotic properties of reduction methods
--V~C~(Xk)
- Vxci (Xk) 0
0 -~ui (Zk)
0
173
ad(zk, z) (Zk) =
ci(xk)
0
0
Lci(xk, X)(Xk) 0
0
0
combined with Lc~(xk, X)(Xk) = Ci(Xk) for all i. Finally (iic) follows from (19) and (27) again:
,~f(zk)
tgv~
=
~
,
O d ( z . , z ) (zk) =
3vi
Corollary 3.8. Vz/(z) = Vzd(z, z) for all z ' E "+m. Proof. This corollary restates part (ii) of the preceding proof. Note that the proof is independent of whether or not z is a Kuhn-Tucker point. Remark 3.9. In terms of the original problem formulation, Proposition 3.7 can be interpreted as: For the given choice of ~b(Xk, x), the quadratic approximations of the Lagrangian functions ~(x, u, v) and ~'(x, ~, v) associated with the original problem (12) and the reduced problem (20) respectively, are identical in a neighbourhood of zk E E "*m. Robinson [18] stated a number of properties relating f(z), d(Zk, z) and their respective gradients in a neighbourhood of a Kuhn-Tucker point z* of the original problem (12), which we mention without proof. Note that at z*, we have f ( z * ) = 0, and V~f(z*) is nonsingular. We set /3-IIv f(z*)-'ll. There exists an open neighbourhood of z* in which z* is the unique solution of f(z) = 0; hence z* is the locally unique Kuhn-Tucker point of (12). The following shortened notation will be used from now on V:d(z,, z2) := Vzd(zl, z)(z2), etc. Theorem 3.10. Robinson, [18]. If all problem functions are twice continuously
differentiable in an open neighbourhood O(x*) of x*, there exist constants r > 0 and M > 0 such that z* is the unique solution of f ( z ) = 0 in the closed ball B(z*, ~r) with radius 12r about z*. Moreover, for any zl, zz E B(z*, r) with ~i as Lagrange multipliers of the reduced problem we have
G. van der Hoek/ Asymptotic properties of reduction methods
174
(i) [IV~d(z~, z9 - Vzd(z*, z*)l[ < (2/3)-~; (ii) I~f(z2) - d(z,, z )ll ~ Mllz~- zzl]2;
(iii) ei(x*) > 0 implies Lci(xl, x2) > 0; (iv) u ~ > 0 implies I~i > 0 . This theorem of Robinson will be applied in the comparison below of Kuhn-Tucker points of the original problem and of equality-constrained reduced problems. First we mention that a simplified problem can be obtained from (12) if the constraint set is reduced to a set of equality constraints ci(x) whose index i belongs to a currently defined active set I(zk). Usually this active set I(Zk) consists of all equality constraints, the currently binding inequality constraints and the inequality constraints that are expected to be binding at the next iteration point. For example, the algorithm to be described in this paper defines the active set I(Zk) as all equality constraints and a selection of linear and nonlinear constraints, containing at least the binding constraints. This means that iCE I(Zk) corresponds with Ci(Xk) > 0. Thus, we consider the reduced problem min s.t.
F(x), ci(x) = 0 for all i E l(z~).
(31)
The first order Kuhn-Tucker conditions of (31) are: V~(F(x)- iE~zk) ViCi(X)) = 0, Ci(X) = 0,
i E I(Zk)
(32) (33)
where v~, i E I(zk) is the Lagrangian multiplier corresponding to ci(x)= 0. The equality-constrained problem (31) can be solved using the reduced problem rain s.t.
F(x) + r x), LC~(Xk,x) = 0 for all i E I(Zk)
(34)
with the following first-order Kuhn-Tucker conditions:
Vx(F(x) + ~(XE, X) -- ~,
iEl(z,)
Lci(xk, x) = O, i E I(Zk).
VE.iLci(xk, X)) = 0,
(35) (36)
Analogous with the definition of S(zk) as the collection of Kuhn-Tucker points of (20), we define S(ZE, I(Zk)) to be the collection of all solutions of the K u h n Tucker conditions of problem (34). If I(Zk) contains all equality constraints and all inequality constraints with positive estimated Lagrange multiplier (estimated from the linearly constrained subproblem; if xk is close enough to x* this estimate has the correct sign), then conditions (32), (33) arise from (14)-(18).
G. van der Hoek/ Asymptotic properties (ff reduction methods
175
For the linearised reduced problems, similar relations apply for the conditions (35) and (36) as compared to (22)-(26). Our next point of interest is to find relations between the solution sets S(Zk) and S(zk, I(zk)). The next two lemmas contain mutual inclusion relations. Lemma 3.11. Let a point zg E B(z*, r) be given with S(zk, I(zk)) C B(z*, r) and strict complementary slackness in (15). If I(Zk) := {i I Ci(X*) = O, i = I ..... m} with I~k,i = 0 for all i~ l(z~), then S(zg, l(Zk)) C S(zk). Proof. By definition zk,~ E S(zk, l(Zk)) satisfies (35), (36). These equations can be extended to (22), (23) and (24) using /~k.i ~ 0 for all i ~ l(Zk). To prove (25) we remark that, using (36), we only need to consider indices i ~ { 1 . . . . . p } - l ( z k ) which correspond to inactive inequality constraints. Then since zk, zk, ~ B(z*, r) and c~(x*)> 0, from Theorem 3. l0 (iii) we know that Lci(Xk, X)> 0; hence (25) is satisfied. Finally, (26) is true for all i ~ I(zk) because of the definition that gk.~ ~ 0 for these i. For i E I ( z k ) (36) gives Lci(xk, x ) = O , which implies that ci(x*)=O, c~(x*) < 0 violates the K u h n - T u c k e r conditions at x* and c~(x*)> 0 contradicts LCi(Xk, X)=O by Theorem 3.10 (iii). But this means that u * > 0 (strict complementary slackness in (15)), which again implies that p~k.~> 0 (Theorem 3.10 (iv)). Thus (26) is proved and the lemma is true. Lemma 3.12. Let a point Zk E B(Z*, r) be given with S(Zk, I(Zk)) ([ B(z*, r) and strict complementary slackness in (15) and (23). If l(Zk) := {i I Ci(X*) = 0, i = 1..... m}, then S(Zk) C S(zk, I(Zk)). Proof. We have to prove (35), (36) for Zk+~ satisfying (22)-(26). For 1 -- 0 (definition of l(zk)), so that Lci(xk, Xk~,) > 0 (Theorem 3. l0 (iii)) and p~k.i= 0 (strict complementary slackness in (23)). Then (35) follows from (21), (22) and the substitution /~k,i = 0 for nonbinding inequality constraints. Since (36) obviously applies for equality constraints, we only need to consider indices l _ 0 (strict complementary slackness in (15)), but then /~k.~> 0 (Theorem 3. l0 (iv)) and Lci(xk, x k ~ ) = 0 (strict complementary slackness in (23)) which completes the proof of (36) and the proof of the lemma. When comparing Lemmas 3.11 and 3.12 we see that, in addition to the strict complementary slackness, the definition of the correct set I(Zk) is of importance. The lemmas can be summarised in a corollary:
Corollary 3.13. Let Zk ~ B(Z*, r) be given with S(zk, I(zD) C B(z*, r) and strict complementary slackness in (15) and 1. . . . . m}. Then S(zk) = S(zk, I(Zk)).
(23). Let l(zk) "= {i [Ci(X*)----0, i =
176
G. van der Hock/Asymptotic properties of reduction methods
Proof. The proof is obvious from Lemmas 3.11 and 3.12. How stringent or unrealistic are the conditions of Lemmas 3.11 and 3.12? The required strict complementary slackness means that there should be no weakly active constraints. This condition was imposed on the problem to be considered. In case of violation, the condition can be satisfied by suitably perturbing weakly active constraints (although this will generally lead to very small values of r). Further, the condition that Zk ~ B(z*, r) can be realised by a preceding phase I procedure which yields a starting point that is 'close enough' to z*. In practice the correctness of l(Zk) is usually obtained after a few iterations in the initialisation procedure, unless zigzagging occurs. This means that the conditions of Lemmas 3.11 and 3.12 are not unrealistic, if a suitable phase I procedure is used.
4. Convergence of sequences of Kuhn-Tucker points In Section 3 the mappings [(z) and d(Zk, z) from E "+" to E "+" were introduced. A further investigation of the algorithms to be considered requires definitions and properties of the operators [(z) and d(zk, z) as presented in [ 15, 8]. We refer to this literature for the definition of: a (strictly) nonexpansive mapping, a contractive mapping, a (non) deterministic mapping, a fixed point and for Banach's contraction mapping theorem. [[zlt will denote the Euclidian norm (it is easy to see, however, that the results remain valid for any norm on R"*'). Note that though we use point-to-set maps below, the resulting implementations will uniquely define the next iteration point. Let X0 and X be subsets of E "+" with Xo C X, where X is assumed to be bounded, and let A be a mapping from X ~ E "+'. In general A may be a nonlinear operator on X C E "+'. Examples are [(z) and d(zk, Z), with the domains the feasible regions L O N L and L N LNL(xD respectively. Another example of such an operator on E" is provided by the algorithm given by (10) which, starting from zk = (Xk, Uk, Vk) defines the next iteration point zk,~ as a certain Kuhn-Tucker point of the reduced problem. In this example the uniqueness of Zk+~ is established by an additional requirement, which makes the mapping deterministic. Removing this additional selection rule yields a nondeterministic mapping, which does not necessarily determine zk+j uniquely. The next proposition, which originates from [18] guarantees under weak conditions that if the iteration process given by (10) is initiated at a point Zk close enough to a Kuhn-Tucker point z* of (12), then the reduced problem (20) defined at zk yields a unique Kuhn-Tucker point zk+t close to zk, hence close to z*. An extension of this proposition to a reduction method which applies purely equality-constrained reduced problems will be used to prove a theorem on the convergence and the rate of convergence of this reduction method. The proof is a slight modification of Robinson's proof, and can be found in [26].
G. van der Hoek/ Asymptotic properties o] reduction methods
177
Proposition 4.1. /f (i) both (12) and (20) satisfy strict complementary slackness, (ii) the problem [unctions are twice continuously differentiable, (iii) zk E B(z*, 89 such that 41311f(zk)ll<-r and (iv) z* is a Kuhn-Tucker point satisfying the second-order sufficiency conditions for (12), then there exists a point Zk+1E B(zk,~r) such that zg.~ is the unique Kuhn-Tucker triple of (20) defined at Zk satisfying IIz +,- z ll-< 2/311f(zdl. This proposition guarantees the existence in B(zk, 89 of a unique K u h n Tucker point of the reduced problem (20). The next proposition extends this result in that it shows that under certain conditions, the point zk~ E S(zk, I(z*)). This means that to find Zk-,, it suffices to solve a smaller reduced problem, with only equality constraints. Note that Propositions 4.1 and 4.2 concern only the case when 4~(Xk,X) is defined by (5).
Proposition 4.2. Let z* be a regular point, satisfying the sufficient second-order Kuhn-Tucker conditions of (12), under strict complementary slackness in (15) and (23). If zk E B(z*,~r) with 41311f(zk)ll<- r and I(Zk) = {i I C,(X*)---- O, i = 1..... m}, then there exists a unique zk+tE B(zk,~r) with Zk+lE S(zk)=
S(z , I(z*))
Ilzk+, - zd <-2 llf(z,)ll.
Proof. Proposition 4.1 implies the existence of a unique Zk~t E B(7.k, 89 such that zk,~ E S(zD and llzk+~- z~ll-< 2/3}~f(zD[[. Under the conditions that Zk and Zk+,E B(z*, r) and of strict complementary slackness, we conclude from the discussion of Lemmas 3.11 and 3.12 that S(zD = S(Zk, I(Z*)), which proves the lemma. The next Propositions 4.3 and 4.4 demonstrate that under the same assumptions at zk, and assuming that I(z*) is known, it suffices to solve the equalityconstrained reduced problem (34) with I(Zk) = I(Z*) to obtain the next iteration point Zk,l. Thus the question remains: how to recognise the set I ( z * ) = {i I c~(x*) = 0, i = 1..... m} at Zk~ Z*? A more thorough discussion on the determination of the correct active set is postponed to Section 5, For a given arbitrary index set I0, the following result can be proven, where this time ~b(xk,x) can be any function which is continuously differentiable and satisfies (9). The proposition is an extension of a theorem of Br/iuniger [3].
Proposition 4.3. If all problem functions and dp(Xk,x) are continuously differentiable and the sequence {Sk} converges to 2 and: lim k~,oe [. 7.k§
inf
]lzk.,- s247 = O, k. I O)
J
then Y. is a Kuhn-Tucker point of the reduced equality-constrained problem (31) with I(Y.D = Io.
178
G. van der Hoek/ Asymptotic properties of reduction methods
Proof. First note that the proposition is formulated in terms of an arbitrary sequence {~'k}with limk_~ ~'k = ~'. Let zk§ be the element of S(~'k, Io) defined above which achieves infllzk+l-ek§ for Zk~ES(~.k, Io). Then l i m k ~ ' k = ~ and limk~llZk,~--~'k.tll=0, SO that l i m ~ Z k = ~'. Then the K u h n - T u c k e r conditions (35) and (36) yield for I(~k) = -t0:
Lci(.~k, x) = 0 for all i E I~
(37)
V,(F(x) + r
(38)
and
~_, v~Lci(xk, x ) ) = 0 . iEl 0
Substituting z = Zk.t, and using (9), (37) and (38), the continuity of all functions and their gradients, and the convergence of the sequence {Zk}, we obtain: c~(~) = 0
for all i E I0
(39)
and
VxF(~) = ~ ~iV,ci(g),
(40)
iEI0
which proves the proposition. Br~iuniger's theorem concerns the function 4~(xk,x) as defined in (5). For the special case that th(xk, x) is defined by (5), Propositions 4.2 and 4.3 uniquely define a sequence converging to the point z*. This is proved in the next proposition.
Proposition 4.4 If the conditions of propositions 4.2 and 4.3 are satisfied and limk_~ ZR = Z*, a solution of (12), then z* is a Kuhn-Tucker point of the problem (31) with I(Zk) = I(Z*). Proof. Starting from a point z0 which satisfies the requirements of Proposition 4.2, the sequence {ZE} with Zk§ = S(z~,, I(z*)) is uniquely determined and satisfies IIS(z~, t(z*))-z~+,ll = o. Hence Proposition 4.3 can be applied and the proof is complete. The meaning of this proposition is that, although in practice z* and I(z*) are unknown, the algorithm still generates sequences {Zk} and {l(zg)} which can be proved to converge to a Kuhn-Tucker point z* of (12) and I(z*) respectively. To prove this a more general approach by means of point-to-set maps will be followed. It will turn out to be possible to prove that under suitable conditions the sequence of iteration points generated by the algorithm converges to a fixed point of a point-to-set mapping which characterises the algorithm. This point will turn out to be a Kuhn-Tucker point of the problem considered.
G. van der Hoek[Asymptotic properties of reduction methods
179
This approach of defining an algorithm via point-to-set maps is originally due to Zangwili [29]. Later Polak [17], Luenberger [12], Meyer [14] and others followed Zangwili's approach in describing algorithms and investigating their properties. The major part of the definitions and results which we shall need can be found in [14, 26]. Again we assume that z E E ~§ while A acts on E ~+~. Let X and Y be subsets of E n*m.
Definition 4.5. A point-to-set map A from X to Y is a map which associates a subset A ( z ) C Y with each z • X. We shall assume that X C E "+m is closed: for instance, X =/~(z*, r) is the closed ball with radius r around the K u h n - T u c k e r point z* of (12). Then, given a point z0 E X and X = Y, an algorithm is defined by any scheme of the following type: Step O. Set k = 0 Step 1. Pick a point Zk§ E A(Zk) Step 2. Set k = k + 1, and go to Step I.
(41)
In this scheme no stopping rule is included (only infinite sequences are generated). The algorithm (41) is termed non-deterministic since Zk~, may be chosen arbitrarily in A(zk). We shall denote by C the characteristic set of the algorithm defined by (41), by P the set of all limit points of all convergent sequences generated by (41) and by Q the set of all cluster points of all sequences generated by (41).
Proposition 4.6 (Meyer [14]). If the map A is upper-semi-continuous (u.s.c.) and compact-valuedon X - D , where D C X, and algorithm (41) is asymptotically regular, then C C Q c C u D.
Corollary 4.7. If A is u.s.c, and compact-valued on X - C, and algorithm (41) is asymptotically regular, then P = Q = C. Ostrowski [16] proved that an asymptotically regular sequence on a bounded set X has either a unique cluster point or a continuum of cluster points. Hence the following holds:
Corollary 4.8. Suppose that A is u.s.c, and compact-valued on X - C, X is bounded, algorithm (41) is asymptotically regular, and C contains at most a countable number of points; then every sequence generated by (41) converges to a point in C, which is a fixed point of algorithm (41). In order to apply the above results to the algorithms considered, we must ensure that they use mappings A which are u.s.c, and compact-valued on X - C.
180
G. van der Hoek/ Asymptotic properties of reduction methods
Then, according to a lemma in [14], it suffices to verify that X - C is bounded and A is closed on X - C. Since X - C CX, which is assumed to be bounded, it remains only to investigate whether A is closed on X - C. Instead of investigating the general algorithm given by (10) with a currently defined active set I(zk), we focus on the special case in which the final, correct active set I ( z * ) = I* has already been found, and I(zk) = I(z*) is used for each subsequent k. The following lemma (cf., e.g., [14]) which guarantees the existence of a subsequence converging to any cluster point of a sequence {zk}, will be used in the next theorem. L e m m a 4.9. Let q be the set of all cluster points of the sequence {zk}, generated
by the algorithm (41) with A u.s.c, and compact-valued on X. If q is nonempty and bounded, then given any neighbourhood N(q) of q, there exists an index k, depending on N ( q ) , such that z~ ~ N ( q ) for all i >- k. Application of this lemma to a suitable region X such that q = {z*} yields the required subsequence. Theorem 4.10. Let the sequence {zk} be defined by (41) with Zk+l= A(zk), where A
is given by (34) with I(Zk) = I*. Let X be bounded and closed, with all problem [unctions and ~b(Xk,X) continuously differentiable while (9) is satisfied. Then every cluster point z* of {zk} is a fixed point of A. Proof. The infinite sequence {zk} thus generated by the feasible point algorithm (41) has at least one limit point z * E X . Then there exists a subsequence converging to z*, which we shall denote again by {zk}. It follows from the definition of Zk§ that (35) and (36) are satisfied by z~+t. The continuity of the gradients, (9), (35), (36) and limk_~ z~ --- z* then imply that Lci(x*, x*) = 0
for all i E I(z*)
and Vx(F(x*)-iE~.)v~Lci(x*, x*)) = 0. Thus z* satisfies the Kuhn-Tucker conditions of the reduced problem defined at z*, and so z* E A(z*). Hence z* is a fixed point of A. Proposition 4.11. The deterministic algorithm defined in Theorem 4.10 is continuous at every z E X, and hence is closed on X. Proof. The continuity follows directly from the continuity assumptions on the problem functions. The closedness is an immediate consequence of this continuity.
G. van der Hoek/ Asymptotic properties of reduction methods
181
Corollary 4.12. The deterministic algorithm defined in Theorem 4.10 is u.s.c, and compact-valued on X. Proof. This follows from the application of [14, L e m m a 10]. As a consequence, all above derived statements for u.s.c, and compact-valued mappings apply to the algorithms developed, for instance those using reduced problem formulations such as (10), (20) and (34) with I(Zk) = I*. We make special mention of the corollary of proposition 12 in [14]. If a sequence generated by the algorithm (41) converges, it converges to a fixed point of A. And Corollary 4.8 above: If (i) the applied mapping A is u.s.c, and compact-valued on X - C; (ii) X is bounded; (iii) algorithm (41) is asymptotically regular; and (iv) C contains at most a countable number of points, then every sequence generated by (41) converges to a point in C which is a fixed point of the mapping A. Compared with Banach's contraction theorem, we see that the convergence to a fixed point is now established under conditions such as asymptotic regularity, upper-semi-continuity and compact valuedness, instead of the condition of contractiveness. It is to be hoped that the requirement that C should be countable will be satisfied in most real-life problems. Let A be defined by (10) and let z* be a fixed point of the mapping A. If we assume strict complementary slackness in all occurring K u h n - T u c k e r conditions, we can prove the next theorem, following [23]. Theorem 4.13. A fixed point z* of the mapping A is a K u h n - T u c k e r point of
(12). Proof. Since z* is a fixed point of A, we know that A(z*) = z* and d(z*, z*) = 0 while (25) and (26) are satisfied. (z* is a first order K u h n - T u c k e r point of (20) with Zk = Z* and hence Lemma 3.6 can be applied). From (28) we see that f(z*) = d(z*, z*) = 0. Furthermore, (17) is satisfied, since if ci(x*) were negative for at least one i E{I . . . . . p}, this implies that L c i ( x * , x * ) = c ~ ( x * ) < O , which contradicts (25). Finally, (18) is satisfied as can be seen from the relation u ~, = p.~,-~0 for i = 1. . . . . p. Hence L e m m a 3.2 applies and z* is a first-order K u h n - T u c k e r point of (12). In the Corollary 3.13 we saw that S(zk)= S(zk, I(z*)), which meant that if I(z*) is known at zk, it suffices to solve the equality-constrained reduced problem (34) with I(zk)= I(z*). In that situation we can formulate an analogous theorem for the resulting equality-constrained reduction method. Theorem 4.14. If z* is a fixed point of the mapping A defined by the application
of the reduced problem (34) with the correct active set I(Zk)= I(Z*) in the algorithmic scheme given by (41), then z* is a K u h n - T u c k e r point of (12).
182
G. van der Hoek/ Asymptotic properties o[ reduction methods
Proof. The proof follows from combination of lemmas 3.11, 3.12 and Theorem 4.13 using strict complementary slackness in the K u h n - T u c k e r conditions. An important question which should be considered now is under which conditions the correct final active set I(z*) can be defined at ZE~ Z*. Possible active set strategies to answer this question are treated in the next section.
5. Active set strategies In Section 4 we derived conditions under which S(zk)= S(zk, I(z*)) which means that the first-order K u h n - T u c k e r points of the original problem (12) are first-order K u h n - T u c k e r points of the pure equality-constrained reduced problem (31) with I(Zk)= I(Z*), and vice versa. Since I(z*) is not usually known beforehand, we seek for criteria by which an algorithm can recognise I(z*). That is why a current active set I(zk) is defined, which is intended to contain the constraints that are most relevant, at least locally. Adjustment of l(zk) should be accomplished using the information gathered about the constraints, such as multiplier estimates and their current status: binding (c~(x~) = 0, i = 1. . . . . m); violated (C~(XE)< 0 for i ~ {1 . . . . . p} or ci(xk) ~ 0 for i E {p + 1.... , m}; or satisfied (ci(xk)~ 0 or C~(Xk)= 0 for the respective cases). Constraints will be added to or dropped from the active set, with the ultimate goal that I(ZE) = I(Z*) holds for all k - K. By the application of such an 'active set strategy', all reduced problems are equality-constrained problems in which all constraints with index i ~ I(z~) are ommitted. Hence the reduced problems are simplifications of the original problem. Obviously, the above approach is equally valid for both the original problem (12) and the linearised problem (20). The convergence of the sequence of iteration points {ZE} is expected to be accelerated if I(z*) can be recognised in an early stage of the iteration process.We recall Robinson's Theorem 3.10 (iii) and (iv), in which the fact is expressed that as soon as zk E B(z*, r) it holds at all points Zk§ ~ B(z*, r) that: (iii) if c~(x) is passive at x*, its linearisation at xk is passive as well. (iv) if c~(x) is active at x*, its linearisation at Xk is active as well. The next two propositions concern relations between the index sets I(z~) and
I(z*)). Proposition 5.1. Let z* be a regular K u h n - T u c k e r solution of problem (12) satisfying the conditions of theorem 3.10 under strict complementary slackness. If ZE E B(Z*, J2r) with 4131[f(z~)ll<- r, then I(z*) C l(zk). Proof. We know from Proposition 4.1 that
S(zk)
IIs(zk)-z~ll~k.
Hence for zk+lE
G. van der Hoek/ Asymptotic properties o/ reduction methods
183
IIz~ ~,- z*ll ~ IIz~+,- z~ll § IIz~ - z*ll < r.
Now Theorem 3.10 can be applied. Let i E I(z*) be arbitrary. Then u ~,> 0 (strict complementary slackness in z*) which gives txk,~> 0 (Theorem 3.10 (iv)) and hence Lci(xk, xk+~)= 0 (complementary slackness in (23)) which implies that i E I(zD. Thus it has been proved that I ( z * ) C I(zk). The interpretation of Proposition 5.1 is obvious: if we are close enough to z*, all constraints of I(z*) belong to the current active set I(zk). A reverse statement can also be proved: Proposition 5.2. In case of strict complementary slackness in (23) .for all i E l(zk), with Zk E B(z*, r) while z* and r satisfy the conditions o.f Theorem 3.10, then I(Zk) C I(z*). Proof. For any i EI(zE) we know by assumption that ~z~i>0. Hence Lci(Xk, X)= 0 (complementary slackness), which yields ci(x*)= 0, and thus i E I(z*). Corollary 5.3. Under the conditions on z* and Zk given in Proposition 5.1 and the condition of strict complementary slackness in both propositions, we conclude that I(Zk) = I(Z*), i.e., the currently defined active set equals the final active set. The proof is obvious. Remark 5.4. The procedure suggested in [3] does not satisfy tXk.i> 0 for all i ~ I(Zk), since it generates weakly active constraints. In summary, I(Zk) = I(Z*) if Zk and z* satisfy some reasonable conditions, and if an appropriate active set strategy is applied.
6. Convergence of the solutions of the reduced problems In Section 4 conditions were discussed for convergence of sequences of iteration points {Zk}, generated by the application of the reduction methods proposed. The limit points turned out to be fixed points z* of the applied mapping A. These points also appeared among the first-order Kuhn-Tucker points of the original problem (12), or among those of the reduced problem (31) with I(zk) = l(z*). Now that conditions for convergence have been established, the next point of interest is the rate of convergence. Given z*, the question arises whether it is possible to derive bounds on Uzk- z*ll- The next theorem deals with this problem.
184
G. van der Hoek/ Asyrnptotic properties of reduction methods
It is proved that the supremum (for all sequences {zk}-~ z* generated by the algorithm) of lira sup Ilzk - z*ll :-'
(42)
k--..~
is an element of (0, 1). This means, in the terminology of [15] that the convergence is R-quadratic. The theorem is closely related to theorems of Robinson [18] and Br/iuniger [3], and the proof is a slight modification of theirs. Theorem 6.1. Let z* be a regular K u h n - T u c k e r solution of (12) satisfying the
second-order sufficiency conditions with strict complementary slackness in (15), while all problem functions are at least twice continuously differentiable in an open neighbourhood of z*. Then there exists a 8 > 0 such that if Zo~ B(z*, 6) the algorithm with the active set strategy defined in Section 5 generates a sequence {zk} which converges R-quadratically to z*. In particular:
<~
1 j~__~ k~ 2,<- 2/3M 1 ({)2k[~_~_*o({)~ ]
IIz~ - z*ll - J-~ Ilzj+, - zjll-< 2 - ~
(43)
where the scalars f3, r and M were defined in Theorem 3.10 and rl is defined as rl := min (-~,~13Mr).
(44)
Proof. First we shall prove by induction a result on [[z~-zi-d[ and [[/(zi)[[ for j = I, 2 .... From f ( z * ) = 0, Lemma 3.2 and the continuity of f(z), we see that there exists a 8(0 < 8 < ~r) such that for all z E B(z*, 6) we have
II/(z)ll -~ 4~-~"
(45)
Let z0E B(z*, 6) be the starting point of the algorithm. Then ]lzo-z*[] < 8 <-~r and
4,ellf(zo)ll "~ /3M ~ ~ 14/3Mr [3M <--~r
(by (45)) (by (44))
dr.
Hence Propositions 5.1 and 5.2 now yield that I(zo)= I(z*), and Proposition 4.2 implies that there exists a unique point z~ with zl = S(zo) = S(zo, I(z*)) such that
IIz,- z4[ ~ 2~llf(zo)[I.
(46)
G. van der Hoek/ Asymptotic properties of reduction methods
185
F r o m the definition z~ = S(zo) we see that d(z0, z0 = 0, and h e n c e
I[i(z,)l[ = IV(z,) - a(z0, z,)l[
MIIz,- z~l ~
(by T h e o r e m 3.10 (ii))
<_ 4D~Ml[f(zo)l[ 2 < 4(32Mrl 2
(by (45)).
Therefore T~2
<
ll!(z,)ll- ~
,~ 21
-- 4--fir-M-.
(47)
We claim that inequalities a n a l o g o u s to (46) and (47) are valid for any j = 1, 2 .... This s t a t e m e n t has been p r o v e d for j = I. T o c o m p l e t e a p r o o f by induction it remains to be p r o v e d for j = k + 1, a s s u m i n g its validity for j = 1, 2 . . . . . k. Thus for l < _ j < _ k we a s s u m e :
Ilzj - zj ,11 < 2[31V(zj-,)ll
(48)
< r/ 2J lIf(z~)ll- 4--~-~"
(49)
and
Consider (48) f o r s o m e 1 -< j -< k: _
Ilzj
2i t
,~--
z~,ll <2[3[V(zj-,)ll < ~'n - 4/3~M
1./2j-i
_
2/3M
(by (49) for j - 1).
Then "0 " rl i-'
tlzj- zj ,11~ ~
~
(~[3Mr)(89 j "
213M
1
1 ~
~r(~)
(by (44)).
Using this we obtain k
Ilzk - z*ll-< IIz0- z*ll § ~j=, IlzJ - zi ,11 (50) which m e a n s that zk E B(z*, Jr). F u r t h e r , inequality (49) for j = k gives ~
4fllLf(zk)i I ~
< I[3 M r -
1"/2L I
= 7/. 4~2ta /3M /3M
" r/2'-'
b y (44))
(51)
"<~
N o w that (50) and (51) have been p r o v e d , proposition 4.2 can be applied. This
186
G. van der Hoek/Asymptotic properties of reduction methods
yields the unique point Zk+~= S(zk) = S(Zk, I(Z*)) with
llzk+,- z~ll-< 2/311f(zk)ll, which is equivalent to (48) for j = k + 1. Finally, (49) is true for j = k + 1, since the definition zk+~= S(Zk) yields d(zk, zk+l) = 0 and hence Ilf(zk~,)ll = Ilf(Zk+,)- d(Zk, zk+,)[I <- MIIzk+,- z~ll2 -< M 94/321Lf(zk)ll 2 2k+l < "O --4--~
(from theorem 3.10 (ii)) (just proved) (by (49) for j : k).
The sequence {Zk} thus generated by the described application of the algorithm is an infinite sequence in /3(z*,~r), so that it has at least one cluster point z'EB(z*,~r). Then Theorem 4.10 implies that z' is a fixed point of A, and Theorem 4.13 implies that z' is a Kuhn-Tucker point of (12). The uniqueness of z* in /3(z*,~r) implies that necessarily z ' = z*. Finally
llz~ - z*ll-<
~'
11z~.l- z, lt--i=k
--<2/3M
"
(52)
Corollary 6.2. The algorithm is asymptotically regularon B(z*, ~r). Proof. The proof is obvious from (52). Corollary 6.3. If C contains at most a countable number of points, then every sequence generated by (41), with zk,~ = A(zk), where A is defined by the reduction method developed in this section, converges to a point in C. Thus a fixed point of the algorithm is a first-order Kuhn-Tucker point of (12). Proof. The proof follows from the application of Corollary 4.8. Corollary 6.4. The effect of only using equality-constrained reduced problems is reflected in the fact that IIf(z)ll for problem (31) is less than or equal to I~(z)ll for problem (12). Proof. The proof follows from a comparison of Propositions 4.1 and 4.2.
7. A 2-phase implementation The proofs of convergence in the preceding sections rely heavily on Theorem 3.10, which assumes that the starting point z0 is close enough to z*. As
G. van der Hoek/ Asymptotic properties of reduction methods
187
mentioned above, a first phase, called phase I, is applied to provide a suitable starting point for phase II, which consists of the solution of reduced problems such as (20) or (34). In phase I an exterior point penalty function is minimised subject to the given linear constraints (l l). The question now is how the quality of the generated starting point z0 for phase II can be improved, and how the results of phase I can be used to define a reasonable approximation of the final active set I(z*). It may be expected that a penalty parameter in phase I that is too large creates an undesired ill-conditioned reduced problem, or that this early stage of the iteration process pays too much attention to feasibility at the expense of reduction of the objective function. On the other hand, a weighing factor of the penalty term that is too low may yield a starting point z0~ B(z*, r). It seems reasonable to assert that the choice of the penalty parameter depends on the function F ( x ) to be minimised, the constraint functions c~(x), i = 1. . . . . m, and their respective gradients and Hessian. The concept of defining a hybrid, 2-phase algorithm in this way was first suggested by Rosen [22], who proposed one exterior penalty minimisation as phase I. It gives rise to the following concept of an algorithm. Step 1. Initialisation: choose a starting point z0, and a penalty parameter to. Step 2. Solve one (or more) problems of the form (11) as phase I to obtain zl = (xl, u~, Vl). Set z0 to z~ and k to zero to initialise phase II. Step 3. Solve a reduced problem as developed in this chapter (e.g. (20) or (34)) as phase II, to obtain zk§ = (xk+~, uk+~, vk+0. Step 4. Apply convergence criteria. In case of convergence, stop. Otherwise, set k equal to k + 1 and return to Step 3. The algorithmic and numerical aspects of this 2-phase proposal are discussed in more detail in [26, Chapter V]. The resulting phase I consists of one minimisation of an exterior penalty function whose penalty parameter is inversely proportional to the constraint violations at the initial point. The theoretical justification of such an approach is given in the next theorem. In essence, this theorem states that if the penalty parameter to is large enough, the convergence of the exterior penalty methods [4] assures that zo E B(z*, r); hence the final point of phase I is close enough to z*. Assuming that all other conditions of the theorem on the ultimate convergence are satisfied (as in Section 6) this means that the convergence is R~quadratic. Theorem 7.1. Consider the nonlinear programming problem (12) with a regular
point z* which satisfies the second-order sufficiency conditions with strict complementary slackness. Assume further that all problem functions are at least twice continuously differentiable in an open neighbourhood of z*. Then there exists a t~ such that if we use to >- t~ in phase I of the algorithm, the generated sequence {Zk} of iteration points converges R-quadratically to z*. Proof. As mentioned above, there exists a t~ such that for to -> t~, phase I
188
G. van der Hoek/ Asymptotic properties of reduction methods
provides a starting point z0 which meets the requirements of Theorem 6.1. Thus if the other conditions such as differentiability of the problem functions are satisfied and an appropriate active set strategy is used, Theorem 6.1 can be applied to prove Theorem 7.1.
8. Conclusion After a discussion of the application of linearisations to solve constrained nonlinear programming problems, a modification of R o b i n s o n ' s algorithm is suggested in which only the constraints of the currently defined active set contribute to the definition of the reduced problem. The convergence and rate of convergence of the resulting algorithm are established. It is proved that, under suitable conditions, the algorithm generates a sequence of K u h n - T u c k e r points of reduced problems which converges R-quadratically to a K u h n - T u c k e r point of the original problem. Appropriate active set strategies are mentioned. Finally, a 2-phase implementation is presented and is proved to be convergent.
Acknowledgment The author is greatly indebted to Prof. Van den Meerendonk for supervising this research project. He is also grateful to Prof. Hazewinkel, Prof. Rinnooy Kan, and the referees for their accurate reading of various versions of the manuscript.
References [1] J.F. Ball!ntijn, G. Van der Hoek and C.L. Hooykaas, "Optimization methods based on projected variable metric search directions", Report 7821/0, Econometric Institute, (Erasmus University, Rotterdam, 1978). [2] E.M.L. Beale, "Numerical methods", in: J. Abadie, ed., Nonlinear Programming (NorthHolland, Amsterdam, 1967) pp. 135-209. [3] J. Br~iuniger, "A modification of Robinson's algorithm for general nonlinear programming problems requiring only approximate solutions of subproblems with linear equality constraints", in: J. Stoer, ed., Optimization Techniques, Proc. 8th IFIP Conference on Optimization Techniques, Wiirzburg, Lecture Notes in Control and Information Sciences 7(2) (Springer, Heidelberg, 1977). [4] A.V. Fiacco and G.P. McCormick, Nonlinear programming:Sequential uhconstrained minimization techniques (Wiley, New York, 1968). [5] R.E. Griffith and R.A. Stewart, "A nonlinear programming technique for the optimization of continuous processing systems", Management Science 7 (1961) 379-392. [6] W.A. Gruver and N.H. Engersbach, "Nonlinear programming by projection-restoration applied to optimal geostationary satellite positioning", American Institute of Aeronautics and Astronautics 12 (1974) 1715-1720. [7] W.A. Gruver and N.H. Engersbach, "A variable metric algorithm for constrained minimization based on an augmented Lagrangian", International Journal for Numerical Methods in Engineering 10 (1976) 1047-1056.
G. van der Hoek/ Asymptotic properties of reduction methods
189
[8] L.V. Kantorovic and G.P. Akilov, Functional analysis in normed spaces (MacMillan, New York, 1964). [9] H.J. Kelley and J.L. Speyer, "Accelerated gradient projection", in: Proceedings of Colloquium on Optimization. Lecture Notes in Mathematics 132 (Springer, Heidelberg, 1970). [10] J. Kreuser, "Convergence rates for nonlinear constraint Lagrangian methods", Ph.D. Thesis, Computer Science Department, University of Wisconsin, Madison, WI 0972). [l I] S.A. Lill, "Generalization of an exact method for solving equality constrained problems to deal with inequality constraints", in: F.A. Lootsma, ed., Numerical methods [or non-linear optimization (Academic Press, London, 1972) pp. 383-395. [12] D.G. Luenberger, Introduction to linear and nonlinear programming (Addison-Wesley, Reading, MA, 1973). [13] G.P. McCormick, "Penalty function versus non penalty function methods for constrained nonlinear programming problems", Mathematical Programming l (1971) 217-238. [14] G.G.I.. Meyer, "Asymptotic properties of sequences iteratively generated by point-to-set maps", Mathematical Programming Study l0 (1979) 115-127. [15] J.M. Ortega and W.C. Rheinboldt, lterative solution of nonlinear equations in several variables (Academic Press, London, 1970). [16] A.M. Ostrowski, Solutions of equations and systems of equations (Academic Press, London, 1966). [17] E. Polak, Computational methods in optimization: A unified approach (Academic Press, London, 1971). [18] S.M. Robinson, "A quadratically convergent algorithm for general nonlinear programming problems", Mathematical Programming 3 (1972) 145-156. [191 J.B. Rosen, "The gradient projection method for nonlinear programming, Part I, linear constraints", Journal of the Society for Industrial and Applied Mathematics 8 (1960) 181-217. [20] J.B. Rosen, "The gradient projection method for nonlinear programming, Part II, nonlinear constraints", Journal of the Society for Industrial and Applied Mathematics 9 (1961) 514-532. [21] J.B. Rosen, "'Convex partitioning programming", in: R.L. Graves and P. Wolfe, eds., Recent advances in mathematical programming (McGraw Hill, New York, 1963) pp. 159-176. [22] J.B. Rosen, "A two-phase method for nonlinear constraint problems", Paper presented at the IX International Symposium on Mathematical Programming, Budapest (1976). [23] J.B. Rosen, "Two-phase algorithm for nonlinear constraint problems", Technical Report 77-8, University of Minnesota, Minneapolis, MN (1977). [241 J.B. Rosen and J. Kreuser, "A gradient projection algorithm for nonlinear constraints", in: F.A. Lootsma. ed., Numerical methods for non-linear optimization (Academic Press, London, 1972) pp. 297-301. [25] G. Van der Hoek, "Experiments with a reduction method for nonlinear programming, based on a restricted Lagrangian", Report 7804/0, Econometric Institute, Erasmus University, Rotterdam (1978). 126] G. Van der Hoek, Reduction methods in nonlinear programming (Mathematical Centre, Amsterdam, 1980). [27] G. Van der Hoek and C.L. Hooykaas, "A reduction method for nonlinear programming, based on a restricted Lagrangian (RESLA)" in Methods of Operations Research, Band 31, Proceedings of the third Symposium on Operations Research (Mannheim, 1978). [28] R.B. Wilson, "A simplicial algorithm for concave programming", Ph.D. Thesis, Graduate School of Business Administration, Harvard University, Boston, MA (1973). [29] W.I. Zangwill, Nonlinear programming, a unified approach (Prentice-Hall, Englewood Cliffs, NJ, 1969).