Optimal Control
WILEY-INTERSCIENCE SERIES IN SYSTEMS AND OPTIMIZATION
Advisory Editors
Sheldon Ross Department of Industrial Engineering and Operations Research, University of California, Berkeley, CA 94720, USA
Richard Weber Cambridge University, Engineering Department, Management Studies Group, Mill Lane, Cambridge CB21RX, UK
GITTINS- Multi-armed Bandit Allocation Indices KALL/WAILACE- Stochastic Programming KAMP/HASLER- Recursive Neural Networks for Associative Memory KIBZUN/KAN -
Stochastic Programming Problems with Probability and Quantile Functions
VAN DIJK- Queueing Networks and Product Forms: A Systems Approach ·· WHITTLE- Optimal Control: Basics and Beyond WHITTLE- Risk-sensitive Optimal Control
Optimal Control Basics and Beyond Peter Whittle Statistical Laboratory, University of Cambridge, UK
JOHN WILEY & SONS
Copyright © 1996 by John Wiley & Sons Ltd Baffms Lane, Chichester, West Sussex P0191UD, England National 01243 779777 International ( +44) 1243 779777 All rights reserved. No part of this book may be reproduced by any means, or transmitted, or translated into a machine language without the written permission of the publisher. Cover photograph by courtesy of The News Portsmouth
Other Wiley Editorial Off~ees John Wiley & Sons, Inc., 605 Third Avenue, New York, NY 10158-0012, USA Jacaranda Wiley Ltd, 33 Park Road, Milton, Queensland 4064, Australia John Wiley & Sons (Canada) Ltd, 22 Worcester Road, Rexdale, Ontario M9W ILl, Canada John Wiley & Sons (SEA) Pte Ltd,~ Jalan Pemimpin #05-04. Block B, Union Industrial Building, Singapore 20'S7
Library ofCongreu Catllloging-in-PubliCiltion Data Whittle, Peter. Optimal control : basics and beyond I Peter Whittle. p. em.- (Wlley-Interscience series in systems and optimization) Includes bibliographical references and index. ISBN 0 471956791 (he: alk. paper). -ISBN 0 47196099 3 (pb : alk.paper) 1. Automatic control. 2. Control theory I. Title. II. Series. TI213:W442 1996 629.8- dc20 95-22113 CIP
Britulr Library Cataloguing in Publication Data A catalogue record for this book is available from the British Library ISBN 0 471956791; 0 47196099 3 (pbk) Typeset in 10/12pt Tunes by Pure Tech India Limited, Pondicherl"}t Printed and bound in Great Britain by Biddies Ltd, Guildford and King's Lynn This book is printed on acid-free paper responsibly manufactured from sustainable forestation, for which at least two trees are planted for each one used for paper production.
Contents Preface
vii
1 First ideas
BASICS Part 1 Deterministic Models 2 3 4 5 6 7
Deterministic models and their optimisation A sketch of infinite-horizon behaviour; policy improvement The classic formulation of the control problem; operators and filters State-structured deterministic models Stationary rules and direct optimisation for the LQ model The Pontryagin maximum principle
9 11 47 63 91 111 131
Part 2 Stochastic Models
167
8 Stochastic dynamic programming 9 Stochastic dynamics in continuous time 10 Some stochastic examples 11 Policy improvement: stochastic versions and examples 12 The LQG model with imperfect observation 13 Stationary processes; spectral theory 14 Optimal allocation; the multi-armed bandit 15 Imperfect state observation
169 179 189 215 229 255 269 285
BEYOND
Part 3 Risk-sensitive and H 00 Criteria
293
16 Risk -sensitivity: the LEQG model 17 The H 00 formulation
295 321
vi
CONTENTS
Part 4 Time-integral Methods and Optimal Stationary Policies
331
18 19 20 21
335 349 357 371
The time-integral formalism Optimal stationary LQG policies: perfect observation Optimal Stationary LQG policies: imperfect observation The risk-sensitive (LEQG) version
Part 5 Near-determinism and Large Deviation Theory
379
22 23 24 25
383 405 415 431
The essentials oflarge deviation theory Control optimisation in the large deviation limit Controlled first passage Imperfect observation; non-linear filtering
Appendices
443
AI Notation and conventions A2 The structural basis of temporal optimisation A3 Moment generating functions; basic properties
443 449 455
References
457
Index
461
Preface Anyone who writes on the subject of control without having faced the responsibility of practical implementation should be conscious of his presumption, and the strength of this sense should be at least doubled if he writes on optimal control. Beautiful theories commonly wither when put to the test, usually because factors are present which simply had not been envisaged. This is the reason why the design of practical control systems still has aspects of an art, for all the science on which it now calls. Nevertheless, even an art requires guidelines, and it can be claimed that the proper function of a quest for optimality is just the revelation of fundamental guidelines. The notion of achieving optimality in systems of the degree of complexity encountered in practice is a delusion, but the attempt to optimise idealised systems does generate the fundamental concepts needed for the enlightened treatment of less ideal cases. This observation then has a corollary: the theory must be natural and incisive enough that it does generate recognisable concepts; a theory which ends in an opaque jumble of formulae has served no purpose. 'Control theory' is now understood not merely in the narrow sense of the control of mechanisms but in the wider sense of the control of any dynamic system (e.g. communication, distribution, production, financial, economic), in general stochastic and imperfectly observed. The text takes this wider view and so covers general techniques of optimisation (e.g. dynamic programming and the maximum principle) as well as topics more classically associated with narrowsense control theory (e.g. stability, feedback, controllability). There is now a great deal of standard material in this area, and it is to this which the 'basics' component of the book provides an introduction. However, while the material may be standard, the treatment of the section is shaped considerably by consciousness of the 'beyond' component into which it leads. There are two pieces of standard theory which impress one as complete: one is the Pontryagin maximum principle for the optimisation of deterministic processes; the other is the optimisation of LQG models (a class of stochastic models with Linear dynamics, Quadratic costs and Gaussian noise). These have appeared like two islands in a sea of problems for which little more than an ad hoc treatment was available. However, in recent years the sea-bed has begun to rise and depths have become shallows, shallows have become bridging dry land. The class of risk-sensitive models, LEQG models, was introduced, and it was
viii
PREFACE
found that the LQG theory could be extended to these, although the mode of extension was sufficiently unevident that its perception added considerable insight. At about the same time it was found that optimisation on the H 00 criterion was both feasible, in that analytic advance was possible, and useful, in that it gave a robust criterion. Unexpectedly and beautifully, these two lines of work coalesced when it was realised that the Hoo criterion was a special case of the LEQG criterion, for all that one was phrased deterministically and the other stochastically. Finally, it was realised that, if large-deviation theory is applicable (as it is when a stochastic model is close to determinism in a certain sense), then all the exact results of the LQG theory have a version which holds in considerable generality. These successive insights revealed a structure in which concepts which had been familiar in special contexts for decades (e.g. time-integral solutions, Hamiltonian structure, certainty equivalence, solution by canonical factorisation) were seen to be closely related and to supply exactly the right view of a very general class of stochastic models. The 'beyond' component is devoted to exposition of this material, and it was the fact that such a connected treatment now seems possible which motivated the writing of this text. Another motivation was the desire to write a successor to my earlier work Optimisation over Time (Wiley 1982, 1983). However, it is not squarely a successor. I wanted to write something much more homogeneous and tightly focused, and the restriction to the control theme provided that tightness. Remarkably, the recent advances mentioned above also induced a tightening, rather than the looseningone might have expected. For example, it turns out that the discounted cost criterion so beloved of exponents of dynamic programming is logically inconsistent outside a rather narrow context (see Section 16.12). In control contexts it is natural to work with either total or time-averaged cost (in terminating or non-terminating situations respectively). The algorithm which emerges as natural is the iterative one of policy improvement. This has intrinsically a clear variational basis; it can also be seen as a Newton-Raphson algorithm (Section 3.5) whose second-order convergence is often rapid enough that a single iteration is enlightening (see Section 3.7 and the examples of Chapter 11); it implies similarly effective algorithms in derived work, e.g. for the canonical factorisations of Chapters 18-21. One very important topic to which we give little space is that of dual control. By this is meant the use of control actions to evoke information as well as to govern the dynamics of the system, with its associated concepts of adaptive control, selftuning regulators, etc. Chapter 14 on the multi-armed bandit constitutes almost the only substantial discussion. Despite the fact that the idea of dual control emerges spontaneously in any effort to optimise the running of a stochastic dynamic system, the topic seems too demanding and idiosyncratic that one can treat it in passing. Indeed, one may say that the treatment of this book pushes a certain line about as far as it can be taken, and that this line necessarily skirts
PREFACE
ix
dual control. In all our formulations of the LQG model, the LEQG model, largedeviation versions and even minimax control we find that there is a certainty equivalence principle. The principle indeed generally takes a more sophisticated form than that familiar from the simple LQG case, but any such principle must by its nature exclude dual control: the notion that control actions affect information gained. Another topic from which we refrain, despite the attention it has received in recent years, is the use of J- factorisation techniques and the like to determine all stabilising controls satisfYing some lower bound on performance. This topic is important because of the increased emphasis given to robustness: the realisation that it is oflittle use if a control is optimal for a specified model if its performa nce deteriorates rapidly with departure from that specification. However, we take reassurance from one conclusion which this body of work establishes: that if a control rule is optimised under the assumption that there is observation error then it is also proofed to some extent against errors in model specification (see Section 17.3). The factorisation techniques which we employ are those associated with the formulation of optimal control as the extremisation of a suitably defined time-integral (even in the stochastic case). This is a class of ideas completely distinct from that of ]-factorisation, and with its own particular elegance. My references to the literature are not systematic, but I have certainly given credit· for all recent work for which I knew an attribution. However, there are many sections in which I have worked out my own treatment, very possibly in ignorance of existing work. Let me apologise in advance to authors thus unwittingly overlooked, and affirm my readiness to correct the record at the first opportunity. A substantial proportion of this work was completed before my retirement in 1994 from the Churchill Chair, endowed by the Esso Petroleum Company. I am profoundly indebted to the Company for its support over my 27-year occupancy of the Chair.
CHAPTER I
First Ideas 1 CONTROL AS AN OPTIMISATIO N PROBLEM One tends to think of 'control' as meaning the control of mechanisms: e.g. the classic stabilisation of the speed of a steam engine by the centrifugal governor, the stabilisation of temperature in a central heating system, or the many automatic controls built into a modern aircraft. However, the controls built into an aircraft are modest compared with those which Nature has built into any higher organism; a biological rather than a mechanical system. This can be taken as an indication that any system operating in time, be it mechanical, electrical, biological, economic or industrial, will need continuous monitoring and correction if it is to keep on course. In other words, it needs control The efficient running of the dynamic system constituted by an economy or a factory poses a control problem just as much as does the operation of an aircraft. The fact that control actions may be realised by procedures or by conscious decisions rather than by mechanisms is a matter of implementation rather than of principle. (Although it is also true that it is the higher-level decisions, laying out the general course one wishes the system to follow, which will be taken consciously, and it is the lower-level decisions which will be automated. The more complex the system, the more need there will be for an automated low-level decision structure which ensures that the system actually follows the course decided by higher-level policy;) In traditional control theory the problem is regarded very much as one of stability-that departures from the desired course should certainly be corrected ultimately, and should preferably be corrected quickly, smoothly and effortlessly. Since the mid-century increasing attention has been given to more specific design criteria: control rules are chosen so as to minimise a cost function which appropriately penalises both deviation from course and excessive control action. That is, the design problem is formulated as an optimisation problem. This has virtues, in that it leads to a sharpening of concepts; indeed, to the generation of concepts. It has faults, in that the model behind the optimisation may be so idealised that it leads to a non-robust solution-a solution which is likely to prove unacceptable if the actual system deViates at all from that supposed. However, as is usual when 'theory' is criticised, this objection is not a criticism of theory as such, but criticism of a naive theory. One may say, indeed, that optimisation exposes the weaknesses in thinking which are usually compensated for by soundness of intuition. By this is meant that, if one makes certain assumptions,
2
FIRST IDEAS
then an attempt at optimisation will go to the limit in some direction consistent with a literal interpretation of these assumptions. It is not a bad idea, then, to see how an ill-posed attempt at optimisation can reveal the pitfalls and point the way to their remedy. 2 AN EXAMPLE: THE HARVESTING OF A RENEWABLE RESOURCE A good example of the harvesting of a renewable resource would be the operation of a fishery. Consider the simplest case, in which the description of current fish stocks is condensed to a single variable, x, the biomass. That is, we neglect the classification by species, age, size and location which a more adequate model would obviously require. We also neglect the effect of the seasons (although see Exercise 1) and suppose simply that, in the absence of fishing, biomass follows a differential equation
x=a(x)
(1)
where xis the rate of change of x with time, dxjdt. The function a(x) represents the rate of change of biomass, a net reproduction rate, and in practice has very much the course illustrated in Figure 1. It is initially positive and increasing with x, but then dips and becomes negative for large x, as the demands which a large biomass levies on environmental resources make themselves felt. Two significant stock levels are xo and Xm, distinguished in Figure 1. The stock level x0 is the equilibrium level for the unharvested population, that at which the net reproduction rate is zero. The stock level Xm is that at which the net reproduction rate is greatest. If stocks are depleted at a rate u by fishing then the equation becomes
x = a(x)- u.
(2)
X
Figure 1 The postulated form of the net reproduction rate for a population. This rate is maximal at Xm and it is zero at xo, which would consequently be the equilibrium level ofthe unharvestedpopulation.
2 AN EXAMPLE: THE HARVESTING OF A RENEWABLE RESOURCE
3
X
Figure 2 The values x1 and x2 are the possible equilibrium levels ofpopulation if harvesting is carried out at a]zxed rate ufor x > 0. These are respectively unstable and stable, as is seen from the indicated direction of movement ofx.
Note that u is the actual catch rate, rather than, for example, fishing effort. Presumably a given effort yields less in the way of catch as x decreases until, when x becomes zero, one could catch at no faster rate than the rate a(O} at which the population is being replenished from external sources (which may be zero). Suppose, nevertheless, that one prescribes a fishing policy by announcing how one will determine u. If one chooses u varying with x then one is showing some responsiveness to the current state; in control terminology one is incorporating feedback. However, let us consider the most naive policy (which is not to say that it has not been used): that which sets u at a definite fixed value for x > 0. An equilibrium value of x under this policy must satisfy a(x) = u, and we see from the graph of Figure 2 that this equation has in general two solutions, x 1 and x 1 , say. Recall that the domain of attraction of an equilibrium point is the set of initial values x for which the trajectory would lead to that equilibrium. Further, that the equilibrium is stable (in a local sense) only if all points in some neighbourhood of it lie in its domain of attraction. Examining the sign of x = a(x)- u, we see that the lesser value x 1 has only itself as domain of attraction, and so is unstable. The greater value x 2 has x > x 1 as domain of attraction, and so is stable. One might pose as a natural aspiration: to choose the value of u which is largest consistent with existence of a stable equilibrium solution, and this would seem to be
u = Um := a(xm)· That is, the maximal value of u for which a( x) = u has a solution, and so for which the equilibrium operating point is such that the biomass replaces itself at the maximal rate.
~
FIRST IDEAS
4
I X
Fipre 3 lfthefvced harvesting rate is taken as high as u,, then the equilibrium at Xm is only semi-stable.
However, this argument is fallacious, and its adoption is said to be the reason why the Peruvian anchovy fishery crashed between 1970 and 1973 from an annual catch of12.3 million tons to one of1.8 million tons (Clark, 1976). As u increases to Um then x 1 and x2 converge to the common value Xm. But Xm has domain of attraction x ;?; Xm, and so is only semi-stable (Figure 3). If the biomass drops at all from the value Xm then it crashes to zero. In Exercise 10.4.1 we consider a stochastic model of the situation which makes the same point in another way. We shall see in the next chapter that the policy which indeed maximises the steady-state harvest rate is that which one might expect: to fish at the maximal feasible rate (presumably greater than um) for x > Xm and not to fish at all for x < Xm. This makes the stock level Xm a stable point of the controlled system, at which one achieves an effective harvest rate of a(xm). At least, this is the optimal policy for this simple model; the model can be criticised on many grounds. _Exercises and comments (1) One can to some extent consider seasonal effects by considering a discretetime model Xr+I = a(x,) - u,
in which time t moves forwards in unit steps (corresponding to the annual cycle) rather then continuously. In this case the function a has the form of Figure 4 rather than of Figure 1. The same arguments can be applied as in the continuoustime case, although it is worth noting that it was this model (with u = 0) which provided the first and simplest demonstration of chaotic effects. (2) Suppose that the constant value presumed for u when x > 0 exceeds a(O), with u = 0 for x = 0. Then x = 0 is effectively a stable equilibrium point, with an
3 DYNAM IC OPTIMISATION TECHN IQUES
5
Figure 4 The form ofthe year-to-year reproduction rate.
t rate effective harvest rate u = a(O). This is because one harvests at the constan One again. zero to the moment x becomes positive, and drives the biomass back and zero of tion has then a 'chattering' equilibrium, at which the alterna is u) of values positive infinitesimally positive values of x (and of zero and rate, ation immigr the infinitely rapid. The effective harvest rate must balance a(O). At this level, a fish is caught the moment it appears from somewhere. is also Under the policy indicated at the end of the section the equilibrium at Xm both out smooth course of a 'chattering' one. Practical considerations would operation and solution around this transition point. 3 DYNAMIC OPTIMISATION TECHNIQUES in the The crudity of the control rule of the previous section lay, of course, d to adapte be must rate harvest assumption of a constant harvest rate. The a least, very the at that, ensure to current conditions, and in such a way as be well may it cs dynami ed improv depleted population can recover. With rium possible to retain the point of maximal productivity Xm as the equilib good of on deducti the for basis a needs y operating point. However, one certainl dynamic rules. There are a number of approaches, all ultimately related. n to The first is the classical design approach, with its primary concer le desirab other that, after and, point ng secure stability at the desired operati later with ues techniq of set one least at dynamic characteristics. This shares rs 4 approaches: the techniques needed to handle dynamic systems (see Chapte and 5). on the One optimisation approach is that oflaying direct variational conditions path, the of n variatio no be should there path of the process; of requiring that The cost. smaller a yield would which cs, consistent with the prescribed dynami ns. variatio of s calculu the in problem a as cast optimisation problem is then be to is problem control the if ation modific However, this classic calculus needs
6
FIRST IDEAS
accommodated naturally, and the form in which it is effective is that of the Pontryagin maximum principle (Chapter 7). This is a valuable technique, but one which would seem to be applicable only in the deterministic case. However, it has a natural version for at least certain classes of stochastic models; see Chapters 16, 18-21, 23 and 25. Another approach is the recursive one, in which one optimises the control action at a given time on the assumption that the optimal rule for later times has already been determined This leads to the dynamic programming technique, a technique which is central and which has the merit of being immediately applicable also in the stochastic case (see Chapter 8). It is this approach which in a sense provides the spine of our treatment, although we shall see that all other methods are related to it and sometimes provide advantageous variants of it It is also true that there is merit in methods which display the future options for the controlled process more clearly than does the dynamic programming technique (see the certainty equivalence principles of Chapters 12 and16). One might say that methods which are expressed in terms of the predicted future path of the process (such as the maximum principle, the certaintyequivalence principle and the time-integral methods of Chapters 18-21) correspond to the approach of a chess-player who explores a range of future scenarios in his mind before he makes a move. The dynamic programming approach reflects the approach of the player who has built up a mental evaluation of all possible board configurations, and so can replace the long-term goal of winning by the short-term goal of choosing a move which leads to a higher-value configuration. There is virtue both in the explicit awareness of future possibilities and in the ability to be guided to the same effect by aiming for some more immediate goal. Finally, there is the relatively naive approach of simply choosing a reasonable control rule and evaluating its performance (by, say, determination of the average cost associated with the rule under equilibrium conditions). It is _ seldom easy to optimise the rule at this stage; the indirect routes to optimisation are more effective and more revealing. However, there is a systematic method of improving such solutions to yield something which is well on the way to optimality. This is the technique of policy improvement (see Chapters 3 and 11), an approach also derived from dynamic programming. Judged either as an analytic or a computational technique, this may be the single most important tool In cases where optimality may be an unrealistic ambition, even a false one, it offers a way of starting from a humble base and achieving performance comparable with the optimal The revision of policy that it recommends can itself convey insight. Policy improvement has a good theoretical basis, has a natural expression in all the characteristions of optimality and, as an iterative technique, it shows second-order convergence to optimality.
4 ORGANISATION OF THE TEXT
7
4 ORGANISATION OF THE TEXT
Conventions on notation and standard notations are listed in Appendix 1. While much of the treatment of the text is informal, conclusions are either announc ed in advance or summar ised afterwards in theorem -proof form. This form should be regarded as neither forbidding nor pretentious, but simply as the best way of punctuating and summarising the discussion. It is also by far the best form for readers looking for a quick reference on some point. It does create one difficulty, however. There are theorems whose validity is of completely assured by the conditions stated-m athema ticians could conceive full than less of ts argumen where s situation are there r, nothing else. Howeve rigour have led one to considerable penetration and to what one indeed believes to be the essential insight, but for which the aspiration to full rigour would multiply the length of the treatment and obscure its point. This is particularly the is case when the topic is new enough that a rigorous treatment, even if available, , however s, assertion ise summar to wish still would One itself not insightful. l technica to subject is these of truth the that od leaving it to be understo ns assertio y summar Such verified. nor stated neither conditions of a nature should not properly be termed 'theorems~ We cover this point by starring the second type. So, Theorem 2.3.1 is true as its stands. On the other hand, *Theorem 7.2.1 is 'essentially' valid in statement and proof, but both would need technical supplement before the star could be removed. Exercises are in some cases substantial. In others they simply make points which, although importa nt or interesting in themselves, would have interrup ted the discussion if they had been incorpo rated into the main text. Theorems carry chapter and section labels. Thus, Theorem 2.3.1 is the first theorem of Section 3 of Chapter 2. Equations are numbered consecutively through a chapter, however, without chapter label. A reference to equation (18) n would thus mean equation (18) of the current chapter, but a reference to equatio (3.18) would mean equation (18) of Chapter 3. A similar convention holds for figures.
!
j
BAS ICS PAR T 1
Deterministic Models
CHA PTE R 2
Deterministic Models and their Optimisation 1 STATE STRUCTURE OPTIMISATION AND DYNAMIC PROGRAMMING
'process' or the The dynamic operation one is controlling is referred to as the ' as including 'plant' more or less interchangeably; we shall usually take 'system set of variables sensors, controls and even comm and signals as well as plant. The termed the which describe the evolution of the process will be collectively can be value whose le, variab l contro The x. process variable and denoted by on notati the with tent consis is This u. by d chosen by the optimiser, will be denote of Chapter 1. to whether Models are termed stochastic or deterministic according as oratio n of incorp the that see shall We randomness enters the description or not. that the ising recogn of way a ally stochastic effects (i.e. of randomness) is essenti course future the that lar, particu values of certain variables may be unknown; in t restric We table. predic ectly of certain input variables may be only imperf rs. chapte ourselves to deterministic models in these first seven in continuous We shall denote time by t. Physical models are naturally phrased useful to also is it er, time, when t may take any value on the real axis. Howev values integer only take to consider models in discrete time, when t is considered s proces the that notion to the t = ... , -2, -1, 0, 1, 2, .... This corresponds for ts, contex ic econom develops in stages, of equal length. It is a natural view in so decisions tend example, when data become available at regular intervals, and way, when they this in to be taken at the same intervals. Even engineers operate l values are contro if work with 'sampled data'. Discretisation of time is inevitable g with a startin in determined digitally. There are mathematical advantages the more to ent discrete-time formulation, even if one later transfers the treatm al in materi cover to physical continuous-time formulation. We shall in general try both versions. if the control There are two aspects of the model which must be specified e is the plant optimisation problem is to be properly formulated. The first ofthes ls u. This equation; the dynamic evolution rule that x obeys for given contro and must be describes the dynamics of the system which is to be controlled, aspect is the derived from a physical model of that system. The second function. This cost a of cation specifi s implie performance criterion, which usually
12
DETERMINISTIC MODELS AND THEIR OPTIMISATION
cost function penalises all aspects of the path of the process which are regarded as undesirable (e.g. deviations from required path, lack of smoothness, depletion of resources) and the control policy is to be chosen to minimise it. Consider first the case of an uncontrolled system in discrete time. The plant equation must then take the form of a recursion expressing x 1 in terms of previous x-values. Suppose that this recursion is first-order, so taking the form
(1) where we have allowed dynamics also to depend upon time. In this case the variable x constitutes a dynamically complete description of the state of the system, in that the future course {x 7 ; T > t} of the process at time tis determined totally by x 1 , and is independent of the path {x7 ; T < t} by which x 1 was reached. A model with this property is said to have state structure, and the process variable x can be more strongly characterised as the state variable. State structure for a controlled process in discrete time will also require that the model is, in some sense, simply recursive. It is a property of system dynamics and the cost function jointly. We shall assume that the plant equation takes the form Xt
= a(Xt-I,
Ut-I,
(2)
t).
analogously to (1). Further, if one is optimising over the time period 0 < t shall assume that the cost function C takes the additive form
c=
h-I
h-I
T=O
r=O
L c(xn Un r) + Ch(xh) = L
Cr
+ ch,
< h, we (3)
say. The end-point h is referred to as the horizon, the point at which operations close. It is natural to regard the terms Cr and ch as costs incurred at time T and time h respectively; we shall refer to them as the instantaneous and closing costs. We have thus assumed, not merely additivity, but also that the instantaneous cost depends only on current state, control and time, and that the closing cost -- depends only upon the closing state xh. One would often refer to xh and Ch as the terminal state and the terminal cost respectively. However, we shall encounter processes which may terminate in other ways before the horizon point is reached (e.g. by accident or by bankruptcy) and it is useful to distinguish between the cost incurred in such a physical termination and one incurred simply by the state one is left in at the expiry of the planning period. We have yet to define what we mean by 'state structure' in the controlled case, but shall see in Theorem 2.1.1 that assumptions (2) and (3) do in fact imply the simply recursive character of the optimisation problem that one would wish. Relation (2) is of course a simple forward recursion, and the significant property of the cost function (3) turns out to be that it can be generated by a simple backward recursion. We can interpret the quantity
1 STATE STRUCTURE
13
h-1 C1 = Lct+Ch,
(4)
T=t
as the cost incurred form timet onwards. It plainly obeys the backward recursion
C1 = c(x, u1, t) + Ct+1
t
(5)
We have the optimisation problem formulated as the problem of choosing {u1; 0 ~ t < h} to minimise the cost function (3), subject to validity of the plant equation (2) over the time interval (0 < t
h-1 C = Lg(xnXT+1 ,T) + Ch(xh)
(3')
1"=0
for some function g. The study of the extremisation of such forms is exactly the province of the calculus of variations (see Exercise 6), which was for long the classic tool for the treatment of dynamic optimisation problems. However, Equation (2) may simply not be soluble for u, and elimination again implies a loss of structure. Yet a third approach is to regard the problem as one of minimising C with respect to the both sequences {x 1} and {u1} jointly. The plant equation (2) then constitutes a set of constraints on the variables which can be allowed for by the introduction of Lagrange multipliers. This approach avoids the unnatural elimination of significant variables, and, as always, the multipliers themselves turn out to be significant. It can be regarded as providing one route to the celebrated maximum principle (see Chapter 7), itself a special case of the 'timeintegral' formulation which we shall find fundamental and valuable. However, there is yet a fourth approach, which is to exploit the recursive character of the state formulation to derive a recursive treatment of the optimisation problem. In equation (4) we defined C, the cost incurred from time t. Define also the value function F(x1, t) as the minimal value ofC1 with respect to remaining control values u,.(t ~ T
Theorem 2.1.1 Assume the plant equation (2) and the cost function (3). Then: (i) The valuefunction F obeys the dynamic programming equation
14
DETERMINISTIC MODELS AND THEIR OPTIMISATION
F(x, t)
u, t) + F(a(x, u, t), t + 1)] = inf[c(x, u
t
(6)
with terminal condition F(x,h)
= Ch(x).
(7)
(ii) Theminimisingvalueofu in (6) isafunctionofxand tonly, u(x, t), say, and the optimal value ofcontrol at timet is u 1 = u( x 1, t). (iii) This expression for u1 is optima£ no matter how past values ofcontrol have been chosen. Proof Relation (7) follows from the definition of F. The backward recursion (6) follows from recursion (5) if we minimise both sides, first over U7 forT > t and then over u 1• The minimising value of u 1 is optimal, and this is exactly the minimising value in (6). It is clear that both the minimising value of u and the minimised value
of the right-hand member in (6) are functions of x and t alone. Assertion (iii) follows simply from the fact that nothing has been assumed of 0 previous history (x7 , u,.: r < t) except that it is known. The key recursion (6) is referred to as either the dynamic programming equation or the optimality equation. Let us consider the conclusions of the theorem in reverse order. Assertion (iii) follows simply from the way the dynamic programming equation was set up. An optimal control rule that remains optimal no matter how past controls have been chosen is said to be closed-loop optimal, for reasons to be explained in Chapter 4. Other methods of optimisation may well not yield this closed-loop form, since they assume that past as well as future has been optimised (see Section 4). The closed-loop property is closely related to that of feedback: that, just as control u affects the process variable x through the plant equation, sox affects u through the control rule. If it is a general consequence of the recursive approach that past policy is -· irrelevant, it is a consequence of our specific assumptions that past history is irrelevant. This is the striking implication of assertion (ii): that the optimal value of u1 is a function of x 1 and t alone. However, it is also striking that, to determine this rule, one must solve the dynamic programming Equation (6); i.e. determine the value function and the optimal control jointly. The value function in a way telescopes the future. Let us repeat the analogy of Section 1.3. If one tries to determine the optimal value of u1 by directly minimising the cost function (3) with respect to all future controls u7 ( T ~ t) one is in the position of a chess-player who tries to determine his next move by working through all possible future patterns of play in his mind. If one determines the optimal u1 from the dynamic programming equation one is in the position of a chess-player who is so experienced that he can set a value F(x) on any board configuration x and chooses his move so as to improve that value. He
l STATE STRUCTURE
15
to as thus replaces the ultimate goal of winning by the immediate goal of moving favourable a position as possible at the next stage. ined in The plant equation (2) is a forward recursion in time, in that x 1 is determ n (6) is a terms of past values. However, the dynamic programming equatio terms of backward recursion, in that the value function at time tis determined in rule control the e optimis cannot one that is this that at time t + 1. The reason for time; that after be will policy control what at a given time unless one knows forwards optimisation runs backwards. As Kierkegaard says, life must be lived but understood backwards. and However, all recursions occurring are simple; a consequence of our plant the for nt sufficie is er e variabl a cost assumptions. One might say that at control l optima the and n functio optimisation problem if (i) both the value that in n, recursio simple a by d time tare functions of~~ alone, and (ii) ~~ is update Theorem 2.1.1 then ~~ can be calculated as a function of ~t-l and Ut-I alone. One can then stated. tions implies that (x 1 , t) is sufficient, under the assump . problem reasonably term x a state variable for the optimisation to Sometimes future costs are discounted by modifYing the cost function (2) h-1
c=L
,eT cT + ,ehch
(8)
=I
t behind where {3 is a factor lying between 0 and 1, the discount factor. The though of factor a by value in e increas to money this is that compound interest causes after {3-T worth thus is now le availab money of t {3- 1 per unit time. A unit amoun a present time -r. Conversely, a unit amoun t of money available after time T has costs are that assume we if (8), ion express value of {3T. Hence the {3-coefficients in expressed in moneta ry terms. ble Discounting is often introduced in this context, partly for the ostensi to device atical mathem a as accounting reasons we have indicated, partly just the to recede to allowed his ensure convergence of the total cost Cas the horizon t for distant future. However, while accounting considerations may be relevan control oment investment decisions, they are scarcely so for moment-by-m lead to decisions, and the unthinking inclusion of a discount can indeed for option inconsistencies (see Section 16.12). We shall consider the discounting are there the cases for which it is realistic, but shall see in the next chapter that riate in other views of the infinite-horizon limit which are perhaps more approp the control context. already One could simply absorb the factor P into the -r-dependence we have from ting discoun uish disting to useful is it allowed in C-r., However, in general to C of (4) 1 on definiti the adapt to genuine time-dependence of costs, and h-1
Cr =
'2: p-rcT + ,Bh-rch;
(9)
16
DETERMINISTIC MODELS AND THEIR OPTIMISATION
the present value (at timet) of future costs. With the value function F(x, t) defined in terms of C, as before, we derive the modified optimality equation F(x, t)
= inf[c(x, u, t) + ;JF(a(x, u, t), t + 1)], u
(10)
again with terminal condition (7). If the dynamics and instantaneous costs are independent of time, so that the functions a and c are functions a(x, u) and c(x, u) of current state and control variables alone, then the problem is said to be time-homogeneous. In such cases the value function F(x, t) will still be !-dependent in general, because of the effect of a finite horizon h. However, it will depend upon t and h only in the difference s = h- t, the time to go. In such cases we shall often write F(x, t) as Fs(x). If Fis also independent oft (which we expect may be the case for a time-homogeneous model if the horizon is infinite and the infinite-horizon total cost IC is well defined) then the problem is said to be time-invariant. We shall consider the continuous-time analogue of this material in Section 6. We shall often use the term control policy, meaning simply a control rule: a rule for determining control values from currently known observables. The clear definition of these terms emerges first when we admit that not everything is observable, and so adopt the stochastic formulation of Chapter 8. A stationary policy is one for which we dispense with the observable 'clock time', and so form the control rule in a time-invariant fashion. Exercises and comments (1) Retain the plant equation (2) but suppose the cost function (3) modified to IC = maxo.;; r
(2) More generally, one could derive a dynamic programming treatment under the assumptions of a simply-recursive plant equation (2) and a cost function for which the cost C, from timet obeys a backward recursion C, = c(x 1, ur, t, ICr+I) for some c, monotonic increasing in the IC argument. Recursion (5), linear in costs, then seems very special However, the case of a linear cost-recursion is virtually the only one which is straightforward for stochastic models. (3) Other variants of the model can be brought to state structure by appropriate transformations. For example, had the plant equation (2) been a second-order recursion x 1 = a(xr-1, Xr-2, Ur-I, t), then state structure could have been achieved by adoption of a transformed process variable Xr = (x1, Xr-I). (4) One can achieve time-homogeneity formally by changing the process variable from x 1 to ~~ = (x1, t). What is the form of recursions (1) and (2) in terms of~? This standardisation is, although superficial, useful for the compact discussion of, for example, the maximum principle and stopping problems.
17
2 OPTIMAL CONSUMPTION
a (5) A state-s~ctured pr~blem can some~es b~ reduc~d ev~n further, so that defined sense the m 1ower-dimens10nal function of the state vanable 1s sufficrent, above. Suppose that a discrete-time system with a vector state variable has a plant equation which is linear in state, Xt = AtXt-1 + b(ut-l, t), an instantaneous cost c(u11 t) which is indepen dent of state and a terminal cost at time h which is a .function of a linear function of termina l state, dT Xh- Show that t and the 1 'P.redicted termina l miss-distance' dT Xk ) are jointly sufficient at time t, where the control x
(6) Suppose that expression (3') for C is differentiable with respect to its x-
arguments, which we suppose for simplicity can take arbitrary values on the real line. The stationarity conditions
(0 < t
(11)
are then necessary if the path {xt} is to minimise C. The continuous-time analogue of expression (3') would be an integral over time C=
foh g(x,x,t) dt+ ···
where x = dxjdt and+· ·· indicates possible separate contributions to C from the end-points, t = 0 and h. If we perturb x(t) to x(t) + 8(t) then the perturbation inC is, to first-order terms in 8,
The requirement of stationarity with respect to variations 8 then leads to the condition that the square-bracketed expression in the final integrand should be zero for 0 < t < k This is the classic Euler condition, the continuous-time analogue of (11). 2 OPTIM AL CONSU MPTIO N One can hope to solve the dynamic programming equation (6) analytically, or at least to reduce it in some way that yields insight Failing this, the recursion can be solved computationally in favourable cases. A tractable example is provided by a simple model of consumption and investment. The model is too simple for realism, but does provide some insight, and is a first version of mc;>re substantial models of portfolio management and, indeed, of economic development (see Section 3.4)
18
DETERMINISTIC MODELS AND THEIR OPTIMISATION
We can regard the optimiser as an investor who wishes to split his capital between investment and consumption at each stage. The capital x can be taken as the state variable; a non-negative scalar assumed to follow the plant equation
(12) Here u1 is the amount diverted for consumption at time t and a is the factor by which invested capital appreciates over unit time. It is more natural to phrase the problem as one of maximising the total benefit the investor receives from consumption (which an economist would term utility) rather than of minimising cost. Let r(u) be the utility he derives from consumption u, and suppose that the total utility is just the sum of discounted instantaneous utilities. If debt is not permitted then necessarily x ~ 0 and so u1 ,::; x 1; if consumption cannot be reversed to become production then u ~ 0. Let Fs(x) be the maximal utility that can be gained with s time periods to go, starting from capital x. The dynamic programming equation is then
Fs(x)= max [r(u)+.BFs-I(a(x-u))] O~u~x
(s>O).
(13)
The concept of discounting was motivated in the last section by the notion that money invested will grow by compound interest. It may then seem perplexing to see the same feature built in twice: a discounting by a factor ,B as well as the growth of invested capital by a factor a (as expressed in the plant equation (12)). The more natural interpretation of ,B in this case is as a survival probability; the probability that the investor will survive the next unit of time to enjoy his savings. If we make the assumptions
for the form of the utility functions then (13) is easily solved. Here vis a constant in (0, 1). The form for r(u) is plausible; there are economic reasons for assuming utility to be concave and non-decreasing. That the terminal reward Fo should have the same form is perhaps debatable, but the assumption is a convenient one. We leave the reader to verify the solutions
Uopt=Jsr-1/v X for the maximal utility and the optimal consumption at horizon s. Here {Is} is a sequence of constants obeying 1"1/v
Js
with
= 1 + "~rf.l/v '·s-l
2 OPTIM AL CONSU MPTIO N
19
so that J}fv =J/fv 0
s
+ 1- Y
1-,·
utility grows, so that The consta nt 1 can be interpr eted as the rate at which 8y after a time s (for sacrifi ce of an amoun t 6 of curren t utility can yield utility small 8 and with optima l decisio ns; see Ex. 2). we see from the If 1 > 1, so that there is an incenti ve to delay consum ption, then solutio n above that
is possible, then for large s. The second relatio n implie s that, if utility growth mme. consum ption is indeed delaye d to almost the end of the progra If 1 < 1 then one finds the limitin g values xl-v
Fs(x) -. (l _ ~~)",
U0 pt -> (1 -/)X
arding , then maxim al ass becom es large. That is, if to delay consum ption is unrew l policy then has utility over an infinit e horizo n is finite. Furthe rmore, the optima capital at all times. the station ary form that one reinvests a fractio n 1 of one's Exercise 1. see However, this station arity does not imply consta nt capital ; one sacrifi ces that This kind of model in genera l implie s the conclu sion possible. The is consum ption to investm ent (i.e. to growth) if econom ic growth finite life by level ual conclu sion is one which is modifi ed at the individ functio n, utility the of expectancy, as we have seen. It is also sensitive to the form and of course fails if there are physical limits to growth . Exerci ses and comments at each stage by a (1) Under the optima l policy derive d above capita l grows consis tent with unity, than less or r greate factor () = a1 = ( a(3) lfv. This can be 'Y
< 1.
if one forgoes a small (2) Show that 1 is indeed the growth rate of utility, in that, t j33 fJS(l-v)8 = y8 amoun t 8 of curren t utility, one can realise an additio nal amoun after times. (3) Solve the linear case v = 0. propor tional to (4) The limit v j 1 yields the case in which r(u) and Fo(x) are
Xmin are minim al log(u - Umin) and log(x -Xmin ) respectively, where Um.in and capital must not fall. subsist ence values, below which consum ption and final Solve (13) in this case.
20
DETERMINISTIC MODELS AND THEIR OPTIMISATION
3 PRODUCTION SCHEDULING This is a version of an inventory problem which allows some useful reduction, but ultimately resorts to computation. We indicate variants of the problem at the end of the section. A factory produces a single commodity and wishes to schedule its production so as to meet a known time-varying pattern of demand in the most economical fashion. Let us define the state x 1 at timet as the stock held at the beginning of the t th day; u1 as production during that day and d1 as demand during that day. In such problems one must indicate relative timing within a stage if one is to be clear about the sequence of events. The plant equation is then Xr+J
= Xr + Ur -
dr.
(14)
Suppose that the instantaneous cost function for day tis CJ (u 1) + c2(Xr+J), where c1 (u) is the cost of producing an amount u in a day, and c2(x) is the cost of carrying a stock x from one day to the next. The optimality equation (for costs) is then
+ c2(x + u- d1) + F(x + u- d 1, t + 1)].
F(x, t) = min[c1 (u) u
(15)
Plausible simple forms for the cost functions are C!(u)
= { ~+bu
c2(x)
+oo = { ex
(u::;;; 0) > 0)
(u
(x < 0) (x? 0).
(16} (17)
Assumption (16) implies that one can dispose of stock without cost, but that to go into production on a given day implies a base cost a and a unit cost b. Assumption (17) implies that negative stock ('back-logging') is forbidden, and that to carry positive stock incurs a unit cost of c per day. We assume a horizon point h at which negative stock is again forbidden, but positive stock valueless. There is no discounting. It is the occurrence of the terms in a and c which make the problem non-trivial. If the base cost a were zero then one would produce only enough to satisfY current demand. If the storage cost c were zero then one would produce enough to satisfY all future demand in a single production run. With both costs present one must steer between these two extremes. The simplifYing feature of the problem, as we have formulated it, can be summarised. Theorem 2.3.1 In an optimal production programme one produces on a given day if and only ifstocks are insufficient to meet that day's demand. When one produces one brings stock up to such a level that it satisfies demand for an integral number ofdays.
3PRODUCTIONSCHEDUUNG
21
Proof Suppose that under a production programme {u;} one has u; > 0 for a given t. If x ~ dr and one transferred the production u; to day t + 1one would save a cost cu; or a+ cu; according as u;+l is zero or positive. Hence the first assertion For the second point, suppose that on day t one produces enough to satisfY demand before time r, but not enough to satisfY demand at time r as well. That is, u1 == Ej;;/ ~ - Xt + 6, where 0 ~ 6 < d,.. Then one must produce on day r in any case. If one decreased u; by 6 and increased u; by 6 then one would save a storage cost c(r - t)D, with no other effect on costs. Hence 6 = 0 in an optimal policy. D Thus, if x 1 ~ d1 one does not produce and lets stock run down. If x 1 ~ d1 then one produces enough that demand is met up to some time r - 1 exactly (so that x,. == 0), where r is an integer exceeding t. The optimal r will be determined by the optimality equation (18) F(x, t) = min [c(t, r) + F(O, r)], t
where T-1
c(t, r) =a+ L)b + c(j- t)]~. i=t
(19)
So, after a first production or disposal decision stock comes down to zero at some time, and from that point one operates between recurrent instants of zero stock. These instants are determined by the reduced form of the optimality equation lPt =
{
lPt+l
min [c(t, r)
t
+
T]
(dt = 0) (d, > 0; t
(20)
with ¢h = 0, where ¢1 = F(O, t). The cost function (19) is easily calculated; by solving (20) recursively one then calculates the optimal instants at which to produce if initial stock is zero. If one begins with positive but inadequate stocks one produces according to (18). If one begins with more than adequate stock one either lets it run down or disposes of some. The model is, of course, excessively simple. In practice there will be a premium for utilising plant and manpower at a constant rate. The base cost a could be regarded as embryonically representing the cost of switching between different types of production. In Section 7.1 we consider a somewhat related problem whose solution has a nice analytic characterisation If demand d is characterised statistically then the model can be _given a time-invariant formulation and there is a stationary optimal policy. For example, if the d1 are supposed independently and identically distributed then the well-known 'two-bin policy' is optimal: that when stock falls below a lower critical value one produces so as to bring it to an upper critical value.
22
DETERMINISTIC MODELS AND THEIR OPTIMISATION
4 LQ REGULATION
We come now to a class of models which are much closer to the models oftraditional control theory and practice than are those of the last two sections. These are the models for which process and control variables are vector-valued, are constrained only by a linear plant equation and imply quadratic costs. They are for this reason termed 'linear/quadratic', a term abbreviated even further to 'LQ: LQ models allow very explicit treatment and have a very full theory, which extends to the stochastic case. Not merely are they amenable; they show interesting properties. Indeed, much research in control theory consists in trying (sometimes realistically and sometimes futilely) to find an analogue of LQ theory for non-LQ models. On the other side of the ledger, LQ assumptions are strong. For some applications, the assumptions may not be too unrealistic; for others, they plainly are so. Further, the very simplicity of the models and corresponding strength of conclusions means that there are interesting features which they cannot incorporate. Nonlinearity is one. For another, control actions cannot affect the informativeness of observations. Let us formulate a state-structured LQ model. We shall assume that the state variable x is a column n-vector and the control variable u a column m-vector, so that there are m scalar components of control and n scalar components of state. We shall suppose the plant equation Xt
=
Axr-1
(21)
+But-!
and a cost function h-!
C
= L c(x1, u1) +! (xTITx)h,
(22)
r=O
where
c(x,u) =!(xTRx+xTsTu+uTSx+uT Qu)
=! [~r[~ ~] [~].
Here the h-subscript to the bracket in (22) indicates that all quantities in the bracket, X and II, are to bear the subscript h. The symbols A, B, R, S, Q and rrh denote matrices of appropriate dimension, chosen so that C is non-negative definite as a quadratic form and Q is positive definite. The inclusion of the factor is a convention which on the whole makes for simplicity. This is by no means the most general state-structured LQ model. For simplicity, we have supposed it time-homogeneous, in that the matrices A, B, R, Sand Q are constant. The analysis extends easily to the time-varying case. We could also have included terms of lower degree in both plant equation and cost function, representing disturbances to dynamics and an incentive for x and u to follow
!
J!!"'""··-
4 LQ REGULATION
23
prescribe d paths; we re~rn. to these ~atters in Section 9. On present assumptions the cost functwn 1s such as to mduce movemen t of both x and u to zero; the model is then one of regulation to the set-point (0, 0). One tends to use the term 'regulatio n' when it is a matter of holding x to a fixed set-point, rather than (for example) of trying to make it follow an external reference signal. We need a simple general result on quadratic forms, and may conveniently take c(x, u) as a specimen such form.
Lemma 2.4.1 The quadratic form (23) is minimal with respect to u at u= -Q- 1 Sx
and the minimised value ofthe form is inf c(x, u) = !xT(R- STQ- 1S)x. u
These assertions are readily confirme d. The interest is that the minimisi ng value of u is linear in x and the minimise d form remains quadratic in x. A slightly related point is that we can always change variables so as to normalis e the matrix Sto zero; see Exerci~e 1. We now come to the central conclusion.
Theorem 2.4.2 The assumptions (21)-(23) have the consequences that (i) The value function is quadratic in x
F(x, t) = !xTII1x
(24)
and the time-dependent matrix II 1 obeys the Riccati equation Ilt = fiit+l
(t
(25)
where f is an operator having the action fiT= R
+ A 1 IIA- (S1 + ATIIB)(Q + BTIIBf 1 (S + B1 IIA).
(26)
( ii) The optimal control, in closed-loop form, is
(27)
where (28)
Proof Certainly F has the form (24) at time h. Suppose that it does so at time t + 1. Then the optimalit y equation evaluates F(x, t) as F(x, t) = inf[c(x, u) +!(Ax+ Bu?IIt+l (Ax+ Bu)] u
(29)
24
DETERMINISTIC MODELS AND THEIR OPTIMISATION
with c(x, u) given by (23). The minimising value of u is the optimal value of u1• But, appealing to Lemma 2.4.1, we see that the minimising value is that given by (27), (28) and the minimised value has the form (24) with ITt determined by (25) and (26). The induction is thus complete. 0 The proof is thus a simple one, with the elegant conclusions that the optimal control is linear (in x) and the value function quadratic. The backward recursion (25) for IT1 is celebrated, and referred to as the Riccati equation. (The continuoustime treatment of Section 8 exhibits dii/dt as a quadratic function of II, and it was just equations of this type which Riccati analysed.) A fast algorithm for solving the Riccati equation (at least in the infinite-horizon limit) is derived in Section 3.6. Despite the basis simplicity of the argument and the result, expressions (26) and (28) for the derived matrices may not seem very elegant. However, the structure behind them is simpler than the expressions themselves. One can formally write
fiT= inf[R + STK + KTS + KT QK +(A+ BK)TII(A + BK)], K
(30)
a way of writing the minimisation with respect to the variable u in (29) as a minimisation with respect to a matrix (see Exercise 2). The operator f then appears as the extremum of a linear operation, for which reason it has a particular character. We see from (21) and (27) that the optimal control process obeys the simple recursion
(31) say. If successful regulation is possible then one expects that the solution x 1 of this equation will become smaller with time, and indeed converge to zero in the infinite horizon. The matrix r I is often termed the gain matrix. For the simplest possible example, suppose that x and u are scalar and that the plant equation and cost functions are h-1
Xt
=
Xt-1 +Ut-I,
C=!QLu7+!Dx~. t=O
The cost function thus penalises deviation from zero of all control variables and of the state variable at termination. The larger D!Q, the greater relative penalty one assigns to terminal deviations in state. One readily finds the evaluation of II1 and of the optimal control II-
QD
1 -
Q+sD
Ut
D =- Q+sDXt
(32)
where s = h - t is the time to go. However, suppose we had optimised the path directly, by eliminating u from the cost function and so choosing {x 1} to minimise
5 LQ REGULATION IN THE SCALAR CASE
25
h-l
C = !Q 2)xr+ l - x 1) 2 + !D~. t=O
with constant gradie nt One readily finds that the optimal path is a straight line (and so consta nt u) given by D (33) Xt+l- Xt = Ut =- Q + hDXo . makes plain a fact not The control rules (32) and (33) are equivalent, and (33) is consta nt along the immediately evident from (32): that the control value value of u1 only if the t correc the gives (33) sion optimal orbit. However, expres gives the optimal value same rule has been applied at earlier times, whereas (32) been. have policy or s action l of ur no matte r what previous contro -loop and openclosed tively respec d terme are The control rules (32) and (33) ck from the feedba orate incorp not do and do loop, in that they respectively the process along the current state value. The open-loop rule endeavours to guide xo. The closed-loop rule desired path by dead-reckoning from the initial position ver, the presen ce of corrects the path by reference to the curren t position xr. Howe property of closed-loop feedback is not enough in itself to ensure the stronger may have been. policy l contro past ver whate al optimality: that the rule is optim a rule becom es of forms -loop closed and openThe distinction between the le disturbances; see vital when the path of the process is subject to unpredictab Section 9.1. Exercises
le u* = u + Q- 1Sx. (1) Suppo sethat onew orks in terms ofarev isedco ntrol variab by this change. What Show that the matrix S of the cross-term is reduced to zero B? are the revised values of R, Q, A and s that one symm etric (2) The infim um with respect to Kin expression (30) implie e-definiteness: positiv of matrix can be 'bigger' than another. The orderi ng is that e. we write Pt > P2 if the matrix Pt - P2 is non-negative definit Note that the open- and (3) Complete the deduc tion of the open-loop control (33). be true: the two always will This 0. = t closed-loop forms of control agree at the closed-loop so and point, g startin forms of the optim al rule will agree at the general initial lates formu one if form form can be deduc ed from the open-loop t). time l genera a at x state of conditions (i.e. a start from a general value 5 LQ REGU LATI ON IN THE SCAL AR CASE ial of the last section One can gain useful insights by following throug h the mater return to cases of shall We . scalar in the simplest case, when both x and u are Section 8. of lation formu ime greater physical interest in the continuous-t
26
DETERMINISTIC MODELS AND THEIR OPTIMISATION
fiT
II Figure 1 Convergence ofthe sequenceJ(sl to the unique positive root of II =fiT.
Assume then the scalar case, with S normalised to zero. The Riccati equation (25) reduces to lit = /IIr+t = R
A 2 Qllt+l
+ Q + BZII t+l .
(34)
and II 1 = fSIIh, where s = h- t is time to go and fs is the sth iterate of the operator f As a function of II the quantity /II has the character depicted in Figure 1; it is monoton e increasing and concave. We shall make these facts the basis of an informal proof of the following theorem; a formal proof will be given for the vector case in Chapter 6.
Theorem 2.5.1 Suppose that Q and Rare positive, that B is non-zero and that llh is non-negative. Then the sequence II1 converges monotonically with increasing time to go h-t to the unique positive solution II ofthe equilibrium Riccati equation II =/II.
(35) The corresponding limit gain factor r is the numerically smaller root ofthe quadratic "·equation (36) Proof It is obvious graphically that (34) has only a single positive root. The sequence PITh is given by the cobweb diagram of Figure 1, and it is again evident graphically that this converges to IT with increasing s. We leave formalisation of these assertions to the reader. We find that
(37)
5 LQ REGULATION IN THE SCALAR CASE
27
r satisfies (36), which has Eliminating II from (34) and (37) we deduc e that expression (37) can be that (34) reciprocal real solutions. It also follows from rewritten as
r =~[II ~R]
(38)
itude of A. The stead y-stat e and from (37) and (38) that jrj < 1, whatever the magn y small er root of (36). 0 gain factor r can thus be ident ified with the nume ricall : that in the infinite horiz on The concl usion is then what one migh t have hoped geneo us. Furth er, this timethe optim al regul ation rule becom es time- homo impli cation that the solut ion homo geneo us rule is stable in that jrj < 1, with the converges to zero with x 1 of the optim ally contr olled plant equat ion (31) the penal ties of instability. The increasing t. The stability is, of course, induc ed by ver; see Exercise 1. conclusion is sensitive to vario us assum ption s, howe We see then from (36) that large. is R Q/ Suppose that contr ol is expensive, in that That is, K ~ 0 if the plant -I. A and A of er r will be close to the nume ricall y small 1 plant is unsta ble (i.e. if 1 the if A) is stable (i.e. if !AI< I) and K ~ B- (Asive then virtua lly no expen is ol JAI > 1). So, if the plant is stable and contr if the plant is unsta ble ver, control will be exerted, which is not surpr ising. Howe effectively repla ces which d and control is expensive then a control will be exerte ocal. recipr its by the gain factor A for the uncon trolle d plant equat ion ly. If we use direct tory trajec The scala r case is one for which we can optim ise the ion can funct cost the le then the plant equat ion (21) to elimi nate the contr ol variab be writte n 00
C
=!I )&; + QB- 2 (xt+l -
Ax1) 2)
+ !IIhx~.
(39)
t=O
Minim isatio n with respe ct to x 1 yields an equat ion
which we can write as
(0 < t
(40) d-ord er difference equat ion so that :T- 1x 1 = Xt+!· The soluti on x 1 of the secon ZJ and z2 are the roots of (39) will be a linea r comb inatio n of zi and z~, where
Q(l- Az)( l- Az- 1 )" + ~ R = 0.
r± 1, so that the optimal trajectory
But we know from (36) that these roots are just has the form Xt = cr 1 + c'r-t,
{41)
28
DETERMINISTIC MODELS AND THEIR OPTIMISATION
where the coefficients c and c' are determined by initial and terminal conditions. The initial condition is prescription of xo, and the terminal condition is, in the infinite-horizon case, effectively that Xt - 0 with increasing t. Solution (41) thus becomes Xt = rt x 0 , consistent with the recursive analysis. Exercises and comments 1 Consider the following separate variations of the assumptions of the theorem. (i) If B = 0 (so that the process is uncontrolled) then (35) has a finite solution if the plant is strictly stable; otherwise it has no fmite positive solution. (ii) If R = 0 and the plant is stable then (35) has the unique solution II= 0. If R = 0 and the plant is unstable then there is a solution II = 0 and another positive solution. These two solutions correspond respectively to infinite horizon limits for the two cases IIh = 0 and IIh > 0. (iii) If Q = 0 then II = R. In this case the optimal control takes whatever value is needed to bring the state variable to zero at the next step.
6 DYNAMIC PROGRAMMING IN CONTINUOUS TIME Control models are usually framed in continuous time, and all the material of Section 1 has a continuous-time analogue. If we look for state structure then the analogues of relations (2) and (3) are a first-order differential equation
x = a(x, u, t),
(42)
as plant equation and a cost function
C=
1h
c(x,u,r)dr+C(x(h),h).
(43)
of integral form. Here xis the time rate of change dx/ dt of the state variable x. It thus seems that .an assumption is forced upon one: that the possible values and course of x are - such this rate of change can be defmed; see Exercise 1. We shall usually suppose x to be a vector taking values in Rn. We shall write the value of x at time t as x( t) rather than x" although often the time argument will be suppressed. So, it is understood that the variables are evaluated at timet in the plant equation (42), and at timer in the integral of (43). The quantity c(x, u, r) now represents an instantaneous rate of cost, and the final term in (43) again represents the closing cost incurred at the horizon point h. The general discussion of optimisation methods in Section 1 holds equally for the continuous-time case: there seem to be just two methods which are widely applicable. One is that of direct trajectory optimisation by Lagrangian methods, which we develop in Chapter 7 under its usual description of the maximum principle. The other is the recursive approach of dynamic programming.
6 DYNAMIC PROGRAMMING IN CONTINUOUS TIME
29
equation formally We can derive the continuous-time dynamic progr ammi ng t) is again defmed as from the discrete-time version (6). The value function F(x, t with state value x. the minimal future cost incurr ed if one starts from time e (cf. (6)) that Considering events over a positive time-increment lit we deduc (44) F(x, t) = inf[c(x, u, t)& + F(x + a(x, u, t)&, t + c5t)] + o(c5t). u
Letting c5t tend to zero in this relation we deduce the contin equation
.
1~
[
c(x, u, t) +
uous-time optimality
] t) 8F(x, t) 8F(x, = 0 81 + 8 x a(x, u2 t)
(t
(45)
respect to x (see the Here 8Ff 8x is the row vector of first derivatives ofF with the minimising case, e te-tim discre the in conventions listed in Appendix 1). As al condition termin a is there and t), u( of value of u in (45) is the optimal value F(x,h) = C(x,h ). In the discounted case the cost function will becom e C=
1h
e-= c(x, u, r)dr + e-<>~~qx(h),h)
member of (44) is then where a is the discount rate. The value of Fin the right-hand e that the optimality 61 quenc conse the with discounted by a factor f3 = e-"' , equation (45) is modified to 8F(x, t) 8F(x, t) . ( 46) 0. tnf[c(x, u, t) - aF(x, t) + 8 t + 8 X a(x, u, t)] = u ated differentials There are many questions of rigour. One is: whether the postul 2 and Section 10.7). exist, and whether it matters if they do not (see Exercise is either necessary or Another is: whether fulfilment of the optimality equation sing only points which sufficient for optimality. These we shall pass over, discus ve content. arise in partic ular applications, when they usually have intuiti Exercises and comments
the sets within which (1) In Section 1 we made no assumptions at all conce rning case restrictions are the variables x and u may vary. In the continuous-time e dx/dt should be implied by the seeming necessity that the rate of chang could one have a meaningful. The point arises even in the uncontrolled case: for which the state state-structured continuous-time deterministic mode l between the possible variable took values in a discrete set (say)? If the transitions see Section 9.1 (and x-values are governed by chance then the model is stochastic; stochastic which be it was the seeming necessity that a quantised system should value of some the by Einstein refused to accept). If the transitions are governed
DETERMINISTIC MODELS AND THEIR OPTIMISATION
30
underlying continuous variable ('hidden variables') then x is no longer a state variable. If the transitions in x are determined by x alone then, either they are instantaneous (in which case they occur at an infinite rate) or at some time delay (in which case one has reverted to something like a discrete-time formulation-a so-called discrete-event system or Petri net). (2) Consider the following simple first passage problem. Both x and u take values on the real line; the plant equation is x = u and the instantaneous cost function is c(x, u) =! (L + Qz?). Termination can take place at either x = 0 or x = 1, with corresponding terminal costs Co and C1 . Effectively, then, from an intermediate point one heads for whichever of these two termination points is the more favourable, on the basis of control, journey-time and termination costs. The problem is plainly time-invariant and the optimality equation is infu[! (L + Qu2 ) +uFx] = 0, where we have used the subscript notation for partial derivatives: Fx = 8F/ 8x. This is subject to boundary conditions F(O) = Co and F( 1) = C1. Show that u = -Q- 1Fx, that Fx = ±-./EQ and that F(x) = min[ C0 + xVLQ, cl + (1- x)v'LQ] for (0 <X< 1). One heads for the termination point corresponding to the minimising option at constant speed ..fiJQ. Thus, either F (x) consists of two linear segments or of a linear segment with a discontinuity at one of the end-points (see Figure 2). In the first case, Fx does not exist at the break-point (the point at which the two options are equally costly). In the second, it does not exist at one termination point (that for which it would be less costly to travel to and terminate at the other termination point, if one were permitted). In neither case does it matter: Fx exists in the directions in which one is both able and wishful to move. (3) A rather more general first-passage problem raises all kinds of physical associations. Let the vector x denote the Cartesian coordinates of a particle moving in a Euclidean space of n dimensions. When at position x the particle moves with speed v(x) and in a direction which can be chosen. The equation of motion is thus x = v(x)u, where u is a unit vector which can vary arbitrarily with F(x)
F(x)
(a)
X
(b)
X
Figure 2 The form of the value jUnction F(x)for the problem ofminimal-cost passage. In case (a) there is a discontinuity ofFx: at the break point at which the two termination points are equally costly. In case (b) there is a discontinuity ofFat a boundary which is never optimal for termination.
31
7. OPTIMAL HARVESTING
time. Let F(x) be the minimal time taken for the particle to reach a stopping set from a point x outside it. Show that the dynamic programming equation is lV7F(x)l = v(x)- 1; i.e.
t(a~)2= v(x)-2. j=l
ax]
This is the eikonal equation of geometric optics, a short-wavelength form of the wave equation, to be solved subject to F = 0 in Y. How is the optimal direction at a given point determined from F ? (4) Consider a version of Exercise 3 in which the magnitude of u may vary. Suppose that the plant equation is x = u and that the integral of the instantaneous cost c1(x) + c2(x)lul 2 from initial position to terminatio n in Y is to be minimised.
7. OPTIMA L HARVES TING Let us take as an example the model with which we began: the harvesting example of Section 1.2. The plant equation is x = a(x)- u, wherex is biomass, a(x) is the natural net growth rate and u the catch rate. Let us assume a reward function IR =
1
00
e-=u dr.
We could define a cost function as the negative of this quantity, but the more natural course is to consider maximisation of a reward, so that F(x, t) will now be the maximal value at time t of the present value of future reward starting from a current biomass of x. The integral will certainly converge if u is uniformly bounded. The problem has a totally time-invariant formulation, so it may be presumed that the value function is a function F(x) of initial state alone. The optimality equation will then be
s~p [u- aF(x) + (a(x) -
u) a~~)]
= 0.
(47)
Let us assume that practical considerations set an upper bound M on catch rate. Note that u occurs linearly in the bracketed expression, with coefficient 1 - Fx (where we use the subscript notation for differentials, see Appendix 1). We can -thus say that the optimal value of u is zero for values of x such that Fx > 1 and is M for values of x such that Fx < 1. That is, one does not fish at all if the effective marginal cost in terms of decreased stock exceeds unity (the market price of the catch), and one fishes at the maximal achievable rate if the marginal cost is less than unity. A policy in which the control adopts the extreme values in its permitted set in this way is termed a bang-bang policy.
DETERMINISTIC MODELS AND THEIR OPTIMISATION
32
It seems plausible then that the optimal policy should take the form
u= {0
(x c)
M
(48)
for some threshold value c. In fact we shall demonstrate that, if the function a( x) is strictly concave and bounded above by M, then the policy (48) is optimal if the threshold has the value determined by
d(c) =a
(49)
where a' denotes the differential. Note that we have not specified what u should be at the threshold c itself. In fact, specification is pointless. For x < c biomass increases under the policy, for x > c biomass decreases. Under the policy biomass thus converges to the equilibrium value c, and at this value natural growth and catch necessarily balance, so ihat effectively u =a( c). We say 'effectively' because u could be 'chattering' infinitely rapidly between the two extreme values as xis nudged either side of threshold. However, the time average of catch rate at c must be a(x) over any non-zero time- interval. To prove the assertions above, let V(x) be the value function under the policy (48), for a threshold as yet undetermined. Then plainly V (c) = a( c)/ a. For x < c there is zero catch and the biomass grows at rate a") until x equals c, which happens first after time
1 c
7t (x)
=
x
dy
a(y)
Thus
(x
(50)
Correspondingly, one finds that
(x
~c)
(51)
where 72 (x)
=
i c
x
dy
( ).
M-ay
The term in Min (51) reflects the return from a catch at this rate until the time 7 2 when biomass has been brought down to the value c. One readily verifies that the value of c which maximises either of expressions (50) or (51) is that determined by (49). If we can now show that V., (with this choice of c) is greater than or less than unity according as xis less than or greater than c then the implication will be that V satisfies the dynamic programming
33
8 LQ REGULATION IN CONTINUOUS TIME
equation (47). We shall see in Exercise 3.1.1 that this implies that V can be identified with F and that the policy is optimal. Verification is direct. It follows from expression (50) that for x :::; c
[1ca'(y)-a ] _ aV _a(c) -T1 (x)_ a(y) dy ~ 1 - exp x Vx- a(x)- a(x) e with equality only at x
= c. Correspondingly, one finds from (51) that for x Vx = exp
[lx ~(y~ ~~
~
c
dy] :::; 1
with equality only at x = c. Demonstration is thus complete. Note that both Fand Fx are continuous at the optimal threshold. Continuity of Fx constitutes the tangency condition, which we discuss more generally in Exercise 10.7.2. The policy may not seem a very practical one, in that fishing effort would have to vary between extreme values as stock varied in a small interval about the operating point c. However, this would be changed if one chose the performance criterion to put a premium also on continuity of operation. The main issue is to recognise an operating point, with complete cessation of fishing effort as soon as stocks fall markedly below it. The fishing fleet could then be reduced to match the effort required at this point. A more realistic model would also distinguish between 'catch' and 'effort' -the effort per unit catch increasing with decreasing x. Note that the solution makes sense even in the undiscounted case, despite the fact that the future reward IR is then infinite. In this case relation (49) yields a' (c) 0, so that c is just the value Xm maximising a(x). That is, the operating point is just that chosen for the crude policy of Section 1.2. However, there is the difference that this operating point is now stable, because of the cessation of fishing once stock falls below it. As the discount rate a increases so the operating point-the equilibrium stock level-falls. The reason for this is that, if x exceeds the value recommended, then it is worth while to reap the profits of a quick harvest, even if by so doing one reduces stocks to a lower-yielding equilibrium. The most extreme case is that in which a> a'(O), when the recommendation is to fish the population to virtual extinction. In this the model demonstrates the crudity of the narrow accounting view with its neglect of wider social, economic or environmental considerations. We consider stochastic versions of the model in Chapter 10, and are forced by these to face the issues. 8 LQ REGULATION IN CONTINUOUS TIME
In analogy to assumptions (21) and (22) of the discrete-time case we suppose the time-homogeneous forms of plant equation
34
DETER MINIS TIC MODELS AND THEIR OPTIMISATION
(52)
x=Ax +Bu and cost function C =!
foh c(x, u)dT +![xTllx)(h)
(53)
(23). Then the where the instantaneous cost rate c(x, u) still has the form immediately, follow analogues of the discrete-time assertions of Theor em 2.4.2 n of the solutio by either by passage to the continuous limit from these or continuous-time dynamic progra mming equation (45). The cost function is quadratic in state (54)
F(x, t) = !xTll( t)x and the time-dependent matrix ll obeys the Riccati equation
(0
~
This is to be regarded as a backward equation for ll( t) with termin of IT (h). The optimal control rule is linear in curren t state
t
(55)
al specification (56)
all quantities being evaluated at a comm on time t. case, in that There is thus something of a simplification over the discrete-time and control on 1 equati i the inverse matrix (Q + BTITB )- appearing in the Riccat the Riccati then zero rule is replac ed simply by Q- 1• If S is normalised to equation (55) reduces to
(57) role; we shall where J = BQ- 1Br. The matrix J turns out to play an impor tant effectiveness l contro of term it the control-power matrix . It is kind of matrix ratio basis. cost a on l to control cost, and so measures the effectiveness of contro n IT of solutio t " In most control contexts we are interested in the time-invarian nt efficie p develo (55) and the corresponding stationary policy (56). We shall a are There r. numerical methods for the determ ination of IT in the next chapte these of One er. few classic problems which can be treated explicitly, howev g or the standing hangin the either to lum, pendu a of sation concerns the stabili ('inverted') position. l. The equation Let a be the angle of deviation of the pendu lum from the vertica of motion in linearised form (i.e. for small a) is
a= aa +bu. e or positive Here the term aa represents the force due to gravity (where a is negativ n) and u is positio d inverte according as a is measured from the hanging or the
~t
I ! ~
i
I
8 LQ REGULATION IN CONTINUOUS TIME
35
the control force applied to the bob. One can consider stabilisation of a to zero in both cases; one tends to speak of 'the pendulum' or 'the inverted pendulum' according as one is speaking of the first or the second. If we adopt the state variable with components a and a then the coefficients in the standard form (52) of the plant equation are
Let us set
n=
[{
f]
and assume, for simplicity, that Sis zero and that R is diagonal with entries r 1 and r2 . The Riccati equation (55) then becomes, in full,
j + r1 + 2ag - Jg2 = 0 g +f + ah- Jgh = 0 h + r2 + 2g- Jh 2 = 0,
(58)
whereJ = b2 /Q. We seek an equilbrium solution of these equations which is non-negative definite and stable. (We discuss stability systematically in Section 5.1). If the problem is properly posed such a solution will exist, and will be both finite and unique. We find that
g=
a+ Ja 2 +Jr1 ' J
h=J'2+2 g J
'
f
= h(Jg-a)
(59)
is an equilibrium solution, and is in fact the only one satisfying the conditions for positive definiteness: that/, g andfh- g2 must all be non-negative. Positivity of h implies that we take the positive square root in the expression for h above. Positivity off implies that Jg- a ?!: 0 which implies that we take the ositive square root in the expression for g.We find thatfh- g2 = (r1 + r2 a 2 + Ort)/J so we require that J > 0. It is evident from the solution above that we require J > 0 if TI is to be finite, but we shall see in Section 16.9 that the condition acquires special force for stochastic versions of the model. The solution thus determined is also a stable solution of the Riccati equation; see Exercise 5.1.5. The optimal control is u = -bQ- 1(ga + ha). Note that this involves a as well as a; we shall see that this is necessary for stability. Note also that if r 1 > 0 then r2 can well be zero--a control which reduces a to zero automatically achieves the same fora A rather more substantial version of this model is the classic problem of the cart and the pendulum, illustrated in Figure 3. A cart of mass M can be moved horizontally under an applied control force u, without friction. An inverted·
36
DETERMINISTIC MODELS AND THEIR OPTIMISATION
Figure 3 The cart and the pendulum
pendulum of length L and bob mass m stands on the cart. One observes the horizontal displacement q ofthe cart and the angle a of the pendulum from the vertical, and the aim is to choose a control rule which stabilises these to zero, say. The difficulty of the problem is the indirectness of the means by which control is exerted on the pendulum. An effective control can be achieved, however, and it is impressive to see this realised; the pendulum stands firmly upright with no perceptible motion of any part of the system. The Newtonian equations of motion are second-order differentials in time, so the state vector x can be chosen as having components (q, q, a, a). It is shown in Exercise 5.2.3 that the linearised (small a) the equations of motion for the system have the form (52) with
0 0] 0 00 01 -(3 [ 0 1 ' A= 0 0 'Y 0 0 0 ~here
B-
-
[
1/M 0
-1/~LM)
l
(60)
f3=gm/M and-y= (g/L)(1 +m/M) and g is the acceleration due to
gravity. Let us consider a numerical example with the time and length scales chosen so that g = 1, L = 1 and suppose that M = 2, R =I, S = 0 and Q = 1. Nothing is gained for illustrative purposes by choosing more exotic values. Solution of the Riccati equation, either directly or by the policy improvement methods of the next chapter, leads to the numerical evaluations
4.44 9.37 11.32 11.36] II = [ 9.37 30.33 38.97 39.21 11.32 38.97 67.06 63.71 ' 11.36 39.21 63.71 61.86
K = [1.00
4.44 12.37 11.33].
The relatively large coefficients associated with the angle variables in both matrices is an indication of the difficulty of controlling the pendulum at one
37
8 LQ REGULATION IN CONTINUOUS TIME
angle variables inK (and so remove. It is not surpr ising that the coefficients of the form of B that angle is the from see in the control rule) are positive, since we the pendu lum to the tips right the to affected negatively by u; a push of the cart right (as indic ated the to cart the left. However; since the push indee d displaces the displa ceme nt of cients coeffi by the positive entry in B) it is surpr ising that the the posit ion that seems t variables in the control rule are also posit ive-i cart to the the move to s corre ction has the wrong sign. However, if one wishe falls to the lum pendu the right from a positi on of rest and does so directly, then falls to lum pendu that the left. One has first to move the cart to the left, so ve achie both to as such a way the right, and to then move the cart to the right in . again al vertic ulum to the the requi red displa ceme nt and bring the falling pend lex comp behaviour of more This is a mild example of the counter-intuitive cing will have devel oped balan stick of ience exper systems (although those with the requi red sophi sticat ion of intuition). Exercises and comments 1Sx as transf orme d contr ol (1) One again norm alises S to zero by choice of u + QR are A - BQ- 1S and and A of variable. Show that the transf orme d values R - sT Q- 1S, while those of B and Q are uncha nged. zero. The equil ibrium Ricca ti (2) Cons ider the scala r case with S norm alised to + BK satisfies r 2 = A 2 + JR. equat ion is R + 2AII = J II2 and the gain matri x r Comp lete the analo gue of the analysis of Secti on 5. d at the end of Secti on 4 (3) The conti nuous -time analo gue of the probl em treate u, an instan taneo us cost rate would have a plant equat ion .X = u in scala r x and 2 Show that the value function and ~ Qu2 and a termi nal cost Dx at time h. optim al contr ol with times to go are
!
U=-
Dx . Q+D s
projectile which will strike its One could regar d x as the lateral displa ceme nt of a tion explicitly penal ised is targe t at time h, and for which the only cours e devia ctile is regarded as havin g the termi nal miss- distan ce x(h). However, if the proje ise 4. inerti a then one shoul d rathe r take the mode l of Exerc the two comp onent s of (4) Cons ider the 'inert ial' version of Exercise 3 in which e) obey 1 = x 2 and chang of the state vecto r (lateral displa ceme nt and its rate nal cost is Dxi at time h. xz = u, the instantaneous cost is Qu2 and theoltermi times to go are Show that the value function and optim al contr with
!
!
x
38
DETERMINISTIC MODELS AND THEIR OPTIMISATION
(in. the absence of · dee d the predicted terminal miss-distance · m .. . +sx2 Is Herexi . further control) so the conclusion is consistent with that of Exercise 1.4. Note that fora given value of this quantity the optimal co~trol becomes small for large s (when plenty of time remains to make the ~orrect10n) and also for smal~ s ~when a iven change in x 2 scarcely affects termmal xi). In rocket contexts 1t IS more ~ealistic to take the instantaneous cost as proportional to lui, which affects optimisation profoundly; see Section 7.7. 9 OPTIMAL TRACKING FOR A DISTURBED LQ MODEL
In Section 4 we considered an LQ version of the regulation problem, for which the aim was to bring the path to the set point x = 0, u = 0 in an optimal fashion. However, the problems faced in applications are considerably more general. For one thing, the set point could be arbitrary; e.g. one might wish to be able to set a central heating system to stabilise at any prescribed room temperature in a certain range. For another, one may wish to follow a variable set point in time; i.e. to follow a command signal. For example, one may wish a yacht to follow a desired course, taking account of the winds and currents which are expected, or a pottery kiln to follow a desired temperature profile. We shall then generalise the model of Section 4 to incorporate two features: the feature of tracking, i.e. of aiming to follow a command signal, and the feature of disturbances to the dynamics, such as the winds and currents of the yachting example. Both the disturbances and the command signal may well not be known in advance. The sea and weather conditions of the yachting example will certainly be only imperfectly predictable. Further, if the vessel's aim is not to follow a prescribed course, but rather to shadow an unpredictable rival, then the command signal is likewise unpredictable. We shall not be able to handle such issues properly until we allow the effects of stochastic inputs (Chapter 10) and imperfect state observation (Chapter 12). For the moment we shall have to assume that both disturbances and command signal are known in advance, although we can presage the later treatment somewhat if we assume that these also are generated by a model. We begin with the discrete-time version, and can then deduce the continuoustime analogue immediately (see Exercise 3). Let us then assume the plant equation (21) modified to
(61) where {d1} is a known disturbance sequence. We assume the instantaneous cost function (23) modified to
(62)
9 OPTIMA L TRACK ING FOR A DISTUR BED LQ MODEL
39
un
are the desired state and control paths, known in advance. where {~} and { the ideal Normally one would regard x~ as the comma nd signal and assume that we can r, Howeve zero. value for u1 would be the 'minim al effort' value, probably n functio cost l termina as well allow prescription of a general control profile. The pt subscri h the where generalises correspondingly to [(x- _xC?II (x- r)]h, indicates that all quantities in the bracket are evaluated at time h. We can somewhat reduce the problem by changing to the variables
!
holds in representing deviation from the desired path. The plant equation (61) then te substitu these new variables if we
d1 = d1 - d~ = dr- X~+ Ax~-!+ Bu~-l on' terms; for d1• This normal ised d represents the effective disturb ance in 'deviati term the it is the old disturb ance less (63) d~ = x~- Ax~-!- Bu~_ 1 • path for the This modifying term represents the 'unnatu ralness ' of the prescribed rbed plant undistu the satisfies plant, and is zero if the desired path itself ly. repeated equation. This is a point which will manifest itself interestThe dynami c program ming treatme nt is straightforward and yields sketch only shall We tions. calcula ing conclusions, but by somewhat laborious in emerge will ch approa ul insightf these, because a much more powerful and Chapte r 6. that the Assum e xc and uc normalised to zero, as above. One finds then eous mogen non-ho a by d replace is n quadra tic form (24) of the value functio quadra tic form (64) equatio n The matrix II 1 obeys exactly the same equation as before, the Riccati (25) I (26), and the optima l control is Ut
= Krxr- D; 1BT(IIt+ldr+l
+ O't+I)·
Here the feedback matrix K 1 also has exactly its previous value (28) and Dr = Q + BTIIt+IB.
ances and The coefficient 0'1 of the linear term, which reflects the effect of disturb on recursi rd comma nd signals, obeys the backwa O'c
= rJ(O't+ I- rrt+ldt+l)
where rr is the gain matrix defined in (31).
40
DETERMINISTIC MODELS AND THEIR OPTIMISATION
The term 81 in (64) is also interesting; it is that component of the cost of future operations which is independent of the effect (presumably transient) of the value of initial state. For the regulation case of Section 4 it had the value zero; in the present case it reflects the total penalty for tracking error incurred up to the horizon point. One can derive a recursion for it, but we shall not do so. The term does not affect the control rule, we shall find other ways of assessing tracking error, and shall in any case see that the recursion can be expressed much more elegantly in a more general formulation (see Section 18.2). From these conclusions wd can progress to the final conclusions, stated for the unreduced formulation. Theorem 2.9.1 Assume the plant equation (61) and instantaneous costfunction (62). Then the optimal open-loop control rule and the optimally controlled plant equation take the respective forms h-t-l u~-u~=Kt(x~-x~)-D; 1 Br r;+lfi+ 2 .•. r;+1 rrt+J+t(dt+J+!-d~+J+!) (65)
L
}=0
h-t-! -BD; 1Br
L
r;+ 1r;+ 2 ... r;+1IIr+J+l (dt+J+l
- d~+J+!)
(66)
}=0
whered; is defined in (63).
We see that the optimal control (65) now has afeedback-feedforwardform. The feedback term K 1(x 1 - ~) is exactly as previously, except that x 1 has been replaced by the deviation x 1 - x~. The sum represents the feedforward term; a term which expresses an optimal anticipation of future disturbances and command signals. If we consider the infinite-horizon case and assume that infinite-horizon limits exist then the rule (65) will simplify to 00
Ut- u~
= K(xt- x~)- n-lBTL:(rT)jii(dt-;-j+!- d~+J+l).
(67)
j=O
We shall show in Section 6.1 that existence of the infinite-horizon limit for II is usually associated with the fact that the limit gain matrix r is a stability matrix. By this we mean that its powers rJ decay to zero with increasing j; in fact, exponentially fast. One can then see the feedforward term in (67) as incorporating an automatic 'matrix discounting' of the future. Intuitively: the further ahead that a problem lies, the less urgent is it. For a simple example, suppose that the amount of water x in a reservoir obeys the recursion
9 OPTIMAL TRACKlN G FOR A DISTURB ED LQ MODEL
Xt = Xt-1 -Ut-I
41
+ d,,
where u represents draw-off (to be chosen) and d represents inflow (supposed 2 known in advance). The instantaneous cost function is R(x- X:) + by rate utilisation of 2 y Q( u - uc) . That is, one would like to secure uniformit cost of nent x-compo The possible. holding u as near to the constant value uc as represents an attempt to enforce the constraints on x: that it should lie between zero and the capacity of the reservoir. One can enforce these constraints rigidly (see Section 7.13) but in an LQ formulation one can merely penalise deviations from half-capacity, say. Note the function of a reservoir: to act as a buffer which smooths out variations in inflow to deliver a more uniform outflow. Applying the above theory we find that the optimal infinite-horizon rule is
i
!
ut = (1 - r)
(x,- + f: r
1dt+J+I)
xc
(68)
;=0
2 where r is, as in Section 5, the smaller root of Q( 1 - r) = Rf. Note that this rule is independent of uc. The average level of outflow must equal the average level of inflow, whatever that may be, and there is no point in trying to prescribe it. The situation would be different if the two did not necessarily balance in the long run;
see Exercise 1. It is interesting that the 'discounting' matrix in (67) should just be the transpose of the gain matrix r. Indeed, there is a duality between past and future, as will emerge in Chapter 6. For validity of the infinite-horizon version (67) it is certainly necessary that the sum should converge. Suppose that the elements of fl converge to zero as pi with increasing}, for a positive scalar p. The sum in (67) will then certainly converge if d1 - d~ grows as fit with ifil < p- 1. However, for successful tracking we would really demand rather more than that. Ideally, the tracking errors should converge to zero with increasing t. A less stringent demand is that they should remain uniformly bounded , so that costs accumulate at a bounded rate. We see from (67) and the corresponding infinite-horizon version of (66) that these demands will holdiftheyholdforthesum~t = L~(fT) 1 IT(dr+J- d;~)· If the disturbance dt is uniformly bounded in value then its contribution to ~t will also be uniformly bounded; a conclusion which we can strengthen in Chapter 10, when disturbance is stochastic in nature. However, the path x~ which one is attempting to follow may well go off to infinity at some rate and the question is: what rates are consistent with vanishing or with bounded tracking errors? For simplicity, consider the case when both dt and u; are zero. If d'T is of order fit for large t (with 0 ~ fi < p- 1 ) then so is ~t- For tracking errors to converge to zero with increasing tit is thus necessary that d~ = x7- Ax~-! should converge to zero with increasing t; a conclusion which we could express as follows.
42
DETERMINISTIC MODELS AND THEIR OPTIMISATION
Theorem 2.9.2 A necessary and sufficient condition that the component oftracking error due to variation in the command signal should tend to zero with increasing time is that all unstable modes ofthe command signal should satisfy the uncontrolled plant equation. The continuous~ time analogues of these assertions are immediate; see Section 6.2. In developing LQ theory further in Chapter 6 we relate it to classic control theory, among other things, so continuous time then provides the more natural frame. Exercises and comments {1) Work through the reservoir example in the case when the plant equation is x 1 = j3x 1_ 1 - u1_t + d1• Here f3 is a coefficient lying between 0 and 1; reflecting the fact that a fraction 1 - f3 of the water is lost over a stage by leakage or evaporation. (2) Suppose that the inflow d is itself known to obey an equation d1 - v = a(d1_ 1 - v). The relation (68) yields the control rule Ut = V + (1 - r)(xt- r) provided larl
a(1- r)
+ } _ ar (dr- v),
< 1.
(3) Suppose that d1 and u~ are identically zero, that Shas been normalised to zero and that infinite~horizon limits hold. Show that the optimal control is 00
Ut = Kxr - n-I BT ~)rT)j~+j+l" j=O
Suppose that the command signal is generated by r, = Cw1 where w1 satisfies a recursion w1 =AtWr-1· Show then that u1 =Kx1 +Ktw1 , where Kt = -D- 1 BT f:}:o(rT)i RC_A{+ 1. 10 OPTIMAL EQUILIBRIU M POINTS: NEIGHBOUR ING OPTIMAL CONTROL A minimal aspiration would seem to be: to locate the optimal point at which a controlled deterministic process should settle in equilibrium. That is, we suppose a stationary control rule, suppose that it has a stable equilibrium point, and then ask what would be the optimal location of this equilibrium. This is to be contrasted with the more ambitious programme of (i) finding an optimal control rule, (ii) establishing that it has stationary limit form in the infinite horizon, and (iii) establishing that there is a stable equilibrium under this stationary policy.
10 OPTIMAL EQUILIBRIUM POINTS
43
but the first is One hopes that the two paths lead to the same conclusion, certainly easier. Consider the dynamic programming equation (69) F(x, t) = inf[c(x, u) + {3F(a(x, u), t + 1)] u
assumptions of for a discrete-time discounted state-structured process under the aneous cost instant and ) u , a(x 1 = 1 Xr+l on a time-homogeneous plant equati valued and vectorare u and x both that function c(x, u). We shall assume ives as derivat many as have F and c a, unconstrained and that the functions of ntials differe that is ed requir ption required. (In fact, the strongest assum be could a and c both ly, Actual second order should exist and be continuous). d existence of considered time-dependent in what follows, except when we deman a static equilibrium. that, by the If x and u are understood as column vectors, then recall a row vector. is t) Fx(x, conventions of Appendix 1, the vector of first derivatives Define (70) er the case when the negative transpose of Fx evaluated at (x 1, t). We shall consid sequence of {x 1} is the optimal orbit for prescribed xo, so that {.\} is then the differentials (70) defined on this optimal orbit. t to u1 will If we set x = x 1 in (69) then the minimality condition with respec imply a stationarity condition (71) ted at time t. where the row vector Cu and matrix au of derivatives are evalua on Differentiation of (69) with respect to x 1 yields the companion equati (72)
and that the Theorem 2.10.1 (Discrete time) Assume that the differentials above exist u and..\ at x, of values the Then optimally controlled process has an equilibrium point. an optimal equilibrium must satisfy the three equations (73) x=a, where arguments x and ufor a, c and their derivatives are understood. together with This follows simply by taking the equilibrium condition x = a the necessity for the equilibrium forms of equations (71) and (72) above. Note that we had to introduction of the conjugate variable ..\; it was because of this 1). One can, of establish the dynamic equations (71) and (72) first (see Exercise course, eliminate ,\to obtain the pair of equations
44
DETERMINISTIC MODELS AND THEIR OPTIMISATION
x=a,
(74)
The fact that the optimal equilibrium point varies with f3 is something we have already observed for the continuous-time model of Section 7, for which the optimal equilibrium point c was determined in terms of the discount rate a by equation (49). Discounting has the effect of encouraging one to realise return early in time, so there is an inducement to take a quick immediate yield from the system, even at the cost of sinking to a lower-yielding equilibrium. Equations (73) may have several solutions, of course, and the eligible solution must at least be stable. For a case in point, one can consider an optimal fishing policy under a model which allows for age structure in the fish population. For certain parameter choices, there is an equilibrium in that the optimal fishing pattern is constant from year to year. For other parameter values, one should opt for the non-stationary policy of 'pulse fishing', under which one allows the population to build up for a period before harvesting it; a static equilibrium solution may or may not be unstable under these circumstances. For the continuous-time analogue of this analysis the plant equation· is x = a(x, u) and the dynamic programming equation is inf[c - aF + 00F + Fxa] = 0, u t
(75)
where a is the discount rate. The equations analogous to (71) and (72) are
(76)
(77) We shall encounter these again in Chapter 7 when we develop the Pontryagin maximum principle; relations (76) and (77) are then seen as conditions that the orbit be optimal. For present purposes, we deduce the analogue of Theorem 2.10.1. Theorem 2.10.2 (Continuous time) Assume that the differentials above exist and that the optimally controlledprocess has an equilibrium point Then the values ofx, u, and A at an optimal equilibrium must satisfy the three equations a=O,
(78)
The question that should now really be faced is: what should the optimal control policy be in the neighbourhood of the equilibrium point? That is, if the equilibrium values are denoted by x and u, then how should u vary from u as x varies slightly from x? To determine this we must consider second-order effects and obtain what is in essence an LQ approximation to the process in the neighbourhood of equilibrium. More generally, one can do the same in the neighbourhood of an optimal orbit.
10 OPTIMAL EQUIUB RIUM POINTS
45
Consider again the discrete-time case, and define Ilr as the value of the square matrix of second-order derivatives Fxx on an optimal orbit at time t. Let 6.x1 denote a given perturbation of Xt from its value on the orbit at this point and 6.u 1 the corresponding perturbation in ur (whose optimal value is now to be determined). Theorem 2.10.3 (JJiscrete time) Assume that all di.lforentials now invoked exist. Define the Hamiltonian at timet, H(xt, ur, At+l) = -c(x,, ur) + j3XJ+ 1a(x1 , u1 ) and the matrices Ar = ax, Bt = au, Rr = - Hxx, Sr = - Hux, Qt+l = - Huu; these being evaluated on the original optimal orbit at time t. Then the matrix Il 1 satisfies the Riccati recursion Ilr = [R + j3ATilr +tA- (ST
+ j3ATIIr+tB)(Q + j3BTIIr+lB)- 1 (S + j3BTllr+tA)Jr
(79) where all terms in the square bracket save II are to bear the subscript t. The perturbation in optimal control is, to first order, 6.u1 = K16.x6 where
(80) Proof Set x = x 1 + 6.x1 and u = ur + !:l.ur in the dynamic programming equation (69), where x 1 and u1 are the values on the original optimal orbit, and expand all expressions as far as the second-order terms in these perturbations. The zerothorder terms cancel in virtue of the equation (69) itself, and the first-order terms cancel in virtue of relations (76) and (77). One is then left with a relation in second-order terms which is just the equation !xTII1x = inf[c(x, u) u
+ !f3(Ax + Bu) Tllt+l (Ax+ Bu)]
with x and u replaced by 6.x 1 and 6.u 1 and A, B, R, S and Q replaced by the t0 dependent quantities defined above. The conclusions thus follow.
.,
...
I
The interest lies in the replacement of the cost function c( x, u) (with cost matrices R, Sand Q) by the Lagrangian-like expression c(x, u)- j3_ATa(x, u). This is the negative of the Hamiltonian which will play such a role in the discussion of the maximum principle in Chapter 7. The additional term it -j3_ATa(x, u) would in fact contribute nothing at this point if a(x, u) were linear: is the non-linearicy in the plant equation which adds an effective supplementary cost. The negative signs which occur in our definition of .A come from a desire to be consistent with convention; these signs would all be positive if one had phrased the problem as one of reward maximisation rather than of cost minimisation. This perturbation calculation is, of course, of no value unless the perturbed orbit continues to lie close to the original optimal orbit. So, either one must be
46
DETERMINISTIC MODELS AND THEIR OPTIMISATION
ations conside ring events over a horizon which is short enough that perturb l origina the to back e converg fact remain small, or the perturb ed orbit must in r attracto an is orbit l origina orbit. The implication in the latter case is that the stance, under the optima l control rule. This would be a rather special circum stable a to settled itself had except in the particu lar case when the origina l orbit ndent indepe be also will equilib rium value. In such a case the matrice s II and K oft. only the The continu ous-tim e analog ue follows fairly immediately; we quote undisco unted case.
invoked exist. Theorem 2.10.4 (Continuous time) Assume that all differentials now time-dependent Define the Hamilt onianH (x, u, ..\) = -c(x, u) + ..\Ta(x, u) and the being evaluated matrices A = ax, B = au, R = - Hxx, S = -flux, Q = - Huu; these ed on the orif(;valuat Fxx = II matrix the Then t. time at on the original optima l orbit n equatio ginal orbit) satisfies the Riccati
(81) K has the The perturbation in optima l control is, to first order, 6.u = K6.x, where time-de penden t value
(82) u, so that For the harvest ing model of Section 7 the Hamilt onian is linear in we see Q = 0 and the above analysis fails. Such cases are spoken of as singular. As eless very from this example, an optima l control with an equilib rium may neverth control ofthe nature inuous discont the in itself well exist. The singularity reflects rule. Exercises and comments unted (1) One could have derived the conditi ons of Theore m 2.10.1 in the undisco int constra the to subject u and case simply by minimi sing c( x, u) with respect to x with ted associa ier x = a(x, u). The variable..\ then appear s as a Lagrange multipl ted case, the constra int. The approa ch of the text is better for the discoun however, and necessa ry for the dynam ic case.
CHA PTE R 3
A Sketch of Infinite-horizon Behaviour; Policy Improvement context one will very Suppose the model time-homogeneous. In the control an infinite horizon. If frequently consider indefinitely continuing operation, i.e. physical sense then the model is also such that infinite-horizon operation makes control rule will have one will expect that the value function and the optimal ndependent. That proper infinite-horizon limits, and that these are indeed time-i in the infinite-horizon is, that the optimal control policy exists and is stationary limit. might expect in the In this chapter we simply list the types of behaviour one tations can be false, as infinite-horizon limit, and that typil)r applications. Expec proportion if they are we illustrate by counter-example; dangers are best put in Chapter 11, when until is analys ntial identified. However, we defer any substa orated in that of incorp be can ent more examples have been seen and the treatm the stochastic case. tant and central Coupled with this discussion is introduction of the impor technique ofpolicy improvement. limit of equilibrium The infinite-horizon limit should not be confused with the l rule stationary contro behaviour. If the model is time-homogeneous and the with time, brium equili then one can expect behaviour to tend to some kind of to regard priate appro under suitable regularity conditions. It would then be more equilibrium as an infinite-history limit.
MING EQUATION 1 A FORM ALIS M FOR THE DYNAMIC PROG RAM so that we can appeal We suppose throughout that the model is state-structured, te-time case first. We to the material of Sections 2.1 and 2.6. Consider the discre argument can be time the that so s, shall suppose the model time-homogeneou possibility of the allow shall we dropped from a(x, u, t) and c(x, u, t), but s= h- t where Fs(x), n writte be discounting. The value function F(x, t) can then write the we if on ificati simpl is the time to go. We achieve considerable notational dynamic programming equation (2.10) simply as (1) (s > 0) Fs = fi!Fs-1 where !I! is the operator with action
48
A SKETCH OF INFINI TE HORIZ ON BEHAVIOUR
, u) !f'qy(x) = inf[c(x u
(2)
+ {3¢J(a(x, u))]
is that it is the cost on a scalar functio n of state qy(x). The interpe tation of !f'qy(x) sing ur, say, optimi so ion: operat of incurr ed if one optimises a single stage time t + 1. at +1) ¢J(x cost 1 closing a knowing that x 1 = x and that one will incur ns of functio scalar into state of ns The operat or !!' thus transfo rms scalar functio it since s, proces sed optimi the of state. We shall term is the forward operator n functio cost given a upon d indicates how minim al costs from time t (say) depen ic dynam the that fact the at time t + 1. The term is quite consistent with progra mming equati on is a backward equation. 1r which is not Let us also consid er how costs evolve if one chooses a policy only t and the s necessarily optimal. If the policy 1r is Markov in that u 1 depend will also be a t value x of curren t state x 1 then the value function from time ary then station also function only of these variables, V( 1r, x, t), say. If the policy is policy the case this it must be of the form u1 = g(x 1) for some fixed function g. In nt consta this of is often written as 1r = g< 00 ), to indica te indefinite application The x). as V3 (g(oo), rule, and one can write the value functio n with times to go is policy fixed this for backward cost recurs ion
Vs+l(g(ool,x) = c(x,g( x))
+ /3Vs(g(ool,a(x,g(x))),
(s > 0)
(3)
a relation which we shall conden se to
(4) g( 00 l. If it can be Here L(g) is then the forwar d operat or for the process with policy can suppress the taken for grante d that this is the policy being operat ed then we as argum ent g and write the recurs ion (4) simply
Vs+! = LVs. scalar functions The operat ors L and !!' transfo rm scalar functions of state to assum ed that ¢is a of state. If one applies either of them to a function c/J, then it is is that they are share they ty proper tant scalar function of state. One impor L. for ly similar 'lj;; ? ¢J if monotonic. By this we mean that !!' ¢J ? !1''1/J
Theorem 3.1.1 ( i) The operators !!'and L are monotonic. g (non(ii) If Ft ? ( ~ )Po then the sequence {Fs} is monotone non-decreasin increasing); correspondingly for { V9 }. Proof Lis plainly monot onic. We have then
!/'¢ = L¢? L'lj;? !1''1/J, ion of!/'¢, i.e. if we take u = g(x) as the particu lar contro l rule induce d by format . proven the minim ising value of u in (2). Assert ion (i) is thus
2 INFINITE-HORIZON LIMITS FOR TOTAL COST
Assert ion (ii) follows inductively. If Fs ~ Fs-1 then Fs+ 1
= .2! Fs ~ .2!Fs-1 = Fs.
In contin uous time we have (with the conventions of Sectio n of relations (1) and (4):
oF= vHF,
OS
ov =MV OS
(s
49
0 2.6) the analog ues
> 0).
(5)
operat ors vii and Here F and V are taken as functio ns of x and s and the M = M (g) have the action s
a(x, u)], A
M<jJ(x)
= c(x,g (x))- a<jJ(x) + 0 ~~) a(x,g(x)).
Exerci ses and comments unifor mly bound ed (1) Consid er a time-h omoge neous discret e-time model with f3 < 1). Suppo se also instan taneou s cost functio n and strict discou nting (so that l variables can take contro and state the that ity) (for simplicity rather than necess n ¢(x) of state functio scalar a of 11¢11 norm only finitely many values. Define the bound ed in ns functio of class the be f!J by supx l¢(x)j, the suprem um norm. Let this norm. Hence show that Show that for¢ and 'lj; in f!J we have 112¢ - .2!'1/JII :o::;; .BII ¢ - '!/!II· n Fin f!J, solutio unique a has F the equilib rium optima lity equati on F = .2! relatio n x u/ the that and f!J, identifiable with lims....oo fl!s'I/J for any ¢ of rule. l contro on determ ined by .2!F defines an optima l infinit e-horiz
2 INFINITE-HORIZON LIMITS FOR TOTAL COST ite operat ion, does The fact that one faces an infinit e horizo n, i.e. envisages indefin if one fires a guided not mean that the proces s may not termin ate. For example, falls back to Earth missile, it will contin ue until it either strikes some object or of escape into space.) with its fuel spent. (For simplicity we exclude the possibility al cost which is a In either case the traject ory has termin ated, with a termin functio n IK( x) of the termin al value x of state. has set no a priori The proble m is nevertheless an infinit e-horiz on one if one consid ered the one that in , bound a set bound on the time allowed. If one had time h, then ibed prescr a at flight in firing a failure if the missile were still The cost Ch ency. conting this to presum ably one should assign a cost Ch(xh) it from the uish disting to cost, might then more approp riately be termed a closing terminal cost IK, the cost of natura l termin ation.
50
A SKETCH OF INFINITE HORIZON BEHAVIOUR
In the infinite-horizon case h is infinite and there is no mention of a closing cost. One very regular case is that for which the total infinite-horizon cost is well defined, and the total costs V(x) and F(x) (under a policy g(co) and an optimal policy respectively) are finite for prescribed x. If instantaneous cost is nonnegative then this means that the trajectory of the controlled process must be such that the cost incurred after time t tends to zero with increasing t. One situation for which this would occur is that envisaged above: that in which the process terminates of itself at some finite time and incurs no further cost. Another is that discussed in Exercise 1.1, in which instantaneous costs are uniformly bounded and discounting is strict, when the value at time 0 of cost incurred after timet tends to zero as /3 1• Yet another case is that typified by the LQ regulation problem of Section 2.3. Suppose one has a fixed policy u = Kx which is stabilising, in that the elements of (A+ BK) 1 tend to zero as p1 with increasing t. Then x 1 and u1 also tend to zero as p1, and the instantaneous cost c(x1 , u1 ) tends to zero as p21 • Although there is no actual termination in this case, one approaches a costless equilibrium sufficiently fast that total cost is finite. One would hope that finiteness of total cost would imply that Vand F could be identified respectively as the limits of Vs = L'Ch and of Fs = sesch ass -+ oo, for any specified closing cost Ch in some natural class CC. One would also hope that these cost functions obeyed the equilibrium forms of the dynamic programming equations F = St'F, (6) V=LV
(7)
and that they were the unique solutions of these equations (at least in some natural class of functions). Further, that the minimising value of u in (6) would determine a stationary optimal policy. That is, that if
F = St'F = L(g)F (8) theng(oc) is optimal. In fact, counter-examples can be found to all these conjectures (see Exercises 14 below). Nevertheless, they hold under relatively mild conditions. One condition which assures a good part of them and which is natural in the control context is indeed that instantaneous costs should be non-negative c(x, u) ~ 0. This is referred to as the case of negative programming in the literature; 'negative' because positive cost corresponds to negative reward. The following two results are immediately useful; more substantial analysis is postponed until Chapter 11. Theorem3.2.1
Supposethatc ~ O.Then V(g(oo)) = limsrcoL(gro.
Proof Here by L'O we mean the finite-horizon cost Vs = Lsch in the case when the closing cost function Ch(x) is identically zero. Let c1 be the cost incurred at
2 INFINITE-HORIZON LIMITS FOR TOTAL COST
51
time t under the policy g< 00 l for a prescribed value of initial state. Then the assertion of the theorem simply amounts to the statement 00
s
0
sToo 0
Lct =lim L:c
1,
valid for a sum of non-negative terms, whether convergent or not.
D
Theorem3.2.2 Suppose that c ~ Oand that the function ¢(x) is such that¢~ Oand L(g)¢ <¢.Then V(g(oo)) < ¢. Proof Here by L(g)¢ <¢we mean that inequality holds for all x and strict inequality for some x. Let us denote L(g) and V (g< oo l) simply by L and V. Monotonicity of L implies that
(s= 1,2,3, ... ). Lettings tend to infinity, we deduce the assertion of the theorem.
0
A function ¢(x) such that L(g)¢ ~ ¢ is termed L(g)-excessive. It can be regarded as a cost function with the property that, if one has the options of either continuing with the policy g(oo) or of stopping and incurring a cost ¢(x) (where x is the current state), then one never does worse by continuing. The argument of Theorem 3.2.2 then implies that V(g(oo)) is the least non-negative L(g)-excessive function. This observation implies the following stronger statement. Theorem 3.2.3 Suppose that the instantaneous cost function is non-negative. Then V(g
Exercises and comments The following counterexamples to the regular behaviour envisaged in the text illustrate various types of behaviour. One is that the incurral of some element of cost can be indefinitely postponed but not ultimately avoided; this explains the instability shown in Exercise 1. Another is that a notional closing cost can affect costs and decisions at all horizons; this explains the non-uniqueness shown in Exercise 2 and 3. (In fact, it is taken as part of the infinite horizon specification that the closing cost is zero. This sets an absolute origin to cost, so that costs cannot necessarily be normalised by addition of a constant.) Finally, the case in which one maximises a positive reward ('positive programming') has the feature that an improvement in policy takes one away from the fixed bound of zero rather than towards, and this has consequences; see Exercises 4 and 5.2. Some exceptional cases are more easily illustrated by a stochastic example; see Exercises 11.1.1, 11.1.2 and 11.2.2.
52
A SKETCH OF INFINITE HORIZON BEHAVIOUR
The reader is invited to investigate the effect of discounting or of the addition of a time penalty (in the form of a constant positive component of instantaneous cost) on all examples. (1) Consider a process whose state space consists of the non-negative integers j = 0, 1,2, ... and a supplementary state a. In state a one can move to any integer state j > 0; in state j > 0 one can move only to j - I; state 0 is absorbing. All transitions are costless, except that from 1 to 0, which carries unit cost. Then Fs (a) is zero, because one can move from a to a j so large that the transition 1 --+ 0 does not occur before closing. Thus F 00 (a) := limFs(a) = 0. On the other hand F{ a) = 1, because the transition 1 --+ 0 must occur at some time under any policy (i.e. choice of move in state a). The fact that Foo =j:. F means that optimisation does not commute with passage to the infinite-horizon limit, and is referred to (unfortunately) as instability. (2) Suppose that one can either continue, at zero cost, or terminate, at unit cost. There is thus effectively only one continuation state; ifF is the minimal cost in this state then the equation F = ft'F becomes F = min(F, 1). The solution we want is F = 0, corresponding to the optimal policy of indefinite continuation. However, the equation is solved by F = K for any constant K ~ 1; that for K = 1 is indeed consistent with the non-optimal policy that one chooses termination. K can be regarded as a notional closing cost, whose value affects costs and decisions at all horizons. It is optimal to continue or to terminate according as K ~ 1 or K ~ 1. In fact, K = 0, by the conventions of the infinite-horizon formulation, but the non-uniqueness in the solution of the dynamic programming equation reflects a sensitivity to any other specification. (3) A more elaborate version of the same effect is to assume that x and u may take integral values, say, and that the plant equation and cost function are such as to imply the equilibrium dynamic programming equation. F(x) = min[lul + F(x- u)]. u
=
The desired solution is F = 0, u 0, corresponding to a zero closing cost. However, there are many other solutions, as the reader may verifY. corresponding to a non-zero notional closing cost. (4) Consider a process on the positive integers x = 1, 2, 3, ... such that when, in x, one has the options of either moving to x + 1 at zero reward ('continuation') or retiring with reward 1 - 1/x ('termination'). This then is a problem in 'positive programming': one is maximising a non-negative reward rather than minimising a non-negative cost. If G(x) is the maximal infinite-horizon reward from state x then the dynamic programming equation is G{x) = max[G(x+ 1), 1- 1/x]. This is solved by any constant G ~ 1, corresponding to the specification of some x-dependent closing reward which exceeds or equals the supremum terminal
3 AVERAGE-COST OPTIMALI TY
53
reward of 1. However, we know that there is no such closing reward, and we must restrict ourselves to solutions in G ::;;; 1. The only such solution is G = 1, attained for indefinite continuation. But indefinite continuation is non-optim al-one then never collects the reward. In short, this is a case for which there is a g such that.!£> F = L(g)F for the correct F, but g(oo) is nevertheless non-optimal. In fact, no optimal solution exists in this case. If one decides in advance to terminate in state x, then there is always an advantage in choosing x larger, but x may not be infinite.
3 AVERAGE-COST OPTIMALITY In most control applications it is not natural to discount, and the controlled process will, under a stationary and stabilising policy, converge to some kind of equilibrium behaviour. A cost will still be incurred under these conditions, but at a uniform rate 1, say. The dominant component of cost over a horizon h will thus be the linear growth term 1h, for large h. For example, suppose we consider the LQ regulation problem of Section 2.4, but with the cost function c(x, u) modified to! (x- q) TR(x- q) + !uT Qu. One is thus trying to regulate to the set point (q, 0) rather than to (0, 0). At the optimal equilibrium a constant control effort will be required to hold x in at least the direction of q. One then incurs a constant cost, because of the constant offset of (x, u) from the desired set point (q, 0); see Exercise 1. More generally, disturbances in the plant equation will demand continuing correction and so constitute a continuing source of cost, as we saw in Section 2.9. With known disturbances cost is incurred at a known but time-varying rate. One could doubtless develop the notion of a long-term average cost rate under appropriate hypotheses, but a truly time-invariant model can only be achieved if disturbances are specified statistically. For example, we shall see in Section 10.1 that, if the disturbance takes the form of 'white noise' with covariance matrix N, then the minimal expected cost incurred per unit time is 1 = ! tr( N II). Here II is the matrix of the deterministic value function derived in Section 2.4. In general, there are many aspects of average-cost optimality -concepts and counterex amples-w hich are best discussed in the stochastic context, and which we shall defer to Chapter 11. Let us denote the cost per unit time for the policy g(oo) and for an infinitehorizon optimal policy by lg and 1 respectively. It is as yet not clear how one is to tackle infinite-horizon cost evaluation or optimisation; if costs are going to build up at a constant rate then the total cost over an infinite horizon is certainly infinite. One way of reducing the situation to that of the last section is to subtract lg or 1 from the cost c. With this normalisation one has a problem for which total cost may again be well defined, under appropriate regularity conditions. The effect of the normalisation will be to change the dynamic programming equation V = LV under the policy g(oo) to
54
A SKETCH OF INFINITE HORIZON BEHAVIOUR
'Yg
+ v = Lv.
(9)
Here v(x) is the infinite-horizon value function for the reduced-cost problem, presumed finite. We interpret v(x) as the component of infinite-horizon cost arising from the fact that the operation started in state x rather than in equilibrium; we shall term it the transient cost. One presumes that equation (9) and the condition that v be finite determines 'Yg· However, the absence of discounting means that (9) determines v only to within an additive constant. This constant can be regarded as arising from an irrelevant closing cost, and can be normalised by prescribing the value of v( x) for some x. Actually, if the controlled process has several equilibrium points then both average cost 'Yg and the arbitrary 'constant' of integration will depend upon the particular equilibrium to which the process converges, and so will be independent of starting point x only for starting points with the same terminal equilibrium. Otherwise expressed, they are constant within the domain of attraction of a given equilibrium point, but will in general vary between domains. We expand on this point in discussion of the continuous-time case below. The analogue of the dynamic programming equation (6) for the optimally controlled process will be 'Y + f = !l'f
(10)
where 'Y is the minimal average cost and f (x) the minimal transient cost. If this optimally controlled process has distinct equilibria of different average cost then it must be that there is no finite-cost manoeuvre which will bring the state value from a domain of higher 'Y to one of oflower 'Y· If there were, then one would use it. We have the continuous-time equivalents of (9) and (10) 'Yg
= Mv,
'Y
= .Af
with 'Yg and 'Y still interpretable as rates of cost per unit time. We can write the first of these equations more explicitly as 'Yg
8v(x)
= c(x,g(x)) +a;-a(x,g(x)).
(11)
Suppose that xis an equilibrium point of the process under policy g
= c(x,g(x)).
(12)
If v( x) is indeed to be interpretable as the increase in total cost incurred by a start from x rather than from the equilibrium value x then it follows that the constant of integration is fixed by the condition v(x) = 0. Suppose now that we try to choose the policy optimally. We see from the above that there are two levels of optimality. First of all, one could try simply to choose
4 GROWTH OPTIMALITY
55
the control rule u = g(x) so that the equilibrium value xis such as to minimise expression (12). This is the criterion of average cost optimality; the equilibriu m values of x and u which achieve it are determine d in Theorem 210.2 (with the discount rate a set equal to zero). However, this does not determine the control rule away from equilibrium. A necessary property is that the rule should stabilise the equilibrium; a desirable property if that it should minimise the cost of passage to equilibrium. The full dynamic programm ing equation 'Y = vi{f achieves this, by minimising the costf (x) of passage to equilibrium as well as the cost "Y per unit time at equilibrium. The distinction between optimisation at and optimisation to the equilibrium point generally disappears both in the stochastic case and in practice, when one must assume that the process is constantly disturbed from what would be its deterministic equilibrium point. Exercises and comments (1) Consider the problem of regulation to (q, 0) considered in the text with s = o. Show that the actual equilibrium value of xis x = (I- R- 10.)q and the minimal average cost is "Y = qTO.q, where
4 GROWTH OPTIMAL ITY Suppose that the cost (or reward) incurred over a horizon oflength his of order I', where p > 1. That is, cost (reward) grows exponentially fast. This is the type of behaviour encountered in economic contexts, when p is the rate of economic growth. Maximisat ion of total reward over a long horizon amounts primarily to maximisation of the growth rate, just as it amounted to minimisation of the cost rate in the last section. This is not the kind of behaviour one encounters in conventional control contexts, but we sketch a few ideas so as to complete the gallery of types: processes for which (in rough terms) total cost remains bounded, grows linearly or grows expone~ially as the horizon recedes. The example we consider is the simplest possible, but serves to illustrate a number of points. Suppose that one can operate a number of activities, each of which both consumes and produces various commodities at rates proportion al to the intensity with which the activity is pursued. Let the kth element of the column vector x 1 denote the intensity with which activity k is pursued at time t. Let aik denote the amount of commodity j consumed by activity kat unit intensity and bik the amount produced. Let bi denote the amount of commodity j which is naturally available per unit time. One then has the constraints x 1 ;;?: 0 and
. r
'.:
.
•...·.·•··•·
56
A SKETCH OF INFINI TE HORIZ ON BEHAVIOUR
-~.· .;~ .~-
· ...
..
·~!'·• Y-?
( 13) Axr <: b + Bxr+ an activity patter n Inequa lity (13) reflects the fact that at time tone canno t choose A and B are the Here le. availab is which consum es more of any comm odity than ively. respect b and k b k, a ts 1 matric es and b the colum n vector with elemen 1 1 in the choice of e latitud The n. One must regard (13) as the plant equatio n: the decisio of choice of e fulfilment of this inequa lity represe nts the latitud where r(xr), ise maxim to choice of the intensity vector x 1• Suppo se one aims e imagin can one , Briefly x. vector r(x) is the utility associa ted with an intensi ty which ption, consum al person that some of the produc tion is siphon ed off into can be regard ed yields satisfa ction (i.e. utility). However, person al consum ption rate of person al of choice The as just anothe r activity, albeit an unprod uctive one. x. vector ty intensi the consum ption can thus be lumpe d in with choice of p > 1 such scalar a and x vector Suppo se now that one can find a non-ne gative
:LZ
that pAx<; Bx.
in the absenc e of That is, one can achieve an expand ing economy, x 1 ex: / x, even are such that ions condit initial the extern al resources repres ented by b, at least if this x and for +oo --+ r(>.x) all activities repres ented in x can be started up. If teed by guaran is ction satisfa with increa sing positive scalar ,\ (so that person al reward of sation optimi h, e the progra mme) then one sees that, in the limit oflarg p. amoun ts primar ily to maxim isation of the growth rate a solutio n of the The maxim al value of p consis tent with (13) is indeed eigenvalue proble m (pA- B)x = 0, the corres pondin g p being the maxim al solutio n of jpA- Bj = 0 and x gative implies that eigenvector. The fact that the elements of A and B are non-ne be productive and to is the same is true of x. One requires that p > 1 if the system able to start up the that x should satisfy the condit ions above if one is to be progra mme and satisfy the consum er. one will largely This maxim al-grow th path constit utes the 'turnpi ke' which ibed initial state follow in an optima l progra mme of expans ion from some prescr to some desired termin al state. ale model (see This model is a specia l case of the so-cal led von Neum ann-G lised since. e.g. Gale, 1960,1967, 1968) which has itself been much genera course, fallac iousof is, itely indefin grow can y The notion that an econom point. However, essential resour ce limitat ions will manife st themselves at some there may be a phase during which the idea is not unrealistic.
5 POLICY IMPROVEMENT consid er only for Consid er again the total-c ost case of Sectio n 2, which we shall on (7) for the equati of n Solutio u). x, c( cost aneous the case of non-negative instant
,;,
i
..
..---
5 POLICY IMPROVEMENT
57
value function V with a prescribed policy must be regarded as relatively straightforward, because this is linear in V. At the worst, it can be solved computationally if the number of states is finite. The dynamic programming equation (6), which determines the optimal policy, is another matter, however. It involves a combination of linear and extremal operations. One could approximate the determination of both F and an optimal policy ever more closely by calculating the iterates Fs = !l' Fs-1 for s = 1, 2, .... This is the method of value iteration. It has the merit of reproducing the effect of an increasing horizon, but is generally very slow to converge. A much faster method is that of policy improvement. This is also iterative, proceedinr in stages which we shall label by i. Suppose that at sta~e i one has a policy gj 00 • Denote the corresponding infinite-horizon cost V(g;oo ) by V1• (This is a bad notation, in that V1 is liable to be confused with V3 ; the cost at horizon s. We hope that strict adherence to the symbols i and s in the two contexts will prevent confusionJ We know V1 to be the minimal non-negative solution of V1 = L(g;) V1• Having determined V1, one then determines the g1+1 which yields an improved policy from L(g1+1)V1 = !l'V1• That is, g1+1(x) is the minimising value of u in !l' V1(x), and so is the optimal value of control if one can optimise for one step before reverting to the policy gl00 )_
Theorem 3.5.1 The inequality !l' V1 ~ L(g;) Vt holds.l{strict inequality holds for some x then the policy g;_<;{ is a strict improvement on gl 00 • Proof The inequality first asserted holds in virtue of the definition of !l'. If strict inequality holds (by which we mean that equality does not hold everywhere) then V;
> !l'V; = L(gi+l)V; ~ L(gi+dV; ~ L(gi+IY(O) - Vt+i·
That is, V1 > V1+1 and the iteration has produced a strict improvement.
0
One would like to assert that, if equality holds, so that V1 = !l' V1, then V1 = F and the policy gloo) is optimal. This will be true under the assumptions of Exercise 1.1, for example, but is not in general true without qualification. One may say that value iteration carries out the operations of improvement of policy and passage to the infinite-horizon limit simultaneously. Policy improvement realises these two operations alternately, with passage to the limit telescoped to solution of the linear equation system V =LV. Typically, policy improvement indeed approaches optimality (when it does so) in very few iterations. The following observation partly explains why.
Theorem 3.5.2 The policy improvement algorithm is identical with application ofthe Newton-Raphson algorithm to the equilibrium dynamic programming equation F = !l'F.
58
A SKETCH OF INFINITE HORIZON BEHAVIOUR
Proof Recall what is meant by the Newton-Raphson (henceforth NR) algorithm. If we have an approximate solution F = V; ofF = ft' F then we improve it to a solution V;+ 1 = V; + !::..; by setting
(14) expanding the right-hand member as far as first-order terms in !::..; and solving the consequent linear equation for!::..;. Suppose that ft' V; = L(g;+I) V;. Because u is optimised out in the application of fi' the variation in u induced by the variation !::..; of V; has no first-order effect on the value of fi'( V; + .6.;), and fi'(V;
+ !::..;) =
L(gi+I)(V; + .6.;) + o(.6.;)
It follows that the linearised version of equation (14) is just vi+ I = L(gi+I) vi+!·
That is, vi+ I is just the value function for the policy gi.';:"{, where gi+l was deduced D from the value function V; exactly by the policy improvement procedure. This is a useful observation, because the NR algorithm is such a natural one, as we shall find The algorithm now has a direct variational justification in this context. The equivalence also has implications for rates of convergence: in regular cases, at least, value iteration and policy improvement show respectively first-and second-order convergence; see Exercise 1. Policy improvement can be used also for the average-cost formulation. If we assume, for simplicity, that there is only a single equilibrium point under the various policies, then in discrete time it takes the form: determination of the average cost 'Yi and transient cost function v;(x) from 'Yi + v; = L(g;)v;
followed by determination of the improved control rule u L(gi+I)v;
= g;+ 1( x) from
= fi'v;.
By 'improvement' is meant that either the average cost has been decreased, or it is unchanged and the transient cost has been decreased. One generally hopes for improvement in the first and stronger sense; see Section 11.3. The continuoustime versions of both total- and average-cost procedures will be plain. The techniques of value iteration and policy improvement were formalised by Howard (1960). The equivalence of policy improvement and the NR algorithm was demonstrated in the LQ case by Whittle and Komarova (1988); in this case it holds in a tighter sense. However, we see from the last theorem that it holds generally. Puterman (1994) attributes the observation of the equivalence to Kalaba (1959) and Pollatschek and Avi-Itzhak (1969). However, it is only in recent years that the point and its application have bitten: see Chapter 18.
6 POLICY IMPROVEMENT AND LQ MODELS
59
Exercises and comments (1) Consider solution of the equation y = f (y) for a scalar y. Let y be the desired solution and {y;} a sequence of approximate solutions. Denote the error y;- y by Ei. The sequence of approximations is said to show rth-order convergence if Ei+I = 0( ~) for small E; and all sufficiently large i. Suppose one generates the sequence by iteration: Yi+I = f(yi). Then, if convergence toy indeed occurs, one fmds that Ei+l = /'Ei + o(t::;), where f' is the first derivative off at y. If the sequence is generated by the NR algorithm one finds that
!"
Ei+l =- 2 ( 1 _ f')
Ef +
o(Ef).
Granted the convergence and differentiability assumptions, we see then that the two methods show first- and second-order convergence respectively. (2) In the case of positive programming the analogue of the least-excessive property of Theorem 3.2.2 does not hold, with the consequence that policy improvement may not work. Consider Exercise 2.4, where one has the option of continuation or termination in each state. If one chooses termination in all states then the corresponding value function (in reward terms) is V(x) = 1- 1/x. If one then performs policy improvement (i.e. chooses the action corresponding to the larger of V(x + 1) or the retirement reward 1- lfx) then one chooses continuation in all states. However, indefinite continuation is non-optimal. 6 POLICY IMPROVEMENT AND LQ MODELS Policy improvement can be regarded as both an analytic and a computational technique. For LQ models it is a combination of both, in that it provides an excellent way of solving the Riccati equation iteratively. Consider the problem of undiscounted LQ regulation to zero treated in Section 2.3. For notational simplicity we shall assume the matrix S normalised to zero. We know that the optimal policy has the form ur = Kxr. Assume then that the policy at stage i has the form ur = Kixr, so that the corresponding infinitehorizon value function has the form V;(x) = !xTII;x. Then the two steps of the policy improvement algorithm reduce to: (i) Determine IIi from the linear equation IIi= R+K(QKi +(A +BK;)TII;(A +BKi)·
(15)
This will have a finite solutionifthematrixA + BKi is a stability matrix; i.e. if the assumed control law is stabilising. (ii) Determine the matrix Ki+l of the improved control law as (16)
A SKETCH OF INFINITE HORIZON BEHAVIOUR
60
For a numerical example, consider the case
i],
A=[~
B=
[n,
R=
[~ ~],
Q= 1,
S=O,
which can be regarded as a discrete-time version of the stabilisation to the origin of a point-mass moving on a line-New tonian in that its acceleration is proportio nal to the control force exerted on it. If we suppose initially a control rule with K = -[0.25 1.00], then successive iterations modify this to -[0.329 1.224] and then -[0.328 1.220]. The smallness of the change on the second iteration indicates that the first iteration had already brought the control very close to optimality. The value of II corresponding to the last (and virtually optimal) ru1e is
II= [3.715 3.044] 3.044 8.264 . The continuous-time versions of (15) and (16) are
R
+ K( QKi +(A+ BKi) TIIi + IIi(A + BKi) = 0,
(17) {18)
Ki+l = -Q- 1BTIIi.
For example, consider the pendu1um regulation problem of Section 2.8 with
A=
[:1
~],
B=
[n.
R=
[~ ~],
Q=1, S=O.
The ± option in A corresponds to consideration of the inverted or the hanging pendulum respectively. Of course, we solved this problem analytically in Section 2.8, but it is of interest to see how quickly the policy improvement algorithm will yield convergence to the known solution. For the hanging case we have the optimal solution
II= [2.;2 1 ] 1 .;2 '
K = -(1
v'i]
where .;2 = 1.414 to three decimal figures. The initial choice K = -(1 iterates to - (1 1.5] and - [1 1.416] on the first and second steps. For the inverted version the optimal solution is
II_ [4.898 3.000] - 3.000 2.449 ,
K = -(3.000
1]
2.449],
to three decimal figures. The initial choice K = -[2 1] iterates to -[3.5 4.0], -[3.05 2.76] and -(3.001 2.467] on successive steps. The numerica l solution of the cart/pendu1um problem was given in Section 2.8. The difficulty with this problem is to find a stable policy with which to start the
7. POLICY IMPROV EMENT FOR THE HARVESTING EXAMPLE
61
the algorith m-one cannot do so until one realises that the coefficients of which that to displacement terms in the control rule must have the sign opposite one would expect on naive grounds, as noted in Section 2.8. , The problem of regulation to zero can be regarded as a total-co st problem total finite since a policy u = Kx which stabilises the equilibr ium x = 0 achieves in cost from any finite starting point x. If we bring in disturbances or tracking as we although Section 2.9 then cost does indeed build up indefinitely with time, have to go to a statistical formula tion of the disturba nces and the reference signal all before the average cost has a simple definition; see Section 10.1. However, in n equatio of these cases the central problem remains the solution of the Riccati on for the matrix II, which does not differ from that for the simple regulati problem above.
7. POLICY IMPROVEMENT FOR THE HARVESTING EXAMPLE red Conside r again the harvesting example of Section 1.2. The naive policy conside policy optimal the one; hted short-sig very a be to out in that section turned in (under the extreme simplifYing assumpt ions of the model) was deduced in is ment improve policy of stage one ul successf Section 2.7. Let us see how taking us from the first policy towards the second. If V(x) is the infinite-horizon cost under a policy u = g(x) then V satisfies the equation (19) u- a V + (a(x) - u) Vx = 0 with u indeed given the value g(x). Recall that xis the stock size, a(x) the natural net reprodu ction rate, u the catch rate and a the discoun t rate. For the naive policy u is given the constan t value Urn = a(xm) for all x > 0, where Xm is the value rate maximising a~). For the optimal policy one harvests at the maxima l feasible M for x > Xa and not at all for x ~ X 0 , where Xa is the root of a' (x) = a. It follows from (19), with u = g(x ), that
Vx = 1 + a(x) - a V(x). g(x) - a(x)
(20)
or M It also follows that, for the improve d policy, one will set u equal to zero of case the in choice The unity. then less or than according as Vx is greater
see equality is immater ial; we might as well set u equal to zero for definiteness. We M or zero to equal u take should we that then from (20) that this implies u (with not or sign same the have a(x) according as a(x) - a V(x) and g(x) zero in the transitio nal case). For the naive policy it follows from (20), and is otherwise obvious, that (21)
62
A SKETCH OF INFINITE HORIZON BEHAVIOUR
where T(x) is the time taken for the stock to drop from x to zero under the policy. For x ~ Xm this time is infinite; for x < Xm T(x)
=
1
dy . o a(xm) - a(y) x
(22)
For the naive policy it is also true that the expression g(x) - a(x) = a(xm)a(x) is nonnegative for all positive x. It thus follows that u will be zero in the improved policy for all x such that a(x) ~ o: V(x). Appealing to evaluations (21) and (22) we find this to be the case if o:
1x
dy l a(xm) - a(O) :::::; og = o a(xm) - a(y) a(xm) - a(x)
1x 0
d(y) d y, a(xm) - a(y)
or
1 x
o:- a'(y) --;----:---'---':---:- dy :::::; 0.
o a(xm) - a(y)
(23)
The value of x separating the regimes of zero and maximal harvesting rate in the improved policy is that for which equality holds in (23). One sees that this root certainly lies between Xm and Xa. Repetition of the argumen t shows that the recommended transition value converges rapidly to the optimal value x" with iteration of the improvement. This argument was of course unnecessary, as the optimal policy is known and simple. However, the example is perhaps reassuring, in that it shows how a single stage of policy improvement may change a conspicuously stupid policy to one qualitatively close to the optimal.
CHAPTER 4
The Classic Formulation of the Control Problem: Operators and Filters 1 THE CLASSIC CONTROL FORMULATION By the 'classic' control formulation we mean the formulation adopted when design was achieved by art (or craft) rather than by optimisation. The two approaches necessarily have a good deal of common ground, however, some of which we explore in this chapter. A prototype example is that of a mariner trying to make a ship follow a desired course; we may as well suppose it a sailing ship, and so emphasise the classical nature of the example. The block diagram of Figure 1 represents the situation. The ship is affected by the actions taken by the mariner, in changing sail or helm. It is also subject to other and unintended disturbances, such as weather and current. The mariner observes how the ship reacts to these combined forces and revises his actions accordingly, thus closing the control loop. His actions must be based upon what he can observe: e.g. apparent speed and heading of the ship, roll and pitch, the state of the sea and weather, the hazards of the topography and, of course, the specified intended course. Actual course
Wind and sea
,----
Ship 1---
Helm and set of sail
Observations
~
Mariner
Desired course
Fipre 1 The block diagram for a system in which a mariner controls the course ofhis ship by the operation ofhelm and sails, endeavouring to follow a set course against the disturbances offered by weather and sea.
64
THE CLASSIC FORMULATION OF THE CONTROL PROBLEM X
d
Plant
cY y
u
Controller
X
w
Figure 2 An abstract version of the block diagram ofFig.l corresponding to equations (1) and (2).
In more general language the situation is that of Figure 2. The physically given system which is to be controlled is termed the plant. In our navigation example the plant is primarily the ship (although see the comments of Exercise 1), which is subject to the disturbances d of sea and weather as well as to the deliberate control actions u. The mariner becomes the controller, the unit which determines control actions u on the basis of current information. (This unit could then equally well be a person taking conscious decisions or a device following the rules built into it. The formal view is simply that a control policy formulates an action rule for all foreseeable situations. This view fails if situations arise which are not foreseeable or which are poorly quantifiable-ou tside our domain!) The current information will include all available observations y on the state of the plant and also the command signal w which specifies those aspects of the course which one wishes the plant to follow. The block diagram of Figure 2 is equivalent to a set of mathematical relations. The plant is regarded as a system whose inputs are the control sequence u and , disturbances d and whose outputs are the actual performance x and the observation sequence y. However, the only output upon which decisions can be based are the observations. For control purposes the plant is thus described by an operator <'1, known and given, which converts input into output:
(1) The physical variables of the process (such as the process or state variable x) are buried in the dynamics of the system which determine the relation (1). This formulation is then often referred to as an input-output formulation, to be compared with the more explicit state-structured dynamic models of later chapters. Of course, relation (1) is not just an 'instantaneous' relationship. It is the consequence of a dynamic mechanism, and so will in general represent y at a given time as being dependent upon all past u and d.
65
1 THE CLASSIC CONTROL FORMULATION
One chooses a control rule, necessarily in terms of observables,
(2)
u = .*"{y, w)
where the operator .*" is to be chosen to secure good properties for the controlled system. A simplified version of this model which is often treated is
(3)
u = .*"(w- y).
That is, it is supposed that the disturbance is simply added to the control input, that the only observable plant output y is the very variable which is required to follow the command signal w and that the control signal is determined purely from the tracking error e = y - w. These assumptions are represented in the simplified block diagram of Figure 3. At least in the case when plant and control operators are linear, the system (3) has the formal solution y = (J + ~.*")- 1 ~(X"w +d),
(4)
for the plant output y in terms of system inputs. Here J is the identity operator. Relation (4) can be regarded as the input/outp ut relationship for the controlled system. It yields, in particular, the expression
(5) for the tracking error e = y - w. Now, what one certainly requires of the controlled system is that the tracking error should be small Consider, for simplicity, the case d = 0, for which the tracking error is the solution of the equation
(J + ~.*")e = -w. d
+
(6) y
Plant
~
y
u
Controller
X
+
w
ligure 3 The special case ofthe system ofF~g. 2 in which the controller acts on the deviation of desired course wfrom actual course y, and in which the disturbance dis simply superimposed on the control u. This corresponds to the pair ofequations (3).
66
THE CLASSIC FORMUL ATION OF THE CONTROL PROBLEM
Then the minimal demand of standard control theory is that the dynamic system specified by (6) should behave well, in that its solution e should be small and even converge to zero as time passes for all w in some class of typical comman d signals. It is such considerations which generate the classical concepts of transient response, stability, dynamic lags etc. In developing these ideas we shall have to more specific about the nature of signals such as u, dandy and of operators such as ~ and %. We should also note that there are considerations other than the reduction of error e. This reduction should be achieved with reasonable economy of control effort Further, any control policy specified should be robust in that it works satisfactorily, not merely for a variety of inputs wand d, but also under some degree of mis-specification of the given system (i.e. of the plant operator~. Exercises and comments (I) In a higher-level formulation one will have a mathematical model of the sea and the weather, which will in this case also be included in the 'plant: Although one has no hope of controlling sea or weather they form part of the physical picture, and can be predicted to some degree. In such a case the disturbance d would represent just those forces driving sea, weather etc. which are 'primitive', in that they are not explainable by a model. A primitive signal is a combination of two extremes, either fully known in advance (in which case no model is needed) or unknown and completely random (so that a model can achieve no further reduction). Equally, the comman d signal may be specified in advance (a specified course for the ship) or not (the intended course may have to be revised in the light of contingencies, or one may be following an unpredictable target vessel). In the latter case the plant may include also a model for the evolution of the comman d signal, and in either case the comman d input to the system will be reduced to its 'primitive' component. (2) R~lation (4) follows by elimination of u from relations (3), and leads essentially to the demand that (J + ~%) should have a stable inverse. If we eliminate y instead then we obtain the equation (J + %~)u = %(~d- w) with the implication that the operator (J + %~) should have a stable inverse, if the control u is not to build up with time. If the dimension of u is smaller than that of y then the second formulation has the advantage of working in a lower dimension. (Note that the dimension of J will then differ in the two cases.) The operator ~% is the loop operator in that it gives the effect on a signal of being passed successively through controller and plant. The operator %~ is equally a loop operator, but calculated by starting from the plant input port rather than the controller input port.
2 FILTERS IN DISCRETE TIME: THE SCALAR (SISO) CASE
67
2 FILTERS IN DISCRETE TIME: THE SCALAR (SISO) CASE The blocks of an engineer's block diagram (such as those of Figures 1-3) are units specified only by their inputs and outputs and the operation which converts one into the other. Let us for the moment discuss this matter generally, so that symbols such as x and r§ refer only loosely to the control context. For our purposes a signal x is simply a function of time-the speed of a flywheel, the angle of an aileron, the price of fish, the pattern of power demand over the country, the picture on a television screen. In discrete time we shall write its value at timet as x 1, so that {xr} is a sequence. In continuous time we shall write rather x(t), if the time argument is indicated at all. The action of the unit represented by one of these blocks is then specified by a relationship between input d and output x, say. The quantity r§ is the operator which represents the effect of the block, transforming input signals into output signals. Such a unit or block is also often referred to as a 'filter', since in electronic contexts so many of the operations applied to a signal are regarded as filters in that they may, for example, pass frequencies only in some prescribed band. In those contexts filters are realised by electronic hardware. The response of a car to bumps and hollows in the road is equally well the response of a filter, realised mechanically by the suspension linked to the mass of the car. In the same way one can see filters which are realised by economic mechanisms (e.g. the response of imports to demand) or biological mechanisms (e.g. the response ofheart- rate to exertion). So, for our purposes, 'filter' and 'operator' are respectively concrete and abstract expressions ofthe same thing. A signal could well take qualitative values (e.g. the colour of a traffic light or the identity of a telephone caller) but we shall suppose it quantitative. A filter will in general have several inputs and several outputs, but, by taking signals as vectorvalued, we can combine these into a single vector input and a single vector output, usually of different dimensions. Treatment of the vector case scarcely needs more formal machinery than does that of the scalar case (i.e. that of a single scalar input and a single scalar output) if one uses the right formalism. We shall consider the two cases separately, nevertheless, since this gives us the chance to approach the topic in two different ways. In considering the scalar case we shall concern ourselves simply with what conclusions hold; in considering the vector case we shall also ask why they hold. What we have termed the scalar and vector cases are often referred to as the SISO and MIMO cases in the engineering literature. The terms are abbreviations of 'single input/single output' and 'multiple input/multiple output' respectively. Although we start with the SISO case in this section, we shall in general take the MIMO case as the standard one. That is, we take it for granted that signals are vector-valued. We shall consider discrete-time filters first. j:
68
THE CLASSIC FORMULATION OF THE CONTROL PROBLEM
One important sub-class of filters is constituted by the linear filters, for which indeed the relation between input and output is linear:
(7) The coefficient g,.,. is then the response of the fJJ.ter at time t to an input consisting of a unit pulse at time T. For this reason it is termed the transient response or the impulse response of the filter. Two significant properties of a fJJ.ter are causality and stability. A filter is causal (or non-anticipatory) if its output at time t cannot depend upon its input after timet. In the linear case (7) this would imply that
(r > t).
(8)
One expects that any physical mechanism (driven by an input d to produce an output x) will be causal. For example, consider the disturbed plant equation (2.61), and suppose it uncontrolled, in that u is set identically equal to zero. Then the solution x 1 of this equation is dependent (linearly, in fact) upon the disturbance input d-r only forT :::::; t. Non-causal filters can occur, however, if physical mechanisms are supplemented by human intervention. Consider the same system, but with the optimal control (2.65) applied. This makes allowance for future disturbances, and so the solution x 1 of the controlled system will depend upon disturbances d.,. for all T. That is, the corresponding filter is non-causal. Note, however, that this can come about only because future d-values are assumed known. Stability, understood in a wide sense, is simply the requirement that the filter output should lie in some regular class of signals if the input does so-e.g., the class of uniformly bounded signals. There are as many definitions of stability as there are of regularity. To be more specific we should consider a more specific class offilters, the translation-invariant filters, which are stationary in their action. Consider the backward translation operator §', with with effect
(9) This then achieves a time delay of a single time unit. The operator ?7 2 achieves a time delay of two units. More generally, the operator ?7' achieves a time delay of r units, where r is an integer of either sign. If r is negative then the effect is of a time advance of -r. A filter is said to be translation-invariant or time-homogeneous if a time-shift in its input merely produces the same time-shift in its output. This can be expressed as the commutation relation
(10) which in fact implies that
(11) .'
69
2 FILTERS IN DISCRETE TIME: THE SCALAR (SISO) CASE
Theorem 4.2.1 Fora translation-invariant linear filter relation (7) takes the form (12)
Proof Applying both sides of (10) to the input consisting of a unit pulse at time t'- 1, with<§ having the linear form (7), we deduce that g1r = g 1-J,r-J. Iterating this relation, we deduce thatg11'
= gt-t',O = gt-t', say.
0
The coefficient g1 is then interpreta ble as the transient response of the filter a time r after applicati on of a unit pulse to the input, actual clock time being irrelevant. We shall abbreviate the descripti on 'translati on-invari ant linear' to TIL. A stability condition for TIL filters is that of of lq-stability: that
( 13) We see that /1-stability ensures that uniforml y bounded input yields uniforml y bounded output, and so is sometimes referred to as BIBO stability. The demand of /2-stability will prove to be a natural one when stochasti c inputs are considered. Stability has in fact been assumed in some degree when one writes down a relation such as (7) or (12). There is an implicati on behind this that the filter has been operating since the indefinite past (as there is no upper limit on r) and that the output is zero if the input is zero. In other words, if the filter represents a physical mechanis m, then there is an assumpti on that any effects from the startup of this mechanis m in the remote past have died away (see Exercise 3.1). If a pulse input reveals the transient response, then, for TIL filters, an exponential input x 1 = z- 1 for scalar z reveals another simple response pattern. One sees readily that the output is just G(z)z- 1 where G(z) is the generatin g function of the transient response:
G(z)
= Lg,z',
(14)
at least if the sum (14) is convergent. That is, the effect of the filter on such an input is simply to multiply it by a factor G(z). The function G(z) is termed the transfer function of the filter. One sometime s expresses the relations hip (14) by saying that the transfer function G(z) is the z-transform of the transient response {g,}. Note that we could write C§ = G( .9"") in the TIL case. Note also that, if C§ 1 and C§2 are TIL filters, then C§ 1C§ 2 is a TIL filter with transfer function G1 (z)G2 (z). Inputs of the form x 1 = eiwt are of special interest, being both exponential and bounded. We can well consider complex-valued signals, and are indeed forced to do so if we require both bounded ness and exponent ial behaviour. The transfer function for this input is G( e-iw), the .frequency response at frequency w. If we write this as p(w)e-i0(w) then p and e (assumed real) represent respectively the
70
THE CLASSIC FORMULATION OF THE CONTROL PROBLEM
multiplication of amplitude and the shift in phase which the filter induces on the signal eiwt. One may ask: to what extent are the transient response, transfer and frequency response functions mutually determining? This raises questions both of mathematical nicety (essentially, the whole theory of Fourier/Lapl ace transforms and their inversion) and of physical understandin g. We shall escape as lightly as we can with the first and attempt to clarifY the second. The transient response function certainly determines the other two in principle; the question then is: for what values of z is the series expression (14) for G(z) meaningful? Stability conditions of the type of (13) certainly ensure that G(z) is defined for z on the unit circle in some sense, and so that the frequency response function G( e-iw) is correspondin gly defined. If q = 1 then the series defining G(z) is absolutely convergent on the unit circle. If q = 2 then G(z) is defined on the unit circle in an L 2 sense. If the frequency response function is given, then by hypothesis it has a Fourier expansion
(15) and the Fourier coefficients (the transient response) are then determined by the inverse relation g,
. . dw. = 2I7!" /71' -71' G(e'w)e•wr
(16)
One can then determine G(z) and continue the original specification of G(z) off the unit circle. If we are given the transfer function G(z) then it does not necessarily follow that the transient response is determined by (15), (16), for reasons we shall see in a moment. In discussing these matters it is helpful to avoid the subtleties of Fourier theory by making some strong (but not unrealistic) convergence conditions. Suppose that the filter is stable in the very strong sense that the transient response g, converges exponentially fast to zero as r --. ±oo, in that it behaves as Pi for r large and positive and as P2 for r large and negative, where 0 ~ PI < 1 < P2 ~ oo. Then.G(z) converges for P2 1 < lzl < Pi 1 and is analytic in this annulus. This argument can be reversed. Suppose that the response function G(z) is ?iven and is analytic in an annulus o: 2 ~ lzl ~ o: 1 . Then in this annulus G(z) IS represented by the series ~,g,z' (a Laurent series) where g, is determined by the contour integral g,
= 2 ~i
J
G(z)r-r-I dz
(17)
and the contour of integration is a circuit of the origin inside this annulus. This reduces to relation (16) in the case that the contour of integration is the unit
2 FILTERS IN DISCRETE TIME: THE SCALAR (SISO) CASE
71
circle. Furthermore, gr converges to zero at least as fast as a!' for large positive r and as a2' for large negative r. However, this g, is not necessarily the transient response of the physical model. There may be several such annuli of analyticity, and, if they are separated by a singularity of G, then the series representations will differ. We shall see just how this happens at the end of the next section. The appropriate representation will be determined on physical grounds: again, those of stability and causality.
Theorem 4.2.2 Suppose that the filter response function G(z) is analytic in an annulus a2 ~ lzl ~ a1. Then (i) The filter output is defined for inputs ofthe form z- 1for z in this annulus, andfor such inputs the filter behaves as one with the transient response g7 determined in (17). (ii) If a2 < 1 < a 1 then the filter thus determined is lq-stable for any positive q. (iii) If a2 = 0 then the filter thus determined is causal. These conclusions follow from the discussion before the theorem. Their implications go something like this. We try to determine a filter with prescribed transfer function G(z) by determining its transient response {g, }. If we look for a filter which is stable then we expand G(z) in its annulus of analyticity which includes the unit circle (if it has one). If we look for a filter which is causal then we take the contour of integral (17) so that all singularities of G(z) lie outside it. We can certainly find a filter which is both causal and stable if all its singularities lie in some region lzl ?! a1 > 1. When we spoke of a filter with a given frequency response function there was an implied assumption that the filter was stable to sinusoidal inputs eiwt, so that (16) gave the appropriate inversion. The class of filters whose transfer function G(z) is rational is practically i:tp.portant, for reasons which we shall see shortly. They are also mathematically straightforward: the transient response is a finite linear combination of exponentials, G(z) has singularities only at a finite number of poles, and one sees easily that the conditions of Theorem 4.2.2 are necessary as well as sufficient.
Theorem 4.2.3 Suppose that a causa/filter has a transfer function G(z) which is rational in z. Supposefurthermore than G( z) has been so normalised that numerator and denominator contain no common factor and no term in z raised to a negative power. Then the filter is stable (in the lq sense for any finite positive q) ifand only if the poles ofG lie strictly outside the unit circle. The transient response is then determined by (16) ~r, equivalently, by (17) with the unit circle as integration contour). Proof The simplest proof is an elementary one, relying on the partial fraction expansion
72
ROL PROBLEM THE CLASSIC FORMULATION OF THE CONT
G(z) = :Lc, z' + LL~k(z1 - z)-k j
k
positive integers and the z1 of G(z). Here all the sums have finite range ,j and k are are the finite non-zero poles. in non-negative powers, Causality requires us to seek an expansion (15) of G(z) is, G must not have That 0. < r and we certainly canno t find one unless Cr = 0 for tend to zero with d shoul gr a pole at z = 0. Stability certainly requires that to gr which is of n ibutio increasing r. The term in (z1 - z)-k generates a contr ly fast) with ential expon order rk-l zT', so g, will certainly converge to zero (and se the becau sary, also neces increasing r if jz1 j > 1 for all j. The condition is could gr and k), and ngj functions rk-l zt of r are linearly indep enden t (for varyi 0 then not possibly converge to zero if lzJI ~ 1 for any j.
-{
TPUT RELATION: 3 FROM DYNAMIC MODEL TO THE INPU T/OU N FILTER INVERSIO the TIL case) by its transfer The specification of a filter by its opera tor c:§ or (in in that it indee d specifies function G amounts to an input -outp ut description something has been lost in what the outpu t should be for a given input. However, the filter was realised. A such a specification: the dynamic mechanism by which of questions (notably, that of return to the physical model both clarifies a numb er stability) and raises others. in passing; that of a car Consider an earlier example which we instanced ning the response of gover ions driving over a bump y road. The dynam ic equat vertical displace(the x les the car are differential equations in the outpu t variab larities in the irregu (the d ment of the car body, etc.) driven by the input variables so in effect and x, t outpu road). These equations express input d in terms of solving the in that, be said specifY the operation C§- 1 of the inverse filter. It can the direct of c:§ the action equations, one is inverting this given filter to determine filter. are expressed by the pthFor example, suppose that the dynamics of the filter order difference equation p
L
ArXt- r
= dr.
(18)
r=O
We can alternatively write this as A(.o/"")x =d.
(19)
toy= G(.o/"")x then we must where A(.o/"") = L:~=O A,§"'. If this is to be equivalent ons, G(z)A(z) = 1. From functi have A(.o/"")G(.o/"") = .f, or, in terms of response this and Theorem 4.2.3 we thus deduce
_,-'i
3 FROM DYNAMIC MODEL TO THE INPUT/O UTPUT RELATION
73
Theorem 4.3.1 The transfer function of the filter d-+ x implied by relation (18) is G( z) = A (z) -I. The causalform ofthe filter is stable ifand only ifall zeros ofA (z) lie strictly outside the unit circle. Inversion of the operator thus amounts, in the TIL case, to inversion of the transfer function in the most literal sense: G(z) is simply the reciprocal of the must be transfer function A(z) of the inverse filter. However, to this observation 1 to be is A(z)of added some guidance as to how the series expansion understood. We do indeed demand causality if equation (18) is to be understood physically: as a forward recursion in time whose solution cannot be affected by future input. We shall see in the next section that the theorem continues to hold in the vector case. In this case A(z) is a matrix, with elements polynomial in z, which implies that A(z)- 1 is rational. That is, the transfer function from any component of the input to any compon ent of the output is rational. This can be regarded as the way the rational transfer functions make their appearance in practice: as a consequence of finite-order finite-dimensional (but multivariable) linear dynamics. The simplest example is that which we have already quoted in the last section; the uncontrolled version
(20) of relation (2.61). We underst ood this as a vector relation in Section 2.9, and shall soon do so again, but let us regard it as a scalar relation for the moment. We have t 1 then A(z) = 1- Az, with its single zero at z = A- • The necessary and sufficien for equation the of solution The 1. condition for stability is then that !AI < prescribed initial conditions at t = Tis 1-T-!
Xt =
L A' dt-r + At-r
(21)
Xn
r=O
whatever A. From this we see that the stability condition !AI < 1 both assures of stability in what one would regard as the usual dynamic sense (that the effect that (e.g. sense filter the in and -oo) initial conditions vanishes in the limit T-+ the filter has the BIBO property). If IAI ~ 1 then solution (21) is still valid, but will in general diverge as t - T increases. We can return to the point now that the transfer function has, in the present case G( z) = ( 1 - Az f 1, two distinct series expansions. These are -1
00
G(z) = LA'z', 0
G(z) =-
L A'z', -oo
1 to valid respectively for !z! < !A!- 1 and !zl > !A!- . The first of these corresponds a to nds correspo a filter which is causal, but stable only if !AI < 1. The second
74
THE CLASSIC FORMULATI ON OF THE CONTROL PROBLEM
filter which is stable if solution
IAI > 1, but is non-causal.
Indeed, it corresponds to a
00
x1
= 2:) -Ar' dt+, r=1
of (20). This is mathematically acceptable if lA I < 1 (and d uniformly bounded, say), but is of course physically unacceptable, being non-causal. Exercises and comments (1) Consider the filter d
~
x implied by the difference equation (18). The leading
coefficient Ao must be non-zero if the filter is to be causal. Consider the partial power expansion. p-1
t-1
A(z)- 1
= Lg,z' +
A(zr 1
r=O
L crd+k, k=O
in which the two sums are respectively the dividend and the remainder after t steps of division of A(z) into unity by the long-division algorithm. Show, by establishing recursions for these quantities, that the solution of system (18) for general initial conditions at t = 0 is just p-1
t-1
Xt
= Lgrdt-r + L
CtkXt-k·
k=O
r=O
Relation (21) illustrates this in the case p = 1. (2) Suppose that A(z) = 1If= 1 (1- aiz), so that we require Jail < 1 for allj for stability. Determine the coefficients c1 in the partial fraction expansion p
A(z)- 1 =
L c1(1- a1z)-
1
}=1
in the case when the a1 are distinct. Hence determine the coefficients g, and c1k of the partial inversion of Exercise 1. (3) Model (20) has the frequency response function (1 - Ae-iw)- 1• An input signal d1 = eiwt will indeed be multiplied by this factor in the output x ifthe filter is stable and sufficient time has elapsed since start-up. This will be true for an unstable filter only if special initial conditions hold (indeed, that x showed this pattern already at start-up). What is the amplitude of the response function? 4 FILTERS IN DISCRETE TIME; THE VECTOR (MIMO) CASE
Suppose now that input d is an m-vector and output x an n-vector. Then the representations (7) and (12) still hold for linear and TIL filters respectively, the
4 FILTERS IN DISCRETE TIME; THE VECTOR (MIMO) CASE
75
first by definition and the second again by Theorem 4.2.1. The coefficients g1r and g, are still interpretable as transient response, but are now n x m matrices, since they must give the response of all n components of output to all m components of input. However, for the continuation of the treatment of the last section, let us take an approach which is both more oblique and more revealing. Note, first, that we are using IF in different senses in the two places where it occurs in (11). On the right, it is applied to the input, and so converts m-vectors into m-vectors. On the left, it is applied to output, and does the same for nvectors. For this reason, we should not regard :!/ as a particular case of a filter operator q;; we should regard it rather as an operator of a special character, which can be applied to a signal of any dimension, and which in fact operates only on the time argumen t of that signal and not on its amplitude. We now restrict ourselves wholly to the TIL case. Since q; is a linear operator, one might look for its eigenvectors; i.e. signals~~ with the property q;~ 1 = .\~1 for some scalar constant A. However, since the output dimension is in general not equal to the input dimension we look rather for a scalar signal a 1 such that q}~u 1 = rw 1 for some fixed vector rJ for any fixed vector f The translation-invariance condition (11) implies that, if the input-ou tput pair {~ur, 'TJO"r} has this property, then so does {~u 1 _J,7JU 1 -I}. If these sequences are unique to within a multiplicative constant for a prescribed (then one set of signals must be a scalar multiple of the other, so that (a-1_ 1 = z~a 1 for the some scalar z. This implies that a- 1 ex z- 1, which reveals the particula r role of the exponential sequences. Further, 1J must be then be linearly dependen t upon (, although in general by a z-dependent rule, so that we can set 1J = G(z)( for some matrix G(z). But it is then legitimate to write q;
= G(!F),
(22)
since q} has this action for any input of the form (z- 1 or a linear combination of such expressions for varying (and z. If G(z) has a power series expansion L:,g,z' then relation (22) implies an expression (12) for the output of the filter, with g, identified as the matrix-valued transient response. We say 'a' rather than 'the' because, just as in the SISO case, there may be several such expansions, and the appropriate one must be resolved by considerations of causality and stability. These concepts are defined as before, with the lq-condition (13) modified to
where grJk is the jkth element of g,. The transient response g, is obtained from G( z) exactly as in (17); by determination of the coefficient of z' in the appropriate expansion of G(z). If causality is demande d then the only acceptable expansion is that in non-negative powers of z.
76
THE CLASSIC FORMULATION OF THE CONTROL PROBLEM
If G(z) is rational (meaning that the elements of G(z) are all rational in z) then the necessary and sufficient condition that the filter be both causal and stable is that of Theorem 4.2.3, applied to each element of G. Again, one returns to basics and to physical reality if one sees the filter as being generated by a model. Suppose that the filter is generated by the dynamic equation (18), with x and d now understood as vectors. If this relation is to be invertible we shall in general require that input and output be of the same dimension, so that A (9'") is a square matrix whose elements are polynomial of degree p in 9'". The analysis of Section 3 generalises immediately; we can summarise conclusions. Theorem 4.4.1 The filter r§ determined by (18) has transfer function G( z) = A (z) -I_ The causa/form ofthe filter is stable ifand only if the zeros ofiA(z) I lie strictly outside the unit circle.
Here IA(z)l is the determinant of A(z), regarded as a function of z. The first conclusion follows from A(z)G(z) =I, established as before. The elements of A(z)- 1 are rational in z, with poles at the zeros of the determinant IA(z)J. (More exactly, these are the only possible poles, and all of them occur as the pole of some element.) The second conclusion then follows from Theorem 4.2.3. The fact that z = 0 must not be a zero of lA (z) I implies that Ao is non-singular. This is of course necessary if (18) is to be seen as a forward recursion, determining x 1 in terms of current input and past x-values. The actual filter output may be only part of the output generated by the dynamic equation (18). Suppose we again take the car driving over the bumpy road as our example, and take the actual filter output as being what the driver observes. He will observe only the grosser motions of the car body, and will in fact observe only some lower-dimensional function of the process variable x. Exercises and comments (1) Return to the control context and consider the equation pair Xr
= Axr-1 + Bur-1,
Yt
=
Cxt-1·
(23)
as a model of plant and observer. In an input-output description of the plant one often regards the control u as the input and the observation y as the output, the state variable x simply being a hidden variable. Show that the transfer function u ---+ y is C(J- Az) -I Bz2 . What is the condition for stability of this causal filter? If a disturbance d were added to the plant equation of (23), then this would constitute a second input. If a control policy has been determined then one has a higher-order formulation; dis now the only input to the controlled system. (2) The general version of this last model would be
5 COMPOSITION AND INVERSION OF FILTERS; Z-TRANSFORMS
dx + !Jlu = 0,
y
+ '?!x =
where d, 18 and'?! are causal TIL operators. If ,s;l [unction u-. y is C(z)A(z)- 1B(z).
77
0,
= A(Y) etc.
then the transfer
5 COMPOSITION AND INVERSION OF FILTERS; z-TRANSFORMS
We assume for the remainder of the chapter that all filters are linear, translationinvariant and causal. Let us denote the class of such filters by ~ If filters ~~ and ~2 are applied in succession then the compound filter thus generated also lies in'?! and has action (24) That is, its transient response at lag r is I":v g2vg!,r-v, in an obvious terminology. However, relation (24) expresses the same fact much more neatly. The formalism we have developed for TIL filters shows that we we can manipulate the filter operators just as though the operator §" were an ordinary scalar, with some guidance from physical considerations as to how power series expansions are to be taken. This formalism is just the Heaviside operator calculus, and is completely justified as a way of expressing identities between coefficients such as the A, of the vector version of (18) and the consequent transient response g,. However, there is a parallel and useful formalism in terms of z-transforms (which become Laplace transforms in continuous time). This should not be seen as justifYing the operator formalism (such justification not being needed) but as supplying useful analytic characterisatim ;s and evaluations. Suppose that the vector system (25) does indeed start up at time zero, in that both x and dare zero on the negative time axis. Define the z-transforms 00
00
.X(z) =
Lx i, 1
d(z) =
1=0
Ld z 1
1
(26)
1=0
for scalar complex z. Then it is readily verified that relation (25) amounts to
A(z)x(z) = d(z) with an inversion
x(z) = A(zr 1d(z).
(27)
1 This latter expression amounts to the known conclusion G(z) = A(z)- if we 1 of z. powers non-negative in understand that A(z)- and d(z) are to be expanded
78
THE CLASSIC FORMULATION OF THE CONTROL PROBLEM
The inversion is completed by the assertion that x 1 is the coefficient of z1-in the expansion of .X( z). Analytically, this is expressed as Xr
=
Z~i
J
x(z)z-t- 1 dz
(28)
where the contour of integration is a circle around the origin in the complex plane small enough that all singularities of x(z) lie outside it. Use of transforms supplies an alternative language in which the application of an operator, as in A(ff)x, is replaced by the application of a simple matrix multiplier: A(z)x. This can be useful, in that important properties of the operator A(ff) can be expressed in terms of the algebraic properties of A(z), and calculable integral expressions can be obtained for the transient response g, of (17). However, it is also true that both the operator formalism and the concept of a transfer function continue to be valid in cases where the signal transforms .X and ddo not exist, as we shall see in Chapter 13. Exercises and comments (1) There is a version of (27) for arbitrary initial conditions. If one multiplies relation (25) by z1 and then sums over t ~ 0 one obtains the relation
x(z) = A(z)- 1 [d(z)-
t
L AjX-kZj-kl·
k=l]
~
(29)
k
This is a transform version of the solution for x 1 of Exercise 3.1, implying that solution for all t ~ 0. Note that we could write it more compactly as
(30) where the operator [ ]+retains only the terms in non-negative powers of z from the power series in the bracket. It is plain from this solution that stability of the filter implies also that the effect of initial conditions dies away exponentially fast with increasing time. (2) Consider the relation x 1 = god1 + g 1d1_ 1 , for which the transient response is zero at lags greater than unity. There is interest in seeing whether this could be generated by a recursion of finite order, where we would for example regard relation (25) as a recursion of order p. If we define the augmented process variable .X as that with vector components x 1 and d 1 then we see that it obeys a recursion x1 = A.Xr-1 + Bdt, where
A=
[00 g1]0 ,
r ,; .
.
__s
6 FILTERS IN CONTINUOUS TIME
79
unde r any circu mstan ces is The fact that this system could not be unsta ble so G(z) has no poles. reflected in the fact that II- Azi has no zeros, and
6 FILTERS IN CONTINUOUS TIME defin ed for any real r; let us In contin ous time the transl ation opera tor f'/' is role of the unit transl ation fT repla cer by the conti nous variable r. However, the ation. More specifically, one transl al must be taken over by that of an infini tesim = lim T-l [ 1 - f'T'~"], which is !!) ation must consi der the rate of chang e with transl r 10 on just the differential opera tor !!) = d/ dt. The relati
) _ .( . x(t- T + 8t)- x(t- T) 1lill Xl-T {j t 6tl0 amou nts to the opera tor relati on
d
- - fl'~" =
dT
!!)fl'~".
Since f/ 0 = 1 this has forma l soluti on
(31) the translations. Relat ion (31) which exhibits :!) as the infini tesma l gener ator of can be regar ded as an expression of Taylor's theor em e-r~x(t)
~(-r!!)'j .1 x(t)
= L..J j=O
= x(t- r) = f!'~"x(t).
(32)
]·
make s sense even if x is not Note, thoug h, that transl ation x(t- r) of x(t) . differentiable to any given order, let alone indefinitely linear filter<'§ will modi fy riant -inva ation transl a case, As in the discre te-tim e plicative matri x factor, G(s), an expon ential signa l x(t) = eest merely by a multi say, so that (33) filter. We can legiti matel y G( s) is then the transfer function of the conti nuous -time write <'§
= G(!!))
(34)
on any linear comb inatio n of in analo gy to (22), since, by (33), <'§ has this action expon ential signals. that the differentials q;r x( t) However, use of the forma lism (34) does not imply to be applied. We have alread y need exist for any r for a funct ion x( t) to which <'§ is efined, even if x( t) is not seen in (32) that the transl ation e-r!'d x( t) is well-d differentiable at all.
80
THE CLASSIC FORMULATION OF THE CONTROL PROBLEM
In fact, identification (31) demonstrates that, if G(s) has a Fourier-Laplace representation,
G(s) = then the filter relationship x
x(t)
= G(~)d(t) =
1
00
e-31g(T) dT,
(35)
= <§d can be written
1oo g(T)e-r~d(t)
dT
=
1 g(T)d(t- T) 00
dT,
(36)
whence we see that we can identify g(T) with the transient response of the filter. However, we must be prepared to stretch our ideas. For example, one could envisage a filter x = ~, d which formed the rth time differential of the current input. One could represent this in the form (36) only by setting g( T) = (-~)'15(7), the rth differential of a delta-function. We have taken the integral only over non-negative T in (35) on the supposition that the filter is causal. If the integral (35) is absolutely convergent for some real positive value O" of s then it will define G{s) as an analytic function for all s such that Re(s) ;;?; O". 7 DYNAMIC MODELS: THE INVERSION OF CONTINUOUS-TIME FILTERS As we have already emphasised in Sections 3 and 4, one must look beyond the input-output specification of a filter to the dynamic mechanism behind it. Suppose that these equations take the form
dx=d.
(37)
The simplest finite-order TIL assumption is that d is a differential operator of order p, say: p
d =A(~)= LA,~'.
(38)
r=O
(For economy of notation we have denoted the matrix coefficients by A, as for the discrete-time version (18), but the two sets of coefficients are completely distinct.) The system (37), (38) then constitutes a set of differential equations of degree p at most. This is to be regarded as a forward equation in time determining the forward course of the output y. In discrete time this led to the requirement that Ao should be non-singular. The corresponding requirement now is that the matrix coefficient of the highest-order differentials should be non-singular. That is, if the kth individual output Xk occurs differentiated to order rk at most in system (37), (38) then the matrix A. whose jkth element is the jkth element of A,k (for all relevantj, k) must be non-singular.
7 THE INVERSION OF CONTINUOUS-TIME FILTERS
81
Just as for the discrete-time case of Sections 3 and 4 the actual filter d ----> x, obtained by inversion of the relation (37), has A(s)- 1 as transfer function. We must also suppose the filter causal, if relation (37) is supposed on physical grounds to be a forward relation in time. Thus, if A(s)- has the Laplace representation
A(sr 1
=leo
e-srg(T) dT,
(39)
then g( T) is the transient response of the filter, and the solution of equation (37) is
x(t)
=
1 00
g(T)d(t- T) dT
(40)
plus a possible contribution from initial conditions. There will be no such contribution if the system starts from a quiescent and undisturbed past or if the filter is stable and has operated indefinitely. The Laplace transform is the continuous-t ime analogue of the z-transform of Section 5. Suppose the system (37) quiescent before time zero, in that both x( t) and d(t) are zero for t < 0. If we multiply relation (37) at time t by e-sr and integrate overt from 0 to infinity then we obtain
A(s).'i:(s)
d(s)
(41)
where x(s) is the Laplace transform of x(t):
x(s) =
roo e-stx(t) dt.
la-
(42)
The reason for emphasising inclusion of the value t = 0 in the range of integration will transpire in the next section. However, relation (41) certainly implies that Y(s) = A(s)- 1x(s) for all s for which both x(s) and A(s)- 1 are defined, and this is indeed equivalent to the solution implied by the evaluation of the transient response implicit in (39). In a sense there is no need for the introduction of Laplace transforms, in that the solution determined by (39), (40) remains valid in cases when d(s) does not exist. However, the Laplace formalism provides the natural technique for the inversion constituted by relation (39); i.e. for the actual determinatio n of g(T) from G(s) = A(s)- 1. Exercises and comments
(1) Show, by partial integration, that {co
Jo 0-
r-1
e-sr2J'x(t) dt = s'x(s)- 'L:Sq.@r-q-lx(O). q=O
82
THE CLASSIC FORMULATION OF THE CONTROL PROBLEM
(Here fiJJ' x( 0) is a somewhat loose expressio n for the rth differential of x at time 0.) For general initial conditions at t = 0 relat ion (41) must thus be replaced by
A(s)Y(s) = x(s)
r-1
+I: A, I:sq_qgr-q-Jy(O). r
This is the continuous-time analogue of
q=O
relation (29).
(2) A stock example is that of radio active decay. Suppose that a radioactiv e substance can decay through consecut ive elemental forms j = 0, 1, 2, ..., and that xi( t) is the amo unt of elem entj at time t. Und er stan dard assumptions the x
will obey the equations X0
=
-J.LoXo
1
+ d,
(j= l,2 , ... ). where f..LJ is the decay rate in state j. Here we have supposed for simplicity that only element 0 is replenished externally, at rate d(t). In terms of Laplace transforms these relations become
+ tto)xo = d,
(s + tli )xJ = flJ-lX J-l if we assume that xj(O) = 0 for all}. We thus find that (s
A
Xj
=
(j = 1, 2, ... ) ,
J.-Lof..Ll f..LJ-l d (s+f..Lo)(s+J.Ld· .. (s+t tJ) d=P--, 1(s) 0
00
A
say. If f..Lo, Ml, ... , MJ are distinct and posi tive then this corresponds to a transien t response function (for d ---+ x ) : 1 J
Gj(T)
e-fl.k T
=Ik=O: -P '() . J
f..Lk
(43)
Suppose that }= pis the term inal state, in that tlp = 0. The term corresponding to k = p in expression (43) for j = p is then simply 1. This corr espo nds to a singularity at s = 0 in the transfer func tion from d to xp. The singularity corresponds to a rath er inno cent instabilit y in the response of xp to d: simply that all matter entering the system ultimately accumulates in state p. 8 LAPLACE TRANSFORMS The
Laplace tran sfor m (42) is often written as !i'x to emphasise that the function x(s) has been obtained from the function x(t) by the transformation !i'. (This is quite distinct from the forward oper ator defined in Section 3.1- we use !i' to denote Laplace transformation only in this section.) The transformation !i' is linear, as is its inverse, which is written 2- 1. One of the key results is that the
inversion has the explicit form
9 STABILITY OF CONTINUOUS-TIME FILTERS
1 lcr+ico
x(t) = .!t'-l_x = -2 . 1fl
e31 .X(s) ds
cr-ioo
83 (44)
where the real number u is taken large enough that all singularities of .X lie to the left of the axis of integration. (This choice of integration path yields x( t) = 0 for t < 0, which is what is effectively supposed in the formation (42) of the transform. It is analogous to the choice of integration path in (17) to exclude all singularities of G( z ), if one wishes to determine the causal form of the filter.) Inversion of a transform then often becomes an exercise in evaluation of residues at the various singularities of the integrand. The glossary of simple transform-pairs in Table 4.1 covers all cases in which .X(s) is rational and proper (a term defined in the next section). The reader may wish to verifY validity of both direct and inverse transforms. In all cases the assumption is that x( t) is zero for negative t. Table4.1 X
1
s-I
t"/n!
s-n~I
(s + o)- 1
e-at t"e~o.t
jn!
8(t)
(s+ 1
arn~I
If s corresponds to the operation of differentiation then s- 1 presumably corresponds to the operation of integration. This is indeed true (as we saw in Exercise 7.2), but the operation of integration is unique only to within an additive constant, to be determined from initial conditions. That initial conditions should have a persistent effect is an indication of instability. A very useful results is the final value theorem: that if lim1rco x(t) exists, then so does lims 1o sx(s), and the two are equal. This is easily proved, but one should note that the converse holds only under regularity conditions. Note an implication: that if limnoo £&'" x( t) exists for a given positive integer j, then so does limslO si+l x(s), and the two are equal. 9 STABILITY OF CONTINUOUS-TIM E FILTERS Let us consider the SISO case to begin with, which again sets the general pattern. Lq-stability for a realisable filter requires that
leo
lg(rW dr < oo.
L 1-stability is then again equivalent to BIBO stability, and implies that G( s) is
analytic for Re(s) ~ 0. This condition certainly excludes the possibility that g could have a differentiated delta-function as component, i.e. that the filter would
84
THE CLASSIC FORMULA TION OF THE CONTROL PROBLEM
actually differentiate the input at any lag. A bounded function need not have a differential at all, let alone a bounded one. A filter for which IG(s) I remains bounded as lsi --+ oo is said to be proper. This excludes s' behaviour of G(s) for any r > 0 and so excludes a differentiating action for the filter. If IG(s) I --+ 0 as lsi --+ oo then the filter is said to be strictly proper. In this case even delta-functions are excluded in g; i.e. response must be smoothly distributed. A rational transfer function G(s) is now one which is rational ins. As in the discrete-time case, this can be seen as the consequence of finite-order, finitedimension al linear dynamics. The following theorem is the analogue of Theorem 4.2.3.
Theorem 4.9.1 Suppose that a causal filter has a response function G(s) which is rational and proper. Then the filter is stable ifand only ifall poles of G(s) have strictly negative real part (i.e. lie strictly in the left halfofthe complex plane). Proof This is again analogous to the proof of Theorem 4.2.3. G( s) will have the expansion in partial fractions G(s)
= 2: c,s' + 2:2: d1k(s- s1rk-l r
j
k
where the ranges of summation are finite, j and k are non-negative integers and the s1 are the non-zero poles of G(\'). Negative powers s' cannot occur, since these would imply a componen t in the output consisting of integrated input, and the integral of a bounded function will not be bounded in general. Neither can positive powers occur, because of the condition that the filter be proper. The first sum thus reduces to a constant c0 . This correspond s to an instantane ous term c0 d(t) in filter output t§d, which is plainly stable. The term in (s- s1 )-k-! gives a term proportion al toT* exp(srr) in the fllter response; this componen t is then stable if and only if Re( s1) < 0. The .condition of the theorem is thus sufficient, and necessity follows, as previously, D from the linear independe nce of the componen ts of filter response. The decay example of Exercise 7.1 illustrates these points. The transfer function G1(s) for the output x1 had poles at the values -J.Lk (k = 0, 1, ... ,j). These are strictly negative for k < p, and so G1 determine d a stable filter for j < p. The final filter had a response singularity at s = -J.Lp = 0. This gave rise to an instability correspond ing, as we saw, to the accumulat ion of matter in the final state. Exercises and comments (1) The second stock example is that of the hanging pendulum -a damped harmonic oscillator. Suppose that the bob of the pendulum has unit mass and
10 SYSTEMS STABILISED BY FEEDBA CK
85
that it is driven by a forcing term d. The linearised equation of motion (see d Section 5.1) for the angle of displacement x of the pendulu m is then A(E»)x = a and 0 damping ts represen ative) 2 (non-neg a1 where A(s) = s + a1s + ao. Here of zeros the then 0 = a If 1 gravity. to due force g (positive) represents the restorin to an undamp ed A~) have the purely imaginary values ±iyao (corresponding ned by initial determi de amplitu an with oscillation of the free pendulum, real part negative with complex are zeros conditions). If 0 < a 1 < 2v'aQ then the ~ 2y'QO, a If m). 1 pendulu free the of (corresponding to a damped oscillation motion illatory non-osc damped a to then they are negative real (corresponding g as accordin unstable or stable thus is of the free pendulum). The equivalent filter the pendulu m is damped or not. A damped harmon ic oscillator would also provide the simplest useful model of our car driving over a bumpy road, the output variable x being the vertical displacement of the car body. If the suspension is damped lightly enough that the n car shows an oscillatory response near the natural frequency of vibratio 1 at s modulu in large be w0 = yao then the response function A(sr will s = ±iwo. This can be observed when one drives along an unsealed road which has developed regular transverse ridges (as can happen on a dry creek bed). There is a critical speed which must be avoided if the car is not to develop violent e oscillations. The effect is enhanced by the fact that the ridges develop in respons to such oscillations!
10 SYSTEMS STABILISED BY FEEDBACK We shall from now on often specify a filter simply by its transfer function, so that is we write G rather than r§. In continuous time the underst anding is then that G understood as denoting G(E») or G(s) according as one is considering action of in the filter in the time domain or the transform domain. Correspondingly, iate. appropr as G(z), or G(ff) discrete time it denotes We are now in a position to resume the discussion of Section 1. There the physically given system (the plant) was augmented to the controlled system by addition of a feedback loop incorporating a controller. The total system thus consists of two filters in a loop, corresponding to plant and controller, and one seeks to choose the controller so as to achieve satisfactory performance of the whole system. Optimisation considerations, a consciousness that 'plant' constitutes a model for more than just the process under control (see Exercise 1.1) and a later concern for robustness to misspecification (see Chapter 17) lead one to modifY the block diagram of Figure 2 somewhat, to that of Figure 4. In this diagram u and y denote, as ever, the signals constituted by control and observations respectively. The signal ( combines all primitive exogenous inputs to the system: e.g. plant noise, observation noise and comma nd signals (or the noise that drives comma nd signals if these are generated by a statistical model).
86
THE CLASSIC FORM ULAT ION OF THE CONT ROL PROBLEM
~
6 G
y
u
K
Figure 4 The block diagram corresponding to the canoni cal set ofsystem equations (45). The plant G, understood in a wide sense, has as outputs the actual observations y and the vector of deviations~- It has as inputs the contro l u and the vector (of 'primitive' inputs to the system.
The signa l D. comp rises all the 'deviations' which are pena lised in the cost function. Thes e woul d for exam ple inclu de track ing error and those aspec ts of the contr ol u itself which incur cost. Here the plant G is unde rstoo d as inclu ding all given aspec ts of the system. Thes e certa inly inclu de plant in the narro w sens e-the proce ss being controlle d-bu t also the senso r system whic h provi des the obser vatio ns. They may also inclu de subsi diary mode ls used to predi ct, for example, sea and weat her for the long- dista nce yach tsma n of Secti on 1, or the future inflow to the reservoir of Secti on 2.9, or the comm and signa l const ituted by the posit ion of a vehicle one is attem pting to follow. The optim iser may be unab le to exert any contr ol upon these aspects, but he must regar d them as part of the total given physical model. As well as the contr ol input u to this gene ralise d plant one has the exogenous input (. This comp rises all quan tities whic h are primi tive input s to the system; i.e. exogenous to it and not expla ined by it. Thes e inclu de statis tical noise variables (white nois e-wh ich no mode l can reduce) and also comm and sequences and the like whic h are know n in adva nce (and so for whic h no mode l is needed).
It may be thoug ht that some of these input s shou ld enter the system at anoth er point ; e.g. that obser vatio n noise shou ld enter just before the controller, and that a know n comm and seque nce shou ld be a direc t inpu t to the controller. However, the simple form alism of Figur e 4 cover s all these cases. The inpu t (is in gener al a vecto r inpu t whose comp onen ts feed into the plant at different ports . A comm and or noise signal desti ned for the contr oller can be route d throu gh the plant , and eithe r inclu ded in or supe rimp osed upon the infor matio n strea m y. As far as plant outpu ts are conc erned , the devia tion signa l D. will not be completely observable in general, but must be defin ed if one is to evalu ate (and optimise) system perfo rman ce.
10 SYSTEMS STABILISED BY FEEDBACK
87
If we assume time-invariant linear structure then the block diagram of Figure 4 is equivalent to a set of relations D.= G11( + G12u
y = G12( + G22u u=Ky.
(45)
We can write this as an equation system determining the system variables in terms of the system inputs; the endogenous variables in terms of the exogenous variables: (46)
By inverting this set of equations one determines the system transfer function, which specifies the transfer functions from all components of the system input ( to the three system outputs: fl, y and u. The classic demand is that the response of tracking error to command signal should be stable, but this may not be enough. One will in general require that all signals occurring in the system should be finite throughout their course. Denote the first matrix of operators in (46) by M, so that it isM which must be inverted. The simplest demand would be that the solution of (46) should be determinate; i.e. that M(s) should not be singular identically in s. A stronger demand would be that the system transfer function thus determined should be proper, so that the controlled system does not effectively differentiate its inputs. A yet stronger demand is that of internal stability; that the system transfer function should be stable. Suppose all the coefficient matrices in (46) are rational in s. Then the case which is most clear-cut is that in which the poles of all the individual transfer functions are exactly at the zeros of IM(s) l,i.e. at the zeros of II- G22(s)K(s)l. In such a case stability of any particular response (e.g. of error to command signal) would imply internal stability, and the necessary and sufficient condition for stability would be that II- G2 2 (s)K(s)l should have all its zeros strictly in the left half-plane. In fact, it is only in quite special cases that this pattern fails. These cases are important, however, because performance deteriorates as one approaches them. To illustrate the kind of thing that can happen, let us revert to the model (3) represented in Figure 3, which is indeed a special case of that which we are considering. The plant output y is required to follow the command signal w; both of these are observable and the controller works on their difference e. The only noise is process noised, superimposed upon the control input. Let us suppose for simplicity that all signals are scalar. Solution (5) then becomes
e = (1
+ GK)- 1(Gd- w)
(47)
88
THE CLASSIC FORMULATION OF THE CONTROL PROBLEM
in the response function notation. Suppose that G(s) and and K(s) are rational. Then the transfer function -(1 + GK)- 1 of e tow is rational and proper and its only poles are precisely at the zeros of 1 + G(s)K(s). It is then stable if and only if these zeros lie strictly in the left half-plane. The same will be true of the response function (1 + GK) -I G of e to d if all unstable poles of G are also poles of GK. However, suppose that the plant response G has an unstable pole which is cancelled by a zero of the controller response K. Then this pole will persist in the transfer function (I+ GK)- 1G, which is consequently unstable. To take the simplest numerical example, suppose that G = (s- 1r 1 , K = 1 - s- 1 • Then the transfer functions
-(1
+ GK)- 1 =
- -5-
s+ 1'
-1
(l+GK)
s G=(s+l)(s- 1)
are respectively stable and unstable. One can say in such a case that the controller is such that its output does not excite this unstable plant mode, so that the mode seems innocuous. The mode is there and ready to be excited, however, and the noise does just that. Moreover, the fact that the controller cannot excite the mode means that it is also unable to dampen it. These points are largely taken care of automatically when the controller is chosen optimally. If certain signal amplitudes are penalised in the cost function, then those signals will be stabilised to a low value in the optimal design, if they can be. If inputs are such that they will excite an instability of the total system then such instabilities will be designed out, if they can be. If inputs are such that their differentials do not exist then the optimal system will be proper, if it can be. One may say that optimality enforces a degree of robustness in that, as far as Physical constraints permit, it protects against any irregularity permitted in system input which is penalised in system output. Optimisation, like computer programming, is a very literal procedure. It supplies all the protection it can against contingencies which are envisaged, none at all against others. Exercises and comments (l) Application of the converse to the final value theorem (Section 8) can yield useful information about dynamic lags-the limiting values for large time of the tracking error e or its derivatives. Consider a scalar version of the simple system of Figure 3. If the loop transfer function has the form ks-N ( 1 + o(s)) for smalls, then k is said to be the effective gain of the loop and N its type number: the effective number of integrations achieved in passage around the loop. Consider a ~ommand signal w which is equal to f' jn! for positive t and zero otherwise. Then w::::: s-n-l, and it follows from (47) and an application of the converse to the final value theorem (if applicable) that the limit of ~e for large tis lims1o O(sN-n+J). It
10 SYSTEMS STABILISED BY FEEDBA CK
89
thus follows that the limit offset in the jth differential of the output path y is zero, finite or infinite according as n is less than, equal to or greater than N + j. So, suppose that w is the position of a fleeing hare andy the position of a dog pursuing it. Then a zero offset for j = I and n = 2 would mean that, if the hare maintai ned a constan t acceleration (!) then at least the difference in the velocitie s of the dog and the hare would tend to zero with time. It appears then that an increase in N improves offset. However, it also causes a decrease in stability, and N = 2 is regarded as a practica l upper limit.
CHAPTER5
State-structured Deterministic Models In the last chapter we considered deterministic models in the classic inputoutput formulation. In this we discuss models in the more explicit state-space formulation, specialising rather quickly to the time-homogeneous linear case. The advantage of the state-space formulation is that one has a physically explicit model whose dynamics and whose optimisation can both be treated by recursive methods, without assumption of stationarity. Concepts such as those of controllability and observability are certainly best developed first in this framework. The advantage of the input-output formulation is that one can work with a more condensed formulation of the model (in that there is no necessity to expand it to a state description) and that the transform techniques then available permit a powerful treatment of, in particular, the stationary case. We shall later move freely between the two formulations, as appropriate. 1 STATE-STRUCTURE FOR THE UNCONTROLLED CASE: STABILTY; LINEARISATION Let us set questions of control and observation to one side to begin with, and simply consider a dynamic system whose course is described by a process variablex. We have already introduced the notion of state structure for a discrete-time model in Section 2.1. The system has state structure if x obeys a simple recursion Xt
=
a(Xt-l,
t),
(1)
when x is termed the state variable. Dynamics are time-homogeneous if they are governed by time-independent rules, in which case (1) reduces to the form Xt
=
a(Xt-l)·
(2)
We have said nothing of the set of values within which x may vary. In the majority of practical cases x is numerical in value: we may suppose it a vector of finite dimension n. The most amenable models are those which are linear, and the assumption oflinearity often has at least a local validity. A model which is statestructured, time-homogeneous and linear then necessarily has the form
x 1 = Ax1-t +b
(3)
92
STATE-STRUCTURED DETERMINISTIC MODELS
where A is a square matrix and ban n-vector. If the equation(/ - A)x = b has a solution for .X (see Exercise 1) then we can normalise b to zero by working with a new variable x- .X. If we assume this normalisation performed, then the model (3) reduces to (4) Xt = AXt-l· The model has by now been pared down considerably, but is still interesting enough to serve as a basis for elaboration in later sections to controlled and imperfectly observed versions. We are now interested in the behaviourof the sequence Xt = A 1xo generated by (4). It obviously has an equilibrium point x = 0 (corresponding to the equilibrium point x = .X of (3)). This will be the unique equilibrium point if I- A is nonsingular, when the only solution of x = Ax is x = 0. Supposing this true, one may now ask whether this equilibrium is stable in that x 1 -+ 0 with increasing tforany xo.
Theorem 5.J.l The equilibrium of system (4) at x = 0 is stable eigenvalues ofthe matrix A have modulus strictly less than unity.
if and only if all
Proof Let ). be the eigenvalue of maximal modulus. Then there are sequences A 1x0 which grow as N, so I..XI < 1 is necessary for stability. On the other hand, no such sequence grows faster than f'- 11..XI 1, so !..XI< 1 is also sufficient for 0 stability. A matrix A with this property is termed a stability matrix. More explicitly, it is termed a stability matrix 'in the discrete-time sense', since the corresponding property in continuous time differs somewhat. Note that if the equilibrium at zero is stable then it is necessarily unique; if it is not unique then it cannot be stable (Exercise 2). Note that g1 = A 1 is the transient response function of the system (4) to a driving input The fact that stability implies exponential convergence of this response to zero also implies lq-stability of the fllter thus constituted, and so of the filter of 1 Exercise 4.4.1. The stability criterion deduced there, that C(/- Azr B should that of by implied is circle, unit the have all its singularities strictly outside Theorem 5.1.1. All this material has a direct analogue in the continuous-time case, at least for the case of vector x (to which we are virtually forced; see the discussion of Exercise 21.1). The analogue of (21 a state-structured time-homogeneous model, is
.X= a(x).
(5)
(For economy of notation we use the same notation a(x) as in (21 but the functions in the two cases are quite distinct) The normalised linear version of this model, corresponding to (4), is
1 STATE-STRUCTURE FOR THE UNCON TROLL ED CASE
93
W
i=Ax 1 The analog ue of the formal solutio n x 1 = A xo of (4) is the solutio n
x(t) = e1Ax(O) :=
(tA'/ - x(O) 'L-. J. oo
1
(7)
j=O
of (6). The stability criterio n is also analogo us.
if all Theorem 5.1.2 The equilibrium of system (6) at x = 0 is stable if and only zero. than less strictly part eigenvalues ofthe matrix A have real ns Proof Let a be the eigenvalue of maxim al real part. Then there are functio the On . stability for ry 1 necessa is 0 < x(t) = e1Ax(O) which grow as eu , so Re(a)
1 < 0 is also other hand, no such functio n grows faster than t"- eRe(u)t, so Re(a) 0 sufficie nt for stability.
e sense. A matrix A with this propert y is a stability matrix in the continu ous-tim and so 0, = a(x) of ns solutio If a(x) is nonline ar then there may be several the that 1.2: Section of ons several possibl e equilib rium points. Recall the definiti which from values initial of set domain ofattraction of an equilib rium point is the stable if its the path would lead to that point, and that the equilib rium is locally we shall orth Hencef point. domain of attracti on include s a neighb ourhoo d of the the models ear non-lin For take 'stabilit y' as meanin g simply 'local stability'. case; linear the in e equilib rium points are usually separat ed (which is not possibl e that x is such see Exercis e 2) and so one or more of them can be stable. Suppos x from the x(t)= an equilib rium point, and define the deviati on A(t) which is ves equilib rium value. If a(x) possess es a matrix ax of first derivati n (6) equatio then continu ous in the neighb ourhoo d of x and has value A at x become s
(8) e x will indeed to within a term o( A) in the neighb ourhoo d of x. The state variabl it is by thus remain in the neighb ourhoo d of x if A is a stabilit y matrix, and testing A = ax(x) that one determ ines whethe r or not xis locally stable. in the The passag e from (5) to (8) is termed a linearisation of the model is ation linearis of ue techniq the and , reasons s neighb ourhoo d of X, for obviou should one r, Howeve ur. behavio local of study indeed an invalua ble tool for the behavi our be aware that nonline ar systems such as (2) and (5) can show limiting e.g. limit rium: equilib static a to e passag much more compli cated than that of ing of a someth nt represe would these of cycles or chaotic behaviour. Either that expect to ble reasona is it and r, failure in most control contexts, howeve es. exampl of exotic most the optimis ation will exclude them for all but
94
STATE-STRUCTURED DET ERM INIS TIC MODELS
We have already seen an example of mult iple equi libri a in the harvesting example of Sect ion 1.2. If the harv est rate was less than the max imal net repr oduc tion rate then there were two equilibria; one stable and the othe r unstable. The stock example is of cour se the pend ulum ; in its linea rised form the archetypal harm onic oscillator. If we supp ose the pend ulum unda mpe d then the equation of moti on for the angle a of disp lacement from the hang ing posi tion is
a+ w2 sin a = 0,
(9)
where w2 is prop ortio nal to the effective length of the pend ulum . Ther e are two static equilibrium positions: a = 0 (the hang ing position) and a = 1l" (the inverted position). Let us brin g the mod el to state form and linearise it simultaneously, by defining D. as the vector whose elements are the deviations of a and a from their equilibrium values, and then retaining only first-order term s in D.. The matr ix A for the linearised system is then
A=[±~ ~], where the + and - options refer to the inverted and hang ing equilibria respectively. We fmd that A has eigenvalu es ±w in the inverted position, so this is certainly unstable. In the hang ing posi tion the eigenvalues are ±iw, so this is also unstable, but only just -the amplitude of the oscillation abou t equilibrium remains constant. Of course, one can calculate these eigen values simply by dete rmin ing which values cr are consistent with a solution a( t) = eat of the linearised version of equation (9). However, we shall tend to discuss models in their state -red uced forms. Discrete-time models can equally well be linearised; we leave details to the reader. We shall develop som e examples of greater novelty in the next section, when we consider controlled processes.
Exercises and comments (1) This exercise and the next refer to the discrete-time mod el (3). If (I - A )x = b has no solution for x then a finite equi libri um value certainly does not exist. It follows also that I - A mus t be singular, so that A has an eigenvalue ). = 1. (2) If (I- A)x = b has mor e than one solu tion then, again, I -A is singular. Furt herm ore, any linea r com bina tion of these solutions with scala r coefficients (i.e. any poin t in the smallest linea r man ifold vft cont ainin g these points) is a solution, and a possible equilibrium. Ther e is neutral equi libri um betw een poin ts of vft in that, once x 1 is in .H, there is no furth er motion.
t
I I ! i
1 STATE-STRUCTURE FOR THE UNCONTROLLED CASE
95
(3) Suppose that the component x1t of the vector x 1 represents the number (assumed continuous-valued) of individuals of age j in a population at time t, and that the Xjt satisfy the dynamic equations 00
xot = :l:a1xJ,t-J,
(j > 0).
}=0
The interpretation is that a1 and b1 are respectively the reproduction and survival rates at age j. One may assume that b1 = 0 for some finite j if one wishes the dimension of the vector x to be finite. Show that the equilibrium at x = 0 is stable (i.e. the population becomes extinct in the course of time) if all roots>. of 00
Lbobi · · ·bJ-iaJ>.-J-i = 1 j=O
are strictly less than unity in modulus. Show that the root of greatest modulus is the unique positive real root. (4) A pattern observed in many applications is that the recursion (2) holds for a scalar x with the function a(x) having a sigmoid form: e.g. x2 /(1 + x) 2 (x ~ 0). !- 0 and Jg >a are thus necessary
and sufficient for stability of this equilibrium. (6) The eigenvalues and eigenfunctions of A are important in determining the 'modes' of the system (4) or (6). Consider the continuous-time case (6) for definiteness, and suppose that A has the full spectral representation A = H AH- 1, where A is the diagonal matrix of eignenvalues >.1 and the columns of H (rows of n- 1) are the corresponding right (left) eigenvectors. Then, by adoption of a new state vector x = Hx, one can write the vector equation (6) as the n decoupled scalar equations i 1 = >.1x1 , corresponding to the n decoupled modes of variation. An oscillatory mode will correspond to a pair of complex conjugate eigenvalues. (7) The typical case for which a complete diagonalisation cannot be achieved is that in which A takes the form
96
STATE-STRUCTURED DETERM INISTIC MODELS
for non-zero fl.· One can imagine that two populat ion groups both reproduce at net rate A, but that group 2 also generates member s of group 1 at rate fl.· There is a double eigenvalue of A at A, but Ai
= ;\i-1 [ ~
jf
l
eAt_ -eAt [ 01
f.Ll]
1 .
One can regard this as a situation in which a mode of transient response eAt (in continuous time) is driven by a signal of the same type; the effect is to produce an output proport ional to teA'. If there are n consecutive such stages of driving then the response at the last stage is proporti onal to f'eA 1. In the case when A is purely imaginary (iw, say) this corresponds to the familiar phenom enon of resonance of response to input of the same frequency w. The effect of resonance is that output amplitude increases indefinitely with time until other effects (non-linearity, or slight damping) take over. 2 CONTR OL, IMPER FECT OBSERVATION AND STATE STRUC TURE We saw in Section 2.1 that achievement of 'state structure' for the optimisation of a controlled process implied conditions upon both dynamics and cost function . However, in this chapter we consider dynamics alone, and the controll ed analogue of the state-structured dynamic relation (1) would seem to be
(10) which is indeed the relation assumed previously. Control can be based only upon what is currently observable, and it may well be that current state is not fully observable. Consider, for example, the task of an anaesthetist who is trying to hold a patient in a condition of light anaesthesia. The patient's body is a dynamical system, and so its 'physiological state' exists in principle, but is far too complex to be specifiable, let alone observable. The anaesthetist must then do as best he can on the basis of relatively crude indicato rs of state: e.g. appearance, pulse and breathing. In general we shall assume that the new observation available at time tis of the form (11) So, if the new informa tion consisted of several numeric al observations, then y1 would be a vector. Note that y 1 is regarded as being an observation on immedi ate past state Xx-t rather than on current state x 1• This turns out to be the formally natural convention, although it can certainly be modified. It is assumed that the past control ut-I is known; one remembers past actions taken. Relation (11) thus
2 CONTROL, IMPERFECT OBSERVATION AND STATE STRUCTURE
97
represents an imperfect observation on Xt-1, whose nature is perhaps affected both by the value chosen for Ut-1 and by time. Information is cumulative; all past observations are supposed in principle to be available. In the time-homogene ous linear case x, u and y are vectors and relations (10) and (11) reduce to
Xt = Axt-l +But-!
(12)
Yt = Cxt-I
(13)
Formal completeness would demand the inclusion of a term Du 1_ 1 in the righthand member of (13). However, this term is known in value, and can just as well be subtracted out. Control does not affect the nature of the information gained in this linear case. System (121 (13) is often referred to as the system [A, B, C ], since it is specified by these three matrices. The dimension of the system is n, the dimension of x. The linear system is relatively tractable, which explains much of its popularity. However, for all its relative simplicity, the [A, B, C] system generates a theory which as yet shows no signs of completion. Once a particular control rule has been chosen then one is back in the situation of the last section. Suppose, for example, that current state is in fact observable, and that one chooses a control rule ofthe form u1 = Kx1• The controlled plant equation for system (12) then becomes
Xt = (A+ BK)Xt-J. whose solution will converge to zero if A + BK is a stability matrix. The continuous-tim e analogue of relations (10), (11) is
x=
a(x, u, t),
y = c(x, u, t),
(14)
x=Ax+Bu,
y= Cx+Du.
(15)
and of (12), (13)
with D usually normalised to zero. Note that, while the plant equation of {14) or (15) now becomes a first-order differential equation, the observation relation becomes an instantaneous relation, non-differential in form. This turns out to be the natural structure to adopt on the whole, although it can also be natural to recast the observation relation in differential form; see Chapter 25. In continuous time the system (15) with D normalised to zero is also referred to as the system [A,B, C ]. One sometimes derives a linear system (15) by the linearisation of a timehomogeneous non-linear system in the neighbourhood of a stable equilibrium point of the controlled system. Suppose that state and control values fluctuate about constant values x and u, so that y fluctuates about y = c(x, u). Defining = u - u and y = y - ji, we obtain the = x - x, the transformed variables
x
u
98
STATE-STRUCTURED DETE RMIN ISTIC MODE LS
system (15) in these transf orme d variables as a linearised version of the system (14) with the identifications
B= au,
D =Cu. Here the derivatives are evaluated at .X u, and must be supposed continuous in a neigh bourh ood of this point. The appro xima tion remai ns valid only as long as x and u stay in this neighbourhood, which implies either that (.X, u) is a stable equilibrium of the controlled system or that one is consi dering only a short time span. Subject to these latter considerations, one can linear ise even about a timevariable orbit, as we saw in Section 2.9. Exercises and comments (1) Non-uniqueness of the state variable. If relations (12), (13) are regarded as just a way of realising a transfer function C(I ~ Az)- 1 2 Bz from u to y then this realisation is far from unique. By consi derin g a new state variable Hx (for square nonsingular H) one sees that the system [HAH - 1, HB, cs- 1] realises the same transfer function as does [A, B, C ]. (2) A satellite in a planar orbit. Let (r, B) be the polar coordinates of a particle of unit mass (the satellite) moving in a plane and gravi tationally attrac ted to the origin (where the centre of mass of the Earth is supposed situated). The Newt onian equations of motio n are then
where ,r- 2 represents the gravitational force and u, and u0 are the radia l and tangential comp onent s of a control force applied to the satellite. A possible equilibrium orbit unde r zero control forces is the circle of radius r = p, when the angul ar velocity must be = w = J (1/p3). Suppose that small control forces are applied; define x as the 4-vector whose comp onent s are the deviations of r, r, B and ()from their values on the circular orbit and u as the 2-vector whose elements are u, and u0 . Show that for the linearised version (15) of the dynamic equations one has
e
1
00 0 0 -2w 0
0
~2w 0
1
1 l
0
Note that the matri x A has eigenvalues 0, 0 and ±iw. The zero eigenvalues corre spond to the neutral stability of the orbit, which is one of a continuous family of ellipses. The others corre spond to the perio dic motio n of frequency w.
I
I ;
!
3 THE CAYLEY-HAMILTO N THEOREM
99
In deriving the dynamic equations for the following standard examples we appeal to the Lagrangian formalism for Newtonian dynamics. Suppose that the system is described by a vector q of position coordinates qh that the potential and kinetic energies of the configuration are functions V(q) and T(q, q) and that an external force u with components u1 is applied. Then the dynamic equations can be written
(j
= 1,2, ... )
(3) The cart and the pendulum. This is the celebrated control problem formulated in Section 2.8: the stabilisation of an inverted pendulum mounted on a cart by the exertion of a horizontal control force on the cart. In the notation of that section we have the expressions
V= Lmg cos a, Show that the equations of motion, linearised for small a, are (M +m)q+m Lii
= u,
q+
La= ga,
and hence derive expressions (2.60) for the matrices A and B of the linearised state equations. Show that the eigenvalues of A are 0, 0 and ±Jg(l + m/ M)/ L. The zero eigenvalues correspond to the 'mode' in which the whole system moves at a constant (and arbitrary) horizontal velocity. The positive eigenvalue of course corresponds to the instability of the upright pendulum. (4) A popular class of controllers is provided by the PID (proportional, integral, differential) controllers, for which u is a linear function of current values of tracking error, its time-integral and its rate of change. Consider the equation for the controlled pendulum, linearised about the hanging position: ii + u?a = u. Suppose one wishes to stabilise this to rest, so that a is itself the error. Note that a purely proportion al control will never stabilise it. The LQ-optimal control of Section 2.8 would be linear in a and a, and so would be of the PD form. LQoptimisation will produce something like an integral term in the control only if there is observation error; see Chapter 12. 3 THE CAYLEY-HAMILTON THEOREM
A deeper study of the system [A, B, C ] takes one into the byways oflinear algebra. We shall manage with a knowledge of the standard elementary properties of matrices. However, there is one result which should be formalised. Theorem 5.3.1 Let A be an n x n matrix. Then the first n powers I, A, A 2 , ..• , An-I constitute a basis for all the powers A' of A, in that scalar coefficients CrJ exist such that
100
STATE-STRUCTURED DETERM INISTIC MODELS n-l
A'= LCrjAj
(r = 0, 1,2, ... ).
}=0
(16)
It is importa nt that the coefficients are scalar, so that each element of A' has the same representation in terms of the corresponding elements of I, A, . .. , An-I.
Proof Define the generating function CXl
if>(z) = L(Azj =(I- Az)- 1 }=0
where z is a scalar; this series will be convergent if lzl is smaller than the reciprocal of the largest eigenvalue of A. Writing the inverse as the adjugate divided by the determinant we have then II
Azi(z)
adj(I
Az)
(17)
Now II - Azl is a polynomial with scalar coefficients a1 : n
II -Azl
""a·zi L..J j ' }=0
say, and the elements of adj(I- Az) are polynomials in z of degree less than n. Evaluating the coefficient of z' on both sides of (17) we thus deduce that n
La1Ar- } = 0
(r
~
n).
(18)
}=0
Relation (18) constitutes a recursion for the powers of A with scalar coefficients a 1. It can be solved for r ~ n in the form (16). D . The Cayley-Hamilton theorem asserts simply relation (18) for r = n, but this has the extended relation (18) and Theorem 5.3.1 as immediate consequences. It is sometimes expressed verbally as 'a square matrix obeys its own characteristic equation', the characteristic equation being the equation 2:-}=o a ;...n-J = 0 for the 1 eigenvalues A..
Exercises and comments (1) Define the nth degree polynomial P(z) - II- Azl = 2:-}=o aJZi. If we have a discrete-time system x 1 = Ax1-1 then Theorem 5.3.1 implies that any scalar linear function~~= cT x 1 of the state variable satisfies the equation P(:Y)~ 1 = 0. That is, a first-order n-vector system has been reduced to an nth-order scalar system.
I
101
4 CONTROLLABILITY (DISCRETE TIME)
One can reverse this manipulation. Suppose that one has a model for which the process variable ~ is a scalar obeying the plant equation P( .:?T)~1 = bur-l (*) with ao = 1. Show that the column vector x 1 with elements (~1 ,~1 - 1 , .•. ,~r-n+I) is a state-variablew ithplantequatio nx1 = Axr-I +Bur- I. where
-an-I 0
-a2
0 1
0
-an] .:.
,
B~
II b :
0 The matrix A is often termed the companion matrix of the nth-order system ( *). (2) Consider the continuous-time analogue of Exercise 1. If x obeys x = Ax then it follows as above that the scalar ~ obeys the nth-order differential equation P(fl))~ = 0. Reverse this argument to obtain a companion form (i.e. state-reduced form) of the equation P(fl))~ =bu. Note that this equation must be regarded as expressing the highest-order differential of~ in terms oflower-order differentials, whereas the discrete-time relation (*) expresses the least-lagged variable~~ in terms oflagged variables.
4 CONTROLLABILITY (DISCRETE TIME) The twin concepts of controllability and observability concern the respective questions of whether control bites deeply enough that one can bring the state variable to a specified value and whether observations are revealing enough that one can indeed determine the value of the state variable from them. We shall consider these concepts only in the case of the time-homogeneous linear system (12), (13), when they must mirror properties of the three matrices A, B and C. The system is termed r-controllable if, given an arbitrary value of xo, one can choose control values u0 , ui, ... , Ur-I such that x, has an arbitrary prescribed value. For example, if m = n and B is non-singular then the system is 1controllable, because one can move from any value of xo to any value of XI by choosing u0 = B-I (xi - Ax0 ). As a second example, consider the system Xt =
[ au
a2I
0]
a22
Xr-I
+
[O1]
Ur-I
for which m = 1, n = 2. It is never 1-controllable, because u cannot affect the second component of x in a single step. It is uncontrollable if a2I = 0, because u can then never affect this second component. It is 2-controllable if a21 ¥= 0, because
102
STATE-STRUCTURED DETERMIN ISTIC MODELS
x2
-
A 2 Xo
[1 a"][uo] u, ,
= Bu, + ABuo = 0 a 21
and this has a solution for uo, u, if a21 Theorem 5.4.1 matrix
#
0. This argument generalises.
Then-dime nsional system [A, B, ·]is r-controllable T~on
= (B, AB, A 2 B, ... , A'- 1B]
ifand only if the (19)
hasrankn.
We write the system as [A, B, ·]rather than as {4, B, q since the matrix C evidently has no bearing on the question of controllability. The matrix (19) is written in a partitioned form; it has n rows and mr columns. The notation r;on is clumsy, but short-lived and motivated by the limitations of the alphabet; read it as 'test matrix of size r for the control context: Proof If we solve the plant equation (12) for x, in terms of the initial value x 0 and subsequent control values we obtain the relation x,- A' xo
=
Bu,_,
+ ABur-2 + · ·· + Ar-i Buo.
(20)
The question is whether this set of linear equations in uo, UJ, ... , u,_J has a solution, whatever the value of then-vecto r x,- A'x0 . Such a solution will always exist if and only if the coefficient matrix of the u-variables has rank n. This matrix is just T;on, whence the theorem follows. 0 If equation (20) has a solution at all then in general it has many. We shall find a way of determining 'good' solutions in Theorem 5.4.3. Meanwhile, the CayleyHamilton theorem has an important consequence. Theorem 5.4.2 Ifa system of dimension n is r-controllable, then it is also s-controllablefor s ~min (n, r). Proof The rank of ~on is non-decreasing in r, so r-controllability certainly implies s-controllability for s ~ r. However, Theorem 5.3.1 implies that the rank of ~on is constant for r ~ n, because it implies that the columns of T;on are then linear combinations (with scalar coefficients) of the columns of r;on_ The system is thus n-controllable if it is r-controllable for any r, and we deduce the complete assertion. 0
If the system is n-controllable then it is simply termed controllable. This is a reasonable convention, since Theorem 5.4.2 then implies that the system is
103
4 CONTROLLABILITY (DISCRETE TIME)
if controllable if it is r-controllable for any r, and that it is r-controllable for r ? n it is controllable. One should distinguish between controllability, which implies that one can bring the state vector to a prescribed value in at most n steps, and the weaker property of stabilisability, which requires only that a matrix K can be found such that A + BK is a stability matrix, and so that the policy u1 = Kx 1 will stabilise the equilibrium at x = 0. It will be proved incidentally in Section 6.1 that, if the process can be stabilised in any way, then it can be stabilised in this way; also that controllability implies stabilisability. That stabilisability does not imply controlis lability follows from the case in which A is a stability matrix and B is zero. This Note, ble. stabilisa so and stable, but uncontrolled, and so not controllable, however, that stabilisability does not imply the existence of a control which s stabilises the process to an arbitrary prescribed equilibrium point; see Section 6.2and6.6. Finally, the notion of finding a u-solution to the equation system (20) can be d made more definite if we require that the transfer from xo to x, be achieve optimally, in some sense. Theorem 5.4.3
( i) r-controllability is equivalent to the demand that the matrix r-1
co,on = LAjBQ -IBT(A TY
(21)
j=O
should be positive definite. Here Q is a prescribed positive definite matrix. (ii) Ifthe process is r-controllable then the control which achieves the passage from u; QuT is prescribed xo to prescribed x, with minimal control cost!
I:::C/
(0
~ r
< r).
(22)
Proof Let us take assertion (ii) first. We seek to minimise the control cost subject to the constraint (20) on the controls (if indeed this constraint can be satisfied, i.e. if controls exist which will effect the transfer). Free minimisation of the Lagrangian form ~I
~I
T=O
T=O
~ L:u'JQu T + AT(x,- A'xo- 'L,Ar-T -I B~)
yields the control evaluation UT
= Q-1 BT (ATr-T-1).
in terms of the Lagrange multiplier A. Evaluation of A by substitution of this solution back into the constraint (20) yields the asserted control (22).
104
STATE-STRUCTURED DETERMINISTIC MODELS
However, we see that (22) provides a control rule which is acceptable for general .xo and x, if and only if the matrix ~on is non-singular. The requirement of non-singularity of ~on must then be equivalent to that of controllability, and so to the rank condition of Theorem 5.4.1. This is a sufficient proof of assertion (i), but we can give an explicit argument. Suppose ~n singular. Then there exists an n-vector w such that wT ~on = 0, and so wT ~nw = 0, or r-1
~)wTAiB)Q- 1 (wTAiB)T = 0. j=fJ
But the terms in this sum are individwilly non-negative, so the sum can be zero only if the terms are individually zero, which implies in turn that wT~B = 0 (j = 0, 1, ... , r- 1). That is, wT~n = 0, so that r,:on is ofless than full row-rank, n. This chain of implications is easily reversed, demonstrating the equivalence of the two conditions: r,:on is of rank n if and only if ~n is, i.e. if and only if the non-negative definite matrix ~on is in fact positive definite. 0 The matrix ~on is known as the control Gramian. At least, this is the name given in the particular case Q = I and r = n. As the proof will have made clear, the choice of Q does not affect the definiteness properties of the Gramian, as long as Q is itself positive definite. Consideration of general Q has the advantage that we relate the controllability problem back to the optimal regulation problem of Section 2.4. We shall give some continuous-time examples in the exercises of the next section, for some of which the reader will see obvious discrete-time analogues. 5 CONTROLLAmLITY (CONTINUOUS TIME) Controllability considerations in continuous time are closely analogous to those in discrete time, but there are also special features. The system is controllable if, for a given t > 0, one can find a control {u(T); 0 ::;:;; T < t} which takes the state value from an arbitrary prescribed initial value x(O) to an arbitrary prescribed terminal value x(t). The value oft is immaterial to the extent that, if the system is controllable for one value oft, then it is controllable for any other. However, the smaller the value of t, and so the shorter the interval of time in which the transfer must be completed, the more vigorous must be the control actions. Indeed, in the limit t ! 0 infinite values of u will generally be required, corresponding to the application of impulses or differentials of impulses. This makes clear that the concept of ,..controllability does not carry over to continuous time, and also that some thought must be given to the class of controls regarded as admissible.
5 CONTROLLABIUTY (CONTINUOUS TIME)
105
It follows from the Cayley-Hamilton theorem and the relation 00
eAt = L)AtY /j!.
(23)
}=0
implicit in (7) that the matrices I,A,A 2 , .•• ,An-l constitute a basis also for the family of matrices {eAt; t ~ 0}. Here n is, as ever, the dimension of the system. We shall define T~on again as in (19), despite the somewhat different understanding of the matrix A in the continuous-time case.
Theorem 5.5.1 (i) The n-dimensional system [A, B, · ] is controllable ifand only if the matrix T;on has rank n. (ii) This condition is equivalent to the condition that the control Gramian G(tton = 1t eATBQ-IBTeATrdT
{24)
should be positive definite (for prescribed positive t and positive definite Q ). (iii) If the system is controllable, then the control which achieves the passage from prescribed x(O) to prescribed x( t) with minimal control cost! J~ uT Qu dr is
(25) Proof If transfer from prescribed x(O) to prescribed x( t) is possible then controls {u( r); 0 ~ T < t} exist which satisfy the equation x(t)
eAtx(O) =lot eA(t-r)Bu(r) dr,
(26)
analogous to (20). There must then be a control in this class which minimises the control cost defined in the theorem; we find this to be (25) by exactly the methods used to derive (22). This solution is acceptable if and only if the control Gramian G( tton is non-singular (i.e. positive definite); this is consequently the necessary and sufficient condition for controllability. As in the proof of Theorem 5.4.3 (i): if G(t)con were singular then there would be ann-vector w for which wTG(trn = 0, with the successive implications that wT eAt B = 0 (t ~ 0 ), wT Ai B = 0 (j = 0, 1, 2, ... ) and wTr,;on = 0. The reverse implications also hold. Thus, an alternative -necessary and sufficient condition for controllability is that r;on should have full ~~
0
While the Gramians G(tton for varying t > 0 and Q > 0 are either all singular or all non-singular, it is evident that G(tton approaches the zero matrix as t approaches zero, and that the control (25) will then become infinite.
106
STATE-STRUCTURED DETERMINISTIC MODELS
Exercises and comments (1) Consider the satellite example of Exercise 2.1 in its linearised form. Show that the system is controllable. Show that it is indeed controllable under tangential thrust alone, but not under radial thrust alone.
(2) Consider the two-variable system x1 = Ajx1 + u (j = 1, 2). One might regard this as a situation in which one has two rooms, roomj having temperature x1 and losing temperature at a rate ->...1x1, and heat being supplied (or extracted) exogenously at a common rate u. Show that the system is controllable if and only if >... 1 =f:. >...2 . Indeed, if )q = >...2 < 0 then the temperature difference between the two rooms will converge to zero in time, however u is varied. (3) The situation of Exercise 2 can be generalised. Suppose, as in Exercise 1.5, that the matrix A is diagonalisable to the form H- 1AH. With the change of state variable to the set of modal variables .X = Hx considered there the dynamic equations become
ij = AjXj + L
bjkUk,
k
where b1k is thejkth element ofthe matrix HB. Suppose all the A] distinct; it is a fact that the square matrix withjkth element >...J- 1 is then no~-singular. Use this fact to show that controllability, equivalent to the fact that r~on has rank n, is equivalent to the fact that HB should have no rows which are zero. In other words, the system is controllable if and only if there is some control input to any mode. This llSSertion does not, however, imply a failure of controllability if there are repeated eigenvalues. 6 OBSERVABILITY The notion of controllability rested on the assumption that the initial value of state was known. If, however, one must rely upon imperfect observations, then it is a question whether the value of state (either in the past or in the present) can be determined from these observations. The discrete-time system [A, B, C] is said to be r-observable if the value of x 0 can be inferred from knowledge of the subsequent observations YhY2 ... ,y, and subsequent relevant control values uo, u1 , ... , u,_ 2 . Note that, if xo can be thus determined, then x 1 is also in principle simultaneously determinable for all t for which one knows the control history. The notion of observability stands in a dual relation to that of controllability; a duality which indeed persists right throughout the subject. We have the determination
107
6 OBSERVABILITY
of Yr in terms of xo and subsequent controls. Thus, if we define the reduced observation r-2 2 C" y,.=y,.~ Ar-i- BUj
j=O
then xo is to be determined from the system ofequations
- = C'nA7"-l Xo Yr
(0 < r
~
r ).
(27)
These equations are mutually consistent, by hypothesis, and so have a solution. The question is whether this solution is unique. This is the reverse of the situation for controllability, when the question was whether equation (20) for the u-values had a solution at all, unique or not. Note an implication of the system (27): that the property of observability depends only upon the matrices A and C; not all upon B. We define the matrix
II CA
~bs=
r
c cA.r"-1 , CA 2
(28)
the test matrix of size r for the observability context.
Theorem 5.6.1 (i) Then-dimensional system [A,·, C] is r-observable if and only if the matrix T~bs has rank n. (ii) Equivalently, the system is r-observable ifand only ifthe matrix ~bs =
r-1
L(CAr-!)TM-!CAr-1
(29)
r=O
is positive definite, for prescribedpositive defznite M. (iii) Ifthe system is r-observable then the determination ofXo can be expressed r-1
xo = (~bs)-'M-1 L(CAr-!)Ty,..
(30)
r=O
(iv) Ifthe system is r-observable then it is s-observablefor s ~ min (n, r). Proof If system (27) has a solution for xo (which is so by hypothesis) then this solution is unique if and only if the coefficient matrix r,'bs of the system has rank n, implying assertion (i). Assertion (iv) follows by appeal to the Cayley-Hamilton theorem, as in Theorem 5.4.2. If we define the deviation T/r = Yr - CAr-! xo then equations (27) amount to T/r = 0 (0 < r ~ r). If these equations were not consistent we could still define a
J 108
STATE-STRUCTUR
I ED DE TE RM IN IS
TI C MODELS
'least-square' solution to them by minimising any positive-definite in these deviations quadratic form with respect to x • 0 In particular, I:~:~ rjJM- 1rJT· Th we could minimise is minimisation yield s ex indeed have a solutio n (i.e. are mutually co pression (30). If equations (27) ns unique then expressio n (30) must equal this istent, as we suppose) and this is so lution: the actual valu The criterion for uniq e of ueness of the least-sq uare solution is that G; x0 . non-singular, which is exactly condition (ii bs should be ). As in Theorem 5.4 conditions (i) and (i) ca .3, equivalence of n be verified directly, ifdesired. 0 Note that we have ag ain found it helpful to bring in an optimisa This time it was a qu tion criterion. estion, not of fmding a 'least cost' solution solutions are known to exist, but of fmdi when many ng a 'best fit' solutio solution may exist. Th n when no exact is approach lies close to the statistical appr when observations ar oach necessary e corrupted by noise ; see Chapter 12. Mat observation Gramian. rix (29) is the The continuous-time version of these results which bears the sam e relation to that of Th will now be apparent, with a proof eorem 5.6.1 as that of does to the material of Theorem 5.5.1 Section 4. Theorem 5.6.2 (i) Th en ifand only ifthe matrix T~-dimensional continuous-time system [A, ·, CJ is ob bs defined by (28) servable has rank n. (ii) This condition is eq uivalent to the condition that the observation G ramian G(t)obs = 1'(ceA'~") TM-IceA'~" dr (31) should be positive defin ite (for prescribed posit ive t and positive defin (iii) Ifthe system is obse ite M ). rvable then the determi nation ofx(O) can be wr itten x(O) = [G(t)obsrl M 1 (CeA'~")TM- 1 y(r) dr, where
1'
y(t) = y(t)
-1
1
ce A( t-T ) Bu(r)d
r.
Away of generating re al-time estimates of cu rrent state is to drive a plant by the appare nt discrepancy in ob model of the servation. For the co model (15) this would nt inuous-time amount to generating an estimate x(t) of x( of the equation t) as a solution
i = A x+ B u+ H (y - Cx ), where the matrix H is (32) to be chosen suitably. One can regard this as of a filter whose outp the realisation ut is the estimate x ge nerated from the know n inputs u and
6 OBSERVABILITY
109
y. Such a relation is spoken of as an observer, it is unaffected in its performance by the control policy. We shall see in Chapter 12 that the optimal estimating relation in the statistical context, the Kalman filter, is exactly of this form. Denote the estimation error x(t)- x(t) by ~(t~ By subtracting the plant equation from relation (32) and setting y = Cx we see that
A= (A -HC)!:::.. Estimation will thus be successful (in that the estimation error will tend to zero with time) if A- HC is a stability matrix. If it is possible to find a matrix H such that this is so then the system [A,·, C] is said to be detectable; a property corresponding to the control property of stabilisability.
Exercises and comments (1) Consider the linearised satellite model of Exercise 2.2. Show that state x is observable from angle measurements alone (i.e. from observation of eand B) but not from radial measurements alone. (2) The scalar variables x1 ( j equations XI
= 1, 2, ... , n)
= 2(1 + Xnr 1 -XI+ u, Xj = Xj-1
of a metabolic system obey the - Xj
(j
= 2, 3, ... 'n).
Show that in the absence of the control u there is a unique equilibrium point in the positive orthant. Consider the controlled system linearised about this equilibrium point. Show that it is controllable, and that it is observable from measurements of x 1 alone.
Notes We have covered the material which is of immediate relevance for our purposes, but this is only a small part of the very extensive theory which exists, even (and especially) for the time-homogeneous linear case. One classical piece of work is the Routh-Hurwicz criterion for stability, which states in verifiable form the necessary and sufficient conditions that the characteristic polynomial I>J -A I = 0 should have all its zeros strictly in the left half:plane. Modern work has been particularly concerned with the synthesis or realisation problem: can one find a system [A, B, C] which realises a given transfer function G? If one can find such a realisation, of finite dimension, then it is of course not unique (see Exercise 2.1). However, the main consideration is to achieve a realisation which is minimal in that it is of minimal dimension. One has the important and beautiful theorem: a system [A, B, C] realising G is minimal if and only if it is both controllable and observable. (See, for example, Brockett (1970) p. 94.)
110
STATE-STRUCTURED DETERMINISTIC MOD ELS
However, when we resume optimisation, the relev ant parts of this further theory are in a sense generated automatically and in the operational form dictated by the goal. So, existence theorems are repla ced by explicit solutions (as Theorem 5.4.3 gave greater definiteness to Theorem 5.4.1), the family of 'good' solutions is generated by the optimal solution as the cost function is varied, and the conditions for validity of the optimal solut ion provide the minimal and natural conditions for existence or realisability.
CHAPTER 6
Stationary Rules and Direct Optimisation for the LQModel The LQ model introduced in Sections 2.4, 2.8 and 2.9 has aspects which go far beyond what was indicated there, and a theory which is more elegant than the reader might have concluded from a first impression. In Section 1 we deal with a central issue: proof of the existence of infinite-horizon limits for the LQ regulation problem under appropriate hypotheses. The consequences of this for the LQ tracking problem are considered in Section 2. However, in Section 3 we move to deduction of an optimal policy, not by dynamic programming, but by direct optimisation of the trajectory by Lagrangian methods. This yields a treatment of the tracking problem which is much more elegant and insightful than that given earlier, at least in the stationary case. The approach is one which anticipates the maximum principle of the next chapter and provides a natural application of the transform methods of Chapter 4. As we see in Sections 5 and 6, it generalises with remarkable simplicity; we continue this line with the development of time-integral methods in Chapters 18-21. The material of Sections 3, 5, 6 and 7 must be regarded, not as a systematic exposition, but as a first sketch of an important pattern whose details will be progressively completed. 1 INFINITE-HORIZON LIMITS FOR THE LQ REGULATION PROBLEM One hopes that, if the horizon is allowed to become infinite, then the control problem will simplify in that it becomes time-invariant, i.e. such that a time-shift leaves the problem unchanged. One hopes in particular that the optimal policy will become time-invariant in form, when it is referred to as stationary. The stationary case is the natural one in a high proportion of control contexts, where one has a system which, for practical purposes, operates indefinitely under constant conditions. The existence of infinite-horizon limits has to be established by different arguments in different cases, and will certainly demand conditions of some kind-time homogeneity plus both the ability and the incentive to control In this section we shall study an important case, the LQ regulation problem of Section 2.4. In this case the value function F(x, t) has the form !xTII1x where II obeys the Riccati equation (2.25). It is convenient to write F(x, t) rather as Fs(x) = !xTIT(s)X where s = h- tis the 'time to go~ The matrix II(o) is then that
112
STATIONARY RULES AND DIRE CT OPTIMISA TION
assoc iated with the term inal cost function. The passa ge to an infin ite horiz on is then just the passa ges_ _, +oo, and infin ite-h orizo n limits will exist if IT(s) has a limit value 11 which is indep ende nt ofii(o) for the class of term inal cost functions one is likely to consider. In this case the matr ix K 1 = K(s) of the optim al contr ol rule (2.27) has a corre spon ding limit value K, so that the rule takes the statio nary form ut = Kxt. Two basic cond ition s are requi red for the existe nce of infin ite-h orizo n limits in this case. One is that of sensitivity: that any devia tion from the desir ed rest poin t x = 0, u = 0 shou ld ultimately carry a cost penal ty, and so dema nd correction. The other is that of controllability: that such any such devia tion can indee d be corre cted ultimately. We suppose S norm alise d to zero; a norm alisa tion which can be reversed if requi red by repla cing R and A in the calculation s below by R - sT Q- 1S and A - ST Q- 1B respectively. The Ricc ati equa tion then takes the form (s=l ,2, ... ).
(1)
wheref has the actio n fiT= R
+ ATIT A- ATITB(Q + BTITB)- 1BTITA.
(2)
Lemma 6.1.1 Suppose that IIco) = 0, R ~ 0, Q > 0. Then the sequence {IT(s)} is non-decreasing (in the ordering ofpositive-defmiteness) . Proof We have F1 = xT Rx ~ 0 = Fo. Thus, by Theo rem 3.1.1, Fs(x) = xTII(s)X is non-d ecrea sing ins for fixed x. That is, II(s) is non-d ecrea sing in the matri x-definiteness sense. Lemma 6.1.2 Suppose that II(o) = 0, R ~ 0, Q > 0 and that the system [A,B, ·]is either controllable or stabilisable. Then {II(s)} is boun ded above and has a finite limit II. Proof To demo nstra te boun dedn ess one must demo nstra te that a polic y can be found which incur s a finite infin ite-h orizo n cost for any presc ribed value x of initia l state. Controllability implies that there is a linea r control rule (e.g. that suggested in the proo f of Theo rem 5.4.3) which, for any xo = x, will bring the state to zero in at most n steps and at a finite cost xTrr• x, say. The cost of holdi ng it at zero there after is zero, so we can asser t that
(3) for all x. Stabilisability implies the same conclusio n, except that convergence to zero takes place exponentially fast rathe r than in a finite time. The nondecreasing sequence {II(s)} is thus boun ded abov e by II* (in the positive-definite
I LIMITS FOR THE LQ REGULATION PROBLEM
113
sense) and so has a limit II. (More explicitly, take x = e1 , the vector with a un.it in the jth place and zeros elsewhere. The previous lemma and relation (3) then imply that 1r;Js, the jjth element of II(s)• is non-decreasing and bounded above by 1rj;. It thus converges. By then taking x = e1 + ek we can similarly prove the convergence of 1rJks·) D We shall now show that, under natural conditions, II is indeed the limit of II(s) for any non-negative definite II(o)· The proof reveals more.
Theorem 6.1.3 Suppose that R > 0, Q > 0 and the system [A, B, ·] is either controllable or stabilisable. Then (i) The equilibrium Riccati equation II =/II
(4)
has a unique non-negative definite solution IT. (ii) For any finite non-negative definite Il(o) the sequence {II(s)} converges to II (iii) The gain matrix r corresponding to II is a stability matrix. Proof Define II as the limit of the sequencef(sJo. We know by Lemma 6.1.2 that this limit exists, is finite and satisfies (4). Setting u1 = Kx1 and Xr+I = (A+ BK)x1 = fx 1 in the relation
(5) where K and rare the values correspondin g to II, we see that we can write (4) as
(6) Consider the form
(7) and a sequence Xr
= f 1xo, for arbitrary Xo.Then (8)
Thus V(x1) decreases and, being bounded below by zero, tends to a limit. Thus
(9) which implies that x 1 --+ 0, since R + KT QK ~ R > 0. Since xo is arbitrary this implies that f 1 --+ 0, establishing (iii). We can thus deduce from (6) the convergent series expression for II: 00
II
= 2)rT)'" (R + KT QK)fi. }=0
(10)
114
STATIONARY RULES AND DIRE CT OPTIM ISATION
Note now that, for arbitr ary finite non-negative ll(s)
= f(s)n( o)
defin ite llco).
~ jCslo _. n.
(11)
Comparing the minim al s-hor izon cost with that incur red by using the statio nary rule Ut == Kx1 we dedu ce a reverse inequ ality s-1
IIcsJ ~ Z:)r Ty(R + KT QK) ri + (rTYllco)rs _. n.
(12)
j=O
Relations (11) and (12) imply (ii). Finally, asser tion (i) follows becau se, if another finite non-negative definite solut ion of (4), then
fi = f(slfi -+ ll.
fi
is 0
It is gratifying that proo f of the convergence ll(s) -+ n impli es incid ental ly that is a stability matrix. Of course, this is no more than one woul d expect: if the optimal policy is successful in drivi ng the state varia ble x to zero then it must indeed stabilise the equil ibriu m point x = 0. The proof appea led to the cond ition R > 0, which is exactly the cond ition that any deviation of x from zero shou ld be pena lised immediately. However, we can weaken this to the cond ition that the devia tion shoul d be pena lised ultimately.
r
Theorem 6.1.4 The conclusions of Theorem 6.1.3 rema in valid if the condition that R > 0 is replaced by the condition that, if R = LT L then the system [A, ·, L] is either observable or detectable. Proof ·Relation (9) now beco mes (Lxt)T(Lxt)
+ (Kx 1 )TQ(Kxt)-+ 0
which implies that Kx1 _, 0 and Lx1 -+ 0. Thes e convergences, with the relati on = (A+ BK)x1_ 1, imply that x 1 ultim ately enter s a mani fold for whic h
Xt
Xt = AXt-1 · (13) The observability cond ition implies that these relations can hold only if x 1 0. The detectability cond ition implies that we can find an H such that A - HL is a stability matrix, and since relati ons (13) imply that x 1 =(A - HL)x1_ 1 , then again x, -+ 0. Thus x 1 _, 0 unde r eithe r cond ition. This fact estab lished , the proof continues as in Theo rem 6.1.3. 0
=
We can note the corollaries of this result, already ment ioned in Secti on 5.4.
Corollary 6.1.5 (i) Controllability implies stabilisabil ity. ( ii) Stabilisability to x = 0 by any means impli es stabilisability by a control of the linear Markov form u1 = Kx1 •
2. STATIONARY TRACKING RULES
115
Proof (i) The proof of Theorem 6.1.3 demonstrated that a stabilising policy of the linear Markov form could be found if the system were controllable. (ii) The optimal policy under a quadratic cost function is exactly of the linear Markov form, so, if such a policy will not stabilise the system (in the sense of ensuring a finite-cost passage to x = 0), then neither will any other. 0 2. STATIONARY TRACKING RULES The proof of the existence of infinite horizon limits demonstrates the validity of the infinite-horizon tracking rule (267) of Section 2.9, at least if the hypotheses of the last section are satisfied and the disturbances and command signals are such that the feedforward term in (2.67) is convergent. We can now take matters somewhat further and begin, in the next section, to see the underlying structure. In order to avoid undue repetition of the material of Section 2.9 and to link up with conventional control ideas we shall discuss the continuous-time case. The continuous-time analogue of (267) would be
u- uc = K(x- ~)- Q- 1BT
1
00
erT'TII[d(t + r)- ~(t + r)] dr
( 14)
where II, K and rare the infinite-horizon limits of Section 1 (in a continuous-time version) and the time argument t is understood unless otherwise stated. We regard (14) as a stationary rule because, although it involves the time-dependent signal ( 15) this signal is seen as a system input on which the rule (14) operates in a stationary fashion. A classic control rule for the tracking of a command signal r in the case UC = 0 would be simply
u =K(x-~)
( 16)
where u = Kx is a control which is known to stabilise the equilibrium x = 0. We see that (14) differs from this in the feedforward term, which can of course be calculated only if the future courses of the command signal and disturbance are known. Neither rule in general leads ultimately to perfect following, ('zero offset') although rule (14) does so if d- ~. defmed in (15), tends to zero with increasing time. This is sometime expressed as the condition that all unstable modes of(~, UC, d) should satisfY the plant equation. There is one point that we should cover. In most cases one will not prescribe the course of all components of the process vector x, but merely that of certain linear functions of this vector. For example, an aeroplane following a moving target is merely required to keep that target in its sights from an appropriate distance and angle; not to specifY all aspects of its dynamic state. In such a case it is better not
116
STATIONARY RULES AND DIR ECT OPTIMISATION
to carr y out the normalisation of x, u and d ado pted in Section 2.9. If we assume that If = 0 and work with the raw variables then we find tha t the con trol rule (14) becomes rath er
(17) Details of derivation are omitted, because in the next section we sha ll develop an analysis which, at least for the stat ionary case, is much more direct and powerful than that of Section 2.9. Relation (17) is exactly wha t we want. A penalty term such as (xR( x- r) is a function only of tho se linear functions of (x- r) whi are penalised. The consequence ch is then that Rxc and Sr are fun ctions only of those linear functions of r which are prescribed. If we consider the case when S, uc and dar e zero and .XC is con stan t then relation (17) reduces to
.xcl
(18) which is to be com par ed with rela tion (16) and mu st be sup erio r to it (in average cost terms). Wh en we inse rt these two rules into the plan t equation we see tha t x settles to the equilibrium valu e r- 1BK x = -r- 1m.XC for rule (16) and r- 1J(r T)- 1Rr for rule (18). Her e Jis again the control-power mat rix BQ - 1BT. We obtain expressions for the tota l offset costs in Exercise 1 and Sec tion 6. Exercises and comments
(1) VerifY tha t the offset cost und er control (16) (assuming S zero and xc constant) is! (A rlP (A r), where
P= (rT )- 1(R+ KT QK )r-t = -n 1 r- - (rT )- 1TI. We shall come to evaluations und er the optimal rule in Section 6. How ever, if R is as_sumed non-singular (so tha t all components of r are necessarily specified) then location of the opt ima l equ ilibrium poi nt by the methods of Section 2.10 leads to the conclusion tha t the offs et cos t und er the optimal rule (18) again has the form !(A r)T P(A xc) , but now with P= (AR - 1AT +B Q- 1BT)- 1 . Thi s is generalised in equation (43).
3 DIRECT TRAJECTORY OPTIM ISATION: WHY THE OPTIMAL FEEDBACK/FEEDFORWARD CO NTROL RULE HAS THE FORM IT DOES Ou r analysis of the disturbed trac king problem in Section 2.9 won through to a solution with an appealing form , but only after some rath er unappealing
3 DIRECT TRAJECTORY OPTIMISATION
117
calculations. Direct trajectory optimisation turns out to offer a quick, powerful and transparent treatment of the problem, at least in the stationary case. The approach carries over to much more general models, and we shall develop it as a principal theme. Consider the discrete-time model of Section 2.9, assuming plant equation (2.61) and instantaneous cost function (2.62). Regard the plant equation at time r as a constraint and associate with it a vector Lagrange multiplier A.,., so that we have a Lagrangian form ~
= 2:)c(x.,., un r) + g(x.,.- Ax.,._I - Bu.,._I -d.,.)]+ terminal cost.
( 19)
T
Here the time variable r runs over the time-interval under consideration, wb.ich we shall now suppose to be h 1 < r < h2 ; the terminal cost is incurred at the horizon point r = h2. We user to denote a running value of time rather than t, and shall do so henceforth, reserving t to indicate the particular moment 'now'. In other words, it is assumed that controls u.,. for r < t have been determined, not necessarily optimally, and that the timet has come at which the value of u1 is to be determined in the light of information currently available. We shall refer to the form ~ of (19) as a 'time-integral' since it is indeed the discrete-time version of an integral. We shall also require of a 'time-integral' a property which~ possesses, that one optimises by extremising the integral .freely with respect to all variables except those whose values are currently known. That is, optimisation is subject to no other constraint The application of Lagrangian methods is certainly legitimate (at least in the case of a fixed and finite horizon) if all cost functions are non-negative definite quadratic forms; see Section 7.1. We can make the strong statement that the optimal trajectory from time t is determined by minimisation of~ with respect to (x,., u.,.) and maximisation with respect to A.,. for all relevant r ~ t. This extremisation then yields a linear system of equations which we can write
R [
ST
s
Q
I -A!T
-B!T
where ff is the backwards translation operator defined in (4.9) and normalised disturbance
(20)
d is
the
(21)
already introduced in Section 2.9. We have added a superscript (t) to emphasise that this is an optimisation from timet onwards; nothing is assumed of the (x, u) path before time t except that it is known. The effect of this is that the optimal control rule, when we deduce it, will be in closed-loop form.
118
STATIONARY RULES AN D DIR ECT OPTIMISATION
Let us write equation (20) as (22) where {(-r} is then known, and { ~~)}, the course of the deviati ons from the desired pat h of the (henceforth) optimally-controlled process, is presumably determined by (20) plus initial conditions at T = t and termina l con ditions at T = h. Note tha t the matrix is Hermitian, in that if we define the conjugate of = (.:T) as~= (.:r- 1 the n~= . Suppose tha t (z), with z a scalar complex variable, has a canonical factorisation
?
(23) where +(z) and +(z)- 1 can be validly expanded on the unit circ le wholly in non-negative powers of z and _ ditto for non-positive powers. Wh at this would mean is tha t an equation such as (24) for ~ with known v (and valid for all T before the current point of ope ration) can be regarded as a stable forward rec ursion for ~ with solution (25) Here the solution is tha t obtain ed by expanding the operator + (.:1) -t in nonnegative powers of .:1, and so is linear in present and pas t v; see Section 4.5. We have taken recursion (24) as a forw ard recursion in that we have solv ed it in terms of pas t v; it is stable in that solu tion (25) is certainly convergen t for uniformly bounded v. Factorisation (23) then implies the rewriting of (22) as
(26)
so representing the difference equation (22) as the compositio n of a stable f9rward recursion and a stable bac kward recursion. The situation may be plainer in terms of the scalar example of Exercise 1. One finds general ly tha t the optimisation of the pat h of a pro cess generated by a forward rec ursion yields a recursion of double order, sym metric in pas t and future, and tha t if we can represent this double-order rec ursion as the composition of a stable forward recursion and a stable backward recursion, then it is the stable forw ard recursion which determines the optimal forward pat h in the infmite-hor izo n case (see Chapter 18). Suppose we let the horizon point h2 tend to +oo, so that (26) holds for all T ~ t. We can the n legitimately half-inv ert (26) to
(27)
3 DIRECT TRAJECTORY OPTIMISATION
119
if(,. grows sufficiently slowly with increasing r that the expression on the right is convergent when ~ _ (!?") - 1 is expanded in non-positive powers of :!7. We thus have an expression for AT in terms of past A and present and future (.This gives us exactly what we want: an expression for the optimal control in the desired feedback/feedforward form. Theorem 6.3.1 Suppose that dT grows sufficiently slowly with T that the semi-inversion (27) is legitimate. More specifically, that the semi-inversion
~+(,.) [~A ~r ~ ~-(,.)_,
m.
(r
~
r)
(27')
is legitimate in that the right-hand member is convergent when the operator~_ (:!7) - 1 is expanded in non-positive powers offf. Then (i) The determination ofu1 obtained by setting r = tin relation (27') constitutes an expression ofthe infinite-horizon optimal control rule infoedback!feedforwardform. (ii) The Hermitian character of~ implies that the factorisation (26) can be chosen so that ~ _ = ~ +, whence it follows that the operator which gives the foedforward component is just the inverse ofthe conjugate ofthefoedback operator.
Relation (27') in fact determines the whole future course of the optimally controlled process recursively, but it is the determination of the current control u1 that is of immediate interest. The relation at r = t determines u1 (optimally) in terms of Xt and d1( r ~ t); the feedback/feedforward rule. Furthermore, the symmetry in the evaluation of these two components explains the structure which began to emerge in Section 2.9, and which we now see as inevitable. We shall both generalise this solution and make it more explicit in later chapters. The achievement of the canonical factorisation (23) is the equivalent of solving the stationary form of the Riccati recursion, and in fact the policyimprovement algorithm of Section 3.5 translates into a fast and natural algorithm for this factorisation. The assumptions behind solution (27) are two-fold. First, there is the assumption that the canonical factorisation (23) exists. This corresponds to the assumption that infmite-horizon limits exist for the original problem of Section 2.4; that of optimal regulation to zero in the absence of disturbances. Existence of the canonical factorisation is exactly the necessary and sufficient condition for existence of the infinite-horizon limits; the controllability/sensitivity assumptions of Theorem 6.1.3 were sufficient, but probably not necessary. We shall see in Chapter 18 that the policy-improvement method for deriving the optimal infinite-horizon policy indeed implies the natural algorithm for determination of the canonical factorisation. The second assumption is that the normalised disturbance dT should increase sufficiently slowly with r that the right-hand member of (27), the feedforward
120
STATIONARY RULES AND DIR ECT OPTIMISATION
term, should converge. Such convergence does not guarantee that the vector of 'errors' ll.t will converge to zero with incr easing t, even in the case when all components of this error are penalised. The re may well be non-zero offsets in the limit; the errors may even increase expone ntially fast However, convergence of the right-hand member of (27) implies that (4 increases slowly enough with time that an infinite-horizon optimisation is mea ningful. Specifically, suppose that the zeros of I
and II and K are the infinite-horizon limi ts of these quantities. Factorisation (28) differs slightly from (23) in that there is the interposing constant matrix
R
s
sT Q
1- Af / Bff
(t
~T
and perform the same semi-inversion on this system as previously. The command signal r occurs only in the combination s Rr and Sr, which are functions only of those components of r which are pres cribed. Exercises and comments (1) Consider the simple regulation problem in the undisturbed scalar case, when we can write the cost function as! E.,.[ Q.a - 2 (x,. - Ax-r- 1) 2 + R.x;] +te rmi nal cos t The problem is thus reduced, sinc e we have used the plant equation to eliminate u, and so need not introduce A. The stationarity condition on X-r yields
4 AN ALTERNATIVE FORM FOR THE RICCATI RECURSION
121
The symmetric (in past and future) second-order equation (•) can be legitimately reduced to a first-order stable forward equation, which determines the infinite-horizon optimal path from an arbitrary starting point, so yielding in effect an optimal control rule in open-loop form. Suppose the canonical factorisation iP(z) ex: (1- rz)(l- rz- 1) where r is less than unity in modulus. (Note that this is exactly the determination of the optimal gain factor r given by equation (236)). Then division of (*) by the 'future' factor leaves the equation (1 - r S")Xr = 0, or x.,. = rx.,.-1, which we know to be indeed the plant equation under optimal infinite-horizon control. (2) Use relations (27')-(29) to verify the determination (267) of the optimal control rule. (3) The continuous time version of equation (22) (for the state-structured model (2.52)/(2.53)) is iP(!i))A =(,where
iP(s)
=[
R S sf -A
ST Q -B
-sf -AT]
-BT 0
.
If we define the conjugate of iP(s) as iP(s) = iP(-s)T then iP is evidently self. conjugate. The analogue of the canonical factorisation (26) is iP(s) = iP_(s) iP+(s), where both iP+ and 'P+ 1have a Laplace representation valid for Re(s) ~ 0 which involves only exponentials e-st for non-negative t (and so have all singularities in the left half-plane). cp_ has the complementary definition, and will in fact be the conjugate of iP+. Solution by the semi-inversion (Zl) then proceeds analogously.
(4) Show that the roots of I'P-(z)l = 0 are just the eigenvalues of the optimal infinite-horizon gain matrix r. 4 AN ALTERNATIVE FORM FOR THE RICCATI RECURSION One could extremise the time-integral (19) recursively, and by doing so one is led to the expanded form of the optimality equation
F(x, t) = inf sup[c(x, u, t) + ,\T(Xt+I -Ax- Bu) + F(xt+1, t + 1)]. Xr+t.U
~
Here we have written x 11 u1 and At+1 simply as x, u and.\, and F(x, t) is the usual value function at time t, which we know to have the form! xTIT 1x. Let us suppose for simplicity that we are in the time-homogeneous case when c(x, u) is given by (223) and that S has been normalised to zero. This normalisation can be reversed by replacing A and R in the expression below by A - BQ- 1S and R - sT Q- 1S respectively. If we extremise ,\ out first in the equation above then this is reduced to the usual optimality equation (2.29), leading to expressions (2.25) and (2.Z7) for the Riccati
122
STATIONARY RULES AND DIRE CT OPTIM ISATION
recursion and the optim al control. If we extremise the variables out in the order xt+ 1 , u, Athen we find that the right-hand mem ber of the equa tion undergoes the successive trans form ation s ~(xT Rx + uT Qu) -AT (Ax+ Bu) - !ATII;~\ A ----)- !xT Rx- ATAx A-
p,Trr;;1 !.ATJ..\
----)- !xT[R +AT (J + rr;M- 1Ajx. where J = BQ- 1BT, as ever. That is, the Ricca ti equa tion is recovered in the alternative form
(31) Further, if we trace throu gh the extremal value of u1 which emerges from the above sequence of trans form ation s we find that it is u 1 = K 1x 1 , as ever, but with the gain matrix K 1 having the evaluation
(32) instead of (2.28). We shall find a use for these alternative forms (31) and (32) in Chap ter 16. In continuous time the two forms coalesce; expre ssions (2.55) and (2.56) for the continuous-time Ricca ti equa tion and control matr ix are (with the norm alisa tion to S = 0) the limit forms of both discrete-time versio ns. However, the alternative forms above have in comm on with the continuou s-time form s that they reveal the role of the control-power matr ix J = BQ- 1BT. 5 DIRECT TRAJECTORY OPTIMISATION FOR HIGHER-ORDER CASES The approach of Section 6.2 generalises imme diately. Suppose that the plant equation is generalised to
dx+ fllu =d
(33)
where .91 = A(ff ) = L:~=O A,ff ' and f!l = B(ff) to pth-o rder dynamics. The time-integral (19) then n=
= L:~=l B,ff' , corresponding generalises to
.l)c(Xn UTJ T) + A;(.s;/XT + PJUT- dT)] +term inal COSt T
(34)
and the equation system (20) correspondingly to
[!
sT Q
-?] [x-xcl(t) [0]
tJ
0
f!l
u-uc A
0
T
d
(35) T
6 THE CONCLUSIONS IN TRANSFER FUNCTION TERMS
123
Here an expression such as .91 again denotes A (g--I?, the conjugate of .s#. The normalised disturbance d now has the definition d- de= d- d r - ~uc. We can again write system (35) as (22); the matrix of operators (ff) thus defined again has the self. conjugacy property . Now, provided that a canonical factorisation (23) of (z) exists (which is again ensured by appropriate controllability and sensitivity hypotheses) then the manipulations (26)-(27) go through exactly as previously to yield the optimal control rule implied in (27). This is expressed in terms of present and past process variables, and so we must suppose these observable. The case of imperfect observation is best left until we treat the full stochastic case (see Chapters 12 and 20 for the cases p = 1 and p unrestricted respectively). Because we have restricted the dynamics to finite order p the canonical factors are polynomial of degree p and a fast iterative method of factorisation based upon the policy improvement algorithm is still available; see Chapter 18. The continuous time results are analogous, with, for example, d having a representation A(~) and its conjugate .91 a representation A(-~) T_ The canonical factors have the characterisation given in Exercise 3.3.
6 THE CONCLUSIONS IN TRANSFER FUNCTION TERMS The analysis of Sections 3 and 5 is almost in transfer function terms as it stands; completion of the view raises some interesting points. Consider the continuous-time version of system (20): (36) This then constitutes a filter (the optimal filter, as a transformation from input (to output .6.) with transfer function (sr 1• This conclusion seems so simple that one wonders whether the subtleties of canonical factorisation etc. were necessary. However, they were indeed so. For one thing, the symmetry of implies that the system (36) is unstable as a forward dynamic system, and that the corresponding filter cannot be both causal and stable. That it is not causal is, of course, because the optimal control has a feed-forward component, anticipating the effect of future disturbances. The stable inversion must be of the form .6.(t)
=
1:
g(r)((t- r) dr
(37)
where g( T) is determined from the Fourier inversion g(r)
= -21
1f
!
00
· dr. (iwr 1e-•wr
(38)
-00
(i.e. by taking the contour of integration as the imaginary axis ins-space when inverting the Laplace transform; cf. Theorem 4.2.2(ii)). The transient response
124
STATIONARY RULES AND DIRE CT OPTI MISATION
g( r) for r > 0 (r < 0) is mad e up of contributions from the poles of the integrand corresponding to zeros of I~P(s)\ in the left (right) half of the complex plane. It is the separation of these two sets of sing ularities which corresponds to the canonical factorisation of IP. The equivale nt of a canonical factorisation cann ot be avoided, and remains to be faced in the appa rently simple inversion (37). The second poin t is that the filter view of relation (36) implies that it holds for all time r, whereas we suppose that it hold s only for r ~ t, where tis the 'now' of the optimiser. Tha t is, in order to develop the optimal control rule in closed-loop form, we assume that optimisation of cont rol from time t does not presume optimality of control before time t. This is the reason why the set of optimality conditions (36) is inverted only partially, to ti1+(.@)6 =
where the vector 6 on the left can be rega rded as system error, the quantity that one would wish to tend to zero with the pass age of time. If the optimal control has been used at all times and xc(r )
= LWj esir j
then the contribution to system erro r of the -til( sjri [
(39)
jth term is
~
A(sj)Wj
l
eSJT.
We can thus assert Theorem 6.6.1 The process com man d sign al (39) will ultimately be followed with zero error by the optimally controlled proc ess if A(sj )wi = 0 for all j such that Re(sj) ~ 0.
This of course is simply a restatement of the conclusion already reached in Section 2.9: that there will be zero offset in the limit if all unstable modes of the com man d signal satisfY the uncontrolled plan t equation.
6 THE CONCLUSIONS IN TRANSFER FUNCTION TERMS
125
The conclusion extends to the case when only certain functions of x are subject to command. Suppose that only the linear combination Dxc is prescribed, and has the form
Dxc(r)
=
L>iesir.
(40)
j
Theorem 6.6.2 The partial process command signal (40) will ultimately be followed with zero error by the optimally controlled process iffor each j such that Re(s1) ~ 0 onecanfinda vectorw1such that A(s1)w1 = 0, Dw1 = v1. This corresponds to a completion of~ to a form (39) which is consistent with prescription (40) and with zero limiting tracking error. Let us calculate what the asymptotic offsets would be in the case of constant xc, if and zero d. The asymptotic values of x and u are then determined by
[1 Here the matrix in the left-hand member is just (O), so that Ao and Bo are the absolute terms in A(s) and B(s). Formulae condense considerably if we adopt a notation which implies a viewpoint which we shall increasingly see as natural. Suppose that we lump
process and control variables into a single vector, x = [ ~] , the system vector. In terms of this we can write the last equation system as
(41) Now, if (s) is to have a canonical factorisation then it cannot be singular on the imaginary axis, so that the matrix in (41), identifiable with (O), is certainly nonsingular.
Theorem 6.6.3
The equilibrium rate ofoffset cost under optimal control is
! (x- xc) T9t(x- xc) = !xcT (9\- 9tP9t)xc
(42)
where Pis the top left matrix in the partitioned inverse ofthe matrix ofsystem (41). Jf9t is non-singular then this reduces to
We leave verification to the reader. Expression (42) involves xc only in the combinations m~ and ~T m~, as it must if only certain components of xc have
126
STATIONARY RULES AN D DIR ECT OPTIMISATION
been specified and deviations from them costed. If, on the other han d, 9t is nonsingular, then all components of r have been specified. We see tha t expression (44) is zero if ~r = 0, which is the familiar statement tha t the constant values x = r and u = lf should sati sfy the plant equation.
7 DIRECT TRAJECTORY OP TIMISATION FOR THE INPU TOUTPUT FORMULATION: NO TES OF CAUTION
A higher-order model such as (33) can be regarded as a red uction of a statestructured model, but it is still a full dynamic mo del It is natura l to ask whether these Lagrangian techniques of trajectory optimisation stil l apply for the ultimate reduction; the case wh en the plant is simply specified by its transfer function, so tha t one has an inp ut- out put specification rather tha n a dynamic specification. In fact, this is not a good idea. An optimisation by any metho d of the problem in this form is in fact fraught with subtle complications, and these reveal themselves in a Lagrangian atta ck. One basic reason is tha t the significant physical variable of the proble m, the process variable, is no longer directly observable in general, and this lack of direct observation has con seq uences which we can list as follows. (i) The calculation in effect trie s to achieve two objects at the same time: to determine what the optimal control rule would have been had the process variable been known and to rec onstruct (estimate) this variable optimally from the observations. Th e whole ana lysis is much simplified, bot h in concept and in execu~ion, if one reverts to a process description, wh en these two aspects separate clearly and explicitly. (ii) In the inp ut- out put descrip tion the plant equation has alre ady been 'solved', but the point of a Lagrangian approach is to avoid prematur e elimination of variables (which is what solutio n amounts to1 and certainly to avoid solution of an equation which, in isolation, is unstable. Plant instabilities gre atly complicate the direct optimisation of traject ory if plant has been specified in inp ut- out put {orm. (iii) The optimal control rule ded uced und er an inp ut- out put spe cification can well be open-loop, and so sensiti ve to perturbation or mis-speci fica tion. (iv) The factorisation algori thm based upon policy imp rovement is straightforward only for polyno mial transfer functions, i.e. for systems expressed by finite-order dynamics. Some analysis will bea r out the se points. Th e analysis will pro pel us somewhat ahead of ourselves, in tha t it will indicate the necessity for the statistical treatment ofunobservables, but tha t is no bad thing. We ado pt the for mulation of Section 4.10:
(44)
7 DIRECT TRAJECTORY OPTIMISATION
127
(45) Here (subsu mes all exogenous signals; i.e. inputs to the system from outside such as comm and signal, process noise and observation noise. The signal y is the observable plant output and ~ the deviation which is to be minim ised in the sense that one minimises a total cost! L:.,.. .Ll; 9\~.,... (We revert to discrete time for simplicity.) Consider then the minimisation of the Lagrangian form T
with expression (44) substituted for ~. Here the multiplier at time r has been denoted by 1-L-r rather than .\,.. This is because relation (45) is an observation relation rather than a plant equat ion-th e plant equation has been 'lost' in the input- outpu t specification. The 'end terms' comprise terminal cost and also an initial term reflecting beliefs about initial values. Expression (46) is a form in the variables (, u,y, and J.L. Let us suppose that at time t observations y,. are available for T < t. We suppose that the only information on the system input ( is that derived from the observ ations plus a statistical characterisation which is incorp orated in the penalty function!~ T 9\.Ll (We shall see from Chapt er 12 and even more clearly from Chapt er 16 that it can be so incorporated.) The form (46) then has to be extremised with respect to the values of all variables which are unobserved and all decisions which are unmad e at timet. ForT ~ tone can extremise out y,.,leading to the conclusion that J.Lr = 0 and so to the equation system gllmG 12 ] [ c] (1) = 0
G129\G12
u ,.
(T
~
t).
(47)
derived by extremisation with respect to(,. and ur- Here the supers cript (i) indicates that all variables in the vector are optimised values based on information at timet. So, (~l is the effective estimate of(,. based on information at timet. ForT < t the variables y and u are known a:nd so canno t be extrem ised out. One then has the equation system
-G21 ] [ '] (tl = 0
J.L
7"
[
o -/
-G 11 9\GI2] G22
[y] u
(T < t).
(48)
T
derived by extremisation with respect to (,. and 1-L-r· This is essenti ally conce rned with the estimation of past (; the equation system (47) with the combi ned optimisation of future u and prediction of future (. We say 'essent ially' because the two systems are coupled by the occurrence of comm on variab les around the present time t.
128
STATIONARY RULES AND DIRE CT OPTIM ISATION
One presumes that a canonical factorisation of the opera tor for each equation system will achieve the same kind of reduction that we achieved in the passage from (26) to (27). However, this presumption is in general mistaken. Consider, for simplicity, the situation in which the whole course of ( is in fact known. The only optimality condition is then the stationarity condition for present and future u:
(-r ~ t). (49) Let us specialise even further: to the simplest versi on of the regulation problem of Section 2.4, supposing that all variables are scala r, and that Sis zero. Ther e is no system input, so (is absent, and the components of A (i.e. the signals which enter the cost function AT9t li = RJ2 + Qu2 ) are u itself and x = (1- AfiT 1BfTu.
This last equation expresses plant in inpu t-out put form, minimisation of RJ2 + Qu2 ) with x thus expressed leads to the version ofand (49) for this case:
Lr (
(T
~
t)
If we write the left-hand mem ber as ( fT)u then <.P can be expressed 7
<.P(z) oc (1 - rz)(l - rz-l) ( 1 - Az) (1 - Az- 1) where r = A + BK is the optim al gain factor determined by (2.36). Now, the recursion in u which corresponds to the optim al control rule, u7 = Kxr = K(l - AfT) -! B!Tu7 , is
(50) But the operator on u in this equation is the cano nical factor of <1>(5") only if lA I < 1, i.e. if the plant is stable. It cann ot be assum ed, then, that division of relation (49) by the 'future' canonical factor of the operator on the left-hand side will reduce it to the corre ct forward recursion for u. Recursion (50) would in any case be impracticable in the case of unstable plant. It is then itself an unstable recursion, which woul d be upse t by the least rounding error in specification of past u, let alone by missp ecification of the model; see Exercise 1. That the cases of stable and unstable plant differ quali tatively is well recognised in the literature (see e.g. Zames, 1981). It is some times cope d with decomposing the controller into an arbitrary stabiliser followed by an optimising compensator. We shall retur n to direct trajectory optimisation in Chapters 18-21, but with the supposition that the plant is specified by explicit dynamic equations (possibly stochastic and not necessarily state-structured) rather than by an inpu t-out put
7 DIRECT TRAJECTORY OPTIMISATION
129
relation. It will turn out that the canonical factorisation treatment is then, as in Section 5, direct and problem-free. Exercises and comments (1) Note that a control rule deduced by any argument from (49) (such as the correct rule (50)) would in any case be an open-loop rule, and so not robust to perturbation or misspecification. The supposition, explicit or implicit, that ( is known and that the model assumed is correct, leads to the conclusion that feedback is unnecessary. The control rule properly deduced from (47) will have some element of feedback, because of the· assumption that ( is unknown and must be estimated-inde ed, re-estimated at each stage. This in turn implies a robuster rule; see Chapter 17. Notes on the literature Canonical factorisations of operators have been used in many contexts, notably that of Wiener-Kolmogorov prediction theory (see Section 13.4). However, their use for trajectory optimisation in the precise sense of Sections 3 and 5 is rather different; the first work on this line known to the author is that of Hagander (1973). The author (Whittle, 1983b) used the technique for the optimisation of both estimation and stochastic control, assuming state structure but allowing the feature of risk-sensitivity which we shall consider in Chapter 16. This work was extended to dynamics of general order in Whittle and Kuhn (1986) and further developed in Whittle (1990a). The approach leads one inevitably to a 'system' approach in which (x, u) is seen as a joint variable constrained by the plant equation. Willems (1991, 1992, 1993) comes to the same view when he develops his 'behavioural' approach. In our notation for the continuous-time case, he asserts that the system is controllable if and only if2I(s) has the same rank for all complexs, and that a certain reduction of
CHAPTER 7
The Pontryagin Maximum Principle The Pontryagin maximum principle states a method of direct trajectQry optimisation which we have in fact already seen in two contexts and by two approaches. These are the approaches which yield both a rapid development of the formalism and a feeling for its meaning. In Section 2.10 we considered perturbations from an optimal trajectory and, in the continuous-time case, deduced the first-order relations (2.76) and (2.77) by dynamic programming methods. .These can be regarded as necessary conditions for optimality of the trajectory, at least when the derivatives invoked exist. Then, in the last chapter, we found Lagrangian methods efficacious for a direct optimisation of trajectory. This was for the LQ case, but perhaps generalises. The maximum principle was first seen as replacing the classical calculus of variations, which it superseded because it could deal with cases for which the optimal control turned out to be discontinuous. One fmds such discontinuous controls in the 'bang-bang' control of the Bush problem (Section 6), rocket thrust programming (Section 7) and even our fishing problem (Section 2.7). The principle also revealed an attractive formalism, a Hamiltonian structure analogous to that of classical mechanics. The reason for this is that a Hamiltonian structure appears whenever an incomplete specification of dynamics is supplemented by an extremal principle. In the control case the dynamical specification is incomplete because the plant equation determines process dynamics for given control values, but gives no guide for the determination of these control values. The guide is supplied by the extremal principle of costminimisation. A fully rigorous derivation of the principle would be, at the present, both lengthy and unattractive. We shall rather adopt a heuristic approach which reveals the structure very quickly and which provides a machinery for the quick generation, in any particular case, of the assertions which one may expect to hold. There is one interesting point, however. Versions of the principle hold in both discrete and continuous time, but in fact under milder conditions in the continuous-time case, for reasons explained in Section 1. Because of this and because of the greater importance of the continuous-time case in the control context we shall devote most of the treatment to this case.
132
THE PONTRYAGIN MA XIM UM PRIN CIPL E
1 TH E PRINCIPLE AS A DIRECT LAGRANGIAN OPTIMISATION We saw in the last chapter how effectiv e direct trajectory optimisation was for the LQ model, especially in the infinite -horizon limit. It is then natural to ask whether the same methods could not be carried over to othe r models. Tha t is, consider a discrete-time model with the state-structure implied by the plant equation (2.2) and the cost function (2.3) and with a vector-valued state variable. Regard the plant equation at time T as a constraint on the {x, u} path with which can be associated a vector Lagrangian multiplier A-r· The Lagrangian form h-1
2)c (x., .,un T)
-r=O
+ >.;(x-r- a(x-r-J,U-r-J,T)] + Ch(xh)
(1)
should then be extremised with resp ect to the {x, u, >.} path, for given initial conditions. We have hitherto suppos ed the horizon poin t h prescribed, with h = +oo as a limit case. It is conceiv able, however, that the Lagrangian approach could be valid und er othe r stopping rules; e.g. that h is the first time at which x first enters some set. For instance, the flight of an aircraft terminates at the moment when it comes to rest on the ground, which may be at a time and in a man ner unscheduled. The formalism that one derives this way is exactly the formalism of the socalled 'discrete maximum principle', the maximum principle in a discrete -time formulation. However, the conclus ions are correct und er only quite rest rictive conditions, because the strong form of Lagrangian methods to which we wish to appeal is valid only und er such con ditions. Briefly, we would like to asse rt that necessary and sufficient conditions for optimality of the {x, u} path are that the Lagrangian form (1) should be minima l with respect to {x, u} and maxima l with respect to {>.}. This assertion is cert ainly true und er the following hyp otheses: that the set of permitted paths {x, u} is convex, that c is convex in its x, u arguments, that a is linear in its x, u arguments, that the horizon is fixe d and finite, plus growth conditions ensu ring boundedness of optimal path and cost. These hypotheses were all satisfied for the LQ model of the last chapter , at least so long as the horizon was held finit e. However, if they are weakened then one can be sure of nothing without further investigation. The continuous-time analogue of the Lagrangian form (1) would be
foh [c(x, u, T)
+ _xT(.x- a(x, u, T))jdT + C(x(h),h)
(2)
in the notation of Section 2.6. One is now has a continuous infinity of variables, and so would expect additional reas ons for possible failure of the Lag rangian approach. Greater care is indeed necessary, but the continuous-tim e case presents one significant simplification. Suppose that the. control variable u is also vector-valued, but with its values rest ricted to a set 01/ which may very well not be
\
__ )
1 THE PRINCIPLE AS A DIRECT LAGRANGIAN OPTIMISATION
133
convex. However, by varying u rapidly relative to the rate at which x is changing one can effectively achieve any value of u which is an average of values in 11/1. That is, 11/1 is effectively replaced by its convex hull, a convex set. This is the intuitive content of the so-called 'chattering lemma: This differing behaviour manifests itself in that the dynamic programming equations yield a version of the maximum principle in continuous time which can only be equalled in strength in discrete time if one makes restrictive assumptions. We shall then associate the principle almost entirely with the continuous-time case. The LQ model remains the outstanding example of a case in which the discrete-time maximum principle is valid and useful; we indicate a couple of others in the exercises. Exercises and comments (1) Economic growth Consider the dynamic allocation problem of Section 3.4, where x 1 is the 'activity vector' at time, t, the vector of intensities at which the various activities are pursued. Thus Xt ~ 0, and resource limitations enforce the plant equation Ax1 ~ b + Bxt-1· Suppose that utility is linear: that one wishes to maximise I:~ cTx 1 , where the kth component of c is the rate at which activity k delivers 'utility~ Application of Lagrangian techniques is valid; the Lagrangian form would be h
L[cT Xt
+ A"f'(b + Bx1- t - Axt- z1)]
I= I
Here z 1 ~ 0 is the margin of inequality in the plant equation; the vector of amounts of resources unused at stage t. The multiplier has the familiar price interpretation; the jth element of At is the effective unit price of resource j at time t. Maximisation with respect to Zt yields the conclusion At ~ 0 with equality in those components for which the corresponding component of Zt is positive. Maximisation with respect to x 1 yields c + BT At+ I -AT At ~ 0, with equality in those components for which the corresponding component of x 1 is positive. If the system is self-sufficient in that it can maintain itself in the absence of external supplies then, at times remote from both the beginning and the end of the optimisation period, it settles on to a maximal growth path (the turnpike of the economists), for which x 1 is of the form p1x for some fixed x and some p ~ 1. The maximal growth rate p and the direction x of the optimal path are determined by
(3) sothatpisjustthemaximal rootoflpA- Bl
= 0.
134
THE PONTRYAGIN MA XIM UM PRINCIPLE
The von Ne um ann -Ga le model (see e.g. Gale, 1960, 1967, 1968) generalises this in that it relaxes the assum ptions of linear dependence of consumption, production and utility on x. Con vexity assumptions must be reta ine d, however, or the two extremal operations in the analogue of (3) no longer commute, and the concept of an optimal growth rate has to be qualified. (2) Optimal dosage Consider the pla nt equation x 1 = Ax _ 1 1 + u1 in sca lar variables, where the controls u are to be chosen to minimise Lj (x r- X,:) 2 subject only to u ~ 0. This would then be an LQ tracking problem but for the fact that control costing has been replaced by the positivity con dition u ~ 0. One might regard x as the concentrat ion of a drug in a patient's bod y, attenuating at rate A in the absence of further administration, but maintaine d by dosage u. The sequence { is the desired concentration pro file. (It is convenient to make couple of temporary changes a of convention: we write Ur rath er tha n Ur- I in the plant equation, and shall change the sign of A.) The Lagrangian form is
!
ua
Ia
L!(xr-X~) 2 + >.,(Axr-1 + u
1-
t=l
One then fmds the conditions >.
1
~
x1)].
0 with equality if u1 > 0 and
(1 :!{. t
h-1
0 :!{.At= '2: ..R x,+ j- '2:..Rx~+ j =X ,- _x::, j=O
say, so that Xh+l =
j=O
Xh+ 1 = 0. Note then that
(1 - Aff)(1 - Af f- 1 )X, = u ~ 0. 1 Complete the argument to show that, if we define d if X 1 = ( 1 - Af f) (1 - Af f- 1 )X, = (1 + A 2 )Xx AX t-l - AXt+l, then the opt imal solution corresponds to the sequence {X 1} which is min ima l sub ject to the conditions X,~_x::,
difX,~O (1:!{.t:!{.h). The course of X correspondin g to the optimal solution wil l consist of 'free' segments for which d if X = 0 and X1 ~ X'f (at which no dos 1 e is administered), interpersed by points at which X1 = X'f and u1 = d if X > 0 1 (at which a dose is administered). One can be more explicit abo ut the algorithm. Denote the times t at which X, = X'f by lt, t2, ... , going bac kwards in time. The n It = h + 1, when Xh+l = 0.
2 THE PONTRYAGIN MAXIMU M PRINCIPLE
The solution in the range
tt+l ~ t ~ t;
135
has the 'free solution' form X 1 = cuA 1+
c2;A- 1 • The coefficients care to be chosen so that Xt =X~ fort= t;, ti+l· Once t 1, t 2 , ..• , !; have been determined then t;+I is determin ed as the smallest value t ( ~ 1) such that the free solution agreeing with at t and t; is not smaller
of than xc at any intermediate point. If this value oft seems to be t = 1 then indeed it is provided that the value of x 0 is such that d i1 X 1 > 0, i.e. such that medicati on should begin immediately. If, on the other hand, d i1 X1 ~ 0, then medicati on begins first at t;. We have used the notation d to indicate potential generalisations of the argument. The problem has something in common with the production problem of Section 4.3. In the case A = 1 it reduces to the so-called 'monotone regression' problem, in which one tries to approximate a sequence {.x<;} as closely as possible by a non-decreasing sequence {Xt}. In this case X 1 is the smallest concave sequence which exceeds and its 'free segments' are straight lines.
xc
x;
2 THE PONTRYAGIN MAXIM UM PRINCIPLE The maximum principle (henceforth abbreviated to MP) is a direct optimality condition on the path of the process. It is a calculation for a fixed initial value :x of state, whereas the DP approach is a calculation for a generic initial value. It can be regarded as both a computational and an analytic technique (and in the second case will then solve the problem for general initial value). The proof of the fact that derivatives etc. exist in the required sense is a very technical and lengthy matter, which we shall not attempt. It is much more importan t to have a feeling for the principle and to understa nd why it holds, coupled with an appreciation that caution may be necessary. We shall give a heuristic derivation based upon the dynamic programming equation, which is certainly the directest and most enlightening way to derive the conclusions which one may expect to be valid. A conjugate variable p will make its appearance. This corresponds to the multiplier vector .A, the identification in fact being p =_AT, so that pis a row vector. The row notation p fits in naturally with gradient and Hamilton ian conventions; the column notation ). is better when, as in equation (6.20), we wish to write all the stationarity conditions as a single equation system. We shall refer top as either the 'conjugate variable' or the 'dual variable'. Note the conventions on derivatives listed in Appendix 1: in particular, that the vector of first derivatives of a scalar variable with respect to a column (row) vector variable is a row (column) vector. Consider first a time-invariant formulation. The state variable x is a column vector of dimension n; the control variable u may take values in a largely arbitrary set illt. We suppose plant equation .X= a(x, u), instantaneous cost function c(x, u), and that the process.~tops when x first enters a prescribed stopping set Y, when a terminal cost IK(:X) is incurred. The value function F(x) then obeys the dynamic programming equation
136
THE PONTRYAGIN MAXIMUM PRIN CIPLE
inf( c + Fxa) = 0 u
(x f. Y'),
(5)
with the term inal cond ition
F(x) = IK(x)
(x E Y').
(6)
The derivative Fx may well not exist if x is close to the bou ndar y of a forbidden region (within which F is effectively infin ite) or even if it is close to the bou ndar y of a highly pena lised but avoidable regi on (when F will be discontinuous at the boundary). We have already seen exam ples of this in Exercise 2.6.2 and shall see others in Section 10. However, let us supp ose for the mom ent that xis on a free orbit, on which any pert urba tion 8x in position changes F only by a term Fx8x + o(8x). Define the conjugate variable p= -Fx
(7)
(a row vector, to be rega rded as a func tion of time p(t) on the path) and the Hamiltonian
H(x , u,p) = pa(x, u) - c(x, u) (a scalar, defined at each poin t on the path
(8)
as a function of curr ent x, u and p).
*Theorem 7.2.1 (The Pontryagin max imu m principle on a free orbit; time-invariant version) ( i) On the optimal path the variables x and p obey the equations
[= a(x, u)]
(9)
andtheoptimalvalue ofu(t) is the valueofu maximising H[x(t), u,p(t)]. (ii) The value ofHis identically zero on this path .
(10)
*Proof Only assertions (9) and (10) need proof; the others follow from the dynam ic prog ram min g equa tion (5) and the definition (7) of p. Ass ertio n (9) is obviously valid. To dem onst rate (10), writ e the dyn ami c prog ram min g equa tion in incremental form as F(x)
= inf[c (x, u)c5t + F(x + a(x, u)8t)] + o(8t). u
( 11)
Differentiation with respect to x yields
-p(t ) = Cxc5t- p(t + 8t)[I +ax c5t] whence (10) follows.
+ o(c5t) 0
2 THE PONTRYAGIN MAXIMUM PRINCIPLE
137
The fact that the principle is such an immedi ate consequence of the dynami c program ming equation may make one wonder what has been gained. What has been gained is that, instead of having to solve the partial differential equation (5) (with its associated extremal condition on u) over the whole continu ation set, one has now merely to solve the two sets of ordinar y differential equations (9) and (10) (with the associat ed extremal condition on u) on the orbit. These conditions on the orbit are indeed those which one would obtain by a formal extremisation of the Lagrangian form (2) with respect to x, u and A, as we leave the reader to verify. Note that the equations (9) and (10) demand <mly stationarity of the Lagrangian form with respect to the A- and x-paths, whereas the conditio n with respect to u makes the stronger demand ofmaximality. It is (9) and (10) which one would regard as characterising Hamilto nian structure; they follow by extremisation of an integral J[px- H(x,p)] dt with respect to the (x,p) path. A substantial question, which we shall defer to the next section, is that of the termina l conditions which hold when x encoun ters!/. Let us first transfer the conclusions above the the time-dependent case, when a, c, !/ and IK may all be !dependent. The DPequa tionfor F(x, t) will now be inf(c + F1 + Fxa) u
=0
(12)
outside.9', withF(x , t) = IK(x, t) for (x, t) in!/. However, wecanr educeth iscase to a formally time-invariant case by augmenting the state variable x by the variable t. We then have the augmented variables P-+ [p Po].
(13)
where the scalar p 0 is to be identified with- F 1• However, we shall still preserve the same definition (8) of Has before, so that, as we see from (12), the relation (14) holds on the optimal orbit. Theorem 72.2 (The Pontryagin maximu m principle on a free orbit) (i) The assertionsofTheorem 7.2.1 (i)still hold, but equation (10) is now augmented by the relation
(15)
(ii) H +Po is identically zero on an optimal orbit. Suppose the system timehomogeneous in that a and care independent oft. Then His constant on an optimal orbit.
138
THE PONTRYAGIN MAXIMUM PRINCIPLE
Proof All assertions save the last are simple translations of the assertions of Theor em 7.2.1. If a and c are indep enden t oft then we see from (15) that p 0 is constant on an optim al orbit, whence the final assertion follow s. 0 However, the essential assertions of the maxim um principle are those expressed in Theor em 7.2.1 (i) which, as we see, transfer to the time-dependent case uncha nged Note that His now a function oft as well as of x( t), u(t) and p( t).
Exercises and comments (1) As indicated above, one can expect H to be identically zero on an optimal orbit when the process is intrinsically time- invariant and the total cost ~) is welldefined. The case of a scalar state variable is then partic ularly amenable. By eliminatingp from the two relations, that His identically zero and that it is maximal with respect to u, one derives the optim al control rule in closed -loop form. (2) Suppose that the process is time-invariant and has a well-d efined average cost 'Y· The total future cost is then F(x, t) = f(x) - -yt plus an 'infinite constant' representing a cost of "' per unit time in perpetuity. We thus have H = -p0 = F1 = --y, so that the const ant value of H can be identi fied with the average reward rate --y. In the scalar case the optimal control rule can be determined, at least implicitly, from H + 'Y = 0 and the u-maximality condition. The equilibrium point is then determ ined from Hx = Hp = 0; 'Y can then be evaluated as the value of-H at this equilibrium point. 3 TERM INAL COND ITION S The most obvious example of depar ture from a free orbit is at termin ation of the path on the stopping set. Since the path is continuous, there is the obvious · matching condition: that the termi nal point is the limit point along the path. However, if one may vary the path so as to choose a favour able termi nal point, then there will also be optimality conditions. The rigorous statement of these termi nal conditions can be quite difficult if one allows rather general stopping sets and terminal costs. We shall give only the assertions which follow readily in the most regular cases. However, even more difficult than termin ation is the case when parts of state space are forbidden, so constraining the path. (For example, an aircraft may be compelled to avoid obstacles, or an industrialist may not be allowed to incur debt, even temporarily.) In such a case the optim al path must presumably skirt the bound ary of the forbidden region for a time before resuming a free path. The special conditions which hold on entry to and exit from such re-stricted phases are terme d transversality conditions; we shall consider them in Section 11. Consi der first the fully time-invariant case. One then has the termi nal condition F(x) = IK(x) for x in the stopping set f/'. Howe ver, one can appeal to
3 TERMINAL CONDITIONS
139
this as a continuity condition, that F(x) - t IK(x) as x (outsideS") approaches x (inside 9'), only if xis the optimal termination point for some free trajectory. Obviously x must lie on the boundary 8Y of Y, since motion is continuous. However, we shall see from the examples of the next section there may be points on this boundary which are so costly that they are not optimal terminal points for any trajectory terminating in Y. Let Y opt denote the set of possible optimal termination points. Partial integration of the Lagrangian expression for cost minimisation throws it into the form
1 1
(px- H) dr + IK(x)
=
-1
1
(jx +H) dr
+ px + IK(x) -
p(O)x(O)
{16)
where the overbar indicates terminal values. Let a be a direction from x into Y opt> in that there is a value x, = x + Ea + o( c) which lies in Y opt for all small enough values of the positive scalar c. If x is an optimal termination point for the trajectory under consideration then we deduce from (16) that px + IK(x) :::; px, + II<(x,). In the limit of small E this yields
(17) where the derivative Kx is evaluated at x. *Theorem 7.3.1 Let x and j5 be the terminal values ofstate and dual variable on an optimal trajectory for a time~invariant problem. Then the terminal optimality condi~ tion (17) holds for all directions a into Y optfrom x. Ifx is interior to !/opt. the derivative IKx is continuous at x and a tangent plane to !/ exists at x then ( 17) can be strengthened to
(p + IKx)a = 0. for all directions in the tangent plane. This can be expressed: the vector (p normal to the boundary of!/ at x.
(18)
+ IKx) Tis
The strengthening of (17) to (18) follows because, under the conditions stated, the inequality (17) holds for all directions in the tangent plane, and in particular holds for -a if it holds for a. If we transfer these conclusions to the problem with state variable~= (x, t) then we obtain the generalisation to the time-variable case. Note that!/ will now be a set of~ values and IK a function of~*Theorem 7.3.2 tory. Then
Let (x, t) and (J5,J5o) be the terminal values on an optimal trajec-
(19)
140
TH E PONTRYAGIN MA XI MU M PR IN CI PL E
for all directions (a, r) into[//opt from (.X, t). If( x, t) is interior to [//o pt• ofU<are continuous the the derivatives ~ re and a tangent plane to[// exists there, then (19)forall (a, r) in thi equality holds in 1 s tangent plane. 4 M IN IM AL TI M E AN D DISTANCE PR OBLEMS It is useful to discuss a problem whose so lution is intuitively ob tackling a substantia vious before l control problem. In this way one gets so application of the ma me feeling for xi mu m principle, an d also for how it shou when the or bit encoun ld be modified ters a constraint. Th e first is an M P tre atm en t of the problem already discussed in Ex Suppose tha t x is the ercise 2.6.3. coordinate of a particle inn -d im en sio na l spac moves at velocity v( x e. The particle 1 bu t its direction of move ment ca n be chosen at plant equation is thus will. The .X= uv(x) where u is a un it vector to be ch that the cost associated osen. Suppose with an orbit is the tim e taken to reach the sto Y plus a function fi< pping x-set (x) of the terminal coordinate x. Th at is, terminal cost functio c = 1 an d a n fi<(x) is defined on Y. Th e problem is thus time-invariant with H = pu v( x) - 1. Sinc positive scalar then e v(x) is a H is ma xim al with res pect to u when u is direction ofpT; i.e. chosen in the
On an optimal free or
The ter mi na tio n cond
bit p obeys
P = -p uv x = -!Pivx. ition (18) becomes
(20)
(21)
p + fi<x .l oY (22) (where the ter mi na tio n values are understoo d). If n< is co ns tan t in simply am ou nts to u. Y then this lo Y; th at the optimal trajectories me et the orthogonally to its bo stopping set undary. Consider now the pa rti cu lar case when the ve locity is constant; we suppose tha t v = 1. W may as well e see then from (21) tha t p is co ns tan t an d so an optimal pa th has a from (20) tha t co ns tan t direction. Th at is, optimal free orbit lines. This is no su s are straight rprise; the minimisa tion of time now am minimisation of distan ou nts to the ce, an d the minimaldistance pa th to Y is which meets the bo un a straight line da ry of Y orthogonally . We can now visualis trajectories very easily e the optimal , even in constraine d cases, if we see th stretched tightly betw em as a string een initial an d termina l points. Th e value function wo uld now have the form F( x) = in f[ lx - x! + fi<(x)] X
4 MINIMAL TIME AND DISTANCE PROBLEMS
141
Figure 1 The minimal-distance path to a stopping set !1', meeting the boundary of !I'
normally.
with x ofcourse constrained to!/. Consider the particular case when the stopping set consists of just two points, x and x', say, with zero terminal cost. Then F(x) = min[lx- x'l, lx- x''IJ,
and one simply moves in a straight line to the nearer terminal point. The interest of the example is that Fx does not exist at either of the possible terminal points or at points equidistant from x' and x''. This does not matter at all; the relevant directional derivative always exists in directions in which it would be optimal to move. The loci of discontinuity in Fx correspond to break-points in optimal control; e.g. points at which the optimal direction of movement changes discontinuously with x. A second example illustrates the effect of a discontinuity in IK(x); the type of behaviour against which the conditions of the last section were intended to guard. Consider a two-dimensional example, x having components x 1 and x 2 , with stopping set x2 = 0 and terminal cost equal to unity for negative x 1 and zero for non-negative x 1• That is, the stopping set is the x 1-axis, and the positive half..axis is preferred to the negative half-axis. The optimal path is a straight line to the nearest point in one of the two half-axes; we leave the reader to show that
Here the three· regions f!(i are marked in Figure 2; their boundaries are determined by continuity of F(x). So, in!!£ 1 (which is just x 1 ~ 0) one moves straight to the nearest point on the XIaxis. In !!£2 one still aims for the nearest point on the non-negative x 1-axis, which is the origin. However, if x 1 is sufficiently large and negative for a given value of x2 (i.e. in f!l 3) one gives up the struggle and goes for the nearest termination point, despite the fact that this will now be on the more heavily penalised negative axis.
142
THE PONTRYAGIN MAXIM UM PRINC IPLE
Figure 2 Optimalpathsfor termination on the x 1 axis when terminat ion on the negative part ofthe axis carries unit penalty. Termination on -I < x < Ofrom a starting 1 point offthe axis is never optimal.
Figure 3 Minimal-distance paths to.'?2 when.'? 1 is effectivelyforbidde n. Thefree parts ofthe optimal trajectory are straight lines; otherwise they hug the boundar y of.'? 1 •
x,
Note that the segme nt -1 < < 0 of f/ is never entere d by an optima l trajectory. Relate d to this is the fact that both Fx and the action rule are discontinuous on the f£ 2/ f£ 3 boundary. The situation for which a part of state space is forbidden is equiva lent to that for which passag e into it is. so heavily penali sed as to be effectiv ely forbidden. So, consid er the example of Figure 3, for which the termin ations set breaks up into two parts: Y' 1, which is so heavily penali sed that it is effectiv ely forbidden, and f/'2, within which IK = 0. Optim al paths will take the shorte st route to f/' 2 which avoids f/ 1; this avoidance may take the form of skirtin g the bound ary off/ 1 for a time, as we see from the diagram.
5 SOME MISCELLANEOUS PROBLEMS The maxim um princip le has found extensive applic ation in econom ics. For what is probably the simplest example, consid er the situation of a monop olist who holds an amoun t x of a non-renewable resource which he releases at a time-
5 SOME MISCELLANEOUS PROBLEMS
143
variable rate u. Because he has a monopoly this release rate determines the current unit price of the commodity, p(x). He wishes to maximise the total discounted return
1
00
e-atup(u) dt.
The plant equation is x = -u. We shall take >. as the conjugate variable (a scalar), since we are already usingp to denote price. Discounting makes the problem time-dependent. (This time-dependence is removed in a dynamic programming approach only by a renormalisation to present value as time advances,) The instantaneous value of u is such as to maximise the Hamiltonian
H = e-atup(u) - >.u.
(23)
We have >. = -Hx = 0, so that >. is constant. If >.o is the conjugate variable associated with time then >.0 + H = 0. At termination we have >.o + K1 = 0. But II( is identically zero, so >.o is zero at termination, which implies that H = 0 at termination. This certainly holds if u = _0 at termination, and this turns out to be the only way of satisfying the terminal condition. The initial condition is
1
00
X=
U
dt.
(24)
Consider ftrst the case p(u) = u-'Y, which in fact makes the problem a version of the consumption problem considered in Section 2.2. Maximisation of H gives the rule u = ke-ath, for some constant k. If t is the termination time then the condition u(t) = 0 is satisfied only for t infinite; the resource is never released completely in fmite time. Condition (24) gives the evaluation k = (a.xh), so that the optimal release rule is u = ( a.xh) e-at h. Here x is the initial stock and this is the open-loop release rule. If xis taken as current stock then the optimal rule in closed-loop form is u = a.xh. If we insert the open-loop expression for u into the reward function above we fmd the maximal return F(x) = ('yfa.)ax 1-'Y. These conclusions are consistent with those of Section 2.2, in the case when infinitehorizon limits existed. Consider now the case p(u) = (1 - u/2)+. The rate of return up(u) thus tends to zero as u l 0; we shall see that this has as consequence that the resource is released in a finite time. The maximal instantaneous rate of return is attained for u = 1, and we certainly may assume that u < 2, since higher values yield zero rate of return. The optimal release rate is u = 1 - >.ea 1• The condition u(t) = 0 then implies that>. = e-ar, and the optimal release rule in open-loop form is
u = 1 - e-a(i-t).
(25)
144
THE PONTRYAGIN MAXIMU M PRINCIPLE
Condition (24) yields the determin ing condition fort
x = t- (1 - e-o.l)fa
(26)
and substitution ofexpression (26) for u into the reward function yields
F(x)
= (1 -
e-o.7) 2 f(2a).
(27)
The determinations (25) and (27) of the optimal u and the value function are only implicit; they are expressed in terms oft which is only implicitly determin ed as a function of x by (26). The dynamic programm ing equation, had we taken that route, could not have been solved more explicitly. Indeed, it would have taken some ingenuity to have spotted this implicit form of the solution. Exercises and comments (1) Zermelo's problem A straight river has a current of speed c(y), where y is the distance from the bank from which a boat is leaving. The boat then crosses the river at a constant speed v relative to the water, so that its position in downstream/cross-stream coordinates (x,y) satisfies = v cos(u) + c(y), y = v sin u where u is the heading angle indicated in Figure 4. (i) Suppose that the boatman wishes to be carried down-stream as little as possible in crossing. Show that he should follow a heading
x
u = cos- 1 ( -
c~)).
(Note the implication, we must have c(y) ~ v for all J! Otherwise the boatman could move upstream in the slack water as far as he likedJ (ii) Suppose the boatman wishes to reach a given point on the opposite bank in minimal time. Show that he should follow the heading U
= COS-I ( - l
+~c(y)),
where pis a constant chosen to make the path pass throught the target point.
y
o(y)
+ X
Flglll'e 4 The Zermelo problem ofEx. 5.1: the optimal crossing ofa stream.
6 THE BUSH PROBLEM
145
6 THE BUSH PROBLEM This is one of the celebrated early problems of optimal control: to bring a mass to rest in a given position in minimal time, using a force of bounded magnitud e. One example would be that of bringing the rollers of a rolling mill to rest in standard position in minimal time, the angular momentu m of the rollers correspo nding to the linear momentu m of the mass. In the one-dimensional case the solution turns out to be simple: to apply maximal force first in one direction and then in the other, the order and duration being such that the mass will come to rest and to the desired position simultaneously. However, to prove the optimality of this manoeuvre, extreme and discontin uous in character, was beyond classical methods of variational calculus. Consider the one-dime nsional case; let x denote the coordinate and v the velocity. Thus (x, v) is the state variable, with the plant equation
x= v,
v= u,
(28) where u can be interpret ed as the force applied per unit mass. We suppose that \u\ : : :; M, c = 1 and that !I' consists of the single point (x, v) = (0, 0). That is, the mass is to be brought to rest at the origin in minimal time. Ifp, q are taken as the variables conjugate to x, v respectively, then
H=pv+ qu-1
(29)
jJ = 0,
(30)
u = M sgn (q)
(31)
If the terminal values ofp and q are denoted a and !3 then (30) implies that
p=a,
q=f3+a s
(32)
in terms of time-to-go s. Since u = M sgn(P) and v = 0 at termination, the relation H = 0 implies that
I.BI = 1/M. Consider first the positive option, !3 = 1/M. If a
~ 0 then q u = M, and backwards integration along the orbit yields
~
0 for s ~ 0, so that
v= -Ms, Thus x, v lie on the parabolic locus
(x
~
0, v:::::; 0)
This is the lower half of the switching locus drawn in Figure 5. If, on the other hand, a is negative, then qchanges sign ats = !3/\a\ =so, say, as then does u. If we follow the optimal path in reverse time then in this case it
146
THE PONTRYAGIN MAXI MUM PRINC IPLE
X
/14
Figure 5 The Bush problem. The two half-parabo/ae through the origin constitute the switching locus; we illustrate a path which begins with maximal deceleration, then switches to maximal acceleration when it meets the locus.
leaves the switching locus at s = so and then follows the lightly-drawn parab ola in Figure 5. In forward time, if one starte d at a point on this path then one would apply maxi mal deceleration u = - M and hold it until the switching locus was reached. One would then apply maxi mal acceleratio n u = M and move along the switching locus to the origin. The case f3 = 1/M leads corre spond ingly to the other half of the switching locus: (x ~ 0, v;;?; 0).
The maxi mum princ iple yields the relations (31) and (32) and from these we have deduc ed the optim al rule in closed-loop form, expressed in terms of the switching locus. If one starts off the locus then one applie s maxi mal deceleration or acceleration depen ding on wheth er one is above or below it. Once the locus is reached one applies maxi mal decel eratio n or accel eration depen ding upon whether one is upon the uppe r or the lower branch. Once the origin is reach ed then the force is of course removed. The control rule we have dedu ced is an example of what is terme d 'bang -bang ' control. That is, the control variable u in general takes extreme values in its perm itted set o/1, sometimes switching between these in a discontinuous fashion, as in this example. Small heatin g systems are usual ly run this way, with the gas flame either fully on or fully off; interm ediate settin gs being impracticable. Fuel economy certainly requires that rocket thrus t should be progr amm ed this way: to operate in general either at full thrus t (in some direct ion) or at zero thrus t. In the Bush example the control force was not casted, merel y limited. In the rocket case
7 ROCKET THRUST PROGRAMME OPTIMISATION
147
there is a linear costing in addition to limitation; this implies the bang-ba ng character of the optimal rule, as we shall see in the next section. The treatmen t of the Bush problem generalises to some degree to the ndimensio nal case. If we work from the vector generalisations of (30) and (31) then we deduce the partial characterisation
u = M (sa. + /3) T !sa.+ /31 of the control law, for fixed a., /3 subject to l/31 = 1j M. As an optimal path is followed we see that the direction of the control force now in general varies all the time, with up to n reversals of direction in some component. One generates all optimal orbits by variation of a. and /3. However, to deduce the closed-loop form of the control in this way seems difficult. Somewhat simpler is the LQ version of the problem, outlined in Exercise 1. Exercises and comments (1) Consider the vector version of the Bush ~roblem with no bound on u, but with an instantan eous cost function (L + Qlul ), penalisin g respectively time taken and control energy consume d during passage to terminati on at x = v = 0. Suppose, to begin with, that we prescribe the terminat ion time t. Show (and we shall treat a more general case in Section 7) that the optimal control is
!
u = _3_(3x+ 2vs) = s2 and that the control cost incurred is
-~2 (x+ vs) s
2
v s'
(33)
(34) where s = t - t is time to go. The first term in the final expression of (33) induces correctio n of final position (with an inference that the effective average velocity over the remainin g time interval is v/2) and the second term induces correctio n offinal velocity. One can now take account of the time-cos t by choosing s to minimise Ls + F(x, v, s). This will give an (x, v)-dependent evaluation of sand so oft, but the evaluation oft is necessarily constant along the optimal orbit.
!
7 ROCKE T THRUST PROGR AMME OPTIMI SATION The solution of the Bush problem of Section 5 is very much what intuition might suggest, but is nevertheless remarkable for its extreme and discontinuous character. It is just for this reason that the problem was so resistant to classical optimisa tion methods, such as as the calculus of variations. The maximum
148
THE PONTRYAGIN MAXIMUM PRINCIPLE
principle has been significant in that it provided a technique for just such cases. A source of significant problems since the 'forties has been the incentive to determine a rocket thrust program me which is optimal in the time/fue l consumption needed to achieve some given manoeuvre. We shall assume the rocket to be effectively a point mass, so that its state is specified by its coordinates in physical space (x ), its velocity (v) and its mass (m ). That is, we neglect rotation, yaw, vibration or any aspect of the rocket other than translation of its centre of mass through space and the wasting of its mass through fuel consumption. Note that x in this case is not the state vector, but simply that part of the state vector (x, v, m) which describes the Euclidea n coordinates of the rocket in physical space. Suppose that the rocket jet has a backward vector velocity k relative to the rocket (so that the absolute velocity of the materia l ~fthe jet is v-k) and that the rocket is subject to external forces of vector magnitude h = h(x, v, m, t). Then the condition of conservation of linear momen tum over a time interval of length 6t gives the equation
(m- 8m)(v +8v) + (v -k)8m = mv+h 8t or
mv=km +h,
(35)
the so-called rocket equation. We shall assume that the direction of the jet vector k can be freely controlled and that the rate mat which mass can be expelled in the rocket jet can also be controlled within limits. If we set
km=u, then we can regard the thrust vector u as the control vector, to be chosen freely subject to lui~M,
(36)
say. With these definitions the collective plant equation for the rocket can be written
x=v mv=u +h. m= -clul
(37)
If conjugate variables p, q, rare associated with x, v, m respectively, then equation s (10) become for this case
. qhXl p=-m
. -p-qh11) q=
m
(38)
7 ROCKET THRUST PROGRAMME OPTIMISATION
149
If we suppose that the cost function is purely terminal then the optimal control u should be such as to maximise the expression qu
--crlul m For a given value of ofqT:
(39)
lui the thrust vector u will then be chosen in the direction
u=qrl.~ lql' and lui will be chosen to maximise x:lul, where
That is,
1•1
~ { ~rermmaw, }
accordmg au { :
~.
(40)
The vector q is often termed the primer; its direction determines the optimal thrust direction and the magnitudes of q and r determine the optimal thrust magnitude. The paths on which lui is respectively M, 0 or intermediate in value are often called maximal thrust, null thrust and intermediate thrust arcs respectively, or simply MT, NTand IT arcs. Equations (38) simplifY somewhat if the rocket is moving in a purely gravitational field, when h = m7, where 'Y is the gravitational vector. One is thus neglecting aerodynamic effects such as drag. The vector 'Y will have the form 7(x) = if the field is a conservative one, V(x) being the potential energy associated with the gravitational field. In this case equations (38) reduce to
v;
p = -q"fx,
q= -p,
; = qufm2
and the primer obeys the equation
q=
q'Yx = qVxx
(41)
where Vxx is the matrix of second differentials of x. Equations (38) become particularly simple if Vmay be assumed quadratic, when Vxx is independent of x. As a very special case, consider the problem of maximising the height reached by a sounding rocket, assuming constant gravity and neglecting (implausibly!) effects such as aerodynamic drag. The problem can be taken as a onedimensional one in which x represents height measured upward from the starting point (the ground). The quantities v, p, q and r will all be scalar, and 'Y = - g, where g is the gravitational constant.
150
THE PONTRYAGIN MA XIM UM PRINCIPLE
The pro blem is one of max imi sing x at termination. Let mo den ote the mas s of the rocket structure, so tha t w = m - mo is the mas s of fuel remaini ng. The term inal con diti ons hol din g are the n p = 1, q = 0, and r is non-po sitive or nonnegative acc ord ing as the fuel rese rve w is positive or zero. We find from equations (38) tha tp = 1, q = s(= tim e to go) so tha t (42) But u, if non-zero, is in the directio n of q = s and so positive (upwar d-directed). Thus k = -m - 1 and l'i. mu st dec rease strictly thro ugh time. We thus see that the thru st pro gra mm e mu st tak e the form of a pha se of max ima l thru st followed by a pha se of null thru st, eith er or bot h of these phases possibly being of zero duration. Den ote the term ina l value of r by r0 • If r0 > 0 (so tha t w = 0 at terminat ion) the n it follows from (40) and (42) tha t thru st is zero for s < cmoro and max ima l for larger s, and the con diti on H = 0 implies tha t v = 0 at terminat ion . Thi s is the usual case, in which an MT pha se which exhausts fuel is followed by an NT pha se dur ing which the rocket coasts to its max ima l height, when its velo city is zero. If r0 ~ 0 the n l'i. > 0 at termination . The re is thus no NT arc; max ima l thru st is applied throughout. If ro = 0 the n it follows aga in from H = 0 tha t v = 0 at termination. If r 0 < 0 the n v < 0 at termination. The se are case s in whi ch the thru st is insufficient to lift the roc ket If initially the rocket hap pen s to be alre ady rising then max ima l thru st is app lied unt il the rocket is on the poi nt of reversing, which is take n as the term ina l inst ant. If the rocket hap pen s to be alre ady falling then term inat ion is immediate. This last discussion illustrates the literal natu re of the analysis. In disc ussing all the possible cases one com es acro ss som e which are ind eed physica lly possible but which one would hardly envisag e in practice. Exercises and comments (1) An app rox ima te reverse of the sou ndi ng rocket pro blem is tha t of soft landing: to lan d a rocket on the surface of a plan et wit h pre scri bed term inal velocity in such a way as to min imi se fuel consumption. It may be ass um ed tha t gravitational forces are vertical and constant, tha t ther e is no atm osp her e and tha t all mo tion is vertical. Not e tha t equation (42) remains valid. Hen ce show tha t the thru st pro gra mm e mu st con sist of a pha se of null thru st foll owed by one of max ima l thru st upwards (the pha ses possibly being of zero dur ation). How is the solution affected ifone also pen alises the tim e taken?
8 PROBLEMS WH ICH ARE PAR TIALLY LQ
LQ models can be equally wel l trea ted by dyn ami c pro gra mm ing or by the max imu m principle; one trea tme nt is in fact only a slightly disguise d version of
1 (
8 PROBLEMS WHICH ARE PARTIALLY LQ
151
the other. However, there is a class of partially LQ models for which the maximum principle quickly reveals some simple conclusions. We shall treat these at some length, since conclusions are both explicit and transfer in an interes ting way to the stochastic case (see Chapter 24). Assume vector state and control variables and a linear plant equation x=Ax +Bu.
(43)
Suppose an instantaneous cost function c(u)
= !uT Qu,
(44) which is quadratic in u and independent of x altogether. We shall suppos e that the only state costs are those incurred at termination. The analysis which follows remains valid if we allow the matrices A, Band Qto depend upon time t, but we assume them constant for simplicity. Howeve r, we shall allow termination rules which are both time-dependent and non-LQ , in that we shall assume that a terminal cost IK( ~) is incurre d upon first entry to a stopping set of fvalues 9', where~ is the combined state/time variable~= (x, t). We assume that any constraint on the path is incorporated in the prescri ption of g and IK, so that ~ values which are 'forbidden' belong to g and carry infinite penalty. The model is thus LQ except at termination. The assumption that state costs are incurred first at termination is realistic under certain circumstances . For example, imagine a missile or an aircraft which is moving through a region of space which (outside the stopping set 9') is uniform in its properties (i.e. in gravitational force and air density). Then no immediate position-dependent cost is incurred. This does not mean to say that spatial position is immate rial, however; one will certainly avoid any configuration of the craft which would take it to the wrong target or (in the case of the aircraft) lead it to crash. In other words, one will try to so maneoeuvre the craft that flight terminates favoura bly, in that the sum of control costs and terminal cost is minimal. This will be the interest of the problem: to chart out a course which is both econom ical and avoids hazards (e.g. mounta in peaks) which would lead to premature termination. The effect of such hazards is even more interesting in the stochas tic case, when even the controlled path of the craft is not completely predict able. It is then not enough to scrape past a hazard; one must allow a safe clearan ce. The analysis of this section has a natural stochastic analogue, which we pursue in Chapte r24. The Hamilt onian for the problem is .
H(x,u, p) = ).T[Ax +BuJ- !uTQu
if we take p = ).T as the multiplier. It thus follows that on a free section of the optimal path (i.e. a section clear of 9')
152
THE PONTRYAGIN MAXIMUM PRINCIPLE
u = Q-!BT>.
(45)
j_=-AT.A.
(46)
Consider now the optimal passage from an initial position~ = ( x, t) to a terminal position ~ = (x, t) by a path which we suppose to be free. We shall correspondingly denote the terminal value of>. by 5, and shall denote time-to-go t - t by s. It follows then from (40) and (41) that the optimal value of u is given by
(47) Inserting this expression for u back into the plant equation and cost function we find the expressions
(48) F(~, ~)
= !,\T V(s)5
{49)
for terminal x and total cost in terms of 5, where
V(s) = los ~r JeATr d-r.
(50)
and J = BQ- 1BT, as ever. In (50) we recognise just the controllability Gramian. Solving for 5 from (48) and substituting in (47) and (49) we deduce
Theorem 7.8.1 Assume the model specified above. Then (i) The minimal cost offree passage from ~to {is F(~, e)= !(x- eAsx?V(s)- 1(x- eA3 x),
(51)
and the open-loop form ofthe optimal control at ~ = (x, t) is u = Q-! BTeATs V(sr!(x- eAsx). where s = t - t. (ii) The minimal cost ofpassage from ~ to the stopping set !/, by an orbit which is free before termination, is F(e)
=
!nf[F(~,~)+K(~)].
~E9'
(52)
ifthis can be attained (53)
Expression (52) still gives the optimal control rule, with ~ determined by the minimisation in (53). This value of~ will be constant along an optimal path. We have in effect used the simple and immediate consequence (47) of the maximum principle to solve the dynamic programming equation. Relation (52) is indeed the closed-loop rule which one would wish, but one would never have
9 CONTROL OF THE INERTIALESS PARTICLE
153
imagined that it would imply the simple course (47) of actual control values along the optimal orbit. Solution of (43) and (47) yields the optimal orbit as x( r)
=eArx + V( r)eAT(I-r) X
(t ~ r ~ T)
where x = x(t) and X is determined by (48). For validity of evaluation (53) it is necessary that this orbit should not meet the stopping set !/before timet. Should it do so, then the orbit will have to break up into more than one free section, these sections being separated by grazing encounters with !/ at which special transition conditions will hold. We shall consider some such cases by example in the following sections. 9 CONTR OL OF THE INERTIA LESS PARTICLE The examples we now consider are grossly simpler than any actual practical problem, but bring out points which are importan t for such problems. We shall be able to generalise these to the stochastic case (see Chapter 24), where they are certainly non-trivial. Let x be a scalar, corresponding to the height of an aircraft above level ground. We shall suppose that the craft is moving with a constant horizontal velocity, which we can normalise to unity, so that time can be equated to horizonta l distance travelled. We suppose that the plant equation is simply
X=U, (54) i.e. that velocity equals control force applied. This would represent the dynamics of a mass moving in treacle: there are no inertial effects, and it is velocity rather than acceleration which is proportio nal to applied force. We shall then refer to the object being controlled as an 'inertialess particle'; inertialess for the reasons stated, and a particle because its dynamic state is supposed specified fully by its position. It is then the lamest possible example of an aircraft; it not merely shows no inertia, but also no directional effects, no angular inertia and no aerodynamic effects such as lift. We shall use the term 'aircraft' for vividness, however, and as a reminder of the physical object towards whose description we aspire by elaboration of the model. We have A = 0, B = 1. We thus see from (45) I (46) that the optimal control value u is constant along a free section of the orbit, whence it follows from the plant equation (54) that such sections of orbit must be straight lines. We find that · V(s) = Q- 1s, so that F(~, ~) = Q(x- x) 2 j2s. Suppose that the stopping set is the solid, level earth, so that the region of free movement is x > 0 and the effective stopping set is the surface of the ground, x = 0. The terminal cost can then be specified as a function IK(t) of time (i.e. distance along the ground). The expressions (53) and (52) for the value function and the closed-loop optimal control rule then become
154
THE PONTRYAGIN MAXIMUM PRINCIPLE
F(x, t) =
!~[~ + IK(t+s)],
(55)
u = -xjs.
(56) Here the time-to-go s must be determined from the minim isation in (55), which determines the optim al landing-point 1 = t + s. The rule (56) is indee d consistent with a const ant rate of descent along the straight-line path joining ~. t) and (0,7). However, suppose there is a sharp moun tain between the startin g point and the desired termin al point 1 determ ined above, sufficiently high that the path determined above will not clear it. That is, if the peak occur s at coordinate t 1 and has height h then we require that x(ti) >h. If the optim al straight-line path determined above does not satisfy this then the path must divide into two straight-line segments as illustrated in Figure 6. The total cost of this comp ound path is
Q(h -x) 2 F1(x, t) = 2 ( ) +F(h ,t1), ti-t
(57)
where Fis the 'free path' value function determined in (55). It is more realistic to regard a crash on the moun tain as carrying a high cost, K 1, say, rather than prohobited. In the stochastic case this is the view that one must necessarily take, becau se then the crash outcome always has positive probability. If one resigns and chooses to crash then there will be no control cost at all and the total cost incurr ed will be just K 1• One will then choose the crash option if one is in a position (x, t) for which expression (57) exceeds K1. i.e. for which
x
(58)
where dis the const ant (2/Q)[K1 - F(h, t 1)]. For K1lar gethis can occur only if the craft is much closer to the moun tain than it is to the peak; see Figure 6. X
t Figure 6 The straight-line segments ofa path surmounting a peak. The curve is the switching locus (determined by equality in (58)) on which one switchesfrom endeavouring to clear the peak to accepting a crash.
10 CONTROL OF THE INERTIAL PARTICLE
155
10 CONTROL OF THE INERTIAL PARTICLE The feature that must most urgently be added to the model above is that of inertia for vertical motion. Let us take the state variable as having compo nents x and v, where xis again height and vis vertical velocity: rate of increas e of height. We a.re thus guilty of an inconsistency in that x denotes a compo nent of the state vector rather than the full vector. However, continuity with the discuss ion above makes this desirable. We assum e then the plant equation
x= v, v= u, (59) and again the instantaneous cost functi on! Qul. Any coeffic ients which might have occurr ed in the second equation have been norma lised to unity by scalechanges in the variables. We have then As=
e
[10
s] 1 ·
It follows then from (45)/ (46) that the optimal control is linear in time, and so from the plant equati on (59) that, along the optimal orbit, the height is cubic in time:
(t
~
T
~
t).
Here a and (3 are coefficients to be determ ined from the termin al conditions. To determ ine the control rule in closed-loop form, we note first the evaluation
V(s) = Q-I [ ~~;
S
t2].
2
Expressions (51) and (52) for minim al cost of passage from and
~then becom e
F(~.~)
optimal control at
= (6Q/; )[x- x-! (v + v)sf + (Q/2s )(v- v) 2, u=(2 jl)[3 (x-x) -2vs- vs].
(60)
(61) If one is consid ering passage to fl' rather than to a prescribed ~then ~is again determ ined by the minimisation in (53). The control rule (61) is in the closed-loop form that one would wish, but indeed one would never have guessed from this that the optima l control in fact varies linearly with time. These evaluations are of course valid only if passage is free. To consider what happens in other cases, suppose we indeed consider an optima l 'landing' in that ~ = (.X, v, t) is prescr ibed as (0, w, t). That is, one is require d to land at timet (corresponding to prescription of a position on the ground) with vertical velocity w (necessarily non-positive). Then a cubic path satisfying initial and terminal conditions (and so the unique optima l free path) could take any of the forms
156
THE PONTRYAGIN MAXIMUM PRINCIPLE
't
(c)
(d)
Figure 7 The various possibilities for the free trajectory ofthe optimally controlled inertial particle.
(i}-{iv) of Figure 7. Cases (i) and (ii) are acceptable, because the orbit meets the stopping set (the ground) flrst at the desired terminal point. In cases (iii) and (iv) the aircraft begins in so violent a dive that it overshoots the plane x = 0 before it finally approaches the desired terminal point in the prescribed fashion. But such an 'Qvershoot' is a crash-a premature encounter with 9'-and is effectively forbidden if the penalty is set high enough. Note that case (iv) could not occur if w < 0; it is possible only if w = 0, and can be regarded as a limit version of case (iii). In either of these cases the actual optimal orbit must break into two parts, as illustrated in Figure 8. First the pilot makes strenuous efforts to come out of the dive, which he does at time t1, say.
Figure 8 The case in which the optimal path to a desired terminal configuration at t breaks into two parts: a grazing escape.from crash at ttfollowed by afree trajectory to termination.
10 CONTROL OF THE INERTIAL PARTICLE
157
Economy of control dictates that he only just escapes a crash, in that as he comes out of the dive (i.e. v = 0) he only just misses the ground (i.e. x = 0+). Having come out of the dive he approaches the desired terminal point by an optimal path which is free. The two sections of the path (those before and after tJ) are indeed distinct in that they are distinct cubic curves. We have to work out how the optimal grazing point t 1 is determined, and the transition conditions obtaining there. Optimisation of t1 means that the pilot has the desired terminal configuration~ in mind and comes out of the dive in such a way as to optimise costs after as well as before t 1• But this is to demand too much. In practice a pilot will simply concentrate on the immediate emergency, knowing that this is the priority. Consideration of what is to happen once the emergency is over is generally deferred, partly because one does not have the processing power to do anything else and partly because later events are secondary in cost terms. We shall find support for the latter assertion. Let us first consider, then, the problem of pulling out of a dive as economically as possible, without consideration of what is to happen afterwards. The cubic path must then take the form of Figure 9, touching x = 0 from above at a point t to be determined. We use the notation t because this is a terminal point as far as this phase of the operation is concerned.
e
Theorem 7.10.1 Suppose the initial configuration is (x, v, t) with v < 0. Then the closed-loop control rule which minimises the cost ofpulling out ofthe dive is
2v2
u = 3x.
(62)
This pullsthecraftoutofth ediveaftera times= t- tequalto -3xjv, and at a cost of
-(2Qv3 j9x). Proof We may as well set t = 0 and t = s. We know from (60), (61) that the minimal cost offree passage from (x, v, 0) to (0, 0, s) is
(63)
Figure 9 The presumedform ofan optimal crash-avoiding trajectory.
158
THE PONTRYAGIN MAXIMUM PRINCIPLE
and that the closed-loop rule which achieves it is u
= -(2j?)(3x + 2vs).
(64)
We can optimise with respect to s. The value of s minimising expression (63) for v negative is s = -3xjv; at this value the cost F and the control u have the evaluations asserted above. We have to confirm that the path thus generated does not in fact cross x = 0 before times. This path x(r) is a cubic with x(O) = x, x!(O) = v, and a double zero at -3xjv. These conditions determine the cubic, and its third zero is found also tolieat-3xjv. 0 Note a consequence of the last observation: that - 3x / v is in fact the largest value which could be chosen for s; any larger and the path would have crossed the axis before time s. At this point one will switch from control rule (62) to a zero control if crash avoidance is all that is demanded. The uncontrolled path now avoids !/, if only just. We can now return to the landing problem, and the question of determining whether the path to landing is free or is broken by a grazing encounter with the ground. In this latter case, we should also determine the point at which break occurs. We shall now use t to denote the time of landing (prescribed) and shall suppose that if there is an earlier grazing then it occurs at t 1• Theorem 7.10.2 Consider the landing problem enunciated above; sets = t - t. Then the optimalpath is a free one unless both the following inequalities hold: 3x+ vs
< 0,
(3x + vs) 2 w > - -'----:---':...._ 4xs
(65)
If these inequalities do indeed both hold then the optimal path suffers a grazing encounter with the ground after a time SJ = t 1 - t determined as the unique positive root of (3x + vsi) 2
w2
sj
(s- s 1) 2
(66)
which is less than -3xjv.
The first condition of (65) states that the prescribed landing time must be later than the time at which it would be optimal to pull out of the dive without consideration of the sequel. The second condition implies that, even if the first holds, the optimal path to the termination point will still be free if the required terminal rate of descent is large enough. That is, if one is willing to accept a crash landing at the destination!
10 CONTROL OF THE INERTIAL PARTICLE
159
Proof If the path is a free one then it has the cubic form given above. We may as well normalise the time origin by setting t = 0 and so t = s in this relation. The coefficients a and f3 in the cubic are then determined by the terminal conditions x(s) = O,x'(s) = w. The cubic then has a root at-r = s, and one finds that theremaining two roots are the roots of the quadratic ~- 2a-r+b
= 0,
(67)
where
2a =
s(x + vs) 2x+ (v+ w)s'
-Slx
b = ::------c;------;2x+ (v+ w)s·
The only case (consistent with x > 0, w < 0) in which the optimal path is not free is the case (iii) of Figure 7, so this is the one we must exclude. This will be the case in which the quadratic (67) has both roots between 0 and s. This in turn is true if and only if the quadratic expression is positive at both 0 and s, and has a turning point inside the interval at which it is negative. That is: b
< 0, ? - 2as + b > 0, 0 < a < s,
c? > b.
We find that the first two of these conditions are both equivalent to
2x+(v+w)s
(68)
This last with the inequality a > 0 implies that
x+vs < 0 The condition a
(69)
< sis equivalent to 3x + vs + 2ws < 0,
(70)
and the final condition, a 2 > b, is equivalent to
(3x + vs) 2 + 4xws > 0.
(71)
The free path is non-optimal if and only ifall of relations (68)-(71) hold. Relations (70) and (71) give the bounds on w
3x + vs (3x + vs) - - < - w < .o._---:----''2
2s
4xs
The upper bound in this relation exceeds the lower bound by (3x + vs)(x + vs)/ (4xs). It follows from (69) that the interval thus determined for w is empty unless 3x + vs < 0, a relation which implies (68), (69) and (70). We are thus left with the pair of conditions (65) asserted in the theorem. In the case that both these conditions are fulfilled the optimal path cannot be free, and is made up of two free segments meeting at time t1 = t + SJ. We choose s 1 to minimise the sum of the cost incurred on the two free segments, as given by expression (60); the stationarity condition (66) emerges immediately. It follows
160
THE PONTRYAGIN MAXIMUM PRINCIPLE
v FigrudO A graphical illustration ofthe solution of equation
the grazing point
(66)for the optimal timing SJ of
from the observation after Theo rem 7.10.1 that the root s 1 must be less than - 3x j v. Indeed, equation (66) has a single such root as we see from Figure 10; the left- and right-hand members of (66) are respectively decreasing and increasing, as functions of s1. in the interval 0 :::::; s1 :::::; - 3xfv. 0 Indeed, we can determine s 1 explicitly. In takin g square roots in (66) we must take the negative optio n on one side, because 3x + vs 1 is positive whereas w is negative. The appropriate root of the resulting quad ratic in s 1 is
31
(3x- vs) + V(3x + vs) 2 - 12xws = 2(w~v) '
atlea stifw - v > 0, whic hwem ayex pect. Thisa ppro
ache s-3xf vasw tends tozer o.
11 AVOIDANCE OF THE SfOPPING SEf: A GEN ERAL RESULT The conclusions of Theo rem 7.10.1 can be generalised , with implications for the
stochastic case. We consider the general mode l of Section 9 and envisage a situation in which the initial conditions are such that the uncontrolled path would meet the stopping set !/'; one's only wish is to avoid this encounter in the most economical fashion. Presumably the optim al avoiding path will graze !/ and then continue in a zero-cost fashion (i.e. subsequently avoid !/' without further control). We shall speak of this grazing point as the 'term inatio n' point, since it indee d marks the end of the controlled phase of the orbit. We then consider the linear system (43) with contr ol cost (44). Suppose also that the stopping set !/ is the half-space
11 AVOIDANCE OF THE STOPPING SET: A GENERAL RESULT
9'={x:aTx::;;;b}
161 (72)
This is not as special as it may appear; if the stopping set is one that is to be avoided rather than sought then its boundary will generally be (n - 1)dimensional, and can be regarded as locally planar under regularity conditions. Let us denote the the linear function aTx of state by z. Let F( ~) be the minimal cost of transition by a free orbit from an initial point = (x, 0) to a terminal point ~ = (x, t), already evaluated in (51). Then we shall find that, under certain conditions, the grazing point ~of the optimal avoiding orbit is determined by minimising F(e, ~)with respect to the free components of ~at termination and maximising it (at least locally) with respect to i That is, the grazing point is, as far as timing goes, the most expensive point on the surface of 9' on which to terminate from a free orbit. The principal condition required is that aTB = 0, implying that the component z is not directly affected by the control. This is equivalent to aTJ = 0. Certainly, if one is to terminate at a given timet then the value of z is prescribed as b, but one should then optimise over all other components of x. Let the value of F(e, ~) thus minimised be denoted G(e, t). The assertion is then that the optimal value oft is that which maximises G( t). We need a preparatory lemma.
e
e,
e,
Lemma 7.11.1 Add to the prescriptions (43), (44) and (72) ofplant equation, cost function and stopping set the conditions that the process be controllable and that aTB =O.Then (73)
e,
Proof Optimisation ofF( ~) with respect to the free components of x will imply that the Xof (47) is proportional to a, and so can be written Oa for some scalar 0. The values of() and tare related by the termination condition z = b, which, in virtue of (48), we can write as aTeA1x
+ ()aTV(t)a =b.
(74)
We further see from (49) that
G(e, t) = !fiaTV(t)a.
(75)
Let us write V(t) and its derivative with respect to t simply as V and V. Controllability then implies that aTVa > 0. By replacing T by s- T in the integrand of (50) we see that we can write Vas
V=J+AV+ VAT.
(76)
Differentiating (74) with respect tot we fmd that
(aTVa)(dOfdt) +aT AeA1a + oaTva = 0,
(77) .
162
THE PONTRYAGIN MAX IMUM PRIN CIPL E
so that
oG = !B2aTVa + OaTVa(dOfdt) =-B aT AeA1x- !01aTVa.
at
Finally, we have
z= ar(Ax + J5..) =aTAx= aTAeA x+ oar AVa = arAeA x +!Oarva. 1
1
The second equality follows beca use aT J = 0 and the fourth by appeal to (76). We thus deduce from the last two relations that 8Gja t = -Bz. Inser ting the evaluation of eimpl ied by (74) we deduce (73). Theorem 7.11.2 The assumptions of Theorem 7.11.1 imply that the grazing poin t ~ ofthe optimal Y' -avoiding orbit is determined by first mini misin g F (.;, ~) with respect to x subject to aTx = band then maxi misin g it, at least locally, with respect tot. Proo f As indic ated above, optimality will require the restricted x-optimisation; we have then to show that the optim al t maxi mises G(~, !). At any value t for which the controlled orbit crosses z = 0 the uncontrolled orbit will lie below z = 0, so that aT eA7x - b < 0. If, on the contr olled orbit, z decreases throu gh zero at time !, then one will increase tin an attem pt to find an orbit which does not enter :/.W e see from (73) that G(x, t) will then increase. Correspondingly, if z crosses zero from below then one will decrease t and G(x, t) will again increase. If the two original controlled orbits are such that the t values ultimately coincide unde r this exercise, then = 0 at the com mon value, so that the orbit grazes :/ locally, and G(x, t) is locally max imal with respect tot. The true grazing poin t (i.e. that for which the orbit meets Y at no othe r point) will be found by repeated elimination of crossings in this fashion. 0
x
One conjectures that G(x, 7) is in fact globally maxi mal with respect to 7 at the grazing point, but this has yet to be demonstra ted. We leave the reader to conf irm that this criterion yields the know n graz ing poin t t = -3xf v for the crash avoidance problem of Theo rem 7.10.1, for which the cond ition aT B = 0 was inde ed satisfied. We should now deter mine the optim al Y -avoi ding control explicitly. Theorem 7.11.3
Defin e
~ = ~(x,s)
= b- aTeAsx,
a= a(s) = JaTV (s)a .
Then, under the conditions of Theorem 7.11.1, the optim
al :/-av oidin g control at xis
(78)
12 CONSTRAINED PATHS: TRANSVERSALITY CONDITIONS
163
where s is given the values which maximises (~I (j) 2 . With this understanding ~I rf2 is constant along the optimal orbit (before the grazing point) and s is the time remaining until grazing.
e
Proof Let us set ~ = (X' 0)' = (X' s) so that X is the value of state when a time s remains before termination. The cost of passage along the optimal free orbit from~to eis (79) where V = V(s) is given by (50) and 8 = x- eAsx. The optimal control at time t = 0 for prescribed is
e
(80) The quantity v- 18 is the terminal value of A. and is consequently invariant along the optimal orbit. That is, if one considers it as a function of initial value ~then its value is the same for any initial point chosen on that orbit. Specialising now to the Y' -avoiding problem, we know from the previous theorem that we determine the optimal grazing point by minimising F (~, e) with respect to x subject to z = b and then maximising it with respect to s. The first minimisation yields (81) so the values of sat the optimal grazing point is that which maximises (~I (j) 2 • Expression (78) for the optimal control now follows from (80) and (81). The identification of ~I dl with v- 18 (with s determined in terms of current x) D demonstrates its invariance along the optimal orbit (before grazing). 12 CONSTRAINED PATHS: TRANSVERSALITY CONDITIONS We should now consider the transition rules which hold when a free orbit enters or emerges from a part of state space in which the orbit is constrained. We shall see that conclusions and argument are very similar to those which we deduced for termination in Section 3. Consider first of all the time-invariant case. Suppose that an optimal path which begins freely meets a set F in state-space which is forbidden. We shall assume that § is open, so that the path can traverse the boundary 8§' ofF for a while, during which time the path is of course constrained. One can ask: what conditions hold at the optimal points of entry to and exit from oF? Let x be an entry point and p and p' be the values of the conjugate variable immediately before and after entry. Just as for the treatment of the terminal problem in Section 3, we can partially integrate the Lagrangian expression for minimal total cost up to the transition point and so deduce an expression whose primary dependence upon the transition value x occurs in a term (p - p')x.
164
TH E PONTRYAGIN MA XIM UM PRI NC IPL E
Suppose we can vary x to x + w· + o( c), a neighbouring poi nt in 8-F . Then, by the same argument as tha t of Th eor em 7.3.1 we deduce tha t (p - p') a is zero if x is an optimal transition value. Th at is, the linear function pa of the conjugate variable is continuous at an opt ima l tran sition point for all directions a tangential to the surface of F at x. Otherwise expressed, the vector (p - p') T is nor ma l to the surface of .F at x. We deduce the sam e conclu sion for optimal exit points by appeal to a timereversed version of the proble m. Transferring these results to the time-varying problem by the usual device of taking an augmented state variable~= (x, t) we thus deduce *Th eor em 7.12.1 Let .F be an open set in (x, t) space whi ch is forbidden. Let (x, t) be an opt ima l transition poi nt (fo r eith er ent ry to or exi t fro m 8-F ) and (p, Po) and p',p the values of the aug me nte d conjugate variable imm edi ate ly before and after transition. The n
0)
(p - p') a + (po -
p~)r =
0 (82) for all directions (a, r) tan gen tial to 8-F at (x, t). In particular, if t can be var ied ind epe nde ntly of x then the Ha mil ton ian H is con tinu ous at the transition. Pro of Th e first assertion follows from the argument bef ore the theorem, as indicated. If we can vary tin bot h directions for fixed x at transiti on then (82) implies tha t p 0 is continuous at transiti on. Bu t we have p 0 + H = 0 on bot h sides of the transition, so the implication is tha t Hi s continuous at the transition. 0 One can also develop conclu sions concerning the form of the optimal pat h during the phase when it lies in 8-F. However, we shall con ten t ourselves with what can be gained from discus sion of a simple example in the next section. An example we have already considered by a direct discus sion of costs is the steering of the inertial particl e in Section 10. For this the sta te variable was (\", v) and ff was x < 0. Th e bou nda ry 8-F is then x = 0, but on this we mu st also require tha t v = 0, or the condition x ;;?: 0 would be vio late d in either the immediate pas t or the imm edi ate future. Suppose tha t we sta rt fro m ~. v) at t = 0 and reach x = 0 first at time t (when necessarily v = 0 as wel l). Sup the variables conjugate to x, pose p, q are v, so tha t the Hamiltonian is H = pv + qu - Qu2 /2, and so equal to pv + q2 j2Q when u is chosen optimally. Continuity of H at a transition point, when vis nec essarily zero, thus amounts to continuity of q2. Let us con firm tha t this con tinuity is consistent with the previously derived condition (66). Ifp and q den ote the values of the conjugate var iables at t 1 -, jus t before transition, then integra tion of equation (46) leads 1 to p(r ) = p, u(r ) = Q- q(r ) = Q- 1(q + ps) , v(r ) = -Q - 1(qs + pil /2) , x(r 1 ) = Q(qi l /2 + ps3/6) , where s = t 1 - r. The values of p and q are determined by the prescribed initial
13 REGULATION OF A RESERVOIR
165
values x and v. We find q proportional to (3x + vt) / t2, and one can find a similar expression for q immediately after transition in terms of the values 0 and w of terminal height and velocity. Assertion of the continuity of q2 at transition thus leads exactly to equation (66). 13 REGULATION OF A RESERVOIR This is a problem which the author, among others, has discussed as 'regulation of a dam'. However, purists are correct in their demur that the term 'dam' can refer only to the retaining wall, and that the object one wishes to control, the mass of water, is more properly referred to as the 'reservoir: Let x denote the amount of water in the reservoir, and suppose that it obeys the plant equation .X = v - u, where v is the inflow rate (a function of time known in advance) and u is the draw-off rate (a quantity at the disposition of the controller). One wishes to maximise a criterion with instantaneous reward rate g( u), where g is concave and monotonic increasing. This concavity will (by Jensen's inequality) discourage variability in u. One also has the natural constraint u ~ 0. The state variable x enters the analysis by the constraint 0 ~ x ~ C, where C is the capacity of the reservoir. We shall describe the situations in which x = C, x = 0 and 0 < x < Cas fun empty and intermediate phases respectively. One would of course wish to extend the analysis to the case for which v (which depends on future rainfall, for example) is imperfectly predictable, and so supposed stochastic. This can be achieved for LQ versions of the model (see Section 2.9) but is difficult if one retains the hard constraints on x and u and a non-quadratic reward rate g( u). We can start from minimisation of the Lagrangian form J[-g(u) + p(xv + u)] dr, so that the Hamiltonian is H(x, u,p) = g(u) + p(v- u). A price interpretation would indeed characterise p as an effective current price for water. We then deduce the following conclusions.
Theorem 7.13.1 An optimal draw-offprogramme shows thefollowingfoatures. (i) The value ofu is the non-negative value maximising g(u)- pu, which then increases with decreasingp. (ii) The value ofu is constant in any one intermediate phase. (iii) The value ofp is decreasing vncreasing) and so the value ofv = u is decreasing vncreasing) in an empty (full) phase. (iv) The value ofu is continuous at transition points. Proof Assertion (i) follows immediately from the form of the Hamiltonian and the nature of g. Assertions (ii) and (iii) follow from extremisation of the Lagrangian form with respect to x. In an intermediate phase x can be perturbed either way, and one deduces that jJ = 0 (the free orbit condition for this particular case). Hence p, and so u, is constant in such a phase. In an empty phase perturba-
166
THE PONTRYAGIN MAX IMU M PRIN CIPLE
tions of x can only be
non-negative, and so one can deduce only that P :::;: 0. Thu s p is decreasing, and so u is increasing. Since u and v are necessarily equal, if x is
being held constant, then v mus t also be increasing. The analogous assertion s for a full phase follow in the sam e way; perturbations can then only be nonpositive. The final assertion follows from continui ty of Hat transition points. With u set equal to its opti mal value H becomes a monotonic, continuous function of p. Continuity of Hat transition points then implies continuity ofp, and so of u. 0 The example is interesting for its appeal to transversality conditions, but also because there is some discussion of opti mal behaviour during the empty and full phases (which constitute the bou nda ry off of the forbidden region!!': the union of x < 0 and x > C). Trivially, one mus t have u = v in these phases. However, one should not regard this as the equation dete rmining u. In the case x = 0 (say) one is always free to take a smaller value of u (and so to let water accumulate and so to move into an intermediate phase). The optimal draw-off rate continues to be determined as the value extremising g( u) - pu; it is the development ofp which is constrained by the condition x = 0. Alth ough the rule u = v is trivial if one is determined not to leave the empty phase, the conclusion that v mus t be increasing during such a phase (for optimali ty) is non-trivial.
Notes Pontryagin is indeed the originator of the principle which bears his name, and whose theory and application has been so developed by himself and others. It is notable that be held the dyn ami c prog ram min g principle in great scorn; M.H.A. Davis describes him memorably as hold ing it up 'like a dead rat by its tail' in the preface to Pontryagin et al. (1962). This was because of the occasional nonexistence of the derivative Fx in the simp lest of cases. However, as we have seen, it is a rat which alive, ingenious, direct, and able to squeeze through where authorities say it cannot. The material of Section 11 is believed to be new.
i
I
1
I \
l
PART 2
Stochastic Models
CHAPTER 8
Stochastic Dynamic Programming A difficulty which must be faced is that of incompleteness of information. That is, one may simply not have all the information needed to make an optimal decision, and which we have hitherto supposed available. For example, it may be impossible or impracticable to observe all aspects of the process variable-tb.e workings of even a moderate-sized plant, or of the patient under anaesthesia which we instanced in Section 5.2, are far too complex. This might matter less if the plant were observable in the technical sense of Chapter 5, so that the observations available nevertheless allowed one to build up a complete picture of tb.e state of affairs in the course of time. However, there are other uncertainties which cannot be resolved in this way. Most systems will have exogenous inputs of some kind: disturbances, reference signals or time-varying parameters such as price or weather. If the future of these is imperfectly predictable, as is usually the case, then the basis for the methods we have used hitherto is lost. There are two approaches which lead to a natural mathematical resolution of this situation. One is to adopt a stochastic formulation. That is, one arrives somehow at a probability model for plant and observations, so that all variables are jointly defined as random variables. The variables which are observable can then be used to make inferences on those which are not. More specifically, one chooses a policy, a control rule in terms of current observables, which minimises the expectation of some criterion based on cost. The other, quite as natural mathematically, is the minimax approach. In this one assumes that all unobservables take the worst values they can take (judged on the optimisation criterion) consistently with the values of observables. The operation of conditional expectation is thus replaced by a conditional maximisation (of cost). The stochastic approach seems to be the one which takes account of average performance in the long run; it has the completer theory and is the one usually adopted. The minimax approach corresponds to a worst-case analysis, and is frankly pessimistic. We shall consider only the stochastic approach, but shall find minimax ideas playing a role when we later develop the idea of risk-sensitivity. Lastly, there is a point which should be made to maintain perspective, even if it cannot be followed up in this volume. The larger the system (i.e. the greater the number of individual variables) then the more unrealistic becomes the picture that there is a central optimiser who uses all currently available information to make all necessary decisions. The physical flow of information and commands
170
STOCHASTIC DYNAMIC PROGRAMMING
would be excessive, as would the central processing load. This is why an economy or a biological organism is partially decentralised: some control decisions are made locally, on the basis of local information plus central commands, leaving only the major decisions to be made centrally, on aggregated information. Indeed, the more complex the system, the greater the premium on trading a loss in optimality for a gain in simplicity-and, perhaps, the greater the possibility of doing so advantageously, and of recognising the essential which is to be optimised. We use the familiar notations E(x) and E(xiy) for expectation and conditional expectation, and shall rarely make the notational distinction (which is only occasionally called for) between random variables and particular values which they may adopt. Correspondingly, P(x) and P(xiy) denote the probability (unconditional and conditional) of a particular value x, at least if x is discretevalued. However, more generally and more loosely, we also use P(x) to denote simply the probability law of x. So, the Markov property of a process {x 1} would be expressed, whatever the nature ofthe state space, by P(xt+1IX1 ) = P(xt+ 1 lx 1), where X 1 is the history {X 7 ; T ~ t }. 1 ONE-STAGE OPTIMISATION A special feature of control optimisation is that it is a multi-stage problem: one makes a sequence of decisions in time, the later decisions being in general based on more information than the earlier ones. For this very reason it is helpful to begin by considering the single-stage case, in which one only has a single decision to inake. For example, suppose that the pollution level of a water supply is being monitored. One observes pollution level y in the sample taken and has then the choice of two actions u: to raise the alarm or to do nothing. In practice,. of course, one might well convert this into a dynamic problem by allowing sampling to continue over a period oftime until there is a more assured basis for action one way or the other. However, suppose that action must be taken on the basis of this single observation. A cost C is incurred; the costs of raising the alarm (perhaps wrongly) or of not doing so (perhaps wrongly). The magnitude of the cost will then depend upon the decision u and upon the unknown 'true state' of affairs. Let us denote the cost incurred if action u is taken by C(u1 a random variable whose distribution depends on u. One assumes a stochastic (probabilistic) model in which the value of the cost C(u) for varying u and of the observable y are jointly defined as random variables. Apolicy prescribes u as function u(y) of the observable u; the policy is to be chosen to minimise E[C(u(y))]. Theorem 8.1.1 The optimal decision function u(y) is determined by choosing u as the value minimising E[C(u)ly]. Proof If a decision rule u(y) is followed then the expected cost is
,
1 ONE-STAGE OPTIMISATION
171
E[C(u(y))] = E{E[C(u(y))iy]} ~ E{inf E[C(u)iy]} u
and the lower bound is attained by the rule suggested in the theorem.
0
The theorem may seem trivial, but the reader should understand its point: the reduction of a constrained minimisation problem to a free one. The initial problem is that of minimising E[C(u(y)] with respect to the jUnction u(y), so that the minimising u is constrained to be a function of y at most. This is reduced to the problem of minimising E[C( u) IY]freely with respect to the parameter u. One might regard u as a variable whose prescription affects the probability distribution of the cost C,just as does that of y, and so write E[C(u)iy] rather as E[qy, u]. However, to do this is to blur a distinction between the variablesy and u. The variable y is a random variable whose specification conditions the distribution of C. The variable u is not initially random, but a variable whose value can be chosen by the optimiser and which parametrises the distribution of C. We discuss the point in Appendix 2, where a distinction is made by writing the expectation as E[qy; u], the semicolon separating parametrising variables from conditioning variables. However, while the distinction is important in some contexts, it is not in this, for reasons explained in Appendix 2. The reader may be uneasy: the formulation of Theorem 81.1 makes no mention of an important physical variable: the 'true state' of affairs. This would be the actual level of pollution in the pollution example. It would be this variable of which the observationy is an imperfect indicator, and which in combination with the decision u determines the cost. Suppose that the problem admits a state variable x which really does express the 'true state' of affairs, in that the cost is in fact a deterministic function C(x, u) ofx and u. So, if one knew x, one would simply choose u to minimise C(x, u). However, one knows only y, which is to be regarded as an imperfect observation on x. The joint distribution of x andy is independent of u, because the values of these random variables, whether observable or not, have been realised before the decision u is taken.
Theorem 8.1.2 Suppose that the problem admits a state variable x, that C(x, u) is the cost jUnction andf(x,y) the joint density ofx andy with respect to a product measure Jl.i (dx)p,2 (dy ). Then the optimal value of u is that minimising J C(x, u) f(x,y)P,i dx. Proof Let us assume for simplicity that x and y are discrete random variables with a joint distribution P{?c, y); ~e formal generalisation is then clear. In this case E[C(u)iy] =
L C(x, u)P(xiy) ex: L C(x, u)P(x,y) = L C(x, u)P(x)P(yJx), X
X
X
{1)
172
STOCHASTIC DYNAMIC PROG RAMM ING
where the proportionality sign indicates a factor P(y) -I, indep enden t of u. The third of these expressions is the analogue of the integr al expression asserted in the theorem. 0 We give the fourth expression in (1) because P(x) and P(yjx) are often specified on fairly distinct grounds. The conditional distributio n P(y lx) of observation on state is supplied by one's statistical model of the observation process, whose mechanism may be fairly clear. The distribution P(x) constitutes the 'prior distribution' of state and its specification may be debat able; see Exercise 1. Exercises and comments (1) The so-called two-hypothesis two-action case is the simplest, but is both illuminating and useful. Cons ider the pollution exam ple of the text, and suppose that serious pollution of the river, if it occurs, can only be due to a catastrophic failure at a factory upstream. There are then only two 'pollu tion states', that this failure has not occurred or that it has, corresponding to x equal to 0 or 1, say. Denote the prior probabilities P(x) of these by ?rx and the probability density of the observation y conditional on the value of x by fx(y ~ Suppose that there are just two actions: to raise the alarm or not. The cost of raising the alarm is Co or zero according as x is 0 or 1; the cost of not raisin g the alarm is zero or C1 according as xis 0 or 1. It follows then from the last form of the criterion that one should raise the alarm if ?ro/o(y)Co < 7rifi(y)CJ. That is, if the likelihood ratio fi(Y)/fo(y) exceeds the threshold value 7roCo/7ri C1.
(2) Risk-sensitivity and hedging. The new effects that a stochastic element can bring can be demonstrated on a one-stage model. Supp ose that, if one divides a sum of money x so that an amou nt x1 is invested in activi ty j, then one receives a total return of r = E1 CjXJ. That is, c1 is the rate of return on activity j. If the c1 are known then one maximises r by investing the whole sum in an activity for which the return rate Cj is greatest. (If there are severa l activities achieving this rate then the investment can be spread over them, but there is no advantage in such diversification). If the c1 are unkn own, and regar ded as rando m variables, then one maximises the expected retur n by inves ting the whole sum in an activity for which the expected rate of return E( c ) is maximal, where this is an 1 expectation conditional on one's information at the time of the investment decision. However, suppose one chooses rathe r to extremise the expected value of exp( Or), maximising or minimising according as the risk-sensitivity parameter() is positive or negative. This criterion takes account ofvar iability of return as well as of expectation, variability being welcome or unwelcome according as () is positive or negative (the risk-seeking and risk-averse cases; see Chap ter 16).
2 MULTI-STAGE OPTIMISATION
173
For simplicity, let us assume (rather unrealistically) that the random variables c1 are independent. Then one chooses the allocation to maximise '2:.1 B- 1Fj(Bx1 ), where Fj(o:) = log{E[exp(o:cJ)]}. The functions Fj(o:) are convex (see Appendix 3). It follows then that the functions e-t Fj(Bx1) are convex or concave according as one is in the risk-seeking or risk-averse case. In the first case will invest the whole sum in an activity j for which e- 1Fj( Bx) is maximal. In the second case one spreads the investment (hedges against uncertainty) by choosing the x1 so that Fj(Bx1) ~>.,with equality if x1 is positive (j = 1, 2, ... ). Here>. is a Lagrange multiplier chosen so that '2:.1 x1 = x. Hedging is a very real feature in investment practice, and we see that it is induced by the two elements of uncertainty and riskaverseness. (3) Follow through the treatment of Exercise 2 in the case (again unrealistic!) of normally distributed CJ> when Fj(o:) = p1a+!v1a 2• Here 1-LJ and v1 are respectively the mean and variance of c1 (conditional on information at the time of decision). 2 MULTI-STAGE OPTIMISATIO N; THE DYNAMIC PROGRAMMI NG EQUATION If we extend the analysis of the last section to the multi-stage case then we are essentially treating a control optimisation problem in discrete time. Indeed, the discussion will link back to that of Section 2.1 in that we arrive at a stochastic version of the dynamic programming principle. There are two points to be made, however. Firstly, the stochastic formulation makes it particularly clear that the dynamic programming principle is valid without the assumption of state structure and, indeed, that state structure is a separate issue best brought in later. Secondly, the temporal structure of the problem implies properties which one often takes for grap.ted: this structure has to be made explicit. Suppose, as in Section 2.1, that the process is to be optimised over the time period t = 0, 1, 2, ... , h. Let W0 indicate all the information available at time 0; it is from this and the stochastic model that one must initially infer plant history up to time t = 0, insofar as this is necessary. Let x 1 denote the value of the process variable at time t, and X 1 the partial process history {x1,x2 , •.. ,x1 }. Correspondingly let y 1 denote the observation which becomes available and u1 the control action taken at time t, with corresponding partial histories Y 1 and Ut. Let W1 denote the information available at time t; i.e. the information on which choice of u1 is to be based. Then we assume that W1 = {Wo, Y, U1-t}. That is, current information consists just of initial information plus current observation history plus previous control history. It is taken for granted, and so not explicitly indicated, that W1 also implies prescription of the stochastic model and knowledge of clock time t.
174
STOCHASTIC DYNAMIC PROGRAMMING
A realisable policy 1ris one which specifies u1 as a function of W 1 fort = 1, 2, ... , One assumes a cost function C. This may be specified as a function of Xh and Uh-l, but is best regarded simply as a random variable whose distribution, jointly with that of the course of the process and observations, is parametrised by the chosen control sequence Uh-I· The aim is then to choose 1rto minimise E,,.(C). Define the total value function uh-I.
G( W 1)
=
inf ,. E1r[CI Wt],
(2)
the minimal expected cost conditional on information at timet. Here E,. is the expectation operator induced by policy 11: We term G the total value function because it refers to total cost, whereas the usual value function F refers only to present and future cost (in the case when cost can indeed be partitioned over time). G is automatically t-dependent, in that W1 takes values in different sets for different t. However, the simple specification of W1 as argument is enough to indicate this dependence. *Theorem 8.2.1 (The dynamic programming principle) The total value function G( W 1) obeys the backward recursion (the dynamic programming or optimality
equation) (t
= 1, 2, ... )h - 1)
(3)
with closing condition
(4) and the minimising value ofu 1 in (3) is the optimal value ofcontrol at timet.
We prove these assertions formally in Appendix 2. They may seem plausible in the light of the discussion of the previous section, but demonstration of even their formal truth requires a consideration of the structure implicit in a temporal optimisation problem. They are in fact rigorously true if all variables take values in finite sets and if the horizon is finite; the theorem is starred only because of possible technical complications in other cases. That some discussion is needed even of formal validity is clear from the facts that the conditioning variables W1 in (2) and ( W1, u1) in (3) are a mixture of random variables Y 1 which are genuinely conditioning and control histories U1_t or U1 which should be seen as parametrising. Further, it is implied that the expectations in (3) and (4) are policy-independent, the justification in the case of (4) being that all decisions lie in the past These points are covered in the discussion of Appendix 2. Relation (3) certainly provides a taut and general expression of the dynamic programming principle, couched, as it must be, in terms of the maximal current observable We.
3 STATE STRUCT URE
175
3 STATE STRUC TURE The two principa l properties which ensure state structur e are exactly the stochastic analogues of those assumed in Chapter 2 (i) Markov dynamics. It is required that the process variable x should have the property (5) where Xr. U1 are now complete histories. That is, if we conside r the distribu tion of Xr+l conditio nal on process history and paramet rised by control history then it is in fact only the values of process and control variables at time t which have any effect. This is the stochastic analogue of the simply-recursive deterministic plant equation (2.2), and specification of the right-hand member of (5) as a function of its three argumen ts amounts to specification of a stochastic plant equation. (ii) Decomposable cost function. It is required that the cost function should break into a sum of instanta neous and closing costs, of the form
c=
h-1
h-1
1=0
1=0
L c(xl, ul, t) + Ch(xh) = L
Cz
+ ch,
(6)
say. This is exactly the assumption (2.3) already made in the deterministic case. We recall the definition of sufficiency of a variable ~~ in Section 2.1 and the characte risation of x as a state variable if (x 1 , t) is sufficient. These definitio ns transfer to the stochastic case, and we shall see, by an argume nt parallel to that of the deterministic case, that assumptions (i) and (ii) do indeed imply that x is a state variable, if only it is observable. A model satisfYing these assumpt ions is often termed a Markov decision process, the point being that they defme a simply recursive structure. However, if one is to reap the maxima l benefit of this structur e then one must make an observational demand . (iii) Perfect state observation. It is required that the current value of state should be observable. That is, x 1 should be known at the time t when u is to be 1 determi ned, so that W1 =(X,, Ut-i)· As we have seen already in the deterministic case, assumpt ion (i) can in principl e be satisfied if there is a description x of the system which is detailed enough that it can be regarded as physically complete. Whethe r this detailed description is immediately observable is another matter, and one to which we return in Chapter s 12 and 15. We follow the pattern of Section 2.1. Define the future cost at time t h-1
C1 = Lcr+C h r=l and the value function
F( W1)
= inf EII"[C1 1Wz] 7r
(7)
l 176
I
STOCHASTIC DYNAMIC PROGRAMMING
so that G{ W1) = I:~~~ Cr + F ( W1). Then the following theorem spells out the sufficiency of et = (xt, t) under the assumptions above. Theorem 8.3.1 Assume conditions ( i)-( iii) above. Then (i) F(W1) is a function ofx1 and t alone. If we write it F(x1 , t) then it obeys the dynamic programming equation
(t
~h)
(8)
with terminal condition
(9) (ii) The minimising value ofu1 in (8) is the optimal value ofcontrol at timet, which is consequently also a function only ofx 1 and t. Proof The value of F(Wh) is Ch(xh), so the asserted reduction of Fis valid at time h. Assume it valid at time t + 1. The general dynamic programming equation (3) then reduces to F( W 1)
c(x1 , u 1, t) + E[F(xt+l, t + l)IX1, U1]} = inf{ u,
(10)
and the minimising value of u1 is optimal. But, by assumption (i), the right-hand member of (10) reduces to the right-hand member of (8). All assertions then D follow by induction. So, again one has the simplification that not all past information need be stored; it is sufficient for purposes of optimisation that one should know the current value of state. The optimal control rule derived by the minimisation in (8) is again in closed-loop form, since the policy before timet has not been specified. It is in the stochastic case that the necessity for closed-loop operation is especially clear, since continuing stochastic disturbance of the dynamics makes use of the most recent information imperative. At least in the time-homogeneous case it is convenient to write (8) simply as F(·,t)
= !l'F(·,t+ I)
(11)
where !l' is the operator with action !l'¢(x)
u) + E[¢(xt+l)lx1 = x, u = u]}. = inf{c(x, u 1
(12)
This is of course just the stochastic version of the forward operator already introduced in Section 3.1. As then, !l'¢(x1) is the minimal cost incurred if one is allowed to choose u1 optimally, in the knowledge that at time t + 1 one will incur a cost of¢( x 1+1). In the discounted case !l' would have the action !l'¢(x)
u) + ,8E[¢(xr+i)lx1 = x, Ur = u]}. = inf{c(x, u
(13)
4 THE EQUATION IN CONTINUOUS TIME
177
4 THE DYNAMIC PROGRAMMING EQUATION IN CONTINUOUS TIME It is convenient to note here the continuous-time analogue of the material of the last section and then to develop some continuous-time formalism in Chapter 9, before progressing to applications in Chapter 10. The analogues of assumptions (i)-(iii) of that section will be plain; we deduce the continuous-time analogues of the conclusions by a formal passage to the limit. It follows by the discrete-time argument that the value function, the infimum of expected remaining cost from timet conditional on previous process and control history, is a function F(x, t) of x(t) = x and t alone. The analogue of the dynamic programming equation (8) for passage from t to t + 8t is F(x, t) = inf{ c(x, u, t) u
+ E[F(x(t + 8t), t + 8t)lx(t) =
x, u(t) = u]}
+ o(5t). (14)
Defme now the infinitesimal generator A(uJ t) of the controlled process by A(u, t)¢(x)
= lim(8t)- 1{E[¢(x(t + 8t))lx(t) 8t!O
= x,u(t) = u]- ¢(x)}.
(15)
That is, there is an assumption that, at least for sufficiently regular ¢( x ), E[¢(x(t + 8t))lx(t) = x, u(t)
= u] = ¢(x) + [A(u, t)¢(x)]8t + o(8t).
The form of the term of order 8t defines the operator A; to write the coefficient of 8t as A(u, t)¢(x) emphasises that the distribution of x(t + 8t) is conditioned by the valuex of x(t) and parametrised by t and the value u of u(t).We shall consider the form of A in some particular cases in the next chapter. *Theorem 8.4.1 Assume the continuous-time analogues of conditions (i)-(iii) of the Section 3. Then x is a state variable and the value function F (x, t) obeys the dynamic programming equation
i~f [c(x, u, t) + BF~~, t) + A(u, t)F(x, t)]
= 0
(t
(16)
The minimising value ofu is the optimal value ofu( t).
Equation (16) follows formally from equation (14) in the limit of small 5t. We have starred the theorem because, to the technical complications which can arise when x and u may vary in infinite sets and when also h may later allowed to be infinite are added those which may appear in the passage to continuous time. However, any invalidity of (16) in particular cases is usually well signalled by one's realisation of some anomaly in the mathematical representation of the physical picture.
178
l
STOCHASTIC DYNAMIC PROGRAM MING
If costs are discounted at rate a then equation (16) becomes
it![c-aF +
0:: +AF] =0
(t
(17)
Here the arguments of the various functions have been left understood , as is often possible without ambiguity.
I
CHAP TER 9
Stochastic Dyna mics in Continuous
Time In general we assume the reader acquainted with basic probability theory. However, we shall devote a short chapter to continuous-time stochastic processes, since these show special features and are of special relevance in control contexts. Some of the material also prepares the way for that of Chapter 22. We assume a state-structured model with state variable x, but shall for simplicity consider only the time-homogeneous uncontrolled case. These limitations are easily removed later, by simply making the various rates which occur depend also upon t and u.
1 JUMP PROCESSES One extreme case is that in which the state space flt is discrete, so that changes in state must be discontinuous. For example, if x is the size of a queue then x can only take the values 0, 1, 2, ... and a transition (from x to x ± 1) takes place whenever a customer arrives or leaves. Similarly, if we consider a model of a population which is recognised as being made up of individuals, then populatio n size can only be integral. Notation is simplified if we assume that the values of state x are integral, taking values j = 0, 1, 2, ... , say. If one later wishes to attach other numerical values to the states then there is no trouble in doing so. Suppose now that
P[x(t + 6t) = kjx(t) = j] = Ajk8t + o(6t)
(k
i= j)
(I)
for small positive &. This is a regularity condition which turns out to be selfconsistent. The quantity >yk is termed the probability intensity of transition from j to k, or simply the transition intensity. The assumption itself implies that the transition has been a direct one: the probability of its having occurred by passage through some other state is of smaller order in 8t (see Exercise 1).
Theorem 9.1.1 The process with transition intensities >yk has infinitesimal generator A with action
A¢(j)
= _E>.ik[¢ (k)- ¢(j)]. k
(2)
180
1
STOCHASTIC DYNAMICS IN CONTINUOUS TIME
i
Proof It is a consequence of (1) that E[¢(x(t + &)) - ¢(x)lx(t) = j) =
L Ajk[¢(k)- ¢(j)]8t + o(6t) k#j
whence (2) follows, by the definition of A To include the case k summation plainly has no effect.
= j in the D
We can return to the controlled time-dependent case by making the transition intensity a function >..1k(u, t) of u and t, when the dynamic programming equation (8.16) becomes
i~f [c(j, u, t) +oF~, t) + ~ Ajk(u, t)[F(k, t)- F(j, t)]l
= 0.
(3)
Exercises and comments (1) It follows from (1) and the Markov character of the process that
P[x(t + 8t) = i,
x(t + 2 8t) = klx(t) = j] = AJiAik(8t) 2 + o[(8tf]
fori distinct from bothj and k. This at least makes it plausible that the probability of multiple transitions in an interval oflength 8t is o( 8t).
2 DETERMINISTIC AND PIECEWISE DETERMINISTIC PROCESSES The deterministic model for which x is a vector obeying the plant equation x = a(x) is indeed a special case of a stochastic model. The rate of change of ¢(x) in time is ¢xa, so that the infinitesimal generator A has the action
A¢(x)
= ¢x(x)a(x)
(4)
where ¢xis the row-vector of differentials of¢ with respect to the components of x. Consider a hybrid of this deterministic process and the jump process of the last section, in which the x-variable follows deterministic dynamics x = aJ(x) in the jth regime, but transition can take place from regime j to regime k with intensity >..1k(x). Such a process is termed a piecewise deterministic process. The study of such processes was initiated and developed by Davis (1984, 1986, 1993). For example, if we consider an animal population, then statistical variability can occur in the population for at least two reasons. One is the intrinsic variability due to the fact that the population consists of a finite number of individuals: demographic stochasticity. Another is that induced by variability of climate, weather etc.: environmental stochasticity. If the population is large then it is the second source which is dominant: the population will behave virtually deterministically under fixed environmental conditions. If we suppose, for
3 THE DERIVATE CHARACTERISTIC FUNCTION
181
simplicity, that the environmental states are discrete, with well-defined transition intensities, then the process is effectively piecewise deterministic. In such a case the state variable consists of the pair (j, x): the regime labelj and the plant variable x. We leave it to the reader to verify that the infinitesimal generator of the process has the action
+ LAik(x)[¢(k,x)- ¢(j,x)].
A¢(j,x) = ¢x(j,x)aj(x)
k
3 THE DERIVATE CHARACTERISTIC FUNCTION Recall that the moment-generating function (abbreviated MGF) of a random column vector xis defined as M(a) = E(e=), where the transform variable a is then a row vector. Some basic properties of MGFs are derived in Appendix 3. One would define M(iO) as the characteristic function of x; this always exists for real 0. The two definitions then differ only by a 90° rotation of the argument in the complex plane, and it is not uncommon to see the two terms loosely confused. · Suppose from now on that the state variable x is vector-valued. We can then define the function (5) of the column vector x and the row vector a It is known as the derivate characteristic function (abbreviated to DCF). We see, from the interpretation of A following its definition (8.15), that H has the corresponding interpretation
E(ea6xlx(t)
=
x)
=
1 + H(x, o:)8t + o(8t),
{6)
where 8x = x(t + 8t) - x(t) is the increment in x over the time interval (t, t + 8t]. The DCF thus determines the MGF of this increment for small 8t; to this fact plus the looseness of terminology mentioned above it owes its name. For example, consider a process for which x can jump to a value x + dj(x) with probability intensity Aj(x) (j = 1, 2, ... ) • For this the infinitesimal generator has the action A¢(x) = L Aj(x)[¢(x + tlj(x)) - ¢(x)]
(7)
j
and the DCF has the evaluation
H(x,a)
=
LAj(x)[eadj(x) -1].
(8)
j
Comparing these last two relations we see that we can make the assertion, at least for processes of this type:
182
STOCHASTIC DYNAMICS IN CONTINUOUS TIME
*Theorem 93.1 Relation (5) between the derivate characteristic function Hand the infinitesimal generator A has the inversion
A= H(x,alax)
(9)
if it is understood that, when we form A¢(x) = H(x, alax)¢(x), the differential operator a 1ax acts only on the x-argument of¢.
This indeed follows from comparison of (7), (8) and appeal to the 'Taylor' formula e(8f8x)d¢(x)
= ¢(x +d),
Here (8/ ax )d we mean the inner product L:k dk8 I Oxk' where dk and Xk are thekth components of the vectors d and x respectively. We shall return to the important identity (9) in Section 22.3. 4 PROCESSES OF INDEPENDEN T INCREMENTS , AND PROCESSES DRIVEN BY THEM A process {x(t)} is one of independent increments if the increments in x over disjoint time intervals are statistically independent. If, in addition, the distribution of the increment depends only upon the length of the interval then it is a homogeneous process of independent increments (abbreviated HPII). Such processes are important because their 'time derivative' (should it exist) provides the continuous-time equivalent of a sequence of liD random variables. The abbreviation HPII is not a standard one, but it will serve us for this and the next section. Let us define the MG F
M(a, t) =
E[e"[x(t)-x(O)l].
Then the HPII property is obviously equivalent to the property
M(a, *Theorem 94.1
t1
+ tz)
= M(a, tt)M(a, tz)
(10)
.lf{x( t)} is an HPJ! then there exists a function '1/J( a) such that M(a, t) = e1'¢(a)_
(11)
Furthermore
{12) for prescribedfunctions a( t)for which the right-hand member is defined.
4 PROCESSES OF INDEPENDENT INCREMENTS
183
*Proof Relation (11) is indeed a consequence of (10). It follows from (11) and the independence of increments that, if { t1} is an increasing sequence of timepoints, then E [exp{
~ aj[x(tJ)- x(tJ-d}] = exp{ ~(tj- tJ-d'!fJ(aJ) }·
Relation (12) then follows by a limiting argument.
0
*Theorem 9.4.2 The process {x( t)} is an HPII ifand only ifH(x, a) is independent ofx, so that A= '!fJ(8/8x), (13) say. One can then identify 1/J with the 1/J of(11). *Proof If the process is an HPII so that (11) holds then one indeed verifies that H (x, a) = 1/J( a), implying (13). To establish the reverse conclusion, define the conditional MGF M(a,x, t) = E[eru:(tllx(O) = xl This obeys the backward equation 8 01 M(a,x, t)
= 'l/J(8/8x)M(a,x, t)
which has a solution
M( a, x, t) = eru:+h/J(a).
(14)
But relation (14) is sufficient to establish both relation (10) and the representa~~
0
So, H (x, a) = 1/J( a) is both necessary and sufficient for the HPII property, with . the 1/J( a) of (11) identified with the DCF. It is natural to ask: what functions 1/J are possible? which are interesting? In fact, the HPII property implies that exp[.,P(a)] must be the MGF of an infmtely divisible distribution. The Uvy-Khintchine theorem then implies that an HPII process must be a superposition of independent Poisson and Wiener processes. It is a quick matter to describe these processes and verifY the HPII property for them. The simplest HPII process is the Poisson process. For this x takes only nonnegative integer values, the only transitions are of the form x --+ x + 1 and these all have the same intensity A. One could regard x(t)- x(O) as the number of events which have taken place in the time interval (0, t], if events take place independently with probability intensity A. These events might be insurance claims, traffic accidents or arrivals of cosmic particles in a chamber. The DCFofthe process is 1/J(a) = A(e0 -1). We see then from (11) that the distribution of the number of events in a time interval of length t is Poisson with expectation At, whence the name of the process. One speaks of the events occurring in a Poisson stream ofrate A.
184
STOCHASTIC DYNAMICS IN CONTINUOUS TIME
We can generalise the process by superimposing Poisson streams. Conside r the scalar process { w( t)} generated by w(t) = Laixi( t)
(15)
j
where the ai are constants and the {xj(t)} are independent Poisson processes of respective rates Aj. For example, the jth stream might be interpretable as a stream of particles each carrying charge ai; the variable w(t) is then the total charge which has accumulated by time t. Such a process is termed a compound Poisson process. It is plainly HPII with DCF 1/J(a)
=L
.Xj(eaaJ- 1).
(16)
j
One obtains an interesting class of models by drivfng a deterministic model by anHPII . Theorem 9.4.3 Consider the vector process {x(t)} generated by the different ial equation ~itten in incrementa/form) 8x = a(x) 8t + b(x) 8w
(17)
where { w( t)} is a vector-valued homogeneous process ofindependent increments with DCF'IjJ(a} Then {x(t)} isaMark ovproce sswithD CF H(x, a) = aa(x)
+ 1/J(ab{x)). •
(18)
The assertions follow immediately if one regards relation (6) as providing the effective definition of the DCFan d inserts expression (17) for 8x into it. Th.e 'plant equation' (17) is interesting as the continuous-time version of a process driven by 'noise' which is totally random, in the sense that its values at different times are statistically independent. One is tempted to rewrite it as
x = a(x) + b(x)e
(19)
where the 'noise' variable e is the time derivative of w. The process {w(t)} generated by (15) plainly does not have a derivative in the usual sense. If one admitte d the idea one would have to regard it as a random sequence of impulse s, those of magnitude ai occurring with intensity Aj, independently of other impulses. This is the 'shot noise' process formulated by physicists early in the century. The concept is proper in that integrals of shot noise are proper; one can write (12) as E[exp {j a(t)e(t) dt}] =exp{ j 1/J(a{t))dt}.
(20)
, .l
5 WIENER PROCESSES (BROWNIA N MOTION)
185
5 WIENER PROCESSES (BROWNIAN MOTION), WHITE NOISE AND DIFFUSION PROCESSES Ifwe consider a sequence of shot noise processes in which the impulses become ever weaker but ever more frequent then we reach the classic limit process associated with Brownian motion and white noise.
*Theorem 9.5.1 Consider the scalar HPII {w(t)} with DCF 'lj;(a.) = (eaa + e-aa)/2a2.
(21) The limit process for small a is then an HPII with DCFand infinitesimal generator
A= 'l/J(ofox) =
!(:x)
2
(22)
Further, since E[exp{/ a.(t)dw(t) }] =E[exp {/ a.(t)€(t)dt }] =exp{!/ a.(t) 2 dt}. (23)
increments ofthe process are jointly normally distributed. The assertions are formally immediate. They constitute in fact a process form of the central limit theorem, given operational meaning by the fact that the argument can be phrased in terms of the statistics of well-behaved functionals such as the integral Ja.(t) dw(t). We recognise in (21) a compound Poisson process consisting of two streams, each of rate 1j2a2 and carrying 'charges' of size a and opposite signs. If a is small then the charge carried is small, but particles arrive fast. The 'derivative' € of the limit process is an infinitely dense sequence of infinitesimal independe nt impulses, whose integral has zero expectation but is normally distributed. The limit process { w( t)} is a time-homogeneous process of independe nt increments which also has the property of being Gaussian. As a function of time it in fact continuous, and is the only continuous HPII. It is in all senses classical: sometimes referred to as Brownian motion and written B(t); sometimes referred to as the Wiener process and written W(t). Its formal differential f(t) is termed 'white noise' by engineers for reasons explained in Section 13.2 It is improper in having infinite variance, but proper in that integrals of white noise are proper, as we see in (23). The argument of Theorem 9.5.1 can be repeated in a vector version. We then arrive at the notion of a vector Wiener process w(t) and vector white noise f(t) with the property that the linear form Jo:(t) dw(t) = Ja.(t)€(t) dt is normally distributed with zero mean and covariance matrix
186
STOCHASTIC DYNAMICS IN CONTINUOUS TIME
cov(/ a(t)e(t) dt)
=
J
a(t)Na(t )T dt
(24)
for some non-negative definite matrix N, at least for functions a(t) such that this last integral is convergent. Relation (24) implies that
( 1
cov (6t)- 1
t+6t
1
•1 .I
)
e(r) dr = Nj6t,
which has an infinite limit as 6t tends to zero. This is an indication that e itselfhas an irregular character; its components have infinite variance. Such a conclusion is inevitable if one requires property (24) for non-zero N. The matrix N indeed gives the statistical scale of e but is not its covariance matrix. It is more appropriately to as the power matrix. As we have seen, one can derive white noise as the 'limit' of realistic physical processes. Mathematically, the concept is regularised in the so-called theory of generalised processes (see e.g. Gihman and Skorohod, 1972, 1979} White noise is to conventional processes what the Dirac 6-function is to conventional functions. Like the 6-function, it can be seen as a limit of conventional processes, and the operational characterisation in terms oflinear functionals above is analogous to the operational characterisation I a(t)6(t) dt = a(O) of the 6-function. Initial discussions ofthe a-function took the circumspect route of writing I a(t)6(t) dt as I a(t)d.H(t) where the Heaviside function H(t) is the formal integral of the 6-:-function; the function which is constant except for a single unit step at the origin. This is now seen as a crutch which can be kicked away, but the same caution leads to the writing of I a(t)e(t) dt as I a(t) dw(t) where {w(t)} is a Wiener process. The noise-driven processes (19) which are most considered, for various reasons, are those for which the driving noise e is white; such a process is termed a diffusion process for reasons which will soon become clear. If one takes a ·deterministic process as a first crude approximation to a given stochastic process, then there are good reasons for taking a diffusion process as the second; see Section 22.3. We can specialise Theorem 9.4.3 to the diffusion case.
J
'I I I
:I
I
J
Il 1
Theorem 95.2 Consider the vector process {x(t)} generated by the differential equation 6x = a(x)t + b(x) 6w
!
(25)
or
x=
a(x) + b(x)e
(26)
where e is a white noise process (wa Wiener process) ofpower matrix N Then {x( t)} is a Markov process with DCF
·I
whereN(x)
5 WIENER PROCESSES (BROWNIAN MOTION)
187
H(x, a) = aa(x) +! aN(x)aT
(27)
= b(x)Nb(x)T.
We see from (13) and (27) that the generator of the process has the action A¢ = ¢xa(x)
+! tr[N(x)¢xx]
(28)
where
+ F + Fxa(x, u, t) +! 1
tr(N(x, u, t)Fxx)]
= 0.
(29)
CHAPTERJO
Some Stochastic Examples 1 LQG REGULATION WITH PLANT NOISE Let us return to the LQ regulation example of Section 2.4 with the sole modification that the plant equation is changed to (1) where e1 is a random disturbance term. This term would often be referred to as plant noise-'plant' because it occurs in the plant equation and 'noise' because in audio-electronic contexts the disturbance manifests itself as a rush of background noise. Plant noise is indeed often to be regarded as random in that it is only partially predictable. The thermal noise of electronic circuits or the gusts to which an aircraft is subject provide examples of disturbances which are scarcely predictable at all. The mention of predictability is a reminder that the stochastic character of the noise must be specified. A common starting assumption is that the noise is as unpredictable as possible; that it is white in that the distribution of e1 conditional on process history before time tis independent ofboth process history and t. This implies that the process has the Markov property (i) of Section 8.3. It also implies that the variables e1 are independently and identically distributed (See Sections 9.5 and 13.2 for a discussion of the special character of white noise in continuous time and of the reason for the description 'white~ Process history before time t is specified by (Xr-1 , Ur-11 since this determines e,. for T < t. Actually, we do not need to demand complete 'whiteness', but simply that
(2) for a constant matrix N That is, that the conditional mean of e1 should be constant (and normalised to zero, by convention) and that the conditional covariance matrix of e1 should also be constant. A stronger assumption which is often made is that the noise should be Gaussian as well as white, i.e. that the noise variables should be jointly normally distributed. The resultant model is then termed an LQG model, indicating linear dynamics, quadratic costs and Gaussian noise. However, as far as discrete-time models are concerned, we need the Gaussian assumption first when we come to consider imperfect state observation in Chapter 12.
190
SOME STOCHASTIC EXAMPLES
Theorem 10.1.1 Assume the stochastic plant equation specified by (1), (2) and the quadratic cost structure specified in equations (2.21) and (2.23). Then the value function has the form F(x, t) = !xTII,x + 6,
(3)
and the optimal control the form
(4) Here the matrices II1 and K 1 have the same evaluations as in Theorem 2.4.2 and 61 = Dt+t
+! tr(NII,+l)
(5)
with terminal condition Dh = 0. That is, the matrix II1 is the same solution of the Riccati equation (2.25) as before, and K1 is determined from it by the same relation (2.28) as before. Indeed, the control rule (4) is exactly what it was in the noise-free case, and the only effect of the noise is to add a term 61 to the cost. The interpretation is that the quadratic term in (3) represents the cost caused by the initial deviation of state x from zero and the term 61 represents the cost caused by the continued disturbance of state by noise. In the infinite-horizon limit (if one exists) the presence of noise induces a cost of! tr( NII) per unit time.
Proof The course of the proof is inductive, as ever. Relation (3) holds at time
h. Assume then that it holds at time t + 1, so that
F(W1)
= inf{c(x, u) + !E[(Ax + Bu + et+t)TIIt+t(Ax + Bu + t:r+t)IX1,Ur]} "
where x, u and e are the values holding at time t. Assumptions (2) imply that the conditional expectation in this relation reduces to
!(Ax+ Bu) TIIt+l (Ax+ Bu) whence all assertions follow.
+! tr(NIIr+l) 0
The admission of plant noise with the 'second-order white' properties (2) thus affects the optimal control rule not at all (at least, if this is expressed in closed-loop form) and increases costs only by a term independent of state or policy. It may seem strange that, for our flrst stochastic example, the optimal control was unaffected (at least in its closed-loop form) by the presence of the stochastic element. This is a consequence of the very special assumptions. The essence of optimal control generally is that one exploits the statistical characteristics of
1 LQG REGULATION WITH PLANT NOISE
191
noise variables to the full. For example, suppose that the plant equation (1) is modified to = Axr-! +Bur-!
+ v1
(6) where v 1 is a stochastic noise signal of some more general form. One will often suppose that it can be generated as the output v1 = G(ff)Er of a linear filter driven by white noise, so that one still has in effect a system driven by white noise, but more complex than (1). This can be coped with in various ways, but the most insightful is that which follows from the certainty equivalence principle of Chapter 12. This principle implies that, under some further assumptions on the linear/Gaussian nature of observations, the optimal control at time t for system (6) would be the same as that for the deterministic system X1
(r > t),
(7)
where v~) is an appropriate linear predictor of v 7 based on information available at timet. That is, at timet one replaces future stochastic noise v7 (r > t) by an 'equivalent' deterministic disturbance v~r) and then applies the methods of Sections 2.9 or 6.3 to deduce the optimal feedback/feedforward control in terms of this predicted disturbance. We shall see in Chapter 12 that similar considerations hold if the state vector xis itself not perfectly observable. It turns out that E~) = 0 (1 > t) for a white-noise input E which has been perfectly observed up to time t. This explains why the closed-loop control rule was unaffected in case (1). Once we drop LQG assumptions then treatment of the stochastic case becomes much more difficult. For general non-linear cases there is not a great deal that can be said. We shall see in Section 7 and in Chapter 24 that one can treat some models for which LQG assumptions hold before termination, but for which rather general termination conditions and costs may be assumed. Some other models which we treat in this chapter are those concerned with the timing of a single definite action, or with the determination of a threshold for action. For systems of a realistic degree of complexity the natural appeal is often to asymptotic considerations: e.g. the 'heavy traffic' approximations for queueing systems or the large-deviation treatment oflarge-scale systems.
Exercises and comments (1) Consider the closed- and open-loop forms of optimal control (2.32) and (2.33) deduced for the simple LQ problem considered there. Show that if the plant equation is driven by white noise of variance N then the additional cost incurred from time t = 0 is D QN E;,;;-ci (Q + sD) -I or hDN according as the closed- or the open-loop rule is used. These then grow as log h or as h with increasing horizon.
192
SOME STOCHASTIC EXAMPLES
2 OPTIM AL EXERCISE OF A STOCK OPTIO N
As a last discrete-time example we shall consider a simple but typical financia l optimisation problem. One has an option, although not an obligation, to buy a share at price p. The option must be exercised by day h. If the option is exercised on day t then one can sell immediately at the current price x 1, realising a profit of x 1 - p. The price sequence obeys the equation Xt+l = x 1 + E1 where the € 1 are independently and identically distributed random variables for which EJEJ < oo. The aim is to exercise the option optimally. The state variable at time t is, strictly speaking, x 1 plus a variable which indicates whether the option has been exercised or not. However, it is only the latter case which is of interest, sox is the effective state variable. If F (x) is the 3 value function (maximal expected profit) with times to go then Fo(x) = max{x - p, 0} = (x- p)+ and
Fs(x) = max{x - p,E[Fs-! (x +E)]}
(s = 1,2, ... ).
The general character of Fs(x) is indicated in Figure 1; one can establish the following properties inductively: (i) Fs(x)- xisnon- increas inginx; (ii) Fs(x) is increasing in x; (iii) Fs(x) is continuous in x; (iv) Fs(x) is non-decreasing ins. For example, (iv) is obvious, since an increase in s amounts to a relaxation of the time constraint. However, for a formal proof: F, (x) = max{x - p, E[Fo(x +E)]};?: max{x - p, 0} = Fo(x),
Figure 1 The value function at horizon sfor the stock option example.
3 A QUEUEING MODE
193
whence Fs is nondecreasing ins, by Theorem 3.1.1. Correspondingly, an inductive proof of (i) follows from
Fs(x)- x =max{ -p, E[Fs-1 (x +e)- (x +e)]+ E(e)}. We then derive
Theorem 10.2.1 There exists a non- decreasing sequence {as} such that an optimal policy is to exercise the option first when x ~ as. where x is the current price and s is the number ofdays to go before expiry ofthe option. Proof From (i) and the fact that Fs(x) ~ x- pit follows that there exists an as such that Fs(x) is greater than x- p if x < as and equals x- p if x ~ as.It follows from (iv) that as is non-decreasing in s. 0 The constant as is then just the supremum ofvalues of x for which Fs(x)
>
x-p.
3 A QUEUEING MODEL Queues and systems of queues provide a rich source of optimisation models in continuous time and with discrete state variable. One must not think simply of the single queue ('line') at a ticket counter; computer and communication systems are examples of queueing models which constitute a fundamental type of stochastic system of great technological importance. However, consideration of queues which feed into each other opens too big a subject; we shall just cover a few of the simplest ideas for single and parallel queues in this section and the next chapter. Consider the case of a so-called M/M/1 queue, with x representing the size of the queue and the control variable u being regarded as something like service effort. If we say that customers arrive at rate A and are served at rate p,(u) then this is a loose way of stating that the transition x -+ x + 1 has intensity Aand the transition x-+ x- 1 has intensity p,(u) if x > 0 (and, of course, intensity zero if x = 0). We assume the process time-homogeneous, so that the dynamic programming equation takes the form
h![c(x,u)+ BF~:,t) +A(u)F(x,t)] =0.
(8)
Here the infinitesimal generator has the action (cf. (9.2))
A(u).p(x) = A[.P(x + 1) - .P(x)] + p,(u, x) [.P(x- 1) - .P(x)] where p,(x, u) equals p,(u) or zero according as xis positive or zero. If we were interested in average-optimisation over an infinite horizon then equation (8) would be replaced by
194
SOME STOCHASTIC EXAMPL ES
1 = inf(c(x, u) u
+ A(ulf(x)]
(9)
where .\ and f(x ) are respectively the average cost and the transien t cost associated with the average-optim al policy. In fact, we shall con cer n ourselves more with the question of optimal allocation of effort or of customers between several queues tha n with optimi sation of a single queue. In preparation for this , it is useful to solve (9) for the unc ontrolled case, when JL(u) reduces to a con stant f.L and c(x, u) to c(x). In fact , we shall assume the instantaneous cost pro portional to the num ber in the que ue, so that c(x) = ax. We leave it to the reader to verify tha t the solution of equation (9) is, in this reduced case,
a>.
!=--,, f.J, -A
f(x ) =
a f.L- .\
x(x + 1) 2
(10)
We have assumed the normalisat ion f(O) = 0. The solution is, of course, valid only if>. < f.L, for it is only then tha t queue size is finite in equilibrium .
4 THE HARVESTING EXAMPLE : A BIRTH-DEATH MODEL
Recall the deterministic harvesting model of Section 1.2, which we sha ll generally associate with fisheries, for definite ness. This had a scalar state variabl e x, the 'biomass' or stock level, which foll owed the equation
.X= a(x )- u.
(11)
Here 1J is the harvesting rate, which (it is supposed) may be varied as des ired. The rate of retu rn is also supposed pro portional to u, and nor mal ised to be equ al to it. (The model thus neglects two ver y imp orta nt elements: the age stru ctur e of the stock and the x-dependence of the cost ofharvesting at rate uJ We suppose again tha t the functio n a(x), the net reproduction rate of the unharvested population, has the form illustrated in Figure 1.1; see also Figure 2. We again denote by Xm and .xo the values at which a(x) is respectively maximal ·an d zero. An unharvested pop ula tion would thus reach an equilibrium at x = xo. We know from the discussion of Section 2.7 tha t the optimal policy has the threshold form: u is zero for x ~ c and takes its maximal value (M, say) for x > c. Here c is the threshold, and one seeks now to determine its optima l value. If a(x) > 0 for x ~ c and a(x )- M < 0 for x > c then the harvested pop ulation has the equilibrium value c and yields a return at rate 1 =a ( c). If we do not discount, and so choose a thresho ld value which maximises this ave rage retu rn A, then the optimal threshold is the valu e Xm which maximises a( c). A threshold policy will still be optimal for a stochastic model und er corresponding assumptions on birt h and death rates. However, there is an effect which at first sight seems remark able. If extinction of the pop ula tion is impossible then one will again choose a threshold value which maximises average
4 THE HARVESTING EXAMPLE: A BIRTH-DEATH MODEL
., I
I
I I I
195
return, and we shall see that, under a variety ofassumptions, this optimal threshold indeed approaches Xm as the stochastic model approaches determinism. However, if extinction is certain under harvesting then a natural criterion (in the absence of discounting) is to maximise the expected total return before extinction. It then turns out that the optimal threshold approaches x0 rather than Xm as the model approache s determinis m. There thus seems to be a radical discontinuity in optimal. policy between the situations in which the time to extinction is finite or infinite (with probability one, in both cases). We explain the apparent inconsistency in Section 6, and are again led to a more informed choice of criterion. We shall consider three distinct stochastic versions of the model; to follow them through at least provides exercise in the various types of continuou s-time dynamics described in the last chapter. The first is a birth-deat h process. Letjbe the actual number offish; we shall set x = j JK where K is a scaling parameter, reflecting the fact that quite usual levels of stock x correspon d to large values of j. (We are forced to assume biomass proportion al to population size, since we have not allowed for an age structure). We shall suppose thatj follows a continuous-time Markov process on the nonnegative integers with possible transitions j --t j + I and j --t j - 1 at respective probability intensities >.i and f..ti· These intensities thus correspond to populatio n birth and death rates. The net reproduction rate >.i - f..ti could be written ai, and correspon ds to Ka(x). Necessarily f..to = 0, but we shall suppose initially that >.o > 0. That is, that a zero population is replenished (by a trickle of immigration, say), so that extinction is impossible. Let 7rj denote the equilibrium distribution of population size; the probability that the population has size j in the steady state. Then the relation 7rjAj = 1ri+1f..ti+ 1 (expressing the balance of probability flux between states j and j + 1 in equilibrium) implies that 7rj ex: Pi> where
Pi =
>.o>.1 ... >-i-1 f..tlf..t2 ••• /-tj
(
j = O, 1, 2, . . .),
(12)
(with Po = 1, which is consistent with the convention that an empty product should be assigned the value unity). A threshold c for the x-process implies a threshold d ~ Kc for the j-process. For simplicity we shall suppose that the harvesting rate M is infinite, alth()ugh the case of a finite rate can be treated almost as easily. Any excess of population over d is then immediately removed and one effectively has Ad = 0 and Pi = 0 for j >d. The average return (i.e. expected rate of return in the steady state) on thexscale is then
(13)
196
SOME STOCHASTIC EXAMPLES
the term 1rd>.d representing the expected rate at which excess of population over c is produced and immediately harvested. Suppose now that the ratio 91 = J.i.JIAJ is effectively constant (and less th~ unity) for j in the neighbourhood of d. The effect of this is that 'lrd-J ~ 7rd()~ (j ~d), so the probability that the population is an amo untj below threshold falls away exponentially fast with increasingj. Form ula (13) then becomes 1 ~ ~~;- 1 ~(1- 9d)
= ~~;- 1 (>.d- JJ.d) = K- 1ad =a( c).
The optimal rule unde r these circumstances is then indeed to choose the threshold c as the level at which the net reproducti on rate is maximal, namely, Xm. This argument can be made precise if we pay attention to the scaling. The nature of the scaling leads one to suppose that the birth and death rates are of the forms >.1 = ~~;)..(jj~~;), J.£J = KJJ.(jf~~;) in terms of functions A(x) and p,(x), corresponding to the deterministic equation x = >.(x) - J.£(x) = a(x) in the limit of large l'i.. The implication is then that A]/ p, varies slowl y withj if~~; is large, with the 1 consequence that the equilibrium distributio n of x falls away virtually exponentially as x decreases from d = cf~~;. The details of this verbal argument are easily completed (although anomalous behaviour atj = 0 can invalidate it in an interesting fashion, as we shall see). The theory of large deviations (Chapter 22) deals precisely with such scaled processes, in the range for which the scale is large but not yet so large that the process has collapsed to determinism. Virtually any physical system whose complexity grows at a smaller rate than its size generates examples of such processes. Suppose now that >.o = 0, so that extinction is possi ble (and indeed certain if, as we shall suppose, passage to 0 is possible from all states and the population is harvested above some ftnite threshold,) Expressio n (12) then yields simply a distribution concentrated onj = 0. Let Fj be the expected total retur n before extinction conditional on an initial population ofj. (It is unde rstoo d that the policy is that ofharvesting at an infinite rate above the prescribed threshold value ~ The dynamic programming equation is then
(O< j
(15)
5 BEHAVIOUR FOR NEAR-DETERMINISTIC MODELS
197
where
(16) Proof We can write (14) as >..p!:l.j+l = p,16.1 where 6.1 = Fj- Ff-t· Using this equation to determine 6.1 in terms of .6.d+1 = I and then summing to determine Fj, we obtain the solution (15). D We see from (15) that the d-dependence of the Fj occurs only through the common factor lid, and the optimal threshold will maximise this. The maximising value will be that at which Ad/ JLd decreases from a value above unity to one below, so that ad = Ad - JLd decreases through zero. That is, the optimal value of c is xo, the equilibrium level of the unharvested population. More exactly, it is less than x 0 by an amount not exceeding K:- 1. This means in fact a very low rate of catch, even while the population is viable. The two cases thus lead to radically different recommendations: that the threshold should be set near to Xm or virtually at x 0 respectively. We shall explain the apparent conflict in the next two sections. It turns out that the issue is not really one of whether extinction is possible or not, but of two criteria which differ fundamentally and are both extreme in their way. A better understanding of the issues reveals the continuum between the two policies. Exercises and comments (1) Consider the naive policy first envisaged in Chapter 1, in which the population was harvested at a flat rate u for all positive x. Suppose that this translates into an extra mortality rate of v per fish in the stochastic model. The equilibrium distribution 1T"J of population size is then again given by expression (12), once normalised, but with JLJ now modified to JLJ + v. The roots Xt and x2 of a(x) = u, indicated in Figure 1.2 and corresponding to unstable and stable equilibria of the deterministic process, now correspond to a local minimum and a local maximum of 1T"j. It is on this local maximum that faith is placed, in that the probability mass is supposed to be concentrated there. However, as v (and so u) increases this local maximum becomes ever feebler, and vanishes altogether when u reaches the critical value Urn = a(xm)· 5 BEHAVIOUR FOR NEAR-DETERMINISTIC MODELS In order to explain the apparent stark contradiction between the two policies derived in Section 4 we need to obtain a feeling for orders of magnitude of the various quantities occurring as K: becomes large, and the process approaches determinism. We shall follow through the analysis just for the birth-death model
198
SOME STOCHASTIC EXAMPLES
of the last section, but it holds equally for the alternative models of Sections 8 and 9. Indeed, all three cases provide exam ples of the large deviation theo ry of Chapter22. Consider then the birth--death mod el in the case when extinction is possible. Since mos t of the time before extinction will be spent near the thre shol d value if r;, is large (an assertion which we shall shor tly justify) we shall cons ider only Ftt. the expected yield before extinction cond ition al on an initial valu ed ofj. Let 'Fj denote the expected time befo re extinction which is spen t in state j (conditional on a star t from d ). Then, by the same meth ods which led to the evaluation (15), we find that
'Fj = rrd
[t k=!
1-"11-"2 .. ·1-"k-1] )qA2 ... Ak-1
[~-"J+IPJ+2 . . ·1-"d]. .\.tAJ+l ... Ad
which is consistent with expression (15) for Fd = AdFd. Whe n we see the process in term s of the scaled variable x Fj = "'F( x) and Tj = T(x). Ifwe defi ne
R(x)
=
1x
log[A(y)l p(y)J dy
(17)
= j I"' we shall write (18)
then we deduce from expression (15) that
F(c) =
ex:R(c)+o(~<)
(19)
for large "'·We thus see that F( c) grows expo nentially fast with"'· Inde ed, the same holds true for the occupation times: T(x) = e~
Tj
j J.LlJ.L2 ••. J.Lk-1 ] = (PJIAo) [ 2.:: AA A ~ Sp11Ao, k=l I 2 • · ·
k-1
say, where
s=
t
k=l
[J.L!J.L2 ... J.Lk-1] A2A2 ... Ak-1
(21)
(22)
In the next section we shall inte rpre t S I Ao as the expected time taken for a popu latio n which has been extinguished to be reseeded and grow to viability. Since the term s in the sum S decline exponentially fast with increasing k, the relative erro r in approximation (21) is expo nentially small.
6 THE UNDERSTANDING OF THRESHOLD CHOICE
199
The scaling argument will again imply that ()j = J.l.j / Aj varies only slowly if"" is large, with the implication from (21) that Td-j "'
Td()~
for fixedj; an analogue of the corresponding assertion for 'Trj in the last section. If we define Text = 1 1), the expected time to extinction, then
I:,f=
(23)
so that 1 - ()d is a measure of the proportion of time before extinction which is actually spent at threshold. Moreover, since Lj ,;; Kx 1j /Text decreases to zero with increasing"" we can just as well interpret Text as the expected time needed to escape permanently from any neighbourhood of d. In other words, the time to extinction is asymptotic to the time at which harvesting becomes commercially nonviable. Suppose that rJ.!I is the actual total yield (on the x-scale) before extinction and ff the actual time to extinction, so that these are random variables with respective expectations F(c) and Text (conditional on a start from x =c). Then a closer analysis shows that in fact rJ.!I j F( c) and ff /Text converge to unity in almost any stochastic sense as K---+ oo. The point is that x recurs to c a great many times before it ultimately drops to zero, and the contributions to rJ.!I or ff from each of these excursions away from c and back are independent and identically distributed random variables. The consequence is that the expectation of the average rate of return before extinction, rJ.!I / ff, can be replaced by the ratio of expectations:
E(OJI/ff)
~
E(OJI)/E(ff) = F(c)/Text,
the relation becoming exact in the limit oflarge K. Relation (23) then implies the evaluation
F( c)/Text =
I'C- 1
Fct/Text ~
I'C- 1
Ad(l - Bd)
= K:- 1(>.d- J.Ld) = >.(c) - p,(c) =a( c) (24)
for the expected average return over the period before extinction. This demonstrates that the expected rate of return averaged over the time before extinction converges to the equilibrium rate of return for the deterministic process asK -+ oo. 6 THE UNDERSTANDING OF THRESHOLD CHOICE
Expressions (16) and (24) are again maximal for c near to xo and Xm respectively. We see that the discrepancy is not a consequence of differing assumptions, because these two evaluations have been made for the same process; one for which ultimate extinction is certain. It is a consequence of differing criteria. To
200
SOM E STOC HAST IC EXAM PLES
ask for max imal total retur n and to ask for maxi mal average retur n over the time to extinction (of stock or, almo st equivalent ly, of com merc ial viability) are two very different things. They differ so because the expo nenti al depe nden ce of Text upon "' mean s that it varies extremely rapid ly with c, wher eas the average retur n (24) varies only moderately. The maxi misa tion of expe cted total retur n then amou nts virtu ally to the maxi misa tion of expe cted survival time, with the rate of retur n playing a role which actually becomes ever less signi fican t as "' increases. Inde ed, the reco mme ndat ion is virtually that one shou ld not harvest. In fact, both crite ria are extreme, one takin g the yield rate before extinction and the othe r the time to extin ction as virtually the sole cons idera tion. A balan ced crite rion would be one which chose the thres hold c to maxi mise yield rate 1 =a( c) subje ct to a presc ribed lowe r boun d on Text· Since this last expression depe nds exponentially upon "' (which is what indu ces the sensitivity of its depe nden ce upon c) one migh t cons ider rathe r the norm alise d expression
L(c) = lim "'- 1 log /t-+00
Text·
This in fact has the evaluation L(c) = { R(c)
R(xo)
(c ~ xo)
(c
~
xo)
(25)
where R is inde ed the func tion defin ed in (18). The two quantities a( c) and L(c) vary joint ly with thres hold cas indic ated in Figure 2; a( c) increases with c up to the value Xm and decli nes there after ; LV:) increases with c up to the value x and 0 is cons tant thereafter. Thus, if one prescribes the value of Text as at least Tmin, then L( c) must be at least K- 1 log Tmin. For smal l enou gh "' one will thus have to take x 0 as thres hold , but as "' increases the reco mme nded thres hold will decre ase quite quickly until it reaches Xm.
Figure 2 The characterisers ofperformance for a stochastic fishing model operating at threshold c in the asymptotic (near-deterministic) case. The graphs ofa(c) the average yieldrate, and of L(c) the normalised logarithm ofexpect ed extinction time, as functions ofc.
7 CONTINUOUS-TIME LQG PROCESSES
201
Interesting~ one can reach this same view by returning to the first stochastic model of Section 4: the birth-death model without extinction. It follows from (21) that we can write expression (13) for the yield rate in the steady state as
A(c)Td "{ ~ (S/'Ao) +Text~ a(c)[l
+ S/'AoText)],
(26)
where the terms neglected are o( 1) in K.. Here Text is now to be interpreted as the time to first extinction for this immortal population. Now, Ao is the rate at which a population which has been reduced to zero is restarted by some external mechanism, so that A(} 1 is the expected 'seeding' time. This seeding does not amount to an effective restart until numbers have been brought up to viability; we can regard (S/'Ao) as the expected time needed for this to occur: the expected restart time. Let us then denote it by Tres, so that expression (26) for 7 becomes
I
-r
I
7 ~ a(c)[l + (Tres/Text)r 1. This is then approximate ly a(c) or a(c)(TextfTres) according as to whether the restart time is small or large relative to the extinction time. Since Tres is independent of c, these two extreme evaluations are equivalent to those obtained for the two extreme criteria in the case when extinction was possible. As one varies the value assumed for Tres one moves monotonical ly between these two extremes, just as one did by varying the prescribed value Tmin in the constrained optimisation ofyield rate. Ifwe consider discounted criteria then the possibility of extinction is seemingly irrelevant, because any accounting horizon is orders of magnitude smaller than a prudent lower bound for the extinction time-the conservation horizon. As above, then, the constraint of such a lower bound must be imposed explicitly. That is, suppose that a is the discount rate and that F(c, a) is the discounted future yield at an operating level (and threshold) of c, calculated on the basis of a deterministic model. Then one should choose c to maximise F(c, a), subject to prescription of a lower bound Tmin on Text· As in the undiscounte d case above, the effect of this constraint will weaken rapidly as the scale parameter Kincreases, and the model approaches determinism .
7 CONTINUO US-TIME LQG PROCESSE S AND PASSAGE TO A STOPPING SET One of the most tractable controlled diffusion processes is, as one might expect, the continuous-t ime version of the LQG regulation problem treated in Section 1. We assume the quadratic cost structure of Section 2.8 but modifY the plant equation to
x=Ax+Bu +e
(27)
202
SOME STOCHASTIC EXAMPLES
where E is white noise of power matrix N. The dynamic progr ammi ng equation is then inf[c(x, u) + F1 + Fx(Ax + Bu) +! tr(NFxx)] = 0 u
(t
(28)
with termin al condition F(x, h) = !xTll( h)x. One readily verifies the solutions
F(x, t) = !xTll (t)x + 8(t),
u(t) = K(t)x (t) (29) for value function and optim al control, where TI and K are exactly the matrices of the deterministic treatm ent of Section 2.8. The term 8( t), representing the future cost attributable to process noise, has the evaluation
8(t) =!
Jh tr[NTI(r)] dr
(30)
in analogue to the discrete-time relation (5). For a modification from the pure LQG form, consider a stochastic version of the first-passage problem of Exercise 2.6.2, in which one attempted to reach one end or the other of the interval [0, 1] with the optimal compromise between economy of time and of control effort. As first-passage proble ms go this is about the simplest, but it raises a numb er of interesting points. We asume the mode l modif ied only in that the plant equation becomes .X = u + E, where E is scalar white noise of power N. One's intentions are then to some extent frustrated by rando m disturbances to the path. The dynamic progr ammi ng equation (28) now becomes inf[! (L + Qtl) + uFx +! NFxx] = 0 u
(x:f:0 ,1).
(31)
This holds up to termin ation, which occurs on passage into either of the values x = 0 or x = 1, with respective termin al costs F(O) =Co,
F(l) =
c1.
We can actually perfo rm the minimisation in (31) to deduc U=
-Q- 1Fx
and the reduced version
(32) e (33)
(34) of the dynamic progr ammi ng equation. The non-linear equati on (34) can in fact be linearised by the transformation
1/J(x) to the form
= e-F(x)fQN
r
7 CONTINUOUS-TIME LQG PROCESSES
203
This has the general solution
'1/J(x)
= c1e"'x + cze-=
(O~x~l)
(35)
where a = J Lj QN2 and the coefficients c; are determined from the boundary conditions (32). In contrast to the piecewise-linear expression ofF (x) for the deterministic case derived in Exercise 2.6.2, F(x) now has a single analytic expression over the whole interval 0 ~ x ~ 1. Mathematically, the fact that F now obeys a secondorder equation means that both boundary conditions can be met by a single analytic expression. Physically, the fact that noise can carry one off the intended course means that one might end up at either boundary point, and induces a blurring of costs which makes F smooth. The evaluation ofF is now both greater and smoother than in the deterministic case, as indicated in Figure 3. In case (a) there was formerly (i.e. in the deterministic case) a change in goal at an intermediate value of x and there still is, at the point at which F is maximal. However, Fis now smoothly differentiable at this break-point, because noise can carry one across it in either direction. In case (b) Fx increases as one approaches the undesirable termination point x = 1, reflecting one's increasing efforts to avoid it. Avoidance is never certain,
(a) 0
X
(b)
0
X
Figure 3 The stochastic version ofFig. 22 The upper curves give the value function F(x)for the stochastic version ofthe optimalfirst-passage problem of Ex. 26.2. The two cases correspond, as previously. to those in which the optimal deterministic policy has a break-point in the interior ofthe continuation region or on its boundary.
204
SOME STOCHASTIC EXAMPLES
however, as reflected in the fact that F is now continuous at this boundary. This might be termed the 'fly-paper' effect, if one regards the moving point as a fly trying to avoid the fate of becomi ng stuck on fly-paper at x = 1. A deterministic fly, whose path is fully under its own control, can actually approach arbitrari ly close to the fly-paper with impunity, knowing that he can avoid entrapment. Hence the discontinuity ofF at x = 1, reflecting the difference between being close and being stuck. A stochastic fly cannot guarantee his excape; the nearer he is to the paper, the more certain it is that he will be carried on to it Hence the continuity ofF at x = 1 in this case. This also explains why the fly tries so much harder in the stochastic case than in the deterministic case to escape the neighbourhood of the fly-paper (as manifested by the greater magnitude in the stochastic case of Fx, and so of -u for x near 1). One may say that the penalty of ending on the fly-paper 'propagates' into the free-flight region in the stochast ic case, causing the fly to take avoiding action while still at a distance from the paper. A rigorous treatme nt of the dynamic programming equation in cases where the postulated differential Fx does not exist everywhere can be based on the concept of a 'viscosity colution' (see Fleming and Soner, 1992). In at least some cases this can be envisaged as obtaine d by the addition of a little stochasticity to a previously deterministic problem. As above, this has the effect of smoothing the solution, and the limit of this stochastic solution as one approaches the determi nistic case is in fact the correct deterministic solution. This approach must be regarded as justificatory rather than constructive, however; we shall see in the large deviation treatment of Chapter 22 that the boot is often on the other foot, and that the stochastic solution is, in a sense, approximated by a deterministic solution . To return to the stochastic fly: it is only the difference in penalties at the two termination points which matters, so we may as well set Co = 0. Suppose we now let C1 tend to +oo, so that absorpt ion at x = 1 carries infinite penalty and is to be avoided at all costs. One would imagine that the controlled process would then become improper in some sense, because there is a contingency, presumably of positive probability, which carries infinite penalty. This is not the case, however . The limit process is perfectly proper with a control rule which, although also perfectly proper, becomes vigorous enough as x = 1 is approached that absorption at this point in fact has zero probability. (Stronger than that: the expectation of cost due to a contingency of infinite cost but zero probability could be anything. In fact, it is zero in this case.) In this limit we must have '!f;(O) = 1, '1/J( 1) = 0, and the solution (35) which satisfies these bounda ry conditions is
e
-F(x)/QN
a{l-x)
= '!f;(x) = e
- e
a{x-1)
ea -e-a
.
The optimal control rule is ea(l-x)
u = -Q-1 Fx = N'I/Jx/'1/J = -VLJQ ea(l-x)
+ ea{x-1) - ea{x-1).
(36)
7 CONTINUOUS-TIME LQG PROCESSES
205
- yfiJQ for x For N small (and so a large) u has approximately the constant value u l - oo, then 1 j x as er, Howev 1. from between zero and a value bound ed away effort is this That I. = x to e passag indicating a strenuous effort to avoid < 1. x for finite is (31), by given as ), x ( F successful is indicated by the fact that we which ses proces lled contro of class a This example is a particular case of shall investigate in Chapter 24. for a genera l There is anothe r point which comes out of this example. Suppose, t)Fxx] in the u, , tr[N(x term final controlled diffusion, that F(x, t) is such that the it can be that so u, and x of ndent dynamic programming equation (28) is indepe differ will t) F(x, n functio value written as a function p(t) of time alone. Then the : oftime n functio a from that for the deterministic case Fctet(x, t) only by
F(x, t) = Fctet(X, t) +
1
p(T) dT,
case. This was and the optimal control will be the same as for the deterministic ourhoo d neighb the in true y imatel true for the LQG example; it would be approx it to be expect may One s. proces of a stable equilibrium of the deterministic large. is Fxx where points at failure approximately true if N is small, but to show al optim The above. le examp ssage This is exactly what we observe in the first-pa inistic determ the for that from policy in the stochastic case deviates seriously corres pond case exactly at those points where Fxx is large for some reason, which These were case. inistic to those points at which Fx failed to exist in the determ forced. or chosen points at which there was a discontinuity in policy, either um principle' These matters are related to the search for a 'stochastic maxim which we shall join in Chapter 23. Exercises and comments
be of the form (1) Note that the solution of (34) for x ~ 0 or x ~ I will equally well ary condit ion at (35), but with the c-coefficients now determined by the bound it. one stopping point and a growth condition at infinity. Determ ine n will satisfY (2) Optimal stopping and tangency conditions. The value functio ter in some charac s change s matching conditions at a bound ary where the proces n of the positio the if sense, and will in addition satisfY optimality conditions forms the of sion bound ary itself can be optimised. Even a non-rigorous discus upon s depend such conditions can take can be quite intricate, as a great deal (see etc. e, possibl continuity of the path, the directions in which movement is Whittle 1983a, pp.l02-108 and 201-207). that of optimal However, an impor tant case which appears in various guises is of continuation or stopping, where the optimiser can choose between the actions The example of termination, so that the stopping set itself can be optimised. which the value Section 2 was one such. Consider a discrete-time formulation in
SOME STOCHASTIC EXAMPLES
206
II< in the stopping function F obeys F = LF in the continuation region C and F = suppose, for shall region£'). The state-variable argum ent xis understood, and we uation/ contin the simplicity, that there are no actions to be optimised other than t forvarian stopping decision. We shall suppose an infinite-horizon time-in mulation. The optimality equation is then F = min(L F, II<), with F
= LF < II<
{~);
on holds. Here ~ where the brackets indicate the set of x within which the asserti by choosing the $ and £') have been chosen optimally and we have closed g set g*; that stoppin stopping option in cases of indifference. Define the accessible of~- The part part of g which can be reached by a single transition from some continued the also essential matching condit ion is then that F = II< in £')*. Define with Fas cost F*; the solution ofF* = LF* (holding everywhere) which agrees then, have s). We much as it can: on~ U £')* (which is the only set of x which matter from the relations above and the definition ofF*, F*
(~);
F* = II<
($*);
LF*
~
LF
(q;*).
cy condition, These relations imply optimality conditions which imply a tangen can go to a we if that F* - II< should have zero derivative at the stopping point, continuous limit in space and time. as Fj etc., and Suppose for example that x takes integer values j; let us write F (x) r) constitute a suppose that LFj = c1 + p1Fj_ 1 + q1Fj + r1Fj+ 1 where (p, q, could thus yield distribution. A continuous-limit version of this assumption the value j = k either a deterministic process or a diffusion process. Suppose that in£'). We suppose marks an optima l stopping boundary, in that k - 1 is in ~and k possible. Then the that r > 0, so that movement in the direction of increasing} is yields ively, respect k and k 1, k at d last set of relations above, asserte which will imply That is, F* - IK has a turnin g point at the stopping boundary, . version imit uous-l contin a zero derivative at this point in cy condition at Note that we have already demonstrated validity of the tangen of Section 2.7. model ting harves inistic the optimal threshold value for the determ L 8 THE HARV ESTIN G EXAM PLE: A DIFFU SION MODE to the stochastic Suppose the deterministic harvesting equation (11) modified differential equation
x = a(x)- u+ E
(37)
so that the model where E is white noise of power v( x) / K. We introduce the factor K, e large. The can be made to approach determinism by allowing K, to becom
207
8 THE HARVESTING EXAMPLE: A DIFFUSION MODEL
as we shall introdu ction of,.. in this way amoun ts to a scaling assump tion which, of the see in Chapte r 22, fits into a general pattern consist ent with the scaling birth-d eath process in Section 4. O) has an If the unharv ested version of the model (i.e. that for which u equilib rium distribu tion then this has probability density
=
(38) where
R(x) = 2
jx
a(y)v(y )- 1 dy.
(39)
is by some (see Exercise 1). We assume that x is kept to the positive half-ax it is best then c > x for ing harvest allow we If 0. 1 x conditi on such as v(x) 1 0 as so that say, M, to equal and finite is rate ing to assume initially that the harvest deduce thus We >c. x a(x) is modifi ed to a(x)- Mfor
Theorem 10.8.1 The average rate ofreturn for the diffusion model (3 7) is _ 'Y-
M
.fc 7r(x)e-~~;MQ(x) dx
00 fo 7r(x) dx + fc 7r(x)e-"MQ(x) dx
c
(40)
where 71" has the evaluation (38) and
Q(x)
=
21x
v(y)- 1 dy.
(41)
Evaluation (10) becomes 'Y =
7r( c)v( c) c 2 fo 7r(x) dx
(42)
in the limit oflarge M. M. These Evalua tion (40) is direct and (42) the limit version of it for large one would expressions must be maxim ised with respect to c. Let us confirm what -+ oo and the expect: that they approa ch the known determ inistic value a (c) as ,.. x < c then model approa ches determ inism. If we assume R(x) increasing in x for (42) ion we find the evaluat ion of express 'Y
rv
!v(c)R '(c) =a( c).
See Exercise 2 for the case of a finite harvest rate. x = 0 is Suppos e however that extinct ion is possible, in that the state value 4. Section of that to similar s absorbi ng for the process (37). We follow an analysi initial an on onal conditi ion If F(x) is the total expecte d return before extinct value x then this obeys the dynam ic progra mming equatio n
SOME STOCHASTIC EXAMPLES
208
(43)
u +(a - u)F' + (2K-)- 1 vF" = 0.
where the harvest rate u takes the values 0 or m according as
x
~
cor x > c.
harvested process. Then Theorem 10.8.2 Suppose that extinction is certain for the the relevant solution of (43) is F(x)
= e~R(c)
(1x
where F' (c) has the evaluation
F'(c) =2M
e-l
dy )F'(c )
(0
~ x ~c),
leo e~fR(y)-R(c)-MQ(y)lv(yr 1
dy.
and R(x), Q(x) are the functions defined in (39) and (41). F'(c) = 1.
(44)
(45)
If M is infinite then
on of(43) for x ~ c. Proof We obtain solution (44) as we obtained (15); by soluti c and appeal to the > x for (43) of tion (45) then follows by solution The evalua this evaluation. 0 condition of certain extinction. The last assertion follows from s. The analogue of the final conclusion of Section 4 then follow ising the expression Theorem 10.8.3 The optimal value of c is that maxim value ofc is then determined by e~
y the value x 0 of That is, if M is infinite, then optimal threshold value is exactl different optimality Figure 2, whatever K-. The phenomenon of the two evaluation (25) still recommendations is explained exactly as in Section 6, and (39). holds, with R now having the evaluation Exercises and comments
of the section must (1) The equilibrium distribution 1r defined at the beginning1 (v7r)" = 0. This insatisfy the Kolmogorov forward equation -(a1r)' + (2K-)the net probability tegrates once, and the integral must be zero since it represents F(O) = 0. flux. The resulting equation has (44) as the solution satisfying ) decreasing for (2) Suppose that R(x) is increasing for x < c and R(x) - MQ(x x > c. Show then that expression (40) has the evaluation Mj(M fZ -R') 1 ~ (1/ R') + 1/(M (l- R')
for large K, where all functions are evaluated at c.
=a,
ENT 9 A MODEL WITH A SWITCHING ENVIRONM
WIT 9 THE HARVESTING EXAMPLE: A MODEL T ENVIRONMEN
209
H A SWI TCH ING
wise-deterministic mod el to M indicated in Section 9.2, we can use a piece Suppose that the mod el has represent the effects of environmental variation. = 1, 2, . . . . In regime i the several environmental regimes, labelled by i , but transition can take place population grows deterministically at net rate ai(x) to regime h with probability intensity Kllih· a different nature to that of This is then a model whose stochasticity is of quite 8. It comes from with out and 4 the birth -deat h or diffusion models of Sections ronmental stochasticity', 'envi rather than within, and represents what is term ed affects conclusions has this ther as distinct from 'demographic stochasticity'. Whe to be deter mine d by equation (11) but with The equivalent deterministic model would be given a(x)
= :~::)iai(x)
(46)
i
m is in regime i. The model where Pi is the steady-state probability that the syste s between regimes take place converges to this deterministic version if transition ge regime'. This occurs in the 'avera so rapidly that one is essentially working in an scaling parameter. al limit oflarge, x:, so that K again appears as the natur for such a multi-regime al A fixed threshold would certainly not be optim hold nature, but with thres of a model. It is likely that the optimal policy would be ion whether the quest a it is a different threshold in each regime. Of course, n of the rate of rvatio . Obse regime is know n at the time decisions must be made e is currently regim mine which change ofx should in principle enable one to deter x itself is of n ver, observatio in force, and so what threshold to apply. Howe . If one error ht with extreme unreliable, and estimation of its rate of change fraug optimal policy would base allowed for such imperfect observation then the bution conditional on curre nt action on a poste rior distribution (i.e. a distri ter 15). An optim al policy observables) of the values of both x and i (see Chap t should be applied only if an would probably take the form that harvesting effor hold dependent on both the estimate of the current value of x exceeded a thres probabilities of the different . precision of this estimate and the current poste rior regimes. desperately crude thoug h it We shall consider only the fixed-threshold policy, threshold value compares with must be in this case, and shall see how the optim al that for the equivalent deterministic mod el able to analysis. A value of x We shall consider a two-regime case, which is amen ot have positive probability in at which a1(x) and a2(x) have the same sign cann ~ 0 and a2(x) = -JJ.(x) :E;; 0 equilibrium. Let us suppose then that a1 (x) = ..\(x) est We shall set 1112 = lit and over an interval which includes all x-values of inter 1121
= 112·
SOME STOCHASTIC EXAMPLES
210
Suppose initially that extinction is impossible, so that the aim is to maximise the expected rate of return 'Y in the steady state. We shall suppose that the maxima l harvest rate M is infinite. For the deterministic equivalent of the process we have, by (46),
a(x)
= z-'2..\(x) .
vw.(x).
(47)
v1 + v2
We shall suppose that this has the character previously assumed, see Figure 2. We also suppose that p,(x) = 0 for x :E;; 0, so that xis indeed confine d to x ;;;::: 0. The question of extinction or non-extinction is more subtle for this model Suppose, for example, that ..\(0) = 0 (so that a population cannot be replenished) and that p,(x) is bounde d away from zero for positive x. Then extinction would be 2 certain, because there is a non-zero probability that the unfavourable regime to on extincti For zero. to down can be held long enough that the populat ion is run be impossible in an isolated populat ion one requires that p,(x) tends to zero sufficiently fast as x decreases to zero; the exact condition will emerge. Let Pi(x) denote the probability/probability-density of the ilx pair in equilibrium. These obey the Kolmogorov forward equations
(0
:E;;
x
If The second equation continues to hold at x = c, but the first does not. is c over excess all since , increase to (i, x) = (1, c) then x cannot continue 2; to 1 from i of n transitio by only left immediately cropped. The state (1, c) can be a discrete event which occurs with intensity 1w1• The effect of this is that an atom of probability forms at (1, c). If this has magnitude q in equilibrium then the balance of probability flux into and out of the state implies the equation
(x =c). · Theorem 10.9.1 PI (x)
(49)
(i) The equilibrium distribution of(i, x) is given by
= k..\(x)- 1 e~(x), P2(x) = kp,(xr 1 e~(x), q = k(~vl)- 1 e~~:R(c)
{50)
where k is a normalising constant and R(x)
= fox[v2p.{yr 1 -
v1..\{yr 1J dy.
(51)
(ii) The average return under the c-thresholdpolicy is
..\(c)
'Y - ---:::::: ----:--- :,.-,---- --:...:-, -----~~~~ exp[~R(x) - ~R(c)]g(x) dx + 1
-
whereg(x)
fo
= ..\(xr 1 + p.(xr 1.
(52)
,
__
'
9 A MODEL WITH A SWITCHING ENVIRONMENT
211
Proof The balance of probability flux at x in equilibrium implies that (53)
(O~x
Using this equation to eliminate p 2 from the first equation of (48) we deduce an equation .Av2 ) tWt Pt + -:\+T --;- PI= 0.
, (X
This leads to the solution for p 1 asserted in (50). The solutions for pz and q then follow from (53) and (49). Assertion (ii) follows from (50) and the relation 0 'Y = qa(c). The condition of non-extinction is just that the distribution (50), when a normalised, should not have all its probability mass at the origin. This sets 1 1 0. x! as grows bound on the rate at which v2 f.L(xr - v1>.(x)-
Theorem10.9.2 ln the deterministic limit expression (57) reduces to 'Y =a( c), with a(x) specified by (52} Proof We may suppose that a(x) ~ Oin the interval [0, c), so that R is increasing in the interval. In the limit oflarge "'expression (52) then becomes
Y"'
,\ (vt/R!)(>.-1 + f.£-1)
+1 0
with the argument c understood throughout.
The optimal value of c thus converges to Xm in the deterministic limit, as one would expect. Suppose now that extinction is possible, in that >.(0) = 0 and that the population can be run down to zero in a bounde d time in the unfavourable regime 2. It will then in fact be certain under communication/boundedness conditions. Let F;(x) denote the expected total return before absorption conditional on a start from {i, x ). We have then the dynamic programming equations
(O<x< c). Since escape from x = 0 by any means is impossible we have F1 (0) = F2(0) However, the real assertion is that
Ft(O+) = ¢,
(54)
= 0. (55)
. where F;(O+) = limxto F;(x) and¢ is an as yet undetermined positive quantity 1 regime in grow to time has it The point is that, if x is small and positive, then and time to decline to zero in regime 2 (before there is a change in regime). The
212
SOME STOCHASTIC EXAMPLES
second equation of (54) continues to hold at x substitute
= c, but for the first we must (56)
Relation (56) follows from the fact that escape from (1, c) is possible only the transition of i from 1 to 2; this takes an expected time (Kv1)- 1 during which return is being built up at rate .A( c). Theoreml0.9.3
The valuefunctions Fi(x) have the evaluations
F1 (x)
=
¢ + (/3/112)
F2(x) = ({3/vi) where¢=
1x
1x
.A(y)-!e-"R(y) dy,
JL(y)-!e-KR(y) dy
(0
~ x < c),
(57)
/3/r;, /3 =
v} 1 .A(c)e~
and R( x) is the function (51) so normalised that R(0)
(58)
= 0.
Proof We deduce from equations (54) that
(59) Using this relation to eliminate F2 from the first equation of (54) we obtain an equation for F: similar to that for PI(x) above, with solution Ff = (r/v2)>..- 1e"R for some constant {3, with the consequence that F~ = ({3/vi)JL- 1e"R. Integrating with end conditions (55) we thus deduce equation (57). Substituting these expressions for the Fi in relation (56) and in the second equation of (54) at x = c we deduce the determinations of¢ and /3 asserted. D We see from solution (57) that optimisation with respect to c amounts to maximisation of {3, and so to maximisation of >..(c)e"R(c). For K large this amounts to the maximisation of R(c), i.e. to the equation a( c)= 0, with a~) having the determination (47). That is, the optimal threshold again approaches thevaluexo. To be exact, the stationarity condition with respect to c is
Ifwe assume that .A(k) is increasing with x then we see that, at least for sufficiently large r;, the optimal threshold c lies somewhat above the value x 0 . For the two previous models it lay below. It is in this that the nature of the stochasticity (environmental rather than demographic) reveals itself. In the previous examples
9 A MODEL WITH A SWITCHI NG ENVIRON MENT
213
there would virtually never have been any harvesting if c had been set above the equilibrium value xo. The effect in this case is that x can indeed rise sufficiently above x 0 during the favourable regime 1, and one waits for this to happen before harvesting. Notes on the literature
The fact that the threshold c should seemingly be set at the unharves ted equilibrium level x 0 if one sought to maximise the expected total return before a point of certain extinction was first observed by Lande, Engen and Saether (1994, 1995), for the case of a diffusion process. The analysis of Section 8 expands this treatment. The material of Sections 4-6 and 9 appears in Whittle and Horwood (1995).
_,.
CHAPTER 11
Policy Improvement: Stochastic Versions and Examples CONCLUSIONS 1 THE TRANSFER OF DETERMINISTIC ite-horizon behaviour and the In Chap ter 3 we considered patte rns of infin ministic case. All conclusions technique of policy improvement for the deter stochastic case if we appropriately reached there transfer as they stand to the L and 2 and their continuousextend the definitions of the forward operators time analogues. state-structured time-homogeneAs in Chap ter 3, attention is restricted to the al expectation operator ous case. In discrete time we define the condition
E(u)cp(x) = E[cp(xt+l)!xt = x, Ut = u].
(1)
policy g(oo) and an optimal policy The forward operators L(g) and .ff' for the respectively are then defined by (2) L(g)cp(x) = c(x,g(x)) + {JE(g(x))cp(x), (x, u) !l'cjJ(x) = inf[c u
+ {JE(u)cp(x)].
(3)
equations for the value functions The corresponding dynamic prog ramm ing s Vs = Vsg((oo)) and Fs then again take the form
Vs = L(g) Vs-1,
Fs = .PFs-1
(s > 0),
(4)
d. with the x-argument of these functions understoo it stands with these extended as valid now is 3 The material of Chap ter the monotonicity of the forward definitions. Explicitly: Theo rem 3.1.1, asserting value functions unde r appr opria te operators and the monotonicity (in s) of the the instantaneous cost function terminal conditions, still holds. If we assume on total cost still hold, as does c(x, u) non-negative then Theorems 3.2.1-3.2.3 Theorem 3.5.1 on policy improvement. (1) is the infinitesimal generator The continuous-time analogue of the oper ator . In terms of this the stochastic A(u) for the controlled process, defined in (8.15) M(g) and~ of Section 3.1 take the versions of the differential forward operators forms
216
POLICY IMPROVEMENT. STOCHASTIC VERSIONS
M(g)<jJ(x) = c(x,g(x ))- a<jJ(x) + A(g(x))
"
The assertions of Chapter 3 for the continuous time case then also transfer bodily.
Exercises and comments (1) We can supplement the example of instability given in Exercise 3.2.1 by the classic stochastic example of the simplest gambling problem. Suppose a gambler has a capital x which takes integral values, positive as well as negative, and has the choice of ceasing play with reward x, or of placing a unit stake and continuing. In this second case he doubles his stake or loses it, each with probability 1/2. If his aim is to maximise expected reward then the dynamic programming equation is
Gs(x) = max{x, ![Gs-1( x- 1) + Gs-1 (x + 1)]}
(s > 0)
where G3 (x) is his maximal expected reward if he has capital x with splays remaining. If at s = 0 he only has the option of retiring then Go(x) = x, and so Gs(x) = x for all s. However, the infinite-horizon version of this equation also has a solution G(x) = +oo. If the retirement reward xis replaced by min (a, x) for integral a then the equation has a solution G(x) =a for x::;:;; a. This corresponds to the policy in which the player continues until he has a capital of a (an event which ultimately occurs with probability one for any prescribed a) and then retires. The solution G(x) = +oo corresponds to an indefinitely large choice ofa. InVestigate how infmite-horizon conclusions are modified if any of the following concessions t6 reality is admitted: (i) debt is forbidden, so that termination is enforced in state 0; (ii) rewards are discounted; (iii) a constan t . positive transaction cost is levied at each play. (2) An int~resting example in positive programming is that of blackmail. Suppose there are two states: those in which the blackmailer's victim is compliant or resistan t Suppose that, if the blackmailer makes a demand of u (0 ::;:;; u ::;:;; 1), then a compliant victim pays it, but becomes resistant with probability zil. A resistant victim pays nothing and stays resistant. If Gs is the maxima l expected amount· the blackmailer can extract from a compliant victim in s further demands, then Go = 0 and
Gs+l = sup[u + (1- u2)Ga] = '1/J(Gs),
" say. Here the optimising value of u is the smaller of l and (2Gs) -I and '1/J( G) is 1 or G + 1/ (4G) according as G is less than or greater than!· Show that Gs grows as s112 and the optimal demand decreases as s- 112 for large s. There is thus no meaningful infmite-horizon limit, either in total reward or in
2 AVERAGE-COST OPTIMALITY
217
optimal policy. The blackmailer becomes ever more careful as his horizon increases, but the limiting policy u = 0 is of course not optimal. (3) Consider the discounted version of the problem, for which the infinite- horizon 1 reward G obeys G = 7/J(f3G). Show that, if! ~ f3 < 1, then G = (2yf/3( 1- {3))- . 2 AVERAGE-COST OPTIM ALITY
The problem of average-cost optimisation is one for which the stochastic model in fact shows significant additional features. Because of its importa nce in applications, it is also one for which we would wish to strengthen the discussion of Chapter 3. We observed in the deterministic contexts of Section 2.9 and 3.3 that one could relatively easily determine the value of control at an optimal equilibrium point, but that the determination of a control which stabilised that point (or, more ambitiously, optimised passage to it) was a distinct and more difficult matter. The r stochastic case is less degenerate, in that this distinction is then blurred. Conside of class a is there that Suppose y. the case of a discrete state space, for simplicit states 9l for which recurrence is certain under a class of control policies which includes the optimal policy. Then all these states will have positive probability in equilibrium (under any policy of the class) and, in minimising the average cost, one also optimises the infinite-horizon control rule at every state value in 91. Otherwise expressed: since equilibrium behaviour still implies continuing variation (within Bl) in the stochastic case, optimisation of average cost also implies optimisation against transient disturbance (within i?ll). These ideas allow us to give the equilibrium dynamic programming equations (3.9) and (3.10) an interpretation and a derivation independent of the sometimes troublesome infinite-horizon limit. Consider the cost recursion for the policy g(oo):
'Y + v = L(g)v,
(5)
where 'Y is the average cost under the policy and v(x) the transient cost from x, suitably normalised. (These are both g-dependent, but we take this as understood, for notational simplicity.) Suppose that the state space is discrete and all states are recurrent under the policy. Then 'Y can be regarded as an average cost over a recurrence cycle to any prescribed state (see Exercise 2). Equation (5) can be very easily derived in this approach, which completely avoids any mention of infinite horizon limits, although it does imply indefinite continuation. The natural normalisation of v is to require that E[v(x)] should be zero, where the expectation is that induced by policy in equilibrium. This is equivalent to requiring that the total transient cost over a recurrence cycle should have zero expectation. If there are several recurrence classes under the policy then there will be a separate equation (5) for each recurrence class. These recurrence classes are
218
POLICY IMPROVEMENT: STOCHASTIC VERSIONS
analogous to the stable equilibrium points of the deterministic case (although see Exercise 1). In the optimised case the equation
1+f=! £'f
(6)
has a similar interpretation, the equation holding for all x in a given recurrence class under the optimal policy, and 1 and f being the minima l average cost and transient cost in this class. Whether this equation is either necessary or sufficient for optimality depends again upon uniqueness questions: on whether the value function could be affected by a notional closing cost, even in the infinite-horizon limit. The supposition of a non-negative cost does indeed imply thatfma y be assumed bounde d below, and so that the relevan tfsoluti on of (6) is the minimal solution exceeding an arbitrary specified bound. However, one must frequently appeal to arguments more specific to the problem if one is to resolve the matter.
Theorem 11.2.1 Suppose that (6) can be strengthened to
1+ f
= 2f = L(g)f.
Then
(7)
(8) where dn = n- 1(J(xo) - E[f(xn) l Wo]), for any policy 1l; with equality if1r = gC 00 l. If dn ---+ 0 with increasing n for any 1r and for any Wo one can then assert that the policy gC 00 l is average-cost optimal. Proof Relation (7) is a strengthening in that the second equality asserts that the infimum with respect to u implied in the evaluation of .Ief(x) is attained by the choice u = g(x). If we denote the value of c(x, u) at timet by c1 then (7) can be written "t+f(xt ) ~E?r[ct+f(xt+l)IWt] for any policy 11; with equality if 1r = gCool. Taking expectations on both sides conditional on Wo and summing over t from 0 to n - 1 we deduce that
n7+ f(xo) 'i:
E,(~c, +f(x"))Wo}
whence the assertions from (8) onwards follow.
0
219
2 AVERAGE-COST OPTIM ALITY
assume the same off, If cis uniformly bound ed in both directions then we may l argum ents are specia cases other and it is then clear that dn has a zero limit. In required to establish the result. Exercises and comments ous to the doma ins of (1) One might think ofthe recurr ence classes as being analog that occup ation notion attraction of the deterministic case, but this is not so. The case this inistic determ of the states should actually recur is impor tant: in the e, the ermor or. Furth would true only of the points of the so-called attract of the many e with transient states in the stochastic formulation may communicat recurrence classes. are not conce med for A different set of ideas altogether (and one with which we that it appro aches a the moment) is encountered if a process is 'scaled' so case states which are in deterministic limit as a param eter ,., is increased. In this feebly as ,., is increa sed, the same recurrence class may comm unica te ever more inistic limit. until they belong to distinct doma ins of attrac tion in the determ v control policy is (2) A controlled Markov process with a statio nary Marko on. Suppose the statesimply a Markov process with an associated cost functi x and thatp( x, y) is the space discrete, that c(x) is the instan taneou s cost in state define a modif ied cost transition probability from state x to state y. Suppose we of modified costs over a function c(x)- "'/and define v(x) as the expected sum ence) to a specified path which starts at state x and ends at first entry (or recurr state, say state 0. Then
v(x)
= c(x) -7 + _Ep(x,y)v(y). y~O
of modified costs has Suppose we choose 1' so that v(O) = 0, i.e. so that the sum on becom es simply zero expectation over a recurr ence cycle. Then this last equati
(9)
v(x) = c(x)- 'Y + _Ep(x,y)v(y). y
which we can identifY with (5). We have 0 = v{O) sum is over a recurrence cycle to state 0. That is,
= E[2: (c(xt )- "'/)],where the 1
e cost over a recurr ence where Tis the recurr ence time; this exhibits~, as the averag = l:x 11'(x)c(x), where 1' that cycle. On the other hand, we deduce from (9) d interp retatio n is secon The {11'(X)} is the equilibrium distribution over states. ence, with no recurr of terms the 'infinite horizo n' one; the first is purely in explicit appea l to limiting operations.
220
POLICY IMPROVEMENT: STOCHAS TIC VERSIONS
(3) The blackmail example of Exercise 1.2 showed infinite total reward, but is not rephrasable as a regular average-reward problem. The maximal expected reward over horizons grows as s112 rather than s, so the average reward is zero and the transient reward infinite. Indeed, an average reward of 1 and a transient reward ofj(x) are associated with a development
F(x)
=
1 2/3 + f(x)
+ o(l- !3)
of the discounted value function for 1 - /3 small and positive. This contrasts with the ( 1 - {3) -I 12 behaviour observed in Exercise 1.2. 3 POLICY IMPROVEMENT FOR THE AVERAGE-COST CASE The average-cost criterion is the natural one in many control contexts, as we have emphasised. It is then desirable that we should obtain an average-cost analogue of Theorem 3.5.1, establishing the efficacy of policy improvement. We assume a policy g~ 001 at stage i. Let us distinguish the corresponding evaluations of {, v(x), L and expectation E by a subscript i. The policyimprovement step determines g;+I by 2v; = L;+I v;. Let us write the consequent equation
(10) in the form
(11) where 8 is some non-negative constant. If 8 can in fact be chosen positive then one has an improvement in a rather strong sense, as we shall see.
Theorem11.11 Relation (11) has the implication
(12) where
dn = n- 1(v;(xo)- Ei+I[vi(xn)lxo]).
If dn --t 0 with increasing n it then follows that: (i) there is a strict improvement in average cost if 8 is positive; (iz) average-optimality has been attained if equality holds in (1 1). The proof follows the same course as that of Theorem 11.2.1. Note that, effectively, the only policies considered are stationar y Markov. However, we expect the optimal policy to lie in this class.
221
4 MACHINE MAIN TENA NCE
In continuous time the optimality equation (7) is replaced by 1 1 = inf[e(x, u) + A(u)f(x)].
= .Af, i.e. (13)
u
4 MACHINE MAINTENANCE service effort over a As an example, consider the optimisation of the allocation of lated in continuous set of n machines. The model will be a very simple one, formu use, passes through time. Consider first a model for a single machine which, with x to x + 1 has states of increasing wear x = 0, 1, 2, . . . . The passage from while in state x, this intensity.>.. for all x, and the machine incurs a cost at a rate ex s the machine representing the effect of wear on operation. A service restore instantaneously to state 0. intensity p,. The Suppose the machine is serviced randomly, with probability be will policy blind dynamic programming equation under this (14) 1 =ex+ .>..[!(x + 1)- f(x)] + p,[f(x) -/(0) ] policy and ft?c) the if costs are undiscounted. Here 1 is the average cost under the transient cost in state x. The equation has the unique solution
= .Xe/Jl. if we make the normalising assumptionf(O) = 0. f(x) = ex/ J.i.,
1
(15)
parameters and Suppose now that we have n such machines and distinguish the whole system the state of the ith machine by a subscript i. The average cost for will then be (16) .X;c;/ Jb; li = I =
L i
L i
with the Jb; constrained by (17) if J.l is the intensity oftotal service effort available. a machine is Specification of the Jli amounts to a random policy in which, when probability Jbd Jb· A to be serviced, the maintenance man selects machine i with stage of policy One {x;}. = x state more intelligent policy will react to system ering policy consid before ver, Howe .~~·""!!J=~~"~improvement will achieve this. ing the Jli to choos by policy m rando improvement, one could simply optimise the that this finds readily One (17). minimise expression (16) subject to constraint leads to an allocation Jli ex ~machine i for Policy improvement will recommend one to next service the which
222
POLICY IMPROVEMENT: STO CHASTIC VERSIONS
:L{CjXJ + >.,[.tj(Xj+!)- jj(x})]} + tt(Ji(O)- f(x;)j j
is minimal; i.e. for which Ji(xi) = c;xd /ti is greatest. If we use the optimi sed value of f.L; derived above the n the recommendation is tha t the next machine i to be serviced is tha t for which x 1 jCJ>:; is greatest. . Note tha t this is an index rule, in tha t an index x 1 ~ is calculat ed for each machine and tha t machine chosen for service whose cur ren t index is greatest. The rule seems sensible: degree of wear, cost of wear and rapidity of wea r are all factors which would cause one to direct attention towards a give n mac hine. However, the rule is counter-intuiti ve in one respect: the index dec reases with increasing >.1, so tha t an increased rate of wear would seem to make the machine need service less urgently. Howeve r, the rate is already taken accoun t of in the state variable x 1 itself, which one wou ld expect to be of ord er At if a give n time has elapsed since the last service. The deflation of x1 by a factor A is a reflection of the fact that one expects a quickly wearing component to be mo re worn, even und er an optimal policy. An alternative arg um ent in Section 14.6 will demonstrate tha t this pol icy is indeed close to optimal. 5 CU STO ME R ALLOCATION BE TW EE N QU EU ES Suppose there are n queues of the type considered in Section 10.3. Quantities defined for the ith queue will be given subscript i, so tha t x 1, A;, f.Li and a1x 1 represent size, arrival rate, service rate and instantaneous cost rate for tha t queue. We suppose initially tha t these que ues operate independently, and use
(18) to denote the total arrival rate of cus tomers into the system. However, suppose that arriving cus tomers can in fact be routed into any of the queues (so tha t the queues are mu tually substitutable alternatives rath er tha n components of a stru ctu red netw ork). We look for a routing policy 1r which minimises the expected average cos t "/1r = E1r[L: 1 a1x 1]. The policy imp lied by the specification above simply sends an ving customer to queue iwi th pro bability >..d >..;the optimal policy will presumarri ably react to the current system stat e { x 1}. The ran dom routing policy achieve s an average cost of
(19)
6 ALLOCATION OF SERVER EFFORT BETWEEN QUEUES
223
(see (10.10)). As in the last section, before applying policy improvement we might ·as well optimise the random policy by choosing the allocation rates At to minimise expression (19) subject to (18), for given A. One readily fmds that the optimal choice is (20) where () is a Lagrange multiplier whose value is chosen to secure equality in (18). So, speed of service and cheapness of occupation both make a queue more attractive; a queue for which a;J JLi > 1j Owill not be used at all. Consider one stage of policy improvement. ·It follows from the form of the dynamic programming equation that, if the current system state is {xi}, then one will send the next arrival to that queue i for which
is minimal. Here the fi, are the transient costs under the random allocation policy, determined in (10.10). That is, one sends the arrival to that queue i for which
fi(xi + 1)- fi(xi) = at(Xt +A1) J.'i-
i
is minimal. If the At have already been optimised by the rule (20) then it follows that one sends the new arrival to the queue for which ((x 1+ 1) J a;J JJ.t) is minimal, although with i restricted to the set of values for which expression (20) is positive. This rule seems sensible: one tends to direct customers to queues which are small, cheap and fast-serving. Note that the rule again has the index form. 6 ALLOCATION OF SERVER EFFORT BETWEEN QUEUES We could have considered the problem of the last section under the assumption that it is server effort rather than customer arrivals which can be switched between queues. The problem could be rephrased in the form in which it often occlirs: that there is a single queue to which customers of different types (indexed by z) arrive, and such that any customer can be chosen from the queue for the next service. Customers of different types may take different times to serve, so it is as well to make a distinction between service effort and service rate. We shall suppose that if one puts service effort u1 into serving customers of type i (i.e. the equivalent of u1 servers working at a standard rate) then the intensity of service completion is the service rate O"tJ.I.i· One may suppose that a customer of type i has an exponentially distributed 'service requirement' of expectation JJ.i 1, and that this is worked off at rate ai if service effort ui is applied.
224
POLICY IMPROVEMENT: STOCHASTIC VERSIONS
As in the last section, we can begin by optimising the fixed service-allocation policy, which one will do by minimising the expression
with respect to the service efforts subject to the constraint
L:u =S,
(22)
1
i
on total service effort. The optimal allocation is
u1 = p.j 1 [A; + ylOa;A;J.L;]
(23) where ()is a Lagrange multipler, chosen to achieve equality in (22). An application of policy improvement leads us, by an argum ent analogous to that of the previous section, to the conclusion that all servic e effort should be allocated to that non-empty queue i for which p.1[.fi(x 1) - Ji(x1)] is minimal; 1 i.e. for which
v1(x1) = p.1[.fi(x1) - Ji(x; - 1)] =
a·J.L·X· 1 ' 'A
O';Jl.;-
i
(24)
is maximal. If the fixed rates p.1 have been given their optim al values (23) then the rule is: all service effort should be concentrated on that non-empty customer class i for which x1J a1p.t/ A; is maximal. It is reasonable that one should direct effort towards queue s whose size x 1 or unit cost a; is large, or for which the response p.; to servic e effort is good. However, again it seems paradoxical that a large arrival rate A; should work in the opposite direction. The explanation is analogous to that of the previous section: this arrival rate is already taken account of in the queue size x1 itself, and the deflation of x1 by a factor JX1 is a reflection of the fact that one expects a busy queue to be larger, even under an optimal policy. Of course, the notion that service effort can be switch ed wholly and instantaneously is unrealistic, and a policy that took accou nt of switching costs could not be a pure index policy. Suppose that to switch an amou nt of service effort u from one queue to anoth er costs c!u!. Suppose that one application of policy improvement to a policy of fixed allocation {u;} of service effort will modifY this to {~}.Then the~ will minimise
E!!c;l~- u;!- ~v;) i
(25)
subject to
(26)
7 REWARDED SERVICE RATHER THAN PENALISED WAITING
225
and non-negativity. Here Vt = v;(x;) is the index defined in (24) and the factor! occurs because the first sum in (25) effectively counts each transfer of effort twice. If we take account of constraint (26) by a Lagrangian multiplier B then the differential of the Lagrangian form L is
oL
8~
{Li+ :=!C-Vt-9 Lt- := -!c - Vj - e =
(cr; > cr1) (cr; < crt)
We must thus have d; equal to cr;., not less than cr1, not greater than cr1 or equal to zero according as L;_ < 0 < Lt+• L;+ = 0, L 1_ = 0 or L 1_ > 0. This leads us to the improved policy. Let 22+ be the set of i for which v 1 is maximal and 22_ the set of i (possibly empty) for which v 1 < max vj - c.
(27)
1
Then all serving effort for members of 22_ should be transferred to members of 22+. The definitions of the v; will then of course be updated by substitution of the new values of service allocation. Such a reallocation of effort will occur whenever discrepancies between queues become so great that (27) holds for some i. The larger c, the less frequently will this occur. 7 REWARDED SERVICE RATHER THAN PENALISED WAITING Suppose the model of the last section is modified in that that a reward r 1 is earned for every customer of type i whose service is completed, no cost being levied for waiting customers. This is then a completely different situation, in that queues can be allowed to build up indefmitely without penalty. It has long been known (Cox and Smith, 1961) that the optimal policy under these circumstances is to serve a customer of that type i for which r1p,1 is maximal among those customers present in the queue. This is intuitively right, and in fact holds for service requirements of general distribution. We shall see that policy improvement leads to what is effectively just this policy. For a queue of a single type served at unit rate we have the dynamic programming equation in the undiscounted case 1
= AA(x + 1) + p,(x)[r- A(x)],
(28)
where xis queue size, A is arrival rate, p,(x) equals the completion rate JL if xis positive and is otherwise zero, r is the reward and A(x) is the increment f (x) - f (x - 1) in transient cost. Equation (28) has the general solution for A A(x)
=I- W +(I- AT) [!!:.]x-1 A-p,
p,-A
A
(29)
If p, > A then finiteness implies that the second term must be absent in (29), whence we deduce that
226
POLIC Y IMPROVEMENT: STOCHASTIC VERS IONS
6(x)
= r.
(30) This is the situation in which it is the arrival rate which limits the rate at which reward can be earned. In the other case, 11 < A, it is the completion rate which is the limiting factor; the queue builds up and we have !=J.l r,
6(x)
= (J.L/>Y- 1•
(31) The total reward rate for a queue of several types unde r a fixed allocation { cr1} of service effort is then 1
=I>~ min[>.;, cr;Jt;]. i
If we rank projects in order of decreasing r;J.l; then an optim al fixed allocation is that for which cr1 = >..;/ J.l; for i = 1, 2, ... , m, where m is as large as possible consistent with (22), to allocate any remaining servic e effort to type m + 1 and zero service effort to remaining types. Now consider a policy improvement step. We should direct the next service to a customer of type i which is present in the queue and for which i maximises J.L;[r;- 6 1(x1)]. It follows from solutions (30), (31) that this expression is zero for i = 1,2, ... ,m and equal to r1tt1[1- (cr;J.L;/>.;)x1- 1 ] fori > m. That is, one will serve any of the first m types if one can. If none of these are present, then in effect one will serve the custo mer type present which maxim ises r;J.l;, because the fact that the traffic intensity for queue m + 1 exceeds unity means that Xm+l will be infinite in equilibrium. It may 'seem strange that the order in which one serve s members of the m queues of highest effective value r;J.l; appears imma terial. The point is that all arrivals of these types will be served ultimately in any case. If there were any discounting at all then one would, of course, always choose the type of highest value among those prese nt for first service. 8 CALL ROUTING IN LOSS NETWORKS Consider a network of telephone exchanges, with the nodes of the network (the exchanges) indexed by a variable j = 1, 2, .... Supp ose that there are mjk lines ('trun ks') on the directed link from exchange j to excha nge k, of which Xjk are busy at a given time. One might think then that the vector ~ = {Xjk} of these occupation numbers would adequately describe the state of the system, but we shall see that this is not quite so. Calls arrive for a jk conne ction in a Poisson stream of rate Ajk. these streams being supposed indep endent. Such calls, once established, terminate with probability intensity /ijk. When a call arrives for ajk-connection, it need not be established on a direc tjk link. There may be no free trunk s on this link, in which case one can either look for an alternative indirect routing (of which there may be many) or simply not
227
8 CALL ROUTING IN LOSS NETWORKS
accept the call. In this latter case we assume that the call is simply lost-no queueing facility is provided, and the caller is assumed to drop back into the population, resigned to disappo intment We see that a full description of the state of the network must indicate how many calls are in progress on each possible route. Let n, be the number of calls in progress on route r. Denote the vector with elements n, by !! and let !! + e, denote the same vector with n, increased by one. Let us suppose that the a establishment of a jk-connection brings in revenue WJk. and that one seeks c dynami routing policy which maximises the average expected revenue. The programming equation would then be 'Y=
2:2:>-1kmax{O,w1k+max[f(!! +e,) -/(!!)] } r
k
j
(32)
+ LnrtLr [f(!!- e,)- /(_!!)], r
where 'Y andfind icate average reward and transient reward respectively. Here the r-maximisation in the first sum is over feasible routes which establish a jkconnection. The zero option in this sum corresponds to rejection ofthe call on the grounds that connection is either impossible or uneconomic. (The difference f (!!) - f (!! + er) can be regarded as the implied cost of establishing an incoming call along router. If this exceeds w1k then the connection uses capacity which could be more profitably used elsewhere. We can take as a convention that this cost is infinite if the route is infeasib le-i.e. requires non-existent capacity.) In the second sum ILr is taken to equaliLJk if route r begins inj and ends ink. The term indicated is included in the sum only if !! - e, ~ 0; i.e. if there is indeed at least one call established on route r. Solution of this equation seems hopeless. However, the value function can be a determined for the simple policy in which one uses only direct routes, accepting stage one apply then shall We trunk. free a is there if only call for this route if and of policy improvement. The average and transient rewards on one such link (for which we shall drop thejk subscripts) are determined by 1 = >.[w + f(x
+ 1)- f(x)]
+ tLX[f(x - 1)- f(x)]
(0 < x < m).
(33)
For x = 0 this equation holds with the term in IL missing; for x = m it holds with the term in >. missing. Let us define the quantity
.6.(x) = f(x)- f(x + 1) which can be interpreted as the cost of accepting an incoming call if x trunks are currently busy. One finds then the elegant solution
B) A( ) = w B(m, B(x, ())
w. X
(0 < X < m ) ;
'Y
= >.w [1 - B(m, e)] )
228
POLICY IMPROVEMENT: STOCHASTIC VERSIONS
where() = )..j J-L is the traffic intensity on the link and B(m, B) is the Erlangfunction
This is the probability tha~ all trunks are busy, so that an incoming call cannot be accepted: the blocking probability. The formula for Wjk thus makes sense. Returning to the full network, define
Then we see from (32) that one stage of policy improvement yields the revised policy: If a call comes in, assign it the feasible route for which the sum of the costs x along the route is minimal, provided that there is such a feasible route and that this minimal cost does not exceed WJk. In other cases, the call must or should be rejected. The policy seems sensible, and is attractive in that the effective cost of a route is just the sum of individual components x. For this latter reason the routing policy is termed separable. Separability ignores the interactions between links, and to that extent misses the next level of sophistication one would wish to reach. The policy tends to assign indirect routings more readily than does a more sophisticated policy. This analysis is taken, with minor modifications, from Krishnan and Ott (1986). Later work has taken the view that, for a large system, policies should be decentralised, assume very limited knowledge of system state and demand little in the way of real-time calculation. One might expect that performance would fall well below the full-information optimum under such constraints, but it seems, quite remarkably, that this need not be the case. Gibbens et al. (1988, 1995) have proposed the dynamic alternative routing policy under which, if a one-step route is not available, two-step routes are tested at random, the call being rejected if this search is not quickly successful. One sees how little information or processing is required, and yet performance has been shown to be close to optimal.
CH AP TE R12
The LQG Model with Imperfect Observation 1 AN INDICATION OF CONCLUSIONS Section 8.3 we assumed that the When we introduced stochastic state structure in a property generally referred to current value of the state variable was observable, same as 'complete information', as 'perfect observation~ Note that this is not the e future course of the process is which describes the situation for which the whol predictable for a given policy. observed (i.e. if the current If the model is state-structured but imperfectly simple recursive treatment of value of state is imperfectly observable) then the in this case, an effective state Section 8.3 fails. We shall see in Chapter 15 that, n state: the distribution of x 1 variable is supplied by the so-called informatio !'£'with which we have to work conditional on W,. This means that the state space butions on!'£', a tremendous has perforce been expanded to the space of distri increase in dimensionality. tion, in that for them these However, LQG processes show a great simplifica and so are parametrised by the conditional distributions are always Gaussian, covariance matrix V1• Indeed, conditional mean value x1 and the conditional develop in time by deterministic matters are even simpler, in that V1 turns out to h we appeal to the observed whic and policy-independent rules. The only point at as an x of n 1• One can regard x1 course of the process is then in the calculatio g estin inter is It • W on 1 mati infor nt estimate of the current state x 1 based on curre even bles; serva unob ate estim to one s that the formulation of a control rule force rule implies criteria for the more interesting that the optimisation of this optimisation of estimation. properties of linear relations LQG processes have already been defined by the noise. We shall come shortly to a between variables, quadratic costs and Gaussian briefly and exactly. If we add definition which expresses their essence even more regulation problem of Section the feature of imperfect observation to the LQG prototype imperfectly observed 10.1 then we obtain what one might regard as the this the plant and observation state-structured LQG process in discrete time. For relations take the form (I) r-l+ Et
x, =
AXt- 1
Yt = CXt-1
+Bu
+ fJt
(2)
230
THE LQG MODEL WITH IMPERFECT OBSERVATION
where y 1 is the observation which becomes available at timet. We suppose that the process noise E and the observation noise 1J jointly constitute a Gaussian white noise process with zero mean and with covariance matrix (3)
Further, we retain the instantaneous and terminal cost functions
c(x,u)=~[~r[~ ~][~],
(4)
of Section 9.1. If we can treat this model then treatment of more general cases (e.g. incorporating tracking of a reference signal) will follow readily enough. We shall see that there are two principal and striking conclusions. One is that, if u1 = K 1x 1 is the optimal control in the case of perfect state information, then the optimal control in the imperfect-observation case is simply u1 = Kr!xr. This is a manifestation of what is termed the certainty equivalence principle (CEP). The CEP states, roughly, that one should proceed by replacing any unobservable by its current estimate, and then behaving as if one were in the perfect-observation case. ltturns outto be a key concept, not limited to LQG models. On the other hand, it cannot hold for models in which policy affects the information which is gained. The other useful conclusion is the recursion for the estimate i 1
(5) known as the Kalman filter. This might be said to be the plant equation for the effective state variable i 1; it takes the form of the original plant equation (1), but, instead of being driven by plant noise Er, is driven by the innovation Yt- Cic-1· The innovation is just the deviation of the observation y 1 from the value E(yrl Wt-i) that one would have predicted for it at timet- 1; hence the name. The matrix H 1 is calculated by rules depending on V1_ 1, , The Kalman filter provides the natural computational tool for the real-time determination of state estimates, a computation which would be realised by either a computational or an analogue model of the plant. Finally, LQG structure has really nothing to with state structure, and essential ideas are indeed obscured if one treats the state-structured case alone. Suppose that X, U and Y denote process, control and observation realisations over the complete course of the process. The cost function C will then be a function C(X, U). Suppose that the probability density of X and Yfor a control sequence U announced in advance is
f(X, Yl; U) =
e-O(X,YI;U)
(6)
(so that U is a parametrising variable). This must be a density relative to an appropriate measure; we return to this point below. We shall term [)
OBSERVATION 2 LQG STRUCTURE AND IMPE RFEC T
231
asing improbability of plan t/ the discrepancy, since it increases with incre The two functions C and I[]) of the observation realisations X and Y for given U. stochastic structure of the problem. variables indicated characterise the cost and cost funct ion C and the discrepancy One can say that the problem is LQG if both the density (6) is relative to Lebesgue []) are quadratic in their arguments when the measure. economically, that dyna mic This characterisation indeed implies, very Gaussian. It also implies that relations are linear, costs quadratic and noise the only effect of controls on policy cann ot affect information, in that r in know n controls, and so can be observations is to add on a term which is linea variables take values freely in a corrected out. It implies in addition that the Lebesgue measure) and that the vector space (since the density is relative to t h (since histories X, Y, U are stopping rule is specification of a horizon poin taken over a prescribed time interval). ntly ofLQ G ideas) that the two It can be asserted quite generally (i.e. independe se the model. One migh t say that quantities C and [j) between them characteri they could be), in that one wishes both C and I[]) should be small (relative to what to make C small and expects IDi to be small. variable; i.e. [J)(x) for x alon e or One can define the discrepancy for any rand om discrepancy is the negative the I[D(xjy) for x conditioned by y. The fact that be normalised in different may h logarithm of a density relative to a measure whic constant. We shall assu me ive addit ways means that it is determined only up to an it so normalised that inf[J)(x) = 0. X
n p. and covariance matrix V, then So, if xis a vector normally distributed with mea (x- p.). Note the general validity of [Jl(x) is just the quad ratic form ! (x- p.) T v-I formulae such as [Jl(x,y) = [Jl(y)
+ [J)(xjy).
is taken in the form (2), with y 1 It is often asked why the observation relation ely previous state rather than on essentially being an observation on immediat work out nicely that way. Probably curre nt state. The weak answer is that things as a joint state variable, then y 1 is a the right answer is that, if one regards (xt,Yt) . This is an alternative expression function of curre nt state, unco rrupt ed by noise deal to recommend it: that one of imperfect observation which has a good error. observes some aspect of current state without
ERVATION 2 LQG STRUCTURE AND IMPERFECT OBS models involves a whole train of The treatment of imperfectly observed LQG as they seek the right development. ideas, which auth ors order in different ways
232
THE LQG MODEL WITH IMPERFE CT OBSERVATION
We shall start from what is surely the most economical characterisation of LQG structure: the assumed quadratic character of the cost C(X, U) and the discrepancy [])(X, YJ; U) as functions of their arguments. We also regard the treatment as constrained by two considerations: it should not appeal to state structure and it should generalise naturally to what is a natural extension of LQG structure: the risk-sensitive models of Chapters 16 and 21. The consequent treatment is indeed a very economical and direct one; completed essentially in this section and the next. Sections 6-8 are added for completeness: to express the results already obtained in the traditional vocabulary oflinear least square estimates, innovations, etc. Section 9 introduces the dual variables, in terms of which the duality of optimal estimation to optimal control finds its natural statement. Note first some general points. The fact that we have written the cost C as C(X, U) implies that the process has been defined generally enough that the pair (X1, U1) includes all arguments entering the cost function Cup timet (such as values of reference signals as well of plant itself). The dynamics of the process and observations are specified by the probability law P(X, YJ; U), which is subject to some natural constraints. We have P(X, YJ; U)
h-1
h-1
t=O
1=0
=IT P(xt+!,Yt+J/Xr, Yt; U) =IT P(xt+!,Yt+I/Xr, Yt; Ur)
(7)
the second equality following from the basic condition of causality; equation (A2.4). Further
[])(xt+l ,yt+I!Xr; Ur) = [])(xt+IIXt, Yr; Ur) + [])(yt+dX r+l, Yt; Ur) = [])(xt+dX t; Ut) + [])(Yt+I!Xt+l• Yt; Ur)
(8)
the second equality expressing the fact that plant is autonomous and observation subsidiary to it, in that the pair (X1 , Y1) is no more informative for the prediction of xt+I than is X 1 alone. . Relations (7) and (8) then imply
Theorem 12.2.1 The discrepancy has the additive decomposition h-1
[ll(X, YJ; U) = L[[])(xr+ J!Xr; Ur)
+ [])(yt+IIXt+l> Yr; Ut)].
(9)
t=O
Consider now the LQG case, when all expressions in (9) are quadratic in their arguments. The dyamics of the process itself are specified by h-1
[])(Xj; U) = L t=O
[])(xt+1IXt; U 1)
(10)
2 LQG STRUCTURE AND IMPERF ECT OBSERVATION
233
and the conditio nal discrepancies will have the specific form [])(xt+tiXt; Ut)
=! {xc+l -
dt+t - AtXt - Bt Uz) TNi+\ (xr+l- dr+l- ArXt - Bt Uz)
(11)
(10) for some vector d and matrices A, B, N, in general time-dependent. Relations and (11) between them imply a stochastic plant equation (12) (0 ~ t t) have been substituted the values which minimise [))(XI; U). The value ofu1 thus determined, Udet(X1 , U1_t ), i.s the optimal value ofutfor the deterministic process in closed-loop form.
Proof All that we have done is to use the determi nistic version of (12) to express future process variables x 7 ( T > t) in the cost function in terms of control variexables U and current process history Xt, and then minimis e the consequ ent r, Howeve pression for C with respect to the as yet underm ined controls u7(T;;;: t). is the point of the theorem is that this future course of the deterministic process fudetermi ned (for given U) by minimis ation of [))(XI; U) with respect to these 0 ture variables. Otherwi se expressed, the 'deterministic future path' is exactly the most probable future path for given X 1 and U. In optimisi ng the 'deterministic process' we have suppose d current plant history X 1 known; a supposition we now drop.
Exercises and comments (1) Note that state structur e would be expressed, as far as plant dynamics go, by P(XI; U) = IlzP(xt+tlxz; ut)· (2) Relation (12) should indeed be seen as a canonic al plant equation rather than necessarily the 'physical' plant equation. The physical plant equation might have the form
234
TH E LQ G MO DEL WIT H IMP ERF ECT OBSERVATION
where the plant noise e* is autoco rrelated. However, we can substit ute e;+I = E( e;+ 1 1X~> Ut) + ft+l· This standar dises the equation to the form (12) expectation is linear in the con , since the ditioning variables and e has the required orthogonality properties. The det erministic forms of the two sets of equations will be equivalent, one being deriva ble from the other by linear operati ons. 3 TH E CERTAINTY EQUIV AL
ENCE PR IN CIP LE
When we later develop the ideas of projection estimates and innova tions then the certainty equivalence principle is rather easily proved in a ver sion which immediately generalises; see Exe rcise 7.2. Many readers may fmd this sufficient. However, we fmd it economical to give a version which does not presume this apparatus, does not assume state-s tructure and which holds for a mo re general optimisation criterion. The LQG criterion is that one cho oses a policy 1r to minimise the exp ected cost E'Jr (C). However, it is actually sim pler and more economical for pre sent purposes to consider the rather more genera l criterion: that 1r should maximise E'Jr [e-BC] for prescribed positive 9. Since this second criterion function is 1 fJE'Jr(C) + o(9) for small 0, we see that the two crit eria agree in the limit of small (). For convenience we shall refer to the two criteria as the LQG-crite rion and the LEQG-criterion respectively, LE QG being an established term in which the EQ stands for 'exponential of quadratic ~ The move to the LEQG criterion induces a measure of 'risk-sensitivity'; of regard for the variability of C as well as its expectation. We shall see in Chapt ers 16 and 21 that LQG theory has a complete LEQG analogue. Indeed, LQG theory appears as almost a degene rate case of LEQG theory, and it is that fact wh ich we exploit in this section: LEQG methods provide the insightful and econom ical treatment of the LQG case. We wish to set up a dynamic progra mming equation for the LEQG mo del. If we defme the somewhat transformed total value function e-G(W,) = f(Y t) supE.r[e-BCIWt], 'If
where f is the Gaussian probab ility density, then the dynamic programming equation takes the convenient form e-G(W,) =SUp Ut
J
dyt+le-G(WHI)
(t
(13)
We lose no generality by assuming perfect information at the horizo n point h; this cannot affect policy since no further actions are to be taken. Thus Wh = (X, Y, U) and the terminal con dition for equation (13) is
CIPLE 3 THE CERTAINTY EQUIVALENCE PRIN
G(Wh) =[]) (X, Y/; U) +BC(X, U). t. The following simple lemma expresses a key resul
235 (14)
function of vectors z andy, positive Lemma 12.3.1 Suppose that Q( z, y) is a quadratic definite in y. Then
j exp[-Q(z,y)Jdy
= exp [q- io/ Q(z,y)]
(27r)' /2 1Qyy 1- 1/ 2], r is the dimension of where q is a constant having the evaluation log[ s of Q( z, y) with respect toy. y and Qyy is the matrix ofsecond-order derivative an integration with respect toy by The point of the lemma is that, if one replaces the result is correct as far as terms a minimisation of Q with respect to y, then erned. dependent on the second argument z are conc in Proof If y is the minimising value (which will one can write Q(z,y) = Q(z,y)
general be z-dependent) then
+!( Y- .YfQ yy(y - y).
ect to the variable y - y. The lemma now follows by integration with resp
0
We can apply the Lemma to the solution of (13). the dynamic programming equation Lemma 12.3.2 Assume LEQG structure. Then (13) with terminal condition (14) has the solution (15) J)(X, Y/; U) + BC(X, U)] + g1 inf inf t inf[[ G( W1) = u,:r :;. t yT:r> X . The value ofUt thus determined is where g1 is a policy-independent function oft alone the LEQG-optimal value ofthe control at timet.
ratic in its arguments at time T = h; Proof We see from (14) that G( Wr) is quad n of the integral in (13) by appeal to suppose this so at time t + 1. Then evaluatio the lemma leads to the conclusion that (16) G(W1) = inf inf G(W1+J) + ... Ur
Yr+l
tion oft alone, and we know the where + ... indicates a policy-independent func nature of G is thus established infimising u1 to be optimal. The quadratic (16). In making this last deduction inductively, and solution (15) follows from (14), they indeed commute because they we have re-ordered the extremal operations; isation occurs because, in the are all infimising operations. The .X-minim
236
THE LQG MOD EL WIT H IMP ERF ECT OBSERVATION
dyn ami c prog ram min g equa tion at time h - 1, the inte grat ion over Yh amounts to . an integration over all variables of the full desc ripti on which are not dete rmin ed . from Wh-1 · 0 The restrictions on the form of[) whic h we dedu ced in the last section imply a strengthening of the last theorem.
Theorem 12.3.3 Assume LEQG structure. Then the dynamic programming equation (13) with terminal condition (I 4) has the solu tion G(W1) =in f{D (X1, Y 1 i; Ut-I) x,
+ u,.:r;; inf .t
inf [D(xt+t. ... ,xh!Xr; U)
~=r>t
+ OC(X, U)]} + g1
(17)
where g1 is a policy-independent/unction oft alone. The value ofu1 thus determined is the LEQG-optimal value of the control at time t. If the value of u1 minimising the square bracket is denoted u(X , u,_,) then the LEQG-optimal value of Ut is u( x, is the value of X dete rmined by thefina l X 1-minimisation. 1 The tran sfor med form (17) of (15) follo ws by appe al to the redu ctio n of [) expressed in (9). The final asse rtion of the theo rem follows from it. This asse rtion is very significant; it is actually the risk-sensitive form of the certainty equivalence principle (CEP), at least for the case 0 ~ 0. We shall develop it in Cha pter 16. The poin t for the mom ent is. that the CEP we seek follows immediately from it.
Theorem 12.3.4 (The Cert aint y Equivalence Principle for the LQG case) Assume LQG structure. Then the optimal value ofu 1 is Udet(X1(t), Ut-t ), where Udet (X1 , U,_I) is the optimal control for the deterministic case determined in Theorem 12.2.2 and x?> isth eva lueo fX1 minimising'D(X, Y1l; U,_,). Proof This is juSt the final asse rtion of the previous theo rem in the limit 0 L0. Suppose 0 small, and cons ider first the extremisations of the square bracket in (17). The values of ~( r ~ t) minimising the square bracket for given U are, to within a term O(O ),jus tthe valu es minimis ing D(xt+ 1, ••• ,xhi X1; U). Tha t is, just the future course of the deterministic vers ion of the process (for given X, U) to within a term 0(0). The se yield a value for D(x1+t. ... Xh!X, U) which is 0(02 ), and the u-minimisations which follow are essentially min imis atio ns of OC(X, U) with future process variables { Xri r > t) given thei r dete rmin istic prediction. The value u(X1, U1-1) of Ut which minimises the squa re bracket is thus Udet ( X 1, U,_ )+ 1 0(0). The min ima l value of the squa re brac ket in (17) is then 0(0), with the consequence that the value of X1 min imis ing the curly brac ket is just x, + 0( 0), with the defmition of x?> mod ified to that state d in the theorem. The fina l
4 APPLICATIONS: THE EXTENDE D CEP
237
determination of the minimising value of Ut, which we know to be LEQGoptimal at 0, is then U
Ut
= Kxr
+ L Ljdt+j+l j=O
(18)
238
THE LQG MOD EL WITH IMPE RFEC T OBSE RVATION
for know n matr ices L1, wher e we have for simpl icity writt en out only the timeinvar iant form of this relation. If one penalises the depa rture of x and u from presc ribed reference signals then we know (see the sectio ns quote d) that we can reduc e the optim al contr ol to the form (18), when d is a linea r comb inatio n of actua l distu rbanc es and reference signals. In practice, the future cours e of distu rbanc es and reference signals will in gene ral be unkn own, but, unde r LQG assum ption s, the CEP will be applicable, and one can asser t that the optim al contr ol rule will be just 00
Ut
(t) = Kx t
~ L d(t) + L-J i t+j+l
j=O
(19)
Here d~ 1 is an effective forecast of the value of dr+J form ed at time t, in that it is the most proba ble value of dt+J cond ition al on infor matio n available at time t. If one does not feel confi dent enou gh of the statis tics of the situa tion to form a probability mode l then the 'mos t proba ble' forec asts in (19) could be repla ced by the best infor mal or subjective forec ast available. One could not say much abou t perfo rman ce of the rule in such a case, but at least one could be sure that the rule was of a form know n to be optim al for a fully speci fied model. Altho ugh we deriv ed the CEP by appe al to the dyna mic prog ramm ing recursi on (in parti cular , in the passa ge from (13) to (17)), it could also be appli ed to a direc t trajec tory optim isatio n (as is indee d impli cit in Theo rem 12.2.2). That is, at tili\e t one just subst itutes cond itiona lly most proba ble estim ates for unob serya bles and then carri es out, by what ever mean s convenient, a direc t optim isatio n of the future trajec tory thus predi cted. The deter mina tion of the curre nt contr ol obtai ned in this way is then optim al. This extended form may be useful if one wishes to resto re the ment al pictu re of the future cours e of the proce ss whic h is largely lost in a recursive form ulation. The values u~l for r > t oflat er decisions form ed in this way are not in general the optim al values of these quantities, but they are indee d the best estim ates one can form at time t of what these optim al future decisions will be (an asser tion we make explicit in Exercise 7.1). This is one advan tage of the exten ded formulation; that optim isatio n is seen as part of a provisiona l forward plan. This plan will of cours e be upda ted as time moves on and one gains new infor matio n, but the fact that it has been form ed gives one much more consc ious appre ciatio n of the cours e of events and the conti ngen cies again st whic h one is guard ing. It is a view of optim isatio n whic h lies close to the intuitive meth ods by whic h we cope with daily life. Agai n, one can imag ine form al estim ates being repla ced by infor mal estimates in cases where one does not have the basis for a comp lete stoch astic model.
a
Of course, the point abou t the recursive appro ach is that it is econ omic al: one does not have to hold a whole future in one's mind (or processors). The chess analo gue we have already given in Secti on 1.3 bring s home the contr ast betw een
239
5 CALCULATION OF ESTIMATES: THE KALMAN FILTER
the two views. To play chess by the extended method is to choose one's current JilOVe after analysing all likely sequences of play some time into the future. To play ; chess by the recursive method is to give every board configuration x a value F(x) and to look for a move which changes the configuration to one of greater (or Jllaximal possible) value. In other words, an ultimate goal (winning) is replaced by an immedia te goal (improvement of configuration). Of course, the chess example is complicated by the fact that there are two players, optimising for different goals. However, there are plenty of other examples. One can say that Nature has allowed for the limited intelligence (and processing power) of her creations by replacing ultimate goals by intermediate goals. An animal does not eat in order that it may not die (a distant goal) or in order that the species may survive (even more distant) but in order that it may not feel hungry (immediate goal). Evolution has implante d a mechanism which both senses relevant aspects of state (a need for food) and evaluates them (so that a condition of extreme hunger is unpleasant). 5 CALCULATION OF ESTIMA TES: THE KALMA N FILTER In order to implement a control rule such as as u1 = K 1x1 for an imp,erfectly observed state-structured system we have to be able to calculate 1 = x/>. LQG assumptions will imply that the distribution of x 1 conditional on W1 is normal, with mean x1 and covariance matrix V, say. The conditional mean value x1 is indeed identifiable with the conditionally most probable value, because it is the value which maximises P(x11W,). If the observation relation is also 'state-structured' in the appropriate sense then one can develop recursions which express and V, in terms of 1. V1_ 1, 1 and u1- 1 . The recursion for 1 is the ubiquitous Kalman filter (5). Such recursions supply the natural way of calculating the estimates, whether control is realised by digital or analogue means. Before developing them we shall clarify notation and some properties of the normal distribution. Suppose that x and yare a pair of random vectors (so defmed as joint random variables). Let us suppose for convenience that they have zero means: E(x) = 0, E(y) = 0. Then the matrix E(xyT) with jkth element E(xjyk) is the covariance matrix between x and y (note that the order is relevant). We shall denote it variously (20) E(xyT) = cov(x, y) = Vxy·
x
x
x,
x,_
y
We shall denote cov(x, x) simply by cov(x); the covariance matrix of the random vector x. If xis scalar then cov(x) is simply the variance of x, which we shall write as var(x). If cov(x, y) = 0 then one says that x andy are mutually uncorrelated, and writes this x.iy. The relation is indeed an orthogonality relation under conventions natural to the context.
240
THE LQG MODEL WITH IMPERFECT OBSERVATION
Lemma 12.5.1 Suppose the vectors x andy are jointly normally distributed with means. Then the distribution of x conditional on y is normal with mean and iance matrix given by
Proof Denote the value asserted for E(xly) in (21) by x; we shall later see this notation as consistent. Then one readily verifies that x - x and y are independently (and normally) distributed, so that the unconditioned distribution of x - .X is the same as its distribution conditional on the value of y. But one again verifies that the unconditioned distribution of x - x is normal with zero mean and covariance matrix equal to the expression for cov(xly) asserted in (21). The conclusion of the theorem thus follows. 0 We have implicitly assumed Vyy non-singular, but the assertions of the theorem remain meaningful even if this is not the case; see Exercise 1. We come now to the principal conclusions.
Theorem 12.5.2 Assume the imperfectly observed state-structured model specified in relations (1)-(4) of Section 1. Suppose initial conditions prescribe xo as normally distributed with mean x0 and covariance matrix V0. The model is then LQG and the distribution of x 1 conditional on W 1 is normal with mean x1 and covariance matrix Vr, say. (i) The estimate x1 obeys the updating relation Xr
= Axt-1 +But-! + Hr(Yr- Cxt-d
(22)
(the Kalman filter), where
(23) (ii) The covariance matrix V1 obeys the updating relation
vi= N + AVt-tAT- (L + AV1-1 cT)(M + cv1-t cTrl (LT + cv1-1AT). (24) Proof There are many proofs, and we shall indeed give one in Section 8 which dispenses with the assumption of normality. However, this assumption has been intrinsic to the LQG formulation and makes for what is by far the most economicalproof. The preliminary assertions of the Theorem follow from previous discussion. If we denote the estimation error x 1 - x1 by b.1 then V1 is exactly the covariance matrix of b. 1. The quantities Xr-1 - Xr-1, Xr- Ax1-1 - But-1, and Yt - Cxt-! are
I
241
: THE KAL MAN FILTER 5 CALCULATION OF ESTIMATES
The se are jointly norm ally with ~t-1. Er and 'IJt respectively. with zero mea n and covariance mat rix
[ v0 0
0
N L1
0
L M
l
.
join t as V. This effectively gives us the where we have written Vt-1 simply is to do t mus nal on W,_t, ur-I · Wh at we distribution of Xt-i . x, and Yr conditio by Xt of tion then con ditio n the distribu integrate out the unobservable Xr-t and -t, (Wr dist ribu tion of x 1 conditional on the value of Yt. so obtaining the the to nt gration over values of x 1_ 1 is equivale Ut-1 ,y 1) = W1, as desired. Inte t-t and -Bu -t Ax 1 Xt= <:; quantities substitution Xr-t = Xc-J + ~t-J· The I and so At+C 'IJt and I Atidentifiable with Et +A ( 1 = y 1 - C.Xr-1 are then rix ted with zero mea n and covariance mat ribu are jointly norm ally dist
VA 1 [ N1 +A L + CVA 1
L +A VC1 ] M + CVC1
(25)
nal 12.5.1 that the dist ribu tion of C con ditio It then follows by an appeal to Lem ma H1 re whe , V rix 1 n H 1( 1 and covariance mat on the value of ( 1 is norm al with mea nt eme stat a and (24). Converting this into and V1 have the values asserted in (23) (i) s rtion asse (and W1_ 1) we deduce the of the dist ribu tion of Xt con ditio nal ony1 0 and (ii) of the Theorem. n of Ric cati type, analogous to the Rec ursi on (24) is a forward recursio a optimisation. The analogy is based on backward Ric cati recursion of control only is it ; istic rmin dete is 9. The recu rsio n real duality, as we shall see in Section V ating. However, we need the value of upd its for which need s the observations the 1, ion Sect in see from (23). As observed to dete rmi ne the updating, as we g the equ atio n with the estimate replacin t Kal man filter (22) is just the plan It is t. e nois t innovation (rep laci ng the plan actual stat e value x and a term in the y1 ion rvat ides a corr ecti on when the obse this latter term which of course prov ch would have bee n predicted. turn s out to differ from the value whi ly con tinu ous- time version. We shall simp All this mat eria l has an imm edia te or by a ined either by analogous arguments quote results, since these can be obta t. formal pass age to the continuous limi ion el specifies the plan t and observat mod ured The stan dard state-struct relations (26) Cx+ "l
x
x
x
.X =A x+ Bu + E,
y
=
n ion noise) are jointly white, with zero mea where f. and "' (plant noise and observat rential form, Note that the plan t relation is in diffe and pow er mat rix [
;!r
i£] .
242
THE LQG MODEL WITH IMPE RFEC T OBSE RVATION
the observation relation in instantaneous form. This is generally regarded as natural formulation, although we shall have occasion to recast it in Chapter 25.
Theorem 12.5.3 Assume the imperftctly observed statestructured model sveczrz,pil above. Suppose that initial conditions prescribe x( 0) as normally distributed with i(O) and covariance matrix V(O). Then the distributio n ofx(t ) conditional on is norma/with mean i(t) and covariance matrix V(t), say, determined as follows. (i) The estimate xobeys the updating relation
i =Ax +Bu + H(y -
Cx)
(the Kalman filter), where H
= (L + VC1 )M- 1 •
(ii) The covariance matrix Vobeys the updating relation
V = N +AV + VA 1
-
(L+ VC1 )M- 1(L1 + CV).
Exercises and comments (1) Suppose that Vyy is singular, so that vectors c exist such that Vyyc = 0. These are elements of the null space %of Vyy· For such vecto rs var( c1 y) = c1 Vyyc = 0, so that c1 y is zero in mean square. Show that VxyC = 0 for such c. An equation Vyya = b for vector a will have a solution, which we shall denote by V_;; 1b, only if b is orthogonal to all c of .JV, and this solution is arbit rary to within addition of an element of% . Show that different evaluations of the expression for E(xjy) in (21) obtained in this way differ by a term c1 y for some c of .JV (and so are zero in mean square) and that expression for cov(xjy) is uniqu ely determined.
6 PROJECTION ESTIMATES Our LQG analysis has generated effective estimates, the most probable values of unobservable random variables conditional on the values of observables, under the assumption that all random variables are jointl y normally distributed. However, the estimates thus generated have distin ctive properties if one makes no distributional assumptions at all, apart from speci fication of first and second moments of the variables concerned. This weakening of hypotheses leads to a new view of the conclusions derived under Gaussian hypotheses in Sections 1-5, and a valuable one. We shall assume notations (for covariance matrices, etc.) and defin itions (of orthogonality etc.) listed in Appendix 1. Suppose that x is a random vector of zero mean and covariance matrix Vxx· Then we can still define the quadratic form
6 PROJECTION ESTIMATES
243
D(x) = !xTV~1 x, norm al density had x which would have occu rred in the expo nent of the then have term ed the norm ally distributed, and which we woul d ugh it is only when x is norm ally iilel'l:Pillll'-'Y· Let us conti nue to do so (altho istrilbutc:a that we can identify D(x) with II) ( x)). non-singular A of appro priat e dime nsion we have
D(x) =!(Ax)T(AVx.xAT)- 1(Ax). A VxxAT is just the covariance matr ix of the trans form
ed vector Ax. We thus
the simple but significant conclusion:
(Ax) T v- 1(Ax). Then necessarily ,,,,, ,, , L,_,,,_J2.6.1 Suppose that D(x) can be written! ·· = cov(Ax).
k,. ·. ·. From this we furth er dedu ce the following lemm L:··: ...... L~mma12.6.2 infxD(x,y) = D(y).
a.
r
! .
linea r function of y. Then Proof Let the minimising value be deno ted x, a D(x, y) can be written as the sum of quad ratic forms (30) (30) and the previous prop ositio n for some Vt, Vz. It then follows from expression l, with V1 = cov( x- x) and that x- x and y are mutu ally ortho gona final form in (261 the V2 = cov(y). Since plainly infxD(x,y) equals the D conclusion then follows. ofx in terms of y with whic h Note that the .X thus deter mine d is just the estimate ble value if distributions are we are familiar, the conditionally most proba estimate. norm al. Let us now term it the minimal discrepancy not, and one would wish to has x and Suppose now that y has been observed lest estimate would be a linear form an estimate of x in term s of y. The simp estimate, ofthe form
x=H y.
(31)
a linear least-square estimate: to and the most imme diate conc ept would be that of re error E[ (x - x? G(x - x) choose the coefficient H in (31) so that the mean -squa readily finds that noth ing is is mini mal for presc ribed positive-definite G. One (i.e. the optim al value of the gaine d by the addit ion of a cons tant term to (31) H yields the condition to ct cons tant is zero) and that optim isatio n with respe (32) X- X ..L y,
244
THE LQG MODEL WITH IMPERFE CT OBSERVATION
independent of G. That is, that the estimation error is orthogonal to the observations. Let us term a linear estimate satisfYing (32) a projection estimate. Then we can assert that condition (32) is sufficient for the linear least-square property as well as necessary, in that a projection estimate also has the linear least-square property. In fact, we can assert a number of equivalences.
Theorem 126.3 The following characterisations ofan estimate x of x in terms ofy are equivalent. (!) It is a projection estimate, i.e. a linear function ofy for which x- x ..L y. qz) It is a linear least-square estimate. qiz) It is the value ofx which freely minimises the discrepancy D~, y). qv) It has the property that cov(x- x)
~
cov(x- x).
(33)
for any linear estimate x. All such estimates are effectively identical, in that they are equal in mean square. Proof Let us abbreviate 'linear least-square' and 'minimal discrepancy' to LLS and MD respectively. We know that the LLS estimate is a projection estimate and it follows from the proof of Lemma U.6.2 that the same is true of the MD estimate. If x = Hy then the orthogonality condition (32) implies that 0 = E[(Hy- x)yTJ = HVyy - Vxy·
(34)
So, if Vyy is non-singular, then the projection estimate has the unique determination X= A
vxy v-• yyY·
(35)
(cf. (21)). If Vyy is singular then the analysis of Exercise 5.1 shows that the solution (35) is still meaningful and different evaluations of it are equivalent in mean square. However, proof that a projection estimate has the stronger LLS property (34) both provides the reverse implications (i) => (iv) => (ii) and demonstrates mean square equivalence. because x - is a linear function of y. We have x ..L x Note first that then
x-
x,
x
cov(x- x) = cov(x- x + x- x) = cov(x- x) +cov(x - x,x- x) + cov(x- x,x- x) = cov(x- x) + cov(x- x)
+ cov(x- x) (36)
Now cov( 6) ~ 0 for any random vector 6, and in particular for the vector 6 = x - x. Thus inequality (33) holds.
6 PROJECTION ESTIMATES
245
the the argu men t above also establishes If x is also a projection estimate then lity. reverse inequality to (33), and so equa square. (x- x) = 0, so that x = x in mea n cov then (33) in s If equality hold als ty then shows that cov (x- x,y) equ An application of Cauchy's inequali 0 sis. Thus xis also a projection estimate. cov (x- x,y), which is zero, by hypothe
, y), re is trouble with the definition of D(x We have glossed over one point. The t join the if ate, estim imu m discrepancy and so with the con cep t of the min nt ifica sign very a ular. The poin t is actually covariance mat rix of x and y is sing the deal with it (see Exercise 1) provide we one, beca use the means by which 9. tion Sec whole con trol problem; see passage to the 'righ t' formulation of the have the erro r covariance mat rix We verifY that all proj ecti on estimates (37) 1 cov (x- x) = Vxx - Vxy V_;;, Vyx, rcise in the mor e circ ums pect sense of Exe immediately if Vyy is non-singular and 5.1 otherwise. on esti mat e x of x in term s of yas We shall som etim es write the projecti (38) jy),
x = c&'(x
of s. Thi s can be regarded as a tran slat ion to emphasise the role of the two variable al, norm were s able vari be app ropr iate if the the expression E(xjy), which would line ar prop ertie s are assumed. c&'(xiy) is a rder to the case whe n only seco nd-o that x. It is a line ar ope rato r in function of y having the dim ensi on of 8(x1 + x2iY) = 8(xJiy) + 8(x2iY) and 8(Axiy) = A8(xiy) is that it ension. However, the essential poin t for x 1 and x2 vectors of the sam e dim erties. prop c risti acte the following char is inde ed a proj ecti on operator, enjoying
Theorem 12.6.4
m8 2 = 8, in that $[8(xiy)iy]
vz) IfYI .l Y2 then
I l
I !
= 8(xJy). (39)
from S(Hyly) = HC(yjy) = Hy. AsProof Ass ertio n (i) follows immediately nd mem ber of (39) is from the observations that the righ t-ha
sert ion (ii) follows late d that the erro r of this estimate is unc orre cert ainl y a line ar esti mat e of x, and with both y1 and y2.
Exercises and comments
nite V 1 .[2..\T x- ..\TV..\] for non-negative defi (1) The mat rix identity xTv - x = sup.> +oo is on is sing ular (in that the evaluati is fundamental, and valid even if V
246
THE LQG MODEL WITH IMPERF ECT OBSERVATION
unless x lies in the orthogo nal complement of the null space of V). It gives us the evaluation of the discrepancy D(x,y)
=
r[
s~y [ATx + pTy- ![~ ~;: ~;;] [~ J]
The minimi sing values of A and p are the differentials of D with respect to x andy. If x has the discrepancy-minimising value then). must be zero, so the extremal equations with respect to A and f-t become x = Vxy/L andy = Vyy/.L· This amoun ts just to the equatio n (35). The analogues of A and f-t in the dynamic context are just the Lagrange multipliers associated with plant and observation equations, and 'it is by their introduction that a direct trajectory optimisation finds its natural completion (see Section 9) as it did already in the case of perfect observation (Sectio n 6.3).
x
7 INNOVATIONS The previous section was concer ned with a one-stage situation. In the temporal context we shall have a multi-stage situation, in which at every instant of time t (= 0, 1, 2, ... ) one receives a new vector observation y 1• The observa tion history at timet is then Y1 = {yr; 0 ~ T ~ t}. Define the innovation ( 1 at time t by (o =Yo- E(yo), (t=Yt-< if(YtlY t_I) (t=l,2 ,3, ... ). (40) The innovation is thus the deviation of the actual observation Yt at timet from the projection forecast one would have made of it the momen t before, at time t - 1. It thus represents the 'new' inform ation gained at time t, in that it is that part of y 1 which could not have been predict ed from earlier information.
Theorem 12.7.1 The innovations are mutually uncorrelated. Proof Certain ly ( 1 l. Yt-l· It follows then that ( 1 l. (r forT < t, becaus e (r is a linear function of Yt-l· 0 Now define
x(t) = .C(xl Yt), the projection estimate of a given random vector x based on inform ation at time t. We shall use this supersc ript notatio n consistently from now on, to denote the optima l determ ination (of the quantity to which the supersc ript is applied , and under a criterion of optimality which may vary) based on inform ation at timet. It follows, by appeal to (39), that
x(tl = C(xl Yt-d
+ tB'(xlCr)
= x(t-i)
+ HtCt
(41)
7 INNOVATIONS
247
where the matrix Hr in fact has the determination
Hr
= cov(x, Ct)[cov((r)r 1.
(42)
Equation (41) shows that the estimate of x based on Yz-l can be updated to the estimate based on Y1 simply by the addition of a term linear in the last innovation. This is elegant conceptually, and turns out to provide the natural algorithm in the dynamic context. Equation (37) of course implies further that X(t)
=
t
t
r=O
r=O
'L: l(xl(r) = L Hr(r-
which is nothing but the development of a projection in terms of an orthogonal basis. However, the innovations are more than just an orthogonalisation of the sequence {y1}; the fact that the orthogonalisation is achieved by the timeordered rule (40) means that ( 1 indeed has the character expressed in the term 'innovation: The innovations themselves are calculated by the forward recursion 1-l
(t
= Yt- L l(yrl(r)·
(43)
r=l
Alternatively, we can take the dual point of view, and calculate them by minimising the discrepancy backwards in time rather than by doing least square calculations forward in time. Theorem 12.7.2 The value ofy 1 minimising D( Y1) is C(y 1 1 Yr-1) and the minimised value is D ( Yt-1). Further,
(44) where Mt = cov( (t). Proof The first pair of assertions follows from the theorems of the last section,
and relation (44) is nothing but relation (30) transferred to this case.
0
We plainly then have the additive decomposition of the discrepancy
D(Yr)
=
n=~=l (~M; 1 Cr·
Exercises and comments
(1) Let u~l be the estimate of the optimal value of uT formed at time t by the application of the extended CEP in Section 4. Show that indeed uVl = l(u~r) IW1 ).
248
THE LQG MODEL WITH IMPERFEC T OBSERVATION
(2) A proof of the certainty equivalence principle. Consider the state-structured
LQG model expressed in equations (1)-(4), the criterion being minimisation of expected cost. Let F(x 1 , t) and u(x1 , t) be the value function and oftimal control rule at time t under perfect state-observation. Note that x~t+l = x~r) + z1 = x1 + z1 where, conditional on ( Wr, ur), the random variable z r has zero expectation and a covariance matrix independent of policy or process history. Hence show, by an inductive argument, that the value function and optimal control in the case of imperfect observation are F(x1, t) + · · · and u(x 11 t) respectively, where+ · · · indicates a term independent of policy or process history. 8 THE KALMAN FILTER REVISITE D We can now give a quick derivation of the Kalman filter under purely secondorder assumptions, when x1 is defined as the projection estimate tf(xrl Wr). The proof is in some ways more enlightening than that of Section 5, although it needs more in the way of preliminary analysis. We have Xr = tf(Axr-l +Bur-l+ =Bur-!+ O"(Axr-!
= A.Xr-1
c:rl We)
+ t:eiWr-1) + tf(AXr-I + t:rl(r)
(45)
+ Bur-! + Hr(t
for appropriate H 1• Here ( 1 is the innovation (46) so the form (22) of the Kalman filter recursion follows from (45) and (46). Further 1 Hr = cov(Axr-1 + er, (r)[cov((r )r 1 = cov((;, (r)[cov((r )r ,
(47)
where
(48) But, as in Section 5, (; and ( 1 can be written as e1 + A~r-1 and 'TJr + C~r-I. respectively, and are jointly normally distributed with zero means and with covariance matrix (25). The expression (23) for H 1 thus follows, as does expression (24) for (49) The quantity ( 1 is certainly the innovation in the observations. The quantity (; equals x 1 - tf(x11 Wr-1 ), and so can be regarded as the plant innovation. 9 ESTIMAT ION AS DUAL TO CONTRO L The recursion (24) for the error covariance matrix V1 stands in obvious parallel to the Riccati recursion (2.25)/(2.26) of LQ control optimisation. The analogy is
r
'';•i>
9 ESTIMATION AS DUAL TO CONTR OL
249
of covariance complete, with only the 'dualising' modifications that transposes d instead of forwar goes ion matrices replace cost matrices and the recurs which is tion estima and l backward. In fact, there is a duality between contro ngian Lagra the of on constantly in evidence, but is revealed cleanly only by adopti methods of Section 6.3. conditions, The stochastic model must be completed by prescription of initial ution distrib the of which in the present state-structured case means prescription . The V matrix 0 of xo conditional on Wo; norma l with mean .Xo and covariance to up tion realisa expression for Dr= [l)(X1, Ytl Wo; Ur- 1 ), the discrepancy for the time t, is then
u by appeal to where A, E and TJ are understood as expressed in terms of x, y and n relatio initial the and (2) the plant and observation relations (1), (51) Ao = .Xo- xo. {x~l; 0:::;; T~ t}, The projection estimate of the course of the process up to timet, x'1'-values. is obtained by minimising [l)r with respect to the corresponding less natural does the , inistic determ being to is model However, the nearer the s appear in (11) inverse whose es matric ance this formulation seem. The covari zero, and this ally identic is nent compo noise and (50) will be singular if any forms. these ising minim and sing plainly presents problems in discus reformulation The resolution turns out to be a continuation of the Lagrangian naturally to over carries indeed which of Section 6.3 for the deterministic case, plant and the view we is, That . the stochastic imperfectly observed model aints constr as (51) ion condit observation equations together with the initial us Let liers. multip ge Lagran of which are best taken care of by the introduction and 0 = T case the in (51) introduce multipliers IT to take account of constraint multiplier mT to the plant equation at time Tin the case T > 0; correspondingly, a the Lagrangian has take account of the observation relation at time T. One then
form r
[l)r-
IQ(xo- .Xo- b.o)- 2)z;( xT- AxT-1 - BuT-! - ET) T=J
+ m;(YT - CxT-1 -TJT)], to be maximised to be minimised with respect to x and the noise variables and a superscript have should liers with respect to the multipliers. Properly, the multip at time t. ation inform on t to indicate that the extremisation is one based when only cript supers this e However, for notational simplicity we shall includ the point needs emphasis.
THE LQG MODEL WITH IMPERFECT OBSERVATION
250
Theorem 12.9.1 If the noise variables are minimised out ofthe Lagrangian form then the problem is reduced to that ofrendering the time integral I
O(l,m, x) =[!lTV/+ zT(x ~ x)Jo + l:[v(lr,mr)
(52)
T=J
+ !J(xr ~ Axr-1 ~ Bur-d + m;(Yr ~ Cxr-I)] maximal with respect to the x variables and minimal with respect to the (/, m) variables. Here
v(l,m)=![~r[~ z][~]
(53)
and the noise estimates are related to the multipliers by T7
/(t) ~ -
r 0 0
(t) A (1) - Xo - Xo , uo
(54)
A
(55)
Proof Equations (54) and (55) are just the stationarity conditions for the Lagrangian form with respect to the noise variables. At the values thus determined the Lagrangian form reduces to -0(/, m, x), and remaining assertions follow. D Note that the transformation clears the form of matrix inverses. Note also that 0 is the analogue of the Lagrangian form (6.19) for the deterministic control case, with(/, m, x) replacing (x, u, >.). Thequadraticform v(l, m) isjustthe information analogue of the instantaneous cost c(x, u) defined in (4). In general, we shall see that the primal and dual variables switch roles when we move from the control problem defined on the future to the estimation problem defined on the past . . The multipliers I and m are to be regarded as differentials of []) 1 with respect to the values of noise variables, which in turn implies the relations (54) and (55) between multipliers and noise estimates. By taking the extremisations of the form 0 in various orders we can obtain the analogues of various familiar assertions. Firstly, by writing down all the stationarity conditions simultaneously we obtain the equation system
[~ ~ !Hfr+[~"] ,~ 0
(1
~ 7 ~
(56)
t).
together with the end conditions uz(t) -
A
r,o - xo
~
(t)
Xo ,
(l,m)~) = 0
(T
> t).
(57)
r
~.·.
9 ESTIMATION AS DUAL TO CONTROL
251
In (56) we have again used the operator notation d=l-A ff,
with il = I - AT ff- 1 etc. Note also that the translation operator acts on the subscript r, not on the superscript t, which is regarded as fixed for the moment. In equation (56) we see the complete dual of the corresponding system (6.20) for the control problem. We can use the operator and operator factorisation approach just as we did in Section 6.3 and 6.5; the analogue of an infmite horizon will now be the passage from a start-up point at r = 0 to r--+ -oo. We shall follow through these ideas explicitly and for a more general case in Chapter 20. Theorem 12.9.2 Ifone extremises out the x-variables in 0then the problem is reduced to that ofminimising the form I
!UTV1)0 + l:)v(l,.,mT ) -z;B~-1
+ m;yT]
(58)
T=1
with respect to the variable (~ m), subject to the backward equation (0
~
r
~
t)
(59)
and end-condition (I, m)T = 0 (r > t). The assertion is immediate. Its point is that exhibits the estimation problem as the complete analogue of the control problem originally posed in Section 2.4 Equation (59) is the analogue of the plant equation, v(!, m) is the analogue of the instantaneous cost, and the initial term ! WV/) 0 is the analogue of the terminal cost. The difference lies only the occurrence of a few terms linear in (l, m) in the sum (58)-the observation and control variables constitute effective input terms which lead to a non-homogeneity. However, we are in general not interested in estimating all of process history, but merely the value of current state x 1• For this we resort to recursive arguments, just as the optimal control was deduced by the recursive argument of the dynamic programming equation. Theorem 12.9.3 The extremum of form (52) for prescribed x 1 and 11 is (! [T VI + [T (x - x) ]1 + · · · , where + · · · indicates terms independent of these variables. Proof Let us drop the !-subscripts for the moment The constrained extremum is certainlyoftheform!JT PI+ zT(x- J.t) +···for someP, J.t. Ifweminimisewith respect to 1 then this becomes - ! (x ~. J.t) Tp- 1(x - J.t) + · · ·. But this must be 0 identical with-! (x- x) Ty- 1(x- x) + ···,whenc e the assertion follows.
252
THE LQG MODEL WITH IMPERF ECT OBSERVATION
Theorem12.9.4
The recursion
holds, with
lt-1 =AT[+ cTm. The Kalman filter (22) and theforward Riccati recursion (24)for Vfollow. Proof It follows from the previous theorem and expression (52) for the time integral that we have the recursion [!/TV/+ /T(x- x)] 1 = ext{(!zTV/ + zT(x- x)] 1_1 + v(l,,m,) + !'[(x,- AXt-1- Bu,_r) + m"f(y,- Cx,-1)}, (62) where the extremum is with respect to /1_ 1 , x 1_ 1 and m1• The extremum with respect to 1 yields the condition (61) and the recursion reduces to (60). The Kalman and Riccati relations then follow immediately. 0
x,_
The reader may ask why we should bother with yet a third derivation of the Kalman filter. Well, the derivation is incidental to the goal; a formalism which is claimed as significant could hardly be regarded as living up to that claim unless it delivered the Kalman filter in passing. On~ point is that the Lagrangian approach, already seen as valuable for the optifiisation of deterministic control, is now seen to extend to this stochastic imperf~ctiY observed case. More sigiuficantly, it achieves the goal which has loomed ever more clearly as desirable: the characterisation of the optimisation problem as the free extremisation of a timeintegral such as (52) with respect to its variables. Constraints can of course not ~e eliminated, but they can be exhibited as natural consequ ences of a free optimisation in a higher-level problem. This higher-level formulation was achieved in two steps. The first was the deduction of the certainty equivalence ,principle, which absorbed the constraint constituted by the realisability condition (that control must depend on current observables alone). The second was the introduction of the dual variables I and m, which absorbed the constraints constituted by plant and observation equations. Comparison of the equation systems (6.20) and (56) demonstrates the complete symmetry between control/future and estimation/past, with primal variables (x, u) and dual variable s(/, m) switching roles in the two cases. There are other byproducts in terms of insight and solution: reduction of these equation systems by the canonical factorisation methods of Section 6.3 solves the stationary optimisation problem for stochastic dynamics of any order (see Chapter 18-21). Remarkably, insight becomes complete first when we consider the risk-sensitive models of Chapters 16 and 21, for which past and future time-integrals are
9 ESTIMATION AS DUAL TO CONT ROL
253
,A. assoc iated in a single integral, and the Lagrange multipliers land d. relate be the plant equation in the past and in the future can
-~~""""
Exercises and comments extrema are taken in the (1) Alternative forms. By varying the order in which the holds with the alternative last equation, demo nstrat e that the Kalm an filter (22) evaluation of Hand recursion for P.' (63) (64) 1 ions (63)/(64) have the Here A= A- LM- 1 C and N = N- LM- LT. Relat I (24), just as was the case same continuous-time limit version (28)/ (29) as do (23) for the corre spond ing control relations.
CH AP TER 13
Stationary Process; Spectral Theory E FUN CTIO N 1 STATIONARITY: THE AUTOCOVARIANC homogeneous and for whic h a Consider a system which is intrinsically timerule is successful, in that it is stationary control rule is adopted. If this control to steady-state behaviour. A stabilising, then the system will eventually settle down viour is itself termed stationary. stochastic process showing such steady-state beha statistically invariant unde r More technically, a process is stationary if it is this, consider first the discretetime-translation. To appreciate what is mean t by xis the process variable, and time case. Deno te the process by {x1}, so that whole process in a parti cular denote a realisation (i.e. the actual course of the tor f7 introduced in Secti on case) by X. Recall the backward translation opera formed sequence f7 X is X 1-t 4.2, which has the effect that the tth term in the trans sequence fl' X is x 1_,, for any rather than x 1• More generally, the tth term in the nary if statio is } integral r. One says then that the process {x 1
E[>(fi'X)] = E[>(X)]
(1)
of the realisation for which the for any integral rand for any scalar functional> ly what is meant by 'statistical right-hand mem ber in (1) is defined. This is exact will in fact hold for all r if it invariance unde r time-translation'. Note that (1) holds for r = ±1. been to optimise in the finite Our approach to control optimisation hithe rto has tended to a stationary form ol horizon initially and then see if the optim al contr itself converge to stationarity in the infinite-horizon limit. The process may then . However, one could move in the course oftim e (i.e. in the infinite-history limit) g for the stationary rule which to the stationary situation immediately, by askin statio nary state. We make some minimises an expected cost per unit time in the clear, in any event, that there is observations on this point in Exercise 2, but it is d, unde r LQ assumptions there interest in the study of the stationary case. Indee with the state-structured case are two principal bodies of technique: one dealing the stationary case by trans form by recursive methods and the other dealing with methods. ption that xis vector-valued (a If we consider LQ models then there is an assum functions of X which we only column vector of dimension n, say) and that the process is Gaussian, then its need consider are linear/quadratic. Indeed, if the and second-order moments. statistics are completely characterised by its first-
STATIONARY PROCESSES; SPECTRAL THEORY
256
As far as frrst-order moments are concerned, the stationarity condition (1) will imply that E(x 1) is independent oft, so that
(2) for some constant n-vector J.L· As far as second-order moments are concerned, (1) will imply that that the covariance between x 1 and x 1_, is a function of r alone:
cov(xr,Xt-r)
=
v,,
(3)
say, for integral r. The quantity v, is termed the autocovariance at lag r of the process. It is an n x n matrix whose jkth element is the covariance between the jth element of x 1 and the kth element of x 1_,. It provides the only measure one has of dependence between values of the process variable at different times, if one is restricted to knowledge of first and second-order moments. It is therefore the only measure there is at all in the Gaussian case. If one regards v, as a function of the lag r then it is termed the autocovariance function. The full expression
v,
= E[(x 1 - J.L)(x 1_ , -
for v, and stationarity imply that v_,
JL)T],
(4)
= v'f, or v,
T = v_,.
(5)
That is, the autocovariance function is invariant under the combined operations of. transposition and lag-reversal. This plus the further property indicated in Exercise 1 are in fact the characterising properties of an autocovariance function. Commonly one supposes that the process has been reduced to zero mean (by adoption of a new variable x- J.L). In this case (4) reduces to v, = E[x1xJ_,]. A degenerate but important stationary process is vector white noise, {E1}, for which v, is zero for non-zero r. One normally supposes that this has zero mean, and, if v0 = V, then one speaks of {E1} as 'white noise of covariance matrix V~ The corresponding assertions for the case of continuous time are largely immediate; one simply allows the time variable t and the lag variable r to take any value on the real line. The one point that does need special discussion is the character of white noise, already covered to some extent in Section 9.5. Discretetime white noise {Er} of covariance matrix Vhas the property that the linear form I>=~ 1 E 1 has zero mean and covariance matrix 'L, 1 at V a'f, at least for all sequences of matrix coefficients {a 1} such that this last sum is convergent. The corresponding characterisation of continuous-time white noise is that the linear form a(t)E~t) dt is normally distributed with zero mean and covariance matrix Ja(t) Va(t) dt. The additional property of normality follows from the assumed independence and stationarity properties. If E had autocovariance function v(r) then the relation
J
DENSITY MATRIX 2 THE ACTION OF FILTERS: THE SPECTRAL
cov (j a(t)E(t) dt)
=
jj
257
a(tJ) v(t1- t2)a(t2)T dt1 dt2
asserted implies then that would hold in regular cases. The evaluation ion; an indication. of the v(r) = Vc5(r), where 6 is the Dirac delta funct noise. The matrix V is the exceptional character of continuous-time white time interval; it is appropriately covariance matrix of the integral oft: over a unit referred to as the power matrix oft:. Exercises and comments
2 0, for any sequence of column (1) Note that 1:1 L:k aJ v1-kak = E(l: 1 aJ Xt) ;;;;: ges. vectors {at} for which the first sum cover programming equa tion that (2) Note that there is no assumption in the dynamic one is considering an assum ed the controlled process is stationary, even when Kx 1). One may say that this is infinite-horizon limit (as with the simple rule Ut = nary, optimises passage to the because the control thus derived, although statio state. If one tries to determine steady state as well as performance in the steady s oneself an instr umen t (the an optimal steady-state rule directly then one denie less effective in dealing with reaction against transients) and may derive a rule al for the deterministic LQ optim such transients. For example, the rule u1 0 is dy been reached. It will alrea has problem of Section 2.4 if the equilibrium x 1 = 0 ising unless x = 0 is a stabil be not be optimal if x is not zero, and will not even stable equilibrium of the uncontrolled process. ge-cost optimal' but not This is an extreme example of a policy which is 'avera (as in Section 10.1, for N > 0) optimal. If non-degenerate plant noise is present st optimal. This is the infinitethen there is only one policy which is average-co rary initial conditions, and so horizon optimal policy, which is optimal for arbit also copes optimally with transients.
=
DENSITY MATRIX 2 THE ACTION OF FILTERS: THE SPECTRAL is as natural in this context as it The use of z-transforms and Fourier transforms se of the translation-invariant was in Chap ter 4, and for the same reason: becau ete-time case and define the character of the process. Consider again the discr ) autocovariance generating function (abbreviated AGF 00
g(z)
=
L
v,t',
(6)
r=-oo
x n matrix whose elements are where z is a complex scalar. The AGF is then ann 1 z- ) T, a relation which we shall functions of z. Property (5) implies that g( z) = g( ons 6.3 and 6.5. It follows then write as g = g, consistent with the usage of Secti
258
STATIONARY PROCESSES; SPECTRAL THEORY
that, if the doubly infinite series (6) converges at a given value of z, then it also converges at z- 1• In particular, if v, decays as p' with increasing r, for some scalar pin [0, 1), then g(z) converges in the annulus p < lzl < p- 1. Note that a white noise process of covariance matrix Vhas a constant AGF: g(z) = V. The AGF may well not converge anywhere. The standard example is the simple sinusoid x 1 = sin(w0 t - '1/J), which defines a stationary process if the phase 'lj; is assumed to be a random variable uniformly distributed over (-1r, 1r). Its autocovariance function is v, = cos(wor), so that the series (6) indeed converges nowhere. A regular class of stationary process is constituted by those for which x is the output of a stable system obeying finite-order linear dynamics and driven by white noise. We shall see below that g(z) is then rational and necessarily convergent in an annulus p < lzl < p- 1• The important property of the AGF is that it transforms very much more pleasantly under the action of a filter on the process than does the autocovariance function itself. Suppose that a process {y1} is derived as the output of a translation-invariant linear filter with stationary input {x 1}:
!
(7) Here we have used the notation of Chapter 4, so that b, is the transient response of the filter and B(z) = Ls b,z' its generating function; the transfer function of a discrete-time filter. If the sum in (7) is convergent (the appropriate sense here being that of convergence in mean square) then the output {yt} is defined and also stationary. Since we are dealing with more than one process then autocovariances etc. must be labelled by the process to which they refer. Let us denote the autocovariance and the AG F for the process {x 1 } by v~x) and g(x) ( z), etc. Theorem 13.2.1 Suppose that the functions g(x) (z) and B(z) are both analytic in an annulus p < lzl < p- 1• Then the sum (7) is mean-square convergent, the AGFg(yl(z) ofthe output is analytic in the same annulus and output statistics are related to input statistics by Vr(y) --
" "" " b·Jvr-j-j+k (x) bTk• L..L..j
(8)
k
(9) Proof Relations (8) and (9) are those which would be established by formal argument: assertion (8) follows from (7) and the definition of an autocovariance; assertion (9) then follows by the formation ofgenerating functions. Convergence of the sum (7) and validity of these manipulations is amply justified by our
DENSITY MATRIX 2 THE ACTION OF FILTERS: THE SPECTRAL
259
erge to zero as plrJ with incre as· very strong assumptions: that v~x) and b, conv 0 ing lrl. strong, but make for simplicity, The assumptions of the theorem are excessively h largely covers our needs. Som e and are justified in the rational case whic tantial relaxation takes one deep relaxation is indicated in Exercise 2, but subs into Fourier theory. form ation rule (8) for the The poin t of the theo rem is that the trans ly pleasing, but the rule (9) for the autocovariances themselves is not particular suggestive. We shall often write it transformation of AGF s is both comp act and simply as g(y) = Bg(x) lJ. a mod el already cons idere d in A prim e example is a statistical version of Section 4.4: A(ff )x = t. y= Cx, noise of covariance V, so that the Here A (ff) is polynomial in ff and t is white rated by a stochastic difference outp ut y is a linear function of a variable x gene have all its zeros strictly outside the equation. Stability requires that lA (z) Ishould esses converge to stationarity with unit circle. Und er these circumstances the proc 1 g(yl(z) = CA(z )- 1 V A(z f cT.
a neig hbou rhoo d of the unit circle. This is certainly rational, and free of poles in , if, as in Section 4.2, we see the The AGF can be given a more physical basis ed define power series (6) as a Four ier series. Let us inde
:2: v,e-irw, 00
f(w)
= g(e-iw) =
(10)
r=-oo
ency'. The transformation (10) with where the variable w is to be regarded as 'frequ unit circle, where we expect it to w real amounts to considering g(z) on the converge if it converges anywhere. Suppose x scalar, for simplicity. We have h-l
0 ~ h-1 Ell : Xte-itwl2 = h-1 t=O
h
h-1 h-1
:2: :2: j=O k=O
Vj-kei(k-j)w =
L (I -
lr!/h)v,e-irw.
r=-h
with increasing h if regularity This last expression will converge to f(w) is that 2::, lv,! < oo. We see then conditions are satisfied; a sufficient condition positive and can be identified with that, unde r such cond ition s,f(w ) is real and trans form of the sequence {x1 }, in the expected squa red mod ulus of the Fourier sense. This argu ment leads to an an appropriately norm alise d and limiting
260
STATIONARY PROCESSES; SPECTR AL THEORY
interpretation off(w) as the density of the expected distribution of 'power' in the process {x 1 } over the frequency spectrum. For this reason it is termed the spectral density function in the scalar case, and the spectral density matrix in the vector case. For the sinusoidal process sin(wot -1{;) mentioned above the Fourier series (10) of course does not converge, but limiting arguments evaluate it as f (w) = [8(w- wo) + 8(w + wo)]. That is, the energy of the process is indicated as being concentrated in the frequencies ±w0 , which is indeed consistent with the nature of the process. Relation (9) becomes, in terms of spectral densities,
!
J(Y) (w)
= f3(w)f(x) (w)f3( -w) T = f3f(x) ,8,
(11)
where f3(w) = L,, b,e-irw is. the frequency response function of the ftlter (see Section 4.2). This could be regarded as the consequence E[(CY>(CY>J = f3E[((x)((x)],8 of a relation (CY>(w) = f3(w)((x)(w), where ((x)(w) is the (random) Fourier amplitude at frequency w in a Fourier decomposition of x into 1 frequencies and ((xl(w) its transposed complex conjugate. In fact, such a decomposition is a delicate matter and one must be circumspect in speaking of a 'Fourier amplitude', but the view is nevertheless a correct one; see Exercise 1. Note that we expect the relation Vr
= -21 11"
1'/f ei""f(w) dw,
{12)
-'If
inverse to (10). 'The continuous-time results are formally analogous, with sums overt replaced by integrals over the whole time axis and integrals over w replaced by integrals over the infinite frequency axis ( -oo, oo). Thus, the mutually inverse pair of relations (10), (12) is replaced by
v(r)
11
= -2
11"
00
eiWTf(w) dw.
-oo
These relations are certainly valid if/(w) exists and is integrable. If we conside r the importa nt special case (in continuous time) of rational f(w), thenf(w ) can certainly have no singularity on the real w-axis, so integrability cannot fail for this reason. However, integrability over the infinite w-axis will also require thatf{w ) should tend to zero with increasing w, and should do so sufficiently quickly. The effect of this condition is to exclude a white noise component, since white noise has a constan t spectral density. If f(w) is rational and tends to zero with increasing w then it must indeed tend to zero at least as w- 2, and so is integrable. The ftlter x--+ y expressed by y = B(!'i)x has transfer function B(s) and frequency response function f3(w) = B(iw). With these understandings, relation (11) holds in regular cases.
I 4
.~
~ ~
i
1 ::;.
ESSIVE REPRESENTATIONS 3 MOVING-AVERAGE AND AUTOREGR
261
the cons tant value V (where Vis the . For a white noise process the SDF f( w) has as to whether the proc ess is in (;()variance matr ix or powe r matr ix, according as an indication that the ener gy discrete or continuous time). One interprets this rang e of meaningful frequencies of the process is uniformly distributed over the cont inuo us time). The proc ess was (i.e. (-11", 11") in discrete time and ( -oo, oo) in belie f that white light has such a then given the nam e 'white' noise in the mist aken even over the visible range, but uniform ener gy spectrum. It certainly does not, the term 'white noise' has stuck. Exercises and comments
onar y process can be unde rstoo d (1) Circulant processes. The struc ture of a stati by assu ming it periodic, with perio d much bette r if one makes the time axis finite of the sequence {xo, x 1 , x2, ... , m. That is, the complete realisation X consists cyclic perm utati on, so that .rx = Xm- 1}, and the shift oper ator ff effects a just means statistical inva rianc e {xm-I,xo,Xt, ... ,Xm-2}· 'Stationarity' then on. Assume stationarity in what unde r .r, and so unde r any cyclic perm utati follows. e with perio d m. Suppose, for Show that cov(xr, x,_,) is a function v, of r alon e matr ix of the m-vector X. rianc simplicity, that x is scalar, and let V be the cova the kth element of the that and Show that V has eigenvalues jj = }:, v,e-ir ) and sums run over 11"i/m exp(2 = corresponding right eigenvector is ()ik, where() a 'spectral density' is value eigen any m consecutive integers. That is, the jth ~ (j is the finite wher El([1 with evaluated at w = 21f'j/m, and can be identified 1 • Furth ermo re, 2 1 0-1 x ~ 1 1 1 (j = mFour ier trans form of the sequence X in that disti nctj, kin for 0 = k) that E((j( these Four ier amplitudes are uncorrelated, in thes etO, l, .. . ,m -1. b,.r ' is to multiply (j by B(()i). The effect of a filter with oper ator B(.r ) = }:, n by white noise, and so with outp ut (2) Cons ider a discrete-time SISO filter drive that this expression should }:, b,e,_,. The necessary and sufficient cond ition ld have finite seco nd-o rder shou ut converge in mean square (and so that the outp therefore sufficient if one is ition moments) is that }:, is fmite. The same cond condition will also be The e. abov considers an inpu t whose SDF is boun ded zero. away from nece ssary if the SDF of the inpu t is boun ded
b;
SSIV E REP RES ENT ATIO NS: 3 MOVING-AVERAGE AND AUT ORE GRE CAN ONI CAL FACTORISATIONS ut of a filter with a white noise Cons ider the process {x 1} obta ined as the outp input:
(13)
262
STATIONARY PROCESSES; SPECTRAL THEORY
In the time-series literature such a process would be referred to as a moving average process, the words 'average' and 'moving' being a loose indication of the properties 'linear' and 'translation-invariant~ If the white-noise process has covariance matrix V then it follows, as a special case of (9), that {x,} has AGF
g(z) = B(z)VB(z)
(14)
If the filter is causal then b, is zero for r negative, so that B(z) is a series in nonnegative powers of z. One then speaks of the moving average relation itself as being 'causal' or 'realisable: Suppose, as in Section 4.4, that the filter is specified as an input-drive n linear difference equation 00
A(ff)x, = L:a,x,_, =
Et
(15)
r=O
We regard (15) as a relation determinin g x 1 in terms of €1 and past x-values, so that a, is indeed zero for r negative and a0 is non-singular. If the relation (15) is stable
then the output {x 1} generated is stationary, and can indeed be represented as a realisable moving average of the input. That is, it can be represente d in the form (13) with B(z) = A(z)- 1, where these expressions are understoo d as power series in non-negative powers of z (see Section 4.4~ Its AGF is then
(16) In the time-series literature a process generated by a relation such as (15) is termed an autoregressive process, the relation itself being an autoregression. The term 'conveys, again loosely, the idea that the variable is linearly dependent upon its own past. The relation (15) is of course just a stochastic difference equation, linear and with constant coefficients, and of order p if a, is zero for r > p. Let us now reverse these deductions. If the AGF has the form (14) then the process could have been generated by a moving average relation (13); if the AGF has the form (16) then the process could have been generated by an autoregressive . relation (15). In other words, one would have, respectively, moving average and autoregressive representations of the process. The point of this manoeuvr e will appear in the next section. Let us make it more explicit.
Theorem 113.1 Suppose that the AGF g(z) ofa process x 1 can befactorised in the form Q4), where both B(z) and B(z)- 1 are analytic in lzl ~ 1. Then the process has a causal moving average representation QJ), with V identified as the covariance matrix ofthe white noise process {E1}. Furthermore, this representation is invertible, in that it can be inverted to the autoregressive relation Q5), with A(z) = B(zr 1• Proof Define a process { E1} by Et
= B(ff)- 1x 1 = A(ff)x1,
(17)
I
.d
ONS 3 MOVING-AVERAGE AND AUTOREGRESSIVE REPRESENTATI
263
of§. It where it is unders tood that the expansions are in non-negative powers fined, so follows from the assumptions of the theorem that this quantity is well-de (15) hold that the process { Er} is stationary and both the relations (13) and that (9) relation to appeal by now see We es. between the two process g(E)
= AgA = B- 1glr 1 = V,
so that the process {€ 1} is white, with covariance matrix V.
D
ation of The factorisation (14) deman ded in the theorem is a canonical factoris we look, essentially, for a g~). This type of factorisation will occur repeatedly as . The factorisation process the of ntation represe ressive moving-average or autoreg pty annulu s non-em a in c analyti is and AGF an is is indeed possible if g(z) p < lzl < p-1. with an We can restate the theorem in what is a completely equivalent form, but ant. signific prove will emphasis whose difference its inverse as Theorem 13.3.2 Suppose that the AGF g(z) ofa process x 1 is such that form a matrix can befactorised in the
(18) has a stable where both A(z) and A(zr 1 are analytic in izi ~ 1. Then the process of the matrix nce covaria the as ed identifi V autoregressive representation (15), with white noise process { € 1}.
. Note, This is indeed just a variation of the stateme nt of the previous theorem 1 rather g(zr of ation factoris cal however, that it is expressed in terms of a canoni factor than of g(z), and that the order of factors is reversed, in that it is the final inside ies propert ity regular have rather than the initial factor which is required to the unit circle. ion Of course, one factorisation immediately implies the other, but the distinct state to comes it when , between the two versions will prove significant. Roughly equivalent estimation (as it will) then factorisations (14) and (18) are respectively square least linear a as to the two characterisations of a projection estimate es process with deal estimate or as a minimu m discrepancy estimate. When we a is (18) ation whose dynamics are of finite order (as we shall) then factoris polynomial factorisation, whereas (14) is not. ing The € 1 defintJd by (17) are in fact just the innovations of the x-process (assum e -averag this to have begun in the infinite past) and the realisable moving own its of representation (13) is just a representation of the process in terms (1938). The innovations. It is a special case of a representation deduced by Wold above (the Wold representation breaks x into the moving-average term deduced ed non-deterministic component) and a term which can be perfectly predict purely
STATIONARY PROCESSES; SPECTRAL THEORY
264
by linear relations. Our factorisation hypotheses exclude the possibility of this second, deterministic, component.
4 THE WIENER FILTER Suppose that one has observed a process {x 1} up to timet, and so knows its partial history X 1 = {Xr; T ::;;;; t}. Suppose that one wishes to use the observations to estimate the value of some unknown random variable f It is natural to consider the projection estimate ~(r) = 8(~JX1 ), which we know, from the last chapter, to have optimality properties. We suppose all variables reduced to zero mean. Theorem 13.4.1 Suppose that the AGFofthe process {x1 } has canonicalfactorisation (13). Then the projection estimate of~ in terms of X 1 is given by 00
c(t)-"'
<,
-
(19)
L....- "frXt-n r=O
where
(20) and
(21)
Proof We can equivalently project~ upon the to-history, so obtaining 00
~(t)
00
00
= LG(~Jt:r-r) = L~~;,V- 1 Et-r = L~~;,.'T'V- 1 B(.r)- 1 xr r=O
r=O
r=O
whence (19) follows. The ftrst equality is a consequence of the orthogonality of thet:,. D Essentially, one obtains the estimate by projecting upon x-innovations and then expressing these innovations in terms ofobserved x. The filter (19) which determines ~(r) from X 1 is the Wiener filter. It is generally stated for the particular case ~ = Xr+m, when one is attempting to predict what the value of the process variable will be m steps in the future. Exercises and comments (1) Show that, for the prediction problem last indicated, solution (20) for the "{coefficients becomes E~ 'Yrz' = [z-m B(z)]+B(z)- 1 •
5 OPTIMISATION CRITER IA IN THE STEADY STATE
265
one could .· (2) Denote the predictor of Xr+m thus determined by x~2m Show that relations the of tion applica by ors predict these ted have e~uivalently have calcula relation, ressive autoreg the to appeal d repeate by is, That 'l:r a,x/l., = 0 (T > t). of the with future E set equal to zero. This is indeed an immediate consequence oise fact that tS'(ETIXr) = 0 forT> t, which is itself a consequence of the white-n assumptions. 5 OPTIMISATION CRITERIA IN THE STEADY STATE spectra l Optimisation criteria will often have a compact expression in terms of best one r Whethe densities in the stationary regime (i.e. the steady-state limit). these of sation goes about steady-state optimisation by a direct minimi ions are expressions with respect to policy is another matter, but the express certainly interesting and have operational significance. n Consider first the LQG criterion. Let us write an instantaneous cost functio such as (2.23) in the 'system' form
(22) , for Here ~ is the vector of 'deviations' which one wishes to penalise (having matrix example, x - r and u - uc as components) and 9t is the associated has auto(which would be just [ ~ ~] in the case (2.23) ). Suppose that { ~t} a given covariance generating function g(z) in the stationary. regime under is to aim the Then f(w). density l spectra stabilising policy 11; and corresponding choose 1r to minimise the average cost (23) tr[9t cov(~)]. !E{tr[9t~~T]} 'Y E(!~T!Jt~]
=!
=
=
and tr(P) Here E denotes an expectation under the stationary regime for policy 1r have not we n notatio of y econom For P matrix the denotes, as ever, the trace of to the ing Appeal 1r. upon g(z) and E "f, of ence explicitly indicated the depend formula cov(~) = 217r f(w) dw,
j
we thus deduce
Theorem 13.5.1 The criterion (23) can be expressed 'Y
= E(!~T9\A] =
4~
J
tr[9lf(w)J dw
(24)
ry where f (w) is the spectral density function of the A-process under the stationa the in 1r] [-1r, interval regime for policy 11; and the integral is is over the rea/frequency case ofdiscrete time and over the whole rea/line in the case ofcontinuous time.
266
STATIONARY PROCESSES; SPECTRA L THEORY
In the discrete-time case it is sometimes useful to see expression (24) in power series rather than Fourier terms, and so to write it
1 7 = !Abs{tr[9tg(z)]} = -47rl.
dz j tr[9tg(z)]-. z
(25)
Here the symbol ~s· denotes the operation of extracting the absolute term in the expansion of the bracketed term in powers of z upon the unit circle, and the integral is taken around the unit circle in an anticlockwise direction. If we had considere d a cost function h-1
C=z=c,
(26)
t=O
up to a horizon h then we could have regarded the average cost (23) as being characterised by the asymptotic relation E(C) = h7 + o(h)
{27)
for large h. Here E is again the expectation operator under policy 1r, but now conditional on the specified initial conditions. The o(h) term reflects the effect of these conditions; it will be zero if the stationary regime has already been reached at timet= 0. Consider now the LEQG criterion introduce d in Section 12.3. We saw already from that section that the LEQG model provided a natural embedding for the LQG model; we shall see in Chapters 16, 17 and 21 that it plays an increasing role as we bring in the concepts of risk-sensitivity, the H 00 criterion and largedeviation evaluations. For this criterion we would expect a relation analogous to(27):
(28) Here 7( 9) is a type of geometric-average cost, depending on both the policy 1r and _ the risk-sensitivity paramete r 9. The least ambitious aim of LEQG-optimisation in the infinite-horizon limit would be to choose the policy 1r to minimise 7(6). ('Least ambitious', because a full-dress dynamic program ming approach would minimise transient costs as well as average costsJ We aim then to derive an expression for 7( 9) analogous to (24). Theorem 13.5.2
The average cost 7( 9) de]med by (48) has the evaluation
!j
7(6) = 4 9
logjl + 99tf(w)l dw
{29)
for values of(} such that the symmetrisation of the matrix I+ 09tf(w) is positive definite for all real w. Here f (w) is the spectral density function of the fl.-process under the stationary regime for the specified policy 1r (assumed stationary and
5 OPTIMISATION CRITERIA IN THE STEADY STATE
267
linear), and the integral is again over the interval [-1r, 1r] or the whole real axis in discrete or continuous time respectively. Here /P/ denotes the determin ant of a matrix P; note that expression (29) indeed reduces to (24) in the limit(} --+ 0. We shall prove an intermediate lemma for the discrete-time case before proving the theorem. Suppose that the AGF of { 6-c} has a canonica l factorisation (16) with A(z) analytic in some annulus /zl ~ 1 + 6 for positive 8. That is, the 6.process has an autoregressive representation. It is also Gaussian, since the policy is linear. Let us further suppose this representation so normalis ed that Ao =I. The probability density of 6. 1 conditional on past values is then
f(6.c/6.r;T < t) = [(27r)m/V/]- 112 exp[-!ci- "V- 1E1] where Ec
= A(Y)6.c and m is the dimension of 6.. This implies that
f(6.o, 6.,' ... '6-h-1/6-r; 'T < 0) = [(27r)ml VIJ-h/Zexp [ = [(27r)'/ VIJ-h/ 2 exp [
-! ~ E; v-'fc]
l
-! ~ 6-'f M(ff)Ac + o(h)
(30)
where M(z) = A(z) v- 1A(z) = g(zf 1 . Since the multivariate density (30) integrates to unity, and since the normalis ation Ao = I implies the relation log/ V/ = dlogjg(z )/, we see that we can write conclusion (30) as follows.
Lemma 13.5.3 Suppose that M(z) is self-conjugate, positive definite on the unit circle and analytic in a neighbourhood ofthe unit circle. Then
II··· Iexr[-l~x;'M(ff)x}xodXI· .dxo-1
(31)
= (27r)hmf2exp{ -(h/2)Ab s[log/M( z)/J + o(h)}. *Proof of Theorem 13.5.2 The expectation in (28) can be expressed as the ratio
multivariate integrals of type (31), with the identifications M(z) = g(z) -I + (}91 and M(z) = g(z)- 1 in numerato r and denomin ator respectively. We thus have
of two
E(e-ec) = exp[-(h/2 )Abs{log jg(z)- 1 + 091/ + log!g(z)i} + o(h)] whence the evaluation
"!(B)
1
= 20 Abs[log/I + 091g(z)l}
.I
268
STATIONARY PROCESSES; SPECTRAL THEORY
and its alternative expression (29) follow. The continuous-time demonstration is analogous. 0 The only reason why we have starred this proof is because the o(h) term in the exponent of the final expression (30) is in fact Ll-dependent, so one should go into more detail to justify the passage to the assertion (31).
r' '
~''~' ,. .· : r '
!
CHA PTE R14
Optimal Allocation; The Multi-armed Bandit 1 AU.OCATION AND CONTROL topic of conThis chapter is somewhat off the principal theme, although on a he wishes. if it bypass may reader the and siderable importance in its own right, resources limited of g sharin the Allocation problems are concerned with problem of kind the not This d. pursue between various activities which are being m if proble l contro a indeed is envisaged in classical control theory, but of course le, examp For ons. conditi ng this allocation is being varied in time to meet changi rk netwo ons unicati comm a the adaptive allocation of transmission capacity in tance. impor logical techno provides just such an example; clearly of fundamental henceforth The classic dynamic allocation problem is the 'multi-armed bandit', ce of sequen a makes er referred to as MAB. This is the situation in which a gambl the choose to plays of any of n gambling machines (the 'bandits'1 and wishes f pay-of ed expect machine which he plays at each stage so as to maximise the total the of ility probab (perhaps discounted, in the infinite-horizon case). The pay-off er, the ith machine is a parameter, 0; say, whose value is unknown. Howev he as gains gambler builds up an estimate of Oi which becomes ever more exact n playing a more experience of the machine. The conflict, then, is betwee g with a machine which is known to have a good value of 0; and experimentin better. It is machine about which little is known, but which just might prove even m as an proble the ates formul one that t in order to resolve this conflic optimisation problem. features As an allocation problem this is quite special, on the one hand, but has ing is allocat is one which ce which confuse the issue, on the other. The resour at a ne machi one only play one's time (or, equivalently, effort) in that one can the split to able be will one s time, and must decide which. In more general model but ting fascina is allocation at a given time. The problem also has a feature which time is not is irrelevant in the first instance: the 'state' of the ith machine at a given of 0;, an value its physical state, but is the state of one's information about the chapter, next 'informational state~ We shall see how to handle this concept in the which is to but this aspect should not divert one from the essential problem, we may decide which machine to use next on the basis of something which indeed term the current 'state' of each machine.
270
OPTIMAL ALLOCATION; THE MULTI-ARMED BANDIT
In order to sideline such irrelevancies it is useful to formulate the MAB problem in greater generality. We shall suppose that one has n 'projects', the ith of which has state value xi. The current state of all projects is assumed known. One can engage only one project at a time; if one engages project i then xi changes to a value ~ by time-homog eneous Markov rules (i.e. the distribution of ~ conditional on all previous state values of all projects is in fact dependent only on xi), the states of unengaged projects do not change and there is a reward whose expectation is a function r;(x;) of x; and i. If one envisages indefinite operation starting from time t = 0 and discounts rewards by a factor f3 then the aim is to choose policy 1r so as to maximise E'lr[L::o {3 1R 1], where R 1 is the reward received at time t. The policy is the rule by which one decides which project to engage at any given time. One can generalise even this formulation so as to make it more realistic in several directions, as we shall see. However, the problem as stated is the best first formalisation, and captures the essential elements of a dynamic allocation problem. The problem in this guise proved frustratingly difficult, and resisted sustained attack from the 'forties to the 'seventies. However, it had in fact been solved by Gittins about 1970; his solution became generally known about 1981, when it opened up wide practical and conceptual horizons. Gittins' solution is simple to a degree which is found amazing by anyone who knew the frustrations of earlier work on the topic. One important feature which emerges is that the optimal policy is an index policy. That is, one can attach a index lli(xi) to the ith project which is a function of the project label i and the current state x; of the project alone. If the index is appropriatel y calculated (the Gittins index), then the optimal policy is simply to choose a project of currently greatest index at each stage. Furthermore , the Gittins index II; is determined by the statistical properties of project i alone. We shall describe this determinatio n, both simple and subtle, in the next section. The MAB formulation must be generalised if one is to approach a problem as complicated as, for example, the routing of telephone traffic through a network of exchanges. One must allow several types of resource; these must be capable of allocation over more than one 'project' at a time; projects which are unengaged may nevertheless be changing state; projects may indeed interact. We shall sketch one direction of generalisatio n in Sections 5-7. The Gittins solution of the MAB problem stands as the exact solution of a 'pure' problem. The inevitable next stage in the analysis is to see how this exact solution of an idealised problem implies a solution, necessarily optimal only in some asymptotic sense, for a large and complex system.
2 THE GITTINS INDEX The Gittins index is defined as follows. Consider the situation in which one has only two alternative actions: either to operate project i or to stop operation and
271
2 THE GITTINS INDE X
then in fact an optim al stopp ing receive a 'retir emen t' reward of M. One has proje ct, its state will not chan ge probl em (since once one ceases to opera te the te the value function for this and there is no reaso n to resum e operation). Deno on the retire ment reward as well probl em by ¢i (Xi, M), to make the depe nden ce dyna mic progr amm ing equa tion as on the proje ct state explicit. This will obey the
(1) a relati on whic h we shall abbreviate to
(2)
of proje ct i before and after one Here Xi and~ are, as above, the values of the state stage of opera tion. the form indic ated in Figu re As a funct ion of M for fixed Xi the funct ion ¢i has for M great er than a critic al value 1: non-d ecrea sing and convex, and equa l to M accep t the retire ment rewa rd M;(xi)· This is the range in whic h it is optim al to over value, at which M is just large rathe r than to continue, and M; (x;) is the cross ng are equally attractive. enou gh that the optio ns of conti nuing or term inati the proje ct when in state xi; it is Note that Mi(xi) is not the fair buy-o ut price for (in state Xi) if an offer is made more subtle than that. It is the price whic h is fair ct opera tor is free to acce pt at which is to rema in open, and so which the proje can be taken as the Gittins index , any time in the future. It is this quantity whic h altho ugh usual ly one scales it to take
(3)
ity that a capit al sum M woul d as the index. One can regard vas the size of an annu choic e betw een the altern ative s of buy, so that one is rephr asing the probl em as a n or of moving to the 'certa in' eithe r opera ting the proje ct with its unce rtain retur proje ct whic h offers a cons tant incom e of v.
M. (x;)
M
X; and Mare respectively the state of Figure 1 The graph of r/J;(x;, M) as afunction ofM Here and
272
OPTI MAL ALLOCATION; THE MULTI-ARMED BANDIT
Note that the index v;(x;) is indeed evaluated in term s of the properties ofproject i alone. The solution of then- proje ct probl em is phra sed in terms of these ; indices: the optim al policy (the Gittins index policy ) is to choose at each stage one of the projects of currently greatest index. One may indee d regard this as a reduction of the problem which is so powerful that one can term it a solution, since solution of the n-project problem has been reduced to solution of a stopping problem for individual projects. .~ We shall prove optimality of this policy in the next section, but let us note now :$ an associated assertion. Let x denote the comp osite state (x1, x2, ... , X 11 ) of the '~; set of all n projects, and let (x, M) denote the value function for the problem of choosing optimally between any one of these proje cts and the additional option of total retirement with a term inal reward M. Let A(l {3) and B(l - {3) be uniform lower and uppe r boun ds (presumed finite ) on the reward rates r;(x;), so that A and B are lower and uppe r boun ds on the total expected discounted reward obtainable if there is no retirement optio n. Then we shall show that (x, M) has the evaluation in term s of the one-p roject value functions c/Ji
;>.0
(4) Solution of this augmented problem for gener al M implies solution of the nproject problem without the retirement option, because if M < A then one will never accept the retirement option. Exercises and comments (1) Prove that ¢;(x;, M) has indee d the chara cter asser ted in the text and illustrated in Figure 1.
3 OPTIMALITY OF THE GITTINS INDEX POL ICY The value function ci> (x, M) will obey the dynamic prog ramm ing equation = max[M,m~xL;] l
(5)
where the opera tor L; defined implicitly by comp arison of (1) and (2) will act only on the x;-ar gume nt of . We shall prove valid ity of (4) and optimality of the Gittins policy by demonstrating that expression (4) is the unique solution of (5) and that the Gittins policy corresponds to the maxi mising options in (5). More explicitly, that one operates a project of maxi mal index v; if this exceeds M(l - {3) and otherwise accepts the retirement option. Many other proofs of optimality have now been offered in the literature which do not depe nd upon dynamic prog ramm ing ideas; one parti cular line is indicated in the exercises
3 OPTIMALITY OF THE GITTINS INDEX POLICY
273
not yield the extra below. However, these are no shorter, most of them do are more insightful. conclusion (4), and it is a matter of opinion as to whether they
Lemma 14.3.1 Expression (4) may alternatively be written
~(x,M) = ¢i(xi, M)Pi (x,M) + Loo ¢i(xi,m) dmPi(x,m) where
P·( ,x, M) .·=
, M) II 8¢j(Xj ;::.M #i
(6)
{7)
u
is non-negative, non-decreasing in M and equal to unity for
(8) dence upon xwhic h Proof Note that quantities such as Mi and M(i) have a depen l integration. Since ¢i, we have suppressed. Equation (6) follows from (4) by partia M for M ~ Mi. then as a function of M, is non-decreasing, convex and equal to for M ~ M;. The unity to equal and 8¢;/ aM is non-negative, non-decreasing 0 properties asserted for Pi thus follow. Consider the quantity
8i(Xi,M) = ¢;(xi ,M)- Li
with equality if M
~ m~xMi, J
\I>(M) - L;\I>(M) with equality if M;
= max Mi
M
(9)
and
= 8;(M)P;(M) + Loo 8;(m)dmPi(m) ~ 0 ~
(10)
M.
J
in m. Here dmPi(m) is the increment in P; (m) for an increment dm ty case follow from Proof Inequality (9) and the characterisations of the equali (6) and the properties of Pi. non-negativity of The first relation of (10) follows immediately from (6). The the non-negative and the expression follows from the non-negativity of 8; and
274
OPTIMAL ALLOCATION; THE MULTI-ARMED BANDIT
non-decreasing nature of P;. We know that 8;(M) = 0 for M ~ M; and that dmP;(m) = 0 form~ M(i)• so that expression (10) will be zero if M ~ M; and M(i) ~ M;. This pair of conditions is equivalent to those asserted in the lemma. 0 Theorem 14.3.3 The value function 4>(x, M) of the augmented problem indeed has the evaluation (4) and the Gittins index policy is optimal. Proof The assertions of Lemma 14.3.2 show both that expression (4) satisfies the dynamic programming equation (2) and that the Gittins index policy (augmented by the recommendatio n of termination if M exceeds max;M;) provides the maximising option in (2). But since (2) has a unique solution and the maximising option indicates the optimal action (see Exercise 3.1.1, valid also for the stochastic case) both assertions are proved.
Exercises and comments
We indicate an alternative line of argument which explains the form of solution (4) and avoids some of the appeal to dynamic programming ideas.
(1) (Whittle, 1980). Consider a policy which is such that project i is terminated as soon as xi enters a write-off setS; (i = 1, 2, ... , N) and retirement with reward M takes place the moment all projects have been written off. We assume that there is some rule for the choice from the set of projects still in use, which we need not specify. Let us term such a policy a write-offpolicy, and denote the value functions under a given such policy for theN-project and one-project situations by F(x, M) andfi(x;, M) respectively. Then
oF
oM= E(,ffjx)), where r is the (random) time taken to drive all N projects into their write-off sets and r; the time taken to drive project i into its write-off set. But r is distributed as L; T; with the r; taken as independently distributed. Hence it follows that
aF
oM
=IT oM' afi i
which would imply the validity of (4) if it could be asserted that the optimal policy was also a write-off policy. (2) (Tsitsiklis-Weber). Denote the value function for the augmented problem if i is restricted to a set I of projects by V (I). This should also show a dependence on M and the project states which we suppress. Then Vhas the sub modular property
V(I)
+ V(J)
~
V(I U J)
+ V(I n J).
(11)
4 EXAMPLES
275
appe al to the fact that choice of a Prove this by induction on the time-to-go sand be seen as a choice of one proje ct one project from each of I U J and I n J can tion is not in general true. from each of I and J, but that the converse asser cts which are written off (in an (3) (Tsitsiklis, 1986). Take I as the set of all proje and J as its complement. Then optim al policy for the full set of n projects) ci> ~ V(J). But plainly the reverse relation (11) becomes M + V(J) ~ ci> + M, or exceed M if and only if J is noninequality holds, so that ci> = V (J), and this will inde ed aban done d once they are empty. Thus, in an optim al policy, projects are projects are written off. Thus the written off and operation continues until all ssion (4) for the value function is optim al policy is a write-off policy, and expre al to something like the argu men t inde ed correct. However, one still has to appe policy. of the text to establish optimality of the Gittins
4 EXAMPLES tion of the index v(x) for an The problem has been reduced to deter mina i. Dete rmin ation of v(x) index ct individual project, so we can drop the proje and one may well have ct, proje that requires solution of the stopping problem for analytic solut ion is that fact t. The to resort to num erica l methods at this poin fact that the MAB the idate inval possible in only relatively few cases does not lem to the oneprob ect -proj of then prob lem is essentially solved by the reduction ples which in exam some list shall project problem (with a retirement option). We fact perm it rapid and transparent treatment. ¢(x( t), M) is necessarily nonLet us say that a project is deteriorating if mach ine whose state is sufficiently increasing in t. One may, for example, have a deteriorates with age. We leave the indicated by its age, and whose perfo rman ce ing equation (2), that v(x) = r(x) read er to show, from the dynamic prog ramm function. simply, where r( x) is the insta ntan eous reward al policy is a one-step lookIf all projects were deteriorating then the optim h the expected immediate reward ahead policy: one chooses the project i for whic situation in which the ri have ri(xi) is maximal. This will ultimately lead to the one then switches projects to keep been roughly equalised for all projects, and the tyres on one's car with the spare them so. That is, it is as if one kept changing of wear. Switching costs will ensu re so as to keep all five tyres in an equal state that one in fact tolerates a degree of inequality. project, for which ¢(x( t),M ) is The opposite situation is that of an improving perfo rman ce of a machine may non-decreasing with time. For example, the lasts, the more likely it is to be a improve with age in the sense that, the longer it in this case, good one. We leave to the reader to conf irm that,
276
OPTIMA L ALLOCATION; THE MULTI-ARMED BANDIT
That is, the index is the discounted constan t income equivalent of the expected discounted return from indefinite operati on of the project. If all projects are improving then, once one adopts a project, one stays with it. However, mentio n of 'lasting' brings home the possibility that a machin e may indeed 'improve' up to the point where it fails, and is thereafter valueles s. Let us denote this failure state by x = 0, presum e it absorbing and that r(O) = 0. Suppose that the machin e is otherwise improving in that ¢(x(t), M) is nondecreasing in t as long as the state x = 0 is avoided. Let a denote the random failure time, the smallest value oft for which x( t) = 0. In this case
v(x) E['E~:J ,81r(x(t))jx(O) = x] 1 - ,8 = I E[,B"ix(O) = x] If all projects follow this 'improving through life' pattern then, once one adopts a project, one will stay with it until it fails. Anothe r tractable example is provided by a diffusion process in continu ous time. Suppose that the state x of the project takes values on the real line, that the project yields reward at rate r(x) = x while it is being operate d, that reward is discounted at rate a, and that x itself follows a diffusion process with drift and diffusion coefficients J1. and N. This conveys the general idea of a project whose return improves with its 'condition', but whose conditi on varies random ly. The equatio n for ¢(x, M) is then X -
a¢ + Jl.¢x + !N ¢xx = 0
(x > ~)
(12)
where~ is the optima l breakp oint for retirem ent reward M. We find the solution of (12) to be
¢(x,M ) = (xja) + (Jl.Ja 2 ) + cePx
(13)
where pis the negative solution of
! Np2 + Jl.P -
a = 0.
and c is an arbitra ry constant. The general solution of (12) would also contain an exponential term corresp onding to the positive root of this last equatio n, but this will be excluded since ¢ cannot grow faster than linearly with increas ing x. The unknow ns c and ~ are determ ined by the bounda ry conditi ons ¢ = M and ¢x = 0 at x = ~(see Exercise 10.7.2~ If we substitute expression (13) into these two equations then the relation between M and ~ which results is equival ent to M = M(e). We leave it to the reader to verify that the calculation yields the determ ination
v(x) = aM(x) = x+
J1. + ..jJ.L 2 + 2aN
la
.
5 RESTLESS BANDITS
277
reward expected from The constant added to x represents the future discounted quence of the fact future change in x. This is positive even if J1. is negat ive-a conse if this occurs, but that one can take advantage of a random surge against trend can retire if it does not.
5 RESTLESS BANDITS be to allow projects to One desirable relaxation of the basic MAB model would by different rules. For course of gh change state even when not engaged, althou ent used to comb at a treatm al medic a example, one's knowledge of the efficacy of actually deteriorate could it ver, particular infection improves as one uses it. Howe virus causing the the le, examp when one ceased to use the treatment if, for infection were mutating. on of an enemy For a similar example, one's information concerning the positi would actually but it, submarine will in general improve as long as one tracks deliberate taking not deteriorate if one ceased tracking. Even if the vessel were evasive action its path would still not be perfectly predictable. of whom exactly As a final example, suppose that one has a pool of n employees employees who are m are to be set to work at a given time. One can imagine that yees who are resting working produce, but at a decreasing rate as they tire. Emplo is thus changing state do not produce, but recover. The 'project' (the employee) whether or not he is at work. tion or not as We shall speak of the phases when a project is in opera a project was static active and passive phases. For the traditional MAB model is not true: the this ms proble many for in its passive phase. As we have seen, space. For state in ents movem ry contra active and passive phases produce ation inform of loss and gain e induc s submarine surveillance the two phase and tiring to pond corres s phase respectively. For the labour force the two recovery. passive phase as a We shall refer to a project which may change even in the rine example. subma the 'restless bandit', the description being a literal one for er respect: one anoth in l The work-force example generalised the MAB mode one. We shall just than was allowed to engage m of then projects at a time rather se suppo that m ( < n) allow this, so that, for the submarine example, we could a matter of allocating aircraft are available to track the n submarines. It is then of all n submarines the surveillance effort of the m aircraft in order to keep track as well as possible. assume rewards We shall specialise the model in one respect: we shall e reward. This averag the ising maxim of undiscounted, so that the problem is that a number of of ns solutio n know the , makes for a much simpler analysis; indeed As we have case. d ounte undisc the standard problems are greatly simpler in n 16.9, the Sectio in ds groun maintained earlier, and argue on structural t. contex l contro undiscounted case is in general the realistic one in the
278
OPTIMAL ALLOCATION; THE MULTI-ARMED BANDIT
Let us label the phases by k; active and passive phases corresponding to k = 1 and k = 2 respectively. If one could operate project i without constraint then it would yield a maximal average reward "f; determined by the dynamic programming equation "fi
+ fi(x;)
= max{rik(x;) k
+ Ekffi(X:)Ix;]}.
(14)
Here r;k(x;) is the expected instantaneous reward for project i in phase k, Ek is the expectation operator in phase k and fi(x;) is the transient reward for the optimised project. We shall write (14) more compactly as
"(; + Ji = max[Lil Ji,Li2fi]
(i=l,2, ... ,n).
(15)
Let m(t) be the number of projects which are active at time t. We wish to optimise operation under the constraint m(t) = m for prescribed m and identically in t; that exactly m projects should be active at all times. Let Ropt(m) be the optimal average return (from the whole population of n projects) under this constraint. However, a more relaxed demand would be simply to require that (16)
E[m(t)] =m,
where the expectation is the equilibrium expectation for the policy adopted. Essentially, then, we wish to maximise ECL_; r;) subject to E(L_; !;) = n- m. Here r; is the reward yielded by project i (dependent on project state and phase) and l; is the indicator function which takes the value 1 or 0 according as to whether project i is in the passive or the active phase. However, this is a constraint we could take account of by maximising E[L_;(r; + 11/;)], where 11 is a Lagrangian multiplier. We are thus effectively solving the modified version of (15) "f;(v)
+Ji = max[L;,Ji, v + Li2fi]
(i = 1,2, ... ,n)
(17)
wherefi is a functionfi(x;, v) of x; and v. An economist would view v as a 'subsidy for passivity', pitched at just the level (which might be positive or negative) which ensures that m projects are active on average. Note that the subsidy is independent of the project; the constraint (16) is one on total activity, not on individual project activity. We thus see that the effect of relaxing the constraint m(t) = m to the averaged version (16) is to decouple the projects; relation (17) involves project i alone. This is also the point of the Gittins solution of the original MAB problem: that thenproject problem was decomposed into n one-project problems. A negative subsidy would usually be termed a 'tax: We shall use the term 'subsidy' under all circumstances, however, and shall refer to the policy induced by the optimality equations (17) as the subsidy policy. This is a policy optimal under the averaged constraint (16). If we wish to be specific about the value of 11 we shall refer to the policy as the v-subsidy policy. For definiteness, we shall close
•
. 110:.
5 RESTLESS BAND ITS
279
v + L,2 J; then project i is to be the passive set. That is, if x; is such that Ln J; = rested. cts are active on average. The value of v must be chosen so that indeed m proje cts. This induces a mild recoupling of the proje constraint (16) is Theorem 14.5.1 The maximal average reward under R(m) =
i~f [~ 'Y;(v)- v(n- m)]
(18)
and the minimising value ofv is the required subsidy level. ing (that, unde r regularity Proof This is a classic assertion in convex prog ramm problems are equal) but best conditions, the extrema attained in prim al and dual square-bracketed expression in seen directly. The average reward is indee d the acted. Since (18), because the average subsidy paid must be subtr
(where 1r denotes policy) then
a
avl; : 'Y;(v) = E1f I
L l; = (n- m). I
ality condition in (18). The This equation relates m and v and yields the minim is convex increasing in v. v) ( 'Y; condition is indeed one of minimality, because 2: mal average reward for maxi The function R(m) is concave, and represents the 0 any min [0, n].
x; as the value ofv which Define now the index v;(x;) of project iwhe n in state value of subsidy which the is it makes Ln J; = v + L;2 J; in (17). In other words, x;. This is an obvious state in i ct makes the two phases equally attractive for proje passive projects when es d reduc analogue of the Gittins index, to which it indee (which Gittins index the that are static and yield no reward. The interest is the Lagrange as seen is now characterised as a fair 'retirement income') is obviously index The ty. activi multiplier associated with a constraint on average project, the a rest to one e meaningful: the greater the subsidy needed to induc more rewarding must it be to operate that project. m(t) = m rigidly. Then a Suppose now that we wish to enforce the constraint to choose the projects to be plausible policy is the index policy; at all times index (i.e. the first m on a list operated as the m projects of currently greatest te the average return from this ranking projects by decreasing index). Let us deno policy by Rinct(m).
280
OPTIMAL ALLOCATION; THE MULTI-ARMED BANDIT
Theorem 14.5.2
Rind(m) ~ Ropt(m) ~ R(m).
(19)
Proof The first inequality holds because Ropt is by definition the optimal average return under the constrain t m(t) = m. The second holds because R(m) is the optimal average return under the relaxed version of this constraint, D E[m(t)] = m.
The question now is: how close are the inequalities (19), i.e. how close is the index policy to optimality? Suppose we reduce rewards to a per project basis in that we divide through by n. The relation (20) Rind(m)/n ~ Ropt(m)fn ~ R(m)/n then expresses inequalities between rewards (under various policies) averaged over both time and projects. One might conjecture that, if we let m and n tend to infinity in constant ratio and hold the populatio n of projects to some fixed composition, then all the quotients in (20) will have limits and equality will hold throughout in this limit This conjecture has in fact been essentially verified in a very ingenious analysis by Weber and Weiss (1990). However, there are a couple of interesting reservations. Let us say that a project is indexable if the set of values of state for which the project is rested increases from the empty set to the set of all states as v increases. This implies that, if the project is rested for a given value of subsidy, then it is rested for all greater values. It also implies that, if all projects are indexable, then the projects i which are active under a a 11-subsidy policy are just those for which
1/i(Xt) > 1/.
One might think that indexability would hold as a matter of course. It does so in the classic MAB case, but not in this. Counter-examples can be found, although they seem to constitute a small proportio n of the set of all examples. An example given in Whittle (1988) shows how non-indexability can come about. Let D(v) be the set of states for which a given project is rested under the v-subsidy policy. Suppose that a given state (x = {, say) enters D as v increases. It can be that paths starting from {with { in D show long excursions from D before they return. This implies a surrende r of subsidy which can become non-opti mal once v increases through some higher value, when {will leave D. Another point is that asymptotic equality in the second inequality of (20) can fail unless a certain stabilit;y condition is satisfied (explicitly, unless the solution of the deterministic version of the equations governing the distribution of index values in the populatio n under the index policy converges to a unique equilibrium). However, the statistics of the matter are interesting. In an investigation of over 20000 randomly generated test problems Weber and Weiss found that about 90% were indexable, but found no counterexamples to average-optimality (i.e. of
281
NTE NAN CE 6 AN EXAMPLE: MAC HIN E MAI
of s). In searching a more specific set instability of the dynamic equation 3 for and , 10in fewer than one case in examples they found counterexamples 5 the orde r of one part in 1o- . of these the mar gin of suboptimality was for average-optimal is then virtually true The assertion that the index policy is dity vali lute an assertion can escape abso all indexable cases; it is remarkable that on that asymptotic optimality can be cati by so little. The result gives some indi le policies. achieved in large systems by quite simp INT ENA NC E 6 AN EXAMPLE: MA CH INE MA d sidered in Section 11.4 constitutes a goo The machine maintenance problem con is it that s of costs rath er then rewards, so first example. Thi s is phrased in term ; action rath er than a subsidy for inaction now natu ral to thin k of v as a cost for 11.4, tion overhaul. In the notation of Sec i.e. to identify it as the cost of a machine atio n for a single machine analogous to equ the dynamic programming equation (17) is then (21) .X[f(x + 1)- f(x) ]. 'Y = min{v +ex + f(O )- f(x) , ex+ ice ecture that the optimal policy is to serv If we norm alis ef(O ) to zero and conj s tion equa the value ~then (21) implies the machine if x ;;::: ~for some critical 'Y + f(x) = v +ex 'Y = ex+ .X[f(x + 1)- f(x) ]
These have solution
(x;;:::
f(x )=c x+ v-' Y f(x) =
f::(ij=O
~)
ex)/ .X= "fX - cx( x- 1) 2.X .A
The identity of the two solutions at x for 'Y in terms of~
equ = .; thus provides the determining
atio n (22)
~ is effect by replacing~(~- 1) bye _ Now Here we have neglected discreteness with l ima min be ld requirement that 'Y shou determined by optimality; i.e. by the that the derivatives with respect to .; of iring respect to f This is equivalent to requ the be equal (see Exercise 10.7.2), i.e. to the two sides of relation (22) should a uce ded we evaluation of 1' into (22) condition .Xc "''Y - e~. Substituting this on uati eval the v "' v( ~). In this way we fmd relation between v and .; equivalent to
cx2
v(x) "' c(x +.X) + li,
282
BANDIT OPTIMAL ALLOCATION; THE MULTI-ARMED
dominant) term in this accurate to within a discreteness effect. The last (and policy improvement in by expression is essentially the index which was deduced Section 11.4.
7 QUEUEING PROBLEMS vement produ ced plausible Despite the fact that the technique of policy impro in Sections 11.5-11.7, the index policies for a numb er of queueing problems restless bandit technique fails to do so. of deducing an index Consider, for example, the costing of service, in the hope cost structure of Section for the allocation of service between queues. With the to (17) would be 11.5 the dynamic progr amm ing equation corresponding 7 =ex + min{A L\(x + 1), v + AA(x + 1)- J.£(x)L\(x)} , L\(x) is the increment Here v is the postulated cost of employing a server zero according as xis f(x) - f(x- 1) in transient cost and J.£(x) equals 1-£ or to be finite. One fmds (and positive or zero. One must assume that J.£ > A if 7 is non-negative v the active we leave this as an exercise to the reader) that for any r-is the whole set x > 0 of set-t he set in which it is optimal to employ a serve dary which changes with positive x. One thus does not have a decision boun the index was based in of ition defin the changing v; the very feature on which Section 5. considering policies for One can see how this comes about: there is no point in for some { > 0. Such ~ = x which, for example, there is no service in a state ever for the queue on as em policies would present the same optimisation probl ting a base-load of { accep the Set Of States X;:;::: {,but With the feature that one is and so incurring an unnecessary constant cost of c{. es of engagement, so One might argue that variation of J.£ allows varying degre sponding proportional that one might allow Jl to vary with state with a corre same conclusion in the s.ervice cost. However, one reaches essentially the states x > 0 at a comm on undiscounted case: that an optimal policy serves all rate. es manifest if one The special character of such queueing problems becom It is assumed that 5. n Sectio in aged envis considers the large-system limit n ___. oo so that there is unity, than less is the traffic intensity for the whole system this capacity is all case that In sufficient service capacity to cover all queues. s are either queue all that so directed to a queue the mom ent it has customers, This rather e. servic of e empty or (momentarily) have one customer in the cours ited by inhib is nse respo if unrealistic immediacy of response can be avoided only er. transf or vation switching costs or slowed down by delay in either obser problem of Section 11.7. We d rewar pure the for hold not do Such considerations on 6 indeed lead to the leave the reader to confi rm that the methods of Secti known optimal policy.
7 QUEUEING PROBLEMS
Notes on the literature
283
ns had arrived at his solution by The subject has a vast literature. Although Gitti s (1974~ His later book (Gittins, . 1970 it was published first in Gittins and Jone subject. The proo f of optimality e 1989) gives a collected exposition of the whol was given in Whittle (1980). tion associated with solution (4) for the value func their analysis completed and ) (1988 Restless bandits were intro duce d in Whittle by Weber and Weiss (1990).
CHA PTE R IS
Imperfect State Observation ent of the case of We saw in Chapt er 12 that, in the LQG case, there was a treatm By 'imperfect it. imperfect observation which was both complete and explic le variab (or of the state observation' we mean that the current value of the process able. In practice, variable, in state-structured cases) is not completely observ observation will seldom be complete. le; this tractability However, the LQG case is an exception in that it is so tractab 16). There are very few . carries no further than to the LEQG case (see Chapt er d both exactly and models with imperfect observation which can be treate case, which one might explicitly. Let us restrict attention to the state-structured l formal result which centra the Then form. regard as the standard normalised there is still a simply that is ed observ fectly emerges if the state variable is imper ent of the value argum the ver, Howe on. recursive dynamic programming equati ational' state 'inform an but le, variab function is no longer the physical state tional on condi state al physic of value variable: the distribution of the current nt of the eleme an rly forme as not current information. This argument is then, of the ality cardin in se increa state space !!l, but a distribution on !!l. This great lt. argument makes analysis very much more difficu distribution is The simplification of the LQG case is that this conditional covariance and x mean 1 t always Gaussian, and so parametrised by its curren and the alone, ents matrix V,. The value function depends then on these argum s even implie ideas validity of the certainty equivalence principle and associated further simplification. lence principle, Certainly, there is no general analogue of the certainty equiva lty. In general the and it must be said that this fact adds interest as well as difficu gains, so that control choice of control will also affect the information one in mind: to control actions must be chosen with two (usually conflicting) goals on aspects of the the system in the conventional sense and to gain information a strange car is of g' steerin the 'feel to le, system for which it is needed. For examp information sary neces gains one that effective as a long-term control measure (in rly, it is the Simila one. term shorton the car's driving characteristics) but not as a of the basis the on on misati conflict between the considerations of profit-maxi base ation inform this ve information one has and the choice of actions to impro ter. charac ular that gives the original multi-armed bandit problem its partic l. (A 'duality' quite Control with this dual goal is often referred to as dual contro and estimation/ e distinct from the mathematical duality between control/futur
286
IMPERFEC T STATE OBSERVATION
past which has been our constant theme.} An associated concept is that of adaptive control: the system may have parameters which are not merely unobservable but also changing, and procedures must be such as to track these changes as well as possible and adapt the control rule to them. A procedure which effectively estimates an unknown parameter will of course also track a changing parameter. The theory of dual and adaptive control requires a completely new set of ideas; it is subtle, technical and, while extensive, is as yet incomplete. For these reasons we shall simply not attempt any account of it, but shall merely outline the basic formalism and give a single tractable example. 1 SUFFICIE NCY OF THE POSTERI OR DISTRIBU TION OF STATE Let us suppose, for simplicity of notation, that all random variables are discretevalued-th e formal extension of conclusions to more general cases in then obvious. We shall use a naive notation, so that, for example P(x11W1) denotes the probability of a value x 1 of the state at time t conditional on the information W1 available at timet. We are thus not making a notational distinction between a random variable and particular values of that random variable, just as the 'P' in the above expression denotes simply 'probability of' rather than a defined function of the bracketed arguments. We shall use a more explicit functional notation when needed. Let us consider the discrete-time case. The structural axioms of Appendix 2 are taken for granted (and so also their implication: that past controlsparametrising variables -can be unequivocally lumped in with the conditioning variables). We assume the following modified version of the state-structure hypotheses of Section 8.3. (i) Markov dynamics It is assumed that process variable x and observation y have the property P(xt+l•Yt+liXt, Y,, U,)
= P(xt+l,Yt+IIx, u,).
(1)
(ii) Decomposable cost function It is assumed that the cost function separates into a sum of instantaneous and terminal costs, of the form
C=
h-1
L {3' c(x, u, t) + phch(xh).
(2)
1=0
(iii) Information It is assumed that W1 = (Wo, Y, U1_ 1 ) and that the information available at time t = 0 implies a prior distribution of initial state P(xoiWo). It is thus implied in (ill) that y, is the observation that becomes available at time t, when the value of u1 is to be determine d Assumption (i) asserts rather more than Markov structure; it states that, for given control values U, the stochastic
RIBUTION OF STATE 1 SUFF ICIEN CY OF THE POSTERIOR DIST
287
ibuti on of x, + 1 conditional on X 1 process {x1} is auto nom ous in that the distr r words, the causal depe nden ce is and Y1 is in fact depe nden t only on X 1• In othe . This is an assu mpti on which coul d one-way; y depe nds upon x but not conversely ded a disco unt factor in the cost be weakened; see Exercise 1. We have inclu function (2) for econ omy of treat men t variables and observations can Assu mpti ons (i) and (iii) imply that both state stochastic treat ment can be start ed be rega rded as rand om variables, and that the is presc ribed for initial state .xo (the up, in that an initia l distribution P(xol Wo) radical than it seems. For prior distribution). The implication may be more it the physical state variable is in fact example, for the origi nal mult i-arm ed band unkn own success probabilities. The the para mete r vecto r = {Oi}; the vecto r of r usua l assumptions, makes it no fact that this does not change with time, unde poin t which allows one to rega rd less a state variable. However, the change in view l. this para mete r as a rand om variable is non-trivia own plan t param eters are to unkn that ies impl e abov on Generally, the formulati rega rded as rand om variables, not be inclu ded in the state variable and are to be ibution. That is, one takes the directly observable but of know n prio r distr ture. The· controversy amo ng Bayesian poin t of view to inference on struc on and its interpretation has been statisticians conc ernin g the Bayesian form ulati take the prag mati c poin t of view a battle not yet cons igne d to histor)t We shall whic h lead to a natu ral recursive that, in this context, the only formulations the Bayesian formulation and its math emat ical analysis of the prob lem are mini max analogue. We shall refer to the distr ibuti on
e
(3) citly, the poste rior distr ibuti on of as the posterior distr ibuti on of state. More expli ition al upon the infor mati on x 1 at time t, in that it is the distribution of x 1 cond ral forward recursion. that has been gath ered by time t. It obeys a natu distribution) Under the asTheorem 15.1.1 (Bayes upda ting of the post erior bution P 1 obeys the updating sumptions (i)-(iii) listed above the posterior distri formula t+I, Yt+IIxr. u,) (4) ) I:x, P,(x,)P(x P ( I ) . ) ( t+l Xt+! = "' L..Jx, P, Xt P(yt+1 Xt, Ut
Proof We have, for fixed Wr+l and variable x 1+h P(xt+tiWt+t) ex: P(xt+IoYt+tiW,u,) = LP(x,,Xt+IoYt+!oiWr.ut) x,
288
IMPERFECT STATE OBSERVATION
= LP(xrl Wr, Ur)P(Xr+J,Yt+d W1,XI> u1) x,
= LPr(Xr)P(xr+l,Yt+dxr, ur)· X
The last step follows from (3) and the implication of causality: that P(x1 W 1 , u1 ) = P(x1 W1). Normalising this expression for the conditional distribution of x 1+1 we deduce recursion (4). 0 1
1
Just as the generic value of Xr is often denoted simply by x, so shall we often denote the generic value of P1 simply by P. We see from (4) that the updating formula for P can be expressed
P(x)--+ P'(x)
:=
L:z P(z)a (x,y, lz, u) L:x L:z P(z)ar(x,y!z, u)
(S)
1
where a1 (x, ylz, u) is the functional form of the conditional probability
P(xt+l = X,Yt+l = ylx1 = z, u1 = u). Recall now our definition of a sufficient variable ~~ in Section 2.1. Theorem 8.3.1 could be expressed as an assertion that, under the assumptions (i)-(iii) of Section 8.3, the pair (x1 , t) is sufficient, where x 1 is the dynamic state variable. What we shall now demonstrate is that, under the imperfect-observation versions of these assumptions expressed in (i)-(iii) above, it is the pair (P1 , t) which is sufficient, where P1 is the posterior distribution of dynamical state Xt· For this reason Pis sometimes referred to as an 'informational state' variable or a 'hyperstate' variable, to distinguish it from x itself, which still remains the underlying physical state variable. Theorem 15.1.2 (The optimality equation under imperfect state observation) Under the assumptions (i)-(iii) listed above, the variable (P1 , t) is sufficient, and the optimality equation for the minimal expected discountedfuture cost takes the form
F(P, t) = i~f [ L P(x)c(x, u, t) X
+f3LLLP(z) a,(x,y!z,u)F (P',t+ X
y
1)]
(6) (t
Z
where P' is defined by (5) and the terminal condition for system (6) is
F(P, h) = L P(x)Ch(x).
(7)
X
The prooffollows by the now-familiar backward induction; confirmation is left to the reader.
2 MACHINE SERVICING AND CHANGE -POINT DETECTION
289
Recursio n (6) may seem unattractive in view of the 'fractional linear' dependence (5) of P' on P. In fact, this reduces effectively to a linear relation. Recall that a function ¢(P) is homogeneous ofdegree r in P if ¢(>.P) = N ¢(P) for any positive scalar -\.We shall find it sometimes convenient to write F(P, t) as F(( {P(x)}, t) if we wish to indicate how P(x) transform s for a given value ofx.
Theorem 15.1.3 The value function F(P, t) can be consistently extended to unnormalised distributions P by the requirement that it be homogeneous ofdegree one in P, when the dynamic programming equation (6) simplifies to the form F(P, t)
=i~ [;; P(x)c(x, u, t)+{3 ~ F ( { ;;= P(z)at(x,ylz, u)}, t+ 1) ]·
(8)
Proof Recursion (6) would certainly reduce to (8) if F(P, t + 1) had the homogeneity property, and F(P, t) would then share this property. But it is evident from D (7) that F ( P, h) has the property. The conclusio n can be regarded as an indicatio n of the fact that it is only the relative values of P(x11W1) (for varying Xt and fixed W1) which matter, and that the normalis ation factor in (5) is then irrelevant. Exercises and comments (1) An alternative and in some ways more natural formulation is to regard (x1 , y 1) as jointly constituting the physical state variable, but of which only the compone nt Yr is observed. The Markov assumpti on (i) of the text will then be weakened to
P(xt+I,Yt+I!Xt, Yt, Ut) = P(xr+I,Yt+dxt,Yt.Ut) consisten t with the previous assumpti on (i) of Section 8.3. Show that the variable (P 1,y1,t) IS sufficient, where Pt = {P1(xt)} = {P(xt!Wt)} is updated by the formula
Pt+l (xt+d ex
Lx, Pt(Xt)P(xt+l, Yt+dxt, Yt, Ut)·
2 EXAMP LES: MACHI NE SERVIC ING AND CHANG E-POINT DETECT ION One might say that the whole of optimal adaptive control theory is latent in equation (8), if one could only extract it! Even somethin g like optimal statistical commun ication theory would be just a special case. However, we shall confine our ambitions to the simplest problem which is at all amenable.
290
IMPERFECT STATE OBSERVATION
Suppose that the dynamic state variable x represents the state of a machine; suppose that this takes integer values j = 0, 1, 2, . . . . Suppose that the only actions available are to let the machine run (in which case x follows a Markov chain with transition probabilities PJk) or to service it (in which case the machine is brought to state 0). To run the machine for unit time in statej costs ci> to service it costs d. At each stage one derives an observation y on machine state. Suppose, for simplicity, that this is discrete-valued, the probability of an observation y conditional on machine state j being Pi (y). Let P = { P1} denote the current posterior distribution of machine state, and let 'Y and f(P) denote the average and transient cost under an optimal policy (presumed stationary). The dynamic programming equation corresponding to (8) is then
where 8(P) is the unnormalised distribution which assigns the entire probability mass 2:1 P1 to state 0. The hope is, of course, to determine the set of P-va1ues for which the option of servicing is indicated. However, even equation (9) offers no obvious purchase for general solution. The trivial special case is that in which there are no observations at all. Then P is a function purely of the time which has elapsed since the last service, and the optimal policy must be to service at regular intervals. The optimal length of interval is easily determined in principle, without recourse to (9). A case which is still special, but less trivial, is that in which the machine can be in only two states, x = 0 or 1, say. We would interpret these as 'satisfactory' and 'faulty' respectively. In this case the informational state can be expressed in terms of the single number 7f=
P, Po+Pt
0
This is the probability (conditional on current information), that the machine is faulty. Let us suppose that Pot = p = 1 - q and Pto = 0. That is, the fault-free machine can develop a fault with probability p, but the faulty machine cannot spontaneously correct itself. If we setf(P) = ¢(1r) and assume the normalisation <;b(O) = 0 then equation (9) becomes, in this special case,
( 10) Here we have assumed that c0
= 0, and have defined
291
DETECTION 2 MACH INE SERVICING AND CHAN GE-PO INT
p(y) = (1- 1r)qpo(y) + (p + 1rq)p1 (y),
'(y) 7r
= (p + 1rq)pl (y) p(y)
.
(11)
optimal decision will Form ula (11) gives the updating rule 1r --+ n'. The old value, this value being presumably be to service when 1r exceeds some thresh in principle determinable from (10). of change-point detection This two-state model can also represent the problem in which a poten tial that mentioned in Section 8.1. Suppose that state 0 is state 1 that in which it and pollution source (say, a factory or a reactor) is inactive, alarm and take antithe give to is active. One must decide whether to 'service' (i.e. g the alarm costs Givin y s pollution measures) or not on the basis of observation time. unit per c d; delaying the alarm when it should be given costs in the literature: those in Two partic ular cases of this model have been analysed les respectively, with variab al norm which the observations y are Poisson or of determining the em probl The j. parameters dependent upon pollution state s is soluble, and is ption assum these critical threshold value of 1r from (10) unde r tively. respec em, probl er referred to as the Poisson or Gaus sian disord
BEYOND PAR T 3
Risk-Sensitive and H 00 Criteria
r~~' 4'0--<-
~'~~
,-,
CH APT ER1 6
Risk-sensitivity: The LEQG Model 1 UTILITY AND RISK-SENSITIVITY whereas economists work Control optimisation has been posed in terms of cost, shall be invoking some we Since cost. largely in terms of reward, i.e. negative of reward for the terms in ssion discu economic concepts, let us conduct the . lation moment, before reverting to the cost formu enterprise. One then Suppose that IR is the net mone tary reward from some as to maximise IR. so ) policy wishes to cond uct the enterprise (i.e. choose a follow that one sarily neces However, if IRis a rando m variable, then it does not e rather to choos t migh one will wish to maximise E,. (IR) with respect to policy 1r; mono tone ly usual on, maximise E,.[U(IR)], where U is some non-linear functi would be this then 1 e Figur increasing. For example, if Uhad the concave form of if IR were it benef be of less an indication that a given increment in reward would already large than if it were small. lly defined by the fact The function U is termed a utility function, and is virtua ed that E,.[U(IR)J is the that, on axiomatic or behavioural grounds, one has decid choice of a utility function quantity one wishes to maximise. The necessity for outcome, but it would also arises because one is averaging over an uncer tain from a reward which was arise if one wished to characterise the benefit derived distributed over time or over many enterprises. of expected cost E,. (C) to In cost terms, one could generalise the minim isatio n disutility function, again a Lis the minimisation of the criterion E,.[L(C)]. Here
u(R)
e increasing form usually conFigure 1 The graph ofa utility function U(IR) of the concav of return of utility with reward, sidered. The concavity expresses a decreasing marginal rate which induces a risk-averseness on the part ofthe optimiser.
296
RISK-SENSITIVITY: THE LEQC MODEL
Figure 1 Aconvexdisutilityfonction, implying an effective risk-averse attitude orpessimism.
presumably monotone increasing. One gains a feeling for the implications of such a generalisation if one considers the two cases of Figures 2 and 3. In the case of Figure 2 L is supposed convex, so that every successive increment in cost is reganied ever more seriously. In this case Jensen's inequality implies that E[L(C)] ~ L[E(C)] so that, for a given value of E(C), a certain outcome is preferred to an uncertain outcome. That is, an optimiser with a convex disutility function is risk-averse, in that he dislikes uncertainty. The concave disutility function of Figure 3 corresponds to the opposite attitude. In this case Lis supposed concave, so that successive increments in cost are regarded ever less seriously. Jensen's inequality is then reversed, with the implication that an optimiser with a concave disutility function is risk-seeking, in that he positively welcomes uncertainty. In the transition case, when L is linear, the optimiser is risk-neutral in that he is concerned only by the expectation of cost and not by its variability. All other cases correspond to a degree of risk-sensitivity on his part. One can interpret risk-seeking and risk-averse attitudes on the part of the optimiser as manifestations of optimism or pessimism respectively, in that they
Fig•re 3 A concave disutility jUnction, implying an effective risk-seeking attitude or optimism.
f ~::.
.·
1 UTILITY AND RISK-SENSITIVITY
297
to his advantage or disadvantage illlply his belief that uncertainties tend er mor e explicitly in the next section. respectively. This conclusion will emerge rath erts the criterion back on to a cost The attitude to risk is revealed also if one conv scale by defining (1) which certainly exists if L is stric tly Here L -I is the function inverse to L, then mini misa tion of x" is of cour se lllonotonic. If L is monotone increasing disutility. equivalent to minimisation of the expected C has expectation m and a sma ll Suppose now that unde r policy 1r the cost of C- m then leads to the conc lusio n variance v. Expansion of L(C) in powers (und er regularity conditions on that E"[L(C)] = L(m) +!L "(m )v + o(v) ies that differentials and moments). This in turn impl
L"(m) v x"= m+ L'(m )l+o (v).
(2)
negatively in the criterion according as That is, variability is weighted positively or the dis utility function is convex or concave. base d on Jensen's inequality, in that This argu men t is less convincing than that , it is illuminating in othe r respects; it makes unne cess ary assumptions. However see Exercise 1. icula r attention to the exponential Ther e are now good reasons for paying part r, the risk-sensitivity parameter. This disutility function e-/JC, where ()is a paramete tive (when it is mon oton e increasing) function should be minimised for () nega mon oton e decreasing). However, if we and max imis ed for () positive (when it is then we find that all cases are cove red norm alise back on to a cost scale as in (1), by the assertion that the norm alise d crite rion
(3) crite rion reduces in the case () = 0 to should be mini mise d with respect to Jr. The it corr espo nds to increasingly riskthe classic risk-neutral criterion E"(C); s through positive values or decreases seeking or risk-averse attitudes as (}increase that relation (2) now becomes through negative values respectively. Note
x"(O) = m- Ov/2 + o(v).
(4)
attractive for two reasons. (i) The The exponential disutility function is on a scale of optim ism- pess imis m. para mete r () places the optimiser naturally alone, the coefficient of v in (2) is (ii) In this case, and essentially in this case in this case there is an appr oxim ate inde pend ent of m (see Exercise 1). Tha t is, variability of cost. decoupling of the aspects of expectation and are what one migh t term math eHowever, the mor e compelling reasons ons have a way of turn ing out to be mati cal/p ragm atic in character; such reas the exponential criterion leads to a fundamental. (iii) Und er LQG assu mpti ons
298
RISK-SENSITIVITY: THE LEQC MODEL
the complete and attractive generalisation of LQG theory. Essentially, n Gaussia a ing resembl ng somethi expectation in (3) is then the integral of 'largeterm might one what by density. (iv) If LQG assump tions are replaced scale' assumpt ions and if an exponential criterion is adopted then large-deviation theory become s immedi ately applicable. It is striking that the econom ic concept of risk-sensitivity, interesting in itself, should mesh so naturall y with the mathematics. We shall explore the LQG a generalisation in the remaind er of this chapter. Large-deviation concept s open complex of ideas to which the final part of the book is devoted.
Exercises and comments (1) Show that L"(m)/ L'(m) is indepen dent of m if and only if L(m) is a linear ntly function of an exponen tial of m. A utility function is of course not significa e commut mations transfor such changed by a linear transfor mation, because with the operatio n of expectation. (2) Note an implica tion of Exercise 8.1.3: that relation (4) is exact (i.e. there is no remaind er term) if L(m) is an exponen tial function of a normal variable.
(3) A classic momen t inequality asserts that, if x is a non-negative scalar random is variable, then (Ex) 1/r is non-dec reasing in r. From this it follows that x1r(O) non-increasing in B.
2 THE RISK-SENSITIVE CERTAINTY-EQUIVALENCE PRINCIPLE tial The combin ation of the LQG hypotheses of Section 12.2 and the exponen to and 12.3, Section in d discusse criterion (3) leads us to the LEQG model already most the that 12.3 Section which this chapter is devoted. In fact, we found in
the economical way of proving the certainty-equivalence principl e (CEP) in or king risk-see LQG case was to do so first for the LEQG model in the there 'optimistic' case (} > 0. Let us slightly rephrase the conclusions summar ised in Lemma 12.3.2 and Theorem 12.3.3. Define the stress
(5) the the linear combin ation of cost and discrepancy which occurs naturally in value st evaluation of expecta tion (3). Let us also define the modifie d total-co function G( W1) as in Section 12.3 by (6) e-eG(W,) = f( Y1)extE7r[e-9Cj Wr]· 7r (} is where the extremisation is a maximi sation or a minimis ation according as 12.3.3 Theorem and 12.3.2 positive or negative. Then the conclusions of Lemma can be rephrase d as follows.
UIVALENCE PRINCIPLE 2 THE RISK-SENSITIVE CERTAINTY-EQ
299
equivalence theorem for the riskTheorem 16.2.1 (The risk-sensitive certainty G structure with B > 0. Then the total seeking (optimistic) case) Assume LEQ value function has the expression inf inf t inf§ G(Wr) = gt + u.,;r;; X . t JT:r> - 1[J)(Xr, Yrli Ur-I) + inf{B X, + inf inf [C(X, U) + e- 1UJ(xr+l) ... )XhiXri U)]}
= gc
(7)
u.,;r;;. t x.,.:r>t
. The value of u1 thus determined is where g1 is a policy-independent function oft alone t. If the value of u1 minimising the the LEQG-optimal value of the control at time the LEQG-optimal value of u1 is square bracket is denoted u(X1 , U1_I) then mined by the final Xrminimisation. u( xit), Ut-1) where x?l is the value ofX 1 deter e function at time t is obtained by The first equality of (7) asserts that the valu ess/observation variables currently minimising the stress with respect to all proc formed, and that the value of u1 deterunobservable and all decisions not already immediately suggest the expression of min ed in this way is optimal. This may not achieves what one might term convera CEP, but it is a powerful assertion which isation with respect of functions sion to free form. By this we mea n that an optim ined minimisation with respect to unu1( W1) has been replaced by an unconstra t is, the constraint that the opti mal observables and undetermined controls. Tha achieved automatically in a free u1 should depe nd only upon W1 has been ing constraints will be taken to its extremisation. This process of effectively relax h the full time-integral formulation. conclusion in Chapters 19-21, when we reac h mor e like a CEP, in that it asserts The final assertion of the theorem looks muc optimal value for known X 1 with X 1 that the optimal value of u1 is just the is often confused with the 1 ). The risk-neutral CEP replaced by an estimate not well agreed) the separateness of separation principle, which asserts (in terms Ther e is certainly no such separation the optimisations of estimation and controL and future) affect the value of the in the LEQ G case. Control costs (both past rol rule (even if the process variable is estimates and noise statistics affect the cont perfectly observed) should now be expressed. If Xr is However, we see how a separation principle uation of the two terms inside the provisionally assumed known then the eval which can be regarded as conc erne d curly brackets in the final expression of (7), can proceed separately. The two with estimation and control respectively, misation with respect to X 1, which evaluations are then coupled by the final mini tiveness of this separation is muc h also yields the final estimate x?l. The effec d below. clearer in the state-structured case considere rs interestingly, and requires mor e The CEP in the risk-averse case, B < 0, diffe the fact that relation (12.16) now careful statement. The distinction arises from becomes
Xi
RISK-SENSITIVITY: THE LEQC MODEL
300
(8)
G(W1) =sup infG(Wt+ 1) + ... u,
Yt+l
and that the order of the two extremal operations cannot in general be reversed. The consequence is that the analogue of Theorem 16.2.1 is Theorem 16.2.2 (The risk-sensitive certainty equivalence theorem for the riskaverse (pessimistic) case) Assume LEQG structure with(}< 0. Then the total value function has the expression
G( W1) = g1 + inf sup ... inf sup§ u,
= gt +
Yt+I
uh-I
stat{fJ- 1[])(Xt,
x,
Yh
(9)
Ytli Ur-i)
stat [C(X, U) + 8- 1[])(xt+l• ... ,xhiXti U)]}
+stat
u.,.:r ~ t x.,.:r>t
where g 1 is a policy-independent function oft alone. The value ofu1 thus determined is the LEQG-optimal value of the control at time t. If the value of u1 extremising the square bracket is denoted u(Xi, U1_!) then the LEQG-optimal value of u1 is is the value of X 1 determined by the final Xrextremisation. u( x?l, Ur-1) where
x?l
Here Yh can be regarded as yielding complete information, and so can be identified with X, and the operator 'stat' simply renders the expression to which it is applied stationary with respect to the variable indicated. The first relation in (9) follows as in the risk-seeking case, but the order of the extremisations must now be observed. However, all that is necessary in applying recursions such as (8) is that quantities being minimised (maximised) should be convex (concave) in the relevant argument The cases in which this requirement fails prove interesting; see Sections 3 and 4. The extremisation conditions in the first expression of (9) will yield linear relations which can be solved (i.e. variables eliminated) in any order. The rearrangement of extremal operations in the final expression of (9) corresponds to just such a reordering, but with the characterisations of maximality or minimality now weakened to stationarity of some kind. Although state-structure plays no part in this formalism (and it is for this reason plus economy that we have dragged the reader through the general case) the situation does indeed become more transparent if state structure is assumed. In such a case cost and discrepancy will have the additive decompositions h-1
h-l
c=L
c(xr, Ut, t) h
D(xt)
L Ct + ch
(10)
t=O
t=O
[]) =
+ Ch(xh) =
h
+ LD(xr,YriXt-l, Yt-!iUt-1) =Do+ LDt 1=1
(11)
t=l
say. Here we have not discounted explicitly, and D(x0 ) is the discrepancy derived from the prior distribution of x 0 conditional on W0 .
UIVA 2 THE RISK-SENSITIVE CERTAINTY-EQ
LENCE PRIN CIPL E
301
and the past stress P1 = P(x , W1) as Defm e now the future stress Fr = F(xr, t) the values of t-1
h-1
I:(c.,.+ e- Dr+t) + Ch 1
and
e-
Do+ L(c .,.+ 8- 1D.,.+t)
1
r=O
d out. Tha t is, all controls from time t with all variables except W1 and x, extremise servables at time t except x 1 itself are onwards are mini mise d out and all unob 8 is positive or negative. mini mise d or max imis ed out, acco rdin g as nce for the state -stru cture d
vale Theorem 16.2.3 (Recursions and certainty-equi Then the future and past stresses l. mode G LEQ d case) Assume a state-structure itions are determined by the recursions and terminal cond ext [c(x1, u, t) + e-t D(xr+tlx,; u,) F(xr, t) = inf Xt+l Ut
(12)
(O~t
+F( xr+ t,t+ 1)] F(xh) = Ch(xh)
(13) (14)
+ P(xr-1, Wr-t)) P(xo, Wo)
(0 < t ~h)
= e-t D(Xo)
(15)
ation according as B is positive or where ext indicates a minimisation or a maximis ted u(x, t) then the optimal value of negative.lftheminimisingvalueofu1in Q2) is deno g P + F1 = P(xr, W1) + P(x1 , W 1 ). u1 isu(x,, t), wherexr isthevalueofx1 extremisin 1 nces of the previous two theo rems All the assertions are imm edia te conseque familiar; the last asse rtion is more h and the definitions. Results now look muc r and in equa tions (12) and (14) we plainly certainty-equivalent in char acte mic prog ramm ing equation and the recognise respectively analogues of the dyna distr ibuti on P(x11W1). Because of the upda ting equa tion for the cond ition al these equa tions will reduce to Ricc ati quad ratic natu re of cost and discrepancy from Sect ions 2.4 and 12.5. and Kalm an relations of the type fami liar s itself. Provisional post ulati on of We see how the sepa ratio n prin ciple manifest evaluations of past and future stress, so the value x 1 of curr ent state decouples the cont rol/f utur e and estim ation /pas t that recursions (12) and (14), conc erne d with functions of observables W1 and the can be solved separately for F1 and P1 as lations are then recoupled by the (generally unobservable) x 1• The calcu ate 1 = x~t) , so yielding the fmal estim mini misa tion of P1 + F1 with resp ect to x 1 rol u(x" t). and the certainty-equivalent opti mal cont to that of the LQG case with, however, close then is rges eme h The patte rn whic disc repa ncy term in (12) enforces a a mixi ng of cost and stochastic effects. The
x
302
RISK-SENSITIVITY: THE LEQC MODEL
stochast ic plant equation but also induces a depende nce of the optimal control upon the covarian ce matrix N of plant noise, even if state observa tion is perfect. The cost term in (14) has an effect upon estimates, while the final estimate 1 shows a depende nce on future costs as well as past costs. The future stress F(x~> t) indeed differs from the normali sed value function for the case of perfect state observa tion only by a term depende nt on t alone. This term represen ts the cost due to plant noise, and will be evaluate d in Section 10. Finally, a word on notation . We shall term 1 the minimal stress estimate of x 1, even if indeed the extremis ation of stress is more complic ated than a simple minimis ation. It is identica l with what we have before written x)tl and shall do again: the best estimate of Xt based upon informa tion at time t. We shall use 1 to denote the estimate of x 1 based upon informa tion at time t which extremises past stress. In the risk-neu tral case the two estimate s coincide d, because future costs had no effect upon state estimates. However, we have now to make a distincti on.
x
x
x
3 SOME SIMPLE EXAMPLES It is helpful to gain a feeling for the effects of risk-sensitivity by examini ng a couple of examples which are simple in that they are concern ed purely with estimati on. Conside r the uncontr olled Markov model x 1 = Axr-1 + E1• We suppose that x is scalar and that the plant noise c is white and Gaussia n with variance N The cost function is C = ~:o R.x;- and it is suppose d that the total informa tion available at time 0 is knowled ge of x 0 • The stress function is consequ ently
!
00
§
= !L[R.x7 + (8Nf 1(xt- Axt-d 2]. t=O
At time 0 this is to be extremis ed with respect to the current unobser vables, which are just x 1 , x 2 , x 3 , • . . • We thus have a risk-sensitive formula tion of a pure predicti on problem from time t = 0. In the risk-neutral case the predicto r is evidently just the projecti on estimate: tS'(x 1 jx0 ) = A 1x . The values 0 x)0l extremis ing stress satisfY
( 1 + A 2 + ONR)x(O ) - A(x(O) t t-1
+ x(O) ) =0 x+l
(t
~
0).
Assume , for simplicity, that 0
(16)
e
If is positive then 0 < a < A, and the forecast is optimist ic in that it tends to zero faster (and so incurs a smaller cost) than does the risk-neu tral predicto r. Better expressed: the forecast er is optimist ic, in that he is effectively assumin g that the noise c will be such as to deflect the path of the process in a favourable direction .
3 SOME SIMPLE EXAMPLES
303
> A and the forecast is pessimistic. prediction as having broken down altogether if (} is so the regard One can the extremal negative that equation (16) has no root smaller than unity, when stress is infinite. This occurs when
If(} is negative then a
a~ e·=- (1- A)2 "'
.
NR
maximised with At the critical point Bthe stress function, which is now being has extremal and respect to unobservables, is no longer concave in its arguments, ster has becom e value +oo. One can regard this as the point at which the foreca his apprehenso pessimistic that he has essentially reached a breakdown point: sions ofloss outweigh the reassurance of statistics. the one-stage For an even simpler example, consider a risk-sensitive version of jointly norma l estimation problem of Section 12.6. Suppose that x and y are random vectors with zero mean and covariance matrix
l
cov[; = [
~:: t~
rl
observed and xis and that there is a cost function~ xT Rx. The variable y has been estimate x by the derives one lation formu to be estimated. In a risk-sensitive minimising
lxx 2(}§=( }xTRX + [X]T[ fyx Y
lxy] [X] fyy y
with respect to x, thus deriving
X= -(fxx + (}R)- 1lxyY· and approaches This estimate of course equals the projection estimate if(} is zero, ation matrix the minimal-cost value 0 as(} increases. The matrix lxx is the inform to lxx +OR. this modifY to is y nsitivit risk-se of x conditional on y; the effect of to let costhappy is one that (in e positiv is () That is, one 'gains' information if value 0 ing inimis cost-m its to nearer pressures supply the 'information' that xis e (in negativ is if(} ation' 'inform loses than the data would have suggested). One that e negativ so es becom () When that fear of costs makes one doubt the data). e becom has ser optimi the her: lxx + ()R is singular then estimation fails altoget this which at B value l critica pessimistic to the point of neurotic collapse. The occurs is the greatest root of llxx + 8R! = 0. in a negative In both examples optimisation failed only for (} sufficiently large by zero in below ed direction. This was because the cost function C was bound infinite an yielded it both cases, and stress-extremisation broke down only when and istic, pessim positive cost. This occurred when the optimiser was sufficiently e · negativ infinite one might term the breakdown that of neurosis. If C can take
LEQC MODEL RISK-SENSITIVITY: TH E
304
ur if() is then breakdown can also occ le) sib pos is ard rew te ini inf values (i.e. if an n can yield an infinite e. Th at is, stress-extremisatio sufficiently large and positiv imiser is sufficiently te positive reward) if the opt ini inf an . (i.e t cos e ativ neg s breakdown tha t of euphoria. optimistic; one might ter m thi n of the one-stage l is supplied by consideratio tro con of le mp exa ple sim A of the multi-stage something of the character es sag pre on uti sol ose wh R( xl - r) , problem, n is c = !uTQu +~(xi- r)T ctio fun t cos the t tha se ppo and u0 . Th e problem. Su u are jus t the values of x 0 and x t tha so £, + Bu £T (BN) - 1£. where x1 =A x+ remise the stress C ext to sen cho be to n the are variables u and £ this as convenient form if we write We obtain results in the mo st 1 2AT(Ax + Bu + € - xc)] . 1 [uT Qu + €T(ON)- £ - >.TR- A+
+!
§
= mq .X
pect to Awe find tha t the t to u and £an d then with res Extremising first with respec are given by optimal u and the extremal § 1 r) , §= !( Ax - r)T P- (A xincreasing 1 + J(O), say. Th e effect of R1 N= +O 1 BT BQ + efinite where P = Rke P larger (in the positive-d ma to eed ind is 0) g sin rea optimism (i.e. inc ponding stress value optimal control and corres sense) and to make bo th the mutation of them) are ns (and our implicit com tio isa rem ext ese Th r. alle sm > 0, where if is the e definite, i.e. as long as () itiv pos is P as g lon gest as ate legitim the critical value if is the lar 1 + J(O)J = 0. Equivalently, jRof t roo ar. largest is singul The n which is correct even if R tio lua eva an 0, = B)I RJ( It is always root of II+ int of 'neurotic' breakdown. po the eed ind is d ine erm value if thus det -negative definite. non-positive if Q and R are non le, in tha t they both ave as controls in this examp Note that u and £ bo th beh in the stress function. sion for x 1 and quadratically res exp the in ly ear lin ear app is chosen to to sen to minimise stress, £ cho ays alw is u as ere wh , However e. Th at is, this ing as () is positive or negativ ord acc it ise xim ma or it as he is minimise strate the optimiser according fru or p hel to sen cho is ol' auX.iliary 'contr in the next section. shall expand up on this point risk-seeking or risk-averse. We (x1 - r) T R(x1 - r) cost function then the ter m! If we modify R to - R in the Th e calculations ard, which we wish to be large. rew a as her rat ed ret erp int be is to of sign of R) is negative P (modified by the change as g lon as id val l stil are B) I = 0. This above the smallest root of II - RJ ( eed exc t no s doe ) as( g lon as er sign. definite, i.e. ric' breakdown. It can be of eith pho 'eu of int po the s ent res up per bo un d rep T STATE OBSERVATION 4 TH E CASE OF PE RF EC ncipal interest: r is devoted to the case of pri pte cha the of der ain rem the omogeneous Most of del in the standard time-h mo n tio ula reg d ure uct str Section 12, the state(12.1)-(12.4). Th e exception is ns atio equ of on lati mu for undiscounted
4 THE CASE OF PERFE CT STATE OBSERVATION
305
look at the question of where we find ourselves forced to take a funda menta l discounting. m reduces then to Let us first of all assume perfect state observation. The proble the dynam ic ially essent ; stress· the solution of the equation (12) for the future progr ammi ng equation. This can be written
F(xr,t)
= inf Ut
ext[c(xt,Ut) +!B- 1 (£TN- 1 c)t+ 1 +F(x t+t,t+ 1)]
(17)
Xt+l
ing as eis positive or where 'ext' denotes a minimisation or a maxim isatio n accord In virtue of this (12.1). ion equat negative, and €t+l and xt+ 1 are related by the plant last fact we can rewrite (12) as Er+l• t + 1)] 1 F(x,, t) = inf ext[c(x,, ut) + !B- 1 ( ET N- c) 1+ 1 + F(Ax, +Bur + Ut
(18) l variable, enteri ng But in this form we see that € can be seen as a subsidiary contro atic cost just as u quadr a ng carryi the plant equat ion linearly just as u does and e him if is oppos to and e does. It is chosen to help the optimiser if (J is positiv d by the wielde are negative. One might regard u and E as the controls which es that assum vely optimiser and by Nature respectively. The optim iser effecti is risk-seeking or riskNatur e is working with him or against him according as he respectively. Note that avers e-a fair characterisation of optim ism or pessimism Natur e makes its move first, at each stage. equation, but only in Of course, € does not appea r as a control in the actual plant In the actual plant s. proces the predicted course of the optimally controlled as a 'control' for the rs equation it is simply rando m process noise, as ever. It appea a stress extremisation; predic ted process because this prediction is generated by current control. the extremisation which determines the optim al value of the familiar Ricca ti The LQ character of recursion (18) implies a solution on on of the risk neutr al lines, which can indeed be expressed in terms of the soluti case.
e
has the quadratic · Theorem 16.4.1 The solution of equation ()8) for the future stress form (19) F(x1, t) =! (xTilx) 1
if it has thisform for t = h, and the optimal control then has the linear form Ut = KtXt·
(20)
(6.31) of the riskHere II1 is determined by either of the alternative forms (2.25) or ative equations neutral Riccati equation and K1 by either of the corresponding altern equations by (2.27) or (6.32) if ITt+ 1 is replaced in the right-hand side ofthese (21)
RISK-SENSITIVITY: THE LEQC MODEL
306
IIM\ Validity of these conclusions is subject to the proviso that the matrix J(B) + controlted augmen the is J(B) where t, should be positive definite for all relevant power matrix (22) 1. We Proof This is inductive, as ever. Suppose that relation holds at time t +
leave the reader to verify the identity ext[(t? (BN)- 1€ +(a+ €liT( a+ €)]
= aTfia
(23)
•
If we perform the €-extremisation in (23) we thus obtain · 1, Ut) F(xt, t) = mf[c(x u,
+ 2I (Axt +But) T-ITt+ I (Axt +But)].
But this is just the inductive relation which held in the risk-neutral case with the g substitution of IT1+1 for II 1+ 1. The question of validity is covered in the followin D discussion.
If we consider the solutions of the risk-neutral case in the alternative forms (6.311 (6.32) then we see that the only effect of risk-sensitivity is simply to replace for the control-power matrix J = BQ- 1BT by the augmented form (22), so that, example, the Riccati equation becomes
(24) if Shas been normali sed to zero. That is, the control-power matrix is augmented as by a multiple of the noise-power matrix. This illustrates again the role of noise control d intende the against or with working an effective auxiliary control, according as eis positive or negative. If the final maximisation with respect to >. of the second. set of displayed of equations in Section 6.4 is now to be valid then we require the final condition 1. the theorem. This sets a lower bound on 8: see Exercise The optimal (i.e. stress-extrernising) values of u1 and €t+I are matrix multiples + of x 1; if they are given these values then the quantities Ax1 + But and Ax1 the as BK =A+ 1 r But+ €t+l can be written r 1x 1 and f' 1x 1• We can regard 1 actual gain matrix for the optimally controlled process and f' 1 as what one might call the predictive gain matrix: the gain matrix that would hold if €1+1 really did by take the role of an auxiliary control variable and take the value predicted for it 6.4 Section stress extremisation. By appealing to the alternative derivations of one finds the evaluations
r1 =A -
J(J(e)
+ rr~.\)- 1 A,
(25)
by if S has been normali sed to zero. In other cases A should be replaced a ily A - sT Q- 1B. If infinite horizon limits exist then it is f' which is necessar
) CASE 5 THE DISTURBED (NON-HOMOGENEOUS
307
is excessively optimistic. Note the stability matr ix; r may not be if the optim iser relation
Exercises and comments of the Ricca ti equation (24) (1) In the scala r case the infin ite-h orizo n form beco mes
A 2 II II= R + 1 + J(8)II that the equa tion has a finite wher eJ(O ) is given by (22). Assu meth atR > 0. Show 2 ~ 1 and J(fJ) > -(1 -IAI )/ R non-negative solution iff J(B) > 0 in the case !AI 2 2 QIN d ii is - B IN Q or - B in the case lA I < 1. That is, the critical lower boun is unsta ble or stable. (l - IAI 2 )/NR accor ding as the unco ntrol led plant US) CAS E 5 THE DIST URB ED (NON -HOM OGE NEO the addit ion of a deter minis tic If the plant equa tion (12.1) is modi fied by distu rbanc e d1 : x, = Axr-1 +Bu r-l+ d, + tr the non-h omog eneo us quad ratic then we shall expe ct the future stress to have form
(26)
can generalise relation (23) to where + · · · indicates terms indep ende nt of x. We obtai n )] = aTfi a- 20'Ta + · · · ext,[(tT (BN)- 1 £+( a+ tlii( a + £) - 2aT (a+£ where + · · · indicates terms indep ende nt of a, the and
matr ix fi is again given by (21),
From this we dedu ce
ministic disturbance d then Theorem 16.5.1 If the plant equation includes a deter modified Riccati equation indithe future stress has the form (26) with Il1 obeying the recursion cated in Theorem 16.4.1 and a 1 obeying the backward (27) Here f\ is the predictive gain matr ix defined in
(25).
RISK-SENSITIVITY: THE LEQC MODE L
308
(27) differs from the Verification is immediate. We see that recursion by the substitution of of corresponding risk-neutral recursion of Section 2.9 only be, since we have f' 1 for the risk-neutral evaluation ofr 1 • This is indeed as it mustoptim isation with u by replaced optimisation with respect to a single control respect to the pair (u, e). al control as The same argument leads to the evaluation of the optim (28) 1 lN)-l (at+ I - IIt+ldt+l)· Ut = KtXt + (Q + BTfrt+IB)- BT (I+ OIIt+ the explicit feedbackThe combination of (28) and the recursion (27) gives one . (2.65) to gous analo feedforward formula for the optimal control more rapidly and much e emerg s As for the risk-neutral case, all these result isation techfactor the adopt cleanly (at least in the stationary case) when we . niques of Sections 6.3; see Section 11 and Chapter 21.
TION OF THE 6 IMPERFECT STATE OBSERVATION: THE SOLU SION P-RECUR the forward recursion In the case of imperfect observation one has also to solve the F-recursion implies (14) for the function P(xtJ W 1). Just as the solution of of perfect observation, solution of the control-optimisation problem in the case the estimation problem. so solution of the P-recursion largely implies solution of estimate x)Jased upon 'Largely', because the P-recursion produces only the state is then quickly derived, past stress.. However, the full minimal-stress estimate x1 as we shall see in the next section. ions (12.1)-(12.4) we For the standard regulation problem as formulated in equat can write the P-recursion (14) as (29) BP(xt, W,) = min[Bct-1 + Dt + BP(xt-1, Wt-d ) Xr-1
where Ct-1
=![~r_J~ ~H~t-~·
plant and observation and f, TJ are to be expressed in terms of x, y, u by the relations (12.1) and (12.2). have a limit, D, say, Now, if we take the risk-neutral limit()----+ 0 then ()p will which satisfies the risk-neutral form of (29) (30) D(xt, W,) = min[Dt + D(xt- l. Wi-1)] .l"t-1
and has the interpretation D(xr, W,)
= D(xr, Yr\; Ur-d
= D(xt\ W,)
+ ···
=! [(x- x? v- (x- .X)] + ··· 1
1
(31)
309
6 IMPE RFEC T STATE OBSERVATION
fact identifiable with D( Y1I; U1_ 1 ) ) , Here + · · ·indicates term s not involving x, (in cova rianc e of x 1 conditional on W1• and 1 and V1 are respectively the mea n and and, as we saw from Section 12.5, In this risk-neutral case is identifiable with of xand Y. the Kalm an filter and the relation (31) implies the recursive upda tings know n solution of (30) to dete rmin e Ricc ati recursion. We can in fact utilise the that of (29). that OP has the quadratic form In the risk-sensitive case we can again establish exhibited in the final mem ber of (31), so that
x
x
x
P(xr. W,) =
;o
[(x- x)
v-
1 (x-
x)Jr + · · ·,
(32)
of x, irrelevant for our purp oses . where + · · · again indicates term s inde pend ent estim ate of XT which extremises past The quantity X1 can now be identified as the ure the precision of this estim ate in stress at time t, and V1 can be said to meas stress as x 1 varies from 1• Rela tion that (OV1)- 1 measures the curvature of past quantities. (29) now deter mine s the upda ting rules for these
x
past stress has the quadratic form Theorem 16.6.1 Under the assumptions above the prescribed mean and variance of xo conditional ~2) with x0 and V1 identified as the those ofx1_ 1and Vr-1 by the Kalon Wo. The values of X1 and Vr are determinedfrom (1 2.24) with the modifications man filter (1 2.22)1 (1 2.23) and the Riccati recursion that x1 is replaced by X1, Vr-1 by Vt-l
-1
= ( Vt-1 +OR )
-1
(33)
andxr-1 by (34)
ofthese recursions it is necessary in the right-hand sides ofthese relations. For validity for all relevant t. that V;=_11 + cr M- 1C+ should be positive definite agai n inductively from (29). ReProof The quad ratic character (32) of P follows · If we assu me from (30) only in the addi tion of the term Ocr-! cursi on (29) differs ted then we fmd that that P(x 1_ 1, W1_ 1) has the quad ratic form asser (Jet-!
+ BP(Xt-1, Wr-1)
=! [(x- x)Tv -I (x- x)]t-1 +ter ms independent of Xr-1
whence it follows that
al case with the subs But this is just the recursion ofth e risk-neutr and Vt-! for Xt-! and Vt-!·
titution of Xr-1
0
310
RISK-SENSITIVITY: THE LEQC MODEL
The modified recursions are again more transparent in the alternative forms (12.59), (12.60). One sees that, as far as updatin g of Vis concern ed, the only effect 1 of risk- sensitivity is to modify CTM- 1 C to CTM- C+BR. That is, to the one adds the matrix ation information matrix associated with a single y-observ or negative (positive BR, reflecting the 'information' implied by cost-pressures according to the sign of B). The passage (34) from 1_ 1 to Xt-1 indicates how the estimate of Xt-1 changes if we add present cost to past stress. In continuous time the two forms of the modified recursions coincide and are more elegant; we set them out explicitly in Section 8.
x
7 IMPERFECT STATE OBSERVATION: RECOUPLING In the risk-neutral case the recipe for the coupling of estimation and control asserted by the CEP is so immedi ate that one scarcely gives it thought: the optimal control is obtaine d by simple substitution of the estimate x1 for Xt in the its perfect-observation form of the optimal rule. It is indeed too simple to suggest the that is This 16.2.3. Theorem from know we which sation, risk-sensitive generali optimal control at time t is u(.Xt, t), where u(x~o t) is the optimal perfectof observation (but risk-sensitive) control and x1 is the minimal-stress estimate t). , F(x + ) W 1 , P(x 1 1 xr: the value of x 1 extremising It was the provisional specification of current state x 1 which allowed us to d decouple the evaluations of past and future stress; the evaluations are recouple ofF (32) and (26) ons evaluati separate the that by the determi nation of 1• Now be and P have been made explicit in the last two sections, the recoupling can also made explicit.
x
Theorem 16.7.1 The optimal control at timet is given by u1 = K/x,, where
.x, = (I+ ev~rr~r\i·, + evto-t)·
(35)
Here II1, K1 and o-1 are the expressions determined in Theorems 16.4.1 and 16.5.1 and V1, xt those determined in Theorem 16.6.1. This follows immediately from the CEP assertion of Theorem 16.2.3 and the Xr evaluations of the last two sections. Note that 1 is an average of the value of which extremises past stress and the value II;- 1a1 which extremises future stress. As one approaches the risk-neutral case then the effect of past stress swamps that of future stress.
x
x,
8 CONTINUOUS TIME to The continuous-time analogue of all the LEQG conclusions follows by passage The . the continuous limit, and is in general simpler than its discrete-time original
7 IMPERFE CT STATE OBSERVATION: RECOUPL ING
311
analogues of the CEP theorems of Section 1 are obvious. Note that the u and t: extremisations of equation (18) are now virtually simultaneous: the optimiser and Nature are playing a differential game, with shared or opposing objectives according as eis positive or negative. The solutions for past and future stress in the state-stru ctured case simplifY in that the two alternative forms coalesce. The solution for the forward stress for the disturbed but otherwise time-homogeneous regulation problem is F(x, t)
= !xTIIx- aT x + · · ·
where II and a are functions of time. The matrix II obeys the backward Riccati equation
This reduces to
IT+ R + ATII +ITA- IIJ(O)II = 0
(37)
if Shas been normalis ed to zero, where the augmented control-power matrix J(O) again has the evaluation
The vector a obeys the backward linear equation
a-+ f'T
(T-
IId
=0
where f' is the predictive gain matrix
The optimal control in the case of perfect state observation is
u = Kx+ Q- 1BTa-
(38)
. where the time-dep endence of u, K, x and a is understo od, and
(39) In the case of imperfec t state observation the past stress has the solution
P(x, W)
= ;e(x- x)TV- 1(x- x) + ...
where the time-dependence of x, W, forward Riccati equation
xand Vis understood. The matrix Vobeys the
RISK-SENSITIVITY: THE LEQC MODEL
312 which reduces to
if L has been normalised to zero. The updating formula for Kalman filter) is
x (the risk-sensitive
dX/dt =Ax+ Bu + d + H(y- Cx)- OV(Rx +STu)
(40)
where H = (L + VC)M- 1 • Recoupling follows the discrete-time pattern exactly. That is, the optimal control is u = Kx + Q- 1BTu where Kis given by (39) and x is the minimal-stress estimate of x:
9 SOME CONTINUOUS-TIME EXAMPLES The simplest example is that of scalar regulation, the continuous-time equivalent of Exercise 4.1. Equation (37) for II can be written
~ = R + 2AIT- J(O)II2 =/(IT),
1
I
I
1
(41)
say, where J(O) = B2 Q- 1 + BN, and s represents time to go. Let us suppose that R > 0. We are interested in the non-negative solutions of f(IT) = 0 which moreover constitute stable equilibrium solutions of (41), in that/' (II) < 0. In the case J(B) > 0 there is only one non-negative root, and it is stable; see Figure 4. In the case J(8) = 0 there is such a root if A < 0 and no non-negative root at all if can be A~ 0. If J(O) < 0, then there is no non-negative root if A~ 0, but there of root one if A is negative and -J (0) not too large; see Figure 5. In fact, there is a the required character if A < 0 and R - A2 / J( 0) < 0. To summarise:
II II
Figure 4 The graph ofjTI in the case J > 0; it has a single positive zero.
9 SOME CONTINUOUS-TIME EXAMPLES
313
II is a positive zero if J exceeds a criFigure 5 The graph of[IT in the case J < 0, A < 0; there value. tical
em set out above, with S = 0 Theorem 1691 Assume the scalar regulation probl itude of K decrease as e inand R and Q both positive. Then both II and the magn creases. down value 0equals If A ~ 0 (i.e. the uncontrolled plant is unstable) then the break value. - B2 / N Q and II becomes infinite as() decreases to this breakdown value 0 equals the then ) stable is plant d If A < 0 (j. e. the uncontrolle solution TI of (41) at()= 0 is -B2 jNQ - A 2 jNR. The non-negative equilibrium finite, but is unstable to positive perturbations. linea rised pend ulum mode l of A secon d-ord er exam ple is provi ded by the on the angle of defle ction a versi Secti on 28 and Exercise 5.1.5. In a stoch astic from the vertic al obeys
a= aa+ bu+ E
icien t a is negative or posit ive wher e E is white noise of powe r N and the coeff ing or the inver ted posit ion. The accor ding as one seeks to stabi lise to the hang 2 Q strictly positive. 2 cost funct ion is! (r 1a 2 + r 2a + Qu ), with r 1 and 2 Q- 1 + BN. It follows from that The analysis of Secti on 2.8 applies with J = b II of the Ricca ti equa tion if and analysis that there is a finite equil ibriu m solut ion . The break down value is thus only if J > 0, and that this solut ion is then stable of the hang ing posit ion 0 = -1/ NQ, whatever the sign of a. The great er stabithelityrelati ve magn itude s of II in comp ared with the inver ted posit ion is reflected only neutr ally stable, rathe r than in the two cases, but the hang ing posit ion is still truly stable. stead y state is provi ded by the Finally, a secon d-ord er exam ple whic h has no solut ion for the optim al contr ol inert ial missile exam ple of Exerc ise 2.8.4. The obtai ned there now beco mes
u=-
Ds(x1 + x2s) Q + D(l + BNQ)s3 /3'
314
RISK-SENSITIVITY: THE LEQC MODEL
where sis time to go. The critical breakdown value is iJ = -1 INQ - 3INDs3, and so increases with s. The longer the time remaining, the more possibility there is for mischance, according to a pessimist. 10 AVERAGE COST The normalised value function F(x, t) defmed by e-BF(x,t)
= extuE[e-8C'Ixr = x,ur = u].
(42)
in the state-structured case should be distinguished from the future stress defined in Section 2, which is in fact the x-dependent part of the value function The neglected term, independent of x, is irrelevant for determination of the optimal control, but has interest in view of its interpretation as the cost due to noise in the risk-sensitive case. Let us make the distinction by provisionally denoting the value function by F,(x, t). Theorem 16.10.1 Consider the LEQG regulation problem in discrete time with perfect state observation and no deterministic disturbance (i.e. d = 0). Then the normalised value function Fv has the evaluation
F,(x, t) = F(x, t) + 8r
(43)
where F(x, t) is the future stress, evaluated in Theorems 16.4.1 and 16.5.1, and (44)
The proof follows by the usual inductive argument, applied now to the explicit form exp[-OF,(x, t)] = e:t(?rrn/2 INI- 1/ 2
j exp[-!eTN- e- Oc(x,u) 1
- OFv(Ax + Bu + e, t + 1)] de of recursion (42). The evaluation (44) of the increment in cost due to noise indeed reduces to the corresponding risk-neutral evaluation (10.5) in the limit of zero 0. It provides the evaluation 1 (45) 'Y(O) = 26 log!/+ ONITI of average error in the steady state, where II is now the equilibrium solution of the Riccati equation (24) (with A and R replaced by A - ST Q- 1B and R - ST Q- 1S if Shas not been normalised to zero). More generally, it provides an evaluation of the average cost for a policy u(t) = Kx(t) which is not necessarily optimal (but stabilising, in an appropriate sense) if II is now the solution of
315
10 AVERAGE COST
BK)T (II- 1 + ON)- 1(A + BK). II= R + KTS + ST K + KT QK +( A+
(46)
methods of rage cost is best evaluated by the For mo re general models the ave uced the expression Section 13.5. Recall tha t we there ded (47) 1(8) = 4 0 log ll + 09if(w)l dw ilising, but which is linear, stat ion ary and stab for the average cost und er a policy deviation the spectral density function of the otherwise arbitrary. Her e f(w) is s expression the associated penalty matrix. Thi vector 6. und er the policy and 9i tion of stat e variant model; the re is no assump is valid for a general linear time-in rem ain s task the d, han ervation. On the oth er structure or of perfect process obs of f(w ) s clas the g the policy and determinin of det erm inin gf( w) in terms of icy is varied. which can be generated as this pol rmined by evaluation (47) reduces to tha t dete It is by no means evident tha t the (supposed cy d regulation case with the poli (45) and (46) in the state-structure straightis iation in the risk-neutral case stabilising) u = Kx. The reconcil ht per hap s e case is less so and, as one mig forward. Tha t for the risk-sensitiv factorisation: follows by appeal to a canonical conjecture from the form of (47), see Exercise 20.2.1. years is to ch has become pop ula r over recent A view ofsystem per form anc e whi urbances dist s iou var system output, and the consider the deviation vector !::. as system as e) nois n atio (e.g. pla nt and observ to which the system may be subject funce ons resp cy characterised by the frequen input. A given control rule is the n rule the ose cho ch it induces. On e wishes to tion G( iw) from inp ut to out put whi sen. cho be ld there are many nor ms which cou to make G small in some nor m, and ut inp em ld be regarded as a collective syst So, suppose tha t the noise inputs cou r filte g lisin wer) mat rix 91, say. (A pre-norma (wh ich is white with covariance (po ion Express inc orp ora ted in the total filter GJ which would achieve this could be (47) the n becomes
!j
1(8) =
4~0 j
!j
log !/+ 09iG(iw)9lG(( -iw ?!d w = 4 0
Iog!I + 09iG9lGI dw (48)
respect to of minimising this expression with and the problem then becomes one of LE QG ion not the m on G, generated by G. Expression (48) is indeed a nor blem in pro on sati me). To phr ase the optimi optimisation (in the stationary regi pter. cha t nex the tain issues, as we shall see in this way helps in discussion of cer fact the with ntial issue: how does one cope However, it also glosses over an esse te qui is ied ns G generated as policy is var tha t the class of response functio es vid t it pro amic pro gra mm ing approach tha restricted? It is a virtue of the dyn on of the cati cifi spe by lied constraints imp an automatic recognition of the n. st be provided in a direct optimisatio system; some equivalent insight mu
316
RISK-SENSITIVITY: THE LEQC MODEL
For reasons which will emerge in the next chapter, one sometimes normalises the matrices if\ and 91 to identity matrices by regarding G as the transfer function from a normalised input to a normalised output (so absorbing 9l and 91 into the definition of G). In such a case (48) reduces to
1(8)
1 = 41re
J
-
logjl + BGGI dw.
(49)
The criterion function (48) or (49) is sometimes termed an entropy criterion, in view of its integrated logarithmic character. However, we should see it for what it is: an average cost under LEQG assumptions. In the risk-neutral case (49) reduces simply to the mean-square norm (47rf 1 Jtr[GG]jdw, also proportional to a mean-square norm for the transient response.
11 TIME-INTEGRAL METHODS FOR THE LEQG MODEL The time-integral methods of Sections 6.3 and 6.5 are equally powerful in the risksensitive case, and equally well cut through detailed calculations to reveal the essential structure. Furthermore, the modification introduced by risk-sensitivity is most interesting. We shall consider this approach more generally in Chapters 21 and 23, and so shall for the moment just consider the perfect observation version of the standard state-structured regulation model (12.1)-(12.4). The stress has the form §
= l_.)c +!ET(ON)- 1E]r +terminal cost. T
This is to be extremised with respect to u and f subject to the constraint of the plant equation (12.1). If we take account of the plant equation at time 7 by a Lagrange multiplier >.r and extremise out E then we are left with an expression
~ = I'::[c(xn ur) + >.;(xr- Axr-1- Bur-l)- !B>.JN>.r] +terminal cost (50) T
to be extremised with respect to x, u and>.. This is the analogue of the Lagrangian expression (6.19) for the deterministic case which we found so useful, with stochastic effects taken care of by the quadratic term in >.. If we have reached time t then stationarity conditions will apply only over the time-interval 7;;;: t. The stationarity condition with respect to fr implies the relation
between the multiplier and the estimate of process noise. In the risk-neutral case 8 0 this latter estimate will be zero, which is indeed the message of Section 10.1. Stationarity of the time-integral (50) with respect to remaining variables at time 7leads to the simple generalisation
317
12 WHY DISCOUNT?
sT
(r
Q
~
t).
(51)
-Bf / case, when deter mini stic distu rban ces of equa tion (6.20) (in the speci al regu latio n . The matr ix oper ator
J
[c(x, u) +
A.T(;~;- a(x, u))- !e.>..TN.\] dr +ter mina l cost,
(52)
linea r a and quad ratic c. This can be whic h can be treat ed in the same way for the form (7.2) assoc iated with the regar ded as a stoch astic gene ralisa tion of in .:\ repre senti ng the stoch astic max imum princ iple, with the quad ratic term ensitivity. We take up these matt ers effect insof ar as it repre sents the effect of risk-s in Part 5.
12 WH Y DISCOUNT? whic h is time- homo gene ous, and has Supp ose we assu me a mode l in discr ete time Supp ose also that it is disco unted , both an infin ite horiz on and an infin ite past. I:~;;, 0 f3T Cn wher e /3 is the disco unt so that the cost func tion from time 0 is§= depe nd explicitly upon time T. The facto r and the insta ntan eous cost cT does not expe cts the optim al polic y also to one form ulati on then seems time -inva riant and ider optim isatio n on an expo nenti al be so (i.e. statio nary) . However, if we cons 1r is chos en to extre mise crite rion from time t ( ~ 0), then the polic y (53) of futur e diSCOUnted COSt at time f. Where Cl =I:~~ I /3'"-IC T iS the prese nt Value rred befor e time t, which will not The prop ortio nalit y facto r involves costs incu the optim al policy cann ot poss ibly affect polic y from time t. We see from (53) that ing-p oint from time 0 to time t we be time -inva riant , beca use in shift ing the start para mete r from (} to {JI6, all othe r have in effect chan ged the risk-sensitivity facto rs rema ining statistically cons tant. -inva riant if correctly viewed, and One feels that the mod el is nevertheless time red. The reme dy is certa inly not to wond ers how the corre ct view can be resto value at each stage; this woul d be reno rmal ise the cost- funct ion to prese nt
318
RISK-SENSITIVITY: THE LEQC MODEL
inconsistent with the principle that an optimisation from a given time can be embedded in an optimisation from an earlier time. It seems necessary to think through the concept of discounting afresh. There seem to be three motivations or justifications for introducing discounting, and some doubt as to whether any of them are applicable in the context of process control. These are (i) mathematical convenience, (ii) the possibility of random termination, and (iii) the growth of capital by compound interest. For point (i), discounting is indeed a convenient way of reducing a total cost over an infinite horizon to something finite, mathematically attractive to handle and reminiscent of the concept of Abel summation. However, convenience is not a reason; any regularising reformulation must have an operational justification. If one expects that the controlled system will settle down to a steady state of some kind then minimisation of average cost in this steady state seems the natural criterion. The only point which should perhaps be emphasised is that already made earlier: the model should be such that transients are excited spontaneously if a policy effective in the steady state can also be guaranteed to be effective against transients. Point (ii) corresponds to the conviction that nothing goes on for ever. The investor of Section 2.2 cannot postpone consumption of his wealth for ever; he must reckon with the fact that he will die at some point. No industrial plant or process will be run for ever; the uncertainties of future economics and technology are at least certain in predicting that. Suppose that, if the plant has survived in its existing form to a given timet, then it has probability f3 of doing so to time t + 1. Then one can regard the discounted term {3F(-, t + 1) in the dynamic programming equation as representing the expectation of future cost over survival prospects. The expectation should also include a term ( 1 ~ {3) times the cost incurred if the plant does not survive to time t + 1. This term will presumably be zero if non-survival simply means termination; it will be something different if non-survival means movement into a new technology, for example. However, while one may have to reckon with such contingencies on a planning time-scale, one scarcely needs to do so on the time-scale of routine control, for a which an average criterion then seems more appropriate. A secondary observation is that a discounting induced in this way would, under LEQG assumptions, nevertheless destroy LEQG structure: see Exercise 1. It is alternative (iii), the effect of compound interest, which is the usual justification. However, a complete model has to make the interest mechanism explicit. The introduction of interest implies the existence of two alternatives: either to use one's money to cover the costs of the enterprise (i.e. of the system being controlled) or to leave it in the bank to grow by compound interest. (More generally, one could consider the allocation of resources between several enterprises competing for support, but the comparison of one uncertain enterprise with the certainty of an interest account is sufficient to make the
l
319
12 WH Y DIS COU NT?
t now the process being controlled we mus point.) To the variable x, describing who ner t-ow plan represents the capital of the add ano ther variable, Zt, say, which n rsio recu the y costs. This capital will obe is run ning the process and paying the (54) 1 at each stage and c, is to be if unused capital grows by a factor {3s net process cost to the plant-owner. Thu
Zh
= p-h ( zo -
L !Y c.,. h-1
)
regarded as the
.
-r=O
ose e terminal capital zh then he will cho If the plant-owner wishes to maximis if or ar line is U If U tion some utility func policy 1r to maximise E... [U(zh)] for the n of this will be equivalent to minimisatio there is no statistical variation then and the not, will it s case er p-r c.,.). In all oth expected discounted cost E... CE~:~ lem. prob the of able the state vari current capital z 1 must figure as part of izon t that one mus t have an infinite-hor poin the still is Of course, there e care is stationary opti mal policy, and som formulation if one is to derive a that because the surplus of capital above needed in finding such a formulation, ld, cou growing exponentially. However, one needed to cover net process costs is at will z ation of the probability of ruin: that for example, consider the minimis some future time fall below zero. of the point is that it is the non-linearity This is in a sense a side-issue. The el as mod the conventional discounted-cost exponential criterion which exposes ete in its state description. essentially incomplete, and incompl Exercises and comments unted one wishes to minimise total undisco (1) Suppose that the criterion is that one ory, hist , conditional on current process cost incu rred over a lifetime, and that ion isat xim g the the next unit time step. (Ma has constant probability f3 of survivin cost into but this can be translated of reward would be more natural, is using optimiser has survived to age t and minimisation) Then, given that the er is eaft ion which he must extremise ther an exponential criterion, the express
- f3)(3j = th, a rand om variable, and (1 Here t is the time of his dea tion is solu mal is invariant in t, and its opti P(t = i + jlt ~ t). This problem amics dyn and s es LQ G assumptions on cost stationary However, even if one mak G LEQ that lies imp of this last expression (apart from those of death), the form structure has been lost.
320
RISK-SENSITIVITY: THE LEQC MODEL
Notes on the literature The LEQG criterion was first investigated by Jacobson (1973, 1977), who solved the perfect-observation state-structured case and saw the role of the noise variable as an independent auxiliary control. This treatment was extended by Speyer (1976) and Speyer, Deyst and Jacobson (1974) and Kumar and van Schuppen (1981). However, the general state-structured case with imperfect observation resisted solution until Whittle (1981) deduced the risk-sensitive certainty equivalence principle; the assumption of state structure was dropped in Whittle (1986). Whittle did not observe the point that the extremal operations in equation (8) do not commute in general; this point was corrected by Fragopoulos (1994). However, it has no consequences if the appropriate saddlepoint exists, and is located by stationarity conditions. The entropy evaluation (47) of the average cost under LEQG assumptions is due to Glover and Doyle (1988). Chang and Sobel (1987) observed that the combination of an LEQG criterion and discounting implied non-stationarity of the optimal policy.
CH AP TE R1 7
The H00 Formulation 1 THE H oo LIMIT In Section 16.10 we quoted the expression "fa( B)=
4~&
J
log[ji + BG(iw)G( -iw)Tj] dw =
4~8
J
log[ji + BGG!] dw, (1)
rate of cost incurred in the stationary derived in Section 13.5, for the average sensitivity parameter e. This relation, regime und er the LEQG criterion with risk8), expresses the average cost very first derived by Glover and Doyle (198 function G induced by a stationary, explicitly in terms of the system transfer onse' we mean the response of a linear stabilising policy. By 'system resp dardised white-noise input. We are standardised deviation output to a stan n G(£i2) and so transfer function G(s). assuming that the equivalent filter has actio in the discrete-and continuous-time The limits of integration are ±n or ±oo cases respectively. by some linear stationary stabilising The class of response functions induced The aim of steady-state optimisation is policy constitutes the class of feasible G. minimal in an appropriate norm , and to choose a G in this class which is on G which is implied by choice of the expression (1) can be regarded as the norm immediately relevant properties of this LEQ G criterion. Let us summarise the norm. rable over the relevant frequency range. Theorem 17.1.1 Assume that GG is integ Then: + BGG is positive definite for all real (i) In the range e > () 0 for which the matrix I increasing in e. w, expression (1) is positive,finite, and nontive, below which ro(B) is no longer nega ily (ii) The critical value () 0 , necessar defined is determined by (2) -B( / = c?(G)
the maximal eigenvalue ofGG. where a1 (G) is the maximal value (over w) of (16.48) to expression (1) implies that Proof The normalisation of expression negative. It follows then that the criter9\ ~ 0, and so that the cost function is nonall real() for which it is defined; we see ion x1r(B) of (16.1.3) is non-negative for
322
THE Hoo FORMULATION
from Exercise 16.1.3 that it is also non-increasing in B. These properti es then transfer to the limit evaluation /G (B). Integrability of GG (which implies that G is proper, in the continuous-time case) implies integrability oflogj/ + BGGI as long as the matrix I + BGG remains positive definite. As 8 is decreased this condition fails first at()= ()G· As we shall see by example, the average cost /G(B) may or may not become infinite at that of value, but its definition certainl y fails from that point, because the derivation 0 ness. -definite evaluation (1) in Section 13.5 required positive Since the class of control rules corresponding to the feasible G includes the optimal stationa ry rules at any given value of fJ we have
Theorem 17.1.2 The breakdown value 8 of() in the stationa ry case can be identified with infGBG, where the infinum is over feasible values of G. Equivalently, (3) -e-l = inf al{G). G
The quantity cr( G), the non-negative square root of c?( G), is also a norm on G. It is known as the Hoo norm of G, otherwise written
IIGJioo = cr(G). ry The conclusion of Theorem 17.1.2 can then be expressed: an optimal stationa s minimise it that in , -optimal H also is 8 00 point wn breakdo LEQG control rule at the . function transfer the HX! norm ofthe effective This is the essential implication of the Glover-Doyle paper, and was found particularly striking because the Hoo criterion had attracte d intense interest since about 1981, for reasons we shall explain in the next two sections. The realisation that it is so closely related to the LEQG criterion, whose investigation had proceeded completely independently, marked a real breakthrough in understanding. We can make the immedi ate assertio n that the Hoo problem is solved for the . state-structured problem of the last chapter by the LEQG-o ptimal policies we deduced in Sections 16.4-16.7, applied at the breakdown point B = 8. However, should make the connect ion more explicit. If the actual white-noise input has covariance matrix m and if the cost replaced by attach~d to an output deviation !::,. is !::,. Tmt::,. then the form GG is evaluation affect not does which mGmG, to within a similarity transfor mation be wcould 9\ and m both of eigenvalues, and so of the Hoo norm. Indeed, n of inclusio and noise te dependent, correspo nding to allowance of non-whi system the r lagged terms in the quadrat ic cost function. Conside
!
dx+fl u
=f
y+~X='17
u = Jf"y
(4)
1 TH E Hoo LIMIT
323
matrix (123) an d the the white with covariance are uts inp ise no the where rate-interpretations in the is given by (12.4) (with ion nct for t-fu cos us eo tan ins tan pu re regulation pro ble m are thus considering a We e). va cas ser e ob , im on s-t ati ou nu equ conti pectively pla nt res te itu nst co (4) of s which the three relation s the solution e. Suppose system (5) ha rul l tro con d an on ati rel tio n erator :Y{ to mi nim ise the tha t which chooses the op is e rul o Bo al tim op the Th en er w) of the ma tri x al over eigenvalues an d ov im ax (m e alu env eig al ma xim G( -iw)T. G(iw) [ [ the rules de du ce d in state-structured case by the in ved sol is m ble breakdown value 8 = 9. Th is pro stationary lim it an d at the the in en e tak , 6.7 4-1 16. s Section linear Markov for the cas ptimal control rules are o-o Bo ter the lat t se tha n the the of s ns It follow uivalent versio ion, an d are certainty-eq ate 1 of pe rfe ct state observat the mi nim al stress est im is, at servation. Th ob te sta ct rfe nic al pe no im 'ca of e the , in the cas However tuted for x 1 in the rule. sti sub is ) see ate m, im for est ed ear tur te-struc at 9 (a lin tpu t form rat he r tha n sta ou utinp in is d en ite giv plo en ex oft e we hav model' is dynamical str uc tur e which the t tha s an me ich wh Section 3, largely lost. is a useful one, since it co nc ep t of B 00-optimality the er eth wh er nd wo y On e ma = 9 at which LEQGyty at the very po int (} ali tim -op QG LE to s 1 an d kn ow alread correspond by example in Exercise see ll sha we As ls. fai n d Kb ec om e inf ini te at optimisatio unstable the n bo th II an is nt pla the if , 9.1 16. m these quantities, bu t fro mT he ore n finite solutions exist for the ble sta is nt pla the If this point. perturbations. in fact unstable to positive its usefulness in the the determination of II is no rm has rat he r be en o Bo the of int po the xt two sections. However, which we develop in the ne rty pe pro a s, nes ust rob an B 2 criterion, tre atm en t of uld be sai d to be ba sed on wo e rul G LQ ry na tio sta Th e optimal . adratic no rm Jtr( GG) dw in tha t it minimises the qu
~ ~]
~
t]
x
Exercises and comments rule in the simplest case: nation of the B 00 -op tim al mi ter de ect dir the r de continuous-t im e case. (1) Consi tion to zero in the scalar ula reg n tio va ser ob ctrfe tha t of pe normalised to zero. If uctured mo de l with S str testa al usu the e the transfer function We ass um ising policy u = Kx the n bil sta a ts op ad e on d an N = wwT pa ir (x, u) is nt noise to the deviation from the normalised pla
G(s) =
[~] s~r
324 where r
THE Hoo FORMULATION
= A + BK. It follows that (with s = I
I
+
OHHI = 1 BN(R + QK2)
+
and so that (}-
iw)
= - [·1UKf sup w
w2
+r2
BN(R + QK2 )] -I ..? rz w-+
The maximising value of w is zero, so the H 00 -optimal value of K minimises (R + QK2 )/(A + BK) 2 subject to stability, i.e. to A+ BK < 0. The freely minimising value is at K = BR /AQ and is acceptable if A< 0; we have then -ON = B2 / Q + A 2 j R. If A ;;;:: 0 the n the minimising value of K (subject to stability) is K infinite and of opp osite sign to B, corresponding to -O N= B2/ Q. We thus confirm the identity of the H00 -optimal rule with the LEQG-optimal rule at the breakdown point, set out in Theorem 16.9.1. 2 CHARACTERISTICS OF TH E Hoo NORM The H 00 nor m was originally intr oduced for considerations which are not at all stochastic in nature, let alone LEQ G. In order to und ers tan d these we should establish some properties of the norm.
Theorem 17.2.1 Consider the case in which G is a constant matrix, so that II Gil~ is the maximal eigenvalue of GGT. We then have the characterisations 2 IIGII co
_ -
jG8j 2 _ tr(GTMG) 2 - sup (M ) o 181 M tr
sup
(5)
where the suprema are respectively over non-zero vectors 8 and non-zer o non-negative definite matrices M ofappropriate dimension. Proof The second expression in (5) is sup6((8T GT G8)/(8T 8)] which is indeed the· maximal eigenvalue of GT G and so also of GGT. If M has spe ctral representation M = L_1 >.AoJ, where the eigenvalues >..1 are necessarily non-negative, then
tr(GTMG) _ L_>..1 IG8/ tr(M ) - L_.>.1 18i which plainly has the same sha rp upp er bou nd as does the sec ond expression 0 One might express the first charac terisation verbally as: II Gil~ is the maximal 'energy' amplification that G can achieve when applied to a vector. Consider now in~
2 CHARACTERISTICS OF THE Hoo NORM
325
response function of a fllter with the dynamic case, when G( iw) is the frequency action G(!'J).
ion G(s) Theorem 17.2.2 In the case ofafilter with transfer funct IIGII2 = su EIG(9))812 p £1812 00
nary vector processes {8(t)} of where the supremum is over the class of statio 2 appropriate dimension for which 0 < Ej8j < oo. distribution matrix F(w). Then Proof Suppose that the 6-process has spectral we can write
EjG(!'J)oj 2 _ ftr(G dF G) Jtr(d F) Ejoj 2 -
definite matrix. We thus see from But the increment dF = dF(w) is a non-negative rem that the sharp uppe r boun d the second characterisation of the previous theo F) is the supremum over w of the to this last expression (under variations of The boun d is attained when 6( t) maximal eigenvalue of GG, which is just II Gil;,. 0 vector amplitude. is a pure sinusoid of appropriate frequency and previous theorem: II Gil;, can be We thus have the dynamic extension of the tion that the filter G can achieve characterised as the maximal 'power' amplifica for a stationary input. on of this last theorem which we shall not Finall~ there is a deterministic versi then give a finite-dimensional prove. We quote only the continuous-time case, in Exercise 2. Suppose that the proo f in Exercise 1 and a counterexample class, in that is is causal as well as response function G(s) belongs to the Hardy closed right half of the complex stable. This implies that G(s) is analytic in the eigenvalue of G(s) G(sl (over plane and (not trivially) that the maximal tive real part) is attained for a value eigenvalues and over complex s with non-nega s = iw on the imaginary axis. One can assert:
class then Theorem 17.2.3 Ifthefilter G belongs to the Hardy IG(!'J)8j2dt II GII2 = su J p J j6j2df I 00
6( t) ofappropriate dimension. where the supremum is over non-zero vector signals in that 6 is a deterministic This differs from the situation of Theorem 17.2.2 y stochastic signal of finite power. signal of finite energy rathe r than a stationar expectation. The assertion is that An integration over time replaces the statistical
THE Hoo FORMULATION
326
IIGII~ is the maximal amplification of 'total energy' that G can achieve when applied to a signal. Since we wish G to be 'small' in the control context, it is apparent from the above that adoption of the HXJ criterion amounts to design for protection against the 'worst case~ This is consistent with the fact that LEQG design is increasingly pessimistic as edecreases, and reaches blackest pessimism at e = 0. The contrast is with the H 2 or risk-neutral case, e = e, when one designs with the average case in mind. However, the importance of the Hoo criterion over the past few years has derived, not from its minimax character as such, but rather from its suitability for the analysis of robustness of design. This suitability stems from the property, easily established from the characterisations of the last two theorems, that
(6) Exercises and comments (1) The output from a discrete-time filter with input 81 and transient response g1 is y 1 = ~rgrDt-r· We wish to find a sequence ~81 } whose 'energy' is amplified maximally by the filter, in that (~ 1 1Ytl 2 / ~~ 181 1 is maximal. Consider the SISO case, and suppose time periodic in that gr has period m and all the sums over time are restricted to m consecutive values. Show (by appeal to Theorem 17.2.1 and Exercise 13.2.1, if desired) that the energy ratio is maximised for 81 = eiw1 with w some multiple of 2rr/m, and that the value of the ratio is then IG(iw)l 2 = I ~Tgre-iwrl2· (2) Consider the realisable SISO continuous-time filter with transient response 1 . Then II Gil;, is just the maximal eat and so transfer function G(s) = (s2 value of IG(iw)l , which is a- 2. Consider a signal 8(t) = e-f3t for t ~ 0, zero otherwise, where (3 is positive. Show that the ratio of integrated squared output to integrated squared input is
ar
2(3(a +
f3r2loo
(e2at- 2e(a-f3)t
+ e-2f3t) dt.
If a< 0 (so that the filter is causal) then this reduces to (a2- a(J)- 1 which is indeed less than a- 2 . If a > 0 (so that the filter is not causal) then the expression is infinite.
3 THE Hoo CRITERION AND ROBUSTNESS Suppose that a control rule is designed for a particular plant, and so presumably behaves well for that plant (in that, primarily, it stabilises adherence to set points or command signals). The rule is robust if it continues to behave well even if the actual plant deviates somewhat from that specified. The concept of robustness
327
N AN D ROBUSTNESS 3 TH E Hoo CR ITE RIO
G
u w K
+
of G(s) corresponding to a ofequation ( 7). lfa pole em syst the of m not be comple· gra dia ck Figure I A blo the controlled system will ed by a zero ofK(s), then plant instability is cancell tely stable.
may never be kn ow n tha t the plant structure t fac the of t un co ac as in statistics an d oth er thus takes time. In control theory, in ry va eed ind y ma d for optimality mu st be exactly, an grown that a co nc ern s ha ion ict nv co the , both qualities mu st be subjects, robustness. Furthermore for ern nc co a by ted en between goals which are complem the right compromise ch rea to is e on if ied quantif nflicting. ou tpu t necessarily somewhat co 1, designed to make plant ure Fig of tem sys the r For an example, conside nsfer functions of pla nt Here G an d K are the tra w. l na sig d an mm co a ed from observed pla nt v follow nt ou tpu t v is distinguish pla l ua act alent d an er, oll ntr an d co e block diagram is equiv is observation noise. Th TJ ere wh J, r ' + v = y t outpu (7) to the relations u = K( w- v- TJ) v= Gu , pressi whence we deduce the ex
on
(8) 1 - GKTJ) v- w =( I+ GK )- (w tem. Fo r stable of the inputs to the sys ms ter in w v or err have a stable causal for the tracking erator I+ GK should op the t tha e uir req all gain theorem operation we holds; the so-called sm ity bil sta t tha e um ass inverse. Let us t IIGKIIoo < l. ndition ensuring this is tha or to co mm an d asserts tha t a sufficient co tra functions of cking err nse po res the t tha (8) We see from vely vation noise are respecti signal an d (negative) obser
norm. They ca nn ot all, in some appropriate sm be to se the of ty of the th bo e One would lik is known as the sensitivi S se 1 + S2 =I . S 1 cau be , ver we ho , bo th be small
328
THE
n)O FORMULATION
system; its norm measures perform ance in that it measures the relative tracking error. S2 is known as the complementary sensitivity, and actually provides a measure· of robustness (or, rather, of lack of robustness) in that it measures the sensitivity of perform ance to a change in plant specification. This is plausible, in that noise-corruption of plant output is a kind of plant perturbation. Howeve r, for an explicit demonstration, suppose the plant operato r perturb ed from G to G + 8G. We see from the correspondingly perturb ed version of equations (7) that
v =(I+ GKr 1GK(w- TJ) +(I+ GKr 1(8G)K( w- v- TJ). The perturb ed system will remain stable if the operator I+ (I+ GK)- 1(8G)K acting on v has a stable causal inverse. It follows from another application of the small gain theorem that this continued stability will hold if
i.e. if the relative perturba tion in plant is less than the reciprocal of the complementary sensitivity, in the Hoo norm. Actually, one should take of the expected scale and dynamics of the inputs to the system. This isaccount achieved by setting w = W1 w and TJ = Wtii, say, where W 1 and W2 are prescribed filters. In the statistical LEQG approach one would regard wand ij as standard vector white noise variables. In the worst-case deterministic approach one would generate the class of typical inputs by letting w and ij vary in the class of signals of unit total energy. Performance and robustness are then measured by the smallness of IIS1 W1lloo and IIS2 W2lloo respectively. Specifically, the upper bound on I!G- 18GIIoo which ensures continued stability is
IIS2W211~1 .
Of course, in a simple minimisation of quadratic costs a compromise will be struck between 'minimisation' of the two operators sl and s2 in some norm. There will be a quadratic term in the tracking error v - w in the cost function , and this will lead to an expression in H 2 norms of S1 and S2. The more observation noise there is, the greater the consideration given to decreasing S 2, so assuring increased robustness. The advantages of an Hoo formulation were demonstrated first in Zames (1981), who began an analysis which has since been brought to a high technical level by Doyle, Francis and Glover, among others; see Francis (1987) and and Doyle, Francis and Tannenbaum (1992). The standard formulation of the problem is the system formulation of equations (6.44)/ (6.45), expressed in the block diagram of Figure 4.4. The design problem is phrased as the choice of K to minimise the response of A to ( in the Hoo norm, subject to the condition that 'K should stabilise the system', i.e. that all system outputs should be stabilised against all system inputs. The analysis of this formulation has generated a large specialist literature. We shall simply list a number of observations.
-~- ._-:• .•.·
:1
ROBUSTNESS 3 THE H 00 CRI TER ION AND
329
man y n to the LEQ G criterion means that (i) The relation of the H 00 criterio d LE QG tion in the now well-establishe Hoo problems already have a solu tion noise The need to take account of observa extension of classical LQ G ideas. criteria, s derived on either LQ G or LEQG ensures a degree of robustness in rule es. ht not have bee n apparent before Zam although this is an insight which mig utp ut ut-o inp in ed stat are in this section (ii) The two problems formulated rath er than simply by its response function G form, in tha t the plant is specified of attempts ber num a n bee e mple. The re hav by a state-structured model, for exa ework, s directly, usually in the LQ G fram over the years to attack such problem dra tic qua a ses imi min ch filter K in (7) whi by, for example, simply seeking the eta!. t Hol 7), (195 ser Kai Newton, Gou ld and (See ~). T9t E(~ as h suc n erio crit e a cer tain rno and Jabr (1976)). One can mak (1960),Whittle (1963),Youla, Bongio roach can app n d-o hea a ed in Section 6.7, amount of progress, but, as we not almost aled reve are ch s two insights whi prove perplexing and tends to mis se are The . blem pro ary lysis of the non-station automatically in a state-space ana and ion mat esti of ects separation of the asp s; (i) certainty-equivalence, with its able vari te juga con of of the introduction control, and (ii) the usefulness the by ted stitu con with the constraints Lagrange multipliers associated ies the ation of these properties simplif loit Exp prescribed dynamic relations. vector case. analysis radically, especially in the without of optimal stationary controls We consider the determination hod s met al tegr e-in pters 18-21, but using tim reduction to state structure in Cha t is wha y isel prec by their natu re and yield which incorporate these insights needed. tegral niques associated with these time-in (iii) The operator factorisation tech ck atta ct dire a in Wie ner -Ho pf methods used methods are distinct from bot h the ded oun exp es mat rix /-factorisation techniqu on the inp ut-o utp ut model and the of the Hoo a and Glover (1990) for solution by e.g. Vidyasagar (1985) and Mustaf problem. ility e nor m will in fact assure overall stab (iv) Simple minimisation of~ in som s of sure mea sically achievable) if Ll includes of the system (provided this is phy eria crit the example, the inclusion of u itself in relevant signals in the system. For ility or avoids the possibility that good stab used through the whole of this text , and ther Fur expense of infinite control forces. tracking could be achieved at the of ree e deg on will automatically achieve som as we have seen, such an optimisati assumed in the model. robustness if observation noise is of an mate guarantee of robustness is use Finally, one might say tha t the ulti n we can tion which opens wider vistas tha adaptive control rule; an observa explore.
I"
PART 4
d Time-integral Methods acnies Optimal Stationary Poli al stationary policy termination of the optim de t ec dir the to ted vo eneous, bu t no t This Pa rt is de LEQG an d time-homog or G LQ ed os pp su is that th e subfor a model which is an interesting one in me the e Th . ed tur uc um principle, necessarily state-str uivalence, the maxim eq y int rta ce ls, gra coalesce themes of time-inte d policy improvement an n tio isa tor fac al nic no eady applied to the transform methods, ca velopment of those alr de a are ds tho me e ralised in Section 6.5. naturally. Th in Section 6.3 and gene se ca ed tur uc str testa uld omit this Part deterministic state-structured case co the ly on r ide ns co to nt for enlightenThe reader conte h passing up a chance ug tho (al s low fol at wh without prejudice to w!). timal ment, in the author's vie deriving a stationary op of t tha is red ide ns co neral neither stateThe problem flrst to be eous LQ G model, in ge en og om e-h tim a for y ally assume the control polic Ou r methods intrinsic . ble va ser ob y ctl rfe pe derived as the structured no r t the optimal policy is tha so , its lim on riz ho on policy. existence of infinitean optimal finite-horiz of ry) na tio sta t fac (in i.e. of infinite-horizon limit of being average-optimal: y ert op pr the ve ha y inl stationary regime. It Such a policy will certa urred pe r un it time in the inc st co ge era av the on to transients, minimising erty of optimising reacti op pr r ge on str the ve ha is such that it will in fact also t have unless plant noise no y ma y lic po al tim which an average-op of the system. policy is can stimulate all modes in formula (13.24), and st co ge era av the for :ft" which one uses to We have an expression lisable linear operator rea t an ari nv e-i tim the e might regard the specified by current observables. On of ms ter in u ol ntr isation of this express the co matter of direct minim a as ply sim m ble optimisation pro
332
TIME-INTEGRAL METHODS
expression with respect to :It. This was the approach taken by a number of auth ors in the early days of control optimisati on (see Newton, Gould and Kaiser (1957), Holt et al. (1960), Whittle (1963)), whe n it seemed a natural developmen t of the techniques employed by Wiener for optimal prediction. However, while the approach yields results rather easily in the case of scalar variables, in the vector case one is led to equations which seem neither tractable nor transparent. In fact, by attacking the problem in this bull-like fashion one is forgoing all the insights of certainty equivalence, the Kalman filter etc. One could could argue that, if these insights had not already been gained, they should be revealed in any natural approach. If so, then this is not one. In fact, the seeming simplif ication of narrowing the problem down to one of average-optimisation blinds one to an even more direct approach. This is an approach which is familia r in the deterministic case and which turns out to be available even in the stochas tic case: the extremisation of a time-integral. We use this term in a rather specific technical sense; by a time-integral we mean a sum or integral over time of a fun ction of current variables of the mo del in which expectations are absent and which is such that the optimal valu es of decisions and estimates can be obt ained by a free and unconstrained extremisation of the integral. In earlier publications on this topi c the author has referred to these as 'pathintegrals', but this is a usage inconsis tent with the quantum-theoretic use of the term. Strictly speaking, a path-int egral is an integral over paths (i.e. an expectation over the many paths whi ch are possible) whereas a time-integr al is an integral along a path. The fact which makes substantial progress possible is that a path-integral can often be express ed as an extremum over time-integr als. For example, we we saw in Chapter 16 that the expectation (i.e. the path-int egral) E[exp( -OC)J could be expressed as an extremum of the str ess §= C + 1 o- [). If one clears matrix inverses from the stress by Legendre transformations (i.e. by introducing Lagrange multipliers to take account of the contraints of pla nt and observation equations) then one has the expectation exactly in the form of the extremum of a time-integral. , It is this reduction which we have exploited in Chapters 6 and 16, sha ll exploit for a general class of LQG and LEQ G models in this Part, and shall exte nd (under scaling assumptions) to the non -LQG case in Part 5. We saw in Section 6.3 that the state-st ructured LQ problem could be conver ted to a time-integral formulation by the introduction of Lagrange multiplier s, and that the powerful technique of can onical factorisation then determi ned the optimal stationary policy almost imm ediatel)t We saw in Sections 6.5 and 6.6 that these techniques extended directly to models which were not state-st ructured. These solutions extend to the stoc hastic and imperfectly observed case by the simple substitution of estimates for unobservables, justified by the cert aintyequivalence principle. We shall see in Chapter 20 that time-integral tech niq ues also take care of the estimation pro blem (the familiar control/estimation duality
TIME-INTEGRAL METHODS
333
methods extend to the finding perfect expression) and, in Chapt er 21, that these to the non-LQG case ion extens LEQG model. All these results are exact, but the ion of the pathximat appro of Part 5 is approximate in the same sense as is the n integrals) of (actio integrals of quant um mechanics by the time-integrals ximation at sensitive classical mechanics, with refmement to a higher-order appro parts of the trajectory. al formalism first, There is a case, then, for developing the general time- integr al pattern, unclu ttered which we do in Chap ter 18. In this way one sees the gener ation to control and applic the in arise by the special features which necessarily estimation. models are no longe r There is one point which should be made. Although our input/ outpu t form. in given state-structured, they are not so general that they are state-structured the in Rather, we assume plant and observation equations, as p. The loss value some case, but allow variables in them to occur to any lag up to mode l to the reduces of an explicit dynamic relationship which occurs when one n 6.7. input/ outpu t form has severe consequences, as we saw in Sectio r's previous work autho the of n versio lined stream This Part gives a somewhat ver, the mater ial of on time-integral methods as set out in Whittle (1990a). Howe edge, new. There knowl r's autho the of Sections 20.3-20.6 and 21.3 is, to the best r would be autho the and er, howev must be points of conta ct in the literature, grateful for notice ofthese.
l
CHAPTER18
The Time-integral Formalism CRETE TIME 1 QUADRATIC INTEGRALS IN DIS the term 'time-integral' even in discrete For uniformity we shall continue to use then a sum. Consider the integral in the time, despite the fact that the 'integral' is variable~
(1) is then a sum over time of a quad ratic with prescribed coefficients G and (. This the function being time-invariant in its function of the vector sequence {~T }, Sections 6.3 and 6.5 ~ is the vector with second-degree part. So, for the models of ying term (T would arise from kno wn sub-vectors x, u and >., and the time-var als r, uc. The matrix coefficients Gj and disturbances d or known com man d sign but no generality is lost by imposition of the vector coefficients (T are specified, the normalisation
(2) then the 'end terms' arise arise from If the sum in (1) runs over h1 < r < h2, h2 respectively. The final term can be contributions at times r ~ h1 and r ?: and the initial term as arising from a regarded as arising from a terminal cost itions. probabilistic specification of initial cond such that we wish to extremise the Suppose the optimisation problem is variable {~T}. We cann ot at the mom ent integral with respect to the course of the see by considering again the models of be more specific than 'extremise', as we respect to x and u and maximised with Cha pter 6, for which one minimised with condition with respect to ~T is easily seen respect to A. In any case, the stationarity to be
(3) and h2 that neither end term involves if r is sufficiently remote from both h1 where
IP(!/) :=
L j
Gj§-J.
~n
(4)
336
THE TIME-INTEGRAL FORMALISM
The normalisation (2) has the implication =(z- 1 onz-transfor ms. Suppose that we regard our optimisation as a 'forward' optimisation in that, if t is the current instant of time, then ~Tis already determined forT< t, and we can optimise only for r ~ t. We shall then write equation (3) rather as
r
(r
~
t)
(6) to emphasise this dependence on t. That is, relation (6) determ ines the optimal course of the process~ from timet onwards for a prescribed past (at timet). This was exactly the situation which prevailed for the control optimi sation of Sections 6.3 and 6.5. If the horizon h2 becomes infinite then we may indeed deman d that (6) should hold for all r ~ t and, if the sequen ce {(T} and the terminal term are sufficiently well-behaved, expect that the semi-i nversion ofthe system (6) analogous to equation (6.27) should be valid. For the estimation problems of Chapter 20 we shall see that the same ideas apply, but with the optimisation applied to the past rather than the future (cf. Sectio n 12.9). To consider the general formulation (1) is then already a simplificatio n, in that we see that these two cases can be unified, and that the operat or
If
(z) = ¢(z)
(7) where both ¢(z) and ¢(zr 1 have expansions in non-negative power z valid in ·lzi ~ 1. We have interposed a constant matrix factor o (necessarilys of symmetric) for generality. It could be normalised to the identity by redefin ition of ¢ (see Exercise 1), but we shall not find this to be the natural norma lisation. If we assume 'dynamics of order p' in that G1 = 0 for ljl > p then it will turn out that ¢( z) is a polynomial in z of degree p at most. As in Chapt er 6, we can semi-invert system (6) to obtain a closer determination of the optimal future course off Theorem 18.1.1 Suppose that
¢(.r){~) =
(r ~ t)
(8)
IN THE 'MARKOV' CASE 2 FACTORISATION AND RECURSION
337
---1
the subscript T and that ¢( f/) is to be (r;ith the understandings that f7 operates on right-hand member of (8) is defined. expanded in non-positive powers of ff) if the ys as p7 for increasing T, where pis less This latter condition will be satisfied if deca than the radius ofconvergence of¢( z)- .
fr
recursion (6), symmetric in past and The passage from (6) to (8) converts the recursion, expressing ~~r) for T ~ t in future in the sense (5), to a stable forward t-hand side of (8) constitutes a kno wn terms of past values ~~l(a < r); the righ d complete the inversion to obta in a driving term for this recursion. One coul is positively undesirable. Relation (8) complete determination, but to do so known ~r for t - p :::;; T < t) and, as we determines ~?) explicitly (in terms of the pter 6, this determination yields the saw for the control application of Cha is both closed-loop and natural for realoptimal control rule in the form which time realisation. orizon limit h2 -> oo for system (6) Of course, if (8) is to hold in the infinite-h The analogue of the controllability/ then cert ain conditions must be satisfied. , which were the basic ones for the sensitivity conditions of Theorem 6.1.3 ly that a canonical factorisation (7) existence of infmite-horizon limits, is simp also be required of the term inal should exist. Regularity conditions will of the path integral (1). contributions implied in the specification ite-horizon solution (8) is formally If such conditions are granted, then the infin e we must find a workable way of immediate. However, to make it workabl We shall achieve this by returning to the deriving the canonical factorisation (7). g and relating the canonical factor ¢ to recursive ideas of dynamic programmin need never be determined; it is enough the value function. This value function e of policy improvement implies a that the dynamic programming techniqu ation of the canonical factorisation. rapidly convergent algorithm for determin rkov' case, for which dynamics are of Let us begin with consideration of the 'Ma order one and recursions are simple.
Exercises and comments ld be invertible in stable form is that it (1) A necessary condition that 4>(z) shou e, which implies that all three factors should be non-singular for z on the unit circl icient condition for canonical factorisin (7) should be non-singular there. A suff tive definite on the unit circle. This latter ability is that 4>(z) should be strictly posi analysis, however, because the quadratic condition will in general not hold in our ed convex/concave character. form constituted by Dhas in general a mix IN THE 'MARKOV' CASE 2 FACTORISATION AND RECURSION (cross-terms) appear in the time-integral If G1 = 0 for Ul > p, so that interactions that we are dealing with pth- orde r only at time lags up to p, we shall say
338
THE TIME-INTEGRAL FORMALISM
dynamics. We shall then refer to the case p = 1 as the Markov or state-structured case. In the Markov case the time-integral can be written U=
L
Cr
+ end terms,
T
where
(9) Consider now a recursive extremisation of the integral, with cr regarded as an 'instantaneous cost', and with the closing cost at the horizon point h assumed to 2 be a quadratic function of ~h2 alone. If we are interested in a forward optimisation then we can define a 'value function' F(~1 , t), this being the extremised 'cost' from timet onwards for given ~1 . This will obey the dynamic program ming equation F(~t,
t) =stat [c1 + F(~t+l, t + 1)]. (t+l
(10)
where 'stat' signifies evaluation of the bracket at a value of ~t+l which renders it stationary. As a dynamic program ming equation this has the simplifying feature that one optimises with respect to the whole vector argume nt ~ rather than just some part of it. In the case under conside ration the value function will have the quadratic form
(11) for t ~ h2 if it does so at hz, with the coefficients (respectively matrix, vector and scalar) obeying the backward recursions lit
= Go - G_tii;;1Gt
at= (t- G_tii;;1at+I
(12)
81 = 81+1- a~ 1 1IH\at+1·
The first of these is just the Riccati recursion; strikingly simpler in this general formulation than it was when written out for the state-str uctured case of the control problem in equations (2.25) I (2.26). The second relation likewise corresponds to the equation before (2.65). The third gives the recursion for the ~ indepen dent compon ent of cost incurred which we thought too troublesome to derive in Section 2.9. The extremising value of ~t+ 1 in (10) is determined by
(13) Suppose that we are in the infinite horizon limit h 2 ~ +oo, and that the problem is well-behaved in that II 1 has, for fixed t, a limit II. It is under these conditio ns that one expects the forward recursion (13) for ~~ to be stable in that its solution
URSION 3 VALUE FUN CTI ONS AND REC
S
339
a stable t. In othe r words, that II + G1 ff is should tend to zero with increasing to the one s lead s Thi 1 a stability matrix. operator, or that r = - II- G1 is l to ona orti prop be ht mig factor >(z) of
(14)
onical from (13) that this is indeed the can is a factorisation of
(,. =
p-1
Cr
e;_ jGp -jer -r = !{;_p+l Go{r-p+l + L j=O
(15)
Ifwe define the value function
(16) then this will obey the optimality (dyn
amic programming) equation
[c, + F({t+l, e~> F({ t,et- 1, ... , {t-p+l; t) = stat (,+I
... ,et-p+2; t + 1)].
(17)
endence purely quadratic with a limited dep We suppose that the term inal cost is F will ord er dynamics then implies that on the past. The assumption of pthfact be ables indicated in (16), and it will in inde ed be a function of just the vari two) in thee-variables: homogeneous quadratic (i.e. of degree
340
THE TIME-INTEGRAL FORMALISM
say. The 'optimal polic y' will then be of a p-lag linea r form in that the extremal criterion in (17) will deter mine ~t+ 1linearly in term s of the p previous values of~: p
I>l! k( t + 1)~t-k+ 1 = k=O
o
(19)
say, with a:0 necessarily symmetric. Our expectation is that the coefficients a:k(t + 1) will beco me inde pend ent oft in the infin
ite-horizon limit (if this exists) and that (19) will then be a stable forw ard recu rsion which, written ¢(ff)~t+l = 0, defines the cano nical factor ¢(z). The argu men t follows. If we substitute the form (19) into the optimality equa tion (17) we obta in a 'superRiccati' equa tion for the p(p + 1) /2 matr ices IT1k which is most forbidding. The relations beco me muc h more com pact in term s of the generating functions
F(w,z; t) =
_L _LII;k(t)wiz'< j
k p
p
}=I
}=I
L G1z1 + L G_1wi
¢(z; t + 1) = F(O, z; t + 1) + GpzP
(20)
(21) (22)
p
a:(z;t+ 1) = _La :k(t+ 1)!'
(23)
k=O
in the complex scalars z and w. We shall refer to F ( w, z : t) as the value function transform. We shall now find it convenient to suppose that F is not necessarily the value function corre spon ding to an optimal polic y, but that corre spon ding to an arbit rary p-lag linea r policy of the form (19). It will then still have the homogeneous quad ratic form (18).
Theorem 18.3.1 (i) The value function transform under the policy (19) obeys the backward recursion (dynamic programming equa tion)
F(w, z; t) = (wz f 1 [(wz)P
t + 1.
{24)
carry the additional argument
RISATION RELATIONSHIP 4 A CENT RAL RESULT: THE RICCATIIFACTO
(ii) The optimal choice ofa is a(z; t + 1) = cjJ(z; t + 1) = F(O, z; t + 1) + GpzP With this choice (24) reduces to the optimality equation 1 ¢(w) T¢01¢(z)]t+ 1. F(w,z; t) = (wz)- 1 [(wz)P(z- 1 , w- ) + F(w ,z)-
341
(25)
(26)
in Whittle (1990a) p. 164. The proo f is by heavy verification; the detail is given the right -han d member of (24) One can see that the terms independent of a in (18) and the translation of the!originate from the presence of the cost term c1 in nate from the fact that ~t+l is subscripts of the ~-arguments. The terms in a origi expressed in terms of earlier ~-values by (19). in (26) we have an elegant Although derivation may be heavy, we see that an elegant expression of what expression of the optimality equation, and in (25) term s of the value function in ¢ r will soon be identified as a canonical facto generating furnction (21), le doub transform F Note the need to introduce the related to (z) by
(27)
(z) = (z- 1,z).
TORISATION 4 A CEN TRA L RESULT: THE RICCATI/FAC REL ATIO NSH IP unde r an optimal policy then If we assume the existence of infinite-horizon limits ler. simp the assertions of Theorem 18.3.1 beco me even
al Theorem 18.4.1 (i) In the infinite-horizon limit the optim has the evaluation
in terms of
cjJ(z) = F(O,z)
+ GpzP.
(ii) The equation ¢(5")~.,.
= 0
(r
~
t)
value function transform
(29)
(30)
holds along the optimalpath. The assertions also begin to have a very clear signi (28) we obtain
ficance. If we set w = z- 1 in
since the forward recur This is nothing but a canonical factorisation of (z), (30) is stable. We thus deduce
(31) sion
342
THE TIME-INTEGRAL FORMALISM
Theorem 18.4.2 (i) If infinite-horizon limit s exist then \P has a canonicalfactorisation (7) with cp(z) apth -ord er polynomial. (ii) This factorisation can be given the spec ial form (31), with cp(z) having identification (29). It then has the specialfeatu res (32) ¢o = Iloo is symmetric o
(33)
= ¢o 1.
(34) These two theorems express the close and elegant relation between the infinitehorizon value function transform F(w, z) and the canonical factor cp(z). The double generating function F ( w, z) is expr essed in terms of the single generating function cp(z) by (28), and ¢ is expressed in terms of Fby (29). The first relation implies the factorisation characterisation (31). Note that we find the fact or¢ with the part icular characteristics (32) and (33) by a very direct route: we simply write dow n the stationarity condition with respect to ~t+I in the optimality equation (17). This yields a linear relation ¢(5" , t + 1) ~t+l = 0 determined by (22) who se coefficients have the properties (32) and (33); these coefficients become the coefficients of¢ (z) in the infinite-horizon limit. We must regard these properties as constitut ing a natural normalisation; a view which will be confirmed.
5 POLICY IMPROVEMENT AND SUC CESSIVE APPROXIMATION TO THE CANONICAL FACTORISATION The analysis of the last two sections is imp orta nt enough in itself. However, it also proves to be crucial for the dedu ction of a good iterative algorithm for the determination of the canonical factorisation. Essentially, the policy improvement algorithm turns out to imp ly a simple iterative algorithm for the determination of ¢ which makes no refer ence to the value function. The policy , improvement algorithm has itself a clea r variational characterisation and shows second-order convergence (see Section 3.5); these properties will transfer to the ¢;- version. We shall again suppose pth- orde r dyna mics. Let the stage of iteration be labelled by i = 0, 1, 2, ... and let p
¢(i)(z)
= L r/Jijzj j=O
denote the approximation to cp(z) at stag e i. Consider the procedure in which is determined from rP(i) by the linear equa tion system u+IJ¢if/¢(iJ + ¢uJ¢ir/¢u+t) = + ¢uJ¢ ir/¢i+I,o¢ir/rP(iJ· (35)
rP(i+i)
·'i
N 5 POLICY IMPROVEMENT AND SUCCESSIVE APPROXIMATIO
343
Theorem18.5.1 Recursion (35) has thefollowingproperties. m to (i) It is the recursion generated by application ofthe Newton-Raphson algorith the factorisation (31), regarded as an equation for¢. (ii) It is also the recursion which would be derived by application of the policyimprovement algorithm. is a (iii) It conserves the normalisation properties (32) and (33) in that, if ¢(i) (z) and the polynomial in z of degree p for which the absolute term ¢i0 is symmetric coefficient ¢ip ofzP equals GP' then the same is true of¢(i+l)(z). uProof Consid er first assertion (i). For notatio nal simplicity, denote the consec in 6.. by¢+ ing¢ Replac ¢+b.. tive trial solutions ¢(i) and ¢(i+I) by¢ and 7/J = obtain we b. (31) and expand ing as far as first-order terms in
= ¢¢()!¢ + ti¢()1¢ + ¢¢()! Ll + ¢¢()1 Llo¢()1¢.
Setting b. = 7/J - ¢ in this relation we deduce (35). can be Asserti on (iii) is easily confirm ed from relation (35). Assertion (ii) of proof the confirm ed by appeal to relations (24) and (25). More economically, LQ control equivalence of the two algorithms established for the state-st ructure d 0 case. general more this to r transfe problem in Section 3.5 will Proper ty (34) is implicitly conserved under the algorithm, in that we as
define ci>io
,+,-! 'l'iO.
be more Equati on (35) constitutes a linear equatio n for ¢(i+I)• but we can which is form a in ed express specific. The solutio n of this equatio n can in fact be ion expans series the explicit to within application of the operati on of truncat ing ion expans of a function. Suppose we have a functio nf(z) with a power series 00
f(z) =
L Jjz 1 J=-00
obtaine d by valid on the unit circle. Then we shall define [f(z))+ as the function ion: expans the retention only of terms for non-negative j in this 00
[f(z))+ =
Ljjzl . j=O
Theorem 18.5.2
The solution ofequation (35) can be expressed
= ¢,u[¢(;{ci>¢(tn+¢uJ·
Proof We can write relation (35) as ¢ii17/J¢-1 + ¢-Iif;¢()1 = ¢-l
(36)
344
THE TIME-INTEGRAL FORMALISM
in the ¢>, '1/J notati on introd uced earlier. Applying the trunca tion throughout, we deduce (36). It is worth while noting a point which is not immediately eviden t from (36): that the right-hand memb er of this equation is indeed a polynomial of degree pin z. We shall refer to (35) and its more specific form (36) as the PI/NR algor ithman abbreviation for 'policy improvement!Newton-Raphso n'. We shall apply the algorithm to partic ular control examples in the next chapt er and shall see then that it is useful, not merely for nume rical conclusions, but also for structural · ones. Explicitly:
Theorem 18.5.3 Suppose that a property of the normalised trial factor ¢>(i) is conserved under the PIINR algorithm, so that it is shared by ¢>U+lr Then the normalised canonicalfactor¢> has this property. Proof If the policy at stage i is non-optimal, so that ¢(i) is not a canonical factor, · then ¢(i) will be modified by the algorithm. The conservatio n ofa prope rty under the algorithm thus implies that it is shared by some canonical factor. But the normalised canonical factorisation is unique, so the property must be shared by ¢>. 0 6 THE INTEGRAL FORMALISM IN CONTINUOUS TIME The continuous-time analogue of the path-integral (1) will now indee d be an integral over time
6(~) =
j c(~) dT +end effects
(37)
where c(e) is a quadratic function of the vector variable ~(t). We shall assume the specific form
(sr)
ckJ
= cik(rs)
( 39 )
However, this normalisation is not the only possible recast ing of the integral (37). Partial integration yields the effective equality c[r]c[s] _ _ c[r-I]c( s+l] ':.j c.,k
':.J ':.k -
(r > 0),
(40)
IN CONTINUOUS TIME 6 THE INTEGRAL FORMALISM
345
the the substitution will change only we mea n by 'effective equality' that rior inte ts poin e tim at ons ity con diti and so does not affect the optimal range of integration. as far as to write the path-integral (37) We can car ry the reduction (40) so 0
=!
J2::: L I: L cj~s) r~~r+s] J ~ J[~ e~(~)~dr -
(-
(,T dr +en d effects,
or, more compactly,
(T (] dr +en d effects.
fi =
rators wit hjk th elem Here q,(~) is ann x n mat rix of ope
\Pjk(~) = LLc};s)(-~)'EPs.
(41)
ent (42)
s
the integral the ~-path for r ~ t should render Theorem 18.6.1 The condition that ·' · - (37) stationary can be written (43) ~ t) ( 1"
rator \P(.@) is rval of integration. The mat rix ope at time points interior to the inte Hermitian in that (44)
first the n te consequence of (39) and (42); the The sec ond asse rtio n is an imm edia as =(s)
= <J>_(s)+(s).
sym met ric ns tha t we can ind eed look for a The Her mit ian character of mea factorisation (45) (s) = ¢(s)oc/J(s) left hal f of e all thei r singularities strictly in the say, where bot h cjJ(s) and cjJ(s) -I hav nite-horizon -horizon limits exist then the infi the complex s-plane. If infinite to the stab le n by the semi-inversion of (43) solution is again effectively give forward equ atio n
(46)
for the optimising ~. ion of the there is a nat ura l normalisat As in the discrete-time case, from the n atio oris by ded ucti on of this fact factorisation (45), dete rmi ned
346-
TH E TIME-INTEGRAL FOR
MALISM
optimality equation of a recursive approach. We cover these matter s in the next section, which reveals both the analogies to and some difference s from the discrete-time material of Sections 4 and 5. 7 RECURSIONS AND FACTOR ISATIONS IN CONTINUOUS TIME In considering the factorisati on question we can again nor malise to the homogeneous ('regulation to zero') case ( =0. There is a difference between the discrete- and continuous-time cas does not seem more than notatio es which nal, but is annoyingly persistent . It refers to degree of dynamics, and has alre ady dynamics were of order p in the disc arisen in Section 4.7. When we said that rete-time case we really meant tha t the of order p at most. If they were in fact of lesser order in some relation y were s then this did not matter; the only require ment was that a recursion such as (19) should really be a determining forward recursion, in that the coefficient ao of at the current time-argument should be non-singular. The corresponding requirement in the continuous-time case is tha t the matrix coefficient of the differentials of highest order should be non-singul ar, and this order may well be different for diff erent components of €. Suppose that, under the normalisation (39), the differe ntials of ej occur in c( €) up to ord er rj exactly (j = 1, 2, ... , n). Then what one might term a minimal-order line ar policy would determine the forward evolution of eby a set oflinear relations
e
(j= l,2 , ... ,n),
(47)
say, where the coefficients l'i. may be time-dependent. The point is that the optimal policy will lie within this class, and that determination of the optimal relation (47) in the infinite-horizon limit implies a determination of the canonical factor if>(s). Let F( €, t) denote the value functio n under policy (47), starting from a known €history at time t. Although we hav e loosely written this as €-depende nt, in fact it will be dependent on the functio n €(r) through the differentials ej'l(r < rj; j = 1, 2, ... , n) at the current instant t. Under appropriate restriction s on the form of tlie terminal cost it follows then tha t the value function under policy (47) is of the homogeneous quadratic form
F(€,t)
=! I:I:I:I:rrj~s>(t)ej'let1 j
and that Fwill obey the backward
k
$
(48)
equation
c + oF + " " e!r+lJ oF = at ~ L.J ') adrl 0 J r
(49)
~th eYJ) expressed interms oflower-o rder derivatives by (47). Under an policy (49) is replaced by the optima optimal lity equation
7 RECURSIONS AND FACTORISATIONS IN CONTINUOUS TIME
.
stat [c +
0: + 2:2: ~J'+l] 8~1] = o t
a~j
r
j
347
(so)
where the stationarity is with respect to the highest·order derivatives ~[r,] (j = 1, 2, ... , n). " Again we introduce transforms, and define the matrices ofgenerating functions
F(s 1,s2)
=
(:L2:rrJ~1 ''2 )s~1 s;2 ) '2
,,
the t·dependence ofF and II being understood. The bracketed expression is the jkth element of the matrix in question, j and k taking values 1, 2, ... , n if ~ has dimension n. We again refer to F(s1, s2) as the value function transform. We shall have occasion to distinguish the parts of these generating functions associated with the highest· degree terms, and so define
H
= (hJk) = (c)2''k)) d(s)
(51)
= diag(s'j).
Ct (s) =
(L c)'Jt sv) v)
s
v)
) () ( "rr(i-l,v FtS= S. L:jk
Theorem 18.7.1 (i) The optimality equation for the value function transform F = F(st,s2) is 1 T 8F (52) c(st,s2) +at+ (st +s2)F- cf>(s!) H- cf>(s2) = 0, where
cf>(s) = Hd(s)
+ Ct (s) + F1 (s).
(53)
( ii) The optima/future course of~ is determined by the forward equation
(54) This is the analogue of Theorem 18.3.1 (ii), and proved analogously; for further details see Whittle (1990a) pp.l94-5. Note that, as previously, the form (53) of the canonical factor is determined simply from the stationarity condition in the dynamic program ming equation. This gives a forward differential equation for ~.
348
THE TIME-INTEGRAL FORMALISM
necessarily stable if infinite-horizon limits exist, and we write it as¢(£'))~= 0 to deduce (52). The way is now open for our principal conclusion; the analogue of Theorem 18.4.2.
Theorem 18.7.2 Suppose infinite-horizon limits exist for the optimal policy. Then the canonicalfactorisation
(55) holds with H equal to the known matrix (51) and cp(s) equal to the infinite-horizon limit ofexpression (53). This factorisation then satisfies the normalisation that both cf> 01 and the matrix coefficient of the highest-degree terms in cp(s) equal the known matrix H. Thisfollowssimplybythesubstitutionofs1 = -s2 = -sintheeq uilibrium form of (52), and appeal to the fact that (54) must be a stable forward recursion. The continuous-time case thus shows one simplification: the matrix H defined in (51) is known before the factorisation is attempted, while the matrix ¢ 0 of (31) is not.
8 THE PIINR ALGORITHM IN CONTINUOUS TIME The same simplification persists in the policy-improvement algorithm which is, as ever, identical with the Newton-R aphson algorithm applied to relation (45) as an equation in ¢. If¢(;) is the determination ofthe canonical factor at stage i then the recursion analogous to (35) is -
1
-
I
I ¢(i+l)n- ¢uJ + ¢uJH- ¢u+Il =+ ¢uJH¢uJ (56) and the normalisa tion indicated in Theorem 18.7.2 is conserved under the algorithm. The more explicit recursion (36) also has an analogue. Suppose a functionf( s) . is representable for purely imaginary s by the integral
f(s) =
L:
est
dg(t).
We then define the truncation operator [ ]+ in continuous time by
[f(s)]+
=
L:
e'1 dg(t).
Then, just as in Section 5, the solution of (56) can be expressed ¢(i+l) = H[¢~l¢~n+¢(i)·
(57)
The derivation of these results is very much as for the discrete-time case; the argument is amplified in Whittle (1990a), pp.197-8.
CHAPTER 19
: Optimal Stationary LQG Policies Perfect Observation EGRAL TION FROM THE TIME-INT 1 DERIVATION OF THE SOLU
e by pathof the perfect-observation cas nt tme trea the in s step ial The essent this section we up in the short Section 6.5. In integral methods are summed sections we two ing that discussion. In the follow simply recapitulate and expand sions. clu con n us chapter can be used to sharpe see how the results of the previo ume a plant equation In the discrete-time case we ass
(1)
{d } is a kno wn iant causal linear operators, 1 where d and f!J are time-invar ite noise with ances and {ft} is Gaussian wh deterministic sequence of disturb dynamics tha n the trix N. If we assume pth -or der zero mean and covariance ma f!J = B( f/) = s d = A(5") = I:~=O A,5 "' and form the e hav l wil rs rato ope and have malising assumption Ao =I , nor the ke ma can We "'. L:~=I B,5 ermined at umption tha t the control u1 det ass the to g din pon res cor 0, set B0 = ble fact that, if u1 e t + 1. This reflects the inevita time t has an effect first at tim sally dependent cau includes x~> then x 1 can not be may depend upo n dat a which function has the 2.4, that the instantaneous cost We shall suppose, as in Section m time-homogeneous quadratic for
OnUt.
Ct=C(Xt,Ut)=![~r[~ ~][~t
(2)
have assumed that, tf make no appearance here. We and r s nal sig nd ma com The zero, by the they have been normalised to n the e, anc adv in wn kno are if they variables and of as redefined process and control adoption of x - r and u - uc 2.9). If they are tion disturbance variable (cf. Sec d - d r - f!Juc as a redefined model, and so presume them generated in the not known in advance, then we s variable x. subsumed as par t of the proces der lags in the the incorporation of higher-or by c We could generalise 1 to start from the equation. However, it is easier nt pla the in e hav we as les, variab s at a later l generalisation becomes obviou ura nat the (2); n ptio um ass r familia point.
350
OPT IMA L STATIONARY LQG POL ICIES: PER FEC T OBSERVATIO N
The development from now on is exactly as in Sec tion 6.3. By the cert ainty equivalence prin cipl e the white-n oise term E will have no effect on control optimisation, and we can for the mom ent delete it (although it will cert ainl y play a role in the next two chapters). Regard the con seq uen t dete rmi nist ic process equ atio n as a con stra int and asso ciate with it a vector Lag rang e mul tiplier .A 1. The con seq uen t Lag rang ian form h-1
H=
I:c( x,., Un T) r=O
h
+ LA ;(d x+ ~u- d)T + ch.
(3)
r=l
constitutes our time-integral. Her e Ch is the closing cos t at tim e h; we could, consistently with othe r assumpt ions, sup pos e it a qua drat ic func tion of Xr (h - p < T ~
h).
We aga in mak e the dist inct ion betw een T, a general run nin g tim e variable , and t, which labels the mom ent 'now : In oth er words, we assu me that ur has alre ady bee n dete rmi ned for T < t, not necessarily optimally, and that Ur is to be dete rmi ned forT;::;: t. Ext rem isat ion of 0 with respect to (x, u, .A)T gives a line ar system of equ atio ns which we can writ e as
(t
~
T
< h- p).
(4)
The corr esp ond ing equations for h - p < r ~ h will be mod ifie d in that they will include con trib utio ns from C11. As in Section 6.5, fJJ and .91 are the ope rato rs conjugate to d and f!J in that, for exam ple, .91 = A(.:T- 1) T = L::=o A"[.:r'. So d operates into the pas t and .91 into the future. Not e that , effectively, Ar = 0 for T >h . The sup ersc ript (t) indicates that the opti mis atio n is one holding from tim et onwards. The values of xVl and u~1 l for r ;::;: t dete rmi ned by (4) and sub sequ ent equ atio ns con stitu te the prediction at tim e t of the future course of the opti mal ly con trol led process. Thi s will coin cide with the actu al cou rse only if ther e is no plan t noise E. However, the dete rmi nati on of u) 1l thus obta ined is opti mal und er all circumstances. If we write the equ atio n system (4) as
(5)
then this can be identified with the equ atio n (18.3) of the general path -int egra l trea tme nt. The mat rix function ( z) thus implicitly defi ned cert ainl y sati sfies the sym met ry requ irem ent ~ = of the general trea tme nt. Suppose that (z) has the can onic al factorisation
(6)
TROL RULE IN DISCRETE TIME 2 EXPRESSION OF THE OPTIMAL CON
351
of (18.311 the natural discrete-time where we have supposed the normalisation normalisation. Then. the semi-inversion
(7)
(r;;:: t)
y solution in the infinite-horizon of (5), if valid, provides the optimal stationar limit. constitutes a stable forward relation It provides the solution in that relation (7) ally-controlled process. However, determining the predicted course of the optim gives an explicit expression for the more importantly, relation (7) for r = t shall make this determination mor e optimal value of u1 in closed-loop form. We well claim it as the most direct, explicit in the following section; one may opti mal stationary policy. economic and natural determination of the ion (6) should exist is just the The condition that a canonical factorisat ld exist in the case ( 0; the case of condition that infmite-horizon limits shou right-hand member of (7) shou ld be 'regulation to zero~ The condition that the future) is just the condition that the finite (with the operator acting into the to cope with the input (; see the optimally controlled system should be able discussion after Theorem 6.3.1. ediate. The operators in the plan t The continuous-time analogue is fairly imm al operator~. so that, for example, equation are now polynomials in the differenti integrals in (3), and relation (4) d = A(!~) = Er Ar~'. Sums are replaced by of conjugacy. In analogue to (5) one holds with the definition il =A ( -~)T writes this system as
=
(8)
(r;;?: t) ~ is unde rstoo d as
where a time argument ris understood and the canonical factorisation
d/dr . lfe)( s) has
(9) of Theorem 18.7.2, where we have supposed the normalisation of the semi-inversion (7) is
(r;;?: t).
I I
I
(10)
TRO L RULE IN DIS CRE TE 2 EXP RES SIO N OF THE OPT IMA L CON TIM E rating function i.P(z) has the form For the LQG control problem the matrix gene
i
I
then the analogue
S(z) A(z )l Q(z) B(z) . 0 A(z) B(z)
R(z)
i.P(z)
= [ S(z)
(11)
352
OPTIMA L STATIONARY LQG POLICIES: PERFEC T OBSERVATION
Here have in fact generalised the cost function somewhat by replacing
the matrix
R by a generating function R(z), etc. This corresp onds to the inclusion oflagg ed variables in the cost functio n c1; the generalisation can be made painles sly at this
point and the conclus ions we now reach remain valid under it. Recall that Bo = 0, andsup poseth enorma lisation Ao =I. This form is special enough that we can say someth ing about the form of the canoni cal factor ¢(z) in the normal ised factorisation (9).
Theorem 19.2.1 The canonicalfactor¢ ofthe matrix generating function (11), under the normalisation of Theorem 18.4.2, has theform ¢xx ¢= [ ¢ux
¢xu /] ¢uu 0 , B 0
A
(12)
the partitioning being the (x, u, >.) partitioning of~. and an argument z being understood throughout. If the cost function involves lags ofsize p - 1 at most then the submatrices with the ¢-labels are polynomials in z ofdegree p - 1 at most. Proof Validity of the form (12) follows by appeal to the recursive algorit hm (18.35). We leave the reader to verify that, if ¢(i) has the form (12), then so does ¢(i+I). So then does¢, by Theore m 18.5.3. The fmal assertio n follows from the fact
that, under the assump tion stated,
¢p
= Gp = [ ~
Ap
0 0] 0
0
Bp
0
j
see (18.32).
0
Theorem 19.22 Suppose that infinite-horizon limits exist. Then the optimal value ofUt is given explicitly and in closed-loop form by --
¢ux(ff)Xt + ¢uu(ff)ut = [¢o¢(ff-)1 Ju.A,
(13)
where the u>. subscript on the bracket indicates the extraction of the corresp onding submatrix in the (x, u, >.)-partitioning of the bracketed matrix. If the cost function involves lags ofsize p - 1 at most then relation (13) expresses the optimal u in terms 1 of(Xr, Xt-I, ... ,Xt-p+I), (ut-I. Ut-2 1 ••• , Ut-p+I)and (dt, dt+l, ... ). Proof Relation (13) follows if we extract the u-subvector relationship from relation (7) at r = t, taking accoun t of the form asserte d for¢ in Theore m 9.2.1. 0 The closed-loop determ ination (13) of the optima l control is explicit to within achievement of the canoni cal factorisation, and is both as neat and as explicit a solution as the general case will allow.
TINUOUS TIME 3 OPTIMAL CONTROL RULE IN CON
353
e insig ht if we intro duce the s:ystem We gain both som e econ omy and som m (4) in the generalised case (11) as nota tion of Sect ion 6.6, and write the syste
) IU(ff)] [x] (t) = [OJ [m(ff d / A 0 21(5")
(14)
T
where then
X=[~],
I!!= [d £!6']
The asse rtion (12) is then that the and the plan t equa tion reduces to lUx =d. norm alise d cano nica l facto r ¢ takes the form
¢(z) _ [ ¢xx(z) IU(z) -
l!lo]
0
(15)
'
ix poly nom ial IU(z). where 910 is the abso lute term in the matr rem 19.2.1, but in the mor e econ omic al We can follow thro ugh the proo f of Theo (15) directly. However, we have to retu rn system nota tion throu ghou t, to dedu ce ion (13) if we wish to disti ngui sh the to the mor e explicit nota tion of relat rule explicitly. part icula r role of u, and dedu ce the cont rol
Exercises and comments
= - Bff and there are no lagg ed term s (1) For the Mar kov case d = I - Aff, £!6' nica l facto r ¢(z) of\II(z) has the form in the cost function. The norm alise d cano
R+A TIIA ¢(z) = [ S +Br iTA
1-A z
sr +AT IIB /] Q + BTIIB
0
-Bz
0
whe re II is the matr ix of the infin ite-h orizo
T
=
[m + ~ IIil11
n value function F (x).
TROL RULE IN 3 EXPRESSION OF THE OPTIMAL CON CONTINUOUS TIME 18.7, that, for re, discussed alrea dy in Sections 4.7 and
We agai n have the featu of the discrete-time case is the matr ix insta nce, the anal ogue of the matr ix Ao atives of com pone nts of x occu rring coefficient, A. say, of the high est-o rder deriv is the high est orde r of deriv ative of r in the plan t equation. Tha t is, supp ose that 1 in the plan t equation. If A, has jkth the jth com pone nt x1 of x that occu rs com pone nt aj~l then
(16)
354
OP TI MA L STATIO NARY LQ G POLICI ES: PE RF EC T OBSE RVATION
This matrix must be no n-singular if the plant equation is to constitut relation which determ e a dynamic ines the forward path of the process. We corre define B. as the matr spondingly ix coefficient in f!J of the highest-order deriv occurring in the plant atives of u equation or the cost fu nction c, and Q* as the these highest-order de matrix in c of rivatives. For the control problem the matrix
(17)
The form of H follows from its general definiti on in (18.51), and the fo the argument of Theore rm of ¢ by m 18.2.1. The fact that B* is in ge neral non-zero makes the solution for the op of u(t) a little more co timal value mplicated than the co rresponding expressio discrete-time case. From n (13) for the relation (10) at r = t we deduce that ¢xx(~)x + ¢xu(~)u
+ AYA = Px>.(~)d Y A= Pu>.(!!))d
ifJux(~)x + ¢uu(~)u +B
whereP(~} = H¢(!!)) -I andPx>.(:!)) etc. indi
cates the corresponding All quantities are evalu submatrix. ated at time t. Eliminati ng A we deduce the relat ion [ifJuxU~)- B* A; 1¢xx(! !))jx + [¢uu(!!))- B*A:;- 1 ¢xu(!!))ju = [Pu>.(!!))- B*A-; 1 (18) Px>.(~)Jd. This must be regarded as a forward differentia l equation for the optim by terms in differentia al u, driven ls of known x. It can, of course, be solved in terms, but would be transform generated in real tim e by a mechanism rea differential equation. lising the The relation simplifie s considerably under certain conditions. Su example, that u does no ppose, for t appear in differentiat ed form in either cost plant equation and that function or the cross-terms betwee n u and x in the cost fu been normalised to zero. nction have Then
=
R(s) [ 0
0
Q A(s) Bo
A(s)
BJ 0
l
(19)
3 OPTIMAL CONTROL RULE IN CONTINUOUS TIME
355
Q is now a simple matrix, as is B0• Let us suppose, for simplicity, that A. has normalised to the identity. Then· one, finds, by the arguments of Theorem ·.· ·1s.1.2, that the expressions (17) reduce further: to
l
0 I Q BJ , I Bo 0
0 H = [0
¢{s) =
[rPxx(s) 0 A(s)
0 Q
Bo
l
I . BJ
0
It follows then that (18) simplifies to
.u = Q- 1{BJ¢xx(!'d)x + [Pu>.(~)- B6Px>.(~)]d}. Exercises and comments (1) The return difference equation Consider again the classic feedback loop of Figure 4.2. The operator -:f{'r§ is the loop operator, giving the effect on a signal which passes successively through the plant and the controller, and ; = ..F + :f{'r§ is the return difference operator, giving the difference in effectis · between no traverse and one traverse of the loop. If a disturbance d superimposed on the control u as it is fed into the plant then u satisfies J u = d. Let J(s) be the continuous-time transfer function of .f. The return difference equation is a version of the inequality JQJ > Q, which holds for LQ-optimal control, at least under state-assumptions, if Q is the penalty matrix for contr<Jl. It expresses the fact that the operation ; -1 attenuates at all frequencies. The relation is very easily proved from the canonical factorisation of with . factors of the known form (17). If the column vector with subvectors x, u and .X is written b.., let the solution of ¢b..= 0 in terms ofu be written b..= Du. Then the return difference equation amounts simply to (20)
h(j)n- 1¢D = hD. Show that in the special case (19) relation (20) implies ]QJ =
Q+BA- 1RA- 1B,
where J = I - Q- 1B"[ rPxxA- 1B. Show that in the case 0 R(s) Q(s) (s) = [ 0 A(s) B(s)
~]
B(s) , 0
special only in that S(s) has been normalised to zero, relation (20) amounts to
JQ.J= Q+BA- 1RA- 1B, where J = Q;:- 1{[¢uu(2)) - B*A;:- 1¢xu]- [¢ux- B.A;:- 1¢xx]A- 1B}. In both cases J is indeed the return difference transfer function for the optimal control.
CHAPT ER20
Optimal Stationary LQG Policies: Imperfect Observation 1 mE PROCESS/OBSERVATION MODEL: APPEAL TO CERTAINTY EQUIVALENCE We assume the linear model of equation (19.1) together with an observa.tion relation: d x + E!lu = d + f
(1)
y+~x= TJ.
(2)
This specification covers both the discrete- and continuous-time formulations; the time variable has not been explicitly indicated The operator ~ is, like d and E!4, a causal translation-invariant linear operator. Let us initially discuss the discrete-time case withpth-ord er dynamics, with CG having the form C(fr) = L:f= 1 C,fr'. As ever, dis a deterministic disturbance term and f and TJ are plant and observation noise respective!~ We suppose that these noise terms jointly constitute Gaussian white noise with zero mean and covariance matrix
In this case of imperfect observation the information W1 available at time t consists of the observation and control histories Y1 and U,_ 1, plus the complete course of the deterministic component of disturbance {d1}. We shall ultimately be passing to the stationary regime, in which the past as well a.s the future is infmite, so that observation and control histories extend into the infinite past. Since the model is totally LQG we can appeal to the certainty equivalence principle of Section 12.3 to deduce the optimal control in this imperfectly observed case from that for perfect observation. We know from the analysis of Section 19.2 that, in the case of perfect observation, the optimal stationary determination of u1 is given in closed-loop form by
(3)
358
OP TI MA L STATIONA RY LQG POLICIES
Th e left-hand member of this relation gives the feedback compon control, expressing u ent of optimal 1 in terms of x, (t p < T ~ t) an d u, right-hand member giv (t p < T < t). Th e es the feedforward ter m, in terms of d, (T ;;::: In the case of imperfe t). ct observation recursio n (3) for the optimal holds, except th at x, mu control still st be replaced, where it occurs, by the curre estimate x~l. We are th nt projection en led to the inference problem; the determi estimates. The duality na tio n of these of estimation an d cont rol has already been for the Markov case in demonstrated Section 12.9; we shall see how this extends to general order. dynamics of
2 PROCESS ESTIM ATION IN TIME-IN TEGRAL FORM (DISCRETE TIME) Th e characterisation th at we shall take of projection estimates least-square property is no t the linear asserted in (ii) of Th eorem 12.6.3, bu t ra th probability-maximisi er the dual ng (or discrepancy-mi nimising) property as the same theorem. It se rte d in (iii) of is this characterisatio n which yields the na integral formulation. tural timeA related po in t is th at canonical factorisa factorisations of some tions are then thing like the reciproc al of an AG F rather th LLS approach associa an (as in the ted with the Wiener fil ter) a factorisation of means that, in the ca an AGF. This se of pt h- or de r dyna mics, the factors are degreep. polynomials of We ca n set up the prob lem rigorously un de r the supposition th at ob began at a finite time servation h1, an d can th en pass to the infinite-history limit ht ju st as we passed to the --> -o o, infinite-horizon limit for control optimisatio will then be restricted n. Histories histories, so th at Xt is {x,; h1 - p ~ T ~ t}, etc negative exponent in th . Then the e Gaussian density of X 1 an d Y1 for prescr discrepancy ibed U1_ 1 is the
i0 1 =priorterms+!t[E]T[~ r=hr
'T] r
ft ]- l[ E ], 'T]
r where E an d ry are ex pressed in terms of x, y and u by appeal to observation relations the plant and (1) an d (2). Th e 'pr io r ter ms' reflect the distribut for relevant system hi ion assumed story before time h 1· In the case ofpt h- or de r will constitute a quad dy namics they ratic function of {x ,y , u,; h 1 - p ~ r < ht}. The projection estimate s x~l are ju st the value s of x,. minimising 10 thus amounts to a back 1 • Estimation ward rather th an a forw ard optimisation prob integral (or a precur lem; a timesor to one) is to be extremised over its co current m om en t t rath ur se before the er th an after. Actually, we would no t regard extremisation of 101 as extremisation integral, because it of a timeis subject to the co nstraints implied by observation relations the pl an t an d (1) an d (2). Le t us eli minate these by the in troduction of
2 PROCESS ESTIMATION IN TIME-INTE GRAL FORM
359
Lagrangian multiplier vectors l.r and m 7 for the constraints constituted by the relations at time T and so extremise a Lagrangian form [l)t
+ L:W (dx +P-lu-d -E) + mT (y +
7
T
with [1) 1 expressed in terms of the noise variables as above. Minimisation of this form with respect to the noise variables reduces the problem to extremisation of the past path integral t
llp(l, m,x)
=
(prior terms)+ L[v(l, m) -IT(dx + P-lu- d)- mT (y +
(4) Here
is the informational analogue of c(x, u). Strictly; we should should give the multipliers I and m superscripts (t), to indicate that they are the multipliers associated with optimisation on the basis of observables at time t. The relation between the multipliers and the noise estimates is
The direct interpretation of the multipliers is the familiar one of sensitivities: of Wl and m~l as respectively the rates of change of the minimal value of [)) 1 (subject to constraints (1) and (2)) with changes in the actual values of Er and TJr· Note one effect of the transformation from [)l, to flp; the expression for the timeintegral has been cleared of all inverses of covariance matrices. Thus one puzzling and seemingly perverse feature associated with a stochastic formulation is removed: that it actually seems to become anomalous if some noise components are zero, and so the noise covariance matrix is singular. One cannot directly relate the multipliers lr and A.r associated with the plant equation at past and future times. This is because the past optimisation takes precedence: one first minimises [)l with respect to unobservables and then C with respect to undetermined controls. This is again a reflection of the degenerate character of the risk-neutral model. When we come to the risk-sensitive formulation in the next chapter we shall see that effectively there is a single timeintegral which spans both past and future, with the consequence that past and future multipliers can be related. The estimates x~t) of process history at timet are obtained by minimising ll)r with respect to the corresponding Xr- The path integral llp must then be
360
OP TIM AL STATIONARY
LQ G POLICIES
ma xim ise d with respect to these variables and mi nim ise d wit h res pec t to cor res pon din g/, m. We ded uce the n the stationarity condition s
[fr
~ ~] [~~] (t) = [d -_:u] (t)
..ci'~O
XT
0
(hr < r ~ t);
the estimation analogue of the set of control optimisat ion relations we rewrote (19.4) in the con den sed for m (19.5), so we rewrite (5) as
(ht say. Th e ter mi nal con dit ion s for
=0
(5) :
T
(19.4~
Just as
< 'T ~ t),
(6)
this equation set are effectiv ely
) =0 (r> t), ' T (7) since ter ms for r > t do no t app ear in Dp. Th e initial con ditions are provided by the stationarity conditions forT~ ht. which dev iate from the pat ter n (5) in they will involve the pri or dis tha t tribution. Note tha t relation s (7) imply tha t (6) has a solution ind epe nde nt of the values of x~> for r > t, bec aus e x~l simply does no t occ ur in the equation system (5) for T > t. We shall refer to thi s feature as the lack of forward coupling. It reflec ts the fact tha t the estima tes of pre sen t and pas t variables do no t dep end upo n the estimates of future var iables. On e can con tra st it with the essential pre sen ce of backward coupling in the cor res pon din g control relation (19.5); pre sen t con tro l is undoubtedly expressed in ter ms of pre sen t and pas t values of the process var iable. Relation (6) plus its mo dif ied version at earlier tim e poi nts determines in principle all pas t estimates x~l. We wish to solve these equations efficiently, but we really only nee d to solve the m for the estimates x~l ( t - p < r ~ t) which are nee ded for im ple me nta tio n of the con tro l rule. In the inf inite-history lim it this poi nt is again me t by app eal to a can oni cal factorisation ; this tim e of '11 (z ). In the infinite-history limit h 1 -oo equations (5) and (6) hol d for all r ~ t. If w~ ass um e tha t '11 (z) has a can oni cal factorisation J(t) T
m
.;~
(8) which we have taken in the nor ma lise d form analogous to (19.6), the n the system (6) can be semi-inverted to (r ~ t). (9) He re the ope rat or 'lj;(!T) act s int o the future, with an effe ctive bou nda ry condition x~> = 0 (r > t) implied by the pro per ty of lack of for ward coupling. Th e ofterator 'lj;(!T)- 1 acts int o the lation (9) for r =!d ete rm ine x/>) explicitly. Th e relations forpasr t.= Re s x~e) (and so t - 1, t - 2 ... the n det erm ine • 1 X1(t)_ 1,x1(t) the _ 2 , ••• recurstve values of y.
\
.·~-1
' .,i
3 THE PARALLEL KALMAN FILTER (DISCRETE TIME)
361
With this one would seem to have estimates of the process variable at the relevant times in the form one would wish. This is not quite true, however. Rather than an expression for x~r) in the form of an infinite sum involving all past observations (which is what (9) yields if we set r = t) we would wish to generate this estimate by some simple updating recursion, as was achieved by the Kalman filter in the state-structured case. The question is, then, whether the backward recursion with respect to r implied by (9) can be converted into a natural forward recursion with respect to t: the pth-order equivalent of the Kalman filter. We shall see in succeeding sections that this conversion is almost immediate. Note one necessary difference between factorisations (19.6) and (8): the order of factors is now reversed, in that the factor 1/J{z) = :L:}=0 1/ljzl, for which both .,P(z) and its matrix inverse have expansions in non-negative powers of z valid in lzl ~ 1, is now the initial factor. Otherwise the factorisation is the complete analogue of (19.6) (e.g., in that 1/Jo is the constant term in 1/J(z) and is symmetric) and can be achieved by the same policy-improvement algorithm. Of course, the improvement is in inference rule rather than control rule. The existence of a canonical factorisation (8) is again the essential condition that infinite-history limits should exist for the estimation rule and the distribution of estimation errors. Exercises and comments (1) For the Markov cased = I - Aff, fJl = - Bff and fl = - Cff the normalised canonical factor '1/J(z) of w(z) has the form N +AVAT L+AVCT '1/J(z) = [ LT + CVAT M + VCVT 0 I
I -Azl -Cz 0
where Vis the limit value of the covariance matrix of the estimation error .X - x (cf. Exercise 19.2.1).
3 THE PARALLEL KALMAN FILTER (DISCRETE TIME) The innovation in the observations now has the form /" _
(t-1) _
.,, - Yr - Yr
rA
(t-1)
- Yr + -.xt
·
(10)
Here the time-translation operator acts, as ever, only on the subscript, so that
rtx~t-l) = :L:~ 1 Crx~~~I). We assume the normalised form (8) of the canonical factorisation.
Theorem 20.3.1 (The parallel Kalman filter) dated by the relations
The estimates x~l (r ~ t) are up-
362
OPTI MAL STATIONARY LQG POLICIES "" d + ( d - I ) X 1(t-1) + VfJUt ~ t + Ho(r, x(tl = x(t-1 ) + H r T T (r < t), f-T<,t
x 1(t)
( 11) (12)
where ( 1 is the innovation (I 0) and the matrix coeff icients Hj are determined by
L HjZ-j = -['1/i(z) -llxm· 00
H(z) :=
J=O
(13)
Equa tion (11) is recognisable as a generalise d form of the Kalm an filter; it must now be supp leme nted by the relations (12) whic h upda te also the estimates of the lagg ed x-values. The real novelty of the theo rem lies in the com pact evaluation (13) of the coefficients Hj in term s of the cano nical factor ¢. The xm factor indicates that we extra ct the corre spon ding subm atrix from the (j, m, x)parti tione d matr ix. We term relat ion (11) plus relations (12) for 0 < j < p the parallel filter because it simultaneously upda tes estimates of the all the comp onen ts of what would in fact be a state vector. It cont rasts with anot her possible version of the Kalm an filter, the serial filter, which emer ges in the next section.
Proof Note the cruc ial difference between relat ions (19.5) and (6) on which we have already comm ented . Relations (19.5) coup le back into the past in that they involve Xr for 7 < t. However, relations (6) do not couple forw ard into the future; future values of x are not involved and future values of land mare zero, as asser ted in (7). This abse nce offo rwar d coupling impl ies an effective boun dary cond ition
(14)
We dedu ce from (6) that
w(.r)(xVlxV- 1l)
= P!- 1 - P~- 1
(15) :aut the right -han d mem ber of (15) is equal to zero for 7 < t and to the colu mn vector 'Yr with parti tion (0, -(1 , 0) forT = t. Premultiplying relat ion (15) by the oper ator '1/1( Y) -I '1/Jo we thus dedu ce that
(r:::; t). This, toge ther with the effective boun dary cond
-h1-T Xr(t) - Xr(t-1 )- - ht-r'Y t-
ition (14), implies that
[~] ~
(r:::; t),
(16)
where h1 is the coefficient of z-i in the expa nsion of -'1/i(z) -I in non-positive powe rs of z. This demo nstra tes the validity of (12) for r :::; t with the evaluation
363
4 THE SERIAL KALMAN FILTER (DISCRETE TIME)
(13) of the coefficients Hi. For the particular case r for x~c-l) from the relation (t-1)
.s;l Xc
= t we deduce an expressi()n
+ !!4ur = d,,
so deducing (11) from (12) for the case r
=
0
t.
Exercises and comments (1) Consider again the Markov case, for which the canonical factor '¢ is given in 1 Exercise 1 of Section 2. The xm submatrix of -i/1( z) - is
H(z)
= H +z- 1 V(I- nTz- 1r 1 cT(M +
cvcT)- 1
(17)
where H = (L + AVCT)(M + CVCT)- 1 and 0 =A- HC. Then Ho indeed has the standard evaluation H. The relations (12) for r < t are of only academic 1 1 interest in this case, but we see from (17) that Hi= V[nTy- cT(M + CVCT)for j > 0. Part of the reason why this formula does not hold at j = 0 is tb.at observation of y 1 gives some informatio n on the value of t:. 1 if L =/:. 0 (i.e. if plant and observatio n noise are correlated). This helps in the estimation of Xc. but not of earlier x-values.
4 THE SERIAL KALMAN FILTER (DISCRETE TIME) Direct operations on the relation (6) yield an interesting alternative form of the higher-order Kalman filter. Let us define Xc as'¢(?/) -l Pt, i.e as the solution of '¢(?/)x
(18)
= p.
Theorem 20.4.1 The vector
(19) can be identified with x) 1l. In particular, current process variable.
x can be identified with xl l, the estimate of 1
1
Proof It follows from (6) and (18) that '¢() 1 '¢(?/)x~l =
Xr
(20)
But, because of the lack of forward couplin~ and the fact that '¢o is the absolute D term in i/1, relation (20) for r = t reduces to x/) = Xr-
364
OPT IMA L STATIONARY LQG POL ICIES
One mig ht define the serial estimate of process hist ory at tim et as {x.,.; r ~ t}3 and the upd ated or revised estimate as {x~lr ~ t}. Correspondingly, the best pred icto r of y 1 at time t- 1 is y~t-! ) = - E:=l C,x;~-;:'l, whereas the seria l predictor is 1 = - L::=l CrXt-r· Cor resp ond ing to the noti on of the inno vati on ( 1 = y 1 - y;t-I ) is then that of the serial innovation
y
1;
s
l
J (z= Yr- Yr= Yt + l?fxt. (21) By the same argu men t as that which proved the special form (19.12) for the ·~ cano nica l factor ¢of iP we ded uce that the cano nica l factor 1j; has the form J
1./J =
[
1./J11 1./Jtm 1./Jmt 1./Jmm
I
where an argu men t f7 or z is und erst
(22) .
0
ood .
Theorem 20A.2 (Th e seri al Kal man filte r) forward pair ofrecursions
The variables x and mobey the stable
dx + !Jiu = d + 1./Jzm(fl)m 1/lmm(fl)m = y so that xis determined recursively by the
(23)
+ l?lx
(24)
serial Kalman filter
dx + ~u = d + 1Ptm(fi)1./Jmm(ffr 1 (,
itselfa stable forward recursion.
(25)
Proof Written in full, relation (18) beco mes
1./Ju 1./Jtm [ 1./Jml 1./Jmm I 0
dl [-=/x ] [a- f?lul 1?1 0
m
'T
-y 0
(26) 'T
From this it follows that l = 0 and that the equation system reduces to (23), (24). J This redu ced system implies the dete rmin atio n m= 1/J;;;~ ( and the Kal man filter recursion (25) for x. Stability of all relations as forward recursio ns is gua rant eed by the cano nica l char acte r of .,P.
0
Relation (25) has inde ed the char acte r of the classical Kal man filter, in that it is equivalent to the driving of a plan t mod el by the serial innovations, or of a plan t/ observation mod el by the observations . The parallel filter of the last section has rath er the character of the driving of a state-reduced plan t mod el by the innovations; the serial filter avoids such a reduction. However, the fact that the serial filter works on serial innovations mea ns that the driving term 1/JJm'I/J;;;~ s (is
365
5 INVALIDITY OF THE SERIAL FILTER s
s
general not a function of current ( alone, but also of past (. Some ~orr111 .,u.,••~u·u is required to take account of the fact that history has been .,~t;m:atea serially. The best way to view relation (25) is in fact to revert to the equation-pair (23), (24) and regard this as a coupled system of plant/observation model and compensator driven by the observations. Of course, the reason why l1 = 1~ 1) is zero for all tis that the plant equation at t constitutes no essential constraint; the variable x 1 appears in this relation alone and can be given the value which satisfies it best, without affecting the estimates of earlier history or their fit. The following result clarifies the character of m1 and relates the two innovations. Theorem 20.4.3
The serial and parallel innovations are related by -1
-1 s
A
(27)
WommC = m = Wmm ( ·
T = t plus the lack of forward coupling implies that Because 1/J has the form (22) this last relation implies that /~ 1) -/(t-l) = 0, which we know, and also that 'l/Jomm(m) 1l - m)t-l)) = (1• Since m;t-!) = 0 the first equality of (27) thus follows; the second we know D from(24).
Proof Relation (16) at
1/Jo(xit) - xir-!))
= It·
Finally, summation of the equation x~"") - x~""-!) over the range T < a ~ t leads to the conclusion
= H
Theorem 20.4.4 The updated estimates of past process values are obtained from the serial estimates .X by the formula
X~) = Xr +
1-T
1-T
j=l
j=l
L Hj(r+j = Xr + L Hj'l/JOmmmr+J
(T~t).
(28)
5 THE CONTINUOUS-TIM E CASE; INVALIDITY OF THE SERIAL FILTER Interesting points arise in the continuous-time case. The transfer from the discrete-time case can not be taken mechanically, and some points are delicate enough to affect implementation. When it comes to estimation then relation (5), written again as (6), still holds, with the boundary condition of lack of forward coupling. We appeal to a canonical factorisation (29) However, it is now sometimes advantageous to vary the normalisation from that indicated in Theorem 18.7.2. In the next section we shall demonstrate that a
366
OPT IMA L STATIONARY LQG POL ICIES
factorisation can be foun d of the form (22); see (35). It will then also follow that , if we agai n define x( t) as the solu tion of (18), then .X( t) can aga in be iden tifie d with the esti mat e of curr ent proc ess valu e x(tl(t). Furt herm ore, it follows from the form of '1jJ that .X and fh obey the anal ogu es of (23), (24)
dx + r16u = d + '1/Jtm(~)fh '1/Jmm(~)fh = Y +~X,
x
so that agai n obeys the seri al Kal man filter s
dx + :18u
(30)
relation, anal ogo us to (25),
= d + 1/JJm(~)1/Jmm(!!))- 1 (s .
(31) Her e (is agai n the seri al inno vati on. The true and seri al inno vati ons now have the definitions s (=y+~x
(32)
respectively. Her e we have used x,( t) to den ote the rth differential!?)' x( t) of x at t and .X,( t) to den ote its proj ecti on esti mat e on info rma tion at t:
x,(t) = lim~'x(tl(-r). Tjt
(33)
Not e that !?) acts, as ever, on the runn ing time argu men t 7 in (33). However, rela tion s (31) and (32) have only a form al validity. The inno vati on (is not differentiable with resp ect to time ~since it has a whi te-n oise com pon ent) . Nei ther in fact is the seri al inno vati on (, and the fact that equ atio n (32) s rela tes differentials of .X to differentials of ( is an indi cati on that som e of the diffe rentials of do not exist either. Equ atio ns (30) are prop er in that they defi ne a stable filter with inpu t y and well-defined outp uts x and fh. However, the equ atio ns themselves con stitu te an imp rope r real isati on of this filter, in that they expr ess relations betw een signals (variables) whi ch are so ill-d efin ed mat hem atic ally as to be hopelessly ill-c ond ition ed phys ically. · In orde r to obta in a set of upd atin g relations in which all variables are well defi ned we have to reve rt to the para llel filter and so, effectively, to a stat e redu ctio n of the model. Thi s goes rath er agai nst our prog ram me, one of who se elements was the refusal to reso rt to stat e reductions. However, the redu ctio n is for purp oses of pro of only, and we shal l see that the parallel filter, gen erat ing all the relevant estim ates .X,, can be deri ved directly from the form of the cano nica l factor'ljJ.
x
6 THE PARALLEL KALMAN FILTER (CONTINUOUS TIME)
The con tinu ous- time case has show n itse lf to differ from the discrete-tim e case in that the seri al Kal man filter can not be imp lem ente d as it stands. Tur ning then to
6 THE PARALLEL KALMAN FILTER (CONTINUOUS TIME)
367
the parallel filter, we see that there is necessarily another difference. Whereas one can well consider the estimation of x at any lag (either in discrete or continuou s time) one cannot consider the estimate of differentials of x of any order, because differentials of componen ts of x will in general exist only up to order one less than the maximal order occurring in the plant equation. (Recall that the highestorder derivative has in general a white-noise component.) This is of course acceptable, in that these are the only estimates of differentials that one needs for purposes of control. However, if one restricts oneself to estimating just these differentials, then one has effectively reverted to a state-reduc ed treatment, which is rather against the spirit of our programm e. We shall in fact appeal to a state-reduction for purposes of proof, but the final conclusions do not appeal to such a reduction. That is, the continuou s-time analogue of the matrices Hj occurring in the parallel filter analogous to (11), (12) will be deduced directly from the canonical factor 'ljJ of w. Let us for the moment omit the input terms d and i?Ju in the plant equation; these can be left out and later restored without affecting estimation. In the Markov case (with d(s) =sf- A, C6'(s) = -C for constant matrices A and C) we find canonical factors
L+ VCT M 0
V
'f/J(s) = [ 0 I
si-Al -C 0
'l/J.
,
=
0 0 [0 M 0 I
/]
0 .
(34)
0
where Vis the stationary covariance matrix of the estimation error x- x. We shall see that this generalises to
'l/J=
[~I !0 ~]0 [~~~I t~:0 ~],'l/J. 0 =
(35)
where 'ljJ and its elements have arguments s or ~ as appropriat e. Let us initially suppose that the dynamics are of order p exactly, so that the matrix coefficient Ap of ~P in d is non-singular, and can be normalised to the identity. The Kalman filter for the state-reduc ed model, which gives the parallel updating for the unreduced model, then has the form (analogous to (11), (12)), (O~r
(36)
p-!
~Xp-1 + LA,x, = Hp-1(
(37)
r=O
where (is the innovation, expressed more explicitly in (32), (33). This constitutes the parallel filter. It is convenient to define the matrices G, = MH, and the generating function
' f
368
OPT IMA L STATIONARY LQG POLICIES
p-I
G(s) =
L G,s-r-I. r=O
The problem is then to determine the coefficie nts H, in (36), (37) directly from the .· . canonical factor '1/J of (35), without reversion to a state-reduced treatment. This is • · resolved as follows. Theorem 20.6.1 The coefficients llj
= GjM- 1 in the parallel Kalman filter
(36), • the canonicalfactorisation (35) by the
(37) are determined in terms ofthe factor '1/J of
relation
'1/J~m(s) =
[A(s)G(s)]+
(39) where the truncation operator [ ]+ retains only the terms in non-negative powers of s. This implies an equation system p
'1/Jrlm =
L
AkGk-r-l
(r = 0, 1, ... ,p- 1)
k=r+1
(40)
which determines Gp_ 1, Gp_ 2 , •.• recursively. Suppose dynamics are of variable order, in that rj is the order of the highest derivative of the jth component ofx which occu rs in the plant equation. Then the jkth elementof'I/Jrlm is zero for r ~ rjandwe may assu me the same ofG,. Relations (40) then again determine the components ofthe matr ices G, recursively from the highest order downwards. Proof The essential conclusion, expressed in (39), is as simple as one could wish. However, the proof is somewhat more exten ded. Consider first the Markov case and define the vector as the solution of '1/J( $'")::( = p, with the canonical factors given by (34). That is,
x
[~ L+f~
2}-t][ ~l] [7] =
It follows then immediately, as in Section 4, that l = 0, that x obeys the Kalman filter relation !'Jx =A x+ H(y - Cx),
and that m= M- 1(y- Cx) = M- 1(. Further, the relation -ifix(tl(r) = '1/J.xT now amounts to
1]0 [-m -1 ] 0
X
(t) T
=
[
~
x ] -Mm 0
(r~t),
(41)
6 THE PARALLEL KALM AN FILTER (CONTINUOUS TIME)
3691
since z(tl(t) = 0, it indeed follows that we can make the identification == xC1l(t) andalsomCtl(t) = m(t). er its state Now consider the model withpth-order dynamics exactly, and consid uishing tilde on '' reduction, using the notation of the Markov model with a disting vector with rth } variables and coefficients. The state vector x will be a column then implies · subcompopent x, = 9J'x(O ~ r < p). The analogue of relation (41) again that I = 0 and that mand the x, satisfY (42) -G,m + 9Jx, - x,+l = o (O~r
-Gp-lm + 9Jxp-l + L:A,x, = o
(43)
r=O
p-1
-Mm+ L:c,x, = -y
(44)
r=O
of L + VC1 • where the vertically partitioned matrix G = (G,) is the analogue H, = G,M- 1, These relations imply the parallel filter relations (36), (37) with x, indeed has and we know from consideration of the state-structured case that of 9)' x. value t curren the the interpretation (33); it is the projection estimate of rather sation factori ical It remains then to determine G, but from the canon ns equatio reduce than by actual calculation ofthe analogue of L+ veT. Ifwe x with xo of ication (42}-(44) to (30) by elimination of x, (0 < r < p) and identif 1/Jmm· and /J1m I ' then we can identifY the resulting operator coefficients of mwith Solution of (42) yields the expression r-1
x, =
P~'x-
L Gk9Jr-k- m= 9J'x- [9J'G{9J)]+.m 1
(45)
k=O
of PI (or s, as where the operator []+annihilates all terms in negative powers (43) and (44) into sion expres appropriate) in the bracket Substitution of this yields relations (30) with the identifications
'1/Jmm(s)
= M + [~(S)G(s)]+.
determining In particular, we have the key assertion (391 which implies the linear relations (40). canonical A continuation of this argument implies the asserted form (35) of the of the tion adapta factorisation. The final assertion of the theorem follows by D argument given to the case of variable order.
CH AP TE R 21
The Risk-sensitive (LEQG) Version optim isati on unde r LQG assu mpThe time-integral characterisation of control one inco rpora tes risk-sensitivity by tions seems to reach its completion first when is beca use the optimisations of generalising to an LEQ G characterisation. This the extremisation of a single time control and estim ation are then combined in the matr ices of operators 1> and \II integral, representing the stress. Furth ermo re, the LQG mod el appe ar inde ed generalise in so pleasing a fashion that they make when it is imbe dded in the class of as a degenerate case which finds its place first LEQ G models. pected that there is one step whose With so pleasing a completion it is then unex recoupling step of Section 16.7. We natural generalisation proves elusive: the final discuss the poin t in Sections 2 and 3.
Assu
1 DEDUCTION OF THE TIME-INTEGRAL me again the process and observation relations
+ :Jiu = d + E
d x
(1)
y+~x = 1J
(E, 'TJ). In discrete time we assu me with jointly white plan t and observation noise ntan eous cost function (12.4). In the noise cova rianc e matr ix (12.3) and insta ssion with a rate interpretation. The continuous time we assu me these same expre ion of risk-sensitivity are com mon struc tural poin ts which arise with the intro duct for definiteness. to both versions, so we shall keep to discrete time and know then from the risk-ac), E.,..(e We assume the risk-sensitive crite rion ion 16.2 that the optimal value of sensitive certa inty equivalence principle of Sect stress the of the control u1 is deter mine d by extremisation §
= c + e- 1 UJ
all currently unde term ined controls with respect to all curre nt unobservables and explicitly the sense in which this (at time t). Theo rems 16.2.1 and 16.2.2 state more extremisation is to be unde rstoo d. the stress as Und er the assu mpti ons above we can then write § = L(c- r + o-l Dr)+ end term s T
372
THE RISK-SENSITIVE (LEQG) VERSION
where
and e, TJ are to be expressed in terms of x, y and u by (1). As emphasis ed in Chapters 12 and 16, the effect of the certainty-equivalence principle is to reduce constrain ed minimisations to unconstr ained minimisations. We still have the constraints of the plant and observation relations, however, which we reduce, as in Sections 6.5 and 129, by the introduct ion of Lagrange multipliers. We thus consider the Lagrangi an form §
+ l:[AT{d x+ ~u- d-e)+ JLT(y+ ctx- TJ)].,.. .,.
{2)
If we extremise this with respect to the noise variables then we are left with the integral 0 = l:[c(x,u ) + AT(dx+ ~u- d)+ JLT(y+ Cx)- 8v(A,JL)].,. +end terms T
(3) where the cost and 'dual cost' terms c( x, u) and v( A, JL) have the expressions C.,.=
I
2
[X]T [R U T
s
ST] [X] Q
U .,.'
(4)
Expression (3) constitutes our time-integral, in that it can extremised freely with respect to every variable except those whose values are currently known. So, if one is determin ing the optimal value of u, then the control and observation histories U,_ 1 and Y, are known, as is also the whole course {d,} of the deterministic disturbance. All other x, y, u, A and JL values are then to be extremised out. Whatever the nature of the extreme, it will be attained at a stationar y point if the optimisation problem is properly posed Extremising values calculated subject to specification of observables at time will be given the superscri pt (t); they are to be regarded as 'minimal stress' determinations of the quantity in question conditional on informat ion at time t. (A 'determination' can then have aspects of both estimation and optimisation, and 'minimal stress' is really 'extremal stress', with the nature of the extremum dependin g on the sign of 8'J The dual variables A and JL have the physical meaning of 'sensitivities': rates of change of minimal stress with respect to changes in the value of plant and observation noise. Their extremising values are related to effective estimates of these noise inputs by
(5) as we see from the values of e and TJ which extremise expression (2}
373
2 THE STATIONARITY CONDITIONS
I ~
!
Note that the time-integral (3) is a linear combination of the time-integrals ion (19.3) and (20.4) arising in the separate considerations of control and estimat also Note y. naturall e for the LQG case. In the LEQG treatment these combin that the passage from (2) to (3) clears all matrix inverses from the integral. 2 THE STATIONARITY CONDITIONS to Extremising expression (3) with respect to all disposable variables, subject ns equatio of specification of the observables at timet, we deduce the two sets
sT
[!
[~
Q
91 L M tj
~f!l J -(}N
n·> = [Rr+S'Wr s.r + Qtf d + ()Lf.t
u
>.
T
=
-y ()STu
(6)
(r~t)
(7)
T
r [a-"·r
-ef.t $rt ] [ -9>. r x -{}R
(r;;;:: t)
r
f.tr for To these can be added the stationarity conditions with respect to Yr and saying t), > r ( 0 = f.tr relation ant import the r > t. The first of these yields the essentially that the observation relation is longer a constraint for r > t (since The . exactly) relation the satisfy to chosen be will estimate of a future observation future second gives a predictive relation for Yr (i.e. for what the values of t interes oflittle usually observations would be) which is and The two equation systems (6) and (7) plainly generalise equations (19.4) s previou these of tion comple natural the te (20.5). They indeed seem to constitu in entry empty the and s, variable n commo versions in that they are now linked by been the bottom right-hand corner of the two matrices of operators has now about bring entries These -eR. by other filled; in the one case by -(}Nan d in the sitive the effect which we have already remarked in Chapter 16: that in the risk-sen costby affected is ion estimat and case control is affected by noise properties ons conditi the them to add we pressures. The systems are complete for given t if
Ur specified (r
Yr specified (r ~ t);
< t).
(8)
As in Sections 18.1 and 19.2 we write the two equation systems (6) and (7) as
= (~t) 'lt(5")x~l = p~l ~( 5"){~)
(r;;;:: t)
(r
~
t)
with the revised definitions of these quantities implied in (6), (7). the Use of the canonical factorisations again reduces these equations almost to of ation factoris al canonic a have required degree. For example, ~{z) will again
374
THE RISK-SENSITIVE (LEQG) VERSION
the form (19.6) with ¢o symmetric, and the expression (19.12) of the canonical factor
rPxx ¢(z) = [ rPux d
rPxu
I 0 -fJN
l
where the submatrices are also functions of z. The optimisation of control in the case of perfect observation differs only from that of the risk-neutral case by the , substitution of this revised form of the canonical factor, and the optima l control rule is then again of the form (19.13). However, when observation is imperfect then solution of the equatio n pair (6), (7) presents a problem which was not encountered in the risk-neutral case. The two sets of equations are couple d in both directions, forward as well as back. The system (6) is now linked to the past by the occurrence of t-£) 1) in the right-hand membe r at T = taswel l as the occurrence of terms in x~l forT< t. Howev er, the real difference lies in the system (7), which was not previously linked to the future, but now is so by the fact that ).~) is non-zero for T ~ t and also by the occurrence of u1 in the right-hand membe r at T = t. Risk-sensitivity has the effect that estimates based on information acquired from the past are also affected by costs to be incurre d in the future. In Whittle (1990a) this point was dealt with in a distinctly unsatisfactory fashion. Appeal to the canonical factorisations of q,(z) and w(z) reduced the two infinite equation systems (6), (7) to a single system of 2p vector equations, these equations being related to the determination of the estimates of x-r(t- p < T::::; t) as the values minimising the sum of past and future stress. This reduce d system is readily solved in the state-structure d case p = 1, solution corresp onding simply to the recoupling step of Theore m 16.7.1. However, the form of solution in higher-order cases was not evident , and there the matter was left. One might think that, since plant noise E appear s as an effective auxilia ry control, one might simply revert to a risk-neutral formulation in which the · control u is replaced by the pair (u, «:).This view fails in the case of imperfect observation, however, because the estimates E~) for T < t are formed retrospectively. That is, if we regard E as the control wielded by Nature, then Nature has the unfair advantage that she can revise past controls in her own favour. We indicate an alternative approach in the next section which yields an explicit solution in terms of a canoni cal factorisation, but at the cost that the function being factorised is rational rather than polynomial (for dynamics of finite order). The analysis is a pleasing and interesting one, in that it demonstrates the efficacy of the innovation concept. However, we are left, for the moment, with the conclusion that one might as well revert to a state-reduction when it comes to implementation.
375
M 3 A GENERAL FORMALIS
viousness of the in Section 16.10: the non-ob We revert to the poi nt raised ve -sensiti average cost in ns (16.45) and (16.47) for the risk -·..n·mv'"-'"•-""""' of expressio We may as well the n er a stabilising policy u = Kx. •the sta te-s tru ctu red case, und a stable plant equati()n an uncontrolled one with : normalise the mo del to xT Rx. Th e dyn am ictantaneous cost function ;) ;x 1 == Ax1_ 1 + er and an ins log II + 9NTII where IT is of average cos t is 1 = (1/28 ) ! program~ng evaluation · the solutton of (9)
!
orem 13.5.2 is "Y = general evaluation f!om The e Th . 0.1) 16.1 em eor Th 1 (see 1 .sat= I -A z. P(z ) =I + fJR.91- Nd - and ere wh z)j IP( log s Ab 29) (1/ dicted pat h for the roach to evaluation of the pre Ifwe tak e the time-integral app responding to (18.4) is process the n the value of
~= [!
!oN]·
IPI so that Abs log nipulation, tha t 1~1 = ldl ld'I ma trix ma e som h wit , find We rkov case in Sec tio n w from the trea tme nt of the Ma kno we t Bu . j~j log s Ab = IPI on (18.14) wh ere 1 + Go + G1z has the can oni cal factorisati zG_1 = ) ~(z t tha 18.2 ation implies in the =G o- G-1:Q- 1G1. Th is equ ing sfy sati ue val a has II present case tha t · n I ] TI = [ I -O N s logj~l = log j@ = thus have Abs logjPj = Ab We (9). s sfie sati II ere wh s. y of the two evaluations follow log II + 9NIII, whence identit
n
n
3 A GENERAL FORMALISM
to a mo re the trees it is better to revert m fro od wo the h uis ting dis In ord er to Chapter 18 in som e ich in fact generalises tha t of abstractly pos ed problem, wh tegral 0, i.e. a sum e-in extremising a quadratic tim respects. Suppose tha t we are Xr· Suppose tha t le' iab vector 'system var a of ns ctio fun tic dra qua of over time r two components are par titi one d ({.. , 1Jr), where the the col um n vector x.- can be at time t the sec ond t t is never observed, but tha distinguished by the fact tha the integral at tim et of ed for r ~ t. Extremisation com pon ent 1Jr has bee n observ trol context above, con dge of pas t 1J values. In the is the n con diti one d by knowle (>., J.L, x).,. and TJr as having the as having the components we thi nk of com pon ent s (y.,., U..-J). has the form Suppose tha t the time-integral fiJTJr] +(e nd terms) 0 = Lb r({ ,TJ )-
e.
e.
.,.
a;e. -
376
THE RISK-SENSITIVE (LEQG) VERSION
where the sequences {ar} and {,Br} are know n and 'Y has the time-homogeneous quadratic form 1 T f ~( f {1/ ] 'Yr(~,TJ)= 2xrrxr=21 [ '~fJ ] TT [ r1/( r
[ E]
7117
'f}
7
•
The entries in the matri x are operators, so that r {{• for examp le, should be written more explicitly as r ee( f!i), a power series in the backward transl ation opera tor :!/. The integral Bthus involves cross-terms between values of the system variable at different times. We suppose symm etry of the matrix r in that f' = r. As ever, we denote the values of ~r and TJr on the extremal path constr ained by knowledge of {ry7 ; r ~ t} by ~~) and TJ~t). If x is any functi on of the path let us define ~x(tl = x(tl - x(t-l); the change in the value of x on the extremising path as the information gaine d at time tis added to that alread y available at time t - 1. Theorem 21.3.1 Suppose that infinite-horizon limits exist. Then (i) If x is any linear function of the path then ~x(t) is a matri x multiple of the 'innovation' ~~
=
'f/t-
(t-1)
TJt
.
(11)
Specifically, A
(t)-
l...l.'f}T
-
K T-1 lT•
(12)
where Ko =I andK j = Ofor j < 0. (ii) Suppose that the canonicalfactorisation , r 1P7(z) - r 77{(z)r{{(z)- 1r(77(z) = v(z)vi) v(z)
(13)
holds. Then the generating functions 00
H(z) = ''LJij zj, -oo
00
K(z) = LKj zj 0
have the evaluations
(14) Proof Extremising 0subject to information at time t we deduc e the linear equations
(all r)
(15)
(r > t)
(16)
3 A GENERAL FORMALISM
377
which then imply that
o
(1 7)
=o
(18)
r~~.6.~~~l +r{'7.6. 11~rl = r77{.6.~~~l + r 1)1).6. 11~l
But .6.ry~l is zero for r < t, and for r = t equals the 'innovation' (11). Assertion (i) then follows from this statement and equations (17), (18). If we form the generating functions H(z) and K(z) then it follows from equations (17) and (18) that r{~(z)H(z)
+ r{17 K(z) = 0
r 11{(z)H(z) + r 1J1J(z)K(z) =
G(z)
(19) (20)
where G(z) is a function whose expansion on the unit circle contains only nonpositive powers of z. Suppressing the z-argument for simplicity we then have H = -re-/r~77 K
(21)
(r1)1)- r 77~r«1 r{11 )K =G. From this last equation it follows that v0-I v K = v--!G .
But since one side of this last equation has an expansion on the unit circle in nonnegative powers and the other in non-positive powers they must both be constant, and the constant must be v01v0 Ko =I. Thus v01vK =I, which D together with (21) implies the determinations (14) of K and H. The conclusions are attractive. However, if we suppose 'pth-order dynamics' in that the matrix r(~) of the cost function (10) involves powers off/ only in the range [-p,p], then the expression factorised in (13) is not similarly restricted; it is a matrix of functions rational in z.
PAR T 5
Near-determinism and Large Deviation Theory of its Large deviation theory is enjoying an enormous vogue, both because and , viewed be can this which in ways many mathematical content and the tool. natural a proves it which for tions applica because of the large range of closer to Chapter 22 gives an introduction to the topic, by an approach perhaps before ation justific some er, Howev ilist. that of the physicist than of the probab ce to relevan clear a has theory The even that introduction would not come amiss. s respect some in r, Howeve 25. and 23 control theory, as we explain in Chapters tic stochas l essentia the of some one needs a more refined treatment to capture be read effects. We cover such refinements in Chapter 24, which could indeed now we than theory on directly, as it demands scarcely more of large deviati sketch. shall Large deviation theory is a shifted version (in quite a literal sense, as we of law the theory: see) of the two basic limit assertions of elementary probability a as d regarde large numbers and the central limit theorem. The second can be two The refined version of the first (in a certain sense; that of weak convergence). to a assertions have process versions: the convergence of a stochastic process creasing deterministic process or to a diffusion process if it is subject to an ever-in internal averaging in a sense which we make explicit in Section 22.3. process To see how large deviation theory extends these concepts, consider a x(O) and x(h) of values the e suppos and h), {x(t)} over a time-interval (0, path istic etermin limit-d the on lie not does prescribed. Suppose further that x(h) ude of the starting from x(O) (and we shall soon be more specific on the magnit between path le probab most a ine determ deviation assumed). Then one can still under that, trates demons theory n the prescribed end-points. Large-deviatio
380
NEAR-DETERMINISM AND LARGE DEVIATION THEORY
appropriate assumptions, this most probable path is just the limit-determinis tic path for a 'tilted' version of the process. It also provides a first approximation to the probability that x(h) takes the prescribed value, conditional on the value of x(O). This estimate is sufficient for many purposes, but can be improved if one approximates the tilted process by a diffusion rather than a deterministic process. Let us be somewhat more specific. Consider a random scalar x which is the arithmetic average of rc independently and identically distributed scalar random variables ~J ( j = 1, 2, ... , rc) of mean !-" and variance d2. (A symbol such as Nor n would be more conventional than rc, but these are already in full use.) Then x has mean !-" and variance rf2 j rc, and converges to !-" with increasing rc in almost any stochastic sense one cares to name (the 'law of large numbers', in its various versions). In particular, for sufficiently regular functions C(x) one has
E,,:[C(x)]
=
C(!-")
+ o(l).
(1)
for large rc. Here we have given the expectation operator a subscript rc to indicate that the distribution of x depends upon this parameter. A strengthening of (1) would be
(2) which makes clear that the remainder term in (1) may be in fact O(rc- 1) rather than anything weaker. One obtains stronger conclusions if one allows the function under the expectation to depend upon rc as well as upon x. For example, the central limit theorem amounts to the assertion (again for sufficiently regular C) that E~<{C[(x -~L)/(Y-/K,]}
= E[C(17)] + o(l)
where T/ is a standard normal variable. The large deviation assertion is that the D(x), known as the rate function, such that
~-distribution
(3)
determines a function
(4) (The precise result is Cramer's theorem, which also evaluates the rate function; see Section 22.2.) The interest is that the function under the expectation is exponential in rc, and it is this which forces the value of x contributing most to the expectation away from the central value!-"· The point of the assertion is that there is indeed such a dominant value, and that it is the value minimising C(x) + D(x). This last observation partly explains the reason for the term 'large deviation', which perhaps becomes even clearer if we consider distributions. If it is proper to allow e-c in (4) to be the indicator function of a set d then (4) becomes P~<(x
Ed) = exp[-rc inf D(x) XEd
+ o(~t)].
(5)
NEA:&.DETERMINISM AND LARGE DEVIATION THEORY
3-81
In considering the event x E done is considering deviations of x from J.l. of order one (if J.1. does not itself lie in d), whereas the probable deviations are of order 11 y'K,. One is then indeed considering deviations which are large relative to what is expected, and D{x) expresses the behaviour of the tails of the x-distribution well beyond the point at which the normal approximation is generally valid. Results of the type of (5) are extremely valuable for the evaluation of quantities such as the probability of transmission error in communication contexts, or the probability of system failure in reliability contexts. If we regard C( x) as a cost function then evaluation (4) seems perfectly tailored to treatment of the risk-sensitive case. For fixed() and large K. we would have
{6) If we wrote the left-hand member as exp(- K.9F), so defining F as a an effective cost under the risk-sensitive criterion, then we would have F = ext[C(x)
+ 9-l D(x)] + o(l),
(7)
X
where 'ext' indicates an infimum or a supremum according as () is positive or negative. This opens the intriguing possibility that the risk-sensitive treatment of control in Chapter 16 has a natural non-LQG version, at least for processes which are near-deterministic in the particular sense that large deviation theory requires. The rate function D(x) is then an asymptotic version of the discrepancy function of Section 12.1. However, the hope that all the theory based on LQG assumptions has more general validity should be qualified. If we revert to the risk-neutral case () -+ 0 then it will transpire that relation (7) becomes simply F = C(J.!.) + o(l), which is in most cases too crude to be useful-a warning that large deviation results will often be too crude. Nevertheless, they give a valuable first indication, which can be refined. Further, the rate function does in a sense encapsulate the essential stochastic features of the 77-distribution, and results such as those of the nonlinear filtering section (Section 25.5) are not at all too crude, but again encapsulate the essentials. In considering a single random variable x we have considered a static problem. As indicated above, these ideas can be generalised to the dynamic case, when one is concerned with a stochastic process {x(t)}. In this case one should see the variable K. as representing the physical (perhaps spatial) scale of a system, and x(t) as representing a physical average over the system at timet. We consider the process version in Section 224, a necessary preliminary to control applications. As we have emphasised and shall demonstrate, a large deviation evaluation such as (5) follows from the law of large numbers, and has nothing to do with normal approximation. However, in the Gaussian case, when the {-variables are normally distributed, then the central limit theorem is exact in that xis normally
382
NEAR-DETERMINISM AND LARGE DEVIATION THEORY
distributed over its whole range. The probability evaluation (5) must then necessarily be virtually coincident with the evaluation of the normal integral (after appropriate scaling of the variable) over d. Indeed, it turns out that the j2cr; the familiar negative exponent rate function D(x) has evaluation (xin the normal density. In other words, if variables are normally distributed then large deviation theory is 'exact', in that the effective asymptotic density const. exp[-D(x)] coincides with the actual density. As a further indication of the same point: if the 7)-variables are Gaussian and the function C~) in (4) is quadratic then the remainder term in the right-hand member is constant and 0(1), in that it is equal to! log[D" /( C" + D")], where the double-primed quantities are the second derivatives of C and D. It is for these reasons that results which we obtain as large deviation approximations coincide, under LQG assumptions, with results which we know from Chapter 16 to be exact.
fll
CH AP TE R 22
tion The Essentials o fLarge Devia Theory PROPERTY 1 THE LARGE DEVIATION
i
I I
deviation property firs t speed if we define the large There is a gain in clarity and er of bo th history an d h of course this inverts the ord oug alth it, ate tiv mo n the and ject; we follow a ways of approaching the sub ny ma are ere Th g. din tan unders trol context. A which is natural for the con and l ica nom eco is ich the fact tha t route wh ks rigour at many points, but lac ly ari ess nec this as ef bri treatment as tha t it is also haps reassure the reader per l wil l ura nat so is the argument first the 'static' case of a ough a succession of cases: rigorisable. We proceed thr stic process and finally, the dynamic case of a stocha single ran do m variable, then cess. of a controlled stochastic pro in Chapters 23-25, the case roductory section above. int the in first orientation a ted mp atte y ead alr e hav We zeroth section of treat tha t introduction as the to te ria rop app n the s hap · It is per s chapter. equations as equations of thi upon a nonthis chapter, and to refer to its on whose distributi depends x le iab var tor vec m do ran Consider a of distributions indexed is thus considering a family negative par am ete r /'\. On e ectation operators can bability measures and exp by K. Th e corresponding pro sise the dependence. tively, if one wishes to empha K. and E"' respec P n itte wr be n the there exists a function ty then states, roughly, tha t per pro on iati dev ge lar e Th ds for lar ge" (in the preamble to Pa rt 5) hol (5) t tha h suc n, ctio fun rate our, (4) holds D(x), the uivalently, at this level of rig Eq sl. sets r ula reg ly ent and for suffici ular scalar functions C(x). for large K. and sufficiently reg g property, stated ally taken as the characterisin It is relation (5) which is usu for closed sets .sri and ic inequality in one direction more carefully as an asymptot n characterisation (4) open sets. It is the expectatio for ion ect dir er oth the in as valid at least for context, however, regarded our in l ura nat re mo is ich wh conditions. Neither ying appropriate growth obe C ns ctio fun s uou contin bu t relation (5) t supplementary conditions, hou wit er oth the lies imp the indicator property (4) if one took exp [-C (x) ] as m fro lly ma for ow foll rse ice may be would of cou such a discontinuous cho , ver we Ho sl. set the h such function of sible trouble associated wit pos the of e aus bec is it unacceptable, and lation of assertion (5). uires a more cautious formu discontinuities tha t rigour req s to hold and un der sions of either of these relation Why one should expect ver e the implications. Th e be explained. However, not to yet has s ion dit con at wh
384
TH E ESSENTIALS OF LA RGE DEVIATION THEORY
expression for bot h the probab ility and the expectation is exp onential in "' to first order, with the coefficient of "' in the exponent determine d in terms of the rate function as indicated. Furthe r, it is a single extremising value of x which contributes dominantly to bot h probability and expectatio n. In the case of Px.(x Ed ) this value of xis tha t value in d which is mo st pro bable on the basis of a kin d ofu nno rm alis ed den sity exp[-"'D(x)] In the case of the expectation the value of x which contributes dom inantly has to compromise bet ween achieving a high value of ex p(- C) and high probability. In either cas e, the operation of integration (with respect to a probability measure) has bee n replaced by tha t of extremisation. In fact, one seems to find the large deviation property in jus t one class of cases. This class is characterised by three properties. (i) Th e par am ete r "' measures the scale of the stochastic system bei ng considered. For example, bot h the capacity of and demand on a telephone network might be of order K, or "' might measure the size of an insurance com pany. (ii) Th e random variab le x is an average over the system of a vector-valued random variable. Thus, it mig ht represent the instantaneous traffic being car ried per uni t of capacity on the var ious links of the telephone network, or the current dividend which the ins ura nce company announces. (iii) The system has the stochastic homogenei ty and ergodicity properties which imply that x behaves as the average of "' independently and identically distributed ran dom variables, in that
(8) for some 'ljJ and for all values of the row vector a for which the left-hand member is defined. Let us note a few points of definition and notation. If ~ is a vector-valued random variable then M(a)
= E(ea~)
is its moment generating functio n (abbreviated to MGF). Here~ is assumed to be a column vector and a then a row vector. M( a) certain ly exists for purely imaginary a, and will exist for other values if the tails of the 77-distribution decay at least exponentially fast. We shall have occasion to work with the function '1/J(a) =lo g M( a), the cumulant generating function of~ (abbre viated to CGF). These functions have a num ber of im por tan t properties which we summarise in Appendix 3. We use the abbreviation liD for 'independently and identically distributed:
2 THE STATIC CASE: CRAM ER'S THEOREM
Cramer's theorem concerns the
case when relation (8) holds exa ctly; i.e. Ex. (e~
(9)
'1
2 THE STATIC CASE: CRAMER'S THEOR EM
385
will be the case (at for some 'lj; for all real a for which the left-hand side exists. This when x is the es) least for "" integral, a case to which we can restric t ourselv a). 'lj;( arithmetic mean of"" liD vector rando m variables ~i with CGF
for all a in Theorem 22.2.1 (Cramer's theorem) Suppose that relation (9) holds deviation propsome set of real values with non-empty interior. Then x has the large erty and the rate function has the evaluation (10) D(x) = sup[a x- 'lj;(a)]. ty in that (5) is Crame r's original theorem asserts the large deviation proper of intervals. We shown to hold if d is an interval, or indeed any countable union the line of makes which (4) n relatio shall rather give a proof of the expectation n (10): functio rate the of tion proof for case (5) clear. Note the interesting evalua transdre Legen (or rm D(x) is seen to be the negative of the minim um transfo which issues, ting form) of the CGF 'lj;( a). This in itself raises a numbe r of fascina we briefly mentio n in Exercise 2. properies of the We shall need to appeal to some differentiability and convexity gh brief, are CGF, listed in the theorems of Appen dix 3. The proofs, althou ent. argum the deferred to Appen dix 3 so as not to break so that the Consid er now the modification of the distribution of by tilting, expectation of a function 4>( ~) of {beco mes rather
e
E(a)[cf>(~)] = E[cf>(~)e"'~J.
(11)
M(a)
factor e"'.e and That is, one weights the original ~-distribution by the exponential pt is in fact conce ry arbitra ntly then renormalises the distribution. This appare and by the 3 dix Appen of motivated naturally, both by the proofs of the theorems eter. param use to which we shall shortly put it. We refer to a in (11) as the tilt )= 'lj;a, the colTheorem 22.2.2 (i) The mean ofthe a-tilted distribution is £(a)({ umn vector offirst differentials of'lj; at a. ( ii) Define the convexfunction ofthe column vector a (12) D(a) = sup[a a- 'lj;(a)]. a
If a is chosen as the value at which the supremum is achieved in (12) then £(a) (~) = a. (iii) Furthermore, at points where the derivatives exist, (13) Daa = ['lj;aat 1 Da =a, where a and a are the corresponding values determined by (12). )j8a]/ M(a) = Proof It follows from the definition (11) that £(a)(~)= [8M(a supremum of a being , convex ly 'lj;a. The function D(a) defined in (12) is certain
386
THE ESSENTIALS OF LARGE DEVIATION THEO RY
linear function of a. The supre mum will be attained at the value of a determined by 1/Ja = q so that the tilted distribution whose param eter has this value indeed hasm eana. To prove assertion (iii), write the definition (12) as
D(a) = supG (a,a) . a
It follows then by stand ard arguments that Da = Ga and Daa = Gaa- GaaG-;~Gaa, where a is given its extremising value. These relations reduce to those asserted in (13) in the partic ular case (12). 0 Suppose that the liD variables E,j are given the tilted distribution (11). The correspondingly tilted expectation of a function ¢(x) of the arithmetic mean x = ~-! _EJ= 1 ~j can then be written in terms of the untilt ed expectation as E~"l[¢(x)J = M(a)-"E"[¢(x)e""x].
We can reverse this relationship to obtai n E,,;[e-~
= E"{e-~<[C(x)+etx-ax]}
= M(a)"E~"l{e-~<[C(x)+ax]} = e"[1{;(a)-aa] E~") { e-I<[C(x)+et(x-a)]} =
e-ltD(a ) Ei"l { e-~<[C(x)+et(x-a)J}
(14) if a is taken as the extremising value in (12). This is the key relation for the proof of Cram er's theorem, which we now deduce in an expec tation version unde r almost minim al conditions on C.
Theorem 22.2.3 (An expectation version of Cramer's theor em) Suppose the relation (9) holdsfor all real a for which the left-hand member is defined, these values constituting a set with non-empty interior. Suppose C(x) contin uous and define D(x) by (12). Then the large deviation relation (4) holds. Proof Choose a equal to the value of x which maximises the last curly-bracketed expression in (14), i.e. which minimises C(x) +ax. Since a and a are corresponding values this then implies a corresponding variation of a, but one ends up in any case with the relation C(a)
~
C(x) + a(x- a).
It follows from the definition (12) of D that
D(a)
~
D(x) - a(x- a).
Adding these two inequalities, we see that a is indee d the value of x which minimises C(x) + D(x).
..J
I
2 THE STATIC CASE: CRAMER'S THEO REM
387
Now, since the tilted expectation of the last curly bracket in its maximal value, we see that
(14) cannot exceed
( 15) continuous, for any for this value of a. On the other hand, since C is of a such that .h"(t:) prescribed positive t: we can find a neighbourhood C(x) + a(x- a)~ C(a) +E. We then deduce from (14) that (16) tilted distribution, x But it follows from the law oflarg e numbers that, under the and so will ultimately converges weakly to its expectation value a as "' increases, is, the probability lie in .h"(t:) with probability one for any prescribed t:. That er small E is chosen. factor in (16) tends to unity with increasing "'' howev 0 Relations (15) and (16) then together imply (4). argument; even less Establishment of the upper bound (15) requires very little appeals to deepe r which (16) bound lower than we gave it (see Exercise 2). It is the ution with an distrib tilted the for properties: the law of large numbers g to make willin is one If eter. appropriately cunning choice of tilting param can be results er strong then C stronger assumptions on the behaviour of deduced; see Exercise 4. evaluation (10) have Now that the large deviation property and the rate function they hold under its that ble been established under the hypothesis (9) it is plausi and exploited by ed nstrat weakened version (8). This has indeed been demo Gartn er (1977) and Ellis (1984). See Exercises 5 and 6. Exercises and comments
on. For a norm al (1) Confi rm the following evaluations of CGF and rate functi + a VaT and D( x) = vector with mean p. and covariance matrix V, tj;( a) = ap. le with v- 1(x- p); just the normal exponent. For a Poissoxn+variab !(xp.; one rej p.)expectation p, tj;(a) = p(e" - 1) and D(x) = x log(x variable ential expon an For ial. factor the cognises Stirling's approximation to /p.). -log(x -1 (xjp) = (x) withm eanp, tf;(a) = -log (l- p.a) andD
!
t.t.l
(2)Wehave
P,.(x Ed)
~
E,.(e"'a(x-a))
= e"'[.P(a)-aa].
ise the last expression if a and a are such that a( x - a) ;;:: 0 for x in d. If we minim we obtain Chernoff's then aint constr this to t above with respect to a and a subjec commonest version the ular, partic In (15). inequality and a probability version of of Chernoff's inequality takes the form
388
TH E ESSENTIALS OF LA RGE DEVIATION THEORY
P"(x ~c)~ inf e"f..P(a)-ac].
(17) a;;. O for scalar x. Cramer's theore m demonstrates tha t this sim ple bou nd is surprisingly good, in tha t the leadin g ter m in the exponent (for lar ge fi.) is correct. Th e distribution of a sum of fi. IID ran dom variables could formally be obtained by Fourier invers ion of the fi.th power of the characteristic function M (W) of a summand. Cra mer's theorem demonstra tes tha t the Fourier transform can be replaced by a mi nim um transform, for cer tai n purposes and for large fi.. It achieves this analytic conclusion by pro babilistic arguments. A feature of large deviation the ory is indeed tha t there is often bot h an analytic route and a probabilistic route to con clu sio ns- cir cum sta nce s incline the investigator to one or the oth er, although habit or taste ma y in the end prevail. (3) Even the simplest form (17) of Chernoff's inequality is of great practical value. Suppose tha t one has an ass embly of fi. components which each have independent probability p of failure; the system will fail if the pro por tio n of components which fail exceeds c. Show tha t the probability of system fail ure is bou nde d above by [z-c(pz + 1 - p)]"' for any z ~ 1 Show that, if c ~ p, the n
P( system failure) ~ [Pc(l _ )1-c]"' p r( l -c ) 1-c , an expression with classic inf
ormation-theoretic significa
nce. (4) Suppose tha t the vector value a which minimises C + D is unique and that C and D possess continuous sec ond-order derivatives at a. An appeal to the central limit theorem rather tha n to the law of large numbers in the final expression of (14) then leads to the evalua tion
E"{e-•.C(x)} = [
IDaal ] l/Ze_"[C(a)+D(a)J+o(l)_ !Caa + Daal
(lS) where all evaluations are at a. Th is improves evaluation (4) in tha t it leaves only an o( 1) remainder in the expone nt. (5) Occupation times for Marko v chains Let x1 be the propor tio n of time spent in state j over a time interval of length fi. for a Markov chain with transition matrix P = (p1k). Le t x be the column vector of the x . 1 Th en (8) will hold with exp[¢(a::)] =>.(a::), the ma xim al eigenvalue of the ma trix (p kea.j). Th e Gii.rtnerEllis extension of Cramer's theorem then demonstrates 1 the large deviation property for x with rate functio n D( x) = sup[a::x -lo g> .( a::)) .
a. A very early result of this typ e was deduced by Miller (196 1), who remarked on the surprising fact tha t .\(a: :) nee d not itself be an MG F, although it shares
3 OPERATORS AND SCALING FOR MARKOV PROCESSES ~manY
389
of the properties. An alternative expression of the rate function, also
I.J"'"~"'' is
. Note that this application corresponds to an averaging over time, and is quite distinct from our later consideration of temporal stochastic processes which show an averaging over the system. (6) Ruin ofan insurance company (Martin- Loj 1986) Suppose that an insuranc e company begins with capital "" and that capital then develops as a process of independent increments whose increment in unit time has CGF '!f;(a). LetT be the time at which ruin occurs (i.e. when capital first runs negative). Then a classic result due to Lundberg states that E( e'h)
= el
for large ""· Here, if() is real, then a( B) is the lesser of the two real a-roots of B+ '!f;(a) = 0. The function ¢(a) is convex (see Appendix 3), so that if V;'(O) is positive) (corresponding to a positive drift of capital) then the smaller root of 't(;( a) = 0 is negative, at - (3, say. Cases for which ruin does not occur are understood not to contribute to the expectation, which then just reduces to the probability of ruin e-~<.B+o(~<) when() is zero. Deduce the assertion P( T ~ r;,y) ~ e-~3(y)+o(~<) where
-(3(y)
= ()inf0[a(B)- By]= ~
inf [a+ y'I/J(a)].
a~
-/3
The bound e-~
390
THE ESSENTIALS OF LARGE DEV IATION THEORY
C(t) =
1 1
c(x) dr + IK(()
(19)
has been defined. Here c and IK are respectiv ely instantaneous and terminal costs and 1is the time at which the variable { = (x, t) first enters a prescribed stopping set f/', when it has terminal value(. If we defme the value function F(x, t) = E[C(t)jx(t) = x], then it follows, as a spec ial case of equation (8.16), that F(x, t) obeys the equation
aF c+at -· +A F= O
(20) a form ofKolmogorov's backward equation . It is subject to the terminal condition F = !Kinf/'.
*Theorem22.11 Wecanformallyset
A= H( x,! )
(21)
where the differential operator acts only on the argument ofthe function ofx to which A is applied, and not on the x-argument ofH. With this understanding the backward equa tion (20) can be written
x,. t). . + H ( a) c(x) + aF(Bt x, ax F(x, t)
=0
(22)
Proof The definition (9.5) of H states that Aeax = H(x,cx)e=. Formula (21) thus holds if A is applied to an.exponential function of x, and so to any finite linear combination of exponent ials. Just as in Section 4.6 we then make the formal identification (21) for the acti on of A on all functions in a sufficiently · ge:qeral class. Relation (22) then follows from (20). 0
Note that if H(x, ex) has a partial power serie s expansion
H(x, ex) = cxa(x) + !cxN(x)cxT + · · · then a(x) 6t and N(x) 6t can be identifie d with E[6x!x(t) = x] and E[(8x)(6x)TI x(t) = x] respectively, t<:> within terms of smaller order in 6t. Here ox is the increment x(t + t5t) - x(t). In particular, then,
x = a(x) = H(x, 0) (23) is the deterministic approximation t<:> the process, in that a( x) is the expected rate ~hange conditional on the curr ent value of state.
r .
t~~·
~·
PROCESSES 3 OPERATORS AND SCALING FOR MARKOV
391
~:t~:
has exactly the form o:a(x). If So, if the process is deterministic then the DCF ally distributed, then the process is a diffusion, in that ox is conditionally norm (24) o:N(x)o:T o:) = o:a(x)
+!
H(x,
ss such that x shows the jump exactly. For anoth er example, suppose the proce ,\i(x) (i = 1, 2, ... ). Then transition x ~ x + di(x) with probability intensity (25) H(x,o:) = L-\(x )(eQ d;(x )- 1). of indefinitely increasing scale Just as in the static case, one requires a conc ept The natur al way of achieving this iflarg e deviation conc epts are to be applicable. fact r;,H(x, r;,- 1o:), where r;, is a is to suppose that the DCF of the process is in ardis ed' or 'unse aled' DCF. For large positive param eter and H(x, o:) is a 'stand that the incre ment ox(t) = integral r;, this is equivalent to the assum ption ge of K realisations of the x(t + Ot)- x(t) for the scaled process is the avera realisations being indep ende nt incre ment for the unsea led process, these spon d again to the idea that x corre d conditional on the value x of x( t). This woul having the homogeneity and m is an average over a system of size r;,, the syste ments of this average are incre ergodicity prope rties which ensure that timeincrements. However, even nt effectively the average of r;, conditionally indep ende process {x(t)} constituted by if the system is Markov, the requi reme nt that the a very strong one. Nevertheless, the system average x should also be Markov is there are natur al examples. see from (24) that the scale d If the unse aled process is a diffusion then we 1 r;,- o:N(x)aT. That is, the effect of process is also a diffusion with DCF aa(x ) e comp onen t in the plant equa scaling is simply to suppose that the white-nois on example, consi der the case tion is scaled by a factor r;,- 112• For a non-diffusi (26) H(x, a)= u(eQ - 1) + px(e-Q 1)
+!
can inter pret x as the numb er of with scala r x and a; a special case of (25). One Poisson stream of rate u (later to particles in a cham ber which particles enter in a independently with individual be taken as a control variable) and leave sents the same process scaled up probability intensity p. The scaled process repre the origi nal process with x now by a factor of r;,, in that it amou nts to r;, replicas of of particles per replica. In fact, representing an 'average' in that it is the numb er in that the poole d version is ss the set of replicas pool to a Markov proce r;,u. rate equivalent to a single cham ber with immi grati on The backward equation (22) beco mes
c
+a:;+ r;;H(x,
if the process is scaled by a factor then this beco mes
r;,- 1 :x)F r;;.
= 0
(~
tf. Y).
(27)
If we retain only terms of zeroth order in
K:
392
THE ESSENTIALS OF LARGE DEVIATION THEORY
aF
c+ -+ 8t Fa =O X
If we retain term s up to ord er ~~;- 1 C
(~ ~
!/).
the n it becomes
aF
+a t+ Fxa + Z~~;1 tr(NFxx) = 0
(~ ~ !/). Predictably, these are jus t the form s of the dynamic pro gra mm ing equation which will hold if the process is app rox ima ted by a deterministic or a diff usion process respectively. The deterministic approximation to a process is ofte n referred to as the 'fluid approximation'; see Exe rcise 2. In deducing (28) or (29) we have supposed the cos t functio n C itself ind epe nde nt of ~~;. It is when we allow an exponential dep end enc e of cost on K(natural und er a variety of circum stances) tha t we obt ain the genera lised forms of these equations which imply larg e deviation effects.
Exercises and comments (1) We saw in Section 10.4 how scaling could com e abo ut in a superficially somewhat different fashion. We considered a bir th- dea th proces s on the integers with birt h and dea th intensities Aj and /-Lj which were slowly var ying wit hj. This slow variation was bes t express ed by representing these intensit ies as functions ~~;A(x) and ~~;J.L(x) of a var iable x = j / K, where ~~; is a larg e parameter. The xprocess is the n jus t the ~~;-scaled version of a process with DC F H( x,a ) = >..(x)(ea -1 ) + J.L (x) (e-a - 1). In the fisheries con tex tj would rep resent the num ber of fish in the pop ula tion (for a given species and region) and x a me asu re of 'stock' in units nat ura l for a region of extent K (2) The term 'fluid approximation ' is use d for the deterministic app roximation, jus t because it represents the flow in state space of the 'fluid' constit uted by points representing the individual stat es of ind epe nde nt replicas of the process. This · corresponds to a simple averag ing of trajectories for these rep licas, and so is a cru der ide a tha n scaling, whi ch con stru cts a single trajecto ry by averaging replica trajectory increments. The effect is the sam e in the determ inistic limit, but in general only in this limit. 4 TH E RA TE FU NC TIO N FO R A MA RK OV PR OC ES S Consider the con diti ona l expecta tion
Gl<(x, t) = E"'(e-~
4 THE RATE FUNC TION FOR A MARKOV PROCESS
3 93
discussion of riskan exponential of cost should seem reasonable after our anoth er motiva16; ter Chap in ion and the exponential-of-cost criter .. occur in the should r;, eter will emerge in Chapt er 24. That the scale param be seen must C ity ext,ouc•n is also reasonable, because the r;,-independent quant .· as a system-average cost, and so x:C as the total system cost. process; we ftnd G,. will obey a backward equation appropriate to the scaled to be
({ ¢. .9'), . With termi nal condition G,. '· . exponential x:-dependence
f,~<\··
~f,~:~:
..
= e-~
(31)
in .9'. Assum e now that G,. shows the
G,.(x, t) =
e-~
( 32)
so that F(x, t) can be regarded as a value function on the origin
al cost scale.
[:: .. ' F(x, t) obeys the equa"-- · ~c*Theorem 22.4.1 Relation (32) holds and the value function tion (33) ({ ¢. .9'),
with terminal condition F = IK in .9'. Proof If x were scalar then the relation e"F ( x:-1
:xY
e-KF
= (-
a;;y+
o(l)
(j = 0, 1,2, ... ),
(34)
ue. Inserting expression would hold, for large x:. There is an obvious vector analog ing only leadin g (32) into (31), appealing to the vector analogue of (34) and retain r;,-independent equat ion terms in r;, we obtain (33). The fact that we obtain a valid ing backwards from (work proof for F supplies the basis of an inductive 0 termination) that (32) holds. would be a Fouri er The argum ent is obviously not rigorous. In particular, there differentiability of E version which would avoid appea l to possibly indefinite simple enough to carry However, the argum ent and the conclusion are perha ps r~~-. conviction. r differential equat ion It is the passage from the linear but in general higher-orde ,.t· equation (33) which ntial (31) to the first-order but in general non-linear differe evaluation. This is ion I marks the passage from exact evaluation to large-deviat optics to the wave of virtually identical with the passage from the equations decreases, or from the eikonal equat ion of geometric optics as wavelength
l
394
THE ESSENTIALS OF LARGE DEVIATI ON THEORY
equations of wave mechanics to the Ham ilton-Jacobi equation of classical .· . mechanics as Planck's cons tant is assumed ever smaller. This is an indication that large deviation theory is indeed well root ed in physics; in the asymptotics of· passage from diffuse wave propagation to a deterministic trajectory as wavelength is decreased. We come now to the large deviation assertion .
*Theorem 22.4.2 (i) Equation Q3) and its terminal condition F = IK in !/ have the unique solution F(x, t)
= sup{ inf ~~ [c(x) + ax- H(x , a)] dT + IK(~)}. x()
a{)
t
(35)
Here the extremisations with respect to the path s of the functions x(T) and a( r) are subject to the prescription of initial value x( t) = x, and 7: is the value ofT at which ~(r) = (x(r ), r)firstenters !/. (ii) The scaled Markov process has the large deviation property, and its rate function for realisations over the time interval (0, h), conditional on the value ofx( 0), is
Doh[x(-}] =su p a{)
f\~ Jo x- H(x , a)] dr.
(36)
*Proof Assertion (i) certainly implies a limi ted version of assertion (ii). If we choose t = 0 and specify stopping at time h with zero terminal cost, then (31), (33) imply assertion (4) with C of the part icula r form c(x) dr. This form is general enough for our purposes, but further generalis ation is plainly possible. To establish (35), consider the expression
Jt
= sup{ inf ~~[ax- Q(x, a)] dr + IK(c8}, x(·)
.ao
t
where the extremisation is subject to x( t) = x. This satisfies the terminal condition
= Qa 8t, a = -4Jx and substitution of these back into (37) shows that
¢r - Q(x,
-
=0
(~ ~ !/). (38) But if the integral expression for¢ satisfies (38) and its terminal condition then the right-hand member of (35) satisfies (33) and its terminal condition. Since these
l'.
;·i
i.
5 HAMILTONIAN ASPECTS
395
expression (35) indeed provides the are uniquely determining, we see that 0 . iet:t:r LLuu cun·.•.u. ofF that we seek xionarity conditions with respect to the Actually, we have appealed only to stat raope difficulty with the commutation of and a-paths, in which case there is no d to , these conditions can be strengthene tions implicit in our 'proof'. However in a. We H(x , a) is convex inx and concave the sup/infcharacterisation if c(x )c' breakdowns be failures: the 'neurotic' and 'euphori .knO W indeed that there can kdown had brea the and points of failure, of Section 16.3 marked just such significance. wit h analogue of the static evaluation (10), We see in (36) the Markov-process the d tilte ch whi a '!f(a). The single variable the DC F H(x , a) replacing the CG F etim a s orm tion ofti me a( r) which perf distribution is now replaced by the func the in t poin to as a 'twi sf We retu rn to this dependent tilt, sometimes referred next section. the logous to the time-integral (7.2) of In (35) we see a time-integral ana s Thi egral (16.52) of the LEQG treatment. Pontryagin principle and the time-int m imu is a version of the Pontryagin max again raises the hope that there of the that than eral gen e ic case muc h mor principle which transfers to a stochast s. pter cha two matter in the next LEQG model. We shall investigate the on a tion (30) I (32) to the evaluation (35) ecta exp Note that the reduction of the l egra -int path a of ntially reduced evaluation single extremising trajectory has esse elarg t wha tly time-integral. This is exac to the evaluation of an extremal its crudest level. deviation theory achieves, at least at does expression (36) for the rate function Note that the occurrence of x in the ge Lar t for the stochastic process itself. not imply that x is assumed to exis h pat the rtions only on the grosser aspects of deviation approximations make asse ity abil erisation (4) is preferable to the prob (which is why the expectation charact of and ) 0(/'i, ber num of s x makes change characterisation (5)) and the fact that an like ng ethi som has means that it indeed size 0( rC 1) in a given interval ofti me K.. effective rate of change for large
5 HAMILTONIAN ASPECTS imal' the evaluation ofexpression (35) is 'opt The extremal pat h x( ·) determined by (or, een high probability and low cost in that it compromises best betw since cost). We shall term it limit-optimal equivalently, low discrepancy and low refined ic deviations from it in a more we shall wish to consider stochast r the ove red side con which the process is treatment. A special case is that in ely ctiv effe ch whi cost at all except that time interval (0, h] and there is no ed. crib pres rse The initial value x(O) is of cou prescribes the terminal value x(h). pro bis then the (asymptotically) most The limit-optimal path in this case (36) ion end-values. It minimises express able path between the prescribed
396
THE ESSENTIALS OF LARGE DEVIATION THEORY
for the rate function with respect to this path, and so is subject to the pair of , equations . 8H X= -,
aa
. = 8H a --
ax
(t
with x(O) and x(h) prescribed. This conclusion can be expres
sed as
Theorem 22.5.1 The asymptotically most probable path obeys Hamiltonian dy- ·~ namic sin the variable pair (x, a) with Hamiltonian just the DCF H(x, a} ;~
:~
This would in fact already be clear from the fact that the (x, a) path extremises the integral (36). Indeed, 'Hami ltonia n dynamics' can be define d just as those which arise by extremisation of an integral of form (36), with H(x, a) defined as the Hami ltonia n and equations (39) necessarily holding on the extremal path. The conjugate variable a supplies the time-dependent 'tilting ' of the process which guides it by the most probable route to the prescribed termi nal value x(h). !t acts then as a kind of force on the stochastic dynamics which is equivalent to the conditioning that the path should terminate at a prescr ibed value. Indeed, if x(h) is chosen as the most probable value for the prescr ibed initial value x(O) then a is identically zero on the path; see Exercise 1. We began with a Markov process of no particular structure apart from the fact that it could be scaled, and yet Hamiltonian structure has now emerged. One naturally asks why this should be. The answer is that Hami ltonia n structure will always emerge if one specifies dynamics only partially, and completes the specification by the imposition of an extremal principle. Thus, Hamiltonian structure emerges in the Pontryagin treatment of optimal deterministic control because the partial dynamic specification constituted by the plant equation is complemented by the optimality condition which determines the control. In the present case, one may say that specification of a stochastic process specifies only 'slack' dynamics; if one then adds the self-g enerated extremal principle that one seeks the most probable path (or the path which compromises best between high probability and low cost), then one has exactly the situation which generates Hami ltonia n structure. Note, howev er, that the structure emerges only asymptotically, in the large- scale limit. If we actually perfo rm the a-extremisation in (36) then we obtain
Doh[x(·)] =
lh
D(x,x ) dT
(40)
where D(x,x )
= sup[a x- H(x, a)] Q
(41)
can be regarded as the rate of increase of discrepancy with time along the path. In classical-mechanical contexts one would regard expression (40) as the action integral and D(x, .X) as the Lagrangian.
GE DEVIATION PRINCIPLE 6 REFINEMENTS OF THE LAR
397
ptimal n-pair (39) dete rmi ning the limit-o In the cast ed situation the equatio : the Ham ilto nian to H(x , o:) - c(x) path is mod ifie d by modification of . 8(c - H) . fJH (42) (t < T < 7). OX X= fJo:' 0: = con diti ons will be dete rmi ned by the extremal The term inat ion coordinates (.X, t) 'twi sted ' the s obey tify o: with -Fx the path implied in (35). Since we can iden plant equation (43) .X= Het.(x, -Fx )·
Exercises and comments fun ctio n also be chosen to minimise the rate (1) The requirement that x(h) should and so 0 Thi s then implies that 8H j8x = (36) yields the condition o:(h) = 0. kwa rd bac n of (39) is to be regarded as a a = 0 at time h. The second equofatiothis lly argu men t implies that o: is identica equation in time; continuation n give by The most probable path is then zero at all points on the path. in (23) n which we have already determined .X = (fJH / 8o:) et.=O. This is the equatio nistic limit. as governing the path in the determi D(x, ..X) = n process specified by (24) (2) Show that for the diffusio ![.X - a(x)] V(x r 1 [.X- a(x)]. pro cess 1988) has shown that for the jum p (3) Weiss (see Vanderbei and Weiss, specified by (25) D(x, X)
~ i>!{~[u; log( u,f,1;) + A; -
u;];
~ u;d; ~ ;}
ng a A; and d; understood. Prove this by taki with the possible x-dependence of the stra int indicated. Lag rang ian multiplier o: for the con ago by roximation pro pos ed many years app an to y Thi s relates interestingl pos ed erim sup of one rded the process as Bartlett (1955, 1960). Bartlett rega each at d; g utin trib antaneous rate A; and con Poisson streams, the ith having inst of rval inte time a in num ber of such events event. If one regards n; = u1bt as the the then son, jointly inde pen den t and Pois length 8t and regards n 1, n2 , ... as set of n1 the probability ofth e mos t probable expression above is the loga rith m of cribed .X. which would yield 6x = .X bt for pres GE DEVIATION PRINCIPLE 6 RE FIN EM EN TS OF TH E LAR ract er y takes acc oun t of the stochastic cha The large deviation trea tme nt certainl es the on the rate function which express of the model, in that it is bas ed in that it On the othe r han d, it is quite crud e asymptotic essence of this character.
398
THE ESSENTIALS OF LARGE DEVIATION THEORY
reduces consideration of events to considerat ion of a single path: the limitoptim al path. The form of the path is depe nden t on both the stochastics and the costs assum ed for the model, but stochastic varia tion abou t the path is neglected. We shall see in the next chapter, when we bring in the idea of control, that there are circumstances unde r which the optimisatio n of stochastic control is adequately treate d by large deviation methods. Indee d, the treat ment is a beautiful and inevitable one, in that it exhibits the time-integral meth ods of Part 4 as applicable to a very much more general class of processes than the LQG or LEQ G processes of that Part (although unde r the assum ption s oflar ge scale etc). One can say that this then also provides a natur al extension of the Pontryagin maxi mum princ iple to a general class of stoch astic problems, since the timeintegral meth ods amou nt to just such a principle. However, the treat ment can prove inade quate at sensitive point s on the path. Consider, for example, a mode l we shall treat in Chap ter 24: a stochastic version of the landi ng prob lem of Secti on 7.11. We saw in that section that if the 'aircraft' began in so severe a dive that a crash (i.e. 'prem ature landi ng') could only be avoided by extreme measures, then the optim al path broke into two analytically distinct sections. In the first secti on one brou ght the plane out of the dive in a groun d-gra zing save; after that one could mano euvre towards the desired term inal confi gurat ion in a more relaxed manner. In the large deviation treat ment of the stochastic version of this mode l one will have the same phen omen on, altho ugh the limit -optimal path will now comprom ise between high probability and low cost. However, the notio n that this path should graze the grou nd is unacceptable. At the grazi ng poin t stochastic deviation from the path will beco me impo rtant , and will force one to in fact ensure a clearance between plane and ground. The size of clearance one should aim for can only be deter mine d by a more refined consi derat ion of stochastic effects. The situation is really exactly that of Section 10.7. We saw there that one could well use the optim al deter mini stic rule for a stochastic mode l as long as the secon d differentials of the stochastic value funct ion were fairly cons tant in the 9-eighbourhood. It is when one approaches critic al point s on the path that this cond ition is violated. We saw in Secti on 10.7 that an undesl.red term inati on point is just such a critical point ; the grazi ng-po int in the example just consi dered has the same character. To derive a more refined result, cons ider again the backward equa tion (31), which is exact. Assu me however the refined versi on G~~:(x,
t) =
e-KF( x,t)-pO (x,t)+ o(l)
(44) of (31). Then , by the including the term of next order ( /'i,-I) in relati on (34) one can extend the argum ent of Theo rem 22.4.1 to conc lude that F satisfies the same equa tion
c + F 1 - H(x, -Fx) = 0,
(33)
AND EXPECTED EXIT TIMES 7 EQUILIBRIUM DISTRIBUTIONS
equatio as previously, and that FJ satisfies the
399
n
f'? + F~Ha(X, -Fx) +! tr[FxxHo:a(X, Fx)] = 0.
(45)
g the the poin t ~. t): the time-integral alon Tha t is, an extra cos t is incu rred from the is t) , t) of~ tr[FxxN(x, t)], where N(x limit-optimal path star ting from (x, (x, Fx)- See Exercise 1. effective noise cov aria nce mat rix Haa that the obta ined in (29). It says essentially Thi s con clus ion is analogous to that rministic usion process, for which to the dete process is to be rega rded as a diff Gau ssia n the limit-optimal path is add ed a drift (43) that would take it along rned to nce mat rix N(x, t). We have thus retu white noise com pon ent with covaria model if inde ed which is close to the original a fully stochastic model, and one that was itself a diffusion process. Exercises and comments
equ atio n the limit-optimal orb it satisfied the (1) We saw in equ atio n (43) that '; i.e. t this holds so long as the orbit is 'free .X= Ha(x, -Fx ) = a(x, t), say. (At leas solution ¢ additional penalties). Show that the does not enc oun ter constraints or g the , t) = Ois just the time-integral of?Palon of the equ atio n ¢r + ¢xa(x, t) + 'l,b(x t. poin (x, t) to the term inat ion lim it-o ptim al orb it from the poin t ES NS AN D EX PEC TED EX IT TIM 7 EQ UIL IBR IUM DIS TRI BU TIO argu men t, somewhat off our prin cipa l line of The re is a class of ideas which lies s of larg e tion of the truly interesting applica but which nevertheless provides one seen an ady ry, and of which we have alre deviation con cep ts in control theo narity, tatio noti ons of stationarity, quasi-s inst anc e in Cha pter 10. It involves alier cav ch a brie f trea tme nt is necessarily recu rren ce and first passage for whi e than plausibility arguments. and brie f 'proofs' necessarily no mor , a) and process with uns eale d DC F H(x Con side r aga in a scaled Markov sity' sup pos e that x( t) has an effective 'den (46) 1r(x, t) = e-~
*Theorem 22.7.1
U(x, t) as defined above obeys the forward
equation
~~ = -H(x,- ~~)-
(47) ysis; a better argu men t is
loits earlier anal *Proof The following argu men t exp . The n al conditions specify the value of x(O) sketched in Exercise 1. Suppose initi with we know from the evaluation (36) of
the rate function that (46) holds
400
THE ESSENTIALS OF LARGE DEVIATIO N THEORY
U(x, t)
= inf sup ([a x- H(x, a)] dr. x(·) <>(·)
Jo
where the infimum with respect to x(·) is subje ct to prescription of x(O), t and . x(t) = x. Just as we found that the function (35) of initial state satisfied equation (33), so we find that the function (48) of state at the uppe r integration limit satisfies (47), whatever x(O).
0
Suppose now that there is a value x. at whic h the equilibrium density 1r(x) is maximal; this would be the preferred equi librium poin t in a deterministic treat men t Define the quasipotential
U(x)
= inf sup x(-) o(-)
1
{ [ax - H(x, a)] dr.
Jo
where the set of paths x( ·) admitted prescribe x( 0) of r for which x( r) = x on the path.
*Theorem 22.7.2
= x. and tis the smallest value
The normalised equilibrium density for the scaled 1r(x)
(49)
process is
= e-~~:U(x)+o(~~:).
(50)
Proof One finds that expression (49) satisfies the equilibrium form of (47). The property U(x*) = 0 implies normalisation to the order of magnitude indicated, since then
0 Consider, as an example, the scaled birth -dea th process defined in Exercise 3.1. For this we find that
U(x)
= 1~ log[J.L(y)/,A(y)] dy = 1x.log[..\(y)/M(Y)] dy;
(51)
see Exercise 2. Relations (50) and (51) are cons istent with the known equilibrium distribution given in equation (10.12). Suppose now that x* is only a local maximum of 1r(x); in the deterministic version of the process it then constitutes a local ly stable equilibrium poin t with basin of attraction gj, say. In the stochastic versi on of the problem escape from r!J is possible and, in the long run, certa in for an irreducible process. However, one would like to assert that the escape time T, a rand om variable, is large in some sense. For example, for the harvesting exam ple of Sections 10.4 and 10.5 the threshold point c was the most probable value and was locally stable in the deterministic limit. However, if extinction was poss ible it was also certain, but one
T TIMES TIONS AND EXPECTED EXI 7 EQUILIBRIUM DISTRIBU
401
increase ected time to extinction would for a demonstration tha t the exp to determinism. scale factor K. implied approach ,, 111,~..,.,·~···--J as increasing ofx. then time from the basin ofattraction 14 ·*Theorem 22.7.3 If Tis the escape (52) E(T ) = exp[~~; inf U(x) +a(~~;)] xEol?l
where 814 is the boundary of14. might imagine s exponentially large in K.. One The expected escape time is thu ld set up a = x) for x in 14, then one cou ) ix(O E(T = ) F(x d ine def one that, if one does so e its large-deviation version. If solv and ) F(x for n atio equ backward there is no me ntio n y con sta nt in 14, which is why one finds tha t F(x) is essentiall starting poi nt The explanation is that, for any of initial value in the theorem. to unity wit h ds nda ry of f!J, the probability ten which is not too close to the bou neighbourhood makes its way to a prescribed increasing K. tha t the pat h rapidly ape problem thu s ches the bou nda ry of f!J. Th e esc % of x. before it ultimately rea •. 'standardises' itself to x(O) = x gh, but mir ror s ich we offer is extremely rou wh ent um arg tive rna alte The ~~; so as to make the y the neighbourhood .% with the actual course of events. Var ry excursion of the imately independent of~~;. Eve integral of 1r(x) ove r.% approx ndary of 14 which ed as an attempt to reach the bou process from .% can be regard is successful with probability .)]. p = exp[-K. inf U(x) + o(K xE8f!l
ility of success at cally independent, so the probab isti stat are ts mp atte ive cut nse Co ber of attempts 1p(J = I, 2, ... ) and the expected num py (1 is t mp atte the jth e nee ded to I + p- 1 ,...., p- 1. The expected tim is ed iev ach is ape esc il unt needed ed duration of an excursion - 1, where CT is the expect escape is thus not less tha n CTp thus con firm imately ind epe nde nt of ~~;. We 0 from .%; this will be approx er bou nd on E ( T). expression (52) as at least a low ture of actual remely loose, but conveys a pic ext sly iou obv is ent um arg e Th ape attempts' es the expected num ber of 'esc giv t fac in (52) n sio res Exp events. Tis spent relatively , and mo st of the escape time needed, to the order indicated rium value x •. close to the deterministic equilib tion time for the .18)/ (10.20) of expected extinc Th e asymptotic evaluation (10 with (52) and direct arguments, is consistent pop ula tion model, derived by x = 0, at which otential in this case. The value sip qua the of (51) tion ina erm det c evaluation bou nda ry of f!l. The asymptoti the s ute stit con , urs occ tion extinc sidered in chastic population models con sto er oth two the for also ds (10.20) hol by formulae (10.39) tions of quasipotential implied Ch apt er 10, with the determina and (10.51) for R(x).
402
THE ESSENTIALS OF LARGE DEV IATION THEORY
Exercises and comments (1) Forward equations are more natu rally expressed in term s of transforms, as have emphasised in Section 12.9. Define M(a, t) as the MG F of x(t) for prescribed initial conditions. The n the forward analogue of (22) is oM I ot : : : ·. H(o loa , a)M , an operator form of the Kolmogorov equation derived by··. Bartlett (1949). If the process is now scaled by a factor "' and we assume · M(a, t) = exp[K.'I/l(K.- 1a, t + o(K.), then we find, as in Section 3, that the eiko nal · form of this last equation is 81/ Jiot=H(o'l/J/oa,a). This and the rela tion U(x, t) = sup0 ,(a x- 1/J( a, t)] imply the Kolmogorov equation in the form (47). (2) Since the bou nda ry conditions are time-independent in (49) then H(x, a) is zero on the extremal path and the integral reduces to J~ ax dT. For the case of scalar x this reduces to a(y) dy where a(y) is the function ofy dete rmi ned by H(y, a) = 0. For the birt h-d eath process of Exercise 3.1 we have then a(y) = log[JL(y) I). (y) J, whence (51) follo ws. Note that the value of .X determined for the extremal path is JL(x) - -\(x), which is the reverse of the velocity of the time-reversed deterministic process. This can not be the actual velocity, and the time-argument of the integral (49) doe s not represent actual time, although the fact that JL - ). is zero at x. does refl ect the difficulty the process has in leaving the neighbourhood of this point.
J:.
Notes on the literature For a long tim e the stan dar d texts were those due to the very signific ant early figures in the development of the theory: Varadhan (1984), Stroock (1984), Friedlin and Wentzell (1984), Ellis (1985); Donsker is a key figure in the jou rna l literature. To a large extent the only stochastic processes conside red were diffusions, and the Ham ilto nian form alism of Section 4 not really to be found. Bucklew (1990) reviewed application s. The unpublished notes circulat ed by Vanderbei and Weiss (1988) were wid ely valued, they considered jum p proc esses occurring in telecommunication app lications. A developed form of thes e note s 'appears as Shwarz and Weiss (1995). Oth er recent texts are those of Deusche l and Strook (1989) and Dem bo and Zeitani (1986). The ope rato r formalism of Section 3 is due to Bartlett (1949). It provides the natu ral und erp inn ing for the form alism of Section 4. This was dev eloped in Whittle (1990b), although it mus t be con tain ed in the physical liter ature in the general discussion of the Ham ilto n-Ja cob i equation and its qua ntu m equivalents. The approach contras ts with that of the probabilists, who emp loy martingale arguments and the like. Probabilistic arguments can be pow erfu l and insightful; the technique of 'tilting' a distribution or 'twisting' a process provides an example. Friedlin and Wentzell (1984) gave a very complete treatment of the exp ected exit time from a basin of attraction (and of transition between such basi ns) for the
7 EQUILIBRIUM DISTRIBUTIONS AND EXPECTED EXIT TIMES
403
of a diffusion process. Tegeder (1993) elucidated the whole matter very "--·Mu••v for the general Markov case. ManY authors have considered refinements of the basic large deviation usually by direct probabilistic arguments; see e.g. Azencott (19&2, · 1984) and Ben Arous (1988). The modified backward equations (33), (45) are · ·.·deduced in Whittle (1990b, 1995), where the treatment is continued to derive .v....~~-- which are close to, but interestingly different from, formulae due to van "·'"';.· Vleck (1928) for the quantum-mechanical context.
(;,f0:~--. ~······· ....;;,-·"'!-·
.' :•_. t"'··
CHAPTER 23
Control Optimisation in the Large Deviation Limit The material of the last chapter has an immediate formal generalisation to the controlled (Markov) case, with the dynamic programming equation replacing the various backward equations. The time-integral solution in the risk-sensitive ('exponential-of-cost') formulation persists, with the addition of a u-extremisation. This form of solution implies validity of an asymptotic maximum principle, but one which is in general more fairly interpreted as a risk-sensitive maximum principle than as a stochastic maximum principle. 1 SPECIFICATION AND THE RISK-NEUTRAL CASE We now generalise the material of the last chapter to the controlled case. We assume the usual cost function, so that the future cost at timet is
C(t) =
li
c(x, u) dr + IK(~).
(1)
Here tis the time at which the process terminates, specified as the value of r for which ~(r) = (x(r),r) first enters the prescribed stopping set!/, and~ is the terminal value of~. Markov dynamics are economically specified in terms of the DCF:
E[e" 6x(t)IX(t), U(t)] = 1 +H(x(t),u(t),a) 8t+o(8t).
(2)
That is, the current value u of control now simply appears as an additional argument in the DCF H(x, u, a), and the dynamic programming equation for the value function
F(x, t) = inf ,. E[C(t)ix(t) = x] can be written
i~f[c(x,u) + aF~~· t) + H(x, u, :x)F(x, t)]
= 0
This is of course just the optimally controlled version of (22.22).
(~ ~
!/).
(3)
406
CONTROL OPTIMISATION IN THE LAR GE DEVIATION LIMIT
Suppose now that H has the series developm ent
H(x ,u,a )
= aa(x,u) +1a N(x ,t)a T + · · ·.
so that a(x, u) 8t and N(x, u) 8t are the mean and covariance matr ix of 8x conditional on the immediately previous valu es of x and u. Suppose furthermore the process scaled by a factor,.,, so that H(x, u, a) is replaced by ""H(x, u, ,.,- 1a) and "" is assumed large. If we retain only leading term s in the scale d version of (3) then we obta in inf[c(x, u) u
+ F1 + Fxa(x, u)] = 0;
(5)
just the dynamic prog ramm ing equation for the deterministic approximation to the process (cf. (22.28) ). If we retain terms of relative orde r ""-I then we obtain
i~f { c(x, u) + F + Fxa(x, u) + 1
L
tr[N(x, u)Fxx]} = 0;
(6)
just the dynamic prog ramm ing equation for the diffusion approximation to the process (cf (22.29) ). We make these relatively trite points to emp hasise again that the large deviation approximation, for all the interest and struc ture that it reveals, is only a 'tilted' form of the crud e deterministic approxim ation associated with (5). 2 LARGE DEVIATION CONCLUSIONS IN THE RISK-SENSITIVE CASE Suppose that we adop t the risk-sensitive crite rion E,,( e-,..ec), where () is the risksensitivity para mete r and "" the scale para mete r ofth e controlled process. As in Section 22.3 it seems reasonable to include the factor l'i. in the exponent, since if C is the cost calculated on system averages then is a prop er assessment of the cost for the whole system. The policy 1r will be chosen to minimise or maximise ,this criterion according as () is negative or positive. The expectation is exactly of the form for which large deviation methods are applicable. We shall essentially just transfer the analysis of Section 22.4 to the controlled case, so that the various backward equations for the expectations now beco me versions of the dynamic prog ramm ing equation, incorporating the taking of an extremum with respect to the current control value u. We suppose now that the relation
,.,c
G,..(x, t) := extErr[e-~~:IIC(t)Jx(t) = x] = 7f
e-KIIF(x,t)+o(~~:)
(7)
holds; i.e. that the middle expression show s the exponential dependence on "" indicated, for large ""·We then interpret F(x, t) as a limiting normalised value function (i.e. 'limiting' for large"" and 'norm alised' to the original cost scale).
·-""
RISK-SENSITIVE CASE 2 LARGE DEVIATION CONCLUSIONS IN THE
407
, the only differences We can then transfer the analysis of Section 22.4 bodily () and the addition of an being the inclusion of the risk-sensitivity param eter the optimal, or approximextremisation with respect to current u (which yields the dynamic progr amm ing ately optimal, value of u). The analogue of (22.31) is equation for G: 1 (8) Gl< = 0 e~t -K.OC X, u)GI< + 7it + KH X, u,"'
[
aG"
(
(
a) ] _ ax
the term 'ext' loosely for with termi nal condition G"' = e-1<81< in.'/. We shall use a minim um depends upon the takin g of an extremum; whether a maxi mum or the sign of() and the point reached in the calculations. n the analogue of (2233): If we inser t expression (7) for G into (8) then we obtai alised value function F, the dynamic progr amm ing equation in terms of the norm
(9) approximate becau se we with termi nal condition F = IK in .C?. The relation is only 1 approximation made here have neglected terms of relative order r;,- . It is the The minimising value of u which is the essential large-deviation approximation. is optim al at (x, t) to the same level of approximation. the path integral solution We have now the analogue of Theorem 22.4.2, giving of equation (9). and its boun dary condi *Theorem 23.2.1 Suppose that () > 0. Then equation (9) tion F = IK in !J? have the unique solution
F(x, t) = inf inf sup u() x(-)
.\(·)
[!I [c(x, u) + )..T x- e-1H(x, u, (}).. T)] dr + IK(~)J.
{10)
1
of u( t) thus determined is where the path x(-) is constrained by x( t) = x. The value optimal in the limit oflarge r;,. ication that the extremal These assertions also hold in the case() < 0 with the modif frespectively. operations with respect to x(·) and A(·) become sup andin lly, the presence of the The argum ent of Theorem 22.4.2 carries over forma lty. However, to the reasons extremisation with respect to u presenting no difficu those consequent upon for the lack of rigour of that argum ent we have now added isation. We plead in extrem its and H in the presence of the extra variable u s needed to prese nt effort the en betwe ion mitigation only the extreme disproport these latter until defer to ation inclin the formal and rigorous arguments, plus tful. they can themselves be made more insigh we have replaced the row Note that in the passage from (22.35) to (10) above rence of the {}-factor is a occur The vector a by())..~ so that ).. is a colum n vector. ion function. We have criter consequence of its appearance in the exponent ofthe
408
CONTROL OPTIMISATION IN THE LARGE DEVIATION LIMIT
taken ). as a column vector so that it can be identified with the familiar Lagrang e multipli er of earlier trajecto ry optimisations, includin g that of the maximu m principle; see integral (7.2). Our reason for wanting to take ). as a column vector in those cases was that the extremising conditions with respect to x, u and ). could then be written down as a single equation system, as in (6.20). Note the implicat ion of solution (10): that the optimal control can be derived by applicat ion of a maximu m principl e, with the Hamilto nian).Ta(x, u) - c(x, u) of the Pontrya gin treatme nt (Chapte r 7) replaced simply by o- 1H(x, u, e>..T)c(x, u). This seems to constitu te what has long been sought: a stochastic maximu m principle. Such a principl e has been known to exist (determ ining the optimal control exactly in a stochastic case) in a few very special cases, but existence of a general principl e has been a matter of extreme doubt. In many cases where such a principl e has been announ ced it has amount ed to little more than a stateme nt of the dynami c program ming principl e, in that the present value of the dual variable ). is expressed as a conditio nal expecta tion (general ly not calculable) over the future of the optimal ly controll ed process. The principl e determi ned from the path integral (10) requires no such conditio nal expectat ions: extremisation with respect to the (x, u, .A)-path determi nes the predicted future course of the optimal ly controll ed process, just as the convent ional principl e determi nes the actual course. However, our conclus ions may be too crude that one would be happy to claim that they fulfil the hopes for a stochast ic maximu m principle. If we conside r the risk-neutral case()---+ 0 then the Hamilto nian become s simply) .Ta(x, u)- c(x, u~ where a(x, u) is the conditio nal expecta tion of rate of change defined in (4). That is, the principl e reduces just to the conventional maximu m principl e for the determinis tic version x = a(x, u) of the process; no stochast ic aspect is retained at all. The reductio n implied in (10), of stochastics to conside ration of a single 'limitoptimal ' path, reflects the same point, already made in Section 22.6. In the risk-sensitive case the stochastics of the process are of course reflected , even in the large deviation limit, in that the full DCF H(x, u, a) occurs in the I:Jamiltonian. However, as we shall see in the next section, what one obtains is rather a determi nistic principl e 'coloured' by optimis m or pessimism. This is not sufficient to enforce a positive clearanc e of dangero us contingencies: for that we shall need the more refined treatme nt of the next chapter. Nevertheless, the principl e undoubt edly embodie s valid and valuable conclusions. We shall refer to it as the risk-sensitive maximum principle (abbrevi ated to RSMP), to avoid confusio n with any stronger version of a stochast ic maximu m principl e which may be developed.
3 AN EXAMPLE: THE INERTIALESS LANDING PROBLEM Conside r the inertiale ss landing problem of Section 7.9, but now with a stochast ic plant equation x = u + t, where t is white noise of power N. Recall that x is the
3 AN EXAMPLE: THE INERTIALESS LANDING PROBLEM
409
with unit current height of a particle above the ground, moving horizontally way as to velocity, and that one wishes to land it (i.e. bring x to zero) in such a 2 cost IK(t), minimise the sum of the integral of control costs Qu /2 and terminal ng. where tis the momen t oflandi al This is a simple but illuminating example: too simple to be of direct practic es. generat it insight the for t interes al interest in itself, but of practic 2 2 we The Hamilt onian for the RSMP is then )..u + NBA. /2- Qu /2, whence derive the equations
u = Q- 1A.,
x = u + NOA.,
(11)
N()).. on a free orbit. The second equation is a plant equation with predicted noise; ).. that then see We n. criterio stress l minima is the predicted value of noise <: on a by ined determ then is ).. of value the and u are constan t on the path, and to x = -(Q- 1 + NO)>-.s, where s = t- tis the further time taken to reduce height zero. The actual value of control is then X
u =- s(l
+ QNO) =
-lx/s,
(12)
ine the say. Evaluation of the integral in (10) is then immediate, and we determ total cost as
2(t- t)~t:
QNO)
+ IK(t).
(13)
t which The intended landing time t = t + s must be the value exceeding minimises this expression. led We see from (11) and (12) that the predicted path of the optimally control to t) (x, point initial the from line t straigh a is so process satisfies x = - x/ s, and 7.9. Section of version inistic determ the for as the terminal point (0, t ), just under the However, this is the minimal stress path; the actual plant equation the neglect and bed prescri t assume we If €. + control rule (12) will be x = -1x/ s · effects of noise then this has solution
x(t)
= c(t- tf',
(14)
less than, where cis a constant, to be determined from initial conditions. Now, 1 is e. We see negativ or zero , positive is e as ng accordi equal to or greater than unity noise from (apart will path led control lly then from (14) that the optima 1: Figure of graph the in as point l termina bed disturbances) approach a prescri riskking, risk-see is ler control the as ng concavely, linearly or convexly accordi neutral or risk-averse. er This pattern is in a way the opposite of what one might have expected. Consid have might One 1. > 1 and 0 < the case of a pessimistic controller, when e of an expected that pessimism would have made him conscious of the hazards
410
CONTROL OPTIMISATION IN THE LARG E DEVIATION LIMI T
X
Figure 1 Effective terminal approach paths for the inertialess particle in the risk-neutral (IJ = O),optimistic(IJ > O)andpessimistic(IJ < O)case s.
early landing (when a penalty would be incu rred) so that he would be more concerned than in the risk-neutral case to main tain height. That is, one would have expected the concave approach of Figure 1 rather than the convex one. The explanation is that a pessimistic controlle r fears that all his actions will be opposed by a contrary Nature; he therefore makes his actions all the stronger, which is why 1 > 1 in (14). That is, pessimis m may lead to convulsive and premature action rather than feeble and post poned action. Since he wishes to descend, he then tends to descend prematurely, implying the convex approach of Figure 1. The final part of his approach is a grou nd-skimming one, which is not what one would expect if there were a conscious ness of the possibility of being driven by rand om disturbances to a premature landing. To this extent stochastic effects are lacking; the penalty which may be incurred on x = 0 does not 'propagate' into x > 0 as an anticipation of danger. It is in the lack of this anticipation that the RSMP shows itself inade quate, particularly when the danger is close. We shall refine the treatment to deal with this point in the next chapter, and shall see that stochastic effects do then indeed induce an incentive to gain height when danger threatens. The intended termination point t is of course not prescribed, but determined by minimisation of expression (13). If IK(t) is conv ex with a minimum at !opt then the minimising twil l be somewhat greater than topt· The intended termination point tis constant along the predicted path, but will vary on the actual path. In the pessimistic case the controller will set a linear predicted course to the point (0, t), but will in fact immediately fall below it, as we saw above. A new determination of 7 will then yield a smaller value. That is, the intended landing point of a pessimistic pilot will tend to creep backwards in time, as he tries to escape from uncertainty. That of an optimistic pilot will tend to creep forwards, since he believes that uncertain ty is on his side. To take an explicit example, suppose that IK(t) = k/(d - t), so that penalty increases monotonically to an infinite value as t i d. One finds then that the
4 THE OPTIMISATION OF CONSUMPTION OVER A LIFETIME
of 1 minimising expression (13) is 1 = (t + fJxd)/(1 == JQ/ j2k. The actual noiseless plant equation is then
+ fJx),
411 where
1(1 + fJx) !X . x=---=-..;_;_,..----_,..:.. fJ(d-t) 1-t
fJ- 1[c(d- tf! -1],
x(t) =
where the constant c is determined by initial conditions. The intended terminal point is then given as a function oftime t by
1 = d- c- 1(d- t) 1- 1 . This varies with t as predicted above.
4 THE OPTIMISATION OF CONSUMPTION OVER A LIFETIME Consider a continuous-time version of the problem of Section 2.2: the classic problem of choosing consumption u so as to maximise the total utility
1 1
g(u) dr + G(x)
over a lifetime (0, 7), where 1 is initially assumed known. Here x(r) is the optimiser's capital at time r, assumed to obey the plant equation
x =ax- u,
( 15)
where a is the rate of interest. The variables x and u are intrinsically positive, so a totally LQG formulation is impossible. The deterministic maximum principle is associated with the path integral
~=
1 1
[g(u) +Ax- A(ax- u)] dr + G(x).
( 16)
If one chooses g(u) =log u and G(x) = k log x then one finds by conventional analysis that the optimal rule, in closed loop form, is X U=--
(17) k+s where s = 1 ~ tis time to go, the residual lifetime of the individual. Note that the residual lifetime is effectively augmented by an amount k, which reflects the utility to the individual of being able to leave an inheritance. Remarkably, the rule is independent of a. One could introduce a stochastic element into the model by introducing noise into the plant equation (15), so introducing some uncertainty into the growth of
412
CONTROL OPTIMISATION IN THE LARGE DEVIATION LIMIT
capital (see Exercise 1). However, a more interesting alternative is to introduce . some uncertainty into length of life. So, suppose that capital obeys the deterministic equation (15), qut that there is a second componen t of the state variable, 'apparent residual lifetime' y, which obeys the equation
.Y=-1+€ . Here f is white noise of power N, and we suppose that the moment of death tis the time at which y first equals zero. The occurrence of the noise term reflects the assumption that, because of changing health or circumstance, residual lifetime is not perfectly predictable from y. The path integral equivalent to the evaluation (10) of Fis now
~=
1 1
[g(u) + hx + .A2.Y- .At (ax- u) + .A2 +!ON.A~] dr + G(x).
(18)
Recall that the assumption of large scale amounts to the assumption that N is small. Application of the maximum principle leads one again to the closed-loop optimal control rule (17), but with the effective residual lifetime s determined in terms of current x andy by
J-; =
20NSZ{1- a(k + s) +log [(k + s)jx]}.
(19)
In the risk-neutral case this leads to the relatively crude conclusion s = y. However, we can observe the effects of risk-sensitivity. As in the last section, these may seem counterintuitive at first glance, but further thought confirms them as reasonable. The graph in Figure 2 illustrates how consumption changes (as a function of capital x and apparent augmented residual lifetime k + y) if e increases from
Decreased consumption
Assets
x
Figure 2 The effect ofan increased degree ofoptimism upon consumption, as a function of current assets and perceived residua/lifetime.
5 A NON-DIFFUSION EXAMPLE
413
istic. The interest rate a is ; i.e. if the individual becomes mildly optim . assumed positive. of consumption decreases. Life If assets x are large enough then the optimal rate individual to expect that he will · is good at this level, and so optimism leads the rces more carefully. However, if x live longer, and so must then husband his resou in which optimal consumpti()n is small enough then there is a band of y with idual is living at such a low increases with increasing optimism. Here the indiv attractive. His life expectation is level of consumption that life does not seem te to make his remaining days large enough that his resources appea:r inadequa d be wort h his while to build up his comfortable, but not large enough that it woul me that his residual life span is in capital. The optimistic course is then to assu ume at a greater rate than in the fact shorter than y would indicate, and so to cons risk-neutral case. Exercises and comments n but white noise of power N is (1) Consider the case where t is fixed and know then to add a term !ON.>.2 to the introduced into equation (15). The effect is given in terms ofx and s = t - t by integrand in (16), and the optimal value of u is
ON
2 (k + s)u- x = 2au (1 - e- as).
REGULATION OF PARTICLE 5 A NON -DIF FUS ION EXAMPLE: THE NUM BER S sion examples, in that the plan t The examples of the last two sections were all diffu natu ral example of another type, equation was driven by white noise. For a (22.26). For this x represents the consider the process with standardised DCF rolled admission rate u and num ber of particles in a chamber with cont scaling, the actual num ber of independent departure rate p. (Note that, after departure rates are ~tu and p~tx, particles is ~tx, the actual admission and total r ~t.) If the instantaneous cost and· actual costs must also be scaled by a facto ge cost (in the risk-sensitive sense) function is u + c(x), say, and the minimal avera is "f, then, on the optimal path, 1 "( = u + c(x) - o-l [u(. and we can identify the term . We suppose that 0 ~ u ~ M, say; negative Ham ilton ian -ff (see Exercise 7.2.2) ct to u then implies that the the condition that -:ff be minimal with respe ter or less than 1 + 0. The grea is optimal u must be zero or M according as ljJ ive or negative according as to 1 value of x = rf>. = u¢- 'px<jJ- must be posit condition >. = -rfx = 0 yields which of these holds. The equilibrium
414
CONTROL OPTIMISATION IN THE LARGE DEVIATION LIMIT
c'(x)- e- 1p(¢- 1 - 1) = 0. The equilibrium value of x obviously coincides the threshold value Xthr at which u changes discontinuously, and so at ¢ = 1 +e. Eliminatin g¢ (and so>.) from these last two equations we obtain equation
c'(x) = -p(1
+ e)- 1
determining Xthr· If c(x) is convex with a minimum at Xopt then Xthr will be than Xopt, correspond ing to a displacement cost which is balanced out by the ·• decreased u-cost. The value of Xthr will increase with increasing e, representing increased optimism that a deficit in input u will be made good by random variation. The value of eat which neurotic breakdown occurs is plainly -1. Notes on the literature The use oflarge deviation methods for the optimisation of stochastic control was suggested passingly in Friedlin and Wentzell (1984) (section 8.2) and applied to the minimisation of exit time probabilities by Fleming (1971, 1978, 1985) and Fleming and Tsai (1981). Its association with a risk-sensitive criterion and so the deduction of the Hamilton- Jacobi equation (9) with solution (10) and its implied maximum principle are due to Whittle (1990b). This was also the first paper to break away from diffusion models and consider a general type of Markov process.
, CHA PTE R24
Controlled First Passage
on metho ds have already emphasised in the last two chapters that large deviati forbidden or need refinement in cases for which the limit-optimal path grazes ment will Refine m. proble g landin the in as · high-penalty regions of state space, tes the genera which s proces ' inistic 'determ take the form that one replaces the limit-optimal path by a diffusion. generated and · There is a class of cases for which this diffusion process is readily expected ising minim of m proble the ·.• · treated. This class also has the property that expected ting calcula of that to cost for a controlled process can be transformed is, 'riskThat s. proces trolled uncon ~,, __vApv•_...,•• uru.-m-cosl: for the corres pondin g itself. sation optimi l sensitivity' is not assumed, but is generated by contro ssage proble m An example of this class has already been seen in the first-pa generalises to le examp The associated with the 'fly-paper' effect (Section 10.7). deduce the shall we lar, cover plausible forms of the landing problem. In particu far too are which s optimal margi n of clearance (of hazards) for some model simple for realism, but which are archetypal. which largely It is the author's preoccupation with the landing problem ch have often motivates this work. Attempts to optimise the landing approa from a prescribed simply optimised the control of deviations (on an LQ model) derived from a ideal approach path. But the ideal approach path should itself be and free of other full stochastic model. If air-space were physically homogeneous particular height, traffic (a legitimate first simplification!) then one would hold a s of the groun d not because there was virtue in that height, but because the hazard considerations r Simila it. from ce distan compelled one to keep a respectful stic effects stocha by d induce ity ictabil would dictate a landing path: the unpred influence to ce air-spa into gate' 'propa would cause costs defined on the ground to actions there.
1 REDUCTION OF THE CONTROL PROBLEM Consider the controlled diffusion process with plant equation (1) x = a(x, t) + Bu + € 1 Qu and terminal cost function IK(~) instantaneous cost function c(u) = g set9'. The incurr ed atfirst entryo fthe state/time variable~= (x, t) to a stoppin N. white noise compo nent f is assumed to have power matrix
!uT
416
CON TRO LLE D FIR ST PAS SAGE
The specialising assumptions are thus (i) tha t the problem is LQ in its depend- · ence on the control variable and (ii) tha t no stat e-d epe nde nt ins tan tan eou s costs are incurred. Th e sec ond ass um ptio n is less restrictive tha n it seems, in tha t one can often redefine the state variable to achieve this (see Exercise 1). Also, as emphasised above, it is the ess ence of some problems tha t sensitivity to state value is ind uce d, not by an exp licit state cost, but by the pro pag atio n of term ina l costs into state space. On e cou ld allow the ma tric es B, Q and N to dep end upo n time, but we shall suppose the m con stan t, for simplicity. We must, however, add the key ass um ptio n (iii) tha t
(2) for som e scalar K.. The ass um ptio n is the n tha t control and noise are comparable in the ir driving effect on the pla nt equation, in tha t the control power ma trix N is pro por tion al to the noise pow er ma trix J. The ass um ptio n may seem a very restrictive one, but it certainly holds in a num ber of cases of inte rest. No te tha t if K- is large the n N is of ord er K.- 1, for fixed B and Q. Theorem 24.1.1 Under the ass umptions above, the value fun ction F (') of the controlled process is given by
if the expectation is well-defined. Here E(
firs t ent ry ofthe uncontrolled pro
cess
(3) is an expectation over the coo rdinates
+t = (x, t).
.X= a(x , t) into £1', conditional on a start fro m a poi
nt ~
Pro of Th e optimality equatio n for the
(4)
controlled process is
inf [! uT Qu + F1 + Fx(a + Bu) II
+!
outside £1', wit h the boundaryc
onditionF(~) =
tr(NFxx)] = 0
(5)
IK(') in £1'. It follows from (5) tha t
u = -Q -IB TF J with the con seq uen t reduction
~of
(6)
of the optimality equ atio n
Fr + Fx a- !Fx BQ - 1BTFJ +! Now, if (2) holds, the n the tran sfo rma
tr(NFxx)] = 0.
(7)
tion
1/1(') =
e-~
(8)
linearises (7) to
1/Jr + 1/Jxa +! tr(N1/Jxx) = 0
(' ¢; £1').
(9)
1 REDUCTION OF THE CONTROL PROBLEM
417
This is subject to the boundary condition (~ E
Y).
(10)
Now, if the right-hand member of (3) were denoted '1/J( ~) then (9), (10) are just the equation and boundary condition that the characterisation of the theorem would require of it. The solution of (9), (10) is unique, however, and must then be identifiable with the right-hand member of (3), if this is well-defined. Good definition will, for example, imply that the uncontrolled path from ~ will terminate in Y with probability one. D The theorem has two striking implications: (i) that the optimisation of control has been reduced to calculation of an expectation for the uncontrolled process, and (ii) that the risk-neutral ambition of minimising the expectation of cost C has been transformed into the 'risk-sensitive' evaluation of the expectation of exp( -~IK). This latter calculation is then perfectly set up for the application of large-deviation methods, if~ is large. One still has the first-passage problem of determining the distribution of terminal coordinate ~for the uncontrolled process. We follow this up in Section 3. Some cases can be solved exactly; for others, we assume N small (and so ~large) and deduce what amounts to a refined version of a large deviation evaluation. Note that the rule (6) for the optimal control becomes (11)
in terms of '1/J. The key conclusion (3) holds only for a diffusion process, and one of a rather particular form. One has to admit that, even though we have managed to break away from the limitation of attention to the diffusion case which has so marked the work of the last three decades, the diffusion processes do have a particular role and character. Exercises and comments (1) Suppose that the instantaneous cost has the form c(x) (9) becomes
+ !uT Qu. Then relation
!'i,C'l/J + '1/Jt + '1/Jxa +! tr(N'l/Jxx) = 0.
(12)
with solution (3) now modified to
This expression of the solution of (12) is just what is often referred to as the 'Feynman-Kac formula'. In the dynamic programming context it would be
CONTROLLED FIRST PASSAGE
418
regarded as a rather straightforward assertion, at least under circumstances for which the expectation is well defined. (2) Risk-sensitivity can be worked in very naturally. Consider the normalised risksensitive value function F(e) defined by e-BF = ext"E... (e-OC(zllx(t) = x). Show that, under the same assumptions as above, 7j; = e-(~<+B)F satisfies (9), so that (3) holds with "" replaced by "" + (}throughout. 2 THE LARGE DEVIATION EVALUATION
Theorem 23.1.1 casts the solution in a form (3) which seems made for large deviation evaluation in the limit of large ""- However, it turns out, as we shall see, that in its unrefined form this yields nothing but the deterministic version of the original stochastic control problem. Nevertheless, the establishment of this connection gives some insight in itself, and an indication of the path to a more refined approximation. Let us denote the control power matrix BQ- 1BT by J, so that the basic assumption (2) can be written simply N = ""- 1J. Then the unsealed DCF of the uncontrolled process (3) is aa(x, t) aJaT and the large deviation evaluation of the value function F(x, t) implied by (3) is
+!
F(x, t)
rv
infsup(JI (ai- aa- !aJaT) dr + IK([)). x(·) or(·)
{13)
1
Here the infimum over paths x(·) is constrained by the initial condition x(t) = x and the fact that the path must not enter !/' before the terminal moment t. The path is thus not necessarily a free one before this point: it may consist of free sections separated by grazing encounters with!/'. Theorem 24.2.1 The approximate evaluation (13) is just the value function ofthe deterministic version of the problem, and the path x( ·) determined by it is just the path ofthe optimally controlled deterministic process. Proof Take (13) as a defining equality, for simplicity. Then we have Fx the initial time) so that the optimal control on this basis would be
u = Q-iBTaT.
= -a (at (14)
Performing the a (·)-maximisation in (13) we obtain
x- a= ]aT= Bu
(15)
and are left with the expression
!
I
aJaT dr + IK
=!I
uT Qu dr + 1K
(16)
3 THE CASE OF LINEAR DYNAMICS
419
above. But which the path x( ·) must minimise, subject to the constraints indicated the cost and n equatio plant the but nothing are (16) relation (15) and expression sing minimi the and , problem control the of version function of the deterministic 0 . version this for path led control lly path nothing but the optima that one This may seem like a disappointing collapse. It is indeed an indication (or, effects tic stochas capture to is one if ch must refine the large-deviation approa process a for ves themsel st manife would rather, those stochastic effects which ting which was not completely LQG). However, it also indicates interes initial bed prescri n betwee costs control connections. The path which minimises le path and terminal points in the deterministic version is the most probab most The (3). version tic stochas between these points in the uncontrolled ined constra a to onds corresp probable first passage path in this second version tion. termina path in the first: that which does not enter f/ before 3 THE CASE OF LINEA R DYNAMICS Suppose that the plant equation takes the linear form
x= Ax+B u + E,
(17)
passage with, as ever, N = K:- 1J. Then the minimal value of the rate function for form the ~takes ~to of the uncontrolled process from
D(~, ~) = inf sup j 1[a? x- a:T Ax- !a:TJa:] dT. X
Q
t
and the It follows then from the control-equivalence proved in the last section Y' meet not does path calculations of Section 7.8, that, in the case when the free prematurely, this has the evaluation
(18) Here s = 7 - t and
(19) in the 1 In expression (18) we recognise just the K:- times the expression a x(t); of x value the exponent of the distribution of .X= x(t) conditional on 1 In V(s). K:matrix Gaussi an distribution with expectation eAsx and covariance n. the controlled/deterministic version we would see V(s) as the control Gramia (of~ density If K: is large then it is shown in Whittle (1995) that the first-passage the first conditional upon ~) for the uncontrolled process x = Ax + E is to approximation given by
(20)
420
CONTROLLED FIRST PASSAGE
and to the second approximation given by /(~1~)
= p(K-/21rt;zl V(s)ri/2e-~tD<€ll+o(Il.
(21) where pis a Jacobian term related to the transformation from x(1) for given 1 to a • set of coordinates for ~on the surface of !!'. That is, to the normal density of x ··.~ conditional on the value of x for fixed s, corrected by the factor p. For example, if the stopping time is prescribed and we take (21) as giving the -.".·_;.~1 distribution of x(1) for prescribed 1 then p = 1, trivially. If the process stops at a -· prescribed value of XI and one takes (21) as the density of (xz, X3, ... , Xn, t) at termination then p = ldX 1 / dtl evaluated at the termination point, where i( t) is the most probable path. However, these assertions are subject to the proviso that the most probable path from~ to~ (i.e. the Hamiltonian path determined by (22.39)) should not enter!!' before termination at ~. Should it do so, then for the coarser approximation (20) one must replace the path by that most probable path which avoids !!' before the final encounter at ~. This will consist of free segments separated by grazing encounters with!!' ,just as for the deterministic control problems of Sections 7.87.10. In such a case D( ~' ~) must be replaced in (20) by the evaluation for the !!'avoiding path. In the case ofassertion (21) the concept of 'the most probable path which avoids !!' before termination at~' may well be too simple; certainly such a path will clear !!'by some margin rather than graze it. We shall in fact consider a number of cases which are simple enough that the first passage distribution is known, and for which simple recommendatio ns on the margin ofclearance can be deduced. We shall use the standard notation for the normal density and integral: il>(x) =
1:
¢(y) dy.
The asymptotic expression 1- il>(x) = ¢(x)[x- 1 + o(x- 1)] for large positive x will be found useful. In Exercise 1 we derive the more refined version ¢(X)
J _ i!>(x)
_
-X
+ X -1 + ( -1) 0 X .
Exercises and comments (1) We can write 1
il>(x)
~(x)
=
100 exp[!(x2 x
y 2)] dy
=
Joroo exp( -xu- !u2) du
= 1oo e-xu(I- !u2 + .. ·)du = [x-1- x-3 + o(x-3)], whence (22) follows.
(22)
4 LANDING THE INERTIALESS PARTICLE
;;.
f:~
421
4 LANDING THE INERTIALESS PARTICLE the inertialess partic le Consi der the stochastic version of the 'landi ng proble m' for governed by the plant discussed in Section 7.9. The height x of the partic le is aim is then to bring the and t time at x height a equati on x = u + t. It starts from value of the cost ted expec the ise minim to as it to zero height in such a way i);r+s Qu2du + IK(t + s), where t +sis themo mento flandi ng. N Q, and we have = 1/ The hypotheses of the last section are satisfied, with K. 2 D = D((,~) = x /2NQs. is its time, t + s. We The only rando m coord inate of the termin ation point for first passag e from require then to know the distrib ution of s, the time taken €. But this distrib ution is height x to zero heigh t for the uncon trolled proce ss x = well known ; it has density
(s
~
0).
(23)
la (21), which is exact in Note that this is precisely what would be given by formu s) for prescr ibed sis the this case. The most probable path from (x, t) to (0, t + ian p offor mula (21) Jacob the that so T)xjs, straight-line path i( T) = (t + sof x = x( t + s) for y densit bility proba the equals xis. Multiplying this by (23). sion expres y prescr ibed s at x = 0 we obtain exactl s now becom es Relati on (3) for the value function of the contro lled proces
(24) where C(x, t, s) is the 'cost' expres sion
C(x, t, s) =
Qx2
2S + IK(t + s)
(25)
inistic version. The Recall now the conclusions of Sectio n 7.9 for the determ respect to s. If the with s) t, C(x, of value function was exactly the minim al value al landin g time optim the +sis t minim ising value iss (a function of x and t) then and the optim al control value at (x, t) is
u = -xjs.
(26)
on the optim al path. Note that t = t + s, the optim al landin g time, is consta nt al integr the define us let case, Retur ning to the stochastic
/[g(s)]
=
laoo g(s)e-C(x,t,s)/NQ ds,
for a function g( s). We see then from expression (24) that at (x, t) is
the optim al contro l rule
422
CON TRO LLED FIRS T PASSAGE
(27) This is close to the deterministic rule (26), but differs from it in two interesting respects. Firstly, there is the term Nix, whic h impels the particle to gain height whenever it is dangerously low. This is exac tly the cautionary effect we were seeking: the appreciation of the danger of a premature landing if height is low. The effect is of course absent in the determin istic case x = 0. The other poin t is the replacement of the factor s- 1 in (26) by /{s- 1) /I( 11 an average of s- 1 for the distribution (23) weighted by the factor exp( -IK/NQ). This average will indeed be close to s - 1 if sis not too small and N is small. However, in the final moments of landing the two evaluations will differ, as they mus t if the cautionary heightgaining term Nix is to be nullified. One can reasonably define the 'ideal approach path' as that for which the height x(t) at time t is the value of x minimising F(x, t). Since u is proportional to Fx then this is just the (x, t) locus on which u is zero (although control action is needed to keep it on this locus). Suppose that topt is the truly optimal landing point, in that it is the value oft minimising IK(t). The n !opt is certainly the ideal time at which to terminate, so that the cont rol (27) approximates (N jx) - xj (to - t) which is zero on the locus
x(t) = jN( top t- t).
(28)
This then describes the ideal approach path (necessarily for t < lopt). For N = 0 the ideal·approach is a ground-skimming one at infinitesmal heig ht-t o bring the particle from this height to zero height requ ires only infinitesmal control cost. If N > 0 then the appreciation of hazards induces the square-root approach of (28). The approach is steepest at the end of the path, as one tries to drop on to the .optimal point. Onc e we introduce inertia then this dropping will of course be modified by a levelling-out at the very end.
5 OBSTACLE AVOIDANCE FOR mE INE RTIALESS
PARTICLE
Suppose that it is not a question ofla ndin g the inertialess particle, but oflifting it over an obstacle which will be encounte red at time t 1. (We assume, as ever, constant horizontal velocityJ This could be regarded as a mountain peak which is so shar p and high relative to the surround ing terrain that, if one surmounts the peak itself, then one certainly clears all othe r hazards in the neighbourhood. We can then regard t 1 as a kind of 'term inati on time', in that flight truly terminates if one hits the mountain, and cont inues with a known future cost if one clears it. Since t 1 is known then the only first-passage coordinate which is unpredictable is the height y of the particle at time t 1 • Conditional on a curr ent position (x, t) the random variable y is norm ally distributed with mean x and variance Ns, whe res is, as ever, the time to go, t 1 - t.
~ -~
S PARTICLE 5 OBSTACLE AVOIDANCE FOR THE INERT IALES
423
lled process then Formula (3) for the value function F(x, t) of the contro becomes (29) '1/J(x, t) = e-F(x,t )/NQ = _1_ 00 e-C(x,t ,y)/NQ dy J21rNs -oo
1
where
C(x, t,y) =
Q(y- x)2 25
+ IK(y)
(30)
at time t 1. The optim al and IK(y) is the cost incurred if the particle is at height y control is (31) u = N'I/Jx/'1/J = y(x,s )- x s where
Jye-c dy y(x,s) = Je-c dy.
(32)
weighted by the factor That is, y(x, s) is an average of y based on its distribution the deterministic case in Yt exp{ -IK/ NQ). If one were aiming for a value y(tt) = y(x,s) can thus be value then the optimal control would be u = (y 1 - x)js. The seen as a provisional aiming point. g problem of the This may seem a simpler example than that of the landin distributed (over the previous section, in that the hazard is localised rather than is trivial. However, surface of the ground), and the first-passage distribution different contingencies the fact that IK may be discontinuous (reflecting the very care is needed in the of clearing or not clearing the peak) means that more evaluation of the integrals of (32) for small N cost prescription If the moun tain is of height h then the simplest terminal would be
R<(y)
~ {:
(x ~h) (x > h)
(33)
are equally disastrous, where K is a large positive constant. That is, all collisions deterministic case is all clearings equally happy. The optimal control rule in the be zero otherwise. The then that u should equal (h- x)fs for x ~hand should stochastic cautio n in r prope ses expres interest lies in seeing how rule (31) differing from this. In this special case evaluations (29) and (31) become (34) e-F(x,t )/NQ = 1 _ (() + ry(() ( 1 - t]) vfNTS¢J( () u = -'-:1-+-:(.:.....ry'-_--:-71)-::::':-'((-:-'-)
(35)
424
CONTROLLE D FIRST PASSAGE
where ( = (h- x)jVNs and 'rJ = exp( -KjNQ). We look for approximate evaluations of these expressions for small N, and, almost independently of this, for large K. For small Nand fixed K we recover the same feature as that observed in Section 7.9: that one will choose to crash on the mountain rather than make the exertions necessary for escape if condition (7.58) holds. We indicate the argument in Exercise 1. However, we see the significant results more clearly if we let K become indefinitely large and so "'indefinitely small. If we simply set "' = 0 in the expressions above then we obtain non-degenerate conclusions. This indicates, as for the example of Section 10.7 (see equation (10.36)), that control in the neighbourhood of the peak is vigorous enough that collision and the infinite penalty it would bring are avoided. (More specifically, the probability of collision is zero in the strong sense that this contingency of infinite cost contributes nothing to the expectation of cost.) Suppose ( large and negative. This means that (x- h)/VNs is large and positive, so that one is well above the peak relative to the variation one can expect in the time remaining. Then formula (35) yields (in the case "' = 0)
{36) which is indeed very small. Suppose ( large and positive, so that one is well below the peak relative to the variation one can expect in the time remaining. Appealing to relation (22) we then have Urv
ri:TI: h- X N YlYf.l{(+C 1) = - · +--.
s
a-x
(37)
That is, the optimal control is exactly the optimal deterministic control plus a stochastic 'lift' term reminiscent of that which we saw in the control (27) for the landing problem. It may seem strange that this lift term should become stronger as h - x becomes smaller, i.e. as one begins to come up level with the peak. However, it can be argued that the first (deterministic) term in the control rule (37) provides adequate lift for x well below h, and it is when x begins to come up level with h that it needs to be supplemented. It is moreover just then that the assumption oflarge (begins to fail. However, there is a version of the control rule which, although equivalent to (37) as far as terms of order N, carries real significance.
Theorem 24.5.2 At values ofxfor which (is large and positive the optimal control is
( ) u= h +Nd -x +oN. s
for small N, where d = s/ (h - x) is invariant on the optimal deterministic path.
(38)
S PARTICLE S OBSTACLE AVOIDANCE FOR THE INERT IALES
425
is that one behaves as That is, the effect of plant noise on the optim al contro l increa sed by an amou nt Nd, this ~; though the heigh t of the moun tain had been ~~ intended clearance being essentially consta nt until one is within distance o(N) of inistic path, the small er the peak. The steepe r the gradie nt g = d-l of the determ will be the safety margi n Nd thus allow ed
-
obtain
l then we Proof If we equate the expressions (37) and (38) for the contro tion is assert ed for d. The point of the theore m is that this correc
the evaluation to slow variation on the consta nt on the deterministic path, and so subject only 0 stochastic path. Exercises and comments (1) For (
= (h -
x)j v'Ns large and positive expression {34)yields e-F(x,t )/NQ ,....,
max [e-(h-x) 2/2Ns, e-K/NQ ]
in 1IN The second term if we retain only the domin ant factors: those exponential 2 /2s > K. This is exactly the x) Q(hin the square bracket will domin ate if be cheaper than that of condition deduc ed in (7.58) for the option of crashing to striving to clear the peak. cost is modif ied to (2) Suppose that the specification (33) of the 'termi nal' r control cost involved IK(y) = Qy2 j2sl for y >h. This would represent the furthe t a time s1 after havin g if one were required to bring the the particle to zero heigh by no more than is cleared the peak. There is then an incentive to clear the peak necessary. Show that the formula analogous to (34) is e-F(x,t )/NQ
=
~exp; [!(C?-
where (1
= (h- x)jVN s
t distance below the That is, ( 1 and (2 are respectively propo rtiona l to the curren time t 1 if one conat peak and the distan ce that one would be below the peak time t 1 + SJ. Note that tinued on the direct flight path to the final landin g point at
Q:x!- . Qh2 1)/2 = 2s1 - 2(s + s1)'
2
2
QN((2
- (
to the final landin g the difference in control costs between the straight-line paths tively. Show that in the point from the peak and from the starting point respec case K = +oo the optim al control is
426
CONTROLLED FIRST PASSAGE X
U=- --+ S
+ Si
NV>.. S
¢((2)
1 - <J?( (2) .
The two terms represent respectively the control that would take the particle to the final destination by the straight-line path and the correction needed to lift it over the peak. Show that in the case of ( 2 large and positive (when the straightline path would meet the mountain) the formula analogous to (37) is -I h -X s+ Si U=-s -+N [h ( -s-1- ) -xJ +o(N). (39) The first term sets a course for the peak. The second introduces a lift inversely proportional to the amount by which the particle is currently below the straightline path passing through peak and destination. We can recast the control rule (39) in the more significant form (38), but with the constant d now modified to d = (g + hfs1 1 where g = (h- x)js is the constant gradient of the optimal deterministic approach path. The term h/ s is 1 correspondingly the descent rate required on the other side of the mountain. The effect is then again that one aims to clear the peak by an amount Nd, this intended clearance showing only slow variation on the optimal approach path, but decreasing as the steepness of either required approach or subsequent descent is increased.
r
6 CRASH AVOIDANCE FOR THE INERTIAL PARTICLE
Consider the stochastic version of the inertial particle model of Section 7.10 in which the applied forces are the sum of control forces and process noise, so that the stochastic plant equation is
x= v, v= u+e. Then the condition (2) is again trivially satisfied, with ,.. = 1/ NQ. It would be interesting to consider both landing and obstacle-avoidance for this model. However, the most natural first problem to consider is that analysed in Section 7.10; that of pulling out of a dive with minimal expenditure of control energy That is, one begins with x > 0, v < 0 and wishes to minimise expected total control cost up to the moment when velocity v is first zero under the condition that height x should be non-negative at this this point (and so at earlier points). The moment when v becomes zero is to be regarded as the moment when one has pulled out of the dive. In a first treatment, costs incurred after this moment are not considered. Recall the results of the deterministic case. If no control is exerted then the particle of course crashes after a time= xjv. The optimal control is u = 2v2 j3x which brings the particle out of its dive (and grazing the ground) after a time -3xjv.
f
~\~·
PARTICLE 6 CRASH AVOIDANCE FOR THE INER TIAL
427
Relation (4) now becomes e-F(x, v)jNQ
= 1- p
+ Pe-K fNQ
(i.e. the probability for the where P = P(x, v) is the probability of a crash ity is first zero) and K is veloc when ive uncontrolled process that height is negat If we assume K infinite then the penalty incur red at a crash (assumed constant). is ol contr al this reduces simply to 1 - P, and the optim (40) u = -NP v/(1 - P). = x, v(O) = v. Then x(s) is Let us use s to denote elapsed time, with x(O) nce Ns 3 /3, and the probability normally distributed with mean x + vs and varia that x(s) is negative is ((), where
(41)
( = -(x + vs){ J;.
the limit- optimal path is likely Now, the point at which stochastic variation from on 2, just the grazing poin t of Secti of to take the path into !/ is, by the discussion to be at s = -3xj v. We see from the optim al deterministic path, which we know mises (, and so maximises the maxi h (41) that this is also the value of s whic cture: that the grazing poin ts conje a probability that x(s) ::::; 0. One is then led to the points which maxi mise just are of the optimally controlled deterministic path has strayed into !/. The path the probability that the uncontrolled stochastic one, as we shall see from this des conjecture is true in a class of cases which inclu the next section, but not generally. The value of (at this grazing point is
c=~3YJ;fv\3Jx· (Strictly, it is the logarithms of and we shall have P(x, v) "'P((.::::; 0) =((). ) To within terms of smal ler unity these quantities whose ratio is asymptotically order inN we thus have U=
/NT0
¢(() A.
(42)
v~ 1- (()
If (is large then we can appeal to the approximation
z.r
3N 2v
Urv---.
3x
(22) and deduce that
(43)
tic control plus a cauti onary This again follows the pattern: optimal determinis is inversely proportional to ction height-gaining correction. In this case the corre the obstacle-avoidance for As current velocity rather than curre nt height.
428
CONT ROLL ED FIRST PASSAGE
problem of the last section, the form of the correction term is somewhat counter-· intuitive, but is explained in the same way. It can also be given the more illuminating form of preservation of a near-constant cleara nce.
Theorem 24.6.2 At values ofx and vfor which (is large and positiv e the optimal control is
2v2 u = 3(x- Nd)
+ o(N)
(44)
for small N, where d = -9x2 J4v 3 is invariant on the deterministic path. That is, the effect of plant noise on the optimal control is that one behaves as though one had to miss the groun d by a margi n of Nd, this margi n being essentially consta nt until one is within distance o(N) of the ground. The stronger the control neede d to pull out of the dive, the smaller will be the safety margin Ndthu s allowed.
Proof If we equate the expressions (43) and (44) for the contro l then we obtain the evaluation assert ed for d. It follows from the analys is of Section 7.10 that vx- 213 is indee d const ant on the deterministic path. The point of the theorem is then again that this correc tion is const ant on the determ inistic path, and so subject only to slow variation on the stochastic path. 0
7 THE AVOIDANCE OF ABSORPTION GENERALLY The conclusions of the last section generalise nicely. Suppo se that we consider the stochastic version of the general avoidance problem of Section 7.11. That is, the plant equation has the linear form (17) with control cost! JuT Qu dr, and the stopping sets is the half-space aT x ~ b. As emphasised in Section 7.11, this last assum ption is not as special as it may appear. Let us norma lise aso that Ia I = 1; a modification of b to b + d then shifts the plane a distan ce d in the direction norma l to it. As in Section 7.11, we shall define the variable z = aT X. Since the process is time-homogeneous we may as well take the startin g and termin ation points as t = 0 and t = s, so that' = (x, 0), ~ = (.X, s), and sis the time remai ning to termin ation at x. We take the termin ation point s as the grazing point; the first time at which, if no further contro l were exerted, the uncontrolled deterministic path would henceforth avoid s. We may as well then assume that initial conditions are such that there is such an s; i.e. that xis such that some control is necessary to avoids. We again make the assum ption that the relation N = ~~:- 1 J holds, and assume 11: large and J fixed. We have then '!j;(x) := e-~
= x] = P[z(t) > b; 0 < t ~ slx(O) =
x]. (45)
r
7 THE AVOIDANCE OF ABSORI'TION GENERALLY
429
l
no longer enter s Here s is again the first time at which the path would n, and F (x) is fashio inistic determ ~·.···subsequently if it continued in an uncontrolled point. Note that to up .9' avoids ' the minim al cost of a path starting from x which s. The proces trolled uncon the that s is a rando m variable, now determined for to be is to[!' entry that so last equality holds in (45) if we assume that K = +oo, avoided at all costs. One might now conjecture that
1/;(x) = e-~ blx(O) = x].
(46)
s>O
ditionally, but let us As we shall see in a moment, the conjecture is not valid uncon z(s) for given s is ble variab m rando assume it valid for the moment. The 2 where K, I a(s) ce distributed normally with mean ll and varian
a= a(s)
=
J
aT V(s)a.
where V has the evaluation 14. Thus, (46) becomes (47)
value ofsis exactly where ( = foil I a, and s maxim ises(. That is, the extremising inistic case. determ the in point the value sdetermining the optimal grazing asymptotically the that ture conjec the Now, the conjecture (46) is equivalent to s determined time the at just .9' (large K) most probable .9'-avoiding path grazes n 2, this path Sectio of ent argum by the minimising condition in (46). Since, by the is effectively ture conjec is just the optim al deterministic .9'-avoiding path, the passage to of cost the that the grazing point of this path is that which maximises assertion this that 7.11.2 .9' with respect to time taken. But we know from Theor em 0. This = B aT and is valid under the conditions that the process is controllable statethe affect not last condition amounts to the assertion that control does to nts amou it process comp onent aT x directly. For the uncontrolled stochastic e . We deduc the assertion that aT Na = 0, or that aTE is zero in mean square
1 Theorem 24.7.1 Suppose that N = KBQ- BT, that aT B large K: for nistic system [A, B, ·]is controllable. Then, (i) The optimal[!'-avoiding control at xis
= 0 and that the determi-
1 u = (KQf 1BT'I/J!f'0"' (a/ri Qf BTeATsa l
~~~().
(48)
where ( = foil I a, and sis given the values which maximises (. (ii) This can befurther approximated (49)
430
CONT ROLL ED FIRST PASSAGE
where sis given the values and
The quantity dis then to be regarded as the clearance which is aimedfor at the grazing point; it is constant along the optimal deterministic path. Proof Asser tion (i) follows from the discussion befor e the theor em and its conclusion (47). Asser tion (ii) follows, as previously, by using the appro ximat ion (22) in formu la (48) and then recasting the 0(~~:- 1 ) corre ction term as a pertu rbatio n dof b. Since d depen ds upon curre nt coord inates only throu gh A/ a2, which is the termi nal value of >. for the optim al determ inisti c path, it is indee d invariant on
~~
0
In the contr ol rule (49) we indee d recognise the determ inisti c rule (7.78) with b modi fied to b + d. Notes on the literature The mater ial of sectio ns 1, 3 and 4 appea red in Whitt le and Gait (1970), the essential result being of cours e the reduc tion expre ssed in Theo rem 24.1.1. The usefulness of the logar ithmi c transf orma tion of F in the contr ol context has subsequently and indep enden tly becom e appar ent to other autho rs; e.g. Holla nd (1977), Flem ing (1978) and Benes, Shepp and Witse nhaus en (1980). The mater ial of the other sectio ns is new, as far as the autho r know s.
CHAPT ER25
lmperfoct Observation; Non-linear Filtering The conclusions of Section 23.2 transfer to the case of imperfect observation, with the consequence that we can obtain a risk- sensitive maximum principle for this case. The conclusions are quite illuminating. It remains true, as we have noted repeatedly, that the large-deviation approximation is in a sense a crude one, in that it gives inadequate expression to the stochastic character of the model at points where the limit-optimal path bifurcates or becomes subject to special constraints or penalties. However, the large deviation approach gives what is in fact rather a good treatment of state estimation, perhaps just because these phenomena do not then arise. We consequently give some space to the question of pure estimation in this general context: non-linear filtering. 1 SPECIFICA TION OF THE PLANT/OBSERVATION PROCESS We suppose that the observation y is, like the process variable x, a vector. It turns out to be natural to work to some degree in terms of the integrated observation z(t) = y(r) dr. In these terms the familiar linear white-noise model (12.26) would then become (1) = Cx+ry. x=Ax+Bu +t:,
f
z
so that both plant and observation relations are cast as flrst-order differential equations. It also has the effect that z is a regular stochastic process, whereas y, with its white noise component, can only be understood in a generalised sense. Of course, model (1) is a very special case of the Markov models we now consider. We assume that the plant/observation process has a time-homogeneous Markov character, and that the observation process is wholly subsidiary to the plant process in that the distribution of the increments 6x and 6z of x and z in the time interval (t, t + 8t) conditional on plant, control and observation histories at time t is in fact conditioned only by x( t) and u( t). The plant/observation process is then specified by the joint DCF
H(x, u, a., {3)
1{E[e = lim(8t)6t!O
0
.Sx+P 6zlx(t) = x, u(t) =
u]- 1}.
(2)
Note that H(x, u, a., 0) = H(x, u, a.), the DCF of the plant process already introduced in Chapter 23. The joint DCF for the linear specification (1) is
432
IMPERFECT OBSERVATION; NON-LINEAR FILTERING
H(x, u, a, ,8) = a(Ax + Bu)
+ ,8Cx + 21 [a ,8]
[N L] [o?] ,BT , LT
M
under our standard assumptions on plant and observation noise. When it comes to scaling of the joint process we shall assume that the joint DCF has the form KH(x, u, a/ K, .8/ K) in terms of a unsealed DCF H. Here"' is the scale parameter, which will be assumed large. We shall consider the joint process from time t = 0; a matter of convention rather than a real constraint. We must have an initial condition, however, in that we must have a distribution for x(O) (understood to be conditional on the information state at time t = 0). We shall assume that this is subject to the same scaling, in that there exists a rate function Do (x) such that
P(x(O) Ed)= exp{-K inf D 0 (x) xed
+ o(K)}
(4)
for large"' and appropriate d. 2 'CERTAINT Y EQUIVALENCE' AND THE SEPARATION PRINCIPLE We consider the risk-sensitive criterion E"( e-~
G(W(t)) = extE"[e-~<°C/W(t)].
(5)
7r
Here the operation 'ext' indicates the taking of a maximum or a minimum according as fJ is positive or negative, and W( t) is the information available at time t. G obeys the dynamic programmin g equation
G(W(t)) =ext E[G(W(t + 8!)/ W(t), u(t)J u(t)
+ o(c5t).
(6)
However, as in Chapter 12, it is more convenient to work in terms of the function
J(W(t)) =f(W(t))G( W(t)),
(7)
where f( W) is the density function of the information W relative to a Kindependent measure m on histories W. In terms of this the dynamic programming equation (6) becomes
J( W(t)) =ext u(t)
1 y
J( W(t + c5t))
where Jh indicates an integration over {y(r); t that the extremising u( t) is optimal. Define also the associated quantity
+ o(8t),
(8)
~ t + 8t} and we can assert
J*( W(t), x(t)) = f( W(t), x(t))E[e-~<°C 1 W(t), x(t)] /
(9)
PLE 2 CERTAINTY EQUIVALENCE' AND THE SEPARATION PRINCI
433
because all where C1 is the cost up to time t. The expectation is indepe ndent of 7f, Let us ined. determ been have and past the in are control values occurri ng ion express otic asympt .. provisionally assume that an
J*(W(t ),x(t)) =
e-~
(10)
and the holds, where P is indepe ndent of "'- Both the validity of the assumption next the in see shall we as theory, n eviatio large-d evaluation of P will follow from section. future Finally, recall the definition (23.7) of F(x, t); the scaled and normalised tion. observa state perfect with but , problem value function for the same control future and past the of ents equival otic The quantities P and Fare in fact asympt what looks stresses defined for the LEQG problem in Chapte r 16. We shall prove ructure d state-st the for these; of like a certainty equivalence principle in terms otically asympt holds it that in case case only, but greatly relaxed from the LEQG r, the Howeve valid. are ns for any model for which large-deviation assertio (in a le princip ence equival ty assertion is so much cruder than a true certain of rather speak shall but term, sense to be explained) that we shall not use the 'stress-extremising'.
tial-of*Theorem 25.2.1 The stress-extremising principle. Assume the exponen validnal provisio the and cost criterion E" (e-~
"'J e-~
exp{ !'.: i~f 8[P( W(t), x)
+ F(x, t)]}.
(11)
of (ii) Suppose that the optimal closed-loop form of u( t) is u(x( t), t) in the case state ct imperfe of case the in perfect state observation. Then the optimal value value observation is u(x(t), t) to within a term o(l) in!'.:, where x(t) is the minimising ofxin (11). can see If we write the final expression for lin (11) as exp{ -!'.:OS(t)} then one and all ble observa S( t) as a 'stress' extremised with respect to all quantities not indeed then F and control decisions as yet unmad e at time t'. The compo nents P .X of x value appear as past and future compo nents of stress and the minimising fact ant signific as the 'minimal-stress' estimate of the current state value x( t). The case the for is that the future stress Pis nothing but the normal ised value function looks so of perfect observation, and it is this which implies assertio n (ii), which much like an assertio n of certain ty equivalence. The proof is inductive. Relation (11) certain ly holds at the horizon point h; suppose it true at a time t + 8t ::::;; h. If u( t) = u then the relation
* Proof
434
IMPERFECT OBSERVATION; NON-LINEAR FILTERING
1
J*(W(t + 6t),x(t + &))
=
JJ*(W(t),x)e-~
8t)lx(t)
= x, u(t) = u) dx
follows from the definition (10) of J*. In virtue of this we can write the version (8) of the dynamic programmin g equation (6) as
J( W(t))
rv
e;t
= e;t
J Je-~~:B(P,+F,)[1-
e-KBP, E[e-KBc(x,u) 01-1\:BF(x(t+ot),t+ot) lx(t)
= x, u(t) = u]dx
i'i.BA(x, u, t) 6tj dx + o(i'i.8t)
(12)
say, where we have written P( W(t), x) and F(x, t) as P1 and F1• The extremising value of u in (12) is the optimal value of u( t), to within a term o( 1) for large K For given i'i. we must assume 8t so small that i'i.l5t is small. However, for large ,.. the dominating contribution to the integral in (12) will come from the value xof x minimising B( Pt + Ft). We then have
J( W(t), t)
rv
rv
[1 - ,..ei~f A(x, u, t) 8t]
J
e-KII(P,+Fr)dx + o(K, 8t).
[1 - ,..(;Jinf A(x, u, t) 8t]e-l
+ o(K, 8t) + o(1).
(13)
where the final remainder term is one that is small for large "'· Here again the infimising value of u is asymptotically optimal. But the recursion which the dynamic programmin g equation (23.8) implies for F implies that the minimal value of A~, u, t) with respect to u is zero, attained at u = u(x, t). Both assertions of the theorem then follow from (13). 0 It is assertion (ii) which has the appearance of a certainty-equivalence statement: that the optimal control is (in the limit oflarge "') that for the case of perfect observation with an estimate i(t) substituted for the unobserved value of x(t). In the risk-neutral case x(t) is just the value ofx minimising P( W(t), x); the large-deviation form of the most probable value. In the risk-sensitive case the estimate i( t) depends also upon both past and future costs in a manner familiar from analysis of the LEQG case in Chapter 16. However, a true certainty-equivalence interpretatio n does not hold in general. To see this, consider the risk-neutral discrete-time case and let i 1 be the projection estimateofx1 basedupon W1 = (Y1 , U1-I)O.IfQ(x) isaquadratic functionthe n
E[Q(xt)l Wr]
= Q(it) + tr(Qxx Vt)
where Qxx is the matrix of second-degree terms in Q( x) and V1 is the covariance matrix, conditional on W1, of the estimation error i 1 - x 1 • The proof of certainty equivalence in the LQG (risk-neutral) case relied on the fact that the term
(: ~:
3 THE RISK-SENSITIVE MAXIM UM PRINCI PLE
43.5
was historytr(Qxx V1 ) could be dropped from consideration because it none upon so and Wr. upon ence depend no independent. That is, it showed
version of past policy. It is dropped from consideration in the risk-neutral 1 . It will in fact be history~order of small, is it e the present case becaus because dependent in general, because of plant/observation non-linearity or is made imation approx The tions. observa 'of quality control actions affect the ce differen the that s ground the on x for .X of tion virtually the brute-force substitu is small. an This does not mean that the theorem is without content. For one thing, are sitivity risk-sen of effects the , another For d. appropriate estimate is deduce discerned. that the As in the LEQG case, the separation principle manifests itself in to deone allows t) x( state current of x value the provisional specification of • These evaluaF stress 1 future and P stress 1 past of couple the evaluations vely, in tions may be said to be concer ned with estimation and control respecti and the that the evaluation of Pr sums up the effect of current information W(t) l optima ation -observ perfect the of ination determ the evaluation of F 1 implies are t) u(x, control rule u(x, t). However, the actual extimate .X and control rule F + B(P 1 ) with respect to of 1 sation minimi the ing: recoupl a then derived by x = x(t).
3 THE RISK-S ENSIT IVE MAXI MUM PRINC IPLE in the last Just as for the LEQG case, the stress-extremising principle derived tegral section enables us to characterise the optima l control problem in time-in future and form. This form is made explicit by large-deviation evaluations of past le for the princip um maxim sitive risk-sen a to then leads stress, P and F, which case of imperfect observation. state the We have hithert o worked in terms of a fixed horizon h, but now entry of obvious generalisation to the case when termin ation occurs at the first IK(~) for termin al ~ = (x, t) to a prescribed stopping set//, with termina l cost value(. Theorem 25.3.1
0=
Define the time-integral
e- 1Do(x(0)) +
1
[c(x,u) + ).Tx+ p.Tz- e- 1H(x,u,8>..T,()J.?)]
1
dT+K(~) (14)
> 0 (8 < 0) is Then, in the limit of large,., the optimal value of u(t) in the cases 8 undeter mined that determi ned by seeking the infimum of 0 with respect to controls s .>..(-) variable at time t, the suprem um (infimu m) with respect to the conjugate at vable timet. and p.( ·)and the infimum (supremum) with respect to variables unobser
436
IMP ERF ECT OBSERVATION; NON -LINEAR FILT ERI NG
In the case of negative ethe extrema l operations mus t be app lied sequenti ally in time, as indi cate d in equ atio n (16.8).
Proof Split the path integral 0 into the integrals or and Dr for the separate ranges r < t and r :;:;: t. The n we shall prov e the theorem by establishing the asym ptotic identifications
= ext0 1 ,
P(W (t), x)
F(x, t) = ext0 1 , where the extremum is that specified in the theorem, but subject to the con stra int x(t) = x.
0 The extremal conditions with resp ect to future y(r) yield J.L(r) = 0 (r > t). But then the expression for Dr reduces exac tly to the expression for F(x, t) asse rted in The orem 23.2.1. The corr espo ndin g evaluation of P( W(t), x) follows by dire ct app eal to the rate-function evaluation (22.36). The cou rse of the Markov process with state variable (x, z) for given u(-) (not necessarily optimal) over the time inte rval 0 ~ r ~ t will have density f(X (t), Z(t) ) = exp[-l'l:D(t) + o(l'l:)], whe re X(t) denotes x-history to time t, etc. and
D(t) = D0 (x(O)) + sup
([a x+ {3z - H(x , u, a,/3)] dr.
Jo
a(), /30
The large deviation prop erty then imp
lies that (9) holds, with the identific
ation
BP(W(t), x) = inf[BCr + D(t)] x(·)
where cr is the cost up to tim et and the infi mum is subject to x(t) = x. But this yields exactly the evaluation (14) if we set z = y, a= B>..T and f3 = BJ.LT. 0 The stat eme nt that one would gen erally recognise as a max imu m prin ciple follows if one writes down the stationa rity conditions on the path integral (14). In term s of the effective Ham ilto nian
these would be . - ().1{' \T X- f)).T '/\ -
- ().?If
OX
(t < r < t) , u is prescribed (0 ~
y
J.l = 0
f).llf
= OJ.lT .?If
T
plus the end conditions implied by
(0
-)
is max ima l in u
< t),
(t
~
r < t) < r ~ t)
y is pres crib ed (0
the theorem.
5 PURE ESTIMATION: NON-LINEAR FILTERING
437
4 EXAMPLES AND SPECIAL CASES Under the LQG assumptions of the usual quadratic cost function and joint DCF (3) one readily finds that the assertions of the last section agree with those derived for the state-structured LEQG case in Chapter 16, known in that case to be exact. If we go to the risk-neutral limit we find that the asymptotically optimal control is just u(x(t), t), where u(x, t) is the optimal risk-neutral perfect information control and x(t) the value of x minimising D(W(t),x)-the large deviation version of the conditionally most probable value. As indicated above, this conclusion rests on nothing more sophisticated than the facts that .X( t) - x( t) has zero expectation and a covariance matrix of order ~~:- 1 . If one formally lets etend to zero in the integral of (14) one obtains what seems like an excessive collapse, in that e- 1H(x,u,o>.T,ep.T) reduces to >.Ta(x,u)+ p.Tc(x, u), where c(x, u) is the expected value ofy(t) conditional on the past. That is, the process seems to reduce to determinism in observation as well as plant, which is surely too much! However, one will then be left with a term p.T(y- c(x,u)) in the integral, whose extreme value with respect top. will be infinite unless y is exactly equal to its expected value c(x, u). This simply reflects the fact that the rate function D for the process over a given time interval is nonzero unless the process lies on its expected (deterministic) path, so that in other cases D/0 will have an infinite limit as e tends to zero. Consideration of the behaviour of D on its own is the theme of our final section: the consideration of process statistics unskewed by costs. 5 PURE ESTIMATION: NON-LINEAR FILTERING An interesting special case is that for which there is no aspect of cost or choice of control policy, and one is simply trying to infer the value of x( t) from knowledge of current observations W(t), possibly in the presence of past (and so known) controls. In the LQG case this inference is supplied by the Kalman filter, supplemented by the recursive calculation of associated matrices such as the covariance matrix V, as we saw in Chapter 12. The general problem is referred to as that of 'nonlinear filtering' by analogy with this case. If we specialise the material of this chapter we have then a large deviation treatment of the nonlinear filtering problem which is exact under LQG assumptions. Since the controls are known and have no cost significance we can write wq) simply as Y (t ), the observation history. Consider the expression
D1( Y(t), x)
= inf [no(x(O)) x(-)
+
sup
t
a(·),~(-) Jo
[ax+ f3y- H(x, u, a, {3)] dr]
(15)
where the infimum is subject to x(t) = x. Then exp[-~~:D 1 ] is the large deviation approximation to the joint probability density of x( t) and Y (t) and so provides, to
438
IMPERFECT OBSERVATION; NON-LINEAR FILTERING
within a Y(t)-dependent normalising factor, the large deviation approximation to the probability density of the current state value x(t) conditional on observation history W(t). However, as we have seen in Chapters 12 and 20 and Exercise 22.7.1, the updating of such conditional distributions is more naturally carried out in transform space. We shall in fact deduce a large deviation form of a forward updating equation for the corresponding MGF M(a, t)
= E[e<>x(tlj Y(t)].
( 16)
*Theorem 25.5.1 Suppose that the Markov process { x( t), z( t)}, where i = y, has DCF K,H(x, u, K,- 1a, ,;- 1(3) with known past controls u. Then for large K, the conditional characteristic function (16) has the form
( 17) for t
> 0 ifit has this form for t = 0. The unsealed CGF 'lj;obeys the updating equation [hjJ(a, t) fJt
= CT(a)- CT(O)
( 18)
where
(19) Proof Relation (17) will hold at t
= 0 under the assumption (4), when we shall
have
?f(a, 0)
= sup[ax- Do(x)].
(20)
X
The function 'tf;( a, t) will differ from w(a, t) =sup[ ax- D1 ( Y(t), x)] X
= sup[ax(t) -D0 (x(O))- sup x(·)
a(·), {3(·)
1 1
[ax+f3y-H(x,u,a,(3)] dT]
0
(21)
only by an additive term, independent of a. The point is that exp[,;W(K,-la, t)] is the large deviation evaluation of the conditional MGF of x(t) times the probability density of the observation history Y ( t). We must divide out this latter to evaluate the MGF; equivalently (22)
5 PURE ESTIMATION: NON-LINEAR FILTERING
439
Note the distinction in (21) between the vector function o{r)(r ~ t) and the vector a; we shall see that there is an effective identification a(t) =a. The extrema with respect to x( ·) and /3( ·) are unconstrained. A partial integration of (21) and an appeal to (20) allows us to rewrite w as w(a,t)=sup
inf [1/l(a(O),O)+ f'[ax-{3y+H( x,u,a,{3)]dr]
x(·) n(-),
PO
Jo
(23)
where the extremisation with respect to x(t) has indeed induced the constraint a( t) = a. Now, by the same methods by which the expression (23.10) for the value function was shown to satisfy the approximate dynamic programming equation (23.9), we deduce that expression (23) obeys the forward equation in time
~~ = i~f(n(~: ,u,a,/3)- f3y]
= a(a).
Relations (22) and (24) imply that 1/1 satisfies (18) I (19).
(24)
0
Equation (18) is indeed the precise analogue of the dynamic programming equation (23.9). However, the exact version of (18) is not, in general, the precise analogue ofthe the exact version (23.8) of (23.9). Relations (18) and (19) provide the updating equation for the conditioned xq) distribution and so can be regarded as providing the large-deviation equivalent of the Kalman filter for the general case. In the LQG case, when H has the evaluation (1), then (18) is exact We find that (18) is solved by
1/!(a, t) = ax(t)
+!a V(t)aT
if .X( t) and V( t) obey the familiar updating equations for the state estimate and the covariance matrix of its error: yet another derivation of the Kalman filter and Riccati equation! If we regard the state estimate x(t) as again the conditionally most probable value, then, in the large deviation approximation, it will be the value maximising infn[K'I/I(K- 1a, t)- ax]. The two extremisations yield the conditions a= 0 and x(t) = (&fP(a, t)/fu)n=O• SO that x(t) is nothing but the large deviation evaluation of the expected value of x( t) conditional on Y (t). The fact that the mode and the mean of the conditional distribution agree is an indication of the implied supposition of a high degree of regularity in the distribution. Relations (18) and (19) are fascinating in that they supply the natural updating in the most general case to which large deviation methods are applicable. However, they do not in general supply a finite-parameter updating (ie. 1/J(a, t) does not remain within a family of functions of a specified by a finite number of tdependent parameters). One is updating a whole distribution and will have such a reduction only in fortunate cases; in others it can only be forced by crude approximation. The LQG case of course supplies the standard fortunate
440
IMPERFE CT OBSERVATION; NON-LIN EAR FILTERIN G
example: the condition al distributi on remains Gaussian , and is parametr ised by its time-dep endent mean and covarianc e matrix. However, for an interestin g non-Gau ssian example, consider again the problem of regulating particle numbers in a chamber, associate d with the unsealed DCF (22.26). Suppose that one cannot observe the actual number n of particles present in the chamber, but can only register particles which collide with a sensor, each particle in the chamber having probability intensity v of registering. One then has essentially a running count m(t) of the number of registrations. Defining the normalis ed cumulative observati on z(t) = m(t)j K-, we then have the normalis ed DCF H(x, u, a, (3) = u(ea- 1) + px(e-a- 1) + vx(el3- 1). for x = nj"' and z. Thus CT(a), as defined by (19), has the evaluation
CT( a) = u( ea - 1) +
rni/ (e-a -
1) - v'lj;' + y - y log y +
y log (V'l/!)
and the updating relation (18) becomes
~~ = u( ea -
1) + p'!f;' (e-a - 1) - v( 'If;' -
'If;~) + y log ('If;' N~)
(25)
y z
where 'If;' = 8# 8a and 1/J'o = [1/!Ja=o· Of course, = does not exist in any naive sense (as it does not even in the conventional case (1), of observati on corrupted by white noise) and we must interpret this last relation incrementally:
81/J = [u(ea- 1) + p'lj;'(e-a- 1)- v('lj;' -1j/0 )]8t +(log('If;' /'If;~)] 8z (26) where 8z is ,_-I times the number of particles registered in the time-incr ement 8t.
Remarkably, relation (26) is exact; see Exercise 1 below. However, it is still not finite-dim ensional in character. Exercises and comments
(1) Since we aim to prove that the updating equation (26) is exact, we may.as well set""= 1. Suppose that M(a, t) = Jexp(ax)p(dx, t). If a particle is registered (i.e. 8z = 1), then, by Bayes' theorem, M changes instantan eously to
J
vxeaxp(dx,t)j
J
vxp(dx,t ) =Ma/MaO = (1/Ja/1/JaO)M.
Thus 'If; suffers an instaneou s incremen t of log(1/Ja/1/Jao) which is just what equation (26) asserts. If no particle is registered in (t, t + 8t), so that 8z = 0, then over the time incremen t 8t the character istic function M changes to const.
J
(1 +A bt)eax(l - vx 8t)p( dx, t)
= const.
j e=[l + H(x, a) 8t- vx bt] p(dx, t).
5 PURE ESTIMATION: NON-LINEAR FILTERING
441
Here A is the infinitesimal generator of the x-process alone, H(x, a.) = u(ea- 1) + px(e-a- 1) the corresponding DCF and the constant is such as to retain the normalisation M( a., t + 8t) = I. This indeed implies relation (26) with 8z=0. Notes on the literature
The material of this chapter is largely taken from Whittle (1991a). There is a very large literature on non-linear filtering; an appeal to large deviation ideas is explicit in the papers by Hijab (1984) and James and Baras (1988). These present what is essentially a recursive updating of the rate function D( Y(t), x) in the case when the stochastic element of both plant and observation dynamics is a diffusion, the updating equation being characterised with some justification as 'the wave equation of non-linear filtering'. However, this description could much more fittingly be applied to the general updating relation (18) I (19). That this is set in transform space is, as we have emphasised, natural in the inference context.
APPENDI X 1
Notation and Conventions For discrete-time models the time variable tis assumed to take signed integer values; for continuous-time models it may take any value. A variable x which is time-dependent is a signal. In discrete time the value of x at time tis denoted x 1• More generally, if( ... ) is a bracketed expression then( ... ) 1 denotes that expression with all quantities in the bracketed evaluated at time t, unless otherwise indicated. In continuous time the time-dependence of x is either indicated x(t) or understood. In either discrete or continuous time the expression x is denoted x1• A circumflex is also used to denote a Laplace transform or z-transform in Chapter 4 alone. If {x1} is a sequence of variables in discrete time then X 1 denotes the history {x.-; r ~ t} of this sequence up to time t. The starting point of this history may depend upon circumstances; it is usually either r = 0 or r = -oo. The simple capital X denotes a complete realisation-the course of the x-variable over the whole relevant time interval. There are other standard modifiers for signals. The command value of x is denoted r' and the deviation X - r denoted x*. The notation X denotes either the equilibrium value or the terminal value of x, in different contexts. Do not confuse with the effect of the overbar on complex quantities and operators, where it amounts to the operation ofconjugation; see below. The bold symbol xis used in a system notation, when it denotes all internal variables of the system collectively. A matrix is denoted by an italic capital: A. Its transpose is denoted by Ar. If Q is a matrix then Q > 0 and Q ~ 0 indicate respectively that Q is positive definite and positive semi-definite (and so understood to be symmetric). If A is a matrix with complex elements then A denotes the transpose of its complex conjugate. The overbar thus combines the operations of transposition and complex conjugation, which we regard as the conjugation operation for matrices. If x is said to be a vector then it is a column vector unless otherwise indicated. If F(x) is a scalar function ofxthen oFfox is the row vector of first differentials of F with respect to the elements of x. We sometimes denote this by the convenient subscript notation Fx, and use Fxx to denote the square matrix of second differentials. If a(x) is itself a vector function of x then ax denotes the matrix whose jkth element is the differential of the jth element of a with respect to the kth element of x. If H(p) is a scalar function of a row vector p then Hp is the column vector of differentials of H with respect to the elements ofp.
x;
444
'
.
APPENDIX 1
.
ot;'il,
The subscript notation is also sometimes used to distinguish sub-matrices in a partitioned matrix; see, for example, Sections 19.2 and 20.3. Operators are generally denoted by a script capital. Important special operators are the identity operator .§, the backward shift operator :!/ with effect :!lx1 = x 1_ 1 and the differential operator ::0 with effect ::0x = dxjdt. A symbol d will denote a distributed lag operator A(:!!) I;i Ai:!Ji in discrete time_ and a differential operator A(~) =I:; AJ~ is continuous time. The conjugated of d is defined as A(:!/- 1) Tin discrete time and A(-~) Tin continuous time. We shall often consider the corresponding generating functions A(z) and A(s), the complex scalars z and s corresponding to :!/ and ::0 respectively. These will also be denoted by d, the interpretation (operator or generating function) being clear from the context. In this case the conjugate il is to be identified with A(z- 1 and A ( -s) Tin discrete- and continuous-time contexts respectively. The shell or Gill characters §, []) and []) are used to denote stress and cost and discrepancy components of stress respectively. The large-deviation evaluation of []) is proportional to the rate function, which we denote by D. The cost from time t is denoted C 1, so if the process is obliged to terminate at a given horizon point h then the closing cost is denoted Ch. This does not necessarily coincide with the terminal cost IK, which is the cost incurred when the process is obliged to stop (possibly before h) because it has entered a stopping set Y'. The notations max or min before an expression denote the operation of taking a maximum or minimum, with respect to a variable which is indicated if this is not obvious. Correspondingly, sup and inf denote the taking of a supremum or infimum; stat denotes evaluation at a stationary point; ext denotes the taking of an extremum (of a nature specified in the context). The expectation operator is denoted by E, and Err denotes the expectation under a policy 1r. Probability measure is indicated by P( ·) and probability density sometimes by f(-). Conditional versions of these are denoted by E(·l·), P(·l·) and f(-1·) respectively. The covariance matrix E{[x- E(x)][y- E(y)]T} between two random vectors X andy is written or cov(x, y). We write cov(x, x) simply as cov(x). If Vxy = 0 then x andy are said to be orthogonal, expressed symbolically as x_Ly. The modified equality symbol:= (resp. =:)in an equation indicates that the left- (right-)hand member is defined by the expression in the right- (left-)hand member. Notations adopted as standard throughout the text are listed below, but some symbols perform multiple duty.
?
vq,
Abbreviations
Abs AGF CEP
Absolute term in a power series expansion on the unit circle Auto covariance generating function Certainty equivalence principle
.·.·~
*.·~·~·
,::~!--'-*}'
APPENDIX 1
CGF DCF MGF Tr
445
Cumulant generating function Derivate characteristic function Moment generating function Trace (of a matrix)
Standard symbols A,B,C Coefficient matrices in the plant and observation equations d,f?J,CI Operator versions of these for higher-order models '!( The system operator (df?J)
c c~,u) ~
D [Jl
d E E,,
F
f G g H
h
..f I
0 J K ~
L
.sr
M
m N n
Cost function Instantaneou s cost function or rate The time differential operator Rate function The discrepancy component of stress Plant disturbance; various Expectation operator Expectation operator under policy 1r Future value function; future stress Transient cost; probability density; the operator in the Riccati equation Total value function; the coefficient of the quadratic terms in a timeintegral; a transfer function; controllability or observability Gramian A function defining a policy u1 = g(x1) Hamiltonian ; the innovation coefficient in the Kalman filter; the constant factor in a canonical factorisation; the matrix coefficient of highest-order derivatives Horizon; various The identity operator The identity matrix Time-integral The control-power matrix The matrix coefficient in the optimal feedback control A terminal cost function Forward operator; covariance matrix of plant and observation noise The forward operator of the dynamic programmin g equation; Laplace transform operator Forward operator in continuous time; the covariance matrix of observation noise; a moment generating function; a retirement reward: an upper bound on the magnitude of control The dimension of the control variable u The covariance matrix of plant noise (termed the noise power matrix in continuous time) The dimension of the process variable x
446
APPENDIX I
p
Past stress; probability measure The order of dynamics; the conjugate variable of the maxim um principle. The matrix of control costs Q R The matrix of process-variable costs ~ The cost matrix in a system formulation r The dimension of the observationy s The cross-m atrix of process- and control-variable costs § Stress 5I' The stopping set s Time to go; the complex scalar argument of a transfer functio n corresponding to § .r The backward shift operator t Time; the present mome nt u The complete control realisation Ur Control history at time t u The control (action, decision) variable v A covariance matrix, or its risk-sensitive analogue; a value functio n under a prescribed policy. v(.A, f.L) The information analogue of c(x, u) Wr Inform ation available at time t; a noise-scaling matrix w The comm and signal X The complete process realisation Xr Process history at time t X ·. The process or state variable y The complete observation realisation Yr Observation history at time t y Observation z The complex scalar corresponding to ff p
a, (3
E
( TJ
e
.A
Discount rate; the row vector argument in transform contexts; variou s Discount factor; the row vector argument in transform contex ts; various Gain matrix Average cost, either direct or in a risk-sensitive sense System error; estimation error Increment (as time increment 8t); Kronecker 6-function Plant noise Innovation; primitive system input; coefficient of the linear term in a time integral Observation noise Risk-sensitivity param eter The Lagrange multiplier for the plant constraint; birth rate
APPENDIX 1
p,
v ~
ll 1r
a T
¢
w
'lj;
x 0 w
447
The Lagrange multiplier for the observation constraint; death rate A Gittins index; various A sufficient variable; the combination ~. t) of state variable and time; the argument of a quadratic time-integral The matrix of a quadratic value function Policy The coefficient of the linear term in a quadratic value function; a function occuring in the in the updating of posterior distributions A running time variable A matrix of operators (or generating functions) occurring in forward optimisation; the normal integral A canonical factor of ; the normal density; various A matrix of operators (or generating functions) occurring in backward optimisation; a function ocurring in the updating of posterior distributions A canonical factor of w; a cumulant-generating function; an expectation over first passage A cost-renormalisation of a risk-sensitive criterion; various The information gain matrix Frequency
Modifiers, superscripts, etc.
x d {j
xc x x(tl
x
1
x
1
Equilibrium value or terminal value of the variable x Conjugate of the operator or generating function d The critical value of the risk-sensitivity parameter B The command value of x The deviation x- xc; limit-optimal path The best (minimum stress) estimate ofx based on information at timet The best estimate of x 1 based on current information: x)t) The best estimate of x 1 based on current information and past costs alone
<
)c
-,
;.,
'
"<"·.
'"j )
I
I
APPE NDIX 2
The Structural Basis of Temporal Optimisation A stochastic model of the system will specifY the joint distribution, in some sense, of all relevant random variables (e.g. process variables and observations) for prescribed values of the control variables. These latter variables are then parametrising rather than conditioning, since they affect the distribution of the former variables but are not themselves initially defined as random variables. The information at a given time consists of all variable-values which are known at that time, both observations and control values. The process model describes not merely the plant which is to be controlled, but also variables such as comman d signals. Either these signals have their whole course specified from the beginning (which is then part of the model specification) or they are generated by a model which is then part of the process model. Consider optimisation in discrete time over the time interval t ~ 0. As in the text, we shall use Xt, U1 and Y1 to denote process, control and observation histories up to time t, and W1 to denote information available at the time t. More be explicitly, W 1 denotes the information available at the time the value of u 1 is to of tion chosen, and so consists of ( W0 , Y11 U1-1 ). That is, it includes recollec ation previous controls and Wo, the prior information available when optimis begins. We take for granted that it also implies knowledge oft itself: of clock time. Let us for simplicity take W0 for granted, so that all expectations and probabilities are calculated for the prescribed W0 . Our aim is to show that the total value function G(W1)
= inf E(C!Wt)
(1)
1f
satisfies the optimality equation
(2) t. and that the infnnising value of u1 in (2) is the optimal value of control at time These assertions were taken almost as self-evident in Chapter 8, but they require proof at two levels. One is at the level of rigour; there may be technical problems due to the facts that the horizon is infinite or that that the control may take values in an infinite set. We shall not concern ourselves with these issues, but rather with the much
450
APPENDIX 2
more fundamental one of structur e. The optimality equation is onl y 'self-evident' because one unconsciously mak es structural assumptions. These assumptions and their consequences must be made explicit. Tha t there is nee d for closer thought is evident from the fact tha t the conditional expectation s in (1) and (2) are not well-defined as they stan d, because the control history UtI or Uris not defined as a random variable. The following discussion is an imp roved version of the first analysi s of these matters, which was given in Whittle (1982), pp. 150-2. Complete realisations X 00 , Uoo and Y00 will be denoted simply by X, U and Y It can be assumed without loss of generality that W = W give 00 s complete information on the course of all variables. We shall use naiv e probability notations P(x) and P(xly) for distributions and conditional dist ributions of random variables x and y, as tho ugh these variables were discrete -valued. All such formalism has an evident ver sion or interpretation in more gen eral cases. A subscript n, as in P.. (x), indicat es the distribution induced und er policy rr; correspondingly for expectations. The policy n is subject to the con dition of realisability: tha t the valu e of the current control u1 may depend onl y on current observables W • We 1 shall express this by saying tha t u 1 must be W 1-measurable. By this we do not mean measurability in the technical sen se of measure theory, but in the naiv e structural sense, tha t u1 can depend on no oth er variable than W1 • We shall in general allow ran domised policies, in that the policy n is determined by specification of a conditional probability distribu tion P1r (uri Wt) for each t. This is convenient, eve n though the optimal policy may be taken as deterministic, in tha t it expresses Ut as a function of W 1 for each t. One must now distinguish betw een conditioning variables and par ametrising variables. A model for the stoc hastic dynamics of the process would imply a specification of the probability dist ribution of X for any given U How ever, this is not a distribution of X conditi onal on U, because U is not eve n defined as a random variable until a control policy has been specified. Rat her, U is a parametrising variable; a variabl e on which the distribution of X dep ends. We shall write the distribution of X conditioned by Y and par ame tris ed by Z as P(X I Y; Z). The specification of a stochastic model for the con trolled process thus corresponds to the specificati on of the parametrised distribution P(X I; U). The full stochastic specification of bot h plant equation and observ ation structure corresponds to specification of P(X , Yl; U). We see the n tha t we should mo re properly write P .. (u 11 W ) as 1 P .. (utl; Wt), because we are simply definin g a Ut distribution which allows an arbitrary dependence upo n W1• Calculatio n of expectations for prescribed W0 may also be a mixture of conditioning and par ametrising. These distinctions turn out not to matter, but thanks only to the ord ering imposed by a temporal stru ctur e. The following are now the bas ic assumptions of a temporal optimisation problem.
APPENDIX2
451
(i) Separateness ofmodel and policy. 00
P1r(X, Y, U) = P(X, YJ; U)
IT P1r(u1l; W1).
(3)
1=0
( ii) Causality. P(X~> Y1J; U)
= P(X~> Y1J; U1-d·
(4)
The assumption W 1 = ( Wo, Y 1, U1-1) implies the further properties (iii) Non-anticipation. W1 is (Y1, U1_ 1)-measurable. (iv) Retention ofinformation. W 1-1 and U1-1 are Wrmeasurable. Conditions such as these are often taken for granted and not even listed, but the dynamic programming principle is not valid without them. The fact that information at time t includes that at time t - 1 would be expressed in some of the literature by saying that in (1), for example, one is conditioning on an increasing sequence of a-fields constituting a filtration. We mention the point only to help the reader make the connection; this is not a language that we shall need. Condition (i) factors the joint distribution of X, Y and U under policy 1r into terms dependent on the model and the policy respectively. Note that it is by specification of the policy that one completes the stochastic specification which allows U to be regarded as a random variable jointly with X and Y. Note that relation (3) also expresses realisability of the policy. Condition (ii) does indeed express the vital causality condition: that the course of the process up to a given time cannot be affected by actions after that time. Here the distinction between parametrising and conditioning variables is crucial. In the sense expressed by (4) the variable x 1 (for example) cannot be affected by ur(T ~ t). However, once a policy is specified, then x 1 will in general show a stochastic dependence on these future control variables, simply because these future control variables will in general show a stochastic dependence upon x 1 . Condition (iv) expresses the fact that information is in principle never lost or discarded. ('In principle', because assumptions of the Markov type make it unnecessary to retain all information.) In particular, past decisions are recalled. The aspect of optimisation is introduced by requiring 1r to be such as to minimise E.,..(C) for a suitably specified cost function C = C( W). Usually the observations yare regarded as subsidiary to the process variable x, in that they are observations on the process variable. This dependent role would be expressed by some property such as P(x1,Y1JX1-1, Y1-1i U1-1)
= P(xt>Y1JX1-1i U1-1)
and the assumption that the cost function C may depend on X and Ubut not on Y. However, there is no need to make such assumptions; one can simply regard {x1 ,y1} as a stochastic process describing system evolution (and parametrised by U) of which only the component {y1} is observable.
452
APPEN DIX2
We shall now show that the dynamic programming princi ple follows from these assumptions. For notational simplicity we shall abbrev iate P1T'(u,l; W1) to P1rt· An unrestricted summation 2::: will be a summation over all W. A symbol (W1 ) under the summation sign will indicate that the summation is to be extended over all W consistent with a given W1• Let us define
the total value function for a specified policy 1r. The condi tional-expectation notation is legitimate, since W1 is defined as a random variab le under the policy. Then Vobeys the recursion
V('rr, Wt) = E1r[V(1r, Wt+I)I W,] in virtue of (iv) and the properties of a conditional expectation .
Lemma A2.1
(5)
V( 1r, W1 ) is independent ofpolicy before timet.
Proof We have
v (1f' w,) =
L(W,) C(W)P1T'(X, Y, U) ---'~-~=::--=-::--:-:::--
L(w,)P7r(X, Y, U)
(6)
Substituting expression (3) for P1r(X, Y, U) into this equati on we find that the P1rj for j < t cancel out, since they depen d on W only through W 1• 0
Lemma A2.2 For any function ¢of process history the expect ation E11"[¢(Xr+l, Yr+l) IW1 , u1] is independent ofpolicy and can be written
(7)
Proof Relation (7) certainly holds if P(Xt+h Yr+ll; U ) is replac 1 ed by P1r(Xr+l, Yr+ 1, U1). But, by the separateness and causality assumptions (i) and (ii), t
P1T'(Xr+l• Yt+h Ut) = P(Xr+l• Yt+Ji; U,)
IT P1rj· }=0
The P1r} terms cancel, leaving relation (7), which certainly has the implication asserted. 0
Lemma A2.3 If Prrt is chosen optimally, for given t, then recurs ion (7) becomes
APPEN DIX2
453
ing u 1 is where the expectation operator is independent of 1r and the the minimis optimal, under prescription of1r at other time points. Proof We have
E'JI"[V(1r, Wt+i)i Wr]
=L
P,..(uri Wc)E[V(7r, Wt+l)i WI> u1]
(8)
UJ
A2.2. Indeed , where the last expecta tion operato r is indepe ndent of 1r, by Lemma tion is expecta final the that es a further appeal to Lemma A2.1 indicat s appear it where only (8) ion indepe ndent of P,..t, so that Prrt occurs in express 0 explicitly. The assertio ns of the lemma then follow. The desired conclus ions are now immed iate.
progra mming Theorem A24 The optimal total value function G obeys the dynami c minimi sing equation (2), the expectation operator being independent ofpolicy. The value of u1 in (2) is the optimal value ofcontrol at time t. tion that This follows simply by applica tion of Lemma A2.3 under the assump tion in (2) policy has been optimis ed after timet. The explicit form of the expecta Ut). 1!; Wt+ P( = ) U 1i 1 follows from (7). Essentially, Prr ( Wt+
APPENDI X 3
Moment Generating Functions; Basic Properties If x is a vector random variable then its moment generating function (MGF) is defined as (1) M(a) = E(e=), where a is then a row vector. This certainly exists for purely imaginary a; and the characteristic function M(ilJ) is studied for real (J very much in its own right Its relation to the MGF is exactly that of the Fourier transform to the Laplace transform. M(a) will exist for a range of real a if the tails of the x-distribution decay at least exponentially fast. In such cases the distribution of x and associated random variables has stronger properties, which are characterised immediately and simply in terms of the MGF, as we saw particularly in Chapter 22.
Theorem A3.1 The moment generating function M(a) is convex for real a. The set Jll ofreal a for which M (a) isfinite is convex and contains the value zero. Proof M(a) is an average of functions e= which are convex in a and so is itself convex. We know already that 0 E Jll. Jensen's inequality M(pa + q{J)
~
pM(a) + qM({J)
(a,(J
E
Jll;p,q
~
O;p+ q = 1)
for convex functions then implies that all elements of the convex hull of .511 belong D to Jll. That is, Jll is convex. This convexity explains why the equation M( a) = 1 has at most two real roots in the scalar case, one at a = 0 and the other of sign opposite to that of
M'(O)
= E(x).
Theorem A3.2 M {a) possesses derivatives of all orders in the interior of Jll, obtained by differentif:lti'!g_~xp_re~si?n (1) under the expectation sign. Proof This follows from the absolute convergence of the expectation thus de&~
D
Existence of derivatives is of course closely related to the existence of moments, which are proportional to the derivatives of M(a) at a= 0. We see from the
456
APP END IX 3
theorem that, if 0 is an interior poin t of .91, then moments of all orders exis t. The classic case for which no moments of integral ord er exist is the Cau chy distribution; a distribution of scalar x with probability density proportion al to (1 + .x2f 1 and characteristic function exp( -181). We could have proved convexity of M(a ) by appealing to the fact that the matrix of second differentials
(where a1 is the jt.h element of a) is plainly non-negative definite. This is clumsy compared with the pro of above, and makes an unnecessary appeal to the existence of differentials. However, we shall indeed use it in a moment. Ifwe write the second differential in the left-han d member as M1k(a) then M k(a )IM (a) 1 can be identified as £(cr.l(x xk), whe re £(cr.) is the expectation for the 1 tilted distribution defined in Section 22.2. Tha t is, the a-ti lted expectation of a function ¢(x ) is
E(cr.l[¢(x)]
= E[~~~~cr.x]_
The MG F of a sum of independent rand om variables is the product of MGFs. The cumulantgeneratingfunction (abb reviated to CG F) '1/J(a) =lo g M(a ) is then a natural quantity to consider, sinc e the CG F of a sum of independen t rand om variables is the sum of CGFs.
Theorem A3.3 The function '1/J( a) is also convex in d. Proof If we use the subscript nota tions M1 = 8M I 8a1 and M k = 2 8 MI 1 8ai8ak. with argument a understood , then we have ·'· _ MJk MJMk '1-'Jk - M - ----x:J2 . But the mat rix of these second diff erentials is just the covariance mat rix of x on the tilted distribution, and so is non-negative definite. This proves con vexity
of'ljJ.
0
Reforences Azencott, R. (1982) Formule de Taylor stochastique et developpement asymptotique d'integrales de Feynman. Seminaire de Probabilites XVI Lecture Notes in Mathematics 921,237-284. Springer, Berlin. Azencott, R. (1984) Densite des diffusions en temps petits; developpements asymptotiques. Seminaire de Probabilites XVIII Lecture Notes in Mathematics 1059, 402-498. Springer, Berlin. Bartlett, M.S. (1949) Some evolutionary stochastic processes. J. Rny. Statist. Soc. B, 11, 211229. Barlett, M.S. (1955) Deterministic and stochastic models for recurrent epidemics. Proc. Third Berkeley Symposium, (Ed. J. Neyman), lV, 81-109. Bartlett, M.S. (1960) Stochastic Population Models in Ecology and Epidemiology. London. Ben Arous, G. (1988) Methodes de Laplace et de la phase stationnaire sur 1\lspace de Wiener. Stochastics, 25, 125-153. Benes, V, Shepp, L.A. and Witsenhausen, H.S. (1980) Some soluble stochastic control processes. Stochastics, 4. Brockett, R.W (1970) Finite-dimensional Linear Systems. Wiley, New York. Bucklew, J.A. (1990) Large Deviation Techniques in Decision, Simulation and Estimation. Wiley, New York. Clark, CW. (1976) Mathematical Bioeconomics. Wiley, New York. Cox, D.R. and Smith,WL. (1961) Queues. Methuen, London. Davis, M.H.A. (1984) Piecewise-deterministic Markov processes: a general class of nondiffusion stochastic models. J. Roy. Statist. Soc. B, 46,353-388. Davis, M.H.A. (1986) Control of piecewise-deterministic processes via discrete-time dynamic programming. In Stochastic Differential Systems (Ed. M. Kohlmann). Lecture Notes in Control and Information Sciences, 78, Springer-Verlag, Berlin. Davis, M.H.A. (1993) Markov Models and Optimization. Chapman and Hall, London. Dembo, A. and Zeitouni, 0. (1991) Large Deviations and Applications. A.K. Peters, Wellesley, USA. Deuschel, J.-D. and Strook, D. (1989) Large Deviations. Academic Press, New York. Doyle, J.C, Francis, B.A. and 'Thnnenbaum, A.R. (1992) Feedback Control Theory. Macmillan, New York. Ellis, R (1985) Entropy. Large Deviations and Statistical Mechanics. Springer, New York. Fleming,WH. (1971) Stochastic control for small noise intensities. SIAM J. Control Optim., 9,473-517. Fleming, WH. (1978) Exit probabilities and optimal stochastic control. Applied Math. Optim., 4, 329-346. Fleming,WH. (1985) A stochastic control approach to some large deviations problems. In Recent Mathematical Methods in Dynamic Programming (Eds. Capuzzo Dolcetta et al.). Springer, Berlin. Fleming, WH. and Soner, H.M. (1992) Controlled Markov Processes and Viscosity Solutions. Springer, Berlin.
1 458
REFERENCES
Fleming, WH. and Tsai, C.P. (1981) Optimal exit probabilities and differential games. Applied Math. Optim., 7, 253-282. Fragopoulos, D. (1994) H,XJ Synthesis Theory using Polynomial System Representations. Ph.D. Thesis, Department of Electrical and Electronic Engineering, University of Strathclyde, Scotland. Francis, B.A. (1987) A Course in Hoo Control Theory. Springer, Berlin. Friedlin, M.l and Wentzell, A.D. (1984) Random Perturbations of Dynamical Systems. Springer-Verlag, New York. (Russian original published in 1979 by Nauka, Moscow.) Gale, D. (1960) The Theory ofLinear Economic Models. McGraw-Hill, New York. Gale, D. (1967) Onoptimaldevelopmentinamulti-sectoreconomy. Rev. Econ. Stud., 34, l-18. Gale, D. (1968) A mathematical theory of optimal economic development. Bull Amer. Math. Soc., 74,207-223. Gartner, J. (1977) On the large deviations from the invariant measure. Th. Prob. Appl., 22, 24-39. Gibbens, R.J., Kelly, F.P. and Key; P.B. (1988) Dynamic alternative routing-modelling and behaviour. In Proceedings of the Twelfth International Teletraffic Congress. North Holland, Amsterdam. Gibbens, R.J., Kelly, F.P. and Key, P.B. (1995) Dynamic alternative routing. In Routing in Communications Networks (Ed. M.A. Steenstrup), Prentice Hall, Englewood Cliffs, New Jersey Gihman, 11. and Skorohod, AY. (1972) Stochastic Differential Equations. Springer, Berlin. Gihman, 1.1. and Skorohod, AY. (1979) The Theory ofStochastic Processes,Vol. III. Springer, Berlin. Gittins, J.C. (1989) Multi-armed Bandit Allocation Indices. Wiley, Chichester. Gittins, J.C. and Jones, D.M. (1974) A dynamic allocation index for the sequential design of experiments. In Progress in Statistics (Ed. J. Gani), pp. 241-266. North Holland, Amsterdam. Glover, K. and Doyle, J.C. (1988) State-space formulae for all stabilizing controllers that satisfY an H 00 -norm bound and relations to risk sensitivity. System & Control Letters, 11, 167-172. Hagander, P. (1973) The use of operator factorisation for linear control and estimation. Automatica, 9, 623-631. Hijab, 0. (1984) Asymptotic Bayesian estimation of a first order equation with small diffusion. Ann. Pro b., 12,809-902. Holland, C.J. (1977) A new energy characterisation of the smallest eigenvalue of the Schrodinger equation. Comm. Pure Appl Math., 30, 755-765. Bolt, C., Modigliani, F., Muth, J.F. and Simon, H.A. (1960) Planning, Production, Inventories and Workforce. Prentice-Hall, Englewood Cliffs, New Jersey. Howard, R.A. (1960) Dynamic Programming and Markov Processes. MIT Press and Wiley, New York. Jacobson, D.H. (1973) Optimal stochastic linear systems with exponential performance criteria and their relation to deterministic differential games. IEEE Trans. Autom. Control, AC-18, 124-131. Jacobson, D.H. (1977) Extensions of Linear-quadratic Control, Optimization and Matrix Theory. Academic Press, New York. James, M.R. and Baras, J.S. (1988) Nonlinear filtering and large deviations: a PDE-control theoretic approach. Stochastics, 23, 391-412. Kalaba, R. (1959) On nonlinear differential equations, the maximum operation and monotone convergence.] Math. Mech., 8, 519-574. Krishnan, A.R. and Ott, T.J. (1986) State-dependent routing for telephone traffic; theory and results. Proc. 25th IEEE Control and Decision Conference, 2124-2128.
REFERENCES
459
Kumar, P.R. and van Schuppen, J.H. (1981) On the optimal control of stochastic processes with an exponential-of-integral performance index. J. Math. AnaL Appl., 80, 3U-332 Lande, R., Engen, S. and Saether, B.-E. (1994) Optimal harvesting, economic discounting and extinction risk in fluctuating populations. Nature, 372, 88-90. Lande, R. Engen, S. & Saether, B.-E. (1995). Optimal harvesting offluctuating populations with a risk ofextinction. Am. Nat.145, 728-745. Martin-Li)f, A. (1986) Entropy, a useful concept in risk theory. Scand Actuarial J., ?? , 223235. Miller, H.D. (1961) A convexity property in the theory of random variables on a fmite Markov chain. Ann. Math Statist., 32, 1260--1270. Mustafa, D. and Glover, K. (1990) Minimum Entropy H 00 Control. Springer, Berlin. Newton, G.C, Gould, L.A. and Kaiser, J.F. (1957) Analytical Design of Linear Feedback Controls. Wiley, New York. Pollatschek, M. and Avi-Itzhak, B. (1969) Algorithms for stochastic games with geometrical interpretation. Man. Sci., 15,399-413. Pontryagin, LS., Boltyanskii, V.G., Gamkrelidze, RY. and Mischenko, E.F. (1962) The Mathematical Theory of Optimal Processes. Interscience, New York. Puterman, M.L (1994) Markov Decision Processes. Wiley, New York. Shwartz, A. and Weiss, A. (1995) Large Deviationsfor Performance Analysis. Chapman and Hall, London. Speyer, J.L. (1976) An adaptive terminal guidance scheme based on an exponential cost criterion with applications to homing missile guidance. IEEE Trans. Autom. Control, AC-21, 371-375. Speyer, J.L., Deyst, J. and Jacobson, D. H. (1974) Optimisation of stochastic linear systems with additive measurement and process noise using exponential performance criteria. IEEE Trans. Autom. Control, AC-19, 358-366. Stroock, D. (1984) An Introduction to the Theory ofLarge Deviations. Springer, Berlin. Stroock, D. and Varadhan, S. (1979) Multimensional Diffusion Processes. Springer, Berlin. Tegeder, R:W. (1993) Large Deviations, Hamiltonian Techniques and Applications in Biology. Ph. D. Thesis, University of Cambridge. Tsitsiklis, J.N. (1986) A lemma on the MAB problem. IEEE Trans. Automat. Control, AC-31, 576-577. Van Vleck, J.H. (1928) Proc. NatL Acad. Sci. USA, 14, 178. Vanderbei, R.J. and Weiss, A. (1988) Large Deviations and Their Application to Computer and Communications Systems. Circulated unpublished notes, AT&T Bell Laboratories. Varadhan, S. (1984) Large Deviations and Applications. SIAM, Philadelphia. Vidyasagar, M. (1985) Control System Synthesis; a Factorization Approach. MIT Press, Cambridge, Mass. Weber, R. and Weiss, G. (1990) On an index policy for restless bandits. J. Appl Prob., 27, 647-648. Whittle, P. (1963) Prediction and Regulation. English Universities Press, London. Whittle, P. (1980) Multi-armed bandits and the Gittins index. J. Roy. Statist. Soc., B 42, 143149. Whittle, P. (1981) Ris~sensitive linear/quadratic/Gaussian control. Adv. Appl. Prob., 13, 764-777. Whittle, P. (1982) Optimisation over Time, VoL 1. Wiley, Chichester. Whittle, P. (1983a) Optimisation over Time, VoL 2 Wile~ Chichester. Whittle, P. (1983b) Prediction and Regulation. Second and revised edition of Whittle (1963) University of Minnesota Press and Blackwell, Oxford. Whittle, P. (1986) The risk-sensitive certainty equivalence principle. In Essays in Time Series Analysis and Allied Processes, Ed. J. Gani, 383-388. Applied Probability Trust, Sheffield.
460
REFERENCES
Whittle, P. (1988) Restless bandits; activity allocation in a changing world. In A Celebration ofApplied Probability (Ed. J. Gani), J Appl. Prob., 25A, 287-298. Whittle, P. (1990a) Risk-sensitive Optimal Control. Wiley, Chichester and New York. Whittle, P. (1990b) A risk-sensitive maximum principle. Syst. Contr. Lett., 15,183-19 2. Whittle, P. (199la) A risk-sensitive maximum principle: the case of imperfec t state observation. IEEE Trans. Auto. Control, AC-36, 793-801. Whittle, P. (199lb) Likelihood and cost as path integrals. J Roy. Statist. Soc., B 53,505-52 9. Whittle, P. (1995) Large-deviation expressions for the distribution of first-pass age coordinates. Adv. Appl. Prob., 27, 692-710. Whittle, P. and Gait, P. (1970) Reduction of a class of stochastic control problems . 1 Inst. Math. Appl., 6, 131-140. Whittle, P. and Horwood, JW. (1995) Population extinction and optimal resource management. To appear in Phil. Trans. Roy. Soc. B. Whittle, P. and Komarova, N. (1988) Policy improvement and the NewtonRaphson algorithm. Prob. Eng. Inf Sci., 2, 249-255. Whittle, P. and Kuhn, J. (1986) A Hamilton ian formulation of risk-sensitive, linear/ quadratic/gaussian control. Int. J. Control. Willems, J. C (1991) Paradigms and puzzles in the theory of dynamical systems. IEEE Trans. Autom. Control, AC- 36,259-294. Willems, J. C. (1992) Feedback in a behavioural setting. In Models and Feedback: Theory and Applications (Eds. Isidori, A. and Tarn,TJ.), pp.l79-191. Birkhii.user. Willems, J. C (1993) LQ-control: a behavioural approach. In Proc. 1993 IEEE Conferenc e on Decision and Control. Wold, H.O.A. (1938) The Analysis of Stationary Time Series. Almquis t and Wicksell, Uppsala. Youla, D.C, Bongiorno, J.J. and Jabr, H.A. (1976) Modern Wiener-Hopf design of optimal controllers. Part 1: The single-input- output case. IEEE Trans. Auto. Control, AC-21, 313. Part II. The multivariable case. IEEE Trans. Auto. Control, AC-21, 319-338. Zames, G. (1981) Feedback and optimal sensitivity: model reference transform ations, multiplicative seminorms, and approximate inverses. IEEE Trans. Auto. Control, AC- 26, 301-320.
Index Allocation, of activity 269-283 Average-cost optimisation 53-56, 138, 217-221,314-316 Autocovariance 256-257 Autocovariance generating function (AGF) 257 Autoregressive processes and representations 262 Avoidance ofhazards 160-163, 428-430 Back-logging 20 Backwards translation operator See 'franslation operator Bang-bang control 31, 146 Bayesian statistics 287 Behavioural formulation 129 Blackmail 216, 220 Brownian motion See Wiener process Bush problem 145-147 Calculus of variations 17 Call routing 226-228 Cart and pendulum model 35-37, 99 Canonical factorisation ofoperators 118, 129,262, 263, 336, 339, 342-344, 345, 348, 352, 354, 361, 367 Causality 68, 451 Cayley-Hamilton theorem 99-100 Certainty equivalence 191, 230, 234-239, 298-302,432-435 Change-point detection 289-291 Chernoff's inequality 387-388 Circulant process 261 Classic formtihitioi:F6:¢:66~~~ ·-·- - - · Clearance, optimal 424,426,428,430 Closed-loopproperty'l4; 25,176,191 Closing cost 12,49 · -,~_:_~,-- :f~: -- -Command signal 64 Companion matrix 101- •> : _. Compound Poisson proce8s]:84, ' -·
Conditionally most probable estimate 237 Conjugate variable 135, 136 Consumption, optimal 17-19,55-56,143, 411-413 Control-power matrix 34, 306 Controllability 101-106 Cost, closing 12, 49 Cost, instantaneous 12 Cost, terminal 49 Cost, transient 54 Cost function 12-13 Covariance matrix 239 Cramer's theorem 380, 384-387 Crash avoidance 157-160 Cumulant generating function (CGF) 384 Derivate charcteristic function (DCF) 181, 390 Detectability 109 Diffusion coefficient 187 Diffusion processes 186-187, 206-208 · Direct trajectory optimisation 42-46, 131-166 Direct trajectory optimisation with LEQG structure 316-317,371-377 Direct trajectory optimisation with LQ structure 116-121, 122-123, 126-129 Direct trajectory optimisation with LQG structure 331-369 Discounting 15, 29, 44, 49, 143, 178, 317-319 Discrepancy function 230-231 Domain of attraction 3, 93 Dosage, optimal 134-135 Drift coefficient 187 Dual control 285-286 Dual variable See Conjugate variable Duality, ofestimation and control248-253 Dynamic lags 88-89 Dynamicprogrammingequation 13-14,28, 174, 176, 177, 288, 289, 339, 340, 347, 400
462 Eikonal equation 31 Entropy criterion 316 Equilibrium 92-93 Equilibrium point, optimal 42--46 Erlang function 228 Estimates, conditionally most probable 2TI linear least square 243 minimal discrepancy 243 projection 242-246 Euler condition 17 Euphoria 304 Excessive functions 51 Extinction 196-7, 207, 210 Factorisation See Canon ical factorisation of operators Feedback 14, 25, 63-66, 85-89 Feedback/feedforward rules 40, 119 Filters 67-85 Filter, action on stochastic processes 257-261 Filter inversion 72-74, 80--82 Filter, proper 84, 87 Filter, Kalma n See Kalma n filter First-passage problems 30-31,140--142, 201-205, 415--430 Final value theore m 83 Fluid approximation 392 Flypaper effect 204 Forward operat or 48 Free form 299 Frequency response function 69 Future stress 305-307, 310, 311 Gain, effective 88 Gain matrix 24, 26 Gittins index 270--275 Grami an 104, 105, 108 Grazing of stopping set 157, 162 Growth, optima l 55-56, 133-135 H 00 criterion 321-329 H 00 norm 324-326 Hamil tonian 45, 136, 137, 396 Hamil tonian structure 395-397 Hardy class 325 Harvesting, optimal 2-5,31-33,61-62,
194-201, 206-213
INDEX
Hedging 173 Homogeneous processes of independent increments (HPII) 182-184 Horizon 12 Imperfect observation 96-98, 229-253, 285-291,308-312,357-369,431--441 Index, Gittins 270-275 Indexability 280 Infinite-horizon behaviour 26-28,47-56, 111-115, 307, 312-313 Infinitesimal generator 177, 179, 390 Infinitely divisible distributions 183 Inertial particle 37, 155-164,426--428 Inertialess particle 37, 153-154,408-411, 421--426 Information 173, 449 Information state 229, 285, 288 Innovation 246-248,361,365,376 Input- output formulation 63-89 Instability, of optimisation 52, 216 Instantaneous cost 12 Insurance 389 Jump process 179,391,397 Kalma n filter 109, 230, 239-242, 248, 252, 309,312,361-369 Lagrange's equations 99 Laplace transform 81-83 Large deviation theory 198, 379--441 and control 405--414 and equilibrium distributions 399-400 and expected exit times 401 and first passage 415--430 and imperfect observation 431--441 and nonlinear filtering 437--441 refinements 397-399 Linear least square estimate 243 Linearisation 93, 97-98 Loop operator 66 LEQG models and optimisation 234, 295-320,316-317, 371-TI7 LQ models and optimisation 22-28, 33-38,38--42,59-61,111-129,147, 150--163 LQG models and optimisation 189-191, 201-202,202-20~331-369
463
INDEX
LQG models with imperfect observation 229-253 Machine maintenance 221-222,281-282, 289-291 Maximum principle (MP) 131-166 Maximum principle, risk-sensitive (RSMP) 407--408,435--436 Minimal discrepancy estimate 243 Minimax criterion 169 Miss-distance 17 Modes, of a system 95, 106 Moment generating function (MGF) 182, 384, 455--456 Mono tonicity, of operators 48 Moving average processes and representations 261-262 Multi-armed bandits 269-283 Neurotic breakdown 303 Newton-Raph son algorithm 57-59 Negative programming 50 Neighbouring optimal control 42--46 Noise power matrix 186 Nonlinear filtering 437--441 Notation 443-447 Observable 173, 449 Observer 109 Observability 106-109 Observation, imperfect 96-98, 229-253, 285-291, 308-312,357-369,431--441 Occupation times 388-389 Offset 89, 116, 123-126 Open-loop control 25, 191 Operator, forward 48 Operator, loop 66 Operator, translation 27 Operators, factorisation of See Canonical factorisation of operators Optimal stopping 205-206 Optimality equation See Dynamic programming equation Optimality conditions See Direct trajectory optimisation Optimisation criteria, expressions for 265-268 Optimism 302 Orthogonal random variables 239
Parametrising variables 171, 174, 450 Paststress 308-310,311-312 Pendulum models 34-36,84-85, 94, 95 Pessimism 303 PID controllers 99 PI/ NR algorithm 344, 348 Piecewise deterministic processes 180-181, 209-213 Plant 11, 64, 66, 85 Plant equation 11 Plant instability 128 Poisson process 183 Poisson stream 183 Pole cancellation 88 Policy 16, 174 Policy improvement 56-61, 215-228 for call routing 226-228 for machine maintenance 221-222 for queueing models 222-226 Policy improvement and canonical factorisation 342-344,348 Pontryagin maximum principle See Maximum principle Positive programming 51 Posterior distribution 287 Prediction 264-265,305, 350 Primer 149 Process variable 11 Production scheduling 20-21 Projection estimate 242-246 Proper filter 84-87 Queueing models 193-194,222-226,
282 Rate function 380, 385, 394 Rational transfer function 71 Realisation 109 Recoupling 310, 374 Recurrence 219 Reference signal See Command signal Regulation 23 Replica Markov processes 391 Reservoir optimisation 40--42, 165-166 Resonance 96 Restless bandits 277-283 Return difference 355 Riccati equation 23-24, 113, 240, 252, 305, 309, 311,338, 340-341
464
INDEX
Riccati equation, alternative form of 121-122, 253 Risk-sensitivity 172-173, 295-320,406--414, 432-437 Robustness 326-328 Routh-Hurwicz criterion 109 Satellite model 98, 106, 109 Scaling of processes 195, 384, 391 Sensitivity 327-328 Separation principle 299, 432-435 Setpoint 23 Shot noise 184 Small gain theorem 327 Spectral density (function and matrix) 259-261 Stability 3, 92 Stability, internal 87 Stability, local 93 Stability matrix 40, 92 Stability, offilters 69, 71,83-85 Stabilisability 103 State structure 11-12, 91-110, 175-176 Stationary policies 16, 115-121 Stationary processes 255-257 Stopping set 136, 139, 151, 394, 415 Stress 298 future 301, 305-307, 311 past 301,308-310,311-312 Submodularity 274 Switching locus 148 System formulation 125 'Th.ngency condition 205-206 Temporal optimisation, bases of ' 449-453
Terminal cost 49 Terminal conditions 138-140, 205-206 Tilting, of a distribution 385 Time-homogeneity 16, 68 Time-integral methods 331-377,407, 435-436 See also Direct trajectory optimisation Time-integral methods, a generalised formulation 375-377 Time-invariance 16 Time-to-go 16 Tracking 38-42, 65, 115-121, 308 Transfer function 69, 75, 79 Transient cost 54 Transient response 68, 80 Transition intensity 179 Translation invariance 68 Translation operator 27 Transversality conditions 138-140, 163-165 Turnpike 56, 133 Twisting, ofa process 395 'JYpe number 88 Utility function 295 Value function 13, 174, 339, 347 von Neumann-Gale model 56, 134 White noise 185, 189 Wiener filter 264-265 Wiener process 184-187 Wold representation 263 z-transform 69, 77, 79 Zermelo's problem 144
--r
I