Lecture Notes in Control and Information Sciences Editor: M. T h o m a
225
A.S, Poznyak and K. Najim
Learning Automata and Stochastic Optimization
~ Springer
Series Advisory Board A. B e n s o u s s a n • M.J. G r i m b l e • P. K o k o t o v i c • H. K w a k e r n a a k J.L. M a s s e y • Y.Z. T s y p k i n
Authors A.S. P o z n y a k CINVESTAV, Instituto Polytecnico Nacional M6xico, D e p a r t a m e n t o d o I n g e n i e r i a Electrica, S e c t i o n d e C o n t r o l A u t o m a t i c o A.P. 14-740, 07000 M 6 x i c o D.F., M 6 x i c o K. N a j i m Institut National Polytechnique de Toulouse, Toulouse, France
ISBN 3-540-76154-3 Springer-Verlag Berlin Heidelberg New York British Library Cataloguing in Publication Data Pozniak, A.S. (Aleksandr Semenovich) Learning automata and stochastic optimization. - (Lecture notes in control and information sciences ; 225) 1.Mathematical optimization 2.Machine theory 3.Stochastic processes
I.Title II.Najim, Kaddour 519.2'3 ISBN 3540761543 Library of Congress Cataloging-in-Publication Data A catalog record for this book is available from the Library of Congress Apart from any fair dealing for the purposes of research or private study, or criticism or review, as permitted under the Copyright, Designs and Patents Act 1988, this publication may only be reproduced, stored or transmitted, in any form or by any means, with the prior permission in writing of the publishers, or in the case of reprographic reproduction in accordance with the terms oflicences issued by the Copyright Licensing Agency. Enquiries concerning reproduction outside those terms should be sent to the publishers. © Springer-Verlag London Limited 1997 Printed in Great Britain The use of registered names, trademarks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant laws and regulations and therefore free for general use. The publisher makes no representation, express or implied, with regard to the accuracy of the information contained in this book and cannot accept any legal responsibility or liability for any errors or omissions that maybe made. Typesetting: Camera ready by authors Printed and bound at the Athenaeum Press Ltd, Gateshead 6913830-543210 Printed on acid-free paper
To Tatyana and Michelle
Contents 0.1 0.2
Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . Introduction . . . . . . . . . . . . . . . . . . . . . . . . .
Stochastic Optimization
3
1.1 1.2
3
1.3 1,4 1.5 1.6
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . Standard deterministic and stochastic nonlinear programming problem . . . . . . . . . . . . . . . . . . . . . . . . . . Standard stochastic programming problem .......... Randomized strategies and equivalent smoothed problems . C a r a t h e o d o r y t h e o r e m a n d s t o c h a s t i c linear p r o b l e m on finite set . . . . . . . . . . . . . . . . . . . . . . Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . .
References 2
xi 1
2.7
Introduction . . . . . . . . . . . . . . . . . . . . . . . . Learning automaton . . . . . . . . . . . . . . . . . . . Environment . . . . . . . . . . . . . . . . . . . . . . . Reinforcement, s c h e m e s . . . . . . . . . . . . . . . . . . H i e r a r c h i c a l s t r u c t u r e of l e a r n i n g a u t o m a t a . . . . . . . . . Normalization and projection ................. 2.6.1 Normalization procedure ................ 2.6.2 Projection algorithm .................. Conclusion . . . . . . . . . . . . . . . . . . . . . . . .
14 23
25
On Learning A u t o m a t a 2.1 2.2 2.3 2.4 2.5 2.6
4 5 8
27 . . . .
. . . .
. . . .
. . .
27 29 31 32 34 35 35 36 39
References
40
Unconstrained Optimization Problems
43
3.1 3.2
43
3.3 3.4
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . Statement of the Stochastic Optimization Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . L e a r n i n g A u t o m a t a Design . . . . . . . . . . . . . . . . . . Nonprojectional Reinforcement schemes . . . . . . . . . . . 3.4.1 Bush-M(~steller r e i n f o r c e m e n t s c h e m e . . . . . . . . . 3.4.2 Shapiro-Narendra reinforcement scheme . . . . . . .
44 46 51 51 57
viii
Contents
3.5 3.6
3.7 3.8
3.9
3.4.3 V a r s h a v s k i i - V o r o n t s o v a r e i n f o r c e m e n t s c h e m e . . . . Normalization Procedure and Optimization ......... Reinforcement Schemes with Normalized Environment response ..................... 3.6.1 Bush-Mosteller reinforcement scheme with normalization procedure ................ 3.6.2 S h a p i r o - N a r e n d r a r e i n f o r c e m e n t scheme w i t h n o r m a l ization procedure .................... 3.6.3 V a r s h a v s k i i - V o r o n t s o v a r e i n f o r c e m e n t scheme w i t h norrealization procedure .................. Learning Stochastic Automaton with C h a n g i n g n u m b e r of A c t i o n s . . . . . . . . . . . . . . . . . Simulation results ....................... 3.8.1 L e a r n i n g a u t o m a t o n w i t h fixed n u m b e r of a c t i o n s . . 3.8.2 L e a r n i n g a u t o m a t o n w i t h c h a n g i n g n u m b e r o f a c t i o n s Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . .
63 69 75 75 80 85 91 93 94 97 104
References
105
Constrained Optimization Problems
107
4.1 4.2
107 109 109 111 112 118 119 120 123 131 133 139 147 151 153 156
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . Constrained optimization problem .............. 4.2.1 P r o b l e m f o r m u l a t i o n . . . . . . . . . . . . . . . . . . 4.2.2 E q u i v a l e n t linear p r o g r a m m i n g p r o b l e m . . . . . . . 4.3 L a g r a n g e m u l t i p l i e r s u s i n g r e g u l a r i z a t i o n a p p r o a c h . . . . . 4.4 O p t i m i z a t i o n a l g o r i t h m . . . . . . . . . . . . . . . . . . . . 4.4.1 Learning automata ................... 4.4.2 Algorithm ........................ 4.5 C o n v e r g e n c e a n d convergence r a t e a n a l y s i s . . . . . . . . . 4.6 P e n a l t y f u n c t i o n . . . . . . . . . . . . . . . . . . . . . . . . 4.7 P r o p e r t i e s o f t h e o p t i m a l s o l u t i o n . . . . . . . . . . . . . . . 4.8 O p t i m i z a t i o n a l g o r i t h m . . . . . . . . . . . . . . . . . . . . 4.9 N u m e r i c a l e x a m p l e s . . . . . . . . . . . . . . . . . . . . . . 4.9.1 Lagrange multipliers approach ............ 4.9.2 P e n a l t y f u n c t i o n a p p r o a c h . . . . . . . . . . . . . . . 4.10 C o n c l u s i o n . . . . . . . . . . . . . . . . . . . . . . . . . . .
References
158
O p t i m i z a t i o n of N o n s t a t i o n a r y ~ n c t i o n s
161
5.1 5.2 5.3 5.4
161 162 164
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . O p t i m i z a t i o n of N o n s t a t i o n a r y F u n c t i o n s . . . . . . . . . . Nonstationary learning systems ................ Learning Automata and Random Environments . . . . . . . . . . . . . . . . . . . . . . . . . .
166
Contents 5.5
Problem
5.6
Some Properties of the Reinforcement Schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
170
5.7
Reinforcement Schemes with Normalization Procedure . . . 5.7.1 Bush-Mosteller reinforcement scheme ......... 5.7.2 Shapiro-Narendra reinforcement scheme ....... 5.7.3 Varshavskii-Vorontsova reinforcement scheme .... Simulation results . . . . . . . . . . . . . . . . . . . . . . . Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . .
173 176 179 182 185 187
5.8 5.9
statement
. . . . . . . . . . . . . . . . . . . . . . .
ix
References Appendix A: T h e o r e m s a n d L e m m a s ........... Appendix B: Stochastic Processes ............
Index
169
189 191 200
203
0.1. N o t a t i o n s
0.1
xi
Notations
The upperscript convention will be used to index the discrete-time step. Throughout this book we use the following notation: arg min f ( x ) minimizing argument of f ( x ) X
B,,-I
a ( u l , ..., u , ; ~1,..., ~ , - 1 )
~,~(i)
conditional mathematical expectations of the environment responses arithmetic averages limit of Cn(i) diameter of the subset Xi expectation operator envelope of the set X = ( 1 .... , 1 ) ~ R ~(j) = (0, O, ...., O, 1, O, ..., O) T E R g(j) (the s th component of this vector is equal to 1 if un = u(s) and the other components are equal to zero) distribution function of the vector x a-algebra (a-field:a set of subsets of ~ for which probabilities are defined)) objective function constraints (j = 1, ..., m) Lipschitz constants Lagrange function penalty function number of automaton actions number of actions of the subset V ( j ) probability measure probability distribution at time n probability associated with the optimal action
~(i) di E en X
eN(J) eN(J)(Un)
F.(x) fo(.) ]A) Li L(x,A) Lm6(P, U) N
N(j) P pn
p,~(c,) p
~
q~ T¢ U
u(a) V
~(~)
optimal strategy probability distribution defined over all possible actions subsets --- (fl, 5v, P ) a probability space set of actions automaton output (action) at time n optimal action slack variables (j = 1, ..., m) set of actions
xii
0.1. N o t a t i o n s
V(j) Vo(.) VA.) W,~
{xd X*
Yn
'Tn
),(j) I.t
o(z) o(x) II ~o(.) ~(.) ,,~(i)
f2 03n
jtu subset of actions objective function constraints (j = 1 .... , m) L y a p u n o v function quantification of t h e c o m p a c t set. X o p t i m a l solution observation at time n correction factor 5 - function multi-teacher environment responses at time n (j = 1 , . . . , m ) LagTange multipliers (j = 1 ..... m) p e n a l t y coefficient environment response normalized environment response O(x)/x ~ 0 b o u n d e d when x --* 0 o(x)/x ---*0 w h e n x ~ 0 projection o p e r a t o r objective function constraints (j = 1 ..... m) conditional variances of the environment responses loss function at time n basic space (events space) observation noise at time n
0.2. I n t r o d u c t i o n
0.2
1
Introduction
In the last decades there has been a steadily growing need and interest in computational methods for solving optimization problems with or without constraints. They play an important role in many fields (chemistry, mechanic, electrical, economic, etc.). Optimization techniques have been gaining greater acceptance in many industrial applications. This fact was motivated by the increased interest for improved economy in and better utilization of the existing material resources. Euler says: "Nothing happens in the universe that does not have a sense of either certain maximum or minimum". In this book we are primarily concerned with the use of learning a u t o m a t a as a tool for solving many optimization problems. Learning systems have made a significant impact on many areas of engineering problems including modelling, control, optimization, pattern recognition, signal processing and diagnosis. They are attractive and provide interesting methods for solving complex nonlinear problems characterized by a high level of uncertainty. Learning systems are expected to provide the capability to adjust the probability distribution on-line, based on the environment response. T h e y are essentially feedback systems. The optimization problems are modeled as t h a t of learning automaton or a hierarchical structure of learning automata operating in a random environment. W~e report new and efficient techniques to deal with different kinds (unconstrained, constrained) of stochastic optimization problems. The main advantage of learning a u t o m a t a over other optimization techniques is its general applicability, i.e., there are almost no condition concerning the function to be optimized (continuity, differentiability, convexity, unimodality, etc.). This book presents a solid background for the design of optimization procedures. Ample references are given at the end of each chapter to allow the reader to pursue his interest further. Several applications are described throughout the book to bridge the gap between theory and practice. Stochastic processes as martingales have extensive applications in stochastic problems. They arise naturally whenever one needs to consider mathematical expectations with respect to increasing information patterns. They will be used as well as Lyapunov functions to state several theoretical results concerning the convergence and the convergence rate of learning systems. The Lyapunov approach offers a shortcut to proving global stability (convergence) of dynamical systems. A Lyapunov function often measures the energy of a physical system. This book consists of five Chapters. Chapter one deals with stochastic optimization problems. It is shown that stochastic optimization problems (with or without constraints) can be reduced to optimization problems on
2
0.2. I n t r o d u c t i o n
finite sets and, then solved using learning automata (operating in S-model environments, i.e., continuous environment response) which are essentially sequential machines. An introduction to learning automata and several notions and definitions are presented in Chapter two. Both single and hierarchical structures of learning automata are considered. It is shown that the synthesis of several reinforcement schemes can be associated with the optimization of some functional. Chapters one and two constitute the core of this book. Chapter three is concerned with stochastic unconstrained optimization problems. Learning automata with fixed and changing number of actions are used as a tool for solving this kind of optimization problems. A transformation called "normalization procedure" is introduced to ensure that the environment response (automaton input) belongs to the unit segment. Several analytical results dealing with the convergence characteristics of the optimization algorithms are stated. The stochastic constrained optimization problems are treated in Chapter four. This stochastic optimization problem is formulated and solved as the behaviour of variable-structure stochastic automata operating in multi-teacher environments. Two approaches are considered: Lagrange multipliers and penalty functions. The properties of the optimal solutions are presented. The convergence of the derived optimization algorithms is stated and the associated convergence rates are estimated. Chapter five deals with the optimization of nonstationary (time-varying, etc.) functions which can encountered in several engineering problems. These optimization problems are associated with the behaviour of learning automata with continuous input~s (S-model environment) operating in nonstationary environments. Numerical examples illustrate the performance of the optimization algorithms described in chapters three, four and five. Two appendices end this book. The first one contains several lemmas and their proofs, the second one contains same definitions and properties concerning different kinds of convergence and martingales. A detailed table of contents provide a general idea of the scope of the book. The content of this book, in the opinion of the authors, represent an important step in the systematic use of learning automata concepts in stochastic optimization problems. We would like to thank our doctoral .students: E. Ikonen and E. G. Ramires who have carried out most of the simulation studies presented in this book. Professor A.S. Poznyak Instituto Polytecnico Nacional M~xico, Mexico and Professor K. Najim Institut National Polytechnique de Toulouse, France
1 Stochastic Optimization 1.1
Introduction
Because of its practical significance, there has been considerable interest in the literature on the field of optimization. This field has seen a tremendous growth in several industries (chemistry, data communication, etc.). Any problem of optimization concerns the minimization of some function over some set which can be specified by a number of conditions (equality and inequality constraints) [6]-[7]. Maximization can be converted to minimization by a change of sign. The optimization problems associated with real systems involves a high level of uncertainty, disturbances, and noisy measurements. In several optimization problems there exists no information (probability distribution, etc.) or if there exists it is incomplete. I~arning systems are adaptive machines [8]-[5]. They improve their behaviour with time and are useful whenever complete knowledge about an environment is unknown, expensive to obtain or impossible to quantify. A learning a u t o m a t o n is connected in a feedback loop to the random medium (environment), where the input to one is the output to the other. At every sampling period (iteration), the automaton chooses an action from a finite action set, on the basis of the probability distribution. The selected action causes the reaction of the environment, which in turn ks the input signal for the automaton (i.e., the environment establishes the relation between the actions of the automaton and the signal received at its input which can be binary or continuous). With a given reinforcement scheme, the learning a u t o m a t o n recursively updates its probability distribution on the basis of the environment response. One of our first goals in this chapter will be to present the various stochastic optimization problems in an unified manner and to show how learning a u t o m a t a can be used to solve efficiently this kind of optimization problems. Optimization methods based on learning systems belong to the class of random search techniques. The main advantage of random search over other direct search techniques is its general applicability, i.e., there are almost no conditions concerning the function to be optimized (continuity, unimodality, convexity, differentiability, etc.). T h e standard programming problem will be stated in the next section.
4
1.2. Standard deterministic and stochastic nonlinear programming problem
1.2
Standard deterministic and stochastic nonlinear programming problem
In this section, our main emphasis will be on the standard programming problem [5]-[6] which can be stated as follows: inf fo(x)
(I.1)
X
f~(x) < 0, (j = 1,...,m)
(1.2)
x e X C_ R M
(1.3)
The function fo(x) is an objective function, sometimes also called a criterion function or cost function. In the case of unconstrained problem the optimal solution requires the gradient at the e x t r e m u m of fo(x) to be null. Notice t h a t equalities constraints of the form ¢8(x) = 0, (s = 1 , . . . , m , ) can be transformed into inequalities constraints
fj(x) <_0, (j ----m + 1 .... , m + 2me) where
/j (z) := Cj (~), (j = m + 1,..., m + m~) fj(x) := - ¢ i ( x ) , (j = m + 1, . . . , m + me) So, hereafter we will deal with the standard programming problem ( 1 . 1 ) -
(1.3). T h e set X m a y be open or close and can have different nature which are usually connected with the physical problem formulated by (1.1)-(1.3). T h e s t a n d a r d p r o g r a m m i n g problem is said to be discrete p r o g r a m m i n g problem if the set X is discrete (finite set). It is said to be continuous p r o g r a m m i n g problem if the set X is continuous (compact). The function fj(x) (j = O, ..., m) involved in (1.1) and (1.2) are assumed to be continuous, Lipschitzian a n d twice differentiable. In deterministic nonlinear p r o g r a m m i n g problems, it usually assumed t h a t xn and yn are available at t i m e n, where y= is defined as follows • zero-order methods
yT := (fo(x,,),fl(x,~),...,f,,~(xn)) E R m+l • first-order methods
yT := (VT fo(x,~), VT fl (x~),..., v T f,,~(x,O) E R ('~+I)M
1.3. Standard stochastic programming problem
5
Our main objective is to construct an optimization procedure (strategy), i.e.~
x . = x . (yl,---, Y~) E X
(1.4)
to solve the following constrained optimization problem
fo(x,-,)
~
inf fo(x)
n---*o~xEX
(1.5)
subject to
im/j(x,) < 0, (j = 1,...,m) It is usually assumed that all the functions fj(xn) (j = 1, ..., m) are convex and enough smooth. Among the methods of the first-order, the Arrow-Hurwicz-Uzawa method [5] is commonly used. It is the recurrent version of the gradient technique applied to the corresponding Lagrange function using the Slater's condition [7]-[81: 3 5 e X : fj(5) < O, (j = 1,...,m) (1.6) In practice the function to be optimized (fo(X)) as well as the constraints ( f j ( x ) ) display some random characteristics (incomplete knowledge). In this case the optimization problem have to be formulated as a stochastic programming problem.
1.3
S t a n d a r d stochastic p r o g r a m m i n g p r o b l e m
A key feature of many practical problems is the presence of disturbances on both optimized and constraints functions. The stochastic programming problem is defined in terms of the minimization of inf E {f0(x, ~)}
(1.7)
E { f j ( x , w ) } < 0, (j = 1,...,m)
(1.8)
x E X C_ R M
(1.9)
x
subject to
where f i ( x , w ) (j = 1,...,m), for each fixed x E X, are Borel functions (random variables) defined on a probability space (f~, ~', P) and w E f L The probability space (f~, 9v, P) is defined as follows: f~ is the space of elementary events is the a-algebra constructed from all subsets of f~ P is a probability measure defined on the measurable space (~, ~'). Similarly to the standard deterministic programming problem, the observation Yn is defined as follows:
1.3. Standard stochastic programming problem •
zero-order methods yT(w) := (fo(Xn,~), f1(Xn,~)), ..., fm(Xn,~)) • R m+l
•
(1.t0)
first-order methods
if(w) := T
Yn,ol( ),'",Yn,m( )) e R ('~+I)M
(1.11)
where E { y T j @ ) I 9~n-1 } a_~. V C j ( x , ) , (j : 0 .... ,m)
(1.12)
Yrn-1 := a -- algebra generated by (yl(w), ...,yn-t(w)) Cj(xn) := E{fi(x,w) } (j : O, ..,m)
(1.13)
The standard assumptions concerning the observation vector yn(w) are the following
l
1} °J 0 +
it o,
In view of (1.12), the observation vector can be interpreted as the gradient of {VCj (xn), j = 0, ..., m} disturbed by zero conditional mathematical expectation noise (vector) ~ = ~ ( w ) (j = 0, ..., m), i.e.,
if(w)= [VT~o(X~I+(~(~I)T;...;vT~(~)+(~(~)) ~]
(1.14/
where
E { ~ ( ~ ) t ~ , - , } °~ 0, (j : 0,..,~)
(1.1~)
If the statkstics characteristics of all the random functions fj (x, w) (j = 0, .., m) are assumed to be known for all x E X, then the standard stochastic optimization problem (1.7)-(1.9) can be rewritten as follows: ¢0(x) = [ fo(X, w)F(dw) ~ inf xEX J
(1.16)
f~
Cj(z) =/fj(x,a~)F(dw)
< 0 (j = 0,..,m)
(1.17)
fl
where the distribution function is assumed to be known and the integrals in (1.16) and (1.17) are defined in the Lebesgue-Stieltjes sense. R e m a r k 1 The "chance constraints" [9]-[10] P { w : C j ( ~o, ~ , .1. . , i y )
> a~} < ~j
(1.18)
1.3. Standard stochastic programming problem
7
belong to the class of constraints (1.3). Indeed, this chance constraint can be written as follows: E{~) < 0 (1.19t where 0 1 ~ ----X {¢J(~n, ~¢*,"", ~ ) >- aj} - f~j
(1.20)
and the indicator function X(') is defined as follows
X(A) -- ~ 1 if ,4 is true t 0 otherwise R e m a r k 2 Substituting (1.20) into (1.3) leads to n
limsup -1 ~ X n~OO
T/,
{ ¢ j ( ~ , ~'.... ,~2) > ~J} < ~
(1.21)
t=l
In the stationary case, i.e., the probability distribution of the vector ( ~ot, ~lt ,..., ~ ) T is stationary, and in view of the strong law of large numbers [11]-[12], it foUows that inequalities (1.21) and (1.18) are equivalent, i.e.,
limsup ! ~ r~--* o o
n
= limsup ! ~ E n~oO
n
: x {¢j(~;0 ,~;,...,~2) >_ ~j} :
t=l
{X {¢1(~°, ~1, ---, ff~n) >-- aJ } } -----
t=l
= p { ~ : ~(¢o,¢1 .... ,¢y) > ~j} < ~ In such statement, this problem is equivalent to the standard deterministic programming problem (1.1)-(1.3). Before to apply the deterministic optimization techniques, it necessary to calculate the integrals in (1.16) and (1.17). These integrals involve complex operations and can not be exactly calculated. Several approaches have been proposed to avoid this problem. One approach consists of using the average of the observations {yn(w)} given by (1.10), i.e.,
1 lim - ~_.~y°n(w) ~:g inf n--,ov
n
t=l
I ~-2-,n y~n( a.8. w) < O, (j =O,..,m) lim n--* oo
(1.22)
{zn}
n t=l
where {Xn} belongs to the class of all realizable strategies (1.4).
(1.23)
8
1.4. Randomized strategies and equivalent smoothed problems
This stochastic p r o g r a m m i n g problem (1.22)-(1.23) leads to highly computational efforts. In fact, it is necessary to solve it (parallel optimization) for different realizations. Another approach is based on the implementation of stochastic approximation techniques [13]-[14]-[15] using the observations (measurements) of the gradients (1.11)-(1.12). The main ideas of this approach will be considered in this book.
1.4 Randomized strategies and equivalent smoothed problems T h e link between the standard nonlinear p r o g r a m m i n g and smoothed stochastic optimization problems wilt be stated. For all x E X , the nonconvex set ~ of the values of the functions Cj(x), (j = O, ...,m) is defined as follows: ¢ = { v e R m + l : v i = Cj(x) (j = 0 , . . . , m ) , x E X }
(1.24)
A typical nonconvex set is depicted in Figure 1.1. In this Figure we have considered only one constraint. The coordinates of the point A are respectively the value of the cost function and the value of the considered constraint for a given value of x.
o{Xl
¢i[xi FIGURE 1.1. The set of possible values of Cj (x), (j : 0, ..., m). For the standard nonlinear p r o g r a m m i n g problem, the functions are ass u m e d to be convex. The corresponding set ~ of all the possible values (¢j(x)) in the ( m + 1) dimensional space is convex too, a n d as a conse-
1.4. Randomized strategies and equivalent smoothed problems
9
quence, many classical optimization techniques [16] including the stochastic approximation methods [13]-[14]-[15], can be used. For nonconvex functions Cj(x) , (j = 0, ..., m), the set • is nonconvex too. To use the standard optimization techniques for solving this problem which naturally is multimodal, it is necessary to solve several locally convex problems and to select the optimum among the obtained solutions. Now, let us consider the stochastic nonlinear problem (1.16)-(1.17) where at least • one of the functions Cj (x) (j = 0,..., m) is nonconvex • one of the functions Cj(x) (or all of them) is nonsmooth, i.e., we must use only observations of zero-order (1.10). To solve such nonlinear stochastic programming problem, which at first glance seems t o be very complex, we will use the approach presented in [171-[181-[19]. Let us introduce the distribution function F~(z) of the vector x = x(w) E X. It is assumed to be given on the same probability space (f2, ~', P ) , and consider the following constrained optimization problem
[ ¢o(x)p(x)dx--~inf p(x)
~0(P) :=
J
(1.25)
xEX
-¢j(p) := f Cj(x)p(x)dx <_0
(1.26)
xEX
where p(x) is the corresponding probability distribution which is related to the distribution function Fx by ~gl
Xh,l
and satisfying the probability measure properties p(x) >_ 0 W: S X C_ R M
...
p(v)dv
1
(1.28) (1.29)
We will consider not only absolutely continuous distribution functions satisfying (1.27) but also other distribution functions like the 6-functions. T h e / ~ - f u n c t i o n operates as follows:
p(z) OO
--o0
=
z0)
oO
-- OQ
oQ
--
O0
OO
--0o
(1.30)
10
1.4. Randomized strategies and equivalent smoothed problems
for any function #)(x), x e R M. Notice that, independently of the convexity properties of the initially defined functions Cj(x) (j ---- 0,...,m), the smooth problem (1.25)-(1.26) subject to the probability measure constraints (1.28)-(1.29) is always a linear programming problem in the space of distributions p(x) (including 6-functions (1.30)). The relation between the standard nonlinear programming problem ( 1.16)(1.17) and the corresponding smooth problem (1.25)-(1.29) will be stated by the following theorem. T h e o r e m 1. If
1. the initial standard stochastic nonlinear programming problem (1.16)-(1.17) has a solution x* (not necessary unique), i.e., there exists a point x* 6 X such that
CO(x') < CO(z*) vz e x ¢~(x*) _< 0, (j = 1 ..... m)
(1.32)
2. the constraints -(1.17) of the initial problem (1.16) satisfy the Slater's condition for x = • E X 3. the set X is compact (closed and bounded)
4. $j(x) ( j - - 1 , . . . , m ) are continuons functions on X Then, the corresponding smooth problem (1.25)-(1.29) has a solution too, i.e.,
Sp(x__*) : ¢0(P*) -< CO(P) Vp E SxN Cj(p*) < 0, (j = 1, ..., rn) p,p" E
S~ :=
{p(x): p(x) >_ 0 Vx E X C_ R M,
(1.33) (1.34)
} and the minimal values ¢0(x*) and ¢0(P*) of the cost functions Co(x*) (1.32) and Co(P) (1.33) are connected by the following relations:
¢o(~*)-¢o(p*) = ( 1 - z ~ ) c o ( ~ * ) - ( 1 - z l ) ~ ( ~ ) - ( ~ ; - z l ) c o ( x ' * ) (1.35)
1.4. Randomized strategies and equivalent smoothed problems
11
where x** =arg min Co(x)
x~X/Xo
Xo : : {x e x : Cj(x) < 0 (j = 1,...,m)}
(1.36)
(1.37)
(z[, z~) are the solution of the linear programming problem go (z,, z2) := ¢ 0 ( 3 ) + z , [CO(x-) - Co(r)] + z 2 [CO(x') - CO(z**)] -~ inf Zt ~Z2
gj (zl, z2) := Cj(~)+Zl [¢A~'*) - CJ(~)] +z~ [¢j(~*) - Vj(z**)] (1.38) _<0 j=l,...,m;
0 < z 2* < z 1* <_ 1
and p* is equal to p* = (1 - z ~ ) 6 ( z - 5 ) + (~; - z ~ ) ~ ( z - ~ * ' ) + z ~ ( ~ - ~ ' )
(1.39)
Proof. Let us consider the following probability distribution representation
p(x) = (1 - a)6(x -- "£) + c~q(x), a • [0, 1]
(1.40)
where 5 E X is the Slater's point (1.6) and q(x) is any density function. This formulation (1.40) represents some kind of parametrization of the class of all density functions (any density function p(x) can be expressed as a function of any density function q(x) and the parameter a • [0,11). Using (1.40) it follows ¢0(P) = (1 - c~)CO(5) + C~¢o(q) =
(1.41)
fco(z)q(z)
=(l-a)co('£)+v~[/co(z)q(x)dx+ kXo XlXo As C0(z*) < Co(x) Vx • Xo
(1.42)
From (1.41) we derive Co(P) > (1 - a)CO(5) + a [co(x*)q0 + @o(x**)(1 - qo)]
(1.43)
with
qo : : P {x E Xo}
(1.44)
12
1.4. Randomized strategies and equivalent smoothed problems
Notice t h a t x** exists because the function C0(x) and the set X are assumed to be respectively continuous and compact. T h e lower estimation (1.43) is sharp, i.e., there exists a density p(x), verifying t h e identity in (1.43) and which is equal t o
p(x) = (1 - a ) 5 ( x - 5) + a [qoS(X - x*) + (1 - qo)5(x - x**)]
(1.45)
Substituting (i.45) into (i.26) we obtain
¢~(p) = ( 1 - a ) - ¢ ~ ( 5 ) + a [ ¢ i ( x * ) q o + ¢ j ( x * * ) ( 1 - q o ) ]
<0
(1.46)
Let us introduce two new variables zl := ~ E [0, 1] and z2 := ~q0 • [0, 1] Solving this linear p r o g r a m m i n g problem (1.38_) with respect to the variables zl and z2 we obtain the minimum value ¢0(P*)
C0(P*) =min g0(z1, z2) Zl ~Z2
=
g0(z~, z~)
(i.47)
< o, o <_ zi <_ z[ <_ 1 and hence p* is given by (1.39). T h e o r e m is proved. •
G o r o l l a r y 1. For the unconstrained optimization problem, the minimal value of the cost functions associated respectively with the initial problem ((1.16)-(1.17)) and the smooth problem, coincide, i.e.,
(i.48)
Co(x*) = Co(P*), P* --- 5(x -- x*)
Proof. In this case Xo = X , X / X o = 0 It follows t h a t x** = x* a n d ¢0(x*) = CO(x**), and hence from (1.35) we obtain: C0(x*) - ¢0(P*) --= (1 - z~) [CO(x*) - ¢0(x)] -< 0,~f~ E X
(1.49)
F r o m (1.38) it follows t h a t
go(zl,z2) = ¢o(5) + zl [¢0(x*) - ¢0(5)] -~
inf
Zl ~Z2 O_<x2~zlgl
has a solution for z~ = 1, from which and (1.49) we obtain (1.48) ~ Corollary is proved. •
e X.
1.4. Randomized strategies and equivalent smoothed problems
,:
13
I: x
-1
0
1
FIGURE 1.2. The cost function Oo(x) defined on the set [-1,
1].
3 From (1.35) it foUows that in general the minimal value of the cost function (1.Z5) of the smooth problem (1.25)-(1.26) can be less than the minimal value Co(x*) of the cost function (1.16) of the deterministic problem (1.16)-(1.17). This is due to the fact that average constraints are less stronger than deterministic constraints. Remark
Consider the cost function ~o(X) depicted in Figure 1.2. T h e deterministic problem formulated as ¢0(x)
inf xeX=[-1,1]
subject to --l<x
has a solution 1
x* -- 0,¢0(x*) --- a > 2 T h e corresponding smooth problem can be formulated as follows:
subject to
- 1 < E { x } _< 0 p(~) = 0, ~ ~t X = [ - 1 , 11 Taking p(x) equal to p ( x ) = ~ 6 ( x + i ) -F (1 - a ) 6 ( x -
I)
(1.50)
14
1.5. Caratheodory theorem and stochastic linear problem on finite set
we derive
-1 < ~(-1) + (1 - ~) _<0 From which it follows that
1 -2 Calculating the cost function, we derive ~)--
E {¢0(x)} = ~ ¢ 0 ( - 1 ) + (1 - ~)¢0(1) = ~ ¢ 0 ( - 1 ) Minimizing the right side toward a, we obtain E{¢0(x)} > ¢0(-t) _ 1 -
2
2
So, for (1.50) we obtain E{¢0(~)} < ¢(~')
1.5 C a r a t h e o d o r y t h e o r e m a n d s t o c h a s t i c l i n e a r problem on finite set Let us start with the definition of the envelope en X of a set X before introducing the Caratheodory theorem which will be useful in the following developments (the significance of any definition , of course, resides in its consequences and applications). D e f i n i t i o n 1 The set X is said to be t h e e n v e l o p e o f a s e t X E TIM if for any point • E en X there exist two points Xl, x2 E X and a parameter E [0,1] such that = (1 -- c~)xt + c~x2 (1.51) Let us recall the Caratheodory theorem from geometry theory (convex sets) [6].
Theorem
2. (Caratheodory [6]). A n y point • belonging to the envelope
set e n X of a set X E TIM can be expressed as a linear combination of only ( M + 1) points x8 E X (s = 1 , . . . , M + 1), i.e., ~
3xsEX
(s--1,...,M+I),o~ES
M+I :=
o~8_>0, ~ S=I
such that
M+I i
~sXs
E en X
as=l
1.5. Caratheodory theorem and stochastic linear problem on finite set
15
The proof of this theorem is given in [6]. The next theorem discusses several properties of this set X. 3. On the class of continuous functions Cj(x) (j = 0,...,m) the following properties
Theorem
1. hold: = {vj = ~ j ( p ) : p • S ~ }
is a convex set
(1.52)
. ¢ = en ¢
(1.53)
Proof. To prove (1.52), let us consider two distributions pl(x) and p2(x) • S M and construct the set of values
~ ' (Pl,P2) :~-'~(1 -- a)~g.(pl) -4- a~j(p2), o~ • [0, 1] If we show that for any a • {0,1] the point
C j ( P l , P 2 ) ( 3 - - 0 ..... m
)}
belongs to ~ we state the first result of this theorem. Indeed,
e3 (pl,p~) =
¢j(~) [apt(x) + (1 - a)p2(x)] ~ = xEX
=/
Oj(x)p~(x)dx
xEX
where
p , ( x ) :-- ( 1 - a)pl(x)-t- ap2(x)
(1.54)
It is easy to show t h a t p~(x) E S M. Hence, pa(x) is a distribution which corresponds to the point ~-¢~ (Pl,P2)(J -~ 0 .... ,m) ~ E -~. This result is true for any a e [0, 1] and any distributions pl(x), p2(x) e S M. This complete the proof of statement (1.52). To prove (1.53) let us notice that any point V E ¢ corresponds to the distribution p(z) = 5(x - xo) in (1.25), (1.26), i.e., C j ( x 0 ) = C j ( ~ ( x - ~ 0 ) ) , (j = 0 .... , m )
Hence, we have the following inclusion : ~c_~
16
1.5. Caratheodory theorem and stochastic linear problem on finite set
According to C a r a t h e o d o r y theorem [6], any point V E en • can be expressed in the following form
~
¢0(x~)
V =
=
as 8~1
a E S "~+2:=
~s>0,
]
as:l s:l
hence,
m+2 ~0( ~E ~s~(X -- Zs)) V =
m+2 and, taking into account t h a t the linear combination m+2
of 6-functions with a E S mq-2 is a distribution, we derive t h a t
¢ C_e n ¢ C_"~
(1.55)
Let us consider any p(x) E S M. We have Cj(p) =
/
Cj(x)p(x)dx
(1.56)
xEX
Let us prove t h a t any p(x), corresponding to (1:56) can be expressed in the following form rnq-2
p ( x ) = ~_, ,~86(z - x~), ~ , z s e x ,
~ ~ s m+2
(1.57)
where the points {xs} satisfy the assumption of this theorem. To prove (1.57), let us again use the Caratheodory theorem [6]. According to (1.52) the set • is convex, t h e n for any fixed distribution p the corresponding vector
~TCo) := ( ~0(p),..., ~(p)) e R~+1
1.5. Caratheodory theorem and stochastic linear problem on finite set
17
can be expressed as follows rnT2
~(p) = ~
~,~(ps)
(1.58)
8=1
where Ps is a distribution function. Because of the continuity of the functions Cj (x) there exist the points xs such that Cj(Ps) = ¢j(xs) (j = 0, ...,m) and, taking into account (1.56), we conclude t h a t
~j(ps) = Cj(x~) = f
oj(x)~(~ - zs)d~
xEX
and hence, from (1.58), we finally derive that m+2
-~(p)
8=1 m+2
s=l f J
x~X
xEX rn+2
s=l
:= (¢0(~),...,,~(x)) ~ R~+~ So, the representation (1.57) holds. It means that en
~, 2 ~
(1.59)
Combining (1.55) and (1.59) we finally obtain
Theorem is proved. •
R e m a r k 4 The statement (1.59) can be also proved by using Kall's theorem [20] (see Appendix A)).
18
1.5. Caratheodory theorem and stochastic linear problem on finite set
Indeed, to prove the correctness of assertion (1.57), let us first state the similar representation for (m + 3) points (linear combination of nonminimal numbers of points), i.e., m+3
o~sS(x -- x~), x , x ~ • X , c~ • S m+3
p(x) = ~
(1.60)
s=l
We must prove t h a t such ~ • S "n+3 and x8 • X exist. Substituting (1.60) into (1.56), we obtain rn+3
~j(p) = ~ a.¢~(x.) and eliminating ~m+3
(O~m+3 :
rr~+l 1 -- E Ols)
(1.61)
from (1.61), we obtain
n'~+2
rr~+2
s:l
s=l
or m+2
~j(p) - Cj(~m+3) = ~
~s [¢j(~s) - 0 ~ ( ~ + 3 ) ]
(1.62)
s=l
According t o the continuity property of the functions Cj(x) (j = 0, ...,rn) we can find the points { x s } s = t .....m+l such that det Am+l,m+l ~ 0 where A,~+1,,~+1 is the submatrix containing the first (m -t- 1)-columns of the matrix Am+l,m+2 = IlajsII ,aj8 := C j ( x s ) - Cj(xm+3),j = 0,...,m; s = 1,...,m +2 Then, we can rewrite the last relation (1.62) in the following matrix form b : Am+l,ra+2" c~
(1.63)
where
bv
=
(b0,...,bm), bj : = ~ j ( p ) - O j ( x ~ + 3 )
A,~+t,,~+2
=
[ajs], a~s := ¢~(xs) - ¢#(xm+3) (j = 0 , . . . , m ; s = 1 , . . . , m + 2 )
OLT
=
(O/1, .... O/mq~2) , O~s ~ 0
Applying Kall's theorem [20] (see Appendix A) for the existence of the nonnegative vector a and the points x s ( s : 1, . . . , m + 1) satisfying the linear equation (1.63), it is necessary and sufficient that there exist # >_ 0 and Aj < 0 (j : 1, ..., m + 1) such that rn+l l=l
1.5. Caratheodory theorem and stochastic linear problem on finite set
19
where Az is t h e / - c o l u m n of the matrix A,~+1,,~+2. This relation can be always verified if the points xs(s = 1, ...,m + 3) are selected according to the following rule:
sign (¢j (xl) - Cj (xrn+3)) = - s i g n (¢j(x,~+2) - Cj (xm+3)) V / = I .... , m + l According to the Caratheodory theorem any linear combination of ( m + 3 ) points can be represented as a linear combination of only (m + 2)-points (1.61) . Hence the formulation (1.63) takes place. The next important statement follows directly from this theorem.
Corollary 2. For the class of continuous functions, the stochastic programming problem (1.25)-(1.26) is equivalent to the following nonlinear programming problem ra+2
p(8)c~(~s) - ~ s=l
inf
(1.64)
PESm+2'x'EX
n~+2
p(8)¢j(~,) < o (j = 1,...,m) s=l
Let us denote the corresponding minimal point by p**, ~]* (s = 1,...,rn + 2). This point may be not unique. The most important consequence of this theorem is the following one: the optimization problem is entirely characterized by • a set of ( m + 2) vectors xk E X • (m + 2) values of the probability vector Pk
k>--O, ~ P k = l , ( k = l , . . . , m + 2 )
.
k=l
and can be reformulated as follows inf h0(z)
(1.65)
hi(z) < 0, (j = 1,..., m)
(1.66)
z = {x0, ..., xm+l,po,...,pm+~}
(1.67)
zEZ
subject t o where z e R~,s = ( m + 2 ) ( n + 1)
20
1.5. C a r a t h e o d o r y t h e o r e m a n d s t o c h a s t i c linear
p r o b l e m o n finite set
m+2
xkEX,
pk_>0, E P k = I k=t
and, m+2
hi(z) = y:~ ] ~ ( ~ ) p k ,
(J = 0, ...,m)
(1.6s)
k=l
The next theorem shows that this nonlinear programming problem (1.64) given on the compact set X in the case of Lipschitzian functions e j ( x ) (j = 0, ...,m) can be approximated by the corresponding linear s~ochastic problem on a finite set.
T h e o r e m 4. If 1. X is a compact set of diameter D which can be partitioned into a number of nonempty subsets Xk (k = 1,...,N) having no intersection, i.e., N
x = U xk,
N xj : o
k= 1
(1.69)
i#j
2. the functions ej(x) (j = O, ...,m) are Lipschitzian on each subset X k , i.e., • X'
¢i(x') - e j ( x " ) < n~
-
X"
'
Vx ,x
"
E Xk
(1.70)
then, there exist fixed points {x~} (x~ E X k , k = 1, ...,N) and an enough large positive integer N such that the discrete distribution N
~(x) = ~ p ' ( k ) ~ ( ~
- ~i)
k=l
with the vector p* E S N which is a solution of the linear stochastic programming pro blem N
E P ( k ) ¢ o ( x * k ) - - ~ inf
pEN N
(1.71)
k=l
N
)-~p(k)¢j(xl)
< o (j = 1,...,m)
(1.72)
k=l
satisfies the constraints (1.72) with e - accuracy, i. e.,
ej(p~) < ~j (j -- 1,...,m)
(1.73)
1.5. Caratheodory theorem and stochastic linear problem on finite set
21
and the corresponding loss function ¢o(P~ will deviate from the minimal value ¢0 (P'*) of the initial programming pro blem (1.6~) not more than e, i.e., N
N
P
(k)¢o(xs,5 -
--
*
(1.74)
p* k~l
where ~J ~.
D
-
max N
k=l,...,N
L~ (j = 0 ..... m 5
(1.755
Proof. Let p**,x;*(s = 1, . . . , m + l ) be the o p t i m a l point (solution) of the p r o b l e m (1.71), (1.72) and the partition of the c o m p a c t X is realized such t h a t each subset Xk could contain not more t h a n one of the e x t r e m a l points x~*. Hence, for an?," (j ----0 ..... m ) we obtain N
N
~p(k)¢j(~) = ~p(k)[¢~(~)-k~:=l
k=l
m-l-2
- ~
]
¢~(x;')~(xLz7 • x~) +
s=l
N
m+2
k=l
s=l
Let us denote by xs~ the o p t i m a l point x s • X~. T h e n m+2
a n d hence, for a n y p • S N N
N
~_,p(k)¢j(xD = ~ p ( k ) k=l
* [¢~(x;) - Cj( X *s.)] +
(1.76)
k=l N
+ y~p(k)¢j(x;') k=l
For (j = 1, ...,m) a n d for a n y p E S N, satisfying the constraints (1.725, we have
~=~)-~P(k)oJ(x:')___~lp(k)= [¢J(~)- ¢J(~;2) ---
22
1.5.
t h e o r e m a n d stochastic l i n e a r
Caratheodory
p r o b l e m o n f i n i t e set
N k=l N
<-
ZP(k)L~11~;- ~;:11 -< k=l
N
<Ep(k)L i k~l
sup
IIx-yll_~
x,yEXk N k=l
~--D/N
]lx -- Ylt
Selecting dk := sup x,yCXk
we derive
N
E
p(k)¢j(x~ ** )
k=l
D < -~ max k=l,...,N
L~ : = ej (j -- 1, ..., m)
(1.77)
Using now the fact that the point p* minimizes the left side of (1.76) for j = 0,we derive: N
~p'(k)¢j(*)X k
N
_'( Ep(k),j(*) Xk
k=l
k=l
N
==
N
Zp(k) [¢j(~Z) - ,j(~;:)] + Zk = l p(k),j(~;') k=l
Then, for p(k) = p**(k) we finally derive N
N
~p"(k),o(~;;)_ - k=l~p'(k)¢o(~)
N
_<
N
< Ep**(k)t¢0(xk)
~x**~
k=l
**
0
k=1 N
~Ep
**
0
(k)L k sup
k~l
II~-yll-~
x,yCXk N
<
Zp"(klL°e~ k=l
<
1.6. Conclusion
Selecting d k
23
< D / N we obtain
p**(k)~(Xs~ ) --
p*(k)Co(x~) < D / N k=l
--
max k=l
L°
..... N
We can satisfy (1.74) with an z - a c c u r a c y if we choose
D/N
max
k= l,...,N
L°
Theorem is proved, m C o r o l l a r y 3. Under the assumptions of this theorem, to obtain an
e - approximation of the initial no nlinear programming pro blem (1.64) it is enough to use the partition of the given compact set X with the diameter D into a subsets X k ( k = 1, ..., N ) with diameters D
dk < - ~ -
<
-
¢ max
k=l ..... N
L°
and the integer N, characterizing the number of such subsets, must satisfy the following inequality: D N >
max L ° k=l . . . . . N
(1.78)
E
This corollary gives a means of determining the accuracy of the approximation of optimization problem on continuous set by optimization problems on finite (discrete) set. In summary, Let us notice t h a t a discrete optimization problem can be formulated as the behaviour of a learning a u t o m a t o n with finite set of actions in a r a n d o m environment corresponding to the optimization problem to be solved. Indeed, the discrete set related to the optimization problem to be solved can be associated to the set of control actions of a learning a u t o m a t o n operating in a random environment. The loss functions related to this environment are equal to the values of the optimized function.
1.6
Conclusion
The main conclusions are the following: • The linear stochastic optimization problem can arise directly (if it is initially formulated on a discrete set) or as an approximation with an z-accuracy of the stochastic optimization problem on a continuous set.
24
1.6. Conclusion Independently on the properties (convexity, differentiability, etc.) of the functions, involved in a nonlinear programming problem, the Cequivalent linear stochastic optimization problem on discrete set is a well-defined problem and represents a linear programming problem in the space of distributions (multidimensional simplex). The formulation of the stochastic optimization problem in the form (1.71) allows us to apply successfully learning automata for its solution. In summary, this chapter shows how to tie the stochastic optimization problems to the behaviour of learning automata operating in random environment (stationary, nonstationary, single, multi-teacher environments).
References [1]
Bertsekas D P 1982 Constrained Optimization and Lagrange Multiplier Methods. Academic Press, New York
[2] Rockafellar R T 1993 Lagrange Multipliers and Optimality. SIAM Review 35:183-238
I3] Najim K, Oppenheim G 1991 Learning Systems: Theory and Applications. IEE Proceedings Computer and Digital Techniques 138:183-192
[4] Najim K, Poznyak A S 1994 Learning Automata: Theory and Applications. Pergamon Press, Oxford
[51 Zangwill W I 1969 Nonlinear Programming: A Unified Approach. Prentice-Hall, Englewood Cliffs [6] Rockafellar R T 1970 Convex Analysis. Princeton University Press, Princeton:
[7]
Spingarn J E, Rockafellar R T 1979 The genetic nature of optimal conditions in nonlinear programs. Mathematics of Operation Research 4:425-430
[81
W.I. Zangwill W I, Garcia C B 1981 PATWAYS to Solutions, Fixed Points, and Equilibria, Prentice-Hall, Englewood Cliffs
[9] Charnes A, Cooper W W 1959 Chance constrained programming. Man. Sci. 6:73-79
[10] Vajda S 1972 Probabilistic Programming. Academic Press, New York
[11]
Ash B B 1972 Real Analysis and Probability. Academic Press, New York
[12] Doob J L 1953 Stochastic Processes. John Wiley &: Sons,New York [131 Kiefer J, Wolfowitz J 1952 Stochastic estimation of the maximum of a regression function. Ann. Math. Star. 23:462-466
[141 Kushner H J, Clark D S 1978 Stochastic Approximation Methods for Constrained and Unconstrained Systems. Springer Verlag, Berlin
26
References
[15] Gelfand S B, Mitter S K 1991 Recursive stochastic algorithms for global optimization in R d. S I A M J. Control and Optimization 29:9991018 [16] Polak E 1971 Computational Methods in Optimization. Academic Press, New York [17] Kaplir~skii A I, Propoi A I 1970 Stochastic approach to nonlinear programming problems. Automation and Remote Control 31:448-459 [18] Kaplinskii A I, Poznyak A S, Propoi A I 1971 Optimality conditions for certain stochastic programming problems. Automation and Remote Control 32:1210-1218 [19] Kaptinskii A I, Poznyak A S, Propoi A I 1971 Some methods for the solution of stochastic programming problems. Automation and Remote Control 32:1609-1616 [20] Kall P 1966 Qualitative aussagen zu einegen problem der stochastichen programmierung. Z. Wahrsch. verw. Geb 6:246-272
2 On Learning A u t o m a t a 2.1
Introduction
Among the most attractive areas of research in the field of control engineering are adaptive and learning systems [1]-[2]-[3]-[4]-[5]-[6]-[7]-[8]-[9]. Learning systems have been shown to be an efficient tool to deal with a large number of engineering problems [9]. They are information processing systems whose architecture and behaviour are inspired by the structure of biological systems (the organism is born with relatively little initial knowledge and learns actions that are appropriate through trial and error) [5][10]. The vocabulary and the concepts associated with learning a u t o m a t a are borrowed from biology and psychology. The experiments performed by Skinner simply illustrate the behaviour of a Reward/Inaction learning automaton. A pigeon was placed in a cage with a red disk mounted on one of the walls. If by chance, the pigeon pecks at the red disk, it receives a certain amount of grain whereas if it pecks elsewhere, it receives no reward. It is not long before the pigeon associates the pecking of the red disk with being rewarded with food. There are also a" myriad of other examples where this simple kind of learning is evidenced by the modification of behaviour [11]. Adaptive control methods [2]-[3]-[4] generally address the problem of control of systems with unknown parameters but known structure. In adaptive control strategies, the behaviour of the system is slightly improved at every sampling period by estimating in real-time the parameters (model or control law parameters) to attain the desired control objective. In learning automata [6], the probability distribution (Pn) is recursively updated to optimize some learning goal. Learning automata have attracted considerable interest in the last decades due to their potential usefulness in a variety of engineering problems which are characterized by nonlinearity and a high level of uncertainty [12]. In fact, later developments in stochastic control theory took into account uncertainties that might be present in a given process; stochastic control was effected by assuming that the probabilistic characteristics of the uncertainties are known. Frequently, the uncertainties are of a higher order, and even the probabilistic characteristics such as the distribution functions may not be completely known. It is then necessary to learn (acquire) additional information [13]. Broadly speaking, learning automata can be claasified into three care-
28
2.1. Introduction
gories: deterministic, fixed-structure and variable-structure. In the deterministic a u t o m a t a , the transition and the output matrices are deterministic. A fixed-structure a u t o m a t o n is one whose transition and output functions are time invariant. Variable-structure stochastic a u t o m a t a possess transition a n d output functions which evolve as the learning process proceeds [6]-[8]. T h e a u t o m a t o n structure changes as the system learns the information. The use of variable-structure stochastic a u t o m a t a leads to a reduction of states in comparison with deterministic a u t o m a t a . A learning system interacts with an environment and learns the optimal action which the environment offers. It bases its decision on the information gained by selecting actions, seeing if the choice is rewarded, and then u p d a t i n g its probability distribution. This cycle continues until the learning process is terminated. T h e environment is characterized by its penalty probabilities. An environment is said to be stationary if its penalty probabilities are constant. Otherwise it is nonstationary. Notice t h a t the t e r m "learning~' is used by several authors in neural networks synthesis an in association with stochastic approximation techniques [9]-[131-[141-[15]. The reference [9] presents several theoretical results concerning the convergence and the estimation of the convergence rate and discusses in fair detail some of the applicatiorLs of learning a u t o m a t a . Most of the available studies relate to the behaviour of learning a u t o m a t a in stationary environments [6]-[7]-[8]-[9]. T h e problem concerning the behaviour of learning a u t o m a t a in nonstationary environments is difficult, and few results are known [6]-[7]-[16]-[17]-[18]-[19]-[20]-[21]-[22]. S a b a and Mogami [22] have shown t h a t an extended form of the scheme proposed by T h a t h a c h a r a n d R a m a k r i s h a n [23] ensures absolute expediency in a nonstationary environment having the property t h a t there exists a unique p a t h which receives the least s u m of the penalty strengths in the sense of mathematical expectation. Poznyak and N a j i m [24] have studied the behaviour of learning a u t o m a t a in asymptotically stationary environments. In this study, several theoretical results were stated. These results concern the properties of reinforcement schemes, normalized environment response and the optimal behaviour of different learning a u t o m a t a . A nonstationary environment arises indirectly in connection with hierarchical system of learning a u t o m a t a [6]-[23]. I t has been shown in [9] t h a t the use of hierarchical system of learning a u t o m a t a accelerates the learning process. T h e latter reference also discusses in fair detail some of the applications of hierarchical structure of learning aut o m a t a . T h e conventional a u t o m a t o n model of a learning system consists of a stochastic a u t o m a t o n operating in a single environment. Nevertheless, an a u t o m a t o n can be made to operate in more t h a n one r a n d o m medium [25]-[26]. B a b a [7] has introduced the concept of average weighted reward and several norms (expediency, e-optimality) of the learning behaviours of stochastic a u t o m a t a in multi-teacher environment. He has also proposed a class of reinforcement schemes which directly use the responses from the
2.2. Learning automaton
29
multi-teacher environment (the number of rewards). A learning system is a sequential machine characterized by a set of actions, a probability distribution and a reinforcement scheme. An extensive literature has been dedicated to the behaviour of learning automata with fixed action set [8]. The concept of the behaviour of automata where the number of the actions available at each time is time-varying has been studied by Thathachar and coauthors [27]-[28]. In this study [27], convergence results have been stated for binary environment responses (P-model environment). An important aspect of convergence not considered is the rate of convergence which concerns the speed of operation of the automaton [27]. Learning automata with changing number of actions are relevant in the modelling of several problems (CPU job scheduling, optimal path in stochastic networks, etc.). Learning automaton with continuous input (Smodel environment) where the number of automaton actions are changed in real-time have been considered by Poznyak and Najim [28]. Learning automata with continuous inputs and changing number of actions have been used for optimization purposes [9]-[28]. The analysis and the statement of the convergence properties of learning systems are usually derived using the Lyapunov approach and the martingale theory (see Appendix A) [29]-[30]-[31]-[32].
2.2
Learning automaton
A learning automaton may be considered as a system which modifies its control strategy on the basis of its experience in order to reach good control (optimization) performances in spite of unpredictable changes in the environment where it operates. In other words, learning automata should, by collecting and processing current information regarding the environment, be capable of changing their structure and parameters as time evolves to achieve the desired goal or the optimal performance (in some sense). An automaton is an adaptive discrete machine described by:
{--, v , n ,
,T}
where: (i) E is the automaton input bounded set. (ii) U denotes the set {u(1), u(2),. .... , u(N)} of actions of the automaton. (iii) T¢-- ( f~, f , P ) a probability space. (iv) { ~ } is a sequence of automaton inputs (environment response, ~ E E) provided by the environment in a binary (P-model environment) or continuous (S-model environment) form.
30
2.2. Learning automaton
(v) { u , } is a sequence of automaton outputs (actions). (vi) Pn = [pn(1),p,(2), ...,p,,(g)] T is the probability distribution at time n N
prt(i) = P { w : un =- u(i) / ~'~-1} and E p r ~ ( i ) = 1 ,Vn i'-I
where ~ = a(~l, u l , p l ; ...; ~,, un, Pn) is the a-algebra generated by the corresponding events (~n E ~'). (vii) cn = [cn(1), cn(2), ..., cn(N)] T is the conditional mathematical expectation vector of the environment responses (at time n). (viii) T represents the reinforcemem scheme (updating scheme) which changes the probability vector p , to Pn+l:
Pn+I = P,~ +'~nTn(Pn; {~t }t=l ...... ; {ut}t=l ..... ,~)
(2.1)
pl(i) > O V i = I , . . . , N where ~n is a scalar correction factor and the vector Tn(.) = [T~ (.) ..... T ~ (.)] T ,satisfies the following conditions (for preserving probability measure): N = 0, v n
(2.2)
i=1
p,~(i) + ~,Ti~(.) E [0, 11
Vn, Vi = 1, . . . , N
(2.3)
This is the heart of the learning automaton. Different reinforcement schemes have been proposed in the literature. A reinforcement scheme can be linear or nonlinear. Sometimes it is advantageous to update p , according to different schemes depending on the intervals in which the value of p~ lies. The loss function q~n associated with the learning automaton is given by ~
= _1 ~ t n t=l
(2.4)
It is a useful quantity for judging the behaviour of a learning automaton. A learning system is a stochastic automaton connected in a feedback loop with a random environment as shown in Figure 2.1. The description of environments and the learning automata will be given in the next sections.
2.3. Environment
un ]
1
31
RANDOM ENVIRONMENT
!
ill
°
[
FIGURE 2.1. Automaton-environment interaction.
2.3
Environment
Learning systems are adaptive machines which interact with an environment and which dynamically learn the optimal action which the environment offers. The role of the environment (medium) is to establish the relation between the actions of the automaton and the signals received at its input. The environment (or medium) is a term t h a t can cover just about anything. It includes all the external conditions and influences [6]. The environment produces random responses whose statistics depend on the current stimulus or input. The environment offers the a u t o m a t o n a finite set of actions. The automaton is constrained to choose one of these actions. The environment is said to be P-model environment if its response belongs to the set {0,1} (binary input). It is said to be an S-model environment when its response takes an arbitrary value in the closed segment [0, 1] (continuous response). In Q-model environment, the environment output can assume a finite number of values in the interval [0, 1]. In automatic control, the environment corresponds to the process to be controlled or optimized (to introduce control signal in a system means to couple the system to its environment). The environment is characterized by its penalty probabilities cl. An environment is said to be stationary if its penalty probabilities are constant. Otherwise it is nonstationary. Having considered single environment, we are ready for our next topic: multi-environments (multi-teacher). The remainder of this section is devoted to multi-teacher environments [7]. A typical learning automaton operating in a multi-teacher environment, is depicted in Figure 2.2. It consists of a set of "teachers". Each teach-
32
2.4. Reinforcement schemes
ers reacts to the automaton response. These reactions (responses) at time
Multi-teacher Environment 1 Teacher 1 Teacher 2
--t
Teacher m
m
Learning A u t o m a t o n
FIGURE 2.2. Multi-teacher environment. n, are denoted by ~ (3" = 1,...,m). They can be binary (P-model environment) or continuous (S-model environment). The automaton input ~n ---- F ( ~ , . . . , ~ ) is constructed on the basis of these responses. Some examples of the transformation F(-) will be presented in Chapter 4. The attention given to this class of environments is due to their very interesting intrinsic properties. In fact they provide simple means to represents some engineering problem. For example, in process control, each teacher can be associated with a state of the process to be controlled. In constrained optimization problems, one teacher can be associated with the cost function and the other to the different constraints.
2.4
Reinforcement schemes
Reinforcement schemes were originally proposed in an a t t e m p t t o model animal learning [33] and have since found successful application in the field of learning automata. A reinforcement scheme can be compared to the recursive estimation procedure used in adaptive control. The reinforcement scheme (learning or updating algorithm) generates Pn+l from Pn. Several algorithms for adjusting the probabilities after each sampling period (interaction with the environment) have been proposed [8]-[9]. They are based on
2.4. Reinforcement schemes
33
incremental changes in the probabilities. The most commonly used employs a linear updating algorithm have been proposed by Bush and Mosteller [33]. All the reinforcement schemes described in the literature can be considered as being solutions of optimization problems [34]. Let us introduce the following average penalty function N
J =: E
{~i(p.)Ei [1 - ~i] - ~ii(p~)Ei [~i]}
(2.5)
i=1
where the functions ~i(Pn) and ~2i(pn) represent respectively the amount of change in the probability vector under reward (~i = 0) and penalty (~i = 1). (i (i = 1,2) which correspond to the reward and penalty environment responses. The interest here is in minimizing (2.5). According to Kiefer-Wolfowitz approximation method [35], the reinforcement scheme which minimizes the function J by setting the gradient of J equal to zero is derived. If the selected action at time n is u(i) , the following algorithm is obtained: p.+,(i)
= p
(i) +
-
Opt(i)
j
(2.7)
P~+I (j) = P~(j) "y,~ [O~i(p,~) (1 _ ~i ) p,~(i) I N - 1] [ Op,~(i)
( 2.6)
O~Pi(p,~)~] Opn(i) j
i/=j We have just derived one of the central result of reinforcement schemes. In general, the existing learning schemes differ in structure (linear, nonlinear, etc.) but they fall into the general framework (2.6-2.8). The functions associated with Bush-Mosteller [33], Shapiro-Narendra [36] and VarshavskiiVorontsova [37] reinforcement schemes are given in Table 2.1. Table 2.1. D i f f e r e n t r e i n f o r c e m e n t schemes and their corresponding f u n c t i o n s ¢(.) and ko(.). Autors Function ¢(p~) Function ~(pn) Bush and Mosteller p , 4 i ) - ~ ,lp~ (i) 2 p (i) 2 l pn [i~2 Varshavskii and c t e - 7, ~ , - ½P'~(i)s ¢,(.) Vorontsova Shapiro and Narendra p n ( i ) - - ~. 1Pn [i~ ,, ~ The bush-Mosteller and Shapiro-Narendra are linear reinforcement schemes, while the Varshavskii-Vorontsova scheme is nonlinear. These schemes are relatively too general and should be efficiently adapted for use
34
2.5. Hierarchical structure of learning automata
in specific environments (S-model, asymptotically stationary environments, etc.). A more general method for constructing learning schemes has been presented. When the number of actions increases, the behaviour of the automaton will be stow and the computer memory capacity required for the implementation of the learning systems will also increase. These problems can be avoided by using a hierarchical structure of automata.
2.5
Hierarchical structure of learning automata
A hierarchical structure of learning automata is depicted in Figure 2.3. It is composed at different levels of single automata with limited number of actions [9]-[22]-[23]-[38]. The first level of the hierarchy comprises a single a u t o m a t o n with N actions. The second level is composed of N single aut o m a t a (of N actions each) and the k th level is formed by N k-1 automata. The last level of the hierarchy contains N g single automata. The hierarchical structure of automata operates as follows. The first level chooses randomly an action (say u(1)). This activates the corresponding automaton in the second level (each action of the automaton contained in the first level, is associated with an automaton belonging to the second level) which select an action (say u(2/3)). This in turn activates an automaton in the third level of the hierarchy. At the last level the action selected ( u ( 3 / 2 / 2 ) ) interacts with the environment (environment input).
ENVIRONMENT
u(~1212 )
FIGURE 2.3. Hierarchical structure of learning automata.
2.6. Normalization and projection
35
T h e hierarchical structure of learning a u t o m a t a can be used for several purposes [9]-[38] (optimization, muttiobjective analysis, etc.). The probability measure N
Ep~(i)
= 1
i=l
is not ensured when the environment response do not belong to the interval [0; 1]. it is then necessary to introduce some procedure for preserving the probability measure. Two solutions are commonly used: projection and normalization [9]-[34]. The next section concerns these procedures for environn~mnt response transformation and probability "scaling'.
2.6
Normalization and projection
In this section, our main emphasis will be on normalization and projection procedures. When the probability do not belong to the probability variation domain [0, 1], it is necessary to use some procedure (operator or transformation) which guarantee that the probability measure remains satisfied at each time. The normalization procedure [39]-[40] process the environment o u t p u t (response) while the projection acts on the probabilities.
2.6.1
NORMALIZATION PROCEDURE
The normalization procedure has been introduced by Najim and Poznyak [9] and used in the context of learning automata with continuous input (S-model environment) for multimodat functions optimization. We shall present a very simple normalization procedure. Let us consider the multiteacher environment represented in Figure 2.2, and assume that the response of each teacher is binary, i.e., ~ • {0, 1}, (j = 1,.., m) The normalized environment response ~'n (automaton input) can be constructed as follows: m
_ 5=1
e [0,11
(2.8)
Trt
This normalization procedure ensures that the automaton input belongs to the unit segment [0, 1]. A more general methods for constructing normalized environment responses will be introduced and used in the next chapters. In what follows, the projection operator is described.
36
2.6. Normalization and projection
2.6.2
PROJECTION ALGORITHM
The projection procedure can be applied for binary and non-binary environment responses [34]. It is a useful procedure when the probabilities will not range over the interval [0, 1]. The projection operator H onto the simplex S is defined as zifzES z* i f z ~ S, It z - z" It= ~isnlt z - y II
rt(z) =
The projection operator II(-) is characterized by the following property: tl P- - II(q) It
{5 p,~:
p,~(i)=l,p,~(i)>_O
i=1
} ,
i=l,...,
N
It consists of: 6 "1 = N vertices I?~
6aN = ½ N ( N - 1) edges (two dimensional faces) F~ CN N-1
....
N ( N -- 1) dimensional faces FJN_I
The point for which the coordinates are all equal to zero (except the j t h which equals 1) is the vertex r~. The face F ~ (m > 2) is the subset F ~ = { P : X E D,~, p(i) > 0} of one of the hyperplane Dn defined as follows N
On=pn:Epn(i)=l,
pn(i)>_O,
i ..... 1,..., N
i=1
The projection of p~ is defined as follows:
II(pn) = p ~ :II P,~ - P,~ I1....
P,~ - Y II
(2.9)
It is obvious that p~ E F~k for a certain k. Note that finding p~ = H(pn) is equivalent to finding the point on the simplex S which is closest to the projection pn(Dn) of the point Pn onto
D~. From the definition (2.9), it follows that minll y - p~ I]=m!nll (y - pn(D,~)) + (B,~(D,~) - p~) ]]= yES
yE
2.6. Normalization and projection
37
=[[ (p,~(D,J - p ~ ) [[2 + [[ (y-pn(D,~)) [[2 The following lemma gives the tool for calculating the projection II(pn) of
Pn. L e m m a 1. The face F N-1 closest to the point p,~(D,~) has an orthogonal vector
aN , = (1, 1, ..., 1, 0,, t ..., i)
(2.10)
~7 J
The irwIex j corresponds to the smallest component of the vector
(point) p,~(D,), i.e., (2.11)
Proof. The distance V between a given point z = (zl, z2, ...) E Dn and the vertex
is equal to N
V~(z, r~)
= (zk - 1) 5 + ~
N
(z~) 2 = ~
i~k
(z,) 2 + (1 - 2z~)
i=1
Then,
V ~ ( z , r ~ ) - V 2 ( z , r ~ ) = 2 (zL- zk) i.e., The most distant vertex corresponds to the smallest component
p,~(k) : ::mzin pn(l) Consequently, the face FN- 1 which lies opposite to the vertex and is closest to the point z has an orthogonal vector aN-1 (2.10). [] The projection of Pn(Drn) onto the hyperplane p,~(Dm) (1 <: m ~ N ) is accomplished according to
1 + ~ a(i)p,~(j) "~ (p,~(D,~))~ =
p,~(j)+
4=1
a(j), (j = 1 .... ,m)
(2.12)
m
where a(j) are the components of the vector which is orthogonal to the hyperplane D,~.
38
2.6. Normalization and projection
We note t h a t the projection procedure is accomplished in a number of steps not exceeding N. T h e projection onto the simplex S~ defined by
-
:
=
l,
> e
,
i=
l,...,
N
i=1
can be carried out by means of the change of variables =
+
The structure of the projection operator algorithm is indicated below. If p~ ~ S then, 1. find pr,(D,~) according to (2.12) for m = N 2. If p,~(Drn) ~ S, then find the smallest component of the vector which is orthogonat to the closest face F~ (2.11). The geometrical interpretation of the projection procedure for the twodimensional case lion(l) p~(2)] T is depicted in Figure 2.4. Five cases ( P ( i ) , i = a, b, c, d, e) have been considered. T h e coordinates of each point P are the probabilities pn(1) and pn(2). The thin lines represent the behaviour of the projection operator. We obtain the following mapping
P(a) n A, P(b) n B, P(c) n C, P(d) n C, P(e) n C For example, the coordinates (xa,y~) of the point A correspond to the projection of P(a). Referring to Figure 2.3, the projection operator can considered as the map that associates with any point P(.) of the plane a point belonging to the segment BC. Normalization and projection have many nice properties. There are useful in the context of optimization and give the user a greater flexibility. T h e disadvantage of projection is t h a t it is time consuming. In other words, the use of the projection operator results in a relatively sluggish behaviour of the learning system. To deal with numerical problems which are due to round-off errors, the probability vector can be scaled by: N
Sp = E p n ( i ) i=l
2.7. Conclusion
39
P ic)
I
,m P [al
Plel
o\
P (
~
[hi IIII
FIGURE 2.4. Graphical interpretation of the projection procedure.
2.7
Conclusion
This Chapter has been concerned with an introduction to learning automata. Several notions and definitions have been given. Both single and hierarchical structure of learning automata were defined. It has been shown that several reinforcement schemes can be associated with the minimization of some functional. The normalization procedure and the projection operator have been introduced for preserving the probability measure. Many of the results presented in this chapter are basic for theoretical background and the behaviour of learning a u t o m a t a in different kinds of random environments (stationary, nonstationary, etc.). In the next chapters it will be shown t h a t learning automata can be used to solve a large class of stochastic optimization problems and that there exhibit a satisfactory degree of robustness (insensitivity to uncertainties, etc.).
References [1] Wiener N 1948 Cybernetics. The Technology Press John Wiley, New York [2] Bellman 1~ 1973 Adaptive Control Proeesses-A Guided Tour. Princeton University Press, Princeton [3] K.J. /~strSm K J, Wittenmark B 1984 Computer Controlled Systems Theory and Design. Prentice-Hall, Englewood Cliffs [4] Najim K 1988 Control of Liquid-Liquid Extraction Columns. Gordon and Breach, London [5] Tsetlin M L 1973 Automaton Theory and Modeling of Biological Systems. Academic Press, New York [6] Narendra K S, Thathachar M A L 1989 Learning Automata an Introduction. Prentice-Hall, Englewood Cliffs [7] Baba N 1984 New Topics in Learning Automata Theory and Applications. Springer-Verlag, Berlin [8] Najim K, Oppenheim G 1991 Learning systems: theory and applications. IEE Proceedings Computer and Digital Techniques 138:183-192 [9] Najim K, Poznyak A S 1994 Learning Automata: Theory and Applications. Pergamon Press, Oxford [10] Walter W G 1953 The Living Brain. Norton, New York [11] Caudill M, Butler C 1990 Naturally Intelligent Systems. MIT Press, Cambridge [12] Najim K 1989 Process Modeling and Control in Chemical Engineering. Marcel Dekker, New York [13] Tsypkin Ya Z 1971 Adaptation and Learning in Automatic Systems. Academic Press, New York [14] Pathak-Pal A, Pal S K 1987 Learning with mislabeled training samples using stochastic approximation. IEEE Transactions on Systems, Man, and Cybernetics 17:1072-1077
References
41
[15] Zurada J M 1992 Artificial Neural Systems. West Publishing Company, New York [16] Srikantakumar P R, Narendra K S 1982 A learning model for routing in telephone networks. SIAM J. Control and Optimization 20:34-57 [17] Barto A, Anandan P, Anderson C W 1986 Cooperatively in networks of pattern recognizing stochastic learning automata. In: Narendra K S (ed) 1986 Adaptive and Learning Systems : Theory and Applications. Plenum Pre~, New York [18] Narendra K S, Viswanathan R 1972 A two-level system of stochastic automata for periodic random environments. IEEE Trans. Syst. Man, and Cybern. 2:285-289 [19] Nedzelnitsky O V Jr, Narendra K S 1987 Nonstationary models of learning automata routing in data communication networks. IEEE Trans. Syst. Man, and Cybern. 17:1004-1015 [20] Narendra K S, Thathachar M A L 1980 On the behavior of learning automata in a changing environment with application to telephone traffic routine. IEEE Trans. Syst. Man, and Cybern. 10:262-269 [21] Koditschek D E, Narendra K S 1977 Fixed structure automata in a multi-teacher environment. IEEE Trans. Syst. Man, and Cybern. 7:616-624 [22] Baba N, Mogami Y 1988 Learning behaviours of hierarchical structure of stochastic automata in a nonstationary multi-teacher environment. Int. J. Systems Sci. 19:1345-1350 [23] Thathachar M A L, Ramakrishnan K R 1981 A hierarchical system of learning automata. IEEE Trans. Syst. Man, and Cybern. 11:236-241 [24] Poznyak A S, Najim K 1997 On the behaviour of learning automata in nonstationary environments, to appear in European Journal of Con-
trol [25] Baba N 1983 The absolutely expedient nonlinear reinforcement schemes under the unknown multiteacher environment. IEEE Trans. Syst. Man, and Cybern. 13:100-107 [26] Baba N 1983 On the learning behaviours of variable-structure stochastic automaton in the general N-teacher environment. IEEE Trans. Syst. Man, and Cybern. 13:224-231 [27] Thathachar M A L, Harita B R 1987 Learning automata with changing number of actions. IEEE Trans. Syst. Man, and Cybern. 17:1095-1100
42
References
[28] Najim K, Poznyak A S 1996 Multimodal searching technique based on learning automata with continuous input and changing number of actions. IEEE Trans. on Systems, Man, and Cybernetics 26:666-673 [29] Doob J L 1953 Stochastic Processes. John Wiley & Sons, New York [30] Robbius H, Siegmund D 1971 A convergence theorem for nonnegative almost supermartingales and some applications. In: l~ustagi J S (ed) 1971 Optimizing Methods in Statistics. Academic Pre~, New York [31] Nazin A V, Poznyak A S 1986 Adaptive Choice of Variants. (in Russian) Nauka, Moscow [32] Neveu J 1975 Discrete-Parameter Martingales. North-Holland Publishing, Amsterdam [33] Bush R R,Mosteller F 1958 Stochastic Models for Learning. John Wiley Sons, New York [34] Poznyak A S 1975 Investigation of the convergence of algorithms for the functioning of learning stochastic automata. Automation and Remote Control 36:77-91 [35] Kiefer J, Wolfowitz J 1952 Stochastic ~stimation of the maximum of a regression function," Ann. Math. Star. 23:462-466 [36] Shapiro I J, Narendra K S 1969 Use of stochastic automata for parameter self optimization with multimodal performance criteria. IEEE Trans. Syst. Man, and Cybern. 5:352-361 [37] Varshavskii V I, Vorontsova I P 1963 On the behavior of stochastic automata with variable structure. Automation and Remote Control 24:327-333 [38] Narendra K S, Parthasarathy K 1991 Learning automata approach to hierarchical multiobjective analysis. IEEE Trans. Syst. Man, and Cybern. 21:263-273 [39] Poznyak A S, Najim K, Chtourou M 1996 Learning automata with continuous inputs and their application for multimodal functions optimization. Int. J. of Systems Science 27:87-95 [40] Poznyak A S, Najim K, Chtourou M 1996 Analysis of the behaviour of multilevel hierarchical systems of learning automata and their application for multimodal functions optimization. Int. J. of Systems Science 27:97-112
3 Unconstrained Optimization Problems 3.1
Introduction
Several engineering problems require a multimodal functions optimization strategy. Usually, the function f(x) to be optimized is not explicitly known; only samples disturbed values f(x) at various settings of x can be observed, making the usual numerical optimization procedures useless. Random search techniques [1]-[2], the model trust region technique [3], simulated annealing [4] and learning automata [5]-[6] have been widely used for the optimization of functions where more than one local optimum exists Random search techniques are generally based on random sampling and search region contraction [1] or on stochastic approximation techniques [7][8]-[9]. In the model trust region technique [3], the step for a new iterate is obtained by minimizing a local quadratic function over a restricted spherical region centered on the current iterate. The simulated annealing method is suitable for the optimization of large scale systems and multimodal functions [4] and is based on the principles of thermodynamics involving the way liquids freeze and crystallize. Learning systems have made a significant impact on many areas of engineering problems including modelling, control, optimization, pattern recognition, signal processing, neural networks synthesis,fuzzy logic processor training and diagnosis. They are attractive and provide interesting methods for solving complex nonlinear problems characterized by a high level of uncertainty [10]-[11]-[12]-[13]-[14]. The learning system consists of a learning automaton operating in a random environment (the problem to be solved). This chapter deals with the use of learning automata with continuous input (S-model environment) for solving unconstrained stochastic optimization problems (optimization of functions with nonunique stationary points where the observations are disturbed by random variables, etc.). This chapter presents an unified analysis of the commonly used reinforcement schemes. It Ls shown that these unconstrained stochastic optimization problems which are given on compact sets, are equivalent with an e-accuracy to stochastic optimization problems given on finite sets. The first part of this chapter will be concerned with learning automata with fixed number of actions while the second part will be dedicated to
44
3.2. Statement of the Stochastic Optimization
Problem
learning a u t o m a t a where the number of automaton-actions changes with time [15]-[16]-[17].
3.2 Statement of the Stochastic Optimization Problem Let us consider the function .f(x) which is a real-valued function of a vector p a r a m e t e r x E X where X is a compact in R M. We would like to find the value x = x* which minimizes this function, i.e. x* = arg min ,f(x) xEX
(3.1)
c Rm
T h e r e are almost no conditions concerning the function ,f(x) (continuity, unimodality, differentiability, convexity, etc.) to be optimized. We are concerned with a global optimization problem of multimodal and nondifferentiable functions. Let yu be the observation of the function ,f(x) at the point x~ E X, i.e.,
Yn = f(xn) + w,~
(3.2)
where wn is the observation noise (disturbance) at time n. T h e stochastic optimization problem on the given compact which we intend to address is: u s i n g t h e o b s e r v a t i o n s (Yn), c o n s t r u c t t h e seq u e n c e {x~) w h i c h c o n v e r g e s (in s o m e p r o b a b i l i t y s e n s e ) t o t h e o p t i m a l p o i n t x*. Consider now a quantification ( X i } of the admissible compact region
XcRM: N
XiCX,
Xj i~jXi=O(i,j=l,...,g),
UXi=X
CR M
(3.3)
i:l
where x~ E { x ( 1 ) , x ( 2 ) , . . . , x ( N ) ) := X, x(i) E Xi. Here the points x(i) are some fixed point (for example, the center point of the corresponding subset Xi) and wn = wn(w) is a r a n d o m variable given on a probability space ( f t , ~ , P ) ( w E ~ - a space of elementary events) and characterizing the observation noise associated with the point xn. T h e next l e m m a states the connection between the original stochastic optimization problem on the given compact X , formulated above, and the corresponding stochastic optimization problem on the finite set X. 1. Let us assume that on each subset Xi the optimized .function ,fix) is Lipschitzian, i.e., there exists a constant Li such that,for any x and x E Xi the .following inequality is .fulfilled
Lemma
f(x')-,f(x")
~_Li x ' - x "
3.2. Statement of the Stochastic Optimization
Problem
45
Then, for any e > 0 there exists a quantification {Xi} (i = 1, ..., N ) of the admissible compact region X C R M such that the minimal values of the optimized function f ( x ) on the compact X and on the discrete set 2~ differ not more than e. Moreover, for this purpose, the quantification number N must satisfy the following inequality D N _> - -
max
i=l,...,N
Li
(3.4)
where D := sup tlx - yll is the diameter of the given compact X. x,yEX
Proof. Let us denote by x ( a ) the point on the discrete set X where the minimal value of the optimized function is reached, i.e., x(o~) :=arg min f ( x )
(3.5)
~EX
Then, using the Lipschitzian property we directly derive f(x(oO)
- f(x*)
= f(x(oO)
min f(x)
-
=
xEX
N
= f(x(t~))-- rain ~ xEX
f ( x ) x ( x E Xi) =
i=1
N
[ f ( x ( a ) ) - f(x)] X(x E Xi) <
=min Z xEX
i=l N
[f(x(i)) -
<min ~ --xEX
f(xlt X(x e
N
<max Z --x~X
Xd <
i=l N
If(x(i)) -- f ( x ) l X(x E X~) < m a x Z L i --xEX
i=l
< max
i=l,_.,N
Here di := sup
IIx(i) - xll X(x e X i ) <
i=l
--
Lidi
H x - Y]I is the diameter of the subset X~. Taking into
x,yEX~
account t h a t we can organize the quantification such that
d4< max
--i=l,...,N
D
di<--N
we can always select N satisfying the condition max
i=l,...,N
D Lidi_< .--=_ max
lY i = l , . . . , N
Li_<e
46
3.3. Learning Automata Design
L e m m a is proved. [] Based on this l e m m a we can formulate the following problem s t a t e m e n t : u s i n g the observations {Yn} construct the sequence {x~} (x~ e X)
which converges (in some probability sense) to the optimal point
(3.5).
T h e solution of this stochastic optimization problem, formulated on the discrete set A" leads with any v - a c c u r a c y to the solution of the initial stochastic optimization problem given on the compact X where the accuracy level e is related to the quantification number N by inequality (3.4). T h e increase of the quantification number N leads to the decrease of the accuracy (increase of e). This optimization problem formulated on the discrete set X wilt be stated and solved as the behaviour of a learning a u t o m a t o n operating in a rand o m environment. As a learning a u t o m a t o n is a sequential machine, the quantification number N will be aasociat'ed with the number of actions of the learning automaton. The next section deals with the link between the previous stochastic optimization problem and learning systems.
3.3
Learning Automata Design
A learning a u t o m a t o n is connected in a feedback loop to the r a n d o m medium (environment), where the input to one is the output to the other. Let us consider the reinforcement schemes (updating schemes) which are the mechanisms used to change the probability vector Pn to Pro+l: a) Nonprojectional Reinforcement Schemes
P,~+I =Pn + %~Tr~(Pn;{~t }t=l .....~; {ut}t=l .....,~)
(3.6)
pl(i) > 0 Vi = 1 , . . . , N where ~/n is a scalar correction factor, the vector Tn (.) = [T~ (.), ..., TuN (.)] T satisfies the following conditions (for preserving probability measure): N = 0, v n
(3.7)
i=l p , ( i ) + ~ T ; ( . ) e 10,1l
Vn,Vi = 1 .... , g
(3.8)
and {~t }t=l ..... n is the sequence of environment responses ~n E R 1 which in our case wilt be constructed on the basis of the available d a t a (observations), i.e.,
3.3. Learning Automata Design
47
For the nonprojectional schemes it is usually assumed that the environment response belongs to the unit closed line segment, i.e.,
~n ~ [0,11 b) Projectional Reinforcement Schemes .....
Pl(i)>0
(3.9)
.....
V i = l ..... N
where 7re. {.} is the projection operator onto the ¢n-simplex S N defined as follows ~ :=
P = (Pl,...,PN) : p(i) > ¢,~ > 0,
p(i) = 1
(3.10)
i=1
(for these schemas conditions (3.7) and (3.8) are not obligatory). Different reinforcement schemes, satisfying conditions (3.7) and (3.8), are given in Table 1 of the previous chapter (see also [14]). In this study we shall be concerned with Bush-Mosteller [18], ShapiroNarendra [19] and Varshavskii-Vorontsova [20] reinforcement schemes. The loss function Cn (see chapter 2) associated with the learning automaton is usually given by ~
1 = - ~t n
(3.11)
t=l
It is a useful quantity for judging the behaviour of a learning automaton. We will show in the sequel that if a stochastic automaton minimizes its loss function (find the best control action x(c~)) then it automatically solves the corresponding unconstrained stochastic optimization problem on a discrete set. Let us consider now a r a n d o m s t a t i o n a r y e n v i r o n m e n t which responses ~n are characterized by the following two properties: (H1) The conditional mathematical expectations of the environment responses exist and are stationary, i.e.,
(H2) The conditional variances of the environment responses are bounded, i.e., =
max cr2(i) i
:
1,...,N
= q2 < c~
(3.13)
48
3.3. Learning Automata Design Here 9vn C ~" is the a-algebra generated by the corresponding process, i.e., .T'n : = ff (¢1 , p l , U l ;
¢~,pn, u~)
...;
The next theorem states the equivalence between the problem related to the asymptotic minimization of the toss function (3.4) and the corresponding linear programming problem.
T h e o r e m 1. The problem of "Optimization of the Learning Automata
Behaviour", which consists of finding a sequence {un} of automata control actions minimizing asymptotically the loss function (3.4), i.e. limsup/l~n ~ " inf
(3.14)
under hypotheses (HI), (112) coincides with the following linear programming problem N
E c(i)p(i) --~ min i=1
(3.15)
p6SN
in the sense that their minimal values coincide: N
inf limsup O~ a'z" rain E c ( i ) p ( i ) = rain {U,}
n~oo
p 6 s N i=I
c(i) := c(&)
(3.16)
i=l,...,N
Proof. N
It is clear that the minimal value of the function ~ c(i)p(i) is equal to c(a) i=1
and can be achieved with the optimal pure strategy p* defined as follows p* (i) = ~ii, (i = 1,..., N)
(3.17)
Indeed, N
E c(i)p(i) i=t
N
> min --p6S N
E
N
c(i)p(i)
> min --p6S N
i=l
E i=1
min
i=l,...,N
N
---- c(&) min E P ( i ) = c(c~) p 6 S N i=1
and
N
=
e(i)p*(i) i=1
c(i)p(i) =
3.3. Learning Automata Design
49
To prove the first equality in (3.16), let us rewrite q)n (3.4) in the following form: N
¢2n = E fn(i)~,~(i)
(3.18)
i=1
where the sequences {fn(i)} and {~n(i)} are defined as follows
f,~(i) E
1 '~ := - E n
X(ut = u(i))
t=l
~(u(i)'w)X(Ut=u(i))
*=~ ~
~X(ut=u(i))>O
if
0
if ~
x(ut = u(i)) = o
t=l
Taking into account assumption (H1), and according to lemma A.12 [14], for almost all co e / 3 ~ : =
co :
~(ut
= u(i))
..... o o
t=l
we have lim ~ ( i ) = T~
OC
c(i)
For almost all co ~ Bi we evidently have timsup l~r~(i)l < oo and, hence, lim A ( i ) = 0 But the vector fn which components are fn(i), belongs to the simplex S/v, and as a result any partial limit • of a the sequence {On} may be expressed, with probability 1, in the following form: N
• = Ep(i)c(~),
p c s N
i=t
where p is a partial limit of the sequence {fn}. Hence, N
a.$.
t i m s u p On > l i m i n f On > min n ~
n~o~
-- PESN
Ep(i)e(i ) i=1
and this inequality turn to an equality if
p,~(i) :=
P
{un =
u(i)I~-,~-, } =
p*(i),
(i = 1, ...,N)
50
3.3. Learning Automata Design
Theorem is proved. • The next theorem is fundamental. It shows that if a nonstationary strategy {Pn} which converges to the optimal strategy p* is used, then the sequence {¢n} of loss functions reaches its minimal value c(c0. T h e o r e m 2, If under assumptions (H1), (H2) a reinforcement scheme generates a sequence {Pn} , which converges with probability one to the optimal strategy p* (3.17), then, the sequences {¢n} of toss functions reaches its minimal value c(~) with probability one, i.e. if Ilp,~ - p*ll = oo~(1) ~+7,--+ ~ OQ o
then,
lim On ~ " c(c~)
(3.19)
rb~oo
Proof.
In view of the strong law of large numbers [21]-[22] (assumptions under consideration), we have 1E~ ~
1
n
n
t=l
E {~tlgr,~_~} ~:2~ 0 n~oo
t=l
But 1 E n
E {~t [.T',~_,}
=
t=l
- Z n
1
=
=
n
i
- ~c(i)p'(i) n
=
~(i)p~(i)
t=l
n
t=l
c@+o~(i)
n
+ - ~c(i) ~
[p,~(i) - p * ( i ) l =
t=l
c(~)
Theorem is proved. • C o r o l l a r y 1. If there exists a continuous function W(p,~), which on the trajectories {Pn} possesses the following properties: 1. W(p) >_ W(p*) = 0 Vp E S iv 2. w ( p ~ )
~
o
then, this trajectory {Pn} is asymptotically optimal, i.e., p,~ a-. 8-, ~ p Tb~
O0
$
3.4. Nonprojectional Reinforcement schemes
51
Proof.
follows immediately from the property of continuity and the considered assumptions. • This scalar function W ( p ) is called L y a p u n o v f u n c t i o r ~ It will be used for the convergence analysis (some interesting properties) of different reinforcement schemes realizing an asymptotically optimal behaviour of the considered class of Learning Stochastic Automata. Notice that the main obstacle to using the Lyapunov approach is finding a suitable Lyapunov function.
3.4
Nonprojectional
Reinforcement
schemes
Let us now consider some nonprojectional learning algorithms (reinforcement schemes) of the type (3.6), and the following Lyapunov function: p(a) p(c~)
1 -
w~ .- - -
(3.20)
It is obvious t h a t this function satisfies the first condition of the previous corollary. We would like to check the validity of the second condition for several commonly used reinforcement schemes applied to nonbinary environment responses, and to show then how to use them for optimization purposes.
3.4.1
B U S H - M O S T E L L E R REINFORCEMENT SCHEME
The Bush-Mosteller scheme [18]-[14] is described by the following recurrent equation:
P~+I = P~ + ~ [e(u~) -- p n ÷ ~,~(e N - N e ( u , ~ ) ) / ( N - 1)]
(3.21)
where
(i) ~(n) e ~ = [0,1], (ii) e g ---- (1, ..., 1) T • R N , (iii) e ( u ( n ) ) = (0, 0, ..., O, 1, O, ..., O) T • R N (the i th component of this vector is equal to 1 if u ( n ) = xi and the other components are equal to zero),
(iv) ~(n) • (0,1).
52
3.4. Nonprojectional Reinforcement schemes
The key to being able to analyse the behaviour of a learning a u t o m a t o n using the Bush-Mosteller reinforcement scheme is the following theorem.
Theorem
3.
If
1. the environment response sequence { ~ } satisfies assumptions (H1),(H2) 2. the optimal action u,~ = u(~) is unique, i.e., rain c(i) > c(a) = 0
3. the correction factor is selected as follows: 3'~--
nA-a
, 3' E (O, 1 ) , a > 3'
then, the Lyapunov function (3.20) possesses the property W(p~) ~ " 0 and, hence, the reinforcement scheme (3.21) generates asymptotically an optimal sequence {Pn}, such that p ~ - ~a . 8p. Proof. Let us consider the following estimation:
p,~+l(i) = p,~(i) + %~[X(U,~ = u(i)) - pn(i)+ +~n [1 -- N x ( u n = u ( i ) ) ] / ( N -- 1)] = pn(i) (1 -- ~ ) +
+~--1
{ ~ + X(u,~ = u(i)) IN(1 - ~n) " 1]} >
(3.22)
n
>_p . ( i ) (1 - ~.) >_... > pl (i) 1 ] ( 1 - ~ ) > 0 t=l
l~¥om l e m m a A.4 [14], it follows t h a t
H'~o - ~ ) _ > t=l
{ ~.\o/ (a-:-x'~ "y ~-~ ,
, a>~
(0, 1)
"y= i, a > O
(3.23)
3.4. Nonprojectional Reinforcement schemes
53
The reinforcement scheme (3.21) leads to E { 1 =pn+l(°~) P~--~I (--~) /.T'~}..s.=
"{'
°~ ]~E -~
-i=1
1/.r,,_,
Au,, = u(i
,
-
E{P'~+I(a)/~'£--IAu'~ =u(i)}
1
)
;.(i)° ~ p,~(i)+s~
(3.24)
where
s~:~E =
p~
+~(a)
1
}
E{p.~+,(,~)/~-.._, n~. = ~,(i)}/~"-' A~ ="(i) ;..(~)
(3.25)
Let us now estimate the term sn (3.25): N 8n : E E { i=1
rn/~n-I
A~-_-~(i)} p~(i)--
N
EE{Y./~._,A.,, =.(~)};.(~)+ (3.2~) i~#a -kS{ E{pn+l(OL)/~-'~)-7-~..'~__ '~n-l-p'~A+l'/$'n ~'~= "//,(C~)}iX ~i-zU-'-~" pn+l~ (Ot)/-~"n--1A'n =='U,(OL)}X =
xp~(a) where y . = E{pn+l(a)/~',~-i Au~ = u(i)} - p n + l ( a ) p~+l (a)E {p,~+l(a)/:F,~_l A u~ = u(i)} Taking into account that P n + l (Ce) : pn(c~) + "Tn[Si,a -- p n ( i ) -[- ~n [1 -- N S i , a ] / ( N - 1)] --
= f
L we derive
p,~(a)(1 - ~.,) + %~ if u(i) = u(a) (~,~ '~'~" O) p,,(a)(1 - %~) +'7,~,~/(N - 1) if u(i) ¢ u(a)
54
3.4. Nonprojectional Reinforcement schemes
and from (3.26) we obtain N
8n = E E
"" {
= ~(~)},:',,(':)"-'
~(i) - ~,,,
=.:,.,,, Z E
,,-
{ gn/'~n-1A'-
}
/-~o-, A ~ =~'(~1
P'(i)
,,,..~. =
E{.., r
where
N-1 A,~ =: p,~(a)(1 - "y,~) and
B~ = (c(i) ~,,) -
Using then Cauchy-Bounyakovskii inequality we get ~f2C
1
N
S n ~ (/~--_1)2 E
2,
"y~C
,
[pn(a)(1--'Tn)+
1
N ~] Ep,~(i )=
1
--< ( / Y - - l ) 2 p2n(o0(1 - q'n) 2 [pn(a)(1 - q'n) +
"7~C
,,:#.
W,,
(N- 1)2 p~(~)(1_ ~)2 [p~(.)(1-~) + ~ ] where
c :::m~x ,,o V/E {(~(') - ~,,)~/-~,,-, A-° = ~,(,)} =m~", ,~,,,
o~ := E {(4i)
- ~.) ./.r,._, A ~ = ~(i)} c- :=max
c(i)
(3.27)
3.4. Nonprojectional Reinforcement schemes
55
Using the lower estimations (3.22) and (3.23) in (3.27), we finally obtain
sn <_ ~ C
Wn < ( N - 1 ) 2 p n ( a ) ( 1 _ ,yn)2 [p,~(c0( 1 - ' Y n ) + ~_~]] ~C
W~ <-- (]V----1)2pn(a)min {pn(c0; ~ <
~/2C
< } -
Wn
1 - (N - 1) 2 (a - 7) 2~ (n + a) 2(1-~)
(3.29)
=
= n-u('-'Y)C, (i + o(1)) W,~ C1 :=
"Y2C (a - ?,)2-~ (N - 1) 2
a.8.
for n _> n0(w), no(ca) < oo . Let us estimate the first term in (3.24):
N(
1
)
i=~ E{pn+l(cx)/~j-1Au,~ = u(i)} - i
N(
i
•
E{pn+l(OO/.~n-1Aun -- u(i)} 1
+ E{pn+l(O~)/Jzn-1Aun == u((~)}
N(
+
i
(
b ~ ( . ) ( 1 - - ~ ) +-~.] 1
p~(~)(1
<
p~(i) =
- i ) p~(i)+
- 1 )p~(a)
=
- 1 ) p,~(i)+ - 1 ) p.(a) < - i ) (i - p~(a)) +
(1 -- p~(c~)) (1 - 7~)p~(~)
- z~) + ~ - i
(3.28)
<_
(i - p,~(~)) +
56
3.4. Nonprojectional Reinforcement schemes (I - -
pn(a)) (i
p~(~)(1
- z.)
-
+
~)
-='a'
z~
[ i
~n
i -
W~
Pn (c~) N - 1
Using the estimations (3.22) and (3.23), we derive:
E
E{p.~I(a)/~-.-I Au.
u('i)} - 1
p,~(i) <
i=1
1
<
-
(i-7.)+
[
I -
c]
~V~ =
~ ( a - ~ ) - ~ Q Nc__~ (n+a)l-'~pl(o -1
l Fn-(l-7)C2(l
=
3'n
q ]w . = [l - ~ . ~ :c(3.30)
=
[i -- n-(1-7)Cfi2(1 +0(i)H-O(n-?))] =
Wn----
[I- n-('-~)C2(l +o(I))] 14~
with C~=
7c IN - 11p, (a) (a - ~)7
Combining (3.24), (3.29) and (3.30), we finally obtain: a.8~
_ Wn
[1- n-(1-7)C2(1 + o(1)) + n-2(1-7)C1 (1 + o(1))] = ---- Wn [I
- n-O-~)C2(l + o(i))]
(3.31)
with 7 2 max
i#~
(7 i
N'Tc-
C1 := ( a - 7 ) 2 ~ ( N - 1 ) 2, C 2 = [ N - 1 ] ( a - 7 )
7
(3.32)
Appealing to (3,31) and Robbins-Siegmund theorem [23] we obtain the convergence with probability one. Theorem is proved. •
3.4. Nonprojectional Reinforcement schemes
57
C o r o l l a r y 2. Under the assumptions of this theorem the convergence rate is described by ft
E{Wn+I} <_ ( N - 1) n
[1-t-('-')C½(l+°(1))]
<
(3.33)
t=l
< (N - 1) exp {-- (C2 - e) n ~ } 0 < ~ small enough Proof.
This result follows directly from (3.31) after averaging and using the following inequalities f i [ 1 - t-(1-~)C2(1 +o(1))] < 12[ [ 1 - t -(1-~, ( C 2 - e ) ] t=l
t:l
and
fl[1-t
O-n)C] _< exp { - (C - e) n ' }
t=l
Corollary is proved. I The message in this theorem is that a learning automaton using the Bush-Mosteller reinforcement scheme selects asymptotically the optimal action un = u(~), i.e., its behaviour is asymptotically optimal. The learning process rate increases when the parameter ~/ in the reinforcement scheme (3.21) (see assumption 2 of this theorem) is selected close to 1. The next subsection deals with the analysis of Shapiro-Narendra reinforcement scheme.
3.4.2
SHAPIRO-NARENDRA REINFORCEMENT SCHEME
Consider now the following reinforcement scheme [19]-[1@ Pu+I = P~ + ~/~(1 - ~ ) [e(u~) - p~] with ~/n C e(u~)
e g
[0,1],
~C[O,1],p,(i)=l/N
=
(0,...,0,1,0,...,0)
::
i (1, ..., 1) T E R N
We are now ready for our main r~sutts.
r
, u~ =
u(i)
(3.34)
58
3.4. Nonprojectional Reinforcement schemes
4. If
Theorem
1. the environment response sequence { ~ } satisfies assumptions (H1),(H2) 2. the optimal action u s -= u(a) is unique, i.e., c - :----rain c(i) > c(a) >_ 0
3. the correction factor is selected as follows: "y~-
? , "y E (0, 1 ) , a > ' y n+a
Then, the Lyapunov function (3.20)possesses the following property W ( p n ) ~g" O and, hence, the reinforcement scheme (3.34) generates asymptotically an optimal sequence {p,,}, such that p,~ (2,8. -~p Proof. We follow the lines of the p r o o f described above. Let us consider t h e following estimation:
Pn+I (i) = p,~(i) + 7,, (1 -- ~n) [X(U,~ = u(i) ) -- p,~(i)] =
=p,,(i)[1-.),,,(1-5,,)]+7~(i-5~)x(u~=u(i))>_
(3.35)
_> p~(i) (1 - ~ ) _>... _> pl (i) 1 ] ( 1 - ~ ) > 0 t=l
T h e reinforcement scheme (3.34) leads to
E ~ 1 -~- P n,+~l ~-~ ( o 0 / f ,, "~ ~.~. L P,~+I t ) °~ ~E ~=1
=Z i=l
p,,+l (~)
i/~_~
J
A u~ = u(i )} p~(i) a.~.
E.pn+l,a,.~_-l^u,~=u,i,~-i
p,~(i)÷sn
(3.36)
3.4. Nonprojectional Reinforcement schemes where
N =
-
1 E{P.+I(c0/Jr.-1Au. =
1 p~
a)
u(i)}/~='~-~
sn
Let us now estimate the term
{
59
Au" = u(i) } p,,(i)
(3.37)
(3.37):
x ~ E f E , - p,,+~(o,) ,~. } o..,. s,~ = ~ ~ p,~+,(a)E,~ / " ' ~ - ' A u n = u(i) p,~(i) = N
= oEE{ ~=~ p,.(,:,)
c,
+ .~,.(1 - ¢.) (~~,. - p,,(c,)) x
Xp,~(c~) +"/n(1 -c(i))(5~,.-p,~(a))/'~'~-1 N : "/n E E ~=1
=
[ Ci [ 1 - p.,(,~)+'~,,(~,,,.-p,,(~,))\ ,7.¢,.(~,~,-p,.(,~)) + o ( p. 7.~,,(&,,.-p.(a)) (,~)+'~. (~,,.'p.(~))) ]
t
Pn(a) + Rfn (5i,,:, -- p.~(a))
P,~(cO)
=%~EE
/ ~ . _ , / ~ = ~(i)~ p.(i) L(1-7-) - ~ kO-~,,)JJ x
/
(-i) x [i - .~,,-(f = 40)]
+'YnE
x p,(o,)
=
C~ [ p.(,~)+~.O-p.~(o,)) kp.(~)+~.O-p.(,~))]j Pn(a) + "In (1 - pn(a)) x (1 -- p,.(~)) + ~,,,(1 c(~)) (1 - p , , ( ~ ) ) / ~ " - '
where Q = (c(i) -
~.)
and C~ = ( c ( a ) - ~ . )
A~,.
= ~,(,~1} p,.(~)
x
a._~88.
60
3.4. Nonprojectional Reinforcement schemes Using the Cauchy-Bounyakovskii inequality, it follows a.s. Sn ~
~2
p~(a)
(1 - ,.y,~)2 [1 - "/,~ (1 - c-)]
×
N
× E E {lC(i) --~nl [i 4-0(1)]/.7n-1A'~
+~
=
p ~ ( a ) (1 - p n ( a ) ) 2
×
[10n(a) + "Yn(1 - p,~(a))] 2 [p,~(a) + 7,~(1 - c(a)) (1 - p,~(a))] × E { I c ( a ) - ~nl[1
=
<'~C[I+o(1)]Wn
(1_,,/,~)211
~n(l_c_)]+
p~(a)w~
]<
b n ( a ) + "~n (1 -- pn(a))] 2 h a ( a ) + 7n(1 -- c(c0) (1 -- pn(a))] 2
< "y~C [1 + o (1)] W~ (1 + W~) -
~/~ C [1 + o (1)] W~
p.(a)
(3.38)
where C, cri2 and c- are defined as follows:
C:=maxIE{(c(i)-,#a ~ ) 2/~-~- I A
u~
= u(i)} = maai ix# ~
c- :----max c(i) Using inequalities (3.23) and (3.34) we derive: n
"ynpnl(a) < p11(a)
(1 -- t T a )
_ p-lCca ~- (n + a) "~ < a ~ Jn+a(a_,~).~ =
<
(3.39)
p l 1 ( a ) ' ~ [1 7t- o(1)] ~t_l+ ~ (a -
~)~
We finally obtain: a.8.
sn <_ n-2+~C1Wn
(3.40)
with Cl := pll(a)~/2 [1 + o(1)] c (a - 7) ~
(3.41)
3.4. Nonprojectional Reinforcement schemes
61
Let us estimate the first term in (3.36):
N(
1 A~- =
~=~1 E { p . + , ( o O / ~ X ,
N(
=E i=l
~(i)} - 1
p.(i) =
i
)
Pn(Ot) "~- ~/n(1 -- c ( i ) ) (~i,c~ - - p n ( o ~ ) )
N(
1 ~,~(~)(1
+
)
_ 1)
-y.(1
-
-
,~(a)(1 -- 3'n(1 - c-)) - 1
=
pn(i)+
~(i)))]
[p,~(c 0 + ~,~(1 - c(~)) (1 - p,~(a))t - 1
--< E
pn(i)
- 1
p,~(~) <
pn(i)+
p,~(~) (1 - p~(cO) [1 - ,~,~(1 - c(cO)l p,~(~) +-y,~(1 - c(~)) (1 - p~(~)) Using the Lyapunov function (3.20) we obtain:
E
i=I
E{pn+l(~)/~n-, Aun =
< W~ [1 - p ~ ( a ) ( 1 - 2~(1 - c - ) ) -
1 - ~n(1 - c-)
u(i)}
pn(i) <_
- i
_p~(a) [1 ~_-y~(1 _~_c(a))] + p~(a)
+ ~(1
]
- c ( ~ ) ) (1 - p ~ ( ~ ) )
J
(3.42) These terms can be expanded via Taylor series about ~/n, i.e., i -- pn(o~)(l
1 --
-- ?n(l
"~,~(1-
-
c-))
== i + ' T n ( l
-
c-)-
c-)
--pn(a) [1 - "72(1 -- c - ) 2] ÷ O('y~)
p2(a) [1 - "7,~(i - c(a))] p . ( ~ ) + "y.(1 -- c(a)) (1 - Pn(a)) = P " ( a ) - "y.(1 - c(a)) + O ( r , o , _ , ) Substituting these relations into (3.42) we obtain:
~=~ E{pn+l(OZ)/.~(_l Aun = u(i)} - 1 pn(i) <_
62
3.4. Nonprojectional Reinforcement schemes
If the estimations (3.22) and (3.23) are used, the following is obtained: i=~
E{p~+l(a)/Sr~l h u ~ = u(i)} - 1
pn(i) <
< w " ( i - 7--7-[~- - c(~)] + °( pl(c~) ~( a - : 2 ~ 7 ) ) n+a \ ,~+a}
-
=Wn(1
=
n+a'7[c--c(°z)]+O(n-2+7))
(3.43)
Combining (3.36), (3.40) and (3.43), we firmlly obtain: E {H~+l/~'n} ~'_ Wn (1
n+a~' It- - c ( a ) ] + O(n -2+'r) + n-2+vC1) =
Prom (3.44) and Robbins-Siegmund theorem [231 we obtain the convergence with probability one. Theorem is proved. • Corollary 3. Under the assumptions of this theorem the convergence rate
is described by the formula
(3.45)
E {W~+I } < _< ( N - 1) I I
t +~a
1
[~- - ~(~)] (i + o(i)) <
t=l
< ( N - 1)n -c C = ['y - e] [c- - c(c0] , 0 < e small enough
Proof. This result follows directly from (3.44) after averaging and using the following inequalities
1 t=l
~
[e--c(a)](l+o(1))
t + a
l-~[~-~][c--c(~)]
< - t=l
and [ 1 - t -1C] < e x p t=l
-C I.
t -1 t=l
<exp{-Clnn}=n )
-v
3.4. Nonprojectional Reinforcement schemes
63
Corollary is proved. • From this theorem and its corollary it follows that a learning automaton using the Shapiro-Narendra reinforcement scheme and operating in random stationary environments with nonbinary responses, belonging to the unit segment, selects asymptotically the optimal action un = u(a), i.e., its behaviour is asymptotically optimal. The learning process rate increases when the parameter "y in the reinforcement scheme (3.34) (see assumption 2 of this theorem) is selected close to 1. It also increases when the difference [c- - c(~)] increases. The next subsection deals with the analysis of the Varshavskii-Vorontsova reinforcement scheme [20].
3.4.3
VARSHAVSKII-VO RONTSOVA REINFORCEMENT SCHEME
Consider now the reinforcement scheme, described in [20]-[14]: P,~+l = P~ + ,y~pT e(u~)(1 -- 2 ~ ) [e(un) -- p~]
(3.46)
with "y= E e(u,)
eN
[0,11,
~=E[0,1],pl(i)=I/N
=
(0,...,0,1,0,...,0)
=
i (1,..., 1) T E
,
=
R N
In comparison with the two previous reinforcement schemes, which are linear with r~spect to the vector Pn, this scheme is nonlinear (quadratic) and as a consequence its properties differ from those stated in the previous subsections. In spite of its nonlinearity characteristic, the convergence analysis of this scheme can be done on the b~sis of the Lyapunov approach using the Lyapunov function (3.20). T h e o r e m 5. /f 1. the environment response sequence { ~ } satisfies assumptions (H1),(H$) 2. the optimal action u,~ = u(a) is unique and all the nonoptimal average losses (corresponding to the non optimal actions) are great than 1/2, i.e., c- :=min c(i) > ½ > c(~) > 0 i~
3. the correction factor (learning rate) is selected as follows:
"ThE (O, m i n { [ m i n { I b l ; ( 1 - 2 c ( c 0 ) } ] - 1 ; l - e } ) ,
~'7,~ =°°
64
3.4. Nonprojectional Reinforcement schemes
0
< ~
- small enough
then, the Lyapunov function (3.20) possesses the property
W(p~)
~
o
rt~o~
and, hence, the reinforcement scheme (3.~6) generates asymptotically an optimal sequence {Pn}, such that a.8.
p,~ - - ~ p n
~
oo
This theorem is proven in the same manner as the previous one.
Proof. The reinforcement scheme (3.46) leads to
E{1-p,~+l(o~)/,~n~a.~ EN E
1
i/~_, A u ~ = ~(i)} ×
Xpn (i) "'~" :
=E
(3.47)
where s,~ = ~ E
=
1
E{p~+,(~)/~_~
A~
{P~+
1(
..... U(i)}/jzn-1
~)
}
A u n = u(i) pn(i)
(3.48)
Let us now estimate the term sn (3.48):
~E~E~-P~+I(a)/~r N = 2~/~ E
~=1
Xp~(~) + ~ ( i
{ E
}
....
p~(i) ( & - e(i)) p~(~) ~ ( T - ~ - i ~ y - p ~ ( ~ ) ) p ~ ( i )
×
(~,. - p~(~)) } .... - 2e(i)) (~,~ - p~(~))p~(i) - / 7 ~ - ' A u~ = ~(i) p~(i) = N
= 2 ~ ~ E [ p~(i) (& - ~(i)) [1 - c,~ + o(C~)] i:1
[
P~(a) + ~/n (6i,, -p,~((~))pn(i)
×
3.4. Nonprojectional Reinforcement schemes
X p~(a)
, A~,, =-(~)}~(~) °~
?~(1 -(~""--P~(~)) 2c(i))(fi,~ -p~(c~))p,~(i)' '~ '~
[ ~.~.p.(i) -[- ot~[1-,7~p,~(i)]lJ .~,,~,,p,~(i) ~] f p,~(i)(c(i)-~n) Lil-~t~p~(i)] t pn(o0 (I - ' ~ n p , ( i ) ) x
- - 4 " y n ENE
i¢~
1 x
65
I - "y,~(I - 2c(i))p,~(i) /J:,~_,
A~. = ~(i)} p.(i) ~
[__ "/n¢.(1--p,~(a)) ~_ O('tn~n(1-pn(a)) '~]
( ~ " - c(c0) L I+y"(1-P"(C0) "t-41'r,E
1+q'"(1-P'~(c0) 'J
1 -'b W,~ (1 - p~(o~))
(1 - p,~(,~))
x I + %~(1 - 2c(~)) (1 - p,~((~))
x
/y,~ , A~,, = ~(,~)} "2
< 4.),~ [1 + o(1)] ~-~ p~(i) E {Ic(i) -~,~t/:r,~ -, Au,~ = ~(i)} p~(a) ~ . (1 - %@n(i)) [1 - ?~ (t - 2c(i))p,~(i)]
+4"72[l+°(1)](1-p'~(a))E{l~-c(a)l/'U'~-liur~=u(a)}
(3.49)
where
-~,
~ {p,.+,(,~)/~,. , A . o = u(,)}
and
c.= Using the Cauehy-Bounyakovskii inequality, from (3.49) we get the following upper estimation:
s,~ < 4ff~C[l +o(1)]W,~ ( 1 - - ? ~ )
(3.50)
_< 87~C[1 + o(1)1W,, where
C:=maxIE{(~()'-~")~/Y"-A' "~=u()i}=max'~,'¢,~
,¢,,
4 :: E {(c(,)- ~,.)~/~o, A - - : ~(')} c - ::=:max
c(i)
Let us now calculate the first t e r m in (3.47):
£( ~=a
,
E{p,+l(c~)/$'n-1Au~
= u(i)}
-1
( ) o.s.
p,~ i
=
66
3.4. Nonprojectional Reinforcement schemes N
u,~t~J
-- 1 =
= Z p=(a) + 3'n(1 -- c(i)) (6~,. - pn(o~)) pn(i) i=I
_
1 ~
pn(a) . -~
p,,(i)
[1 - ~/=(1 -- 2c(i))pn(i)]
+
(3.51)
1 --1 1 + "/,(1 - 2c(a)) (1 - - p , ( a ) )
Let us now maximize the first term in (3.51) with respect to the components pn(i) (i ¢ v~) under the constraint N
Zpn(i)
(3.52)
= 1-p,~(c~)
i:/:c~
To do that, let us introduce the following variables xi:=pn(i),
ai:=7=[1-2c(i)],
i#a
It follows that iv
P t~)-,~':"
N
i¢=aZ 1 - "y.. [1 - 2c(i)1 p.(i)
= ~. 1 -X~aixi
:= F(x)
(3.53)
To maximize this function (3.53) under the constraint (3.52), let us introduce the following Lagrange function:
L(x, )~) := f ( x ) - )~
x~ - (1 - x , )
The optimal solution (x*, A*) satisfies the following conditions of optimality: 0 1 Oz L(z*,A*)-)2 A* = 0 V i ~ a N
From these optimality conditions, it follows that
,_ 11-z. N
i~ot
v~:=(1 '
1-z.)_i N
is~ot
3.4. Nonprojectional Reinforcement schemes
67
Hence
F(x) < F(x*)
1 - - x ~ )-1 N
= (I - x~) (I
E a; ~ From this inequality we derive
1 - p,~(a) 1 - ~ b [ 1 - pn(a)]
(3.54)
where
(3.55) Notice t h a t according to the first assumption of this theorem b<0 So, we can rewrite (3.54) as follows
N
;~(i)
1 - Pn(a)
p,~(i) ) <
(1-7=[1-2c(i)1
1 +,-r, Ibl [1 -
p,(~)]
(3.56)
Substituting (3.56) into (3.51) leads to: N (E{
i=-"~
1
Prtq-I (Ot)/JYJ--1 A Un = ~t(i)} <
t
1-
- 1 ) ;~(i) ° S
pn(o~)
-- p,~(~) 1 + ?,~ Ibl [1 -
p,~(o,)]
~,,~(i -- 2c(o0) ( t - p ~ ( ~ ) ) 1 + %~(1 - 2c(a))(1 - p ~ ( a ) )
= w.
1 " / n ( 1 - 2c(a))pn(tl) ] = 1 + - r , tbl [1 - p,~(a)] - 1 +%~(1 - 2c(a))(1 - p n ( a ) ) J
= W= (1 - 7~ [Ibl (1 - p=(~)) + (1 - 2c(a))pn(c0] + O(7~2)) <
<_W,~(1-,~nmin{Ibl;(1-2c(a))}+O(~) )
(3.57)
Combining (3.47),(3.51) and (3.57) gives rise to the following inequality:
E {Wn+l/.7:n-l A un = u(i) } a~" Wr~ (1- "Tnmin {[b[ ; B} + O('72)) (3.58) where
B = 1 - 2c(~)
68
3.4. Nonprojectional Reinforcement schemes
From (3.58) and Robbins-Siegmund theorem [23] we obtain the convergence with probability one. Theorem is proved. • From this theorem it follows that a learning automaton using the Varshavskii-Vorontsova reinforcement scheme and operating in random stationary environments with nonbinary responses, belonging to the unit segment, selects asymptotically the optimal action us -- u(c~), i.e., its behaviour is asymptotically optimal. C o r o l l a r y 4. Under the assumptions of this theorem the gain (correction factor) "y,~ can be selected as a constant (satisfying the second condition of this theorem), i.e.,
~/n -- "f and then, the convergence rate is exponential and described by the formula
n{w ÷l} <_( N - 1)(1- C)
(3.59)
where C = ~ m i n ( [ b [ ; (1 - 2c(~))} (1 - e) E (0, 1), 0 ( z small enough
Proof. follows directly from (3.58) after applying the operator of mathematical expectation, m The rate of the learning process is great if the parameter ~/in the reinforcement scheme (3.46) is close to its possible upper value defined by the third condition of this theorem. C o r o l l a r y 5.If under the assumptions of this theorem
, a < ~/ E (0, 1] t÷a Then, the convergence rate is described by the formula "/n-
E(W,,+,~<(N-1)
~/ m i n { , b , ; B , ( l A-o(1))) < t÷a (3.60) < ( N - 1)n - c
I~(1 t=l
where B = 1 - 2c(a)
and C = [~ - e] min (lbl; (1 - 2c(c~))}, 0 < e small enough
3.5. Normalization Procedure and Optimization
69
Proof.
This result follows directly from (3.58) after averaging and using the following inequalities
l~ (1 t+a'Ymin{[b]'(1-2c(a))}(1,-t-o(1)))_< t=l
_<
1-
7 ['y -
sl min {Ibl;
t:l
(1 -
2c(~))}
)
and
t=l
[1-- t - l c ] _~ exp - C E t-1 t=l
< exp { - C l n n }
= n -c
Corollary is proved. • In this case the convergence rate also increases if the p a r a m e t e r ~ increases too. A remark is in order here. We have seen t h a t the Lyapunov function seems to be helpful in our analysis, and this will be reinforced as we proceed. The next section deals with the normalization procedure and the optimization algorithm.
3.5
Normalization Procedure and Optimization
Let us now show how to apply the nonprojectional reinforcement schemes for solving the Stochastic Optimization Problem (3.5) on discrete sets using only the corresponding observations Yn (3.2) of the function f ( x ) to be optimized. It is evident t h a t we can not directly use these schemes for optimization purposes. In fact, e n v i r o n m e n t r e s p o n s e s h a v e t o b e l o n g t o t h e u n i t i n t e r v a l [0, 1], b u t t h e a v a i l a b l e o b s e r v a t i o n s d o n o t obligatory satisfy this conditions. W h a t to do? This is the central question of this book. The solution consists of introducing a mapping between the observations Yn and the aut o m a t o n inputs ~n (environment responses). A procedure called "normalizatio n procedure", which establish the connection between the environment response ~n and the available observations y,~ has been initially described in [14] in order to keep also the a u t o m a t o n input within the unit segment [0, 1].
70
3.5. Normalization Procedure and Optimization D e f i n i t i o n 1. We will define the normalization procedure by the follow-
ing relation: ~n(i)-- m!n Cn-l(j)] 3 +
~ = ~ :=
max [~n(k)- n~n ~ _ 1(j)]
,
u~ = u(i)
(3.61)
+1
k
+
where F_, y,x(~ = u(i)) ~'~n(i) : : t : i
, i = 1, ..., N
E
z(,-,,
(3.62)
-- u(i))
t=l
x 0
[x]+:=
if x > 0 if x < 0
(3.63)
The following assumptions concerning the properties of the observation noise {w~} will be in force throughout of this book. (H3) The conditional mathematical expectations of the observation noise wn are equal to zero for any time n = 1, 2 ..... i.e., E
{ w n / ~ ' n _ 1 A Un : u ( i )},,.8.0 ,=
(It means that {wn} is a sequence containing martingale-differences.) (I-I4) The conditional variances of the observation noises exist and are uniformly bounded, i.e., E
{w ~ /
u~
max sup ~(i) i
(i
~
~(i)
: = a2 < oo
n
The main properties of the normalization procedure (3.61) is presented in the following lemma.
L e m m a 2. Assume that assumptions (H3) and (H4) hold and suppose that any reinforcement scheme (nonprojeetional or projectional) generates the sequence {Pn} such that for any i -- 1, ...,N
~_, p ,~(i) ~8= ~
(3.64)
n=l
Then, the normalized environment response ~ (3.61) possesses the foltoudTug properties:
3.5. Normalization Procedure and Optimization
71
• The number of selections of each action is infinite, i.e., oo
~(~
= u(i)) ° 7 oo
vi = t,...,N
(3.65)
t=l
• The random variable "5,~(i) is asymptotically equal to the value of the function to be optimized for the corresponding point x(i) belonging to the finite set X , i.e., "Sn(i) = f ( x ( i ) ) +oo~(1),
Vi = 1, .... N
(3.66)
• For the selected action u,~ = u(i) at time n, the normalized environment reaction ~n is asymptotically equal to A(i), i.e.,
~ a.__s.A(i) + o~(i) E [0, 1),
u~ = u(i)
Vi = 1 , . . . , N
(3.67)
where A(i) :
f ( x ( i ) ) -- f ( x ( ~ ) ) E [0, 1) max [f(x(i)) -- f(x(a))] + 1
(3.68)
i
• For the optimal action u,~ = u(a), the normalized environment reaction is asymptotically equal to O, i.e.,
~ a.~. o~(1),
un = u(a)
(3.69)
Proof.
1. (3.65) follows directly from assumption (3.64) and the Borel-Cantelli lemma [22]. 2. Let us introduce the following sequence [y, - I ( ~ ( i ) ] x(u, = u(i) ) On(i ) := "cn(i) -- f ( x ( i ) )
= t=l
x ( ~ = u(i)) t=l
w~x(ut = u(i) ) _- t=l
,
i=
x ( ~ t = u(i)) t=l
which leads to the following recurrent form for On(i) On(i ) ~- (1 -- ) ~ n ( i ) ) 0 n _ l
( i ) -{- ) ~ n ( i ) w n
1,...,N
72
Normalization ProcedL~re and Optimization
3.5.
where An(i) := n z ( ~ n = ~(i)) E x(u~ = u(i)) t=l
T a k i n g into account ~ssumptions ( H 3 ) , ( H 4 )
we derive
~~ (1 - An(i)): (On_~(i)) 2 + (;n(i)) 2 a~(i) _< < (1 - An(i)) 2 (en_i(i)) 2 + (An(i)) 2 a 2
(3.70)
It is easy to see t h a t n
n
~ n (i) °----~
OO~
t:l
E~(i)
°<~
(:X)
t=l
In view of this facts a n d Robbins-Siegmund t h e o r e m [23], it follows On(i) ~:2~" 0 a n d hence n~o~
lim En(i) -- f(x(i)) Vi ---- 1, ..., N So, (3.66) is proved. 3. (3.67) follows directly from (3.66) a n d a s s u m p t i o n ( H a ) . 4. (3.69) is t h e consequence of (3.67) and (3.68). •
Corollary
6. If for some reinforcement scheme the following inequality
holds
~p~(i) >_0
T e
o,
(3.71)
t:l
Then, for any small positive e we have a.8.
1
~Sn(i) - f(x(i)) = O~(n]/2_¢_ ~)
(3.72)
and~ as a result~ for large enough n >_ no(w), we obtain 1 ~na's'A(i)-t-ow(nl/2_r_e)E[0,1),
Un:U(i)
Vi=I,...,N (3.73)
2 Un
O(
1
and 1 . E{&lun=~(i)}=A(i)+ o( nW~-~)
(37~)
3.5. Normalization Procedure and Optimization
73
Proof. Let us notice that according to the strong law of large numbers [21]-[22] and lemma A.13a in [14], it follows
! ~ [x(~, n t=l
~(i)) -p,(i)] .... o~(~ -(1/2-~))
and for ;~,,(i) =
x ( u ~ = ~(i))
t=l
X(ut = u(i))
x ( u . = u(i))
n
~ pt(i) + ow(n-O/2-~))
t=l
]
we obtain two estimations for large enough n :> n0(w):
~(~) _>
x(u. = ~(i))
> x ( ~ = u(i)) [1 - ~]
n [1 + oo~(n (1/2-,))] -
n
and
~ ( i ) <_
~(~ ~(i)) ?2 [O ( 5 ) +°w(n-(l/2-s))]
< --
_ x(~. = ~(i)) [1 + o~(1)1 o (n -'+~) _< _< x ( ~ = ~(i)) [1 + el o (n -1+~) From the last inequality it follows also that
/~2(i) ___~~(~Zn : U(i))[1 -t-~] 0 (Tt -2[1--r]) 1. In view of lemma A.14 given in [14], from (3.70) we derive 1
"Sn(i) - f(x(i)) a.=~.o,,(nl/2_~. ~ ) from which (3.72) and (3.73) follow. Averaging (3.70) and using lemma A.1 in [14] we obtain (3.74), from which according to Jensen's inequality and the following relations
¢{ we derive (3.75). Corollary is proved. •
}
1
74
3.5. Normalization Procedure and Optimization
RANDOM
~n
ENVIRONMENT
J
. H I H
NORMALIZATION
l
[
LEARNING AUTOMATON
FIGURE 3.1. Feedback connection between automaton and environment with normalized response. In the following we shall be concerned with the behaviour of learning automata with normalized environment response and their use for optimization purposes. The probability distribution Pn will be adjusted using the previous reinforcement schemes, where ~n is constructed according to the procedure (3.61) (see Figure 3.1). This Figure represents the feedback connection between the a u t o m a t o n and the environment for which the response is normalized. Taking into account the optimization problem to be solved, the environment response is constructed as it shown by Figure 3.2. The admissible compact region X is quantified into a set of subsets, and the loss function is calculated using the noisy observations (realizations) of the function to be optimized. The convergence analysis of the behaviour of learning a u t o m a t a using the reinforcement schemes described before and the normalization procedure is given in the next section.
3.6. Reinforcement Schemes with Normalized Environment response
75
Environment Optimized function flxn]
fM
Noise
I
Un(
I
I
:
Loss function
Learning automaton ),~¢n
FIGURE 3.2. Multimodal searching technique based on learning automaton.
3.6 Reinforcement Schemes with Normalized Environment response T h e aim of the following subsections is to examine and analyze the properties of S-model learning a u t o m a t a using respectively Bush-Mosteller [18], Shapiro-Narendra [19] and Varshavskii-Vorontsova [20] reinforcement schemes will be given in the following subsections. The environment response is normalized according to the procedure (3.61)-(3.63).
3.6.1
B U S H - M O S T E L L E R R E I N F O R C E M E N T SCHEME W I T H NORMALIZATION PROCEDURE T h e correction factor was considered constant (~/~ ---- "7 ---- const.) in the original version of these reinforcement schemes. In this study, we assume
76
3.6. Reinforcement Schemes with Normalized
Environment response
t h a t t h e correction factor "y~ is time-varying and is selected according to the following formula: ~/~-
~f , ~ • ( 0 , 1 ) , a > ~ / n+a T h e initial probabilities are assumed to be strictly positive ( p l ( i ) > 0 Vi = 1 , . . . , N ). T h e Bush-Mosteller scheme [18]-[14] is described by:
(3.76)
p,~+l=p,~+%~[e(un)-p,~+~,~(eN-Ne(un))/(N-1)]
(3.77)
with "y~ --
"Y , " ~ e ( 0 , 1 ) , a > % n+a
~ne[0,1)
e(u~)
=
(o,..., o, 1, o,..., o) r , u~ = ~(i)
eN
=
(1,...,1) T E R N
i
T h e o r e m 6. For the Bush-Mosteller scheme (3. 77), condition (3.64) is satisfied, and if assumptions (H3) and (1f4) hold, and the optimal action is single, i.e., min A(i) := A* > 0
(3.78)
Then, the automaton selects asymptotically the global optimal point x ( a ) and the loss function ~,~ tends to its minimal possible value
f(x(.)),
~ i t h probability one.
Proof. Let us consider t h e following estimation:
pn+l(i) = p,~(i) + %~[~(u,~ = u(i)) -- p,~(i)+ +~n [1 -- N X ( u n = u ( i ) ) ] / ( N - 1)] = p,~(i) (1 - ~/~)+
~/~
IN(1
~n) - 1]
_>p~(i) (1 - ~ ) _>... >_pl(i)I~(1-~t) t~l
(3.79)
3.6. Reinforcement Schemes with Normalized Environment response
77
From lemma A.4 [14], it follows that
i i ( 1 _ ~,) >
t=l
~ (~_2V (0,1) ~.\oj , a > ~ ~
[
~
,
(3.80)
"f=l, a>0
Substituting (3.80) into (3.79) leads to the desired result (3.64). Notice also that from (3.80) and (3.79) it follows t h a t in (3.72) -- ~ e ( 0 , 1 / 2 )
and hence in (3.73) we have
~(i) - l(z(i)) °~
1 O~(n,/2_~_ ;)
(3.Sl)
Performing similar calculation as in (3.81) and (3.25), and taking into account the reinforcement scheme (3.77), we can obtain the following inequality for sn which is similar to (3.29): 13Cn sn a.~. <_ n-2(1-~)C,1 (1 +o(1)) Wn, C,1 ::: (a_.f)2~ (N - 1) 2
(3.82)
with
Cn=EE
lu~ .... u(i)
~-A(i)
=o
i=~¢, Substituting (3.83) into (3.82) we get: a.8.
s,, <_ o ( n - 3 + t ~ ) w , , Calculating the first term in (3.24) we derive:
~=--~ E{p,.,+,(oO/.r~-_, Au,., = u(i)}
,:#a E{pn+l(a)/.T'n~l hu,, :
pn(i)
u(i)}
+ ( E {p,-,+,(o~)/.r~ ,h~ =~(~)} - 1) p,,(o,) = .
= ~_~ i:#a
.
1 (00(1 -- "/n) + ~
- 1
(A(i) + o(n-1/2+'~))
p,~(i)+
78
3.6. Reinforcement Schemes with Normalized Environment response
(
+
,
)
p,~(c~)(1--%~)+?n [1+o(n-1/2+7)] - 1 p,~(o0 <__
( 1 ) --< p,~(a)(1 -- 3',~) + ~ (A* + o(n-1/2+'~)) - 1 (1 - p,~(c~)) + (1 - p,,(,~)) (1 - ,~,,) + 7 , , o ( n - ' / 2 + ' )
Notice that
,~
1
--A*
£:( i=~
<
,
=u(i)}
E{p,~+l(a)/.Izj_lhu,~
- 1
) p,,(i)
°: _
(,_ po(~,))(,_ po(o)+ ~. [,,.(o)_ ~ , (A. ÷ o(o ,,~+~))]) + .....
-
,,o(~,) + ~.. (~A.
_ p.,(,~) + o(,,-,,~+.,))
(1 - P n ( ~ ) ) (1 - 3'n) + ? ~ o ( n -]/2+7 ) -- "v~" <
-t-pn(Ot) + ")% (1 -- pn(o0 + 0(n-1/2+"/)) phI" ) -< (1-- pn(c~)) (1--pn(C~) + ? ~ [p~(c~)- N-~_1 (A* +o(n-'/2+'r))]) + --
Pn(Ol) -1- ~n ( ~
A* -- pn(o~) jr_o(n-1/2+'7))
(1 - p , . ( ~ ) ) (1 - .~,.) + ~,~o(u-1/2+',)
+ p.(.)
+ ~,, ( ~
,',,, -
p~(~)
--__ Wn
+
(I__TnN~A.)
--
%
o(n-,/2+~)) P~(") =
A*) + ?.o(n--U2+'r)
(1 - p~(cO) (1 - ^f~~
p~(.) + ~ (~_, A.
+
-
pn(~) + o(n-,/~+~))
l__Tn + ~
o(n_112+~)
1
('~1 '~* -}- o(rt-1/2+7)) 1
=
1--~,¢ + ~ 0
Taking into account (3.22), (3.23), we conclude that !fn
,~,, < p--~--~ < --
-
C2 (1 + o(1)) n'-~
c~:= '
I-4"
(N
-
1)pl
(a)
(a -
7);
I-
3.6. Reinforcement Schemes with Normalized Environment response
79
and hence, N (
1
)
a.,.
E 'E{pnTl(Cel/~:n_ 1 A~tn = u(i)} - i p,~(i) < i----1
1 + o(n -a/2+u'y) = i + c2[l+o(1)] TL1-- "r
=Wn(1-~/nNI~_IA*)
= w~ (t
~1--~
(1
c2 [1+o(i)1~ •1-7 ]
] -~-o(n-3/2+ 2"Y) =
+ O(~t-3/2+27)
(3.84)
a.8.
for n > n0(w), n0(w) < oo . Combining (3.81), (3.82) and (3.84) we obtain C2[ln 1-~+ o(1)] +o(n_a+47))+o(n_3/2+27)=
E {Wn+ 1/jrn_ 1} a~. ( 1Wn _
-W~(I
-
C2 [i _+o(i)]~ nl-'~ ] +
°(n-3/2+~'Y)
(3.85)
In view of Robbins-Siegmund theorem [23], it follows Wn 5~ 0, Pn(a) ~:2~" I Notice now that the hypotheses (Hi) and (H2) follow from (H3) and (H4), and hence, according to the third theorem of this chapter, we obtain • ,~ ~
f(x(~))
Theorem is proved. • This theorem shows that, a learning automaton using the Bush-Mosteller reinforcement scheme with the normalization procedure described above, selects asymptotically the optimal action, i.e., the optimal point x(a) is asymptotically selected. We still have to show how to choose the design parameters of the optimization algorithm such that to increase the speed of convergence. The next corollary gives the estimation of the rate of optimization. Corollary 7. Under the assumptions of this theorem it follows (convergence rate):
Wn = o where
1
80
3.6. Reinforcement Schemes with Normalized Environment response
Proof. In view of lemma A.14 in [14] for C2 [1 + 0(1)], ~ := o(n_3/2+2,r) nl_ ~
/zn := rtv~ un := Wn~ an .-From (3.85), we obtain 3 -2
2 ~ - - u :> 1
Corollary is proved. • The next subsection deals with the analysis of Shapiro-Narendra reinforcement scheme using the normalization procedure.
3.6.2
SHAPIRO-NARENDRA REINFORCEMENT SCHEME WITH NORMALIZATION PROCEDURE
The Shapiro-Narendra scheme [19]-[14] is described by: Pn+l = Pn -~- *~n( 1 - - ~ n ) [e(~tn) - - Pn]
(3.86)
e [0,1] We shall analyse its behaviour when the environment responses ~ are constructed on the basis of the realizations of the function to be optimized (Figure 3.2) and the normalization procedure (Figure 3.1).
T h e o r e m 7. For the Shapiro-Narendra scheme [19] (3.86), assume that the optimal point x(a) is single and suppose that assumptions (H3), (H4) hold. In addition, suppose that the correction factor satisfies the following condition: .
oo
oo
T~
(3.87) n=l
n=l t=l
=1Lnl-2- t I
t=l
.
lim 2. ~ oo
II(1 t=l
- ~ft) - 1 : = c < p l ( O L ) i *
(3.88)
Reinforcement Schemes with Normalized Environment response
3.6.
81
3.
5
H( 1 -
n
t=l
%) >-
(~v)
~ •
0,
(3.89)
s=:1
Then, this reinforcement scheme select asymptotically the optimal point x(a), i.e.,
p.(a) ~
i
n~oQ
and the loss function ~n tends to its minimal possible value f(x(~)), u~ith probability one. Proof. The proof is similar in structure to the proof of the previous theorem. Let us estimate the lower bounds of the probabilities P ~ I (i):
~f.(l - ~ , ) [ X ( u , = u(i)) -p,(i)] = [1-^I~(i-~,~)] + ~ , ( 1 - ~ , ) X ( u , = u ( i ) )
p,+,(i) = p,(i) + =p,(i)
(3.90)
>
n
> p,~(i) (1 - 2 , )
>...
> Pl(i) I I ( 1 -
~t)
t=l
From (3.87) and (3.90) it follows that o~
~pn(i)
= oo, v i = 1 .... , N
~1
As a result, according to the Borel-Cantelli lemma [22], we obtain oo
E X (u,~ = u(i)) = oo, Vi = 1..... N
(3.91)
n=l
Let us consider again the following Lyapunov function t4~ . -
1 -p~(~)
p~(a)
Notice that in view of assumption (3.89), the relations (3.72) and (3.75) are fulfilled, i.e., 6.8.
"5,~(i) - f(x(i)) := o,~(nl/2_~._~)
(3.92)
Taking into account the Shapiro Narendra scheme (3.86) and (3.92), it follows:
II
+
+
v
+
I
+
+
~
IA
+
+
+
~L ~ ' j
+
+
II
>
> II
I
i
--..2~ --_.ZJ
II
~
H ~
x
÷
4-
II
>
+
I
&
fr
x
I
I
J
x
II
>
+
+
+
II
>
i
-0
I
I
II
>
+
+
P
3.6. Reinforcement Schemes with Normalized ( _ _ ~ A *
)
= \p~(~) (i - ~) - i (i - p~(~)) +
Environment response
(
1
p~(~) (i - ~) + ~
-1
)
83
p~(~)+
Inequality (3,90) gives T~
3'npX' (a) _
(3.93)
t=l
a n d hence, a.8,
E {W~+l/~}
a~.( _
1
2n-
A*
(1- ~)
<
- p=(~)(1-
~,~)
(1--pn(a)) (1-- ~n)
)
1 - p=(c~)
p~[j)~f-:y.)
, .
+o(~71~1~.1-[(i-~1-1 + o
R ( 1 - a ) -2 =
t=l
t=l
=: W~ [(1 -- 7hA* -- p n ( a ) (1 -- 7n)) (1 + ~yn) +
(1 - ~ ) p ~ ( ~ ) ] p . ( ~ ) (1 - z . ) + z . + o ( ~ ) +
+o(~-_~.)~ I I ( 1 - ~)-' + o ~
( t - ~)-~
=
t=l
= w.
[
(1 - ~ A * ) (1 + ~ ) - p~(~) (1 - ~ ) + ~
+o(-r~_~.)~. 1-I(t - ~,)-' + o t=l
= W,~ 1 + ")',~(1 - A ' ) - %~i +
(1 - ~ ) - ~ \
%~ ( p X l ( o t )
+o(n--iSg;_2~)"/,~ I l~(l - "/t) -1 + 0 t=l
=
t=l
-
l) -I- 0("/2)
I\
O't) - 2
=
t=l
= w~ [ t - ~ (~" - ~ p 2 (~)) + o(~p2(~))] +
1
(.
-t-O(~)~n H(i--~'t) -1 -~-0 ~l-I(1-'y~) -2 t=l
\
t=l
)
~-
84
3.6. Reinforcement Schemes with Normalized Environment response
Using again inequality (3.90) and the assumptions of this theorem we derive:
E{w,~+I/~,J°~w,~ 1 - ~ , ~ +O
~
A*-pi-i(~)~,~
1 - ~t) -2
( 1 - - ~ , ) -1
+
+
+o(_i_~_~.)~,~ i ~'~( 1 _ ~ ) - 1 + o ( ~ tI~1(1 - a ) -2 ) = 1 t=l
=W,
=
[1-~(A*-p;l(a)c+o(1))+O
"7~
+o(n-i-~_2,)~,,~ I ~ ( 1 - ~t) -~ + O t=l
( 1 - ~ t ) -2
(1 - "rt) -2
+
(3.94)
=
Taking into account assumption (3.88) of this theorem (A* - - p ~ l ( a ) e > 0), assumption (3.s7) and Robbins-Siegmund theorem [23], we obtain W,~ 5:g- O, p,,(a) ~ " 1 Theorem is proved. • The following corollary deals with the constraints associated with the correction factor 7~. C o r o l l a r y 8. For the correction factor (3. 76) 7 "7,,-- n + a , "/ ~ [ u , l ) , a > ~ assumption (3.87), (3.88) and (3.89) will be satisfied if
lim "yn I ' I ( 1 - - ' y t ) - l = c = O , T = ' y t=l
and: 1. the rate of optimization will be given by n~W,~ ~:g 0 where
u < min{1 -- 3"E'yA*} := u('y) < y*
(3.95)
3.6. Reinforcement Schemes with Normalized
Environment response
85
2. the maximum optimization rate ~* is equal to max v('y) = v* -
A* A*+3
(3.96)
and is reached for 1 "7* - A* + 3
(3.97)
Proof follows directly from inequality (3.94), which can be rewritten as follows: 7/,
Applying then l e m m a A.14 in [14] for u . := w - ,
~ . := -~ (A* + ° ( 1 ) ) , 9 . :=
o
i (~-z:~-3~) ,~-:=
nU
we obtain the desired result. Corollary is proved. • These statements show that, a learning a u t o m a t o n using the ShapiroNarendra reinforcement scheme with the normalization procedure described above, has an asymptotically optimal behaviour and the global optimization objective ks achieved with an e-accuracy. T h e Varshavskii-Vorontsova reinforcement scheme will be considered in the following.
3.6.3
VARS HAVS K I I - V O RONTSOVA R E I N F O R C E M E N T SCHEME W I T H N O R M A L I Z A T I O N P R O C E D U R E
The Varshavskii-Vorontsova scheme [20]-[14] ks described by:
p~+, = p~ + ~ p ~ ~(u~)(1 - 2 ~ ) f ~ ( ~ ) - p~l
(3.99)
This scheme belongs to the class of nonlinear reinforcement schemes. We shall analyse its possibilities to solve the stochastic optimization problems on discrete sets. The following theorem cover all the specific properties of the Varshavskii-Vorontsova scheme.
8. For the Varshavskii- Vorontsova scheme (3.99), assume that the optimal point x(c~) is single and suppose that assumptions (H3), (H4) hold. In addition, suppose that the correction factor satisfies the following condition:
Theorem
86
3.6. Reinforcement Schemes with Normalized Environment response . Z~,=
~,
~=1
H(I-~) n=l
= ~,
t=l
]
b:= \ - [1-2A(i)] 1 ~
<0
(n 1+2r + , y n ) , y n l - I ( l _ ~ ) - I n=l
2.
(3.1oo)
t=l
(5
< oo
(3.101)
3. -
E
?Z t = l
1-I (1 - %) :> O ( 7 ) '
TE
O,
(3.102)
s=l
Then, the optimal point x(a) is asymptotically selected, i.e.,
Pn(a) r b~~ o ~' 1 and the loss function q~,~ tends to its minimal possible value f (x(a ) ), with probability one. Proof. Let us estimate the lower bounds of the probabilities Pr~+l (i) : p~+l(O = p~(i) + ~ p ~ e(un)(1 - 2~)[X(u~ = ~(i)) - pn(i)] =
> p~(i)
[1 - % p ~
e ( u ~ ) ( 1 - 25~)] _>
__ ;~(i) [1 - ~ ; ~
e(u~)] >
> p~(i) (1 - ~ ) >_... >_p~(i) 1-I(1 - ~ ) t=l
From (3.100) and (3.103) it follows that oc
Ep~(i) n= 1
:-oo, vi = 1,...,N
(a.ma)
3.6. Reinforcement Schemes with Normalized
Environment response
87
As a result, and according to the Borel-Cantelli lemma [22], we obtain X(Un =
oo, Vi =
u(i)) =
1,...,N
(3.104)
Let us consider again the Lyapunov function Wn.
1-p~(a)
p,~(~)
Notice t h a t in view of assumption (3.102), relations (3.72) and (3.75) are fulfilled, i.e., "5,~(i) - f ( x ( i ) )
a.~. o 0 , ( n l / 2 ,. . . .
), E
un = u ( i
)} =
1 ) = z~(i) + o (~-3-7~-2~
(3.105)
Taking into account the Varshavskii-Vorontsova scheme (3.99) and (3.105), it follows: N
E{WnTl/J~n } a.s. E E {Wn+i/jTn_ 1A u n __--I~(i)} pn(i)= i=l
N { 1 - 1/.T',~_,Au,~ = u(i)} p,~(i)+ i#a pn(c~)[1-'Tnpn(i))(1- 2&)]
=EE
+E
1 p , ( a ) + ? ~ p , ~ ( a ) ) ( 1 - 2#~) (1 - p ~ ( a ) )
-
-
l/:1zn-1A un = u(i) } x
xp,,(~) = =~E
{IE ~ l+~,~p~(i))(l-2g~)+O(~X)-1/ 1
-E {.,,.(1- ~.,)(1- p.,(,~//+ o(~//.~,._, A"o = ,~(,/} = .
p--~
~ nl--2-r j -~-
-'3'n [1 -- 2A(o~)l (1 - p , j a ) )
-
-
+ o( , n l"~' _ 2~. r , -]- O ( ' ~ 2 ) ~--
88
3.6. Reinforcement Schemes with Normalized
Environment response
N
pn(a) (1 + ?,~)] +
= W,~ [1 -
p,~(z)) [1 - 2A(i)]
Otnl...27yp~(o~) + O(',/~p,~ (a)) =
(3.106)
a.8.
for n >_ no(w), no(w) < cx) . Let us now m a x i m i z e the second t e r m in (3.106) with respect to t h e c o m p o n e n t s Pn(i) (i :/: a ) under the constraint N
Zp,~(i)
-- 1 - p,~(a)
(3.107)
i :~ ct
To do t h a t , let us introduce the following variables
xi ::= pn(i),
a~ := ~n [1 -- 2A(i)],
i ~ a
It follows t h a t N
N
Ep~(i))
[1 - 2 A ( i ) ]
aix~ := F(x)
= E
(3.108)
To m a x i m i z e the function F ( x ) under the constraint (3.107), let us int r o d u c e the following Lagrange function:
L(x, A) := F ( x ) - A
xi - (1 - x,~)
T h e o p t i m a l solution ( x ' , A*) satisfies the following o p t i m a l i t y conditions: 0
oxiL(x*,A*) == 2aixi - A*
0
Vi ~ a
N
~AL(x*, A*)= Zx~-
(1-x~)
= 0
i#a
F r o m these o p t i m a l i t y conditions, it can be seen t h a t
*
N =
xi
A* 2ai
~
A* E
(2a~)-I
~¢~
=
1
- -
x~, A* --
1 xa N ~ (2ai) -1 -
Hence F(x)
< F(x*)
-
1 - - Xo~
-
-
E a~ 1 i#a
( N - 1) = ~ ( N - 1) (1 - ~,~)
3.6. Reinforcement Schemes with Normalized Environment response
89
From this inequality we derive ~
N 2 • 7~b (N E p ~ ( ~ ) ) [ 1 - 2A(i)] < = -~
1) (1 - p,~(a))
(3.109)
Ibl (N - 1) W,~
where b :=
< 0
[1 - 2 A ( i ) ] - '
(3.110)
iCa Substituting (3.109) into (3.106), we obtain a.8.
E {W~+I/~}
~ W~ [1 -
p,~(a)
(1 + ~ )
- T~ Ibl ( N - 1)] +
+or ~n ~+o(~p~l(~)) < , nl-2Tpn(ol)" _< W~ [1 -
~ Ibl (g - 1)1 + o ( ~
1-[(1 - ~ ) - ' ) + O(~ rI(1 - ~d -~) t=l
t=l
(3.111) Taking into account the conditions (3.100) and (3.102), and in view of Robbins-Siegmund theorem [23], we obtain W~ ~-~" 0,
p,(a)
~2; 1
Theorem is proved. •
C o r o l l a r y 9. For example, condition (3.101) will be satisfied for the class
of environments for which
~(i) >~ v i # a This means that the minimal value of the optimized function must be strictly enough different from its values in the other points. The following corollary gives an example of the class of the correction factor (gain sequence) satisfying condition (3.101).
90
3.6. Reinforcement Schemes with Normalized Environment response
C o r o l l a r y 10. I f we select ~ . + a ' ~ < a e (0,1}
7n-
Then, the rate of optimization can be estimated as follows n~W,~ %~ 0 n~o~
where the order of optimization satisfies the following conditions
0
Ib]- 1 ( N - 1 )
-1} : = . ( 7 )
and the best order of the conve,yence is equal to
v(7*) = .* = ibt -1 ( N -
1)-'
and is reached for
~=~'=g
l - l b l - ' ( N - 1 ) -1
Proof.
follows directly from Lemma A.14 from [14] for ~n
:
== Wn, oLn := "Yn "-
fin
:
= min
"~ .~n :== n n+a
o(
(1 -- ~ft)
v
1); O (
t=l
=
min
o
;O
(1 -- "Tt) - 1 )
:
t=l
= o
We obtain . < i - 3z,.
i~l ( N - 1) < i
and hence 0
Corollary is proved. •
3.7. Learning Stochastic Automaton with Changing number of Actions
91
The point of th~s theorem is that, a learning a u t o m a t o n using the Varshavskii-Vorontsova reinforcement scheme with the normalization procedure described above, has an asymptotically optimal behaviour, i.e., this reinforcement scheme solves the stochastic optimization problem on finite sets if condition (3.t01) is flflfilled. It is interesting to note t h a t the previous theorems do not require any conditions (convexity, unimodality, differentiability, etc.) on the function to be optimized, as do most optimization techniques. In this p a r t we have shown t h a t learning a u t o m a t a with fixed number of actions operating in an S-model environment and using a normalization procedure, can successfully solve the stochastic optimization problem on finite sets and, as a result, to solve with an C - a c c u r a c y the stochastic optimization problems related to nonconvex and nondifferentiable functions given on any compact sets. This approach has been already used for the adaptive selection of the optimal order of linear regression models [24] .The Bush-Mosteller reinforcement scheme with normalized a u t o m a t o n input has been used to adjust the probability distribution. It has been shown t h a t this approach, while requiring from the user no special skills in stochastic processes, can be a practical tool t h a t open the way to new applications in process modelling, control and optimization. Another way to solve these stochastic optimization problems Ls to consider the behaviour of S-model learning autonmton with changing number of actions. This approach will be briefly described in the next section.
3.7 Learning Stochastic Automaton with Changing number of Actions An extensive literature has been dedicated to the behaviour of learning a u t o m a t a with a fixed action set [10 I. The concept of the behaviour of a u t o m a t a where the number of the actions available at each time is timevarying has been studied by T h a t h a c h a r and coauthors [15]. Convergence and convergence rate results have been stated by Poznyak and N a j i m [17] for learning a u t o m a t a when the action set is time-varying and the input received is continuous. T h e set of all actions of the a u t o m a t o n will be denoted by V = {u(1), u(2), u(3), ..., u ( N ) } , 2 ~ N < oo. This set will be partitioned into W subsets
( W = ~N C~, = 2N - N - 1) , where V(j) represents the k=2
jth
action subset. The index j is assigned by ordering these subsets in a lexicographic manner [15] beginning with double-action subset, etc. and ending with the complete set of actions V.
92
3.7. Learning Stochastic Automaton with Changing number of Actions The probability distribution Qn, defined over all possible action subsets
is Q~ = {q~(1), q~(2) .... , q , ~( w) } r
where
qn(i) =: prob [V~ = V(i)], Vn is the subset selected at time n. w
qr~(i)
1,
=
Vn.
i:l
This probability distribution is assumed to be a priori known. The learning automaton which has been considered in this study, operates as follows: Let V~ be the action subset selected at time n. The probability distribution Pn is adjusted according to the following stages [17] 1) Calculate
p,~(i) > o
K~(j) =
i:u(i) c v(j) v~ = v ( j )
2) Scale the probabilities of the actions in the selected subset
p~(i/j) = p,~(i)/K,~(j)
for all i, such that u(i) E V~
3) Select randomly an action u~ from the subset Vn, according to the scaled probability distribution Pn" 4) Use the Bush-Mosteller scheme to adapt p~. The probabilities pn(i) such t h a t un ~ Vn, remain unchanged. 5) Rescale the action probabilities of the actions of the selected subset. For all i, such that u(i) EVn, do
pn(i) = Pn( /3)KnO) So, the probability distribution P(n) is adjusted as follows [17]:
P~+I (i/j) -= p~(i/j) + "~,~(eNO)(u,~) -- p~(i/j)+
[~N0) _ N(j)~N(j)(u~)] / )
Pn+I(/)) = P,~(z/2), *
i
'
*
"
'
i• j
where N(j) denotes the number of actions of the subset Vi,
2 <_ g ( j ) <_N.
3.8. Simulation results
93
Under some assumptions, the asymptotic properties of this learning syst e m have been stated in [16]. It has been shown a m o n g other t h a t 1. the learning a u t o m a t o n generates asymptotically an optimal pure strategy. N (0, .., 0, 1, 0, ...01 N
i.e., a,s.
pn(o~)
--*
1,
f(x,~) = m ji n f ( x j ) < mj#a in f(xi)
n..,-~ (x)
2. for N = 3, the "orders of learning~' for a u t o m a t a with changing number of actions is equal to the "orders of learning" of a u t o m a t a with constant number of actions, i.e., P~art, g
* Pc~,~st
3. the desired accuracy ~ is achieved by a single learning a u t o m a t o n which has a number of control actions (outputs) N not less t h a n TD m a x L i .
Hence, the change of action set from instant to instant slows down the learning process. Nevertheless, the behaviour of a u t o m a t a with changing number of actions is useful for solving some engineering problems (optimization, etc.). We have presented iterative methods for solving stochastic unconstrained optimization problems t~sing learning a u t o m a t a with fixed and changing number of actions and several commonly used reinforcement schemes. Results concerning the convergence and the estimation of the convergence rate have been stated. We shall illustrate by means of a numerical example how the optimization techniques based on learning a u t o m a t a with fixed and time-varying number of actions may be used to optimize a multimodal function.
3.8
Simulation results
T h e objective of this section is to give a glimpse of the power of the optimization approaches presented in the previous sections. T h e learning aut o m a t a with continuotLs input (S-model environment) and with or without
94
3.8. Simulation results
variable subsets of actions (changing number of actions) described in the preceding Sections were used to optimize the following multimodal function
f(x)
1 = 0.3 + ~-~
cos(/ -
x
+
\i:2
sin(/-
x
i:2
x E [108.5,198.5] which is depicted in Figure 3.3.
0.8 0.7 0.6 05
Zo.4 0.3 0.2 0.1 0
0
I
l
I
50
100
150
x
FIGURE 3.3. Multiraodal function. T h e next subsection deals with the use of learning a u t o m a t a with fixed number of states for multimodal functions optimization.
3.8.1
L E A R N I N G A U T O M A T O N W I T H FIXED N U M B E R OF ACTIONS
A learning a u t o m a t o n is used to optimize the multimodal function shown in Figure 3.3. T h e environment response corresponds to the realization of the function to be optimized. The action selected by the a u t o m a t o n
3.8. Simulation results
95
play the role of the environment input, the argument x. At each time n (iteration), the optimization algorithm, based on a learning automata with fixed number of actions performs the following steps: Step 1. Choice of an action u(i) on the basis of the probability distribution pn. The technique used by the algorithm to select one subset u(i) among N actions is based on the generation of a normally distributed random variable (any specific machine routine; e.g., RANDU, can be used to carry out a normally distributed random variable). Step 2. Compute the S-model environment response. Step 3. Use of this response to adjust; the probability distribution according to adopted reinforcement scheme. Step 4. Return to step 1.
The interval [108.5,198.5] was partitioned uniformly into 10 values. No prior information was utilized in the initialization of the probabilities, i.e., p0 =
~ , ...,
The learning automaton started with a purely uniform distribution for each action. T h e correction factor 3' was chosen to be equal to 0.35. The evolution of the probabilities (for Bush-Mosteller (B-M), ShapiroNarendra (S-N) and Varshavskii-Vorontsova (V-V) reinforcement schemes) associated with the optimal action (solution) are depicted in Figure 3.4. These probabilities tend to one. Figure 3.5 represents the evolution of the loss functions associated with the previous reinforcement schemes. In practice, it is impossible to obtain perfect measurements. To make the simulations more realistic and to test the performance of the optimization algorithm in the case of noisy observations, a disturbance of zero mean and finite variance equal to 0.02 has been added to the realizations (observations) of the optimized function f(x). The probabilities associated with the optimal solution (x* = 168.5) and related to different reinforcement schemes are depicted in Figure 3.6. Figure 3.7 shows the evolution of the lass functions associated respectively with Bush-Mosteller (B-M), Shapiro-Narendra (S-N) and VarshavskiiVorontsova (V-V) reinforcement schemes. The results are similar when disturbance is not present in the system. The simulation results show that the reinforcement schemes (Bush-Mosteller, Shapiro-Narendra and Varshavskii-Vorontsova) converge to the optimal action. This behaviour was expected from the theoretical results stated in this chapter. The above results can be explained by the adaptive structure of
96
3.8. Simulation results
B-M
>~
_~ O.5 .o o_
0
I
0
500
1000
, I
1500 S-N
c-.,
_~ 0.5
,o EL
O
O
1 ~
I
I
I
500
1000
1500
'
'
V-V
_~ O.5 2 EL 0 0
500
i 1000 Iteration number
i 1500
FIGURE 3.4. Evolution of the probabilities associated with the optimal action.
the algorithm and by the fact that a each time (iteration), an action is randomly selected on the basis of the environment response and on the probability distribution p~. The next subsection presents some numerical simulations, from which we can verify the viability of the design and analysis concerning the use of learning a u t o m a t a with changing number of actions and which are given in the second part of this chapter.
3.8. Simulation results
97
0.14 0.12
0.1
l
B-M .£ 0.08
"111 i
0.06 _.J
".Y-V
\
0.04
\ \
"- .~S-N
0.02
I
0 1
I000
0
"". . . . . . . . . . . . . . . . . . . . . . . I
I
l
2000 3000 Iterationnumber
FIGURE 3.5. Evolution of the loss functions.
3.8.2
LEARNING AUTOMATON WITH CHANGING NUMBER OF A C T I O N S
For every time, the optimization algorithm based on learning a u t o m a t a with changing number of actions perforn~s the following steps: Stepl. Choice of a subset V(j) on the basis of the probability Qn. T h e technique used by the algorithm to select one subset V(j) a m o n g W subsets is based on the generation of a normally distributed r a n d o m variable .(any specific machine routine; e.g., RANDU, can be used to carry out a normally distributed random variable). Step 2. Choice of an action u(i)on the basis of the probability distribution p~,. Use of the some procedure as in step 1. Step 3. C o m p u t e the S-model environment response.
98
3.8. Simulation results
~,-
i
|
i
=
B-M O. P o..
0
500
I
I
1000
1500
I,
S-N
..~_-
0.5 2
13_
0 0
500
I
I
1000
1500
L
|
,
~V
~ O.5 2 n
0 0
500
!
I
1000 Iteration number
1500
FIGURE 3.6. Evolution of the probabilities associated with the optimal action. Step 4. Use of this re~sponse to adjust the probability distribution according to the given reinforcement scheme. Step 5. R e t u r n to step 1.
The interval [108.5, 198.5] was partitioned uniformly into 10 values (it is up to designer to specie, the number of actions of the a u t o m a t o n ) . T h e action number can be selected according to the desired accuracy. No prior information was utilized in the initialization of the probabilities. T h e learning a u t o m a t o n started with a purely uniform distribution for each action, i.e.~
PO = and
,.-,
3.8. Simulation results
99
0.14 0.12 i,
0.1 ,Or) C
.£ 0.08 C
O
0.06
0.04 ".
0.02 0
S-N I
0
1000
"'"'-~. I
I
2000 3000 Iteration number
I
4000
FIGURE 3.7. Evolution of the loss functions. The theory of learning systems assures t h a t very little a priori information is known a b o u t the r a n d o m environment (the function to be optimized). Nevertheless, the eventually existing information can be introduced through the initial probability distribution or in the design of the environment response (normalization , etc.). The value of the correction factor ? was determined empirically. The value "y = 0.35 was chosen after some experimentation. Several experiments have been carried out. Figure 3.8 shows the evolution of the probability associated with the optimal state selected by the automaton. This optimal state corresponds to the value of x (x* = 168.5) which minimizes the function f(x) in the interval [108.5,198.5]. The evolution of the loss function is depicted in Figure 3.9. To test the performance of the optimization algorithm in the case of noisy observations (although the data from real systems are passed through a linear filter to remove aspects of d a t a related to disturbances). A zero mean disturbance was added to the function f(x). The dispersion of this disturbance is equal to 0.02. Figures 3.10 and 3.11 illustrate the performance of this optimization algorithm. Figure 3.10 represents the evolution of the probability associated
100 3.8. Simulation results i
1
0.9 0.8 0.7
~o.6 0.5 a_ O.l 0.3 0.2 0.1 0
0
I
I
I
500
1000 Iteration number
1500
FIGURE 3.8. Evolution of the probability associated with the optimal action. with optimal action (solution). After approximately 300 iterations, this probability tends to one. The loss function which tends to zero is depicted in Figure 3.11. As a result of the effect of the noise (perturbation) the t i m e needed for obtaining the optimal solution increases. The decay of the loss function is clearly revealed in Figllre 3.11. T h e nature of convergence of the optimization algorithm is clearly reflected by Figures 3.4-3.11. However, not: only is the convergence of optimization algorithm i m p o r t a n t but the convergence speed is also essential. It depends on the number of operations performed by the algorithm during an iteration as well as the number of iterations needed for convergence. The studied reinforcement schemes behave as expected. T h e y need fewer p r o g r a m m i n g steps and less storage memory. In summary, to verify the results obtained by theoretical analysis a number of simulations have been carried out with a multimodal function. Notice t h a t the suitable choice of the number of actions N is an important issue and depends on the desired accuracy. In order to overcome the high dimensionality of the number of actions a hierarchical structure of learning
3.8. Simulation results
0.14
I
I
101
!
0.12 0.1 r-
.£ 0.08 "5
0
0.06
J
0.04 0.02
0 0
!
!
!
500
1000 Iteration number
1500
FIGURE 3.9. Evolution of the loss function. a u t o m a t a can be considered. In this section we have presented the practical aspects associated with the use of learning a u t o m a t a with fixed and time-varying number of actions for solving unconstrained stochastic optimization problems. The presented simulation results illustrate the performance of thLs optimization approach, and show t h a t complex structures as multimodal functions, etc. can be learned by learning a u t o m a t a . Other applications were carried out on the basis of the algorithms presented in this chapter. T h e y concern processes control and optimization [25], neural networks synthesis [6] [26], fuzzy logic processor training [27] and model order selection [24]. In [25] A learning a u t o m a t o n with continuous input has been used to solve an optimization problem related to the static control of a continuous stirred t a n k fermenter. In this study, the discrete values of the manipulated variable (dilution rate) were associated with the actions of the automaton. Neural networks synthesis can be stated as an optimization problem involving numerous local o p t i m a and presenting a nonlinear character. The experimental results presented in [6] and [26] show the performance of learning a u t o m a t a when used
102
3.8. Simulation results I
!
I
|
1
0.9 08 0.7
.~o.6 £
0.5
n 0.4
0.3 0.2 0.1~
0I
0
200
400 600 Iteration number
FIGURE 3.10. Evolution of the probability associated with the optimal action. as optimzers. In [27] a novel method to train the rule base of a fuzzy logic processor is presented. This method is based on learning a u t o m a t a . ]f comp a r e d to more traditional gradient based algorithms, the main advantages are t h a t the gradient need not to be computed (the calculation of the gradients is the most time-consuming task when training logic processors with gradient methods), and t h a t the search for global minimum is not fooled by local minima. An adaptive selection of the optimal order of linear regression models using variable-structure stochastic learning a u t o m a t o n has been presented in [24]. The Akaike criterion has been derived and evaluated for s t a t i o n a r y a n d nonstationary cases. The Bush-Mosteller reinforcement scheme with normalized a u t o m a t o n input, has been used to adjust the probability distribution. Simulation results have been carried out to illustrate the feasibility and the performance of this model order selection approach. It has been shown t h a t this approach, while requiring from the user no special skills in stochastic processes, can be a practical tool t h a t open t h e way to new applications in process modelling, control and optimization. T h e authors would like t h a n k Dr. Enso Ikonen for his assistance in carrying out these simulation results.
3.8. Simulation results
103
0.9 0.8 0.7
0.6 C
"~ O5 c
0 ,,,J
o.4 0.3 0.2 0.1 !
0
0
200
I
!
400 600 Iteration number
FIGURE 3.11. Evolution of the loss function.
The reinforcement schemes implemented in this chapter give very satisfactory results, and the analytically predictable behaviour is one of their important advantages over heuristic approaches like genetic algorithms and simulated annealing.
104 3.9. Conclusion
3.9
Conclusion
T h e aim of this chapter is the development of a useful framework for computational method for stochastic unconstrained optimization problems on finite sets using variable-stochastic a u t o m a t a , which incorporate the advantages and flexibility of some of the more recent developments in learning systems. Learning stochastic a u t o m a t a characterized by continuous input and fixed or changing number of actions have been considered. The analysis of these stochastic a u t o m a t a has been given in a concise and unified manner, and some properties involving the convergence and the convergence rate have been stated. A comparison between the performance of a learning a u t o m a t a with and without changing number of actions has been done. Several simulation results have been carried out to show the performance of learning a u t o m a t a when they are used for multimodal functions optimization. In the next chapter we turn to the development of optimization algorithms on the basis of learning a u t o m a t a , for solving stochastic constrained optimization problems on finite sets.
References [1] Bi-Chong W, Luus R 1978 Reliability of optimization procedures for obtaining global optimum. AIChE Journal 24:619-626 [2] Dorea C C Y 1990 Stopping rules for a random optimization method. SIAM J., Control and Optimization 28:841-850 [3] Sorensen D C 1982 Newton's method with a model trust region modification. SIAM Journal Numer. Anal. 19:409-426 [4] Dolan W B, Cummings P T, Le Van M D 1989 Process optimization via simulated annealing: application to network design. AIChE Journal 35:725-736 [5] McMurtry G J, Fu K S 1966 A variable structure automaton used as a multimodal searching technique. IEEE Trans. on Automatic Control 11:379-387 [6] Kurz W, Najim K 1992 Synthese neuronaler netze anhand strukturierter stochastischer automaten. Nachrichten Neuronale Netze Journal 2:2-6 [7] Kushner H J 1972 Stochastic approximation algorithms for the local optimization of functions with nonunique stationary points. IEEE Trans. on Automatic Control 17:646-654 [8] Poznyak A S, Najim K, Chtourou M 1993 Use of recursive stochastic algorithm for neural networks synthesis. Applied Mathematical Modelling 17:444-448 [9] Najim K, Chtourou M 1994 Neural networks synthesis based on stochastic approximation algorithm. International Journal of Systems Science 25:1219-1222 [10] Narendra K S, Thathachar M A L 1989 Learning Automata an Introduction. Prentice-Hall, Englewood Cliffs [11] Baba N 1984 New Topics in Learning Automata Theory and Applications. Springer-Verlag, Berlin
106 References [12] Lakshmivarahan S 1981 Learning Algorithms Theory and Applications. Springer-Verlag, Berlin [13] Najim K, Oppenheim (] 1991 Learning systems: theory and application. IEE Proceedings~E 138:183-192 [14l Najim K, Poznyak A S 1994 Learning Automata: Theory and Applications. Pergamon Press, Oxford [15] Thathachar M A L, Harita B R 1987 Learning automata with changing number of actions. IEEE Trans. Syst. Man, and Cybern. 17:1095-1100 [16] Najim K, Poznyak A S 1996 Multimodal searching technique based on learning automata with continuous input and changing number of actions. IEEE Trans. Syst. Man, and Cybern. 26:666-673 [17] Poznyak A S, Najim K t997 Learning automata with continuous input and changing number of actions, to appear in International Journal of Systems Science. [18] Bush R R, Mosteller F 1958 Stochastic Models for Learning. John Wiley & Sons, New York [19] Shapiro I J, Narendra K S 1969 Use of stochastic automata for parameter self optimization with multimodat performance criteria. IEEE Trans. Syst. Man, and Cybern. 5:352-361 [20] Varshavskii V I, Vorontsova I P 1963 On the behavior of stochastic automata with variable structure. Automation and Remote Control 24:327-333 [21] Ash B B 1972 Real Analysis and Probability. Academic Press, New York [22] Doob J L 1953 Stochastic Processes. John Wiley &; Sons, New York [23] Robbins H, Siegmund D 1971 A convergence theorem for nonnegative almost supermartingales and some applications. In Rustagi J S (ed) 1971 Optimizing Methods in Statistics. Academic Press, New York [24] Poznyak A S, Najim K, Ikonen E 1996 Adaptive selection of the optimal order of linear regression models using learning automata. Int. J. of Systems Science 27:151-159 [25] Najim K, M6szar~s A, R~snak A 1997 A stochastic optimization algorithm based on learning automata. Journal a [26] Najim K, Chtourou M, Thibault J 1992 Neural network synthesis using learning automata. Journal of Systems Engineering 2:192-197 [27] Ikonen E, Najim K 1997 Use of learning automata in distributed fuzzy logic processor training, IEE Proeeedings~E
4 Constrained Optimization Problems 4.1
Introduction
In the previous chapter, we disctLssed how to solve stochastic unconstrained optimization problems using learning automata. The goal of this chapter is to solve stochastic constrained optimization problems using different approaches: Lagrange multipliers and penalty functions. Optimization is a large field of numerical research. Engineers (electrical, chemical, mechanical, etc.) and economist are often confronted with constrained optimization where there are a priori limitations (constraints) on the allowed values of independent variables or some functions of these variables. There is now a strong and growing interest within the engineering community in the use of efficient optimization techniques for several purposes. Optimization techniques are commonly used as a framework for the formulation and solution of design problems [1}-[2]. For example, in the context of control, the objective of an on-line optimization scheme is to track the real process optimum as it, changes with time. This must be achieved while allowing for disturbances to the proceas, ensuring process constraints are not violated. In recent years, learning automata [3]-[4] have attracted the attention of scientists and technologists from a number of disciplines, and have been applied to a wide variety of practical problems in which a priori information is incomplete. In fact, observations measured from natural phenomena posses an inherent probabilistic structure. We hasten to note t h a t in stochastic control theory, it is assumed that the probability distribution are known. Unfortunately it is the exception rather than the rule t h a t statistic characteristics of the considered processes are a priori known. In real situations, modelling always involves some element of approximation since all real systerrLs are. to some extent, nonlinear, time-varying and disturbed [5]. Descent algorithrrLs such as steepest descent, conjugate gradient, quasi Newton methods, methods of feasible directions, projected gradient method with trust region, etc. are commonly u,sed to solve optimization problems with or without constraints [6]-[7]-[8]-[9]-[10]-[11]-[12]-[13]. T h e y are based on direct gradient measurements. When only noisy measurements are
108 4.1. Introduction available, algorithms based on gradient approximation from measurements, stochastic approximation algorithms and random search techniques [14] are appropriate for solving stochs~stic optimization problems. The main advantage of random search over other direct search techniques is its general applicability, i.e., there are almost no conditions concerning the function to be optimized (continuity, unimodality, convexity, etc.). Methods based on learning systems [3]-[:15]-[16] belong to this cle~ss of random search techniques. The aim of this chapter is the development of a useful framework for computational method for stochastic constrained optimization problems on finite sets using variable-stochastic automata, which incorporate the advantages and flexibility of some of the more recent developments in learning systems [4]. In recent years, learning systems [4]-[17]-[18] have been fairly exhaustively studied. This interest is essentially due to the fact that learning systems provide interesting methods for solving complex nonlinear problems characterized by a high level of uncertainty. Learning is defined as any relatively permanent change in behaviour resulting from past experience, and a learning system is characterized by its ability to improve its behaviour with time, in some sense tending towards the ultimate goal. They have been used for solving unconstrained optimization problems [4]. This stochastic constrained optimization problem is equivalent to the stochastic linear programming problem which is formulated and solved as the behaviour of a variable-stochastic automaton in a multi-teacher environment [4]-[19]. A reinforcement scheme derived from the Bush-Mosteller scheme [15] is used to adapt the probabilities associated with the actions of the automaton. The learning automaton operates in a multi-teacher environment. The environment response which is constructed from the available data (cost function and constraints realizations) has been normalized and used as the automaton input. In this chapter we shall be concerned with s~ochastic constrained optimization problems on finite sets using two approaches: Lagrange multipliers and penalty functions. The first part of this chapter is dedicated to the Lagrange multipliers approach. It is shown that the considered optimization problem is equivalent to the stochastic linear programming problem given on a multidimensional simplex. The Lagrange function associated with the later is not strictly convex and, as a result of this fact, any attempts to apply directly the gradient technique for finding its saddle-point doomed to failure. To avoid this problem we have introduced a regularization term in the corresponding Lagrange function as in [20]. The Lipschitzian property for the saddle-point of the regularized Lagrange function with respect to the regularization parameter is proved. The second part of this chapter presents an alternative approach for solving the same stochastic constrained optimization problem. It is based on learning automata and the penalty function approach [12]-[21]-[22]-[23]-
4.2. Constrained optimization problem
109
[24]-[25]-[261-[27 ]. T h e environment response is processed using a new normalization procedure. The properties of the optimal solution as well as the basic convergence characteristics of the optimization algorithm are stated. Our main theoretical results (convergence analysis) are stated using martingales theory and Lyapunov approach. In fact, martingales arise naturally whenever one needs to consider conditional expectation with respect to increasing information patterns [51-[28] .
4.2
Constrained optimization problem
4.2.1
PROBLEMFORMULATION
Let us consider the following sequence of r a n d o m vectors 4~(u,w), u E U := {u(1),...,u(N)};
~ E [~; n = 1 , 2 , . . . ; j = 0,1 .... , m
which are given on a probability space (f/, j r p )
where
• u is a discrete variable defined on the set U • w is a r a n d o m variable defined on the set of elementary events fL Let us define the Borel functions ~
1
(j = 0, 1, ..., m) as follows
" ~ (~.~)
(4.1)
t=l
where the sequence {ut}t=l ..... n of r a n d o m variables is also defined on (f~, ~-, P ) and represents the optimization sequence "optimization strategy" which depends on the available information, i.e.,
u~:u,~(~,us,wlj=O,
1,...,m);t=l,...,n;s=l,...,n-i
(4.2)
Let us consider the following assumptions: • ( H 1 ) T h e conditional mathematical expectation of the r a n d o m variables ~ ( u n , co) for any fixed random variable un = u(i) is stationary, i.e., J (j = 0 , 1 , . . . , m ; i = 1 , . . . , g ) E {~(~,,, ~)1--,, = ~,(~) A ~,,_, } °~ ~,~ where ~ n - 1 := a(~Jt,us,colJ = O, 1 ..... m ; t = 1 .... , n -
1;s---- 1 , . . . , n - 1)
is the a - a l g e b r a generated by the corresponding data.
110 4.2. Constrained optimization problem • (H2) For any fixed random variable u~ = u(i), the absolute value of the random variables 4,~(Ju,~, w) are uniformly bounded with probability one, i.e.,. limsup [4[~(u~,w)l a~. a+ < c~ (j = 0,...,N) The main stochastic optimization problem treated in this chapter may now be stated: Under assumptions (H1) and (H2), find the optimization strategy {un} which minimizes, with probability 1, the average loss function lim sup ~}o
inf
(4.3)
under the constraints limsupq),J~a~_ 0;
( j = l .... ,m)
(4.4)
From (4.3) and (4.4) it follows that the sequence {4° } and the sequences {4~} (J = 1,...,m) correspond respectively to the realizations of the loss function (I)°n and the constraints defined by the functions (I){ (j = 1, ..., m). R e m a r k . The "chance constraints" [25]-[29] P { w : 0j(4°, 4¼, ..., ~ ~) _> aj} _< 3j
(4.5)
belong to the class of constraints (4.4). Indeed, this chance constraint can be written as follows: E{4~} _< 0 (4.6) where 4~ = ?( {¢j(4 °, 4¼, .-., 4~~) > ay} - ~j
(4.7)
and the indicator function X(')' is defined as follows ?((,4)
= f lifAistrue [ 0 otherwise
Substituting (4.7) into (4.4) leads to 71,
limsup n E
X {¢/(4°' d ' " " 4 ~ n ) >- ai} -3j
(4.8)
t=l
In the stationary case, i.e., the probability distribution of the vector 0 (4~, 4t,1 •--, 4tm T) is stationary, and in view of the strong law of large numbers [28]-[301, it follows that the inequalities (4.8) and (4.5) are equivalent, i.e.,
4.2. Constrained optimization problem
111
n
limsup 1 E X
{~j(£0,~1,...,(~) > aj}
t=l
= limsup -1 E E { x { ¢ j ( ~ o , n~cx~
n
~ 1, . . . , ~ ) > _ a~ } } =
t=l
= P {~d : 4)j(~°, "1 The problem (4.3)-(4.4) is asymptotically equivalent to a linear programming problem which will be stated in the next subsection. 4.2.2
EQUIVALENT LINEAR PROGRAMMING PROBLEM
Consider the following stochastic programming problem on a multidimensional simplex: N
Yo(p) := E
v°p(i) ---*sup
i=l
(4.9)
p6S
N
Vj(p) := E
vJiP(i) < O,
(j = 1,...,m)
(4.10)
i=l
S ~ :=
p = (p(1),...,p(N)) E I:tNf p(i) >_ O, E p ( i ) = 1
(4.11)
i=1
where SNdenotes the simplex in the Euclidean space R N and p is the probability distribution along with the optimization is done. The following theorem states the asymptotic equivalence between the basic problem (4.3)-(4.4) and the linear programming problem (4.9)-(4.1I). T h e o r e m l . Under assumptions (H1) and (H2):
1. The basic stochastic coru~trained optimization pro blem (4.3)- (4.4) has a solution if and only if the corresponding linear programming problem (4.9)-(4.11) has a solution too (may be not unique).
. These solutions coincide with probability one, i.e., inf @~ 0 a.s. = inf Vo(p) {u~} p~S N
such that the constraints (~.~) and (4.10) are simultaneously fulfilled.
112 4.3. Lagrange multipliers using regularization approach
Proof. Assume t h a t the basic problem (4.3)-(4.4) has a solution. Then, according to Lemma A.4-1 (appendix A), t h e sequence {un} satisfies asymptotically (n -~ co) the constraints (4.9) if and only if the sequence
IN(i) := -
n
X(Ut == u(i))
(4.12)
t=l
satisfy the set of constraints (4.10)-(4.11). Hence, inf ~o a~. V0(p*) (~n} -
(4.13)
where p* E S N is the solution of the linear programming problem (4.9)(4.11). Let us now consider the stationary optimization strategy {un} for which P { ~ : u~ = u(i)l~=~_a} ~'~" p'(i) According to the strong law of large numbers for random sequences containing martingale differences (see Lemma A.4-2 in appendix A) we derive lim ~
~'~" Vj(p*) (j == 1, ...,m)
~--* oo
Theorem is proved. • The next section deals with the properties of the Lagrange function and its saddle-point.
4.3 Lagrange multipliers using regularization approach It is well known [ii]-[26]-[27]-[8] that the stochastic linear programming problem discussed above is equivalent to the problem of finding the saddlepoint for the following Lagrange function m
L(p, A) := Vo(p) + E A(j)Vj(p)
(4.14)
i=1
given on the set S N × R~' where R ~ := ~ = ((~(1),...,~(m)): ~(i) > 0 i = 1 .... ,m}
(4.15)
In other words, the initial problem is equivalent to the following minmax problem L(p,A) --* inf sup (4.16) pC:S N ) ~ R T
4.3. Lagrange multipliers using regularization approach
113
According to the LagTange Multipliers Theory [7]-[26], any vector p* E S g is the solution of the linear programming problem (4.9)-(4.11) if and only if there exists a vector A* E R ~ such that for any other vectors p E S N and A C R m + the following saddle-point inequalities hold L(p*,A) < L(p*,A*) < L(p,A*)
(4.17)
The Lagrange function L(p, A) is not strictly convex and, as a consequence, any attempt to apply directly the gradient technique for finding its saddlepoint doomed to failure. The following example ilh~strates this fact. E x a m p l e . Consider the function L(p, A) for which their arguments belong to the single dimensional space ( N = m -- 1), and Vo(p) = 0, i.e., L(B, A) = Ap
Applying directly the gradient, technique (Arrow-Hurwicz-Uzava algorithm) [31] for finding its saddle-point which is equal to p* = A* = 0, we obtain Pn An
=
Pn-1 - %~V~,L(pn 1 , A n - l ) - - P,~--1 - %~A~ 1 An-1 + % V ~ L ( p n - I , A ~ - I ) = A,~_~ +~/nPn-1
"fn
E
R 1, " y ~ 0 ,
(4.18)
E%=oo
It is easy to show that this procedure generates a divergent sequence {Pn, An} which does not tend to the saddle-point p* = A* = O. Indeed, Pn :=P~ + A2~ = (1 +~f2) p,,_, = P o f l
(1+~/2) >-Po > 0
t=l
One approach for avoiding this problem consists of introducing a regularization term in the corresponding Lagrange function [7]-[20]: /~(p,A) = L(p,A) + 6 (tlpNU _ tlAN2)
(4.19)
For each regularizing parameter 6 > 0, this regularized Lagrange function La(p, A) is strictly convex on p and concave on A . The next theorem describes the dependence of the saddle-point (p* (5), A*(5)) of this regularized function with the regularizing parameter 5 and analyses its a~symptotic behaviour when 6 --* 0. T h e o r e m 2. For any sequence {6n} such that O<6n
5~ --~ 0
(4.20)
114
4.3. Lag-range multipliers using regularization a p p r o a c h
the sequence { (p* (Sn ) , )~*(~.))} of the saddle-points of the corresponding Lagrange function L6(p,A) (4.14) converges to the saddle point (p**,A**)of the initial Lagrange function L(p, A), which corresponds to the saddle point (p*, A') and has the minimal norm, i.e., if there exist many saddle-points (p*, A*) of the initial Lagrange function then, (p* (6.), A* (5.)) --* (p**, A**), n --* oo
where p'* := arg p* min 1 /~llP, lt2 + ,),*
i1~.112)
and (p*,A*) is the saddle-point of L(p, A). Proof. Let us introduce the following notations p~ := p*(6.), ;~;~ := A*(6.) Using inequalities (4.17) and definition (4.68) we derive Ls(p*(5.),A*) = L (p*(~i.),A*) + ~6 (llp.(e~)ll 2 - I1~.112) < < L (p*,A*(~.)) + 7
-
= L~(p*, A*(£~)) After dividing by 6 > 0, this inequality leads to
2 L • IIP*(5~)ll 2 + 11~'(6~)tt 2 _< ~ [ (p , ~ * ( ~ ) ) - L ( p * ( 6 ~ ) , ~ * ) ] + + IIp*ll2 + IIA*II2 <__lip*l] 2 + IIA*II2 <
oo
From this inequality it %llows that the sequence {p~, Jk~} is uniformly bounded on n. Hence, we (:an select a convergent subsequence { P ~ , )~k } k - ~ ' i.e., the following limits exist lim
k--*oa
Pn~ :----P, klim A" ~ o a nk : ~
It is clear that this limit point (~, ~) is a saddle-point of the initial Lagrange function L(p, A) (4.14). If it is unique then the theorem is proved. Now consider the case, when we are concerned with a set of saddle-points. The following inequality %
]
4.3. Lagrange multipliers using regularization approach
0 G (p -- pn) T VpL~(p, ~)
(,~ - ),~)T V~L~(p, ;~), V),,p
-
115
(4.21)
holds for any convex function. Let us suppose that for two different convergent subsequences there exist two different limits, i.e., *
k~c~
lim
s~oc
p~
--
:
lim
=p,
)~n~ : = ~
lim A*
s~oo
n~ : ~
~
For p = p*,,k := ,~* and n = nk or n = ns, inequality (4.21) and the properties (4.17) of the saddle-point [10] lead to
-
-- Pnk )
-
(m vO q-
j=l
*
)
= L ( B * , A * ) - L(p~,A*) + L ( p * , A ~ k ) - L(p*,A*)+
+6nk[( 6n~ [(
p" - P'~k) . ,T p , + ( A * - A* ~TA*]< '~kJ p* - - p Z k ) T p * + (A
T~.']
where v ° := (vO,...,vON) T, v j := ( V ~ , . . . , v ~ ) T j
= 1,...,m
Dividing both sides of the last inequality by 5n~, and for k --* obtain
co
,
we
for any saddle-points (p*,A*) of the Lagrange function L(p,.~). Similarly, for the other subsequence we can write
N
From these last inequalities it follows t h a t the points (~, ~) and (~, A) correspond to the minimum of the following quadratic function
!
2
÷ ,-e)
defined on the set of all possible saddle-points (p*,A*). This function is strictly convex, hence the minimum is unique. As a consequence, it follows
116 4.3. Lagrange multipliers using regularization approach
Theorem is proved. [] For 6 > 0, the regularized Lagrange function is a strictly convex function and, has the following useful property which will be given in the next lemma. L e m m a 1. For any p E S N and for any A E R m the following inequality holds: (p - p*(5)) r VpL~(p, A) - (A - A*(6)) v V~L~(p, A) >_
(lip- p*(5)ll2 ~11~- ,\*(5)112) where p" (6), A* (5) is the saddle-point of the regularized Lagrange function (4.68). Proof. From (4.68) we obtain (p - p*(5)) T Vpn~(p, A) - (A - A*(6)) T V~L~(p, A) =
= L~(p, A*(6)) - L~(p*(5), A) +
(lip- p*(e)ll 2 + IP,- "v (5)112)
T h e desired result follows immediately from the following inequality
n~(p, A*(6)) >_ n~(p*(5), A) which corresponds to the saddle-point condition (4.17). Lemma is proved. [] The next theorem states the Lipschitzian property for the saddle-point (p*(6), A*(6)) of the regularized Lagrange function /~(p,A) with respect to the regularization parameter 6. T h e o r e m 3. Under the same conditions as in Theorem 2, there exists a positive constant C such that for any nonnegative parameters 5t and 5s the following inequality holds: liP*(5,) -p*(Ts)ll + 11)¢(5t) - ~,'(5,)11 _< CISt - 6sl
4.3. Lagrange multipliers using regularization approach
117
Proof.
Let us consider the following sets Z0 :=
p,,k) :
p(i) = 1 i=1
z l ( i l .... ,i8) : = { ( p , ~ ) : p ( i k )
}
= 0, k = 1 . . . . . s } n Z 0
Z 2 ( j l .... , j r ) :=: {(p,X): A(jk) .... 0, k = 1,...,r} N Z0
Z3(il,---,is; Jl,-.-, jr) := ZI(il, ...,is) ~ Z2(Jl, ...,Jr) The total number of these sets is equal to 2m(2 g - 1). They will be denoted by ~k (k = 1, .... 2m(2 N - 1)). Let us a.~sociate with each set ~k the problem ~ of finding the saddle-point of the regularized Lagrange function L~(p, ~). The corresponding solution will be denoted by ( p ( ~ ) , /k(~3~)). It is clear that the saddle-point (p* (~),),* (5)) of the regularized Lagrange function L~(p, )~), defined on the set S g × R m + , coincides with the solution (P(~e), A ( ~ ) ) of one of the previous problems ~e. Notice that each problem ~ is a convex optimization problem with equality constraints and, as a consequence we can use the LagTange technique for deriving the necessary optimality conditions. The optimality conditions are:
VpL~(p, A, #) = 0,
V~L~(p,),,#) =: 0,
V,L~(p,A,#) = 0
(4.22)
where /;e(p,)~,#) := Le(p,)~) - #o
p(i) - 1
- ~#lkP(ik)
-- ~ _ , # 2 k , X ( j k )
k
k
and #lk and #2k are Lagrange multipliers. T h e linear system of equations (4.22) is equivalent to m
V 0 -t- ~ / ~ j V 3=1
j ~-5p-
#0 e N -- E l Z l k e ( i k ) k
-~
0
Z,2
=
0
e(
k)
k
(p, A) C Gk where eN :
(l,...,1) T E R N
e ( j k ) := (0, ..0, 1, O, O, ..., O)T
J~
(4.23)
118 4.4. Optimization algorithm From (4.35) it follows t h a t the expression of the optimal solution (saddlepoint) can be written in the following parametric form m+N
m+N bj,6
p(i) =
~=0 ,~+N E
~=0 , A ( j ) = ,~+N P s A s ai~6 ~ aj~6
(4.24)
s=O
8~0
For any 6 > 0, the corresponding Lagrange function is strictly convex and has a unique optimal point. It follows that the denominator of (4.24) is not equal to 0. When 6 Ls small enough the problem to be solved remains the t same (gl~) as 6 decreases. Let us assume that the r p - f i r s t coefficients ~s
(s = O, 1, ...,rp)
are equal to zero and, similarly assume that the rp - f i r s t
P ( 8 = O, 1 ..... rp" ) are also equal to zero. It follows: coefficients ais r
m + N - r~ - 1 ]
p ( i ) = 5 r"
i,s+%+l
s=l m+g
, rp := rp, -- rp,,
(4.25)
r~ := r x' - - r x"
(4.26)
a r; +l + s=:l E aPi,s+r~ +l The same form is valid for tile vector A: rn+N-r~,-
A(j)
=
2
1
m+g a~ 4 ~ a~ ,, 5 s 3,r~ +1 s=l j,s+r~ +1
,
In view of theorem 2 (boundness of the coordinates of the saddle-point) we conclude %>0, r~>0 The assertion of this theorem follows directly from (4.25)-(4.26) and the boundness of the parameter 6. Theorem is proved. B The next section deals with the use of learning a u t o m a t a for solving the equivalent optimization problem stated above.
4.4 Optimization algorithm It has been shown in section 4 that the stochastic constrained optimization problem (4.3)-(4.4) is asymptotically equivalent to the problem related to the determination of the saddle-points of the regularized Lagrange function L 6 ( p , A ) , using the realizations of the cost function and the constraints. This equivalent problem may be formulated and solved as the b e h a v i o u r
4.4. Optimization algorithm
!~
Teacher1 ~,¢~0~n
-----t
Teacher2
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
119
Normalization Procedure
----~ Teacher m+l ~ "
I
I FIGURE 4.1. Multi-teacher environment.
of a variable-stochastic a u t o m a t o n in a multi-teacher environment [4]-[19] (Figure 4.1). Referring to the schematic block diagram for the learning a u t o m a t o n operating in a multi-teacher environment (Figure 4.1), we note t h a t the normalization procedure processes as a mapping from the teachers responses ( ~ , j = 0, ..., m) to the learning a u t o m a t o n input (~n).
4.4.1
LEARNINGAUTOMATA
T h e role of the environment (medium) is to establish the relation between the actions of the a u t o m a t o n and the signals received at its input. For easy reference, the m a t h e m a t i c a l description of a stochastic a u t o m a t o n will be given below. An a u t o m a t o n is an adaptive discrete machine described by:
{z, u, n, {~,}, {~}, {p~},T} where: (i) E is the bounded set of a u t o m a t o n inputs. (ii) U denotes the set {u(1), u(2), ...... u ( N ) } of actions of the automaton. (iii) T~ = ( f~, Y, P ) a probability space.
120 4.4. Optimization algorithm (iv) {[=} is a sequence of automaton inputs (environment response, ~= E S) provided by the environment in a binary (P-model environment) or continuous (S-model environment) form. (v) {u=} is a sequence of automaton outputs (actions). (vi) p= = [p=(1),p=(2), ...,p=(N)] T is the conditional probability distribution at time n N
p~(i) = P{~o: u~, .....u(i)
/
9t-~_1},
~ p , ~ ( i ) = 1 Vn i=l
where ~'~ = a(~l, ul, Pl;---; ~,~, u=, pn) is the a-algebra generated by the corresponding events (.Tn E ~'). (vii) T represents the reinforcement scheme (updating scheme) which changes the probability vector p,, to P~+1:
}t=l ..... ~; {ut}t=l ..... ,~)
P~+I = P= +Tn:/~(Pn; {¢t p~(i)>0
Vi=l
(4.27)
..... N
where "Yn is a scalar correction factor and the vector
.... [Tj(.),..., satisfies the following conditions (for preserving probability measure): N
~T¢~(.) ........0 v,~
(4.28)
Z=I
p,,(i)+%(~,~(.) E [0,1]
Vn, Vi = 1 , . . . , g
(4.29)
This is the heart of the learning automaton.
4.4.2
ALGORITHM
Let un = u(i) be the action selected at time n. For fixed vector An and fixed regularization parameter 6 = 6n, the environment response (automaton input) ~n is defined as follows:
~-
~ p~(i) +~ '
~=
u(i)
(4.3o)
where ~ :=
(4.31)
4.4. Optimization algorithm
121
and ~T(oJ) := (~O(un,03),~l('Ltn,o3) .... ,~r~(Un,03)),
~n(03) • n m+l
(4.32)
The sequences { a n } and{t3n} are positive deterministic. They will be defined below. The transformation (4.30)-(4.31) is a normalization procedure [4]. In this study we shall be concerned with Bush-Mosteller [15] reinforcement scheme for adjusting the probability distribution p (other reinforcement schemes can be also used [4]). The Bush-Mosteller scheme [4]-[15], is described by:
Pn+a = Pn + "~,~ [e(u=) - p,~ + ~,~(e g - N e ( u , ~ ) ) / ( N - 1)] 1
pl(i) = ~ ,
(4.33)
(i = 1,...,N)
where
,~,~ •
[o, 1], ~,~ • [o, 1]
e(~)
:
(~,0...01
eN
=
i (1,..., 1) T •
~,
: u(i)
R N
The Lagrange multipliers An are adjusted according to the following algorithm: (4.34) A=+a(j) : [An(j) + 7 ~ ( ~ - 6nA~(j))] 0~+ "y~ > 0, Al(j) > 0 ,j = 1,...,m where the operator [-]0 ~: ~s defined as
{~ [x]0~::=
x >_ A+ x
0<~
0
x
and {A+ } is a monotically increasing sequence. The sequences an and ~3n will be selected according to the following lemma. L e m m a 2. If, for some positive monotieally decreasing {'Gt} (0 < Tn ~ O) and positive increasing { A + } (0 < A~ ~ c¢) sequences, the parameters an and 13n are given by ~n. : =
,~
~o+ + ~.+ E ~ j=l
, /7,~ : = a,~ +
+6~
1 + (N - 2)%~
122 4.4. Optimization algorithm where + .__ ~ ~ ~n .-- 2 (1 & ( N - 2 ) ~ ) ' ~+ := 1 - Tn and ~
:= ~-n(N - 1)
then, the automaton input belongs to (0, 1]:
~ n e [ (N-1)~-~21+(N-)%7 2 , ~ ; 1 ] C(O, 1]
Proof.
To prove this lemma we will use the induction method. Let us assume that + ~,~-1 e [ ~ - 1 , ~ n - 1 ] , from the Bush-Mosteller scheme (4.35), we derive p~(i) = p,~-~l(i) + ~ - 1 [X(u,~-i = u(i)) - p n - l ( i ) +
+~n-1[1 - NX(u,~-I = u ( i ) ) ] / ( N -- 1)] > 1 -- s,~-l, ~ - ~
>__pn-l(i)(1--~n-1)+'7~-lmin
>_pn_,(i)(1--~ ' n~.,)+~ ' n_lmin{i--~+_1;
>_
}
(4.35)
From the definition of ¢~_ 1 and ¢~_ 1, we derive =
-~_1,
~--
1
Consequently, it follows p,~(i) >_ min { p n - , ( i ) ; ~-~-, } >_ min{min {p,~-2(i); Tn-2} ; ~-n-, } =
= min {pu-2(i);~-,~-l} _> ... _> min {P1(i);~-~-1} = 1 Finally,
p,~(i) >_ ~-,~_i,vi = 1, . . . , N
(4.36)
Consider now the automaton input (~: I
~,~ ,~
~ = (1+(N-2)~) ~
y~ m
~0++~+Z~}+ j=l
+
4.5. Convergence and convergence rate analysgs
>-- ( I + ( N - - 2 ) T = )
[
Pn 7,~
Pr, T 2 ( N - 1)
1 + (N - 2)T,~
1 + ( Y - 2)~',,
¢;~-,, (1 + ( Y - 2)v,~)p,~ and
(C + ~ ) ~ ~n--~ ( I + ( N - - 2 ) T ~ ) p ~ < -
(~n+ + ~ ) 1 + (N - 2)7~
(~+ + ~ ) ~ -~ (1 + ( N - - 2 ) T ~ ) T ~ - I
123
< -
-
= 1 +T~(N--2) = 1 1 + (N - 2)~
Lemma is proved, m As might be expected from the construction of the automaton input (normalized environment response), the signal input ~n belongs to the interval This lemma leads to the following corollary C o r o l l a r y 1. The conditional mathematical expectation of ~,~ is given by
the aJfine transformation of the corresponding regularized Lagrange function (4.20):
T~(i) :-- E{$' [ (U'~= u(i)) A'Ta} -- p,~(i) I [a,~~
L~(p,~,A,,)+t3,, ]
where ~,~ and J3n are defined in the previous lemma. The asymptotic properties of this optimization algorithm are stated in the next section.
4.5
Convergence and convergence rate analysis
In this section we prove the convergence of the optimization algorithm described in the previous section and we estimate its convergence rate. T h e next theorems are our main results in this section. 4. Consider the optimization algorithm (4.33)-(4.34), under assumptions (1-11) and (H2), and in addition, assume that there exist two nonnegative sequences {~,~} and {~-~} such that
Theorem
1) 0<~
L0, 0 < ¢ ~ < - -
N+I
0,
-' _-oo,
124
4.5. Convergence and convergence rate analysis
2) the following limit d:=
lim
-n"n
h
h:=2
(
1-
or+ j=l
exists, 3) the sequence {7~} is con.strutted as follows
-r~-
N N - 1 ~n~/'~"
Then,
if 4) __
+
<
oo
then, the convergence to the saddle-point (p**,A**) (of the Lagrange function) which has the minimal norm, is ensured with probability one, i.e., W,~ : = ][p,~ - pnl] 2 +
]]An--
),~[I 2 ~:2¢ 0
and
¢ 5) "~,~+~ + I~,~+, - ~1 ~ ~'+~ _~ o then, we obtain the mean squares convergence
S{W,~} ,~-,o¢~°'{ IIp~- p*'II2+ II&"- ~**112}~-~-~ o Proof. For any time n >inf {t:
W,~+I <
Pn + 7n
I1~;11_< ~+}, (4,33) and (4.34) ]earl to e(u,~) - p,~ + ~,~
N - 1
J
,~+1
+
+ I1~,,~ +'y~ (~,~ - 6,~,~,~) - ,~7~+~ II ~ = =
II
p,~-p~+~,,~
[~(u,~)-p,~+~,~
N-1
_1-(,~+i-Pl)
II +
4.5. Convergence and convergence rate analysm
125
where
~,~ = (~,..., ~ ) ~ According to the following well-known inequality
lia + bll 2 ~ (1 + ~) Ilall 2 + (1 + ~-1) ilbll2 ,g > 0 which is valid for any ¢ = ¢~ > 0, it follows Wn+l -~ (1 + en)
I1
B,~-P~+7~
[
e(u,~)-pn+~,~
+
+(1 +~71)ii p;÷,-p;il ~+ (~ +~)JJ~-~:, + ~ (:~- ~ ) l l ~+ +(1 + ~;1)II~;+~ - ~II ~ In view of theorem 2, we derive Wn+l _< (1 +¢~)W~ +2"yn(1 + 6~)((p,~ _ p,,).TpA,~ + "y~. - - y,~( A , ~ - - A ~ ) T A ~ I + (4.37)
where A s := e(u,~) -p,~ +~,~
e N - Ne(u,,)
N-1
Let us calculate the conditional mathematical expectations and the estimation of the conditional second moments of the vectors APn and A~n. Notice t h a t Pn and As are ~-n-measurable. Hence, we have: 1. Conditional mathematical expectations I) Vectors A~ N
E {A~I~'~} = E E { A ~
l(u~ = u(i)) A ~'~}p~(i) ----
i= 1
~=]
N - 1
] p,,(i) =
N
t N-1
E ' r P ( i ) p ~ ( i ) [eN -- Ne(u(i))] i:1
(4.38)
126 4.5. Convergence and convergence rate analysis where 7P(i) := E { ( ~ !(u,~ = u(i)) AiY,~} 2) Vectors A~ N
E {A~Ibr.} = E E {A~ [ (u,~ = u(i)) A .7",~} p.(i) =
(4.39)
i--1
N i=1
where 2. Estimation of the conditional second moments 1) Vectors A n
E {IIA~II: i7.} <_ ~ + 2 iip.ll ~ where
(4.40)
2N Cp = 1 + N----L-]
2) Vectors A~ E {llA~tl2 I~n} < 2a~ + 25~ (A+) 2
~ := ~ (~+):
(4.41)
(4.42)
j=l Considering the conditional mathematical expectation of both sides of (4.37) and combining (4.38),(4.39),(4.40) and (4.4t) we obtain a*8.
E{W~+llf~} < ( t + ~ ) w . + 1
N
Ne(~(i))]) +
4.5. Convergence and convergence rate analysis
127
+(1 + ~nl)C 2 l~n+l -- ~nl 2 Observe that
N E{APnlg~'n}- N -1 1 Z'r~(i)pn(i) [eg - Ne(u(i))] ='=-'
i=l
a.__s.N - l i=l O t n ~ L ~ (Pn, An) "J~Bn [eN - Ne(u(i))]
(4.43)
as. _an VpL~ (Pn, An) - bne g where Nan an . - N---Z-~ ,
b~
N/3n '- N-~
(4.44)
and an and/3n are deterministic sequences, defined by lemma 2,
N
E {A~I.Tn } = Z T~(i)pn(i) -
6~A~
'~'='"V~L~,~(pn, A,~)
(4.45)
Using formulas (4.43) and (4.45) we derive E {Wn+l I ~ } ~_~ (1 + ~ ) ( 1 + 2"y2)Wn+ -2~/n(1
+ en) {an (Pn -- p~)T VpL~,, (Pn, An)-
(4.46) (4.47)
~%~(An - A~)T V~L~,~(pn, An) } + -~-"[2(1-{-~n) (C,p Jr-2 ~k"Yn] ('~nA ~ 2 IO'~ -{-52 (A+)2]) -~- (]" -{- ~nl)C2 I~n+l- ~n'2 Let
~An= anon Then, taking into account the strictly convexity property of the regularized Lagrange function L~,,(pn, An) (see Lemma 1) and (4.47), we obtain E {Wn+li~-,~} _ (1 + ~n) (1 -
"/n6na,~)Wn+
+~(1 ÷~°)(c~ ÷ 2~. [~ ÷ ~: (~.+)~])+(1 ÷ <1)c2 i~1-~.t 2 Finally,
128 4.5. Convergence and convergence rate analysis
E{W~+al2=~}
a.s,
<_ (1
(4.48)
+¢~)(1-',/,~6,,a,~)W,,+
16~+1 - 6.12 2 2,~2 ) ~ ) 2C3 ". + C2 ~n
where c1:=2
( Cp+2Cr
2
supa
, c 2 : = 2 C 2, c 3 : = 4 s u p a , ~
rz
r/,
According to assumption 2 of this theorem, it follows: art" =
~
-
-
j=l
and hence, we conclude (1 + ~ )
(1 - ~ 6 ~ a ~ )
_< 1 - ~ a ~
1
<
~,,~6,~a,~j <
_< 1 - ~,~,~a~ [1 - d + o(1)] Prom (4.48) we obtain tile following quasimartingale inequality E{W~+I[~'u} _< ( 1 - " m 6 , ~ a , ~ [ 1 - d + o ( 1 ) ] ) W n +
(4.49)
En It follows that under the assumptions of this theorem, all the conditions of the Robbins-Siegmund theorem [4]-[32] (see Appendix B), are satisfied. We conclude
W,~ ~:2,"W*,
Z
a.8,
",/,~6,~a,,W,~ < oo
and, in view of assumption 1, ~ "/,~5~v,~(A+)-1 = c~, this result implies r~l
t h a t there exists a subsequence W~ k which tends to zero and, hence W* ~ 0 . The mean squares convergence follows directly from (4.49) after applying the operator of mathematical expectation and lemma A5 in [4] for 6n = 0 and b = 0. Theorem is proved. •
4.5. Convergence and convergence rate analysis
129
C o r o l l a r y 2. If 1
~+ ='Xl + n X l n n ' "Tn
~-~= ( N + l ) ( l + n " l n n ) '
~--~0
n'~ ' 6,~
~'0 ~,
.Xl > 0
~'0
e,~ = --nS, ¢o < "~oAlh
(4.50) (4.5i)
% 6, e, ~ - > 0 , ;~_>0 then,
1) if ),+-y+T+$=a
2-y>t, 2(6+l)-e>l
(4.52)
then, conditions (1)-(4) of the previous theorem are fulfilled and, the almost surely convergence is ensured, i.e.,
2)if A + ' y + 7 - + 5 = e _< 1, -~ > A + T + 6 , 2 + 6 > e+-y+A+~- (4.53) then, conditions (I),(2), (3) and (5) of the previous theorem are fulfilled and, the mean squares convergence holds, i.e., Eiw~}
~o
Proof.
The proof follows directly by substituting (4.50)-(4.53) into the conditions of the previous theorem. • The estimation of the maximum convergence rate for the class of parameters (4.50)-(4.53) is given by the next theorem. T h e o r e m 5. Under the conditions of the previous theorem and for the class of parameters (~.50)-(~.53), there exists u > 0 such that 1 and E { W n } = o(n-~v) Wn a.j. o~(~__~) where the order v of convergence rate satisfies the following constraint 1 - < - (v,~,6)
<
=
and the m a x i m u m convergence rate •** is reached for
7=-y*=2,
e=e*=l.'c=7*=6=~*. .*(~*,e*,6*)
= v**
= -~, A
A* = O
130 4.5. Convergence and convergence rate analysis
Proof. From theorem 3, it follows: W~ := []p~ -p**[I 2 + IIA~ - M*]] 2 = I[(P~ - P ~ ) + (P~ -P**)[[ 2 + + [ [ ( M - AX) + (A~ - A**)[[ 2 _< 2 Ilp,~ - p 5 l [ 2 + 2 lip* - p**l[ 2 + + 2 IIA. - ~112 + 2 I1~;, - x**ll ~ < 2 w . + c ~
Multiplying b o t h sides of the previous inequality by lJn, we derive
v.w;, < 2 v . w . + ~,,,c6~ Selecting ttn = n ~ and in view of lemma A.3-2 (see Appendix A) and taking into account that v,~+l - v,~
~ + o(1)
vn
n
we obtain 0 < v < min { 2 7 - 1; 26 + 1 - ~,26} := v*(%6,5) where the positive parameters 7, 6, ~, T,and A satisfy the restrictions (4.5) A+7+T+5=e_<
1, 2 7 > 1 ,
2(6+1)--6>1
and the constraint (4.36) which is equivalent to 6_
5 _<w
and similarly it follows A - % 26} := v*('y,6,6)
0 < v < m i n { 2 " ~ - 1; 1 + 6 - ~ -
The solution of this optimization problem r'* ("/, 6, 6) --~max "r,~,5
is given by 27-t
= 1 +6--T--
A--^f = 25
4.6. Penalty function
131
or in equivalent form "7 = -3 _ _ ( ; ~ + ~ -
~) = ~ + ~ = i - ~ - ~ - - x
The smallest ~- maximizing "7 is equal to T=6 and, hence 2 "7-3
1 1 ~A=~+5=l-25-A
From these relations, we derive 1 Taking into account that A _> 0, we obtain 6<-6
1
and, consequently 1
2
. . . . . . 4- (5 ~ 2 --3 The optimal parameters are "7='7*=
2 , e=e*=l,
7 = ~-* = 6 = ~* = -1 A = A* = O 6'
The maximum convergence rate is achieved with these parameters and is equal to 1 ~*('7, e,~) - 2~* - 1 = ,** = 3 Similarly, after application of the mathematical expectation operator to (4.49), for E {Wn} and in view of lemma A.3-2 (see appendix A) and lemma A5 in [4], we obtain the desired results. Theorem is proved. • An increasing attention has been devoted to the use of penalty functions for the solution of nonlinear programming problems. The next part of this chapter deals with the penalty function approach.
4.6
Penalty function
Consider the following programming problem N
V o ( p ) = E v°i p ( i ) --* min i~l PESN
(4.54)
132 4.6. Penalty function N
v~p(i) <_ 0 (j = 1, ..., m)
Vj (p) = E
(4.55)
i=1
Let us assume that the set Pr of all vectors p E S g satisfying the constraints (4.55) is not empty, i.e., P~ := p E S ~ I Vj-(p) < 0 (j = 1..... m) # 0
(4.56)
This problem (4.54-4.55) is equivalent [10] to a linear programming problem with equality constraints
Vo(p) --*
miA~
(4.57)
pCSN,ueR~~
Vj(p) + ~/¢ = 0, ~j > 0 (j = 1 ..... m)
(4.58)
where gj are the nonnegative slack variables [26]-[27]. In this part we shall be concerned with the penalty function approach. The following penalty function has been considered by Poznyak [18] m
Vo(p) + ,
+uj] - + ) 2 , p>0
Z
(4.59)
i=l
where the penalty coefficient # -~ oo. The optimal solution of this penalty function is equivalent to the optimal solution of the following penalty function ttV0(p) +
[l/)(p) + gj]
, 0 < # --~ 0
(4.60)
i=1
This penalty function is more useful (practical point of view) because ~t --* 0 instead to infinity. Let us consider the following regularized penalty function [24]
L,,~ (p, ~) := pVo(p) + -~
(Hpll.2+ i1~112) + -~1
IIVP + ~ II2
(4.61)
where
V=- [~li.,.iVN] C R rrt+N /
k
~ T
Vj
(4.62)
We assume that the parameters # (penalty parameter) and 6 (regularization parameter) are positive and tend monotically to zero, i.e., 0<#,6~0
4.7. Properties of the optimal solution
133
The parameter 6 play the same role as in the first part. We shall be concerned by the following optimization problem n~,~ (p,~.) ~
in f
(4.63)
p~RN,uER~
For fixed # > 0 and ~ > 0, thks problem is strictly convex. Its solution is unique and will be denoted by
(p*(p, 6),~:*(#, 6)) e S N × R ~
(4.64)
The properties of the optimal solution are presented in the next section.
4.7
Properties of the optimal solution
The next theorem states the property of this solution (4.64) when 6 and # tend to zero. T h e o r e m 6. Let us assume that
1) the set Pr is not empty and the Slater's condition Vj (~) > 0 (j = 0 ..... m)
(4.65)
is fulfilled for some ~ E S N 2) The parameters # and 6 are time-varying, i.e., #=#~,
6=~,~, ( n = 1,2 ..... )
such that o<~.tO, --
~
o < 6.Lo
(4.66)
----4
(4.67)
0~
Then, p~ :=p*(p~,,6 ) ~ '~
p** (4.68)
where p** E P~ is the solution with the minimal weighted norm, of the linear programming problem (4.54), i.e., [Ip**II,+v'rv ~ Hp*Ni+yrv, p* E P* c Pr
(4.69)
where p* is any solution of (~.5~), P* is the set of all solutions and, u** is given by ~** = -Vp** (4.70) and I]xll2Q := xTQx.
134 4.7. Properties of the optimal solution Proof.
First, Let us prove [24] that the Hessian matrix associated with the penalty function (4.61) is strictly positive definite for any positive # and 5, i.e., g:~---
VpL~,5 (p,~) V~VpT L,,~ (p, ~)
VpVTLt, a (p,u~ u ' V~Lu,a (p, u~
> 0
(4.71)
To prove this result it is sufficient to show that the following matrix are strictly definite positive [33] v2pn,,, (p, ~) > 0, V~L~,, (p, ~) > 0,
Indeed, 1) 2 VpLt,,~ (p,~) = v T v ÷ S I N > 0 , IN :=diag{1,..,1} e R N
2) V2L,,6 (p,~) = (1 + 6)Im > 0
3) -1
T
= v T v + 8IN -- O-~3vTI,~V = 8IN + ( 1 - ~ v T v > 0
4) 1
= (1 + 8) I m - Y ( v T v + 6IN) -1 Y T In view of the matrix inversion lemma, it follows (1 + 8) I m - V ( v T v + 6IN) -1 v T = (1 + 8) Im-- 8 - I V (Ilv - V T (8I~ + v v T ) - I v )
.T =
From (4.71) it follows that the penalty function (4.61) is strictly convex and, as a consequence it has a unique minimal point on S N x R~. Let us denote it by (p*(#,5),~*(#,8)) E S N × R ~
4.7. Properties of the optimal solution
135
Using the strictly convex property, we conclude that 0< -
~-~*(~,5)
H
p - p• (#'5) ~ - ~*(~,6)
<
<_ (p - p*(#, 6)) T Vp it,,6 (p* (/.t, 6),~*(#, 5))+ ~ * ( # , 6 ) ) T v u L~,,6 (p*(#,5),~'(#,6))
+(~----
=
(4.72)
(p - p*(#,6)) T (t-~o + 6p*(tt, 6) + v T [V p * (t.t,6) +~*(~,6)1) + + (~ - ~'(~, 6)) ~ (vp" (,, 5) + (1 + 6)~-(~, 5))
Let p* is any solution of the linear programming problem (4.54)-(4.55). Dividing by # and replacing p and ~ respectively by p* and = - g p * - 6 V (p* - p*(#, 6)), we obtain 5 0 < 1# [Vo(p*) - Vo(p*(#,6))] + -fi (p* -p*(#,6))Tp*(.u,6) -
6 _1# ]]Vp*(~,6) +~*(#,5)H 2 + ; (Vp* - Yp*(tt, 6))T Vp*(#,6) < ~5 _ p, _< _ !. llVp* ( , , 5) + ~* (,, 5)tl ~ + ; (p* (., 6)) T [I~ +
(4.73)
VvV] p*(., 5)
For 0 < / ~ = #n ~ 0, 0 < 6 = 6n ~ 0, and taking into account assumption (4.67), we derive 0 > lim !
IIUp*(p,.6,~) +~*(~,~,6,~)112
(4.74)
and, as 0 < #~ ~ 0 it follows IIVp*(g~,6=) + ~ ' ( g ~ , 6 ~ ) f t 2 -~
0
(4.75)
n---* oo
Let us denote by p ~ any partial limit of the sequence {P*(#n, 6n)}. In view of (4.75), it follows that any partial limit ~ of the sequence {u*(#n, 5n)} is uniquely defined by ~*0¢2 = - V p * O Q As p ~ E S N, we derive II~Lli -< IIVpLII ~ tIvll < ~, i.e., {u'(tzn,6n)} is a bounded sequence. Let us now prove that
*
6
*
6
136 4.7. Properties of the optimal solution where 7rp. {-} is the projection operator onto the set P*. It has the following property [[x-~rp. {x}l [ _< [[x-p*ll Yx E R N, Vp* E P* C R N
(4.77)
Inequality (4.74) leads to
llvp*(~,~,6,,)
+~*(#~,6,,)t! 2 < c 1 , ~ , c l • (0,oo)
Hence, because each component of the optimal solution u*(#n, 5~) is nonnegative, we derive Y p * ( #,~ , 6,~) <_ x/ ~--~v/-fi~ e '~ - ~* ( #,~ , 6 ,~) < v f-C-~xFfi~ e '~
(e" := (1, ..., 1) T • n ~ ) This inequality mtLst be understandable in componentwise sense. Notice t h a t [[p*(#~,6,)-Trp. {p*(,,~, b,~)}l[2 :........ min
pC.:SN :Vp~O
[tp*(#n, 6•)_p[[2 <_
(4.78) <
max
- - q e S N :Vq< v/-C-11~fi~em
rain
pE S N :Vp
He - p[[2 := g(#=)
Let us introduce the following sequence ~)rt, :----
v~14h~
max yj(~)
j:=l,...,m
where ~ satisfies the Stater's condition [34]-[35] Hence, W,~ • (0, 1)
(4.79)
Let us consider the following transformation = (1 - V,~)q + ~ , , P • S N which transforms the set
into the set
{5• s ~ : v~_< o} Indeed, U ~ = (1 - ~,,~)Vq + ~,~Y~ < (1 - ~,~) x / ~ , / - ~ e ~ + ¢,~U(~) =
= = ~ n [mj=l a x..... l. ~'(~)em+V(~)]-
(4.80)
4.7.
g(#n) :
<
max rain p ~.~bnp "~es~:vv<_oveSN:Vp
max
-~s~:v~o
~Zb-~
_
137
Properties of the optimal solution
~bn ~ 2
\1-¢./
= Const \ l _ wn /
P
I]2<-
max
1t~ p-tl2
";~sN:v'~
-
:
-
from which follows the upper estimation (4.76) Dividing both sides of (4.73) by ~, we obtain
0 > (p*(~, ~ ) - ;*)~ [I + vTv] p'O,~, 6~)+ 1 + < t I v ; * ( ~ , ~ ) + ~*(~,6~)II ~ > >_ (p*O,,~,6,~) - p*)~' [I + vTv] p*(~,,~,~,~) >__ > (p*(~., ~.) - p*)~ [I + v ~ v ] p * ( ~ , 6 ~) + #~nv_ oT , t; • (u., ~.) - p*) = =
(P *(#,~, ~n) - p,)T [ I + v ~ v ] P*(gn, ~ . ) + -- ~n--T
,
.
~-~Vo ~p (~,,~,~,~) - ~-. {p*O,,. ~,~)}) + +#~--T ~-~v 0 (Trp. {p *(p.~, 6~ ) } - p*) >_
>_ (p.(.~, ~ ) _ p.)T [I + vTv] ; * ( ~ , 6~)+ -}-]I~ VT
0 (p* ( ~ , ~.,)- ~P. {p* ( ~ , 5 ~)})
0 0 where ~0T := (Vl,... , VN). From (4.76), it follows
0 >__(p (#n, ~) _ p . ) T
P*(#'~'
8,~ II~otlc (~'~)~
For n --* oo and in view of assumption (4.67), we finally obtain
o > (p~ - p*)~ [I + v ~ v ] p~
(4.81)
for any partial limit Po¢ This inequality can be interpreted as the necessary condition of optimality of the strictly convex function f ( x ) : : l x T (IN + V T V ) x on the convex compact set P* [26]:
(x - z*)TV f ( z *) >_ 0 V x ~ P * x : : p * , x* : a r g f ( x ) xeP*
138 4.7. Properties of the optimal solution But any strictly convex function has a unique minimum. It follows t h a t X* = arg f(x) = Po~ x~.P*
is unique (all partial limits of the sequence {p* (#n, 6n)} coincide). Because f(p*) is equal to the weighted norm of the vector p* E P*, we conclude that p ~ =p** Theorem is proved, m The next theorem states the Lipschitzian property of the optimal solution of the penalty function (4.61) with respect to the parameters # and 6. T h e o r e m 7. Under the assumptions of theorem 6, there exist two positive
constant C1 and C2 such, that [[P*(~I, 51) -- P*(#2, (~2)1t <-- C1 [#1 -~t21 + C2 [52 - t~l[
(4.82)
Proof. Similarly to the proof of Theorem 3, let us introduce the following set Zo :=
p(i) = 1 i=l
Zl(Q,...,is) := {(p,~): p ( i k ) = 0 , k = 1,...,s} N Z0 Z2(jl, ...,jr):= {(p, ~ ) : u ( j k ) = 0, k = 1, ..., r} A Z0 Z3(il, ...,is;j1, ...,jr) := Zl(il, ...,is) N Z2(jl, ...,jr) The total number of these sets is equal to 2 "~ (2 N - 1). They will be denoted by ~ (k = 1, ..., 2rn (2/v - 1)). Let us associate with each set ~k the problem Bk of the optimization of the penalty function Lt,,~(p, u) (4.61) on the set Gk. It is clear that the optimal solution of the initial problem (4.57-4.58) coincides with the solution of one of these problems (Bk), when tt, 5 --* 0. Notice t h a t each problem (/~k) concerns the optimization of the convex function Lu,~(p,~t ) (4.61) subject to equality constraints. Hence, we can use the Lagrange multipliers technique. The necessary optimality conditions are
Vp£t,,~(p, Yt, ,k) = O, V~£,,~(p, ~i.,,~) = O, V~£u,~(p, ~, )~) = 0 where
£u"(P'U'/k) := Lt"'(P'U'~)-/k°
"P(i)-- l ) - - ~
~lk~(ik)-E
These equations can be rewritten as follows:
#Vo + 6p + Y T [Vp + "5] - Aoe N - E A2ke(jk ) = 0 k
4.8. Optimization algorithm
Vp + (1 + 6) ~ - ~ , h k e ( i k ) = 0
139 (4.83)
k
(p, ~) ~ 6k This algebraic system of equations can be written in the following compact form
.Co - ~o~ N - E ~ e ( j ~ )
y T
]
~ )~lke(ik ) or in the following equivalent polynomial form /rn+N
p*(#,6) or ~(~,6) = .
.
",,
= ~+~---
~
r- = 0
/
\
(4.84) i=1 ,...,rn+N
' Let us assume that the r~-first coefficients c~i and a~ of the numerator of (4.84) and, the r:' -first coefficients h~ of the denominator of (4.84) are equal to zero. We can rewrite (4.84) in the following form: p* (#, 6) or ~(p, 6) = • c;,+1 + ~ ;i, + , , + --
,~+N-r~-i . .\ ~ 6s (c i ,. + < i + , , ) | 8~1 rn+N-r"
\
Ti--S
-1
s=l
i
/ l |
•r i -t-8
]
i=l,rn+N
where "Fi : - - ?~i - - 'Pi
In theorem 1 we have proved that p* (#, 6) and ~(#, 6) are bounded for any # and 6 --* 0. It follows: ri>_0 ( i = l , . . . , m + N ) from which the Lipschitzian property (4.82) follows. Theorem is proved. • The learning optimization algorithm and its properties (convergence and convergence rate) will be presented in the next section.
4.8
Optimization
algorithm
The optimization problem stated above will be solved using a learning automaton operating in a multi-teacher environment. We shall be again concerned with the Bush-M(xsteller reinforcement scheme [15].
140 4.8. Optimization algorithm
P,~+I = P~ + ?~ [e(u~)
- p ~ + ~,~(e N - N e ( u n ) ) / ( N
1 Pl(i) ...... N '
-
1)]
(4.85)
(i = l, ... , N )
where ~fn e(un)
E z
[0,1], ¢~ ~ [0,i] (o .... ,o,i,o...o)L
u,~=u(i)
i
(1, ..., 1) T E
e N
The automaton input ~ structed as follows:
R N
(multi-teacher environment response) is con~'~ -
(4.86)
p,~(i)
where --T V,
c5~ (2 n
----
i + (N - 2)T~
2(1 + i N - 2)T,~)
~+ = 1-T,~, ~; = % ~ ( N - 1 ) , 0 < T,~ I 0, ~+ Too and Vn-1 is the estimation of the matrix V (4.62) which elements (Vn-1)ij are constructed according to the following recurrent scheme:
(Un)iJ -- (Un--1)i'
t~l~(?-Q=~(i)) ~j
~l )~(Ut z U(i)) ] (i=
1.... ,N;j=
I ..... m)
In view of temma 2 it follows that [ (N - 1) (T~)2
]
The slack variables ~,~ will be adapted according to the following algorithm
~,.+l(j)= [ ~ ( j ) - ~X [ ~ + (i + e~) ~(J)]]o "+ ~>0,
~l(J)>0
(j=l
..... m)
(4.87)
The convergence properties of this optimization algorithm are stated by the following theorem.
4.8. Optimization algorithm
141
T h e o r e m 8. Consider the optimization algorithm (~.85), (~.86) and (4.87) subject to assumptions (H1) and (H2) and let I) the sequence {-y~} be selected as follows: "y,~ -
N N-1
a,~ %
(4.88)
2)
--
2g;[
an-1 _lt:=d
(4.89)
3) there exist a positive sequence ~-n ~ 0 such that
oo ~
~nTn~n ( ~ ) -1 : O0
(4.90)
then, if
4) .E :,
-'
~ i + ~ + v.
< oo
(4.91)
where
V "I~-, and
8n ::: ~=2_~"/n--1 ~ ,2]J q -7n--1 --'~u--1 [l#n -- #n+ll 2 -~-[~n --On+l[ (~~(n~W n u))22~ r ~ - - I -I '~nU--l'fn--1 [ [(,.,fnu)2 ( ~ ) 2
[]~n- ~n+ll 2 -~-I~n -- ~n-I-112]
then, W,~ : = [tpn - p~tl2 -I- --~--- II~,~ - ~ l t 2
"~--t
~:~rl.~(x:,0
where Pn "= P (Pn,6n), un := - V p ~
then, if
5) i
~.+
v~+
then,
E{W.}
--, 0 ~oo
-~ 0
(4.92)
142 4.8. Optimization algorithm
Proof. The optimization algorithm and the properties of the projection operator lead to
W,~+l -<- lpn, + ,~n, [e( u,~) _ pn, + ~n,1--~N--e(~u~) ] 7n,
u,~+l[[
= lip - p~,ll 2 + ~
+27n" (p _ p~) Av +
IIA~II 2 +
lip:. -
DI,~,' _
~:.,.lt2
%~-_..~1
p:,+, ÷
* 1I2÷
-- Pn+
=
II~ +
(I,~)2 IIA~II 2 +
÷ II~a - Un+l II~ - 2,~a (~n, - ~a) ~ A,~ ÷ 2 (~n, - ~:~)~ (~:~ - ~:~+1) -
-2"r~ ( ~
-
(~Y~.__.~ "Yn,-1 ~ + 1 ) T A~] + \TX 7,~-1]"~ II~n, - ~ A , ~ -
~'n,+, II
where
A~ : =e(u~)-pn,+~n,
1 Ne(u~) N-1 -
-
A,~ : = ~ n , + ( l + 6 ~ l ~ n , From theorem 7 it follows that Wn,+1 ~ Wn, + C v n , v / - ~ +2%~
(p -p~)A~ - z ~'~n,-1%~u (~,~ - ~ ) T An,u +~n, 7~,-1
(4.93) where ~n,=47~+2
(~n,--1) [(~'2]~n, [£n+l,2÷C2[~n-~n,+l[2] 1+~
÷
q'r~--1
7'-'2 "~n,"~X-- 1 (Cl Ip,n, - p,n,+, I + C2 le,, - e,,+, I) (C3 + t',;, V'~) -1-.~_1-- (Q'~) On ÷ 2
"/u-1
+
(~,~)~ on, + II~n, - ,-,n,+lll )
and 'rp.,
0n, :: ~ : ( ~ ; ~ ) ~ + m ( ~ )
j=l
2, c 3 > 0
Notice that tl~n,- ~,112 -< ~'~ wn, 3'n,
II~n, - ',n,+, II ~ *. lien, - ~11 ~ + 2 I1~;, - ~5+11[~ <
(4.94)
4.8. Optimization algorithm < 2 ~-" w . + 2 [c~ I ~ - ~°+ lJ 2 + ~
143
t~° - 6.+1t 2]
t
J
Using (4.41)-(4.43) we derive:
E{A~[.T',~} a'-~'-an [VpL~.,,~.(p,~,Sn)+o,.,(-x~-n)] -b,~eN
(4.95)
Here the term o ~ ( : ~ ) has appeared as a result of the application of the strong law of large numbers [28],[30]: V , ~ - V = o~
....
(_~_n~_n)
Also we have E ( ~ . + 5n15~.} ~'~' Vp,~ + (1 + e~)5. = V;L..,8. ( p n , ~ )
(4.96)
Taking into account the strictly convex property (4.72), we conclude that
2~,~ (p - p~)T E {A~t~'= } - 2"y: (5 n - 5;r+l) T E {~n -Jc Un[~Yn} a..~s. = --2~/.a,~ (P -- Pn)T VvL..,,~., (Pn, ~,~)-2~
(~
±
un+l) v ~ n . . , ~ ( p , t , 5 ~ ) +Tna,~o,.(~) <_
- -*
_<--23'na,~Amin(H) ([[Pn- P:II 2 + NS,,- 5:1[ 2)
+^/nano,.,(--v~-n) (4.97)
From (4,71) we derive
H = [ 5IN + VTV V
(1
VT ] [ vTv + 6) I.~ = Y
> [ 6IN -
0
0
vT ] [ 5IN 0 ] I.~ + 0 6Ira >-
] =6IN+ m
6i~
and hence, Amid(H) _> 5
(4.98)
Substituting (4.98) into (4.97) we derive
2%~(p - p~)T E {A~P~[J:.} - 2~/~ (5~ - 5~+1) T E {~. + 5.[~'~ } ~_~" <_ -2~.~.~.
-2~/.a.Sn
1
(lip. - ;;,ll ~ + 115. - 5~ll ~) + ~.~.o~(~1_< Ilp.-p~.]l 2
[]5, - 5~,.112 3,-1
+3,,ano,~(1-1r-) = ~/n
144 4.8. Optimization algorithm
= --2%~a.~.V~ + %~a.o~ (--~n~n)
(4.99)
Using (4.99) into (4.93) and taking into account that
after the application of the operator of conditional mathematical expectation and using the following inequality
2v,~x / ~ . <_v.Wn + vn, v,,, W,, >_0 we obtain E{Wn+l[Jr,~} _
1-2"y~m,~6.('~+) -I
1
(4.100) + C v . ] W. + 7 : .
~+ -1 1 (u.) o~(-~)+~.+Cv.
where +
:-[
+~Tn-1
+
~/'2-'N
~ - - ~ - (c, I . . - ..+,1 + c~ 16. - 6.+11) tc~ + 7.",/~.)
+
7n-1
+7,~-I (7~) ~ On+ 7n- 1
-{-2 ~'~Tn 7n--17"--1[(Tn) On -~-2
--
12
Taking into account assumption 3) of this theorem we can rewrite (4.100) as follows:
E{W,.+,:.} Y [~- 27:.e. (~X)-' (1-d + o(~)) +c~.] wo+ ~+ -~ 1 +%:-. ( u . ) o,.(-~)+~n+Cv.
(4.101)
From this inequality and Robbins-Siegmund theorem [32] (see Appendix B) under the assumptions of this theorem we obtain the convergence with probability one, i.e., W,~ %-~"0 The mean squares convergence follows from (4.101) after applying the operator of mathematical expectation to both sides of this inequality and using lemma A5 given in [4]. Theorem is proved, m
4.8. Optimization algorithm
145
C o r o l l a r y 3. If the parameters of th,e optimization algorithm, involved in the assumptions of theorem 7, belong to the following class of parameters "~o 6o ._ #o (4.102) ?%
T,~:: N _
'
/t 6 ~
?%p~
1 -+ Const + n ~ In n, Const > 0 l + n ~ l n n, u n : : ~, ~, #, T > 0 ,
u>O
then 1) the convergence with probability one is ensured, i.e.,
for 7+6+U+T=
1, ~ + ~ - + u >
1
~
(4.103)
2#> U+T, 25>U+T 0 := min{2"y, 2(1 + p ) - - u - - T , 2~--U+T.
I+#+"y--U,
2(1 + 6 ) - - U - - ' r , I+6+"y--U}>
1
2) the mean squares convergence is guaranteed, i.e.,
E
0
for 2 + 6 + U + T = 1, 26 < 1, 0 > 1, 2# > U+~', 26 > U+T (4.104) Proof. Notice that
1
~ : 0(~) From the conditions of theorem 7 and (4.102) follows the desired result. • The next theorem gives the estimation of the order of convergence rate of this optimization algorithm. T h e o r e m 9. Under the conditions of the previous theorem and for the class of parameters (4.102) there exist u > 0 such that W~ :
o~o
,E
14
:o
(1)
(4.1o5)
146 4.8. Optimization algorithm where the order y o f convergence rate satisfies the f o l l o w i n g upper estimation 1
u < u*('y,) <_ u*" = 3
and the m a x i m u m convergence rate u** is reached f o r 7 = 7 " = - 2 3' 6 = 6 " =
p =#*=
1 T =v*= ~,
u =u*=
~1
(4.106)
u * ( 7 * , t * , ~*) = u** Proof.
Notice that w;
= IIp~ - p**tl 2 + tlu:~ - ~ * ' I ] 2 <_
<_ 2vv~+ c ( ~ + ~) Multiplting both sides of the last inequality by vn, we obtain ~ W ~ < 2 ~ W ~ + C ~ ( , ~ + 6~) In view of lemma A.3-2 (see appendix A) for ~ := n v we derive that the order must satisfies the following inequalities: u
0--1,~+T+U--2,
+ 2# -- U -- T, l + 26 -- U -- T, 2#, 26
where the design parameters satisfy (4.8): "y+6+u+~=l,
~/+~-+u>-
2 # > u + T , 26 > u + T ,
1 2
8>I
and, similarly to the Lagrange multipliers case, 6_
1 ~--6=26,
0-i=i--6
or
6:6":#=#*
1
=?,o:
Taking into account that u+~- < 1
8*
1 u.
3'
1
3
4.9. Numerical examples
147
we conclude that O* = =
min
6-"/),2~,-u+'r,~+3,-u
=
min 2 % 2 ~ , - u + 7 " , ~ + , ~ - u
if the corresponding parameters are respectively equal to 2
1
Theorem is proved. • Apart from the selection of some design parameters (%, 5n, etc.), the Lagrange multipliers and penalty function approaches generate different optimization algorithms. In the next section we shall be concerned with the implementation aspects of these optimization algorithms.
4.9
Numerical examples
In this section computer simulations are presented to illustrate the performance of learning automata to solve constrained stochastic optimization problems on the basis of Lagrange multipliers and penalty function approaches. E x a m p l e 1 In the first simulations, the following numerical example has been considered 2p(1) + p(2) -~min pES ~
subject to p(1) + 4p(2) < 1.5 This problem can be rewritten as follows FT p -~min p~0
subject to Ap<_b where FT----[2-1]'
P=
p(2)
1
1
,
b=
[/] 1.5
The first line of the matrix A corresponds to the simplex constraint p(1) + p(2) = 1
148 4.9. Numerical examples
t~'1(1}
0.9 ~
0.8029
0.8
~ 0.7 0.5 0.5
s6o
io6o
is6o
2o'oo
zdoo
30'o0
3~oo
*, 00
Reratio~ i~umbe~ 0.5 ~
0.4
~ 0.3 0.1971
0.2 0.1
~o
,Q6o
15~o
zo'oo
z~'oo
00
Iterat:i¢~ number
F I G U R E 4.2. Evolution of the probabilities.
According to our notations, it follows: v°=2,
v°=1,
v] : 1 - 1 . 5 : - 0 . 5 ,
v22 = 4 - 1 . 5 = 2 . 5
The solution of this linear programming p'mblem is 5
p(1) = ~,
p(2) =
1
A two actions automaton has been considered. The automaton operates in a random environment which corresponds to the constrained stochastic optimization problem to be solved. Th,is environment produces an output (response) different (continuous) from the reward/inaction (penalty) signal which gives partial information about their states. 2 The second set of simulations concerns the following numerical example:
Example
p(1)+3p(2)+5p(3)+p(4)+p(5)--*min pE S ~
subject to 5p(1)+2p(2)÷p(3)+5p(4)+5p(5)<2
4.9. Numerical examples
149
0.2005
0.2004
0.2003
. • -0.2002 fi
,% 0.2001
$ 0.2
0.1999
s~o
1o6o
is~o
zdoo
2s'oo
3doo
3doo
Iterationv~umber
FIGURE 4.3. Evolution of the Lagrange multiplier. p(1) + 4p(2) + 2p(3) + p ( 4 ) + p ( 5 ) _< 3 This problem can be rewritten as follows
FT p --*rain p>o
subject to Ap<_ b where
p(1) p(2) FT
A
=
[2-3511],
p(3) p(4) p(5)
p=
[,,,, 5
2
1
5
5
1
4
2
1
1
,
b=
[1] 2
3
The first line of the matrix A corresponds to the simplex constraint p(1)+p(2)p(3)+p(4)+p(5)=l
~o
150
4.9. Numerical examples
x 0.001 A
8
FIGURE ~.~. Evolution of the loss function. According to our notatio,r~s, it .follows: v ° = 1, ~o = 3. ~o
v~
=
5-2=3,
v
=
5-
5, v o = 1 , v ° = 1
4 .... 2 - 2 = 0
2 = 3, v 51 = 5 - 2
vt=1-2=-1 3
v~
=
1-3=-2.
v~
4-3-:1,
v 42
=
1-3=--2,
v 52 =
1 -- 3 = --2
v~=2-3=-1
Notice that this constrained optimizagon problem has many solutions. For example: p ( 1 ) = 0.0370, p ( 2 ) = 0.5566, p(3) - 0.3333, p ( 4 ) = 0.0370, p ( 5 ) = 0 . 0 3 7 0 p ( 1 ) ----0.1111, p ( 2 ) = 0.5566, p ( 3 ) = 0.3333, p ( 4 ) = 0.0000, p ( 5 ) = 0.0000 p ( 1 ) = 0.0000, p ( 2 ) = 0.5566, p ( 3 ) = 0 . 3 3 3 3 , p(4) = 0.1111, p ( 5 ) = 0.0000 p ( 1 ) = 0 . 0 0 0 0 , p ( 2 ) = 0 . 5 5 6 6 , p ( 3 ) = 0 . 3 3 3 3 , p ( 4 ) = 0 . 0 0 0 0 , p ( 5 ) = 0.1111
A n automaton of five aetioT~s has been eo~r~sidered.
4.9. Numerical examples
151
2.2
$
1.8050
;~ ts 1.$ 1.4
~bo
lo6o
m60
2o'o0
2~o0
3o'o0
3¢00
Itefatio~ 'nUrr~be~"
0.5
0.0842
8
0
-0.5
-1
~6o
106o
ls6o
zdoo
zs'oo
3o'00
3~'oo
)0
Iteratian number
FIGURE 4.5. Evolution of the criterion and the constraint.
4.9.1
LAGRANGE MULTIPLIERS APPROACH
For every time n, the optimization algorithm based on the Lagrange multipliers approach performs the following steps:
• Step O. Choose the parameters setting of the adaptation algorithms (4.33) and (4.34), the number of actions N and, initialize the probability vector P] (Pl (i) := 71, i = 1, ..., N ) in the case where no prior information is available) and the other design parameters for example as follows AI(j)=I,
Al:== 1, a~
1
7+max
v~ , ( j : l , . . . , m )
1 - - 60, A + ---- 104 -t- i n n 61 - - N - 5 1
• Step 1. generate a uniformly distributed random variable, say z (z E (0, 1)) and select, the control action u(i) according to the following rule: the index i is given by i
......rain j subject to E p ~ ( j ) j=l
>_ z
152 4.9. Numerical examples Pn(1)
~0,2
~01 ~ £ n 0 1( ~, = Pn(Z) g 05
-
~ i
500
10'00
,
-
-
-
-
-
-
2o'0o
15;0 ,
aooo
'
0.4974
]
I
,o
n ~0.4,
x~
00 ~ Pn(3)
500
1B08
! 500
,
,
/
2000 .
25'00
3880
4
0.2399 J
o go.2
[
o,
~(4)
o.4° P~(5)
~oo '
s;o
.......... iooo
1500
,
2000
25'00
i
_
doo
?oo
2o'oo
o0857 25'00
3000 0.0499
~0.2 o.
3000
0
I
0
500
I 0;0
15'00 Ite ra tJon numb er
2000
25;0
3000
FIGURE 4.6. Evolution of the probabilities. • Step 2. generate (rn + 1) uniformly distributed r a n d o m variable, .say rl~ ( ~ E (0, 1), (j = 0 ..... m ) ) arm construct the vector ~n as follows i
vJ
• Step 3. construct the normalized a u t o m a t o n input (~ (environment response) according to (4,30) and (4.31). • Step J. adjust the probability distribution Pn and the LagTange multipliers An(j) according to (4.33) and (4.34). • Step 5. go to step 1 at the next time ~, + 1.
T h e p a r a m e t e r setting for the Lagrange multipliers optimization algor i t h m are the following: A1 = 0.2, 20 = 0.05, ~ = 0.66, 6o ::= 0.01, r = 6 = 0.16 Some simulations results related to example 1 are presented in the following.
4.9. Numerical examples
153
0.9 ~: 0.8
0.8259 0.7
0.$ 0.5
z~o
4~o
~6o
o6o
1obo Iz~o 14bo io~o 18~o
2 ]0
0.5
0.3
0.1741
0.2 0.1
z~o
460
66o
8~o
lo6o lz~o
146o
166o
18~o
)0
Iteraticm~ b e f FIGURE 4.7• Evolution of the probabilities. T h e variation of the components of the probability vector are depicted in Figure 4.2. These components converge respectively to pn(1) = 0.8029 a n d p~(2) = 0.1971. Figure 4.3 presents the evolution of the Lagrange multiplier A~(1) which converges after 2600 iterations. Figure 4.4 shows the loss function (~n) variation. T h e evolution of, respectively the criterion (~°n) and the constraint ((I)~) are given in Figure 4.5. The minimal value of the criterion is equal to 1.8050. T h e following simulation results concern the second example. T h e evolution of the components of the probability vector pn(i)(i ---- 1, ..., 5) is represented in Figure 4.6. T h e constraints are satisfied.drawn in Figure 4.9. The minimal value of the criterion is equal to 2.9704. The learning a u t o m a t o n provides a good optimization performance in the corresponding stochastic environment.
4.9.2
PENALTY
FUNCTION
APPROACH
We will now a t t e m p t to solve the previous constrained stochastic optimization problems using the penalty function approach. The mechanization of the optimization algorithm based on the penalty function is similar to the previous one. T h e p a r a m e t e r setting for the optimization algorithm based
154 4.9. Numerical examples
x 0.001 9
00 It~eati0~ number
FIGURE 4.8. Evolution of the loss function. on the penalty function approach are the following: 2 C o n s t = 20, ~0 = 0.05, " / = ~, 50 = 0.01, #0 = 0.05 1
1
A learning automaton operating in a multi-teacher environment (with continuous response) has been implemented to solve the previous optimization problems on the basis of penalty function approach as described in the previous section. For every time n, the optimization algorithm based on the penalty function approach performs a set of steps similar to the previous one (Lagrange approach). For the first example (automaton with two actions and one constraints), the evolution of the probabilities is depicted in Figure 4.7. The evolution of the loss function is depicted in Figure 4.8. The evolution of, respectively the criterion (go) and the constraint (~ln) are given in Figure 4.9. The criterion converges also to the same value 1.8244. These Figures indicate clearly the performance of learning automata when used for solving stochastic constrained optimization problems. For the second example, the evolution of the probabilities is shown in Figllre 4.10.
4.9. Numerical examples
155
2.2
.§
2 1.8244
~S 1.8 1.6 1.4
'260
460
~6o
860
w6o
iz6o
146o I~6o 18~o
2~ O0
Ite~atio~~ur~be~
C
0.5 0.025~3 0 -0.5
z~o
4~o
~6o
e6o
lo6o lz6o
146o 1~6o
186o
Itetati0~~urnbet FIGURE 4.9. Evolution of the criterion and the constraint. T h e constraint ( ~ ) can be seen that:
is satisfied5. The criterion converges to 3.t207. It
the learning a u t o m a t o n has efficient learning ability the rate of convergence is good. The Lag-range multipliers and penalty function approaches lead to the same results. The above simulation results imply t h a t learning a u t o m a t a can be used to solve constrained optimization problems. T h e y can carry out a powerful search for the optimal solution and do not involve problem such as functional knowledge and computing robustness. T h e authors would like t h a n k Mr. Eduardo Gomez Ramirez for his assistance in carrying out these simulation results.
156 4.10. Conclusion
0.
0.0008
I°
500
-~ 05 ~ "01
o. ~.
~
1 0.5
o
1
~_ -I~ o_
,
l
,
~
,
1000
1500
"
'
~
560
lo6o
156o
'
'
lO6O
156o
P
~ _4
560
)
'
os o
,
500 Pm(3j
o~
I~0o-
-
2o00
.....
0
os '
1o0o
500
1ooo
~
z:/oo
s o
0.5279' _. . . .
2doo
25'00 0.2218"
' 2o'oo
3000 _
~.o'o0
_
J
~.5'oo
. '. . . 0.092~ . . . 15oo ~.o'oo Iteratkm t~umbe¢
J
Joo
z~oo o.ls?J
I
zs'oo
o I ~ooo
FIGURE 4.10. Evolution of the probabilities.
4.10
Conclusion
In this chapter we have formulated and solved a stochastic constrained optimization problem on finite set using a variable structure learning automaton operating in a random multi-teacher environment (construction of the environment response on the basis of the available measurements). The equivalence of this problem with the stochastic linear programming problem has been stated. Two approaches have been considered: Lagrange multipliers and penalty function. Two new tools: regularization and normalization have been introduced. The asymptotic behaviour (convergence and convergence rate) of these optimization algorithms (learning systems) has been investigated using martingales and Lyapunov approach. It has been shown that the probabilities tend to the optimal strategy. This theoretical study clearly show the performance of the behaviour of learning a u t o m a t a as a tool for solving stochastic constrained optimization problems. The Lagrange multipliers and the penalty function approaches exhibit the same convergence rate (o~(n-1/3)). In the penalty function approach we have been concerned with the estimation of the matrix related to the constraints. It has been shown in [24] that the convergence rate (o~ (n-2/5)),
4.10. Conclusion
157
associated with projectional gradient scheme, is greater than the convergence rate of the optimization algorithm based on the Lagrange mulipliers and the penalty function procedures, presented in this chapter. Several simulation results have been presented. These results illustrate the feasibility and the performance of the learning automata in connection with the Lagrange multipliers and the penalty function approaches, to solve constrained stochastic optimization probler~s. It suffices to say, in conclusion, that learning automata form a cornerstone for solving several complex engineering problems. The previous chapters of this book have been mainly concerned with the use of learning a u t o m a t a operating in stationary random environments to solve stochastic optimization problems. However, in most situations there are a nmnber of nonstationary processes which have to be taken into consideration. Methods for extending, in some sense the approach based on learning a u t o m a t a to the optimization problen~s involving nonstationary processes are discussed in the following chapter.
References [1] Clarke F H, Dem'yanov V F, Gianessi F 1989 Nonsmooth Optimization and Related Topics. Plenum Press, New York [2] Najim K 1989 Process Modeling and Control in Chemical Engineering. Marcel Dekker, New York [3] Najim K, Oppenheim G 1991 Learning systems: theory and applications. IEE Proceedings Computer and Digital Techniques 138:183-192 [4] Najim K, Poznyak A S 1994 Learning Automata: Theory and Applications. Pergamon Press, Oxford [5] Wong E 1973 Recent progress in stochastic processes - a survey", IEEE Trans. on Information Theory. 19:262-275 [6] Bertsekas D P 1982 Constrained Optimization and Lagrange Multiplier Methods. Academic Press, New York [7] Rockafellar R T 1993 Lagrange multipliers and optimality. SIAM Review 35:183-238 [8] Rockafellar R T 1970 Convex Analysis. Princeton University Press, Princeton [9] Sp(~sito V A 1975 Linear and Nonlinear Programming. The Iowa State University Press/AMES [10] Martos B 1975 Nonlinear Programming Theory and Methods. NorthHolland Publishing Company, Amsterdam [11] Whittle P 1971 Optimization under Constraints. Wiley-Interscience, New York [12] D.G. Luenberger D G 1965 Introduction to Linear and Nonlinear Programming. Addison-Wesley, London [13] Avriel M 1976 Nonlinear Programming: Analysis and Methods. Prentice-Hall, Englewood Cliffs [14] Liukkonen J R, Levine A 1994 On convergence of iterated random maps. S I A M J. Control and Optimization 32:1752-1762
References
159
[15] Bush R R, Mosteller F 1958 Stochastic Models for Learning. John Wiley &; Sons, New York [16] Narendra K S, Thathachar M A L 1989 Learning Automata an Introduction. Prentice-Hall, Englewood Cliffs [17] Najim K, Poznyak A S 1996 Multimodal searching technique based on learning automata with continuous input and changing number of actions. IEEE Trans. on Systems, Man, and Cybernetics 26:666-673 [18] Poznyak A S 1973 Learning automata in stochastic programming problems. Automation and Remote Control 34:1608-1619 [19] Baba N 1984 New Topics in Learning Automata Theory and Applications. Springer-Verlag, Berlin [20] Polyak B T 1987 Introduction to Optimization. Optimization Software, Publication Division, New York [21] Kaplinskii A I, Propoi A I 1970 Stochastic approach to nonlinear programming problems. Automation and Remote Control 31:448-459 [22] Kaplinskii A I, Poznyak A S, Propoi A I 1971 Optimality conditions for certain stochastic programming problems. Automation and Remote Control 32:1210-1218 [23] Kaplinskii A I, Poznyak A S, Propoi A I 1971 Some methods for the solution of stochastic programming problems. Automation and Remote Control 32:1609-1616 [24] Nazin A V, Poznyak A S 1986 Adaptive Choice of Variants. (in Russian) Nauka, Mascow [25] Vajda S 1972 Probabilistic Programming. Academic Press, New York [26] Zangwill "W I 1969 Nonlinear Programming: A Unified Approach. Prentice-Halt, Englewood Cliffs [27] Garcia C B, Zangwill W I 1981 Pathways to Solutions, Fixed Points, and Equilibria. Prentice-Hall, Englewood Cliffs [28] Doob J L 1953 Stochastic Processes. John Wiley t~ Sons, New York [29] Charnes A, Cooper W W, Thompson G J 1964 Critical path analysis via chance constrained and stochastic programming. Operations Res. 12:460-470 [30] Ash R B 1972 Real Analysis and Probability. Academic Press, New York
160 References [31] Arrow K J,Hurwics L, Uzawa H 1961 Constraint qualifications and maximization probler~s. Naval Res. Log. Quart. 8:175-191 [32] Robbins H, Siegmund D 1971 A convergence theorem for nonnegative almost supermartingales and some applications. In: Rustagi J S (ed) 1971 Optimizing Methods in Statistics. Academic Press, New York [33] Albert A 1972 Regression and the Moore-Penrose Pseudoinverse. Academic Press, New York [34] Slater M 1950 Lagrange multipliers revisted: a contribution to nonlinear programming. CowIes Discussion Paper 403 [35] Spingarn J E, Rockafellar R T 1979 The generic nature of optimal conditions in nonlinear programs. Mathematics of Operation Research 4:425-430
5 Optimization of Nonstationary Functions 5.1
Introduction
In most engineering problems it is assumed that the disturbances are stationary. Hence this assumption on real sequence is really only one of the properties needed in order for the results in the stochastic analysis to be true. In practice, many optimization and control problems involve nonstationary functions a n d / o r nonstationary observation noise. For example: Planning and operation of power systems lead to the prediction of power load. The prediction is needed for a variety of times, ranging from years down to fractions of an hour (short-term and longterm prediction). The power load can be regarded as a nonstationary random process. It has a noticeable seasonal pattern and a periodic structure where the main period is one week. It is influenced by e.g. domestic and industrial needs and long or short term weather conditions [1]. In process design problems (process flow diagram and major equipments list determination), the objective function consists of the sum of the annualized investment cost plus the sum of the operating costs [2]. While the choice of the above objective function appears reasonabte, it fails to account for the fact that a large uncertainty may be involved in future operational elements, such as parameter drift (transfer coefficients, reaction constants, etc.), disturbances (internal and external), fluctuations of the value of the money, of the quality of the feedstreams, etc.. Most of these perturbations are nonstationary. In industrial operations the optimum conditions change as time passes. Any observations are obscured by disturbances, and it is not easy to distinguish between true changes in optimum conditions and the spurious fluctuations caused by random perturbations. • Nonstationary data appear in systems identification [3]-[4]. In this chapter the use of learning automata operating in asymptotically nonstationary environment in order to solve this type of optimization prob-
162 5.2. Optimization of Nonstationary Functions lems will be discussed. The results derived here enlarge considerably the area of applications to which learning automata may be applied. The optimization problen~s considered in this chapter are stated in the next section.
5.2
Optimization
of N o n s t a t i o n a r y
Functions
In this section we describe the general type of optimization problem we shall interested in. Let us consider the optimization of nonstationary functions defined on finite sets: f~(u,~) - ~ i n f , u~ e U == (u(1), ...,u(N)) (5.1) ~n
We assume that only disturbed data Yn of the function ft(u) are available, i.e.~
y~ = y~(~, ~) = f~(~) + ¢~
(5.2)
Let us assume that all nonstationary effects belong to the class o f " a s y m p totically stationary in average" processes, i.e. they satisfy the following assumptions: ( H 1 ) there exists
E {y~/u,~ = u(i) A ~ - 1 }
== f,~(u(i)) + E {;~/u~ = u(i) A J:,~-l} = y~(i)
(5.3) (H2)
Tt
!n ~L(i) ~ f(i)
(5.4)
The first assumption states some kind of restriction to the properties of the observation noise. For example, ( H 1 ) will be satisfied if at each time n the noise ~n has a bounded second moment, i.e.,
This property holds, for example, for Gaussian noises and is not true for noises having Cauchy distribution. The second property (assumption) represents some kind of "stationarity in average" and will be satisfied for optimized functions and noises which are stationary in average. The following examples illustrate these characteristics E x a m p l e 1 Let us consider the function given by (see Figure 5.1)
fn(u) = (u - u* s i n ~ n - e) 2 , w,c > 0 For independent centered normal distributed noises, it follows:
(5.5)
5.2. Optimization of Nonstationary Functions
163
ft[u[i))
uli)
) C t
FIGURE 5.1. Nonstationary function.
n1 ~-~.?t(i ) _= n ~ t~l
[ft(u(i)) + E {~t/ut -= u(i) A .~'l- 1}]
----
t~l
1 '~ = - ~_~ (u(i) - u* sin cot - e) 2 = (u(i) - c) 2 n
t=l
--2 (u(i) -- c) u* _1 ~ n
sin cot + (u*) 2 i ~ n
t=l
sin2 cot
t=l
and hence, assumption ( H 2 ) is fulfilled 1
r~
n ~?t(i)
-'~ (u(i)
7t$---}00
c) 2 + (u*)2 := B i ) 2
t=l
E x a m p l e 2 Independent Gaussian noises .Af(atcr2) where at is periodi-
cally time-varying and satisfies the .following constraint 1
'~ t=l
For this example, the previous assumptions hold. T h r o u g h o u t this chapter we will assume t h a t these two assumptions ( ( I l l ) a n d ( I t 2 ) ) are satisfied. Under the~e assumptions we will be inter-
164 5.3. Nonstationary learning systems ested in the following stochastic optimization problem given on a finite set: lim 1 -
n~o~
n
t=l
y~ --~ inf {ut}
(5.6)
This stochastic optimization problem (5.6) presents the correct mathematical formulation of the nonstationary optimization problem (5.1), using the observations (5.2). Hereafter we will associate these observations of the disturbed flmction to be optimized with the response of an environment response. Consequently, we can associate this stochastic optimization problem on finite sets (5.6) to the behaviour of a learning automaton operating in a random environment which is asymptotically stationary in average i.e.,
yt = ~
(5.7)
The next section presents an overview of nonstationary environments.
5.3
N o n s t a t i o n a r y learning s y s t e m s
Learning systems are an efficient tool to deal with a large number of engineering problems [5]-[6]-[7]-[8]. A learning system interacts with an environment and learns the optimal action which the environment offers. Most of the available studies relate to the behaviour of learning automata in stationary environments [5]-[8]. The problem concerning the behaviour of learning automata in nonstationary environments is difficult, and few results are known [7]-[91-[101-[111-[12]-[13]. Narendra and Viswanathan [13] considered periodically changing nonstationary random medium with unknown period. A nonstationary environment in which one action u s continues to have the minimum penalty ca even though all the penalty probabilities keep changing with time, i.e., cry(t, w)+ 6 < cj(t, co) holds for some ~, some 6 > 0, and for all j (j # a), and for all random factors co; has been studied by Baba and Sawaragi [14]. These results have been extended to the nonstationary multi-teacher environment [12]. Several basic norms (expediency, optimality, etc.) of the learning behaviour of stochastic automaton in multi-teacher environment are given in [12]. A variable-structure stochastic automaton where the penalty probabilities have been assumed to be time-varying and depending on the input action chosen, wz~s introduced by Narendra and Thathaehar [15]. Srikantakumar and Narendra [9] have developed an adaptive routine in telephone networks using learning methods. They have considered a nonstationary environment for which reward probabilities ci (p) and their derivatives are Lipschitz functions of all their arguments and
Oci (p) 0pi
>0Viand
Ocj (p) Oci (p) for j # i Opi < <
5.3. Nonstationary learning systerrLs
165
So, this environment is nonstationary because the probability vector p changes with time p = p(n) (the probability vector is updated using a reinforcement scheme). A LR-p scheme parametrized at step n by an intermediate p a r a m e t e r vector On E R '~ has been suggested by Barto et al.[10]. This scheme has been used to solve a class of learning ta~sks t h a t combines aspects of learning a u t o m a t o n tasks and supervised learning pattern-classification tasks (associative reinforcement tasks). In [16], the environment (the penalty probabilities) was modelled by a difference equation for the prediction of the transient, behaviour of the learning system. Another source of nonstationary environment is in the concept of multilevel systems of a u t o m a t a which has been introduced by T h a t h a c h a r and Ramakrishan [17], and Najim and Poznyak [51. A nonstationary environment arises indirectly in connection with hierarchical system of learning a u t o m a t a [7]. It has been shown in [5] t h a t the use of hierarchical system of learning a u t o m a t a accelerates the learning process. The latter reference also discusses in fair detail some of the applications of hierarchical structure of learning a u t o m a t a . B a b a and Mogami [11] have shown t h a t an extended form of the scheme proposed by T h a t h a c h a r and R a m a k r i s h a n [17] ensures absolute expediency in a nonstationary environment having the property t h a t there exists a unique p a t h which receives the least sum of the penalty strengths in the sense of mathematical expectation. In this work, we consider the behaviour of learning a u t o m a t a operating in a nonstationary environment. The conditional expectation of the environment responses are assumed to be time-varying. A normalization procedure is introduced to deal with environment responses which do not belong to the segment [0, 1]. The class of asymptotically stationary environment (in the average sense) is also introduced. We show t h a t several nonstationary environments such as: • Markovian environments • Periodically changing environments •
Multi-teacher environments
• Environments with fixed optimal actions belong to this class of asymptotically stationary environments. Several theoretical results are stated. These results concern the properties of reinforcement schemes, normalized environment response and the asymptotically optimal behaviour of different learning automata. These results will be divided into a succession of lemmas and theorems for greater clarity. Some key lemmas are given in Appendix A. The learning a u t o m a t o n and the nonstationary environment are described in the next section.
166 5.4. Learning Automata and Random
5.4 Learning Automata Environments
Environments
and Random
In this section we consider the behaviour of a variable-structure stochastic a u t o m a t o n in a nonstationary environment. The interaction between the stochastic a u t o m a t o n and the environment is shown in Figure 5.2.
,~,,
] "I
RANDOM ENVIRONMENT
LEARNING AUTOMATON
I. ~I! I"
FIGURE 5.2. Learning automaton interacting with environment. The role of the environment (medium) is to establish the relation between the actions of the a u t o m a t o n and the signals received at its input. Let us recall the definition of a learning automaton. An automaton is a sequential machine described by (see the previotrs chapter):
{_=,v , n , {¢~}, {~}, {p~} ,T} where: (i) E is the a u t o m a t o n input bounded set. (ii) U denotes the set {u(1), u(2),. .... , u(N)} of actions of the automaton. (iii) T¢ = ( ~, ~', P ) a probability space. (iv) {~n} is a sequence of automaton inputs (environment response, ~,~ E E) provided by the environment in a binary (P-model environment) or continuous (S-model environment) form. (v) {un} is a sequence of automaton outputs (actions).
5.4. Learning Automata and Random
Environments
167
(vi) p~ = [p~(1), pn(2), ..., p,~(N)] T is the probability distribution at time n N pn(i) = P { w : un = u(i) / ~n-1} and E p n ( i ) = 1 ,Vn i=1 where ~n = cr(~1, Ul, Pl ;...; ~n, un, Pn) is the a-algebra generated by the corresponding events (~n E 5~). (vii) c~ = [c~(1), cn(2), ..., c,~(N)] T is the conditional mathematical expectation vector of the environment responses (at time n). (viii) T represents the reinforcement scheme (updating scheme) which changes the probability vector p~ to P,~+I:
P,~+1 = P,~ + ")',~T,~(pn; {~t }t=l .....n ; {Ut}t=l ..... n) pl(i)>0
(5.8)
Vi=I,...,N
where 7,~ is a scalar correction factor and the vector :/~(.) = [TI(.), ...,TN(.)] T satisfies the following conditions (for preserving probability measure): N
~(.)
= o, Vn
(5.9)
/=1
p,~(i)
+ ~(.)
e [0,1]
v n , v i = 1, ..., N
(5.10)
This ks the heart of the learning automaton. Different reinforcement schemes satisfying these conditions (5.9),(5.10) are given in Table 2.1 [5]. In this study we shall be concerned with Bush-Mosteller [18], ShapiroNarendra [19] and Varshavskii-Vorontsova [20] reinforcement schemes. As in the previous chapters, the loss function ¢,~ associated with the learning automaton is given by n
~ =
_1~
(5.11)
n t=l
Let us consider the nonstationary environment which is characterized by the following two properties: (H3) The conditional mathematical expectations of the environment responses exists, i.e., E{~,/~,-I
A u " = u(i)} = c,~(i),
V i = 1,...,N
(5.12)
168 5.4. Learning Automata and Random
Environments
and their arithmetic averages tend to a finite limit, with probability one i.e., 1 -~c~(i) ~ " ~(i) (5.13) n
r~oo
t-~l
( H 4 ) The conditional variances of the environment responses are uniformly bounded, i.e.,
E {(e~- c~(i))2/y,,_,
A'~ =
~(i)}
=.~(i)
V i = I, ..., N (5.14)
max sup a2(i) = a 2 < cx) i
n
D e f i n i t i o n 3 A random environment, satisfying conditions (5.12)@5.14), will be said " a s y m p t o t i c a l l y s t a t i o n a r y i n the average s e n s e " and will be denoted by 1C. T h e following nonstationary environments, which have been considered by several authors, are asymptotically stationary in the average sense.
• Markovian environment [6]. A set of M stationary environments which conditional m a t h e m a t ical expectations of the environment responses are equal to cJ(i) (j = 1, ..., M ) is considered. According to a stationary transition matrix 7r = tI~-ij]I~,~=Mof an ergodic Markovian chain, the a u t o m a t o n switches from one environment to another environment. In this case, we have: M
E{ ~n/d~n'l A'~ .... .(i)} = Z ~ ( i ) r ~ ( j )
V i = 1,...,N
j=l M
r , ( j ) = E T r s j rn(s )
-+ r(j)
Vj=I,...,M
f~--* oo
S===1
n
M
I EE{~j.r,._,A~,=u(i)
Tt
} - ~(i)
E?(i)~(j)
n~oo
t=l
j=l
Vi=l,...,N
• Periodically changing nonstationary environments [21]-[13]-[7]. For the c l a ~ of sinusoidally varying environment we have:
27r
ct(i) = c°(i) + a(i) sin ( ~ - n ) ,
-
n
Zc~(i) ~ t=l r ~
~(i) = e°(i) oo
5.5. Problem statement
169
• Multi-teacher environments [11]-[12]. In this case, the input of the learning automata corresponds to the arithmetical average of the continuous (or binary) outputs ~ (i = 1, ...,M) of the M stationary environments ("teachers") responses, i.e.,
~ = M ( 4 +... + Hence, 1 M j=l
Auo=
(i) :-- E 1
n
1
nECt(i)=~(i)= t=l
u(i)}
M
~Ec3(i)
Vi=I,....N
j-I
• Environments with fixed optimal actions [14]-[5]. In this case, it is assumed t h a t there exists an action a which remains to be optimal even though the characteristics of the environment keep changing with time:
lim sup E { ~,~/ ~,~- l A u,~ = u( ~ ) } < < liminfE{~/:r~_lAu~=u(i)}~,_~
i#~
The problem to be considered in this study is presented in the next section.
5.5
Problem
statement
In the ideal situation (the conditional mathematical expectations of the environment responses are a priori known) we can easily define the optimal strategy {u~} as follows: 4
=
u(~)
~
=
arg
v n = 1, 2...
(5.15)
~nc~(i)
The probability vector sequence {p*~}corresponding to this optimal strategy (5.15) is equal to:
p~(a,~) = 1, p~(j) = 0, j ~ a~, V n = 1, 2...
(5.16)
170 5.6. Some Properties of the Reinforcement
Schemes
The automaton outputs (environment responses) and the loss function corresponding to this optimal strategy will be respectively denoted by { ~ } and • nIn this chapter we shall be concerned with the following problem. Problem statement:
For the class ]C (5.12)-(5.14) of envirvnments and the reinforcement scheme (5.8), estimate the maximum asymptotic deviation J of ~,~ from ~ , i.e., sup limsup [~n -- ( ~ ]
J = J({Pn}) = /C
(5.17)
n--*oa
5.6 Some Properties of the Reinforcement Schemes In this section, the asymptotic deviation J (5.17) is expressed as a function of the penalty probabilities {cn} (5.12) and the probability vector sequence {Pn} (5.8) and it is shown that under assumption ( H 3 ) , ( H 4 ) and for a class of correction factors, this deviation tends to zero, with probability one. The following lemmas state these results. L e m m a 1. Assume that (t13) and (t14) hold, then the asymptotic devia-
tion J is given by: sup limsup -1 ~ [ c t ( i )
- ct(at)lpt(i)
(5.18)
i n 1 n N 6n := -- E (~t -- ~;1 -- -- E E [ct(i) -- Ct(at)lpt(i)
(5-191
ja.~. /C
n--* c¢
n
t=l
i=1
Proof. Let us introduce the following sequence
n
t=l
n
t=l
i=1
As n --* oo, this sequence tends (a.s.) to zero. Indeed, according to ( H 3 ) , it follows
t ~ ~ [ct(i) -- ct(at)]pt(i) a's" n t=l i=1
~E{(¢,
- et) / t - l }
(5.20)
t---.1
Then, using the previous equality (5.20) we derive the recurrent form of
(5.19):
5.6. Some Properties of the Reinforcement
5n =
1--
Schemes
6n-l+-[(~n--~n)-E{(~n-~n)/.~n-1)] n
171
(5.21)
Taking the conditional expectation of 6~, and in view of ( H 4 ) , we derive the following inequality:
( +-
1-
E{5~[(~,~-~)-E(~,~-~)/:F,~_I]/~,~_I}+
n
+ n ~ E [(~n - ~*) - E I ( ~ n - ~)/9cn-1}12/.-~n-l) =
(
1
5~_1+
-
+~E {[(~n- ~:)- E {(~n- ':)/~n-1 }]2/JYn--1 } [5],
In view of lemma A. 11
this inequality MacLs to
~ Tt--, 0 (X~ This fact implies the desired result (5.18). [] L e m m a 2. Assume that assumptions (H3) and (H4) hold, and suppose that 7n -- 7 ( 7 , a > O) (5.22) n+a
Then, the asymptotic deviation J can be expressed as follows: j a~.
sup lim sup ~ T /C
n-~oo
P,~ _ _1 ?~ t = l
(5.23)
Tt J
where the components 7X(i) of the vector A = [~,(1), ..., A ( N ) ]T are equal to ~(i)
= ~(i) - ~ ( ~ )
v i = 1, ..., N
C~----arg In in ~(i)
(5.24)
172 5.6. Some Properties of the Reinforcement
Schemes
Proof.
Let us introduce the following random variable 2In : n
N
(5.25)
2in: } E E t::: 1 i = 1
and, consider the matrix version of the Abel's identity: t--1
(5.26) t~r~o
t~nO
t~r~ 0
At E R ~ × k,
S~'~O
Bt E R k × Z
The proof of this identity ks given in Appendix A (lemma A.5-1 in Appendix
A). Based on the Abel's identity (5.26), the functional (5.25) leads to: •I n = -n - E
PtT A t = P nT
At
n1
t=l
(Pt--Pt
As
1 T
t=l
s=l
z~
T~ 1 E ( t _ 1) " / t , _ l T t T . _ l A t _ l ----P" " - - n t=l
Let us introduce the following functions: n~ (i):=
~(i) -
~,(~1,
~ := -1 E n~ n
(5.27)
t=l
Taking into account assumptions ( H 3 ) , ( H 4 ) and (5.22) we conclude that
n%
=
~,+o~(1),
n--*oo
and
(5.28) t=n0
where o~(1) is a random bounded sequence which tends to zero, with probability one. As n --* o o , equation (5.28) leads to the desired result (5.23). •
5.7. Reinforcement Schemes with Normalization Procedme
173
L e m m a 3. Assume that assumptions (H3), (H4) and (5.22) hold. If the
reinforcement scheme (5.8) possesses the following properties: •
(PS) p~(a) ~:2," 1
(5.29)
• (P2)
1 ~
T~ ~:~ 0
?7,
~--~oo
then, the asymptotic deviation J is equal to zero, with probability one,
i.e.,
J a~'o
(5.30)
or, in other words, the loss function ~,~ (5.11) tends to its minimal possible value a.8.
(~n --~ ~(a)
(5.31)
Tb~O0
Proof. Using expression (5.28) and taking into account that ~ ( a ) = 0, we derive
(5.30). • In the following we shall be interested in the analysis of different reinforcement schemes which are supplied by normalized environment responses.
5.7
Reinforcement
Schemes
with Normalization
Procedure This section deals with the normalization procedure which is useful when the environment responses do not belongs to the segment [0, 1]. This normalization procedure has been initially suggested by Najim and Poznyak [5] to solve the optimization problems related to multimodal functions, and for neural networks synthesis. It is described by the following algorithm:
[
s~(i)-
min sn_l (j)] 3
max [sn(k)k
+
= u(i)
(5.32)
min sn-l(j)] +1 j +
where
~x(u~ = u(i)) s~(i)
:= t=l
, x(u~ = u(i))
t=l
i=l,...,N
(5.33)
174
5.7. Reinforcement Schemes with Normalization Procedure
[x]+ :=
x 0
if if
x>_O x
X(U~=u(i)):=
{ 1 if 0 if
u~=u(i) un#u(i)
(5.34) The properties of the normalized environment response will be s u m m a rized in the following lemma. L e m m a 4. Assume that assumptions (H3) and (H4) hold and suppose that
the reinforcement scheme (5.8) generates the sequence {Pn} such that for any i -- 1, ..., N oo
Pn (i) °-'~ O0
( 5.35 )
n= 1
77~en, the normalized environment response ~n (5.32) possesses the following properties: • The number of selections of each action is infinite, i.e., oo
x(u~ = ~(i)) ° ~ o,,
v i = l, ..., N
(5.3~)
t=l
The random variable sn(i) is asymptotically equal to the arithmetic average of the conditional expectation of the corresponding environment responses, i.e., sn(i) = ni ~ c~(i) + o~(i),
vi = i,..., N
(5.37)
t=l
• For the selected action un = u(i) at time n, the normalized environment reaction is asymptotically equal to ~ ( i ) , 4. e.,
~,°-' ~(i)+o~(1),
u, =~(i)
vi = t , . . . , N
(5.38)
• For the optimal action u~ = u(~), the normalized environment reaction is asymptotically equal to O, i.e.,
Proof. 1. (5.36) follows directly from assumption (5.35) and the Borel-Cantelli -lemma [22].
5.7. Reinforcement Schemes with Normalization Procedure
175
2. Let us introduce the following sequence
O,~(i): =
t=~
s~(i) -
,
i =
which leads to the following recurrent form of
O,~(i)
E
t=l
o = ( i ) = (1 -
x(~
= ~(i))
~,,~(i))o,~_~(i)
where ;~(i) :=
1,...,N
+ ~(i)
x(~,~ =
[~ - ~(i)]
~(i))
x ( ~ , , .... ~ ( i ) )
t 1
Taking into account that (see assumptions ( H S ) , ( H 4 ) )
S{[~n-cn(i)]/~'n_ 1 A.. =:.(i)} °~0 it follows t h a t
A-o = a.s,
(1 --
,,)}
o.s.
)~n(i)) 2 (On_l (i)) 2 + (~n(i)) 2 (T2(i)
(x,~(i)) 2 ~
In view of Robbins-Siegmund theorem [2a], it follows that
O,~(i) a's-4
0 and hence
lira
s,~(i)
=
lim t=l
-- lim -1 £
~,(i) = ~(i)
t=l
So, (5.37) is proved. 3. (5.38) follows directly from (5.37) and assumption (H3). 4. (5.39) is the consequence of (5.38). •
176 5.7. Reinforcement Schemes with Normalization Procedure
11111,[
RANDOM ENVIRONMENT
{ NORMALIZATION
I
LEARNING I I AUTOMATON ¢
N
FIGURE 5.3. Learning automaton with normalization procedure. In the following we shall be concerned with the behaviour of learning automata with normalized environment response. The probability distribution Pn will be adjusted using the previous reinforcement scheme (5.8) where ~n is replaced by ~n (see Figure 5.3). To illustrate this approach, the analysis of the properties of Bush-Mosteller [18], Shapiro-Narendra [19] and Varshavskii-Vorontsova [20] reinforcement schemes is given in the following subsections. The correction factor was considered constant (~/n = "Y= const.) in the original version of these reinforcement schemes. In this study, we assume t h a t the correction factor % is selected according to (5.22) with "y < (1 + a). The initial probabilities are assumed to be strictly positive (pl(i) > 0 Vi=l,...,N). 5.7.1
BUSH-I~'IOSTELLER REINFORCEMENT SCHEME
The Bush-Mosteller scheme [18]-[5] is described by:
pn+l = pn +'y~ [e(un) - pn +'~n(eN - N e ( u , ~ ) ) / ( N - 1 ) ]
(5.40)
5.7. Reinforcement Schemes with Normalization Procedure
177
where
~ ~(~)
• --
[0,1], 8 ~ • [ 0 , 1 ] (0,..., 0,1, 0,..., 0) T , u~ = u(i) i
eN
---- (1, ..., I)T e R g
T h e reinforcement scheme above is frequently used in the case of stat i o n a r y environments. We shall analyze its behaviour in a s y m p t o t i c a l l y s t a t i o n a r y (in t h e average sense) mediums. This question is addressed in t h e following t h e o r e m . 1. For the Bush-Mosteller scheme (5.40), condition (5.35) is satisfied, and if assumptions (H3) and (H4) hold, and the optimal action is single, i.e.,
Theorem
min ~ ( i ) := ~* > 0
(5.41)
then,, the properties (['1) and (P2) are fulfilled and the loss function • ,~ tends to its minimal possible value "5(a) (5.31), with probability one.
Proof. Let us consider t h e following estimation:
p,~+l(i) = p,~(i) + ~ [ X ( u ~ = u(i) ) - p,~(i)+
[1 - NX(u,~ = u ( i ) ) ] / ( N - 1)] =
= p~(i) (1 - ~ ) +
n
~ Pn(i) ( ] - - - ~ f n ) ~ - - " ' ' - - ~ p i ( i ) l - I ( 1 - - ~ t ) t=l
F r o m l e m m a A.4 [5], it follows t h a t
H(!_~)_ t=l
>
~ (~-~V ~?j [
~-~ ,
, a>~
7=1,
(0,1) a>O
(5.4~)
178 5.7. Reinforcement Schemes with Normalization Procedure Substitution of (5.43) into (5.42) leads to the desired result (5.35). In view of (5.27) and assumptions (H1), (H2) we deduce Ar~(~)
=
( I - - 1 ) ~,~_1(c0 + 1A'~n -< a.s.
1
1 -constn
< (1 - n ) ~n_l(a) 71-
Using Lemma A.3-2 (see Appendix A) for u~ = A~(a),
as = 1,
~,~ = n1 c ° n s t '
u~ = d -~ (~ • (0, g )
we derive 1 ~'n(~) a~s " 0(--~_~)
(5.44)
The reinforcement scheme (5.40) and (5.44) lead to E {[ 1 - p ~ + l ( a ) ] / $ ' ~ }
a~. [1 - P ~ + I ( a ) ] -
N
A-o N
_
_<
A* 1 _< [1 - p,~+~(c0] (1 - ~ ' , ~ - ~ _ 1) +~,,~ o(n--~_~ )
(5.45)
a.s.
for n > no(~o), no(w) < ~ . ~lNking into account the normalization procedure (~* E (0, 1)) (5.45) and Robbius-Siegmund theorem [231, we obtain p~(c~) ~:g" 1 ~ o o
The property ( P 1 ) is then fulfilled. To prove the fulfillment of property (P2) we use Lemma 1 and Toeplitz lemma: lim 1
Tt-1
~'~'lim 1
E
-1
-1
=
5.7. Reinforcement Schemes with Normalization Procedure
~'8" = lim E {T~/~n} = eN -~/N--~(l(a)) ~
179
[A ~ ( i ) + o ~ ( 1 )] p ~ ( i ) ° ~
i=1
a._~._ e N - N e ( u ( ~ ) ) 5 ( ~ )
= 0
N-1 Theorem is proved. [] This theorem shows that, a learning automaton using the Bush-Mosteller reinforcement scheme with the normalization procedure described above, selects asymptotically the optimal action, i.e., its behaviour is asymptoti(~lly optimal. The next subsection deals with the analysis of Shapiro-Narendra reinforcement scheme.
5.7.2
S H A P I R O - N A R E N D R A REINFORCEMENT SCHEME
The Shapiro-Narendra scheme [19]-[5] is described by: P~+I = Pn + ~/~(1 - ~ ) [e(un) - p~]
(5.46)
We shall analyze its behaviour in asymptotically stationary (in the average sense) mediums and state some analytical results. These results are described in the following theorem. T h e o r e m 2. For the Shapiro-Narendra scheme [19] (5.46), assume that condition (5. 35) is satisfied and suppose that assumptions (H1), (H2) and condition (5.$1) hold. In addition, suppose that the correction factor satisfies the following condition lim u--~c~
"Yn
l-I( 1 -
.yt)-I := c < pl(a) 1 - A*
(5.47)
t=l
then, the properties (P1) and (P2) are fulfilled and the loss function ~n tends to its minimal possible value Y(a), with probability one.
(5.si) Proof. Let us estimate the lower bounds of the probabilities Pn+l(i) : pn+l(i)
=
p,~(i) +~/n(1 - Sn)[X(V~ = u(i)) --p~(i)] =
=
p,~(i) [ 1 - V . ( 1 - $ ~ ) ]
>
p.(i) ( 1 - ~ ) >... >_pl(i)l-[(1-~)
(5.48)
+'y,,(1-$,~)Z(u,,=u(i)) >__
t=l
180 5.7. Reinforcement Schemes with Normalization Procedure Substituting (5.43) into (5.48) leads to the desired result (P1). In view of (5.27) and assumptions (H3) and (H4), and according to the proof of the previous theorem, we deduce that (5.44) ~s fulfilled. Let us consider the following Lyapunov function W,~ .-- 1--p~(c~) p~(~) Taking into account the Shapiro Narendra scheme and (5.44), it follows:
N
Z{WnT1/ffrn} a'~-s"EE{Wnq_l/ffYn_ 1 A1tn i=1
N(
:= ~
+
/t(/)} pn(i)
I.
(
, ~(p = E
+
1
-1)
) )
- 1
p~(i)+
-1
p~(~) =
p~(i)+
.(a)[1 --Tn] + ~/~,(i) p~(a)
( p.(~)
1
-1)
+~. (1-p.(~))
pn(O~)+"/n ow(1)
Replacing A(i) by its minimal value &* leads to the following inequality:
E{l;Vn+l/.Fn}a~.(
1
_
-
p~(~)[1 - ~] + ~ a * p~(~)
(1 -
p~(c~)) p~(c~)
-~ p~(c~) + % (1 [ 1 = w~ 1 - ~ + ~ *
[1 - 3'~] + "Y~ o ~ ( 1 )
pn(oL))
-p~(a) +
- 1) (1 - p~(c~))+ =
P2~(a)[1- "/~] ] +7= oo~(1) = p~(~-S(i - - - ~ ) + ~ +3'n o~(1)
(5.49)
2i2 ~p:l(~) >_p7!(~)lim n~l-[(1-n~) -1 =p~l(~) c t=l
(5.50)
-
for n > no(~v), no(w) < Condition (5.47) gives a.8.
"f~
oo . n
5.7. Reinforcement Schemes with Normalization Procedure
181
Inequalities (5.50) and (5.49) imply E{PVn+I/5,} < W,~ 1 . y ~ + . y n 5 * -- ( 1 - - T n ) + p i l ( a ) o
.
.
.
.
.
.
c+o(1)
+
(1) =
. 1 q- p l l ( a ) e
o(1)
+ % o~(1) =
Taking into account the normalization procedure (A* E (0, 1)), assumption (5.47) and Robbins-Siegmund theorem [23], we obtain a.8.
W,~ --~ 0,
p,~(a) ~
1
The property (P1) is then fulfilled, To prove the fulfillment of property (P2) we use Lemma 1 and Toeplitz temma:
E{Tt 1/.~t-1} a's'=
Tt-1 .... lira _ 1 £
lira-1£ n--,oo
~
rt-*c~
n
t=~ 0
t=rb 0
N
[ 1 - ~(i)] [e(u(i))- lira P~] ~o¢lim p~(i)a.,._
a.~. lim E{Tn/Jrn} = E i=l
°g
-
0
Theorem is proved. The following corollary deals with the constraints associated with the correction factor %. C o r o l l a r y 1. For the correction factor (5.22), assumption (5.47) will be
satisfied if "f
--: 1,
lim
%
II( 1
t=l
- - ")'t) - 1
:
C := a -1
"~
pl(cO
-v..
A
These statements show that, a learning automaton using the ShapiroNarendra reinforcement scheme with the normalization procedure described above, has an asymptotically optimal behaviour. The Varshavskii-Vorontsova reinforcement scheme will be considered in the following.
182 5.7. Reinforcement Schemes with Normalization Procedure 5.7.3
VARSHAVS K I I - V O RONTSOVA REINFORCEMENT SCHEME
The Varshavskii-Vorontsova scheme [20]-[5] is described by:
This scheme belongs to the class of nonlinear reinforcement schemes. We shall analyze its behaviour in asymptotically stationary (in the average sense) mediums. The following theorem cover all the specific properties of the Varshavskii-Vorontsova scheme. Its significance will be discussed below. T h e o r e m 3. For the Varshavskii-Vorontsova scheme (5.51), condition (5.35) is satisfied and if assumptions (H3), (H4) and the condition (5.41) hold, and in addition if the nonstationary environment satisfies the following condition:
b:=
[ 1 - 2~(i)] -1
< 0
(5.52)
then, the properties (P1) and (P2) are fulfilled and the loss function ¢n tends to its minimal possible value "5(a), with probability one. (5.3I) Proof. Let us estimate the lower bounds of the probabilities: p~+, (i) = p~(i) + ~ p ~ e(u~)(1 - 2~)[X(u~ = u(i))
>_ pn(i) [ 1 - ~',~p,~T e(u~)(1
-- 2~',~)
-p~(i)] =
] _>
> p~(i) [1 - ~p~ e(~)] > > p~(i) (1
-
~1 _>... _> p,(i) [[(1
-
~)
t=l
Substituting (5.43) into (5.53) gives us the desired result (5.35).
(5.53)
5.7. Reinforcement Schemes with Normalization Procedure
183
For convergence analysis, let us introduce the following L y a p u n o v function: - p,~(a) W•
1
.-
reinforcement scheme (5.51), (5.44) and simple calculations lead to
The
N
E{ W~,~+1/~,,}a.8. = ZE{W,,+,/-~",~-i Au~ =
N(
,
( +
1
(
1
1 + ~,,~ (1 - p , ~( a ) )
>_~(~), n0(~0)
-1) p~(~)=
pn(a)) p~(a)
p,,(i)
+
p,~(i)=
)
p,~(a) + %~(1 - 2 A ( a ) + o ~ ( 1 ) ) (1 -
for n
u(i)}
i
- 1
)
+ ~,~ o , , ( 1 )
< ~.
Let us now m a x i m i z e the first t e r m in (5.54) with respect to the c o m p o nents p~(i) (i ¢ a ) under the constraint N
~p~(i)
= 1 - p,~(~)
(5.551
To do t h a t , let us introduce the following variables
x, := p~(i),
a,
[1 - 2 ~ ( i ) ] ,
:= ~
i
#
It follows t h a t N
,#~
(
p,~(i) i -- ~fn
)
1
N
=~
1
X,
a,~
F(x) "
(5.56)
-
To m a x i m i z e the function F(x) under t h e constraint (5.55), let us introduce t h e following L a g r a n g e function:
L(x, )0 := F(x) - ;~
xi -
(1 - x,~)
184
5.7.
Reinforcement Schemes with Normalization Procedure
T h e optimal solution (x*, A*) saGsfies the following optimality conditions:
0•
i L(x*'A*)-
1 (1-a~x~)2
A* = 0
Yi¢~
N
L(x*, A* ) =
i i#a
From these optimality conditions, it can be seen t h a t X i --
1 l-x~
1 - x~ )_ 1
V/~- __ (i
N
'
N
al ~ ai_ 1 i¢c~
~ ai 1 i~a
Hence
F(x) <_F(x*)
= (i - x,~) (I
1 --Xc~ )--1 N
E a; 1 i¢c~
From this inequality we derive p,~(a) i - 'y,~b [1 - p,~(a)] i-
(5.57)
where
(5.58) Substituting (5.57) into (5.54), we obtain a.8.
1
E{W.+I/~'.} < W.l_%~b[l_ p,~(a)]÷ +
1+7.
(1-pn(a))
-1
+";.o,.,(1)=
1 ~" P~(~) i-?.b[i- p ~ ( ~ ) l + i + C ( i - - ~ ( a ) )
]
+?"
o~,(1)
=
= w . [1 - ~ . (Ibl [1 - p~(~)] + p . ( ~ ) ) + O ( ~ ) ] + ~ . o~(1) T h e maximization of the right side of this inequality with respect to p n ( a ) E [0, 1] leads to E {Wn+,/~-~} a~. W~ [1 - "Ynmin {Ibl; 1} + o(~g)] + ~ o~(1)
5.8. Simulation results
185
Taking into account condition (5.52) and in view of Robbins-Siegmund theorem [23], we obtain
Pn ( a ) a-:2*" i
Wn ~:2; O, The property ( P 1 ) is then fulfilled. Using Toeplitz lemma gives us lim
Tt-1 a.=~.lim
E{Tt-l/.~t-1
= lim E
Then, by using Lemrna 1, it follows: N
lira E{T,~/~,~}-E [i- 2A(i)]
lim
p~(i)[e(u(i))-
lira p,] a'=f"
i=1
a.~. [1- 2~,(a)] [e(u(a))- e(u(a))]= 0 Property (P2) is fulfilled. Theorem is proved, m The following corollary gives an example of the class of environments satisfying condition (5.52). C o r o l l a r y 2. For example, condition (5.52) will be satisfied for the class of environments for which
?,(i)
vi#a
The main point of this theorem is that, a learning automaton using the Varshavskii-Vorontsova reinforcement scheme with the normalization procedure described above, has an asymptotically optimal behaviour in the class of nonstationary environments for which the condition (5.52) is fulfilled. The next section presents some simulation results.
5.8
Simulation results
This section presents a simple numerical simulation, from which we can verify the viability of the design and analysis given in this chapter. Let us consider the optimization of a function described by (5.5) with c ...... 10, u* ---- 0.01 and w = 10. The variation domain of the variable u is discretised into a set of N values [u(1), u(2),..., u(10)] ; u(i) = 5 + i
186 5.8. Simulation results No prior information is used initially, i.e., the probability vector is initialized to P0 z [ ~ , " " , ~1] T . The learning system operates as follows: At each time n a normally distributed random variable z is generated and an action u(i) is selected on the basis of the probability distribution Pn. The selected action u(i) is then used to construct the normalized a u t o m a t o n input which in turn is used to adjust the probability distribution by means of a reinforcement scheme (Bush-Mosteller (B-M), Shapiro-Narendra (S-N) and VarshavskiiVorontsova (V-V)). This procedure is repeated at each time: The optimal action is u(5) and is equal to 10. The probabilities aasociated with this optimal action (.solution) and generated respectively by BwshMosteller (B-M), Shapiro-Narendra (S-N) and Varshavskii-Vorontsova (VV) reinforcement schemes are given in Figure 5.4.
!
B-M
~
~ 0.5 ,.
O_
0 ~".'~ 0
1
i
,
•
•
•
~'+"~]' j
!
200
400
600
800
i
|
i
i
200
400
600
8OO
2 o. 0
!
0
-1 f_
.Q
0.5
£ o_
0
,
_ I
0
200
' ,
!
V-V
I
|
I
400
600
800
Iteration number
FIGURE 5.4. Evolution of the probabilities associated with the optimal action. After a learning period which is equal to: 450 iterations for the Varshavskii-Vorontsova scheme 650 iterations for the Bush-MosteUer scheme 750 iterations for the Shapiro-Narendra scheme,
5.9. Conclusion
187
these probabilities tend to one. The loss functions associated with the behaviour of learning automata using different reinforcement ,schemes are depicted in Figure 5.5.
0.9 0.8 0.7 ._o 0 . 6 0.5
o4
• ,..':b.
0.2
\
0.1
""-
0
0
"--,-- ,.,.._.
.......
I
|
I
I
200
400
600
800
FIGURE 5.5. Evolution of the loss functions. The three reinforcement schemes considered in this study achieve a satisfactory decrease in the loss function value. The Shapiro-Narendra scheme (S-N) lea~s t o a loss function which decreases more quickly than those related respectively to Bush-Mosteller (B-M) and Varshavskii-Vorontsova (V-V) reinforcement schemes. Figures 5.4 and 5.5 illustrate the ability the optimization algorithm to deal with the optimization of nonstationary functions. It can be seen that the optimization objective is achieved. This behaviour was expected from the theoretical results stated in this chapter. The above results can be explained by the learning ability of the optimization algorithm.
5.9
Conclusion
This chapter has focused on the use of variable-structure stochz~stic learning a u t o m a t a operating in nonstationary environments for solving optimization
188 5.9. Conclusion problems related to nonstationary functions. The conditional expectation of the environment responses (automaton inputs) were assumed to be timevarying. Different nonstationary environments (Markovian environments, periodically changing environments, multi-Teacher environments, environments with fixed optimal actions) have been analyzed. A normalization procedure has been introduced to deal with environment responses which do not belong to the unit segment. The notion of asymptotically stationary environment (in the average sense) was also introduced. Several theoretical results are stated. These results concern the asymptotically optimal behaviour of learning automata using Bush-Mosteller, Shapiro-Narendra and Varshavskii-Vorontsova reinforcement schemes. Some simulation results have been presented to illustrate the use of learning automata operating in nonstationary environments for solving optimization problems related to nonstationary functions. The theoretical analysis presented in this chapter can be extended to learning automata which use other reinforcement schemes.
References [1] Holst J 1977 Adaptive Prediction and Recursive Estimation, Department of Automatic Control, Lund Institute of Technology. Report LUTFD2/(TFRT- 1013) / 1-206/ [2] Najim K 1989 Process Modeling and Control in Chemical Engineering. Marcel Dekker, New York [3] Kawashima H 1981 Identification of autoregressive integrated processes. IFAC Congress, Japan, Kyoto [4] Yaglom A M 1962 An Introduction to the Theory of Stationary Random Functions. Prentice-Hall, Englewood Cliffs [5] Najim K, Poznyak A S 1994 Learning Automata: 77teory and Applications. Pergamon Press, Oxford [6] Tseltin M L 1973 Automaton Theory and Modeling of Biological Systems. Academic Press, New York [7] Narendra K S, Thathachar M A L 1989 Learning Automata an Introduction. Prentice-Hall, Englewood Cliffs [8] Najim K, Oppenheim G 1991 Learning systems: theory and applications. IEE Proceedings Computer and Digital Techniques 138:183-192 [9] Srikantakumar P R, Narendra K S 1982 A learning model for routing in telephone Networks. SIAM J. Control and Optimization 20:34-57 [I0] Barto A, Anandan P, Anderson C W 1986 Cooperatively in networks of pattern recognizing ~ochastic learning automata. In Narendra K S (ed) 1986 Adaptive and Learning Systems : Theory and Applications. Plenum Press, New York [11] Baba N, Mogami Y 1988 Learning behaviours of hierarchical structure of stochastic automata in a nonstationary multi-teacher environment. Int. J. Systems Sci. 19:1345-1350 [12] Baba N 1984 New Topics in Learning Automata Theory and Applications. Springer-Verlag, Berlin
190 References [13] Narendra K S, Viswanathan 1~ 1972 A two-level system of stochastic automata for periodic random environments. IEEE Trans. Syst. Man, and Cybern. 2:285-289 [14] Baba N, Sawaragi Y 197.5 On the learning behaviour of stochastic automata under a nonstationary random environment. IEEE Trans. Syst. Man, and Cybern. 5:273-275 [15] Narendra K S, Thathachar M A L 1980 On the behavior of learning automata in a changing environment with application to telephone traffic routine. IEEE Trans. Syst. Man, and Cybern. 10:262-269 [16] Nedzelnitsky O V Jr, Narendra K S 1987 Nonstationary models of learning automata routing in data communication networks. IEEE Trans. Syst. Man, and Cybern. 17:1004-1015 [17] Thathachar M A L, Ramakrishnan K R 1981 A hierarchical system of learning automata. IEEE Trans. Syst. Man, and Cybern. 11:236-241 [18] Bush R R, Mosteller F 1958 Stochastic Models for Learning. John Wiley ~ Sons, New York [19] Shapiro I J, Narendra K S 1969 Use of stochastic automata for parameter self optimization with multimodal performance criteria. IEEE Trans. Syst. Man, and Cybern. 5:352-361 [20] Varshavskii V I, Vorontsova I P 1963 On the behavior of stochastic automata with variable structure. Automation and Remote Control 24:327-333 [21] Chandrasekaran B,and Shen D W C 1967 Adaptation of stochastic automata in non-stationary environment. Proc. Natl. Electronics Conf. pp. 39-44 [22] Doob J L 1953 Stochastic Processes. John V~Tiley~ Sons, New York [23] Robbins H, Siegmund D 1971 A convergence theorem for nonnegative almost supermartingales and some applications. In Rustagi J S (ed) 1971 Optimizing Methods in Statistics. Academic Press, New York
Appendix A: T h e o r e m s and Lenzrn__as
191
Appendix A: Theorems and Lemma This appendix collects the theorems and lemmas needed for the proof of several theoren~s which have been established in the previous chapter. Each lemma will be denoted by A.i-j, where i represents the chapter's number and j the order of classification in the considered chapter. The references cited in these appendices are given the bibliography section of chapter 1.
Theorem
(Kall's theorem, 1966). For the linear equation
(K1) Az=b
(AER'~X'~,bERm,n>_m+I)
where the matrix A has the first m independent columns, written A1,...,Arn, to have a nonnegative solution x E R ~ for all b E R m, it is necessary andsufficient that there exist
(K2) #~ >_ O, Aj < 0 such that
(K3) #jAj = ~ j=m+l
AiAj
i=1
where Aj is the j - , h column of the matrix A.
R e m a r k 5 In fact, the matrix A must have more than m columns, because it is impossible that some b = bo and also -bo (such situation is possible) be represented as a nonnegative linear combination the same m independent columns. Hence, n >_ m + 1. Proof.
a) Necessity. The vectors A1,..., Am are independent, then there exit a numbers fli (i = 1,...,m) such that (g4) b= El3jA~ i=1
Then, for any x E R n (x, _> O) satisfying ( K 1 ) we c a n write: t~
Ax = E j-----1
m
x~A~ = E i=1
~jA~
192
Appendix A: T h e o r e m s a n d L e r a m a s
or
E
(j3i - xi) A~ ......
i=1
xjAj
j=m+l
It is clear that if we want to prove this statement for any b E R m there exist such b t h a t for the given matrix A the numbers 13j will be nonpositive and hence we can identify #j in (K3) with the coefficients (t3~ - xi) of the left side of the last inequality and Aj in (K3) with the coefficients in the right side, i.e., ~j
~
~3i - - X i
)U
=
xj
So, the necessity is proved. 2) SuFficiency. Taking into account the presentation (K4) and the linear independence of the vectors A1, ..., A,~ we can conclude the for any fixed b the corresponding values t3j are unique. If all of t h e m are nonnegative then the equation (K1) has the solution :- ~ 13j for j = 1,..., m ~gj 0 for j = m + 1,...n t Let us now suggest that one at least of the 13j is negative. Notice t h a t the largest of the ratios f~j/),j (j <_ m) ~s positive, if at least one of the 13j founded above ~s negative. Let us define "Y0 as follows 70 :=max ]/3j[/[Aj] J If relation (K3) is true then the following relation n
rrl
E
o,jAj .......
j=m+l
i=1
is also true, and hence, taking into account the representation (K4) we derive: m
rr~
(~o~ - 5 + zJ) Aj = ~ i=1
(~0~ - Zj) Aj + b =
/=1
=
~ j=m+
7o#3A~ 1
or
b= E i=1
i=1
(70 IA j [ + j3j) A~ +
70#iAj = j=m+l
j=m+l
Appendix A: T h e o r e m s and Lerm~aas
193
Taking into account that
Finally, we can write m
b =EI~jI~A3
+ ~
i=1
~o#~A~ =
j==rnT1
= ~
x~A~
.,4=1
with
1~jl~rj
(
zj = /
for j = 1, ...,m for j ..... m + 1,...,n
"Yopj
_>0
Theorem is proved, m
L e m m a A.3-1. Let {~'n} be a sequence of (r-algebras and 0n, 0,~, An, and ~,~ are ~n-measurable nonnegative random variables such that 1. E(~n)
2 ~
< oo
E(On) <
oo
n~l
3.
fi
fi
A n a':s" 0 0 ,
n= l
lJn
<
O0
r~= l a.8.
4. E(nn+,/Tn) <__ (1 - )'n+~ + ~'n+~)n~ + 0n 77~en, lim ~n
a.8.
0
Proof. In view of tile assumptions of the previous theorem and RobbinsSiegmund theorem (Robbins and Siegmund, 1971), it follows that a.s.
n --~ o o
and
oo
< oo E ~n+l~n ....
194
Appendix A: T h e o r e m s a n d L e m m a s As OO
n
~
OO
Then, a subsequence Un~ which tends to zero with probability 1 exists. Hence r/* a'~" 0. •
A.3-2. Let {Un} be a sequence of nonnegative random variables un measurable with respect to the or-algebra ~n, for all n ----=1,2...,.
Lemma
g
1. E(un/.Tn) V n = 1, 2, ... exists 2. the following inequality holds
E(u,,+l/~'n) _< Un(1 -'~n) +f~n where {an} and {fin} are sequences of non random variables such that
a,~ ~ (0,1], OO
OO
n~l
n~l
Zn>0
for some nonnegative sequence {Vn} (v,, > 0, n = 1,2, ...) 3. the limit
lira Vn+z - V n " - - # < 1 n---*oo
Otnb*n
exists, then,
( ! ) ~. o
U n ~ Ow V n
with probability I when vn --~ oo. Proof.
Let ~ be the sequence defined as ~n := u~vn Then, using assumption (2), we obtain E(~n-{-I/.~n)
~__. ~ n ( l - - O~n) ( ~ 1 a.,.
__ : , ~ ( 1 _ c~n)(v.+z --vn vn
Jr- V n + l ~ n
I) ÷ v.+lj3,~
=
Appendix A: T h e o r e m s and Lemmas
195
Taking into account assumption (3), we derive ~,.8.
E(~+I/Y~)
< ~ [1 - ~ ( 1 - ~ + o(1))1 + v~+lZn
From this inequality and Robbins-Siegmund theorem (Robbins and Siegmund, 1971), we obtain a.s. -~0 which is equivalent to
u. = o~(~) ~'o Lemma is proved. []
L e m m a A . 3 - 3 (see ref. 31 of chapter 2). For the sequence {~/,~} where ~f ~fn- n +a' the following inequalities are fulfilled a) for ~ E (0,1) and a > % then n
+a
l+a
'~ ) ~ > H (1 "), + 1
(a_~ -
n
~) > \~-~-~]
k=l
b) for "/= 1 and a > 0, then a
n÷a
k=l
Proof. The proof of assertion a) is evident. Using the convexity property of the function x ln x, it follows that (x + Ax) ln(x + ,~x) -- x In x >_ A x ( l + x in x) Taking into account this inequality, we obtain
n +~----'/+ 1
In
1
n-+ ~/ a
dx
>_
1
>
(1-v(k))=exp k=l
In kk=l
1
kT~
-
196 Appendix A: T h e o r e m s a n d L e m m a s
> exp -
{/( )} In
1
7
0
d~
=
\~1
z + a
Lemma Ls proved. • L e m m a A . 4 - 1 . Under assumptions (H1) and (H2)
1) the random variables eOJ(w) (j 0,...,m) are the partial limits of the sequences { ~ } (j = 0, ...,m) for almost all w e f~ if and only if they can be written in the following form N
¢o(.~) = ~ o p ( i ) :
Vo(p)
i=1 N
• ,(~) = ~ v~p(i) := y,(p),
(j .... 1,..., m)
i=1
where the random vector p = p(w) C S g (defined by equation (4.11)) is a limit point of the vector sequence f~ .... (f~(1) .... ,f,~(N)) T defined by (equation (4.13)). 2) for almost all w E f~
Proof. Let us rewrite 4p~ (equation (4.1)) in the following form N
= E
f,,(z)~,~(u,,, w)
i=l
where n
J {
x(~, =~(i));~ 0,(i),,~) ,
i~(~,~) =
~
~(~, = ~(i)) > 0
t=l
0
,
~
~(~, = ~(i)) = 0
t= 1
are the current average loss functions for ut = u(i) (i = 1, ..., N). According to lemma A.12 (Najim and Poznyak, 1994) (see also Ash, 1972 ), for almost all
t=l
Appendix A: Theorems a n d L e m m a s
197
we have lira ~J G~(u~,w)= v i For almost all w ~ Bi, we have: •
=j
h~. G ( ~ , ~) < and lim f~(i)::= 0
n-~oo
The vector f,~ also belongs to the simplex S g . It follows that any partial limit ~2J(w) of a sequence { ~ } can be expressed in the following form: N
cJ(~) = ~ v ~ p ( i ) : = y~(p),
(j .... 0,...,m)
i=1
where p i~s a partial limit of the sequence {fn} and consequently, ~IiJ(w) C [m/inv/J,maxv~] ( j = O , . . . , m ) Lemma is proved. []
L e m m a A.4-2. Let us assume that 1. the control strategy {u~} is stationary, i.e.,
P {~: u= = u(i)15",~_1 } a'Sp(i) 2. the random variables/~(u(i),w) (j = 0 , . . . , m ; i =: 1,...,N) have stationary condigonal mathematical expectation and uniformly bounded conditional second moment, i.e.,
E { ~ ( u ( i ) , w) I un = u(i) A
~n--1
}
a.8, =:
}
{( .
j
Vi
oo
~x(
U t ~- U
(i)) °--*8
O0
t=l
Then,
~(u(i), ~ ) ~ ( ~ = ~(i)) 8inJ :
t=l
-
x ( ~ = ~(i)) t:l
v~
0
198 Appendix A: Theorems a n d Lemrn~s
Proof. From the recurrent form of s~ ]
s~ 8n_~J1 (1 - ~(,,,,- ~,(0) t :,,(,))]
+
k
,¢)
x ( ~ . = ~(i)) -t
n
E x(~ = , 4 0 ) t=l we derive
X
n
........... 2' X
~f, X(Ut = U(i))
[
t=l
+ ..... x ( ~
= ~(i))
2
t=l
X(Ut =
~t(i))]
co~t(~)
where o~(1) is a random sequence tending to zero with probability one, and Const(w) is an almost surely bounded and positive random variable . Observe t h a t
~(u~ ~ u(0__)(2_+o_~(1)) o.=,.o~ t=l In view of Robbins-Siegmund theorem (Robbins and Siegmund, 1971) (or Lemma A.9 in Najim and Poznyak, 1994), the previous inequality leads to the desired result. Lemma is proved. • The lemma used in chapter 5 is stated and proved in the following.
Appendix A: T h e o r e m s a n d L e m m a s Lemma
199
(the matrix version of the Abel's identity).
A.5-1
n
~
t--I
A,B, = a. ~2 B , t=nO
t=rLO
At
tA,- A,-~I Z B~ t=nO
E R m× k
8=r~O
Bt E R k × t
Proof. For n = no we obtain: no-1
AnoB,~o = A~oBno -
[A~o -
Ano-1] E
B, = A~oB~o
8~T~ 0
r~O- 1
The sum
~ ] Bs in t h e previous equality is zero by virtue of t h e fact t h a t
8~0
the u p p e r limit of this s u m is less t h a n the lower limit. We use induction. We note t h a t t h e identity (Abel's identity) is t r u e for no. We assume t h a t it is t r u e for n a n d prove t h a t it is t r u e for n + 1: ~+1
E
AtBt =
t=r~O
n
= a.~ E t:~o
AtBt + At.+1Bn+l = t=r~O
~
t-1
Bt-
[ a t - At-l] E ~:~ ~t~
B. + a ~ + a B ~ + a
=
8=rl, 0
=(a.~+l~Bt+a.~+lB.~+x)-t=. --
(
(An+l
- A,~)
Bt + t~'n,O
n+l
= An+l E t=nO
[At - A t - l l t='/'~O
n+l
Bt- E
HI
E
B,
8=rl, O
t--1
[A,- At-l] E Bs
t=r/,o
T h e identity (Abel's identy) is proved. •
8=n.o.
/
=
200
Appendix B: S t o c h a s t i c P r o c e s s e s
Appendix B: Stochastic Processes In this appendix we shall review the important definitions and some properties concerning stochastic processes. A stochastic process, {x~, n E N } is a collection (family) of r a n d o m variables indexed by a real p a r a m e t e r n and defined on a probability space (f~, ~ , P ) where f~ is the space of elementary events w, ~- the basic a-algebra and P the probability measure. A a-algebra ~ is a set of subsets of f~ (collection of subsets). ~-(xn) denotes the a-algebra generated by the set of r a n d o m variables xn. The a-algebra represents the knowledge a b o u t the process at time n. A family ~- = {~n, n ~ 0} of a-algebras satisfy the standard conditions Jr s _< ~-~ < ~- for s _< n, :~0 is suggested by sets of measure zero +of ~ , and ~'n = N ~s. s>n
Let {x,~} be a sequence of random variables with distribution function {F~} we say that:
D e f i n i t i o n 7 {xn} converges in distribution (law) to a random variable with distribution function F if the sequence {Fn} converges to F. This is urritten lal!2
Xn
---+ X
D e f i n i t i o n 8 {xn} converges in probability to a random variable x if given s, ~ > 0, then there exists no(s,6) such that Vn > no P(I x~ - x I> s) < 6 This is written x,+ ~
x
D e f i n i t i o n 9 {x~} converges almost surely (with probability 1) to a random variable x if givens, 6 > 0, then there exists n0(e, 5) such that Vn > no. P(Ix~-xl<sVn>n0)> 1-6 This is written a.8.
Xn
----+ x
Appendix B: S t o c h a s t i c P r o c e s s e s
201
D e f i n i t i o n 10 {x~} converges in quadratic mean to a random variable x
if 2 i m E [ ( x ~ - x)T(x~ - x)] = 0 This is written X n q:-~" X
The relationships between these convergence concepts are summarized in the following 1. convergence in probability implies convergence in law. 2. convergence in quadratic mean implies convergence in probability. 3. convergence almost surely implies convergence in probability. In general, the converse of these statements is false. Stochastic processes as martingales have extensive applications in stochastic problems. They arise naturally whenever one needs to consider mathematical expectations with respect to increasing information patterns. They will be used to state several theoretical results concerning the convergence and the convergence rate of learning systems. D e f i n i t i o n 11 A sequence of random variables {x,} is said to be adapted to a the sequence of increasing a-algebras {~n} if x,~ is Y:n measurable for every n.
Definition 12 A stochastic process {x~} is a martingale if a.8.
and
Definition 13 A stochastic process {x=} is a supermartingale if a.8.
E{x,~+I/~}
~ x~
D e f i n i t i o n 14 A stochastic process {x,~} is a submartingate if a,8.
>
202 Appendix B: S t o d m s t i c P r o c e s s e s
The following theorems are useful for convergence analysis. T h e o r e m (Doob, 1953). Let {x~, ~'~} be a nonnegative supermartingate such that
x,~ > 0
s u p E { x ~ } < oo.
Then there exists a random variable x (defined on the same probability space) such that E{x}
z~
-* x
(a.s.).
T h e o r e m (Robbins and Siegmund, 1971). Let {~n} be a sequence of a-algebras and xn, an, ~n and ~n are ~n-measurable nonegative random variables such that for all n = 1,2,... there exists E { x n + l / ~ n } and the following inequality is verified
with probability one. Then, for all w E f ~ where rio:=
~oe~l
~n < c~, n=l
~,~<~ n=l
the limit lim xn = x*(w) exists, and the sum o~
co nverges. Since the literature on stochastic processes is extensive we refer the reader to the books written respectively by Doob (1953) and Neveu (1975).
Index A Abel's identity, 172, 199 Accuracy, 23, 43, 46, 85, 93 Adaptive, 27, 29, 31 Algebra, 30, 48, 167
H Hessian, 134 Hierarchical, 28, 34-35, 165 I Inaction, 27
B Binary, 36, 120 Borel-Cantelli, 71, 81, 87, 174 Bush-Mosteller, 33, 51, 75, 95, 108, t21, 139, 167, 176, 186
J Jensen's inequality, 73 K Kall's theorem, 17, 191 Kiefer-Wolfowitz, 33
C Caratheodory theorem, 14, 16 Cauchy distribution, 162 Cauchy-Bounyakovskii, 54, 60, 65 Chance constraint, 7 Chance constraints, 110 Changing number of actions, 91, 94 Constraints, 4, 108, 132 Continuous, 32, 93, 166 Convergence, 29, 57, 62, 68, 79, 90, 145, 183 Convergence rate, 123, 129, 146 Convexity, 44, 108 Correction factor, 30, 46, 63, 68, 76, 89, 167, 176, 179 Cost function, 108
L Lagrange, 66, 88, 108, 112-114, 116, 121, 127, 138, 152, 183 Law of large numbers, 50, 73, ii0 Learning automata, 3, 27, 32, 43, 118, 165 Lipschitzian, 20, 44, 108, 116, 138 Loss function, 30, 47, 74, 76, 99, 110, 167, 170, 173, 177, 179, 182 Lyapunov, 109 Lyapunov function, 29, 50, 52, 58, 61, 63-64, 81,180, 183
E Enclidean, 111 Environment, 3, 28, 30, 47, 74, 164, 167 Expectation, 6, 30, 109, 126, 128, 144, 165, 167, 171 Expediency, 28, 165
M Markovian, 165, 168 Martingale, 29, 112 Matrix inversion lemma, 134 Minmax problem, 112 Multi-teacher, 24, 28, 31,119, 139, 164-165 Multimodal, 43-44, 94, 173
G Gaussian, 162-163 Gradient, 4, 33, 107, 113
N Najim, 28, 91, 173 203
204 Index Nonstationary, 28, 31, 50, 161162, 164-165, 167, 182, 185, 187 Normalization, 35, 69, 79, 91, 109, 121, 165, 173, 178, 181 O Objective function, 4 Observation, 6, 44, 70, 162 P Penalty, 28, 31, 33, 108, 164-165 Penalty function, 108, 132, 153 Poznyak, 28, 91, 173 Probability, 3, 30, 33, 50, 74, 145, 165, 176 Probability measure, 9, 30, 35, 46, 120 Programming problem, 3-4, 4, 10, 48, 132, 135 Projection, 36-37, 47, 136, 142
q Quantification, 44 R Reinforcement scheme, 29-30, 52, 58, 64, 120, 167, 183, 186, 176 Reward, 27, 33, 164 Robbins-Siegmund, 56, 72, 84, 128, 144, 175, 185 S S-model environment, 29, 31, 91, 120, 166, 97 Saddle-point, 112, 114, 116, 108, 124 Shapiro-Narendra, 33, 57, 75, 80, 95, 167, 179, 186 Simplex, 36, 38, 47, 49, 108, 111 Slack variables, 132 Slater' condition, 1 0 Slater's condition, 5, 133, 136 Stationary, 28, 31, 47, 63, 68, 112, 161, 168, 177, 179
Stochastic, 3, 6, 9, 108, 164 Stochastic approximation, 8, 43, 108
T Taylor series, 61 Toeplitz, 178, 181, 185 V Varshavskii-Vorontsova, 33, 63, 75, 85, 95, 167, 182, 186
Lecture Notes in Control and Information Sciences Edited by M. Thoma 1993-1996 Published Titles:
Vol. 186: Sreenath, N. Systems Representation of Global Climate Change Models. Foundation for a Systems Science Approach. 288 pp. 1993 [3-540-19824-5] Vol. 187: Morecki, A.; Bianchi, G.; Jaworeck, K. (Eds) RoManSy 9: Proceedings of the Ninth CISM-IFToMM Symposium on Theory and Practice of Robots and Manipulators. 476 pp. 1993 [3-540-19834-2] Vol. 188: Naidu, D. Subbaram Aeroassisted Orbital Transfer: Guidance and Control Strategies 192 pp. 1993 [3-540-19819-9] Vol. 189: Ilchmann, A. Non-Identifier-Based High-Gain Adaptive Control 220 pp. 1993 [3-540-19845-8] Vol. 190: Chatila, R.; Hirzinger, G. (Eds) Experimental Robotics II: The 2nd International Symposium, Toulouse, France, June 25-27 1991 580 pp. 1993 [3-540-19851-2] Vol. 191: Blondel, V. Simultaneous Stabilization of Linear Systems 212 ,pp. 1993 [3-540-19862-8] Vol. 192: Smith, R.S.; Dahleh, M. (Eds) The Modeling of Uncertainty in Control Systems 412 pp. 1993 [3-540-19870-9] Vol. 193: Zinober, A.S.I. (Ed.) Variable Structure and Lyapunov Control 428 pp. 1993 [3-540-19869-5] Vol. 194: Cao, Xi-Ren Realization Probabilities: The Dynamics of Queuing Systems 336 pp. 1993 [3-540-19872-5]
Vol. 195: Liu, D.; Michel, A.N. Dynamical Systems with Saturation Nonlinearities: Analysis and Design 212 pp. 1994 [3-540-19888-1] Vol. 196: Battilotti, S. Noninteracting Control with Stability for Nonlinear Systems 196 pp. 1994 [3-540-19891-1]
Vol. 197: Henry, J.; Yvon, J.P. (Eds) System Modelling and Optimization 975 pp approx. 1994 [3-540-19893-8] Vol. 198: Winter, H.; NQBer, H.-G. (Eds) Advanced Technologies for Air Traffic Flow Management 225 pp approx. 1994 [3-540-19895-4] Vol. 199: Cohen, G.; Quadrat, J.-P. (Eds) 1l th International Conference on Analysis and Optimization of Systems Discrete Event Systems: Sophia-Antipolis, June 15-16-17, 1994 648 pp. 1994 [3-540-19896-2] Vol. 200: Yoshikawa, T.; Miyazaki, F. (Eds) Experimental Robotics II1:The 3rd International Symposium, Kyoto, Japan, October 28-30, 1993 624 pp. 1994 [3-540-19905-5] Vol. 201: Kogan, J. Robust Stability and Convexity 192 pp. 1994 [3-540-19919-5] Vol. 202: Francis, B.A.; Tannenbaum, A.R. (Eds) Feedback Control, Nonlinear Systems, and Complexity 288 pp. 1995 [3-540-19943-8] Vol. 203: Popkov, Y.S. Macrosystems Theory and its Applications: Equilibrium Models 344 pp. 1995 [3-540-19955-1]
Vol. 204: Takahashi, S.; Takahara, Y.
Logical Approach to Systems Theory 192 pp. 1995 [3-540-19956-X]
Vol. 214: Yin, G.; Zhang, Q. (Eds) Recent Advances in Control and Optimization of Manufacturing Systems. 240 pp. 1996 [3-540-76055-5]
Vol. 205: Kotta, U.
Inversion Method in the Discrete-time Nonlinear Control Systems Synthesis Problems 168 pp. 1995 [3-540-19966-7]
Vol. 215: Bonivento, C.; Marro, G.; Zanasi, R. (Eds) Colloquium on Automatic Control 240 pp. 1996 [3-540-76060-1]
Vol. 206: Aganovic, Z.;.Gajic, Z.
Vol. 216: Kulhavy, R. Recursive Nonlinear Estimation:A Geometric Approach 244 pp. 1996~-540-76063-~
Linear Optimal Control of Bilinear Systems with Applications to Singular Perturbations and Weak Coupling 133 pp. 1995 [3-540-19976-4] Vol. 207: Gabasov, R.; Kirillova, F.M.;
Prischepova, S.V. Optimal Feedback Control 224 pp. 199513-540-19991-8] Vol. 208: Khalil, H.K.; Chow, J.H.;
Ioannou, P.A. (Eds) Proceedings of Workshop on Advances inControt and its Applications 300 pp. 1995 [3-540-19993-4] Vol. 209: Foias, C.; Ozbay, H.; Tannenbaum, A. Robust Control of Infinite Dimensional Systems: Frequency Domain Methods 230pp. t995 [3-540-19994-2] Vol. 210: De Wilde, P.
Neural Network Models: An Analysis 164 pp. 1996 [3-540-19995-0] Vol. 211: Gawronski, W. Balanced Control of Flexible Structures 280 pp. 1996 [3-540-76017-2] Vo|. 212: Sanchez, A. Formal Specification and Synthesis of Procedural Controllers for Process Systems 248pp. 1996 [3-540-76021-0]
Vol. 217: Garofalo, F.; Glielmo, L. (Eds) Robust Control via Variable Structure and Lyapunov Techniques 336 pp. 1996 [3-540-76067-9] Vol. 2t8: van der Schaft, A. Gain and Passivity Techniques in Nonlinear Control 176 pp. 1996 [3-540-76074-1] Vol. 219: Berger, M.-O.; Deriche, R.;
Herlin, I.; Jaffr~, J.; Morel, J.-M. (Eds) ICAOS '96: 12th International Conference on Analysis and Optimization of Systems Images, Wavelets and PDEs: Paris, June 26-28 1996 378 pp. 1996 [3-540-76076-8] Vol. 220: Brogliato, B. Nonsmooth Impact Mechanics: Models, Dynamics and Control 420 pp. 1996 [3-540-76079-2] Vol. 221: Kelkar, A.; Joshi, S. Control of Nonlinear Multibody Flexible Space Structures 160 pp. 199613-540-76093-8] Vol. 222: Morse, A.S.
Control Using Logic-Based Switching 288 pp. 1997 [3-540-76097-0]
Vol. 213: Patra, A.; Rao, G.P.
General Hybrid Orthogonal Functions and their Applications in Systems and Control 144 pp. 1996 [3-540-76039-3]
Vol. 223: Khatib, O.; Salisbury, J.K.
Experimental Robo~cs IV: The 4th International Symposium, Stanford, California, June 30 - July 2, 1995 596 pp. 1997 [3-540-76133-0]
Vol. 224: Magni, J.-F.; Bennani, S.; Terlouw, J. (Eds) Robust Flight Control: A Design Challenge 664 pp. 1997 [3-540-76151-9]