MARKOV PROCESSES AND LEARNING MODELS
This is Volume 84 in MATHEMATICS IN SCIENCE AND ENGINEERING A series of monographs and textbooks Edited by RICHARD BELLMAN, University of Southern California The complete listing of the books in this series is available from the Publisher upon request.
MARKOV PROCESSES AND LEARNING MODELS M. FRANK NORMAN Department of Psychology University of Pennsylvania Philadelphia, Pennsylvania
@
ACADEMIC PRESS New York and London
1972
COPYRIGHT 0 1972, BY ACADEMIC PRESS, INC. ALL RIGHTS RESERVED NO PART OF THIS BOOK MAY BE REPRODUCED IN A N Y FORM, BY PHOTOSTAT, MICROFILM, RElXIEVAL SYSTEM, OR ANY OTHER MEANS, WITHOUT W R m N PERMISSION FROM THE PUBLISHERS.
ACADEMIC PRESS, INC. 111 Fifth Avenue, New
York, New York 10003
United Kingdom Edition published by ACADEMIC PRESS, INC. (LONDON) LTD.
24/28 Oval Road, London NWI
LIBRARY OF CONaRESS CATALOQ CARD NUMBER:70-182638
AMS (MOS) 1970 Subject Classifications: 60B10,60F05, 60J05,60J20,60J70,92A25; 60J35,60J60,62M05,62M10, 62M15,92A10 PRINTED IN THE U N m STATES OF AMERICA
TO Sandy
This page intentionally left blank
0 Contents
xi
PREFACE
0 Introduction 0.1 Experiments and Models 0.2 A General Theoretical Framework 0.3 Overview
1 12 13
Part I DISTANCE DIMINISHING MODELS 1 Markov Processes and Random Systems with Complete
Connections 1 . 1 Markov Processes 1.2 Random Systems with Complete Connections
21 24
2 Distance Diminishing Models and Doeblin-Fortet Processes 2. I Distance Diminishing Models 2.2 Transition Operators for Metric State Spaces
30 37 Vii
...
CONTENTS
Vlll
3 The Theorem of Ionescu Tulcea and Marinescu, and Compact Markov Processes 3.1 3.2 3.3 3.4 3.5 3.6 3.7
4
A Class of Operators The Theorem of Ionescu Tulcea and Marinescu Compact Markov Processes: Preliminaries Ergodic Decomposition Subergodic Decomposition Regular and Absorbing Processes Finite Markov Chains
Distance Diminishing Models with Noncompact State Spaces 66 71
4.1 A Condition on p 4.2 Invariant Subsets
5
43 45 50 52 56 61 63
Functions of Markov Processes
5.1 5.2 5.3 5.4 5.5 5.6 5.7
Introduction Central Limit Theorem Estimation of pu Estimation of u2 A Representation of u2 Asymptotic Stationarity Vector Valued Functions and Spectra
13 75 80 84
90 93 94
6 Functions of Events 6.1 Theprocess X,,'=(En,Xn+,) 6.2 Unbounded Functions of Several Events
98 100
Part I1 SLOW LEARNING 7 Introduction to Slow Learning 7.1 Two Kinds of Slow Learning 7.2 Small Probability 7.3 Small Steps: Heuristics
109 111 114
ix
CONTENTS
8 Transient Behavior in the Case of Large Drift 8.1 8.2 8.3 8.4 8.5
A General Central Limit Theorem Properties of f ( t ) Proofs of (A) and (B) Proof of (C) Near a Critical Point
116 120 124 126 133
9 Transient Behavior in the Case of Small Drift 9.1 Diffusion Approximation in a Bounded Interval 9.2 Invariance 9.3 Semigroups
137 141 146
10 Steady-State Behavior 10.1 A Limit Theorem for Stationary Probabilities 10.2 Proof of the Theorem 10.3 A More Precise Approximation to E(X.7
152 154 157
1 1 Absorption Probabilities 11.1 Bounded State Spaces 11.2 Unbounded State Spaces
163 169
Part 111 SPECIAL MODELS 12 The Five-Operator Linear Model 12.1 12.2 12.3 12.4
Criteria for Regularity and Absorption The Mean Learning Curve Interresponse Dependencies Slow Learning
176 179 183 189
13 The Fixed Sample Size Model 13.1 13.2 13.3 13.4
Criteria for Regularity and Absorption Mean Learning Curve and Interresponse Dependencies Slow Learning Convergence to the Linear Model
196 199 202 205
CONTENTS
X
14 Additive Models 14.1 14.2 14.3 14.4 14.5
Criteria for Recurrence and Absorption Asymptotic A , Response Frequency Existence of Stationary Probabilities Uniqueness of the Stationary Probability Slow Learning
21 1 214 217 218 221
15 Multiresponse Linear Models 15.1 Criteria for Regularity 15.2 The Distribution of Y. and Y,
226 229
16 The Zeaman-House-Lovejoy Models 16.1 A Criterion for Absorption 16.2 Expected Total Errors 16.3 The Overlearning Reversal Effect
235 237 239
17 Other Learning Models 17.1 17.2 17.3 17.4
Suppes’ Continuous Pattern Model Successive Discrimination Signal Detection: Forced-Choice Signal Detection: Yes-No
244 241 249 252
18 Diffusion Approximation in a Genetic Model and a
Physical Model
18.1 Wright’s Model 18.2 The Ehrenfest Model
257 262
References
263
LISTOF SYMBOLS
269
INDEX
21 1
0 Preface
This monograph presents some developments in probability theory that were motivated by stochastic learning models, and describes in considerable detail the implications of these developments for the models that instigated them. No attempt is made to establish the psychological utility of these models, but ample references are provided for the reader who wishes to pursue this question. In doing so he will quickly become aware that the difficulty of deriving predictions from these models has prevented them from developing to their full potential, whatever that may be. Since I am more a probabilist than a psychologist, I am in a position to regard this difficulty as a challenge rather than a nuisance. The book has four main parts: Introduction (Chapter 0), Distance Diminishing Models (Part I), Slow Learning (Part I]), and Special Models (Part 111). Parts I and I1 and Chapter 14 in Part I11 (Additive Models) constitute the theoretical core. From the mathematical point of view, Part I develops a theory of Markov processes that move by random contraction of a metric space. Transition by random translation on the line is considered in Chapter 14. Part I1 presents an extensive theory of diffusion approximation of discrete time Markov processes that move “by small steps.” Parts I and I1 are practically independent, so it would be almost as natural to read them in reverse order. Chapters 12-16 of Part Ill consider the various special models described in the Introduction in the light of Parts I and 11. In addition, some xi
xii
PREFACE
special properties of these models are obtained by ad hoc calculation. Chapter 17 takes up a number of other learning models briefly, and Chapter 18 spells out the implications of Part I1 for a population genetic model of S. Wright and for the Ehrenfest model for heat exchange. The chapters of Part I11 are almost independent, except that Chapter 13 depends heavily on Chapter 12. Open mathematical problems are easily visible throughout the book, but especially in Chapters 14-16. The mathematical prerequisites for close reading of Parts I and I1 are analysis (integration, metric topology, functional analysis) and probability at about the first-year graduate level. The texts of Royden (1968) and Breiman (1968) would be excellent preparation. An acquaintance with the language and notation of these subjects will suffice if the reader is willing to skip most of the proofs. The prerequisites for Part I11 (excepting Chapter 14) are less stringent, and it can be used for reference purposes without studying Parts I and I1 systematically. No previous exposure to mathematical learning theory is assumed, though it would be useful. The glimpse of this subject in Chapter 0 is adequate preparation for the rest of the book. The reader is referred to Atkinson, Bower, and Crothers (1965) for a balanced and thorough introduction. Those who are familiar with mathematical learning theory will notice that the emphasis on “continuous state models” in this book, at the expense of “finite state models,” is the reverse of the emphasis in the psychological literature. In particular, the book makes no contribution to the analysis of models with very few states of learning. These models are quite well understood mathematically, and they have been extremely fruitful psychologically. Concerning the numbering of formal statements, Theorem 5.2.3 is the third theorem of Section 5.2. Within Chapter 5, it is referred to as Theorem 2.3. Equations, lemmas, and definitions are numbered in the same way. The symbol Isignifies the end of a proof. I have not attempted to reconstruct the history of the topics treated. In most cases, only the main source or immediate antecedent of a development is cited in the text. These sources often give additional information concerning prior and complementary results. The publications of the following individuals deserve special mention in connection with the parts indicated: C. T. Ionescu Tulcea (Part I), A. Khintchine (Part 11), and R. R. Bush and F. Mosteller (Part 111). Much of the research reported in this volume is my own, and a significant portion appears here for the first time. My work has been generously supported by the National Science Foundation under grants GP-7335 and GB-7946X to the University of Pennsylvania. Most of the writing was done in particularly pleasant circumstances at Rockefeller University on a Guggenheim Fellowship.
PREFACE
...
xlll
It is a pleasure to acknowledge the encouragement and assistance of a number of my teachers and colleagues, especially R. C. Atkinson, K. L. Chung, W. K. Estes, M. Kac, S. Karlin, and P. Suppes. I am also much obliged to R. Bellman for suggesting that I write a book for this series. The manuscript was expertly typed by Mrs. Maria Kudel and Miss Mary Ellen O’Brien. Finally, the book is dedicated to my wife, Sandy, in gratitude for her inexhaustible affection and patience.
This page intentionally left blank
0 0 Introduction
0.1. Experiments and Models In this section we describe a few more or less standard experimental paradigms, and certain special models for subjects’ behavior in such experiments. These models have stimulated and guided the development of the general mathematical theories presented in Parts I and I1 of this volume. In addition, they are of substantial psychological and mathematical interest in their own right. The implications of Parts I and I1 for these models, as well as numerous results of a more specialized character, are collected in Part 111. All but one of the experiments to be described consist of a sequence of trials. On each trial the subject is presented with a stimulus configuration, makes a response, and an outcome ensues. There are at least two response alternatives, and there may be infinitely many. The outcome or payoff typically has obvious positive or negative value to the subject, and would be expected to influence future responses accordingly. If an outcome raises, or at least does not lower, the probability of a response in the presence of a stimulus, it is said to reinforce the association between stimulus and response. The probabilities of different stimuli, and the conditional probabilities of 1
2
0. INTRODUCTION
various outcomes, given the stimulus and response, are prescribed by the experimenter and are constant throughout the experiment. Simple learning experiments are those in which only one stimulus configuration is ever presented. The complementary case of two or more stimulus configurations is called discrimination or identification learning, since the subject must distinguish between stimuli if he is to be able to make appropriate responses (i.e., responses that yield preferred outcomes) to each. A. SIMPLE LEARNING WITH TWO RESPONSES
Three of the six models described in this section relate to experiments of this type. The two responses are denoted A and A , ,and, in the most general case considered, response Ai can be followed by either of two outcomes, Oi, or Oio, where Oij reinforces A j . The reinforcement probability parameters are
,
Rij
= P(OijIA3.
Of course, if there is no outcome Oij in a particular experiment, we take nij = 0, and, conversely, if nij = 0, we need not include 0, in the description of the experiment. The notation 0, emphasizes that the outcome 0,, that follows A , and reinforces A l (perhaps presentation of food or money) may be of a totally different character than the outcome Ool that follows A, and reinforces A , (perhaps no food or loss of money). However, it is convenient, for most purposes, to redefine the outcomes in terms of their supposed reinforcing effects. Thus we introduce the new outcomes 0, and 0,, where Oj indicates reinforcement of Aj, irrespective of the preceding response. Thus Ai Oj (“Ai is followed by Oj”) means the same thing as Ai Oij. The trial number is indicated by a subscript. Thus A, or Ojn denotes occurrence of A i or Oj on trial n. We always call the first trial “trial 0,” so n = 0,1,2, .... This is slightly more convenient mathematically than beginning the trial numbers with 1.
EXPERIMENTS. (i) Paired-associate learning (see Kintsch, 1970). A human subject is required to learn a “correct” response to each of a sequence of stimuli presented repetitively, as when a student learns the vocabulary of a foreign language from a deck of cards. Though this is basically a complex discrimination learning experiment with multiple responses, as a first approximation we may focus on successive presentations of the same item, ignoring interitem interactions, and code the subject’s responses on that item as “correct” (A,) or “incorrect” (A,). If the subject is told the correct response after each of his responses, then 0, is the only outcome, R,, = nl0= 1, and we have an example of continuous reinforcement.
3
0.1. EXPERIMENTS AND MODELS
(ii) Prediction experiments (see Estes, 1964; Myers, 1970). A human subject is asked to predict which of two lights, 0 and 1, will flash on each trial. Response A i is prediction of light i, and outcome Oj is flashing of lightj. Monetary payoffs are sometimes used (with, of course, a larger prize for a correct prediction), but more often they are not. The special case of noncontingent outcomes, where the outcome probabilities do not depend on the response made (nl = nol= n), has received most attention experimentally. After performance has stabilized, it is often found that the proportion of A , responses is very close to the probability n that A , is reinforced, at least when the average of several subjects’ data is considered. This probability matching is somewhat surprising, since frequency of correct prediction is maximized by always predicting the light which flashes most frequently. As one might expect, experiments with monetary payoffs tend to produce behavior closer to this optimal strategy. The condition P(A1n) - P(O1n) 4 0 as n+ 00 defines a form of probability matching that is applicable to arbitrary outcome probabilities. Since P(O1n) =
1111 P(A1n)
+ no1(1-p(A1n)),
we have P(Aln)-P(Oln) = (nOl+n10)P(A1n)-*ol. Therefore, if nol +n,, > 0, probability matching in the above sense is equivalent to P ( A l n )+ I, where 1= ~01/(~01+~10)~
Hence we refer to I as the probability matching asymptote. It is a useful reference point for asymptotic performance. (iii) T-Maze experiments (see Bitterman, 1965; Weinstock, North, Brody, and LoGuidice, 1965). On each trial a rat is placed at the bottom of a Tshaped alley, and proceeds to the end of the right or left arm (response A , or A , , respectively), where he may or may not find food. Finding food reinforces the response just made, and finding no food reinforces the other response, though in some variants of this experiment it appears that the effect of not finding food may be nil or almost nil. Rats show no tendency to match response to outcome probability. Practically all of them develop a decided preference for the response with the highest probability of yielding food. (iv) Avoidance learning (see Bush and Mosteller, 1959; Theios, 1963; Hoffman, 1965). A dog or rat may avoid an electric shock if he jumps over
4
0. INTRODUCTION
a barrier ( A , ) shortly after a warning signal. Otherwise (A,) he must clear the barrier to escape the shock. The only possible outcomes are avoidance and shock, and both appear to reinforce jumping, so no, = n 1 = 1 .
,
MODELS. All of the models presented below allow for the possibility that an outcome will have no effect on response probability on some trials. This state of affairs is called ineffective conditioning and denoted C;, otherwise, conditioning is effective (Cl). The probability of effective conditioning is a function only of the preceding response and outcome: cij = P ( C , piO j ) .
In all three models, the effects of a subject’s experience before trial n are summarized by a random variable that represents his probability of making response A , on trial n. In the linear and stimulus sampling models, this In additive models, this variable is called random variable is denoted X,,. p,,, and X,, is a certain transform of p n that is more useful for most purposes. (i) A stimulus sampling model. Stimulus sampling theory was introduced by Estes in 1950. Atkinson and Estes (1963) gave a unified presentation of the theory, and Neimark and Estes (1967) collected many relevant papers. In the models considered here, it is postulated that the stimulus configuration consists of a population of N stimulus elements, each of which is conditioned to either A , or A , . The subject samples s of these without replacement, and makes response A , with probability m / s if the sample contains m elements conditioned to A , . If A j is effectively reinforced, everything in the sample becomes conditioned to A j . If conditioning is ineffective, no element changes its state of conditioning. Various sampling mechanisms have been considered, but it is usually assumed that the sample size s is fixed throughout the experiment, and all elements are equally likely to be sampled. We will restrict our attention to thisfixed sample size model. In the special case s = 1, each “stimulus element” may be interpreted as a pattern of stimulation, and this special case is consequently referred to as the pattern model. Let X,, be the proportion of elements in the stimulus population conditioned to A I at the beginning of trial n. It is easy to compute the conditional probability, given X,,, of any succession of occurrences on trial n. For example, the probability of obtaining a sample with m elements conditioned to A , , making A , and having A, effectively reinforced, is
,
(N2)(
N(1 -Xn> s-m ) m
-~ S
I O C I O .
5
0.1. EXPERIMENTS A N D MODELS
This event produces the decrement AX,, = -m/N in X,,. The variable X, gives the subject’s A I response probability before sampling, in the sense that
(ii) A linear model. Linear models were introduced by Bush and Mosteller in 1951, and reached a high degree of development in a treatise published soon thereafter (Bush and Mosteller, 1955). The effectiveness of conditioning mechanism was first used in conjunction with linear models by Estes and Suppes (1959a). The five-operator linear model postulates a linear transformation
,
( o < e , 1 < I)
x’=(i-e)x+el
(1.2)
of A response probability x for each response-outcome-effectiveness sequence. Rewriting this as x’ -
= e(n-x),
we see that the change in x i s a proportion 8 of the distance to the fixed point A. If conditioning is ineffective, then 8 = 0 and 1is immaterial. If 0, C , occurs, then x’-x>O for all O < x < 1, so that, if 8 > 0 , we must have A = 1. Similarly I = 0 for 0, C , . Thus
AX,
=
eii(1-Xn)
if
Ain01nCln,
--ei0xn
if
A i n 00, C 1 n 9
I 0
if Con.
The conditional probabilities that various operators are applicable are given by expressions like f‘(A I n 00, C1n IXn) =
X n I[ 10 c10
that are common to all three of the models discussed here. If Oij = 1, AinOjnC,,,produces complete learning, in the sense that X,,, I = j , irrespective of X , . This fact permits us to regard certain “all-or-none” models as linear models. Norman (1964) has given an example in which this point of view is rather natural. It is shown in Chapter 13 that the linear and stimulus sampling models are quite closely related. For example, predictions of the stimulus sampling model converge, as N + co and s / N + 8, to those of the linear model with 8, = 8 for all i and j .
6
0. INTRODUCTION
(iii) Additive models. Luce (1959) proposed that the A , response probability variable p be represented
P
=
+00)
01
in terms of the strength ui > 0 of A i . In terms of the relative strength u = u , /uo or x = In u, this becomes
p
=
+
u/(u 1)
=
e“/(e“ + 1).
(1.3)
Beta models postulate that a learning experience multiplies each ui by a positive constant pi. Thus u is multiplied by p = PI /So, and b = In p is added to x . If the experience reinforces A , , then j? 2 1 or b 2 0 ; if it reinforces A , , then /3 < 1 or b < 0. For the jue-operator beta model, the value X,,of x on trial n thus satisfies
AX,, = b,
if A , Oj, C , , ,
(1.4)
where bi, 2 0 and bio < 0. Alternatively, un+1
= Bijvn
(1.5)
or
if AinOjnC,,,,where p i , 2 1 and pi, < 1. Of course, AX,, do,, and Ap, are all 0 if Con. Generalizing slightly, we may consider additive models, in which a learning experience effects a change x’ = x + b in the variable x (equivalently u’ = pu, where u = e“ and j= eb),just as in the beta model, but the A , response probability variable p = p ( x ) need not be of the form (1.3). It is natural to assume that p is continuous and strictly increasing with p ( - 00) = 0 and p ( a ) = 1, though some of our results require less. Given such qualitative restrictions, the precise form of p has remarkably little influence on those aspects of these models’ behavior that are considered in Chapter 14. Fiue-operator additiue models satisfy (1.4) and (1.5). I f p is strictly increasing, there will also be an equation analogous to (1.6).
SYMMETRY. Symmetries in the experimental situation permit us to reduce the number of distinct parameters in these models. One type of symmetry is especially prevalent. In prediction experiments, and in T-maze experiments where the same amounts of food are used to reward left and right choices, it is natural to assume that the response-outcome pairs A , O , and A,O, that involve “success” are equally effective for learning, and that the same is true of the “failure” pairs A , 0, and A o O , . To say that they are equally
7
0.1. EXPERIMENTS A N D MODELS
effective means, first, that they have equal probabilities of producing effective conditioning, thus
cI1 = coo = c
and
col = cl0 = c*.
( 1 *7)
In the linear and beta models, equal effectiveness of two response-outcome pairs also implies that the corresponding operators u and ii on A l response probability x are complementary. This means that the new value 1 - C(x) = 1 - i i ( 1 - ( 1 - x ) )
of 1 - x produced by ii is the same function of 1 - x that u is of x :
u(x) = 1 - a ( 1 - X ) .
If ii(x) = ( I -8)x, then u(x) = 1 - ( i - e ) ( i - x )
= (i-elx+e,
while if D(P) =
BP BP+l-P’
then
Thus complementarity of the success operators and complementarity of the failure operators reduce, respectively, to =
e
and
e,,
=
el,
=
B
and
Po1
=
W l O
b , , = -boo = b
and
b,, = -bl0 = b*
ell = e,,
=
e*
(1.8)
in the linear model, and 811
=
l/BOO
=
B*
or, alternatively, (1.9)
in the beta model. This characterization of complementarity is valid for all additive models with p strictly increasing and p ( - x ) = 1 - p ( x ) . The further condition that success and failure be equally effective, an extremely stringent symmetry condition, would take the form c = c* and, also, 8 = O* in the linear model and b = b* in additive models.? These assumptions are sometimes convenient mathematically, but most of our results do not require them. ?The pattern model is exceptional, since success has no effect and cI1has no role.
8
0. INTRODUCTION
B. SIMPLE LEARNING WITH MANYRESPONSES As in the experiments described previously, the subject confronts the same stimulus configuration on every trial. But on trial n he makes a choice Y,, from a set Y of alternatives that may have more than two elements, perhaps even a continuum of elements. This is followed by an outcome Z,, from a set Z of possibilities. The conditional probability distribution D(Yn9
A)
= P(Zn E AIYn)
of Z , given Y,, is normally specified by the experimenter, and, at any rate, it does not vary over trials. Linear models for such experiments take the following form. Let X,, be a subject’s choice distribution on trial n ; i.e., p ( y n ~ A I x n= ) xn(A),
where A is any (measurable) subset of Y . For every response-outcome pair e = ( y , z ) , there is a 0 < 8, < 1 and a probability I, on Y such that, if Y,,= y and Z , = z, then
x,,,= (
+
1 - 8 ~ ~ e e I~e .
If 8, > 0, A, represents the asymptote of X,, under repeated occurrence of e. In typical experimental situations, one has enough intuition about this asymptote to place some restrictions on the form of I e . t
A CONTINUOUS PREDICTION EXPERIMENT. Suppes (1959) considered a situation in which a subject predicts where on the rim of a large disk a spot of light will appear. Here Y,, is the subject’s prediction and Z,, is the point subsequently illuminated on trial n. Suppes assumed that Be = 0 does not depend on e and I, = I ( z , .) does not depend o n y . In addition, the distribution I ( z , .) is symmetric about z and has mode at z. Suppes and Frankmann (1961) and Suppes, Rouanet, Levine, and Frankmann (1964) report experimental tests of this model and a closely related stimulus sampling model, also due to Suppes (1960),-that is described in Section 17.1. A pigeon pecks a lighted “key” in a small experiFREE-RESPONDING. mental chamber (or “Skinner box”). Occasional pecks are reinforced by brief presentation of a grain hopper. In the experiments considered here, the experimenter specifies the probability u ( y ) that a peck y seconds after the last one (that is, a y-second interresponse time or IRT) is reinforced. Such experiments do not have trials in the usual sense, but one can consider each response as a choice of an IRT from Y = ( 0 , ~ ) .In Norman’s (1966) linear t A different type of multiresponse linear model has been considered by Rouanct and Rosen berg ( 1964).
9
0.1. EXPERIMENTS A N D M O D E L S
model, Yo is the time until the first response, Y, is the nth IRT for n 2 1, A’, is a subject’s distribution of Y,,, and
1
if
0
otherwise.
Y, is reinforced,
m.
Clearly, n(Y,(1 1) = It is assumed that the entire effect of nonreinforcement is to decrease the rate of responding. Thus
x,,,
= (1
-e*)x, + o*T*
if 2, = 0, where T* has a very large expectation. Reinforcement of a y-second IRT is supposed to result in a compromise between two effects: an increase in the rate of responding, and an increase in the frequency of approximately y-second IRT’s. If T is a probability on Y with a very small expectation, A ( y , .) is a probability on Y with mode near y , and 0 < u < 1, then this compromise is represented by the assumption that
x,,,= (1-8)Xn+
8[(1-a)r+aA(y,
*)I,
if Y, = y and 2, = 1. The case where y is a scale parameter of A ( y , .) [i.e., A ( y , A ) = q ( A / y ) for some probability q on Y ] is especially interesting. Comparison of this model with data from some standard schedules of reinforcement indicates that realistic values of 8 and 8* have 8*/8 extremely small.
c. DISCRIMINATION LEARNING One of the lateral arms of a T-maze is white, the other is black, and the positions of the two are interchangeable. The black arm is placed to the left on a randomly chosen half of the trials. This stimulus configuration is denoted ( B , W ) , the other, (W, B ) . The rat’s choice of arms may be described either with respect to brightness ( B or W ) or position ( L or R). Reward could be correlated with either brightness or position, but let us consider an experiment in which the rat is ,fed if and only if he chooses the black arm. In many experiments of this type, training is continued until performance meets some criterion, and then the correct response is reversed, that is, switched to W . An interesting question is the effect of extra trials before reversal on the number of errors in the new problem. Though such overtraining leads to more errors early in reversal, it sometimes produces fewer total errors in reversal-the overlearning reversal efect. Clearly the rat must learn to observe or attend to brightness, the relevant stimulus dimension, if he is to be fed consistently. And, in so far as perceptual learning of this type takes place in overtraining, an overlearning reversal
10
0. INTRODUCTION
effect is a possibility. The concept of attention to a stimulus dimension is central to the model described below. This model is a specialization, to the case of two stimulus dimensions, of a theory proposed by Zeaman and House (1963) in the context of discrimination learning in retarded children. See Estes (1970, Chapter 13) for a full discussion of the model in that context. The animal is supposed to attend to either brightness (br) or position (PO), and u denotes the probability of the former. If he attends to brightness, he chooses black with probability y. If he attends to position, his probability of going left is z. To summarize: y = P(BIbr),
u = P(br),
and
z = P(L1po).
The values of u, y, and z on trial n are V,, Y,, and Z,. Trial-to-trial changes in these variables are determined by the same considerations that govern changes in response probability in simple learning with two responses. The probability v of attending to brightness increases if the rat attends to brightness and is fed, or does not attend to brightness and is not fed. Otherwise it decreases. The conditional response probabilities y and z change only if the rat attends to the corresponding dimension. Since food is always associated with B, y = P(B1br) increases whenever the rat notices brightness. If he attends to position on a (B, W) trial, z = P(Llpo) increases, since food is (fortuitously!) on the left. On (W, B) trials, z decreases. These changes are summarized in Table 1, which also gives the conditional probabilities of various events, given V, = u, Y, = y , and Z , = z. Choices are specified only as B or W , but, since stimulus configurations are given, laterality is implicit. TABLE 1 EVENTEFFECTS A N D PROBABILITIES I N A DISCRIMINATION LEARNING MODEL" ~~
Event
.
U
~
Y
Z
Probability
0 0 0 0
VYP U(l-Y)/2 VYI2 4 1- Y I P (1 - u) z / 2 (1 - u)(l - 4 1 2 (l-u)(l-z)/2 (1-v)2/2
+ +
-
-
"Notation: + indicates an increase, - a decrease, 0 no change.
11
0.1. EXPERIMENTS AND MODELS
Zeaman and House stipulated that all of these changes are effected by linear operators like those in two-choice simple learning models, and, further, that there are only two learning rate parameters, one for u and one for y and z. We will forego the latter restriction at this juncture. Thus the first two entries under u in Table 1 mean that
where 0 < 42< 1. This rather complex model describes the rat’s choices with respect to both brightness and position. For example, P(BIu,y,z) = P(B1br)u = yu
+ P(B1po) (1 -u)
+ +(l - u ) ,
and, similarly, P(LIu,y,z) = z(1-u)
so that
+ +u,
+
P(Bn) = E(YnVn) E(1 - v , ) / 2 and P(Ln) = E(Zn(1-K))
+ E(V,)/2*
Note that the probability of a black (hence correct) response depends only on u and y. There is a simpler description of the transitions of these variables than that given in Table 1. Collapsing the pair of rows corresponding to each attention-response specification, we obtain the reduced model of Table 2. This reduction presupposes a natural lateral symmetry condition : The operators on u (and y) in the rows to be combined must have the same learning rate parameter, as well as the same limit point. TABLE 2 EVENT EFFECTS AND
PROBABlLmES FOR THE
REDUCEDMODEL
12
0. INTRODUCTION
This model was proposed by Lovejoy (1966, Model I) as an explanation of the overlearning reversal effect. Though he noted that his theory was “quite close” to that of Zeaman and House, the precise nature of the relationship seems to have eluded him, since he later faulted his model for having “. . .completely disregarded the fact that sometimes an animal chooses one side and sometimes the other ...” (Lovejoy, 1968, p.17). This was held to be a serious omission because of the frequent occurrence of strong lateral biases or “position habits” early in acquisition. Within the full model of Table 1, these would be reflected in values of v near zero and values of z near zero or one. Henceforth we will refer to both the full and reduced models described in this subsection as Zeaman-House-Lovejoy (or ZHL) models. The merits and limitations of these models are discussed in Sutherland and Mackintosh’s (197 1) treatise on discrimination learning in animals. The experimental situation that we have considered is an example of a simultaneous discrimination procedure, since both values of the relevant brightness dimension are present on each trial. In the comparable successive procedure, food would be available on the left, say, if both lateral arms are black, and on the right if both are white. A model for successive discrimination, due to Bush (1965) and built on the same psychological base as the ZHL models, is presented in Section 17.2.
0.2. A General Theoretical Framework All of the examples given in the preceding section have the following structure. At the beginning of trial n, the subject is characterized by his state of learning X,,, which takes on values in a state space X . On this trial, an event En occurs, in accordance with a probability distribution p ( X n , G ) = P(En E GIXJ
over subsets G of an event space E. This, in turn, effects a transformation
xn+1
= U ( X n , En)
of state. In all of the examples, the subject’s response is one coordinate of the event E n , and, in the simple learning models, its outcome is another. Additional coordinates are as follows: number of elements in the stimulus sample conditioned to A , (fixed sample size model), effectiveness of conditioning (two-choice simple learning models), state of attention (ZHL models), and stimulus configuration (full ZHL model). In all of the two-choice simple learning models, A’,, can be taken to be a subject’s A , response probability, though we prefer a certain transform of this variable for additive models. Thus A’,, is one dimensional. I n the multichoice models, X,, is the choice distribution on trial n. Even though X,, is
13
0.3. 0 VER VIEW
multidimensional in this case, all of its “coordinates” X,,(A) are of the same type-probabilities of sets of responses. Thus all of our examples of simple learning models are considered uniprocess models. The coordinates of the state variables and xn = (5,Y,) of the ZHL models, on the other hand, do not possess this homogeneity. One of them, 5 , describes a perceptual learning process, while the others describe response learning, but under the influence of different stimuli. These are, therefore, examples of multiprocess models. Though there has been a tendency for uniprocess and multiprocess models to be used in conjunction with simple and discrimination learning, respectively, the association is not inevitable. For an example of a multiprocess simple learning model, see Bower (1959). In Sections 17.3 and 17.4, we consider the uniprocess models of Atkinson and Kinchla (1965) and Kac (1962, 1969) for signal detection experiments with two stimulus conditions. Intermediate in generality between the various special models introduced in Section 0.1 and the general framework described above are two classes of models that will play a prominent role in subsequent developments. The finite state models are those whose state spaces have only a finite number of elements. In the fixed sample size model, for example, Xn =
(5,Yn, 2,)
x = { j / N : j = O , ...,N}. In distance diminishing models, X is a metric space with metric d. Typically, all of the event operators u( ., e ) are nonexpansive:
d(u(x,e),u(y,e))< d ( x , y )
for all x a n d y ,
and some are contractive:
d(u(x,4,u ( y , 4) < d ( x , y ) if x z Y . The precise definition of this class of models is given in Section 2.1. For example, the operators (1.2) of the five-operator linear model satisfy ~x’-Y’l = (1 -e)lx--,q
.
Such an operator is nonexpansive, and contractive if 8 > 0. Given slight restrictions on their parameters, all of the uniprocess linear models discussed above, as well as the reduced ZHL model, are distance diminishing.
0.3. Overview In the general theoretical framework described at the beginning of the last section, the sequence X,, of states is a Markov process. The event process En is not Markovian, but the sequence X,,’ = (En,X,,, of event-state pairs
14
0. INTRODUCTION
is. When E,, includes a specification of the subject's response Y, on trial n, as is usually the case, we may consider Y, as a function either of E,,, Y,, = g(E,,), or of the Markov process X,,', Y,, =f(X,,'). Thus the study of learning models quickly leads to Markov processes. Most of this volume is given over to the systematic investigation of Markov processes arising in or suggested by learning models. Considerations specific to events and responses provide a closely related secondary focus. Some of our results are rather general, and this generality has been emphasized in order to heighten mathematical interest and facilitate use in areas other than learning theory. But applications of general theorems to particular learning models and specific psychological problems are not neglected, and we have not hesitated to include results applicable only to special models. In addition to this multiplicity of levels of abstraction, a variety of different mathematical techniques and viewpoints are employed. We have been quite impressed by the range of analytic and probabilistic tools that yield important insights into the behavior of a few simple models. Here is a brief survey of the contents of the book. Part I is concerned with distance diminishing models and the much simpler finite state models. Chapter 1 gives background material on Markov processes in abstract spaces and on our general theoretical framework. Chapter 2 provides preliminary results for distance diminishing models, and for a class of Markov processes in metric spaces-Doeblin-Fortet processes-that includes their state seThe ergodic theory of compact Markov processes, that is, quences x,,. Doeblin-Fortet processes in compact state spaces, is presented in Chapter 3. This theory is completely analogous to the theory of finite Markov chains, which is a special case. Some comparable results for distance diminishing models in noncompact state spaces are given in Chapter 4. Chapter 5 contains a law of large numbers, a central limit theorem, and some estimation techniques for certain bounded real valued functionsf(X,) of regular Markov processes X,,. This theory is applied in Chapter 6 to the processes X,,'= (En,X,,,,) in distance diminishing and finite state models for which X,, is regular. Part I1 deals with slow learning, to which Chapter 7 gives a full introduction. To study learning by small steps, we consider a family X." of Markov processes indexed by a parameter 8 such that AX." = O(@, and take limits as 8+0. Diffusion approximations to the distribution of A': for the transient phase of the process, when n is not too large, are obtained in Chapters 8 and 9. Approximations to stationary distributions and absorption probabilities are considered in Chapters 10 and 11, respectively. The form of these approximations is determined by the drqt E(AX."IX: = x), and by the conditional variance (or covariance matrix in higher dimensions) of AX.", given :'A = x. Some special considerations apply to the case of small (O(8')) drift.
15
0.3. 0 VER VIEW
In Part TI1 the methods of Parts I and IT and some special techniques are applied to various special models. In order to gain a definite impression of the types of results obtained in Part 111 (and, therefore, in Parts I and I1 as well), let us consider some examples pertaining to the symmetric case of the five-operator linear model. In this model, A response probability satisfies the stochastic difference equation
’ AX,,
=
.
with probability X n n l , c ,
O(I-X,,),
- e*x,,
with probability X,n,, c*, with probability ( ~ - X , , ) T E , ~ C * ,(3.1)
8*(1-X,),
- OX,, ,
with probability (1 - A’,) no, c ,
, o
otherwise.
for which f ( 0 ) = x . Then xn
-f(ne)
=
O(@,
as long as n8 remains bounded. Furthermore, var(X,,) = O ( 8 ) , and the distribution of approaches normality as 8 + 0. The results that follow require certain restrictions on the model’s parameters. Suppose, for simplicity, that success is effective, in the sense that 8c > 0, and that either outcome can follow either response (nij > 0 for all i andj). The cases of effectivefailure (0*c* > 0) and ineffectivefailure (B*c* = 0) must be distinguished. The former arises more frequently in practice.
16
0. INTRODUCTION
When 8*c* > 0, the process X,, has no absorbing states and is regular, so that the distribution of X,, converges (weakly) to a limit p that does not depend on Xo = x . Let x , be the expectation of p : lim xn =
xm =
n-1 m
I
Yp(dY),
and let A,, be a subject’s proportion of A , responses in the first n trials. Then A,,, is asymptotically normally distributed as n -P co, with mean xm and variance proportional to l / n . The proportionality constant o2 can be consistently estimated on the basis of a single subject’s data. The asymptotic A , response probability x , is bounded by the probability matching asymptote 1 = no,/(no,+nl0) and the unique root I in (0,l) of the quadratic equation w ( I ) = 0. In fact, if A is the “better” response, in the sense that n I 1> roo(or, equivalently, no, > nl0), and if 8* < 1, then
,
1 < X,
if 8c > e*c*,
1 = X, = I
if 8c
1 > X,
if 8c .c 8*c*.
> Iz
= tI*c*,
The quantity I is an appropriate approximation to x, when 8 and 8* are small. For w and thus I are independent of 8 along any line 8*/8 = k, and x, +I as 8 -+ 0. In addition, the asymptotic distribution p of A’,, is approximately normal with variance proportional to 8 when 8 is small. The behavior of X,, is radically different when O*c* = 0. Both 0 and 1 are absorbing states, and P,(lim X,, = 0 or 1) = I n- m
for any initial state x . The probability
4(x)
= P,( lim X,, = 1) n- m
that the process is attracted to 1 is fundamental. When roo= n 11 , 4 ( x ) = x . We now describe an approximation to 4 that is valid when 0 is small and noo-n,l = o(e). Suppose that n,, is fixed and that roo approaches it along a line ((.oo In1 1) - l)/O = k
9
where k#O. This constant is a measure of the relative attractiveness of A . and A,. It follows that
E(AX,,JX,,= x ) = e2a(x)
17
0.3. 0 VER VIEW
and
E ( ( A X , J ~ I= X x~ )
= 02b(x)
+ 0(e3),
where b ( x ) = n , , c x ( l - x ) and a ( x ) = - k b ( x ) . As 0 4 0 , ~ ( X ) + $ ( X ) where , $ ( x ) is the solution of the differential equation 1 2
d2$
- b ( x ) 2( x )
with $ ( O ) = O
dx
d$ + a(x)(x) = 0 dx
and $ ( l ) = 1 ; i.e., $ ( x ) = ( e 2 k x -l)/(e2k- 1 ) .
Note that
$(f)
=
l/(ek++),
which is very small when k is large. Thus, if a subject has no initial response bias, and learning is slow, it is very unlikely that he will be absorbed on A , when A, is much more attractive. This is, of course, just what we would expect intuitively.
This page intentionally left blank
Part I 0 DISTANCE DIMINISHING MODELS
This page intentionally left blank
1 0 Markov Processes and Random Systems with Complete Connections
1.1. Markov Processes
Our starting point is a measurable space (X, a), and a stochastic kernel K defined on X x 99. For every x E X, K ( x , .) is a probability on 99, while for of random every B E $3, K ( .,B) is %measurable. A sequence .?i? = {Xn}nro P),with values in ( X , W), is a Markov vectors on a probability space (Q, 9, process with (stationary) transition kernel K if P(Xn+, EBIXn, . . . , X o ) = K(Xn,B)
(1.1)
almost surely (as.), for each n 2 0 and B E 99. The process has state space (X, 99) and initial distribution p o ( B ) = P ( X o E B). For any (X,99), K, and p o , there exists a corresponding (Q, 9, P) and T !. (Neveu, 1965, V.2). Moreover the distribution of such a process, which = 99 x 99 x is the probability Q on gW given by
Q(W = p ( g E B ) , is completely determined by p o and K. In fact, Q is the only probability on 21
22
1.
MARKOV PROCESSES A N D LEARNING MODELS
93* such that
for any sequence Bn in 93 such that Bn = X for n > k. We sometimes write Qx(B) or Px(X E B) when po = 6, [the probability concentrated at x, also denoted 6(x, .)I and we wish to call attention to x. For a (real or complex) scalar valued functionfon A,' let If1 be its supremum norm, H ( f ) the closed convex hull of its range, and osc(f) the diameter of its range or, equivalently, the diameter of H ( f ) . Thus
If I
=
SUPlf(X)I xcx
9
H ( f ) = CO (rangef) , and Let B(X) be the Banach space of bounded 93'-measurable scalar valued functions on X under the supremum norm. Let M ( X ) be the Banach space of finite signed measures on B, under the norm lpl = total variation of p ,
and let P ( X ) be the probability measures on 93. If p then
Jf dP E H ( f ) .
E
P ( X ) and f E B(X), (1.2)
If p E M ( X ) with p ( X ) = 0, then and
for f~ B ( X ) . The transition operators (1.5)
and
1.1.
23
M A R K 0 V PROCESSES
on B ( X ) and M ( X ) , respectively, generalize left and right multiplication by the transition matrix in the theory of finite Markov chains. The second is the adjoint of the first, in the sense that 0 1 9
(1.7)
U f ) = (TP,f)
3
where
OL,f)
=
s
fdP
for ~ L ME ( X ) and f e B ( X ) . Both U and T are positive ( U f 2 0 if f > O , Tp 2 0 if p 2 0) and both are contractions (I 111, I TI < 1). In addition, Tp E P ( X ) if p E P ( X ) , and Uf =f if f is constant. More generally, Uf(x)E H ( f ) by (1.2); hence, H(Uf) = H ( f ) .
(1.8)
It follows that osc(Uf)< osc(f). The powers of the transition operators satisfy
and T"p(B) =
s
(1.10)
p ( d x ) K(")(x,B ) ,
where K(") is the n-step transition kernel, defined recursively by
K'O'(x,
a)
= 6,
and
K'"+"(x,B) =
s
K(x,dy)K'"'(y, B ) .
The probabilistic significance of K"), U j , and Ti is clear from the formulas
P(X,,+jE BIX,,, ..., Xo) = K(j)(X,,,B ) E(f(Xn+j)(Xn,* * * , X o )
=
Uif(XJ
a.s., a.s.3
and pn+j = Tjpn 9 where 2"= {X,,},,,,, is any Markov process with transition kernel K, and p,, is the distribution of X,,. The first equation is obtained by applying the second to the indicator function I,(x) of B :
24
I . MARKOV PROCESSES A N D LEARNING MODELS
if X E B .
(I
A set B E W is stochastically closed if B # 0 and K ( x , B) = 1 for all x E B. Thus P(X,,+ E BIX,,) = 1 a s . when X,, E B. If a stochastically closed set contains a single element a, we say that a is an absorbing state. A probability p is stationary if Tp = p. If the initial distribution po of X is stationary, then p,,= p, for all n. In fact, S is a strictly stationary stochastic process. 1.2. Random Systems with Complete Connections
In this section we begin with two measurable spaces (X,W) and ( E , Y ) , a stochastic kernel p on X x Y, and a transformation u of X x E into X that ’ and a.Following Iosifescu (1963; see is measurable with respect to W x 3 Iosifescu and Theodorescu, 1969), we call the system ((X, W), (E, Y ) , p ,u ) a (homogeneous) random system with complete connections. A sequence Y = X , , E,, X I , E l , ... of random vectors on a probability space (Q,.F,P) is an associated stochastic process if X,, and En take on values in (X, W )and (E, ’3), respectively,
Xn+1
= u(Xn, En)
(2.1)
and
P(EnEAIXn,En-,,
.a*)
=p(Xn,A)
(2.2)
as., for each A E Y. The processes = {X,,}nro and 8 = are, respectively, state and event sequences, (X, W) is the state space, and (E, ’3) is the event space. The distribution p, of X , is the initial distribution of an associated stochastic process. The concept of a random system with complete connections may be regarded as a generalization and formalization of the notion of a stochastic learning model. Thus we will often call such a system a learning model or simply a mode1.t In this context, the state of learning X,, characterizes a subject’s response tendencies on trial n, and the event En specifies those occurrences on trial n that affect subsequent behavior. Typically En includes a specification of the subject’s response and its observable outcome or payoff. When the subject is in state x , the event has distribution p ( x , .), and the transformation of state associated with the event e is u( -,e). Three classes
t This terminology is slightly at variance with that in Chapter 0. For example, “the fiveoperator linear model” of Chapter 0 is a family of models indexed by the parameters &,, c,,, and n,, according to the present terminology.
25
1.2. RANDOM SYSTEMS WITH COMPLETE CONNECTIONS
of learning models with which we will be especially concerned are the distance diminishing models defined in the next section, the additive models discussed in Chapter 14, and the finite state models. DEFINITION 2.1. A finite state model is a learning model for which X is a finite set and W contains all its subsets.t
There is a stochastic process Y associated with any random system with complete connections and any initial distribution p o (Neveu, 1965, V. 1). The distribution Q of any such process is the unique extension to L% x Y x W x of the measure on cylinder sets given by
=
Lo Lo PO (dxO)
p (xO
9
deO)
1
BI
(.
(XO
eO), d x l )
1.
P (xk
dek)
where B,, E W and A,, E Y for n < k, and B,, = X and A,, = E for n > k. Let p1 = p , and, for k 2 1, let
for x E X and A equation
E
p((En,
9
(2.3)
Yk+'.Then pk is a stochastic kernel on X x Y k ,and the *..Y
J % + k - l ) E A I X n , & - l ~ .**) = p k ( X n , A )
generalizes (2.2). Let p m ( x ,.) be the distribution of 8 = 6,; i.e.,
(2.4)
when Xo has distribution
P m (x, 4 = P X ( 8 E A ) for A E 9".Then p m is a stochastic kernel on X x Y" which extends pk in the sense that, if A E Y k , Pm(x,A
x E m ) = pk(x,A).
(2.5)
If S is the shift operator in E m :S{e,,},,20= {e,,+l}n,O, and bN= S N b = { E N + n } n 3 0 , then, a.s.3 P(BNE AIXN,E N - , , ...) = p m ( xA~) ,.
(2.6)
One of our prime objectives is to study state sequences from various classes of learning models. Such sequences are interesting in their own right, and
t Similarly, when E is finite, we always assume that all subsets are in 9.
26
1.
M A R K O V PROCESSES A N D LEARNING M O D E L S
provide an indispensable tool for the study of event sequences. The following simple observation is fundamental.
THEOREM 2.1. An associated state sequence X of a random system with complete connections is Markovian, with transition operator
clf(x) =
J P(X,de)f(u(x,e))*
In finite state models, X is a finite Markov chain. Proof. As a consequence of (2.1), E(f(Xn+,)IXn,X n - 1 , ...) = E(f(u(Xn9 En))lxn, for YE B ( X ) , thus E(f(Xn+1)Ixn, Xn-
I
1,
xn-1,
..*)
...) = uf(Xn)
by (2.2). A useful representation of the powers of the transition operator
w - ( x )= = (eo, ..
where e"
. , en-
s
(I is
Pn (x,W f ( u(x,4 )9
(2.7)
and u ( x , e") is defined iteratively:
u(u(x,e"), en) = u ( x , e n + ' ) .
(2.8)
Joint measurability of u ( x , e ) ensures that u ( x , e") is measurable with respect to B x X " , so the integral on the right in (2.7) makes sense and defines a measurable function of x. For any A E g'",
P ~ ( X , S - ~=A P) x ( L f NA~) =
= E x ( P ( s N €AIX,))
Ex(Pm(xN, A ) )
= U N p m( x ,A ) .
(2.9)
The sequence X,,' = ( E n ,X,,, ,) is also Markovian, and the application of an appropriate Markov process theory to this process in Section 6.1 yields additional information about event sequences in distance diminishing and finite state models. The Markov process X: = (X,,, En) can be used in the same way for finite state models. In finite state models with finite event spaces, both X,,' and XL are finite Markov chains. Let S'={Xn'JnbO and .""= {X:}nao
*
THEOREM 2.2. The process X' is Markovian, with transition operator (2.10)
27
1.2. RANDOM SYSTEMS WITH COMPLETE CONNECTIONS
J ~o(dx)f'(x).I n
=
E(f'(X0))=
Note that, if j - B~( X ' ) depends only on x, U ' f ( e , x ) = Uf(x). Thus, for any f E B(X'),
U'"f(e, x)
=
U'"- 'f'(x,e)
=
un-' f ' ( x ) .
(2.1 1)
THEOREM 2.3. The process X" is Markovian with transition operator
and in it ial distribution
P W )= where
f(4=
s
1
Po(dx)f,(x)9
P(X,
d e ) f ( x ,e ) '
The simple proof is omitted. It is easily shown that
U""f= (u"-'f) 0 for f E B ( X " ) and n >, 1, where g 0 u ( x , e ) = g ( u ( x , 4).
o
(2.12)
indicates composition of functions; i.e.,
REDUCTION.An important tool in the study of learning models with multidimensional state and event spaces is the reduction procedure that was applied
28
I . MARKOV PROCESSES A N D LEARNING MODELS
to the full ZHL model in Section 0.1. Here we describe this procedure in abstract terms and note its properties. The starting point is a learning model ((X, W), (E, Y),p, u), with which are associated state and event sequences Xn and En. In addition, there are measurable spaces ( X * , W*) and ( E * , Y*), and measurable transformations @ and Y of X and E onto X * and E*, respectively. Intuitively, x* = @(x) and e* = Y ( e ) represent simplified state and event variables. In the full ZHL model, they are projections: @ ( w , z ) = (V,Y)
and Y(S, a, r ) = (a,r ) 9
where s = (B, W) or (W, B), a = br or PO, and r = B or W. We now give conditions under which Xn* = @(A',,) and En* = "(En) are state and event sequences for a learning model. Suppose that @(u(x,e))depends only on x* and e*, and that, for any A * € Y*, p ( x , " - ' ( A * ) ) depends only on x*. In other words, there are functions u* on X * x E* and p* on X * x Y* such that
u*(@(x),W e ) ) = @(.(.,e))
(2.13)
and p * ( @ ( x ) , A * )= p ( x , Y - ' ( A * ) ) .
(2.14)
Since di and Y are onto, these functions are unique and p* (x*, .) is a probability.
THEOREM 2.4. If u* and p * ( - , A ) are measurable, so that ((A'*, a*), (E*, Y*), u*, p * ) is a learning model, then X,* = di(X,) and En* = Y(E,) are associated state and event sequences. Proof. Substituting X , and En for x and e in (2.13) and (2.14) we obtain U*(Xn*,
En*) = @(Xn+l)= X,*+l
and p*(Xn*, A*) = P(E, E Y--'(A*)IX,, E n - ' , ...) = P(En*~A*IXn En-', , ...)
a s . And it follows from the latter equation that p*(Xn*, A*) = P(En*E A*IX,*, En*--',...).
1.2. RANDOM SYSTEMS WITH COMPLETE CONNECTIONS
29
The significance of this is that the reduced model of Theorem 2.4 may be far simpler than the model with which we began, and hence may yield useful information about Xn* and En*.This is certainly the case for the ZHL models, since, as we shall see in Chapter 16, the reduced model is distance diminishing while the full model is not. It is not difficult to verify that the conditions for applicability of Theorem 2.4 are satisfied by the ZHL models. The most important of these conditions are (2.13) and (2.14).
2 0 Distance Diminishing Models and Doeblin-Fortet Processes
This chapter begins our study of a class of random systems with complete connections with metric state spaces, and a class of Markov processes in metric spaces that includes their state sequences.
2.1. Distance Diminishing Models Roughly speaking, a distance diminishing model is a random system with complete connections with a metric d on its state space X , such that the distance d(x’, y’) between x‘ = u(x, e) and y’ = u(y, e) tends to be less than the distance d ( x , y ) between x and y. It might be assumed, for example, that the event operators u ( . , e) are uniformly distance diminishing: d ( u ( x , 4,~
( v4 ,) < rd(x,y)
for some r < 1 and all x , y E X. A more general condition, first suggested by Isaac (1962) in a more restricted context, is
/
30
p ( x , de) ~
xe), u (, v ,e)>
<~
xv ). ,
2.1. DISTANCE DIMINISHING MODELS
31
Let
where e’ = (eo, ... ,ej- ]). The definition given below uses the more general condition rl c 00 and r, < 1 for some k 2 1. In addition, the functions p ( . ,A) are assumed to satisfy the same Lipschitz condition :
For 1 < j <
00,
let
and
by (1.1.3) with p = 6 p j ( x , y , . ) . Since p j + l ( x , A x E ) = p j ( x , A ) ,R j + 1 2R j , and taking j = 1 in (1.2) we see that the uniform Lipschitz condition on p ( . , A ) given above is equivalent to R , c 00. DEFINITION 1.1. A random system with complete connections ((X, a), (E, S ) , p ,u ) is a distance diminishing model (with metric d ) if ( X , d ) is a metric space with Bore1 sets 93, rl c 00, r, c 1 for some k 2 1, and R , c 00. A more restrictive notion of a distance diminishing model was introduced by Norman (1968a). Implicit in Definition 1.1 is the assumption that d(u(x,ej), u ( y , ej)) is measurable in ej for each x , y E X and j 2 I , so that 5 is well defined. If ( X , d ) is separable, d ( - ,.) is jointly measurable and this condition is satisfied. Separability has other uses in this connection. If ( X , d ) is separable, a sufficient condition for joint measurability of u, part of the definition of a random system with complete connections, is that u(x, .) be measurable for each x E X and u ( - , e ) be continuous for each e E E. Separability is not a very restrictive condition in applications. Lemmas 1.1 and 1.2 give basic properties of the rj and R j .
LEMMA 1.1. For all i , j > 1, R i + j < r i R j + R i . LEMMA 1.2. For all i,j > 1, ri+
< rirj.
32
2.
DISTANCE DIMINISHING MODELS
where
we have
It follows from Inequality (1.1.4) that
and
Proof of Lemma 1.2. Clearly
where x' = u(x,e') and y' = u ( y , e'). The inner integral is at most 9d(x', y'), and / p i (x,de') d(x', y')
so q
< ri9d(x,y). 1
< rid(x,y )
33
2.1. DISTANCE DIMINISHING MODELS
Taking i = 1 in Lemmas 1.1 and 1.2, and recalling our assumption that R, < 00 and r, c 00, we see that Rj < 00 and 5 < 00 for all j 2 1 in a distance diminishing model. The assumption that r, < 1 permits us to say more. THEOREM 1.1
r* = lim rjlj < 1 j-r
Q)
and Proof. By Lemma 1.2, for any n 2 j 2 1, r,, < ri'r,,,, where q = [ n / j ] and m = n-jq <j. Thus
lirnsupr,'l" n-r m
< rj'j
for all j 2 1. Therefore the limit r* exists and r* < r 1,l i for a l l j > 1. Taking j = k we see that r* < 1. Since Rj+ 2 R,, sup R, = lim R,, and (1 -rJRi+j
< Ri
by Lemma 1.1. If ri c 1 (e.g., i = k) we obtain supRj
< Ri/(l-ri) < 00
(1 *7)
on letting j + 00. It remains to show that R, = supR,. As a consequence of (1.2.5), R, 2 Rj for all j 2 1 ; hence, R, 2 sup R j . To obtain the converse, let A E Y", x , y E X, and E > 0 be given. Since Y" is generated by the algebra '3' = '3' (here we identify A E '3' and A x E mE SOo),there is an A' E '3' such that
uT=,
p , (x, A A A')
+ p , ( y , A AA') < E
(Halmos, 1950, Theorem 13.D). Then
I ~ P , ( x , Y , A )1 < 1
(x, A ) - P m (x, A ' ) 1
+ I6Pm(x,y,A')I + IPm(y,A')-Pm(y,A)I,
< supRjd(x,y) by (1.2.5), so ld~co(x,~,AG ) I E + supRjd(x,y)* arbitrary, 16pm(x,y,A)I < sup R,d(x,y). Thus
and Idp,(x,y,A')I
Since E is that R , = sup R,.
R,
< sup R j ,
so
34
2.
DISTANCE DIMINISHING MODELS
The key to understanding the state sequence in a distance diminishing model is to restrict the transition operator U to bounded Lipschitz functions. The following notation will be needed. For f E B ( X ) ,
and
Ilf’ll = m ( f )+ If1 . Let
U X )=
{f:llfll<001
be the set of bounded Lipschitz functions. In terms of rn, (1.2) becomes
A few important properties of L ( X ) will now be noted. First, L ( X ) is a normed linear space with respect to both 1 . I and // ./I. Also, i f f , , E L ( X ) , f~ B ( X ) , and I f , - f l + O as n+ co, then m ( f )< liminfrn(f,), so that
In particular, . f ~ L ( X ) and llfll < C if I l f , l < C for all n. Another consequence of (1.8) [together with the completeness of B ( X ) and the inequality If1 < l f ’l l ] is that ( L ( X ) ,11.11) is complete. The inequality w.9)
< I.flm ( s )+ 191d f ) ,
valid for f , g E B ( X ) , implies that L ( X ) is closed under multiplication, with llfgll < I l f ’ l i 1/g1/. Thus L ( X ) is a Banach algebra with unit 1. We now establish some elementary results concerning the action of U on L ( X ) . LEMMA 1.3.
For all f E L ( X ) a n d j 2 1
IU%x)-
Proof.
uif(~)I< rjrn(S)d(x,y) + +I6pj(x,y, .)I osc(f).
This follows directly from
35
2.1. DISTANCE DIMINISHING M O D E L S
DEFINITION 1.2. A transition operator U for a metric state space ( X , d ) is a Doeblin-Fortet operator if U maps L ( X ) into L ( X ) boundedly with respect to 11.11, and if there are k 2 I, r < 1, and R < 03 such that m ( W ) G rm(f)+ Rlfl
for all f
E
(1.9)
L ( X ) . A corresponding Markov process is a Doeblin-Fortet process.
A special case of Theorem 1.2 was proved by Doeblin and Fortet (1937, see (3), p. 143) using computations similar to our own. THEOREM 1.2. The transition operator for state sequences of a distance diminishing model is a Doeblin-Fortet operator. Proof. It follows immediately from Lemma 1.3 that m(Uif’) < rim( f )
for all j 2 1. Taking j
=
+2
~f I~ 1
(1.10)
1 and adding I Ufl < If1 we obtain
II Uf II < rl m(f> + W
l +
I)lf
I < max(r19 2 R , + 1 ) l l f II .
Thus U is bounded on L ( X ) . Putting .j = k in (1.10) we obtain (1.9) with r = r , < 1 and R = 2 R i < o o . I It follows immediately from (1.9) and I Uf I < If I that
I UYll < rllf II + R‘lf
I 1
(1.11)
where R’ = R + I . This inequality, in a slightly different setting, is one of the main hypotheses of the lonescu Tulcea-Marinescu theorem (Theorem 3.2. I), which is, in turn, the cornerstone of the theory of Doeblin-Fortet processes in compact state spaces (“compact Markov processes”) presented in Chapter 3. COMPACT STATE A N D FINITE EVENT SPACES. Both the five-operator linear model and the reduced ZHL model have compact state spaces and finite event spaces. Proposition 1 gives a relatively simple criterion for such a model to be distance diminishing. If w maps X into X , let
PROPOSITION I. A model with compact state space and finite event space is distance diminishing if r n ( p ( . , e ) ) < 00 and / ( u ( . , e ) ) <1 for all e E E, and if, for every x E X , there are j 2 1 and ei E Ej such that i(u(.,ei))< 1
and
p j ( x , e i ) > 0.
36
2. DISTANCE DIMINISHING MODELS
Proof. Clearly
and
< supr(u(.,e)) <1. KB
r,
Thus it remains only to find a k for which r, < 1 . Since I ( U O w) < z(u)J(w), +<.,e">)
< 1,
(1.12)
< I(U(.,e"))
(1.13)
and Z(U(.,e"+'))
if e"+ = (eo,
...,en) and e" = (e,,, ...,en- '). Let g ( y , n) =
E"
pn(y, e") ( 1 - I ( # ( *
O))-
Then g(-,n) is continuous, g ( y , n + 1 ) 2 g ( y , n ) by (1.13), and, if j = j ( x ) and d = d ( x ) are as in the statement of the proposition, g ( x , j ) 2 p j ( x , 4 (1 - z ( u ( - , ~ > ) by (1.12). Thus g ( x , j ) > 0, and x
E
{ Y :d Y , j ) > g ( x , j ) / 2 ) =
W )-
But T ( x )is open, and Xis compact, so there is a finite subset A" of Xsuch that X=
u
T(x).
XSX'
Let
k
= maxj(x)
g
=
XSX'
and ming(x,j) > 0. XSX'
If
E X,
there is an x E X' such that y
E
T ( x ) . Thus
g(y,k) 2 s ( r , i ) > g(x,j)/2 2 g / 2 .
Therefore r,
< 1 -g/2
<1.
I
2.2. TRANSITION OPERATORS FOR METRIC SPACES
37
2.2. Transition Operators for Metric State Spaces We begin our study of transition operators for Markov processes in metric spaces by introducing some terminology concerning the asymptotic behavior of U" or
0" = (l/n)Y'Uj /= 0
as n + 00, where U is a bounded linear operator on a Banach space (L, 11 .II). DEFINITION 2.1. The operator U is orderly if there is a bounded linear operator U" on L such that 1 8"-U* 11 + O as n+ 00. It is aperiodic if 11 U"- U" 11 +0 as n + 00. An orderly operator is ergodic if U"L is one dimensional. And an aperiodic, ergodic operator is regular. If U is a transition operator for a measurable space (X,1) and L c B ( X ) , we apply the same terminology to corresponding Markov processes. If 1 E L, then U" 1 = 1, so that ergodicity is equivalent to constancy of U"f for each f E L. The Markov processes considered in the next chapter are all orderly. They are ergodic when there is a unique "ergodic kernel," in which case there is a unique stationary probability p, and U"f = f d p . They are aperiodic if each ergodic kernel has "period" one, as is the case if each ergodic kernel is an absorbing state. An aperiodic operator is clearly orderly. The following simple lemma is very informative. LEMMA 2.1. If U is aperiodic, and V = U- U*,then
U"U"
for all n > 1, and
=
U",
VU" = U W V = O , U" = U" + V",
(2.3)
r ( V ) = lim llVnII1/"< 1.
(2.4)
n-r
(0
In the case of complex scalars, the quantity r ( V ) is V's spectral radius (Dunford and Schwartz, 1958, Lemma VII.3.4). It follows from (2.4) that for any r(V) < a < 1 there is a D = D, such that
Proof.
Since U"+lconverges to U"U, UU", and U", we have
U"U= UU"
=
U".
(2.6)
38
2.
DISTANCE DIMINISHING MODELS
Thus U"U" = U", by iteration of the second equality. Equation (2.1) follows on letting n --+ co, and (2.2) follows from (2.1) and (2.6). The case n = 1 of (2.3) is the definition of V, and the general case follows by induction using (2.1) and (2.2). Since IIV"+"II < )IV"II IIV"II, the argument in the first paragraph of the proof of Theorem 1.1 shows that the limit r ( V ) exists, and
r ( V ) < I V"II ' I n for all n 2 1. By (2.3) and aperiodicity, IIV"II c 1 for n sufficiently large; hence, r ( V ) < 1.
I
If K is a transition kernel with operator U, the kernel
K" = (l/n) 'y-K ( j ) n- I
j=O
corresponds to U".If U is orderly (or aperiodic) with respect to L = B ( X ) , t the question arises whether U" is the transition operator of any stochastic kernel K", and, if so, what can be said concerning the convergence of R"(x)= R " ( x , .) [or K'"'(x)] to K" ( x ) . Let K" ( x , B) = U m l B ( x )Since .
R"(x,B)=
unZ8(~) + K"(x,B)
for all x and B E %?, the Vitali-Hahn-Saks theorem (Neveu, 1965, Corollary 1 of Proposition 1V.2.2) ensures that K" is a stochastic kernel. It then follows from the linearity and continuity of U" that Urn is the transition operator with kernel K " . Furthermore, the equality
= SUP xsx
IK ( x ) - K ' ( x ) ,~
for arbitrary transition operators U and U' with kernels K and K ' , enables us to translate statements concerning the rate at which 1U"- U"I (or I U " - U"l) converges to 0 in terms of convergence of R"(x)[or K'"'(x)] to K"(x). When (X, d ) is a metric space and U is a transition operator that is orderly (or aperiodic) on L ( X ) , e.g., an orderly Doeblin-Fortet operator, such questions lead to consideration of weak convergence in the space P ( X ) of probability measures on the Bore1 subsets 9 l of ( X , d) (Parthasarathy, 1967, is a good general reference on this subject). A sequence p,, in P ( X ) converges ?This means that (O"-Uml+O as n+ co. Regularity with respect to B ( X ) is of some importance in subsequent chapters.
2.2.
39
TRANSITION OPERATORS FOR METRIC SPACES
s
weakly to p E P ( X ) if f f d p , + f dp for all real valued bounded continuous functions f ( f C~r ( X ) ) .Equivalently
p(b) < liminfp,(B) n- w
and
limsupp,(B) n-
00
< p(B)
for all B E g,where b is the interior and B is the closure of B. This is the most natural notion of convergence for probability measures on a metric space. For o E M ( X ) , let
s
Then 11011 = 0 [i.e., f d w = 0 for all YE L ' ( X ) ] implies that w = 0, and is a norm on M ( X ) . The corresponding metric
11 1)
A (P?V ) = IlP - VII on M ( X ) and P ( X ) has been carefully studied by Dudley (1966). If ( X , d ) is separable, weak convergence in P ( X ) is equivalent to A convergence. Furthermore, if ( X , d ) is complete as well as separable, P ( X ) is complete with respect to A . Let U and U' be transition operators, bounded on L ( X ) , corresponding to stochastic kernels K and K'. Then (IU- U'II 2 sup [ U f I l f I/ = 1
uyfl
(2.7)
and
for f E L ( X ) . Interchanging the f and x suprema on the right in (2.7), and the f and x # y suprema on the right in (2.8), and restricting to L ' ( X ) , we obtain the inequalities
(2.9) and
If U" is the transition operator for a stochastic kernel K", (2.9) and (2.10) permit us to translate statements about the rate at which 11 8"-Urn11 (or 1 U n - U" 11) converges to 0 into comparable statements about SUP xtx
etc.
A (R"(x),K"
W),
40
2. DISTANCE DIMINISHING MODELS
We can now give sufficient conditions for U" to be a transition operator. THEOREM 2.1. I f U is a transition operator that is orderly on L(X), and if ( X , d ) is complete and separable, then U" is the transition operator for a stochastic kernel K". For any x E X and B E 93, n
n
J K" ( x , dy) K ( y , B) = J K(x, dy) K"
( y , B) = K" ( x ,B).
(2.1 1)
If U is ergodic, K m (x, .) = K" does not depend on x and is the unique stationary probability of the adjoint T of U. (
0
)
The last two sentences also apply if U is orderly on B(X), where (X,@ is an arbitrary measurable space. Proof. Since U is orderly, 11 D"- 8"1(+ 0 as m, n -t 0 0 ; hence, by (2.9), A(R"(x), R"(x))+O as m,n + co. Since P(X) is complete with respect to A, there is a K" (x) E P ( X ) to which R " ( x ) converges weakly. Clearly
(2.12) for each f E L ( X ) and x E X . This representation, and the fact that U"f E B(X) for each f E E(X), imply that K" (.,B) E B(X) for each B E W,so that K m is a stochastic kernel. We omit the proof. (Theorem 2.2 gives a much stronger result for Doeblin-Fortet operators.) Clearly
uu" = 8 " =~8" +- n-'U" - n-11.
Applying this to f E L(X), and noting that UO% UnUf, and 0"fconverge to U U m f , UmUJ and Umf, respectively, in L(X), hence in B(X), while I Unfl < If I, we obtain
uumf=
U"Uf = U " j ,
from which (2.1 1) follows. If U is ergodic, U " f ( x ) = U m f ( y ) for all f E L(X), so that K"(x, . ) = K" ( y , .) for all x , y E X . The outer equality in (2.11) says that K" (x, .) is stationary, so, in the ergodic case, K" (.) is stationary. If p is a stationary probability,
b , f )= (T"P,f)= (P,0-j.f) +
as n - t co, for all f
E
(P, U " f ) = U"f = ( K " , f )
L(X). Hence p
= K".
I
The next result gives a rather special property of Doeblin-Fortet operators.
41
2.2. TRANSITION OPERATORS FOR METRIC SPACES
x
THEOREM 2.2. If U is an orderly Doeblin-Fortet operator, and if, for every K" (x, .) is a probability that satisfies (2.12) for all f E L ( X ) , then
E X,
supm(Km(.,B))< co. B€.¶
For the proof we need this lemma. LEMMA 2.2. Let A and B be nonempty subsets of X with d ( A , B) = inf d ( x ,y ) > 0 . xeA,y~B
Let
where d ( x , A ) = d ( { x } , A ) . Then p E L ( X ) , p ( x ) = 0 on A, p ( x ) = 1 on B, and 0 < p ( x ) < 1 for all x E X . Proof of lemma. By direct computation (P (4- P (Y))4x1 d(Y)
4)d(Y7 B ) + 4Y7 A ) M Y , B) where d ( x ) = d ( x , A ) + d(x, B). But m ( d ( .,A)) < 1, so =
(44A ) -
4 Y 7
IP(4 - P(Y)I
4x9
B)) ,
G d(X,Y)/d(X)Y
and d ( x ) 2 d(A, B), so m(p)< l / d ( A ,B).
Proof of theorem. It follows from (1.9) that m(U"kf) < r"mCf)
+ R*l f I ,
where R* = R/(1 -r). Replacing f by U'f, 0 < i < k- 1, we obtain m ( U i f ) G Cr[j/'' 1 f
11 + R* I f I ,
where C = max { 11 U'II : 0 < i < k- l}. Therefore m ( W )<
m-')I l f I1 + R* I f I
>
from which m(U"f)
< R* I f I
follows on letting n - +co. This holds for all f E L ( X ) . Suppose that B is closed and B # X or 0.Then A" = { x : d(x, B)
> I/n}
(2.13)
42
2. DISTANCE DIMINISHING MODELS
is not empty for n sufficiently large, and d(A", B) 2 I/n > 0; thus p,,(x) = p ( x , A", B ) E L ( X ) . Also pn(x)+ZB(x) for all x E X ; thus U w p n ( x ) + K" ( x , B) for each x E X . Therefore
IK'" (x, B ) - K" ( y ,B)I = lim
n+ w
Iu"Pn(X)
- u"pn(y)l
< R*d(x,y)
(2.14)
by (2.13). The inequality (2.14) holds trivially for B = X and B = 0; hence it is valid for all closed sets B and x , y E X . If A is an arbitrary element of 99, then, by the regularity of K w ( x , .)+ K" (y, .) =C(.), there is a closed subset B = of A such that C(A -B) E. Thus
-=
IK" (x, A ) - K"
s
0,A)I
I(K" ( x ,A ) - K" ( Y , A ) ) - (K" (x, B ) - K r n( y , B))I
+ IK" (X, B ) - K" ( Y , B)I
< C ( A - B ) + IK"(x,B) - K"(y,B)I < E + R * d ( ~ , y ) by (2.14). Since E is arbitrary, we conclude that
m ( K " ( - , A ) ) < R* for all A €99.
[
We close this section with a lemma that will be needed shortly. A real valued function f on X is upper semicontinuous if { x : f ( x )< y} is open for all y. Clearly continuity implies this.
LEMMA2.3. If U is a transition operator for a metric state space such that Ufis upper semicontinuous for allfe L ' ( X ) , then K ( . , B) is upper semicontinuous for all closed sets B. Proof. The sequence {A,,} defined in the second paragraph of the preceding proof is increasing, so {p,,} is decreasing. Thus the sequence {Up,} of upper semicontinuous functions decreases to K ( - ,B). But upper semicontinuity is preserved under decreasing pointwise limits. [
3 0 The Theorem of Ioneseu Tulcea and Marinescu, and Compact Markov Processes
The material in this chapter and the next gives information about certain Doeblin-Fortet operators and corresponding Markov processes, thus for state sequences in certain distance diminishing models. The results in Section 3.1 are applicable to all Doeblin-Fortet operators, while those in later sections of this chapter require compactness of ( X , d), so that corresponding Markov processes are compact according to Definition 3.1. Chapter 4 treats distance diminishing models whose state spaces satisfy weaker conditions such as bounded ness. 3.1. A Class of Operators
Generalizing the situation in Chapter 2, we consider two complex Banach spaces ( B , 1 .I) and ( L , 1) 11) with L c B. B is not necessarily a function Space, though we use the notation f for one of its elements. It is assumed that +
(a) i f f , € L, YE B, lim,,+m[f,-fl and llfll < C.
= 0,
and Ilf,ll < C for all n, then f E L
43
44
3. COMPACT MARKOV PROCESSES
II-II
A linear operator U from L into L is bounded with respect to both where the latter is the restriction of 1. I to L. In addition, and 1.
IL,
(b) H=supnBolUnI,< CQ; and (c) there is a k 2 1, an r < 1, and an R < 00 such that
II UYII G r I l f II + R I f I for a l l f e L. Only (b) and (c) are needed for the next lemma. LEMMA 1.1. For all m 2 0 and f E L,
II UrnYIlG rrnI l f I1 + R' I f
I 9
(1.2)
where R' = (1 -r)-'RH. Furthermore, J = supnBo11 U"II < 00. Proof. Equation (1.2), which is obtained by iterating (l.l), implies that sup, 11 U'"'f11 < 00 for each f E L. By the uniform boundedness principle, the supremum D of (1 UrnkIIis finite (Dunford and Schwartz, 1958,t Corollary 11.3.21). The lemma then follows from
Definition 2.2.1 is applicable in the present context. The following useful condition for aperiodicity generalizes Lemma 1 of Norman (1970a). THEOREM 1.1. If there is a sequence 6, with limit 0, and, for each f there is a U "f E B such that
lUY- U"f I G 6, I l f I1
E
L,
(1.3)
for all n 2 I , then U is aperiodic. Proof. By Lemma 1.1, 11 U"fl1 < Jll f 11 for all n ; hence, U m fE L and by (a). The operator Urn is clearly linear. By means of arguments similar to those in the proof of Lemma 2.2.1, but using (1.3) instead of 11 U" - U" 11 -+ 0, we obtain (2.2.l), (2.2.2), and (2.2.3). Replacing f by V"J n 2 1, in (1.2), and noting that UV = V 2 , we get
11 U"f (1 < Jll f 11
IIVmk+nfll G rrnIIVnfII
+ R' IVY1
< r"W I l f II + 6, R' I l f II . Therefore IIVrnk+"ll< r " W + 6 , R ' .
t This convenient reference on functional analysis is cited repeatedly in this chapter.
3.2. THEOREM OF IONESCU TWLCEA A N D MARINESCU
45
For arbitrary j > 1 let m = [ j / 2 k ] and n =j - m k , so that j = mk+n. Then m , n - + c oasj-,co, SO IIV'II+O. I 3.2. The Theorem of Ionescu Tulcea and Marinescu
To the conditions (a), (b), and (c) of the last section, we now add
(d) if L' is a bounded subset of (L, 11. in (4 I * I).
II),
then UkL' has compact closure
Under these hypotheses, Ionescu Tulcea and Marinescu (1950) obtained the representation of U" given in the important theorem that follows. For any complex number 1, let
D(1) = { f E L : U f = 1 f } , so that I is an eigenvalue of U if and only if D(1) # fa. THEOREM 2.1. The set G of eigenvalues of U of modulus 1 has only ajinite number p of elements. For each 1 E G , D(1) is jinite dimensional. There are bounded linear operators U,, 1 E G, and V on L such that
U,U,< = O
if 1 # L', U: = U,,
(2.2)
UAV = VUA = 0,
(2.3)
UAL = D ( 1 ) ,
(2.4)
r ( V ) < 1.
(2.5)
xAeG
The operator l"UAhas a finite-dimensional range, hence is strongly compact. Since IIV"II c 1 for n sufficiently large, it follows that U is quasistrongly compact (see Lotve, 1963, Section 32.2). The proof of the theorem is broken down into a number of lemmas. LEMMA 2.1. If 111 = 1, D(1) is finite dimensional.
LEMMA2.2. There are only a finite number of eigenvalues of modulus 1. Note that U ' = U k satisfies (b), (c), and (d) with k ' = 1. Furthermore, U f = 1 f implies that U ' f = 1% and IAk1 = 111. Thus D(1) cD'(Ak), so the former is finite dimensional if the latter is. Also d c G', so G is finite if G' is. Hence it suffices to prove these lemmas for k = 1. Proof of Lemma 2.1. Iff E D, where
D = D(1) n {f: I f
I < I},
46
3.
COMPACT M A R K 0 V PROCESSES
then so that
Ilf II G
R/(1 - r ) *
Thus D is bounded, and, by (d), U D has compact closure in 1 . I. But D c AX'UD, so D = B is compact. It follows that D(1) is finite dimensional (Dunford and Schwartz, 1958, Theorem IV.3.5). I
Proof of Lemma 2.2. Suppose not. Let 1,, A 2 , ... be a sequence of distinct elements of G and let f.E D(1,,),f.# 0. The f, are linearly independent. Let S(n) be the linear span of f l ,f 2 , ... ,f.. Then S(n - 1) is a proper subspace of S(n), so that there is a sequence g,, such that g,, E S(n), lgnl = 1, and Ig,,-fl 23 for all f e S ( n - 1) (Dunford and Schwartz, 1958, Lemma VII .4.3). Iff = aifi E S(m), it is easy to see that zUnf E S(m) for all complex z and n 2 0, and that 1;"U"f- f E S ( m - 1) for all n 2 0. By Lemma 1.1
xy=
< r"-'
Ill,:"U"-'gjll
llgjll
+ R' < 1 + R'
for n sufficiently large. Thus (d) implies that there is an increasing sequence
ji and a sequence ni such that
hi = 1r"IU"ig. Ji J1
converges in B. However, d i = Ihi+l-
Qil =
Ihi+I-gji+I-hi+gji+II,
and hiE S ( j i ) c S ( j i + ,- 1). hi+, - gji+,E S ( j i +I - 1) Thus d i >, f. This contradiction establishes the lemma. I
LEMMA 2.3. If 111 = 1 and 1 is not an eigenvalue of U k ,then (21- U k ) L= L. Proof. Again it suffices to consider k = 1. Since 11 U"I( is bounded, r ( U ) < 1, and so (A'I-U>L= L for 11'1 > 1 . For g E L we seek an f E L such that (11- (I)f = g. If g = 0, let f = 0. If g # 0, let 11,,1> 1, I,, + 1, and f,E L with (1,,I- U ) f .=g. Clearly f,# 0, so we may put f.' =f,/lf,l.Then
s.'= K'uf,'+G'g/lf.l7 so that
(2.6)
47
3.2. THEOREM OF IONESCU TULCEA A N D MARINESCU
Therefore
IIXII G
(2.7)
( R + llgll/lf.l)/(1-d*
[.fI
Suppose now that + 00. By (2.7), IlXIl is bounded, so that, for some subsequence f i . , Ufi, converges in B. By (2.6), f i , does too, and the limit f' belongs to L by (a). Since 1U1, < 00, U'i,+ U f ' , and (2.6) yields f'= A-'Uf '. Now 1 = Ifi,l+ If 'I, so f' # 0, and I is an eigenvalue of U . This contradicts our assumption, and rules out the possibility that 00. Applying this result to subsequences of f., we see that d = sup. < 00. Multiplication of (2.7) by then yields
[.fI
[.fI
If.l+
Consequently Uf. has a subsequence Uf.. that converges in B, and f..= I; (U'. +g) converges in I . I to an element f of L. Then U.' + Uf in I f 1, and f = A-'(Uf+g).
I
If 111 = 1 let
LEMMA2.4. For every 111 = 1 and f E L there is a U, f E L to which U,"f converges in B. The linear operators U, have I U,IL< H and 1 U,ll < J , and satisfy (2.1)<2.4) where
v = u-
CAUp
AEG
This operator has no eigenvalues of modulus 1. Proof. Let L! be the closure of L in B. Then (L!,1. IL' ) is a Banach space, and the unique continuous extension U' of U to L! is a bounded linear operator on (L!, 1 . IL,) with I U'"IL,< H for all n 3 0. Furthermore, for any f E L, the sequence Up'= U t f is strongly compact. For note first that
II U:f II < J IlfII
(2.8)
Also, for any n 3 m 2 I ,
+ En,
u,"f = A-mumU,"f
(2.9)
where E,
converges to 0 in B as n + m = k , and (d).
= n-'m(I00.
K n U n UTf )
Compactness follows from (2.8), (2.9) with
48
3. COMPACT MARKOV PROCESSES
As a consequence of a standard strong ergodic theorem (Dunford and Schwartz, 1958, Corollary VII1.5.3), U;"f converges in L! for every f E L!. When f E L we denote the limit U,J Thus limIU,"f-U,fl
n--L a0
(2.10)
=O
for all f E L. By @.lo), (2% and (a), U,f E L and 11 U,f 11 Q J(If 11. Thus And it follows from (2.10) and (b) that I U,lLQ H . Taking m = 1 in (2.9) and letting n + 00 we obtain 1U,f = UU,f.
11 U,ll < J. (2.1 1)
Thus n- 1
The left side converges to U,, U, f and the right to U, f if 1 = X, and to 0 if A # 1'.This yields (2.2). Equation (2.11) also implies that U, L c D(1). If, on the other hand, f E D(A), then f = U,"f for all n, so that f = U, f and f E U, L. Thus (2.4) obtains for all 111 = 1. The fact that VU, = 0 is a consequence of (2.2) and (2.1 l), and U, V = 0 follows similarly from 1U, = U, U , which is, in turn, obtained from the case m = 1 of (2.9) on commuting U and U,". This establishes (2.3), and (2.1) follows from (2.2) and (2.3) by induction. Suppose now that 111 = 1, f E L, and Vf= 1 . Then Uf = I-'Uvf Therefore
f
=
=
1-Vf = Af.
u,f = n-'u,vf
and 1 is not an eigenvalue of V.
=
0,
I
LEMMA2.5. If II) = 1 and v 2 1, then I is an eigenvalue of U' only if some vth root of I is an eigenvalue of U . Proof. l,=wpl,,
Let 1 < = 1 and o=exp(i2zr/v), so that the vth roots of I are p =O,l, ..., v- 1. Then
3.2.
THEOREM OF IONESCU TULCEA A N D MARINESCU
49
and the quantity in brackets is v or 0, depending on whether or not j is a multiple of v, so
1 (l/mv) c
v- 1
mv- I
p=O
j=o
n ; i ~ j=
(l/m)mE'n-kwk, k=O
where W = U". Now W satisfies (b), (c), and (d) with kw = k [see Lemma 1.1 for (c)] ; thus Lemma 2.4 is applicable to W. Letting m + 00 in the above equation we obtain v- I
1 UA, = WA.
p=O
If I is an eigenvalue of W , (2.4) gives W A# 0. Thus UA,# 0 for some I,, and, again by (2.4), this I,, is an eigenvalue of U. I LEMMA 2.6.
Proof.
The operator V satisfies (b), (c), and (d) with k, = k.
First, (2.1), (b), and Lemma 2.4 give
so V satisfies (b). UAL. Then F is a finite-dimensional linear space by Lemma Let F = 2.1 and (2.4). Thus there is a constant K such that Il f ' l < K l f l for a llfe F. Hence
xAsc
and (c) is satisfied. Finally, (d) follows from
LEMMA 2.7.
r ( V ) < 1.
Proof. Since r ( V ) = r(Vk)'Ik (Dunford and Schwartz, 1958, Theorem VI1.3.11), it suffices to show that r ( V k )< 1. By Lemma 2.4, V has no eigenvalues of modulus 1. Lemma 2.6 permits us to apply Lemma 2.5 to V, from which we conclude that V k has no eigenvalues of modulus 1. Application of Lemma 2.3 to V then shows that the spectrum a ( V k )of V k is disjoint from the unit circle, hence from {z: IzI 2 I } . Since a ( V k )is compact, r(Vk)< I . I
All of the assertions of Theorem 2.1 have now been established.
50
3.
COMPACT MARKOV PROCESSES
3.3. Compact Markov Processes: Preliminaries In the remaining sections of this chapter we will consider a class of Markov processes whose transition operators satisfy the hypotheses of Theorem 2.1. As in Chapter 2, let ( X , d ) be a metric space with Bore1 sets LB, bounded measurable (complex valued) functions B ( X ) , supremum norm 1 . I, and norm 1 = m ( . ) + 1 . I on the bounded Lipschitz functions L ( X ) . Let K be a stochastic kernel in (X,LB)and let U be the corresponding transition operator (1.1.5) restricted to L ( X ) . DEFINITION 3.1. A Markov process X = {X,,},,,o with state space ( X , d ) and transition kernel K is compact if ( X , d ) is compact and U is a DoeblinFortet operator. This terminology was introduced and the theory of such processes outlined by Norman (1968b). It was observed in Section 2.1 that (a) of Section 3.1 is satisfied by L = L ( X ) and B = B ( X ) , whether or not (X, d) is compact. Since U is a transition operator, (b) of Section 3.1 holds with H = 1. A Doeblin-Fortet operator is assumed to satisfy (c) of Section 3.1. Finally, if L' c L ( X ) is bounded in )I.II, the same is true of UkL'. Thus this set is bounded in B ( X ) and equicontinuous. Since ( X , d) is compact, the Arzeli-Ascoli theorem (Dunford and Schwartz, 1958, 1V.6.7) implies that its closure with respect to 1 . 1 is compact, and (d) of Section 3.2 obtains. Therefore U satisfies all of the hypotheses of Theorem 2.1. THEOREM 3.1. Theorem 2.1 is applicable to the transition operator of a compact Markov process.
As a consequence of Theorem 2.1.2 we have the following important class of examples of compact Markov processes. PROPOSITION 1. If the state space of a distance diminishing model is compact, then any state sequence is a compact Markov process.
In addition, all finite Markov chains are compact Markov processes. PROPOSITION 2. A Markov process in a finite set X is compact with respect to any metric d on X. Proof. The metric space ( X , d ) is obviously compact, and B ( X ) = L ( X ) contains all complex valued functions on X. Clearly (1 f 1 < RIfl, where
R
= 1
+ 2/mind(x,x'). X#X'
51
3.3. PRELIMINARIES
Thus
IIUfll G
so that 11 U 11
RIUfl G R l f l ,
< R, and (c) of Section 3.1 holds with k = 1 and r = 0. I
For any compact Markov process, { U"f}nrO is equicontinuous whenever f is continuous. For by Lemma 1.1 this is certainly true if f~ L ( X ) , and L ( X ) is dense in the space C ( X ) of continuous functions by the StoneWeierstrass theorem (Dunford and Schwartz, 1958, IV.6.17). If YE C ( X ) , let g E L ( X ) with If-gl < ~ / 3 and , let 6 be so small that d ( x , y )< 6 implies
I U"g(x)- U"S(Y)l G 4 3 for all n. Then
I w - ( x )- W ( Y ) I
G I~ " f ( x )U"dX)l
+ I U"g(x)- U"dY)l + IU%(Y) - W ( Y ) l G & for all n 2 0 and d ( x , y )G 6, as required. The theory of transition operators having this property (see Jamison, 1964, 1965; Jamison and Sine, 1969; Rosenblatt, 1964a, b, 1967) has guided our development at a number of junctures below. However, our stronger assumptions sometimes yield stronger results or permit simpler methods, and the theory that emerges is completely analogous to that of finite Markov chains. Since U1 = 1 , l is an eigenvalue of U,and all constant functions are corresponding eigenvectors. Clearly
so that
IIUP- UllI
= w/n>.
Thus U is orderly in the sense of Definition 2.2.1. When, and only when, 1 is the only eigenvalue of modulus 1, U is aperiodic and IIU"- UIJI+ O geometrically. Since compactness implies completeness and separability, Theorem 2.2.1 is applicable to U . Theorem 2.2.2 is also. Theorem 3.2 summarizes some of these results in our present notation.
52
3. COMPACT MARKOV PROCESSES
THEOREM 3.2. The operator U , is the transition operator .for a stochastic kernel K , . For any B E 9, K , ( . , B ) E D(1), and, for any x E X , K , (x, .) is stationary. 3.4. Ergodic Decomposition
The key concept in the analysis of compact Markov processes is that of an ergodic kernel. A subset F of X is an ergodic kernel if it is stochastically and topologically closed, and if it has no stochastically and topologically closed proper subsets. LEMMA 4.1. Any stochastic kernel defined on the Bore1 subsets 9 of a compact metric space ( X , d ) possesses an ergodic kernel. Any two distinct ergodic kernels are disjoint.
Proof. If F, and F2 are ergodic kernels and F = F1 n F2#a,then F is a stochastically and topologically closed subset of Fl and F2.Thus F, =
F = F2. The ergodic kernels are clearly the minimal elements of the collection 8 of stochastically and topologically closed subsets of X under the natural ordering c. Zorn’s lemma (Dunford and Schwartz, 1958, Theorem 1.2.7) implies that there is at least one such minimal element. For since X E 8,8 is not empty. If d is a totally ordered subset of 8,it has the finite intersection The Lindelof theorem (Dunford and Schwartz, property; hence A = n d # 0. 1958, 1.4.14) implies that A = n d ’ for some countable subcollection d’ of d , from which it follows that A is stochastically closed. It is obviously topologically closed; hence A E 8. Clearly A is a lower bound for d .
I
If F E 9 is stochastically closed, then K ; ( x , F) = 1 for all x E F, so that K , (x, F) = 1 if F is also topologically closed. It follows that the functions K l (., F), for different ergodic kernels F, are linearly independent. By Theorem 3.2, all such functions belong to D(1), so the number i of ergodic kernels does not exceed the dimension d of D( 1). Let Fl ,F2,... , Fibe the ergodic kernels, F = 4 , and
ui=
g j ( x ) = K1 (x, 4)*
THEOREM 4.1. The functions gj are a basis for D( l), so there are d ergodic kernels. A compact Markov process is ergodic if and only i f there is a unique ergodic kernel. The second statement follows from the first. There is a unique ergodic kernel if and only if D(l) = U , L ( X ) is one dimensional, i.e., contains only constants. Our proof of the first statement is based on a corollary to the following lemma.
53
3.4. ERGODIC DECOMPOSITION
LEMMA 4.2. Iff is upper semicontinuous and f(x) < Uf(x) for all x E X, then f is constant on each ergodic kernel, and f attains its maximum on F. Proof.
Let
Cj = {x E 4 :f ( x ) = max f ( y ) } . YG4
Since f restricted to F j is upper semicontinuous and 4 is compact, Cj is a nonempty topologically closed subset of 4 . If x E Cj we have n
since 4 is stochastically closed. Hence K ( x , Cj)= 1, and Cj is stochastically closed. But 5 is an ergodic kernel, so Cj = 4 , and f is constant on F j . Similarly, if A = {x:f ( x ) = max f ( y ) } , YeX
A is stochastically and topologically closed. Lemma 4.1 implies that there is an ergodic kernel for the compact metric space ( A , d ) and the restriction to A of the kernel K, and it is easily seen that this set is an ergodic kernel for ( X , d ) and K unrestricted. Hence A 3 f;. for some j , and f attains its
maximum on F.
I
COROLLARY. I f f E D ( I ) , f is constant on each ergodic kernel. /f 111 = 1 and f E D(A), then I f(.)l is constant on each ergodic kernel, and f(x)=O for all x E X if f ( x ) = 0 for all x E F. ProoJ Iff E D(1),then ref€ D(1) and imf E D(1),so both are constant on each ergodic kernel by Lemma 4.2. If 111 = 1 and f E D(A), then g = If(.)l is continuous and satisfies g < Ug. I
Proof of Theorem 4.1. Suppose that f E D ( 1 ) . Let ipj be the value off i g j . Then 6 E D ( 1 ) and 6 vanishes throughout F. on F;. and let 6 = Thus 6 = 0; that is, f = ipjgj. Therefore the linearly independent functions gj span ~ ( 1 ) . I
f-x$=,
xi=,
The expansion of K , (. ,B ) in terms of the gj turns out to be especially interesting. For any probability p on the Bore1 subsets of a separable metric space, there is a smallest closed set with probability 1. This set, called the support of p, is the set of points x such that p ( 0 ) > 0 for all open sets 0 containing x (Parthasarathy, 1967, Theorem 2.1 of Chapter 11). THEOREM 4.2. There is a unique stationary probability p j with p j ( @ = 1. The ergodic kernel t;;. is the support of p j , and {p,, ... , p d } is a basis for {p E M ( X ) : Tp = p}. For all x E X and B E 93, d
(4.1)
54
3. COMPACT MARKOV PROCESSES
Proof. Let p l ( B ) , ... ,p d ( B ) be the unique constants such that (4.1) holds for all x E X . Taking x E 4 , we obtain K, ( x , B) = p j ( B ) for all B E 99. By Theorem 3.2, p j is a stationary probability, and, clearly, p j ( F i ) = d,, so the p j are linearly independent. Suppose that p E M ( X ) and Tp = p. Then T,"p = p. But, for YE L ( X ) ,
(T,"P,f)= (P,V,"f)
+
01,U l f )
=
(TI P 9 - f )
as n + 00. Thus ( p , f ) = (TIp , J ) for all f E L ( X ) and
Therefore the p j are a basis for {p E M ( X ) : Tp = p}, and pi is the only stationary probability with p j ( 4 ) = 1 . Since, by Lemmas 2.2.3 and 4.3, the support Sj of p j is a stochastically as well as topologically closed subset of
4, sj=4. I
LEMMA 4.3. Let K be a stochastic kernel in a separable metric space ( X , d ) , such that K ( - , B ) is upper semicontinuous for B closed. Then the support S of any stationary probability p on 99 is stochastically closed. Proof. Clearly 1 = p ( S ) = j p(dx) K ( x , S). Thus p ( S ' ) = 1, where
S ' = { x : K ( x , S ) =l }
= X-{x:K(x,S)<
l}.
Since K ( . ,S) is upper semicontinuous, S' is topologically closed. Therefore S' 2 S; i.e., S is stochastically closed. I The next theorem shows that a compact Markov process converges with probability 1 to a random ergodic kernel, and that g j ( x ) is the probability of convergence to 5 starting at x . Let Qj
= { m : d ( X n ( m ) , 4 ) + 0 as
n + a}.
THEOREM 4.3. For any compact Markov process X ,
and gj (XI = p x ( Q j )
(4.3)
*
I f f E L ( X ) and f vanishes on F, then U"f= V"f, so that cally as n + 00. Proof. For
A E G,
I U,"f(x)I
G U,"g(x)
9
11 CJ"fll
+0
geometri-
55
3.4. ERGODIC DECOMPOSITION
where g ( x ) = If(x)l E L ( X ) iff E L ( X ) . As n+ 00 the left side converges to I U ,f(x)I and the right to U,g ( x ) . The latter is 0 iff and thus g vanishes on F, since K , ( x , F) = 0 by Theorem 4.2. So U ,f = 0, U ” f = V”f, and IlV”II llfll
IIU”fll
.
Since 11 V“II converges geometrically to 0, the second statement of the theorem is proved. Now f = d ( .,F) E L ( X ) and vanishes on F, so I U”fl converges geometrically to 0. And
E L f ( X n ) I = E [ E Lf(Xn) IXOII = E[U”f(XO)l
< I UYl.
Therefore
2 d ( ~ n ,VJ = 2 E C ~ ( X A
so that
3
n=O
00
P(r) = 1, where
In other words,
r = {oE a :lim d(X,,(o), F ) = 0}. n- w
Since Ugj = g j , E(gj(Xn+1)lXn, * . - , Xo) = gj(Xn)
a.s. Thus g j ( X o ) , g j ( X , ) ,... is a martingale. Since it is bounded (by lgjl), it converges on a set rjwith P(4)= 1 [Neveu, 1965, (I), p. 1371. Suppose now that {x,,} is any sequence in X such that d(x,,, F ) + 0 , although there is no j for which d(x,,, 4)+0. Then there are subsequences x,,. and x,,,, and j’# j * , such that d(x,,., 4 , )+ 0
d(x,,. ,4,) + 0.
and
Then gj.(x,,,)+ 1 and gjt(x,,,)+O, so that gj.(xn) does not converge. From this we conclude that, if d(x,,, F) -+ 0 and gj(x,,) converges for all j, there is some j such that d(x,, , 4)-+ 0 as n + 00. If d
then this result is applicable to x,, = X,,(o). Thus there is a j = j ( o ) such that o E aj,and d
d
j= 1
j= 1
rnnr;.c Uaj.
56
3. COMPACT MARKOV PROCESSES
Since the former has probability I , the latter does too, and (4.2) is proved. For any o E , Q j (thus a.s.),
u$
gj (Xn(0))-+ In, (0). Therefore
gj (x) = Ex(gj(X0)) = Ex(gj ( X J ) +
as n+
00.
This proves (4.3).
Ex(Inj) = Px(Qj)
I
In the remainder of this section we consider the asymptotic behavior of the proportion
of visits to the Bore1 set A among the first n steps. The random probability v,, is the empirical distribution of X , , ... , We shall see that, with wherej(o) is the index of the probability 1, v, converges weakly to pj"i(w,, ergodic kernel that X,, approaches. Let J
and observe that r
n- 1
J fdvn = (l/n) C
m=O
f(~rn) =
(I/n)SnU)
for a l l y e B ( X ) . Thus we wish to prove: 4.4. For any compact Markov process X, the probability is 1 THEOREM that ( l / n ) S , , ( f ) - *y ( f ) as n+ 00 for all f E C ( X ) .
Proof. A theorem due to Jamison (1965, Theorem 3.2) shows that, under the sole assumption that the sequence U,"f is equicontinuous for f E C ( X ) the probability is 1 that ( l / n ) S , , ( f )and U , f ( X , , ) converge to the same limit for allfc C ( X ) . But if o E Q j , U , f ( X j )-+ y c f ) for all f~ B ( X ) . In view of (4.2), the proof is complete. I
u;=
Chapter 5 gives additional information about the asymptotic behavior of the sums S,cf), when f~ C ( X )and U is regular. 3.5. Subergodic Decomposition
In the last section we saw that the probabilities g j ( x ) of converging to the various ergodic kernels 4 are a basis for the set D(1) of eigenfunctions
57
3.5. SUBERGODIC DECOMPOSITION
for the eigenvalue 1 of U. Thus the structure of D(1) depends on the behavior of corresponding compact Markov processes outside of F = U:= 4. In this section we will see that compact Markov processes move cyclically with some period pi within each 4 , and that the set G of eigenvalues of U of modulus 1 is just the collection of all pith roots of unity for all j . Thus G reflects the behavior of processes inside of F. THEOREM 5.1. For each j = 1, ... ,d, there is a maximal pi 2 1 for which there are nonempty, pairwise disjoint, topologically closed sets 4"',1 < m < p j , with union 4 , such that K(x, F;") = I for X E (Fj'+pj = q).These sets are unique up to cyclic rearrangement. Furthermore
u c,,, d
G=
(5.1)
j= 1
where C, is the set of pth roots of unity. Thus a compact Markov process is aperiodic if and only i f p j = 1 for all j . The integer pi is called the period of 4 , and the 5"' are the subergodic kernels of 4 . Suppose that X' is a stochastically and topologically closed subset of X , d' is the restriction of d to X', 9' (= 9 n X ' ) is the collection of Bore1 subsets of ( X ' , d'), and K' is the restriction of K to X'; i.e., K'(x, B) = K ( x , B ) if x E X' and B E 9'.Since X' is stochastically closed, K' is a stochastic kernel, and, since X' is topologically closed, (X', d') is compact. The first step in the proof of Theorem 5.1 is to show that the corresponding transition operator U' is a Doeblin-Fortet operator, so that it corresponds to compact Markov processes according to Definition 3.1. I f f ' E L ( X ' ) , the same is true of its real and imaginary parts fl' and f ; . There is an extension fi offi' to X, such that llfill = llfill' (Dudley, 1966, Lemma 5). Then f = f i +if2 is an extension off' with If1
Ilf'IIt
so that U' is bounded on L(X'). And
II U'"'Yl1'4 II UrnkfllG
r"
I l f II + R' I f I
for some R' c co by Lemma 1.1, so that Thus U' satisfies (2.1.9) with k' = mk, if m is sufficiently large that r m P < 1. Let Kj and U j correspond in this way to X ' = 4 , and let Gj be the eigenvalues of modulus 1 of U j .
58
3.
LEMMA5.1.
COMPACT MARKOV PROCESSES
G = Utz1G,.
Proof. If f~ D(1), then fi =fie E D j ( l ) for all j. If, in addition, f# 0, then fi # 0 for some j by the Corollary to Lemma 4.2, so that 1 E G j . Thus Gc G j . Conversely, suppose that 1 E' G. For any fi E L ( 4 ) let f be a Lipschitz extension of fi to X . Then Uildfi = U,"flFj+O in 11. ] I j ; hence 1 E' G j . Therefore Gj c G .
ut=,
u;=
I
It is easily seen that 4 is the only ergodic kernel of K j . This fact is the basis for further analysis of U j .Thus the next two lemmas deal with DoeblinFortet operators with compact state spaces ( X , d ) , for which X is the only ergodic kernel.
5.2. If X is the only ergodic kernel, then G = C, for some p 2 1. LEMMA LEMMA 5.3. There are nonempty, topologically closed sets X', ..., X p , pairwise disjoint with union X , such that K ( x , X"") = 1 for all x E X". If Y ' , ..., Y4 is another such collection with q ap, then q = p and there is an integer v such that Y" = X"" for all m. Proof of Lemma 5.2. First we show that G is a group under complex multiplication. If Aj E G, j = 1,2, and 4 is a corresponding eigenfunction, then Ifi(x)l = 141# 0 by the Corollary to Lemma 4.2. The equation U'(x) = S f j ( x ) then implies
K(x,C Y : ~ ( Y ) = 1jA<X>>) = 1,
(5.2)
from which it follows that Ufl f 2 (4 = 1,2 2 f
l W f 2
(4
and
UfC'(X) = A ; ' f ; ' ( x ) . Sincef, f2 and f;' E L ( X ) , 1,1, and 1;' E G, and G is a group, as claimed. We now show that any finite subgroup G of 111 = 1 is C , for some p. For real t, let e ( t )= exp(i27rt). If 1E G, then, since G is finite, 1 is a root of unity. Hence there are positive integers o(1) and r ( 1 ) such that r(1) 6 o(1), (r(A),o(A))= 1, and 1 = e(r(A)/o(L)). Also e( llo(1)) E G . For there are integers a and b such that 1 = ar(A)+bo(A); hence l/o(A) = ar(I)/o(A)+b, so that e(llo(1))= 2"' G . It follows that Co(d)c G. Let p = max,,,o(1). Then for any 1 E G, o(1)lp. For there are integers a and b such that ao(1)+bp = (o(A),p),and e ( I / o ( l ) )and e ( l / p )E G. Thus e ( t ) E G, where
59
3.5. SUBERGODIC DECOMPOSITION
Since o(e(t))< p , =
r(e(t))/o(e(t>)> I/o(e(t))2
UP,
so, by (5.3), o ( I ) < ( o ( I ) , p ) ; i.e., o ( I ) = ( o ( I ) , p ) .Thus o ( I ) l p , as claimed. But then I E C p , and G c C,. Therefore G = C p . I
Proof of Lemma 5.3. Uniqueness. Given Y ', ... , Y4,let w = exp(i2n/q) and
c w"I,,. 4
g =
m= I
Since the Yj are compact, d(Yj, Y k )> 0 if j # k , so that g E L ( X ) . Then UIym= Iym- 1, and
c O"Iym-, 4
ug =
m=1
= wg.
Clearly g # 0, so w E G = C p . Since q > p , we must have q = p . If I E G, D ( I ) is one dimensional. For let f E D ( I ) , f # 0, and let xo E X. Then f ( x o ) # 0, and, if f ' E D ( I ) and c =f'(xo)lf(xo), A ( x ) = f ' ( x ) - cf(x) belongs to D(A) and vanishes at x o . Since lA (x)l = l A l , f'= cf. Let I = exp(i2n/p) and
f=
P 1 I"&
n= 1
If g is as in the first paragraph of the proof,f,g constant c such that
E
D ( I ) , so there is a complex (5.4)
g = cf.
For any xo E X there are 1 < m , n < p such that x,, E Y" n X". Evaluating (5.4) at xo we obtain I" = cA" or c = I-' where v = n - m . Thus P m= I
I"I,, =
P
1 I"-'I,,
n= I
=
c A"lp+.. P
m= 1
Equating powers of A we obtain Y" = X"'", for all m.
Existence. Let I
= exp(i2n/p), f E
X"
=
D ( I ) , f # 0, xo E X , and
{ x :f ( x ) = I"f(X,)}
for m > 0. Note that X ' , .. . , X p are topologically closed and pairwise disjoint, and that Xm+,= X". By (5.2), K ( x , X " + ' ) = 1 for x E X". Since xo E X o , it follows by induction that X" # 0.The set
X'
=
P
(J X"
rn= 1
60
3.
COMPACT MARKOV PROCESSES
is stochastically and topologically closed. Since X is an ergodic kernel, X'=X. I
Conclusion of proof of Theorem 5.1. Let pi be the integer p and 4" the set X" obtained by applying Lemmas 5.2 and 5.3 to Kj and 4. If qj 2 p j and Y,!", m = 1, ... ,q j , are nonempty, topologically closed, and pairwise disjoint, with union 4 and K(x,YY+') = 1 for X E Y,!", then the Y,!" are topologically closed in (4,d j ) and K j ( x ,Y Y + l )= 1 for x E Y,!". Thus the uniqueness assertion of Lemma 5.3 gives qj =pi and Y,!"= FY". This is the uniqueness claimed by the theorem. Since Gj = C p J ,(5.1) follows from Lemma 5.1.
I
We now sketch some ramifications of the subergodic decomposition. Though these have their own interest, they play no role in subsequent developments. For any n 2 1, U" is a Doeblin-Fortet operator. The subergodic kernel 4" is an ergodic kernel for K(PJ).Let g,!" be the Up' invariant Lipschitz function that is one on 4", and let p,!" be the T p Jinvariant probability with support 4"'.If X is a compact Markov process with transition kernel K and initial distribution a, then g,!"(x)= P(d(X,, F;'"')
+0
as n + 0 0 ) .
(5.5)
The set 4"'is the unique ergodic kernel of K(PJ)14m, p,!"l4" is its only stationary probability, and the corresponding (Doeblin-Fortet) operator is regular. If I P J = 1, let
We have U g r = g : - ' and Tp,?'=py+', from which it follows that UhjPl= and T V ~ = I, ~ V ~ ,The ~ . functions hj,' and g j belong to D ( 1 ) and agree on F, so, by the Corollary to Lemma 4.2,
hj,] = g j , j = 1,..., d . (5.6) Equations (5.5) and (5.6) amplify (4.3). Similarly vj,' is a stationary probability with vj,, (4)= 1 ; so, according to Theorem 4.2, vj,] = p j ,
j
%f =
(/
The linear operator
is bounded on L ( X ) . For I E G, let
=
1,..., d .
fdvj,l) hj,l
61
3.6. REGULAR A N D ABSORBING PROCESSES
Iff
E L(X),
a direct analysis of U" for n large shows that
U;f(x) for all x E F. Since U f and U ; f belong to D(A), the Corollary to Lemma 4.2 implies that U,f = U i f . Thus U, = U,l, from which it follows that Uf(X)
=
3.6. Regular and Absorbing Processes Two types of aperiodic processes, regular and absorbing processes, are especially important in applications. According to Definition 2.2.1 and the U111 -,0 as n -+ co, paragraph that follows it, regularity means that 11 17"and U ,f ( x ) does not depend on x. Theorems 4.1 and 5.1 show that a compact Markov process is regular if and only if it has but one ergodic kernel, and this kernel has period 1. Alternatively, there is a unique stationary probability p, and the distribution p,, of X,, converges to p for any initial distribution po . DEFINITION 6.1. A Doeblin-Fortet operator for a compact state space ( X , d ) (or a corresponding compact Markov process X,) is absorbing if all ergodic kernels are unit sets: 4 = {aj}. Aperiodicity is obvious, and, by Theorem 4.3, X,, converges with probability 1 to a random absorbing state a / , j = j ( w ) . If there are two or more absorbing states, the probability of convergence to each of them depends on the initial distribution. In particular, such a process is not regular. We now give useful criteria for a compact Markov process to be regular or absorbing. These criteria are expressed in terms of the support u,,(x) of K(")(x, a).
THEOREM 6.1. A process is regular if and only if there is a y
EX
such that
d(a,,(x),y ) -+ 0 as n -, co for all x E X. THEOREM 6.2. A process is absorbing if and only if there are absorbing states a , , ,..,ai such that, for eoery x there is a j =j ( x ) for which d(un(x),aj)
0 as n
-+
00.
(6.2)
62
3.
COMPACT M A R K 0 V PROCESSES
Proof of Theorem 6.1. Suppose that (6.1) obtains. For any x E n = kpj we have K(")(x,Fm)= 1. Thus an(x)c &"', so that
emand
d(&",y) 4 d(an(x),y)*
It then follows from (6.1), on letting n -+ 00, that d(4"',y) = 0; i.e., y E 4m. Therefore y belongs to all subergodic kernels. Since the subergodic kernels are disjoint, there must be only one, and (I is regular. Suppose, conversely, that U is regular with ergodic kernel F. Let y E F and, for any E > 0, let 0 be the open sphere with radius E and center y. Then liminfK(")(x,0)2 p ( 0 ) > 0 , n-+ m
the latter since the stationary probability p has support F. Thus K(")(x,0 )> 0, an(x) n 0 # @, and d(o,(x),y) < E for all n sufficiently large. I
Proof of Theorem 6.2. Assume (6.2). If F' is an ergodic kernel, let x E F'. Then an(x)c F', so that (6.2) implies that aj E F'. Since { a j } is stochastically and topologically closed, { a j } = F', and (I is absorbing. Suppose that U is absorbing. For any x E X there is a j =j(x) such that gj(x) > 0. If 0 is the open E sphere about a j , then
lim inf P ( x , 0)3 K , (x, 0)3 gj(x) . n-
m
Therefore K(")(x,0)> 0 and d(a,,(x),a j ) < E for n sufficiently large.
I
Application of these criteria is facilitated by the following interrelationship among the sets a,,(x). THEOREM 6.3. For all m,n 2 0 and x
E
X,
(6.4)
63
3.7. FINITE M A R K 0 V CHAINS
since y E a,,,(x) implies that a =I an(y) and thus K("'(y,a)= 1. Therefore 5 3 am+n(x)* To prove the reverse inclusion, we note first that 1= =
p + n )
1
(
~
ur7n + n ( x ) )
~(m'(x7 dv) ~ ( n ' ( y a7m + n < X I )
9
so that K("')(x,a*)= 1, where 6*
= { y : K'"'(y,o,+,(x)) =
l}.
Since a,,,+,(x) is closed, K("'(y,a,,,+,,(x))is upper semicontinuous by Lemma ) that 2.2.3, and a* is closed. Thus a* 3 a m ( x ) , so that y ~ a , , , ( x implies a,+,(x) 3 an(y).In other words, a,+,(x) 3 a; hence o,,,+~(x)3 5. This completes the proof of (6.3). If a,,,(x) and a n ( y ) are finite, 6 is too. Hence 5 = a , and (6.4) obtains. Using this equation and induction, we derive finiteness of a,,(x) from finiteness of q ( x ) . [
3.7. Finite Markov Chains When the state space X of a Markov process X = {Xn}nrois a finite set, the process is afinite Markov chain. We noted in Section 3.3 that X is a compact space and X a compact process with respect to any metric d. In this section we collect various observations concerning specialization of the theory of compact Markov processes to this case. Though such specialization is an easy and instructive exercise, direct approaches to the theory of finite Markov chains (e.g. Kemeny and Snell, 1960; Feller, 1968; Chung, 1967) are much simpler. The space B ( X ) = L ( X ) of all complex valued functions or vectors on X is finite dimensional, so the norms 1 . I and I/. 11 on this space are equivalent. The same is true of the corresponding norms on (bounded) linear operators on these spaces. Thus the same operators are orderly, ergodic, aperiodic, and regular with respect to the two norms. The fact that X is aperiodic or regular in I . I for certain finite state learning models is the basis for the treatment of their event sequences in Chapter 6. One wants to have matrix interpretations of the various statements about operators in the preceding sections. For every operator W on B ( X ) , there is a unique complex valued function or matrix w = &(W) on X x X such that WfW =
c W(XJ)f(Y)
YEX
64
3.
COMPACT MARKOV PROCESSES
or Wf= w . f , where dot denotes matrix multiplication. Thus, if (I and K are the transition operator and transition kernel of X,the transition matrix P = &(U) has values or elements P ( x , y ) = K ( x , { y } ) . The mapping & is an algebra isomorphism between operators and matrices : A(aV+bW) = aA(V)
+ b&(W),
and A ( V W ) = d ( V )* A ( W ) .
In fact, it is not difficult to show that it is an isometry, IwI = IWI, with respect to the norm
on matrices. These facts yield the desired matrix interpretations. For example, in the aperiodic case
U" = u, + V",
where 1 V"I + 0 geometrically, so, if Pl = & ( V , ) and u = A (V), P" = P,
+ v",
where Iu" I + 0 geometrically. All subsets of X are topologically closed, so an ergodic kernel is a stochastically closed set with no stochastically closed proper subsets. The support of a measure on Xis just the set of its atoms. Theorem 3.4.2 gives the following representation of the matrix P, = A ( U l ) :
In the ergodic case d = 1, P,( x , y ) = p, ({y}) for all x ; i.e., all rows of PI are equal to the stationary distribution. If x,, ,x E X , d(xn, x ) + 0 as n + co if and only if x , = x for (all sufficiently) large n. Similarly, if A c X, d(x,,, A) + 0 if and only if x,, E A for large n. Thus, for example, in Theorem 3.4.3
In, = { o : X,,(0) E 5 for large n} . The fact that the empirical distribution v,, of X,, ..., Xn-,converges weakly to pj(", (Theorem 3.4.4) with probability 1 is equivalent, in this context, to v,,({x})+ pj(-)({x))with probability 1 for all x E X. The words "topologically closed" can be deleted from Theorem 3.5.1. In (3.5.5) g?(x) = P(X, E F;""
for large n).
3.7. FINITE MARKOV CHAINS
65
If P " ( x , y )> 0 we say that y can be reached from x in n steps, and if this holds for some n 2 0 we say that y can be reached from x. Theorems 3.6.1 and 3.6.2 can be rephrased in these terms as follows. A finite Markov chain is regular if and only if there is a y E X and, for every x E X, an integer N,, such that y can be reached from x in n steps if n 2 N, . A finite Markov chain is absorbing if and only if, from any state x, some absorbing state uj can be reached.
4 0 Distance Diminishing Models with Noncompact State Spaces
In this chapter we consider a distance diminishing model ((X, d), (E, Y), p , u ) whose state sequences have transition operator U.The state space is not assumed to be compact. Theorems 1.1 and 2.2 give conditions for regularity of U . The key assumptions in Theorem 1.1 are that ( X , d ) is bounded and p ( x , .) has a lower bound that does not depend on x. Theorem 2.2 assumes “regularity in X ’ ” for an “invariant” subset X’ of X , such that d(x, X’) is bounded. 4.1. A Condition on p
Boundedness of ( X , d) means, of course, that b =
SUP
X.Y€X
d(x,y)
<
00.
Our condition on p is this: (a) There is a probability v on Y and an a > 0 such that p ( x , A ) 2 av(A) for all A E Y.
66
4.1.
67
A CONDITION ON p
If p‘(x, A) = v(A),then p i ( x , A) = v’(A), where V j on Y j is the jth Cartesian power of v on 9.Consequently, if $ is defined by (2.1.1)with p i in place of p i ,
The following condition supplements the assumption r, < 1 in the definition of a distance diminishing model (Definition 2.1.1). (b) There is a k’ 2 1 such that r’ = r;. < 1.
If, in addition to (a) and (b), 4 4 x 9 4,U ( Y , 4) < d(X,Y)
(1.3)
for all e E E and x , y E A’, we can take k = k’ in Definition 2.1.1.When a = 1 this is obvious, since p = p ‘ , so that 9 = $. In any case p j ( x , A ) 2 a’v’(A)
for all j 2 1 and A E 3’. Thus, if a < 1, the equation P j ( X , A ) = a’d(A)
+ (l-a’)pj*(x,A)
defines a stochastic kernel pi* on X x Yj, and rj
< a’$ + (1 - u j ) ~ * ,
(1.5)
where q* is defined by (2.1.1)with pi* in place of p i . However (1.3) implies that d ( u ( x ,21,U ( Y , 4 )< d ( x , y )
for allj, e j , x, and y , so that $* < 1 for any stochastic kernel pi*. Hence (1.5) and (b) yield rk,< I , as claimed. The following result is a refinement due to Norman (1970a, Theorem 1) of Theorem 1 of Ionescu Tulcea (1959). THEOREM 1.1.
Under (1. l), (a), and (b), U is regular.
The theorem is proved by combining several of our previous results with the following lemma.
1.1 LEMMA s = supsj jr 1
where
< 1,
68
4 . MODELS WITH NONCOMPACT STATE SPACES
Proof of Lemma 1.1. First we define a family of stochastic kernels vj,. on X x Y j . For 1 <j < n let v ~ , ~ (.)x = , vj, and for j > n let vj,.(x,A) =
S S ...,
v"(de") pj-"(x', de*j-")I,(e'),
where x r = u(x,e") and e*j-" = (en, ej-l). If j < n , then clearly 6vj,,(x,y, A) = 0 for all x , y , and A. If j > n, then 6vj,,(x,y,A) =
Thus, by (1.1.4),
I S
v"(de") 6pj-"(x', y', de*j-")I,(d).
16vj,n(X,y,A)I
<
1
v"(de")+l6pj-n(xl, Y ,
S
< Rj-,, f ( @ ) d ( x ' , < Rj - r,,'d(x, y ) .
.)I
yr)
Theorem 2.1.1 and (1.1) then yield I6vj,n(X,Y, A l l
G Rm rib
for all j , n 2 1. Applying Lemma 2.1.2 to pr we obtain r;+ l6vj,&,y,A)I
< r;$,
< Rmr"b = yi
hence (1.7)
for all i,j 2 1, x , y E X, and A E Yj. It follows immediately from (1.4) and the definition of vj,n that pj(x,A) 2 a"vj,n(A)
for all j,n 2 1, x X x Y j such that
E X,
and A E 3'. Thus there is a stochastic kernel qj,n on
pj(x, A ) = anvj,n(x,A ) + (1 -a")qj,n(X, A ) *
Clearly 16pj(x, Y , A ) I
< an16vj,n(x, Y , A ) ( + (1 -a") l6qj,n(x, Y , A)I < anyi+ ( 1 - 4 ) = y;
by (1.7) if n = ik'. Taking the suprema over A E Yj, x , y E X , and j 2 1 we obtain s < yf for all i 2 1. However yi and thus y; c 1 for i sufficiently large. I Conclusion of proof of Theorem 1.1. By Lemma 2.1.3 and ( l . l ) ,
osc(U'f)
< r j b m ( f )+ sosccf)
4.1.
69
A CONDITION ON p
for f E L ( X ) , where s < 1 by the last lemma. Replacing f by Unfand noting that m ( U ” f )< J 11f 11 as a consequence of Lemma 3.1.1, we get OSC(Uj+”f)< 5bJ11f
11 + s OSC(U”f).
Therefore Ej+m
G
Sen
+ tj,
(1.8)
where En
= sup OSC(U”f), IIf II c 1
and ti = GbJ. Now E, is a nonincreasing sequence. Letting n + 00 on both sides of (1.8) we obtain lim E,
n- a
< tj/(l -s)
for all j2 1. But, according to Theorem 2.1.1, rj+O and thus t j + O as Therefore E,+O as n+m. The closed convex hull H ( U n f ) of the range of Unf is compact, and, by (1,1.8),
j+00.
H(U”+’f) c H(U”f);
hence n= 1
But, for f
E
H(U% # 0 .
L(X), diam H(U”f) = osc(U”f)
+0
as n + co, so this intersection contains a single point Urn$ Since Unf(x) and Uaf are both in H ( U ” f ) ,
I U”f(X) - uafI < osc(U”f)
for all x E X. Thus
I u”f- uafI < osc(U”f) < cn Ilf II for all n 2 1. An application of Theorem 3.1.1 completes the proof.
(1.10)
a
A NONMETRIC ANALOG OF THEOREM 1.1. The extreme case of a contractive operator is a constant operator: u ( x ) = u. Theorem 1.2 (p. 70) is an analog of Theorem 1 . 1 for learning models in which the operators u ( . , c ) corresponding to certain combinations 2 of events are constant. Suppes’ (1960)
70
4. MODELS WITH NONCOMPACT STATE SPACES
stimulus sampling model for a continuum of responses is of this sort, and Section 17.1 describes an application of Theorem 1.2 to this model. THEOREM1.2. If a random system with complete connections satisfies (a) ( p . 66) and (b') below, then LI is regular on B ( X ) . (b')
There is a k 2 1 and a G E y ksuch that vk(G)> 0 and u(x,ek)= u(ek)
ij-e"EG.
The proof is much simpler than that of Theorem 1.1.
Proof. Clearly pk (x, A ) 2 akvk(A)
3 akvk(AnG) = @ ( A ) , where c = a"S(G)> 0, and $(A)
=
vk(AnG ) / S ( G )
is a probability on gk with p'(G) = 1. Thus, there is a stochastic kernel p* such that P&( x ,A ) =
CP'(4
+ (1 - c ) P * b , A ) .
For f E B ( X ) , U"f(X)= cUlf(x)
+ (1 - c) U * f ( x ) ,
where
and
Since U'f is constant, osc(ll'ff) = (1 -c) osc(U*f)
< (1 - c ) O S C U ) , and osc( uy)
< (1 - C)["'k'
OSCU)
71
4.2. INVARIANT SUBSETS
for all n 2 0. On combination with (1.10) this yields
or
4.2. Invariant Subsets
DEFINITION 2.1. A nonempty subset X' of X is inuariunt for a distance diminishing model if u(X', e) c X' for all e E E. For a n y j 2 1, w j
=
(J u(X,e')
eJoEJ
is invariant. In multiresponse linear models, u(x,e) = (1 -eelx
+ &A,,
and any convex set that includes {A,: 8, > 0} is invariant. An invariant Bore1 set is clearly stochastically closed for a state sequence X.Furthermore, state sequences are attracted to invariant sets. THEOREM 2.1.
If X'
is inuariant, d(X,,, X ' ) 40 a s . as n 4 co.
Proof. Let d ( x ) = d ( x , X ' ) . This function is Lipschitz, though it need not be bounded. If x' E X', then u(x', en) E X', so that U n d ( x )=
s
p,,(x,de")d(u(x, e"),X ' )
< /p,(x, den) d(u(x, en), u(x', en)) < r,,d(x,x') . Taking the inf over x' E X ' , we obtain U"d(x) < r , d ( x ) ,
Therefore, if X o = x as.,
by Theorem 2.1.1, so that x.."=od(Xn)< 03 and d(Xn)+ O a s . The same conclusion is obtained for an arbitrary initial distribution by conditioning on x,. I
72
4. MODELS WITH NONCOMPACT STATE SPACES
If X‘ is invariant and d’, p’, and u‘ are the restrictions of d, p , and u to X’, then R i < Rj and r!, < 5 , and ((X‘, d’), (E, Y), p‘, u’) is a distance diminishing model. This “submodel” sometimes satisfies the conditions for regularity given in Theorems 3.6.1 and 1.1, even when the “complete” model does not. Hence the importance of the following result of Norman (1970a, Theorem 2). THEOREM 2.2. If X ’ is invariant, d(x, X ’ ) is bounded, and U’is regular, then U is regular. Clearly, if U is regular, then the quantity E, defined by (1.9) converges to 0 as n + 00. And the last paragraph of the proof of Theorem 1.1 establishes the converse. The corresponding equivalence is, of course, valid for U’. Thus our task reduces to showing that E,,’+O implies E,+O.
Proof Let f can write
E
L ( X ) and let f’=f IX’. For any x
EX
and x’ E X’, we
U”j.(x)= gn (x,x’) + hn (x, x’) 3 where
gn(x, X I )
=
and
J
Pn(X,
hn ( x ,x’) =
d 8 ) ( f ( u ( x ,8 ) )- f (u(x‘,8)))
1
pn (x,d8)f’(u(xl,8))*
Consequently
IU”f(x)- U”f(y)I
< Ign(x,x’)I + Ign(y,y’)I + Ihn(x,x’)- h n ( y , ~ ‘ ) l < m (f)rn d(x,x ’ ) + m (f)rnd(y,Y’) + OSCU”) 9
since h,(x, x’) is in the closed convex hull of the range off‘. Take the infimum over x‘ and y‘ and then the supremum over x and y to get osc(U”f)
< 2m(f)r,h’ + o s c ( f ’ ) ,
where h‘ < 00 is the supremum of d ( x ,A“). Then replace f by U”fand use Lemma 3.1.1 and ( U n f ) ‘ =U’“’‘ to obtain osc(U2”f) < 2b‘Jr,llfll
+ ~,,‘llf’l\‘.
It follows that E2,
as n -P
00.
< 2b‘Jr, -I- E,,‘
Since E, is nonincreasing,
E,
+0
+O, as was to be shown.
I
5 0 Functions of Markov Processes
5.1. Introduction
This chapter is concerned with the time series analysis of processes o f the form Yj = f ( X j ) , where X j is a Markov process and f is a real valued function of its states. The theory includes bounded Lipschitz functions of a regular Doeblin-Fortet process, as well as other examples described later. Here are our assumptions: is a Markov process with state space ( X , g ) , transition (a) 3 = kernel K, and transition operator U. (b) The functions to which the theory applies form a subset L of B'(X), the set of bounded measurable real valued functions on X . The class L is a Banach space under a norm II-II and 1 E L. I f f , g E L then f g E L and Ilfgll d
llfll llsll
*
Thus L is a Banach algebra. The supremum norm is continuous at 0 in L, or, equivalently, c=sup---
fZ0
I
llfll 73
74
5.
FUNCTIONS OF M A R K 0 V PROCESSES
(c) U = U” + V is a regular operator on L.
In addition to L ( X ) functions of regular Doeblin-Fortet processes, the above assumptions are satisfied by a class of functions of a process X i = (En,X,,, from a distance diminishing model, assuming that the state sequence X,, is regular. This class includes all bounded measurable functions of En alone. Details of this important application of the theorems of this chapter are given in Section 6.1. Another noteworthy special case is that in which L = B‘(X) and I I f I I = 1 . I. Then U is regular if and only if Doeblin’s condition is satisfied, there is only one ergodic set, and it is aperiodic ( r = 1 and d , = 1 in the notation of Neveu, 1965, V.3). Functions of such processes have received much attention in the literature (see Iosifescu and Theodorescu, 1969, Section 1.2). If the state sequence of a random system with complete connections is of this type, the same is true of X,,’ and X i = (A’,,, En), as is noted in Section 6.1. This provides one approach to functions of event sequences in finite state models, and in Suppes’ continuous-response stimulus sampling model. Before beginning in earnest, let us briefly survey the chapter’s contents. Let Y = U“f,
s,=
crj,
n- 1 j=O
and
It is shown in Section 5.2 that Z,, is asymptotically normally distributed with mean 0 and variance a’ [Z,,- N ( 0 , a’)]. Estimation of pu and o2 are considered, respectively, in Sections 5.3 and 5.4. Section 5.5 gives a representation of a2 in terms of U “ and K, and uses this representation to prove that a’ > 0 in an important special case. It is shown in Section 5.6 that the “shifted” process { Y N + j } j r oapproaches stationarity as N + co. Generalizations of these results to vector valued functions f and to spectral analysis are presented in Section 5.7.
75
5.2. CENTRAL LIMIT THEOREM
5.2. Central Limit Theorem This section is devoted to establishing the following theorems.
THEOREM 2.1. S , , / n + y a.s. us n+ THEOREM 2.2. E(Z:) THEOREM 2.3. Z,,
-
=
00.
d +O ( l / n ) .
N(Oya2)as n+
00.
The proofs draw on many of the same estimates, so this background material is developed systematically, and the proof of each theorem is completed as its prerequisites become available. The proof of Theorem 2.3 is not finished until the end of the section. We shall assume that y = Umf= 0. The general case is obtained by applying this special case to f’=f-y. Let Fn = F ( X 0 , ..., X,,)
be the smallest Bore1 field with respect to which X o , ... , X,,are measurable. Note first that
Moments of Y,,.
E(Ym+jIF) = uif(Xm) = vif(Xm) (a.s.), since U w f = 0. Hence
IE(Ym+jIF)I
Ivy1 Cllvifll < CDa’ 1I f I Y
where a < 1, by (2.2.5). This is abbreviated
E(Y,,,+~IF,) = O(aj).
(2.1)
In this chapter we use such notation only when the order of magnitude is uniform over all relevant variables that do not appear on the right (rn 2 0 and o ESZ in this case). It follows that (2.2)
E(Yj) = E ( E ( Y ~ ~ F=~O) )( a j ) . Turning now to conditional second moments, we see that E( y m + j
yrn
+ k lFm) =
E(Yn,
+( j
A
k ) E(Yrn
+ ( j v k)lFm+
( j A k))lFm)
9
(2’3)
where j A k and j v k , are, respectively, the smaller and larger of j and k ; E(Ym
+j
yn,
+k
=
E(Ym
+ ( j A k ) 0 (a’k-”)IFm)
9
by (2.1); and (2.4)
76
5.
+
FUNCTIONS OF MARKOV PROCESSES
P.
as n + 00. This reconciles (2.6) and (1.1). Clearly lPul G
I f v f l < If1 IvlU'.fl 9
so that
p u = O(al"1). Thus the series in (1.2) converges absolutely. IfO
(2.10)
77
5.2. CENTRAL LIMIT THEOREM
Moments of sums. Let
and n- 1
sn
=
1
j.k=O
Pj-k.
From (2.1) we get
or (2.1 1) Similarly, (2.7) yields (2.12) since /,k=O
Now
so that
and
On combination with (2.12) this gives
E(S&IF~)= nu2 + O(1).
Since So,n= Sn, E(S,Z) = nu2
+ 0(1),
which yields Theorem 2.2 on division by n. Turning to the fourth moment, E(S$,n) =
1E(Yi 5
Ukl
yk
yl) 9
(2.13)
78
5. FUNCTIONS OF MARKOV PROCESSES
where the summands vary independently between m and m + n - 1 ; E(S,4n)
< 4!CtIE(YiYjYkY,)I ijkl
9
where the prime indicates that summation is restricted to i < j
< k < I; and
for some constant K, as a consequence of (2.10). Now
and, similarly,
so E(S&) =
o(n2).
(2.14)
It follows that
(2.15) hence
and S,,/n-O a s . This completes the proof of Theorem 2.1. Conclusion of proof of Theorem 2.3. As a consequence of Theorem 2.2, was defined by the series (1.2), is nonnegative. We need not distinguish below between the cases o2> 0 and o2= 0, but, in the latter case
02, which
2,
-
N(0,oZ) = 60
follows immediately from Theorem 2.2. Of the various estimates obtained above, only (2.1 l), (2.13), and E(ISm,n13)
=
o(n"),
(2.16)
79
5.2. CENTRAL LIMIT THEOREM
which follows from (2.14), are needed henceforth. Our approach is drawn from a paper of Serfling (1968), to which the reader is referred for generalizations. At the end of Section 8.4 we show how, as an alternative to the following direct argument, Lemma 8.4.2 can be used to complete the proof. Let O < < 1, d = [n'], and q = [nld], so that d-n' and q - n l - ' . Let
<
for j
cj =
< q. Clearly
Z n
+
- sjd/fi
j
- Cq
=
sqd,r/fiY
where n = dq r, so
E(IG - Cq13)
by (2.16). Since 0 < r < d = o(n), Z,' ) . suffices to show that cq N ( 0 , a Since
-
ACj
= o((r/n)'A)
Cp converges in probability to 0, so it - lj
=
Cj+,
=
sjd,d/fi,
(2.11), (2.13), and (2.16) give E(ACj15d) = o(l/$)Y E((Acj>'lTd) = (d/n)a2+ o(l/n),
(2.17) (2.18)
and E(IACj13) = o((d/n)%)
(2.19)
for j < q. Let hj(A) = h,!(A) = E(exp(iAcy))
cj. Then
be the characteristic function of hi+ (A)
=
E(exp(iACj)exp(U ACj))
=
E(exp(iACj)E(exp(iA ACj)I$d)).
Using the Taylor expansion exp(it)
=
+ it - t2/2 + o(ltI3)
1
in conjunction with (2.17) and (2.18), we obtain E(exp(iAACi)l8d)
=
1 - (A2/2)(d/n)a'
+ o(l/$) + o(E(ldcj13 15.d)).
(2.20)
80
5. FUNCTIONS OF MARKOV PROCESSES
Substitution into (2.20) and application of (2.19) yield =
+
(1 - (A2/2) (d/n)aZ)hj(A) O(l/&
Thus, by iteration,
h,(l)
=
+
(1 - (12/2) ( d / n ) o 2 r O(q/&)
+ O((d/n)%).
+ O(q(d/n)%).
Since &/n+ 1, the first term on the right converges to exp(-lZaZ/2), and the third converges to zero. Clearly
q/&-
nx-e,
so the second term on the right converges to zero if 5 > 3. Choosing 5 in this way we obtain + exp(-A2a2/2),
h&)
from which Theorem 2.3 follows via the continuity theorem (Breiman, 1968, Theorem 8.28). I 5.3. Estimation of pu
+.
The natural estimator of the asymptotic autocovariance p. of Yn and Yn is the sample autocovariance given by
3. and
p-,
=
1
n-#
2 (yi-y.)(yi+u-y.)
n-'-u
=-
i=o
(3.1)
Pu for u 2 0, where Y. = Sn/n.
This estimator is consistent in both the quadratic mean (q.m.) and a s . senses. THEOREM 3.1. As n + 00, P,,+p,, in quadratic mean and almost surely.
Proof. Since p - . = p U , we can assume that u 2 0 , and, since neither 6, nor pu is affected if a constant is added to f, we can assume that U "f= 0 without loss of generality. Clearly
P.
= P.*
+ E,,
(3.2)
where
(3.3)
5.3. ESTIMATION OF
81
P.
(3.4)
and Y” = s,,n-,/(n Consider
E,
- 24).
first. By the Schwarz inequality,
E((Y’Y.)2)< E“(Y‘4) E“( Y.4) = O(l/(n-u))O(l/n)
= O(l/(n-u)n),
as a consequence of (2.14) and (2.15). The same estimate applies to the other two terms on the right in (3.4), so E(E.z) = 0(l/(n - u ) n) .
Thus E, + 0 q.m. as n + 00. Theorem 2.2 implies that E, only to show that pu* + pu q.m. and a s . If
(3.5) +0 as.,
so it remains
then
For k 2 u, E(Wi Wi + k ) = E(Wi E W i + =
Iq+u))
~ ( 0 ( 1 ) O ( a ~= ) ) O(ak)
by (2.7). While for u 2 k 2 0, E(WiWi+,) = O(1).
Therefore pu* - p, + 0 q.m. and a s . as a consequence of Lemma 3.1.
I
LEMMA 3.1. Let {Wi}i,o be a real valued stochastic process such that E(Wi W i + & ) for k 2 0, where a, 2 0 and
ah
82
5. FUNCTIONS OF MARKOV PROCESSES
Then
If cn+
> cn > 0 and
then i=O
as., as n+
w, + 0
00.
Lemma 3.1 and (2.5) yield an alternative proof of Theorem 2.1. The lemma gives much stronger estimates of the magnitude of Wi than Theorems 2.1 and 3.1 require. For a fuller development of the method used to prove (3.8), see Serfling (1970a,b).
xi
Proof.
Note first that, for any real numbers b, , ... ,bn-
(3.9) where
Thus
G
n- I
1 ali-jlG A n ,
i,j=O
and (3.7) is proved. Let
T, =
n- 1
1 WJCi
i=o
5.3. ESTIMATION OF
83
P.
E((T,-K)')
xi
< A i = mCh n .';c (mvn)-1
(3.10)
Since c c 00, cr2 c m, and T. is q.m. Cauchy. Let T be its q.m. limit. Letting m + m in (3.10) we see that E((T-T,)')
< A C';c Ian
for n 2 2. Thus the sequence E((T-
T21)')
= O(j-')
is summable, from which it follows that T21-+ T a.s. Once it is established that T.+ T as., (3.8) follows via Kronecker's lemma (Neveu, 1965, p.147). To complete the proof that Tm-+T as., it remains only to show that Aj =
max lT,,-T2,1 + O
2Jc n C 21 + 1
a.s. as j + 00. For any 0 < m <j, the sum T2,+- T21of 2' terms can be partitioned into 2j-" blocks of 2"' terms each. Let v,k be the kth such block. Clearly Tn - T ' , =
c' Vdm m
where the prime indicates that the sum need not include all m, and k, is chosen appropriately. Hence, by the Schwarz inequality,
Thus
so that
84
5. FUNCTIONS OF MARKOV PROCESSES
by (3.10), where 2j< i c 2j”. It follows that
for j 3 1. Therefore
and A,+O a s . as j + m .
I 5.4. Estimation of a’
If rs’ > 0 and s is a consistent estimator of a’, it follows from Theorem 2.3 that (S,-ny)/(ns)%
N
N(O, 1 ) .
(4.1)
This fact can serve as a basis for inference about y . Norman’s (1971b) discussion of inference about the mean of a second order stationary time series extends easily to the case at hand. Although (4.1) does not presuppose stationarity of { Y j } j , o , one expects the approximation to be better, for fixed n, when the process is nearly stationary than when it is not. Thus, in practice, it is advisable to disregard the initial, grossly transient, segment of the process. This amounts to considering a shifted process { Y N + j } j r O , to which our assumptions are equally applicable, and which approaches stationarity as N-P 00 according to Theorem 7.1. The purpose of this section is to display a suitable estimator s and to estimate its mean square error. From among a number of closely related possibilities, we single out the truncated estimator
[see (3.1) for 6.1 on the basis of its simplicity and small bias. The truncation point t = t, < n must be chosen judiciously. The danger of choosing t too large is amply illustrated by the fact that, if t = n- 1, s =n
( jy= (o Y j - Y . ) ) z = 0 .
To achieve consistency, t should be small in comparison to n. THEOREM 4.1. s + 0’ in quadratic mean if t + 00 and tln +0 .
5.4.
ESTIMATION OF
O2
THEOREM 4.2. If 0 < c < ln(l/r(V)) and t - (1/2c) Inn = 0 (1 ), then n
-E((s- a’)’) In n
+
204
-. C
For r ( V ) , see (2.2.4). Neither theorem requires o’ > 0, but, if this condition holds, (4.5) can be rewritten in the instructive form E(((s/02)- I)’)
-
4t/n.
As we will see in Section 5.1, s is an example of a spectral estimator, and our proofs of the above theorems differ from arguments familiar in the theory of such estimators (Hannan, 1970, Chapter V) only in the simplifications attendant to this special case and the slight complications due to nonstationarity. A result concerning spectral estimators due to Parzen (1958, Eq.(5.6)) suggests that no estimator of o2 has mean square error of smaller order of magnitude than (In n)/n under our assumptions. Proofs.
As usual, we can assume that U*f = 0. Clearly s-0’
=A + ~ - 8 ~ - 8 2 ,
(4.6)
where
& =
y!!!&u
I4
[for pu* and
E,
see (3.2), (3.3), and (3.4)]
and
We postpone until later the proof that (4.7)
86
5. FUNCTIONS OF MARKOV PROCESSES
as t -+
00
and t/n + 0. As a consequence of (3.9,
Thus E(&2)=
o(t/n)
if t/n 40. Clearly
hence
6: = o(t/n)
(4.9)
62 = O(u2').
(4.10)
as n + c o . Also
so that
Theorem 4.1 follows immediately from (4.6)and (4.7)-(4.10). If t satisfies (4.4), where u = e-c, then (4.10)yields
:6
= o(t/n).
(4.11)
Using this in conjunction with (4.7)-(4.9) and (4.6), we get lim(n/t)E((s-a')')
= lim(n/r)E(d2) =
-
4a4,
and (4.5)follows on noting that t
(1/2c) Inn.
Inequalities (4.3)are equivalent to I > u > r ( ~ ) . I The remainder of the section is devoted to the proof of (4.7). Since
(4.12)
5.4.
ESTIMATION OF
87
ul
where
Also, (4.13) where (4.14) and
a = pu pu - E(Yi Yj)pu - E(Yk Yi)pu + o(,..Jvj)+I~l
= -pupu
= -pupu+
+ ,-pVO+lul)
o*
(4.15)
by (2.8). Combining (4.13), (4.14), and (4.19, we obtain E((YiYj-pu) (YkYi-pu)) = P k - i P t - j
+ p l - i p k - j + 4 + O*-
In view of (4.12) it suffices to establish the following points: ( l / n t )C
~k - i P I - j
= (l/nt)
C
(l/nt)CCq
+
PI-i ~ k j - +
204
9
0,
(4.16) (4.17)
and (l/nt)CCO* + 0 as t + co and t/n+O. The last of these presents no difficulty, since
= O(l)O(n) = O(n)
and the other component of O* is similarly bounded. Proof of (4.17). Clearly "- 1
According to (4.14), q is the difference between two terms, each of which are invariant under permutations of i, j, k, and I, so q also has this property.
88
FUNCTIONS OF MARKOV PROCESSES
5.
Thus
ICCqI
4!c'141
(4.18)
Y
where C'is the sum over O < i < j < k < l < n - l . Over this range, Pk-iPl-j
= o(a'+)
(4.19)
and P1-iPk-j
=
o(d-i)
(4.20)
by (2.9). The equation preceding (2.6) gives
Q = '(Yi 5)~u + E(YiS(Xj))
(4.21)
where g = fvk-jCfV'-kf).
The first term on the right in (4.21) is pupu
+ O(aj+'-k),
as a consequence of (2.8) and (2.9), while the second is
E(Yi u j - ' g ( x , ) ) = E(Y,) u w g
+ E(YiVj-ig(xi))
- o(a'+'-i + &i).
Thus Q
- PuPu =
~ ( ~ i + l - -+ k ai+l-i
Substituting this estimate, (4.19), and (4.20) into (4.14), we obtain
x'
=
~ ( ~ i + l -+ k ai+l-i
Now of each of these powers of q are 0 (n), and (4.17) follows.
c(
+ a'-').
is O(n). Hence
x'141 and, by (4.18),
Proof of (4.16). The equality in (4.16) is obtained by interchanging k and I and replacing - 0 by v in Ex. The inner sum of the term on the left is
1
i.k:
Pk-iPk-i+u-u'
0 6 i, i + Y,k , k + u < n
If k is replaced by i+d, this reduces to
5.4.
89
ESTIMATION OF az
where Mn(u, u, d ) is the number of i (possibly 0) such that O
It is not difficult to show that Mn(u, u, d ) = max (0, n -B) ,
(4.22)
where
B
= max(0, u,d,d+u)
- min(O,u,d,d+u).
Introducing
we can write 03
11
P k - i Pl- j
1 uvd= -
=
Mn 03
(#, u, dl P d P d +
v--u
z(u/t)I(vlt) *
Under the further change of variables u = u+ w, this becomes (l/nt)llpk-iPl-j
=
f
PdPd+wFn(d,
d.w= - w
w),
(4.23)
where
=
/-:
A n ( y )dy
and A n ( y ) is the summand evaluated at u = [ y t ] . It follows easily from (4.22) that the factor Mn/nof A n ( y ) converges to 1 as n + co and t/n + 0. If I yI # 1, the other factor converges to Z2 (y) = Z(y) as t + 00. Thus Anb)
+
I(Y)
9
except on a set of Lebesgue measure 0, as t + co and t/n+0. Furthermore, 0 4 A n ( y ) < 1 for all n and y , and A n ( y )= 0 if IyI > 2 (for then lu/rl > 1). Thus, by the dominated convergence theorem, F,(d,W)
as t +
00
+
J-;z(Y)dY
and t/n+0, for every d and w.
=2
90
5. FUNCTIONS OF MARKOV PROCESSES
Since 0 < M. < n, it is clear from (4.24) that
IF,(d,w)l
< (2t+l)/t
Q 3.
Also
Thus the dominated convergence theorem can be applied to the right-hand side of (4.23) to obtain the limit
cW
d,W=
w
PdPd+w
= 2a4
given in (4.16). This completes the proof of (4.7). 5.5. A Representation of a2
To obtain (4.1), it was assumed that a2> 0, so it is useful to have simple criteria for positivity of this constant. Since
< Po
1P.l
for all u, positivity of po is certainly necessary for positivity of 02, but, in general, it is not sufficient. Theorem 4.1 gives a representation of o2 that permits us to show (Theorem 4.2) that p o > 0 implies o2 > 0 for the important special case of indicator functions f = I , of B E 9, provided, of course, that I , E L. It is anticipated that Theorem 4.1 will be useful for establishing positivity of a2 for other functions f E L. For any g E L, let J ( g ) (4 =
j
K(x,
(dY)-WW2
= Ug2 ( x ) - (Ug)2 ( x )
Clearly J ( g ) E L. Iff
EL
with U"f
f* =
= 0,
.
let
c u'f= pf. W
j=O
W
j=O
The series converges absolutely in L, and L is complete, so f * E L. Clearly (I-U)f*
THEOREM 5.1.
a2= U"Jcf*).
=f
91
5.5. A REPRESENTATION OF u2
This theorem generalizesa formula of Frtchet (1952, p.84) for finite Markov chains.
Proof
since the second series converges in L and Urnis continuous on L. But
c m
j=
by (5.1), so that
- 00
v q - = 2f* -f = f* + Uf*
2 fvijy
j=-m
+ Uf*)
= (f*- Uf*) (f* =
cf*)2
- (Vf*)Z.
Thus
2
=
Uyf*)’- Urn(Uf*)’
= UV(f*)’
- Urn(Uf*)’= U r n J ( f * ) .
a
Let po and 17’ correspond to IB€ L, or, equivalently, to f = IB-b, where b = UmIB.Clearly 0 Q b Q 1 and po =
U*f’ = b(1-b),
so po > 0 is equivalent to 0 < b < 1. THEOREM 5.2. I f 0 < b < 1, then u’ > 0.
Proof. Assume u’ = 0. We will show that b E‘ (0,l). First,
E(If*(xj+,)-uf*(xj>I’) = E ( J ( f * )(xi)) = E(U’Jcf*)
=
(XO))
WJ(f*) WO))
by Theorem 5.1, so
E(If*(xj+,)-uf*(xj>I’) =O(4. Using (5.1) to eliminate Uf* on the left, and taking square roots, we obtain
92
5. FUNCTIONS OF MARKOV PROCESSES
where Yj =f ( X j ) and Yj* =f*(Xj). Thus where
so that sn,n
in probability as n+ For any k,m 2 0,
- (Yn* - Y?n)
+
0
00.
E(Y,*kY2*,”)= E(U”(f*kU”f*m)(-Yo))
PkPm as n-, 00, where pk = Umf*k.Since Yn* = 0(1), it follows easily that there is a unique probability 11 on the set R of real numbers such that fEm <jp(d<)= P , ~and , the distribution of (Yn*, Y;,,) converges (weakly) to p x p as n -+ 00. Therefore the distribution of Yn*-Y,*, converges to v, where v is the distribution of t-q when (t,q)has distribution p x p . By (5.2), the distribution of Sn,nalso converges to v. It follows that, for any E > 0, liminfP(ISn,nI < E ) 2 n-
m
v{t:
<E}
=
PXP{(t*s):
=
/P(mPit:
It-sl < E l
It-sl < E l
= 6.
For any q in the support of p, the integrand is positive. Thus 6 > 0, and P(Isn,nl
<E)
>0
93
5.6. ASYMPTOTIC STATIONARITY
where
(5.3) can be rewritten But Tn(wi)is an integer, so b is within 2e of an integer. Since this holds for all E 0, b is an integer, and b E’ (0,l). I
=-
5.6. Asymptotic Stationarity
If the distribution po of Xo is stationary, then {Xn}n,Ois a (strictly) stationary process, so (Yn}n,ois too. If po is not stationary, we still expect {Yn}n,oto be nearly stationary if a sufficient number of observations at the beginning of the process are disregarded. In other words, we expect the shifted process gN= {YN+n}n,O to approach stationarity as N + 03. Theorem 6.1 gives a result of this type. THEOREM 6.1. The jinite-dimensional distributions of gNconverge weakly
to those of a stationary process gym as N -+
00.
The proof is based on the following lemma, special cases of which have been noted and used in preceding sections. LEMMA 6.1. For any k 2 1 and g o , ...,gk-
E L,
Proof of Lemma 6.1. For k = 1 we get F ( . ;g o ) = g o E L. Suppose, inductively, that the assertion of the lemma holds for some k L 1. Then Ik- I
= F ( x ; g d , ...,S i - l ) ,
\
(6.1)
wheregf=gi for O < i < k - l , and &-1
=gk-IUgkEL.
By hypothesis, the function on the right in (6.1) belongs to L, so the assertion of the lemma holds for k + 1.
94
5. FUNCTIONS OF M A R K O Y PROCESSES
Proof of Theorem 6.1. It follows immediately from the lemma that
* uwF;(* ;90, gk- 1) as N + 00. For any nonnegative integers m,, ...,m K , and distinct nonnegative integers nl,... ,nK, we can apply this to k- 1 = max(n,, ... ,nK) ***
and
9i =
i # nj, allj,
to obtain convergence of
Since the variables Y, are bounded (by If I), such convergence of moments implies weak convergence of the joint distribution p(nl,
e e . 3
nK)
of YN+nl,
*.*)
yN+m,-
But the distributions D N ( n , ,...,nK), for fixed N, are consistent, so the asymptotic distributions D m ( n i ,- . - , n ~ ) are too. Thus there is a distribution Dm on R m with these finite-dimensional distributions (Neveu, 1965, p.82). Clearly DN(nl4- 1 ,
...,nK+ 1) = DN+'(nl,
nK) 3
SO
Drn(n,-k 1 , ... ,n ~ 1) + = D"(n1,
...,n ~ ) ,
and any process OY" with distribution D" is stationary.
I
5.7. Vector Valued Functions and Spectra This section sketches some generalizations of the results of previous sections. Proofs are omitted, since they are slight extensions of arguments given earlier.
5.7.
VECTOR VALUED FUNCTIONS AND SPECTRA
95
If I is a positive integer and fi E L for 1 < i < I, let
f=
P,, = lim cov(Y,,, Y,,+,,) n- 03
exists, P,, = O(d"I), the matrix
z=
f
(7.1)
P,,
U=--m
is positive semidefinite, and the distribution of Z,, converges to the multivariate normal distribution with mean 0 and covariance matrix Z:
z,,
-
N(0,Z). (7.3) This multivariate central limit theorem can be proved by applying Theorem 2.3, the univariate case, to 2*f, for each I-vector 2. Just as in Theorem 6.1, the finite dimensional distributions of {YN+,,},,20 converge to those of a stationary process as N + m . In the remainder of the section, it is assumed that L c Bc(X), rather than B'(X), and that L is closed under complex conjugation (denotedf). If W and Y are complex valued random variables, let cov(W, Y) = E((W - E(W))(Y - E( Y))) .
For anyf,g E L, the limit ~u
cf,8) = n-limw cov(f (xn),( x n + u)
exists, and p,cf,g) = O(alUI).The j , kth element of the matrix Pu in (7.1) is p,,cf,fk). For real 2,
C u= W
d L g ; 2) = (1/2n)
w
puCf,g)e-"A
is the cross spectral density function off and g, and a(f,f; 2 ) is the spectral density function ofJ See Hannan (1970) for a full discussion of spectra and their estimation. Note that
2lroy;f; 0)
= cTz
96
5. FUNCTIONS OF MARKOV PROCESSES
is the asymptotic variance in the univariate central limit theorem, while 2xauj,f,; O) = z j k
is the j , kth element of the asymptotic covariance matrix of the multivariate central limit theorem (7.3). The quantity
is the jinite Fourier transform of f ( X , ) , . .. ,f ( X n - ,). The following relation between the cross spectral density function and finite Fourier transform generalizes Theorem 2.2:
cov(w(f;
4, w ( g ; 4)= ocf g ; 4+ O ( l / n ) .
The natural estimator
where
converges to p . u , g ) in quadratic mean and almost surely as n + co. The estimator
which generalizes s in (4.2), converges to a(f, g; A) in quadratic mean as t + co and rln-+O. If t is chosen in accordance with (4.3) and (4.4), and --R
as n + Let
00.
5.7.
VECTOR VALUED FUNCTIONS A N D SPECTRA
and, if U"f = 0,
cpuif.
f A = ( 2 4 - ?h
OD
j=O
The representation aCf,g; 4 = UrnJCfA,gA)
generalizes Theorem 5.1.
97
6 0 Functions of Events
Throughout this chapter ((X, W), (E, Y),p, u ) is a random system with complete connections and 9 '= X , , E,, X,,El, ... is an associated stochastic process. We are often interested in functions g(E,,) of the event En, or, more generally, in functions g(E,,, ..., of the events on k successive trials. In the learning context, g(E,,) might describe a subject's response on describes his responses on trials n through trial n, while g(E,,, ... , n k - 1. This chapter analyzes such processes.
+
6.1. The Process X,,'= (En,Xn+
Any function of En can be regarded as a function of X,,' = (En, X,,+ ,) and X: = (X,,, En). In this section we obtain information about bounded functions of Enby applying the theory of Chapter 5 to functions of the processes 9'= and %" = {X:},,>,,which are Markovian according to Theorems 1.2.2 and 1.2.3. We first consider 9'for distance diminishing models whose state sequences are regular with respect to L ( X ) . Then we give comparable results for 9'and 9"for learning models in which X is regular with respect to B ( X ) .
98
99
6.1. THE PROCESS Xa' = (En,X*+ 1)
Let U be the transition operator for a state sequence % of a distance diminishing model, let U' be the transition operator for S', and let X' = E x X. For f E B(X') (real or complex) let m ' 0 = supm(f(e, 9) 9 CEE
Ilfll'
= m'Cf)
and
L' = {f
E B(X'):
+ If1
9
Ilfll' < a}.
THEOREM 1.1. If U is regular with respect to the norm II.II on L(X), then 3', U', L', and 1) 11' satisfy the assumptions of Chapter 5 [i.e., (a), (b), and (c) of Section 5.1 or the corresponding assumptions in the complex case considered in Section 5.73.
-
It follows that all of the theorems of Chapter 5 are applicable to f(X,,') iff E L'. We now call attention to some interesting subclasses of L'. If f ( e ,x) = g(e)h(x), where g E B ( E ) and h e L ( X ) , then f E E . In fact, If1 = JgIIhl and m'Cf) = 191 m(h), so 1) f 11' = 191 llhll < co. I f f = h [i.e., g(e)= 11, then m'Cf) = m(h) and 11f 11' = llhll. I f f = g [i.e., h(x) = I], then m'Cf) = 0 and (1 f 11' = 191. Thus B ( E ) and L ( X ) are naturally isometrically embedded in L', and all theorems of Chapter 5 are applicable to g(E,), g E B(E), and to h(X,,), h E L ( X ) . In the latter case, these results can be obtained more simply by applying the same theorems to S, U, L ( X ) , and 11 11. So the primary interest in Theorem 1.1 is that it gives asymptotic properties of g(E,,).
-
COROLLARY.All of the results of Chapter 5 are applicable to g(E,,) for 9 E B(E). Proof of Theorem 1.1. By Theorem 1.2.2, assumption (a) of Section 5.1 is satisfied. We omit the elementary verification of (b) (and its complex analog). IffEL',
fw-fw
=
s
P ( X , a
+
(f(e,u(x,e))-f(e,u(y,e)))
( P k de) - P ( Y , d 4 ) f (e,4%4)
so that (1.2.10) yields
m'(U'f) = mCf') < m ' ( f ) r , + 2 1 f ) R l .
Thus U' is a bounded linear operator on L'. If U is aperiodic with limit U", let U'"f(e, x ) = U"f'(x).
9
100
6. FUNCTIONS OF EVENTS
Then
11 ury- Ulrnf1Ir
=
11 un-y - U“flII
by (1.2.1 I), so that
IIU’Y-
urrnfll’ < IIU”-’ - U r n [Il/f’I < lIu”-’- U”ii i i ~ ’ l l ’ l l f l l ’ -
Thus IIU’”-UrrnII--rOasn+co. It follows immediately from ( I . 1) that, if U“h is constant for each h E L ( X ) , U‘“f is constant for each f E E . This completes the verification of (c). Finally, it is clear that, when L‘ c B‘(X’), L‘ is closed under conjugation (with Iifll’ = ilfll’). I There are interesting learning models whose transition operators U are regular with respect to the norm 1.1 on B ( X ) . Finite state models furnish many examples, and Suppes’ stimulus sampling model for a continuum of responses is another. Criteria for regularity in this sense are given in the last paragraph of Section 3.7 and in Theorem 4.1.2. According to Theorem 1.2 below, such regularity is inherited by both U ‘ and U“. THEOREM 1.2. If U is regular on B ( X ) , then the assumptions of Chapter 5 are satisjied by %‘, U ‘ , L’ = B ( X ’ ) ,and II.1)’ = I I, and by X , U“, L” = B ( X ” ) , and 11 .I]‘’ = I . I. A s a consequence, all results of Chapter 5 are applicable to g(&) for E B(E).
-
Proof. Only (c) need be checked. If U‘“ is defined by (l.l),
11 u y -
Ulrnf11’
=
1un-y - Urnf’l
by (1.2.1 I). Thus
1IU’”S- V r n f11’
< 1un-1
U“I
If’l
Urnl I i f so 1 U‘”- U’“ 11’ --r 0. Clearly U t m fis constant. The case of U“ can be treated similarly, using (1.2.12) instead of (1.2.1 1). 1 11’9
6.2. Unbounded Functions of Several Events The approach of the last section to limit theorems for functions of the was limited to bounded functions of a single event sequence 8 = event. In this section we obtain a strong law of large numbers and a central limit theorem for functions of several events satisfying suitable integrability conditions. Our approach leans heavily on that of Iosifescu and Theodorescu
101
6.2. UNBOUNDED FUNCTIONS OF SEVERAL EVENTS
(1969, proofs of Theorems 2.2.12 and 2.2.22), though their terminology differs slightly from ours.
DEFINITION 2.1. A random system with complete connections is uniformly aperiodic if there is a function r on X x gmsuch that $N
= SUP
xex
~P~(X,S-~A)-~((X -0 ,A)I
A&"
as N + 00. If r ( x , A) = r ( A ) does not depend on x , it is uniformly regular. The probability p, ( x , S - N . ) is the distribution of bN= {E,,+N}n,o when Xo = x a.s. The Vitali-Hahn-Saks theorem (Neveu, 1965, Corollary IV.2.1) ensures that r ( x , is a probability on gm,and, clearly, r ( x , S - ' A ) = r ( x , A). Thus the sequence 8" = {K},,,,, of coordinate functions on E m is a stationary process with respect to r ( x , .). This probability is understood whenever 6'" appears below. Theorem 2.1 links uniform aperiodicity and uniform regularity of learning models to aperiodicity and regularity of their transition operators. a )
THEOREM2.1. A learning model is uniformly aperiodic (uniformly regular) if U is aperiodic (regular) on B ( X ) . Then r ( x , A ) = U m p m( x , A ) and $N = O(aN),where a < 1. The same conclusions obtain if U is aperiodic or regular on L ( X ) in a distance diminishing model. Proof.
Let r ( x , A ) be defined as above. Then
by (1.2.9); hence Ipm( x , S - N A )- r ( x , A ) [ <
IVNJIpm(-,A)I
if
Uisaperiodicon B ( X ) ,
1) V N11 11 p m(. ,A ) 11
if
U is aperiodic on L ( X ) ,
by (2.2.5). Theorem 2.1.1 says that R, < co for a distance diminishing model. The definition of r ensures that it does not depend on its first variable when U is regular. I The strong law of large numbers and central limit theorem given in Theorems 2.3 and 2.4 apply to uniformly regular models. Theorem 2.2 is of a rather different kind and applies to uniformly aperiodic models. Under the hypotheses of Theorem 2.1, $ N converges to 0 geometrically, so the hypotheses ;S$N < co and ;S$# < 03 of Theorems 2.2 and 2.4, respectively, are certainly satisfied.
102
6.
FUNCTIONS OF EVENTS
For A E Y", let T be the total number of trials N for which 8" E A, i.e.,
and let
h ( x ) = &(q*
x$=o
THEOREM 2.2. If a model is uniformly aperiodic with 4N< 00, and if r( .,A ) = 0, then h is the unique bounded measurable fwtction such that (2.1)
U'h(x) + 0 as n + cx)for all x, and h =p,(.,A)
+ Uh.
If the model is distance diminishing, h E L(X). If U is aperiodic on B ( X ) or L ( X ) , (2.1) is equivalent to Umh= 0. Thus, for a distance diminishing model, h is the unique bounded Lipschitz function d ) is compact and U is absorbing such that U"h = 0 and (2.2) holds. If (X, according to Definition 3.6.1, the conditions r ( . , A) = 0 and Umg= 0 are equivalent to pm(ai,A) = 0 and g(ai) = 0 for all absorbing states ai.
Proof m
m
so h E B(X).[In the distance diminishing case,
IIPm(.,S-NA)II =
WUN),
so h E L ( X ) . ] The series converges uniformly, so
=
5 p,(x,S-"+"'A).
N=O
Thus Uh = h -pa (. ,A), which is (2.2), and
asn-+co. If, on the other hand, h'
E
B ( X ) and satisfies (2.1) and (2.2), let A
= h-h'.
6.2.
103
UNBOUNDED FUNCTIONS OF SEVERAL EVENTS
Then A E B ( X ) and A
= UA
by (2.2), so that
A ( x ) = U"A(X)= U"h(x) - U"h'(x) + 0 as n+
00
for all x. Thus A = 0; i.e., h = h'.
I
THEOREM 2.3. If a model is uniformly regular, and i f g is Yk measurable with J (91dr < 00, then r
n- 1
( l / n ) S , = ( l / n )- c - g j+ y j= 0
a s . as n+
00,
=
J
gdr
where gj = g ( E j , ...,Ej+k-l).
THEOREM 2.4. If a model is uniformly regular with Jg2dr< 00, then the series o2 =
f
j=-m
C 48 < CO,
and if
Pj,
where pi = / g . g o S l j l d r - y ' ,
converges absolutely, o22 0, and (sn-ny)/J;;
asn-,oo.
N
N(0,02)
In both theorems the distribution of Xo is arbitrary. Note that
J 1gIsdr < liminfE(lgjIC) j+ m
for any /?> 0 [see Lotve, 1963, (i) of 11.4A1, since the distribution of Igjls converges to that of jg(Fo, ..., & J a as n + m . Thus, for lIgIsdr to be finite, it suffices that, for some initial distribution, E(Igjlb) have a bounded subsequence. The proofs of these theorems are based on the lemmas that follow. Lemma 2.1, which requires only uniform aperiodicity, permits us to transpose results from 8" to 6.Part (A)is relevant to Theorem 2.3, part (B) to Theorem 2.4. LEMMA 2.1. (A) If A is a tail event (A E 9= n:=oYn, where Y, = {S-"A : A E V } )then , pm(x,A) = r ( x , A ) for all x E X. ( B ) If {an}n80is a sequence of real numbers, {b,},,o is a sequence of positive numbers with b,+ co as n+ co, n- 1
Tn =
C 9j*,
j=o
gj* = S ( q ,
* * a s
Fj+k-l),
104
6. FUNCTIONS OF EVENTS
and d,,(t) = E,(expit(S,,b;'
-an)) - E(expit(T,b,,-' -a,,)),
then d,,(t) + 0 as n -+ ca for all real t. Proof. (A) Since A r ( x , B ) , so
E
A = S-"B for some B E 9". But r ( x , A ) =
Y,
I~m (x, A ) - r (x,A )I
=
I~m (x, S -"B) - r (x, B )I < 4 n
*
Since 4,,+ 0 as n -+ 00, (A) is proved. (B) Let 0 < m < n. Then Id,,(t)l
- 11) + E(lexpit(T,b,') - 11) + IE,(expit((S,,-S,,,)b;' --a,,))- E(expit((T,-T,)b;'
< E,(lexpit(S,,,b;')
-a,,))I
< c,(m,n)+ d(m,n)+ 24,,,
by (1.1.4), where c,(m,n)and d(m,n)+O as n - t lim sup Id.(t)l n- m
00.
Thus
< 24,,,
for all m 2 0. This implies (B). Subsequent lemmas assume uniform regularity.
LEMMA2.2. The process bmis mixing, in the sense that, for any j 2 1, B E Y', and A E Y j + N , J r ( B nA ) - r(B)r(A)I < & r ( B ) . Proof. For any C E Y", PX(b"+N+lE
CIE,, ..., E,)
= E x ( P ( b " + N +CIX,,+l, l~ En, ...)I En,..., E,) = E x ( p m ( X n + I , S-NC)IEn, . * * , = p m ( u ( x ,E,,
Thus If A
Eo)
..., En),S h N C ) as.
I P x ( b n + N +CIE,,, l ~ ..., Eo) - r ( C ) [ <
= S - ( i + N ) CB ,
E @, and m =n-j+
4N a.s.
1 3 0, it follows that
6.2.
105
UNBOUNDED FUNCTIONS OF SEVERAL EVENTS
The desired inequality is then obtained by letting n (and hence m ) approach infinity on the extreme right and left. I
LEMMA2.3. The tail field 9- is trivial according to r. Proof.
Suppose that A
E 9-,B
E gi. Then A E gj+N, so, by Lemma 2.2,
J r ( B n A ) - r ( B ) r ( A ) J<
-, 0
as N + m . Thus (2.3)
r ( B n A ) = r(B)r(A).
But there is a sequence B , E P such that r(AAB,)-rO as i + m ; hence r ( B , ) + r ( A ) and r ( B i n A ) + r ( A ) . Thus, applying (2.3) to Bi and letting i + 00, we obtain r ( A ) = r 2 ( A ) ; i.e. r ( A ) = 0 or 1.
I
Conclusion of proof of Theorem 2.3. The process {g,,*},,,, is strictly stationary, and E(lg,,*I) < 00. If A is an invariant event for this process, it is a tail event for this process (Breiman, 1968, Proposition 6.32), hence for 8".Thus, by Lemma 2.3, r ( A ) = 0 or 1, and { g n * } n a ois ergodic. Therefore, the ergodic theorem for stationary processes (Breiman, 1968, Theorem 6.28) T,,/n=y implies that T , / n + y with r probability 1. But the event (call it A ) is a tail event for b", so, by (A) of Lemma 2.1, f',(S,,/n
+
(x, 4 = r ( A ) = 1 ,
Y) =
for all x E X. The same result for an arbitrary initial distribution is obtained by conditioning on X , . I Conclusion of proof of Theorem 2.4. There is a stationary process such that {g,,'};=, and {g,,*}:=, have the same distribution. It follows easily from Lemma 2.2 that g,,' satisfies the similar condition (I) of Ibragimov (1962), with {g,,'};=-m
4(n) =
$
J
(1
~for~ n ~>, k, for n < k.
Since 2 4 ( n ) %< 00, E(g,,') = y, and E(gA2) = Jg'dr <
00
,
Ibragimov's Theorem 1.5 (see also the first paragraph of its proof) is applicable to xi = g i - y , and yields all conclusions of Theorem 2.4, but with T,, in place of S,,. The latter difference is unessential according to (B) of Lemma 2.1. I
This page intentionally left blank
Part I1 0 SLOW LEARNING
This page intentionally left blank
7 0 Introduction to Slow Learning
7.1. Two Kinds of Slow Learning
Consider a learning model ((X,g),(E, Y ) , p ,u ) with state sequence X,,. There are at least two different types of slow learning. If the probability P(X,+ # XJX,, = x)
= p ( x , {e: u(x,e)
#
XI)
of any change in x is small, we say that learning occurs with small probability. If, on the other hand, IAX,,l is small, or, alternatively, lu(x,e)-xl is small, then we say that learning occurs by small steps. Note that the formulation of the latter notion requires that X be a subset of a normed linear space. In fact, all of our results on learning by small steps pertain to subsets of finitedimensional Euclidean spaces. In order to study X,, when P ( X , , + , # X,,lX,,) or AX,, is “small,” we consider families of processes X,P, indexed by a “learning rate” parameter p, such that these quantities are of the order of magnitude of p. The limiting behavior of X,P, suitably normalized, as p + 0 is then sought. The alternative notations c, 8, and t are used for p, the first in connection with small probability and the last two for small steps. 109
110
7. INTRODUCTION TO SLOW LEARNING
Learning with small probability is normally attributable to large probability of events with identity operators [u(x,e) = x for all X I . In the fiveoperator models introduced in Section 0.1, the last coordinate of e can be taken to be a variable k whose value (1 or 0) indicates whether conditioning is effective or ineffective (C, or Co). An identity operator applies whenever conditioning is ineffective. The possibility of ineffective conditioning can be introduced into any model ((X, W ) ,(E, Y),p,u ) as follows. Let X‘ = X , E’ = E x (0, l}, u’(x,(e, 1)) = u(x,e), u’(x,(e,0))= x , and
PYX,Gx
{kl) =
(i;
P ( x , de) Ce
if k = 1,
p(x,de) (1-c,)
if k = 0,
where c, is a measurable mapping of E into [O, 11, which represents the probability of effective conditioning given e. To study small values of c,, we assume that c, = cd,, where 0 4 c, d, < 1. The function d. is regarded as fixed, and limits are taken as c-0. If K, is the transition kernel for state sequences in the model described above, then
+
K,(x,B) = ( 1 - ~ ) 6 , ( B ) cK,(x,B).
(1.1)
[Proof: By definition K,(x, B ) = p’(x, B*), where B* = {el: u’(x,e’)E B } . But
B* n {el: k = I } = { e : u(x,e) E B } x (1) and
B*n(e’:k=O}= Thus
u x ,B) =
1
if x E’ B , Ex{O}
if X E B .
p(X,de)c, + a,(@
u(x.e)~B
from which linear dependence on c is apparent.
I]
Equation (1.1) is the starting point for the theory of learning with small probability presented in the next section. Learning by small steps usually involves step size parameters 0, that are analogous to the step probability parameters c,. Such parameters may be
111
7.2. SMALL PROBABILITY
introduced into a learning model ((X, 9?),(E, Y),p,u ) by defining new event operators u'( e) as follows : a ,
or, equivalently,
u'(x, e) - x
=
6,(u(x, e) - x)
u'(x, e) = (1 - 0,) x + 6, u ( x , e) .
This operator maps into X if X is a full linear space, or if X is convex and o a , G 1. For examples, we turn to five-operator models, where the event e is of the form A i Oj C, or, more compactly, (i,j,k). The state space [0,13 for the linear model is convex, and we may take 6, = kO,, and u(x,e) = j for all x if k = 1. The state space (- 00,w) of the additive model is linear, and wemay take6,=kbij,andu(x,e)-x=l f o r a l l x i f k = l . To study small values of O,, it is assumed that 6, = eve, where 6 > 0 (and tf, > 0 if X is only convex). The function tf. is fixed, and small values of 6 are considered. If X,"is a state sequence in such a model, it follows that the conditional distribution, given X." = x, of the normalized increment H." = AX,"/6 does not depend on 6: For
-LR(H,"IX,B= x ) = U(X).
P(H;cAIX," = x )
(1.2)
= p(x,{e: (u'(x,e)-x)/OEA)) = P(X, { e : q e ( U ( x , e) - X) E A } )
-
The theory of learning by small steps presented below encompasses slightly more general dependence on 6 than the prototypical case (1.2). The need for greater generality is seen by noting that not only u but also Xandp depend on 6 = s / N in the fixed sample size model. Furthermore, it is often convenient to reformulate a model in terms of a transformed state variable h = h(x) [e.g., p =p(x) in additive models] and the dependence of the transformed state sequence on 6 may be slightly more complex than (1.2). Conditions (1.1) and (1.2) refer only to state sequences, and this is characteristic of other conditions to be given later. Thus it is natural to frame the theory of slow learning as a collection of limit theorems for certain families of Markov processes. Applications to various special models then appear as implications of rather general results. 7.2. Small Probability Let K be any transition kernel in a measurable space ( X , a ) , and, for any 0 < c < 1, let K, = (1 - c) 6, + cK. We are interested in the limiting behavior of Kj"' as c+O. The following construction shows us what to expect.
112
7. INTRODUCTION TO SLOW LEARNING
If { X j } j , o is a Markov process with kernel K, {Yk}k>, is a sequence of random variables with P(Yk = 1) = c and P(Yk = 0) = 1 - c that are independent of each other and of { X i } , and S ( n ) = C I : = l Y k ,then X,C= XS(,,)is Markovian with kernel K , . Let N ( t ) , t 2 0, be a Poisson process with mean interevent time 1 that is independent of { X i } . As c + O and n c + t the binomial distribution B(n,c) of S(n) converges to the Poisson distribution P ( t ) of N ( t ) . Thus it is to be expected that the distribution K,(")(x,- ) of X i converges to the distribution L,(x, .) of X N ( r )Theorem . 2.1 shows somewhat more than this. Clearly K,(")(x,B ) =
n
C bj K"'(x, B )
j=o
and
where
bj
=
and
(!)d(l-c)"-j
pi = e-'(tj/j!).
The limiting Markov process X N ( l )is of pseudo-Poisson type (Feller, 1971, p. 322).
THEOREM 2.1. For all
x
EX,
B E a, 0 < c < 1, and n 2 0,
IK,(")(x,B ) - Ln,(x,B)I Proof.
< nc(1 -e-,).
(2.3)
By (2.1) and (2.2), K,(")(x,B ) - L,, (x, B ) =
C (bj -pi) K")(x, B ) m
j=O
(bj = 0 for j > n), thus
Replacing B by
B we obtain
IK,("'(X,B )
- L,,(x, B)I
< I&,
C)
- P(nc)l/2.
(2.4)
So (2.3) follows from the form of the Poisson theorem given by Lemma 2.1. I
113
7.2. SMALL PROBABILITY
LEMMA2.1. lB(n,c) - P(nc)l/2 < nc(1 -e-'). Proof. If Pi and Qi are any Bore1 probabilities on the line,
where
*
IPi * Pz-Qi * Q21 G denotes convolution. For
PI * Pz - Q i
and
* Q2
+ IPz-QzI
If'i-Qil
*
= PI (Pz-Qz)
3
(2.5) (2.6)
+ ( P i - Q i ) * Qz
G su~I(Pz-Qz)(B-x)l G IPZ-QZI/~,
so that
IPi * (Pz-QJl G I P z - Q z l . Using (2.6) inductively we obtain IP*"-Q*"l 4 nlP-Ql. Since B(1, c)*, = B(n, c) and P(c)*" = P(nc), this yields IB(n,c)-P(nc)l G n l B ( l , c ) - P P ( c ) l .
But I B ( ~ , c ) - P ( c ) ~=/ ~
1
i: bj > PI
(bj-pj),
where b j = B ( l , c ) ( { j } )andpi = P ( c ) ( { j } ) ,and
bo= 1 - c < e-' = p o ,
so I B ( I , c ) - P ( c ) 1 / 2 = b , - p i = c(l-e-').
I
The bounds in (2.3) and (2.5) are useful only when n is not too large relative to I/c, as, for example, when nc is bounded or converges to t < 00. Using the fact that the distribution functions of B(n,c) and P(nc) both approximate the normal distribution function with mean and variance nc when nc is large and c is small, it can be shown that sup IB(n, c) - P(nc)l n2O
as c
+0
0. Thus, according to (2.4), sup IK,'"'(x, B) - L,, (x, B)I
n.*,B
as c +0. For explicit bounds, see Vervaat (1 969).
-+
0
114
7. INTRODUCTION TO SLOW LEARNING
7.3. Small Steps: Heuristics The assumptions of the theory of learning by small steps will be spelled out in detail in subsequent chapters, but enough was said in Section 7.1 to support an informal discussion of the kinds of results to be obtained in Chapters 8 and 9. Let {X,"},,, be a state sequence corresponding to the value 8 of the learning rate parameter. In general, X," is an N-dimensional random (column) vector. We do not work directly with X," in Chapter 8, but, rather, with a linear normalization
z." = ( X , , @ - f ( n W J B , where the function f is defined in (B) of Theorem 8.1.1. In Chapter 9, no spatial normalization is necessary. The time scales in Chapters 8 and 9 must also be handled differently. Let = 8 and Y; = Z,,@in the former case, and let 7 = 0 2 and Y;=X," in the latter. Just as in the case of learning with small probability, we will obtain approximations to Y (Y;) (the distribution of Y;) by distributions of the form Y ( Y ( n r ) ) ,where Y(t), t 2 0, is a Markov process with continuous time parameter. However, unlike the pseudoPoisson limit in the previous case, the process Y(t) is a diftiion; i.e., it has continuous sample paths. This reflects the basic assumption that AX,,@ is small when 8 is small. We will see in subsequent chapters that the variables Y; satisfy the fundamental equations
+ E((dY;)21Y; = y > = rb(y,n,r) + E(dY,'I y.' = y )
=
W ( y ,Iz, 7 )
,
O(?)
O(?),
(3.1)
and
E(lAY;I31 Y J = y ) = O ( 7 ) , where y2 = yy*, I yl = y*y (* indicates transposition),
4y,n,7)
+
a(r,t),
and
b(y,n,r)
+
b(y,t)
(3.2)
as 7 + 0 and nr+ t. For any t and 7 , let n = [ t / r ] and ' ( t ) = Y;. Then Y'(t+r) = Y i + l ,so (3.1) can be rewritten
+ o(l), 7-'E((Y'(t+.r) - Y'(t))2lY'(t) = y ) = b ( y , n , z ) + o ( l ) , t-'E(Y'(t+r)
- Y'(t)lY'(t) = y )
= a(y,n,r)
and
?-'E(IY'(t+r) - Yr(t)I31Y'(t) = y ) = o(1).
(3.3)
1 I5
7.3. SMALL STEPS: HEURISTICS
Now nr -,t as 7 +0, so (3.2) applies and (3.3) suggests that as r-0, where Y(t) is a diffusion such that limr-'E(Y(t+r)1-0
Y(t)lY(t) = y )
lirnr-'E((Y(t+r) - Y(t))'IY(t) r-0
=
a(y,t),
= y) =
b(y,t),
(3.5)
and
More generally, we might expect convergence of the higher-dimensional distributions, where 0 < I , < t 2 < ... < t,, or even convergence of the distribution, over a suitable function space, of entire trajectories of the process, 9(Y'(t), t Q T) + Y(Y(t), t < T) as r + 0. If, instead of Y'(t), we consider the random polygonal curve with vertices Y*(nr)= Y i , i.e., P(t) =
(
(n+l)r-t
) + (?) Y,r
p(t)
Y,+,
a)
for nr < t Q (n+ l ) r , the natural path space is the set C([O, of continuous (vector valued) functions on [0, TI . For the sake of simplicity, we will focus our attention on the distribution at a single value of t in the chapters that follow. The importance of this circle of ideas for us is that we shall now not be surprised to find that Y ( Y ' ( t ) )converges when (3.1) and certain auxiliary conditions obtain. The Y(t)), while intuitively satisfying, is not characterization of the limit as 9( always convenient for proving convergence, so alternative descriptions of the limit will be used later. Furthermore, plausible though the transition from (3.1) to (3.4) may be, the theorems currently available to justify it (e.g., Theorem 1 on p. 460 of Gikhman and Skorokhod, 1969) do not cover the cases considered in subsequent chapters. Thus we will have to provide our own theorems of this type.
8 0 Transient Behavior in the Case of Large Drift
8.1. A General Central Limit Theorem We now introduce the notation and state precisely the assumptions of Theorem 1.1, which is a general central limit theorem for learning by small steps. Let J be a bounded set of positive real numbers with in f J = 0. Let N be a positive integer, and let RN be the set of N-tuples of real numbers, regarded as column vectors. For every 8 E J , {X,"}.,, is a Markov process with stationary transition probabilities and state space a subset Z, of RN. Let I be the smallest closed convex set including all lo,O E J. Let H: be the normalized increment of X,",
H," = AX,"/B , and let w ( x , O), S ( x , 0), s(x, 0), and r ( x , 0) be, respectively, its conditional mean vector, cross moment matrix, covariance matrix, and absolute third moment, given X," = x : w ( x , e ) = E(H;IX," = x ) , s ( x , e)
=
E ( ( H , B ) ~ I X=. BXI,
s ( x , e) = E((H: - w ( x , e))zIx: =
e)
= s ( ~ , - w2(x, e) ,
and 116
117
8.1. A GENERAL CENTRAL LIMIT THEOREM
Here x 2 = xx* and 1x1’ = x * x for x For any N x N matrix A, let
RN,where
E
*
indicates transposition.
N
We assume that I,,approximates I as 0 -,0 in the sense that, for any x E I, (a.1) lim inf Ix-yI
=
8-0 Y S I ~
0.
Next we suppose that there are functions w ( x ) and s(x) on I that approximate w ( x , 0 ) and s(x,0) when 0 is small, and that, in the former case, the error is 0 (0) : (a.2) S U P ~ W ( X-, w~ () x ) ~= O(0) XSI,
and (a.3)
E
= sup Is@, xel,
0) - s ( x ) ~+ 0
as 840. Let S(X) = s(x)
+ wZ(x).
The function w is assumed to be differentiable, in the sense that there is an N x N matrix valued function on I such that IW(Y) - w ( x ) - w’(x) (Y-XII
(b.1) lim
=
l Y - XI
Y-x Ye1
for all x E I. We assume that w’(x) is bounded, (b.2)
CL =
SUPIW‘(X)J <
00,
XSI
and that w‘(x) and s(x) satisfy Lipschitz conditions,
and (b.4) sup
I W - S ( Y ) l . < oo
X,Yd
x *Y
IX-YJ
Finally, we suppose that r ( x , 0) is bounded:
(c) r = supr(x, 0) < BEJ
XSI,
00.
118
8.
TRANSIENT BEHAVIOR WITH LARGE DRIFT
As a consequence of (c), Iw(x,e)I is bounded, so that the d r f t E(dX,BIX,B = x) is of the order of magnitude of 8. The possibility
E(AX,BJX,B
=X ) =
o(e2)
is not precluded. We refer to this as the case of small drift, while the general case is called the case of large drift. Note that w(x, 8) = O(8), so, by (a.2), w(x) = 0, when the drift is small. We can now survey briefly the main results of this and the next three chapters. Theorem 1.1 below describes the behavior of U ( X t ) under the above assumptions, when 0 is small and n is not too large, specifically, n = O(l/8). Such a restriction on the size of n means that the theorem pertains to the transient behavior of the process {X."},,,,. Theorem 9.1.1 gives an approximation to L?(X;) that is valid for larger values of n, n = O(1/O2), in certain families of processes with small drift. The final chapters of Part I1 provide small step approximations to the asymptotic behavior of real valued processes. Chapter 10 treats stationary probabilities, and Chapter 11 considers absorption probabilities. For 8 E J and x E 10, let K (8, x ) = Ex(X,B)
and
wn (6, x ) =
Ex
(IX," - p n (0, x ) I')
*
THEOREM 1.1. (A) q,(8, x ) = O(0) uniformly in x E I0 and n8 4 T, for any T < 00. ( B ) For any x E I, the differential equation
f'M
(1.1) has a unique solution f(t) =f ( t , x ) with f ( 0 ) = x. For all t 2 0, f ( t ) E I, and = w(f(0)
P n (6,X ) - f(ne, X ) = 0(0) uniformly in x E I0 and n8 4 T. (C) If x E I, the matrix differential equation A'(t) = - w'( f ( t , x))*A ( t )
(1.2)
(1.3) has a unique solution A ( t ) = A (t,x ) with A (0) the identity matrix, and A ( t ) is nonsingular. For any 8 E J , let {X;},,,, have initial state X g E lo.If x E I, and xo -+ x as 8 -+ 0, then the distribution 2 ': of 2." =
(~n"
xe>)/@
converges to the normal distribution U ( t ,x ) with mean 0 and covariance matrix d t ,4
as 8-+0 and nO+t.
=
[ A (24) A (0-' l * s ( f ( u ) )[ A (4A (0-' I dl4
(1.4)
119
8.1. A GENERAL CENTRAL LIMIT THEOREM
The function g ( t ) = g ( t , x) can be characterized, like f ( t ) , as the unique solution of a differential equation with a given initial value. In this case, the equation is g'(0 = w'(f(t))g(t)+
w'(f(t>)* + s(fW)
9
and the initial value is 0. Part (A) means that there is a constant CT such that o,(O,x) < CTO for all O E J, x E 1 0 , and n < T/O. In (B), (1.2) has an analogous interpretation. If d is any metric on probabilities over RN such that P,-+ P weakly implies d(P,, P ) + O (e.g., Dudley's metric described in Section 2.2), it follows from (C) that
as 8 4 0 . The routine argument, which we omit, is based on the compactness of the interval 0 < t < T and the ball 1x1 < T, and the joint weak continuity of 9 ( t , x). Another corollary of Theorem 1.1 is convergence of finite dimensional distributions
az.",, ..., z:,> as O + O and n j O + t j . If (c) is replaced by the stronger assumption
it can also be shown that the distribution over C([O,T I ) of the random polygonal curve Z e ( t ) with vertices Ze(nO)= 2." converges weakly to the distribution of a diffusion as O+O. If 10 = I for all 0, (a.1) is certainly satisfied. And if, in addition, (7.1.2) holds, w(x, O), s(x, O), and r(x, 0) do not depend on 8, so (a.2) and (a.3) are satisfied, and the supremum over O in (c) can be omitted. A simple and important example of this sort is the five-operator linear model with Oij = Oqij, where qij 2 0 and 0 < O < max-'qij. In this case, w(x) and s(x) are polynomials, so that (b) holds, and I H.1 < maxqij as., so (c) does too. Thus Theorem 1.1 is applicable. Another example satisfying 10 = I and (7.1.2) is the standard multidimensional central limit theorem. Let W, , W,, ... be a sequence of independent and identically distributed random vectors, and let
120
8.
TRANSIENT BEHAVIOR WITH LARGE DRIFT
(X,” = 0). Then {X:},,, is a Markov process in I = RN. Since H: we have = w, w(x, e) = ~(4.1
s(x, 0)
and
=
= W,,
,,
covariance matrix of Wj = s ,
r(x,O) = E(1413) = r .
Thus, assuming r < 03, the remaining assumption (b) is trivially satisfied. Clearlyf(t,O) = t w and g ( f ,x) = f s , so (C) with x,q = x = 0 and 8 = l/n gives
as n+ 03. This result falls short of the standard central limit theorem only because the assumption r < co is stronger than necessary. In fact, the Lyapounov condition (c) can be replaced by the Lindeberg condition
without altering the conclusions of Theorem 1.1, and this more general condition is equivalent to E(l4.1’) < 03 in the case considered above. Theorem 1.1 was stated by Norman (1968c, Theorem 3. l), and proved in the special case N = 1, I,q= I, and L?(H:IX:=x) = U(x). Proofs of (A) and (B)are given in Section 8.3, the proof of (C) in Section 8.4. The latter is rather different from the comparable proof (that of Theorem 2.3) in the paper mentioned above. 8.2. Properties off(?)
A number of arguments in the proofs of (B) and (C) of Theorem 1.1 hinge on properties of f ( f ) that follow from its characterization as the solution of (1.1). To avoid repeated interruptions of the main development, these properties are collected in this section. The simple proofs given are quite standard. Weconsiderfirst w(x). Forx,yE I , a n d O < p < l,leth(p)= w(x+p(y-x)). It follows from (b. 1) that h’(p) = w‘(x + p (Y - x)) (Y - X I ;
hence, by (b.3), h’ is continuous. Thus the fundamental theorem of calculus gives w(y) - w ( x ) = h(1) - h ( O )
t If F c 8,IF = I&)
=
II
w‘(x+p(y-x))dp(y-x),
is 1 or 0, depending on whether or not w E F.
(2.1)
121
8.2. PROPERTIES OF f(t)
and (b.2) then implies that w is Lipschitz, IW(Y)
- w(x)l G alv-xl*
Also,
so that I4.Y)
- w ( x ) - w ’ ( 4 (Y - 41 G (PP) lY - XI
(2.3)
as a consequence of (b.3). Now IW(X,e)I
G r(x,8)% G r%
(2.4)
[see (c)], from which it follows, via (a.1) and (a.2), that (2.5)
Iw(x)l G r%.
If I = RN,the method of successive approximations (see, for example, the proof of Theorem 6 in Chapter 6 of Birkhoff and Rota, 1969) can be used in conjunction with (2.2) to construct a solution f(t) = f ( t , x ) to (1.1) with f(0)= x . The more devious existence proof for general I is given in the next section. Whether or not I = R N , there is at most one solution. For if f(t) and f ( r ) both satisfy (1.1) and start at x , then
f(t)-
m
=
j 0‘ ( w ( f ( 4
so that
-m G
If@)
a
- w ( f ( 4 ) )du
9
1‘ 0
If&)-f@)ldu
by (2.2). Thusf(t) = f ( r ) follows from this form of Gronwall’s lemma (Hille, 1969, Corollary 2 of Theorem 1.5.7). GRONWALL’S LEMMA.
If c, k 2 0, and h ( t ) 2 0 is continuous, then
for 0 < t < T implies
h ( t ) < cek‘ for 0 < t < T.
122
8. TRANSIENT BEHAVIOR WITH LARGE DRIFT
Throughout the rest of the section we suppose that, for every X E Z , f ( r ) = f ( r , x) is a solution of (1.1) with f(0) = x . For 6 > 0
so I f ( t + 6 ) - f ( t ) l G 6r”
by (2.5). Also,
thus
by (2.6). Next, note that
Therefore,
so that
(2.10) Note that (ear- I)/t is an increasing function of t ; hence ear- 1 G Ct, where C = (eaT- l)/T, for t < T. From (2.6) and (2.9) we get If(t,Y) -f(U7X)I
< If(t7Y) -f(&Y)l I f ( U , Y ) < r%It-ul + e a ” l y - x l ,
-f(U7X)I
which converges to 0 as t + u and y + x. Thus f(-,.) is jointly continuous.
123
8.2. PROPERTIES OF f ( r )
The existence of solutions A ( t ) = A ( t , x) and B(t) = B ( t , x ) of the linear differential equations (1.3) and B‘(t) =
w’(f(0)B(t)
(2.1 I)
with A (0) and B(0) the identity matrix can be proved by successive approximations, and, in view of (b.2), the uniqueness proof given above for (1.1) applies here also. Both A ( t ) and B(t) are invertible; in fact, (2.12)
B(t)* = A(t)-’,
since
d dt
- B(t)*A(t) =
($
B(r)).A(t)
+ B(t)* -dtdA @ )
=0
and B(O)*A(0) is the identity matrix. The matrix B(t, x) is the differential as we will see below. This fact is needed in Section 8.5. at x off(t, Clearly, a),
F(t) = f ( t , Y ) -f ( 4 4 - a t , x) (Y -x)
Thus
by (2.3), so that
for t < T by (2.9). Thus Gronwall’s lemma gives IF(t)l
< (/?/4a)( e 2 a T - l ) l y - x 1 2 e a r
for t < T. Taking T = t we obtain I f ( t , y ) - f ( t , x ) - B(r,x) ( y - x ) l
< ( 8 / 4 4 ) e u f ( e 2 a ‘ - l ) l y - x ~ 2(2.13) .
a.
124
TRANSIENT B E H A V I O R W I T H LARGE DRIFT
8.3. Proofs of (A) and (B)
In the rest of this chapter, we use the notation K for bounds that do not depend on 0, n, and x (e.g., a, p, and r ) , and C for bounds that are valid for all x and nO < T. In this section the process X,, = X," has initial state x E I,, H,, = AXn/O, p,,= ~,,(O,X), w,, = w , , ( ~ , x )and , Y,, = X,,-p,. Proof of (A). Clearly W,+I
=
+ 2E(Yn*AY,)+ E(lAY,IZ).
0,
(3.1)
Since X,, E I, and I is closed and convex, p,,E I, and w(p,) is defined. We have E(Yn*AYn)/O = E(Yn*(Hn-
since E(Yn*)= 0,
WO1.I))
E(Yn*dYn)/O = E(Yn*(w(Xn, 0 ) - w(pn))) because E(H,,IX,,) = w(X,,, O), E(Y,,*AY,,)/O= E(Yn*(M'(Xn,0 ) - w(Xn>))+ E(Yn*(w(Xn)- ~ ( p , , ) ) )
< E(IYnI Iw(xn, 0 ) - w(Xn>I)+ E(lYnl Iw(Xn) - w(pn)I) < KOE(IYnI) + KE(IYn12) by (a.2) and (2.2), and
E(Y,*AY,,)/O < KO
+ Kw,,
since E(lY,,l) < ( 1 + 0 , , ) / 2 . Finally,
E(lAYn12)/e2< E(IHn12)
as a consequence of (c). Using (3.2) and (3.3) in (3.1) we obtain
< (l+KO)wn + KO2,
which, on iteration, yields
n-
I
< KO2 1 (1 +KO)',
on
since w0 = 0. Thus
j=O
< e((i+Ke)n-
0,
1) (3.4)
125
8.3. PROOFS OF ( A ) A N D ( B )
Proof of (B). To begin with,
so that
If f ( t ) =f(t, x ) satisfies (1.1) with f(0)= x , then, according to (2.7) with 6 = 8 and t = no, v, =f(n8) also satisfies (3.5). Since vo = x = p o , we expect to find v, G p,,. However, at this point the existence of f ( t , x) can only be guaranteed when I = RN. Thus we shall extend w ( x ) to all of RN, define f in terms of the extended function, prove (1.2), and then show that f ( r ) E I, so that f satisfies (1.1). According to (2.2) and (2.5), w ( x ) is bounded and Lipschitz. Thus all of the coordinate functions w j ( x ) of w ( x ) are bounded Lipschitz real valued functions; hence they possess bounded Lipschitz extensions 4.( x ) to RN (Dudley, 1966, Lemma 5). The function W ( x ) with coordinates bV.(x) maps RN into RN and satisfies (2.2) and (2.5) for suitable constants u and r. It follows that, for every x E I (in fact, for every x E RN), there is a unique function f ( t ) =f(t, x ) such that
f’(0 = WfW) andf(0)
f(no, 4,
=x,
and that (2.7) holds with Win place of w. Thus, if v,
+
+
v,+ = v,, ew(v,) k, e2, where 1k.l < K. Since p,,E I, (3.5) can be rewritten in the form pn+ 1 = p n
lcnl
+ eW(pn) + Cne’
>
< C. Subtracting (3.6) from (3.7) and letting d, = p,-
so that
dn+i = dn + e(W(pn) - W(V,))+ C, 8’, Idn+ll G (l+KB)ldnl
+ C8’.
= vn(& x ) =
(3.6)
(3.7)
v, , we obtain
126
8. TRANSIENT BEHAVIOR WITH LARGE DRIFT
If vo = p o = x E l o , then do = 0, so iteration yields 1d.l G OC(eKne-1) G 8C(eKT-1)
for nO G T, and (1.2) is proved. Suppose now that x E I and t > 0. We wish to show that f ( t , x ) E I, so that f(.,x) satisfies (I. 1). As a consequence of (a. l), there is a function x. on J such that X0 E I0 and xe + x as 8 -+ 0. If n = [ t / O ] then f(n8, xe) -+f(t, x) as 8 4 0 . But P n (6, ~ 0 )
X&
+
0
by (1.2), so pn(8,x,)+f(t, x ) as 8 4 0 . Since p,,(O,xO)E I and Z is closed, E I, as claimed. I
f(r,x)
8.4. Proof of (C)
In this section, o(8) denotes a random variable such that max E(lo(O)l)/O-+ 0
n =sTI0
as O + 0. The properties of Z,, = 2." that are crucial for (C) are given in the following lemma.
8.4. PROOF OF (C)
127
(4.6)
128
8.
TRANSIENT BEHAVIOR WITH LARGE DRIFT
where But
r = wm(8,
xn)
+ Ip m (0, Xn) - vm (0, I xn)
G cmO2 by (3.4) and (1.2). Thus E(IZ,,+,,, - Z,,12)< Cm8
I
if (n+m)8 < T, by (4.7).
+ C(m8)2E(IX,- v,12)/e< cm8
Note that a: and b: are bounded and that and
a:-+a(t)
b:--, b(t)
(4.8)
as 8 -+ 0 and n8 + t, where a ( t ) = w f ( f ( t ,x ) ) and b(t) = s(f(r, x)) are continuous. Thus (C) of Theorem 1.1 follows from Lemma 4.2. LEMMA 4.2. Suppose that Z, = Z,: 8 E J, n8 < T, is any family of stochastic processes in RN,with Z, = 0, that satisfies (4.1)-(4.4). Suppose further that a, = a: and b, = b,8 are bounded, and that (4.8) obtains as 8+0 and nO-+t, where a ( t ) and b(t) are continuous. Suppose, finally, that X ( t ) is continuous with ~ ( 0=)0. Then
Z" as 8 + 0 and n8 + t, where
s(0 =
-
NO,9(0)
CA (24) A 0)-'I*b(u) CA (24) A (0-'I d.4,
and A ( t ) is the solution of A'(t)
=
-a(t)*A(t)
with A (0) the identity matrix. Note that Lemma 4.2 does not assume that Z, is Markovian. The lemma is proved below by a variant of the method of Rosen (1967a,b). Proof. Using the third-order expansion of exp(iu) for real u we obtain, for any 1 E RN, E(exp(iA*dZ,)IZ,)
= 1
+ i1*E(dZnIZ,)
- A*E((dZn)*IZn)1/2+ kll13E(IdZn13IZn),
129
8.4. PROOF OF ( C )
where Ikl
< 6 ; hence E(exp(iA*dZ,,)IZ,,)
=
1 + iOA*a,,Z,, - 8A*b,,A/2 + o(8)
by (4.1)-(4.3). Substituting this into E(exp(iA*Z,,+,)) = E(exp(iA*Z,) E(exp(iA*dZ,)lZ,,)), we obtain E(exp(iA*Z,,+,)) = E(exp(iA*Z,))
+ E(o(8))
+ 8A*a,,E (exp(iA*Z,,) iZ,,) - 8A*b,,AE(exp(iA*Z,,))/2 .
Summation over n then yields
(4.9) where E =
'?'E(o(B)).
(4.10)
j=O
Let he(A,t) = E(exp(iA*Z,,)),
a e ( t )= a,",
and
be(t) = b,",
where n = [t/8]. Note that
In these notations, (4.9) becomes
+
- leA*be(u)Ahe@, u ) du .
(4.1 1)
Let 8, be a sequence in J that converges to 0. Suppose that, for all t < T, 9(Z,") converges weakly to a probability distribution Y (t), when 8 + 0 along this sequence. We will show that 9 ( t ) = N(O,g(t)). This requires a somewhat lengthy argument. Let h(A, t) be the characteristic function of 9 ( t ) . Then he(& t) + h(A, t). Furthermore, (4.4) yields (4.12)
130
8.
TRANSIENT BEHAVIOR WITH LARGE DRIFT
from which it follows that
V h(A, t ) = and
s
exp (U*z) i z 9 (t) (dz) ,
(4.14)
VhO(A,t ) + Vh(A,t ) (see Lobve, 1963, Theorem A, p. 183). Next we show that h(A, - ) and Vh(A, .) are continuous. Let u > t and let m = [u/8] - [t/8], so that [u/8] = m+n. Then he(n, 24)
- hyn, t )
= E( exp (iA*Zn)(exp (iA*(Zn+
-zn>) - I)) ,
so that
(4.15) (4.16)
by the Schwarz inequality, and
IVh(1,u)- Vh(A,t)l G (121 Ixl%+1)X(u-t)%. As a consequence of (4.12) IVhe(A,t)I G E(IznI) G IxI“, so the integrands in (4.11) are bounded over u and 8. And, referring to (4.10),
I4 G j“‘E(lo(e)I) = 0 G nOmaxE(Io(8)1)/8 -+ 0 jaTl0
8.4. PROOF
131
OF (C)
as 8 -+ 0. Thus we can let 8 -+ 0 along 8, in (4.11) to obtain h(I,t) - 1 =
I’
I * a ( u ) V h ( I , ~ ) du3
I’
I*b(u)Ih(rZ,~)d~.
Since the integrands are continuous in u, differentiation yields
ah - ( I , t ) = I*a(t) Vh(1,t ) - +I*b(t)Ih(I, r ) . at
(4.17)
To solve (4.17), let 5 E RN,I(?) = A (t) 5, and F(r) = h(A(t),t). Then
X(t) = -a(t)*l(t). From this and (4.17) we obtain F’(t) = X(t)* V h ( I ,t ) =
+ ah (A, t )
- $I(r)*b(t)A ( t ) F(t) .
The unique solution of this ordinary differential equation with F(O)= h((,O)= 1 is
For
t =A(t)-’2,
2 ( t ) =I, and 2(u) = A ( u ) A ( t ) - ’ I , so
h(A,t) = exp(-3A*gO)I),
the characteristic function of N(O,g(t)). Thus Y ( t )=N(O,g(t)), as claimed. Suppose now that there is some t ’ < T such that U(Zz.),where n‘= [ t ’ p ] , does not converge to N(O,g(t’)) as 8 -+ 0. We will show that this leads to a contradiction. There is a bounded continuous function G on RN such that E(G(Z:.)) *
/
G ( z ) N ( O , g ( t ’ ) (dz) ) = 1.
Hence, there is a 6 > 0 and, for every j , a 8; < l / j such that (4.18)
IE(G(Z,B,))- 11 2 6
if 8 = 8 .; Let n = [ t / 8 ] . Using (4.12) and the diagonal method, we can construct a subsequence 0, of 0; such that U(Z,B)converges weakly to a probability A?(?) as 8+0 for all rational t < T. If h ( I , t) is the characteristic function of A?(?),(4.16) applies to rational t and u. Since h(I, is uniformly continuous on the rationals, it has a unique continuous extension to all of a )
132
8.
TRANSIENT BEHAVIOR WITH LARGE DRIFT
[O, T I , for which the same notation will be used. The extension satisfies (4.16); hence h(A, u ) + h(A, t) uniformly in 111 < K as u + t. Taking t irrational and u rational, the continuity theorem implies that h(A, t) is the characteristic function of a probability 9 ( t ) for all 0 < t < T. For any 0 < t, u < T, t) - h e ( At)l
< ( h ( Lt ) - h ( 5 u)l
+ Ih(A, u ) - he@,u)l + Ihe(A,u) - he(& t)l .
Estimating the third term on the right by (4.19, taking u rational, and letting 8 + 0 along 8, we obtain limsuplh(A,t)-he(A,t)l
< lh(A,t)-h(A,u)l + IAlx(lu-tl)".
Since the right side converges to 0 as u + I, he (A, t) -+ h (A, t) and thus 9(Z:) -, 9 ( t ) for all t < T, as 8 + 0 along 8 k . It follows that 9 ( t ) =N(O,g(t)) for all t < T; in particular, for t = t'. Hence 9 ( Z : , ) +N(O,g(t')), and E(G(Z:,)) + 1 as 8+0 along 8,. Since 8, is a subsequence of 8,l, this contradicts (4.18). Our conclusion is that 9(Z:)+N(O,g(t)) for all t < T as 8 4 0 . It remains only to relax the strict dependence of n on 8 (n = [t/8]). Let n' be any integer between 0 and T/8. As in (4.15), IE(exp(iA*Z,,))- E(exp(iA*Z,,.))I
< 121 x ( [ n 8- n'dl)",
which converges to 0 if 8+0 and n'B+t. Thus 9(Z,,.)-+N(O,g(t)).
I
Alternative proof of Theorem 5.2.3. Lemma 4.2 can be used in place of the final two paragraphs of Section 5.2, which complete the proof of Theorem 5.2.3. This simple exercise throws additional light on that theorem, and provides an example of an application of Lemma 4.2 to a non-Markovian process. In the notation of the final paragraphs of Section 5.2, we have a family {t;i"};= of normalized partial sums indexed by n, and we wish to show that C," N ( 0 , a') as n+ oc). Taking the conditional expectation given cj in (5.2.18) and letting 8 = d/n we obtain
-
+
E ( ( A C ~ )= ~ ~6 C0 ~~ )o(e). It follows from (5.2.19) that E(lACjI3
and, if
ICj)
=
~(e),
5 > -f,(5.2.17) yields E(ACjlCj) = o ( e ) .
These equations are instances of (4.2), (4.3), and (4.1), where aj = 0 and
133
8.5. NEAR A CRITICAL POINT
bj = 0'. Thus (4.8) holds, with a ( t ) = 0 and b(t) = . ' a
From (5.2.16) we get
E(ICj+,c-CjI') G K k e ,
-
which is (4.4) with x ( f )= Kt. Thus Lemma 4.2 is applicable, and g ( t ) = fa'. Since q8-r 1 as n-r 00, C4 N(O,a'), as claimed. I 8.5. Near a Critical Point
Instead of subtracting f(n0, x,), where X,B = x,, from X,", as in (C)of Theorem 1.1, it is natural to consider the alternative normalization z." = (X." -f(nO,
X))/fl,
where x = lime-,ox, . Clearly
4'= (x.- f W ,xo>)/Je + ( f W ,xe>-f(ne,
-x))/fl.
The first term is asymptotically N(O,g(t)) by Theorem 1.1, and, since z: = (xe - x>/Je,
l(f(ne,xtl) - f ( n O 9 x ) ) / J 8 - B(ne>z:l
< cJ81Z:l2
by (2.13). Since B ( . ) (see (2.11)) is continuous,
(f(ne, xe) -f(ne, x ) ) / J B as 8 -+ 0, no -r t , and z:
-r z.
z."
-
-+
B ( f )z
Thus Theorem 1.1 has the corollary (5.1)
N ( B ( t ) z , g ( t ) )= Y ( t )
as 8-0, nO-rt, and z:+z. This result has special interest when x is a critical point, that is, a point such that w ( x ) = 0. In this case f ( t ) = x, so that z."
=
(5.2)
(X."-X)/Je.
Furthermore, B ( t ) = e'W'(X), A(t) =
e-lw'(x)*
(5.3) 9
and
Note that s(x) = S ( x ) - w ' ( x )
= S(x).
In the case N
=
I , g(t) = f s ( x ) if
134
8. TRANSIENT BEHAVIOR WITH LARGE DRIFT
w’(x) = 0, and g ( t ) = s ( x ) ( e 2 f w ‘ ( x ) 1- ) / 2 w ’ ( x )
(5.4)
if w’(x)#O. The result (5.1) suggests the possibility that, if z ; + z , Ywe = lim L?(z;) n- 03
-
lim Y ( t )= p(00)
I+ w
(5.5)
as 8 + 0. Elaborations of this statement assume rather different forms depending on the sign of w ’ ( ~ )so, we examine the cases w ’ ( ~<) 0 and w ‘ ( ~>) 0 separately. If w’(x)c 0, then, according to (5.3) and (5.4), B(r)+O and
s(t)
+
I
a2 = s ( x ) / 2 w’(x)I 9
(5.6)
so 9(00) N= ( 0 , 0 2 ) . If x is the only critical point [hence w ( y ) > O for < x and w ( y ) c 0 for y > X I , and if the asymptotic distribution 92 exists (perhaps in the sense of Cesaro convergence), then we expect it to be the unique stationary probability of the Markov process {z,,@}.,, . A generalization of (5.5) that does not presuppose the existence of or uniqueness of the stationary probability is
y
pe
+
N(0,a2)
(5.7)
as O+O, where is any stationary probability of z:, or, equivalently, the normalization (5.2) of any stationary probability of X;. We do not require z;+z for (5.7), since po does not depend on zt and cr2 does not depend on z. A theorem of this type is given in Section 10.1. For any y E R’, Y ( t )( Y , 00)
=
@((B(t)z-y)/(g(t))’/;),
where @ is the standard normal distribution function. If w’(x)>O and s ( x ) > 0, g(z) + 00 and B ( ~ ) / ( d t ) ) ” 1/a; +
thus lim Y ( t )(y, 00)
1-m
=
@(z/a)
for all y. If X; + f.00 as. as n + a,then lim Y(z;)
n- w
so (5.5) suggests
( y , 00) =
P(X; + a),
8.5. NEAR A CRITICAL POINT
135
as 8 --f 0 and (xe- x ) / f i - , z. A theorem of this type for additive models is proved in Section 14.5 with the aid of Theorem 11.2.1 and Lemma 5.1 below. For proofs of (5.7) and (5.8), we need the following lemma about the ,,('= The lemma is applicable to any N. Unlike moments of LIZAX,,('/@. Lemma 4.1, nothing is assumed about the distribution of z t ; thus we are free to consider a stationary initial distribution later. Furthermore, o(8) is a function of zn and 8 such that
and
(5.14)
and
a.
136
TRANSIENT BEHAVIOR WITH LARGE DRIFT
Thus o(e) in (5.11) satisfies
lo(e)l/e G E + K
From (5.13) we get
It-
w’(x) (A’, - x)l
J ~ .
< KO + KO 1z.I 2
by (a.2) and (2.3). Thus IC/JB-w’(x)znI
KJB(l+Izn12),
which implies (5.10). Finally, (5.12) follows directly from (c) of Section 8.1.
9 0 Transient Behavior in the Case of Small Drift
9.1. Diffusion Approximation in a Bounded Interval The corollary (8.5.1) of Theorem 8. I. 1 gives an approximation to the distribution of X: when 6' is small, X: = xe is near a critical point x [i.e., a point such that w ( x ) = 01, and no is bounded. Some one-dimensional cases where w'(x) # 0 and where this approximation continues to be useful when no is unbounded are described roughly in Section 8.5, and will be discussed further in subsequent chapters. In the case of small drift [i.e., w ( x ) = 01, all points of I are critical, and w'(x) = 0. For a critical point x with w'(x) = 0, (8.5.1) gives <X,"-x>/Je
or
-
N ( ( x e - x ) / J e , nes(x))
x,,@- N(x0, ne2s(x)),
which suggests that
9 (X.")-b N ( x , W ) ) as 8-0, x ~ - * x ,and nO--roo in such a way that n 0 2 + t . If s ( x ) > O and I 137
138
9.
TRANSIENT BEHAVIOR WITH S M A L L DRIFT
is bounded, this conjecture is definitely wrong, since Y(X,") is confined to I and the normal distribution N ( x , t s ( x ) ) is not. However, we shall show in this chapter that Y ( X , " ) does converge as 8 -+ 0, X g x, and ne2 -+ t, in many cases of slow learning with small drift in bounded intervals. The limiting distribution is Y ( X ( t ) ) , where x(t),t 2 0, is a certain diffusion in I with X ( 0 ) = x. Under hypotheses (a)-(c) of Section 8.1, with w ( x ) = 0, the conditional moments of AX," are of the form -+
E(AX,IX,
= x) =
O(r),
E((AXJ2IX" = x) = zS(x)
+
0(2),
and E(lAX,131X" = x) = O ( 7 9 ,
where 7 = 02. In order to describe the asymptotic behavior of X." as nz -+ t, we must specify E(AX,IX, = x) more precisely as 7a(x)+o(r),and we must impose various conditions on a ( x ) and S(x) [which corresponds to b(x) below]. These remarks serve to place the assumptions that follow within the framework of the preceding chapter. However, these assumptions are self-contained, and it is not necessary to refer to Chapter 8 in order to apply the theory in this chapter. Let J be a bounded set of positive real numbers with infimum 0, and let I = [do,d,] be a closed bounded interval. For every T E J, let K, be a transition kernel in a Bore1 subset I, of I. Corresponding Markov processes are denoted .;'A Suppose that
+ 0(Z), x ) = rb(x) +
E(dX,'IX,' = X) = T U ( X ) E((AX,')2IX,'=
0(7),
(1.1)
and where
o(7)
is uniform over I,; i.e.,
as r-0. It is assumed that a has three (bounded) continuous derivatives throughout I (a E C 3 ) . We emphasize that a(j)(x),j = 1,2,3, exists and is continuous even at the boundaries x = di. Furthermore, a(do)2 0 and a(dl) < 0. Our conditions on b are rather unusual but, nonetheless, quite general, as we will see. It is assumed that b admits a factorization
139
9.1. DIFFUSION APPROXIMATION
where oi E C3, oi(di)= 0, oi(x)> 0 for do < x < d l , P (4 = 6 0 (x)/(ao (4+ 0 1 (x))
is nondecreasing over this interval, and, letting p (4)= lim P (XI x-d,
7
C3. These conditions imply that b E C3, b(di)=0, and b ( x ) > O for do < x c d l . A very broad class of functions b having suitable factorizations are those of the form
PE
b(x) = (x-do)’(dl -x)’~(x), (1.3) where j and k are positive integers, h E C3, and h(x) > 0 for all x E I. Let no (x) = (X -do)’ (h(x))’
and CT~ (x)
= (d1- X ) k (h(X))’.
Since f i is in c3,cri is too; since ao/alis increasing, p is also increasing; and since ao+ol is positive throughout I, p E C3. Another interesting class of examples are those with b(di)= 0, b ( x ) > 0 on (do,dl), and PE C3. Here we can take oi = f l to obtain p ( x ) = 3. Let $33 be the collection of Bore1 subsets of I. A transition probability is a function P on [O,oo) x I x 9 such that P ( t ) = P ( t ; ., .) is a stochastic kernel, P(0; x, .) = S,, and the Chapman-Kolmogorov equation
s
P ( s ; X, dy) P ( t ; y , A ) = P(s+t; X, A )
holds for all s, t, x, and A. Theorem 1.1 is our main result on transient behavior for slow learning with small drift. I . I . If a and b satisfy the conditions given above, there is one THEOREM and only one transition probability P such that
and
I40
9.
where o(r) is uniform ouer x
E I.
TRANSIENT BEHAVIOR WITH S M A L L DRIFT
If(l.1) holds,
Kj")(x,, .) = U ( X ; l X ; = x,) + P ( r ; X , .)
(1.6)
weakly, as r + 0,x, + x, and nr + t. The uniqueness of P follows immediately from (1.6). For if Q is a transition probability that satisfies (l.5), K, = Q ( r ) is a family of transition kernels in I that satisfy (l.l), and K Y ) = Q(nr) by (1.4). Taking r = t/n and x, = x in(l.6)weobtainQ(t; x , - ) + P ( r ; x, .)asn+co;i.e.,Q(t; x , . ) = P ( r ; x , .). As a consequence of the last equation in (lS), there is a diffusion X ( t ) , r > 0, with X ( 0 ) = x and P ( X ( s + t ) E AIX(s)) = P ( t ; X(s), A ) a.s. (Lamperti, 1966, Theorem 24.1). In particular, 9 ( X ( r ) ) = P(r; X , .).
Under the same hypotheses as in Theorem 1.1, it can be shown that
U K , , ...,X;J+Y(X(tl),
... ,X(t,))
as r+O, x , + x , and njz+tj, j = 1, ..., k. If o ( r ) in (1.1) is replaced by O(r' +') for some v > 0, then U(Xr(t), r G T ) -+ =Y(x(t), r G T), where Y ( r ) is the random polygonal line with vertices Xr(nr) = X;. The family of linear models discussed in the last three paragraphs of Chapter 0 provides a simple illustration of Theorem 1.1. Since failure is ineffective (O*c* = 0), (0.3.1) specializes to
L
O( 1 - X,)
AX,, =
with probability X,, n 1 c ,
-ex,
with probability (1 - X,,) zooc ,
(1.7)
otherwise.
It is assumed that c > 0 and n 1 > 0 are fixed, and that xoo approaches n 1 as O + O along the line 1coo/x11 =
1
+ Ok,
where, for our present purposes, k can be any real constant. Taking and di = i, (1.1) holds with
(1.8) '5
= 8'
and b(x) = Bx(1-x), (1.9) where B = x II c and A = - knI c. The special case of Theorem 1. I corresponding to such functions a and b is contained in an earlier result (Norman, 1971a, Theorem 4.3). a(x) = A x ( l - x )
,
141
9.2. I N V A R I A N C E
9.2. Pnvariance In this section we introduce a special family of transition kernels L, satisfying ( I . l), and show that, if K , is any other such family, then K p ) ( x , .) L?)(x, .) converges to zero, uniformly over x E I, and nr < v, as 7 --f 0. Thus the behavior of K:“) when 7 is small is invariant over all families of kernels satisfying (1.1). Such invariance is at once an important implication of Theorem 1 . 1 and a basic component of its proof. The remainder of the proof of Theorem 1.1 is given in the next section. The family L, is distinguished by its exceptional simplicity, which permits us to establish a crucial property (Lemma 2.2) by a rather direct computation. Let u1 = g 1 , uo = -no, e = & UJX) =
ui(x, 7) = x
p 1 = p , and p o = 1- p . Then
+~
+
( x ) eui(x),
+ 7ayx) + eu;(x) 3 1 - z l a ’ ~ - elu;l > o
u;(x) = 1
if T < T ~ for , some ro > 0 sufficiently small. The condition ‘c < t ois assumed in all that follows. Now u i ( d o ) 2 d o , since a ( d o ) 2 0 and v i ( d o ) 2 0 ; and u i ( d l )< d , , since a(d,) < 0 and ui(d,)< 0. Since ui is increasing, ui maps I into I. Thus the requirement that a transition from x to ui(x) occurs with probability p i ( x ) defines a transition kernel L = L, in I. The corresponding transition operator V = V, is
Vf(x) = 1f (ui (XI) pi (XI 3
where the summation is over i = 0 and 1. Since ui E C3 and p i E C3, V maps C = C ( I ) into C and C 3 into C 3 .
LEMMA 2.1. L, satisfies (1.1). Proof. First,
+ C ui(x) pi(x)
(ui(x)- x ) pi(x) = ~ a ( x ) 0
9
and C u i ( x ) p i ( x ) = o l ( x ) ~ ( x-) ~ o ( x ) q ( x = ) 0
as a consequence of the definition of p (q = 1 - p ) . Next,
+ r2aZ(x),
~ ( u i ( x ) - x ) z p i ( x )= r C U 2 ( X ) p i ( X )
while la1
-= 00, and C 0:
( x ) pi( x ) =
0 1’
( x )P ( x )
+ 00’( x ) 4 (XI
= 01(x)ao(x)q(x) + 0o(x)o,(x)P(x)
(2.1)
142
9.
TRANSIENT BEHAVIOR WITH SMALL DRIFT
by (2.1), so that
C 0; (x) Pi (x) =
01
(x) DO (x) = b (x)
by (1.2). The last equation in (l.l), with O ( t % ) in place of o ( T ) , is a consequence of the boundedness of a and ui. I ForfE C3, let
llfll
=
If’l + Is”l + IPl.
LEMMA2.2. There is a y > 0 such that
II~:fII
for all n 2 0,
t
< t o ,and f E C3.
G
eyrir
llfll
Proof. We often suppress the argument x in the following derivation, writing, for example, p i instead of pi(x). This renders the notation ) g ) ambiguous, so we denote the supremum norm Iglm. Clearly,
(W’ (XI = Cf’(ui)ui’pi + 1fC.i) PI
*
Since ui’ 2 0, ICf’(ui)ui‘Pil G If‘lmCu;Pi
< If’I Also, pd = -P;,
m(
1 + 0 1 0;Pi + la’ I m )*
SO
Cf(.i> pi’ = ( f ( u , ) - f ( u o ) ) P;
However, p ; = p’ 2 0, and hence
and thus
-
(2.3)
143
9.2. IN VARIANCE
We next show that there is a constant c < 00 such that
c
I(VS)”I,
1 ~ 1 , ( 1 + ~+4~
IYL
(2.7)
for all T < 70 and f E C 3 . We use the notation c below for a number of different bounds that do not depend on x , t, o r 5 By the chain and product rules for differentiation,
(vf)“(x) = Cf”#;’pi
+ C f ’ ( u i >#;‘pi
+ 2 C f ’ ( u i )ui’pi’ + Cf(xi)P;I
*
Since u! = 1 + Oq’ + 0 (7) and u;’ = Our + 0 (T), this can be written ( V ~ ) ” ( X=) A1
+ + B1 + B2 + C , A2
(2.8)
where A , = Cf”(ui>U12Pi, A , = 2 C f ’ ( U i ) pi’
9
B1 = Cf’(ui>O(2ui’~i’ + VYPi) B2 = C f ( X 3 P:
9
9
and ICI
<~
1
~
7
(2.9)
.
We shall examine each of these components in turn. First, IAII G I f ” l m C 4 * P i
and
(2.10)
so that lA2l
G 21f”lmOCviPI
-
(2.1 1)
Adding (2.10) and (2.11) and using (2.5) we get lAll
+ lA2l
G If”l,(l+4.
(2.12)
144
9.
TRANSIENT BEHAVIOR WITH S M A L L DRIFT
Turning next to B,, we see that B, = (f(u1) -f(uo))P; = A(u,-uo)p‘; =
AecuiP;,
where A
=
(f(u1)-f(uo))/(u1 -uo)
[orf’(u,) if u1 = u,]. Making use of the equality obtained by differentiating (2.5) we obtain B, = - A 0 C (2u;pf
Therefore B,
+ B,
=
+(‘p’pi).
e 1(y(ui) - A ) (2u;p: + v;pi).
But If’(ui)-Al
G
Clfnlme,
so IB1
+ Bzl < Clf”lm7.
(2.13)
Inequality (2.7) follows directly from (2.8), (2.9), (2.12), and (2.13). A similar computation yields I(Vf)”lm G I f ” l m ( l + c r )
+c~(lf’Im+lflm)*
When this is added to (2.6) and (2.7), IlVfll G (1 + c r ) llfll G e“
llfll
results, and (2.2) with y = c follows by induction.
[
The next lemma is our main result in this section.
LEMMA2.3. Let U,, T E J , be a family of transition operators whose kernels K, satisfy (1.1). For any v < 00 andfE C, sup I U,nf - V,nfl + 0
nTGV
(2.14)
as 7-0. In (2.14) and its proof, 191 is understood to be the supremum of g over the largest set on which g is defined. This is I, in (2.14). Proof.
We first establish the basic inequality
(2.15)
145
9.2. INVARIANCE
for f bounded and measurable. If x U"f(x)
E I,
and n 2 1,
- V"f(X) = U ( U"- 'f-V"- 'f)( x )
+ vV"-'f(x) - VV"-'f(x). Since U is a contraction, IU"f-V"fl
< Iv"-'f-V"-'fl + IuV"-'f-VV"-'fl,
and (2.15) follows by induction. If g E C3, then g(Y) = g(x)
(Y -X Y g'W +4 + (Y-X)g'(x) + 2
Y - xi3 Ig'"17
where 101 <&.Integrating this equality with respect to K,(x,dy) and using (1.1) we obtain U,g(x) = g ( x )
+ r f g ( x ) + 44llSll>
(2.16)
where f g ( x ) = a(x)g'(x)
and O(T) is uniform over x Similarly,
E
+ 2-'b(x)g"(x),
(2.17)
I, and g E C 3 .
V*g(x) = g(x)
+ t f g ( x ) + 44llsll
*
Subtracting this from (2.16), we get l u r g - Vrgl
< 7&11.911
Y
(2.18)
for some function E = ~ ( t of ) 7 alone such that E 3 0 as T + 0. Iff E C3, g = V/f does too, so we can apply (2.18) to V$ By (2.2) IlVm
< eYVllfll= k
if j z < v, so
IU, V / f - V, V/fl < rsk . This estimate, in conjunction with (2.19, yields
I U:f - V:fl < vks for nT < v. This implies (2.14) forfE C3. Since C3 is dense in C, and U" and V" are contractions, (2.14) holds for all f~ C.
a
146
9. TRANSIENT BEHAVIOR WITH SMALL DRIFT
9.3. Semigroups It is shown in this section that there is a transition probability P satisfying ( l S ) , and that
Trf(x) =
1
p ( t ;x , d v ) f b )
(3.1)
where f E C, has the following continuity properties: T, f E C, and
Granting this, it is easy to complete the proof of (1.6). By Lemma 3.1,
IVf- W l
+
0
(3.3)
as 7 40,if n varies in such a way that n7 < v. According to (1.5), Kr = P(7) satisfies (l.l), and according to (1.4), T, has the semigroup property TsTr= T,+,, so that T; = Tnr.Thus, as a special case of (3.3),
ITnrf- V 3 l
+
0,
and this can be combined with (3.3) to yield
I C f - T"ZfI
+
0
as z+O and m < v . But
I T m f - Ttfl
+
0
as n t + t, by (3.2); and
Ttf(xr) as x, + x , since T,f
E
+
Ttf(x)
C. Therefore, for all f E C, u!f(xr)
+
T,f(x)
7 + 0, n7 4 t, and x, + x , and this differs from (1.6) only notationally. Before proceeding further, it is useful to summarize some of the terminology and basic results of semigroup theory. A family T,, t 2 0 , of bounded linear operators on a Banach space B is a semigroup if To = E, the identity operator, and T,T, = T,+, for all s , t 2 0. The semigroup is strongly continuous if, for each f E B, T,f is continuous with respect to the norm of B. The generator r of such a semigroup is the linear operator defined by
as
rf = limt-'(T,f-f) 1-0
for f in the domain 9(r)of r, the set off for which this limit exists in the norm of B. The semigroup is uniquely determined by r [and 9(r)].If
147
9.3. SEMIGROUPS
f E 9(r),then T,f
E
9(r),and d dt
- T,f
=
TT,f = T , r f .
(3.4)
A strongly continuous semigroup on C = C(Z) is contractive if lT,fl < If1 for allfE C, and T, maps nonnegative functions into nonnegative functions. A contractive semigroup is conservative if T, 1 = 1. Let C2 be the set of functions with two continuous derivatives throughout I.
LEMMA3.1. There is a conservative semigroup T, on C whose generator
r has the following properties : (A) 9(r)I> C2,and (B) for f E C2 and x E I,
+
r f ( x ) = a(x)fl(x) 2-'b(x)f"(x). Proof.
(3.5)
Let do < d < d , ,
and m(x) = Let 9 be the set off
E
r
2 b ( ~ 7 ) - ' e ~dy. (~)
C satisfying these two conditions:
(i) f has two continuous (but not necessarily bounded) derivatives on (do,d , ) . For do c x < d , , put d d r*f(x)= a(x)f'(x)+ 2-'b(x)f''(x) = dm --f(x), dp (ii)
r * f ( x )has finite limits [denoted r *f ( d i ) ]at do and d , . Thus r *
The generator f of T, is the restriction of by the type of boundary at do and d , . Let ai = r m ( x ) dp(x)
and
(3.6)
f C. ~
r* to a subset of 93 determined pi = I d ' p ( x ) dm(x), d
t This function should not be confused with ao/(ao+ul)in the preceding sections.
148
9. TRANSIENT BEHAVIOR WITH SMALL DRIFT
where the range of integration excludes di. The boundary di is regular if exit if entrance if natural if
pi < a, ai < co and pi = 00, ai = 00 and pi < co, ai = 00 and pi = 00. ai
<
00
and
Regular and exit boundaries are termed accessible; others are inaccessible. If di is inaccessible, let gi = 9. If di is exit, let
g i = {f
E 9 :r * f ( d i ) =
O}.
Finally, if di is regular, let
g i = {f E 9:lim eB(=)f’(x) = 0) x+di
or, equivalently,
Let 9(r)= g on g1and r = T*IQ(T). The fact that r generates a contractive semigroup T, is a consequence of Feller’s semigroup theory. The results needed here are all presented in Section 5 of Chapter I1 of Mandl (1968). Note that 9 3 C 2 , and (3.6) holds for all x E Z iff E C 2 . If 9 ( T ) 2 C 2 , then r f = r * f for f E C 2 , and (3.5) follows. Thus (B) follows from (A). Granted (A) and (B), 1 E 9 ( T ) and r l = 0; hence, by the outer equality in (3.4), dT, l/dr = 0. Since To 1 = 1 , T, is conservative. Therefore it remains only to establish (A). It suffices to show that Qi 3 C 2 for each i. If di is inaccessible this is certainly the case. If di is an exit boundary, we shall see below that a(di)= 0; hence r * f ( d i )= 0 for all f E C 2 . If di is a regular boundary, we shall see that (- I)’a(di)> 0. But b ( ~ )
lb’l Idi-Yl,
so
as x -,di.Therefore B(x) + - 00, and eB(x)f‘(x) -,0 for all f E C 2 , as x -+ di . We now establish the facts cited above about a(di) when di is accessible. If d , is regular, then p ( d , ) < co and rn(dl) < 00, so, by the Schwarz inequality, (p’(x)m’(x)/2)%= l/(b(x))’A
9.3.
149
SEMIGROUPS
is integrable over (d, d,). If b‘(d,) = 0, b ( y ) = O((d,-y)’), and this would not be the case, so b’(d,) < 0. But then a (d ,) = 0 would imply boundedness of a ( y ) / b ( y ) ,hence of B ( x ) , and, in view of (3.7), m(dJ = 00 would follow. Therefore we must have a(d,) < 0 if d, is regular. Similarly, a(do)> 0 if do is regular. Now m ( d , ) = co if d , is an exit boundary. Thus, to show that a(d,) = 0 for such a boundary, it suffices to show that a(d,) < 0 implies m ( d , ) < 00. If a(d,) < 0 there is a 5 < d, and an a < 0 such that a ( x ) < a if x > 5. In particular, B’(x) < 0, so that eB@) is decreasing for x 2 5. Thus de”(”)/dx is integrable over [5, d,), and m‘(x) = a(x)-’deB(”)/dx
is too. The proof that a(do) = 0 if do is an exit boundary is similar.
1
Since T,, t 2 0, is a conservative semigroup, there is a transition probability P that satisfies (3.1) for all f~ C and x E I. 3.2. P satisfies (1.5). LEMMA
Proof. Let
R, = T-’(T,-E) - r , and let f(y) = ( y - x ) ’ . Since f ( x ) 3.1,
6,(X) =
so that 16,(x)1
T-’
< IsZ,fl.
s
= 0,
(Y-X)2P(T;
and I‘f(x) = b ( x ) by (B) of Lemma
X,
dy) - b ( X ) = n , f ( X ) ,
But f =J2-2xY+
where Y ( y )= y and 1(y) = 1, and I6r(x)I
1,
a, is linear with sZr 1 = 0, so
< IQzJ21 + 21x1 I Q r 9 I
and, letting c be the maximum of 21x1 for x E I,
141 < IR,J21
+ ClR,Yl.
B y ( A ) o f L e m m a 3 . 1 , Y j E 9 ( T ) f o r a l l j > O ; thus ISZ,JjI-,Oand ISJ+O as T+O. This establishes the second equation in (1.5). ~ , obtain Applying the same argument tof(y) = y - x andf(y) = ( y - ~ ) we where the first equation in (1.5) and le;l+O, E / ( X ) = T-’
s
Iy - Xl’P(T; X , d y ) .
150
9. TRANSIENT BEHAVIOR WITH S M A L L DRIFT
But
G
(kr'l l$1>",
by the Schwarz inequality, and
I d so that le:I in (1.5).
I&l + 1%
is bounded as r + 0. Hence
+ 0,
which is the last equation
This completes the proof of Theorem 1.1. The proof of Lemma 3.2 uses only the properties of TI given in Lemma 3.1. Since, according to Theorem 1.1 , there is only one transition probability P satisfying (1.5), we conclude that there is only one semigroup TI with these properties. Let D be the set of all f E C2 for whichf" is Lipschitz; i.e.,
D
={ f ~ C ~ : m ( f " ) < ~ } ,
where
For f E D,let
Ilf II
=
['"fI
If'l + I f 7 + mCf")-
Clearly C3 c D and m ( f " ) = on C 3 , so this seminorm is an extension of 11 defined previously on C3. It can be shown that, if fk E D, {)If k l l > k > O is bounded, and Ifk- f I + O as k+ 00, then f E D,and
Ilf II
G liminfllfkll * k-
03
Thus, taking n = [ t / r ] in (2.2) and letting r + 0, we see that, for any f~ C3 and t 2 0, Ttf E D and
ll T t f II
G eytIl f
II
f
(3.8)
This bound is very interesting in its own right. Furthermore, were it available from another source, T, could be used in place of V, in Section 9.2. This approach would be reminiscent of that of Khintchine (1948, Section 1 of Chapter 3), who postulated properties analogous to (3.8).
DENSITIES. For do < x c d, , the restriction of the distribution P ( t ; X, .) to ( d o ,d,) has a density [ ( t ;x, .) with respect to Lebesgue measure. TO describe this density, we need the boundary terminology and the notations 9, gi, r*,m(x), and p ( x ) from the proof of Lemma 3.1.
151
9.3. SEMIGROUPS
Let 8,= gi unless di is an exit boundary, in which case
8, = {f€9 : f ( d * )= O} . Let r? be the restriction of boundary,
r* to 8on 8,. If neither do nor d, is a natural
where 0 2 ,Il >A, > ... are the eigenvalues of r?, and rP1, 4,, ... are the corresponding eigenfunctions, normalized so that 4: dm = 1 (Elliott, 1955). The series converges uniformly over t 2 6 and eo < x, y < el , for any 6 > 0 and do < eo < el < d , . A counterpart of (3.9) for natural boundaries is given by McKean (1956). If di is not an exit boundary,
h,(t; x)
=
P ( t ; x, {di}) = 0.
For an exit boundary d,, the following formulas relate hi to (: (3.10)
if the other boundary is not exit, and (3.11)
if both boundaries are exit, where
p o = 1-pl , and the integrals are over (do,dl). We omit the proofs of these equalities.
10 0 Steadystate Behavior
10.1. A Limit Theorem for Stationary Probabilities
Assumptions (a), (b), and (c) of Section 8.1 are in force throughout this chapter, and N = 1. We are concerned here with the limit as 6 + 0 of stationary probabilities of nonabsorbing or recurrent processes. Typically, the distribution of X: converges as n-, 00 to a stationary probability that does not depend on the distribution of X,", but, whether or not this is the case, stationary probabilities represent the possible modes of steady-state behavior of {X:}.,,. Let do be the set of stationary probabilities (with finite fourth moments if I is unbounded) of the transition kernel K, of this process. Naturally, we assume -./B # 0. The processes considered in this chapter are distinguished by the existence of a (critical) point I in the interior of I, such that E(dX:IX; = x ) is positive when x < I and negative when x > I if 6 is small. More precisely:
(d) There is an interior point I of I, such that w(x)
if x < I,
w(x) = 0
if x = A ,
<0
if x > I ,
w(x)
152
>0
10.1. A THEOREM FOR STATIONARY PROBABILITIES
153
and wf(A)< 0 . Theorem 1.1 requires additional assumptions when I is unbounded, as in additive learning models. Henceforth, we assume that, for any 8 E J, = -L4(Xt) E J e . Then U(X,B)= p e for all n ; in fact, X." and 2,"
= <X."-A)/Je
are strictly stationary processes.
THEOREM 1.1. If (i) I = [c, d ] is bounded, or (ii.1) Q ( x , ~ = ) E((H,B)41X: = x ) is bounded, and (ii.2) there are A > 0 and B > 0 such that (A-x)w(x,e) 2 B
for all 8 E J and x E Ie with Ix-AI > A , then
(A) E((X,B-A)2) = 0 ( 8 ) , and (B) 2," --N(0,02) as 8 --* 0, where o2 = s(A)/21wf(A)1= S(A)/2lwf(A)1. Theorem 1.1 under condition (i) is similar to the central limit theorem of Norman and Graham (1968), while (ii) generalizes Theorem 6 of Norman (1970b). Note that (A) implies E(X:) - L = O ( J e ) .
(1.1)
An improved estimate, valid under (i), (ii.l), and additional conditions, is obtained in Section 10.3. Note also that (B) differs only notationally from (8.5.7). The five-operator linear model with Oij=8qij and with nij and cij fixed provides the simplest example of a family X: of processes to which Theorem 1.1 can be applied. We have already noted that it satisfies (a), (b), and (c) of Section 8.1. For this model E(AX,IX,= x ) =
ew(x)
=~
8 1 1 ~ , , - ~ o o ~ o o ~ ~ ~ ~- ~ 1~ 0 ~~i + 0 ~8 2 70 , ~ o , where nij= nijcij, so the quadratic polynomial w has w(0) = qol no,and -w(l) =qlonl0. If both of these quantities are positive, (d) is satisfied. Furthermore, neither 0 nor 1 is absorbing, and it will be shown in Section
154
10. STEAD Y-STATE BEHA VIOR
12.1 that there is a unique stationary probability pe, to which U(X,,!) converges as n+ co, regardless of U ( X t ) . Since I = [O, I], all of the conditions of (i) of Theorem 1.1 are met. 10.2. Proof of the Theorem
As in Chapter 8, we denote a variety of bounds by K. Proof of (A) under (i). Clearly
(xn+l-1)2 = (Xn-1)2
+ 2(Xn-1)dXn + (dXn)2,
so, taking expectations on both sides, canceling E((X,,-A)’) on the left and right, and dividing by 20 we obtain
o = E((x,,-1) W ( X , , , el) + e E p ( x , ,, q / 2 .
Since S(x, 0) < K by (c) of Section 8.1,
0
0)) + KO.
(2.1)
< E((Xn-A) (w(xn, 0) - w(X,)))+ KO < KO,
(2.2)
< E((X,,-A)w(X,,
Thus E((A-Xn) w(Xn))
by (a.2) and the boundedness of I. Let
Since p is positive and continuous by (d), and I is compact, there is a k > 0 such that p ( x ) > k for all x E I. Thus (A-X,,)w(X,,) = (l-X,,)Zp(X,,) > k(X,,-A)’, which yields (A) on substitution into (2.2).
I
Proof of (A) under (ii). Since, by (ii.2),
(Xn-1)w(Xn, 0) < 0 a s . for IX,,-Al > A, (2.1) yields
o < E((x,,-+(x,,, e ) q + K e ,
where
G = IIX,-A16A.
(2.3)
155
10.2. PROOF OF THE THEOREM
Thus
~((1 - x,,)w (x,,)G) G E((x,,- 1)(W (x,,, e) - W (x,,)) G)
< KO.
But (2.3) is valid for IX,,-1l < A , so a = E ( ( ~ zx,,)~G) G
+ Ke
~e .
(2.4)
The binomial expansion of ((X,,-1)+AX,,)4 is (X,,,
- 4 4
=
+ 4(Xn-1)3dXn + 6(X,,-1)2(AX,,)2 + 4(X,,-1)
(Xn-L)4
+ (AX,J4.
(2.5)
156
10. STEADY-STATE BEHAVIOR
Now (xn-43W(x,,,
e)
e)((i-c)+c)
=( ~ , , - 4 3 ~ ( ~ , , ,
< - ~ ( x ~ - , q ~ ( i+-( -x ,c, -)4 3 ( W ( x n , e ) -
by (ii.2) and (d), so that (X,,-43w(X,,, 8)
< -B(X,,-A)’(l-G)
W(X,,))G
+ KO(X,,-A)’
by (a.2) of Section 8.1. This inequality and (2.7) yield B/3 = BE((Xn-A)’(1-G))
Thus
< KOE((X,,-I)’) + KO3.
B/3 < K8(a+/3) G KOp
+ KO2
by (2.4), and B/3 < BPI2
for 8 < 6 = B/2K. Thus
+ KO3
+ KO’
/3< KO2. Adding this to (2.4), we obtain
(A).
I
Part (B) of Theorem 1.1 is proved by noting that, as a consequence of (A) and Lemma 8.5.1, Lemma 2.1 below applies to z.”. LEMMA2.1. For every 8 E J, let {z.”},,,~ be a real valued stationary process. Suppose that
and a.s., where a < 0 and as 8+0. Then as 8+0, where
(i’
= b/21al.
This lemma is a steady-state analog of Lemma 8.4.2. Both results apply to non-Markovian processes. The proof of Lemma 2.1 is much simpler than that of Lemma 8.4.2.
157
10.3. APPRO XIMA TION TO E(X.9
Proof.
As in the proof of Lemma 8.4.2, E(eiodzn(z,,)= 1
where Ikl <&;thus he(w) = J7(eiOZn+l) = he(o)
+ iwE(dz,,(z,,)
-o2E((dzJ2IZn)/2+ kIw13E(Idzn13l z n ) , = E(eimZnE(eiodz nIzn))
+ iwE(eioZmE(dz,,lz,,))
+
-w2E (eioZn ~ ( ( d z , , ) ~ l z . ) ) / 2( w ( 3 ~ ( e i o z n k ~ ( ( 1d2z.))., , ( 3
Canceling he(w) and dividing by w we obtain
0 = iE(eioZnE(dznlz,,)) - oE(ei"z~E((dzn)2~zn))/2 +w2E(eio'nkE()dzn13I., ) . Thus, by (2.8),
0 = itlaE(eioznz,,)- w8bE(eiwzn)/2
+ E(o(e))+ o E ( o ( e ) )+ wZE(o(e))
[the first and last equations in (2.8) imply that E(lz,,l)< 00 when 8 is small], or
0 = (d/dw)he(o) + wo2he(o)+ &(a),
where as 8
-
6 = suple(o)l/(l+w2) 0
0. Since he(0) = 1, it follows that he(o) = exp(-0202/2) (I -
But
+0
I[
exp(02x'/2)~(x)dxl < 6
as 8-0;
exp(02x2/2)r(x)dx).
lo'
exp(02x2/2)(1+x2)dx
thus he(o)-+exp(-0202/2), and z:
-
N(0,02).
+0
I
10.3. A More Precise Approximation to E(X,,B)
THEOREM 3.1. (ii.l), then
If the assumptions of Theorem l . l ( i ) are augmented by E((x:-A)~) =
o(e2).
(3.1)
158
10. STEAD Y-STATE BEHA VIOR
(3.2)
uniformly ouer x
E
lo,where u and w n are Lipschitz, then
E(x~= ) A
+ ey + o(e),
where y = (aZw”(A)/2
(3.3)
+ u(A))/lw’(L)I.
Proof. The derivation of (2.7) is valid under our present assumptions (integrability is not a problem here). Thus E((A-Xn)’w(Xn))
< E((Xn-A)3 ( w ( X n , 0) - w(Xn>)) +KOE((X,,-A)’)
Using (2.3) on the left, and (a.2) and IX,,-Al E((X, - A)“)
+ KO3.
< K on the right, we obtain
< KBE((X, - A)’) + KO3 < KO’
by (A) of Theorem 1.1. To prove (3.3), we note first that
o = E(dx,,)/e = E ( ~ ( x , ,e)) ,
+ eqU(x,,))+ o(e).
= E(~(X,,))
Also
(3.4)
159
10.3. APPROXIMATION TO E ( X / )
Since E(z.4) < K for all 0 by (3.1), (B) of Theorem l.l(i) implies that E(z,Z)-+a2as 8 4 0 (Lo&ve, 1963, Corollary to B, p. 184). Thus
+
I W ~ A ) I ( E ( X , , ) - A=) e [ w y ~ ) a ~ / 2 + ~ (o(e). ~)1
I
Let he(w)= E(eiwzn). Then
(d/dw)he(0)= i ~ ( z , , = ) iJer
+o ( f l
by (3.3). A heuristic derivation of asymptotic expansions of he(w) and its w derivatives in powers of f l has been given by Norman (1968b, Section IILA), under hypotheses somewhat more restrictive than those of Theorem 3.1. It is easy to see that Theorem 3.1 is applicable to the family of linear models considered in the last paragraph of Section 10.1. Since w(x,O) = w(x) for these models, u(x) = 0. The application of Theorem 3.1 that follows is of a rather different kind. Consider a model for simple learning with two responses (see Subsection A of Section O.l), whose state variable x represents A , response probability. Assuming only that all response-outcome pairs A i Oj are equally effective for learning (see the paragraphs on “Symmetry”), and incorporating a stepsize parameter 0 > 0, we obtain the equations
L
t’F(Xn)
AX,, = -BF(l-X,) where F(x) 2 0 for all 0 < x Cij
< 1, and
if OlnCln, if OOnCln,
(3.5)
if Con,
= P ( C l , ~ A i , O j n= ) c
>0
for all i and j . Suppose that outcomes are noncontingent; that is, nI1= nol= n, where 0 72 1. Thus the three rows in (3.5) have probabilities nc, (1 - n) c, and 1 - c, respectively. Five-operator linear models with Oij = 8q and q > 0 satisfy ( 3 . 9 , with
-= -=
F(x) = q ( l - x ) . In this case,
E(dXnIXn) = @c(n-Xx,),
so that
AE(x,,) = eqc(n-E(x,,)) and
(3.6)
160
10. STEADY-STATE BEHAVIOR
Thus probability matching, x,
(A,@
= lim E(X,,) = n , n- m
(3.7)
is predicted for all 8 and A. The question arises whether there are functions besides (3.6) that predict (3.7), or various generalizations such as xm(n,e)=
or xm(.,e) =
IT
+ o(i)
(3.8)
+ o(e)
(3.9)
for all 0 < A < 1 . We will show that, within a wide class of functions to be described shortly, only those of the form (3.6) predict (3.9) [or (3.7), which is a special case]. There are, however, other functions that predict (3.8). In order to keep X,, within [0,1], it is certainly necessary to assume F(l) = 0, and, to meet the smoothness conditions of Theorem 3.1, we assume that F has a bounded third derivative. The final and most important condition concerning F is that F’(x) < 0 for all 0 < x < 1 . If 0 < x < 1, let G ( x ) = F(x)/(l - x )
.
COROLLARY. Under these conditions, (3.8) holds for some 0 < IT < 1 i f and only if G(A) = G(l-z),
(3.10)
and (3.9) is satisfied for all 0 < IT < 1 if and only iYF is given by (3.6).
The result concerning (3.8) is similar to Theorem 4 of Norman and Yellott (1966). Proof. We first show that, if 8 Q K = IF’I-’, then the model with events 0, C,, OoC , , and Co is distance diminishing, and its state sequences {X,,}n30are regular compact Markov processes. Let ul(x) = x
+ BF(x)
and
uo(x) = x - OF(1 - x )
be the operators for the events 0,C , and 0,C , . Then, for all 0 < x < 1,
0
< u ~ ( x<) X+81F’1(1-X) < 1 ,
and similarly, 0 < uo(x) Q 1, so the model is at least well defined. Furthermore, 1 > u;(x) = 1
+ OF’@) 2 1 - fllF’I 2 0,
so, since u,’ is continuous, 1u,’l < 1. The same is true of u,; thus, in the
161
10.3. APPROXIMATION TO E(X.9
notation of Proposition 1 of Section 2.1, < 1. Since p ( x , 0 , C , ) = RC > 0 for all x, this proposition shows that the model is distance diminishing. Moreover, if u y ) is the nth iterate of u, , u y ) ( x )E o,,(x) for all n 2 0, and u y ) ( x )-, 1 as n -,00 for all 0 < x < 1, so X,, is regular by Theorem 3.6.1. Thus there is a unique stationary probability PO, and U ( X , , ) + p e as n-, 00, so that x, (n,8) =
J- xpe (dx)
It is easy to see that all of the hypotheses of Theorem 3.1 are satisfied. To check (d) of Section 10.1, for example, we note that w ( x ) = w,(x)
=
F(x)nc-F(l-x)(l-n)c;
(3.1 1 )
thus w ( 0 ) = F(0) nc > 0, w(1) = -F(0)(1 -n)c < 0, and w’(x) = F ’ ( x ) n c + F ’ ( 1 - x ) ( 1 - n ) c
<0
(3.12)
for all O<x< 1. By (3.3) [or ( 1 . 1 ) ] , x , ( n , O ) + L = l ( r r ) as 8-0, so (3.8) is equivalent to L(n) = n or w,(n) = 0. In view of (3.1 l ) , this means that F(7t)n = F ( l - n ) ( l - n ) ,
which is equivalent to (3.10). We have already seen that (3.6) implies (3.7), hence (3.9). Suppose now that (3.9) is satisfied for all 0 < n < 1. Then G ( x ) = G(1 -x), so (3.1 1 ) yields W(X) =
G ( x )( I L - X ) C .
(3.13)
Furthermore, y = y(n) = 0, by (3.3). Since u ( x ) = 0, and S(n) = F Z ( n ) n c + F 2 ( 1 - n ) ( l - n ) c
>0
so that oz>O, this implies w”(n)=O. But w”(n)= - 2 G ( n ) c , by (3.13); therefore G’(n)=O. It follows that there is a positive constant q such that G(x) = q or F ( x ) = q ( l -x) on (0, I). Since both F ( x ) and q ( l - x ) are continuous at 0 and 1 , (3.6) holds throughout [0,1]. It is not difficult to display functions F, other than those in (3.6), that satisfy (3.10) and thus (3.8) for all 0 < 71 < 1. Let F ( x ) = G ( x ) ( l -x), where C ( X )= k
+ H ( x ( 1-x)),
k > 0, H ( y ) 2 0, H has three bounded derivatives in [0,4], and 0 < ]H’I < k.
162
10. STEADY-STATE BEHAVIOR
Any such function is of the type considered in the corollary, for IG'(x)) < IH'I
< G(x),
and thus F'(x) = G ( x ) (1-x) - G ( x ) < 0. Since H is not constant, neither is G, but G ( x ) = G(l - x ) .
11 0 Absorption Probabilities
11.1. Bounded State Spaces In this chapter we study real Markov processes that are attracted to the upper and lower boundaries of their state spaces, and attempt to approximate the asymptotic mass at the upper boundary. Let I = [c, d ] be a closed bounded interval. For every T E J, K, is a transition kernel in a Bore1 set I, such that
and Ir * - I, n ( c , d ) #
0.
Corresponding Markov processes are denoted X i . All states of I,' v I,are absorbing, and, for any X E I,*, the asymptotic distribution of X i when X,l = x is concentrated on these absorbing states:
164
I I. ABSORPTION PROBABILITIES
weakly as n+
00,
where
Kpo(x,I,' u IT--) = 1. [Of course K!")(x,.) = a,(.)
= Km (x,
for x
a )
4 r ( x ) = KpO (x, 1,'
(1.3)
u IT-.]Let
E I,'
)
be the asymptotic probability of I,'. Suppose further that there are functions a ( x ) and b(x) > 0 on ( c , d ) such that
+ E((dX,I)*1Xi = x ) = rb(x) + E(dXiIX: = x )
=
ra(x)
~ ( t ) , O(?),
(1.4)
and
E(ldX;('IX;= x ) = o ( r ) , where the quantities
O(T)
satisfy
6
= sup lo(r)/rb(x)l + 0 =I,*
as r+O. Suppose finally that the function
r(x) = W x ) / b ( x ) on ( c , d ) can be extended to a continuously differentiable function on I. Note that, although (1.4) has the same form as (9.1.1), our conditions on o(r), a, and b differ from those in Section 9.1. Let v/ be the solution of v/"(x)+ r(x)v/'(x)= 0,
(1.6)
such that ~ ( c = ) 0 and v/(d)= 1. The following theorem combines results of Khintchine (1948, Section 2 of Chapter 3) and Norman (1971a, Theorem 6.1).
THEOREM 1.1. Under these conditions,
l4r-v/l
=
suqI+r(x)-v/(x)I + O XI,
as t + O .
Let U = U, and U" = Urmbe the transition operators corresponding to K, and Krm.If y(x) is a bounded continuous function on I,, lim U:y(x) = Urmy(x)
n-r m
for all x
E
I, by (1.2). Since IU:y(x)I is bounded by the supremum of Iy(x)l
165
11.1. BOUNDED STATE SPACES
over I,, lim U:y(x) = lim U, U:y(x) = U, UTmy(x)
n- m
n-
00
by the bounded convergence theorem, so that
u,Oo~ (XI = ur ~ P " (XI Y * If y ( x ) is 1 on I,' and 0 on I;, U,Ooy(x)= $,(x); thus
u,4, (x)
=
4 r (x)
*
This equation is closely related to (1.6). Expanding 4 = 4, formally about
x E I,* and using (1.4)we obtain
0 = E(4(Xn+1) = za(x) &(x)
or, dividing by rb(x)/2, +"(x)
- 4(x)lxn = X) (XI + z b$"(x) + o(z) , 2
+ r ( x )#(x)
=
o(1).
Since 4(x) is 1 on I,' and 0 on I,-, the conclusion of Theorem 1.1 appears reasonable. We present some applications before giving a proof. Theorem 1.1 applies to the family of linear models with small drift considered at the end of Section 9.1. In that case, I, = I= [0,1], so I: = {I} and I,- = (0). As we shall see in Section 12.1, Px(X:-+O or X i + 1) = 1
for all 0 < x < 1, which implies (1.2) and (1.3). The functions a and b for (1.4) are given in (9.1.9), and r ( x ) = -2k is certainly continuously differentiable on I, even though b ( i ) = 0. The remaining assumption (1.5) is easily verified. As was noted at the end of Chapter 0,
w (x)
- 1)
= (eZkX- l)/(ezk
for k # 0. The following application is of the type considered by Khintchine in the section of his monograph cited above. Suppose that, for every T E J, K, is a transition kernel in a Bore1 set I, satisfying (1.1). It is not assumed that states of I,' u I,- are absorbing, but, rather, that X: eventually enters this set wherever it starts: P,(X:
E I,'
u 1,- for some n 2 0) = 1 .
Let N = N' be the first index n such that X,t
E I,'
(1.7)
u I,-. Thus XN' is the
166
ABSORPTION PRO BABILl TIES
I I.
first point of I,‘ u I,- to be reached, and wr(x) = Px(X,la 4 is the probability that X,l goes above d before it goes below c. Suppose that K , satisfies (1.4), where a and b are continuously differentiable and b ( x ) > 0 throughout [c, d], and the terms O ( T ) satisfy
[It follows that r ( x ) = 2a(x)/b(x) is continuously differentiable throughout I, and (1.5) holds.] COROLLARY.Under these conditions, Iw,-yyI of (1.6) with ~ ( c=) 0 and ~ ( d=)1.
+ 0,
where ty is the solution
This is proved by modifying K, to make states of I,‘ u I,- absorbing, and applying Theorem 1.1. Clearly, Y: = Xi,, (n A N = min {n,N}) is a Markov process in I,, with kernel
For n 2 N , Y,l = X,l; thus, for all Bore1 subsets A of I,, L y ( X , A ) = Px(Y,l E A ) + P X ( X iE A ) = L,m(X,A).
It follows that L, satisfies (1.2) and (1.3), and wr
(x) = LP“( x , 1,’
*
Since K,(x, .) = L,(x, .) on I,*, L, satisfies (1.4), where, as was noted above, (1.5) holds. Thus L, satisfies all of the assumptions of Theorem 1.1. Proof of Theorem 1.1.
Let f = f , be the solution of
f ” ( x ) + r(x)f’(x) = E
in I, such that f ’ ( c ) = 1 and f ( c ) = 0; i.e.,
f ( x ) = l’f’(Y)dY, where f ’ ( x ) = e-B(x)( l + ~ [ e ’ ( ~ ) d y )
and
167
11.1. BOUNDED STATE SPACES
Let
I E ~ be
sufficiently small that f ' ( x ) > 0 for all x E I. Then f(d) > 0, so
g(x) = g,(x) = f ( x ) / f ( d ) is the solution of
g"(x)
+ r(x)g'(x) = Elf(d) = E"
(1.9)
in I, such that g(c) = 0 and g ( d ) = 1. Since g' and r are differentiable on I, g" is too, as a consequence of (1.9). Differentiating (1.9) we get
+
~ ( ~ ' ( x r'(x)g'(x) )
+ r(x)g"(x) = 0 ,
from which we see that g ( 3 ) is continuous throughout I. It is easily shown of g. to R' such that that, for any a > 0, there is an extension h =
(1.11) for x < c. (If I, c I, this extension is unnecessary, and the proof can be simplified slightly.) For j = 1,2, let
dj = SUP Ih"'(x)I XSl
.
Let U = U , be the transition operator with kernel K,. From (1.4) we get
where
101 < 1, for x E I,*.
Division by rb(x)/2 yields
where ai is a function only o f t , such that ai-t0 as r+O. Suppose now that E # 0, so that E" # 0. If r is sufficiently small 4hat the quantity on the right , Uh(x)-h(x) has the same sign as is less than [we write r < r ( ~ , a ) ]then F. on I,*. Since Uh(x) = h ( x ) on I,' u It-, we conclude that
for all x E I, and
7
Uh(x) 2 h(x)
if
E
> 0,
Uh(x) < h ( x )
if
E
< 0,
< r(c,a).
168
11. ABSORPTION PROBABILITIES
Applying the positive operator U" to both sides, we see that U"h(x) is a monotonic sequence ; hence U"h(x) 2 h ( x )
if
E
> 0,
U"h(x) < h ( x )
if
E
< 0.
U"h(x) 2 h ( x )
if
E
> 0,
U w h ( x ) < h(x)
if
E
< 0,
Since h is continuous, (1.2) implies that (1.12)
where U" = U," is the transition operator with kernel Kra. , As a consequence of (1.3), U"h(x) - &(x) =
thus IU"h(x)-&(x)l
< aK"(x,I?+uZ;)
=a
b y (1.10) and (1.11). Henceforth, we assume X E I , * , so that h(x)=g,(x).
BY U.W,
4r(x) = U"h(x) + ( $ r ( X > - Umh(x))
2 ge(x)-a
if
E
> 0,
if
E
< 0,
and
+,(XI < g,(x) + a for
7
< T ( E , a). Finally, M-4 - v(x)2 - Ige(x)- W)l- a
if
E
> 0,
M x )- y(x) <
if
E
<0;
Ig,(x) - w(x)l + a
hence, taking E > 0,
14, - vl < 19e - W I
" 1 9 - e - VI + a
for all T < T ( E , ~ ) A T ( - E , ~ ) . But y = g o and lga-gO1+O as 6-0, so the right-hand side can be made arbitrarily small by choosing E and a sufficiently small. Therefore, - y I + 0 as T +0. I
169
11.2. UNBOUNDED STATE SPACES
11.2. Unbounded State Spaces
In this section, we treat processes in unbounded state spaces. For every T E J ,K, is a transition kernel in a Bore1 subset I, of R' with sup IT=
infl, = - 0 0 .
and
00
As usual, X i is a corresponding Markov process. In place of (1.2)and (1.3), it is assumed that = 1
P,(X,I+ooorX,'+-ooasn+oo) for all
X E
R'. Let =
PX(x:
9
+
and suppose that x-
lim 4 J x ) = lim 1 - 4,(x) = 0 . -m
(2.1)
x- m
T-0
T-0
We assume that (1.4)holds, where a and b are continuously differentiable and b ( x ) > 0 throughout R', and, in place of (lS),
as
t +0
for each k < 00. Finally, we assume that v=
j-m e-B(x)dx-= m
00,
where
Let v/ be the solution of (1.6) with v/(x) =
w ( - 00)
=0
and
w(00)
e-B(y)dy/v.
Theorem 2.1 is completely analogous to Theorem 1.1. THEOREM 2.1.
Under these conditions,
14, - v/I as
T+O.
= SUP I'$,(X) XCI,
- yl(X)l
+
0
= 1 ; i.e.,
170
I ] . ABSORPTION PROBABILITIES
This result will be applied to additive learning models in Chapter 14.
Proof. For any positive constant k , let N be the first index n for which IX,l 2 k , and let
2 k, *
= px (XN
Wrk
Also, let y k ( x ) be the solution of (1.6) on 1x1 y k ( - k ) = O ; i.e., Wk(X)
=
W ( 4 -W(--k) W(k) - d - k )
for which Wk(k)= 1 and
a
Take y k ( x )= 1 for x > k and y k ( x )= O for x < - k . It follows from the corollary to Theorem 1.1 that
ark =
IWrk
- Wkl
as
+0
‘5
(2.4)
+ 0.
And it is easily shown that pk
= IWk-Y/I
-b
as k + c o .
(2.5)
as k
(2.6)
Suppose now that Yk
= limsup T-0
- Yrkl -b 0
+
co.
Clearly
I4r-Wl
+%k+bk,
I4r-Wrkl
so that 1imsup14f-dvl G T+O
Yk+bk
by (2.4), and the theorem would follow from (2.5) and (2.6). Thus it remains only to demonstrate (2.6). Clearly 14r(x)-~rk(x)l
G
Px(Gk-G,),
px(G,-Gk)
where G, and Gk are the events
G, and
=
(lim X , = 00) n+ 03
=
(lim XntN = co) n-t 00
(2.7)
11.2.
UNBOUNDED S T A T E SPACES
171
Similarly,
Hence, by (2.7),
so that
But (2.1) implies that the right-hand side converges to 0 as k + co, so (2.6) is proved. I
This page intentionally left blank
Part Ill 0 SPECIAL MODELS
This page intentionally left blank
12 0 The Five-Operator Linear Model
Let us first recall the notation of Section 0.1 for two-choice simple learning experiments. There are two responses, A , and A,, each of which can be followed by an outcome Oj that reinforces Aj, and conditioning is effective (C,) or ineffective (C,). Occurrences on trial n are indicated by adding a subscript n. The reinforcement schedule and effectiveness of conditioning parameters are and Let
nij = P(OjnIAin) cij = f'(ClnlAinOjn)*
17, = nijcij = F'(Oj,,C,,,~Ai,,). Technically, occurrences like Ai, are subsets of an underlying sample space. It is convenient to use the same notation for their indicator random variables. Thus A,, =
1
if A,,,,
0
if AOn. 175
176
12. FI YE-OPERA TOR LINEAR MODEL
It will always be clear from the context whether a set or its indicator is meant. For the five-operator linear model, the state random variable Xn can be interpreted as a subject's probability of A l n , and the event random variable is En =
(Aln,
Oln, Cln).
The model is defined within the framework of Section 0.2 by the following specifications. State space: X = [0,1] Event space: E = (0, l} x (0, I } x (0, I}. Transformation of x effected by e = (i,J k ) :
+ kOij(j-x).
u(x,e) = x
Probability of e given x:
p ( x , e) =
I
(xnljclj
if i = 1, k = 1 ,
xnlicij
if i = 1 , k = 0,
x'noicoi
if i = 0, k if i = 0 ,
(xlnojcbi
= 1,
IC =
0,
where XI
= 1-x.
More precisely, these relations define a family of models indexed by parameters O,, qi, and cii, subject to and
o G eii, nij, cii G
1
E n i j = 1. i
Two useful auxiliary notations are
ei = pijnii i
and
wi = oiirnii..
Note that 0 ,> O if and only if O,,, n l 0 , and c I o are all positive, while 8 , > 0 if and only if there is somej such that OIj, n l j , and cli are positive. 12.1. Criteria for Regularity and Absorption The quantities Oi determine whether or not the model is distance diminishing with respect to the natural metric d ( x , y ) = Ix-yl on X .
12.1. REGULARITY AND ABSORPTION
THEOREM 1.1. 8, > 0.
177
The model is distance diminishing if and only i f 8 , > 0 and
Proof. If Oi > 0, there is a j such that Bij > 0, nij > 0, and cij > 0. Hence, in the notation of Proposition 1 of Section 2.1, i ( u ( . , e ) ) = 1 - 8, < 1
for e = (i,j, l), and p ( x , e) > 0 if x # i'. Thus, if 8, > 0 and O1 > 0, it follows from this proposition that the model is distance diminishing. If, conversely, 8 , = 0 or 8, = 0, it is not difficult to show that r, = 1 for all k > 1, so that the model is not distance diminishing. I
The qualitative asymptotic behavior of state and event sequences depends on which states, if any, are absorbing. This is determined by the quantities q.
LEMMA 1.1. The state i is absorbing if and only if oi = 0. A state x E (0,l) is absorbing if and only if 8, = 8, = 0, in which case all states are absorbing and the model is said to be trivial. In view of Theorem 1.1, a distance diminishing model is not subject to this form of degeneracy. Proof. If e = (i,i', l), oi > 0 is equivalent to
and p ( i , e ) = nii,> 0,
which conditions are necessary and sufficient for i not to be absorbing. A state x E (0,l) is absorbing if and only if 8,nij = 0 for all i and j ; i.e.,
el =e,=o. I
Since X is compact, a state sequence X,,of a distance diminishing fiveoperator linear model is a compact Markov process (Proposition 1 of Section 3.3). Except for one degenerate case, such state sequences are either regular or absorbing processes. We recall that, for a regular process, the distribution U(X,,) of X,,converges to a limit p that does not depend on U(X,), while, for an absorbing process, ,'A converges almost surely (a.s.) to a random absorbing state. For further discussion of regular and absorbing compact Markov processes, see Section 3.6. THEOREM 1.2. If either 0 or 1 is an absorbing state for a distance diminishing linear model, then X,,is an absorbing process. THEOREM 1.3. If neither 0 nor 1 is absorbing, then the model is distance diminishing. Either (a) w, = w1 = 1, or (b) X,,is regular.
178
12. FIVE-OPERATOR LINEAR MODEL
The proof of Theorem 1.2 is based on the following lemma. The set of possible values of X , when X , = x is denoted a,,(x). LEMMA1.2. If Oi > 0 and i is absorbing, then d(a,(x), i ) -0 as n -,co, for all x # i'. Proof. Since Oi > 0 and mi = 0, Oii > 0 and
nii> 0. Let xo = x and
x" = u(x"-', (i,i, 1)). Assuming, inductively, that x"see that
' E a,,- (x),
and noting that x"-' # i', we
x" E o,(x"-') c a&). Thus x" E a,(x) for all n 2 0, and d(a,(x),i)
< lx" - il = (i-eiiylx-il
-,0. I
Proof of Theorem 1.2. Suppose that i is absorbing. (1) If i' is absorbing too, then d(a,,(i'),i')=O for all n. In conjunction with Lemma 1.2, this implies that X,, is absorbing, according to Theorem 3.6.2. (2) If i' is not absorbing, then there is an x # i' such that x E a,(i'). So a,,-,(x) c a ,,(i'), and d(a,,(i'),i )
< d(a,,_(x), i ) + 0
by Lemma 1.2. Thus, again, Theorem 3.6.2 is applicable.
I
Proof of Theorem 1.3. If neither endpoint is absorbing, then Oi 2 mi > 0 for i = 0,1, so that the model is distance diminishing by Theorem 1.1. Suppose now that mi < 1. We will show that d(a,,(x),i')+O as n- 00 for all x E X , so that regularity follows from Theorem 3.6.1. Consider first the case x # i'.
(1) If Oii. < 1, let xo = x and
x" = u(x"-1, ( i , i ' , 1)). Since x"-' # i ' , x" E a,(x"-'). By induction, X"E a,,(x). But Oii, > 0, so d(a,,(x),i')
< d(x", i ' ) =
(2) If
(i-eii,yyx-q
-, 0.
nii.< 1, there is an event e such that Ju(x,e) - i'I 2 Ix - i'l
and p ( x , e) > 0, so that u(x, e) E a, (x), Therefore
x" = u(x"- ',e) # i' and
12.2.
179
THE MEAN LEARNING CURVE
= 1, so that i’ E o1(x“-’), we have i’ E a,,(x) for all n > 1. Since i‘ is not absorbing, there is a y # i‘ in a,(i’). Thus on(?)T> on- (y), and
x” E a,,(x). Thus if Bii.
d(o,,(i’),i’)< d(a,,- (y),i’) + 0 as n + co by the previous case. When oo= o1= 1, the process X,, moves to 0 or 1 on its first step and thereafter alternates between these two states. Clearly the process is ergodic, with ergodic kernel (0, l } and period 2. Such cyclic models are of no interest psychologically.
12.2. The Mean Learning Curve The mean learning curve
xn =
f‘(A1n)
= E(Xn)
is the traditional starting point for both mathematical and empirical studies of models for learning in two-choice experiments. The standard tactic for obtaining information about x,, is to use the equation
Ax,
=
E(W(Xn))
9
where W(Xn) = E(AXn I Xn)
7
to obtain a relation between x,, and x , , + ~ Clearly . E ( A , , , A X , , I X , = ~=)
ellx~xnll -e,oxxnlo = elx’x--lx,
E(A,,,AX,~X,,= x)
-e,,xx~noo + e,,
=
x’X’no1= w,,x’ - e,xx‘.
These equations are at the root of most of the computations in this chapter. Adding them, we obtain W(X) = ( e , - e , ) x x ~ - ~ l x + w , x ~
Let
6
=
el - e,,
0
= w1
+ w,,
and, if o > 0 (i.e., 0 and 1 are not both absorbing),
I
=
0010.
180
12. FIVE-OPERATOR LINEAR MODEL
In terms of these quantities, if w = 0,
("
W ( x ) - 6xx' = w o x ' - 0,x =
o(l-x)
Substitution into (2.1) then yields
if o > 0. if o = 0,
AX,, - 6E(X,,X,,') = w ~ x , , ' W,X, =
o(l-x,,)
if o > 0,
(2.3)
(2.4)
which is the relation sought. We fist consider the asymptotic A, response probability x, = limx,. n-r
4)
Later in the section we return to the problem of computing or approximating x,,. It is assumed that the model is distance diminishing and (in the case of no absorbing states) noncyclic, so that the limit x , exists. If both 0 and 1 are absorbing states, the process X,,is absorbing. Thus the probability is 1 that X,,converges to either 0 or 1, and x,
- P(X,,-P 1)
depends on the distribution of X,. The only case in which x , is known exactly is 6 = 0. Then (2.4) gives Axn = 0, so that x, = xo . An approximation to x , when 6, 8 , 1 , and 8, are small is given in Section 12.4. If i is the only absorbing state, X,, -+ i as., and x, = i. The quantity x , is of particular interest when there are no absorbing states. Let 21,
= (l/n)
n- 1
C Aim
m=O
be the proportion of A t responses in the first n trials. The corollary to Theorem 6.1.1 implies that Al,,+x, as., and where m
cT2
=x,x:+2cpj
j= 1
and pi = lim P ( A , , , A l , , + j ) - x, 2 . n-+ m
Once
a2 has
been estimated, this result can be used to construct confidence
181
12.2. T H E MEAN LEARNING CURVE
intervals for or test hypotheses about x,, on the basis of a single subject’s proportion A,,, of A, responses. When a formula for uz like (3.23)is available, the problem of estimating u2 reduces to that of estimating the parameters that appear therein. A model-free approach to the estimation of o2 is described in Section 5.4. It is worth noting that the quantities x,, p j , and u2 do not depend on the distribution of X , . Since W is quadratic with W(0)= w o> 0 and W(1) = -wl< 0, this function has a unique root A in (0,l).
THEOREM 2.1. If there are no absorbing states, and BOl < 1 or el, < 1, then l<X,
if 6 > 0 ,
l=x,=A l>x,>A
fi if
6=0, 6<0.
(2.6)
In the symmetric case, Oii = 8, Oir = 8*, cii = c, qi,= c*, and the quantities 6 and 1 reduce to
6 = (ec-e*c*)
(nol-nlo)
(2.7)
and 1= ~01/(~01+~10).
We noted in Section 0.1 that, for any two-choice simple learning model, this ratio is the value of P ( A ,), associated with probability matching: P(A1,) = P ( 0 , m). The quantity I is the expected operator approximation to x , (see p.183). It is especially useful in the case of learning by small steps, as we shall see in Section 12.4. Proof of Theorem 2.1. Letting n + 00 in (2.4),we obtain
0 = 6a + ~ ( l - x , ) ;
(2.8)
hence x, - 1 = 6a/w,
where a = /xx’p(dx)
and p is the stationary probability of X,,. Clearly, a > 0. To obtain the relations between x , and 1 listed in (2.6), it remains only to show that a > 0. If a=O, then p({O,l})= 1. If p ( { i } ) = 1, then, by Lemma 3.4.3, i is absorbing, contrary to assumption. Hence, p({i}) < 1 for both i, and (0, l}
182
12. FIVE-0 PERA TOR LINEAR MODEL
is the support of p. Therefore, (0,I } is stochastically closed, according to Lemma 3.4.3. If e = ( i , i ' , l), wi > 0 implies u ( i , e ) E o1( i ) , and u ( i , e) # i. Thus u ( i , e ) = i f ; that is, Bii, = 1, for both i. Since we are assuming that this is not the case, we must have u > 0. Note that u - x,x,I
=
-pp(dx)
+ x$
= -v,,
where
(2.9) is the variance of p. Thus (2.8) yields 0 = 6(x,x,'-v,)
+ w(l-x,),
or W(x,) = 6u, . If v, = 0, x , would support p, hence would be absorbing, contrary to assumption. Thus v, > 0, and the right-hand relations in (2.6) follow immediately. I We now give some results for x, analogous to those in Theorem 2.1. There are no restrictions on the parameters of the model other than those specifically mentioned below. Note first that, when 6 = 0 and w > 0, (2.4) gives x,+1
- 1 = (1-0) (xn-l),
so that x, = l + ( l - w y ( x o - l ) .
(2.10)
Consider now the case 6 > 0 (6 c 0 is similar). Rewriting (2.4) x,+, = 6E(X,X,,')
+ (1-w)x, +
0 0 ,
and noting that E(X, X,,') < x, x,,', we obtain
(2.1 1) where t ( x ) = (1 -w)x
+ wo
and u(x) = 6xx'+(l-o)x+w,.
Let 1, = Izo = x o ,
In+1 = t(1,) , and A n + 1 = u(Ln). When w = O , I n = x o , and, when w > 0, I, is the quantity on the right in (2.10). If 0 c w < 2, I,,+ 1 as n + co. The function u is just the expected
183
12.3. INTERRESPONSE DEPENDENCIES
operator
E(X,+,IX,, =x)
=x
+ W(x) = u(x),
so I,,is the expected operator approximation to x, (Sternberg, 1963, Section 3.2). If I,,converges to a limit A as n 4 co, then A = u(A); i.e., W ( I )= 0.
If o < 1, then I,
THEOREM 2.2. Suppose 6 > 0. condition 0+6 < 1, x,, < A,,.
< x,,.
Under the stronger
Proof. If w < 1, dt(x)/dx 2 0. Thus I,, < x,, implies In+, = t(C) < f(Xn)
G X,+l by (2.1 1). Also du dx
- (x)
= h(1-24
+ (1-0) > 1 - (o+6)
for x < 1. Thus w + 6 < 1 implies du(x)/dx > 0 for 0 < x 0 < I, < 1, the above argument yields x, < I,. I
< 1.
Noting that
An approximation to x,,, valid when the parameters Oij are small, is described in Section 12.4. This approximation is closely related to A,.
12.3. Interresponse Dependencies This section treats the relationship between responses on two different trials. Successive trials are considered first. Let a, denote response alternation between trials n and n + 1, i.e.,
184
12. FIVE-OPERATOR LINEAR MODEL
But, by (2.2),
+ o1xn. Equation (3.1) is obtained by adding these equations. I P(AinAon+I) = (1-fll)E(xnxn’)
To use (3.1) as it stands, we must know x,, and E(X,,X i ) . The only case in which a simple exact formula for x,, is available is 6 = 0. If, in addition,
6, = s1- so = 0 , where
E(XnX;) can be calculated from (3.15). Equations (3.9H3.11) and Theorem 3.2 give expressions for some derivatives of P(a,,) under these rather restrictive conditions. Our immediate objective is to treat the “difficult” case 6 # 0 by an altogether different method. Some slow learning approximations to quantities involving alternations are given in Section 12.4. When 6 # 0, comparison of (2.4) and (3.1) suggests that the troublesome term E(X,,X,,’) be eliminated between them to obtain the relation
6P(un)-t5xn
= 6(o,xi’+o,xn)-t(o,x,’-o,xn)
= -2(1-e,)o0X;
+ 2(1-e0)w1xn
(3.5)
between the mean learning and alternation curves. Even though (3.5) relates the “observable” quantities P(u,,), x,,, and x,,+ it is not easy to test directly. We now note some immediate consequences that are more amenable to comparison with data. It is assumed that the model is distance diminishing and noncyclic. Then the limit P(u,) = lim P(un) n-r
00
185
12.3. INTERRESPONSE DEPENDENCIES
exists, and (3.5) yields
~ ( a ,= ) -2(1-e1)0,X;
+2(1-eo)01x,.
(3.6)
This result is interesting only in the case of no absorbing states. If there are absorbing states, P(a,) = 0. In fact, E(#an) < 00, by Theorem 6.2.2, where #an is the total number of trials on which a, occurs. If 0 is the only absorbing state, E(#A In) < 00 too. Summation of (3.5) over 0 < n < N yields
6
N
C P(an)-t(xN+1-~0)
n=O
= -2(1-e,)00
1 x; + 2(1-e0)0, N
n=O
c x,. N
n=O
When N+ co this becomes 6E(#an) = 2 ( 1 - 8 0 ) ~ 1E(#A,n) - 5x0
(3.7)
if 0 is the only absorbing state, and 6E(#an) = t ( x a -xO)
(3.8)
if both 0 and 1 are absorbing. For a derivation of (3.8) via the functional equation (6.2.2), see Norman (1968d, Theorem 2). We now turn our attention to the case 6 = 0. If there are no absorbing states, (3.1) yields P ( ~ , )= 2(1-
e,)
E(X,
x,') + 2wir,
(3.9)
where E ( X , X , l ) = limn+mE(X,X,,'). Summation of (3.1) over all n gives
as a consequence of (2.10), if 0 is the only absorbing state, and
in the case of two absorbing states. The quadratic means in these equations can be calculated explicitly when 6, = 0. Let g = s,
+ 2(0-t0-?1),
186
12. FIVE-OPERATOR LINEAR MODEL
where ti =
e;, nii,.
THEOREM 3.2. Suppose that 6 = 0 and S2 = 0. I f there are no absorbing states, gE(x,x;)
=
~ro~2-e01-e10~.
(3.12)
I f 0 is the only absorbing state, m
g
C E(XnXn’) = E(X0 Xd) + ( 1 - e l o b 0 .
n=O
(3.13)
Finally, if both endpoints are absorbing, (3.14)
X,+,X,+,
=
XnXn’ +(Xn’-Xn)dXn-(dXn)~;
thus E(X,+,X;.,IX”=x)
= xx’+(x’-x)w(x)-z(x),
where Z ( x )= E((AX,J2IXn=x). Since 6 = 0, (2.3) yields (x’-x)W(x) = oox’+w1x-22wxx’. And Z ( X ) = x(e:, ~
’ ~+e:,x2nl0) n,, + x’(e;, xt2nO1 +e;ox2noo)
+ x’(soX2-2tox+to) - 2(t, +t,)xx’ + fox’ + t , x ,
= x(s1x’z-2t,x’+t,) = SiXX’
since s1 = so. Therefore, since ti = Oii. m i ,
E ( X n + , X : , + , ( X n = x= ) ~ ~ ‘ - g ~ ~ ~ + ( i - e ~ , ) ~ ~ ~ ’ + ( i - e , ~ (3.15)
so that AE(X,X;) = - g E ( ~ n ~ ~ ) + ( i - e o l ) o+o( ix-~- e , , ) ~ , x , . (3.16)
Letting n+co in (3.16), we obtain (3.12). Summation of (3.16) over n 2 0 yields (3.13) and (3.14). I Some light is shed on the meaning of the condition 6 = 6, = 0 by considering the symmetric case. The analog of (2.7) is
s,
= (e+-e*Zc*)
(Icol -lllo).
(3.17)
187
12.3. INTERRESPONSE DEPENDENCIES
From this it follows that 6 = 6, = 0 if and only if nol = nlo, or 8 = 8* and For the latter condition clearly implies the former, and, under the former, nol # nl0 implies c = c*.
ec = 8*c*
and
82c = 8*%*.
Since the model is distance diminishing, either Bc or 8*c* is positive; thus both are positive. Dividing the second equation by the first yields 8 = 8*, from which c = c* follows. The condition n l o = nol means that the probability nii that Ai is “successful” does not depend on i. Yellott (1969) showed that such noncontingent success schedules are especially useful in assessing the relative merits of the linear model with 8 = 8* and c = c* and the pattern model with c = c*. Either x l 0 = nol, or 8 = B* and c = c* is compatible with no absorbing states. In the second case, (3.12) reduces to (3.18) Generally speaking, explicit computation in symmetric models with 6 = 6, = 0 is limited mainly by one’s stamina. When 8 = B* and c = c*, such computations often simplify somewhat if outcomes are noncontingent (no1 = n11).
AUTOCOVARIANCES. We now obtain expressions for the asymptotic response autocovariance function pj and the important quantity a’ [see (2.5)] when 6 = 0. Like (3.9), these expressions involve E ( X , X,’), our formulas C(3.12) and (3.18)] for which require 6 , =O. THEOREM 3.3. If 6 = 0 and Xn has no absorbing states and is noncyclic, then pi
( i - ~ ) j - ~ ( ( i - ~ ) z-r ( i - e i ) ~ ( x m x ; ) )
=
(3.19)
for j > 1, and = 2((1-42)zzt - (~-+)E(X~X;))/W.
Proof. For j’ ‘(A
(3.20)
I, In
A 1 n +j)
=
E ( E ( AI n A 1 n + jl En Xn+ 1))
=
E ( A I n E ( A 1 n + jl Xn+ 1))
=
E(Aln(z+(l-~)j-l(Xn+l-~)))
by (2.10), so that P(AInA1 n + j ) =
z‘(A1n)
+ (1-~)’-’(E(A1nXn+1)-
Ip(Aln))* (3.2’)
188
12. FIVE-OPERATOR LINEAR MODEL
And, by a computation like that in (3.3), E(A,nXn+,) = (l-e,)E(Xn?
+
(e1-01)~,.
(3.22)
Substituting this expression into (3.21) and taking the limit, we obtain = (i-w)j-l((i-el)E(X~z)+(el-wl)i-iz),
from which (3.19) follows. Consequently, o2 =
ir
+ (214 ((1 -w ) ii' - (1 -e,) E ( X , x;)),
which reduces to (3.20).
I
When 8, = 8 and c,, = c for all i andj, (3.18) and (3.20) can be combined to yield
We conclude this section by considering two special topics very briefly. REINFORCEMENT. There is one class of models for which a CONTINUOUS great many analytic results are known even when 6 # 0. These are the continuous reinforcement models (i.e., noo= nl0= 1) with coo = cl0 = 1. The reader is referred to Bush (1959), Tatsuoka and Mosteller (1959), and Sternberg C1963, Eq. (80)] for this interesting development. A number of predictions for continuous reinforcement models with coo = cl0 and Oo0 = Ol0 (and thus 6 = 62= 0) are given by Norman (1964). OF OPERATORS. It may happen that one or both responses in CONTINUA a two-response experiment have more than two experimenter-defined outcomes, or, alternatively, one such outcome can produce several effects. Say response A, can be followed by outcomes Oi4,where a belongs to a discrete index set d ,with probability n,(a)[E4ni(a) = 13, in which case the operator
~ ( x , ( i , a )= ) ( l - e i a ) ~+ Y i a ,
with yi4 = Oi4 or 0, is applied. In fact, we can even consider an arbitrary (measurable) index space d, in which case n, is a probability on the index space. Most of the results of this and the preceding sections carry over to this generalized linear model if we take
189
12.4. SLOW LEARNING
where
oia= lu(i, (i, a)) - il =
if i = O , if i = 1 .
O,-y,
Since the behavior of state and response sequences depends only on the distribution of (&, wia) induced by n,, specification of this distribution, rather than Z7, itself, would suffice for the study of these processes. 12.4. Slow Learning
This section presents various approximations that apply when the stepsize parameters Oij are small. Two different types of variation of the model’s parameters are considered. The first of these produces large drift while the second leads to small drift. LARGEDRIFT. Here we assume that 8, = 8qij, where 8 varies and qij 2 0 is fixed, as are cij and q j . Under these conditions, the following quantities do not depend on 8: di =
8
Bi/8 =
= s/e =
1i qij
nij
9
8, -do,
ai = o,/e = qii, nii., w(x) =
w(x)/e=
F
~
+ E , x’ - E , x , ~
#
(4-1)
and
s(x) = E((mn/e)21xn =x) =
q : 1 n , , x ~ z x + q : , n , , x 3+q:ln,,X13 +~:ono,xzx’.
Let s(x) = S(x) - wz (x). According to Theorem 8.1.1, if X , = x as., and if n8 is bounded, as 8
-
--f
(XI-fW))/Je “ O , g ( W ) 0, where f and g satisfy the differential equations
(4.2)
and
t In this section, prime is used to denote reflection about f ( x ’ = 1 - x) and d/dt to denote differentiation,with one exception: w’ is the derivative of w.
190
12. N YE-OPERATOR LINEAR MODEL
and the initial conditions f ( 0 ) = x and g(0) = 0. The normality assertion of (4.2) and the precise value of g(n@ are of less importance than the fact that the distribution of X , is tightly clustered about f(n0) when 0 is small. We now make some observations about f and g . Except where indicated, these do not depend on the special form of w and s. First, if w(x) = 0, then f ( t ) = x and Zw‘(x)t
s ( t > = s(x) (e
- 1)/2w’(x) ,
assuming w’(x)# 0. When w’(x) = 0, g ( t ) = ts(x). If w(x) # 0, then the quantity in (4.3) never vanishes (Norman, 1968c, Lemma 5.1). Thus we can integrate
= 1 to obtain H (f ( t ) ) = t, where
For the linear model, w is at worst a quadratic polynomial, so H i s easily computed and then inverted to givef. The most difficult case is that in which w(x) = 6 ( X - - I )(x-C)
has distinct roots 1 and (. Let 1 be the root such that w’(I) = 6 ( I - 5 ) < 0.
(4.6)
Using the partial fraction representation 1 w(u)
-=-
1
b(l-C)(ull
in ( 4 . 9 , we obtain
A)
or
I-(
f ( t )- I = z ( t )- 1
-
(4.7)
00. Of course, 0 < 1 < 1, but it is possible that 0 < < 1 also. For example, in the two-absorbing-barrier case o1= w o = O , I = 1 and C = O if 6 > 0 , while 1 = 0 and ( = I if 6 < 0 .
As a consequence of (4.6), f ( t )-P 1 as t --f
191
12.4. SLOW LEARNING
When W(X) # 0, f is strictly monotonic, so we can write g ( t ) = G ( f ( t ) ) . As a consequence of (4.4) and (4.3), wcf)
dG
-gcf) = 2 W ’ W G ( f ) + s(f).
The solution with G(x) = 0 is Gcf) = w ’ v ) l - d u . No absorbing states. The models considered above have no absorbing states if and only if G I > 0 and Go> 0. In this case, w has a unique root I in [O, 13, and, in fact, 0 < 1 < 1. Since W = Ow, 1 is also the unique root of W, referred to in Theorem 2.1, for any 8 > 0. Assuming that 8qii. < 1,
9((Xn- I ) / @ )
-+
92
as n-, co, where YWe does not depend on 9 ( X o ) . By Theorem 10.1.1,
92 + N(O,a2) as O+O, where az = S(R)/2Iw’(1)1.
Clearly 0’ > 0. As in the transient case, asymptotic normality is not as sigabout R when 8 is small. nificant as the clustering of limn-tmY((X,,) Let x, = lim E(XJ n-r m
and E((Xm-I)2)= lim E ( ( X , - ~ ) ’ ) . n-r m
As a consequence of (10.3.1),
E((x, - 112) = ea2 while (10.3.3) gives X,
=
I
+ (e) ,
+ ey + o(e),
(4.9) (4.10)
where
+
y = a28/w’(I).
(4.1 1)
Thus R Oy is a better approximation to x, than 1 is, when 8 is small. Note that y has the same sign as -8 (or -a), as Theorem 2.1 requires.
192
12. FIVE-OPERATOR LINEAR MODEL
It is fairly obvious that P(u,) + 2AV as 0 + 0. A more precise approximation to P(u,) can be derived from (4.9) and (4.10). Letting n + 00 in (3. l), we obtain P(u,) = t E ( X , X,')
Since
+
0 0 xm'
+
0 , x,
.
+ (1-24 (x-A) - ( x - A ) 2 , E ( X , Xd) = 21' + (1 - 2 4 (x, -A) - E((X, -A)'). xx' = M'
Thus, by (4.9) and (4.10),
E(X, X.J = AV + ep + o(e),
where But
t = 2- O(d, +do),
p = (1-24y - 0 2 = 0 2 z / W ' ( A ) . and
0,~;+0,x,
=
e(z,z+a,A)+o(e),
so
+
P ( ~ , )= 2 ~ z c2p - (0, +do) AV
+ zoz + a,A] e + o(e).
(4.12)
SMALLDRIFT. Except in special cases to be identified below, it is necessary to let both nijand Oij depend on an auxiliary parameter 8 in order to obtain slow learning with small drift in linear models. Let T = 02, and suppose that nijand eij vary with 0 in such a way that the following conditions are met: 8ii
(4.13)
= 0(0) ,
+ o(T)
(4.14)
el, n,, - 8oonoo = T a + o(T),
(4.15)
0:
=
Tpi
mi = Oii. nii.= 7ai
9
+ o(T),
e:.nii.= o(z).
(4.16) (4.17)
In these equations, i = 1 and 0, a is any real constant, ai 2 0, pi 2 0, and Po > 0 or p1> 0. The quantity Oij nij,takes into account the two learning-rate parameters Oij and cij associated with Ai O j , as well as the probability nij that Oj follows A i . Thus it is natural to refer to it as the weight of Oj given A i . According to (4.19, the difference between the weights of success given A , and A . is very small (O(T)).And (4.16) implies that the weight of failure given either response is very small. This might mean that Oii, = O(T), Oii. = O(0) and
193
12.4. SLOW LEARNING
Hiis= 0(8), or Hi,, = O(T).If Oii, = o(1) or a, = 0, then (4.16) implies (4.17).
Finally, we note that, when
Pi > 0, (4.13) and (4.14) imply liminfn,, > 0. e-+o
It is interesting to inquire as to the conditions under which our “large drift” scheme satisfies (4.13H4.17). Substituting 8, = @,, into (4.15) and (4.16), we obtain the equations v11n11
=~
tto1n01
= t t l o n l o = 0,
0 0 ~ 0 0
and which are sufficient as well as necessary. We now show that, under (4.13)-(4.17), all of the hypotheses of Theorem 9.1.1 are satisfied. Note first that 6 = ~ * T + o ( T ) , where a* =a+a,-ao. Therefore, letting a(x) = a * x x ’ + a o x ‘ - a l x ,
E(dXn(Xn= x ) = W ( X ) = t a ( x ) Similarly, (4.14) and (4.17) imply that E((dX,)2JXn= x ) = t b ( x )
+ O(T).
(4.18)
+ O(T),
(4.19)
where b(x) = ( 8 1 x ‘ + p o x ) x x ’ .
Finally, 8iU,, = O ( T ) by (4.13) and (4.14), and 0;. U,,, = 4 7 ) by (4.17), so
IX,
~ ( 1 ~ ~ ~ = x1 ) 3=
(4.20)
o(T).
All o(r)’s are uniform over x, and the functions a and b satisfy the requirements of the theorem. It follows that, if Xo = x a s . and nr is bounded, then X,, P ( n z ; x , -) as 8+0, where P is the transition probability that satisfies (9.1.5).
-
Two absorbing states. Suppose that (4.13), (4.14), and (4.15) hold, and that wo = w l= 0 for all 8, so that 0 and 1 are absorbing. Then (4.16) and (4.17) are satisfied, and a ( x ) in (4.18) reduces to a ( x ) = axx’. If, in addition, Po > 0 and P1> 0, let
44
= WX)/b(X) =
W(P1 x’ + P o x )
Y
and let y be the solution of (4.21)
194
12. FIVE-OPERATOR LINEAR MODEL
with ~ ( 0= )0 and ~ ( 1 = ) 1. According to Theorem 11.1.1, d(x) = Px(X,+ 1 as n + a)+ ~ ( x )
as 0-0. Combining this with (3.8) we obtain Ex(#an)
N
(4.22)
~(V(X)-X)/Z~
as 8+0, if a#O. To justify application of Theorem 11.1.1, it is necessary to verify that the o(r)’s in (4.18H4.20) satisfy
as r+O. In the case of (4.18), o ( t ) = (6-ra)xx’
;
hence IO(~)/WX)l
G l 6 l t - .I/minB*, I
and (4.23) follows. The other two equations can be handled similarly, using
E(px,Ijjxn= x)
= (e~lnllx+-l
It is not difficult to solve (4.21). Clearly, *(x) dx
where C > 0. If
=
Cexp-
# Po, and p = 2a/(P,
+e
s
~ o n o o ~ - ~ ~ ~ ~
r,
-Po), this yields
2 (x) = C(P1 x’+Pox)P. dx Thus, if p # - 1, (4.24)
where h(x) = (/31xf+/lox)P+1.
When 8 , = Po and holds with
c(
# 0, as in the example at the end of Chapter 0, (4.24) h(x) = exp -2ax/B1.
13 0 The Fixed Sample Size Model
In the first three sections of this chapter we will give an account of the fixed sample size model that closely parallels the treatment of the linear model in the last chapter. In fact, most of our results for the linear model apply without change to this stimulus sampling model. The close relationship between these models is further emphasized in Section 13.4, where it is shown that the distributions of the state and event sequences of certain linear models are limits of the corresponding distributions for sequences of fixed sample size models. We recall that, in the fixed sample size model, the state variable x represents the proportion of elements in the total stimulus population conditioned to A,. The event variable e = (m, i, j, k) gives the number of elements in the sample conditioned to A , , and the response, outcome, and effectiveness of reinforcement indices. It is a finite state model defined formally as follows. State space: X = { v / N :0 < v GN}. Event space : E = {(m,i, j , k ) : 0 < m G s, 0 < i,j, k < l}. Transformation of x corresponding to e = (my i, j , k ) : u(x,e) = x
+ k e ( j - m/s),
8 = s/N. 195
196
13. FIXED SAMPLE SIZE MODEL
Probability of e given x : p ( x , e) = H(m,x ) L(m/s; i, j , k), where H(m, x ) is the hypergeometric distribution
and L is the event probability function for the linear model:
U y ; iJ,M
=
[ynljclj
if i = 1, k = 1 ,
yqjc;j
if i = 1, k = 0,
y’nojcoj
if i = 0, k = 1,
(y’nojcbj
if i = 0, k = 0 .
The restrictions on the model’s parameters are 1<S
< N,
0
< 7Cij,Cij < 1,
IRij j
= 1.
The formula given above for u(x, e) applies only if p ( x , e) > 0. The definition in other cases is arbitrary. For example, if m >Nx, j = 0, and k = 1, the formula gives u(x,e) = x
- m/N < 0 .
However, in this case
so p ( x , e ) = 0. With the following notations, many of the formulas of the last chapter become applicable to the fixed sample size model:
ei = c(ni,,+nil), mi = en,,., where [ = (s- l ) / ( N - I ) ,
It is understood that c = O if N = 1. Then pattern model), in which case Bi = 0 too.
nij= 7rijcij.
c=
0 if and only if s = 1 (the
13.1. Criteria for Regularity and Absorption
A state sequence X,,for a fixed sample size model is a finite Markov chain. The theory of such chains is discussed in Section 3.7, and in standard sources (e.g., Kemeny and Snell, 1960).
13.1. REGULARITY AND ABSORPTION
197
The following simple properties of the transition kernel K of X,,are used repeatedly below. LEMMA1.1. If Uii,> 0 and x # i’, then, starting at x, the process can move closer to i’:
LEMMA1.2. If (s- 1)nii > 0 and 0 < x < 1, then the process can move closer to i :
K ( x , { y : ly-il < Ix-it}) > 0 . Proofs. In the first case, some event of the following sort has positive probability : The sample contains elements conditioned to A,, response A , is made, and Ai. is effectively reinforced. Any such event moves the process toward i’. In the second case, since O < x < 1 and s 2 2, a sample can be drawn containing elements conditioned to each response. Then A, can be made and effectively reinforced. As for the linear model, the presence of absorbing states exerts a decisive influence on X,,. The criterion for i to be absorbing is the same as in the linear model.
LEMMA 1.3. The state i is absorbing if and only if o,= 0. A state 0 < x < 1 is absorbing if and only if nii,= 0 and (s- 1) n,,= 0 for i = 0 and 1. Then all states are absorbing and the model is said to be trivial. Proof. If Hi,, = 0 , i is certainly absorbing, while if n,,.>O,it is not, according to Lemma 1.1. But wi > 0 if and only if nii,> 0. If 0 < x < 1 is absorbing, nii.= 0 and (s- 1) nii= 0 follow from Lemmas 1.I and 1.2. Suppose, conversely, that the latter conditions hold. If s = 1, then the response made is the one to which the sampled element is conditioned, and a change of state occurs only if the other response is effectively reinforced, which has probability 0. If s # 1, then nii= 0, i = 0,1, and no response is effectively reinforced with positive probability. Thus, again, all states are absorbing.
I
Theorems 1.1 and 1.2 differ only slightly from the comparable theorems (12.1.2 and 12.1.3) for the linear model. All states can be absorbing in Theorem 1.1 ,but the distance diminishing assumption of Theorem 12.1.2 rules this out. And there is a cyclic case in addition to the one (q= wo = 1) given by Theorem 12.1.3. THEOREM 1.1. Ifthere is an absorbing state, then X,, is an absorbing process.
198
13. FIXED SAMPLE SIZE MODEL
THEOREM1.2. If there are no absorbing states, then either (a) no,= n,, = 1 and s = 1 or N, in which case X,,is ergodic with period 2, or (b) X,, is
regular.
Proof of Theorem 1.1. Suppose i' is absorbing and i is not. The latter implies nii.> 0, according to Lemma 1.3. Thus, if x # i', K(")(x,i') = K ( " ) ( x{, i ' } ) > 0 for some n < N by Lemma 1.1. Hence the criterion for absorption given at the end of Section 3.7 is met. If some 0 < x < 1 is absorbing, Lemma 1.3 shows that all states are ab>0 sorbing. Suppose now that this is not the case. By the same lemma, njSj or (s- l ) n j j > 0 for some j . Lemmas 1.1 and 1.2 then imply that, if 0 < x < 1, K ( " ) ( x , j> ) 0 for some n < N - 1. Thus the process is absorbing if both 0 and 1 are absorbing states. I Proof of Theorem 1.2. By Lemma 1.1, both 0 and 1 can be reached from any state, so X,, is ergodic, and both 0 and 1 belong to the single ergodic kernel F. Clearly K ( 0 , s / N ) > 0 and K(s/N,0) > 0, so K("(0,O) > 0. Since the period p of F divides all return times, p < 2. If n,. < 1 for some i, then
K(i, i) = 1 - n,,> 0, so p = 1 and the process is regular. If 1 < s < N , then K ( 3 ) ( 00) , > 0, so, again, p = 1. For no,> 0 and n,, > 0 imply K(O,s/N)> 0 and K(l/N, 0) > 0. If, in addition, 1 < s < N , then starting at x = s / N , a sample with s-1 elements conditioned to A , can be drawn, A, made, and A, effectively reinforced, leading to state 1/N. Thus K(s/N, 1/N) > 0, and ~(y0,o2 ) K(O,s / ~ K) ( ~ / Ni , / ~K(I/N, ) 0) > 0,
as claimed. Suppose now that no,= n,, = 1. If s = N , then K(0,l) = K(1,O) = 1, so that p = 2. If s = 1, then K ( x , { y :[ y - x l
so, again, p = 2.
=
l/N}) = 1 ,
I
Part I of this volume showed that the theories of finite state models and distance diminishing models with compact state spaces are completely parallel. Thus the remarks in Sections 12.2 and 12.3 about the asymptotic behavior of distance diminishing linear models are applicable to nontrivial fixed sample size models with the same absorbing states. For example, the proportion A,, of A , responses in the first n trials is asymptotically normal with mean P(A,,) and variance d / n , where c2 is given by (12.2.5), if the model has no absorbing states and is noncyclic.
199
13.2. MEAN LEARNING CURVE
13.2. Mean Learning Curve and Interresponse Dependencies Our first order of business in this section is to find suitable expressions for P(Ain Ojn CknlXn) 9
E((dXn)AinOjn Cknlxn)9
and various quantities that derive from them. These formulas are compared with analogous expressions for the linear model. Some differences are noted, but there are important similarities that are exploited in the remainder of the section. Iff is any complex valued function on [- 1, I], E(f(dXn)Ain Ojn Cknlxn = X ) = M C f ( k e ( j - m / s ) ) U m / s ; Lj,k)] 9 (2.1)
where M denotes expectation with respect to the distribution H ( m , x). If
f ( y ) = 1, (2.1) and M(m/s)= x yield
P(Ain Ojn Cknlxn = X ) = L ( x ; i,j,k)
just as in the linear model. Summing over j and k for i = 1, we obtain P(AIn(X,= x ) = x ,
so that P ( A I n ) = E(X”) = xn .
If f ( y )= y, (2.1) reduces to E ( ( d X n ) A i n ~ j n C k=nx~)X=n kOM[(j-m/s)L(m/s;i , j , k ) ] .
From 1 N-s sN-1
M ( ( m / s - x ) ’ ) = - -XX’
[Wilks, 1962, Eq. (6.1.5)], it follows that
M ( ( 1-m/s)m/s) = rxx’ , M((m/s)’) = x - r x x ‘ , M ( ( 1 - m/s)’) = x’
where
- rxx‘ ,
(2.2)
200
13. FIXED SAMPLE SIZE M O D E L
Thus
-e(l-rx')xnlo
E((~xn)Ai,,oj,,ck,,Ix,, =x) = .
- cxx'noo
cx' -e(i-rxi)
E(AX,,IX,,= x,Ai,,oj,c,)
=
.
e(i-rx)
- rx
j
k
I
o
I
1
1
o
e(i-rx)x;n,,
I
i
(2.5)
0 0 1
1
1
1
1
o
I
o
1
I
0 0 1
Though all rows are linear functions of x, the second and third differ essentially from the corresponding expressions for any five-operator linear model. This is most striking for the pattern model where they are, respectively, - 1/N and 1/N for all x. Returning to (2.5), and adding the first and second and the third and fourth rows, we get
E(& AX,,IX,, = X) =
(O1x'-ol)x
if i = 1 ,
(oo-Oox)x'
if i = 0,
(2.6)
which is identical to the corresponding linear model expression (12.2.2). Most of the formulas in Sections 12.2 and 12.3 follow directly from (12.2.2), and thus apply to the fixed sample size model. We now discuss these results in more detail. Considering Section 12.2 first, the expressions (12.2.3) for W ( x )= E(dX,,IX,,=x) and (12.2.4) for Ax,, apply here. As in the linear model, 6=01-Oo, W = O ~ + W ~ and , 1 = oolo = ~ o l / ~ ~ o * + ~ l o ~ ~
If the condition Oii. < 1 in Theorem 12.2.1 is replaced by s < N , the system
201
13.2. MEAN LEARNING CURVE
(12.2.6) of bounds for x , is valid in the present context. Clearly,
6 =~
C ~ ~ 1 1 + ~ 1 0 ~ - ~ ~ 0 0 + ~ 0 1 ~ 1 ,
so that 6 = 0 if and only if s = 1 or n 1 1
+ n10 = no0 + n o 1 *
(2.7)
In the symmetric case, cii = c, cii, = c*, we have 6 = r(C-c*)(~01-~10),
which is analogous to (12.2.7). Finally, the formula (12.2.10) for x,, when 6 = 0 and the bounds for x,, when 6 > 0 given by Theorem 12.2.2 apply here. Turning to Section 12.3, the expression (12.3.1) for P(a,,) is valid for the fixed sample size model, as is the relation (12.3.5) between P(a,,) and x,, when 6 # 0 and its corollaries (12.3.6H12.3.8). When 6 = 0, P(a,) and E(#a,,) are related to E(X,,X,,’) according to (12.3.9H12.3.11). Under the same assumption (no analog of “S2 = 0” is needed), the expression (12.3.15) for E(X,,+l X i + l I X n = x ) applies, with =
mi+ 2(1-r)0
and Bii. replaced by 8, as is shown in the next paragraph. The formulas for E(X,X,’) and C,,E(X,,X,,’) in Theorem 12.3.2 follow. In the symmetric case with c = c*, we have this analog to (12.3.18):
Finally, Theorem 12.3.3 on response autocovariance when 6 = 0 holds here.
Proof of (22.3.15). We will obtain an expression for Z(X)
=
E((AXn)21Xn= x )
from which (12.3.15) follows just as in the proof of Theorem 12.3.2. Letting Y = mls, E((AXn)2A1nIXn = x )
+ = e 2 w ( Y ’ Z y (rill ) +nlo)- ~ M ( Y ’ Y ) ~+, ,M ( Y ) ~ , , I
= ~2CM(Y’ZY)nll M((1 -Y’~2Y)n103
=
e 2 ( n , l + n l o ) ~ ( ~ ’ Z Y ) - 2e 0~, ox .l x x ’ +
(2.8)
Similarly, E ( ( A X , , ) ~ A ~ , ~=Xx,), = e 2 ( n o o + n o l ) ~ (-y2r00xx’+ 2~’) e O o ~ t . (2.9)
202
13. FIXED SAMPLE SIZE MODEL
As was noted earlier, if 6 = 0 and s # 1, (2.7) holds. Thus
e2(nl1 + nlo)M ( Y f 2 Y )+ e2w o o + no11M(YZY’) =
+nio) M ( Y ~ ’ )= Be+’,
e2(nil
and this is valid even when s = 1, since both sides vanish in that case. Therefore, addition of (2.8) and (2.9) yields
qX) = (ee, - 2 ~ 0xx’ ) + eml x + emox’ ,
the desired equation.
I
In the fixed sample size model with 6 = 0, just as in certain linear models satisfying this condition, one can obtain complicated formulas for practically any quantity of interest by rather straightforward computation. Some ingenious work has led to elegant expressions for the distribution of NX, in the pattern model. This development was begun by Estes (1959) and carried forward by Chia (1970). Estes also showed that the asymptotic distribution of N X , is binomial, with parameters N and 1. The assumption that col = c l 0 in these papers is unessential.
13.3. Slow Learning The implications of Part I1 for the fixed sample size model are essentially the same as for the linear model (see Section 12.4), but slight additional effort is required to verify the hypotheses of the relevant theorems. This section is mainly devoted to checking these hypotheses. We assume throughout that s is fixed, so 0 + 0 if and only if N + co.
LARGEDRIFT.
For easy reference we quote (12.2.3):
E(dX,IX, = x) = W(X) = 6xx’ Division by 8 yields w(x,e) = (s/e)xx’
for x in
+ 0 0 x’ - 0
+ no,x‘ - n , , x ,
le = { v / N :0 < v
Now
I;
S-1
-=--=---
e
s
~
N
S-1
-
i
s
1
1-11~
1 x.
(3.1) (3.2)
203
13.3. SLOW LEARNING
hence (3.3) where
Combining (3.2) and (3.3), we obtain w(x, e) = w ( x )
where w ( x ) = Sxx’
+ ev(x) + o ( e 2 ) ,
(3.4)
+ no,x’ - n l o x
and v ( x ) = Sxx‘/s.
This takes care of assumption (a.2) of Theorem 8.1.1 and also (10.3.2) of Theorem 10.3.1. As a consequence of (2.1), s ( x , e) = q(Ax,/e)zlx,= =
1M [ ( j - m/s)2L(m/s;i,j, 1)] . i/
(3.5)
It is easy to see that
X(X-l/N)...(x-(m-1)/N)x’...(x‘-(S-nl-l)/N) (1 - 1/N) ... (1 - ( S - l ) / N ) + b,(x)=
(2)
flX‘S-m
as N+ a,uniformly over 0 < x
< 1. It follows that
s(x,e) = s(x)
uniformly over x
E
+ o(i)
4, where
s ( x ) = E M * [(j-m/s)’L(m/s; i,j , I)], ii
and M * is the expectation with respect to the binomial distribution b.(x). Condition (a.3) of Theorem 8.1.1 follows from (3.4) and (3.6). The smoothness conditions (b) on w ( x ) and S ( X )= S(x)- w(x)’ are trivially satisfied, and (c) follows from ldXnl< 0. Thus all of the assumptions of Theorem 8.1.1 hold.
204
13. FIXED SAMPLE SIZE M O D E L
When no,> 0 and n,,> 0, so that there are no absorbing states, w has a unique root 1 in [0,1]. Thus Theorems 10.1.1 and 10.3.1 apply to the stationary distribution pe of X,. Since W(x) is not quite proportional to w(x), the root 1 = 1, of W ( x ) [or of w ( x , 8 ) ] , which figures in (12.2.6), need not equal the root 1 of w ( x ) ; however, &+A as 8 + 0 . The formula (12.4.11) for y is no longer valid, since v ( x ) need not vanish, but (12.4.12) holds if we t a k e p = ( l - U ) y - a Z and replace 8,+8, and ai by
and Hiit,respectively.
SMALLDRIFT. Under the conditions
n,,= B +we), n,,- no,= a0 + o(8), n,,,= Oa, + o(e),
B > 0,
(3.7) (3.8) (3.9)
and s > 1, which are analogous to (12.4.13)-(12.4.17), Theorem 9.1.1 is applicable, with T = 02, a(x) =
s- 1
(a+a,-a,)xx' S
+ aox' - a, x ,
and s- 1
b(x) = p-xx'. S
In the case
no,= n,,= 0 of two absorbing barriers, a ( x ) reduces to a(x) =
s- 1
-axx' , S
and Theorem 11.1.1 applies also. We will verify the hypotheses of the latter theorem only. Since (3.10) (3.8) gives
6
s- 1
= r-a+o(r) S
205
13.4. CONVERGENCE TO THE LINEAR MODEL
and E(dX,IX,,
=x) = 7 4 x )
+ O(T).
Note that O ( T ) in the last equation has a factor xx’, so max lo(~)/~b(x)I+ 0
(3.1 1)
o<x< I XSl,
as T +0. Next, by (3.9, = BM(YY’)
+ w Y ’ 2 Y ) (1 + M(VZY‘)t o
Y
where y = m/s and t i = n,,-fl.But (2.4) and (3.10) imply that s- 1 M(yy‘) = -xx’ S
+ o(l)xx’,
and (3.7) yields M(Y’2Y)ltll + W Y 2 Y ’ ) l t 0 l G K8M(YY’) < K8xx’.
Therefore
E((dXn)21Xn= x ) = rb(x) + o ( 7 ) ,
(3.12)
and (3.11) holds. Finally,
~ ( l d x lx,, , , ~=~X) Q eE((dx,,)2IX,,= X) = O ( T ) by (3.12). Thus (11.1.4) and (11.1.5) hold, and the other conditions of Theorem 11.1.1 are easy to check. 13.4. Convergence to tbe Linear Model
We begin by noting that, according to (2.3), the variance ofm/s is small when s and N are large: M ( ( m / s - x ) 2 ) Q 1/4s.
(4.1)
Thus m/s tends to approximate x under these conditions, so that, referring to the definition of u at the beginning of the chapter, u(x,(m,i,j,k)) approximates x + M ( j - x), the corresponding event operator for the linear model with Bu = 8. Furthermore, the probability of A, Oj, C,,, given X,, = x is L ( x ; i,j , k) in either model. This suggests that, if s and N approach infinity in such a way that 8 + 8*, while the distribution p of Xo converges to a distribution p*, then the joint distribution 9 ( n ; s, N , p ) of X o , Eo, Xl , ..., X,,for the fixed sample size model should converge, in some sense,
206
13. FIXED SAMPLE SIZE MODEL
to the comparable distribution g ( n ; 0*,p*) for the linear model with Bij = O* and initial distribution p*. In this section, it is shown that this convergence does indeed occur. Of course, En above refers to (Al,,, Oln, Cln), that part of the event for the fixed sample size model that has a counterpart in the linear model. This notation is used throughout the section. The intuitive notion that the linear model is a stimulus sampling model with an infinity of stimulus elements is deeply ingrained in this subject. On the mathematical side, a technical report by Estes and Suppes (1959b) presents a general but arduous approach to convergence of event probabilities. Theorem 4.1, the proof of which is not difficult, is the first result of this kind to be published. THEOREM 4.1. For any n 2 0, 0 < 8* < 1, and distribution p* on [0, g ( n ; s , N , p ) converges to B ( n ; O * , p * ) as s+00, 8+8*, and p - + p * .
13,
The convergence to which the theorem refers is weak sequential convergence in the following sense. Let s = s j , N = Nj y and p = p j be any sequences such that sj -+ co, s j l N j + 0*,and p j converges weakly to p* (see E A] be Section 2.2) as j + 00. Let &';-I = ( E o , ... ,En- 1), and let [S",-' E A and zero otherwise. Then the random variable that is one if as j 3 00, for any
A c
=
( 0 , l ) x (0,l) x
(3n times)
and bounded continuous real valued function h. The cases A h ( x , , ... ,x,) = 1 are, of course, of particular interest. Proof.
=rnand
Let us denote E ( f ( d xn) Ain Ojn Ckn Ixn = X )
by J ( x ) in the fixed sample size model and by J * ( x ) in the linear model. An expression for J ( x ) is given by (2.1), and J * ( x ) = f(kO*(j - x ) ) L ( x ; i,j , k ) . We first show that, iff has a bounded second derivative,
+
J ( x ) = J * ( x ) E(X),
(4.3)
where E ( X ) -+ 0 uniformly in x as s -,00 and 8 + 8*. If g has a bounded second derivative on [0,1], the second-order Taylor expansion g(Y) = g w
+ ( Y - x ) g ' ( x ) + Y19"1(Y-x)2,
207
13.4. CONVERGENCE TO T H E LINEAR M O D E L
where IyI
< 3 and lg”1= sup, Ig”(x)l, yields IM(g(m/s)) -g(x)l G lg”lM((m/s-x)2)/2
G lg”1/8s by (4.1). Iff has a bounded second derivative on [- 1,1] and
W)
9 w =f ( W i - Y ) )
3
where L ( y ) = L ( y ; i , j , k ) , it is easy to see that 197 G
If”l + 2lf’l
Y
so that
IM [f(Wj- m/s)>L(m/s)l -f(Wi-4 )U X ) l
(171+ 21f’I>/8s.
G
But
I f ( W j - x > ) W ) -f(ke*(j-x))Ux)l
G 10-
O*l If’l .
Thus
IJW - J*(x)l G
(If”l+21f’0/8s
+ lo - O*I If’l,
as required. The proof of the theorem now proceeds by induction. All limits below are taken along an arbitrary but fixed sequence (s, N, p) = ( s j , N j , p j ) with the required asymptotic behavior. Since Q(0; s, N , p ) = p and 9(0;0*, p*) = p*, there is nothing to prove when n = 0. Suppose that (4.2) holds for some n 2 0 and all A and h. We must show that n can be replaced by n+ 1. It clearly suffices to consider unit sets A = {(eo,... ,en)}. Furthermore, it follows from the continuity theorem for multidimensional distributions (Breiman, 1968, Theorem 11.6) that only functions of the form 1 n+l
\
need be considered. Now
where ( i , j , k ) andf(y) in the definition of J ( x ) are en and exp((- l ) “ t n+ ly); [S”,-’= e;-’]exp(i
208 by (4.3), where IS1 < max,
13. FIXED SAMPLE SIZE MODEL
IE(x)~
0; and
by the induction hypothesis (4.2), since J*(x) is bounded and continuous. The quantity on the right reduces to
14 0 Additive Models
In additive models for simple two-choice experiments, A, response probability is a Bore1 measurable function p ( x ) on the state space X = R', and state transformations are translations : u(x, e) = x
+ be.
An event e = (i, z ) is determined by a response Ai and a consequence z, which is drawn from a measurable space @,a) in accordance with a specified distribution
n , ( D ) = P(z E DIA,). Thus the distribution p ( x , .) of e given x satisfies p(x,{(l,z):zED)) = P(x)nl(D)
and p ( x , {(O,z): E D ) ) = q(x)n,(Q,
where q ( x ) = 1 -p(x). It is easy to see that n,and be affect the distribution of the (Markovian) sequence (X,,, AIn) of states and responses only through 209
210
14. ADDITIVE MODELS
the conditional distribution
Qi(B) = P(be E BIA,) = n i ( { z :biz E B } ) of be given A , . Though it is natural to require that p meet various criteria of smoothness, it is unnecessary to do so for Sections 14.1 and 14.2. Of course, 0 < p ( x ) < 1. The interpretation of x or u = ex as an “ A , response strength” variable suggests that p be strictly increasing, with p ( c o ) = 1 and p ( - co) = 0. The latter conditions are assumed, but, in place of monotonicity, it suffices for the time being to suppose that p ( x ) is bounded away from 0 and 1 on finite intervals. Equivalently, p(x,,)+ 1 if and only if x,,+ 00, and p(x,,)+ 0 if and only if x,, + - 03. All of these conditions are satisfied in beta models, which are additive models with p ( x ) = u/(u+ 1). Our final assumption is that the moment generating functions m
Mi(A) = E(expAb,lAi) = /-meAyQi(dy)
exist, for I in some open interval A containing 0. The significance of these quantities is suggested by the equation E((Un.1
/on)AIun)
= MI (I)pn + Mo(I)qn
as., where u,, = ex”, p,, =p(X,,), and q,, = q(X,,), which shows that they determine the tendency of u.” to increase or decrease. If k 2 0 and A E A, fy”eayQi(dy)and M / ” ) ( I )exist and are equal. In particular, if mi = E(be1Ai) =
/Ygi(&),
then m i =M/(O), so that mi determines the departure of M i ( l ) from M,(O) = 1, when 111 is small. If mi < 0, then M i ( I ) 1 for I > 0 sufficiently small, while if m i> 0, then M , ( I ) < 1 for I < 0 sufficiently large. Actually, ) M i is convex, so that Mi(A)< 1 implies since M,”(x)= ~ y 2 e A y Q i ( d2y0, that M i ( o ) < 1 whenever w is between I and 0. Henceforth it is understood that an argument of M iis an element of A. Analogous to five-operator linear models, five-operator additive models are defined by z = ( j ,k), where Aj is the response reinforced and k = 1 or 0 depending on whether or not conditioning is effective. Thus, bijo= 0, and, letting b,, = b i j , , b,, 2 0 and bio < 0. Also,
-=
ni( j ,k ) = nij
(“
1 - cij
if k
= 1,
if k = 0,
where 7cij is the probability of reinforcing A j after A i , and cij is the attendant
21 1
14.1. RECURRENCE A N D ABSORPTION
probability of effective conditioning. In this case, mi = the linear model, the symmetric case, defined here by co, = c1, = c*,
c,, = coo = c , 611 =
- b 00
=
cjbijnijcij. As for
bol
b,
=
-blo = b*,
is of particular interest. Much of this chapter follows Norman (1970b). 14.1. Criteria for Recurrence and Absorption In this section we establish Theorem 1.1, which shows that the qualitative asymptotic behavior of p , (or X,, or u,) is determined by the signs of the mean increments mi.
=-
THEOREM l.l.(A) r f m , 0 and ml > 0, limp, = 1 a s . (B) If mo < 0 and m, < 0, limp, = 0 a.s. (C) Zf m, < 0 and m, > 0, g o ( x ) + g , ( x ) = 1, where g,(x) = P,(limp,=i). In addition, g o ( x ) > 0, g , ( x ) > 0, g o ( x )+ 1 as x + - co, and g , (x)+ 1 as
x + co.
(D) Zf m, > 0 and m, < 0, then lim supp,
=1
and lim infp, = 0 a.s.
The proof is based on a number of lemmas, some of which will find other uses later. Let
F,(u) = M , ( A ) p ( x )
+ Mo@)q(x),
so that
E((un + 1 Note that F,(u) + MI (A) as u + 00.
Ion) =
FA (un)
(1.1)
LEMMA1.1. I f m , < O , liminfX, O a n d M l ( A ) < l , there are constants B(A) and C(A)< 1 such that E,(u:)
< ua C"(A) + B(A).
(1.2)
Similarly limsup X , > - 00 if mo > 0, and (1.2) holds if A < 0 and Mo (A) < 1. Proof.
For V sufficiently large,
c(n)= supF,(u) < 1 . uav
212
14. ADDITIVE MODELS
for u, 2 V. But for u, < V,
<
~ ( v i + ~ I u , , ) max(M,(A))V' = D ( A ) .
Therefore
E ( d +1 Ion)
< c(A)va + D(A)
as., so that EX@i+I)
+ W).
G C(A)E,(un?
Taking B(A) = D(A)/(I - C(A)),(1.2) follows by induction. Letting n 4 co in (1.2), we obtain
B(A) 2 liminfE,(uk) 3 E,(lim inf u."> by Fatou's lemma. Thus lirn infvk < 00 and lirn inf X, < 00 a.s. if P ( X o= x ) = 1. It follows immediately that lirn inf X , < 00 a.s. for any initial distribution.
I
For two events A and A*, we say that A implies A* a.s. if P(A -A*) = 0. Similarly, A and A* are a.s. equivalent if P(AdA*) = 0. LEMMA1.2. If m 1> 0, then lirn supX, = 03 implies limX, = 00 a.s. If A < 0, M,(IZ) < 1, and V is sufficiently large that FA@)< 1 for all u 2 V, then 90(4
<(U/VA
(1 -3)
for all v > 0.
An analogous result holds if mo < 0. ProoJ
Let H,= min(v:,
V'). This process is a supermartingale. For
E(Hn+1lun,
00)
< E(ui+1Iun) < V: = H,
a s . if u, 2 V, while
E(Hn+IIvn, ***,
00)
< V A= Hn
a.s. if u, < V. Thus
E(Hn+,(un, .-.s 00)
as., as claimed. Since H,2 0, lirn H,exists a.s. [Neveu, 1965, (l), p. 1373. If lirn sup X, = co, liminf H,= 0. But the latter implies lirn H, = 0 a.s., which in turn implies lirn X, = 00. Thus lim sup Xn= 00 implies lim X , = 00 a.s.
213
14.1. RECURRENCE AND ABSORPTION
To obtain the last assertion of the lemma, note that
E,(H,,)
< min(v',
V')
< v'.
(1*4)
But lirn H,,>, V'IA, where A is the event lim X,, = - 00. Thus letting n + 00 in (1.4) yields (1.3). I The last lemma is of a rather different kind. 1.3. If m, > 0 or m, > 0, P(limsupX,, E X) = 0. LEMMA
Similarly if m, < 0 or m, < 0, the probability that liminfX,, is real is 0. Proof. If m,> 0, then Qi([2z,00)) > 0 for some E > 0. For any x ,
P(Xn+, > x + & I X n , *.*,Xo) = Q,(Cx-Xn+E,0O))Pn
2
Qi(C%
+ Q~(Cx-Xn+&,m))qn
c o ) ) ~ ,+ Q o ( C k
a))Bx
a s . if IX,,-xl < E, where 01, = infly-xl<,p(y)and
=
YX
8,
= infly-xl
p is bounded away from 0 and 1 over finite intervals, a, > 0 and Bx > 0, thus Y x > 0.
It follows that IX,,-xl < E infinitely often (Lo.) implies
c P(X,+, > OD
n=O
X+EIX,,
..., X,)
=
00
as. The latter event is a s . equivalent to X,, 2 X + E i.0. (Neveu, 1965, Corollary to Proposition IV.6.3). Thus the probability that IX,,-xl < E i.0. but not X,, 2 X + E i.0. is 0. However, llimsupX,,-xl < E implies the latter event, so
P(llimsupX,,-x( < E ) = 0. Since X is a denumerable union of intervals of the form ( x - E , x + E ) , the lemma follows. I Proof of Theorem 1.1. Suppose m ,> 0. By Lemma 1.3, limsup X,, = - 00 or limsupX,, = 00 a s . By Lemma 1.2, IimsupX,, = 00 implies limX,, = 00 a s . ; hence lim sup X,, = - co or lim X,, = co a s . When m, > 0, lim sup X,, > - co a s . by (an analog of) Lemma 1.1, so that, when m, > 0 and m, > 0, IimX,, = 00 as. This proves (A), and (B) is similar. Returning to the case where our only assumption is m 1 0, we have limX, = - co or limX, = 03 a.s. Thus go(x)+gl ( x ) = 1. It follows from (1.3) that go( x )+ 0 ; hence g , ( x )-+ 1 as x -+ co. In particular, there is a constant d such that g, ( x ) > 0 for x 2 d. It is easily shown that, for any n > 1
=-
214
14. ADDITIVE MODELS
and x E X,
Px(dXj > E , j = 0,...,n - 1) 3 (6c)n, where c = inf,,,p(y) > 0 and 6 = Q,( [ E , m)) > 0 for E sufficiently small. Hence Px(Xn2 d) > 0 if n is sufficiently large that x+ns 2 d. But g , ( x ) = E x ( g ,(X,,)), so g , ( x ) > 0 for all x E X. The other assertions in (C) follow similarly from m, < 0. Suppose m, < 0. Then lim inf Xn = f00 a s . (Lemma 1.3) and lim inf Xn < 00 a s . (Lemma l.l), so liminfX, = - 03 a s . Thus liminfp,, = 0 as. Similarly, m, > 0 implies limsupp,, = 1 as., and (D) is proved. I If p(x) = O(e0X) as x + - m for some ct > 0, as in the beta model, the conclusion in (B) of Theorem 1.1 can be strengthened to C;=,p,, < m as. In fact
for all x E X (Norman, 1970b, Theorem 2). The recurrent case m , < 0, mo > 0 is the subject of the next three sections. 14.2. Asymptotic A , Response Frequency
As in zero-absorbing-barrier linear and stimulus sampling models, A , response frequency
and average A, response probability
admit strong laws of large numbers in recurrent additive models. Furthermore, there is a simple expression
r = mo/(mo-m,)
(2.1)
for the a s . limit of these sequences in the additive case. Theorem 2.1 also converge to 0, shows that the expectations of &((Aln-[) and f i ( ( p , , - [ ) and their second moments are bounded. Presumably their distributions are asymptotically normal under suitable additional conditions yet to be found. It would also be of interest to have laws of large numbers and central limit theorems for other functions g(E,,) andf(X,,) of events and states.
14.2. A , RESPONSE FREQUENCY
215
from which it follows that E ( V ,V,)= 0 and E ( W ,W,) = 0 if m # n. Furthermore, if si = Jx’ Q,(dx) and maxs, = s,
an3
Note that
where 6 = m , - m , . Now if I > O is sufficiently small,
216
14. ADDITIVE MODELS
by (1.2). Adding these equations, and observing that 8 + e - y > y 2 , we obtain A2Ex(X,Z)< 2coshlx
+ B,
where B = B(A)+B(-A). Thus EX(X:)
< (2coshAK+B)/12 = J
for all n 2 0 and 1x1 < K. Clearly
so X,,/n + 0 a s . This, in conjunction with (2.7) and (2.9), yields
But then (2.8) implies A,,,+ [ as., and (A) is proved. The expectation of (2.9) is 0 = Ex(vn)= E,(X,,)/n - x/n
(2.10)
jj,, + [
as.
+ 6(E(p,,) - 0;
therefore 6lEx(Pn) - tl Q (1x1+ IEx<Xn>l)/n Q (K+J")/n
for 1x1 < K , as a consequence of (2.10). Thus (2.4) is established. Equation (2.9) and Minkowski's inequality yield
<(K+fl+fi/fi for 1x1 < K, by (2.10) and (2.5). This yields (2.3) and, with (2.6), (2.2).
I
We now consider how mi and depend on the parameters of symmetric = Xo as., so assume five-operator additive models. If v = bc+b*c* = 0, X,, v > 0 and let r = bc/v. The quantities bc and b*c* determine the efficacy of success and failure, respectively, and r measures their relative magnitude. A simple computation shows that
m 1 = v(r-nlo)
and
mo = v(nol-r).
Suppose that nO12 n l o , or, equivalently, n,, 2 no,. This means that A , is more likely to be successful than A,. The opposite case is similar. Then recurrence (m,c 0, mo > 0) occurs when r c n l 0 . Under this condition
C
= (~01-r)/(n0,+~,~-2r).
If nol = n l 0 , ( = 3. Otherwise 3 is a strictly increasing convex function that runs from n o l / ( ~ o+nlo) l = 1 at r = 0 to 1 at r = A,,. Thus the model pre-
14.3. EXISTENCE OF STATIONARY PROBABILITIES
217
dicts that asymptotic A , response frequency C will overshoot the probability matching asymptote 1 over the range 0 < r < n l 0 . If nlo< r < nol, then m , > 0 and m, > 0, so p,,+ 1 a s . according to Theorem 1.1. Thus success probability n1 pn A,, qn is maximized asymptotically. Finally, if r > nol,m , > 0 and mo < 0, so p,,+ 1 or pn+ 0 as., and both limits have positive probability.
+
14.3. Existence of Stationary Probabilities If p is continuous, the Markov process X,,always possesses a stationary probability in the recurrent case. Furthermore, the moment generating functions of stationary probabilities are bounded by the function B(A) of Lemma 1.1. Let d' be the set of stationary probabilities. THEOREM 3.1. If mo > 0, m , < 0, and p is continuous, then d is not empty. In fact, for every stochastically and topologically closed subset Y of X , there is u p E d s u c h that p(Y) = 1. Z f A > 0 andM,(A) < 1, or A
P ( ( ,.) = T 9 , (see Section 2.2 for the notation) is conditionally weakly compact. For, by (2.10), there is a constant J such that
Hence
F K , 1.)
< J/a2,
where I. = (- co,a ) u (a, co), for all a > 0 and n 2 0. Therefore, F ( ( , is uniformly tight, thus conditionally weakly compact (Parthasarathy, 1967, Theorem 11.6.7). Suppose now that Y is stochastically and topologically closed (e.g., Y = X).Let 5 E Y and pj = Pi(<, .) be a subsequence of K"(r, -) that converges weakly to a probability p. Since Y is stochastically closed, pj(Y) = I for all j , and, since Y is topologically closed, limsuppj(Y) d p(Y). Thus p(Y) = 1. In addition, p is stationary. For, by virtue of the continuity of p , U maps bounded continuous functions into bounded continuous functions, and this implies, via (1.1.7), that T is continuous with respect to weak a )
218
14. ADDITIVE MODELS
convergence. Thus Tpj + Tp weakly, and T"J6,1nj - 6,/nj= Tpj - p j
+
Tp - p
weakly, as j + 00. But the left side converges weakly to 0, so Tp = p, as claimed. Suppose now that p E ef, and let A > 0 and M I (A) < 1, or A < 0 and M o (A) < 1. For any d > 0, let F ( x ) = min (e", d). Then
s s Fdp
=
FdT"p
=
s
U"Fdp.
Since U"F(x)< d for all n and x, Fatou's lemma gives
1 <1 Fdp
lim sup U"Fdp .
But U"F(x)<Ex(u:), so J F d p < B(A) as a consequence of (1.2). As d+ 00, the left-hand side converges to JeAxdp,so JeAxdp< B(1). I
In Section 14.5 it is shown that, if p is increasing and sufficiently smooth, then stationary probabilities are approximately normal, with mean p - ([) and variance of the order of magnitude of the increments be, when these are small. 14.4. Uniqueness of the Stationary Probability No necessary and sufficient condition is known for uniqueness of stationary probabilities of state sequences of recurrent additive models. The classical sufficient condition for uniqueness of stationary probabilities of general Markov chains is indecomposability, the nonexistence of disjoint stochastically closed sets (Breiman, 1968, Theorem 7.16). This criterion has the following corollary for additive models. THEOREM 4.1. If Q , or Q , has a positive density with respect to Lebesgue measure L , then there is at most one stationary probability.
Proof. Suppose that Q , ( B )=js.h(x) L(dx), wheref;.(x) > 0 for all x E.'A Since p ( x ) > 0 and q ( x ) > 0, 0 = K(x,B) = p(x)Q,(B-x)
+ q(x)Qo(B-x)
implies that Q,(B-x) = 0; thus L ( B - x ) = 0, so that L ( B ) = 0. If A and B are stochastically closed, x E A, and y E B, then K ( x , 1)= 0 and K ( y , B") = 0, so L ( 1 ) = 0 and L(B")= 0. Therefore, A n B # 0, and K is indecomposable. I
14.4.
UNIQUENESS OF THE STATIONARY PROBABILITY
219
However, indecomposability is not necessary for uniqueness in recurrent ) 1, where b , = m , < 0 and additive models. Consider the case Q i ( { b i }= bo = mo > 0, so that X,, has the simple transition law AX,, =
b,
with probabilityp(X,,) ,
bo
with probability q(X,,).
Let G = {nob,
(4.1)
+ n, b , :no, n, are integers}.
Then x + G is stochastically closed for any x E X , and the collection of distinct, hence disjoint, sets of this form is nondenumerable. So X,, is certainly not indecomposable. But uniqueness can obtain. THEOREM 4.2. Zfp is continuous and nondecreasing,and g b o/ b , is irrational, then there is a unique stationary probability. According to Theorem 3.1, a necessary condition for uniqueness is irreducibility, the nonexistence of disjoint stochastically and topologically closed subsets of X. If b o / b , is rational, the sets x + G are topologically closed, so uniqueness fails. On the other hand, if b , / b , is irrational, X is the only stochastically and topologically closed set. For if Y is stochastically closed and x E Y, then x + G + c Y, where
G + = { n , b , + n , b l : n o , n l 20}. And x + G+ is dense, so Y = X if Y is topologically closed. Hence if bo/bl is irrational, irreducibility holds as weil as uniqueness. It remains to be seen whether irreducibility is sufficient for uniqueness in the general recurrent additive model with continuous nondecreasing p. The proof of Theorem 4.2., which is due to James Pickands, 111 and the author, is published here for the first time. Proof. The continuity of T ensures that B is weakly closed, and the inequality
jcoshIxp(dx)
< ( B ( I ) + B(-4)/2,
valid for all p E 9 if I > 0 is sufficiently small according to Theorem 3.1, implies that B is weakly compact. Therefore, by the Krein-Milman theorem (Dunford and Schwartz, 1958, Theorem V.8.4) applied to the finite signed measures on X with the weak topology, the convex set B is the closed convex hull of its extremal points. Hence it suffices to show that there is a unique extremal stationary probability. A stationary probability p is extremal if and only if it is ergodic, that is, B"f+ s f d p in L,(p) for every YE L,(p) (Rosenblatt, 1967, Theorem 2).
220
14. ADDITIVE MODELS
In order to study the dependence of P ' f ( x ) on x , we define a collection of Markov processes Xn(x) on the same probability space. Let t n ,n20, be a sequence of independent random variables, each uniformly distributed on [0,1]. Let Xo(x) = x , and, given Xn(x), let
where pn(x)= p ( X , ( x ) ) . It is easy to see that, for each x E X , X n ( x ) has the transition law (4.1). The processes Xn(x) have the following important property: If 0
< X - Y < bo + (bll = b ,
then Xn(x)-Xn(y) = x - y
or
x-y-b
(4.2)
for all n 2 0. For this is certainly true for n = 0. Suppose, inductively, that Xn(x)- X n ( y )= x - y . Then pn(x)2 pn(y), since p is nondecreasing. If tn<
Pn(Y) or tn>Pn(x), AXn(x)=AXn(Y), SO X n + , ( x ) - X n + 1 ( y ) = x - Y . If Pn(x)> tn>pn(y), AXn(x)=bl and dXn(y)=bo, SO Xn+l(X)-Xn+1(y)= x - y - b . The case X n ( x ) - X n ( y ) = x - y - b is similar, so (4.2) follows.
Suppose now that f is a bounded nondecreasing diffrrentiable function on X such that If'l < co. Then, if 0 < x - y < b,
v f ( x )- U'lf(Y) = E ( 4 , where A = f ( X n( x ) )-f(Xn ( Y ) )
d 1TKx-Y) by (4.2). Thus U " f ( 4 - U"f(Y) G I S ' l ( X - Y ) and
0"fW - PfW G
1f'KX-Y)
(4.3)
for all n > 0. Let p1 and p, be extremal, hence ergodic, stationary probabilities. Since O"f-+jfdpl in L z ( p l ) , there is a subsequence n' such that P ' f - + j f d p I plas. But j f d p , in L, &), so there is a subsequence n* of n', such that v n * f - + j f d p on f a set A i with p f ( A f ) = 1, i=1,2. The support of pi is X . For it is topologically closed by definition and stochastically closed by Lemmas 2.2.3 and 3.4.3, and we have already noted that X is the only sto-
v".'f+
221
14.5. SLOW LEARNING
chastically and topologically closed set when bo/bl is irrational. It follows that A i is dense. If O < x - y < b , X E A ] ,and Y E A , , then letting n+00 along the subsequence n* in (4.3) we obtain
J f 4 - Jf&, G
If'l(X-Y).
Since there are points y in A, with x - y positive and arbitrarily small, jfdcll G f f & z . BY symmetry, S f d h G Jfdcll so Y
Jfh
= JfdP,
(4.4)
for all bounded nondecreasing differentiable functions with If'l < 00. If g is such a function, then, for any y E X and 0 > 0, f U W
=d(X-Y)/4
is too, and, if g ( c o ) = 1 and g ( x ) = 0 for x G 0,
monotonically as a+O. Thus, applying (4.4) to fu, and taking the limit as a+O, we obtain pl(y, co) = p , ( y , 00) for all y E X , from which it follows that p1 = p 2 . I If a recurrent additive model with p continuous has a unique stationary probability p, then B"f(x) + J f d p for all x E X and bounded continuous f. For we saw in the proof of Theorem 3.1 that { P ( x , is weakly conditionally compact, and that every weak subsequential limit is stationary, hence is p. It follows easily that R"(x, - ) converges weakly to p. One wonders whether, under these conditions, Pf converges uniformly over compact sets. This is true for f = p by (B) of Theorem 2.1. In closing this section, we note the lack of criteria for aperiodicity, that is, convergence of K ( " ) ( x ,.) as opposed to P ( x , .). Our only result worth mentioning is negative: X,, is not aperiodic under the hypotheses of Theorem 4.2. 14.5. Slow Learning
Suppose that be = he, where 8 > 0, a, is a fixed real measurable function, and p ( x ) and U i are also fixed. This section describes the limiting behavior of X,, = X," and p,, = p(X,,) as 8 + 0.
14. ADDITIVE MODELS
If
Li(o) = E(expwae)Ai)< co for 101 < E, then M i ( l ) = Li(81) < co for 111 < C / O . The moments of be and a, are linearly related:
mi = E(belAi) = OE(aelAi) = 8iiii, si = ,qb:lAi)
=
= e2E(a:IA,) =
e2ji,
~ ( l b , l 3 1= ~~ e )3 ~ ( l ~ ~ l = 3 l8~ 3~ ~)
In addition, w(x) =
E(Axn/ep-,= x)
=
~ .
E ,+ ~ ii-ioq,
q(Axn/e)21xn = x ) = s , p + s,q, iyX)= ~(l~x~pl~lx,, = x) = plp + iOq.
s(x) =
(5.1)
To apply Theorem 8.1.1, we take Z = Ze = R’, and note that the approximation assumption (a) is trivially fulfilled. Assuming that p ( x ) has two bounded derivatives, as in beta models, the smoothness conditions (b) on w ( x ) and s(x) = S ( x ) - w ( x ) ~are satisfied. Finally, (c) follows from F(x) 6 maxi ii. As a consequence of Theorem 8.1.1, (Xn -f/Je N(O, g (no)) when 8 + 0 and n8 is bounded. As usual, f ’ ( r ) = w(f(t)) and f(0)= x = X , . If w ( x ) # 0, H ( f ( t ) ) = t and g ( t ) = G ( f ( t ) ) , where H and G are given by (12.4.5) and (12.4.8), respectively. Letting P ( t ) = p ( f ( t ) ) , it follows that N
(Pn-P(ne))/JB
Clearly P ( 0 ) = p ( x ) , and
-
~(07p’(f(ne))’g(n~)).
P’(t) = p ’ ( f ) ( n i , P+iii,(l-P)).
In beta models, p’ = pq, so P satisfies the autonomous differential equation P’(t) = P ( 1- P) (Z’ P + 5,(1 - P)).
We mention that, if p ’ ( y ) > 0 for all y, and p has certain other regularities, asymptotic normality of p n as O + 0 can be obtained by applying Theorem 8.1.1 directly to pn. In the remainder of the section, we develop approximations to stationary probabilities and absorption probabilities. Suppose that p’(x) > 0 for all x E R’. If iii, and iiio have opposite signs, there is a unique 5 E R’ such that P(t) =
r = =o/(~o-fi,)
223
14.5. SLOW LEARNING
or, equivalently, w(5) = 0. Also,
w e )=
- fio)P’(<) f 0
(61
3
so that
d
= W)/21w’(<)l
is well defined (and greater than 0). Let
zn = <xn-O/@* If iiil < 0 and Z0> 0, then, for every 8, X, has at least one stationary probability Po, and this distribution has finite fourth moment (Theorem 3.1). Let X, , hence X,,, have distribution Po. Since w’(<)c 0, (d) of Theorem 10.1.1 is satisfied, and the remaining hypotheses (ii. 1) and (ii.2) of the second part of the theorem are easily checked. The conclusion is that 2,
-
N(0,aZ)
as 8+0, from which (pn-O/@
N
N(0,p’(OZu2)
follows. If FJ,,< 0 and El> 0, then, for every 8 > 0 and initial state x, A’,,-i _+ 03 a.s. by (C) of Theorem 1.1, and we wish to approximate g , (x) = P,(X,,+oo). Note that X,,+03 if and only if z,,+ 00, so 91
where z = ( x the parameter
= pz(zn+
()/p. By applying
T =8
= b(z),
Theorem 11.2.1 to the process z,, and (not 8’!), we shall show that i SUP Is1(4- W/4l -0
XER~
as O+O. Here 4j is the standard normal cumulative distribution function. Since is a critical point of w, Lemma 8.5.1 yields
<
+ o(e), E((AZ,,)~~Z, = z) = esg) + o(e), E(AZ,IZ,,= Z) =
ew’(t)z
E(JA~,,J =~ z )I = ~ ,o(e), ,
where, for any k > 0,
as O+O. Thus (11.1.4) and (11.2.2) obtain, and the function r(z) defined
224
14. ADDITIVE MODELS
below (11.1.5) is z/oz. It follows that B(z) =
l
r ( y ) d y = z2/2o2,
and e-B(') is integrable over R', as Theorem 11.2.1 requires. The function @(z/o) is the solution of V(Z)
with limits 0 and 1 at --oo 4 = 4 0 satisfies
+ r(z)$'(z) = 0 ,
and a.Thus, it remains only to show that
lim 4e(z) = lim 1 - #e(z) = 0.
z--m
2-
OD
(5.2)
0-0
0-0
Proof. The calculation that follows will put us in position to use Lemma 1.2. There is a constant c > 0 such that, for 101 < 612,
L,(w) < 1 + OE,+ C 0 2 .
Hence, if ldel
<42, F , ( ~ x )- 1 G i e ( f i l P
+ Eoq) + crzzez
= ne(y(p-c)
+
CM),
where y = El - iiio> 0. Taking 1= - 1/@, we obtain
< - @ ( y ( P - c ) - c@) if @<4 2 . If 8 is still smaller, say 8 < E', there is a unique x, > t such that FA(e")
-
A x e ) -C = c@Ir,
and x 2 Xe implies that FA(@) < 1. Thus (1.3) with V = ex, takes the form 1 - #e(z)
where z = (xNow
()I@ and
< exp(-(x-xe)/J8) = (Xe- t)/,/g.
= e-'eZ',
as 8 -3 0. Therefore 28 is bounded, and there is a constant K such that 1 - 4e(z)
< Ke-'
for all z E R', if 8 < 8'. The second limit in (5.2) follows immediately, and the other holds by symmetry. I
15 0 Multiresponse Linear Models
The distinctive feature of the models considered in this chapter is that the subject’s response y is an element of an arbitrary measurable space (Y,9).The state x is the distribution of y, and the state space (A’, d) is a set of probabilities on 9, with total variation distance d ( x , x ’ ) = ( x - x ’ l . The response y is followed by an outcome z from a measurable space ( Z , 3 ) . Its distribution is n ( y , -), where l7 is a stochastic kernel. Thus the event e = (y,z) has distribution
The event space is ( E , 9)= ( Y ,9)x (Z, 3).Occurrence of e transforms x into where ye is a measure on 9 with 225
226
IS.
MULTIRESPONSE LINEAR MODELS
y. is a measurable mapping into ( M ( Y ) ,d ) , and
A, = e ; l y e E
x
if 8, > 0. We assume that X is convex, so that u(x, e ) E X , and separable, so that u is jointly measurable. Both conditions are satisfied if, for example, Y is countably generated and X is the set of probabilities absolutely continuous with respect to a fixed a-finite measure. Some interesting characterizations of transformations of probabilities of the form x+(1-8)x+81 are given by Blau (1961). The linear free-responding model and Suppes’ continuous response model discussed in Chapter 0 are of this type. In the free-responding model, y is an interresponse time and z is the reinforcement indicator variable, so Y = (0,oo) and 2 = (0,I}. Also, O,, = 8 and Oy0 = 8* do not depend on y , and
1, =
(
T*
if z = 0 ,
(l-a)T+ctA(y,-)
if z = 1 ,
where T and T* are probabilities on Y, and A is a stochastic kernel on Y x Y. The probability l7 ( y , { 1)) of reinforcing a y-second interresponse time is denoted u ( y ) . In Suppes’ model, y and z are the predicted and “correct” points on, say, the unit circle Y = Z in the complex plane, and 1, = 1(z, .) does not depend on y . Generalizing Suppes’ assumption that 8, is constant, we assume that 8, = 8 ( z ) depends only on z. The five-operator linear model is also a special case; take Y = (0, l}, Z = (0, l} x (0, l}, Oijk = ke,,, lij,((j})= 1, and 1 7 ( i ; j , k ) = n i j c i j if k = 1 and nij(l-cij) if k = 0 . Of course, here the real variable x({ 1)) characterizes the probability x, and it is natural to describe the model in terms of this variable, as we do in Chapter 12. Response and outcome random variables are denoted Y, and 2,; thus En = (Y,,, ZJ. Other useful notations are
and
15.1. Criteria for Regularity We first give a sufficient condition for the model to be distance diminishing. THEOREM 1.1.
If inf,,
8, > 0, the model is distance diminishing.
15.1.
227
CRITERIA FOR REGULARITY
Proof.
=
.s
sup p(x,de) (1 -8,)
= 1
- inf p(x, de) 8,. x
s
But p ( x , de) 8, =
1
x(dy) 8, 2 inf 8, ,
so that r,
< 1 -info,
< 1.
I
Theorems 1.2 and 1.3 give criteria for X , to be regular with respect to the bounded Lipschitz functions L ( X ) . The condition inf 8, > 0 is assumed in Theorem 1.2 and follows from the assumptions of Theorem 1.3, so both relate to distance diminishing models. Both theorems are corollaries of results in Chapter 4. THEOREM 1.2. If info, > 0, and if there are a > 0 and 4 E P ( Y ) such that & ( A ) 2 ac$(A), for all A E g and e E E with 8, > 0, then X,,is regular. THEOREM 1.3. I f there are a > 0 and v E P ( Z ) such that n ( y , B ) 2 av(B), for all y E Y and B E 2,i f y,(A) = y(z, A ) does not depend on y [thus 8, = 8(z) does not either], and if J v(dz)8 ( z ) > 0, then X,,is regular. In the free-responding model,
A, 2 (1 - a ) ( t A t*), where T A T * is the infimum o f t and t * (see Neveu, 1965, p. 107), so I, has a lower bound of the type required by Theorem 1.2 if a 1 and t and T* are not mutually singular. The condition in Theorem 1.3 that the reinforcement function have a response independent component av is very natural in applications of Suppes’ model. The extreme case of noncontingent outcomes, n ( y , .) = v, has received much attention in the experimental literature.
-=
Proof of Theorem 1.2. Clearly X is bounded: d(x,x’)
< 1x1 + I x ’ ~
= 2.
Let A” be the convex hull of {A,: 8, > 0). Then X’ is invariant (Definition
228
IS. MULTIRESPONSE LINEAR MODELS
4.2. l), so, by Theorem 4.2.2, it suffices to show that the transition operator U‘ for the corresponding submodel is regular. If X E X’, there are nonnegative pi with Cpi = 1, and e(i) E E with Oe([) > 0, such that x = C p IAe(i). Thus x ( 4 3 CPIa+(A) = a + ( 4 9
so that p(x,A) 2 a v w , where
v ( 4 = /+(dY) p ( Y , d Z ) M Y , z ) . Thus the submodel satisfies (a) of Section 4.1. Furthermore, r i = Sv(de)(l-ee) = 1 - J+(dy)ey
< 1 - info,
< 1.
Therefore U‘ is regular, according to Theorem 4.1.1.
Proof of Theorem 1.3. The transition operator
1
is the same as for the (reduced) model with X* = X , E* = 2, u*(x,z) = e), and p*(x, B) = ~ ( d yn)( y , B). But
U(X,
P*(X,
B) 2 av(B) ,
and
so U is regular by Theorem 4.1.1. When 0, > 0 for all y E Y, Ay(4
I =YY(4PY
is a stochastic kernel. The following lemma is needed in the next section.
LEMMA1.1. Under the hypotheses of either Theorem 1.2 or Theorem 1.3, there is a c > 0 and a probability $ on ?Y such that Iy(4
for all y and A.
3 c$(4
(1.1)
229
15.2. DISTRIBUTION OF Y, AND Y,
Proof. Under the hypotheses of Theorem 1.2,
AY(4 =6 2
/
e,. > 0
~(Y,dz)~y,~y,(A)
2 a$(A). Under the assumptions of Theorem 1.3,
where c = a/supO,
and
15.2. The Distribution of Ynand Y ,
Let xn be the unconditional distribution of Yn;i.e., xn(A) = P(YnE A ) .
Then, for any boundzd measurable function g on Y,
230
IS. MULTIRESPONSE LINEAR MODELS
which is analogous to (12.2.4). If 0, = 0 for all y, then JX,,(dy)O,= 0 a.s., so that x,+~is obtained from x,, via the linear transformation
Constancy of 8, generalizes the condition 6 = 0 in Chapter 12. It is satisfied in the free-responding model with u(y) = p, for then 0, = Op + 0*( 1 - p). It holds also in Suppes’ model if O(z) = O or if n ( y , .) = v. In the latter case,
Ay = for all y, so
J v(dz)y(z, .)/a
= (l-d)X,
X,+l
or
x,
=
=
x
+ ox
x + (l-a)”(xo-x).
Suppose now that the model is distance diminishing and that X,, is regular. By Theorem 6.2.1, the model is uniformly regular with 4,,= O(cin), where a < 1. From Section 6.2, we recall that the sequence 8, = {F,},,, of coordinate functions on E m is a stationary process, with respect to the asymptotic distribution r of bN = {E,,+N}n,Oas N+ 00. Let Y , + , and Z , + , be the coordinates of F,: F, = (Y,+,,, Zm+,,),and let x, be the distribution of Y , = Y,,,:
x,(A)
=
r(Y, € A )
for A E CY. Then
Ixn(A)-xm(A)I G 4n,
so Ixn- x, I + 0 geometrically as n + 00. If the real valued measurable function g on Y is positive or integrable,
1
g ( y )x,
a.
(49 = l
d Y m(ern)>r(de“)
is denoted For example, if Y = (0,a),the real powers g ( y ) = 9 are of special interest, while if Y is the unit circle, we might consider g ( y ) = (argy)k. The asymptotic expectation can be estimated as accurately as desired by the corresponding sample mean from a single subject: If IS(Y)I X,(dY) .c 00,
s
(1In)Sn = (I/n)
n- I
1g(Yj)
+
j=O
a s . as n+ a, by Theorem 6.2.3. And, if Jgz(y)x,(dy) < 00, Theorem
231
15.2. DISTRIBUTION OF Y. A N D Y ,
6.2.4shows that
<sn-ng)/J;; where
0’
N
N(O,O’)
is the usual sum of the autocovariances Pj = mv(g(Yw), g(Ym+j))*
At the end of the section, we consider a case in which Y = (0, GO), and determine the powers b for which j y x , ( d y ) < 00. To get information about x , , we return to (2.1). Letting n + a and canceling x, (A) yields
where
Neglecting the term C(A), we obtain the nonlinear integral equation x*(4
1
X*(dY) 0, =
j
X*(dY) Y,(A).
If there is a unique solution x*, it is analogous to the approximation 1 to x, = P ( A , , ) in Theorem 12.2.1. Since that approximation is good for small 0,, we expect x , G x* when ee is small.
Throughout the remainder of the section, it is assumed that 0, = 0 > 0 for all y E Y.Then C ( A ) = 0, so that
We now solve this equation under the additional condition (1. l), which, according to Lemma 1.1, is no more restrictive than our sufficient conditions for regularity. If c = 1 in (l.l), then I, = $ and x , = $. If c < 1, let 5 be the stochastic kernel defined by
2,(4
= c*(4
+ U-c)t(y,’4.
Substitution of this expression into (2.3) yields x,(4 =c*(4
+ (l-c)jx,(dMy,A),
which leads, on iteration, to x,(A) = c
k
1 (1-C)j
j=O
s
*(dyp(y,A)
+ (l-c)k+I
s
(2.4)
x,(dy)p+”(y,A).
232
15. M U L TIRESPONSE LINEAR MODELS
Since the last term is bounded by (1 - c ) ~ + ~and , c > 0, passage to the limit yields
where {‘(y, A ) = c
c (1m
=O
c)Jt”’(y, A).
In the free-responding model with u ( y ) = p, yv = (l-p)O*t* +pO((l-ct)t+aA(y,-)),
so that we can take C+
=
8-1((1-p)e*~*+ p e ( l - a ) 7 )
in (l.l), and 5 = A. The case A ( y , A) = q ( A / y ) is of special interest. ConA) = q(A/y). Then sider now the general model with Y = (0,oo) and
e(~,
t w , 4 = tl’(A/Y)
9
where qo = 6 , , q1 = q, and q1 is the jth convolution power of q :
+“(A)
= Jtl(dY)b(A/y).
Also,
tYY, A ) = tt’(A/Y)
9
where q’(A) = c
c (1 -c)’q’(A) m
/=o
The solution (2.5) takes a particularly simple form in terms of the Mellin transforms a m (B) =
J
y~xm (dY)
I,$@), and fi(j?). Here B may be positive or negative, and 0 < am(j?)< GO. Since the Mellin transform of a convolution is the product of the transforms (whether or not these are finite), (2.5) yields
am(B>
=
I,$(B)?w.
15.2. DISTRIBUTION OF Y. AND Y ,
233
And it is easy to show that (2.6) can be transformed term by term to obtain A
a’@) = c
i (1-C)’Cfi(B)l’.
1-0
Thus R,(p) < 00 if and only if $(B) < m and (1 -c)q(B) < 1, in which case
This formula can also be obtained directly, by transforming both sides of (2.4). If @(po)<mfor some p o > O , then q(/I)+l as BJO, so (l-c)q(B) 0, as would normally be the case in the free-responding model, @(B)+ 00 as /?+ m, so that 9, (B) = 00 for /3 sufficiently large. Similar considerations apply to negative /?.
16 0 The Zeaman-House-Lovejoy Models
The Zeaman-House-Lovejoy or ZHL models were described in Section 0.1. We recall that a rat is fed when he chooses the black arm (B) of a Tmaze, instead of the white one (W). According to the models, he attends to brightness (br), rather than position (PO), with probability u, and, given attention to brightness, he enters the black arm with probability y: u
=
P(br),
y = P(Blbr).
In the full model, there is another variable, z, the probability of going left given attention to position, and x = (u, y, z ) is the state variable. Observe that, if u = 1 and y = 1, the rat attends to brightness and makes the correct response with probability 1. The consequent state transformation does not change x, so all states x = (1, I , z), 0 < z < 1, are absorbing. But Theorem 3.4.1 implies that a distance diminishing model with compact state space has at most a finite number of absorbing states. Therefore, the full ZHL model is not distance diminishing. Furthermore, very little is known about it, beyond what follows from our study of the reduced model. The smallstep-size theory of Chapter 8 is applicable, but its implications have not been worked out in detail. 234
235
16.1. CRITERION FOR ABSORPTION
The reduced ZHL model has state variable x = ( u , y ) and state space X = [0,1] x [O, 11, with Euclidean metric d. Its events, event probabilities, and state transformations are listed in Table 1, which is an amplification of Table 2 of Section 0.1. The row numbers in Table 1 provide convenient labels for events. Of course, all $J’S and 8’s are in the closed unit interval. TABLE 1 EVENTEFFECXS AND PROBABILITIES FOR THE REDUCED ZHL MODEL
(e= 1 - 0 ) Event
Av
Probability
AY
The relationship between stochastic processes associated with the full and reduced ZHL models is exactly what would be expected intuitively. Let Xn = K, Yn7 2,)
and
En = (s,,, a,,, r,,)
be any state and event sequences for the full model. Thus s,, is the stimulus configuration [(B, W) or (W, B)] on trial n, a,, is the dimension attended to, and r,, is the rat’s choice (B or W). Then, by Theorem 1.2.4,
X,,
=
(Y,, Y,,)
and
E,, = (a,,, rn)
are state and event sequences for the reduced model. Thus any theorem concerning V,, Y,,, a,,, and r,, in the reduced model applies also to the full model. Bearing this in mind, we restrict our attention in the remainder of this chapter to the reduced model, and refer to it simply as “the ZHL model.” 16.1. A Criterion for Absorption
THEOREM 1.1. ing.
If 4, , $J., , 0, , 8, > 0, the Z H L model is distance diminish-
THEOREM 1.2. Its state sequences are absorbing, with single absorbing state 5 = (1,l). In particular, X,, + 5 a s . as n + CQ, by Theorem 3.4.3. Also, since P(t,PO
W ) = 0,
236
16. ZEAMAN-HOUSE-LOVEJOY MODELS
E(#po u W )< co by Theorem 6.2.2, where # P O u W is the total number of trials on which either a perceptual or an overt error occurs. In particular, # P O u W < co as.; that is, subjects learn with probability 1. This conclusion may appear obvious, but no simple, direct proof is known. Proof of Theorem 1.1. We shall use Proposition 1 of Section 2.1. The functions p ( .,e) have bounded partial derivatives, so m ( p ( ., e)) < a.All state transformations have the form u(x,e) =
Thus u(x, e)
- u(x', e)
(( 1 - 4 ) u ++Y * ). (1 -8)Y
=
d 2 ( u ( x , e), u(x', e)) = (1-4)z(u-u')2
+ (~-e)~(y--y')~
< (1 - (4 A e))2d2(x,x') , and z(u(-, e))
< 1 - (4 A 0) < 1.
Furthermore, z(u(., 1)) < 1, since > 0 and > 0. Let P = { x : u > O and y > O } . If X E P, p ( x , l ) > O , so we can take j = 1 and d = 1 in Proposition 1. For x E' P, we display, in the next paragraph, a succession e" = (e,,, ... , en- I ) of events such that u ( x , en) E P and p.(x,e")>O. Then, i f j = n + l and d = ( e , , ..., e n - , , l),
+<.,.'))
< +(.,e"))z(u(.,
1)) < 1,
and p j ( x , d ) = ~ n ( ~ , e " ) ~ ( ~ ( x I,)e> " )0, , as Proposition 1 requires. If 42< 1, e" can be chosen as follows (+ indicates positivity):
0 + 0 If
#2
+ 0 0
4 2 4,2
= 1, then each 2 must be followed by 4 to restore positivity of u.
I
237
16.2. EXPECTED TOTAL ERRORS
Proof of Theorem 1.2. Let A==(
u(x,e")
if X E ' P ,
X
if X E P ,
where e" and P are as in the previous proof, and let a,,(x) be the set of possible values of X,,when X,,= x. Then I E P n a&) (n = 0 if x E P). But u ( . ,1) maps P n o,,,(x) into P n q,,+ (x), so induction yields = u(2,
E
P n o,+k(x)
for all k 2 1, where I(') = (1, 1, ..., 1) (k times). It follows that d(an+k(x),
since u ( t , 1")) =
r) d d('%, 5) d
r ; hence
00.
3
1(')))
w,C)
< l ( u ( ' , l)Yd(A,
d(an+k(x), C)
as k-r
-
I(.(
-+
9
0
An application of Theorem 3.6.2 completes the proof.
[
16.2. Expected Total Errors
Throughout this section we suppose, following Zeaman and House (1963), that 41
=
42
=
4 3
=
44 = 4 ' 0
and
e,
= e, = e > 0 .
Theorem 2.1 gives the expected total numbers of overt and perceptual errors under these conditions. THEOREM 2.1. For any 0 d v,y d 1,
E,(#w) = 014
+2gp
and
where ij = 1- u.
E,(#PO)
= 2ijl4
+2gp,
(2.2)
Proof. Events 2, 3, and 4 have probability 0 at c, so, as a consequence of Theorem 6.2.2, gk((x)= E,(#k) is the unique Lipschitz function such
238
16. ZEAMAN-HOUSE-LOVEJO Y MODELS
that
+ Q?k(X)
(2.3) and gk(5)= 0. This characterization is now used to establish formulas for these expectations. Table 1 gives g d x ) = P(X,k)
UY - Y = W y l x ) = e(yvY
+j v j ) = ego ;
hence J - UJ = e j v
and j / e = JV
+ uj/e.
Since Jv = p(x, 2), and j j / O is Lipschitz and vanishes at 5, 92
(4 = Y/O
*
Returning to Table 1,
uv - v
= E(dulx)
= #.J(f%y- u 2 j
- vu/2 + i?/2).
Rewriting y in the first term on the right as 1-1, we get
uv - v
= 4(-uj
from which it follows that
fi/+ = - vJ
+ 6/2),
+ fi/2 + uq4.
When (2.5) is added to this equation, the result is
+ jp = fi/2 + u(fi/+ +jle) .
But p ( x , 3) = p ( x , 4 ) = 612, and f i / + + j j / O is Lipschitz and vanishes at g3 (4 = 94(x) =
El4 + J P *
5, so (2.8)
Formulas (2.1) and (2.2) follow from (2.6) and (2.8) on noting that
W # W ) = 9 2 (4 + 9 4 w and &(#PO)
=
9 3
(4 + g 4 ( 4
*
I
We now consider reversal of the correct response. If, on trial n and thereafter, food is placed in the white arm, (2.1) implies that the expected total
239
16.3. 0 VERLEARNING REVERSAL EFFECT
umber of (overt) errors on the new problem, given X,,, is E / 4 + 2 Y n / 8 . Thus the unconditional expected number of errors is
+
Y, = E ( R ) / ~ 2 ~ ( ~ , , ) / e .
The change in Y,, as the result of one additional trial before reversal is then AY,, = -E(AV,)/$
+ 2E(AYn)/8.
In view of (2.4) and (2.7), AY,, = E(3V, 7,- V,,/2). This formula is needed for the study of the overlearning reversal effect in the next section. 16.3. The Overlearning Reversal Effect
To apply Theorem 8.1.1 to the ZHL model, we simply assume that the parameters c # ~= ~ y,8 and 8 , = q j 8 depend linearly on a parameter 8 > 0. Then lo= I = X is the closed unit square, 9(AXn/81Xn= x ) does not depend on 8, W(X)
= E(AX,,/OlX,,= X )
and S ( X ) = E((AX,,/8)21X,,= X )
have polynomial coordinate functions, and AX,,/8 is bounded, so it is easy to see that all of the assumptions of the theorem hold. Let X , = x as., and let f satisfy
f’w = W ( f ( 0 )
(3.1)
with f(0)= x. The theorem says that, for any T < 00, E(Xn) -f(n@
=
o(e)
(3.2)
and (3.3) uniformly over x E X and n8 < T. Furthermore, X,, is approximately normally distributed when 8 is small and n8 is bounded. We now revert to the special case 8, = O2 = 8 and 4j = 4, 1 <j < 4, considered in the last section, and assume that 4 = 78, where y > 0. This parameter is thus the ratio of the perceptual and response learning rates. We will study the asymptotic behavior of f(t), and then use the information
240
16. ZEAMAN-HOUSE-LO V U O Y MODELS
obtained to show which values of y produce an overlearning reversal effect when 8 is small. Let o ( t ) and y(r) be the coordinate functions of f(r):f(t) = (v(r), y ( t ) ) . It follows from (2.4) and (2.7) that the coordinate equations of (3.1) are d ( t ) = y(ij/2
- ujj)
(3.4)
and (3.5)
y’(t) = ojj.
Since f ( t ) E A,‘ 0 < v(t), y(r) < 1. The values u = 1 and y = 1 are associated with perfect learning. It is clear from (3.5) that y ( t ) is nondecreasing. Thus, if y ( 0 ) = 1, y ( t ) E 1, and (3.4) gives 8 ( t ) = ij(0)e-yr/Z.Throughout the remainder of the section, it is assumed that y ( 0 ) < 1. According to Lemma 3.1, o ( t ) and y ( t ) both converge to 1 as r + a.It should be noted that u ( t ) need not be monotonic. By (3.4), v’(0)aO if u(0) G v, and o’(0) < 0 if u ( 0 ) > v , where v = 1/(1 +2jj(O)).
In the former case, u’(t) > 0 for all t > 0. In the latter, there is an s > 0 such that u’(t) < 0 for t < s, o’(s) = 0, and d ( t ) > 0 for > s. Since these facts are not used below, their proofs are omitted.
3.1. For any y > 0, LEMMA jj(r)
as
t+
a,where u =
-
(3.6)
ue-‘
J(0)exp2(jj(O)+E(O)/y).
If y c 2, for some fl> 0. If y = 2, ij(t)
and, if y > 2, ij(t)
-
-
2ate-‘,
-e-r Y-2
*
(3.9)
Proof. Equations (3.4) and (3.5) are equivalent to ij’(t) = ~ ( u -j a/2)
(3.10)
241
16.3. 0 VERLEARNING REVERSAL EFFECT
and (3.1 1)
j’(t) = - u j .
+
Thus, if Z(t) = ij/y j , Z’(t) = -ii/2.
(3.12)
Therefore, Z ( t ) 2 0 is nonincreasing, so that Z(m) = limZ(t) 1- 00
exists. Since y ( m ) exists, u(m) does too. Integrating both sides of (3.12), we obtain Z(0) - Z ( t ) =
hence Z(0)
- Z(m)
=
sb
C(s) ds/2;
dm
iqs) &/2.
Finiteness of the integral implies u(m) = 1. It follows from (3.1 1) that j ( t ) = j ( 0 ) exp - C u ( s ) & t
= ji(o)exp(-r)expS 0 ii(s)&.
Thus, j(t)
-
j(0)exp2(Z(0) -Z(m)) exp(-t).
In particular, y ( m ) = 1, so Z(m) = 0, and (3.6) is proved. Solving (3.10) for ij in terms of q = y u j , we obtain (3.13) If y < 2, (3.6) implies that eY“/’q(s) is integrable over the whole positive axis. Furthermore,
fi
= ij(0)
+ ~ O O e Y s / zdsq (>s )0.
For this is certainly true if Is(0)> 0, and, if ij(0)= 0, then q(s) > 0 for s sufficiently small, which implies positivity of B. Thus (3.7) obtains.
242
16. ZEAMAN-HOUSE-LO VEJO Y MODELS
If y = 2 , eYS12q(s)+2aas s + m ; hence ii(t)et/t = ij(O)/t
+
+ 2a
when t + 00, as required for (3.8). Finally, if y > 2,
This estimate and (3.13) yield (3.9). Throughout what follows, the initial value x = (u,y) of Xn=X." and f ( t ) is fixed and independent of 8, with y < 1. At the end of the last section, we considered Y , = Y:, the expected number of errors following a reversal before trial n, and obtained the expression (2.9) for the change in this quantity due to one additional trial before reversal. This expression and Lemma 3.1 are the basis for Theorem 3.1.
THEOREM 3.1. If y < 3 ( y > 3), there is a To with the following property. For any T > T o , there is an E = E~ > 0 such that, if 8 < E , then AY: < 0 (AY; > 0),for all To < no< T. The theorem is concerned with large values of n (n 2 T o / @ ,or, loosely speaking, with overtraining. It says that, when 8 is small, overtraining facilitates reversal (the overlearning reversal effect) if y < 3, but retards reversal if y > 3. This is consonant with the intuition that, when the perceptual learning rate $ = 78 is small relative to the response learning rate 8, the gain in attention to the relevant stimulus dimension with overtraining should more than compensate for the slight further strengthening of incorrect stimulus-response associations. It is, perhaps, surprising that the overlearning reversal effect is predicted for values of y almost as large as 3. Of course, to obtain an experimentally measurable effect of overtraining, it is necessary to give a reasonable number k of overtraining trials. The change in expected total errors due to overtraining is then
243
16.3. 0 VERLEARNING REVERSAL EFFECT
Proof. According to (2.9), AY; = E(P(Xn)),where P is the polynomial P(x)
=
3vy - u / 2 .
Clearly
IAy'n" - P(f(ne))I G IE(P(Xn)) - P(pn)l
+ IP(pn) - P(f(ne))I
3
where pn = E(Xn). Since the first partial derivatives of P are bounded over X ,
IPO1n) - P(f(n@)l G CIpn -f(ne)I G Ce by (3.2). And, since the second partial derivatives of P are bounded,
IE(P(Xn)) - P 0 l n ) l G CE(IXn-pn1*) G ce by (3.3). Thus, for any T < 00, there is a C = C , such that IAY,B-P(~(G ~ ~ce ))~
(3.14)
for all ne G T. Let p ( t ) = P ( f ( t ) ) = 3v(t)y(t) - u ( t ) / 2 .
It follows easily from Lemma 3.1 that, as t - + a, p(t)eY'/'
+
if y < 2 ,
-8/2
if y = 2 ,
p(t)e'/t + - a
if y > 2 .
p(t)er + 2a(y-3)/(y-2)
-=
Therefore, if y < 3, there is a To such that p ( t ) 0, for t 2 To. And, if y > 3, there is a To such that p ( t ) > 0, for t 2 T o . Suppose that y < 3, and let T > T o . Since - p ( t ) is continuous and strictly positive on [ T o , TI, S = min - p ( t ) > 0. To
If To G ne < T, then AY; -p(ne) G
ce
by (3.14); hence AY; - p ( n e ) < S G - p ( n e ) ,
if 0 < E = SIC. Thus AY; < 0, as was to be shown. The analogous assertion for y > 3 is proved similarly. I
17 0 .Other Learning Models
17.1. Suppes’ Continuous Pattern Model Suppes proposed two models for simple learning experiments with continua of responses, such as the continuous prediction experiment described in Section 0.1. His linear model was considered in Chapter 15. The other is an analog of the pattern model for two-choice experiments (Suppes, 1960). This section treats the ergodic theory of a slight generalization of the latter model. There is a set S of N stimulus patterns s, and a measurable response space (Y, ’9).The state x of learning is a vector (x,, s E S) of points in Y. As a trial begins, the subject samples a single stimulus pattern s with probability I / N , and makes a response y according to a distribution As(xsrdy). If, for example, Y is a circle, x, might be the mode of this distribution. Given y , the outcome , Conditioning is effective or “correct response” z has distribution ~ ( ydz). ( k = I ) with probability c ( y , z), and ineffective (k =0) otherwise. The event variable is e = (s,y,z, k). If conditioning is effective, x, is changed to z,
244
245
17.1. SUPPES' CONTINUOUS PATTERN MODEL
while x,, t # s, remains unchanged :
4x9 4, = z
9
for t # s.
u(x,e), = x,
If conditioning is ineffective, u ( x , e) = x. For this model, the distribution p(x,de) of e given x is
Naturally, we assume that As( -,A), n (. ,D), and c are measurable. Let K and U be the transition kernel and transition operator for a state sequence X,,, and let B ( X ) be the set of bounded measurable scalar valued functions on X . Event, response, and outcome sequences are denoted En, Y,, and Z,,, respectively. The following result is analogous to Theorem 15.1.3 for multiresponse linear models.
THEOREM1.1. If c ( y , z ) and n ( y , D ) have lower bounds of the form c ( y , z ) 2 c(z) and n ( y , D ) 2 p((D), where c(z) is a nonnegative measurable function, p is a measure, and a = c(z)p(dz) 0, then U is regular with respect to B ( X ) .
=-
The conclusion is equivalent to K(")(x,B) + K'O (B) as n + a,uniformly over x and B. The limit K" is the unique stationary probability of K.
Proof. Note that u ( x ,e) = u*(x, e*) depends on x and e* = (s, z, k), but not on y. The functions u* and p*(x, A*) = p ( x , e* E A*) define a (reduced) model with states x* = x and events e*. I f f € B ( X ) ,
Uf(X) =
s
P(X,
= /P(X,
= /P*(X,
4 f ( u ( x , 4) de)f(u*(x,.*)I de*)f(u*(x, e*))
= U*f(x).
Thus U = U*, and it suffices to show that U * is regular. This is accomplished via Theorem 4.1.2.
246
17. OTHER LEARNING MODELS
J- P ( d Z ) W k
= (1/N) =
D
uv(s,D, k).
(1.1)
This function v has a unique extension to a probability on the event space (E*, Y*), and it follows from (1.1) that
p*(x, A*) 2 av(A*) for all A* E Y*, as hypothesis (a) of Theorem 4.1.2 requires. Let en*= (sn,z,, k,), eN= (eo*, ... ,e;- 1), and
G
= {eNs :,,,# s,for
m # n, and k, = 1 for all n}.
If eNE G, u*(x, eN)s,= z,
for n < N- 1, so u*(x,eN)= u*(eN)does not depend on x. Finally, vN(eN) =
n
N- 1 n=O
v(s,, Y, 1) = I/",
so
vN(G)= N ! / N N> 0, and (b') of Theorem 4.1.2 obtains.
I
When U is regular on B ( X ) , the model is uniformly regular (Theorem 6.2.1), so that {Ej+,},,o possesses an asymptotic distribution r as j + co. If g is a measurable real valued function on Y Z m , gj
= g
***,
Yj+m- 1 ,
z j + m - 1)
7
and J-g'dr
=
~ w ~ 2 ( y o , .z. .o, y, m - l , z m - l ) r ( d e m )<
00,
247
17.2. SUCCESSIVE DISCRIMINATION
then
(l/n)S,, = (l/n) 'g'g, j=o
+S =
Igdr
a s . as n + co (Theorem 6.2.3), and S, is asymptotically normally distributed with mean ng and variance proportional to n (Theorem 6.2.4). Finally, by Theorem 6.1.2, the results of Chapter 5 apply to gj when m = 1 and g is bounded. 17.2. Successive Discrimination
In Chapter 16 we considered the ZHL model for simultaneous discrimination learning, in which the two values of the relevant stimulus dimension are present on each trial. In this section we discuss a similar model for successive discrimination, where only one value appears per trial. Suppose, for example, that a stimulus card overhead at the choice point of a T-maze may be black or white. When it is black (an so trial), the rat is fed if he goes to the left side (A, response), while, if it is white (s,), he is fed if he goes to the right (Al). The two trial types are randomly interspersed and equiprobable. Bush (1965) developed some suggestions of L. B. Wyckoff, Jr., into the model described next, which, following Bush, we call the Wyckof model. The state x has four coordinates, v , y l , y o , and z. The variable u gives the rat's probability of observing or attending to the stimulus card (a), while y i is the probability of making the correct A, response on an s, trial, given a. Finally, z is the probability of A , on either type trial, given inattention (8). To summarize: P(S,) = 3 ,
P(a) = v , P(Ailsia) = ~
t
,
P(A,la") = 2. Events e are of the form siaAj or si8Aj. It remains only to indicate how various events affect x . The probability v of attention increases if the rat attends and gets food or does not attend and does not get food, and decreases otherwise. The variable y, increases on sia trials, regardless of the subject's response, and otherwise does not change. The Stria1 probability z of going right changes only on 8 trials, increasing or decreasing depending on whether food is on the right or left, just as in
248
17. OTHER LEARNING MODELS
models for simple learning. All of these changes are effected by simple linear operators. Thus, for example, the possible values of A u after a" and their probabilities of occurrence are
- 41 -
u
if s1 EA,
with probability ijz/2,
4oou
if so 6Ao
with probability UZ/2,
+loij
if s1 &lo
with probability ijZ/2,
4olij
if so ;Al
with probability Vz/2,
where ij = 1 --v. Under the natural lateral symmetry conditions 411= 4oo= 43 and $ J =~ ~ 4ol= 44, the two events involving correct (C) responses, as well as the two events involving incorrect (I) responses, can be combined into two new , events, 6C and a"I, which produce Au's of - 4 3 u and ~ $ ~ i j respectively. The probability ij/2 of these new events does not depend on z. Relabeling the four events involving a to indicate whether the subject's choice was correct or not, we obtain the reduced model described in Table 1. The fact that there are four parameters in lines 1-4 instead of eight reflects lateral symmetry conditions like those that made the reduction possible. The new state variable is, of course, x = (u, y l , yo). TABLE 1 EVENTEFFECTSAND Event
PROBABILITIES FOR THE
Av
REDUCED WYCKOFF MODEL(6= 1 - V ) AY 1
AYO
Probability
Our subsequent discussion focuses on this model, whose properties are quite analogous to those of the reduced ZHL model. Since the proofs are also similar, they are omitted. If 4 , , 44, 0, 8* > 0, the Wyckoff model is distance diminishing and absorbing, with single absorbing state ( I , I , I). Thus V , + 1 and Yin+ I a s . as n+ m, where Vn and Yin are the values of u and yi on trial n. The expected total numbers E(#I) and ,?(#a") of overt and perceptual errors are finite. Under the additional conditions 4j = 4,
249
17.3. SIGNAL DETECTION: FORCED-CHOICE
1 <j<4, and 8=8*,
fi/4+ 2 ( J , + Y o W
&(#I)
=
~,(#a")
= 2614
and
+ 2 ( y 1+jjo)/e.
Let 4 = ye, where y > 0 is fixed. If n8 is bounded, then var(K) = U(8), var( Yin)= U(8),
E(v,) = u(ne)
+ o(e),
and
E( Y i n )
= yi (no)
+ 0(0)
9
where
u'(0
=
(Y/2)(fi - 4 Y l + Yo>)
9
yl(0 = Y i u / 2 ,
V, = u(O), and Yio= y i ( 0 ) a s . Introducing j = J o + J , into the equation for u' and the sum of the equations for y,' and yd, we obtain 6'(t) =
( y / 2 )( U J - 6 )
j'(t) =
-4 2 .
and These equations are similar to (16.3.10) and (16.3.11), and can be put to the same use. When 8 is small, an overlearning reversal effect is predicted for y < $ but not for y >$. More precisely, Theorem 16.3.1 holds with 3 in place of 3.
17.3. Signal Detection: Forced-Choice This section and the next treat learning models for two types of signal detection experiments. In the forced-choice paradigm, the subject must specify which of two temporal intervals or spatial locations a signal occurred in. In yes-no experiments, the subject must state whether or not there was a signal in a single observation interval. When there is feedback concerning correctness of responses, it is natural to assume that the subject can make use of it to improve his performance. In this way we are led to consider learning models for signal detection. The model of Atkinson and Kinchla (1965, see also Atkinson, Bower, and Crothers, 1965, Chapter 5) for forcedchoice experiments is discussed below. Friedman, Carterette, Nakatani, and
250
17. OTHER LEARNING MODELS
Ahumada (1968) have considered a similar model for the yes-no paradigm. Kac’s (1962, 1969) yes-no model, a predecessor of the forced-choice model of Dorfman and Biderman (1971), is taken up in the next section. Suppose that a signal occurs in temporal interval S , with probability qy, and in Sz with probability q(1-y). If q < 1, then, unbeknown to the subject, there is no signal in either interval (So)with probability 1-7. On an Si trial, i = 1 or 2, a detection occurs with probability 0 and produces sensory state s,; otherwise sensory state so results. This state always occurs on So trials. When the subject is in state s, or s,, he makes the appropriate response (A, or A,). In state so, he makes response A , with probability x. It is this bias parameter that undergoes a learning process. After his response, the subject is told that the signal occurred in the first (0,)or second (0,) interval. On signal trials this feedback is correct, but, on So trials, 0, and 0, occur with probabilities II and 1 -II. Events in this model are of the form e = Sisj A, 0,, and their conditional probabilities given x are easily derived from the above specifications. For example, p(x,e)
if e = S , s , A , O , , =
(l-q)x(1-11)
if e = S o s 0 A , O 2 .
The response bias parameter x changes only when detection fails to occur, increasing after 0, and decreasing after 0,.Linear state transformations are assumed: u(x,e) = where 0 < Oj < 1. Note that P(x7s001)
i
(1-81)x+8,
if
s0O1,
(1-8,)x
if
so02,
otherwise,
X
= qY(l-O)
+ (l-q)n
= 91
and P(x7s00Z) = q(1-y)(1-a)+(1-q)(1-7L) do not depend on x. Let Xn be the value of x on trial n. Clearly, E(dXnlXn) = e,(l-Xn)g,-
Xng,
7
so
dE(Xn) = 81 81(1 -E(Xn))
- e2g2 E(Xn)
9
=g,?
25 1
17.3. SIGNAL DETECTION: FORCED-CHOICE
and, letting n + 00,
,
The asymptotic probabilities of appropriate and inappropriate A responses (“hits” and “false alarms”) are h = lim P(A,,,IS,,,)= n- m
t~
+( 1 - 0 ) ~ ~
and f
=
lim P(A,,,IS2,,)= (1-o)xm.
n- m
Thus, however, 7, y, n, and 0, may vary, (f,h) is confined to the line h = Q +f, which is the receiver operating characteristic or ROC curve predicted by the model. Given a single subject’s data for an n trial block, the natural estimators off and h are the proportions F,, and H,, of A, responses on S2 and S, trials. It is not difficult to show that, if r~ c 1 or 7 < 1, this model is distance diminishing, and the process X,, is regular. Thus, according to the corollary to Theorem 6.1.1, the results of Chapter 5 apply to functions g(E,,) of an event sequence E n . To illustrate one of these results, we show that (F,,,H,,) is h) and covariance matrix asymptotically normally distributed, with mean proportional to l/n, as n + co. Clearly
u,
-
F,, = S, A , / S 2
and
-
H,, = S1Al/S1,
where
and SIjis 1 or 0 depending on whether or not S , occurs on trial j. Now Slj, S , j , and the comparable indicators for S , A , and S 2 A , are functions of E j . It follows from the multivariate central limit theorem described in Section 5.7 that
is asymptotically multivariate normal, with mean
Y = (v(1 -Y), v(l -Y)A tlY, vyh) and covariance matrix Z/n, where C is given by the series (5.7.2). But the
252
17.
OTHER LEARNING MODELS
function, call it Y , from y’, to (F,,, H,) [see (3.1)] is continuously differentiable with Y ( y ) = V ; h ) , provided that q > 0 and 0 c y < 1. Hence
f i ( K -5
H” - 4
N(O, A )
3
where
n
= Y’(y)*CY’(y),
Y ’ ( y ) is the derivative of Y at y , and
* indicates transposition.
17.4. Signal Detection: Yes-No Suppose that, on every trial, the subject has an internal response y to the stimulus (signal or noise) that is present. If there is a signal ( S , ) , this variable has distribution function F, ; otherwise (So), its distribution function is F,. The sequence of trial types is independent, and P ( S , ) = p i > 0. The subject declares that the signal was present (A,) if y exceeds a criterion x, and that it was absent (A,) if y < x . He is then told whether he was right or wrong. Kac (1962, 1969) considered the possibility that the criterion x in such a system undergoes a learning process. Specifically, he assumed that x is incremented by A > 0 if there is a false alarm (SoA ,) and decremented by A in the case of a missed signal (S, A,). Dropping the assumption of equal changes, we arrive at a learning model with state x, events e = S i A j , state transformations
A, < O
>0
if e = S , A , , if e
=
S,A,,
otherwise, and event probabilities and
In view of the form of the state transformations, it is not surprising that Kac’s model behaves much like the recurrent additive models of Chapter 14. The exposition that follows skips a number of proofs that are similar to ones in that chapter. It is, of course, part of the intuitive background of the detection problem that the signal should tend to produce a larger internal response than the noise, e.g., IydF, > J y d F , , but none of our results depend on this condition. Continuity and strict monotonicity of F, and F, suffice for everything except slow learning.
253
17.4. SIGNAL DETECTION: YES-NO
State sequences in this model oscillate, in the sense that lim sup X , n-1 w
and
= 00
liminfX, n-r w
a.s. If u = ex and 1 E R', there are constants C(1)
Ex(u,")
< U*CCn(l) + B(1)
=
- co
-= 1 and B(1) such that (4.1)
for all n 20. Taking 1 = k1 and adding, we see that, for any fixed x , Ex(X,') is a bounded sequence. The counterpart of (A) of Theorem 14.2.1 is especially interesting. Let a,, and a,, be the proportions of false alarms and missed signals in an n trial block :
c s~,,, A~.,,,/.Csim, if
n- -1 ..
a, =
In-1
m=O
=
m=O
1 - i,
and put a, =(ao,, a,,,). Let L be the line L = {(ao,ai):Ao~oao+A1p1a,=0},
and let d(a, L ) be the distance from a point a in the plane to L. THEOREM 4.1. As n+m, d(a,, L)+O a.s. Proof. Let X o = x a.s. Clearly AX,,, =
1A i Si, Aitm, i
so X,/n - x/n = Z A i i
n- I
1 SimAi.,/n.
m=O
(4.2)
Since E(X,') is bounded, X,/n+O a.s. By the strong law of large numbers,
'f' s~,,, +1
pin /
m=O
a.s., as n + co. Hence, multiplying the ith term on the right in (4.2) by this ratio, we get
CAipiain i
+
0
a.s., and d(a,, L ) is proportional to this sum.
I
Let K be the transition kernel of X,, and let
R"(x,B) = (I/n)
n- I
1K(~)(X,B).
j=O
254
17. OTHER LEARNING MODELS
THEOREM 4.2. For any x E R', R"(x, converges to a stationary probability K OD (x, ). There is a unique stationary probability if and only if A . / A is irrational.
,
a )
It follows from (4.1) that J ' ~ m ( dy) x , ely G ~ ( 1 )
for all x and 1. Additional information about K"' when A , / A l is rational is developed in the course of the proof. Proof. Suppose that A , / A , is irrational. Since pi > 0 and Fi is strictly increasing,
44
= P1 F l W
+ P O Q -Fo(x)) > 0
and
P(x) = P 1 4 (x)/o(x) is strictly increasing. It is also continuous, with p"( - 00) = 0 and p(00) = 1. Thus the Markov process 2, with transitions AX,, =
i
A,
with probability
p"(z,,) ,
A,
with probability
1 - p"(8,) ,
has a unique stationary probability, according to Theorem 14.4.2. If its transition kernel, then
R
is
+
K(x,B) = o(x)R(x,B) (l-o(x))ax(B).
It follows that p + j, where
is a one-to-one mapping of stationary probabilities of K into stationary probabilities of R. Thus K has at most one stationary probability. The argument in the second to last paragraph of Section 14.4 then yields weak convergence of K"(x, .) to the unique stationary probability p. If A , / A , is rational, let j and k be relatively prime positive integers such that A , / l A , I =j / k , and let E
= A,/j = IA,l/k.
It is not difficult to show that G+ = { n , A , + n , A , : n , , n , >O} = EZ,
(4.3)
255
17.4. SIGNAL DETECTION: YES-NO
where Z is the set of all integers. Clearly C, = x + G + is stochastically closed, so the restriction K, of K to C, is the transition kernel for a denumerable Markov chain. As a consequence of (4.3), any two states communicate, so lim K:(y, z) = n(z)
n- m
for all y, z E C, (Parzen, 1962, Theorem 8A, p. 249). But
R:(y, { z : IzI < k ) ) + 1 as k +
00,
uniformly in n, from which it follows that CzEc,n(z)= 1 , and lim R,"(j>,A)=
ndm
1 n(z) = n ( A ) ,
ZEA
for any A c C,. The distribution n is the unique stationary probability of K,, and n ( z ) > O for all ZEC, (Parzen, 1962, Theorem 8B, p. 249, and Theorem 8C, p. 251). Therefore
lim R"(x,B) = n(B n C,) = P ( X , B )
n- m
for all Bore1 sets B, K m ( x ,.) is the unique stationary probability of K with support C,, and K m ( x + e , .) = K m ( x , I a).
To study slow learning, we fix p = A l / A o .c 0, and let A , = 8+0. The moments W(X)
= E(AX,,/OlX,,= X ) = PPlFl(X) +Po(l-&F,(4)
and S(X) = E((AX,,/8)21X,,= X ) = P2PlF1(X)+ P o ( l - F o ( x ) )
do not depend on 8, so (a) of Section 8 . 1 is satisfied, with lo= R'. If F,has two bounded derivatives, w ( x ) and s ( x ) = S(x)- w 2 ( x ) satisfy the smoothness condition (b). Finally, IAX,,/Ol is bounded by the maximum of 1 and p, so (c) holds, and Theorem 8.1.1 is applicable. Thus, if Xt = x does not depend on 8,
(X." -f
-N O ,
9(0)
f'(0 = W ( f ( t ) ) , s'(t) =
f(0)= x , and g(0) = 0.
2w'(f(t))g(t)
+ ~(f),
(4.4)
256
17. OTHER LEARNING MODELS
It follows that U ( X : ) + c 5 f ( t ) . Kac (1962) gave two interesting heuristic derivations of this result, describing the limit function via the equation
which we mentioned in Section 12.4 [see (12.4.5)]. Since w(x) is continuous and strictly decreasing, with w( - 00) = p o > 0 and w(00) = ppl < 0, there is a unique ( such that w ( ( ) = 0. As t -+ 00, f(t) + (, and, if ~ ' ( 5 < ) 0 (i.e., F,'(() > 0 or I$'(() > 0), -+
2
= S(5)/21W'(5)1 *
In view of (4.4), one expects the limiting distributions K m ( x ,.) = KOw@,.) to be asymptotically normal, with mean ( and variance 0 2 , as 0+0. It follows from Theorem 10.1.1 that this is the case.
18 0 Diffusion Approximation in a Genetic Model and a Physical Model
This chapter treats diffusion approximation in a population genetic model of S . Wright and in the Ehrenfest model for heat exchange. Since sequences of states in these models are finite Markov chains, their “small step size” theory has much in common with that of stimulus sampling models (Section 13.3).
18.1. Wright’s Model Our description of this model is based on Ewens (1969, Section 4.8). The reader should also consult that volume for references to the extensive literature on diffusion approximation in population genetics. The work of Karlin and McGregor (1964a,b) is of particular interest. Consider a diploid population of 2M individuals. At a certain chromosomal locus there are two alleles, A and A , . The number i of A genes in the population thus ranges from 0 to 2M, and the proportion, x = i / 2 M , from 0 to 1. The A gene frequencies i (or x) in successive generations are assumed to form a Markov chain, with transition probabilities
257
258
18. DIFFUSION A PPROXIMA TION
where ni = (1 -u)Il**
and xi* =
+ u(1 -n**)
+
(1 +s1)x2 (1 +s,)x(l -x) (1 +s,)x2 2(1 +s,)x(l -x) (1 -x)2
+
+
-
The constant u represents the probability that an A, gene mutates to A,, while u is the probability that A, mutates to A,. The genotypes A , A , , A 1 A 2 , and A 2 A , have fitness 1 +sl, 1 +s2, and 1, respectively, so s, and s, control selective pressures. Let X,, be the proportion of A , genes in generation n. The standard formulas for the mean and variance of the binomial distribution give E(Xn+ 1 I X n = x) = ni
and
,
var(X,,+ IX,, = x) = ni(l - n t ) / 2 M .
(1.1)
Thus E(dX,IX,, = x) = Ici - x and, since q ( 1 - q ) =x(l-x)+O(lxi-xl), var(dX,,IX,, = x) = x ( l -x)/2M
+ O(lni-x1/2M).
(1.3)
The theory of Part I1 has a number of implications for the behavior of
X,,= XnM as M + co and the parameters u, u, and si-+O.In order to study
small values of u, u, and si, we let u = E p , u = iip, and si=Sip, where ii 2 0, ii 2 0, and Si are fixed, and p + 0. Under these conditions, it is shown later that
E(dX,IX" = x) = pw(x)
+ O(p2),
(1.4)
where w(x) = i j - ( i i + i j ) x + x ( l - x ) ( S 2 + ( S , - 2 S 2 ) x ) .
Also, var(dX,,IX,, = x) = x ( l - x) /2M
+ O(p/2M).
(1 -5)
It remains only to specify the relative magnitudes of p and 1/2M. We will consider two possibilities: p = 1/(2M)" and p = 1/2M. These will be referred to as the cases of large and small drift, respectively. The results
18.1.
259
WRIGHT'S MODEL
described next follow immediately from the indicated theorem in Part 11. The hypotheses of these theorems are verified at the end of the section. LARGEDRIFT : TRANSIENT BEHAVIOR.
I,
= {i/2M: 0
Let
< i <2 M } ,
and, for any 0 < x 4 1, let f ( t ) = f ( t , x) and g ( t ) = g(r, x) be the solutions of the differential equations f'(t) =
w(f(0)
and g'(0 = 2w'(f(t))s(t)+ A t ) (1 - f ( t ) ) ,
with f(0) = x and g(0) = 0. Suppose that XoM is concentrated at a point
xY E I", and that X, + x as M - , a.Let v,," =f(n/(2M)", x"). By Theorem 8.1.1,
( 2 W " (Xn" - vnM1
N
N(O, g (t, XI)
(1.6)
as M + a and n/(2M)" + t. In particular, the deterministic approximation v,M to X,," should be highly satisfactory when M is large and n/(2M)" is bounded. LARGEDRIFT: STEADY-STATE BEHAVIOR. Assume that ij = w(0) > 0 and U = - w(1) > 0, that w has a unique root I in (0, l), and that w'(I) < 0. As t + a , f ( t ) + r l and g ( t ) + n2 = I(l-rl)/2lw'(I)l.
The process XnM has no absorbing states, and the asymptotic distribution
pmM = lim u ( ( ~ M ) ~ I ( x , M - A ) ) n+ m
does not depend on the distribution of XoM. Theorem 10.1.1 shows that lim 9," = N(0,n2),
M-r 00
just as (1.6) suggests. Theorem 10.3.1 gives a formula for a constant y such that lim E(X,M) = I
n-r m
+ y/(2M)" + 0(1/(2M)").
SMALLDRIFT: TRANSIENTS. Suppose now that p = 1/2M, and assume, once again, that XoM= xy + x as M + a.According to Theorem 9.1.1,
260
IS.
DIFFUSION APPROXIMA TION
there is a unique transition probability P ( t ; x , A) such that /(.Y-x)p(T;
s
X,
dY)
=
( y - x ) Z P ( r ; x , dy) =
Tw(X) TX(1
j i y - x 1 3 ~ ( T ; x , d y=) O
-x)
O(r),
+ O(T),
w ,
where the 0's are uniform over 0 < x < I . As M + co and n / 2 M + t,
Y(X,")
+
P(t;x,
a ) .
Kimura (1964) discusses the eigenfunction expansion (9.3.9) of the density of P ( t ; x , a).
SMALL DRIFT: ABSORPTION PROBABILITIES. If = 0 and fi = 0, both 0 and 1 are absorbing states for XnM,and the process is eventually absorbed at one or the other a s . Let
4M( x ) = P,(X," is absorbed at 1 ) . The function w reduces to w ( x ) = x ( 1 - x ) (s,
+ ( S , -2S,) x ) ,
so that r ( x ) = 2 w ( x ) / x ( l-x)
Let
=
2(S,
+ (i,- 2 5 , ) ~ ) .
be the solution of $"(x)
with $ ( O )
=0
and $ ( 1 )
=
+ I'(x)$'(x) = 0
1 . By Theorem 1 1 . 1 . 1 ,
as M - r c o . Proofs. To check the hypotheses of the theorems cited above, we must obtain suitable estimates of n i - x and E(ldXn131Xn=x).Note first that 7ri
-x
= u
- ( u + u ) x + ( 1 -2.4-0)
Also,
xi* = x
1
1
+ s1 x + s,(l -x)
(7ri*-X).
+ s, x z + 2 s , x ( l -x)
'
(1.7)
18.1.
26 1
WRIGHT'S MODEL
so that xi* - x = x ( l -x)
and xi*
s,x+s2(l-2x) 1 +s,x2+2.s2x(1-x)
- x = x ( l -x) (sl x + s2(l-2x)) (1 + 0@)).
(1.9)
Combining ( I .2), (1.7), and (1.9), we obtain (1.4), and of course, xi-x = 0(p),
(1.10)
an estimate that will be used repeatedly. All 0's in this section are uniform over x. Since the conditional distribution of 2MX,+ given x is binomial, (2M)4E((Xn+ 1 - xi)4Ixn = X) =
2Mxi(l -xi) (3(2M- I)n,(l -xi)
+ + (1 -xi)').
Thus
E ( ( x , + , - x i ) 4 ~ ~=,
G
-ni)/(2~)2.
(1.11)
Since Iy+zlp <2p-1(Iylp+ Izlp) for p 2 I , and p G 1/(2M)" for large or small drift, (1.10) and (1.1 1) yield E(lAXn141Xn= x) = o((2M)-2).
Therefore, ~
(
1
~
IX, ~ =x) ~ 1
= 3 0((2~)-%).
(1.12)
Consider first the case of large drift, and let 0 = p = l/(2M)". It follows easily from (l.4), ( I S ) , and (1.12) that Theorem 8.1.1 applies, with s(x) = x( 1 -x). When U > 0, 6 > 0, w has a unique root I in (0, I), and ~ ' ( 2 < ) 0, the conditions of Theorem 10.1.1 are also met. An expansion E ( A X , I X , = x) = pw(x)
+
p2v(x)
+ o(p3)
of the type required by Theorem 10.3.1 can be obtained from (1.7) and (1.8). We turn now to small drift, and let r = p = 1/2M. By (1.5) and (].lo), E((AX,)21Xn =x)
= x ( l -x)/2M
+ 0((2M)-2),
(1.13)
which, together with (1.4) and (l.l2), shows that Theorem 9.1.1 applies, with u(x)= w(x) and b(x)=x(l-x). For Theorem 11.1.1, it suffices to show that the errors 0 ( h ) in (l.4), (l.l2), and (1.13) are 0(x(l -x)h) when ii = 0 = 0. In the case of (l.4), this follows from (l.9), which also yields the refinement xi - x = xi*
-x
=
O(x(1-x)p)
(1.14)
262
18. DIFFUSION APPROXIMATION
of (1.10). This and (1.3) give E((AX,JZIX,,=~)= x(l-x)/2M+ 0 ( ~ ( 1 - ~ ) ( 2 M ) - ~ ) . Applying the Schwartz inequality to Y 3 = YY', where Y = IX,,, -nil, and using (1.1) and (l.ll), we obtain E(IX,,+, -n,131Xn=x)
< n,(l---7ri)(2M)-~*
But ni(l -ni) = O(x(1 -x)), so using (1.14) once again, E(lAX,,131X,,=~)= O(x(l-x)(2M)-').
I
18.2. The Ehrenfest Model
There are 2M balls, distributed between two urns, I and 11. The number in urn I on trial n is J,, = JnM.On any trial, a ball is selected at random, and shifted to the other urn. Thus, AJ,, =
1
-1
with probability 1 - J,,/2M, with probability J,,/ 2M.
The physical significance of this model is discussed by Kac (1947, 1969), who also gives an explicit expression for its higher-order transition probabilities C1947, Eq. (62)]. We note that J,, has the same transition probabilities as the number of stimulus elements conditioned to response A, in the pattern model with nol= n I o= col = c l 0= 1. It is easy to see that Theorem 8.1.1 is applicable to X,, = (J,,- M ) / M , with 0 = 1/M. Clearly, dX,,/8 = AJ,,, so w(x) = E(dJ,,(X,= x ) = -x and s ( x ) = var(dJ,,IX,, = x) = 1 - xz.
Let JoM =j , as., where ( j ,
- M ) / M + x, and let
g(t,x) = ((1 -e-")/2)
- x2te-2t.
It follows from the theorem that as M +
00
and n / M + t. Consequently, if (jM- M ) / , / m - t z ,
-
( ~ ! - ~N(ze-', ) / (1f-e-")/2). l This was noted by Kac (1947, p. 384).
0 References
R. C. Atkinson, G. H. Bower, and E. J. Crothers, 1965. An Introduction to Mathematical Learning Theory. Wiley, New York. R. C. Atkinson and W. K. Estes, 1963. Stimulus sampling theory, in Handbook of Mathematical Psychology (R.D. Luce, R. R. Bush, and E. Galanter, Eds.), Vol. 11, pp. 121-268. Wiley, New York. R. C. Atkinson and R. A. Kinchla, 1965. A learning model for forcedchoice detection experiments, British Journal of Mathematical and Statistical Psychology 18, 183-206. G . Birkhoff and G. C. Rota, 1969. Ordinary Diflerential Equations, 2nd ed. Ginn (Blaisdell), Boston, Massachusetts. M. E. Bitterman, 1965. Phyletic differences in learning, American Psychologist 20, 396-410. J. H. Blau, 1961. Transformation of probabilities, Proceedings of the American Mathematical Society 12, 511-518. G. H. Bower, 1959. Choice-point behavior, in Studies in Mathematical Learning Theory (R. R. Bush and W. K. Estes, Eds.), pp. 109-124. Stanford Univ. Press, Stanford, California. L. Breiman, 1968. Probability. Addison-Wesley, Reading, Massachusetts. R. R. Bush, 1959. Sequential properties of linear models, in Studies in Mathematical Learning Theory (R. R. Bush and W. K. Estes, Eds.), pp. 215-227. Stanford Univ. Press, Stanford, California. R. R. Bush, 1965. Identification learning, in Handbook of Mathematical Psychology (R. D. Luce, R. R.Bush, and E. Galanter, Eds.), Vol. 111, pp. 161-203. Wiley, New York. R. R. Bush and F. Mosteller, 1955. Stochastic Models for Learning. Wiley, New York.
263
264
REFERENCES
R. R. Bush and F. Mosteller, 1959. A comparison of eight models, in Studies in Mathematical Learning Theory (R. R. Bush and W. K. Estes, Eds.), pp. 293-307. Stanford Univ. Press, Stanford, California. A. B. Chia, 1970. Spectral representations of multi-element pattern models, Journal of Mathematical Psychology 7 , 150-1 62. K. L. Chung, 1967. Markov Chains, 2nd ed. Springer-Verlag, New York. W. Doeblin and R. Fortet, 1937. Sur des chaines a liaisons completes, Bulletin de la Sociktk MathPmaiique de France 65, 132-148. D. D. Dorfman and M. Biderman, 1971. A learning model for a continuum of sensory states, Journal of Mathematical Psychology 8, 264-285. R. M. Dudley, 1966. Convergence of Baire measures, Studia Mathematica 27, 251-268. N. Dunford and J. T. Schwartz, 1958. Linear Operators, Part I. Wiley (Interscience), New York. J. Elliott, 1955. Eigenfunction expansions associated with certain singular differential operators, Transactions of the American Mathematical Society 78,40&425. W. K. Estes, 1959. Component and pattern models with Markovian interpretations, in Studies in Mathematical Learning Theory (R. R. Bush and W. K. Estes, Eds.), pp. 9-52. Stanford Univ. Press, Stanford, California. W. K. Estes, 1964. Probability learning, in Categories of Human Learning (A. W. Melton, Ed.), pp. 89-128. Academic Press, New York. W. K . Estes, 1970. Learning Theory and Mental Development. Academic Press, New York. W. K. Estes and P. Suppes, 1959a. Foundations of linear models, in Studies in Mathematical Learning Theory (R. R. Bush and W. K. Estes, Eds.), pp. 137-179. Stanford Univ. Press, Stanford, California. W. K. Estes and P. Suppes, 1959b. Foundations of statistical learning theory, 11. The stimulus sampling model, Technical Report No. 26, Institute for Mathematical Studies in the Social Sciences, Stanford, California. W. J. Ewens, 1969. Population Genetics. Methuen, London. W. Feller, 1968. An Introduction to Probability Theory and Its Applications, Vol. I, 3rd ed. Wiley, New York. W. Feller, 1971. An Introduction to Probability Theory andlts Applications, Vol. II,2nd ed. Wiley, New York. M. Frkhet, 1952. Recherches Thkoriques Modernes sur le Calcul des Probabilitks, Vol. 11, 2nd ed. Mkthode des fonctiones arbitraires. ThCorie des evenements en chaine dans le cas d’un nombre fini d’ktats possibles. Gauthier-Villars, Paris. M. P. Friedman, E. C. Carterette, L. Nakatani, and A. Ahumada, 1968. Comparison of some learning models for response bias in signal detection, Perception and Psychophysics 3, 5-1 I . 1. I. Gikhman and A. V. Skorokhod, 1969. Introduction to the Theory of Random Processes. Saunders, Philadelphia. P. R. Halmos, 1950. Measure Theory. Van Nostrand-Reinhold, Princeton, New Jersey. E. J. Hannan, 1970. Multiple Time Series. Wiley, New York. E. Hille, 1969. Lectures on Ordinary Differential Equations. Addison-Wesley, Reading, Massachusetts. H. S. Hoffman, 1965. Theory construction through computer simulation, in Classical Conditioning (W. F. Prokasy, Ed.), pp. 107-1 17. Appleton, New York. I. A. Ibragimov, 1962. Some limit theorems for stationary processes, Theory of Probability and Its Applications I, 349-382. C. T. Ionescu Tulcea, 1959. On a class of operators occurring in the theory of chains of infinite order, Canadian Journal of’Mathematics 11, 112-121.
REFERENCES
265
C. T. lonescu Tulcea and G. Marinescu, 1950. Thtorie ergodique pour des classes d’optra-
tions non completement continues, Annals of Mathematics 52, 140-147. M. Iosifescu, 1963. Random systems with complete connections with an arbitrary set of states, Revue de MathPmatiques Pures et AppliquPes 8, 61 1-645. M . Iosifescu and R. Theodorescu, 1969. Random Processes and Learning. Springer-Verlag, New York. R. Isaac, 1962. Markov processes and unique stationary probability measures, Pacific Journal of Mathematics 12, 273-286. B. Jamison, 1964. Asymptotic behavior of successive iterates of continuous functions under a Markov operator, Journal of Mathematical Analysis and Applications 9, 203-214. B. Jamison, 1965. Ergodic decompositions induced by certain Markov operators, Transactions of the American Mathematical Society 117, 451-468. B. Jamison and R. Sine, 1969. Irreducible almost periodic Markov operators, Journal of Mathematics and Mechanics 18, 1043-1057. M. Kac, 1947. Random walk and the theory of Brownian motion, American Mathematical Monthly 54, 369-391. M. Kac, 1962. A note on learning signal detection, IRE Transactions on Information Theory 8, 126-128. M. Kac, 1969. Some mathematical models in science, Science 166, 695-699. S . Karlin and J. McGregor, 1964a. On some stochastic models in genetics, in Stochastic Models in Medicine and Biology (J. Gurland, Ed.), pp. 245-271. Univ. of Wisconsin Press, Madison, Wisconsin. S . Karlin and J. McGregor, 1964b. Direct product branching processes and related Markov chains, Proceedings of the National Academy of Sciences 51, 598-602. J. G . Kemeny and J. L. Snell, 1960. Finite Markov Chains. Van Nostrand-Reinhold, Princeton, New Jersey. A. Khintchine, 1948. Asyniptotische Gesetze der Wahrscheinlichkeitsrechnung. Chelsea, New York. M. Kimura, 1964. Diffusion models in population genetics, Journal of Applied Prubability I , 177-232. W. Kintsch, 1970. Learning, Memory, and Conceptual Processes. Wiley, New York. J. Lamperti, 1966. Probability. Benjamin, New York. M. Loeve, 1963. Probability Theory, 3rd ed. Van Nostrand-Reinhold, Princeton, New Jersey. E. Lovejoy, 1966. Analysis of the overlearning reversal effect, Psychological Review 73, 87-103. E . Lovejoy, 1968. Attention in Discrimination Learning. Holden-Day, San Francisco. R. D. Luce, 1959. Individual Choice Behauior. Wiley, New York. P. Mandl, 1968. Analytical Treatment of One- Dimensional Markov Processes. SpringerVerlag, New York. H. P. McKean, Jr., 1956. Elementary solutions for certain parabolic partial differential equations, Transactions of the American Matheniatical Society 82, 5 19-548. J. L. Myers, 1970. Sequential choice behavior, in The Psychology ofLearning and Motivation (G. H. Bower, Ed.), Vol. 4, pp. 109-170. Academic Press, New York. E. D. Neimark and W. K. Estes, 1967. Stiniulus Sampling Theory. Holden-Day, San Francisco. J. Neveu, 1965. Mathematical Foiindatiorrs of the Calculus of Probability. Holden-Day, San Francisco. M. F. Norman, 1964. Increniental learning on random trials, Journal of Mathematical Psychology 1, 336-350.
266
REFERENCES
M. F. Norman, 1966. An approach to free-responding on schedules that prescribe reinforcement probability as a function of inter-response time, Journal of Mathematical Psychology 3, 235-268. M. F. Norman, 1968a. Some convergence theorems for stochastic learning models with distance diminishing operators, Journal of Mathematical Psychology 5, 61-101. M. F. Norman, 1968b. Mathematical learning theory, in Mathematics of the Decision Sciences (G. B. Dantzig and A. F. Veinott, Eds.), Part 2, pp. 283-313. American Mathematical Society, Providence, Rhode Island. M. F. Norman, 1968c. Slow learning, British Journal of Mathematical and Statistical Psychology 21, 141-159. M. F. Norman, 1968d. On the linear model with two absorbing barriers, Journalof Mathematical Psychology 5, 225-241. M. F. Norman, 1970a. A uniform ergodic theorem for certain Markov operators on Lipschitz functions on bounded metric spaces, Zeitschrifr fur Wahrscheinlichkeitstheorie und verwandte Gebiete 15, 51-56. M. F. Norman, 1970b. Limit theorems for additive learning models, Journal of Mathematical Psychology 7 , 1-1 I. M. F. Norman, 1971a. Slow learning with small drift in two-absorbing-barrier models, Journal of Mathematical Psychology 8, 1-21. M. F. Norman, I971b. Statistical inference with dependent observations: extensions of classical procedures, Journal of Mathematical Psychology 8,444451. M. F. Norman and N. V. Graham, 1968. A central limit theorem for families of stochastic processes indexed by a small average step size parameter, and some applications to learning models, Psychometrika 33, 441-449. M. F. Norman and J. I. Yellott, Jr., 1966. Probability matching, Psychometrika 31, 43-60. K. R. Parthasarathy, 1967. Probability Measures on Metric Spaces. Academic Press, New York. E. Parzen, 1958. On asymptotically efficient consistent estimates of the spectral density function of a stationary time series, Journal of the Royal Statistical Society, Series B 20, 303-322. E. Parzen, 1962. Stochastic Processes. Holden-Day, San Francisco. B. Rokn, 1967a. On the central limit theorem for sums of dependent random variables, Zeitschrqt fur Wahrscheinlichkeitstheorie und uerwandte Gebiete 7 , 48-82. B. Rodn, 1967b. On asymptotic normality of sums of dependent random vectors, Zeitschrvt fur Wahrscheinlichkeitstheorie und verwandte Gebiete 7 , 95-102. M. Rosenblatt, 1964a. Equicontinuous Markov operators, Theory of Probability and its Applications 9, 205-222. M. Rosenblatt, 1964b. Almost periodic transition operators acting on the continuous functions on a compact space, Journal of Mathematics and Mechanics 13, 837-847. M. Rosenblatt, 1967. Transition probability operators, in Proceedings of the Fqth Berkeley Symposium on Mathematical Statistics and Probability (L. M. Le Cam and J. Neyman, Eds.), Vol. 11, Part 2, pp. 473-483. Univ. of California Press, Berkeley, California. H. Rouanet and S. Rosenberg, 1964. Stochastic models for the response continuum in a determinate situation :comparisons and extensions, Journal of MathematicalPsychology 1,215-232. H. L. Royden, 1968. Real Analysis, 2nd ed. Macmillan, New York. R.J. Serfling, 1968. Contributions to central limit theory for dependent variables, Annuls of Mathematical Statistics 39, 1158-1 175. R. J. Serfling, 1970a. Moment inequalities for the maximum cumulative sum, Annals of Mathematical Statistics 41, 1227-1234.
REFERENCES
267
R. J. Serfling, 1970b. Convergence properties of S . under moment restrictions, Annals of Mathematical Statistics 41, 1235-1 248. S. Sternberg, 1963. Stochastic learning theory, in Handbook of Mathematical Psychology (R. D. Luce, R. R. Bush, and E. Galanter, Eds.), Vol. 11, pp. 1-120. Wiley, New York. P. Suppes, 1959. A linear model for a continuum of responses, in Studies in Mathematical Learning Theory (R. R. Bush and W. K. Estes, Eds.), pp. 400-414. Stanford Univ, Press, Stanford, California. P. Suppes, 1960. Stimulus-sampling theory for a continuum of responses, in Mathematical Methods in the Social Sciences, 1959 (K. J. Arrow, S. Karlin, and P. Suppes, Eds.), pp. 348-365. Stanford Univ. Press, Stanford, California. P. Suppes and R. W. Frankmann, 1961. Test of stimulus sampling theory for a continuum of responses with unimodal noncontingent determinate reinforcement, Journal of Experimental Psychology 61, 122-132. P. Suppes, H. Rouanet, M. Levine, and R. W. Frankmann, 1964. Empirical comparison of models for a continuum of responses with noncontingent bimodal reinforcement, in Studies in Mathematical Psychology (R. C. Atkinson, Ed.), pp. 358-379. Stanford Univ. Press, Stanford, California. N. S. Sutherland and N. J. Mackintosh, 1971. Mechanisms of Animal Discrimination Learning. Academic Press, New York. M. Tatsuoka and F. Mosteller, 1959. A commuting-operator model, in Studies in Mathematical Learning Theory (R. R. Bush and W. K. Estes, Eds.), pp. 228-247. Stanford Univ. Press, Stanford, California. J. Theios, 1963. Simple conditioning as two-stage all-or-none learning, Psychological Review 70, 403-417. W. Vervaat, 1969. Upper bounds for the distance in total variation between the binomial or negative binomial and the Poisson distribution, Statistica Neerlandica 23, 79-86. S . Weinstock, A. J. North, A. L. Brody, and J. LoGuidice, 1965. Probability learning in a T-maze with noncorrection, Journal of Comparative and Physiological Psychology 60, 7681. S. S. Wilks, 1962. Mathematical Statistics. Wiley, New York. J. I. Yellott, Jr., 1969. Probability learning with noncontingent success, Journal of Mathematical Psychology 6, 541-575. D. Zeaman and B. J. House, 1963. The role of attention in retardate discrimination learning, in Handbook of Mental Deficiency (N. R. Ellis, Ed.), pp. 159-223. McGraw-Hill,
New York.
This page intentionally left blank
0 List of Symbols
These symbols are used in the same or similar ways in two or more chapters. The definitions given on the pages indicated do not cover all local variations in meaning.
269
270
LIST OF SYMBOLS
S, 25
S(x), 117
s(x,e), 116
74, 153 Cn(X), 61 TP,22 z, 138 e, 5,111,116,195 u(x,e), 12,24 u(x, 4 2 6 u, 22 u,, 144 U", 23 U",U",37 v ( x ) , 158 v,37,45 w(x), 15, 117 w'(x), 117 w ( x , el, 116 x,, 15, 179 x,, 16, 180 i,X,, 12,21,24 X,', 13, 26 X : , 14,116 I, 21,24 #, 185 A , V,75 Ol,f 1%23 If I, IPI, 22 iifii, 34,37,142 x*, 1x1, x2, 117 U2,
Absorbing state, 24 Absorption, criteria for, 61, 65, 176, 196, Adjoint, 23 Ahumada, A., 250, 264 Aperiodic operator, 37 Atkinson, R. C., 4, 13,249,263 Attention, 9, 234, 247 Autocovariance asymptotic, 80, 187 sample, 80
Carterette, E.C., 249, 264 Central limit theorems, 75, 95, 116, 153 Chapman-Kolmogorov equation, 139 Chia, A. B., 202, 264 Chung, K.L., 63,264 Complementary operators, 7 Continua of operators, 188 Critical point, 133 Cross spectral density function, 95 Crothers, E. J., 249,263 Cyclic models, 179, 197
Biderman, M., 250, 264 Birkhoff, G., 121, 263 Bitterman, M. E., 3, 263 Blau, J. H., 226, 263 Boundaries, 147, 148 Bower, G. H., 13, 249,263 Breiman, L., 80, 105, 207, 218, 263 Brody, A. L., 3, 267 Bush, R. R., 3, 5, 12, 188, 247, 263
Diffusion approximation, see also Slow learning in bounded interval, 137 in population genetics, 257 Discrimination learning, 2, 9 Distance diminishing models, 13, 30, 31 with noncompact state spaces, 66 Doeblin, W.,35, 264 Doeblin-Fortet operator, 35 absorbing, 61
21 1,235
27 1
272
INDEX
Dorfman, D. D., 250, 264 Drift, 14, 118 Dudley, R. M., 39, 57, 125, 264 Dunford, N., 37, 44,46, 48-52, 219, 264 Effective conditioning, 4 Ehrenfest model, 262 Eigenvalue, 45 Elliott, J., 151, 264 Ergodic kernel, 52 Ergodic operator, 37 Estes, W. K., 3-5, 10, 202, 206, 263-265 Estimation of cross spectral density, 96 o f & , 80 of u*,84, 181 Event, 12, 24 function of, 14, 98 sequence, 24 space, 12,24 Ewens, W. J., 257, 2 Expected operator, 183 approximation, 181, 183 Experiments avoidance, 3 continuous prediction, 8 forced-choice detection, 249 free-responding, 8 paired-associate, 2 prediction, 3 simultaneous discrimination, I2 successive discrimination, 12, 247 T-maze, 3 yes-no detection, 249, 252
13,
Failure, 6 Feller, W., 63, 112, 264 Finite Fourier transform, 96 Finite Markov chain, 63 Finite state model, 13, 25 Fortet, R., 35, 264 Frankmann, R. W., 8, 267 Frkchet, M., 91, 264 Friedman, M. P., 249, 264 Gikhman, I. I., I IS, 264 Graham, N. V., 153, 266 Gronwall's lemma, 121
Halmos, P. R., 33, 264 Hannan, E. J., 85, 95, 264 Hille, E., 121, 264 Hoffman, H. S., 3, 264 House, B. J., 10, 237, 267 Ibragimov, I. A., 105, 264 Implication, as., 212 Indicator function, 23 Ineffective conditioning, 4 Initial distribution, 21, 24 Interresponse dependencies, 183, 199 Interresponse time (IRT), 8 Invariance, I4 I Invariant subsets, 71 Ionescu Tulcea, C. T., 45, 67, 264, 265 Ionescu Tulcea-Marinescu theorem, 45 Iosifescu, M., 24, 74, 100, 265 Isaac, R., 30, 265 Jamison, B., 51, 56, 265 Kac, M., 13, 250, 252, 256, 262, 265 Karlin, S., 257, 265 Kemeny, J. G., 63, 196, 265 Khintchine, A., 150, 164, 265 Kimura, M., 260, 265 Kinchla, R. A., 13, 249, 263 Kintsch, W., 2, 265 Lamperti, J., 140, 265 Learning model, 24 Learning rate parameter, 109 Levine, M., 8, 267 Lipschitz condition, 31 Lipschitz function, 34 Loeve, M., 45, 103, 130, 159, 265 LoGuidice, J., 3, 267 Lovejoy, E., 12, 265 Luce, R. D., 6,265 McGregor, J., 257, 265 McKean, H. P., Jr., 151, 265 Mackintosh, N. J., 12, 267 Mandl, P., 148, 265 Marinescu, G., 45, 265
INDEX
Markov processes, 2 I absorbing, 61 aperiodic, 37 compact, 14, 50 diffusion, 114 Doeblin-Fortet, 14, 30, 35 ergodic, 37 functions of, 73 indecomposable, 218 irreducible, 219 orderly, 37 pseudo-Poisson, I I2 regular, 37, 61 Mean learning curve, 179, 199 Mosteller, F., 3, 5, 188, 263, 264, 267 Multiprocess models, 13 Myers, J. L., 3, 265 Nakatani, L., 249, 264 Neimark, E. D., 4, 265 Neveu, J., 21, 25, 38, 55, 74, 83, 94, 101, 212, 213, 227, 265 Norman, M. F., 5 , 8, 31, 44, 50, 67, 72, 84, 120, 140, 153, 159, 160, 164, 185, 188, 190, 21 I , 214, 265, 266 North, A. J., 3, 267 Orderly operator, 37 Outcome, I noncontingent, 3 Overlearning reversal effect, 9, 239 Parthasarathy, K. R., 38, 53, 217, 266 Parzen, E., 85, 255, 266 Period, 57 Pickands, J., 111, 219 Probability matching, 3, 160 asymptote, 3 Random polygonal curve, I I5 Random system with complete connections, 24 uniformly aperiodic, 101 uniformly regular, 101 Receiver operating characteristic(ROC), 251 Recurrence, criterion for, 21 I Recurrent case, 214 Reduced model, 28 Wyckoff, 248 Zeaman-House-Lovejoy, I I , 235
273 Reduction, 27 Regular operator, 37 Regularity, criteria for, 61, 65, 176, 196, 226 Reinforcement, I continuous, 2, 188 Response, 1 Rosen, B., 128, 266 Rosenberg, S.,8, 266 Rosenblatt, M., 51, 219, 266 Rota, G. C., 121, 263 Rouanet, H., 8, 266, 267 Royden, H. L., 266 Schwartz, J. T., 37, 44, 46, 48-52, 219, 264 Semigroup, 146 generator, 146 Serfling, R. J., 79, 82, 266, 267 Simple learning, 2 Sine, R.,51, 265 Skorokhod, A. V., 115, 264 Slow learning absorption probabilities and, 163 near critical point, 133 introduction, 109 with large drift, i 18 in simple learning models, 189, 202, 221 with small drift, I18 small probability, 109, I I I small steps, 14, 109, 114 stationary probabilities and, 152 transient behavior and, 116, I37 Snell, J. L., 63, 196, 265 Special learning models additive, 6, 209 all-or-none, 5 Atkinson-Kinchla, 249 beta, 6, 210 continuous pattern, 244 five-operator linear, 5, 175 fixed sample size, 4, 195 Kac, 252 multiresponse linear, 8, 225 pattern, 4, 196 Wyckoff, 247 Zeaman-House-Lovejoy (ZHL), 12, 234 Spectral density function, 95 State of learning, 12, 24 sequence, 24 space, 12, 21, 24
214 Stationary probability, 24 existence, 217 uniqueness, 218 Sternberg, S., 183, 188, 267 Stimulus, 1 Stimulus sampling theory, 4 Stochastic kernel, 21 Stochastically closed set, 24 Subergodic kernel, 57 Success, 6 noncontingent, 187 Suppes, P., 5,8,69,206,244,264,267 support, 53 Supremum norm, 22 Sutherland, N. S., 12, 267 Symmetry, 6 Tatsuoka, M.,188,267 Theios, J., 3, 267 Theodorescu, R., 24, 74, 100, 265 Time series analysis, 73
INDEX
Transition kernel, 21 n-step, 23 Transition operators, 22 for metric state spaces, 37 Transition probability, 139 density of, 150 Trials, 1 Trivial model, 177, 197 Uniprocess models, 13 Upper semicontinuous function, 42 Vervaat, W., 113,267 Weak convergence, 38, 39 Weinstock, S., 3, 267 Wilks, S. S., 199, 267 Wright's model, 257 Yellott, J. I., Jr., 160, 187, 266, 267 Zeaman, D., 10,237,267