Graduate Texts in Mathematics
230
Editorial Board S. Axler F. W Gehring K. A. Ribet
Graduate Texts in Mathematics 2 3 4 5 6 7 8 9 10 II
12 13 14 15 16 17
18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33
TAKEUTI/ZARING. Introduction to Axiomatic Set Theory. 2nd ed. OxTOBY. Measure and Category. 2nd ed. ScHAEFER. Topological Vector Spaces. 2nd ed. HILTON/STAMMBACH. A Course in Homological Algebra. 2nd ed. MAc LANE. Categories for the Working Mathematician. 2nd ed. HUGHES/PIPER. Projective Planes. J.-P. SERRE. A Course in Arithmetic. TAKEUTI/ZARING. Axiomatic Set Theory. HUMPHREYS. Introduction to Lie Algebras and Representation Theory. CoHEN. A Course in Simple Homotopy Theory. CONWAY. Functions of One Complex Variable I. 2nd ed. BEALS. Advanced Mathematical Analysis. ANDERSON/FULLER. Rings and Categories of Modules. 2nd ed. GoLUBITSKY/GUILLEMIN. Stable Mappings and Their Singularities. BERBERIAN. Lectures in Functional Analysis and Operator Theory. WINTER. The Structure of Fields. ROSENBLATT. Random Processes. 2nd ed. HALMOS. Measure Theory. HALMOS. A Hilbert Space Problem Book. 2nd ed. HUSEMOLLER. Fibre Bundles. 3rd ed. HUMPHREYS. Linear Algebraic Groups. BARNES/MACK. An Algebraic Introduction to Mathematical Logic. GREUB. Linear Algebra. 4th ed. HOLMES. Geometric Functional Analysis and Its Applications. HEWITT/STROMBERG. Real and Abstract Analysis. MANES. Algebraic Theories. KELLEY. General Topology. ZARISKIISAMUEL. Commutative Algebra. Vol.L ZARISKIISAMUEL. Commutative Algebra. Vol.Il. JACOBSON. Lectures in Abstract Algebra I. Basic Concepts. JACOBSON. Lectures in Abstract Algebra II. Linear Algebra. JACOBSON. Lectures in Abstract Algebra III. Theory of Fields and Galois Theory. HIRSCH. Differential Topology.
34 35 36 37 38 39 40 41
42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63
SPITZER. Principles of Random Walk. 2nd ed. ALEXANDERfWERMER. Several Complex Variables and Banach Algebras. 3rd ed. KELLEYINAMIOKA et al. Linear Topological Spaces. MONK Mathematical Logic. GR.AUERTIFRITZSCHE. Several Complex Variables. ARVESON. An Invitation to C*-Algebras. KEMENY/SNELL/KNAPP. Denumerable Markov Chains. 2nd ed. APOSTOL. Modular Functions and Dirichlet Series in Number Theory. 2nd ed. J.-P. SERRE. Linear Representations of Finite Groups. GILLMANIJERISON. Rings of Continuous Functions. KENDIG. Elementary Algebraic Geometry. LOEVE. Probability Theory I. 4th ed. LOEVE. Probability Theory II. 4th ed. MOISE. Geometric Topology in Dimensions 2 and 3. SACHSIWu. General Relativity for Mathematicians. GRUENBERG/WEIR. Linear Geometry. 2nd ed. EDWARDS. Fermat's Last Theorem. KLINGENBERG. A Course in Differential Geometry. HARTSHORNE. Algebraic Geometry. MANIN. A Course in Mathematical Logic. GR.AVERIWATKINS. Combinatorics with Emphasis on the Theory of Graphs. BROWN/PEARCY. Introduction to Operator Theory 1: Elements of Functional Analysis. MASSEY. Algebraic Topology: An Introduction. CROWELL/Fox. Introduction to Knot Theory. KoBLITZ. p-adic Numbers, p-adic Analysis, and Zeta-Functions. 2nd ed. LANG. Cyclotomic Fields. ARNOLD. Mathematical Methods in Classical Mechanics. 2nd ed. WHITEHEAD. Elements of Homotopy Theory. KARGAPOLOVIMERLZJAKOV. Fundamentals of the Theory of Groups. BOLLOBAS. Graph Theory. (continued after index)
Daniel W. Stroock
An Introduction to Markov Processes
�Springer
Daniel W. Stroock MIT Department of Mathematics, Rm. 272 Massachusetts Ave 77 02139-4307 Cambridge, USA dws @math.mit.edu
Editorial Board S. Axler Mathematics Department San Francisco State University San Francisco, CA 94132 USA axler@ sfsu.edu
F. W. Gehring Mathematics Department East Hall University of Michigan Ann Arbor, MI 48109 USA fgehring @math.lsa. umich.edu
K. A. Ribet Mathematics Department University of California at Berkeley Berkeley, CA 94720-3840 USA ribet@ math. berkeley .edu
Mathematics Subject Classification (2000): 60-01, 60110, 60J27 ISSN 0072-5285 ISBN 3-540-23499-3 Springer Berlin Heidelberg New York Library of Congress Control Number: 20041113930 This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable for prosecution under the German Copyright Law. Springer is a part of Springer Science+ Business Media springeronline.com © Springer-Verlag Berlin Heidelberg 2005 Printed in Germany The use of general descriptive names, registered names, trademarks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. Typesetting: Camera-ready by the translator Cover design: design & production GmbH, Heidelberg Printed on acid-free paper 41/3142 XT- 54 3 2 1 0
This book is dedicated to my longtime colleague: Richard A. Holley
Contents
. . . . . . . . . .
Preface .
.
. . . . . .
.
. . .
A Good Place to Begin 1 . 1 . Nearest Neighbor Random Walks on Z 1 . 1 . 1 . Distribution at Time n . . . . . . . . . 1 . 1.2. Passage Times via the Reflection Principle 1 . 1 .3. Some Related Computations . . . . . 1 . 1 .4. Time of First Return . . . . . 1 . 1 .5. Passage Times via Functional Equations 1 .2. Recurrence Properties of Random Walks . . . 1.2.1. Random Walks on _zd . . . . 1 .2.2. An Elementary Recurrence Criterion . . 1 .2.3. Recurrence of Symmetric Random Walk in Z2 1 .2.4. Transience in Z3 1 .3. Exercises . . . .
Chapter 1 Random Walks
.
Chapter
2
Doeblin's Theory for Markov Chains
2.1. Some Generalities . . . . . . . . . . . . . 2.1.1. Existence of Markov Chains . . . . . . . 2.1 .2. Transition Probabilities & Probability Vectors 2.1.3. Transition Probabilities and Functions 2.1 .4. The Markov Property . 2.2. Doeblin's Theory . . . . 2.2. 1. Doeblin's Basic Theorem 2.2.2. A Couple of Extensions 2.3. Elements of Ergodic Theory 2.3. 1 . The Mean Ergodic Theorem 2.3.2. Return Times . . 2.3.3. Identification of 1r 2.4. Exercises .
Chapter 3 More about the Ergodic Theory of Markov Chains
3.1. Classification of States . . . . . . . . . 3.1.1. Classification, Recurrence, and Transience 3.1.2. Criteria for Recurrence and Transience 3.1.3. Periodicity . . . . . . . . . 3.2. Ergodic Theory without Doeblin 3.2.1. Convergence of Matrices . . .
Xl
1 1 2 3 4 6 7 8 9 9 11 13 16 23 23 24 24 26 27 27 28 30 32 33 34 38 40 45 46 46 48 51 53 53
viii
Contents
3.2.2. Abel Convergence . . . . . . . . . 3.2.3. Structure of Stationary Distributions 3.2.4. A Small Improvement . . . . . . . 3.2.5. The Mean Ergodic Theorem Again 3.2.6. A Refinement in The Aperiodic Case . 3.2. 7. Periodic Structure 3.3. Exercises Chapter 4 Markov Processes in Continuous Time
4.1. Poisson Processes . . . . . . . . . 4.1.1. The Simple Poisson Process . . . . . 4. 1.2. Compound Poisson Processes on zd . 4.2. Markov Processes with Bounded Rates 4.2. 1 . Basic Construction . . . . . . . . 4.2.2. The Markov Property . . . . . . . 4.2.3. The Q-Matrix and Kolmogorov's Backward Equation 4.2.4. Kolmogorov's Forward Equation . . . . . . . . . . 4.2.5. Solving Kolmogorov's Equation . . . . . . . . . . 4.2.6. A Markov Process from its Infinitesimal Characteristics 4.3. Unbounded Rates . . . . . . . . . . 4.3. 1 . Explosion . . . . . . . . . . 4.3.2. Criteria for Non-explosion or Explosion 4.3.3. What to Do When Explosion Occurs 4.4. Ergodic Properties . . . . . . . . . . 4.4.1 . Classification of States . . . . . . . . 4.4.2. Stationary Measures and Limit Theorems 4.4.3. Interpreting irii 4.5. Exercises Chapter 5 Reversible Markov Processes
5.1 . Reversible Markov Chains . . . . 5.1.1. Reversibility from Invariance . . 5.1 .2. Measurements in Quadratic Mean 5.1.3. The Spectral Gap . . . . . . . 5.1.4. Reversibility and Periodicity 5.1 .5. Relation to Convergence in Variation 5.2. Dirichlet Forms and Estimation o f /3 5.2.1. The Dirichlet Form and Poincare's Inequality 5.2.2. Estimating !3+ . . . . . . . . . . . . . 5.2.3. Estimating (3_ . . . . . . . . . . . . . 5.3. Reversible Markov Processes in Continuous Time 5.3. 1 . Criterion for Reversibility . . . . . . 5.3.2. Convergence in L2 (ir) for Bounded Rates 5.3.3. L2 ( ir )-Convergence Rate in General . .
55 57 59 61 62 65 67 75 75 75 77 80 80 83 85 86 86 88 89 90 92 94 95 95 98 101 102 107 107 108 108 1 10 1 12 113 115 115 117 119 120 120 121 122
Contents 5.3.4. Estimating ). . . . . . . . . . . 5.4. Gibbs States and Glauber Dynamics 5.4. 1 . Formulation 5.4.2. The Dirichlet Form 5.5. Simulated Annealing 5.5. 1 . The Algorithm 5.5.2. Construction of the Transition Probabilities . 5.5.3. Description of the Markov Process 5.5.4. Choosing a Cooling Schedule 5.5.5. Small Improvements 5.6. Exercises . . . . . Chapter 6 Some Mild Measure Theory
6.1. A Description of Lebesgue's Measure Theory . 6.1.1. Measure Spaces . . . . . . . . . . . . . 6.1.2. Some Consequences of Countable Additivity 6.1.3. Generating a-Algebras 6 .1.4. Measurable Functions 6. 1.5. Lebesgue Integration . 6.1.6. Stability Properties of Lebesgue Integration . 6.1 .7. Lebesgue Integration in Countable Spaces 6.1 .8. Fubini's Theorem . . . . . . . . . . . . 6.2. Modeling Probability . . . . . . . . . . . 6.2. 1 . Modeling Infinitely Many Tosses of a Fair Coin 6.3. Independent Random Variables . . . . . . . 6.3. 1 . Existence of Lots of Independent Random Variables 6.4. Conditional Probabilities and Expectations 6.4. 1 . Conditioning with Respect to Random Variables
ix 125 126 126 127 130 131 132 1 34 134 137 138 145 145 145 147 148 149 150 151 153 155 157 158 162 163 165 166
Notation .
167
References .
168
Index . . . .
169
This Page Intentionally Left Blank
Preface
To some extent, it would be accurate to summarize the contents of this book as an intolerably protracted description of what happens when either one raises a transition probability matrix P (i.e. , all entries (P)ij are non negative and each row of P sums to 1) to higher and higher powers or one exponentiates R (P I), where R is a diagonal matrix with non-negative entries. Indeed, when it comes right down to it, that is all that is done in this book. However, I, and others of my ilk, would take offense at such a dismissive characterization of the theory of Markov chains and processes with values in a countable state space, and a primary goal of mine in writing this book was to convince its readers that our offense would be warranted. The reason why I, and others of my persuasion, refuse to consider the theory here as no more than a subset of matrix theory is that to do so is to ignore the pervasive role that probability plays throughout. Namely, probability theory provides a model which both motivates and provides a context for what we are doing with these matrices. To wit, even the term "transition probability matrix" lends meaning to an otherwise rather peculiar set of hypotheses to make about a matrix. Namely, it suggests that we think of the matrix entry (P)ij as giving the probability that, in one step, a system in state i will make a transition to state j. Moreover, if we adopt this interpretation for (P)ij , then we must interpret the entry (P"")ij of P"" as the probability of the same transition in n steps. Thus, as n oo, P"" is encoding the long time behavior of a randomly evolving system for which P encodes the one-step behavior, and, as we will see, this interpretation will guide us to an understanding of limn_,00 (P"")ij · In addition, and perhaps even more important, is the role that probability plays in bridging the chasm between mathematics and the rest of the world. Indeed, it is the probabilistic metaphor which allows one to formulate mathematical models of various phenomena observed in both the natural and social sciences. Without the language of probability, it is hard to imagine how one would go about connecting such phenomena to P"". In spite of the propaganda at the end of the preceding paragraph, this book is written from a mathematician's perspective. Thus, for the most part, the probabilistic metaphor will be used to elucidate mathematical concepts rather than to provide mathematical explanations for non-mathematical phe nomena. There are two reasons for my having chosen this perspective. First, and foremost, is my own background. Although I have occasionally tried to help people who are engaged in various sorts of applications, I have not accu mulated a large store of examples which are easily translated into terms which are appropriate for a book at this level. In fact, my experience has taught me that people engaged in applications are more than competent to handle the routine problems which they encounter, and that they come to someone like me only as a last resort. As a consequence, the questions which they -
-+
Preface
xii
ask me tend to be quite difficult and the answers to those few which I can solve usually involve material which is well beyond the scope of the present book. The second reason for my writing this book in the way that I have is that I think the material itself is of sufficient interest to stand on its own. In spite of what funding agencies would have us believe, mathematics qua mathematics is a worthy intellectual endeavor, and I think there is a place for a modern introduction to stochastic processes which is unabashed about making mathematics its top priority. I came to this opinion after several semesters during which I taught the introduction to stochastic processes course offered by the M.I.T. department of mathematics. The clientele for that course has been an interesting mix of undergraduate and graduate students, less than half of whom concentrate in mathematics. Nonetheless, most of the students who stay with the course have considerable talent and appreciation for mathematics, even though they lack the formal mathematical training which is requisite for a modern course in stochastic processes, at least such courses are now taught in mathematics departments to their own graduate students. As a result, I found no ready made choice of text for the course. On the one hand, the most obvious choice is the classic text A First Course in Stochastic Processes, either the original one by S. Karlin or the updated version [4] by S. Karlin and H. Taylor. Their book gives a no nonsense introduction to stochastic processes, especially Markov processes, on a countable state space, and its consistently honest, if not al ways easily assimilated, presentation of proofs is complemented by a daunting number of examples and exercises. On the other hand, when I began, I feared that adopting Karlin and Taylor for my course would be a mistake of the same sort adopting Feller's book for an undergraduate introduction to probabil ity, and this fear prevailed the first two times I taught the course. However, after using, and finding wanting, two derivatives of Karlin's classic, I took the plunge and assigned Karlin and Taylor's book. The result was very much the one which I predicted: I was far more enthusiastic about the text than were my students. In an attempt to make Karlin and Taylor's book more palatable for the students, I started supplementing their text with notes in which I tried to couch the proofs in terms which I hoped they would find more accessible, and my efforts were rewarded with a quite positive response from my students. In fact, as my notes became more and more extensive and began to diminish the importance of the book, I decided to convert them into what is now this book, although I realize that my decision to do so may have been stupid. For one thing, the market is already close to glutted with books which purport to cover this material. Moreover, some of these books are quite popular, al though my experience with them leads me to believe that their popularity is not always correlated with the quality of the mathematics they contained. Having made that pejorative comment, I will not make public which are the books which led me to this conclusion. Instead, I will only mention the books on this topic, besides Karlin and Taylor's, which I very much liked. Namely, as
as
Preface
xiii
J. Norris's book [5] is an excellent introduction to Markov processes which, at the same time, provides its readers with a good place to exercise their measure-theoretic skills. Of course, Norris's book is only appropriate for stu dents who have measure-theoretic skills to exercise. On the other hand, for students who possess those skills, Norris's book is a place where they can see measure theory put to work in an attractive way. In addition, Norris has included many interesting examples and exercises which illustrate how the subject can be applied. The present book includes most of the math ematical material contained in [5] , but the proofs here demand much less measure theory than his do. In fact, although I have systematically employed measure theoretic terminology (Lebesgue's Dominated Convergence Theorem, the Monotone Convergence Theorem, etc.), which is explained in Chapter 6, I have done so only to familiarize my readers with the jargon which they will encounter if they delve more deeply into the subject. In fact, because the state spaces in this book are countable, the applications which I have made of Lebesgue's theory are, with one notable exception, entirely trivial. The one exception, which is made in § 6.2, is that I have included a proof that there exist countably infinite families of mutually independent random variables. Be that as it may, the reader who is ready to accept that such families exist has no need to consult Chapter 6 except for terminology and the derivation of a few essentially obvious facts about series. For more advanced students, an excellent treatment of Markov chains on a general state space can be found in the book [6] by D. Revuz. The organization of this book should be more or less self-evident from the table of contents. In Chapter 1 , I give a bare hands treatment of the basic facts, with particular emphasis on recurrence and transience, about nearest neighbor random walks on the square, d-dimensional lattice zd . Chapter 2 introduces the study of ergodic properties, and this becomes the central theme which ties together Chapters 2 through 5. In Chapter 2, the systems under consideration are Markov chains (i.e., the time parameter is discrete) , and the driving force behind the development there is an idea which was introduced by Doeblin. Restricted as the applicability of Doeblin's idea may be, it has the enormous advantage over the material in Chapters 3 and 4 that it provides an estimate on the rate at which the chain is converging to its equilibrium distribution. After giving a reasonably thorough account of Doeblin's theory, in Chapter 3 I study the ergodic properties of Markov chains which do not necessarily satisfy Doeblin's condition. The main result here is the one sum marized in equation (3.2.15) . Even though it is completely elementary, the derivation of (3.2. 15) , is, without doubt, the most demanding piece of analy sis in the entire book. So far as I know, every proof of (3.2.15) requires work at some stage. In supposedly "simpler" proofs, the work is hidden elsewhere (either measure theory, as in [5] and [6] , or in operator theory, as in [2]) . The treatment given here, which is a re-working of the one in [4] based on Feller's renewal theorem, demands nothing more of the reader than a thorough un derstanding of arguments involving limits superior, limits inferior, and their
xiv
Preface
role in proving that limits exist. In Chapter 4, Markov chains are replaced by continuous-time Markov processes (still on a countable state space). I do this first in the case when the rates are bounded and therefore problems of possible explosion do not arise. Afterwards, I allow for unbounded rates and develop criteria, besides boundedness, which guarantee non-explosion. The remainder of the chapter is devoted to transferring the results obtained for Markov chains in Chapter 3 to the continuous-time setting. Aside from Chapter 6, which is more like an appendix than an integral part of the book, the book ends with Chapter 5. The goal in Chapter 5 is to obtain quantitative results, reminis cent of, if not as strong as, those in Chapter 2, when Doeblin's theory either fails entirely or yields rather poor estimates. The new ingredient in Chapter 5 is the assumption that the chain or process is reversible (i.e., the transition probability is self-adjoint in the L 2 -space of its stationary distribution) , and the engine which makes everything go is the associated Dirichlet form. In the final section, the power of the Dirichlet form methodology is tested in an analysis of the Metropolis (a.k.a. as simulated annealing) algorithm. Finally, as I said before, Chapter 6 is an appendix in which the ideas and terminol ogy of Lebesgue's theory of measure and integration are reviewed. The one substantive part of Chapter 6 is the construction, alluded to earlier, in § 6.2.1. Finally, I have reached the traditional place reserved for thanking those individuals who, either directly or indirectly, contributed to this book. The principal direct contributors are the many students who suffered with various and spontaneously changing versions of this book. I am particularly grateful to Adela Popescu whose careful reading brought to light many minor and a few major errors which have been removed and, perhaps, replaced by new ones. Thanking, or even identifying, the indirect contributors is trickier. Indeed, they include all the individuals, both dead and alive, from whom I received my education, and I am not about to bore you with even a partial list of who they were or are. Nonetheless, there is one person who, over a period of more than ten years, patiently taught me to appreciate the sort of material treated here. Namely, Richard A. Holley, to whom I have dedicated this book, is a true probabilist. To wit, for Dick, intuitive understanding usually precedes his mathematically rigorous comprehension of a probabilistic phenomenon. This statement should lead no one to to doubt Dick's powers as a rigorous mathematician. On the contrary, his intuitive grasp of probability theory not only enhances his own formidable mathematical powers, it has saved me and others from blindly pursuing flawed lines of reasoning. As all who have worked with him know, reconsider what you are saying if ever, during some diatribe into which you have launched, Dick quietly says "I don't follow that." In addition to his mathematical prowess, every one of Dick's many students will attest to his wonderful generosity. I was not his student, but I was his colleague, and I can assure you that his generosity is not limited to his students. Daniel W. Stroock, A ugust
2004
CHAPTER 1
Random Walks
A Good Place to Begin
The purpose of this chapter is to discuss some examples of Markov processes which can be understood even before the term "Markov process" is. Indeed, anyone who has been introduced to probability theory will recognize that these processes all derive from consideration of elementary "coin tossing." 1.1 Nearest Neighbor Random Walks on Z Let p be a fixed number from the open interval (0, 1), and suppose that 1 {En: n E z + } is a sequence of {-1, 1}-valued, identically distributed Bernoul li random variables2 which are 1 with probability p. That is, for any n E z+ n and any E (El , . . . , En) E { -1, 1} , lP'(B1 = E 1 , . . . , Bn =En) = pN(E)qn-N(E) where q 1 - p and n (1.1.1) n + �n(E) when Sn(E) L = 1} = N(E) #{ m : 1 =
=
=
tm
=
Cm·
Next, set (1.1.2)
Xo = 0 and Xn =
n
L Bm for n E z+ .
m =l
The existence of the family { Bn : n E z + } is the content of § 6.2.1. The above family of random variables { Xn : n E N} is often called a nearest neighbor random walk on Z. Nearest neighbor random walks are examples of Markov processes, but the description which we have just given is the one which would be given in elementary probability theory, as opposed to a course, like this one, devoted to stochastic processes. Namely, in the study of stochastic processes the description should emphasize the dynamic aspects 1 Z is used to denote the set of all integers, of which N and z+ are, respectively, the non negative and positive members. 2 For historical reasons, mutually independent random variables which take only two values are often said to be Bernoulli random variables.
2
1 RANDOM WALKS
of the family. Thus, a stochastic process oriented description might replace ( 1.1.2) by IP'(Xo = 0) =
( 1.1.3)
1 and
IP'(Xn - Xn-1 = E I Xo, ... , Xn-1) =
{�
if E = 1 if E = -1,
where IP'(Xn - Xn-1 = E I Xo, ... , Xn-1) denotes the conditional probability ( cf. § 6.4.1) that Xn-Xn-1 = E given a-( {Xo, ... , Xn-d ). Notice that ( 1 .1.3) is indeed more dynamic a description than the one in ( 1.1.2). Specifically, it says that the process starts from 0 at time n = 0 and proceeds so that, at each time n E z+, it moves one step forward with probability p or one step backward with probability q, independent of where it has been before time n. 1 . 1 . 1 . Distribution at Time n: In this subsection, we will present two approaches to computing IP'(Xn = m ) . The first computation is based on the description given in ( 1.1.2). Namely, from ( 1.1.2) it is clear that IP'(IXnl :::; n) = 1. In addition, it is clear that n odd
===? IP'(Xn is odd ) = 1 and
===? IP'(Xn is even) = 1 .
n even
{ {
Finally, given m E -n, . . . , n } with the same parity as n and a string E = (E1, ... , En) E -1, l}n with (cf. ( 1.1.1)) S (E) m , N(E) = ntm and so n
= n
+m
n-nt
IP'( B1 = E1, ... , Bn = En) = p_2_q_2_.
Hence, because, when (k) kl(}�k)! is the binomial coefficient "£ choose k," there are ( � ) such strings E, we see that 2 =
"'
"
IP'(Xn = m)
( 1.1.4)
=
if m E Z, lml :::; n, and
( mtn) p
m
"1"' q"2"'
has the same parity as n
and is 0 otherwise. Our second computation of the same probability will be based on the more dynamic description given in ( 1.1.3). To do this, we introduce the notation (Pn)m IP'(Xn = m) . Obviously, (P0)m = bo,m, where bk,R. is the Kronecker symbol which is 1 when k = £ and 0 otherwise. Further, from ( 1 .1.3), we see that IP'(Xn = m) equals =
IP'(Xn-1 = m - 1 & Xn = m) + IP'(Xn-1 = m + 1 & Xn = m) = p!P'(Xn-1 = m -1 ) + q!P'(Xn-1 = m + 1 ) .
That is,
( 1 .1.5)
1.1 Nearest Neighbor Random Walks on Z
3
Obviously, (1.1.5) provides a complete, albeit implicit, prescription for com puting the numbers (Pn )m, and one can easily check that the numbers given by (1.1.4) satisfy this prescription. Alternatively, one can use (1.1.5) plus in duction on n to see that ( pn ) m = 0 unless m = 2£ - n for some 0 :::; £ :::; n and that (Cn )c = (Cn )£- 1 + (Cn )£- 1 when (Cn )c p-Cqn -C(pn hc - n · In other words, the coefficients {(Cn )c n E N & 0 :::; £ :::; n} are given by Pascal's triangle and are therefore the binomial coefficients. 1 . 1.2. Passage Times via the Reflection Principle: More challenging than the computation in § 1 . 1 . 1 is finding the distribution of the first passage time to a point a E Z. That is, given a E Z \ {0}, set3 ::=:
:
(a = inf {n :::::1 : Xn = a} (::=: oo when Xn =/= a for any n ::::: 1 ) .
(1.1 .6)
Then (a is the first passage time t o a, and our goal here is to find its distribu tion. Equivalently, we want an expression for IP'((a = n), and clearly, by the considerations in §1.1.1, we need only worry about n's which satisfy n ::::: l a l and have the same parity as a. Again we will present two approaches to this problem, here based on (1. 1.2) and in §1.1.5 on (1.1 .3) . To carry out the one based on (1 .1.2), assume that a E z+ , suppose that n E z+ has the same parity as a, and observe first that
IP' ((a
=
n) = IP'(Xn =a & (a > n - 1 ) = p!P' ((a > n - 1 & Xn-1 = a - 1) .
Hence, it suffices for us to compute IP'( (a > n - 1 & Xn - 1 = a - 1 ) . For this purpose, note that for any E E { - 1 , 1} n - 1 with Sn - 1 (E) = a - 1, the event { (B1 , . . . , Bn - 1 ) = E} has probab1hty p 2 q � 2 Thus, •
•
n+a_l
•
IP'((a = n) = N(n, a)p__ 2 q__ 2 where N(n, a) is the number of E E { - 1 , 1 } n - 1 with the properties that Sc(E) :::; a - 1 for 0:::; £ :::; n - 1 and Sn - 1 (E) = a - 1. That is, everything comes down to the computation of N(n, a). Alternatively, since N(n, a) = ( �::-� 1 ) -N'(n, a), where N'(n, a) is the number of E E { -1 , 1} n - 1 such that Sn_2 1 (E) = a - 1 and Sc(E) ::::a: for some £:::; n - 1 , we need only compute N'(n, a). For this purpose we will use a beautiful argument known as the reflection principle. Namely, consider the set P(n, a) of paths (S0 , . . . , Sn - 1 ) E zn with the properties that So = 0, Sc- Sm- 1 E { -1 , 1 } for 1 :S m :S n - 1, and Sm ::::: a for some 1 :::; m :::; n - 1 . Clearly, N' (n , a) is the numbers of paths in the set L( n, a) consisting of those (So , . . . , Sn - d E P( n, a) for which Sn_ 1 = a - 1 , and, an application of the reflection principle, we will show that the set L(n, a) has the same number of elements as the set U(n, a) whose elements are those paths (So , . . . , Sn - d E P(n, a) for which Sn - 1 = a + 1 . Since (So , . . . , Sn - d E U(n, a) i f and only if So = 0, Sm - Sm- 1 E { -1, 1 } n+a
(*)
n-a
n
as
3
As the following indicates, we take the infemum over the empty set to be
+ oo.
1 RANDOM WALKS
4
=
for all 1 <:::: m <:::: n - 1 , and Sn-1 a + 1, we already know how to count them: there are ('::+I) of them. Hence, all that remains is to provide the 2 of the reflection principle. To this end, for a given advertised application S (50, . . . , Sn_1) E P(n, a) , let i:'(S) be the smallest 0 :<:::: k <:::: n - 1 for which sk � a, and define the reflection 9\ ( S ) ( So, 'Bn-d of s so that S if 0 <:::: m <:::: i:'(S) and Sk 2a - Sk if i:'(S) < m <:::: n - 1. Clearly, 9\ maps L(n, a) into U(n, a) and U(n, a) into L(n, a). In addition, 9\ is idempotent: its composition with itself is the identity map. Hence, as a map from L( n, a) to U(n, a), 9\ it must be both one-to-one and onto, and so L(n, a) and U(n, a) have the same numbers of elements. We have now shown that N' ( n, a) (',';:;:2") and therefore that
=
=
=
=
N( n, a)
0
0
m = Sm
0
2
= ( nt:�1 ) - (nnt_}) .
Finally, after plugging this into ( * ) , we arrive at
JP'((a = n) = [ ( ntna--11 ) - (nnt-a1) ] p__2 q_2-, n+a
n-a
which simplifies to the remarkably simple expression
lP'((a = n)
= -na ( n-n+2-a ) p_2_q__2 n+a
n-a
=
a a) . -lP'(Xn n
=
The computation when a < 0 can be carried out either by repeating the argument just given or, after reversing the roles of p and q, applying the preceding result to - a . However one arrives at it, the general result is that
( 1.1. 7)
!a! = n) =-!a!n ( n-+2-a) p_2_q__2 = -lP'(Xn = a) n n
a =/= 0 ==}- lP'((a
n+a
n -a
for n � !a! with the same parity as a and is 0 otherwise. 1 . 1.3. Some Related Computations: Although the formula in (1.1.7) is elegant, it is not particularly transparent. In particular, it is not at all evident how one can use it to determine whether lP'( < oo) 1. To carry out this computation, let a > 0 be given, and write of . . . , Bn , . . . ), where u { 00} so that, for each is the function which maps { - 1, 1 }z+ into n E N,
(a(a = fa(B=l > z+
fa
m c < a for 1 <:::: n. !a(E1 , . . . ,En , . . . ) > n C=:L::> 1 Because the event { (a = } depends only on ( B 1 , . . . , Bm ) and (a = m (a+l = + (1 0 I:m ) , ( 1.1.8) h(Bm+ 1 , . . . , Bm+n ' . . . where (1 I: <:::: m
{===}
m
m
o
==}=
m
1.1 Nearest Neighbor Random Walks on Z
5
{(a = m & (a+1 < oo} = {(a = m} n {(1 o �m < oo}, and {(a independent of {(1 o �m < oo }. In particular, this leads to
m} is
00
1P'((a+1
< oo) = L lP'((a = m & (a+1 < oo) m=1
=
00
L lP'((a = m)lP'((1
m=1
o
�m
< oo)
00
= 1P'((1 < oo) L lP'((a = m) = 1P'((1 < oo)lP'((a < oo), m=1
since (Bm+1, ..., Bm+n, .. . ) and (B1, ..., Bn, . . . ) have the same distribution and therefore so do (1 �m and (1. The same reasoning applies equally well when a < 0, only now with - 1 playing the role of 1. In other words, we have proved that o
l l
< oo) = lP'((sgn(a) < oo) a for a E Z \ {0}, where sgn ( a ) , the signum of a, is 1 or - 1 according to whether a > 0 or a < 0. In particular, this shows that lP'((1 < oo ) = 1 ==} lP'((a < oo) = 1 and lP'((-1 < oo) = 1 ==} lP'((-a < oo ) = 1 for all a E z+. In view of the preceding, we need only look at 1P'((1 < oo). Moreover, by (1.1 .9)
lP'((a
the Monotone Convergence Theorem, Theorem 6.1.9,
]
00
1P'((1 < oo) = limlE [s<"1 = lim L s2n-11P'((1 = 2n - 1). s/'1 s/'1 n=1
Applying (1.1.7) with a = 1 , we know that
(
1 2n -l)pnqn-1 . 1P'((1 = 2n - 1) = _ _ 2n - 1 n Next, note that
(n) 2n - 1
1 2n - 1 --
n-1 2n-1 rr - (2(n 1))! - -(2 - 1) n!(n - 1)! n! m=1 m 4n-1 n-1 4n .! = - IJ (m- .!) = (-1) n-1_ 2 ' 2 2 n n! m=1
()
where4, for any a E JR, if n = 0 if n E z+ 4
In the preceding, we have adopted the convention that
TJ�=k aj
=
1 if£< k.
1 RANDOM WALKS
6
is the generalized binomial coefficient which gives the coefficient of Taylor's expansion of ( 1 + ) around = 0. Hence, x
f s 2n - 1 JID ((1 = 2n - 1) n= 1
=
""
x
n
in the
x
f (�)n ( -4pqs2 ) n
1 2qs n=
_ __
1 - \./1- 4pqs2
=
2qs
1
'
and so ( 1 . 1 .10)
IE [s (']
= __
1 - \/1 - 4pqsz for I s I < 1 . 2 qs
Of course, by symmetry, one can reverse the roles of p and q to obtain (1.1.11)
IE [s"-'] '
_ =
1 - J1 - 4pqsz
2ps
for I s I < 1.
By letting s / 1 in (1.1.10) and noting that 1 - 4pq (p- q ) 2 , we see that5 lim IE [ s (' ] =
s/ 1
1 - IP - q l 2q
and so
lP'((1 < oo) =
{�
=
(p + q ) 2 - 4pq =
p 1\ q q
if p :::::: q if p < q.
Of course, Jll'( (_1 < oo) is given by the same formula, only with the roles of p and q reversed. Thus, (1.1. 12)
if a E z+ & p 2': q or if a E z+ & p < q or
:::; q - a E z+ & p > q.
-a E
z+ & p
1 . 1 .4. Time of First Return: Having gone to so much trouble to arrive at
(1.1.12), it is only reasonable to draw from it a famous conclusion about the
recurrence properties of nearest neighbor random walks on Z. Namely, let Po:=:: inf{n 2': 1 : Xn = 0}
( :=::
oo if Xn-=/- 0 for all n 2': 1)
b e the time of first return to 0. Then, by precisely the same sort of reasoning which allowed us to arrive at (1. 1.9) , we see that lP'(X1 = 1 & p0 < oo) = plP'((_1 < oo) and lP'(X1 = -1 & p0 < oo ) = qlP'((1 < oo), and so, by (1.1.12), (1.1.13) 5
lP'( po < oo) = 2(p 1\ q ) .
We use a 1\ b to denote the minimum min {a, b} of a, b E JR.
1 . 1 Nearest Neighbor Random Walks on Z
7
In other words, the random walk {Xn : n ::=: 0} will return to 0 with probability 1 if and only if it is symmetric in the sense that p By sharpening the preceding a little, one sees that J!D(X1 1 & 2n) pJ!D((_1 2n - 1 ) and J!D(X1 - 1 & 2 n) 2n - 1), and so, by ( 1 .1 . 10) and (1.1.11),
= �. = Po = = = = Po = = qJ!D((l = (1.1.14) E[sP0] = 1 - V1 - 4p s2 for l s i < 1. Hence, d Po] = 4pqs 2 for l s i < 1 , E[posPo ] = s-E[s ds J1 - 4pqs2 and therefore, since6 E[p0s P0 ] E[p0, Po < oo] as s 1, E[po, Po < oo] = -1p4pq- q1 , q
/'
/'
which, in conjunction with ( 1 . 1 . 13) , means that7
E[Po I Po < oo] = -12pp - qq = 1 v
1
(1.1.15)
1
+ -1p - q1 .
The conclusions drawn in the preceding provide significant insight into the behavior of nearest neighbor random walks on Z. In the first place, they say that when the random walk is symmetric, it returns to 0 with probability 1 but the expected amount of time it takes to do so is infinite. Secondly, when the random walk is not symmetric, it will, with positive probability, fail to return. On the other hand, in the non-symmetric case, the behavior of the trajectories is interesting. Namely, ( 1 . 1 . 13) in combination with ( 1 . 1 . 15) say that either they fail to return at all or they return relatively quickly. 1 . 1 .5. Passage Times via Functional Equations: We close this discus sion of passage times for nearest neighbor random walks with a less computa tional derivation of ( 1 . 1 . 1 0) . For this purpose, set ( [ s("] for a E Z \ { 0} and E (-1, 1). Given a E we use the ideas in §1.1.3, especially (1.1.8) , to arrive at
Ua s) = E
z+ ,
s
00
00
m=l m= l = mL=l smlP a = m)u1 (s) = ua (s)ul (s). 00
((
When X i s a random variable and A i s a n event, w e will often use JE[X, A ] t o denote lE[XlA]· 7 a V b is used to denote the maximum max{ a, b} of a, b E JR.
6
1 RANDOM WALKS
8 Similarly, if -a E
(1.1.16)
z+' then Ua- l ( s) = Ua ( s)u- l ( s). Hence Ua (s) = Usgn(a)(s)lal for a E Z \ {0} and l si < 1 .
Continuing with the same line o f reasoning and using (1.1.16) with a = 1 , we also have
u l (s) = lE[s(1, xl = 1]1+ lE[s( , xl = -1] = ps + qslE[s(20� ' xl = - 1] = ps + qsu2 (s) = ps + qsul (s) 2 . Hence, by the quadratic formula, ul (s) = 1 ± y'12qs- 4pqs2 Because 1P'(( 1 is odd) = 1, u 1(-s) = -u 1 (s). At the same time, s E (0, 1) 1 + y'12qs- 4pqs2 1 + )12q- 4pq = p q q - 1. Hence, since s E (0, 1) u1 (s) < 1, we can eliminate the"+" solution and thereby arrive at a second derivation of ( 1.1.10) . In fact, after combining this '
v
>
===}
--
>
===}
with (1 .1 .16), we have shown that
z+ i f - a E z+ if a E
(1.1 . 17)
for
l si < 1 .
1.2 Recurrence Properties of Random Walks
In § 1 . 1 .4, we studied the time p0 of first return of a nearest neighbor random walk to 0. As we will see in Chapters 2 and times of first return are critical ( cf. §2.3.2) for an understanding of the long time behavior of random walks and related processes. Indeed, when the random walk returns to 0, it starts all over again. Thus, if it returns with probability 1, then the entire history of the walk will consist of epochs, each epoch being a sojourn which begins and ends at 0. Because it marks the time at which one epoch ends and a second, identically distributed, one begins, a time of first return is often called a recurrence time, and the walk is said to be recurrent if 1P'(p0 < oo) 1 . Walks which are not recurrent are said to be transient. In this section, we will discuss the recurrence properties of nearest neigh bor random walks. Of course, we already know (cf. that a nearest neighbor random on Z is recurrent if and only if it is symmetric. Thus, our interest here will be in higher dimensional analogs. In particular, in the hope that it will be convincing evidence that recurrence is subtle, we will show that the recurrence of the nearest neighbor, symmetric random walk on Z persists when is replaced by but disappears in Z3.
3,
=
(1.1.13))
z
:2:2
1 .2 Recurrence Properties of Random Walks
9
1.2.1. Random Walks on :zd: To describe the analog on :zd of a nearest neighbor random walk on :Z, we begin by thinking of N 1 { -1 , 1 } as the set of nearest neighbors in :Z of 0 . It should then be clear why the set of nearest neighbors of the origin in :zd consists of the 2d points in :zd for which (d - 1) coordinates are 0 and the remaining coordinate is in N1 . Next, we re place the N1-valued Bernoulli random variables in § 1 . 1 by their d-dimensional analogs, namely: independent, identically distributed Nd-valued random vari ables B 1, ..., Bn, ....8 Finally, a nearest neighbor random walk on :zd is a family {Xn : n 2: 0} of the form =
Xo= 0 and Xn=
n Bm L m=1
for n 2: 1.
The equivalent, stochastic process oriented description of {Xn : n 2: 0} is lP'(Xo=0)= 1 and, for n 2: 1 and E E Nd, (1 .2.1)
lP'(Xn - Xn - 1=E I Xo, · · · , Xn_l )=PE, where pE lP'(B1=E). When B 1 is uniformly distributed on Nd, the random walk is said to be symmetric. =
In keeping with the notation and terminology introduced above, we define the time p0 of first return to the origin equal to n if n 2: 1, Xn = 0, and X m i= 0 for 1 :S: m < n, and we take p0 = oo if no such n 2: 1 exists. Also, we will say that the walk is recurrent or transient according to whether lP'(p0 < oo) is 1 or strictly less than 1 . 1.2.2. An Elementary Recurrence Criterion: Given n 2: 1, let p�n ) be the time of the nth return to 0. That is, p� 1 ) = p0 and, for n 2: 2,
p�n - 1 ) < 00 ====} p�n) =inf{m > p(n - 1 ) : X m = 0} and p�n - 1 )= oo ====? p�n ) = oo . Equivalently, if g : ( Nd)z+ :z + U { oo} is -----t
determined so that
m g(E 1, . . . Ee, . . . ) > n if L Ee i= 0 for 1:::; m:::; n, £= 1 " m where then p0=g ( B . . . , Be, . . . ) , and p0(n )=m ====? p0(n+l )=m+p0oz.., m Po o � is equal to g(B m+ 1, . . . , B m+£, . . . ) . In particular, this leads to lP'(p�n+l ) < oo)= L lP'(p�n )=m & Po o � m < oo) m= 1 1,
CXJ
The existence of the Bn's can be seen as a consequence of Theorem 6.3.2. Namely, let {Un : n E z+} be a family of mutually independent random variables which are uniformly distributed on [0, 1). Next, let ( k1, . . . , k2d) be an ordering of the elements of Nd, set Nd so that f3o = 0 and f3m = I:;:1IP'(Bl = ke ) for 1 ::; m ::; 2d, define F : [0, 1) F r [f3m-1, f3m ) = km, and set Bn = F(Un)· 8
---+
1 RANDOM WALKS
10
since {p6n )= m} depends only on (B 1 , ... , Bm), and is therefore independent of p0 �m, and the distribution of Po �m is the same as that of p0. Thus, we have proved that o
o
( 1 .2.2) One dividend of ( 1 .2.2) is that it supports the epochal picture given above about the structure of recurrent walks. Namely, it says that if the walk returns once to 0 with probability 1, then, with probability 1, it will do so infinitely often. This observation has many applications. For example, it shows that if the mean value of ith coordinate of B1 is different from 0, then {Xn : n � 0} must be transient. To see this, use Yn to denote the ith coordinate of Bn, and observe that {Yn - Yn - 1 n � 1} is a sequence of mutually independent, identically distributed { -1, 0, 1 }-valued random variables with mean value p, cl 0. But, by the Strong Law of Large Numbers ( cf. Exercise 1.3.4 below), this means that �n --> p, cl 0 with probability 1, which is possible only if IXnl � IYnl --> oo with probability 1, and clearly this eliminates the possibility that, even with positive probability, Xn= 0 infinitely often. A second dividend of (1.2.2) is the following. Define :
00
To= L l{o}(Xn ) n=O to be the total time that {Xn n � 0} spends at the origin. Since X 0= 0, T0 � 1. Moreover, for n � 1, T0 > n <==? P6n ) < oo. Hence, by (1.2.2) , :
00
00
00
E[To] = LIP(To > n )= 1 + L IP(p6n ) < oo) = 1 + L IP(po < oo)n, n=O n=l n=l and so (1 .2.3)
E[To]= 1 - IP(po1 < oo)
--=-:----:-
1
IP(po= oo) ·
Before applying ( 1.2.3) to the problem of recurrence, it is interesting to note that T0 is a random variable for which the following peculiar dichotomy holds: ( 1 .2.4)
IP(To < oo) > 0 ===? E[To] < oo E[To] = oo ===? IP(T0= oo)= 1 .
Indeed, i f IP(To < oo) > 0, then, with positive probability, Xn cannot b e 0 infinitely often and so, by ( 1 . 1 . 13), IP(p0 < oo) < 1, which, by ( 1.2.3) , means that E[T0] < oo. On the other hand, if E[T0]= oo, then (1 .2.3) implies that IP(p0 < oo) = 1 and therefore, by ( 1.2.2) , that IP(T0 > n) IP(p6n) < oo)= 1 for all n � 1 . Hence (cf. (6. 1 .3)) , IP(T0= oo )= 1. =
1 .2 Recurrence Properties of Random Walks 1 .2.3. Recurrence of Symmetric Random Walk in
11
Z2: The most fre
quent way that (1.2.3) gets applied to determine recurrence is in conjunction with the formula
lE[To] =
(1.2.5)
00
L lP'(Xn n=O
=
0).
Although the proof of ( 1 .2.5) is essentially trivial (cf. Theorem 6.1.15):
in conjunction with (1 .2.3) it becomes powerful. Namely, it says that ( 1 .2.6)
{Xn :
n ;::: 0} is recurrent if and only if
00
L lP'(Xn n=O
=
0) = oo,
and, since lP'(Xn= 0) is more amenable to estimation than quantities which involve knowing the trajectory at more than one time, this is valuable infor mation. In order to apply ( 1.2.6) to symmetric random walks, it is important to know that when the walk is symmetric, then 0 is the most likely place for the walk to be at any even time. To verify this, note that if k E
) L lP'(Xn=C
lP'(X2n= k =
£EZd
zd ,
&
X2n- Xn = k- £)
where, in the passage to the last line, we have applied Schwarz's inequality (cf. Exercise 1.3.1 below) . Up to this point we have not used symmetry. However, if the walk is symmetric, then lP'(Xn=C) = lP'(Xn=-C), and so the last line of the preceding can by continued as
= L lP'(Xn=C)lP'(X2n- Xn=-C)=lP'(X2n= o). £EZd
Thus, (1 .2. 7)
{Xn:
n ;::: 0} symmetric
=?
lP'(X2n= 0) = maxlP'(X2n = k) . kEZd
1 RANDOM WALKS
12
To develop a feeling for how these considerations get applied, we begin by using them to give a second derivation of the recurrence of the nearest neighbor, symmetric random walk on Z. For this purpose, note that, because J!D([Xnl::::; n) = 1, (1.2.7) implies that
2n 1 = L lF(X2n = €) ::::; (4n + 1) 1F(X2n = 0) , £=-2n and therefore, since the harmonic series diverges, that L�=O JF(Xn = 0) = oo The analysis for the symmetric, nearest neighbor random walk in Z2 re quires an additional ingredient. Namely, the d-dimensional analog of the preceding line of reasoning would lead to J!D(X2n = 0) � (4n + 1) -d, which is inconclusive except when d 1. In order to do better, we need to use the fact that .
=
(1.2.8) To prove (1.2.8), note that each coordinate of Bn is a random variable with mean value 0 and variance �. Hence, because the Bn's are mutually indepen dent, the second moment of each coordinate of Xn is :;:} . Knowing (1.2.8), Markov's inequality (6.1.12) says that
which allows us to sharpen the preceding argument to give9
1
2 ::::; JF(IX2nl
< 2y'n)
L JF(X2n = £) l£1<2vfn ::::; (4y'n + 1) dlP'(X2n 0) ::::; 2d-l (4n � + 1)1F(X2n = 0). =
=
That is, we have now shown that
(1.2.9) for the symmetric, nearest neighbor random walk on zd. In particular, when d = 2, this proves that the symmetric, nearest neighbor random walk on Z2 is
recurrent.
9 For any a, b E [0, oo ) and p E [1, oo ) , (a+ b)P :<; 2P-1(aP + bP). This can be seen as an application of Jensen's inequality ( cf . Exercise 5.6.2), which, in this case, is simply the statement that x E [0, oo) !------' xP is convex.
13
1.2 Recurrence Properties of Random Walks
1.2.4. Transience in 1::3: Although (1 . 2. 9) was sufficient to prove recurrence for the symmetric, nearest neighbor random walk in 1::2 , it only leaves open the possibility of transience in _zd for d 2 3. Thus, in order to nail down the question when d 2 3, we will need to see how good an estimate (1.2.9) really is. In particular, it would suffice to prove that there is an upper bound of the same form. To get an upper bound which complements the lower bound in (1.2. 9), we first do so in the case when d 1. For this purpose, let 0 :::; f :::; n be given, and observe that n(n-1)···(n-£ + 1) (n!)2 lP'(X2n = 2£) lP'(X2n = 0) (n + £)!(n - £)! (n + £)(n + f 1) (n + 1) c-1 g g c > 1--= II 1n + f- k n+1 =
k=O
(
)
-
(
-
)
· · ·
Now recall that (1.2.10)
oo
m
log(1 - x) = - L ::.__ for lxl m m=1
and therefore that log(1 - x) 2 -3;' for 0 shows that
(
:::;
x
:::;
)
<
1,
�. Hence, the preceding
3£2 f lP'(X2n = 2£) > 2 exp og 2 e (n+l) l n+1 lP'(X2n = 0) as long as 0:::; f :::; nt1. Because lP'(X2n = -2£) = lP'(X2n = 2£), we can now say that lP'(X2n = 0):::; e2lP'(X2n = 2£) for 1£1:::; Vn· But, because 2:clP'(X2n = 2£) = 1, this means that (2fo-1)lP'(X2n 0):::; e2, and so _______ _:_ ,.:..-
0 -{_
--
3
3
=
(1 .2.11) when {Xn: n 2 0} is the symmetric, nearest neighbor random walk on Z. If, as they most definitely are not, the coordinates of the symmetric, nearest neighbor random walk were independent, then (1.2.11) would yield the sort of upper bound for which we are looking. Thus it is reasonable to examine to what extent we can relate the symmetric, nearest neighbor random walk on _zd to d mutually independent symmetric, nearest neighbor random walks {Xi,n: n 2 0}, 1:::; i :::; d, on Z. To this end, refer to (1.2.1) which p" 2�, and think of choosing Xn-Xn_1 in two steps: first choose the coordinate which is to be non-zero and then choose whether it is to be + 1 or -1. With this in mind, let {In : n 2 0} be a sequence of {1, . . . , d}-valued, mutually =
1
14
RANDOM WALKS
independent, uniformly distributed random variables which are independent of {X ;,n : 1 :::; i :::; d & n 2 0}, set, for 1 :::; i :::; d, N;,o = 0 and N;,n = L::: : =l l {;} (Im ) when n 2 1 , and consider the sequence {Yn : n 2 0} given by
Yn = (X l Nl , . . . , Xd Nd ) .
(1.2. 12)
'
, ·n
'
, 1�
Without too much effort, one can check that {Yn : n 2 0} satisfies the conditions in (1.2.1) for the symmetric, nearest neighbor random walk on zd and therefore has the same distribution as {Xn : n 2 0}. In particular, by (1.2.11) ,
1P' (X2n= 0) = L 1P'(Xi,2mi = 0 & Ni,2n = 2m; for 1 :::; i :::; d)
(I!
mENd
�'
m m 1 /\···/\ m d 2": �
+ 3d
:::; e2
P(X,,
m,
,
�
)
0) P(N,,,n
�
2m, foe 1 S i S
l::
mENd m l/\···1\ md < :;[
(2
In y�-1
) -d
+ 1P'(Ni,2n :::; � for some 1 :::; i :::; d) .
Thus, we will have proved that there is a constant A ( d)
< oo
such that
1P'(X2n= 0) :::; A ( d)n- � , n 2 1 ,
(1.2.13)
once we show that there i s a constant B (d) (1 .2.14)
d)
< oo
such that
1P'(Ni,2n :::; � for some 1 :::; i :::; d) :::; B (d) n- � , n 2 1.
In particular, this will complete the proof that
d2
00
3 ==? nL=O lP'(X2n= 0) 00 <
and therefore that the symmetric, nearest neighbor random walk in zd zs
transient when d 2 3.
To prove ( 1 .2 .14) , first note that
1P'(Ni,2n :::;
� for some 1 :::;
i :::; d) :::; dlP'(N1 ,2n :::; �) .
1 .2 Recurrence Properties of Random Walks
15
Next, write N1 ,n = L� Zm where Zm = l{l} (Im), and observe that { Zm : m � 1 } is a sequence of {0, 1 } -valued Bernoulli random variables such that lP'(Zm = 1 ) = p �- In particular, for any E IR,
A
=
[ (A ( � ))]
and so
IE exp
Zm
np -
= em/J ( >-) where 7j;(A)
=
log (pe - >.q + qe>-P ) .
Since 7j;(O) = 7/J'(O) = 0, and pqe>. (p- q) 7/J"(A) = (qe>-v + pe->-q ) 2 where X>, e � >- (v+ q) , Taylor's formula allows us to conclude that =
( 1.2.15 )
A
Starting from ( 1.2.15 ) , there are many ways to arrive at ( 1.2.14) . For ex ample, for any > 0 and R > 0, Markov's inequality (6.1.12 ) plus ( 1 .2.15 ) say that
which, when A = 4nR , gives
( 1.2.16 )
lP'
(�
)
Zm ::; np - nR ::; e- 2nR 2 .
Returning to the notation used earlier and using the remark with which our discussion of ( 1 .2.14 ) began, one see from ( 1.2.16 ) that
2 n2 , 1P'(Ni, 2n ::; � for some 1 ::; i ::; d) ::; de-d2 which is obviously far more than is required by ( 1.2.14) .
The argument which we have used in this subsection is an example of an extremely powerful method known as coupling. Loosely speaking, the coupling method entails writing a random variable about which one wants to know more ( in our case X2n) a function of random variables about which one knows quite a bit ( in our case {(Xi,m , Ni,m) : 1 ::; i ::; d & m � 0} ) . Of course, the power of the method reflects the power of its user: there are lots of ways in which to couple a random variable to other random variables, but most of them are useless. as
16
1 RANDOM WALKS 1.3 Exercises
EXERCISE 1 . 3 . 1 . Schwarz 's inequality comes in many forms, the most ele mentary of which is the statement that, for any {an : n E Z} � JR. and {bn n E Z} � JR., :
Moreover, when the right hand side is finite, then
if and only if there is an o: E JR. for which either bn = o:an , n E Z, or an = o:bn, n E Z. Here is an outline of one proof of these statements. (a) Begin by showing that it suffices to treat the case in which an = 0 = bn
for all but a finite number of n's. (b) Given a real, quadratic polynomial P(x) = Ax2 + 2Bx + C, use the quadratic formula to see that P ?: 0 everywhere if and only if C ?: 0 and B2 :S: AC. Similarly, show that P > 0 everywhere if and only if C > 0 and B2 < AC. (c) Assuming that a n = 0 = bn for all but a finite number of n's, set P(x) = Ln (anx + bn) 2 , and apply (b) to get the desired conclusions. Finally, use (a) to remove the restriction of the an 's and bn 's. EXERCISE 1.3.2. Let {Yn : n ?: 1} be a sequence of mutually independent, identically distributed random variables satisfying JE[ I Y1 1 J < oo. Set Xn = :L:= l Ym for n ?: 1. The Weak Law of Large Numbers says that
In fact,
(1.3.3) from which the above follows as an application of Markov's inequality. Here are steps which lead to (1.3.3). (a) First reduce to the case when lE[Y1 ] = 0. Next, assume that lE [Yn < oo , and show that
Hence the result is proved when Y1 has a finite second moment.
1.3 Exercises R) � (b) Given R > 0, set y = Yn l[o , R) ( ! Ynl) - IE [Yn, !Ynl < R) � 2::::: �= 1 y . Note that, for any R > 0,
[[ IJ
n IE :
<
JE
[l � ll [I [ ( � )'] � IE
x R)
+ IE
Rl X n - x$, n
[ + 21E I Y, 1 , I Y, 1
:>
R]
17
R RJ and XA )
=
ll
<e:
"� + 21E [I Yd , I Y, I
:>
R] ,
and use this, together with the Monotone Convergence Theorem, to complete the proof of (1 .3.3). EXERCISE 1 . 3.4. Refer to Exercise 1 .3.2. The Stmng Law of Large Numbers says that the statement in the Weak Law can be improved to the statement IE[Y1] with probability 1 . The proof of the Strong Law when that �n one assumes only that IE[jY1I J < = is a bit tricky. However, if one is willing to assume that IE[Y14] < =, then a proof can be based on the same type argument which leads to the Weak Law. Let { Yn}l' be a sequence of mutually independent random variables with the properties that M = supn IE [ ! Ynl4] < =, and prove that, with probability 1 , limn _,00 � '2:::� = 1( Ym - IE[Yml) = 0. Note that we have not assume yet that they are identically distributed, but when we add this assumption we get limn ->oo � 2:::::�= 1 = IE [Y1] with probability 1 . Here i s an outline. (a) Begin by reducing to the case when IE[Yn] = 0 for all n E z+ . (b) After writing --->
n
L IE [Ykl
k l o . . . , k4 = 1
0
0 0
yk 4 ]
and noting that the only terms which do not vanish are those for which each index is equal to at least one other index, conclude that
(*)
1 RANDOM WALKS
18
(c ) Starting from (*), show that
for all E > 0. This is the Weak Law of Large Numbers for independent random variables with bounded fourth moments. Of course, the use of four moments here is somewhat ridiculous since the argument using only two moments is easier. (d) Starting again from (*) and using (6.1.4) , show that
<::::
4M 00 1 4M n <:::: E m E L 2
4
n = m+ l
-
-4
---+
0 as m � oo for all E > 0.
(e ) Use the definition of convergence plus
(6.1.4 ) to show that
w (L�Yk + o) = W ( Ql fjl nyJ L�Yk l � ) N k I �) . W I L Y � y l ( � � m nm �
�
<::::
Finally, apply the second line of (6.1 .3) plus (d) above to justify
L� yk
for each N E z+. Hence, with probability 1, � 0, which is the Strong Law of Large Numbers for independent random variables with bounded fourth moments. EXERCISE 1 .3.5. Readers who know DeMoivre's proof of the Central Limit Theorem will have realized that the estimate in (1.2.11) is a poor man's sub stitute for what one can get as a consequence of Stirling 's formula
(1.3.6)
n! rv �
(�r
---+
as n � oo,
meaning that the ratio of the quantities on the two sides of Indeed, given (1.3.6), show that
W(X2n = 0)
rv
{2_ y ;:;;
"rv"
tends to 1.
Next, give a proof of (1.3.6) based on the following line of reasoning.
1.3 Exercises
19
(a) Let T1 , . . . , Tn b e a mutually independent, unit exponential random vari ables, 1 0 and show that for any 0 < R :S Vn 1-
2__ < lP' ( - R < R2 -
T1 + . . . + Tn - n
fo
<
-
R)
=
1
(n -
n jvnR+ tn - 1 e -t dt . 1 ) .I - vn + n R
(b) Make a change of variables followed by elementary manipulations to show that
where
(c) As an application of the Taylor's series for log ( 1 + x) (cf. ( 1 .2. 10) ) , show that En (O") ------> 0 uniformly for 1 0" 1 :S R when n --> oo , and combine this with the results in ( a) and (b) to arrive at
and
Because f�oo e - ""2 dO" R / oo . 2
= v'27f,
it is clear that ( 1 .3.6) follows after one lets
EXERCISE 1 .3 . 7. The argument in § 1 .2.3 is quite robust. Indeed, let {Xn n ;::: 0} be any symmetric random walk on ?} whose jumps have finite second moment. That is, Xo = 0, {Xn - Xn - 1 n ;::: 1 } are mutually independent, :
:
identically distributed, symmetric (X1 has the same distribution as - X1 ) , Z 2 valued random variables with finite second moment. Show that {Xn : n ;::: 0} is recurrent in the sense that 1P'(3n ;::: 1 Xn = 0) = 1. 10
A unit exponential random variable is a random variable T for which IP'(T > t)
o = e -t v
20
o
1 RANDOM WALKS
EXERCISE 1.3.8. Let {Xn : n 2 0} be a random walk on zd : X = 0, {Xn - Xn _ 1 : n 2 1} are mutually independent, identically distributed, zd_ valued random variables. Further, for each 1 ::; i ::; d, let ( Xn) i be the ith coordinate of Xn , and assume that < oo and ( a1 , . . . , ad ) E [O , oo ) d with I:;� ai > 1, J!D(( Xn ) i = 0) ::; Cn-o:' , n 2 1, show that { Xn : n 2 0} is transient in the sense that J!D(::ln 2 1 Xn = 0) < 1. EXERCISE 1.3.9. Let {Xn : n 2 0} b e a random walk o n zd , as in the preceding. Given k E zd , set If, for some C
Tk
00
=
L l {k} ( Xn)
n=O
and
(k = inf{n 2 0 : Xn k}. =
Show that
(1.3.10)
JE[Tk] = J!D( (k < oo )lE[To] =
=�P(ko �- 00�, oo
where p0 = inf { n 2 1 : Xn = 0} is the time of first return to 0. In particular, if { Xn : n 2 0} is transient in the sense described in the preceding exercise, show that
lE
[fo
]
l B (r) ( Xn) < oo
for all r
E
(0, oo ) ,
where B ( r ) = {k : lkl ::; r } ; and from this conclude that I Xn l ______, oo with probability 1. On the other hand, if {Xn : n 2 0} is recurrent, show that Xn = 0 infinitely often with probability 1. Hence, either {Xn : n 2 0} is recurrent and Xn = 0 infinitely often with probability 1 or it is transient and I Xn l ______, oo with probability 1. EXERCISE 1.3. 1 1 . Take d = 1 in the preceding, Xo = 0, and { Xn - Xn- 1 : n 2 1} to be mutually independent, identically distributed random variables for which 0 < 1E [ I X1 1] < oo and JE [X1 ] = 0. By a slight variation on the argument given in §1.2.1, we will show here that this random walk is recurrent but that Xn nlim ---+oo
= oo
and
lim Xn n ---+oo
= -oo
with probability
1.
( a) First show that it suffices to prove that supn Xn = oo and that infn Xn = Next, use the Weak Law of Large Numbers ( cf. Exercise 1.3.2) to show
-oo.
that
max nlim ---+ 00 1 :o:; m :o:; n
lE [ I Xn l]
n
= 0.
1. 3 Exercises 1, set T� n) = L�-==10 l{k} (Xm),
(b) For n :::0: for all k E Z , and use this t o arrive at
21 n n show that IE [T� ) J <::; IE [TJ lJ
�
where JL (n) = 0 � ��- IE [ I Xm l ] . ( 4JL (n) + 1 )IE [TJn l J :::0: 1 Finally, apply part ( a) to conclude that IE[To] = oo. Hence, by (1.3.10), lP'( p0 oo) 1, and so {Xn : n :::0: 0} is recurrent. ( c ) To complete the program, proceed as in the derivation of (1.2.2 ) to pass
< =
from (b) to
IP'(p6m)
(*)
< oo) = 1
1, m where P6 ) is the time of the mth return to 0. Next, for r E z+ , set 'rJr inf{n :::0: 0 : Xn :::0: r }, show that E = lP'(ry1 > Po) 1, and conclude that m m lP'(ry1 > P6 )) <::; E . Now, combine this with ( *) to get lP'(ry1 oo) 1. for all m :::0:
<
Finally, argue that
< 00 < =
< 00
<
=
=
< 00
IP' ('r/r+ l ) :::0: IP'('r/r ) )IP'('r/1 and therefore that IP' ('r/r oo) 1 for all r :::0: 1. Since this means that, with probability 1, supn Xn :::0: r for all r :::0: 1, it follows that supn Xn with probability 1. To prove that infn Xn = -oo with probability 1, simply replace { Xn n :::0: 0} by { -Xn : n :::0: 0}. EXERCISE 1.3.12. 11 Here is an interesting application of one dimensional random walks to elementary queuing theory. Queuing theory deals with the
= oo
:
distribution of the number of people waiting to be served (i.e., the length of the queue) when, during each time interval, the number of people who arrive and the number of people who are served are random. The queuing model which we will consider here is among the simplest. Namely, we will assume that, during the time interval [n - 1, n), the number of people who arrive minus the number who can be served is given by a Z-valued random variable En . Further, we assume that the En 's are mutually independent The and identically distributed random variables satisfying 0 IE[ I E1 1] associated queue is, apart from the fact that there are never a negative number of people waiting, the random walk { Xn : n :::0: 0} determined by the En 's: X0 0 and Xn = L�= l Em - To take into account the prohibition against having a queue of negative length, the queuing model { Qn : n :::0: 0} is given by the prescription Q o 0 and Qn ( Qn - 1 + En ) + for n :::0: 1.
<
< oo.
=
( a) Show that
=
and conclude that, for each of Mn = maxo �m� n Xm · 11
=
n :::0: 0, the distribution of Qn is the same as that
So far as I know, this example was invented by W m. Feller.
22 at
1 RANDOM WALKS (b) Set M=
=
limn-+= Mn E N U { oo }, and, as a consequence of (a), arrive
IP'( Qn = j ) = IP' (M= = j ) for j E N. nlim ---> CXJ ( c ) Set f-1 = lE [E1] . The Weak Law of Large Numbers says that, for each E > 0, IP' ( I Xn - nf-1 1 2: nc) -----+ 0 as n --+ oo. In particular, when f-1 > 0, show that IP'(M= = oo) = 1 . When f-1 = 0, use Exercise 1 .3. 1 1 to reach the same conclusion. Hence, when lE [E1] 2: 0, IP'( Qn = j) -----+ 0 for all j E N. That is, when the expected number of arrivals is a least as large of the expected number of people served, then, with probability 1 , the queue grows infinitely long. (d) Now assume that f-1 = JE [E1] 0. Then the Strong Law of Large Numbers ( cf. Exercise 1 .3.4 for the case when E1 has a finite fourth moment and Theorem 1 .4.11 in [9] for the general case) says that � -----+ f-1 with probability 1 . In particular, conclude that M= oo with probability 1 and therefore that L jE N Vj = 1 when 1/j = limn_,= IP'( Qn = j) = IP'(M= j).
<
<
=
(e ) Specialize to the case when the Em 's are { - 1 , 1 }-valued Bernoulli ran dom variables with p = IP'(E1 = 1 ) E (0, 1 ) , and set q = 1 - p. Use the calculations in ( 1 . 1 .12) to show that IP'( Qn nlim --->CXJ
= j) =
{O
_
SL....1!. q
( )j E. q
if p 2: if p
q
< q.
(f) Generalize ( e) to the case when Em E {-1 , 0, 1}, p = IP'(E1 = 1 ) , and The idea is that M= in this case has the same distribution as supn Yn , where {Yn : n 2: 0} is the random walk corresponding to { - 1 , 1 } valued Bernoulli random variables which are 1 with probability p+P . q
q = IP'(E1 = - 1 ) .
_
CHAPTER 2
Doeblin 's Theory for M arkov Chains
In this chapter we begin in earnest our study of Markov processes. Like the random walks in Chapter 1, the processes with which we will be dealing here take only countably many values and have a discrete ( as opposed to continu ous ) time parameter. In fact, in many ways, these processes are the simplest generalizations of random walks. To be precise, random walks proceed in such a way that the distribution of their increments are independent of everything which has happened before the increment takes place. The processes at which we will be looking now proceed in such a way that the distribution of their increments depends on where they are at the time of the increment but not on where they were in the past. A process with this sort of dependence property is said to have the Markov property and is called a Markov chain. 1 The set § in which a process takes its values is called its state space, and, as we said, our processes will have state spaces which are either finite or countably infinite. Thus, at least for theoretical purposes, there is no reason for us not to think of § as the set { 1 , . . . , N} or z+ , depending on whether § if finite or countably infinite. On the other hand, always taking § to be one of these has the disadvantage that it may mask important properties. For example, it would have been a great mistake to describe the nearest neighbor random walk on Z2 after mapping Z2 isomorphically onto z+ .
2 . 1 Some Generalities Before getting started, there are a few general facts which we will need to know about Markov chains. A Markov chain on a finite or countably infinite state space § is a family of §-valued random variables { Xn : n ;::: 0} with the property that, for all n ;::: 0 n and (i o , . . . , i n , J ) E § +2,
(2.1.1 ) where P is a matrix all of whose entries are non-negative and each of whose rows sums to 1. Equivalently ( cf. §6.4.1)
(2.1.2 ) 1 The term "chain" is commonly applied to processes with a time discrete parameter.
2
24
MARKOV CHAINS
It should be clear that (2.1.2) is a mathematically precise expression of the idea that, when a Markov chain jumps, the distribution of where it lands depends only on where it was at the time when it jumped and not on where it was in the past.
2 . 1 . 1 . Existence of Markov Chains: For obvious reasons, a matrix whose entries are non-negative and each of whose rows sum to 1 is called a transition probability matrix: it gives the probability that the Markov chain will move to the state j at time n + 1 given that it is at state i at time n, independent of where it was prior to time n. Further, it is clear that only a transition probability matrix could appear on the right of (2.1.1) . What may not be so immediate is that one can go in the opposite direction. Namely, let be a probability vector2 and P a transition probability matrix. Then there exists a Markov chain n ::;:. 0} with initial distribution and transition probability matrix P. That is, i) and (2.1.1) holds. To prove the preceding existence statement, one can proceed as follows. Begin by assuming, without loss in generality, that § is either 1 , . . . , N} or z+. Next, given i E §, set fJ( i, 0) 0 and fJ( i, j) Lk = l for j ::;:. 1, and define F : § x [0, 1 ) ----+ § so that F ( i, u ) = j if f)(i, j - 1) :::; u < !J(i,j). In addition, set a(O) for i ::;:. 1 , and define f 0 and a i L� =l [0, 1) ----+ § so that f (u) i if a i 1 ) :::; u < a i . Finally, let n ::;:. 0} b e a sequence o f mutually independent random variables cf. Theorem 6.3.2) which are uniformly distributed on [0, 1), and set
{Xn : JP(X0 = = (J-L)i = = = ( ( -) = (J-L)( k)
( 2.1.3)
{Xn :
if n if n
J-L
J-L
=
{
(P)ik ( {Un :
:
=0
::;:. 1.
We will now show that the sequence n ::;:. 0} in ( 2.1.3 ) is a Markov chain with the required properties. For this purpose, suppose that i o , . . . , i E § + l , and observe that
n
(
n)
lP(Xo = io, ... , Xn = in) = JP (Uo [a (io - 1) a (io)) Um [!J(im- l , im - 1 , im l,im ) 1 n) = /-Lio (P)ioil · · · (Pk_ , in · J-L initial distribution (J-L)i = JP(Xo = i , /-L ,
E
&
E
) /) (
-
) for
:S; m :S;
2 . 1 . 2 . Transition Probabilities & Probability Vectors: Notice that the use of matrix notation here is clever. To wit, if is the row vector with ith entry ) then is called the of the chain and
( 2.1.4 ) 2 A probability vector is a row vector whose coordinates are non-negative and sum to 1.
2.1 Some Generalities 25 where we have adopted the convention that P 0 is the identity matrix and p n = ppn- 1 n � 1.3 To check (2.1.4), let n � 1 be given, and note that, by (2.1.1) and induction, (2.1.4) results after one sums with respect to (i 0 , . . . , i n- 1 ) . Obviously, (2.1.4) is the statement that the row vector J-LPn is the distribution of the Markov chain at time n if J-L is its initial distribution (i.e. , its distribution at time 0). Alternatively, pn is the n-step transition probability matrix: (Pn) ij is the conditional probability that Xm + n = j given that Xm = i. Hence
For future reference, we will introduce here an appropriate way in which to measure the length of row vectors when they are being used to represent measures. Namely, given a row vector p, we set
IIPII v = L I (P) il , iE §
(2.1.5)
where the subscript "v" is used in recognition that this is the notion of length which corresponds to the variation norm on the space of measures. The basic reason for our making this choice of norm is that
(2.1.6) since, by Theorem
6.1.15,
Notice that this is a quite different way of measuring the length from the way Euclid would have: he would have used
(2.1.7) On the other hand, at least when § is finite, these two norms are comparable. Namely,
IIPI I 2 ::; IIP I Iv ::; vi#§IIPII 2 ,
where #§ denotes the cardinality of §.
The first inequality is easily seen by squaring both sides, and the second is an application of Schwarz's inequality (cf. Exercise 1.3.1). Moreover, I I · ll v is a 3
The reader should check for itself that pn is again a transition probability matrix for all N: all entries are non-negative and each row sums to 1.
nE
26
2
MARKOV CHAINS
norm ( i.e., measure of length) in the sense that IIPII v = 0 if and only if p = 0 and that it satisfies the triangle inequality: l i P + p' llv <::: !! PIIv + ! I P' ! Iv · Finally, Cauchy's convergence criterion holds for II · llv · That is, if { Pn}f is 0 if and a sequence in �s , then there exists p E �§ for which IIPn - P l l v only { Pn }f is Cauchy convergent good
---+
lim sup n = 0. m----+ c::xJ n > m IIP - Pm ll v As usual, the "only if" direction is an easy application of the triangle inequal ity: I IPn - Pm llv <::: II Pn - P l!v + li P - Pm llv ·
To go the other direction, suppose that { Pn}f is Cauchy convergent, and observe that each coordinate of { Pn }f must be Cauchy convergent as real numbers. Hence, by Cauchy's criterion for real numbers, there exists a p to which { Pn}f converges in the sense that each coordinate of the Pn 's tends to the corresponding coordinate of p. Thus, by Fatou's Lemma, Theorem 6.1.10, as m --+ oo ,
l i P - Pm llv = L ] ( P) i - ( Pm ) i l iE§
0. LI ( n)i ( ) n-+ oo iE§ P - Pm i l ---+
<::: lim
2 .1 . 3. Transition Probabilities and Functions: As we saw in §2 . 1 . 2, the representation of the transition probability as a matrix and the initial distributions as a row vector facilitates the representation of the distribution at later times. In order to understand how to get the analogous benefit when computing expectation values of functions, think of a function f on the state space § as the column vector f whose jth coordinate is the value of the function f at j. Clearly, if JL is the row vector which represents the probability measure t-t on {1 , . . . , N} and f is the column vector which represents a function f which is either non-negative or bounded, then J-Lf = l::i E§ f(i)t-t({i} ) is the expected value of f with respect to f-t. Similarly, the column vector p n f represents that function whose value at i is the conditional expectation value of f(Xn) given that X0 = i. Indeed,
IE [f(Xn) I Xo = i]
L f(j)W'(Xn = j I Xo = i) j E§ = L (P n )ij ( f)j = ( P n f )i . j E§ =
More generally, if f is either a non-negative or bounded function on § and is the column vector which it determines, then, for 0 <::: m <::: n,
(2.1 .8)
IE [J(Xn) I Xo = io , . . . , Xm = im] = (Pn - m f)i= , n or, equivalently, IE [f(Xn) I Xo , . . . , Xm ] = (p - m f) x=
f
2.2 Doeblin's Theory
27
since
lE [J (Xn) I Xo = i o , · . . , Xm = i m ] = L f(j) IP' (Xn = j I Xo = io, . . . , Xm = im ) j E§ = L J (j)(P n - m } i,, j = (P n - m f)i m . j E§ In particular, if J-L is the initial distribution of { Xn n � 0}, then (2.1.9) since lE[f(Xn)] = I: i (J-L) i lE[f(Xn)IXo = i] . Notice that, just as I · llv was the appropriate way to measure the length of :
row vectors when we were using them to represent measures, the appropriate way to measure the length of column vectors which represent functions is with the unifo rm norm I · ll u:
(2.1.10)
llfllu = sup l (f)j l · jE§
The reason why since
II ll u is the norm of choice here is that IJ-L f l < IIJ-LIIv ll f ll u, iE§
In particular, we have the complement to
i E§ (2.1.6 ) :
(2.1.11 ) The Markov P roperty: By definition, if J-L is the initial distribution n 2: 0}, then (2.1.12) IP'(Xo = io, . . . , Xn = in) = ( J-L ) i0 (P)i0 i 1 · · · (P k_1 in · Hence, if m, n � 1 and F : sn+ l R is either bounded or non-negative,
2.1 .4. of {Xn
then
:
----+
lE [F(Xm , . . . , Xm+ n), Xo = io, . . . , Xm = im ] = L F(im , Jl , · · · , Jn) J-L io (P)io i l · · · (P)im-li,J P)i ,nJl · · · (P)j l n -dn = lE [F(Xo, . . . , Xn) I Xo = im ] IP'(Xo = io, . . . , Xm = i m ) . Equivalently, we have now proved the Markov property in the form lE [F(Xm , . . . , Xm+ n) I Xo = i o, . . . , Xm = i m ] (2.1.13 ) = lE [F(Xo, . . . , Xn) I Xo = im ] · 2.2 Doeblin's Theory
In this section we will introduce an elementary but basic technique, due to Doeblin, which will allow us to study the long time distribution of a Markov chain, particularly ones on a finite state space.
28
2
MARKOV CHAINS
2.2.1. Doeblin's Basic Theorem: For many purposes, what one wants to know about a Markov chain is its distribution after a long time, and, at least when the state space is finite, it is reasonable to think that the distribution of the chain will stabilize. To be more precise, if one is dealing with a chain which can go in a single step from some state i to any state j with positive probability, then, because there are only a finite number of states, a pigeon hole argument shows that this state is going to visited again and again and that, after a while, the chain's initial distribution is going to get "forgotten." In other words, we are predicting for such a chain that J-Lp n will, for sufficiently large n , be nearly independent of In particular, this would mean that J-Lp n = is very nearly equal to J-LPm when m is large and therefore, by Cauchy's convergence criterion, that 1r = limn _,oo J-Lp n exists. In addition, if this were the case, then we would have that 7r = limn-> oo J-Lpn+l = limn_,00 ( J-LP n ) P = 1r That is, 1r would have to be a left eigenvector for with eigenvalue 1. A probability vector 1r is, for obvious reasons, called a stationary distribution for the transition probability matrix P if 1r = 1rP. Although we were thinking about finite state spaces in the preceding dis cussion, there are situations in which these musings apply even to infinite state spaces. Namely, if, no matter where the chain starts, it has a positive probability of visiting some fixed state, then, as the following theorem shows, it will stabilize.
(J-Lpn-m )Pm
J-L.
P
P.
2 . 2 . 1 DOEBLIN'S THEOREM. Let P be a transition probability matrix with the property that, for some state ]o E § and E > 0, P jo 2: E for all i E § . Then P has a unique stationary probability vector 1r, 1r ) jo 2: and, for all initial distributions J-L,
( () i E ,
PROOF: The key to the proof lies the observations that if vector with < oo , then
!I P! v
(2.2.2)
L (pP)j = L (P)i and L (P)i 0 ==} ! pPn ! v (1 - E)n ! I P ! v j E§
=
p
E JR.5 is a row
iE§
:S
for n 2: 1.
i E§
The first of these is trivial, because, by Theorem 6.1.15,
As for the second, we note that, by an easy induction argument, it suffices to
2.2 Doeblin's Theory check it when n = 1. Next, suppose that
I
29
Li (p)i = 0, and observe that
(pP)j I = l l:iE§ )p)i (P) ij I I LiE§ (P)i((P)ij - u5j,j0) I S: LiE§ i (P)ii((P)ij - E0j,j0) , =
and therefore that
I PP I v jLE§ (2::iE§ i (P)ii((P)ij - E0j,j0) ) = L i ( P)i l (L((P)ij - E0j,j0) ) = (1 - c) I P I v · iE§ j E§ n . Then, because Now let J-t be a probability vector, and set J-tn J-t P m J-tn J-tn-mp and Li ((J-tn- m )i - J-ti) = 1 - 1 0, S:
=
=
=
J-tI n -7r l v{J-t�n }f 7r limn_,= (J-tPn )P P, iE§
for 1 S: m < n. Hence, is Cauchy convergent; and therefore there exists a for which 0. Since each is a probability vector, it is clear that 1r must also be a probability vector. In addition, 1r = = 1r and so 1r is stationary. In particular,
J-tn
limn_,= J-tpn+l
=
iE§
Finally, if v is any probability vector, then
which, of course, proves both the stated convergence result and the uniqueness of 1r as the only stationary probability vector for D It is instructive to understand what Doeblin's Theorem says in the language of spectral theory. Namely, as an operator on the space of bounded functions ( a.k.a. column vectors with finite uniform norm ) , has the function 1 as a right eigenfunction with eigenvalue 1: P l = 1 . Thus, at least if § is finite, general principles say that there should exist a row vector which is a left eigenvector of P with eigenvalue 1. Moreover, because 1 and the entries of are real, this left eigenvector can be taken to have real components. Thus, from the spectral point of view, it is no surprise that there is a non-zero row vector E with the property that On the other hand, =
P. P
P
J-t JR§
J-tP J-t .
30
2
MARKOV CHAINS
standard spectral theory would not predict that J.L can be chosen to have non negative components, and this is the first place where Doeblin's Theorem gives information which is not readily available from spectral theory, even when § is finite. To interpret the estimate in Doeblin's Theorem, let M1 (§; C) denote the space of row vectors E C§ with Then,
v l vl v = 1. l vP ! I v 1 for all v and so sup{ l al : a vP = av } :<::; 1 . & v Moreover, if vP = av for some a -1- 1 , then v l = v(P l ) = (vP)l = a vl and therefore vl = 0. Thus, the estimate in (2. 2 . 2 ) says that all eigenvalues of P which are different from 1 have absolute value dominated by 1 - That is, the entire spectrum of P lies in the complex unit disk, 1 is a simple eigenvalue, and all the other eigenvalues lie in the disk of radius 1 Finally, although general spectral theory fails to predict Doeblin's Theorem, it should be said :<::;
EC
E M1 (§; C),
3
E MI (§; C)
,
c.
-
E.
that there is a spectral theory, the one initiated by Frobenius and developed further by Kakutani, which does cover Doeblin's results. The interested reader should consult Chapter VIII in 2.2.2. A Couple of Extensions: An essentially trivial extension of Theo rem is provided by the observation that, for any M 2:: 1 and E > 0, 4
[2] .
2 .2. 1 ( 2 . 2 . 3)
for all probability vectors J.L and a unique stationary probability vector To see this, let 1r be the stationary probability vector for pM , the one guaranteed by Theorem and note that, for any probability vector J.-L, any m E N, and any 0 :<::; r < M, rr .
2.2.1 ,
(2.2. 3)
(2. 2. 3) P 2 .2. 1.
Thus has been proved, and from the argument needed to show that 1r is the one and only stationary measure for is the same as the one given in the proof of Theorem The next extension is a little less trivial. In order to appreciate the point which it is addressing, one should keep in mind the following example. Namely, consider the transition probability matrix
P = [ o1 o1 ]
on
{1 , 2}.
Obviously, this two state chain goes in a single step from one state to the other. Thus, it certainly visits all its states. On the other hand, it does not satisfy 4 Here and elsewhere, we use [s] to denote the integer part [s] of s of s the largest integer dominated by s.
E JR. That is, [s] is
2 .2 Doeblin's Theory 31 the hypothesis in (2. 2 . 3 ): (Pn )ij 0 i f either i j and is odd or if i i= j and is even. Thus, it should not be surprising that the conclusion in (2 . 2 . 3 ) fails to hold for this P. Indeed, it is easy to check that although ( �, �) is the 1 for one and only stationary probability vector for P, I ( 1 , 0 )P n ( �, �) I I all 0. As we will see later (cf. §3.1 .3) , the problems encountered here stem from the fact that (Pn )1 1 0 only if is even. In spite of the problems raised by the preceding example, one should expect =
=
n
n
-
n �
>
v
=
n
that the chain corresponding to this P does equilibrate in some sense. To describe what we have in mind, set
(2. 2.4) Although the matrix A n is again a transition probability matrix, it is not describing transitions but instead it is giving the average amount of time that the chain will visit states. To be precise, because
n- 1 {j}(X ) Xo i , n- 1 IP'(X j I Xo i) [ � 1; (An )ij = � �0 l mI l m (An )iJ is the expected value of the average time spent at state j during the time interval [0, - 1] given that i was the state from which the chain started. =
=
=
lE
=
n
Experience teaches us that data becomes much more forgiving when it is averaged, and the present situation is no exception. Indeed, continuing with the example given above, observe that, for any probability vector JL ,
What follows is a statement which shows that this sort of conclusion is quite general. THEOREM. Suppose that P is a transition probability matrix on §. If for some M E z+ ' E §, and E > 0, (A M � E for all i E §, then there is precisely one stationary probability vector 7r for P, � E, and
2 .2 .5
jo
)ijo
(7r)j0
for any probability vector JL .
started, let 1r be the unique stationary probability which Theorem 2. 3)getguarantees for AM . Then, because any which is stationary for P (2.is Tocertainly stationary for AM , it is clear that 1r is the only candidate for JL
P-stationarity. Moreover, to see that 1r is P-stationary, observe that, because P commutes with AM , (1rP)AM (7rAM)P 1rP. Hence, 1rP is stationary for A M and therefore, by uniqueness, must be equal to 7r. That is, 1r 1rP. =
=
=
2 MARKOV CHAINS
32
In order to prove the asserted convergence result, we will need an elementary property of averaging procedures. Namely, for any probability vector
JL,
(2.2.6) To check this, first note that, by the triangle inequality,
Second, for each k
;::::
0,
I J-LPkAn JLAn l v
and so :::; � Hence, after combining this with the first observation, we are lead to �
which is what we wanted. )ijo ;:::: E To complete the proof of Theorem 2.2.5 from here, assume that for all i, and, as above, let be the unique stationary probability vector for P. Then, is also the unique stationary probability vector for and so, by the estimate in the second line of (2.2.2) applied to = � (1 which, in conjunction with (2.2.6), � � leads to
(AM 1r 1r AM,nAM 1rll v AM, I J-L A I (J-LAn 7r)AM I!v :::; c)I I J-LAn 1r l v , I J-LAn � 1r l v :::; IMJ-LA�n1 J-LAnAMI I v I J-LAnAM 1r l v :::; + (1 � c)l l �-tAn � 1r l v · �
�
+
�
--
n
Finally, after elementary rearrangement, this gives the required result. 2.3 Elements of Ergodic Theory
J-Lpn
In the preceding section we saw that, under suitable conditions, either converge and that the limit is the unique stationary probability vector or In the present section, we will provide a more probabilistically ori for ented interpretation of these results. In particular, we will give a probabilistic
1r �-tAP.n
2.3 Elements of Ergodic Theory
33
interpretation of This will be done again, by entirely different methods, in Chapter 3. Before going further, it will be useful to have summarized our earlier results in the form (cf. (2.2.3) and remember that /11JI :S I/J.LIIvllfl/u )5 sup inf(P M )ij ;::: E ===} 1/Pf - 7rf l/u :S 2(1 - E ) [J\T] i/f l /u (2.3.1) j ' and ( cf. Theorem 2.2.5) M-1 sup inf(A M )ij ;::: E ===} 1/Anf - ?rfl/u :S -- 1/fl/u (2.3.2) j ' nE when f is a bounded column vector. 2.3. 1 . The Mean Ergodic Theorem: Let {Xn n ;::: 0} be a Markov chain with transition probability P. Obviously, 1r.
(2.3.3) is that average amount of time that the chain spends at j before time n. Thus, if J.L is the initial distribution of the chain (i.e., ( J.L )i = lP'(Xo = i)), n then (J.LAn)j = IE[r; l], and so, when it applies, Theorem 2.2.5 implies that IE[rtlJ (1r)j as n -+ oo. Here we will be proving that the random ) (n variables Tj themselves, not just their expected values, tend to (1r)j as n oo. Such results come under the heading of ergodic theory. Ergodic theory is the mathematics of the principle, first enunciated by the physicist J.W. Gibbs in connection with the kinetic theory of gases, which asserts that the time-average over a particular trajectory of a random dynamical system will approximate the equilibrium state of that system. Unfortunately, in spite of results, like those given here, confirming this principle, even now, nearly 150 years after Gibbs, there are essentially no physically realistic situations in which Gibbs's principle has been mathematically confirmed. 2 . 3 . 4 MEAN ERGODIC THEOREM . Under the hypotheses in Theorem 2.2.5 , 2(M - 1) n for all n ;::: 1 . sup iE (rj l - ( 1r) j ) 2 ::; nE jE§ --+
-+
[
]
(See (2.3.10) below for a more refined, less quantitative version. ) More gen erally, for any bounded function f on § and all n ;::: 1 :
where f denotes the column vector determined by f. 5 Here, and elsewhere, we abuse notation by using a constant to stand for the associated constant function.
34
2 MARKOV CHAINS
f
PROOF: Let be the column vector determined by the function Obviously, 1 1 1
f = f - 1rf.
nn-1 m = ) J(X 7rf L J (Xm), m=O - n m=O and so 2 2 ( � � f(Xm) -,.r) ,:, (� /(Xm)) �' k�o ](X,)](Xt) n 2 1 -1 2 L /(X )/(Xe) k n L n2 O� k �f< n k=O /(Xk ) :S: :2 O� L�f< n /(Xk )/(Xe). k _
-L n
�
�
2
Hence,
E [ ( � � f (Xm ) - ,;f) '] : � [!(x,) n� k j(X,+t)] n = :2 � E [f(xk ) � \PCf)xk l n-1 = :2 L ( n - k )E [nxk )(An- k f) x k J k=O M But, by (2.3.2), I An - k fl l u (n --k1) E l f l u , and so, smce l f l u :S: l f l u , (n - k )E[f(xk )(An - k f) x k J :S: ( M -e1 )l l fl l � . After plugging this into the preceding, we get the second result. To get the first, simply take f = and observe that, in this case, l f l u :S: 1 . 2.3.2. Return Times: As the contents of §§1 . 1 and 1.2 already indicate, $
,
lll
·
:S:
-
0
l {j}
return times ought to play an important role in the analysis of the long time behavior of Markov chains. In particular, if pj0l 0 and, for m 2: 1, the time -1=- j for of mth return to j is defined so that PJm) = if PJm - 1 ) = or 1 m mevery n > p( m - ) and PJ ) = inf{n > PJ 1) = j} otherwise, then we say that j is recurrent or transient depending on whether lP'(pY ) < = j ) = 1 or not; and we can hope that when j is recurrent, then the history of the chain breaks into epochs which are punctuated by the successive returns to j. In this subsection we will provide evidence which bolsters that hope.
oo
:
=
Xn
oo Xn oo! Xo
2.3 Elements of Ergodic Theory
Notice that
pj PJl) =
::=: 1 and, for
n
35
::=: 1 ,
l (n,ooj (Pj) = Fn,j (Xo, . . . ,Xn ) (2.3.5) . . { 1 if im j for 1 n , where Fn ,j (z o, . . . , Zn ) 0 otherwise In particular, this shows that the event { pj n} is a measurable function of (Xo, . . . , Xn ) · More generally, because l (n,ooj (P)m+ 1 )) ) = l [n,ooj (PJm) ) n£L=-11 l{ c} (P)m))Fn-C,j (Xc, · · · , Xn ), an easy inductive argument shows that, for each N and n N, { p)m) n} is a measurable function of (Xo, . . . , Xn ) · 2 . 3 . 6 THEOREM. z+ (i,j) §2 , IP'(p)m) < oo I Xo = i) = IP'(pj < oo I Xo = i)IP'(pj < oo I Xo j) m- 1 . j IP'(pJm) < oo!Xo, = j)m = 1 m- 1 N. j X0 = j { p) ) -p) ) 1 } Pj. -=/-
=
.
::; m ::;
>
+
mE
For all m E
and
>
E
E
=
In particular, if
is recurrent, then
for all m E In fact, if is recurrent, then, conditional on : m ::=: is a sequence of mutually independent random variables each of which has the same distribution as
PROOF: To prove the first statement, we apply (2. 1 .13) and the Monotone Convergence Theorem, Theorem 6.1.9, to justify
IP'(p)m) < oo I Xo = i) = nL= 1 IP'(p)m- 1) = n P)m) < oo I Xo i) = L tl�oo E [1 - FN,j (Xn , . . . , Xn+N ), p)m - 1) = n I Xo = i] n=001 . . . ,XN) E[ l FN, j (Xo, I Xo = j]IP'(p)m - 1 ) n I Xo = i) L N--lim + oo n=1 limoo IP'(pj N I Xo = j )IP'(p( m - 1 ) = n I Xo = i) =" � N---+ n= 1 IP'(pj < oo I Xo = j)IP'(p)m- 1) < oo I Xo i). Turning to the second statement, note that it suffices for us prove that IP'(Pj(m+ 1) n nm I Xo = . Pj( 1) = n1 , . . . , Pj(m) = nm) = IP'(pj n I Xo = j ) . 00
&
=
00
=
=
00
::;
J
=
=
>
>
+
J,
2 M ARKOV CHAINS
36
But, again by ( 2.1.13 ) , the expression on the left is equal to
IE [Fn,J (Xn= ' . . . ,Xn= +n ) I Xo - ( 1) - n1 , . . . ,pj(m) - nmJ = IE[Fn,j (Xo, . . . ,Xn ) I Xo = jJ = lP'(pj > n I Xo = j). _
J , Pj .
_
_
D
Reasoning as we did in §1 .2.2, we can derive from the first part of Theorem 2.3.6: < . i)
IE[T I X0 = �·] = + lP'(lP'(pj. - oooo!I XXoo -= ') 2.3. 7) IE[Tj I Xo j] = oo lP' (TJ = oo I Xo j) = 1 IE[Tj I X0 = jJ oo lP' (Tj oo I Xo j) 1, where Tj = :2..: :;=;: 0 l{j } (Xm ) i s the total time the chain spends in the state j. Indeed, because m lP' (T > m I X0 - �.) - { lP'lP'(PJ( j(m)+l) ooooI XoI X=o j-) �') ifif ii =I jj, 5t,)
J
(
=
J
P1 -
=
{===}
<
<
{===}
=
=
<
J
<
P
all three parts of ( 2.3. 7) follow immediately from the first part of Theorem 2.3.6. Of course, from ( 2.3. 7) we know that is recurrent if and only if j] In particular, under the conditions in Theorem 2.2.5, we know that 0, and so
j IE [Tj I Xo = oo. (An )j0j0 (n)j0 > IE[Tj0 I Xo Jo] m""=O (Pm )j0j0 = n---+limoo n(An)j0j0 = That is, the conditions in Theorem 2.2.5 imply that jo is recurrent, and as we are about to demonstrate, we can say much more. To facilitate the statement of the next result, we will say that j is accessible from i and will write i--+j if (P n ) ij > 0 for some n 2: 0. Equivalently, i--+j if and only if i = j or lP' ( pj oo! Xo i) > 0. 2 . 3 . 8 THEOREM. infi ( AM )ijo 2: 2: 1, Jo , 0. j0--+j. j0--+j, j > (0, oo). IE [p� I Xo = j] oo PROOF: First suppose that j0-fj. Equivalently, lP' ( pj = oo! Xo = j0) = 1 . At the same time, because (AM ) JJo 2: there exists an 1 m such that (Pm )Jjo > 0, and so lP'(pJm) = oo I Xo = j) 2: lP'(PJm) = oo Xm = Jo I Xo = j) lim IE[FN,j (Xm , . . . ,Xm+ N ), Xm = jo i Xo = j ] = N-+oo = lim lE[FN,j (Xo, . . . ,XN) I Xo = Jo]lP'(Xm = Jo i Xo =j ) N-+oo lP'(pj = oo I Xo = Jo) (Pm)jjo > 0. =
-----+
00
=
=
<
E
Then
<
00 .
L__,;
=
Assume that is recurrent if and only for all p E
E for some M
Moreover,
if
E,
::;
&
=
if
<
M
and then
2.3 Elements of Ergodic Theory
37
Hence, by Theorem 2.3.6, j cannot be recurrent. We next show that
)ij > 0 for some M' 2: 1. To this end, choose m E N so that (P m )joj > 0. Then, for all i §, M 1 �1 m )j0j (P (P 2: (P ) )i j i j (Am+M)ij = m + M �-1 0 m + M 0 0 = = £ £ M M m (AM)ijo(P )joJ 2: m +� (Pm )Joj > 0. m+M In view of * and what we have already shown, it suffices to show that if infi ( AM)ij E for some E > 0 and M E z+ . For this JE[p� JX0 = j] purpose, set u (n, i) IP'(p1 > nMJ X0 i) for n z + and i §. Then, by (2. 1 . 13), u (n + 1 , i) = 2:: IP'(pj > (n + 1)M & XnM = k I Xo = i) kE§ = I: JE [FM, j (XnM , . . . ,X( n +1 ) M ), Pi > nM & XnM = k I Xo = i] kE§ = 2: 1P'(p1 > M I Xo = k)IP'(p1 > nM XnM = k I Xo = i) kE§ = 2:: u 1 , k)IP'(PJ > n M & XnM = k I Xo = i) . kE§ Hence, u ( n + 1, i) :::; Uu (n, i) where U maxkE§ u(1, k). Finally, since u ( 1 , k) = - IP'(pj :::; M JXo = k) and IP'(PJ :::; M I Xo = k) O:Smax m<M (Pm ) kj 2: (AM) kj E, U :::; 1 - c. In particular, this means that u ( n + 1 , j) :::; (1 - c)u ( n, j), and therefore that IP'(p1 > nMJ X0 = j) :::; ( 1 - c) n , from which lE [p� I Xo = J ] = 2:: nPJP'(p1 = nJ X0 = j) n= 1mM :::; 2:: ( mM) P m= 1 n= (m- l)M+1 IP'(pj = n I Xo = j) :::; MP 2:: mPJP'(p1 > (m - 1)MJ Xo = j) m= 1 :::; MP 2:: mP ( 1 - E) m - 1 m= 1 o
(*)
J
-->
j
==?
inf(AM' "
E
1
C
L.....-
( )
L.....-
C
2:
< oo
E
=
=
E
&
(
=
1
2:
2:
00
00
00
00
< oo
2 MARKOV CHAINS
38 follows immediately. D
2.2 .5,
2.3.3. Identification of 1r: Under the conditions in Theorem we know that there is precisely one P-stationary probability vector 1r. In this section, we will give a probabilistic interpretation of 1r)] . Namely, we will show that
(
sup sup inf ( AM ) ij 0 M2: 1 jE§ 'E§ (2. 3.9) (7r )j = lE[PJI X1 o j] = 0 if j is transient ) . The idea for the proof of (2. 3 . 9 ) is that, on the one hand, cf. (2 . 3 . 3)) >
=
===}
(
(
while, on the other hand, pj"'> - 1
£=0 l{j} (Xc) = PZ:J l . L
PJ, the
Thus, since p)m ) is the sum of m mutually independent copies of preceding combined with the Weak Law of Large Numbers should lead
To carry out the program suggested above, we will actually prove a stronger result. Namely, we will show that, for each E
j § ,6 (2.3.10) lP' (nlim-+CXJ rjn) = lE PJ I Xo1 •] I Xo = j) = 1 . In particular, because 0 � T � 1 , Lebesgue's Dominated Convergence Theorem, Theorem 6.1 . 11 , says that 1 lE[pJ I Xo = j] from (2 . 3 . 1 0). Thus, we need only prove (2. 3 . 1 0). To this end, choose and 0 so that A M ijo for all i. If j0 -fj, then, by Theorem j0,follows 2 .3.8, j is transient, and so, by (2. 3 . 7), lP' ( TJ I X0 = j) = 1 . Hence, [
=J
-(n)
M,
E >
(
)
;::: E
< oo
Statements like the one which follows are called individual ergodic theorems because they, distinguished from the first part of Theorem 2.3.4, are about convergence with probability 1 as opposed to convergence in mean. See Exercise 3.3.9 below for more information.
6
as
2.3 Elements of Ergodic Theory
39
lP'(pj Tj X0
n conditional on = j, Tj ) :<=: � 0 with probability 1. At the same time, because j is transient, = oo I = j) > 0, and so I = j] = oo. Hence, we have proved (2.3. 10) in the case when )o -fj. Next assume that j0--'>j. Then, again by Theorem 2.3.8, = j] < oo ) 1) m m and, conditional on m ;:::: 1 } is a sequence of mutually = j, {PJ - PJ independent random variables with the same distribution as Pj · In particular, by the Strong Law of Large Numbers (cf. Exercise 1 .3.4)
X0
Xo
lP'
(
lim ---+ oo
m
JE[pj Xo lE [pj i Xo
---+
PJm) m
:
=
rj
I Xo ) =j
On the other hand, for any m
;::::
=1
where rj
=
lE[pji Xo
=
j] .
1,
and
while, since PJm )
;::::
m,
I T(Pj -
(rn)
1
) - r-:-1 1
1
Hence,
Finally, by taking mn
.
=
1
-I m I ( m)
r1J·
0._
- r1· .
[ -[!; J we get n
I I -mn
Pj(m, ) - r· rj
2 3 -(n I T ) - r-:- 1 I :<=: - + 1
:<=:
1
---+
0 as
n
----'> oo .
Notice that (2.3.10) is precisely the sort of statement for which Gibbs was looking. That is, it says that, with probability 1 , when one observes an individual path, the average time which it spends in each state tends, as one observes for a longer and longer time, to the probability which the equilibrium (i.e. , stationary) distribution assigns to that state.
40
2 MARKOV CHAINS 2.4 Exercises
EXERCISE 2 . 4 . 1 . In this exercise we will give a probabilistic interpretation of the adjoint of a transition probability matrix with respect to a stationary distribution. That is, suppose that the transition probability matrix admits a stationary distribution J-L, assume > 0 for each i E §, and determine T = ��\� the matrix p by
P
(P)ji(J-L)i·
(PT)ij
( a) Show that p T is a transition probability matrix for which
J-L
is again a stationary distribution. (b) Use JlD and J!DT to denote probabilities computed for the chains deter mined, respectively, by and P T with initial distribution J-L, and show that these chains are the reverse of one another in the sense that, for each n 2': 0 the distribution of under J!DT is the same as the distribution of (Xn , . . . under J!D. That is,
P (Xo, . . . , Xn )
, Xo)
(o
for all n 2': 0 and i , . . . , i n ) E §n+ l . EXERCISE 2 . 4 . 2 . The Doeblin theory applies particularly well to chains on a finite state. For example, suppose that P a transition probability matrix on an N element state space §, and show that there exists an e > 0 such that (AN 2': e for all i E § if and only if i-->j0 for all i E §. In particular, if such a j0 exists, conclude that, for all probability vectors J-L,
)ijo
l
II J-LAn - 1r v :S:
2(N - 1) ne
,
n 2': 1 ,
P. EXERCISE 2 .4 . 3 . Here is a version of Doeblin's Theorem which sometimes gives a slightly better estimate. Namely, assume that (P)ij Ej for all (i, j), 0, show that the conclusion of Theorem 2.2.1 holds and set = I:j Ej· If and that ( 1r )i Ei for each i §. EXERCISE 2 .4.4. Assume that P is a transition probability matrix on the finite state space §, and show that j § is recurrent if and only if lE[p1 I Xo Of course, the "if" part is trivial and has nothing to do with the j] where 1r is the unique stationary probability vector for
e
2':
2':
e>
E
E
<
=
oo.
finiteness of the state space.
P
EXERCISE 2 . 4 . 5 . Again assume that is a transition probability matrix on the finite state space §. In addition, assume that is doubly stochastic in the sense that each of its columns as well as each of its rows sums to 1 . Under the condition that every state is accessible from every other state, show that = j] = #§ for each j E §.
lE[pji Xo
P
41
2.4 Exercises
EXERCISE 2.4.6. In order to test how good Doeblin's Theorem is, consider the case when § = { 1 , 2} and P-
( 1 -(3 a
for some (a, (3) E (0, 1 ) .
Show that rr = ( a + (3 ) - 1 ((3, a ) and that . . vector } = 2(a V (3) Ja + (3 - 1 J . max{ JJvP - rrJJ v : v 1. s a probabrhty a + (3 EXERCISE 2.4.7. One of the earliest examples of Markov processes are the branching processes introduced, around the end of the nineteenth century, by Galton and Watson to model demographics. In this model, § = N, the state i E N representing the number of members in the population, and the process evolves so that, at each stage, every individual, independently of all other members of the population, dies and is replaced by a random number of offspring. Thus, 0 is an absorbing state, and, given that there are i ;:::: 1 individuals alive at time n, the number of individuals alive at time n + 1 will be distributed like - i plus the sum of i mutually independent, N valued, identically distributed random variables. To be more precise, if J.L = (J.Lo, . . . , f.Lk l . . . ) is the probability vector giving the number of offspring0each individual produces, define the m-fold convolution power J.L*m so that (J.L* )j = Oo ,j and, for m ;:::: 1,
j (J.L*m ) j = "2�) J.L* (m -1) )j - i f.Li · i =O
Then the transition probability matrix P is given by (P) ij = (J.L*i )j . The first interesting question which one should ask about this model is what it predicts will be the probability of eventual extinction. That is, what is limn--> oo lP'(Xn = 0)? A na'ive guess is that eventual extinction should occur or should not occur depending on whether the expected number "Y l:�= O kf.Lk of progeny is strictly less or strictly greater than 1 , with the case when the expected number is precisely 1 being more ambiguous. In order to verify this guess and remove trivial special cases, we make the assumptions that (J.L) o > 0, (J.L) o + (J.Lh < 1, and "Y l:�=o k (J.L)k < oo . ( a) Set j ( s) = 2:: %': 0 Sk f.Lk for s E [0, 1], and define r n (s) inductively so that r 0 (s) = s and r n = f o r (n - 1 ) for n :2: 1 . Show that "Y = f'(1) and that =
=
()()
= 2: sj (Pn ) ij for s E [0, 1] and i ::::: 0. j =O Hint: Begin by showing that f (s) i = l:j::0 sj (J.L*i ) j.
r n ( s) i = lE [sXn I Xo = i]
42
2 MARKOV CHAINS
f(s) - s f(s) f(o:) s 1 :::; 1 ===} n-+oo lim lE[sXn I Xo i] s s
(b) Observe that E [0, 1] is a continuous function which is > 0) on positive at = 0, zero at = 1 , smooth and strictly convex ( i.e., (0, 1 ). Conclude that either 1 ::::: 1 and > for all E [0, 1) or 1 > 1 and there is exactly one a E (0, 1) at which = a. (c) Referring to the preceding, show that
s
=
and that
f"
,______.
=
1
s
for all
s
E
(0, 1]
n-+oo lE[sXn I Xo i] o:i for all s (0, 1) Based on ( conclude that 1 ::::: 1 ===} lP' ( Xn OI Xo = that 1 1 ===} limn-+oo lP' ( Xn OI Xo o:i and lim lP'(1 ::::: Xn ::::: L I Xo i) 0 for all L 2': 1. n-+oo 1 > 1 ===} lim c) ,
(d)
>
=
=
E
=
=
=
=
i)
i)
------>
1 and
=
=
The last conclusion has the ominous implication that, when the expected number of progeny is larger than 1, then the population either becomes extinct or, what may be worse, grows indefinitely. EXERCISE 2 . 4 . 8 . Continue with the setting and notion in Exercise 2.4.7. We want to show in this exercise that there are significant differences between the cases when 1 < 1 and 1 = 1. = = ( a) Show that Hence, when 1 < 1, the expected size of the population goes to 0 at an exponential rate. On the other hand, when 1 = 1 , the expected size remains constant, this in spite of the fact that = i) 1 . Thus, when 1 = 1 , we have a typical situation of = the sort which demonstrates why Lebesgue had to make the hypotheses he did in his dominated convergence theorem, Theorem 6.1.11. In the present case, the explanation is simple: as --" oo , with large probability = 0 but, nonetheless, with positive probability is enormous. (b) Let be the time of first return to 0. Show that
lE[Xn I Xo i] iJn .
lP'(Xn OI Xo
------>
n Xn Xn p0 lP'(po ::::: ni Xo i) lP'(Xn OIXo i) (fo (n- l ) (Jlo )r, =
=
=
=
=
and use this to get the estimate
JE [p0 I X0 i] ::::: JE[p0 I X0 i] i lP'(po ni Xo
In particular, this shows that = < oo when 1 < 1 . ( c ) Now assume that 1 = 1. Under the additional condition that f3 = 1) = r !"(1) = Lk> 2 k(k - 1 )f.lk < oo , start from = and show that = oo for all 2': 1 .
(n- l) (Jlo ),
2.4 Exercises
43
Hint: Begin by showing that
for n > m. Next, use this to show that oo >
E[poi Xo = 1] = 1 + L0 (1 - r n (J.Lo )) 00
would lead to a contradiction. (d) Here we want to show that the conclusion in (c) will, in general, be false without the finiteness condition on the second derivative. To see this, let + e '\'oo k E ( 0 1 ) be given, and check that j ( s ) s + ( 1 �s)1 +0 = L.... k =O s J.lk, where . . . , J.lk , . . . ) is a probability vector for which J.lk > 0 unless k = 1 . J.L = Now use this choice of J.L to see that, when the second derivative condition in (c) fails, = 1] can be finite even though 1 = 1 . Hint: Set an = 1 note that a n - an + l = and use this first 1 and then that there exist 0 < c2 < c2 < oo such that to see that anan+ l c1 :<:::: a;;:! 1 - a-;;, 0 :<:::: c2 for all n � 1 . Conclude that ( 0 > = 1) tends to 0 like n - � . EXERCISE 2 . 4 . 9 . The idea underlying this exercise was introduced by J.L. Doob and is called7 Doob 's h-transformation. Let P is a transition probability matrix on the state space §. Next, let 0 i= f c;;; § be given, set = inf{ n � 1 : E f}, and assume that
B (J.L, o ,
='
E[p0 IXo n r (J.Lo ), --+
Xn
J.Loa�+& , lP' p ni Xo
Pr
h(i)
=
lP'(pr = I Xo = i) > 0 00
for all i E §
=
§ \ r.
( a) Show that h(i ) = Lj E §(P)iJ h(j) for all i E §, and conclude that the matrix P given by (P)ij = hCi (P)iJh(j) for (i, j) E (§) 2 is a transition ) probability matrix on §. (b) For all n E N and (j0 , . . . E (§) n + l , show that, for each i E §,
,Jn ) P(Xo Jo, . . . , Xn Jn I Xo = i) = lP'(Xo = Jo, ... , Xn = Jn I Pr = oo & Xo = i) , =
=
where lF is used here to denote probabilities computed for the Markov chain on § whose transition probability matrix is P. That is, the Markov chain determined by P is the Markov chain determined by P conditioned to never hit r. 7
The "h" comes from the connection with harmonic functions.
44
2 MARKOV CHAINS
EXERCISE 2 . 4 . 1 0 . Here is another example of an h-transform. Namely, as sume that j0 E § is transient but that i ----*]o for all i E §.8 Set h(i) = lP' ( PJo < ooiXo = i) for i -=/:- Jo and h(jo) = 1 . (a) After checking that h(i) > 0 for all i E §, define P so that if i = Jo if i -=1- j0. Show that P is again a transition probability matrix. (b) Using lF to denote probabilities computed relative to the chain deter mined by P, show that
t
lF (PJo > n I Xo = i) = n < Pio < oo I Xo = i ) h i) IP'( for all n E N and i -=/:- Jo. (c ) Starting from the result in (b) , show that j0 is recurrent for the chain determined by P.
8
B y Exercise 2.4.2, this i s possible only if § i n infinite.
CHAPTER 3
More about the Ergodic Theory of M arkov Chains
In Chapter 2, all of our considerations centered around one form or another of the Doeblin condition which says that there is a state which can be reached from any other state at a uniformly fast rate. Although there are lots of chains on an infinite state space which satisfy his condition, most do not. As a result, many chains on an infinite state space will not even admit a stationary probability distribution. Indeed, the fact that there are infinitely many states means that there is enough space for the chain to "get lost and disappear." There are two ways in which this can happen. Namely, the chain can disappear because, like the a nearest neighbor, non-symmetric random walk in Z (cf. (1.1.13)) or even the symmetric one in Z3 (cf. §1.2.4), it may have no recurrent states and, as a consequence, will spend a finite amount of time in any given state. A more subtle way for the chain to disappear is for it to be recurrent but not sufficiently recurrent for there to exist a stationary distribution. Such an example is the symmetric, nearest neighbor random walk in Z which is recurrent, but just barely so. In particular, although this random walk returns infinitely often to the place where is begins, it does so at too sparse a set of times. More precisely, by (1.2.7) and (1.2.13), if is the transition probability matrix for the symmetric, nearest neighbor random walk on Z, then = = 0) :S A(1)n - ! ----+ 0, :S and so, if were a probability vector which was stationary for then, by Lebesgue's Dominated Convergence Theorem, Theorem 6.1.11, we would have the contradiction (JL)J lim = 0 for all j E Z.
P
(P2n )ij (P2n )ii IP'(X2n JL =
"' (JL)i(P2n )ij n-----1> CXJ � iEZ
P,
In this chapter, we will see that ergodic properties can exist in the absence of Doeblin's condition. However, as we will see, what survives does so in a weaker form. Specifically, we will no longer be looking for convergence in the ll v-norm and instead will settle for pointwise convergence. That is, we will be looking for results of the form ----+ for each j rather than 1r ll v ----+ 0.
I I JLP ·
(JLP)1 (1r)j
46
3 MORE ERGODIC THEORY OF MARKOV CHAINS 3 . 1 Classification of States
In this section we deal with a topic which was hinted at but not explic itly discussed in Chapter 2. Namely, because we will no longer be making an assumption like Doeblin's, it will be necessary to take into account the possibility that the chain sees the state space as a union of disjoint parts. To be precise, given a pair (i, j) of states, recall that we write i-->j and say that j is accessible from i if, with positive probability, the chain can go from state > 0 for some n E N. Notice that accessibility is i to state j. That is, transitive in the sense that
(Pn )ij
(3.1.1)
(Pm+n )iR Lk (Pm )ik(Pn ) kc ;::: (Pm )ij(Pn )p! =
> 0.
If i and j are accessible from one another in the sense that i-->j and j-->i, then we write i,....j and say that i communicates with j. It should be clear that ,.... is an equivalence relation. To wit, because = 1 , i,....i , and it is trivial that j,....i if if-+j. Finally, if i,....j and j,.... £, then (3. 1 . 1 ) makes it obvious that i,....£ . Thus, ",...." leads to a partitioning of the state space into equivalence classes made up of communicating states. That is, for each state i, the communicating equivalence class [i] of i is the set of states j such that i,....j ; and, for every pair (i, j) , either [i] = [j] or [i] n [j] = 0. In the case when every state communicates with every other state, we say that the chain is irreducible. 3 . 1 . 1 . Classification, Recurrence, and Transience: In this subsection, we will show that recurrence and transience are communicating class prop erties. That is, either all members of a communicating equivalence class are recurrent or all members are transient. Recall (cf. §2.3.2) that PJ is the time of first return to j and that we say j < ooiXo = j) is equal to is recurrent or transient according to whether or strictly less than 1 . 3 . 1 . 2 THEOREM. Assume that i is recurrent and that j i= i. Then i-->j if and only if lP' ( PJ < P i Xo = i) > 0. Moreover, if i-->j, then < oo i Xo = £) = 1 for any , £ E {i, j} 2 . In particular, i-->j implies that if-+j and that j is
(P0)ii
JID(pj
JID(pk
(k ) i
recurrent.
PROOF: Given j i= i and n ;::: 1 , set (cf. (2.3.5))
Gn (ko, . . . , kn) (Fn- l , i (ko, . . . , kn- l ) - Fn,i(ko, . . . , kn))Fn,j (ko, . . . , kn)· If { p;m ) : ;::: 0} are defined as in §2.3.2, then, by (2.1.2), =
m
3.1 Classification of States
47
lP' (Pi(m+1) < Pi I Xo = ') £=1"' IP'(Pi(m) = Pi(m+ 1 ) < Pi I Xo = ') "' "' IP'(Pi(m) Pi(m+1) < Pi I Xo ) £=1 n=1 = £L,n= 1 lE[Gn (Xe, . . . ,X£+n ), p�m) = I! < PJ I Xo = i] = £,Ln=1 lE[Gn (Xo, ... ,Xn ) I Xo i]lP'(p�m) = I! < PJ I Xo = i) e,n=1 00
z
00
00
_
-�� _
=
n
n t.
�
_ n - t-
- t-,
+
&
z
-
n
_
z'
00
00
=
and so
IP'(p�m) < PJ I Xo = i) = IP'(pi < PJ I Xo = i) m . suppose that i -+j but lP'(pj < Pi I Xo = i) = 0. Then lP'(pi <:::: PJ I Xo = so, because lP' ( pj Pii Xo = i) lP' ( pi < ooi Xo = i) = 1, lP' ( pi < i) Now = 1 ,=and Hence, by (3.1.3), this means that lP' ( p�m) < PJ I Xo = i) = 1 JforP IXallo i) =1 ,1 .which, since p( m ) leads to lP'(pj = ooi Xo = i) = and therefore rules out i-+j. That is, we have now shown that i -+j ===? lP'(pj < implication needs no comment. PiiToXo provei) >that0, andi-+jthe opposite lP'(pi < ooi Xo = j) = 1 , first observe that IP'(PJ < Pi < oo I Xo = i) = nlim--->oo mL= 1 IP'(PJ < Pi + I Xo i) nl_i_,� mL= 1 lE[1 - Fn,i (Xm , . . . ,Xm+n), PJ < Pi I Xo = i] lim "' IP'(pj = < Pi I Xo = i)lE[ 1 - Fn ,i (Xo, . . . , Xn ) I Xo j ] = n-+cx: :> m�= 1 = IP'(PJ < Pi I Xo = i)IP'(pi < oo I Xo = j) . Thus, after combining this with lP' ( pi < ooi Xo = i) 1 , we have IP'(PJ < Pi I Xo = i) = IP'(PJ < Pi I Xo = i)IP'(pi < oo I Xo = j), which, because lP'( PJ < Pi I Xo = i) > 0, is possible only if lP' ( pi < ooi Xo = j) 1 . In particular, we have now proved that j-+i and therefore that i+-+j. j =/:- i
(3.1 .3)
===?
=1-
m
�
�
� m,
1
=
===?
00
=
<:::: m
m
n
=
00
=
=
m
00
=
m
=
=
48
3 MORE ERGODIC THEORY OF MARKOV CHAINS Similarly,
JPl (pJ < oo I Xo = i) = JPl(pJ < Pi I Xo = i) + JPl (pi < PJ < oo I Xo = i ) = JPl(pJ < Pi I Xo = i ) + JPl(pi < PJ I Xo = i ) JPl(pJ < oo I Xo = i ) , and so
JPl(pJ < oo I Xo Hence, i--+j Finally,
=?
=
i )JPl(PJ < Pi I Xo
JPl( pJ < oo i Xo
JPl(pi < PJ < oo I Xo = J)
=
=
=
i)
=
JPl(pJ < oo I Xo
=
i) .
i) = 1.
JPl(pJ < oo I Xo = i) JPl(pi < PJ I Xo
Hence, because we now know that JPl ( pi < oo i Xo = j)
i ) when i--+j, we see that i--+j implies
=
= J) .
1 = JPl ( p1 < oo i Xo =
JPl(pj < oo I Xo = J) = JPl(pj < Pi I Xo = J) + JPl (pi < PJ < oo I Xo = j ) = JPl(pj < Pi I Xo = j ) + JPl(pj < oo I Xo = i )JPl(pi < Pj I Xo = j ) = 1 , since JPl ( pi = PJ IXo = j) ::::: JPl ( pi = ooi Xo = j ) = 0 . 0 As an immediate consequence of Theorem 3.1 .2 we have the following corol lary. 3 . 1 . 4 COROLLARY. Ifi ...... j, then j is recurrent (transient) if and only if i is. Moreover, if i is recurrent, then JPl ( p1 < oo i Xo = i) is either 1 or 0 according whether or not i communicates with j. In particular, if i is recurrent, then (P n ) iJ = 0 for all n 2: 0 and all j which do not communicate with i. When a chain is irreducible, all or none of its states possess any particular communicating class property. Hence, when a chain is irreducible, we will say that it is recurrent or transient if any one, and therefore all, of its states is. 3.1.2. Criteria for Recurrence and Transience: There are many tests which can help determine whether a state is recurrent, but no one of them works in all circumstances. In this subsection, we will develop a few of the most common of these tests. Throughout, we will use u to denote the column vector determined by a function § R We begin with a criterion for transience. u :
3 . 1 .5 THEOREM.
that (P u) i transient.
If
u
----+
is a non-negative function on § with the property then (Pu)j < (u)1 for some j E § implies j is
::::: ( u) i for all i E §,
3.1 Classification of States PROOF : Set f = u - Pu, and note that, for all
n
49
2: 1,
n-1 u(j ) 2: (u)j - (P n u) j = L ((P m u) j - (Pm+1 u)j) m =O n-1 n-1 = L (P m f)j ::0: (f) j L (P m )jj · m=O m=O Thus lE[Tj iXo = j] = 2:::=0(P m )jj ::; (i)] < oo, which, by (2.3.7) , means that j is transient. D In order to prove our next criterion, we will need the following special case of a general result known as Doob 's Stopping Time Theorem. LEMMA 3 . 1 . 6 . Assume that u : § ----7 IR is bounded below and that r is a non empty subset of §. If (P u)i ::; u( i) for all i tf. r and pr = inf{n :;:: 1 : Xn E f}, then IE [u(Xn/\pr ) I Xo = i] ::; u(i) for all n ::0: 0 and i E §. Moreover, if the inequality in the hypothesis is replaced by equality, then the inequality in the conclusion can be replaced by equality.
PROOF: Set An = {Pr > n } . Then, An is measurable with respect to (Xo , . . . , Xn ), and so, by (2.1.1) , for any i,
lE [u(X(n+1) /\pr ) I Xo = i] = lE [u(Xn/\pr), AnC I Xo i] + L lE [u(Xn+ d, An n {Xn = k} I Xo = i] k f/:.S = IE [u(Xn/\pr ), An C I Xo = i] + l: IE [(Pu)k, An n {Xn = k} I Xo = i] k f/:.S ::; lE [u(Xn/\pr ), An C I Xo = i] + lE [u(Xn/\pr ), An I Xo = i] = lE [u(Xn/\pr ) I Xo = i] . =
Clearly, the same argument works just as well in the case of equality.
D
THEOREM 3 . 1 .7. Assume that j is recurrent, and set C = {i : i�j}. If u § [0, oo) is a bounded function and either u( i) = (P u) i or u(j ) > u(i) :;:: (P u) i for all i E C \ {j} , then u is constant on C. On the other hand, if j is transient, then the function u given by :
----+
u( i) =
{1
if i = j lP'(pj < ooiXo = i) if i of. j
is a bounded, non-constant solution to
u( i) = (Pu) i for all i of. j.
50
3 MORE ERGODIC THEORY OF M ARKOV CHAINS
PROOF: In proving the first part, we will assume, without loss in generality, that C §. Now suppose that j is recurrent and that u(i) = (Pu)i for i -=!= j. By applying Lemma 3.1.6 with r = {j}, we see that, for i -=!= j, =
u(i) = u(j) lP'(PJ :::; n I Xo = i ) + E [u(Xn ), PJ > n I Xo = i] . Hence, since, by Theorem 3.1.2, lP' ( p1 < oo i Xo = i) = 1 and u is bounded, we get u( i) = u(j) after letting n oo . Next assume that u(j) :::=: u( i) :::=: (Pu) i for all i -=!= j. Then, again by Lemma 3.1.6, we have ---t
u(j) :::=: u(i) :::=: u(j) lP'(PJ :::; n I Xo = i ) + E [u(Xn ) , PJ > n I Xo i] , which leads to the required conclusion when n oo . To prove the second part, let u be given by the prescription described, and =
---t
begin by observing that, because j is transient,
From this, one sees first that P11 < 1 and then that inf#j u(i) < 1 u(j). That is, u is bounded and non-constant. At the same time, when i -=!= j, by conditioning on what happens at time 1 , we know that =
u(i) = lP'(PJ < oo I Xo i ) =
=
P ij + L pik lP'(PJ
<
k=h
oo I Xo = k)
=
(Pu) i .
D
3 . 1 . 8 THEOREM. Let {Bm : m :::=: 0} be a non-decreasing sequence of non-empty subsets of § with the property that
lP'(::Jn E N Xn tf_ Bm I Xo = j ) = 1
for some j
E Bo
and all m :::=: 0.
If there to exists a non-negative function u satisfying (Pu) i :::; u(i), i -=!= j and am infi \t B= Ui oo as m ---t oo , then j is recurrent. PROOF: For each m ::::: 0, set rm = {j} u BmC, and take Prm inf { n ::::: 1 : Xn E fm} = PJ 1\ Tm, where Tm inf { n :::=: 1 : Xn tf_ Bm} · By Lemma 3.1.6, =
---+
=
=
j) = 1, we conclude, after am lP'(Tm :::; PJ IXo = j) for all m :::=: 0, and therefore limm--+oo lP'( Tm :::; PJ IXo = j) = 0. But this means that for all n letting n
:::=:
---t
0. Hence, because lP' ( Tm < oo i Xo
oo, that u(j) :::=:
lP'(PJ < oo I Xo = j) and so j is recurrent.
D
=
lP'(PJ < Tm I Xo = j) = 1 - lP'(Tm :::; PJ I Xo = j ) / 1, :::=:
3.1 Classification of States
51
{Fm
: m 2': 0} be COROLLARY 3 . 1 .9 . Assume that P is irreducible, and let a non-decreasing sequence of non-empty, finite subsets of the state space. If j E and there exists u a non-negative function which satisfies ( Pu) i :::; u(i), i -!=- j and infi � F, ui -----+ then j is recurrent.
F0 XXn n
oo,
PROOF: In view of Theorem 3. 1 .8, it suffices for us to check that 1P'(3n E N � = j) = 1 for all m 2': 0. To this end, let = inf{n 2': 1 : � By irreducibility, < = i) 0 for all m and i. Hence, because is finite, for each m there exists a E (0, 1) and 2': 1 for all i E such that = i) :::; But this means that j ) equals +
FFm i X}·o m Fm lP' (Tm > (£ lP'(Tl)Nm >m INXom i Xo L lP'(Tm > (£ l)Nm =
+
&
iEFm
lP'(Tm ooi Xo Bm> Tm Nm Fm . Bm Xm, i I Xo j) =
=
iEF,
lP'(Tm > £Nm i Xo
lP'(Tm ooi Xo
= = j) :S 8� , and so = j) = 0. D Remark: The preceding criteria are examples, of which there are many oth ers, which relate recurrence of j E § to the existence or non-existence of certain types of functions which satisfy either Pu = u or Pu :::; u on § \ {j}.
Thus,
All these criteria can be understood as mathematical implementations of the intuitive idea that
( ) Xn -1=- j, u(Xn)
u(i) =
Pu
i
or (Pu) i :::; u(i) for i -!=- j,
implies that as long as will be "nearly constant" or "nearly non-increasing" as n increases. The sense in which these "nearly's" should be interpreted is the subject of martingale theory, and our proofs of these criteria would have been simplified had we been able to call on martingale theory. 3.1.3. Periodicity: Periodicity is another important communicating class property. In order to describe this property, we must recall Euclid's concept of the greatest common divisor gcd(S) of a non-empty subset S c;;; Z. Namely, we say that d E z+ is a common divisor of S and write d iS if J E Z for every s E S. Clearly, if S = {0}, then d S for every d E z + , and so we take gcd(S) On the other hand, if S -1=- {0}, then no common divisor of S can be larger than min { l s l : s E S \ {0} } , and so we know that gcd(S) < Our interest in this concept comes from the role it plays in the ergodic theory of Markov chains. Namely, as we will see below, it allows us to distin guish between the chains for which powers of the transition probability matrix converge and those for which it is necessary to take averages. More precisely, given a state i, set =
(3. 1 . 10)
i
oo.
S(i)
=
{n 2': 0 :
(Pn )i > 0}
oo.
and d(i) = gcd ( S(i) ) .
52
3 MORE ERGODIC THEORY OF MARKOV CHAINS
Then d( i) is called the period of the state i, and, i is said to be aperiodic if d(i) = 1 . As we will see, averaging is required unless i is aperiodic. However, before we get into this connection with ergodic theory, we need to take care of a few mundane matters. In the first place, the period is a communicating class property:
(3.1.11) i'-4j ===? d(i) = d(j). To see this, assume that (P m )ij > 0 and (P n )ji > 0, and let d be a common divisor of S(i). Then for any k E S(j), (P m + k +n )ii � (P m )ij (P k )jj (Pn )j i > 0, and so m + k + n E S(i). Hence dl {m + k + n : k E S(j)}. But, because m + n E S(i) , and therefore dl(m + n ) , this is possible only if d divides S(j), and so we now know that d(i) ::;; d(j). After reversing the roles of i and j, one sees that d(j) ::;; d(i), which means that d(i) must equal d(j). We next need the following elementary fact from number theory. 3 . 1 . 12 THEOREM. Given 0 # S <;_;; Z with S # {0}, gcd(S) :'S min { l81 8 E S \ {0}} and equality holds if and only if {gcd(S) , -gcd(S)} n S # 0. More generally, there always exists an M E z + , { am } J"f <;_;; Z, and {8m}f1 <;_;; S such 2 that gcd(S) = 2:: � am8m. Finally, if S <;_;; N and (81 , 82) E S ===? 81 + 82 E S, then there exists an M E z + such that {m gcd(S) : m � M } = {8 E S : 8 � Mgcd(S)}. :
PROOF: The first assertion needs no comment. To prove the second assertion, let S be the smallest subset of Z which contains S and has the property that 2 (81 , 82) E S ===? 81 ± 82 E S. As is easy to check, S coincides with the subset of Z whose elements can be expressed in the form 2:: � am 8m for some M � 1 , {am }J"f <;_;; Z, and {8m }J"f <;_;; S . In particular, this means that gcd(S) I S , and so gcd(S) ::;; gcd( S) . On the other hand, because S <;_;; S, gcd( S ) I S. Hence, gcd(S) = gcd( S ) , and so, by the first part, we will be done once we show that gcd( S ) E S. To this end, let m = min{8 E z + 8 E S } . We already know that gcd( S) ::;; m. Thus, to prove the equality, we need only check that mi S . But, by the Euclidean algorithm, for any 8 E S, we can write 8 = am + r for some (a, r) E Z2 with 0 ::;; r < m. In particular, r = 8 - am E S. Hence, if r # 0, then r would contradict the condition that m is the smallest positive element of S. To prove the final assertion, first note that it suffices to prove that there is an M E z + such that {m gcd(S) : m � M } <;_;; S. To this end, begin by checking that, under the stated hypothesis, S = { 82 - 81 : (81 , 82) E S U {0} } . Thus, gcd(S) = 82 - 81 for some 81 E S U {0} and 82 E S \ {0}. If 81 = 0, then m gcd(S) = m82 E S for all m E z + , and so we can take M = 1. If 81 # 0, choose a E z+ so that 81 = a gcd(S) . Then, for any (m, r) E N2 with 0 ::;; r < a, 2 (a + ma + r)gcd(S) m81 + r82 + (a - r)81 = (m + a - r)81 + r82 E S. :
=
53
3.2 Ergodic Theory without Doeblin
Hence, after another application of the Euclidean algorithm, we see that we can take = a2. D As an immediate consequence of Theorem 3. 1. 12, we see that
M
d(i) < oo In particular, 1 (3. 1. 13)
==?
i is aperiodic
(3.1 . 14)
(Pnd(i) )ii > 0 for all sufficiently large E z+ . (Pn )ii > 0 for all sufficiently large E z+ . n
n
{==}
We close this subsection with an application of these considerations to the ergodic theory of Markov chains on a finite state space. 3 . 1 . 15 COROLLARY. Suppose that P is an transition probability matrix on a finite state space §. If there is an aperiodic state § such that for every §, then there exists an and an E > 0 such that � E for all i E §. In particular, (cf (2.3.9))
j0 E -+ j0 i M Ei M E z+ (P )ijo E z+ I JLPn - rrl l v 2(1 - E)l-Ri·l PROOF: Because j0 is aperiodic, we know that there is an M0 E N such that Further, because i --+jo, there exists an m(i) E z + (Psuchn)io]othat>(P0 mfor(i)all)ijo >� 0.Mo.Hence, n )ijo > M0 for all m(i) + Mo . Finally, (P take M Mo +maxi E§ m(i) , E min (P ) ij , and apply (2.2.3) . :::::
for all n
and initial distributions
JL .
n
n �
=
=
i Es
D
o
3 . 2 Ergodic Theory without Doeblin
In this section, we will see to what extent the results obtained in Chap ter 2 for Markov chains which satisfy Doeblin's condition can be reproduced without Doeblin's condition. The progression which we will adopt here runs in the opposite direction from that in Chapter 2. That is, here we will start with the most general but weakest form of the result and afterwards will see what can be done to refine it. 3.2.1. Convergence of Matrices: Because we will be looking at power series involving matrices which may have infinitely many entries, it will be important for us to be precise about what is the class of matrices with which we are dealing and in what sense our series are converging. For our purposes, the most natural class will be of matrices M for which (3.2.1)
I M f l u,v = supiE§ LjE§ I (M)ij l
Mu,v ( I ) l u,v
is finite, and the set of all such matrices will be denoted by § . An easy calculation shows that u § ) is a vector space over JR. and that is a good norm on That is,
Mu,v (S).M ,v ( I M I u,v
=
·
0 if and only if
M
=
0,
1 The "if" part of the following statement depends on the existence of infinitely many prime numbers.
54
3 MORE ERGODIC THEORY OF MARKOV CHAINS
l o:MI I u,v = l a ii i M I u,v for a lR, and l i M + M'l l u,v � I M I u,v + I M'I I u,v · Slightly less obvious is the fact that exists and ( 3.2.2) (M , M') E Mu,v (§) 2 MM' I MM'I I u,v � I M I u,v i M'I I u,v · E
==?
To see this, observe first that, since
k I (M)iki i (M') kJ I � (Lk I (M)ikl ) supk I(M') kJI < oo, the sum in (MM')ij = Lk (M)ik (M')kj is absolutely convergent. In addition, for each i, L I ( MM' )ijl � L L I ( M)iki i ( M' ) kjl j j k L
and so the inequality in ( 3.2.2) follows. We next want to show that the metric which determines on is complete. In order to do this, it will be useful to know that if then for each (i, j) lim ( 3.2.3) � lim ==? sup
I · l u,v
Mu,v (§),
{ MnMu,}o='v (§) �
Mij n-+CXJ (Mn)ij L I Mij l n---+ oo I Mnl l u , v · j Indeed, by Fatou's Lemma, Theorem 6.1.10, L I Mij l nlim L I ( Mn)iJ I for each i §, -+CXJ j j =
2
:S:
E§
E
E§
and so ( 3.2.3) is proved. Knowing ( 3.2.3) , the proof of completeness goes as follows. Assume that is Cauchy convergent: lim � = 0. Obviously, for each i, j) E is Cauchy convergent as a sequence in JR. Hence, there is a matrix such that, for each (i, j), Furthermore, by ( 3.2.3) , =
{Mn }o=' Mu,v (§) I · l u, ( §2 , { ( )ij }(f' m-+= supn>m l i Mn Mm l u , v MnM (M)ij limn_,=(Mn )ij · li M - Mm l u,v n---+limoo li Mn - Mm l u,v n>supm l i Mn - Mm l u , v · Hence, l i M -M m l u , v ----+ 0. v-
:S:
:S:
3.2 Ergodic Theory without Doeblin
55
3.2.2. Abel Convergence: As we said in the introduction, we will begin with the weakest form of the convergence results at which we are aiming. That is, rather than attempting to prove the convergence of ( Pn ) ij or even the Cesaro means � 2:: �,";;0 (P m ) ij as n oo, we will begin by studying the Abel sums (1 - s) I::= o s m (P m ) ij as s / 1 . We will say that a bounded sequence {x n }o c::; lR is Abel convergent to x if ->
00
lim ( 1 - s) L SnX n = X.
s/ 1
n= 1
It should be clear that Abel convergence is weaker than (i.e. , is implied by) ordinary convergence. 2 Indeed, if Xn x, then, since (1 - s) 2::: � sn 1, for any
N:
-----+
00
::; (1 - s) L sn l x - Xnl n=O Hence,
l
lim x - (1 - s)
s/1
::;
=
N(1 - s) sup l x - Xnl
nEN
f1 snxn l ::; Nlim
-+ oo
+
sup l x - Xn l ·
n?:_N
sup l x - xn i = O
n ?:_ N
if Xn x. On the other hand, although { ( -1) n } f fails to converge to anything, -----+
00
(1 - s) L sn (-1) n n=O
=
1 (1 - s) 1 + S -----+ 0 as s / 1 .
That is, Abel convergence does not, in general imply ordinary convergence. With the preceding in mind, we set (3.2.4)
R(s)
00
=
(1 - s) L snpn for s E [0, 1). n=O
However, before accepting this definition, it is necessary to check that the above series converges. To this end, first note that, because pn is a transition probability matrix for each n 2: 0, I IPn ll u, v = 1 . Hence, for 0 ::; m < n,
2 In Exercise 3.3. 1 below, it is shown that Abel convergence is also weaker than Cesaro convergence.
56
3 MORE ERGODIC THEORY OF MARKOV CHAINS
and so, by the completeness proved above, the series in (3.2.4) converges with respect to the II ll u,v -norm. Our goal here is to prove that ·
(3.2.5)
1 lE[ X o =jl P JI ( ) (R(s)) = iJ { lP'(pj oo i Xo = i) jj ii #=jj, renewal equation (Pn )ij = mLn=1 f(m)ij (pn-m )jj f(m)iJ lP'(PJ I Xo = i) , (Pn )ij mLn=1 lP'(Xn = j PJ = m I Xo i) = mLn=1 lP' (Xn-m = j I Xo = j) lP' (pJ = m I Xo = i) . s }(s)ij mL=1 smf(m)ij = lE [sPj I Xo = i] , lim
1r
'1
s/ 1
=
<
7r
if if
and the key to our doing so lies in the (3.2.6)
for n � 1,
where
=m
=
which is an elementary application of (2. 1 . 13): =
=
&
Next, for E [0, 1) , set
00
=
and, starting from (3.2.6), conclude that
(R(s)) ij = s)6i,j s) � sn (,t1 f(m)ij (pn-m )jj ) = s)6i,j s) %;;1 smf(m)ij Ct sn-m (Pn-m)jj ) = s)6i,J + }(s)ij (R(s)) jj ' (1 -
(1 -
+
(1 -
+ (1 -
(1 -
That is, (3.2.7)
(R(s) ) ij = s)6i,j + }(s)iJ (R(s) ) JJ s -s (R(s)) jj = 1 }(s)JJ for
(1 -
Given (3.2.7) , (3.2.5) is easy. Namely,
1-
E
[0, 1).
3.2 Ergodic Theory without Doeblin
57
Hence, if j is transient, and therefore ] (1)jj < 1 , 7rjj = hm (R(s) ) JJ. . = 0 = [ I 1 = s/ 1 lE Pj Xo Jl 0
On the other hand, if j is recurrent, then, since
m- 1 1 - s m "' --= L... sc / m, 1-s C=O the Monotone Convergence Theorem says that
� 1 - ] (s)jj � 1 - sm = L... � f(m)jj / L... mf(m)jj = lE [PJ I X0 = J]. 1 m =1 m=1 _
8
as s / 1 . At the same time, when i ;i j, 3.2.3. Structure of Stationary Distributions: We will say that a prob ability vector J.t is P-stationary and will write J.t E Stat(P) if f.t = J.LP. Obvi ously, if J.t is stationary, then J.t = J.LR(s) for each s E [0, 1). Hence, by (3.2.5) and Lebesgue's Dominated Convergence Theorem,
If j is transient, then 1rij 0. On the other hand, if j is recurrent, then, by Theorem 3.1.2, 1rij is either 1rjj or 0, according to whether i+-+j or i 'fj. Hence, in either case, we have that =
(3.2.8)
J.t
E Stat(P)
===} (J.t)j = .
(L: ) 'l,<:---+J
( J.t ) i 1rjj·
We next want to show that (3.2.9)
1rjj > 0 and C {i : i+-+j } ===} n ° E Stat(P) when (n° )i l c (i)1rii· =
=
To do this, first note that 7rjj > 0 only if j is recurrent. Thus, all i E C are recurrent and, for each s E (0, 1), (R(s) ) kc > 0 ¢===? (k, f!.) E C2. In particular, (n° ) i = lims/ 1 (R(s) ) ji for all i, and therefore, by Fatou's Lemma,
3 MORE ERGODIC THEORY OF MARKOV CHAINS
58
Similarly, for any i, (1r0P)i
=
L 7rkk (P)ki � lim L (R (s) ) J k (P)ki = (1r0)i,
s/'l kEG
k EG
since
L (R(s) ) jk (P)ki =
k EG
(R(s) -
� - s ) )ji -----+ 1fji I
=
(1r0 )i as s / 1 .
But if strict inequality were to hold for some i, then, by Fubini's Theorem, Theorem 6.1.15, we would have the contradiction
Hence, we now know that 1r0 = 1r0P . Finally, to prove that 1r0 E Stat(P) , we still have to check that L i ( 1r0) i = 1 . However, we have already checked that 1r c = 1r0P, and so we know that 1r0 = 1r0R( s ) . Therefore, since, as we already showed, L i (7r0)i � 1, Lebesgue's Dominated Convergence Theorem justifies 0 < 1fJJ
=
� (1r0)i (R(s)) ij -----+ (� (1r0)i ) 1fJj ,
which is possible only if L i ( 1r0)i = 1. Before summarizing the preceding as a theorem, we need to recall that a subset A of a linear space is said to be a convex set if (1 - O)a + Oa' E A for all a, a' E A and e E [0, 1] and that b E A is an extreme point of the convex set A if b (1 - O)a + Oa' for some e E (0, 1) and a, a' E A implies that b = a' a". In addition, we need the notion of positive recurrence. Namely, we say that = j] < oo. Obviously, only recurrent j E § is positive recurrent if states can be positive recurrent. On the other hand, ( 1 . 1 . 13) together with ( 1 .1 . 15) show that that there can exist null recurrent states, those which are recurrent but not positive recurrent. 3 .2 . 10 THEOREM. Stat(P) is a convex subset of JRs . Moreover, Stat(P) -1 0 =
=
JE[p1 I Xo
if and only if there is at least one positive recurrent state j E §. In fact, for any J.L E Stat(P), ( 3.2.8) holds, and J.L is an extreme point in Stat(P) if and only if there is a communicating class C of positive recurrent states for which (cf. (3.2.9)) J.L = 1r0 . In particular, (J.L)j = 0 for any transient state j and, for any recurrent state j, either (J.L)i is strictly positive or it is 0 simultaneously for all states i 's which communicate with j .
PROOF: The only statements not already covered are the characterization of the extreme points of Stat(P) and the final assertion in the case when j is recurrent.
3.2 Ergodic Theory without Doeblin
59
In view of (3.2.8) , the final assertion when j is recurrent comes down to showing that if j is positive recurrent and if--+j , then i is positive recurrent. To this end, suppose that j is positive recurrent, set C = {i : if-+j}, and let i E C be given. Then n°P n = n° for all n � 0, and therefore, by choosing n so that (P n )Ji > 0, we see that (n°) i � (n°)J (Pn )Ji > 0. To handle the characterization of extreme points, first suppose that J.L # n° for any communicating class C of positive recurrent states. Then, by (3.2.8) , there must exist non-communicating, positive recurrent states j and j' for which (J.L)J > 0 < (J.L)J' · But, again by (3.2.8) , this means that J.L = Bn° + (1 - B)v, where c = {i : if-+j}, e = L iE c (J.L) i E (0 , 1), and (v) i equals 0 or (1 - e) - 1 (J.L) i depending on whether i is or is not in C. Clearly v E Stat ( P ) and, because VJ' > 0 = (n°)J' , v # n ° . Hence, J.L cannot be extreme. Equivalently, every extreme J.L is n° for some communicating class C of positive recurrent states. Conversely, given a communicating class C of positive recurrent states, suppose that n° = (1 -B)J.L+Bv for some B E (0, 1) and pair (J.L, v) E Stat (P ) 2 . Then (J.L) i 0 for all i ¢:. C , and so, by (3.2.8) , we see first that J.L = n° and then, as a consequence, that v = n° as well. D The last part of Theorem 3.2.10 shows that positive recurrence is a commu nicating class property. Indeed, if j is positive recurrent and C { i : if-+j}, then n° E Stat ( P ) , and so, since (n°)J > 0, 1rii = (n°) i > 0. In particular, if a chain is irreducible, we are justified in saying it is positive recurrent if any one of its states is. See Exercise 3.3.3 below for a criterion which guarantees positive recurrence. 3.2.4. A Small Improvement: The next step in our program will be the replacement of Abel convergence by Cesaro convergence. That is, we will show that ( 3.2.5 ) can be replaced by =
=
1 n- 1 where An = - L pm . n
(3.2. 11)
m =O
As is shown in Exercise 3.3.1, Cesaro convergence does not, in general, follow from Abel convergence. In fact, general results which say when Abel con vergence implies Cesaro convergence can be quite delicate and, because the original one was proved by a man named Tauber, they are known as Taube rian theorems. Fortunately, the Tauberian theorem required here is quite straight-forward. A key role in our proof will be played by the following easy estimate:
n- 1
(3.2. 12)
{am }!f' c;;; [0, 1] & An = � L ac n o
�
! A n - An- m I
S::
m n
-
for 0 S:: m < n.
60
3
MORE ERGODIC THEORY OF MARKOV CHAINS
m 1 1 { nn1 m An - An-m -n C=2::n-m ac - n n - m C=O ac <-- !I!..n n 3. 2 .13 (i,j), limn _,00 (An )ij :::; e1rjj· {nc : £ 0}
The proof is:
(
=
LEMMA. For all and any subsequence
::::0:
)
> _ !I!_
� L..t
In addition, for any j
<;;; N,
PROOF: To prove the first part, observe that
n-2::1 ( 1 - � ) m (Pm )ij ( 1 - � ) -n (R(1 - � )) ij ' (An )ij � (1 - �) -n m=O limn_,00 (An )ij :::; e1fij :::; (1. 2 . 1 0) (3.2.5), e1fJJ . (3.2.6) (An )ij = m=n-2::11 f(m)ij ( 1 - �) (An-m)jj j. ! (An )ij - lP' (pj < n I Xo i) a ! n-2::1 f(m)ij ( : + ! ( An-mb - a ! ) ::S:
::S:
which, together with
and
shows that
To handle the second part, use
to arrive at
for i =/=
Hence,
=
::S:
m=1
(3.2.12)
j f(m)i � f(m)ij
where, in the second inequality, we have used plus 2::::= 1 < 1 Finally, by Lebesgue's Dominated Convergence Theorem, 2::: � tends to as n ---> oo, and therefore, by applying the above with n = nc and letting oo, we get the desired conclusion. D We can now complete the proof of Namely, if = then the = = for all guarantees that first part of Lemma says that j must i. Thus, assume that > In this case Theorem E Stat (P ) when C {i : i+-+j}. In particular, be positive recurrent and = At the same time, if = l im and the 00 + , then, by the subsequence {nc : ::::0: is chosen so that second part of Lemma and Corollary
1.
£0
--->
3.1fjj2.13 0. 1fjj l::iEC(7rc)i(A£n )ij 0}· 7rc 3.2.13
(3.2 . 1 1li)m. n_,00 (An )ij 1fjj 0 0,1fij 3.2.10 a+ne))J n_,a (An )jj (A 3. 1 .4, (Ane)ij
lim i E C ===} C-+oo
=
----+
= a+ .
3.2 Ergodic Theory without Doeblin
61
Hence, after putting these two remarks together, we arrive at
'Trjj 00 ( jj , 00 ( jj 'Trjj,
'Trjj,
Similarly, if a - = limn__, An ) we can show that a - = and so we now know that > 0 =* limn __, An ) which, after another application of the second part of Lemma 3.2.13, means that we have proved (3.2.11). 3.2.5. The Mean Ergodic Theorem Again: Just as we were able to use Theorem 2.2.5 in §2.3.1 to prove Theorem 2.3.4, so here we can use (3.2. 1 1 ) to prove the following version o f the mean ergodic theorem. 3 . 2 . 14 THEOREM. Let C a communicating class of positive recurrent states. E C) = 1, then =
IfJID(Xo
J�
E [ (� f>{jj(Xm) -
�,,
) ']
�
0.
See Exercises 3.3.9 and 3.3. 1 1 below for a more refined statement.
JID(Xm
E C for all m E N) 1 , without loss in generality we P ROOF: Since may and will assume that C is the whole state space. In keeping with this assumption, we will set 1r = 1r0. Next note that if f.Li = = i), then =
JID(X0
and so it suffices to prove that
But, because 1ri i > 0 for all i
E
§ and
62
3
MORE ERGODIC THEORY OF MARKOV CHAINS
it is enough to prove the result when 1r is the initial distribution of the Markov chain. Hence, from now on, we will be making this assumption along with
C = §.
Now let f the column vector whose ith component is l{j} (i) - 1fjj · Then, just as in the proof of Theorem 2.3.4,
Since 1r E Stat(P),
and therefore the preceding becomes
2 n mn(JAmf) , :'::: 2 n mL =l where (!Am f)i (f) i (Am fk Finally, by (3.2.11), for each E > 0 there exists an N€ E z+ such that =
Hence, we find that
J'.� E
[ (�,%
l{j } (Xm ) - <;;
2 :'::: lim 2 n -----+ CXJ
) ']
-
� m I 1r (f Amf) ! + limCXl 2E � L...- m :'::: E. n---+ 2 n mL...n m=N,+l =l
D
3.2.6. A Refinement in The Aperiodic Case: Our goal in this subsection
is to prove that ( cf. (3.2.5))
lim ( Pn ) ij = Irij for all i E §. (3.2.15) If j is transient or aperiodic, then n-+oo Of course, the case when j is transient requires very little effort. Namely, by (2.3.7), we have that j transient
00
===?
L (P n ) ij :'::: IE [TJ IXo = j] n=O
<
oo ,
3.2 Ergodic Theory without Doeblin
63
p
and therefore that limn _,00 (P n )ij = 0. At the same time, because lP' ( j = ooiXo j) > 0 when j is transient, 'Trij :::; Kjj = 0. Thus, from now on, we will concentrate on the case when j is recurrent and aperiodic. Our first step is the observation that ( cf. §2.3.2) =
if j is aperiodic, then there exists an N E z + such that max = n I Xo = j) > 0 for all n 2: N. l ::O: m ::O: n
lP' (p)ml
(3.2.16)
To check this, use (3.1 . 14) to produce an N E z+ such that (P n )jj > 0 for all n ;::: N. Then, since (Pn )jj =
m) = n I lP'(pJ L m=l n
Xo = j ) for n 2: 1,
(3.2.16) is clear. The second, and key, step is contained in the following lemma. 3 . 2 . 1 7 LEMMA. Assume that j is aperiodic and recurrent, and set aj limn_,00 (Pn )jj and aj = limn _,00 (Pn )jj · Then there exist subsequences {n£ : C 2: 1} and {nt : C ;::: 1} such that . (Pn ± - r ) jj l1m a1±- = C->oo
for all r ;::: 0.
e
j lP'(pt) m lP'(pJ ) PJm) = I = j) + lP'( = jm PJm) =/= I j) = j) I + lP' ( e j nc - M PJ ) + lP'(Xne = j PJm) nc - M I = j)
1} so that (p ne )jj -----+ a , and, using (3.2.16), choose N ;::: 1 so that max 1 ::; m ::; n = niXo = j) > 0 for all n ;::: N. Given r ;::: N, choose 1 :::; m :::; r so that r5 = riXo = j) > 0. Now for any M E z+ , observe that, when nc ;::: M + (P n' )jj is equal to PROOF: Choose a subsequence
{nc
: C ;:::
=
T,
lP' (Xn, = j & = r5(Pn,- r b Furthermore,
T
Xn = &
nLe -M lP'(pJm) = k=l kf-r
while
Xo
Xn
e
&
T
;:::
=!= T Xo
&
>
Xo = Xo
kiXo = j)(Pn, - k )jj :::; (1 - 8) sup (P n )jj , n ?_ M
·
3 MORE ERGODIC THEORY OF MARKOV CHAINS
64
Hence, since j is recurrent and therefore
aj ::; after letting £
----+
0 lim (P ne - T )jj
+
f---+ CXJ
f?(p;m) < oo i Xo = j) = 1, we get ( 1 - 0 ) SUp (Pn )jj
n?_ M
oo. Since this is true for all M
a+ < J -
;::: 1 ,
lim (p ne -r) JJ + ( 1 0f---+ CXJ ·
ajj.
it leads to
o)a+
J '
<
which implies �_, = (p ne - r)jj 2': But obviously limg__,= (p ne - r)jj and so we have shown that limg__,= (pne -r)jj = for all r 2:: N. Now choose L so that nL 2': N, take nt = nHL - N, and conclude that limg_,= (p ni -r)jj = for all r 2': 0. The construction of {n£ : £ 2': 1 } is essentially the same and is left as an exercise. 0 3 . 2 . 1 8 LEMMA. If j is aperiodic and recurrent, then limn__,= (P n )jj ::; 7rjj · £ 2': 1 } are the ones described in Furthermore, if the subsequences { nt t n Lemma 3.2. 1 7, then limg__,=(p )ij = for any i with i
aj,
aj
aj
:
afj
Turning to the first assertion, we again use the result in Lemma 3.2. 1 7 to obtain N
N
aj LY? (pj 2': riXo = j) = lim L (pJ 2': r I X0 = j) (Pni - r )jj · =l =l f---+ CXJ
r
r
Y?
for all N ;::: 1 . Thus, if we show that N
(*)
2': r I Xo = j) (Pn - r )Jj ::; LY?(pJ =l
r then we will know that
(X)
1,
n 2:: N ;::: 1 ,
ajlE[PJ I Xo = J] = aj LY? (pj 2': r I Xo = j) ::; =l
r which is equivalent to the first assertion.
1,
3.2 Ergodic Theory without Doeblin
65
To prove ( * ) , note that, for any n 2" 1,
n (pj = = j) (Pn- r)jj (Pn )jj = r=l2)P' r l Xo n n r=ln r= l n+l = L lP'(pj r I Xo = j ) (P n - r )jj - L lP'(pj r I Xo = j ) (Pn+l - r )jj, r=2 r=l and so, since lP'(pj 1IX0 = j) 1, n n+ l r L lP'(PJ r I Xo = j) (P n+l - r )jj = L lP'(pj r I Xo = j ) (P n - )jj r r=l =l for all n 1. But I:;�= 1 lP'(pj r!Xo = j)( pn - r )jj = 1 when n = 1, and so we have now proved that Lr=l lP'(pj r I Xo = j) (Pn-r )jj :S: r=lLn lP'(pj r I Xo j) (Pn-r )jj = 1 2"
2"
2"
=
2"
2"
2"
2"
N
2"
2"
=
for all n 2" N 2" 1. D We can now complete the proof of (3.2.15) when j is recurrent and aperiodic. By the first part of Lemma 3.2.18, we know that lim -+ = 0 if = 0. Thus, if = 0, then, for any
n b 'lfjj (P n 'lfjj i, n (pj = ! = i)(Pn-r b = 0 = 1rij · n lim '"""' lim )i (P = j n---+oo r=l lP' r Xo n---+oo In order to handle the case when j is positive recurrent, set C = { i i+-+j}, and take 1r0 accordingly. Then, 1r0 Stat P . In particular, by the last part oo
.L....t
E
:
( )
of Lemma 3.2.18 and Lebesgue's Dominated Convergence Theorem,
C )i(pnt )iJ ___, at L(7rc)i = at , 'lfjj = iL(7r iEC EC and so (Pn )jj 'lfjj· Finally, if i -1- j, then, again by the Lebesgue's theorem, (Pn )ij = rL=ln lP'(pj = r!Xo = i)(Pn-r )jj lP'(pj oo !Xo = i)njj = 1rij · ___,
___,
<
3.2. 7. Periodic Structure: The preceding result allows us to give a finer analysis even in the case when the period is not 1. Namely, consider a Markov
3 MORE ERGODIC THEORY OF MARKOV CHAINS
66
chain with transition probability matrix P which is irreducible and recurrent on §, and assume that its period is d 2: 2. The basic result in this subsection is that there exists a partition of § into subsets §r , 0 :::; r < d, with the properties that
(1) (Pmd+r )jk > 0 ===?- r(k) - r(j) = r mod d, (2) r(k) - r(j) = r mod d ===?- (Pmd+ r )jk > 0 < (Pmd - r ) kj for all sufficiently large m 2: 1 , (3) for each 0 :::; r < d , the restriction of pd to §r is an aperiodic, recurrent, irreducible transition probability matrix,
where we have used r(j) to denote the 0 :::; r < d for which j E §r · To prove that this decomposition exists, we begin by noting that, for 0
r < d,
:::;
3 m > 0 (Pmd+r ) > 0 ===?- 3 n > 1 (P nd-r ) J l > 0.
(* )
-
· ·
D
· ·
-
Indeed, by irreducibility and the Euclidean algorithm, we know that there exists an m1 2: 0 and 0 :::; r1 < d such that that (Pm' d+r' )ji > 0. Fur > (Pm'd+r ' ) J· l· (Pm" d) · · > 0 for all thermore ' by (3 . 1 . 13) ' (P(m'+m")d+ r' ) J· l· sufficiently large m"'s, and so we may and will assume that m' 2: 1. But then (P(m+m')d+(r+ r'))ii > 0, and so dl (r + r') , which, because 0 :::; r, r' < d, means that r' = 0 if r = 0 and that r' = d - r if r 2: 1 . Starting from ( * ) , it is easy to see that, for each pair (i , j) E §2 , there is a unique 0 :::; r < d such that (Pmd+ r ) ij > 0 for some m 2: 0. Namely, suppose that (Pmd+r ) ij > 0 < (Pm' d+ r' ) ij for some m, m' E N and 0 :::; r, r' < d. Then, by ( * ) , there exists an n 2: 1 such that (Pnd- r )ji > 0 and so (P(m'+n)d+( r' - r ) )ii > 0. Since this means that d l (r' - r), we have proved that r r' . Now, let i0 be a fixed reference point in §, and, for each 0 :::; r < d, define §r to be the set of j such that there exists an m 2: 0 for which (Pmd+r )io) > 0. By the preceding, we know that the §r 's are mutually disjoint. In addi tion, by irreducibility and the Euclidean algorithm, § U�:� §r · Turning to the proof of property (1), use ( * ) to choose n 2: 0 and n' 2: 1 so that (p nd+r (j ) ) 2oJ > 0 < (P n 'd - r ( k ) ) kz·o · Then (P(n+ m+n')d+(r (j )+ r- r ( k ) ) ) to to > 0 ' and so dl (r(j) + r - r(k)). Equivalently, r(k) - r(j) = r mod d. Conversely, if r(k) - r(j) = r mod d, choose n 2: 0 and n' 2: 1 so that (P nd+r ( k ) )iok > 0 and (P n' d- r (j ) ) jio > 0. Then (P(n +m+n' )d+r )jk > 0 for any m 2: 1 satisfying (Pmd) ioio > 0. Since, by (3.1.13), (Pmd)ioio > 0 for all sufficiently large m's, this completes the left hand inequality in (2) , and the right hand inequality in (2) is proved in the same way. Finally, to check (3), note that, from (1), the restriction of pd to §r is a transition probability matrix, and by (2), it is both irreducible and aperiodic. The existence of such a partition has several interesting consequences. In the first place, it says that the chain proceeds through the state space in a ••
=
=
.
.
3.3 Exercises
67
d.§r(i) +n ,
n
where the cyclic way: if it starts from i, then after steps it is in addition in the subscript should be interpreted modulo In fact, with this convention for addition, we have that (3.2. 19) To see this, simply observe that, on the one hand, i) 0 unless i E ===} while, on the other hand, 1. Hence, i tj. ===} 1 0, whereas i E Secondly, because, for each 0 ::; r < the restriction of to is an irreducible, recurrent, and aperiodic, transition probability matrix, we know that, for each 0 ::; r < and j E there exists a E [0, 1] with the i, j) E More generally, 1rj;) for all property that
pn l§r = l§r+n " n p l§, ( = § 1 n §r+nn , '£�� p , = r+n §r+n 0=d,'£�,l§r-�0(Pl (P l§r)i = §,, )i = (Plp§dr+n )i.§r ) § , 1r}; d r ( §�. (Pmd )ij (r(j)) if r(j) - r(i) s mod d { lim (P md+ s ) i . = 1rjj (3.2.20) m->oo 0 otherwise. In particular, if ( i, j) (§r 2, then d"'-1 (pmd+s ) ij - 1 n""""'- 1 (Pmd) ij 1rj(jr) (And )ij - nd1 mn-1 �=O s�=O nd m�=O ----+ Hence, since we already know that (An )ij 1rjj , it follows that r (3.2.21) 1rj( ) - d1rjj for 0 ::; r < d and j §r . In the case when P is positive recurrent on §, so is the restriction of p d to each §r , and therefore "L.j E§r 1r};> = 1 . Thus, in the positive recurrent case, (3.2.21) leads to the interesting conclusion that 1r§ assigns probability � to each §r· See Exercise 3.3.12 and 5.6.7 below for other applications of these considerations. -----+
=
1
)
E
_
""""'
_
d.
-----+
E
3.3 Exercises EXERCISE 3 . 3 . 1 . Just as Cesaro convergence is strictly weaker (i.e., it is im plied by but does not imply) than ordinary convergence, so, in this exercise, we will show that Abel convergence is strictly weaker than Cesaro convergence. ( a) Assume that the radius of convergence of <;;; lR is less than or equal to 1 . That is, limn_, 00 ::; 1. Set R(s) ( 1 s) "'£� for � for ;::: 1 , and show that, for s E [0, 1 ) , s E [0, 1) and Use this to conclude that R(s) = (1 E lR ===} lim R(s) lim
{an }8"= n an a i s ! n � 1 1- A-n . am n 2s) A'£r'n =nsn'£� n-n+l>oo An = a /1 = a. s an = - 1 ) n for n ;::: 0, check that the radius of convergence of is 1, and show that {ann}8"- 1Take � s a = s(1 - s) and (1 - s) -n1 mL=O am = { � 1 ifif nn isis even �=O m ( 1 + s odd m Hence, {an }8" is Abel convergent to 0 but is Cesaro divergent. (b)
(
-2
1
+ Zn
m
)Z .
3 MORE ERGODIC THEORY OF MARKOV CHAINS
68
EXERCISE 3 . 3 . 2 . Recall the queuing model in Exercise 1 .3.12. Show that {Qn : n 2 0} is an N-valued Markov chain conditioned to start from 0, and write down the transition probability matrix for this chain. Further, for this chain: show that 0 is transient if JE[B1] > 0, 0 is null recurrent if JE [B1] = 0, and that 0 is positive recurrent if lE[B1] < 0. In order to handle the case when JE[B1] = 0, you might want to refer to Exercise 1.3. 1 1 . EXERCISE 3 .3 . 3 . Here is a test for positive recurrence. Namely, given a transition probability matrix and an element j of the state space §, set C = {i E § i�j}. Assume is a non-negative function on C with the property that i 2 + E for all i E C \ {j} :
P ( ) (Pu)i
u
u
for some E > 0. ( a) Begin by showing that
and use this to conclude that j is positive recurrent. (b) Suppose that § = Z and that Iii 2 + E for all i E Z \ {0}. Show that 0 is positive recurrent for the chain determined by EXERCISE 3.3.4. Consider the nearest neighbor random walk on Z which moves forward with probability p E ( � , 1 ) and backward with probability q = 1 - p. In other words, we are looking at the Markov chain on Z whose transition probability matrix is given by = p if j i + 1 , = q if j = i - 1 , and 0 if lj - il =f' l . Obviously, this chain is irreducible, and the results in §§1 .2.2-1 .2.1 show that 0 is transient. Thus, the considerations in Exercise 2.4.10 apply. ( a) Let be constructed from by the prescription in Exercise 2.4.10 when j0 = 0, and, using (1.1. 12), show that
Lj lji (P)ij
(P)ij
P
=
P
(P)ij
P.
=
(P)ij
P
(P)ij =
� i � O & j = i + 1 or i 2 1 & j = i - 1 q if i � 0 & j = i - 1 or i 2 1 & j = i + 1 0 otherwise.
{P
(b) On the basis of Exercise 2.4.10, we know that 0 is recurrent for the chain determined by Moreover, by part (b) of Exercise 3.3.3, one can check that it is positive recurrent. In fact, by combining part (b) of Exercise 2.4.10 with the computations in § 1 . 1 .4, show that ' 2p lEIP' [Po i Xo = OJ = p--q , -
P.
where the superscript JlD is used to indicate that the expectation value is taken relative to the chain determined by
P.
3.3 Exercises
69
P
( c ) Since is irreducible, so is P. Hence, since 0 is positive recurrent for the chain determined by P, there is a unique stationary probability vector for P. Find this vector and use it to show that if j E {0, 1 } p�q
{
.
2pq j pl (p- q)
E!P [Pj iXo = J ] =
qJ
�f;
if j :::; -1
-q)
if j
:2:
2.
EXERCISE 3 . 3 . 5 . In section §1.2.4, we worked quite hard to prove that the nearest neighbor, symmetric random walk on Z3 is transient, and one might hope that the criteria provided in §3.1.2 would allow us to avoid working so hard. However, even if one knows which function u ought to be plugged into Theorem 3.1.5 or 3.1. 7, the computation to show that it works is rather delicate. Namely, show that if 0 is sufficiently large and
a> u(k) = (a2 + t, (k)T ) :::; u(O) u (Puu)k :::; u(k) P
1 -
2
for k E Z3 ,
then when is the column vector determined by the function and is the transition probability matrix for the nearest neighbor random walk on Z3 . That is, if l:�=l l (k)i - ( f)il = 1 ( P )k£ = otherwise. What follows are some hints. (a) Let k E Z 3 be given, and set
{�
M 1+ =
Show that
3
a2 + Li=l )k)T
(Pu)k :::; u(k)
for 1
and
:::; i
:::; 3.
if and only if
( 1 - -1 ) - ! >- -1 � (1 + 2xi) � + (1 - 2xi)� .
L....
M
3 i =l
2(1 - 4x; ) !
- }a ) - :2: 1 + 21r and that (1 + �) � + (1 - o ! :::; 1 - e for � < 1, 1 1 2 S :::; u(k) if and conclude that 1 1 3 1 lxl 2 1 + 2M 2 1 6 ' - 3 i =1 (1 - 4xi ) 2 (b) Show that ( 1
(Pu)k
where lxl =
1
2
> L
Vl:i xt is the Euclidean length of x
=
(x x 2 , x 3 ) . 1,
3 MORE ERGODIC THEORY OF MARKOV CHAINS
70
( c ) Show that there is a constant C
�
< oo such that, as long as a
2 1 1 + 2 fxl 3 3 Lot (1 - 4x2) !
1
' <
2=1
+
2:
1,
C f x l4 '
and put this together with the preceding to conclude that we can take any a > 0 with a2 + 1 2: 2C. An analogous computation shows that, for each d 2: 3 and all sufficiently d-2 - -2- can be used to prove that large a > 0, the function a2 + the nearest neighbor, symmetric random walk is transient in The reason why one does not get a contradiction when d = 2 is that the non-constant, non-negative functions which satisfies when d = 2 are of the form and therefore do not achieve their maximum value at 0. log (a 2 + + EXERCISE 3 . 3 . 6 . As we said at the beginning of this chapter, we would be content here with statements about the convergence of either : n 2: 0} : n 2: 0} for each (i, j) E § 2 . However, as we are about to see, or there are circumstances in which the pointwise results we have just obtained self-improve. (a) Assume that j is positive recurrent, and set C = {i : if--+j }. Given a probability vector with the property that = 0, show that, in general, and, when j is aperiodic, for each i E C. (b) Here is an interesting fact about convergence of series. Namely, for each m E N, let n 2: 0} be a sequence of real numbers which converges as n -----+ oo. Further, assume that, for each n E N, the to a real number : m 2: 0} is absolutely summable. Finally, assume that sequence
(
l.:�(k); )
Pu ::; u
(k)i (k)D
zd .
{ (Pn )ij
{(An )ij
l.:w_c(JLP (JL)in )i ------* 1rii (JLAn )i ------*JL1rii {am,n b { am ,n m mL=O ! am,n ! ------* mL=O I bm! < oo as n -----+ oo. Show that n---+lim(X) m=O ! am,n - bm ! 0. :
00
00
00
""""' L......t
=
Hint: Using the triangle inequality, show that
and apply Lebesgue's Dominated Convergence Theorem to conclude that
as n -----+ oo .
3.3 Exercises
71
(c) Return to the setting in ( a) , and use (b) together with the result in ( a) to show that, in general, liPAn 0, and II JLPn 0 when j is aperiodic. In particular, for each probability vector JL with = 1,
-n° l v
- 1r0 l Lv i (JL)i Ec --->
--->
where f is the column vector determined by a function f . O f course, this is still far less than what we had under Doeblin's condition since his condition provided us with a rate of convergence which was independent of p,. In general, no such uniform rate of convergence will exist. EXERCISE 3 . 3 . 7 . Here is an important interpretation of when C is a positive recurrent communicating class. Namely, let i be a recurrent state and, for k E §, let be the expected number of times the chain visits k before returning to i given that the chain started from i:
n°
P,k
Determine the row vector JL E [0, (a) Show that, for all j E §,
oo] §
by (JL)k =
P,k·
Thus, without any further assumptions about i, JL is P-stationary in the sense that = p,P. (b) Clearly = 1 and = unless i is positive recurrent. Nonethe less, show that = 0 unless i+--+j and that E 0, oo ) if i+--+j. Hint: Show that
JL
P,iP,j
Lj P,j oo
P,j (
(c) If i is positive recurrent, show that
- - LJLP,k = nc . k
JL =
Equivalently, when i is positive recurrent,
(n°)j
In words, is the relative expected amount of time the chains spends at j before returning to i.
72
3 MORE ERGODIC THEORY OF MARKOV CHAINS
EXERCISE 3 . 3 . 8 . We continue with the program initiated in Exercise 3.3.7 but assume now that the reference point i is null recurrent. In this case, = oo when E [0, oo ) § is the P-stationary measure introduced in Exercise 3.3.8. In this exercise we will show that, up to a multiplicative constant, is the only P-stationary E [0, oo)§ with the property that = 0 unless i----*j (and therefore i+-+j). Equivalently, given such a = ( a) Assume that E [0, oo ) § satisfies = P . If = 1 , show that, for all j E § and n ?: 0,
J-L
v v
v, v (v(v)i)ji-L· v v (v)i (v)j = z) k#i v)klP'(Xn[nii =i j 1& Pi > n I Xo k) (P - ) l{j} (X ) I Xo = i + lE fo m
'f'-j (J-L)j J-L
=
l
Hint: Work by induction on n ?: 0 . When n = 0 there is nothing to do. To carry out the inductive step, use ( 2.1.1 ) and Fubini's Theorem to show that
l:k#i )v)klP'(Xn = j & Pi > n I Xo = k) :l) k#i vP)klP'(Xn = j & Pi > n I Xo = k) = :l) v )e1P'(Xn + 1 = j & Pi > n + 1 1 Xo t:) . e Assuming that v = vP and (v )j = 0 unless i----*j , show that v (v)iJ-L · Hint: First show that v = 0 if (v)i = 0, and thereby reduce to the case when (v)i = 1 . Starting from the result in ( ) apply the Monotone Convergence Theorem to see that the right hand side tends to (J-L)j as n --7 oo. =
=
(b)
=
a ,
EXERCISE 3 . 3 . 9 . Let C be a communicating class of positive recurrent states. The reason why Theorem 3.2.14 is called a "mean ergodic theorem" is that the asserted convergence is taking place in the sense of mean square convergence. Of course, mean square convergence implies convergence in probability, but, in general, it cannot be used to get convergence with probability 1 . Nonetheless, as we will show here, when E C) = 1 and j E C,
lP'(Xo
( 3.3.10 )
n-1
1 h. m "' 1{J· }
n
-
n--+oo
mL...=O. (Xm ) = 1r
··
JJ
with probability 1 .
Observe that, in § 2.3.3, we proved the individual ergodic theorem ( 2.3.10 ) under Doeblin's condition holds, and ( 3.3.10 ) says that the same sort of in dividual ergodic theorem holds even when Doeblin's condition is not present. In fact, there is a very general result, of which ( 3.3.10) is a very special case,
3.3 Exercises
73
which was proved originally by G.D. Birkhoff. However, we will not follow Birkhoff and instead, as we did in § 2.3.3, we base our proof on the Strong Law of Large Numbers, although' this time we need the full statement which holds ( cf. Theorem 1.4.11 in [9])) for averages of mutually independent, identically distributed, integrable random variables. ( a) Show that it suffices to prove the result when IP'(Xo = i) = 1 for some i E C. m (b) Set p� o ) = 0, and use p� ) to denote the time of the mth return to i . If
show that, conditional on X0 = i, both { Tm m ::::: 1} and {Ym m ::::: 1 } are sequences of mutually independent, identically distributed, non-negative, integrable random variables. In particular, as an application of the Strong Law of Large Numbers and (3.2.5) plus the result in Exercise 3.3.7, conclude that, conditional on Xo = i, ( =) _ 1 (m) 1 P; 1 . Pi (*) and lim - "'""' 1{j } (Xm ) = 1rii1rij lliDoo -m----+ m--)- ooo m L... m =0..t :
:
1'
) 1 1 {j } (Xm ) 1 "'P;m · m ---+ oo (ffi) - 1rij - 1rij Wi' th ' h probab'l' 1 1ty 1 . Hence, 1lill LA=O Wlt , P probability 1 . ( c ) In view of the results in ( a) and (b), we will be done once we check that, conditional on Xo = i, _
with probability 1 , where mn is the z + -valued random variable determined so that p�mn - 1 ) "S < p�mn) . To this end, first show that n
pC =n) _ l 1 1 n- 1 -� "'""' 1{ J·} (Xe) - -"'""' 1 {J·} (Xc) (m n) � 0 0 '
n
�
"S
Tmn . _ mn
Next, from the first part of ( * ) , show that IP' (limn ---+oo mn = oo [ Xo i) Finally, check that, for any E > 0 , 1P' sup Tm 2:: E Xo = i m?_M m 1 'S L IP'(pi ::::: mE [ Xo = i) 'S - lE [pi , Pi ::::: Me I Xo = i] , E m=M =
(
00
I
)
and use this to complete the proof of (3.3.10).
=
1.
74
3 MORE ERGODIC THEORY OF MARKOV CHAINS
Ln , I Ln -
which is the random probabil ( d ) Introduce the empirical measure ity vector measuring the average time spent at points. That is, = By combining the result proved here with the one in (b) � of Exercise 3.3.6, conclude that limn --.oo 1r0 ll v = 0 with probability 1 when E C) = 1 . EXERCISE 3 . 3 . 1 1 . Although the statement in Exercise 3.3.9 applies only to positive recurrent states, it turns out that there is a corresponding limit theo rem for states which are not positive recurrent. Namely, show that if j is not positive recurrent, then, no matter what the initial distribution,
2.':�- l l {i}(Xm )· lP'(Xo
(Ln )i
1 n ( lP' n---+limoo _.!_n "' l {j}(Xm )
= 0 = 1. m=O ) When j is transient, IE [2.':� l {j} (Xm )] and therefore the result is trivial. To handle j are null recurrent, begin by noting that it suffices to handle the case when lP'(Xo 1. Next, note that, conditional on X0 j, m+l) - PJm) 0} isj) a=sequence jz{p+-valued of independent, identically distributed, random variables, and apply the Strong Law of Large Numbers to see that, for any R 1 , PJm) H(R) I Xo j lP' ( lim m---+00 m ) ) lP' (mlim---+00 / R H(R) I X0 = j) = 1 , where H(r) �IE[pj R! Xo = j] / R/ Hence, given X0 = j, (=) P� with probability 1 . Finally, check that, for any 0, [nc]) J P ( ( sup _.!_ � l{j} (Xm ) E I Xo = j ::; lP' sup lP' n?_N n 0 n?_N n ::; �f I X0 = j) ) ::; lP' (msup?_Nc (m) ::; -f1 1 Xo j) , and combine this with the preceding to reach the desired conclusion. EXERCISE 3 . 3 . 1 2 . When j is aperiodic for P, we know that limn __,00 (P n ) ij exists for all i § and is 0 unless j is positive recurrent. When d(j) 1 and j is positive recurrent, show that limn--. oo ( P n )jj will fail to exist. On the other hand, even if d(j) 1 , show that limn--. oo(P n )ij = 0 for any j § which is not positive recurrent. L....t
< oo
=
:
=
m ?: ?:
m
J
?:
=
1\
m
?:
oo
1\
=
-?
?:
as
oo .
E>
oo
?:
!!.i_ m
=
>
E
>
E
CHAPTER 4
Markov Processes in Continuous Time
Up until now we have been dealing with Markov processes for which time has a discrete parameter n E N. In this chapter we will introduce Markov processes for which time has a continuous parameter t E [0 , oo ) , even though our processes will continue to take their values in a countable state space §. 4 . 1 Poisson Processes
Just as Markov processes with independent, identically distributed incre ments ( a.k.a. random walks) on zd are the simplest discrete parameter Markov processes, so the simplest continuous time Markov processes are those whose increments are mutually independent and homogeneous in the sense that the distribution of an increment depends only on the length of the time interval over which the increment is taken. More precisely, we will be dealing in this section with zd-valued stochastic processes { X (t) : t 2 0} with the property that lP' ( X (O) = 0) = 1 and
lP'(X (t l ) - X (to ) = j1 , . . . , X (tn ) - X (tn- 1) = Jn) n = II lP'(X (tm - tm -d = Jm ) m =l for all n 2 1 , 0 :S: to < · · · < tn , and (j1 , . . . , Jn ) E ( zd ) n . 4. 1 . 1 . The Simple Poisson Process: The simple Poisson process is the N valued stochastic process {N(t) : t 2 0} which starts from 0 (i.e. , N(O) = 0), sits at 0 for a unit exponential holding time E1 ( i.e., N(t) = 0 for 0 :::; t < E1 and lP'(E1 > t) = e - t ) , at time E1 moves to 1 where it remains for an independent unit exponential holding time Ez , moves to 2 at time E1 + E2 , etc. More precisely, if {En : n 2 1 } is a sequence of mutually independent, unit exponential random variables and
when n = 0 when n 2 1, then the stochastic process { N ( t) : t 2: 0} given by (4. 1 . 1)
N(t) = max{n 2 0 : Jn :S: t}
76
4 MARKOV PROCESSES IN CONTINUOUS TIME
is a simple Poisson process. When thinking about the meaning of ( 4.1.1), keep in mind that, because, with probability 1 ,
En >
0 for all n �
1 and
CXJ
oo, = E L m =l m
with probability 1 the path t E [0, oo) N(t) is piecewise constant, right continuous, and, when it jumps, it jumps by + 1: N(t) - N(t - ) E {0, 1 } for all t 0, where N(t - ) lims/t N( s ) is the left limit of N( · ) at t. We now want to show that {N(t) : t � 0} moves along in independent, homogeneous increments. 1 That is, we want to show that, for each s , t E [O, oo) , N(t) - N( s ) is independent of (cf. §6.1.3) N( T ) : T E [O, s] } ) and has the same distribution as N(t): f-------+
>
=
a ({
(4.1.2) IP'(N( s + t) - N( s ) = n I N( T ) , T E [0, sl) = IP'(N(t) n) , n E N. =
Equivalently, what we have to check is that when ( s , t) E [O, oo? and A E T E [O, s] } ) , IP'( {N( s + t) - N(s) � n} n A ) = IP' (N(t) � n) IP'(A) for all n E N. Since this is trivial when n = 0, we will assume that n E In addition, since we can always write A as the disjoint union of the sets A n {N( s ) = m }, m E N, and each of these is again in : T E [0, s] } ) , we may and will assume that, for some m, N( s ) = m on A. But in that case we can write A = , }) and s } n B where B E on B. Hence, since N( t) n + � m + {===} s � s + t, and � s k � m}) , an application of € m}) is independent of (6.4.2) shows that
a ({N(T) :
z+ .
a ({N(T) a , . . . Em > {J ({E1 m+l Ja(m{J£ - Jm : > J n m+ a ({Ek : 1P' ({ N ( s t) - N ( s) � n} n A) = IP' ( {Jm+n � s + t} n {Jm+ 1 > s} n B) = IP'({ Jm+n - Jm � s + t - Jm } n { Jm+1 - Jm > s - Jm } n B) = lE [v (Jm ), B] , +
where, for � E [0, s] ,
v(�) IP' (Pm+n - Im � s + t - 0 n Pm+1 - Im > s - 0) = IP'( Pn � s + t - 0 n {E1 > s - 0) = IP'( Pn � s + t - 0 n {En > s - 0) = IP'( Pn - 1 + En � s + t - 0 n {En > s - 0) = lE [w (�, En ), En > s - �] when w ( � , IP'(Jn- 1 � s + t - � - for � E [0, s] and E [s - �' s + t - �] . Up to this point we have not used any property of exponential random =
TJ )
=
TJ )
TJ
variables other than that they of positive. However, in our next step we will
1 Actually, our primary goal here is to develop a line of reasoning which will serve us well later on. A more straight-forward, but less revealing, proof that the simple Poisson process has independent, homogeneous increments is given in Exercise 4.5. 1 below.
77
4.1 Poisson Processes
use their characteristic property, namely, the fact that an exponential random variable "has no memory." That is, from which it is an easy step to
lF' (E > a+ b i E > a) = lF'(E > b), (4.1 .3) lE[f(E), E > a] = e- alE[f(a + E)] for any non-negative, B[o ,oo) -measurable function f. In particular, this means that v(�) = lE[w (�,En ), En > s - �] = e- (s-I;) IE[w(�,En + s - �)]) e- (s-l;)lF'(Jn- 1 t - En) = e- (s- l;) lF'(Jn t) = e- (s-l; lF' (N(t) n) . Hence, we have now shown that lF'( {N(s + t) - N(s) n } n A) = lE[e - (s - J= ), B] lF'(N (t) n) . E
2::
:S:
:S:
=
2::
2::
Finally, since
= lF'( Pm+ l > s } B) = lF' ({Em+ 1 > s - Jm } n B) = lE [e-(s- Jm ) ' B] ' the proof of ( 4. 1 .2) is complete. Hence, we now know that the simple Poisson process {N(t) : t 0} has homogeneous and mutually independent incre lF' (A)
ments.
n
2::
Before moving on, we must still find out what is the distribution of N(t) . But, the sum of n mutually independent, unit exponential random variables has a r ( n ) -distribution, and so
= = lP'(Jn t ln+1) = lP'(Jn t) - lP'( Jn+ln 1 1t Tn- 1 dT - 1 1t Tn dT = t (n 1 . 0
lF'( N(t) n)
:S:
=
_
)I
:S:
<
t)
1·
I n.
0
:S:
n.
That is, N(t) is a Poisson random variable with mean t. More generally, when we combine this with (4. 1 .2), we get that, for all s, t E [O, oo) and n E N,
= n I N(T) , T [0, sl ) = e-t 1tn.n . Alternatively, again starting from ( 4.1.2), we can now give the following Markovian description of {N(t) : t 0}: N(s) ! l [o n] (N(s)) . (4.1.5) lP'(N(s + t) = n I N(T) , [0, sl) = e-t (ntn-N(s)) , d : Having constructed the z simplest Poisson process, we can easily construct a rich class of processes which are the continuous time analogs of random walks on zd . Namely, suppose that (4.1 .4)
lP'(N(s + t) - N(s)
E
2::
TE
4.1.2.
Compound Poisson Processes on
_
78
4 MARKOV P ROCESSES IN CONTINUOUS TIME
is a probability vector on zd which gives 0 mass to the origin 0. Then the Poisson process with jump distribution J-t and rate R E ( 0, oo ) is the stochastic process {X(t) : t ?: 0} which starts at 0, sits there for an exponential holding time having mean value R- 1 , at which time it jumps by the amount k E zd with probability (J-L) k , sits where it lands for another, independent holding time with mean R - 1 , jumps again a random amount with distribution J-t , etc. Thus, the simple Poisson process is the case when d = 1, (J-Lh = 1, and R = 1. In particular, the amount by which the simple Poisson process jumps is deterministic whereas the amount by which a compound Poisson process jumps will, in general, be random. To construct a compound Poisson process, let {Bn : n ?: 1} be a sequence of mutually independent zd -valued random variables with distribution J-t, in troduce the random walk Xo = 0 and Xn = 2::= 1 B m for n ?: 1 , and define {X(t) : t ?: 0} so that X(t) = XN(Rt ) ' where {N(t) : t ?: 0} is a sim ple Poisson process which is independent of the B m 's. The existence of all these random variables is guaranteed (cf. the footnote in § 1.2.1) by Theorem 6.3.2. Obviously, X(O) = 0 and t E [0, oo ) X(t) E zd is a piecewise constant, right continuous zd -valued path. In addition, because (J-L)o = 0, it is clear 2 that the number of jumps that t""'X(t) makes during a time in terval (s , t] is precisely N(Rt) - N(Rs ) and that Xn - Xn_ 1 is the amount of the nth jump of t-v-+X(t) . Thus, if Jo 0 and, for n ?: 1 , Jn is the time of the nth jump of t-v-+X(t) , then N(Rt) = n -¢==? Jn :'S: Rt < Jn+1 , and X(Jn) - X(Jn- 1 ) = Xn - Xn - 1 · Equivalently, if {En : n ?: 1 } denotes the sequence of unit exponential random variables out of which {N(t) : t ?: 0} is built, then Jn - Jn - 1 = � , X(t) - X(t - ) = 0 for t E ( Jn - 1 , Jn ) , and X(Jn) - X(Jn - 1 ) = Bn- Hence, {X(t) : t ?: 0} is indeed a compound Poisson process with jump distribution J-t and rate R. We next want to show that a compound process moves along in homoge neous, mutually independent increments: J-t
f-------7
=
(4. 1.6) lP' (X( s + t) - X(s) = k I X(T) , T E [0, sl)
=
lP' (X(t) = k) ,
k E zd .
For this purpose, we use the representation X(t) = XN(Rt) introduced above. Given A E a ( {X(T) : T E [O, s] }) , we need to show that lP'( {X( s + t) - X(s)
= k} n A) = lP' ( {X( s + t) - X(s) = k} ) lP'( A) ,
and, just as in the derivation of ( 4.1 .2) , we will, without loss in generality, assume that, for some m E N, N(Rs ) = m on A. But then A is independent 2 This is the reason for our having assumed that (J.L)o = 0. However, one should realize that this assumption causes no loss in generality. Namely, if (J.L)o = 1, then the resulting compound process would be trivial: it would never move. On the other hand, if (J.L)o E (0, 1 ) , ( 1 - (J.L)o)- 1 (J.L)k when k # 0, then we could replace JL by P,, where (Jl)o 0 and (Jl)k and R by R = ( 1 - (JL)o)R. The compound Poisson process corresponding to Jl and R would have exactly the same distribution of the one corresponding to JL and R. =
=
79
4.1Poisson Processes
{Xm+n -Xm : n 2 0} {N (R(s + t)) - N(Rs)}), and so IP'({X(s + t) -X(s) = k} n A) = n=O LIP'({X(s + t) - X(s) k & N(R(s + t)) - N(Rs) n } n A) = LIP'( {Xm+n - Xm = k & N ( R(s + t) ) - N(Rs) = n} n A) n=O = n=O L IP'(Xn k)IP'(N(Rt) n) IP'(A) = n=O L IP'(Xn = k & N(Rt) = n)IP'(A) = IP'(X(t) = k) IP'(A). Hence, ( 4.1.6 is proved. Finally, to compute the distribution of X(t), begin by recalling that the dis tribution of the sum of n independent, identically distributed random variables is the n-fold convolution of their distribution. Thus, IP' ( Xn = k) = ( J.L *n )k, where ( J.L*0 )k = bo , k is the point mass at 0 and (J.Lm) k = jLEZd (J.L*(n- l) ) k-j (J.L)j of (
U
a-
(X)
=
=
(X)
(X)
=
=
(X)
)
for n 2 1 . Hence,
(Rtn )-n (J.Lm )k · L -1 IP'(X(s + t) = k) = n=O L IP'(Xn = k & N(Rt) = n) = e-Rt n=O . Putting this together with 4.1.6 , we now see that for A o-({N(T) : T [0, s]}), IP' ({X(s+t) = k} n A) = L IP'({X(s+ t) = k} n An {X(s) = j }) = L IP'({X(s + t) - X(s) = k - j } n A n {X(s) = j }) = jLEZd (P(t))jklP'(A n {X(s) = j }) = IE[(P(t)) X(s)k ' AJ ' 00
00
(
where
( 4 1 . 7) .
)
E
E
80
4 MARKOV PROCESSES IN CONTINUOUS TIME
}
Equivalently, we have proved that { X (t) : t :::>: 0 is a continuous time Markov process with transition probability t""'"'P(t) in the sense that
J!D(X(s + t) = k I X (a ) , a E [0, sl ) = (P(t)) X ( s) k . Observe that, as a consequence of (4. 1.8), we find that {P(t) : t :::>: 0 is a semigroup. That is, it satisfies the Chapman-Kolmogorov equation ( 4.1 .8)
}
P(s + t) = P(s)P(t), s, t E [0, oo ) .
(4. 1 .9) Indeed,
( P (s + t) ) 0k = L J!D(X (s + t) = k & X (s) = j) jEZd = L (P(t) ) j£ (P(s) ) 0j = (P(s)P(t)) 0k , j EZd from which the asserted matrix equality follows immediately when one re members that (P(T) ) k£ = (P(T))o (£ - k "
)
4.2 Markov Processes with Bounded Rates
There are two directions in which one can generalize the preceding without destroying the Markov property. For one thing, one can make the distribution of jumps depend on where the process is at the time it makes the jump. This change comes down to replacing the random walk in the compound Poisson process by more a general Markov chain. The second way to increase the randomness is to make the rate of jumping depend on the place where the process is waiting before it jumps. That is, instead of the holding times all having the same mean value, the holding time at a particular state will depend on that state. 4 . 2 . 1 . Basic Construction: Let § be a countable state space and P a transition probability matrix with the property that (P) ii = 0 for all i E §.3 Further, let 9\ { Ri i E § } � [0, oo ) be a family of rates. Then a continuous time Markov process on § with rates 9\ and transition probability matrix P is an §-valued family {X(t) : t :::>: 0 of random variables with the properties that =
:
}
t""'"' X(t) is piecewise constant and right continuous, If Io 0 and, for n :::>: 1, In is the time of the nth jump of t""'"'X(t), then J!D(In In - 1 + t & X(In) = j I X(T), T E [0, In) ) - tRxun-1) (P)x ( Jn- l )j on { In - 1 < 00 (a)
(b)
(4.2.1)
=
=
>
e
}.
3 We make this assumption for the same reason as we assumed i n §4. 1 . 2 that (JL) o = 0 , and, just as it resulted in no loss in generality there, so it does not reduce the generality here.
4.2 Markov Processes with Bounded Rates
81
Our first task is to see that, together with the initial distribution, the pre ceding completely determines the distribution of {X ( t) : t ::::: 0}, and, for reasons which will become clear soon, we will restrict our attention through out this section to the case when the rates 91 are bounded in the sense that supi Ri < oo. In fact, for the moment, we will also assume that the rates 91 are non-degenerate in the sense that 91 � ( 0, oo). Since it is clear that non-degeneracy implies that lP'(Jn < oo) = 1 for each n ::::: 0, we may ( cf. (6.1 .5)) and will assume that Jn < oo for all n ::::: 0. Now set Xn = X(Jn) for n 2:: 1 , and observe that, by (b) in (4.2.1), and En =
Jn;;n�n1- 1
:
Hence, {Xn n ::::: 0} is a Markov chain with transition probability matrix P and the same initial distribution as the distribution of X(O) , {En n 2:: 1 } is a sequence of mutually independent, unit exponential random variables, and ( { Xn : n ::::: 0}) is independent of ( {En : n 2:: 1 } ) . Thus, the joint distribution of { Xn : n 2:: 0} and {En : n 2:: 1 } is uniquely determined. Moreover, {X(t) : t ::::: 0} can be recovered from {Xn : n 2:: O} U {En : n 2:: 1 } . Namely, given (e 1 , . . . , en, . . . ) E (0, oo)2+ and (jo , . . . , jn , . . . ) E §N, define
:
CJ
CJ
(!R , P )
(4.2.2)
(t; ( e 1 , . . . , en, . . . ), (jo , . . . , Jn, . . . ) ) = Jn for �n :S: t < �n+ l n where �o = 0 and �n = Rj� e m when n 2:: 1. m= l
L _1
Clearly,
X(t) = (!R,P) (t; (E1 , . . . , En, . . . ) , (Xo , . . . , Xn, . . . ) )
(4.2.3)
::; t
00
1
L Rj�_ Em . m =l Thus, we will know that the distribution of {X ( t) : t ::::: 0} is uniquely de termined once we check that I::= l Rx� _ 1 En = oo with probability 1. But for 0
<
this is precisely why we made the assumption that 91 is bounded. Namely, by either the Strong Law of Large Numbers or explicit calculation (e.g. of [exp ( 2::: � Em )]) , we know that 2::: � Em = oo with probability 1. Hence, the boundedness of 91 is more than enough to guarantee that 2::: � Rx� _ Em = oo with probability 1 . To handle4 the degenerate case, the one when some of the R/s may be 0, set §0 { i : Ri = 0} and determine 9'1: so that Ri = Ri if i ¢:. §0 and Ri = 1 if i E §0. Our goal is to show that the distribution of {X(t) : t ::::: 0} is the same as that of {X(t /\ ( : t ::::: 0}, where {X(t) : t ::::: 0} is a process with rates
lE
-
1
=
4 The discussion with follows is a little technical and need not be fully assimilated in order to proceed.
4 MARKOV PROCESSES IN CONTINUOUS TIME
82
9'\ and transition probability matrix P and ( inf{t ;:::: 0 : X(t) E §0} is the first time t-v->X(t) hits §0. That is, we are claiming that the distribution of the process {X ( t) : t ;:::: 0} is that of the process {X ( t) : t ;:::: 0} stopped when it hits §0. To verify the preceding assertion, let {X�i ) : n ;:::: 0 & i E §0} be a family of §-valued random variables and {En n ;:::: 1} a family of (0, oo )-valued random variables with the properties that =
:
cr({X�i) :
)
n ;:::: 0 & i E §o} ) , cr({En : n ;:::: 1 } ) , and cr({X(t) : t ;:::: 0} are mutually independent, (2) for each i E §0 , {X�i ) : n ;:::: 0} is a Markov chain starting from i with transition probability matrix P , (3) {En : n ;:::: 1} are mutually independent, unit exponential random variables. (1)
{ XCil ( t) : t ;:::: 0} given by - . . . , E-n , . . . ) , (X0(i) , . . . , Xn(i) , . . . ) ) X- (i) (t) - <1> ( sJt ,P) (t ,. ( El,
Then, for each i E §, the process
_
starts at i and is associated with the rates 9'\ and transition probability matrix P. Finally, define {X(t) : t ;:::: 0} so that
if t < ( X(t) X(t) = { X(X (() l (t - () if t ;:::: (. Obviously, X(t) = X(t /1. (). Hence, we will be done once we show that the distribution of {X(t) : t ;:::: 0} is that of a process associated with the rates 9'\ and transition probability matrix P. To see this, set ]0 = 0 and, for m ;:::: 1 , let Jm be the time of the mth jump of t-v->X(t). Next, suppose that A E cr({X(T) : T E [0, Jn) } ) . We need to show that
_
and because we can always write A is the disjoint union of sets of the form {X ( Jm ) = Jm for 0 :::; m < n }, we may and will assume that A itself is of this form. If Rj= > 0 for each 0 :::; m < n, then A E cr( {X(cr) : cr E [0, ln) } ) , and (Jg , X(]g)) (h X(Jg)) for 0 :::; £ :::; n on A. Thus (*) holds in this case. If jg E §o for some 0 :::; £ < n, use m to denote the first such £, set i = Jm, and let {Jjil : £ ;:::: 0} be the jump times of t-v->XC i l (t) . Then we can write A = B n C, where B = {X(Jg) = jg for 0 :::; £ :::; m} and
=
if O :::; m < n - 1 if m = n - 1 .
4.2 Markov Processes with Bounded Rates
83
w ( { In > ln- 1 + t X(Jn ) = j } n A) w ( { fni-l m > JnCi-l m- 1 + t xCil ( JCni-l m ) = 1· } n c) w(B) &
&
=
and so ( * ) holds in this case also. In view of the preceding, we know that the distribution of is uniquely determined by ( 4.2.1) plus its initial distribution, and, in fact, that its distribution is the same as the process determined by 4.2.3). Of course, in the degenerate case, although 4.2.3) holds, the exponential random variables n 2:: and the Markov chain be functions of the : n 2:: will process However, now that we have proved the uniqueness of t) their distribution, there is no need to let this bother us, and so we may and will always assume that our process is given by 4.2.3). 4.2.2. The Markov Property: We continue with the assumption that the rates are bounded, and we now want to check that the process described above possesses the Markov property:
( {Xn
{ En : {X1}( : t ;::: 0}. (4.2.4
( 0} not (
{X (t) : t ;::: 0}
{X (t) : t ;::: 0}
W(X(s + t) = j I X(T), T E [0 , sl) (P(t)) X(s)j where (P(t) ) ij = IP(X(t) = j I X(O) = i ) . =
)
For this purpose, it will be useful to have checked that
�9'\m s < �m+ l Ri(s-Jm )} n Bm , where {Jm s} Bm E u({El , ... ,Em } U {Xo, ... ,Xm }). Then IP({X(s+t) = j} n A) mL=O IP({X(s + t) = j} nAm) = mL=O IP({X(s + t) = j Em+l > Ri(s - Jm )} n Em) ( 4.2.5 )
p
:S;
==*
0
( )
:S;
2
(X)
=
(X)
&
0
4 MARKOV PROCESSES IN CONTINUOUS TIME
84
and, by (4.2.3), (4.2.5), and (4. 1 .3),
lP'( {X(s + t) = j & Em + l > Ri( s - Jm) } n Bm) = lP' { q,9'1 ,P (t; ( Em+ l - Ri ( s - Jm) , Em+ 2 , · · · Em+n, · · · ) ,
(
,
>
( i, Xm + l , . . . , Xm +n, . . . ) ) = j } n { Em+ l Ri( s - Jm) } n Bm Brn = (P(t) ) ij lP' (Am), = lP'(X(t) = j I X(O) =
i)E[e -Ri(s-Jm), ]
)
from which ( * ) is now an immediate consequence. Just as (4.1.8) implied the semigroup property (cf. (4. 1 .9)) for the transition probability matrices there, so (4.2.4) implies the family {P(t) : t 2: 0} is a semigroup. In fact, the proof is the same as it was there:
(P(t) ) ij = L lP'(X (s + t) = j & X(s) = k I X(O) = i) kE§ = (P(t) ) kj lP' (X(s) = k) = L (P (t) ) kj (P(s) ) ik = (P(s)P(t) ) iJ ' k E§ k E§ In addition, it is important to realize that, at least in principle, the distribution of a Markov process is uniquely determined by its initial distribution together with the semigroup of transition probability matrices {P(t) : t > 0}. To be precise, suppose that {X(t) : t 2: 0} is a family of §-valued random variables for which ( 4.2.4) holds, and let JL be its initial distribution (i.e. , the distribution of X(O).) Then, for all n 2: 1, 0 = to < t 1 < · · · < tn, and jo , . . . , jn E §, lP'(X (tm) = jm for 0 <:::: m <:::: n ) ( 4.2.6) . = (JL )j0 (P(t l - to ) ) ]0)1 . . . · · · (P(tn - tn - d) )n-l]n
L
To verify this, first note that, by (4.2.4) ,
.
.
lP'(X(to ) = jo & X(t 1 ) = j1) = (P(tl) ) ]0]1 . . ( JL )j0 = ( JL )j0 (P(h - t o ) ) ]0]1. . Thus (4.2.6) holds for n = 1. Now let n 2: 2 be given, assume that (4.2.6) holds for n - 1 , and set A = {X(tm) = jm : 0 <:::: m <:::: n - 1}. Then, by (4.2.4),
lP'(X(tm) = jm for 0 <:::: m <:::: n) = lP'( {X(tn) = jn} n A) = (P(tn - tn - d ) ]n-l)n . . lP' (A) ,
and so ( 4.2.6) follows from the induction hypothesis. Finally, by the ap plication of Theorem 6.1.6 given at the end of § 6.1.4, the distribution of {X(t) : t 2: 0} is completely determined by the probabilities it assigns to sets of the form {X(tm) = jm for 0 <:::: m <:::: n}, and so our uniqueness assertion has now been justified.
4.2 Markov Processes with Bounded Rates
85
4.2.3. The Q-Matrix and Kolmogorov's Backward Equation: As we saw at the end of the preceding section, apart from its initial distribution, the distribution of a Markov process is completely determined by the semigroup 2': 0}. Thus, it is important to develop methods for calculating the transition probabilities directly from the data contained in the rates 9-\ and the transition probability Based on ones experience with real valued functions, one should suspect that (4. 1 .9) means that must to be expressible as e tQ for some In fact, ought to be obtainable by differentiating at t 0. As the first step in our program to give the substance to the preceding speculations, we will prove that
{ P(t) : t
P(t) P. P(t)
Q
t�P(t)
Q.
=
(*) 0, and so we now assume that there is nothing to do when Ri 0. Because RClearly, i (P(t)) ij Jij!P'(El tRi I X(O) i) + IP(El ::; tRi & X (t) j I X (O) i), and ( cf. ( 4.2.3) and (4.2.5)) IP(El ::; tRi & X(t) j I X (O) i) IP (cl>'.R ,P (t - Ri 1 E1 ; (E2 , . . . , En , . . . ), (XI, . . . , Xn , . . . ) ) j & E1 tRi I Xo i ) IE [ (P(t - Ri 1 E1)) Xd ' E1 ::; Rit I Xo i] Ri Jto e-rR, 2_)P) kES ik (P(t - T ) kj dT. =
>
>
=
=
=
=
=
=
=
=
::;
=
=
=
)
=
we have completed the proof of (*) . The expression in (*) is an integrated version of a renowned equation due to Kolmogorov. Namely, when one differentiates (*) with respect to t, one arrives at Kolmogorov 's backward equation :
which, when written in matrix notation, becomes (4.2.7)
� P(t)
d
=
QP(t)
with
P(O)
=
I when Q
=
R(P - I) ,
where R is the diagonal matrix whose ith diagonal entry is Ri . The reason for the adjective "backward" is that 4.2. 7) describes the evolution of t�
(
(P(t) ) ij
86
4 MARKOV PROCESSES IN CONTINUOUS TIME
as a function of time t and its backward variable i, so called because, if one adopts the perspective of someone traveling along the path t-v-+X (t), then i, being the place where he started, is the variable he sees when he is "looking backward." Unfortunately, the terminology for the matrix Q is much less inspired. Namely, probabilists call any matrix whose off-diagonal entries are non-negative and whose rows sum to 0 a Q-matrix. 4.2.4. Kolmogorov's Forward Equation: Recall the norm II · llu, v intro duced at the beginning of §3.2.1. Starting from ( 4.2. 7) , we have
and so ( 4.2.8)
Because, by the semigroup property (cf. (4.1.9 ) ) ,
P(t + h) - P(t) - hQP(t)
=
P(t) (P(h) - I - hQ ) ,
we can pass from the above to Kolmogorov 's forward equation
d P(t) dt
( 4.2.9 )
=
P(t)Q with P(O)
=
I
,
so called because it describes the evolution of t-v-+P(t) as a function of the for ward variable j: the variable which the traveler sees when he "looks forward" in time. 4.2.5. Solving Kolmogorov's Equation: As we mentioned earlier, ( 4.1.9 ) suggests that P(t) et P (o ) , and, thanks to ( 4.2.7) , we now know that P(O) Q. That is, we are guessing that P(t) e tQ , where the meaning of the exponential is given by the power series =
=
=
( 4.2.10 )
eM
oo
=
Mm L m=O
for M E Mu ,v (§). m.1
Because II Mm llu, v ::; II M II�v ' there is no question that the preceding series converges for every M E Mu, v (§). In fact, because
4.2 Markov Processes with Bounded Rates
I
(4.2 . 1 1 )
I
eM - m�0 Mm!m < II Mn!II �,v e n-l
-
u,v
87
IIMII u,v .
Also, it is easy (cf. Exercise 4.5.2 below) to check that (4.2.12 )
In particular, this means that t""" etQ has the semigroup property: e< s +t) Q e8QetQ . Finally, because
=
we see that
and so -!JietQ = etQQ. With these preparations, it is an easy matter to complete the identification CXJ tm Q m (4.2.13 ) P (t ) = etQ --1- for all t E [0, oo ) . m=O m. =
L
Namely, by the preceding and (4.2.7) , we see that, for any t > 0 and T E [0, t] ,
and so T E [0, t] �---+ e (t - T ) QP(T ) E Mu,v (§ ) is constant. Before closing this discussion, there is an important point which should be addressed. Namely, we have proved, via (4.2.13 ) , that, for each t ;::: 0, etQ is a transition probability matrix, although this fact is not at all evident from the power series expression for etQ . In particular, without further considerations, it is far from clear why the entries of etQ are non-negative or why ll etQ llu, v = 1 , independent of II Q II u,v · Thus, we will now provide another way to see these properties, one that does not require the identification of etQ as P (t ) . Namely, set
' {
PiJ
=
�
if i =;f j 1 + �' if i = j
where M = sup Ri = sup ( - (Q)ii) . i i
Then P is a transition probability matrix, and Q I commutes with P,
=
M(P -1). Hence, because
4 MARKOV PROCESSES IN CONTINUOUS TIME
88
The non-negativity of the entries of et Q is obvious from this representation as is the fact that
In order to appreciate what is going on here, it is well to notice that this line of reasoning works only because t � 0: when t < 0, the entries of etQ need not be non-negative and ll e tQ llu,v may be as large as e l t ! !I Q !Iu,v . 4.2.6. A Markov Process from its Infinitesimal Characteristics: In some applications the most natural way to describe a Markov process {X(t) : t � 0 } in terms of its rates 5Jt and the underlying transition probability P is to specify its initial distribution and say that, given its past ( {X (T ) : T E [0, t] } ) up until time t � 0, the probability of its having moved away from X(t) at time t + h (where h > 0 is small) will be approximately hRx (t) ' and given ( {X(T ) : E [0, t]} U {X(t + h ) =1- X(t)} ) (i.e. , both its past and that it has moved) the probability that X(t + h ) = j is approximately (P) x (t)j · In such a description, 5Jt and P become the infinitesimal characteristics of the process and the question is whether the distribution of the process can be reconstructed from this description. However, before attempting such a reconstruction, we will make the description more quantitative by insisting that there exist an E : (0, oo ) ----+ (0, oo ) which tends to 0 at 0 and for which 0'
0'
and
T
l lP' (X(t + h) =1- X(t) I X(T) , T E [0, tl) - hRx(t) I ::; he( h)
l lP' (X(t + h)
=
I
j I X(T) , T E [0, t] , & X(t + h) =1- X(t) ) - (P) x(t)j ::; c ( h ) .
Clearly, the first of these is equivalent to
l lP' (X(t + h) = X(t) I X(T) , T E [0, tl ) - 1 - h (Q) x(t)X(t) I
::;
h c( h),
whereas the two together lead to
l lP' (X(t + h)
=
l
j I X(T) , T E [0, t l) - h (Q) x(t)j ::; hc ( h ) ( Rx(t) + c ( h ) )
when j =1- X(t). Hence, because we are assuming that the rates are bounded, the above description implies that
(*)
l lP' (X(t + h)
where, again, c'( h )
=
j I X(T), T E [0, tl ) - bx(t),j - h (Q) x(t)j
----+
0 when h � 0.
l
::;
h c' ( h),
4.3 Unbounded Rates
89
We will now show that (* ) is sufficient to show that (4.2.4) is satisfied with the P (t) = etQ and therefore, by the result discussed toward the end of § 4.2.2, that {X(t ) : t 2' 0} is a Markov process corresponding to rates 9i: and transition probability matrix P. To this end, given s 2' 0 and A E ( {X ( T) : T E [0, s] } ) , determine the row vectors { J.L ( t) : t 2' 0} so that (J.L(t )) 1 = lP' ( {X(s + t ) = j} n A) for each j E §. Then, because 0'
[
(J.L(t + h)) j = IE lP' (X (t + h) = j I X (T) , T E [0, tl) ,
A]
and (J.L (t )) 1 = IE [ox (t ),J ' A] , (*) tells us that
I = I (J.L (t + h)) 1 - (J.L (t )) 1 - h (J.L (t) Q ) 1 1
hE1 ( h) 2' (J.L (t + h)) 1 - (J.L (t)) 1 - hlE [(Q)x ( s+ t)j '
A]
I
for t 2' 0 and h > O.
Hence, -/i J.L(t ) = J.L (t) Q for t > 0, and so
from which we conclude that lP' ({X(s + t) = j} n A ) = J.L(t) = J.L (O)et Q = J.L (O)P(t) ,
which is equivalent to lP' ( {X(t) = j } n A ) = IE [ (P (t )) x( s ) J '
A) .
That is, we have now shown that (*) implies ( 4.2.4), and therefore, by the final part of §4.2.2, we see that the distribution of {X(t) : t 2' 0} is that of a Markov process corresponding to rates 9i: and transition probability P . 4 . 3 Unbounded Rates
Thus far we have been assuming that the rates 9i: = { Ri : i E §} are bounded, and the essential application which we made of this assumption came in §4.2. 1 . Namely, a bound on the rates guaranteed that the Jn / oo with probability 1 and therefore that our Markov process was completely determined for all time. As we will see below (cf. Exercises 4.5.5 and 4.5.6) , when the rates are unbounded, ]00 limn -+oo Jn may be finite with positive probability, in which case our description fails to tell the process what to do during the time interval [J00 , oo). In this section, we will give conditions, other than a bound on 9i:, which guarantee that ]00 = oo with probability 1. In addition, we will give a very cursory discussion of what one can do to salvage the situation when J00 < oo with positive probability. =
4 MARKOV PROCESSES IN CONTINUOUS TIME
90
4.3. 1 . Explosion: The setting here is the same as the one in §4.2.1, only now we are no longer assuming that the rates are bounded. Thus, if t-v-+X (t) is given by the prescription in (4.2.3), Jn denotes the time of the nth jump (i.e. , the nth t for which X (t) -=/:- X(t- )), and J= limn -+= Jn , then the distribution of t-v-+X (t) is uniquely determined only until time J= . Our first step is to show that, with probability 1 , J= coincides with the time when the process explodes out of the state space. To be precise, choose an exhaustion { FN : N 2: 1 } of § by non-empty, finite subsets. That is, for each N, FN is a non-empty, finite subset of §, FN <,;;;; FN + I , and § = UN FN . Next, take 9{( N ) to be the set of rates given by =
Then, for each N 2: 1, (4.2.3) with 9{(N ) replacing 9\ determines a Markov process { X ( N ) (t) : t 2: 0}. Moreover, if (N inf{ t 2: 0 : x( N) (t) 1:- FN }, then X( N + ll (t) = X ( N l (t) for t E [O, (N ] n [O, oo) . Hence, (N (N + I , and so the explosion time = limN _.= (N exists (in [0, oo] ) . 4 . 3 . 1 THEOREM. J= with probability 1, and, for each N 2: 1, X( N l (t) = X (t A (N ) for t E [0, oo) . In particular, if = oo) = 1, then the
:S:
=
e
e
lP'(e
=
distribution of X (O) together with the description in ( 4.2. 1 ) uniquely deter mine the distribution of a process {X (t) : t 2: 0}.
e> e ee> lP'((NlP'(e>>
e> >
PROOF : Because { -=1- J=} can be written as the union of the sets { T 2: } as T runs over the positive rationals, in order to prove that = J= with probability 1, (6.1 .5) says that it suffices for us to show that lP'( T 2: J00) = 0 = 1P'(J00 T 2: for each T 0. 0 for some T. Then To this end, first suppose that T 2: J=) there exists an N such that T 2: J00) 0. On the other hand, T 2: J= ===} T 2: J= 2: r]:/ :L� Em , where rN = supj E FN Rj , and (N therefore we are led to the contradiction T 2: J00 } U { J=
>
> > e) >
e) > lP' > e) >
Next suppose that ( J= T 2: 0. Then there exists an n 2: 1 such that ( Jn > T 2: 0. On the other hand, if J1, is the distribution of X(O) , then :z=:=o Lj
lP'
n
( JLP m )j :S: L=O )L
That is, we know that lP'(Jn
> e) T 2:
=
0.
-----+
0 as N ----+ oo.
4.3 Unbounded Rates
91
Given the preceding, it should be clear that X(t) = X ( Nl (t) as long as t E [O, (N ) and that X((N ) = X ( N ) ((N ) if (N < oo. Hence, X(t A (N ) = X ( N l (t) for all N E N and t :::=: 0. Finally, because (4.2.1), with 9'{( N ) replacing 91, together with the initial distribution uniquely determine the distribution of {X( Nl (t) : t :::=: 0}, it follows that, for each N E N, the distribution of {X(t A (N ) : t :::=: 0} is uniquely determined by the initial distribution and (4.2.1), and therefore, when = oo) = 1 , so is the distribution of {X(t) :
t ::::: 0}.
IP'(e
D
IP'(e
ooiX(O) = i)
1 for all i E § and (P(t)) ij IP'(X(t) = J IX(O) = i ) , then, for each initial distribution and T > 0, 4.3. 2 COROLLARY. If
=
=
J-t
=:=
11 �-tp ( N ) (t) - J-LP(t) llv = 0, -+ oo tEsup Nlim (O,T]
(4.3.3)
: t > 0} is the semigroup determined by 91( N ) and P. Moreover, {X(t) : t :::=: 0} satisfies the Markov property in (4.2.4). Finally, {P(t) : t :::=: 0} where {P( N )
is a semigroup which satisfies Kolmogorov's backward equation in the sense 2 that, for each ( i, j) E § ,
( 4.3.4) PROOF: First note that
IP'(e
=
oo) =
LiE§ (J-t)i!P'(e
and therefore that lim N -+ oo IP'((N ::::; T) time, by Theorem 4.3.1,
=
=
oo I X(O) = i )
=
1,
0 for each T E [0, oo ) . At the same
for all 0 ::::; t ::::; T and j E §. Thus, sup 11�-tp ( N ) (t) - J-LP(t) ll v t E(O,T]
::S:
IP'((N ::S: T) ------* 0 as N ----* oo.
To prove the Markov property (4.2.4) , let A E a ( {X(T) : T E [0, s]}) be given, and assume that X ( s) = i on A. Then, since A n { (N > s} E a ( {X( N ) (T) : T E [O, s]}) , the Markov property for {X(N l (t) : t :::=: 0} plus the result in Theorem 4.3.1 justify lim IP'({xC N l (s + t) = j} n A n {(N > s}) IP' ( {X(s + t) = j} n A) = N-+oo N = lim (p ( ) (t)) . . IP'( A n { (N > s}) = (P (t)) IP' (A) . N-+oo D
D
4 MARKOV PROCESSES IN CONTINUOUS TIME
92
Of course, once we know that ( 4.2.4 ) holds, then, by exactly the same argu ment with which we derived ( 4. 1 .9) in the bounded case, we know that the {P(t) : t 2': 0} here is also a semigroup. To check ( 4.3.4 ) , note that, by ( 4.2.7) applied to {P ( Nl(t) : t 2': 0},
(p (N) (t)) ij = bi,j +
lt (Qp(N) (T)) ij dT
as soon as N is large enough that i E FN . Hence, since 1 2': (pCN) ( T)) kj (P(T)) kj while Lk I(Q) ikl = 2Ri < oo, (4.3.4) follows by Lebesgue's Domi nated Convergence Theorem. D 4.3.2. Criteria for Non-explosion or Explosion: In this subsection we will first develop two criteria which guarantee non-explosion: e = oo with probability 1. We will also give a condition which guarantees explosion with probability 1. 4 . 3 . 5 THEOREM. Let P be a transition probability matrix satisfying (P) ii = 0 for all i E §, and let J-L be a probability vector with the property that (J-L) i = 0 unless i is recurrent for P . Then for every choice of rates 91, there is --+
no explosion of the process described in 4.2. 1 ) with initial distribution
(
J-L .
PROOF : First observe that it is enough to handle the case when the rates are non-degenerate. Indeed, because what we are trying to check is that J00 = L� R)c� Em = oo with probability 1 , it is clear that making the rates smaller can only make explosion less likely. In addition, because JP'( e = oo) = Li Es(J-L) ilP'(e = oo!X(O) = i) , it suffices for us to show that lP'(e = oo!X(O) = i ) = 1 whenever i is recurrent for P. Because we are assuming that the rates are non-degenerate, ( 4.2.3 ) guar antees that the points visited by {X(t) : t E [0, J00 ) } will be the same the points visited by {Xn n 2': 0}. In particular, for any N 2': 1, as
:
P((a ( N)) i < (N I X(O) = i) = eN 1P'(3 n 2': 1 Xn = i and Xm E FN for 0 ::; m ::; n I Xo = i) , where (aCNl) i is the first time after Ji N) that t""" X ( Nl (t) returns to i. At the =
same time, essentially the same argument as was used to prove Theorem 2.3.6 shows that lP'((aCNl)im) < (N I X(O) = i) = W!J, where (aCNl)im) is defined inductively so that (aCNl)i1 ) = (a ( N ) ) i and
{ oo t 2':
= Jt) < oo if (aCNl)im) = oo and {J�N ) : n 2': 0} are the jump times of t""" X ( Nl (t). Thus, if i is recurrent for p and therefore eN 1 as N -+ oo, we know that lim IP'((a ( N) )im) < (N I X(O) = i) = 1 for each m 2': 1 . N--+oo (a ( N) ) i(m+ l )
_ -
inf {
--+
J�!;{
:
X ( N) = i }
if (a ( N) )�m)
4.3 Unbounded Rates
( CT (Nl )�1) 2:: JiN) and Ee ( CT (N)) ,(m+l ) (CT (N) ) ,(m) > J£(N) + Je(N) = R+l. '
On the other hand, since
(CT (N) ) ,(m) = ie N)
===?
�
�
lP'( ( O' (N) )im) ::; T I X(O) = i) ::; lP' But, for all m 2:: 1 and
93
N
2::
�
1
(� Ee ::; RiT)
t
<
(T�i?m
1,
lP'((N :=:; ( CT (N) ) ;m) I X (O) = i) + lP'( ( CT (N))�m ) :=:; T I X (O) = i) , and so, after first letting oo and then m -+ oo, we see that, when i is recurrent for P, lP'( e ::; T J X (0) = i) = 0. D lP'((N' :=:; T I X(O) = i)
N
:=:;
-+
We group our second non-explosion criterion together with our criterion for explosion as two parts of the same theorem. In this theorem, the process is determined by (4.2.1) with rates 9{ and transition probability P. 4 . 3 . 6 THEOREM. If there exists a non-negative function u on § with the oo as properties that U oo and, for some a E N infNFN u(j) [0, oo) ,
2)P)ij u(j) ::;
j E§
N
-----+
=
( 1 ;' ) u( i) +
-+
i
whenever E § and
Ri > 0,
then lP'(e = ooiX(O) = i) = 1 for all i E §. On the other hand, if, for some i E §, Rj > 0 whenever i-+j and there exists a non-negative function u on § with the property that, for some E > 0,
L (P)j ku(k) ::; u(j)
{ k : i --+k } then lP'( e oo iX(O) = i) =
=
0.
�
�. J
N
whenever
i-+j,
P ROOF: To prove the first part, for each 2:: 1, set u(N) (j) = u(j) when j E FN and u(N) (j) = UN when j tJ. FN . It is an easy matter to check that if Q(N) = R(N) (P � I) ' where (R(N)) �J· · = R'[,(N) O·'l ,J· ' and u(N) is the column vector determined by (u(N)) i = u(N) (i), then, for all i E §, (Q(N)u(Nl) i ::; au(N) (i) . Hence, by Kolmogorov's forward equation (4.2.9) applied to {P(Nl (t) t 2:: 0}, d� (p (Nl (t)u(N) ) i ::; a(p (N) (t)u(N) ) i , and so (P(N l (T)u (Nl) i ::; e""Tu(N) (i) . But, since
94
4 MARKOV PROCESSES IN CONTINUOUS TIME
this means that
;
P ((N ::; T I X ( O) = i) ::; lE [u (Nl (x(Nl (T)) I X ( O) = i] N (P (N) (T)u(Nl) i < e cxT u(N) (i) < ecxT u ( i) = UN UN UN and so, by the Monotone Convergence Theorem, we have now proved that i) = 0. In proving the second assertion, we may and will assume that i----+j for all j E §. Now take u(Nl (j) = u (j) if j E FN and u(Nl (j) = 0 if j rJ_ FN , and use u(N) to denote the associated column vector. Clearly ( Q(N)u(N)) i is either less than or equal to -E or equal to 0 according to whether i E FN or i rJ_ FN . Hence, again by Kolmogorov's forward equation,
P( e ::; T I X(O) = i) ::; limN _, = JID( (N :S: T I X ( O)
d� ( p
=
(Nl (t)u (N) ) i ::; -E L ( p (Nl ( t)) ij ' j E FN
and so
lE [u (Nl (x(N) (T)) I x(N) ( o) = i] - u(N) (i) ::; -ElE
[for 1FN (x(Nl (t)) dt I x(Nl (o) = i] = -ElE [T
1\
(N 1 x(Nl ( o) = i] .
N,
But, since u(N) :2 0, this means that JE [(N IX(Nl (O) = i] ::; u�i) for all and i) < oo. D so lE[eiX (O) = i] ::; u� 4.3.3. What to Do When Explosion Occurs: Although I do not intend to repent, I feel compelled to admit that we have ignored what, from the mathematical standpoint, is the most interesting aspect of the theory under consideration. Namely, so far we have said nothing about the options one has when explosion occurs with positive probability, and in this subsection we will discuss only the most banal of the many choices available. If one thinks of e < oo} as the event that the process escapes its state space in a finite amount of time, then continuation of the process after e might entail the introduction of at least one new state. Indeed, the situation here is very similar to the one encountered by topologists when they want to "compactify" a space. The simplest compactification of a separable, locally compact space is the one point compactification, the compactification which recognizes escape to infinity but ignores all details about the route taken. The analog of one point compactification in the present context is, at the time of explosion, to send the process to an absorbing point � rJ_ §. That is, one defines t E [0, oo ) X (t) E § U {�} by the prescription in ( 4.2.3 ) (remember that, by Theorem 4.3.1, e = J= with probability 1) as long as t E [0, J=) and
{
f-------+
4.4 Ergodic Properties
95
takes X ( t) = ll for t E [J oo ) . For various reasons, none of which I will explain, this extension is called that the minimal extension of the process. The minimal extension has the virtue that it always works. On the other hand, it, like the one point compactification, has the disadvantage that it completely masks all the fine structure of any particular case. For example, when § = Z , explosion can occur because, given that e < oo, limt /e X(t) = + oo with probability 1 or because, although limve I X(t) l = oo with probability 1, both limve X(t) = + oo and limve X (t) = -oo occur with positive probability. In the latter case, one might want to record which of the two possibilities occurred, and this could be done by introducing two new absorbing states, ll + for those paths which escape via + oo and [l _ for those which escape via 00 ,
- 00 .
Alternatively, rather than thinking of the explosion time as a time to banish the process from §, one can turn it into a time of renewal by redistributing the process over § at time e and running it again until it explodes, etc. Obviously, there is an infinity of possibilities. Suffice it to say that the preceding discussion hardly scratches the surface. 4.4 Ergodic Properties
In this section we will examine the ergodic behavior of the Markov processes which we have been discussing in this chapter. Our running assumption will be that the process with which we are dealing is a continuous time Markov process of the sort described in §4.2.1 under the condition that there is no explosion. 4.4. 1 . Classification of States: Just as in §3.1, we begin by classifying the states of §. In order to make our first observation, write Q = R( P - I) , where R is the diagonal matrix of rates from 91: and P is a transition probability matrix whose diagonal entries are 0. Next, define p'.R to be the transition probability matrix given by5 if Ri > 0 if Ri = 0.
(4.4.1)
Obviously, it is again true that Q = R ( P '.R - I) . In addition,
( 4.4.2)
i---+j relative to p '.R {===} 3 n 2': 0 (Q n )ij > 0 {===} ( P (t) ) ij > 0 for all t > 0.
Because of ( 4.2.13), this is obvious when 91: is bounded, and, in general, it fol lows from the bounded case plus (cf. the notation in §4.3.1) limN___. oo ( p ( N l (t) ) ij = (P (t) ) i " J 5
It is worth noting that,
as
distinguished from P,
p9'l
is completely determined by Q.
4 MARKOV PROCESSES IN CONTINUOUS TIME
96
On the basis of ( 4.4.2) , we will write iSj when any one of the equivalent conditions in 4.4.2) holds, and we will say that i Q-communicates with j and will write iSj when iSj and jSi. In this connection, § will be said to be Q-irreducible if all states communicate with all other states. We turn next to the question of recurrence and transience. Because
(
lP' (X(t) = i for all t
= i) = 1 , it is obvious that i should be considered recurrent when Ri = 0 . On the other hand, since in the continuous time setting there is no "first positive time," it is less obvious what should be meant by recurrence of i when Ri > 0. However, if one adopts the attitude that time 1 in the discrete time context represents the first time that the chain can move, then it becomes clear that the role of 1 there should be played here by the first jump time J1, and therefore that the role of Pi is played by (4.4.3 O'i = inf{t X(t) = i}. Hence, we will say that i is Q-recurrent if either Ri = 0 or lP' ( ai < ooi X (O) = i) = 1 , and we will say that it is Q-transient if it is not Q-recurrent. Our next observation is that i is Q-recurrent if and only if it is recurrent with respect to the transition probability matrix p
==?
)
2:
0 I X (0)
2: h :
:
2:
this observation, we see that both Theorem 3. 1.2 and Corollary 3.1.4 apply to the present setting. In particular, Q-recurrence and Q-transience are Q
communicating class properties. We next develop the analog for continuous time of the relations given in 1 and ( 2 . 3 . 7) . Narne1y, set ( ) O' and , tOr m 2: 1 , 'f
(m+1) - oo 1 aJ(m) - oo a a j j m+ 1) = inf{t J£+1 : X(t) = j} if a]m) = Jc < oo. Then, just as in the a)proof of Theorem 2.3.6, one sees that _
2:
J
c
_
_
[ lE 1��) l{i}(X(t) ) dt, aim) < oo i X(O) = il lE [l"i l {i}(X (t)) dt I X(O) = i] lP'(aim) < oo I X(O) = i) = lE[Jl I X(O) = i] lP' (aim) < oo I X(O) = i) = lP'(ai < oo�(O) = i)m
In addition, one can easily check that (=+1)
=
t
4.4 Ergodic Properties and
97
E [100 l{j} (X(t)) dt I X(O) = i] =E
[100 l{j } (X(t)) dt I X(O) = j] JID(o-j < 00 I X(O) = i) .
Hence, by exactly the argument with which we passed from Theorem 2.3.6 to ( 2.3.7) , we now know that
ooiX(O) = i ) ) [J{o00 l{j } (X(t)) dt I X(O) = ·] = R1j (oi,j + JIDJID((oo--jj =< ooiX ( O) = j) E [100 l{i} (X(t)) dt I X(O) = i] = 00 oo � JID (1 l{i} (X(t) ) dt = oo i X(O) = i) = 1 ( 4.4.4) E [100 l{i} (X(t)) dt I X(O) = i] < 00 � ( 100 l{ i} (X(t) ) dt < 00 I X(O) = i ) = 1. lE
z
JID
Our final goal in this subsection is to prove the following statement. 4.4.5 THEOREM. For any given state i E §, the following are equivalent: (1) i is Q-recurrent. ( 2 ) There is a t E ( 0, oo) such that i is recurrent relative to the transition probability matrix P(t) . ( 3) i is recurrent relative to P(t) for all t E ( 0, oo).
PROOF: We will prove this equivalence by checking that the same statement holds when "recurrent" is replaced throughout by "transient." To this end, first observe that
and therefore, from the first line of ( 4.4.4 ) , that
(4.4.6 )
i is Q-transient �
Next, notice that for 0 ::; s < t,
(4.4.7)
100 (P(t)) ii dt < oo.
(P(t) ) . . > (P(t - s) ) . . (P(s) ) . . > e - (t -s) R' (P(s)) . . 22 �
22
n -
u
4 MARKOV PROCESSES IN CONTINUOUS TIME
98
h P(h) ) = ) = e -hRi . Hence, for any t 0 and IP'(J i X (O ( ) 1 ii n (P(t)n+l ) i = (P ( (n l)t)) ii e-tRi (P(T)) i for all T E [nt, (n + l)t] 2:
since E N,
>
>
i
2:
+
and
e -tRi
(P(t)n) . . = e-tRi (P(nt)) . . < (P(T)) . . u �
21.
n
for all
T E [nt, (n + l)t].
Since this means that
te-tR;
n) ii 10oo (P(T)) ii dT f P(t) ( n=O :S::
:S::
tetR;
(
n+ l ) ii ' f (P(t) n=O
the asserted implications are now immediate from 4.4.6) . D 4.4.2. Stationary Measures and Limit Theorems: In this subsection, we will complete our program by proving the following basic result. 4 . 4 . 8 THEOREM. For each j
Trjj and
E
=
§ lim
t-->oo
(P(t)) JJ. .
exists
(P(t)) . . = irij ( < ool X(O) = ) iiref irjj 0, fri (ire)i 0 EC = l e (i)frii , p, E Stat (P ( s)) p, Stat (P(s) ) 0, lim
t---"" 00
1,]
=
i irJJ for i =1- j.
IP' O'J
Sj } , and when the Moreover, > then > for all i = {i : i then, for each > 0, row vector is is determined by the unique probability vector for which (p,) k = 0 when k rf- C. In fact, if E for some s > then, for each j E C,
s ire
P ROOF: We begin with the following continuous-time version of the renewal equation (cf. (3.2.6)) :
(4.4.9)
(P(t)) ij as IP'(X(t) & Jl > t I X(O) = ) + IP'(X(t) = & Jl :::; t I X(O) = )
The proof of (4.4.9) runs as follows. First, write =
j
i
j
i .
4.4 Ergodic Properties Clearly, the first term on the right is 0 unless i e- tRi . To handle the second term, write it as 00
=
99
j, in which case it is equal
mL= l lP'(X(t) j & CJj Jm t I X(O) =
::=;
=
i) ,
=
and observe that (cf. (4.2.3) and (4.2.5))
lP'(X(t) = j & CJj = Jm t I X(O) = i) = JP'( (t - Jm ; (Em+l , . . . , Em+n ' . . ), (j, Xm+ l , · · Xm+n , · ) ) = j & CJj = lm t l Xo = i) = IE [ ( P(t Jm ) ) jj ' CJj Jm ::=; t I X(O) i] . Hence, after summing this over 1 and combining this result with the preceding, one arrives at (4.4.9) . Knowing ( 4.4. 9), we see that the first part of the theorem will be proved once we treat the case when i = j. To this end, we begin by observing that, because, by (4.4.7) , (P(s) ) i e- sR, > 0 for all s > 0 and i §, each i is aperiodic relative to P(s), and so, by 3.2.15) , we know that n(s)ii limn oo (P(s) n ) ii exists for all s > 0 and i §. We want to show that n 1 ) ii = limt--> oo (P(t) ) ii ' and when n(1) ii = 0, this is easy. Indeed, by 4.4. 7) , ::=;
.
<1>9'\,P
·
,
· ·
::=;
=
-
=
m �
E
�
E
(
(
(
=
__,
where [t] denotes the integer part of t. In view of the preceding, what remains to be proved in the first assertion is that limt )i = when n( l );; > 0, and the key step in our proof will be the demonstration that n(s);; = for all s > 0, a fact which we already know when = 0. Thus, assume that n(1);; > 0, and let C be the Q-communicating class of i relative to By (4.4.2) , C is also the communicating class of i relative to for every s > 0. In addition, because > 0, i is recurrent, in fact positive recurrent, relative to and therefore, by Theorem 4.4.5, it is also recurrent relative to s) for all so that s > 0. Now determine the row vector = Then, because n ( 1 ) ;; > 0, we know, by Theorem 3.2.10, that is the one and only E Stat ) which vanishes off of C. Next, given s > 0, consider Then is a probability vector and, because = =0 when E C and k rf_ C, J.t vanishes off of C. Also, because
__,oo (P(t) i n(1)ii
n(1);; P(1). P(s) n(1);; P(1), P( 7r(1)c (7r(1)c)j7r(1)cl c (j)n(1)jj· J.t P(1) J.t 7r(1)cP(s). ( J.t P( l )) jk ( j J.tP ( l ) = 1r(l) cP(s)P(1) 1r(l) cP(s + 1) 7r(1)cP(1)P(s) 7r(1(P(s) J.t, n(1);;
=
=
=
=
4 MARKOV PROCESSES IN CONTINUOUS TIME
100
is stationary for P(1). Hence, by uniqueness, we conclude that 7r(1) c P (s) = 7r(1) c , and therefore that 7r(1) c is a stationary measure for P(s) which vanishes off of C. But, by (3.2.8) and the fact that C is the communicating class of i relative to P ( s), this means that J-L
To complete the proof of the first part from here, note that, by (4.4.7) ,
and so, since 1r(s)jj = 7r(1)jj and P ( nt ) = P(t) n ,
Now let s � 0, and simply define irjj = 7r(1)jj · Given the first part, the proof of the second part is easy. Namely, if irjj > 0 and C = { i iS j}, then, for each s > 0, we know that C is the communicating class of j relative to P(s) and that ire = 1r ( s ) c E Stat (P (s) ) . Conversely, by (3.2.8), we know that if J-L E Stat (P(s) ) for some s > 0, then :
With the preceding result in mind, we will say that i is Q-positive recurrent if irii > 0 and will say that i is Q-null recurrent if i is Q-recurrent but not Q-positive recurrent. From the preceding, we already know that Q-positive
recurrence is a Q-communicating class property.
The following corollary comes at very little additional cost.
4 .4 . 1 0 COROLLARY. Mean Ergodic Theorem Assume that j is Q-positive recurrent and that lP'(X( o)Sj) = 1 . Then,
See Exercise 4.5.10 for the more refined version.
4.4 Ergodic Properties
101
PROOF: The proof is really just an obvious transcription to the continuous setting of the argument used to prove Theorem 3.2.14. Namely, set C = { i : jSi}. By precisely the argument given there, it suffices to handle the case when ire is the initial distribution. Thus, if f = l{j} - ir11 , what we need to show is that
[ (� { ) ] ) ] ; [ ({ [ f(X(t)) dt
J':"w lE
'
� o,
when ire is the distribution of X(O) . But, because ire is P (t)-stationary,
lE
[ (� [
= ;2
f (X(t)) dt
�
'
]
)
lE !(X ( , ))f(X (t)) d' dt
,
loT (lot a(t - s) ds) dt = � loT ( 1 - ,j,) a(t) dt,
where a(t) ire ( JP (t) f) , f being the column vector corresponding to the function f and fP (t)f being the column vector determined by ( J P(t)f)i = f(i) (P(t)f)i. Finally, since, for each i E C, limt _,00 (P(t)f)i = 0, limt->oo a(t) = 0, and therefore =
{T
{T
� 1 - ,j,) a(t) dt :::; � l dt --? 0 as T -+ oo. Tk ( T k la(t) 4.4.3.
t
D
Interpreting irii: Although we have shown that the limit
irij
lim _,00 ( P (t)) ij exists, we have yet to give an expression, analogous to the one in (3.2.5), for irii· However, as we are about to see, such an expression is readily available from (4.4.9) . Namely, for each a > 0, set oo oo
L(a)ij = alE
[lo
I
] lo
e -atl{j} ( X(t)) dt X(O) = i = a
e - <>t (P(t)) ij dt.
Because irii = limt _,00 ( P (t) ) ii ' the second of these representations makes it is easy to identify irii as lima "-,.0 L(a)ii· At the same time, from (4.4.9 ) , we see that - u; L(a)ii � a + Ri + IE e <> I X(O) = i] L(a)ii · Hence, we have now shown that if Ri = O 1 irii = 1 R;lE[u; IX(O)-i] if Ri > 0, =
[
{
{
which, in conjunction with the second equality in Theorem 4.4.8, proves that Ji J + J!D( O"J < oo i X(O) = i) if R1 = 0 iriJ. = lP'(uj < oo i X(0)=2) ( 4.4.11 ) 1" f R1 > 0 . RjlE[uj iX(O)-j] Of course, as an immediate corollary of ( 4.4.11 ) , we now know that i is positive recurrent if and only if either Ri = 0 or lE[O"i iX(O) = i] < oo . ,
4 MARKOV PROCESSES IN CONTINUOUS TIME
102
4.5 Exercises
EXERCISE 4 . 5 . 1 . The purpose of this exercise is to give another derivation of (4. 1 .5). Thus, let {En : n :::>: 1} be a sequence of mutually independent, unit exponential random variables, and define { Jn : n :::>: 0} and {N(t) : t :::>: 0} accordingly, as in §4.1.1. Given 0 = t0 < · · · < tg and 0 ::; n 1 ::; · · · ::; ng , use the change of variables formula for multi-dimensional integrals to justify:
P (N(t l ) = n 1 , . . . , N(tg) = ng) = P (Jn, :S: t 1 < Jn1 + 1 , , Jne :S: t < Jne + 1) ne + 1 = · · · exp �j d6 · · · d�ne + 1 J =1 A
(
J J
=
.
.
L
) ·
e
J . . . J e -'1ne +l d7]1 . . . d7]n, + 1 B
£
£
(t t ) n · -n · _ , = e - t, II vol ( �j ) = II e -(tJ -tJ -tl j - j-1 (n · - n · -1 ) ! j=1 j=1 J
]
where
{
A = (6 , . . . , �n, +1 )
E
(O, oo) n' +1 :
]
J
'
� �k :S: tj < �1 �k for 1 :S: j :::: .e } ,
B = { (771 , . . . , 7Jne + d E (O, oo) ne +1 : 7]i < 7Ji + 1 for 1 :S: i :S: ng & 7Jnj �j
=
:S: tj < 77nj + 1 for 1 :S: j :S: R} , n -nj J -l : tj-1 :S: U1 < · · · < Unj -nj-l :S: tj } , { (u1 , . . . , Unj -nj _ , ) E R.
with the obvious modifications when n1 = 0 for 1 ::; j ::; i.
M1 and M2 be commuting elements of Mu,v (§). After (M1 + M2 )m = €=f0 (;) Mf M�-e
EXERCISE 4 . 5 . 2 . Let checking that
for all m E N, verify (4.2.12). EXERCISE 4 . 5 .3. Given a Q-recurrent state i, set C = {j : i S j}, and show that Ri > 0 ===? R1 > 0 for all j E C. EXERCISE 4 . 5 .4 . In this exercise we will give the continuous-time version of the ideas in Exercise 2.4.1 . For this purpose, assume that § is irreducible and positive recurrent with respect to Q, and use 7T to denote the unique
4.5 Exercises
103
probability vector which is stationary for each P(t) , t > 0. Next, determine the adjoint semigroup {IP'( t) T : t � 0} so that
(ir)J (P(t)) ji (p ( T t) ) = ( ) 7r i • •
2]
A
0
( a) Show that P(t) T is a transition probability matrix and that irP(t) T = ir
for each t � 0. In addition, check that {P(t)T : t � 0} is the semigroup determined by the Q-matrix Q T , where
(b) Let IP' and IP'T denote the probabilities computed for the Markov pro cesses corresponding, respectively, to Q and Q T with initial distribution ir. Show that IP'T is the reverse of IP' in the sense that, for each n E N, 0 = to < t1 < · · · < t n , and (jo, . . . , jn ) E §n+I ,
IP'T (X(tm) = Jm for 0 ::; m ::; n) = IP'(X(tn - tm) = Jm for 0 ::; m ::; n) . ( c ) Define P T so that
Show that P T is again a transition probability matrix, and check that Q T = R(PT - 1). EXERCISE 4.5.5. Take § = N and (P) ij equal to 1 or 0 according to whether j = i + 1 or not. Given a set of strictly positive rates 9't, show that, no matter what its initial distribution, the Markov process determined by 9't and P explodes with probability 1 if L iE N Ri 1 < oo and does not explode if L iE N Ri 1 = 00 · EXERCISE 4 .5 .6 . Here is a more interesting example of explosion. Take § = Z3 , and let ( cf. Exercise 3.3.5) P be the transition probability matrix for the symmetric, nearest neighbor random walk on Z3 . Given a set of positive rates 9't with the property that L kE Z3 Rk_1 < oo, show that, starting at every k E Z3 , explosion occurs with probability 1 when (Q)k£ = Rk ((P) k£ - Jk, £ ) · Hint: Apply the criterion in the second half of Theorem 4.3.6 to a function of the form ((k) i - (£)i) 2 L i=l £E Z3 £ and use the computation in Exercise 3.3.5.
�
(a? + t
)
1
-
2
,
104
4 MARKOV PROCESSES IN CONTINUOUS TIME
EXERCISE 4 . 5 . 7 . Even when § is finite, writing down a reasonably explicit expression for the solution to ( 4.2. 7) is seldom easy and often impossible. Nonetheless, a little linear algebra often does quite a lot of good. Throughout this exercise, Q is a Q-matrix on the state space §, and it is assumed that the associated Markov process exists (i.e. , does not explode) starting from any point. (a) If u E C5 is a bounded, non-zero, right eigenvector for Q with eigenvalue a E C, show that the real part of a must be less than or equal to 0. (b) Assume that N = #§ < oo and that Q admits a complete set of linearly independent, right eigenvectors u1, . . , UN E C5 with associated eigenvalues a 1 , . . . , aN . Let U be the matrix whose mth column is Um , and show that etQ = UA(t)U-1 , where A(t) is the diagonal matrix whose m diagonal entry is et Oim . EXERCISE 4 . 5 . 8 . Here is the continuous analog of the result in Exercise 3.3.7. Namely, assume that i is Q-recurrent and R; > 0, set .
and let P, be the row vector given by (P,)1 p1 for each j E §. ( a) Show that {l; = A, , {l1 < oo for all j E §, and that {l1 > 0 if and only . Q J. . .1f �f--+ (b) Show that P, = P,P(t) for all t > 0. Hint: Check that =
and that
( c ) In particular, if i is Q-positive recurrent and C = {j : i Sj}, show that
P, = ( 2::1 p1 ) ir e . Equivalently, A
( 7r
c
[
i
lE J;' l{j } (X (t) ) dt X (O) = i )j = lE [o-; I X(O) = i]
]
EXERCISE 4 . 5 . 9 . Having given the continuous-time analog of Exercise 3.3.7, we now want to give the continuous-time analog of Exercise 3.3.8. For this purpose, again let i be a Q-recurrent element of § with R; > 0, and define the
4.5 Exercises
105
measure J.L accordingly, as in Exercise 4.5.8. Next, assume that f) E [0, oo ) § satisfies the conditions that: (v)i > 0, (v)j = 0 unless J!D(O"j < oo I X (O) = i) = 1 , and DQ = 0 in the sense that Rj (v)j = L kh (v)kQkj for all j E §. 6 The goal is to show that f) = Ri (v)iJ.L · Equivalently,
(v)j = Ri (v)iJE
[l!Ji l {j} (X(t)) dt / X(O) = i]
for all j
E
§.
In particular, by Exercise 4.5.8, this will mean that v = vP(t) for all t > 0. In the following, the transition probability P is chosen so that its diagonal entries vanish and Q = R ( P - I) , where, as usual, R is the diagonal matrix whose jth diagonal entry is Rj for each j E §. (a) Begin by showing that i is P-recurrent. (b) Define the row vector v so that (v)j = �:��l� , and show that v = vP. ( c ) By combining (b) with Exercise 3.3.7, show that
(v)j = IE
[1: l {j} (Xm) I Xo = il ,
where {Xn : n 2: 0} is the Markov chain with transition probability matrix P. (d) Show that
RjiE
[l(J' l w (X(t)) dt I X (O) = i] = IE [1: l w (Xm) I Xo = il ,
and, after combining this with (c), arrive at the desired conclusion. EXERCISE 4 . 5 . 10. Following the strategy used in Exercise 3.3.9, show that, under the hypotheses in Corollary 4.4. 10, one has the following continuous version of the individual ergodic theorem:
(
)
lP' lim IILr - 7r0 ll = o = 1, v
T-.---.-+ (X)
where C = {i : jSi} and Lr is the empirical measure determined by
{T
l {i} (X(t)) dt. (Lr ) i = _!._ T Jo
In addition, following the strategy in Exercise 3.3.11, show that, for any initial distribution, lP'
(
lim ..!-..
T-+oo
fr l {j} (X(t)) dt = ) = 1
T }0
o
when j is not Q-positive recurrent.
6 The reason why the condition vQ = 0 needs amplification is that, because Q has negative diagonal entries, possible infinities could cause ambiguities.
4 MARKOV PROCESSES IN CONTINUOUS TIME
106
EXERCISE 4 . 5 . 1 1 . Given a Markov process {X(t) : t :;;. 0} with Q-matrix Q, one can produce a Markov process with Q-matrix M Q by speeding u p the clock of t""'X(t) by a factor of M. That is, {X(Mt) : t :;;. 0} is a Markov process with Q-matrix MQ . The purpose of this exercise is to show how to carry out the analogous procedure, known in the literature as random time change, for variable rates. To be precise, let P be a transition probability matrix on § with (P)ii = 0 for all i. Choose and fix an i E §, let {Xn : n :;;. 0} be a Markov chain starting from i with transition probability matrix P and { N ( t) : t :;;. 0} a simple Poisson process which is independent of {Xn : n :;;. 0}, and set X 0 (t) XJ., ( ) for t :;;. 0. In particular, {X 0 (t) : t :;;. 0} is a Markov t (P - I) starting from i. Finally, let 91 be a set of process with Q-matrix positive rates, and take Q = R(P - I) accordingly. ( a) Define 1 - - dT for t E [0, oo), A(t) = t R =
1 o
-
X 0 (T)
observe that t""'A(t) is strictly increasing, set A(oo) = limv A(t) E (O, oo] , and use s E [O, A(oo)) A- 1 (s) E [O, oo) to denote the inverse of t A(t) . Show that Rxo ( A - l (u)) dCJ, s E [0, A(oo)). A-1 (s) =
f-------7
""'
= las
X 0 ( A- 1 (s)) for s E [O, A(oo)) . Define J8 0 = J0 and, for n :;;. 1 , J� and Jn to be the times of the nth jump of t""'X0 (t) and S""'>X(s), respectively. After noting that Jn = A( J�), conclude that, for each n :;;. 1, s > 0, and j E §, (b) Set X(s)
=
=
lP'(Jn - Jn - 1 > & X(Jn) = j J X( CJ ) , CJ E [0, Jn)) on { Jn - 1 < 00 }. = e -t R x(J n- ! l (P)x ( Jn _!)j S
( c ) Show that the explosion time for the Markov process starting from i with Q-matrix Q has the same distribution as A(oo). In particular, if A(oo) = oo with probability 1 , use (b) to conclude that {X(s) : s :;;. 0} is a Markov process starting from i with Q-matrix Q. (d) As a consequence of these considerations, show that if the Markov process corresponding to ·q does not explode, then neither does the one cor responding to Q', where Q' is related to Q by (Q')ij = ai(Q)ij and the { ai : i E §} is a bounded subset of (O, oo).
CHAPTER 5
Reversible M arkov Processes
This is devoted to the study of a class of Markov processes which admit an initial distribution with respect to which they are reversible in the sense that, on every time interval, the distribution of the process is the same when it is run backwards as when it is run forwards. That is, for any n 2: 1 and (i0 , . . . , in) E En + l , in the discrete time setting, (5.0. 1)
IP'(Xm
=
im for 0 'S m 'S n)
=
IP'(Xn- m = im for 0 'S m 'S n)
and in the continuous time setting, (5.0.2) IP'(X(tm) = im for 0 'S m 'S n) = IP'(X(tn - tm)
=
im for 0 'S
m
'S n)
whenever 0 = to < · · · < tn. Notice that the initial distribution of such a process is necessarily stationary. Indeed, depending on whether the setting is that of discrete or continuous time, we have
IP'(Xo = i & Xn = j ) or IP'(X(O) = i & X (t) = j)
IP'(Xn = i & Xo = j) = IP' ( X (t) = i & X (O) = j ) , =
from which stationarity follows after one sums over j. In fact, what the preceding argument reveals is that reversibility says that the joint distribution of, depending on the setting, (Xo, Xn ) or (X(O), X (t)) is the same as that of (Xn, Xo) or (X(t), X(O)) . This should be contrasted with the stationarity which gives equality only for the marginal distribution of the first components of these. In view of the preceding, one should suspect that reversible Markov pro cesses have ergodic properties which are better than those of general stationary processes, and in this chapter we will examine some of these special properties. 5 . 1 Reversible Markov Chains
In this section we will discuss irreducible, reversible Markov chains. Because the initial distribution of such a chain is stationary, we know ( cf. Theorem 3.2. 10) that the chain must be positive recurrent and that the initial distribu tion must be the probability vector ( cf. (3.2.9)) 1r = rr§ whose ith component
5 REVERSIBLE MARKOV PROCESSES
108
is ('rr) i JE[pi IXo = i] - 1 . Thus, if P is the transition probability matrix, then, by taking n = 1 in (5.0 . 1 ) , we see that =
( 1r) i ( P) iJ
=
lP'(Xo = i & X 1 = j) = lP' (Xo = j & X1 = i) = ( 1r)j ( P)Ji ·
That is, P satisfies1 (5.1 . 1)
( 1r) i ( P) ij = (1r)1 ( P)1i , the condition of detailed balance.
Conversely, (5.1.1) implies reversibility. To see this, one works by induction on n 2 1 to check that
1r · ( P) · 'l.Q
· 'l.o'l-1
· · ·
(P) ·
·
- 1r· (P) ·
't n- t tn -
't n
· -1 'Zn 'l.n
· ··
(P) ·
· ' ?.. t 't0
which is equivalent to (5.0. 1 ) .
5.1.1. Reversibility from Invariance: As we have already seen, reversibil ity implies invariance, and it should be clear that the converse is false. On the other hand, there are two canonical ways in which one can pass from an irreducible transition probability P with stationary distribution 1r to a tran sition probability for which 1r is reversible. Namely, define the adjoint p T of P so that (5.1 .2) Obviously, 1r is reversible for P if and only if P = p T . More generally, because 1rP = 1r, p T is again a transition probability. In addition, one can easily verify that both and p T p
(5.1 .3)
are transition probabilities which are reversible with respect to 1r. As is explained in Exercise 5.6.9 below, each of these constructions has its own virtue. 5.1.2. Measurements in Quadratic Mean: For reasons which will be come increasingly clear, it turns out that we will here want to measure the size of functions using a Euclidean norm rather than the uniform norm ll f l lu · Namely, we will use the norm (5.1 .4) 1 The reader who did Exercise 2.4. 1 should recognize that the condition below is precisely the same as the statement that P P T . In particular, if one knows the conclusion of that exercise, then one has no need for the discussion which follows. =
5.1 Reversible Markov Chains
109
is the expected value of g with respect to Because ( 11" ) i > 0 for each i E §, it is clear that l l f l l 2 ,1r = 0 {==} f 0. In addition, if we define the inner product ( f, g) 1r to be ( fg) 7r , then, for any t > 0, 71" .
=
± C1
f ll�,7r = t2 I I J I I �,7r ± 2 ( f, g) 7r + 2 ll g ll �,7r ' ll t f and so I ( f, g) 7r I :::; t2 11 fll §,7r + r 2 ll g ll §,7r for all t > 0. To get the best estimate, we minimize the right hand side with respect to t > 0. When either f or g is identically 0, then we see that ( f, g) 7r = 0 by letting t --+ oo or t --+ 0. If 1 o :::::
C
neither f nor g vanishes identically, we can do best by taking t = Hence, in any case, we arrive at Schwarz 's inequality
( ������:: )
4 .
(5.1.5) Given Schwarz's inequality, we know That is, we have the triangle inequality:
(5.1.6) Thus, if L 2 (11") denotes the space of f for which llf l l 2 ,7r < oo, then L2 ( 11" ) is a linear space for which ( f , g)'-"' I I ! - g l l 2 , 7r is a metric. In fact, this metric space is complete since, if limm _,00 supn > m l l fn - fm l l 2 ,1r = 0, then { fn (i) n 2:: 0} is Cauchy convergent in � for each i E §, and therefore there exists a limit function f such that fn(i) f(i) for each i E §. Moreover, by Fatou's Lemma, :
____,
ll f - fm ll2.1r :::; lim ll fn - fm l l 2 ,1r n ---> oo
____,
0 as
m --+ 00.
In this chapter, we will use the notation
Pf(i) = L f (j ) ( P ) ij jE§
= (Pf) i,
where f is the column vector determined by f. When f is bounded, it is clear that P f(i) is well-defined and that l i P f llu :::; l l f llu , where ll g l l u supi E§ lg(i) l = ll gll u when g is the column vector determined by g . We next want show that, even when f E L 2 ( ) P f ( i) is well-defined for each i E § and that P is a contraction in L 2 ( 11" ) l i P f l l 2 ,1r :::; ll f l l 2 , 1r · To check that P f(i) is well-defined, we will show that the the series LjE§ f (j ) ( P)ij is absolutely convergent. But, by the form of Schwarz's inequality in Exercise 1.3.1, =
11"
,
:
L I J (j ) I ( P ) ij jE§
:::::
I I J I I 2 ,7r
(
)
(P ) 2. . ! L ( ) J jE§ ']
11"
'
5 REVERSIBLE MARKOV PROCESSES
1 10 and, by (5. 1 .1 ),
As for the estimate li P f l brr :::; IIJII 2 ,1r , we use Exercise 5.6.2 below together with I;1 (P) ij 1 to see that (Pf(i)) 2 :::; Pj 2 (i) for each i. Thus, since 1r is P-stationary, =
(5.1 .7) An important consequence of (5.1 .7) is the fact that, in general (cf. (2.2.4)) , (5.1 .8) and (5. 1.9) P is aperiodic
===?
lim--+ CQ II P n f - (!) 1r ll 2 '1r = 0 for all f E L2 (rr). n-
To see these, first observe that, by (3.2. 11) , (3.2. 15), and Lebesgue's Domi nated Convergence Theorem, there is nothing to do when f vanishes off of a finite set. Thus, if { FN N ?: 1 } is an exhaustion of § by finite sets and if, for f E L 2 (rr), !N lFN f, then, for each N E z+ , :
=
IIAnf - (!) 7r l l 2 ,1r :S II An( ! - fN ) ll 2,7r + IIAnfN - (JN)1r ll 2,7r + (IJN - J l ) 7r :S 2 IIJ - JN II 2 ,7r + IIAnfN - (JN)7r ll 2 ,7r , where, in the passage to the second line, we have used IIAng ll 2 , 1r :S l l g l l 2 ,1r , which follows immediately from (5.1.7) . Thus, for each N, limn -+oo IIAnf (!) 1r l l 2 ,1r :S 2 11 ! - !N I I 2 . 1r , which gives (5.1 .8) when N --+ oo. The argument
for (5. 1.9) is essentially the same, and, as is shown in Exercise 5.6.6 below, all these results hold even in the non-reversible setting. Finally, (5.1.1) leads to
(i ,j )
( i ,j )
In other words, P is symmetric on L 2 ( 1r ) in the sense that (5. 1 .10) 5. 1.3. The Spectral Gap: The equation (5. 1 .7) combined with (5.1 . 10) say that P a self-adjoint contraction on the Hilbert space2 is L 2 (rr). For the 2 A Hilbert space is a vector space equipped with an inner product which determines a norm for which the associated metric is complete.
5.1 Reversible Markov Chains
111
reader who is unfamiliar with these concepts at this level of abstraction, think about the case when § = {1, . . . , N}. Then the space of functions f : § ----+ lR can be identified with JRN . (Indeed, we already made this identification when we gave, in §2. 1 .3, the relation between functions and column vectors.) After making this identification, the inner product on L2 ( 1r) becomes the inner product I:i (rr)i(v)i (w)i for column vectors and w in JRN . Hence (5.1.7) says that the matrix acts as a symmetric, contraction on JRN with this inner product. Alternatively, if P where is the diagonal matrix whose ith diagonal entry is (rr)i, then, by (5. 1.1), P is symmetric with respect to the standard inner product (v, w)JR.N I:i (v)i (w)i on JR N . Moreover, because, by (5.1 .7), v
P
=
II�PII- �,
II
=
II- �r,
we see where g is the function determined by the column vector that, as an operator on JRN , P is length contracting. Now, by the stan dard theory of symmetric matrices on JR N , we know that P admits eigen values 1 2: A1 2: · · · A N 2: -1 with associated eigenvectors ( e 1 , . . . , e N ) which are orthonormal for ( · , · )JRN : (ek , ee)JRN = ok,C· Moreover, because � = I:f= 1 (P)ij y'(7!1j, we know that A 1 = 1 and can take ( el)i = � Finally, by setting ge = (II) - � ee and letting ge be the associated function on §, we see that P ge = Aege, 91 1 , and ( gkl ge ) "' = ok,C · To summarize, when § has N elements, we have shown that P on L2 (rr) has eigenvalues 1 = A1 2: · · · 2: A N 2: -1 with corresponding eigenfunctions g1 , . . . , gN which are orthonormal with respect to ( · , · ) . Of course, since L2 (rr) has dimension N and, by ortho-normality, the ge's are linearly independent, (g1 , . . . , gN ) is an orthonormal basis in L2 ( 1r ) . In particular, =
"'
( 5.1.11)
pn f -
(!)"'
N = L X£ ( f , ge ) "' ge for all n 2: 0 and f E L2(rr), £= 2
and so
IIPn f
- (!)"' I �,Tr
=
N L A�n
£= 2
(!, e) ! 9
:S
I - (!)"' I �'"' '
( 1 - P)2n f
where (3 = ( 1 - A 2 ) 1\ ( 1 + A N ) is the spectral gap between {- 1 , 1 } and {,\ 2 , . . . , A N }. In other words, (5. 1 . 12)
I I Pn f - ( f) "' I I 2,Tr ::; ( 1 - (J) n ll/ for all n 2: 0 and
f
(f)"' I L2 (Tr)
E L2 (rr) .
When § is not finite, it will not be true in general that one can find an orthonormal basis of eigenfunctions for Instead, the closest approximation
P.
5 REVERSIBLE M ARKOV PROCESSES
1 12
to the preceding development requires a famous result, known as the Spectral Theorem (cf. §107 in [7] ) , about bounded, symmetric operators on a Hilbert space. Nonetheless, seeing as it is the estimate (5.1 .12) in which we are most interested, we can get away without having to invoke the Spectral Theorem. To be more precise, observe that (5.1.12) holds when (3 = 1 - sup { I P f - (f) 7r l l 2 , rr : f E L2 (rr) with ll f ll 2 ,1r = 1 } . (5. 1 .13)
!
To see this, first note that, when (3 is given by (5. 1 .13) ,
[ (
� )
(
�
)J
l iP f - U) 1r ll 2 ,1r = ll f ll 2 ,1r p r ll f 2 , 7r 1 f 2 , 7r 2 , 7r 2 {0}, \ 1 ( (rr) ) when L fE - P llfll2,1r :<:; and that l i P f - (f)7r ll 2 ,1r :<:; ll f ll 2 ,1r trivially when f = 0. Next, because ( P n f)7r = (f)7r for all n 2: 0,
l l pn+l f - (f) 7r ll 2 ,7r = I I P (P nf - ( Pnf)7r ) l l 2 ,1r :<:; (1 - {3 ) ll pn f - ( Pn f) 7r l l 2 ,1r = (1 - {3) II Pn f - (f) 7r l l 2 ,1r · Thus, by induction on n, II P n f - (f)7r ll 2 ,1r :<:; (1 - f3 ) n ll f l l 2 , 7r for all n 2: 0 and f E L 2 (rr) . Hence, if f E L2 (rr) and J = f - (f)7r , then I I P nf - (f)7r ll 2 ,1r = II Pn fll 2 ,1r :<:; (1 - f3) n ll fll 2 , 7r = ( 1 - f3 ) n l l f - (f)1r ll2 ,7r · That is, (5.1.12) holds with the (3 in (5. 1.13) . Observe that when § is finite the (3 in (5. 1 .13) coincides with the one in the preceding paragraph. Hence, we have made a first step toward generalizing the contents of that paragraph to situations when (5. 1 . 1 1 ) does not apply. 5. 1.4. Reversibility and Periodicity: Clearly, the constant (3 in (5.1 .13) can be as small as 0, in which case (5.1. 12) tells us nothing. There are three ways in which this might happen. One way is that there exist an f E L 2 (rr) with the property that, ll f ll2,1r = 1 , (f)7r = 0, and P f = f . However, irreducibility rules out the existence of such an f. Indeed, be cause of irreducibility, we would have (cf. (5.1.8)) the contradiction that 0 = (f) 7r = limn --7oo ( An f)i = f(i) for all i E § and that f (i) =1- 0 for some i E §. Thus, we can ignore this possibility because it never occurs. A sec ond possibility is that there exists an f E L2 (rr) with ll f ll2 ,1r = 1 such that Pf = -f . In fact, if f is such a function, then (f)7r = ( P f) 7r = - (f) 7r , and so (f)7r = 0. Hence, we would have that l i P f - (f)7r ll 2 , 7r = 1 and therefore that (3 0. The third possibility is that there is no non-zero solution to P f = -f but that, nonetheless, there exists a sequence {fn }f <;;;;; L2 (rr) with ll fn ll2 ,1r = 1 and Un )7r = 0 such that l i P fn ll2,1r tends to 1 . Because the analysis o f this last possibility requires the Spectral Theorem, we will not deal with it. However, as the next theorem shows, the second possibility has a pleasing and simple probabilistic interpretation. See Exercise 5.6.7 below for an extension of these considerations to non-reversible P's.
=
5.1 Reversible Markov Chains
113
5 . 1 . 14 THEOREM. If P is an irreducible transition probability for which there is a reversible initial distribution, which is necessarily 7r, then the period of P is either 1 or 2. Moreover, the period is 2 if and only if there exists an f E L 2 (-rr ) \ {0} for which f = - P f. PROOF: We begin by showing that the period d must be less than or equal to 2. To this end, remember that, because of irreducibility, (-rr ) i > 0 for all i's. Hence, the detailed balance condition, (5.1 . 1 ) , implies that (P)ij > 0 <==> (P)ji > 0. In particular, since, for each i, (P)ij > 0 for some j and therefore (P 2 )ii = 2:1 (P)i1 (P) 1i > 0, we see that the period must divide 2. To complete the proof at this point, first suppose that d = 1 . If f E L 2 ( ) satisfies f = -Pf, then, as noted before, ( f ) -rr = 0, and yet, because of aperiodicity and (5.1.9), limn _,00 pn f(i) = ( f ) -rr = 0 for each i E §. Since f = P 2n f for all n � 0, this means that f 0. Conversely, if d = 2, take §o and §1 accordingly, as in §3.2. 7, and consider f = 1 §0 - 1 §1 . Because of (3.2. 19), P f = - J , and clearly llfll 2 , -rr = 1. D As an immediate corollary of the preceding, we can give the following graph theoretic picture of aperiodicity for irreducible, reversible Markov chains. Namely, if we use P to define a graph structure in which the elements of § are the "vertices" and an "edge" between i and j exists if and only if (P)ij > 0, then the first part of Theorem 5.1.14, in combination with the considerations in §3.2. 7, says that the resulting graph is bipartite (i.e., splits into two parts in such a way that all edges run from one part to the other) if and only if the chain fails to be aperiodic, and the second part says that this is possible if and only if there exists an f E L 2 ( ) \ { 0} satisfying P f = - f. 5.1.5. Relation to Convergence in Variation: Before discussing meth ods for finding or estimating the (3 in (5.1 . 13), it might be helpful to compare the sort of convergence result contained in (5.1 .12) to the sort of results we have been getting heretofore. To this end, first observe that 7r
=
7r
In particular, if one knows, as one does when Theorem 2.2.1 applies, that (*)
sup ll8i P n - -rr llv ::::; C(1 - E ) n i
for some C < oo and E E (0, 1] , then one has that
which looks a lot like (5.1 .12) . Indeed, the only difference is that on the right hand side of (5. 1 .12) , C = 1 and the norm is ll fll 2 ,-rr instead of l lfll u · Thus, one should suspect that (*) implies that the (3 in (5. 1.12) is at least as large as the E in (*). In, as will always be the case when § is finite, there exists a g E L 2 (-rr )
5 REVERSIBLE MARKOV PROCESSES
1 14
with the properties that l l g ll 2 ,1r = 1, (g)7r 0, and either Pg = (1 - (J)g or Pg = - ( 1 - (3) g, this suspicion is easy to verify. Namely, set f = g 1 [ _ R,R] (g) where R > 0 is chosen so that a = ( f, g)7r 2: !, and set f = f - ( ! ) 1r · Then, after writing f = ag + (f- ag) and noting that (g, f - ag)7r 0, we see that, for any n 2: 0, =
=
I (Pn fll � , 7r = a2 (1 - (3) 2n ± 2 a (1 - (J) n ( g, pn (f - ag) ) 7r + ! ! P n (f - ag) ! ! �,7r 2: (1 (3) 2n ,
�
_
since On the other hand, I I P n /II § 7r <::: C 2 (1 - c) 2n llfll � <::: (CR) 2 (1 - c) 2n . Thus, � ( 1 - (3) 2n <::: (CR) 2 (1 - c) in for all n 2: 0, which is possible only if (3 2: c. When no such g exists, the same conclusion holds, only one has to invoke the Spectral Theorem in order to arrive at it. As the preceding shows, uniform estimates on the variation distance be tween J-tp n and 1r imply estimates like the one in (5.1 . 12). However, going in the opposite direction is not always possible. To examine what can be done, let a probability vector J-t be given, and define f so that f (i) = ���; . Then, since ( ! )7\" = 1 and (g)! <::: (g2 )7r for all g E L 2 (1r), j
j j
j
�� =
>-J, I P"f (j ) - ( !) . 1
I IPn f - U )7r ll 2 , 1r ·
Hence, (5.1.12) implies that
(5. 1 . 15)
IIJ-LP n - 1r ll v 'S
o;
(�
(-.-}; IP" f (J ) - ( f) . I '
(� ��;: ) -1
1 2
)
' 1
(1 - (J) n .
In the case when § is finite, and therefore there exists an A E ( 0, 1) for which ( 1r ) i 2: A, (5.1.15) yields the Doeblin type estimate
5.2 Dirichlet Forms and Estimation of (3
115
However, when § is infinite, (5.1 .15) , as distinguished from Doeblin, does not give a rate of convergence which is independent of JL. In fact, unless L i ���: < oo, it gives no information at all. Thus, one might ask why we are considering estimates like (5. 1 .15) when Doeblin does as well and sometimes does better. The answer is that, although Doeblin may do well when it works, it seldom works when § is infinite and, even when § is finite, it usually gives a far less than optimal rate of convergence. See, for example, Exercise 5.6.15 below. 5.2 Dirichlet Forms and Estimation of (3 Our purpose in this section will be to find methods for estimating the op timal (i.e., largest) value of (3 for which (5. 1 . 12) holds. Again, we will assume that the chain is reversible and irreducible. Later, we will add the assumption that it is aperiodic. 5.2.1. The Dirichlet Form and Poincare's Inequality: Our first step requires us to find other expressions the right hand side of (5. 1 . 13) . To this end, we begin with the observation that
L5 (7r) L 1r
1 - (3 = sup { [ I P f l brr f E with ll f ll 2 ,1r = 1 } 2 { ! E ( ) (!)71' = 0}. where
(5.2.1)
L5 (7r)
:
==
:
It is obvious that the supremum on the right hand side of (5.1.13) dominates the supremum on the right above. On the other hand, if f E L 2 (1r) with ll f ll 2 ,1r = 1, then either II! - ( f ) 7r ll 2 ,1r = 0 and therefore l i P f - ( f )7r ll 2 ,1r = 0, or 1 2': II! - (f)7r l l 2 , 1r > 0, in which case
is also dominated by the right hand side of (5.2.1). To go further, we need to borrow the following simple result from the theory of symmetric operators on a Hilbert space. Namely,
L5 (7r)
& ll f ll 2 ,1r = 1 } sup{ II P JII 2 ,7r : f E = sup { [ ( !, P !)71' 1 : f E & ll f ll 2 ,7r = 1 } .
(5.2.2)
L5 (7r)
That the right hand side of (5.2.2) is dominated by the left is Schwarz's inequality : I (!, P !) 71' I :S: ll f l l 2 ,7r l i P f ll 2 , 1r · To prove the opposite inequality, let f E with ll f ll 2 ,7r = 1 be given, assume that l iP f ll 2 ,1r > 0, and set g = I I P�IL . Then g E and llgll 2 , 1r = 1. Hence, if 1 denotes the supremum on the right hand side of (5.2.2) , then, from the symmetry of P ,
L6(1r)
..
L6(1r)
4 II P JII 2 ,7r = 4 (g, P f ) 7r = ( ( ! + g ) , P( f + g ) )7r - ( ( ! - g ) , P ( f - g ) ) 7r :s: r ( ll f + g[l � ,7r + II! - gll � ,7r ) = 2 r ( ll fl l � ,7r + llgll � ,7r ) = 4r .
5 REVERSIBLE MARKOV PROCESSES
1 16
The advantage won for us by (5.2.2) is that, in conjunction with (5.2. 1 ) , it shows that (5.2.3)
where f3±
{3 = f3+ 1\ !3-
==
inf{ (!, (I =t= P) J )-rr : f E L6 ( 7r ) & llfll 2 ,-rr
=
1}
Notice that, because (I - P) J = (I - P)(f - ( f ) -rr ) , (5.2.4)
{3+ = inf{ (!, (I - P) J)-rr : f E L2 ( 1r ) & Var-rr (f)
==
II! -
( f)-rr ll!
=
At the same, because
1}.
it is clear that the infemum in the definition of f3- is the same whether we consider all f E L 2 (1r) or just those in L§ (1r). Hence, another expression for f3- is (5.2.5) Two comments are in order here. First, when P is non-negative definite, abbreviated by P 2': 0, in the sense that ( f, P J ) -rr 2': 0 for all f E L2 (1r) , then f3+ :S 1 :S (3_ , and so {3 = f3+ · Hence, (5.2.6)
P 2': 0 ==? {3 = inf{ (!, (I - P) J)-rr : f E L 2 (1r) & Var-rr ( f) = 1 } .
Second, by Theorem 5.1 .14, we know that, in general, f3 = 0 unless the chain is aperiodic, and (cf. (5. 1 . 1 1 ) and the discussion at the beginning of §5.1.4) , when § is finite, that {3 > 0 if and only if § is aperiodic. The expressions for {3+ and j3_ in (5.2.4) and (5.2.5) lead to important calculational tools. Namely, observe that, by (5. 1 . 1 ) , _
L f(i) (7r )i (P)ij ( J (i) - J (j)) ( i, j ) = L f(i)( 7r )j (P)ji ( J (i) - f(j)) = L J (j)(7r )i (P)ij ( J (j) - f( i )) . ( i,j) (i,j )
(!, (I - P) J ) -rr =
Hence, when we add the second expression to the last, we find that
�
P) J)-rr = E (f, f) == L( 7r)i(P)ij ( J (j) - f(i)) 2 . i#j Because the quadratic form [ ( !, f) is a discrete analog of the famous quadratic form I V fl 2 (x ) dx introduced by Dirichlet, it is called a Dirichlet form. Extending this metaphor, one interprets {3+ as the Poincare constant (5.2.7)
�J
(5.2.8)
(!, (I -
117
5.2 Dirichlet Forms and Estimation of (3 in the Poincare inequality
( 5.2.9) To make an analogous application of ( 5.2.5), observe that
L ( J (j ) + ! ( i) ) 2 (n) i( P)ij = (P !2 ) -rr + 2 (!, P f)-rr + 11!11 � , -rr ( i ,j ) = 2 11!11 �, -rr + 2 (!, p f ) -rr , and therefore
( 5.2.10)
(!, ( I + P ) f) -rr = £ ( !, ! )
=
1
2 "' L..) n ) i ( P ) ij ( J (j )
(i ,j )
+ f(i) ) 2 .
Hence
L2 ( n ) & Var-rr ( f ) = 1 } = inf{ E(f, f ) : f E L6 ( n) & IIJII 2 ,-rr = 1 } .
(3_ = inf{ E( f, f ) : f E
( 5.2.11)
In order to give an immediate application of ( 5.2.9) and ( 5.2.11), suppose ( cf. Exercise 2.4.3) that (P) ij :::: Ej for all ( i, j) , set e = Lj Ej , assume that e > 0, and define the probability vector J-L so that ( J-L)i = � · Then, by Schwarz's inequality for expectations with respect to J-L and the variational characterization of Var-rr ( f ) as the minimum value of a� ( ( ! - a) 2 )-rr ,
2£ ( !, f) = L (n)i(P)ij ( J (j) - f( i) ) 2 ( i ,j ) :::: e L ( J (j) - f (i)) 2 ( n) i ( J-L) j :::: L ( ( f)JL - f (i)) 2 ( n ) i :::: eVar-rr ( f ) , (i ,j ) e
and, similarly, E( f, f ) :::: �Var-rr ( f ) . Hence, by ( 5.2.9) , ( 5.2.11), and ( 5.2.3), :::: �. Of course, this result is a significantly less strong than the one we get by combining Exercise 2.4.3 with the reasoning in §5.1.4. Namely, by that exercise we know that llbip n - n llv ::::: 2 ( 1 - ) n , and so the reasoning in §5.1.4 tells us that (3 :::: which is twice as good as the estimate we are getting here. On the other hand, as we already advertised, there are circumstances when the considerations here yield results when a Doeblin-type approach does not. 5.2.2. Estimating (3+ : The origin of many applications of ( 5.2.9) and ( 5.2.11) to estimate (3+ and (3_ is the simple observation that (3
e
e,
( 5.2.12)
Var-rr ( f ) =
� L:: (J (i) - f (j) ) 2 (n)i (n)j , 2]
5 REVERSIBLE MARKOV PROCESSES
1 18
which is easily checked by expanding (f ( i) - f (j)) 2 and seeing that the sum on the right equals 2 ( ! 2 ) 7r - 2 ( !)';, . The importance of (5.2.12) is that it expresses Var7r (f) in terms of difference between the values of f at different points in §, and clearly E ( f , f ) is also given in terms of such differences. However, the differences which appear in E(f, f) are only between the values of f at pairs of points ( i, j) for which (P)ij > 0, whereas the right hand of side of (5.2.12) entails sampling all pairs ( i, j). Thus, in order to estimate Var7r ( f ) in terms of E( f, f ) , it is necessary to choose, for each (i, j) with i =/= j, a path p(i, j) = (ko, . . . , kn ) E §n + l with k0 i and kn = j , which is allowable in the sense (P)k= -' k= > 0 for each 1 :S m :S n , and to write 2 2 (f (i) - t (J ) ) = L !J.et eEp (i ,j ) where the summation in e is taken over the oriented segments (km -1, km) in the path p(i,j), and, for e = (k, R.), !J. e f f(R.) - f(k). At this point there are various ways in which one can proceed. For example, given any {a(e) : e E p(i, j)} <:;; (0, oo) , Schwarz's inequality (cf. Exercise 1.3.1) shows that the quantity on the right is dominated by =
(
)
=
(
a(e) L p (e) eEp(i ,j ) :S
where p (e)
) )( ( L ) L (!J.ef) 2p(e), p (e) (!J.e f) 2 L a(e) eEp i,j ) (
a(e ) 1 max _ eEp (i ,j ) a(e ) eEp i,j p (e) eEp i j ( ) (,)
('rr ) k(P)k£ when e = (k, R.). Thus, for any selection of paths P = {p(i,j) : (i, j) E §2 \ D} (D here denotes the diagonal {(i, j) E §2 : i = j}) and coefficients A = {a(e , p) : e E p E P} <:;; (0, oo ) , =
Var7r ( f )
( e f) 2 p (e) � pLE P WA (P) L eEp = � L (!J.e f) 2 p (e) ( L WA (P) ) :S W (P, A)E(f, f), e p3e :S
where wA (P)
and
!J.
(
)
1 "'"""' a(e,p) -- if p = p(i,j), = (1r)i (1r)j max eEp a (e, p) L... eEp p ( e )
_
W (P, A) = sup
L w.4(e, p).
e p3e
5.2 Dirichlet Forms and Estimation of (3
119
Hence, we have now shown that
(3+ ;:::: W(P,1 A)
( 5.2.13 )
for every choice of allowable paths P and coefficients A. The most effective applications of ( 5.2.13) depend on making a choice o f P and A which takes advantage of the particular situation under consideration. Given a selection P of paths, one of the most frequently made choices of A is to take 1. In this case, ( 5.2.13) gives
a ( e ,p)
(5.2.14)
(rr (p)) -�;(p))+ , L L e p3e e'Ep p( )
=
(rr)i
where W(P ) = sup
where = and = when begins at i and ends at j. Finally, it should be recognized that, in general, ( 5.2. 13) gives no informa tion. Indeed, although irreducibility guarantees that there is always at least one path connecting every pair of points, when § is infinite there is no guar antee that P and A can be chosen so that W(P, A) < oo. Moreover, even when § is finite, and therefore W(P, A) < oo for every choice, only a judicious choice will make (5.2.13) yield a good estimate. 5.2.3. Estimating (3_ : The estimation of starting from ( 5.2.11) is a bit more contrived than the one of starting from ( 5.2.9). For one thing, we already know (cf. Theorem 5.1.14) that = 0 unless the chain is aperiodic. Thus, we will now require that the chain be aperiodic. As a consequence of aperiodicity and irreducibility, we know (cf. ( 3.1.13)) that, for each i E §, there always is a path p (i ) = which is allowable (i.e. ( P )k,._1k,. > 0 for each 1 _::::; m _::::; 2n + 1) , is closed (i.e. , = and starts at i (i.e., = i ) . Note our insistence that this path have an odd number of steps. The reason for our doing so is that when the number of steps is odd, an elementary exercise in telescoping sums shows that
(rr (p)) -
(rr(p)) + (rr)j
p
(3_
(3+
(ko, . . . , kzn+d
k0
(3_
ko kzn+ l) ,
m ) -1 m=O (f(km) 2n
2f(i) = L (
+ J ( km+ l)).
p(i) for each i E §, and we { a (e , p) : e E p E P} � (0, )
Thus, if P be a selection of such paths, one path make an associated choice of coefficients A = Then, just as in the preceding section,
4 IIJII�,tr = _::::;
oo .
Li (eEp(Li) (-1)m(e) Lief) 2 (rr)i LpEP rr(p) (LeEp ap(e(e),p) ) (L(Li eEp e !) 2 a(e,p) )
____EJ!)_ '
5 REVERSIBLE MARKOV PROCESSES
120
where, when p (ko, . . . , k2n + I ) , 1r(p) = ( 7r ) k0 , and, for 0 ::; m ::; 2 n , m(e) m and fief = f (km) + f (km+ l ) if e = (km , km+l) · Hence, if =
_
w(p)
then
=
(
=
-) l: -p -) , 1
a(e, p) (e e Ep
1r(p) max eEp a ( e, p)
2 ll f11L.. ::; W('P, A) E ( f, f) where W('P, A) = sup e
l: w(p) ,
p3 e
and, by (5.2.11), this proves that 2 !3- ;:::: w(P, A) for any choice of paths P and coefficients A satisfying the stated requirements. When we take o:(e, p) 1 , then this specializes to =
(5.2.15)
-
- -
where W1 (P)
=
p
� 7r(p) s �p � L...t L...t (e') , p3 e e ' Ep
It should be emphasized that the preceding method for getting estimates on (3_ is inherently flawed in that it appears incapable of recognizing spectral properties of P like non-negative definiteness. That is, when P is non-negative definite, then (3_ ;:::: 1 , but it seems unlikely that one could get that conclusion out of the arguments being used here. On the other hand, because our real interest is in (3 = (3+ 1\ (3_ and the estimates given by (5.2. 14) and (5.2.15) are likely to be comparable, the inadequacy of (5.2.15) causes less damage than one otherwise might fear.
5.3
Reversible Markov Processes in Continuous Time
Here we will see what the preceding theory looks like in the continuous time context and will learn that it is both easier and more resthetically pleasing there. We will be working with the notation and theory developed in Chapter 4. In particular, Q will denote a matrix of the form R( P I ) where R is a diagonal matrix whose diagonal entries are the rates 9\ = { Ri i E §} and P is a transition probability matrix. We will assume throughout that Q is irreducible in the sense that (cf. (4.4.2)), for each pair (i, j) E § 2 , (Q n )ij > 0 for some n 2': 0. 5.3.1. Criterion for Reversibility: Let Q be given, assume that the as sociated Markov process never explodes (cf. §4.3.1), and use { P (t ) : t ;:::: 0} to denote the semigroup determined by Q (cf. Corollary 4.3.2). Our purpose -
,
:
5.3 Reversible Markov Processes in Continuous Time
121
in this subsection i s to show that if fl is a probability vector for which the
detailed balance condition
(5.3.1) holds relative to Q, then the detailed balance condition
( fl ) i (P(t)) ij = (fl)J (P(t)) 1i for all t > 0 and (i, j) E §2
(5.3.2)
also holds. The proof that (5.3.1) implies (5.3.2) is trivial in the case when the rates 9'i: are bounded. Indeed, all that we need to do in that case is first use a simple inductive argument to check that (5.3.1) implies ( jl)i (Qn ) ij = (jl)1 (Q n ) ji for all n :::: 0 and then use the expression for P(t) given in (4.2. 13) . When the rates are unbounded, we will use the approximation procedure introduced in §4.3. 1 . Namely, refer to §4.3.1, and take Q (N) corresponding to the choice rates ry{ ( N ) described there. Equivalently, take (Q ( N ) )ij to be (Q)ij if i E FN and 0 when i rf FN . Using induction, one finds first that ((Q ( N) ) n )ij = 0 for all n :::: 1 and i rf FN and second that Hence, if {P ( N ) (t) : t > 0} is the semigroup determined by Q ( N) , then, since the rates for Q ( N) are bounded, (4.2.13) shows that if i rt FN (p CN l (t)) ij = oi ,j ( fl)i (p (N ) (t)) ij = ( jl)j ( p (N ) (t)) ji if (i, j) E Fir.
(5.3.3)
In particular, because, by (4.3.3) , (P(Nl (t)) ij ------> (P(t)) ij ' we are done. As a consequence of the preceding, we now know that (5.3.1) implies that fl is stationary for P(t) . Hence, because we are assuming that Q is irreducible, the results in §4.4.2 and §4.4.3 allow us to identify fl as the probability vec tor ir ir§ introduced in Theorem 4.4.8 and discussed in §4.4.3, especially (4.4. 11). To summarize, if fl is a probability vector for which ( 5. 3. 1 ) holds, then fl = ir. 5.3.2. Convergence i n L2(ir) for Bounded Rates: In view of the results just obtained, from now on we will be assuming that ir is a probability vector for which (5.3.1) holds when fl = ir. In particular, this means that =
(5.3.4)
(ir)i (P(t)) ij = (ir) 1 (P(t)) ji for all t > 0 and (i, j) E §2 .
Knowing (5.3.4) , one is tempted to use the ideas in §5.2 to get an estimate on the rate, as measured by convergence in L2 (ir) , at which P(t)f tends to (f ) fr . To be precise, first note that, for each h > 0,
( f, P(h)f ) ir = IIP(�) f ll � ,fr :::: 0,
5 REVERSIBLE MARKOV PROCESSES
122
and therefore (cf. (5.1.13), ( 5.2.6), and ( 5.2.7)) that
{3 (h)
1 - sup { ! (f, P(h)f ) * ! f E L6 ('rr ) with l lfii 2 ,?T = 1 } = inf { ( !, I - P( h)f} * : Var*(f) = 1 } = inf { Eh ( f, f) : Var* (f) = 1 } , =
where
:
Eh (f, f)
=
( * ) i (P( h) ) ij ( f(j ) - f( i ) ) 2 �L i#j
is the Dirichlet form for P(h) on L 2 (7T) . Hence (cf. ( 5.1.12)), for any t > 0 and n E z+ ,
To take the next step, we add the assumption that the rates 9\ are bounded. One can then use ( 4.2.8) to see that, uniformly in f satisfying Var* (f) = 1 ,
where
( 5.3.5)
EQ (f, f)
=
� L( 7T ) i (Q) ij ( f(j ) - f(i ) ) 2 i #j
;
and from this it follows that the limit limh'\.O h-1 {3( h) exists and is equal to
( 5.3.6 ) Thus, at least when the rates are bounded, we know that
( 5.3.7) 5.3.3. L 2 (7T)-Convergence Rate in General: When the rates are un bounded, the preceding line of reasoning is too nai:ve. In order to treat the unbounded case, one needs to make some additional observations, all of which have their origins in the following lemma. 3 5.3.8 LEMMA. Given f E L 2 (7T ) , the function t E [0, oo ) f---.> IIP(t)fll § *
is continuous, non-increasing, non-negative, and convex. In particular, (0, oo ) f---.> (f,(I- � (t))f)-rr is non-increasing, and therefore
l.
h '\.0 1m
11 ! 11§,* - II P(h)f ll§,* h
t 'E
exists in [0, oo] .
3 If one knows spectral theory, especially Stone's Theorem, the rather cumbersome argument which follows can be avoided.
123
5.3 Reversible Markov Processes in Continuous Time In fact (cf. (5.3.5)),
11 ! 11§,-,r - II P(h )f l l§,r � 2 EQ(f' f) . h'--,. h PROOF: Let f E L2 ( fr ) be given, and (cf. the notation in §4.3.1 ) define fN = l FN f . Then, by Lebesgue's Dominated Convergence Theorem, II! 0 as N -+ oo. Because IIP(t) g ll 2 ,7r :::; ll g ll ,1r for all t > 0 and !N II 2, 7r 2 g E L2 (fr ), we know that lim0
---->
uniformly in t E (0, oo ) as N -+ oo . Hence, by part ( a) of Exercise 5.6.1 below, in order to prove the initial statement, it suffices to do so when f vanishes off and set of for some M. Now let f be a function which vanishes off of 'ljJ (t) = I IP(t)f l l � 11- · At the same time, set 1/JN (t) = IIP(N) (t)f ll� ' 7r for N � M. Then, because by ( 4.3.3), 1/JN 1/J uniformly on finite intervals, another application of part ( a) in Exercise 5.6.1 allows us to restrict our attention to the 1/JN 's. That is, we will have proved that 'ljJ is a continuous, non-increasing, non-negative, convex function as soon as we show that each 1/JN is. The non negativity requires no comment. To prove the other properties, we apply ( 4.2. 7) to see that
FM
FM,
---->
?jw(t) = ( Q (N) p(N) (t)j, p(Nl (t)f) 7r + (p(N ) (t)j, Q (N) p(N) (t)f) 11- " Next, by the first line of ( 5.3.3), we know that, because N � M , p(Nl (t)f the vanishes off of FN , and so, because ( fr ) i Q W) = ('ir ) 1 Q j�l for ( i , j ) E preceding becomes
F� ,
Similarly, we see that
;j;N (t) = 2 (Q (N) p(N) (t)j, Q(N) p(N) (t)f)7r + 2(p(N) ( t)j, (Q (N ) ) 2 p(N) (t )f)7r = 4(Q(N ) p(Nl (t)j, qCN ) p(Nl (t)f) * = 4!!Q(N ) p(Nl (t )J!I� , 7r � 0 . Clearly, the second o f these proves the convexity of 'ljJN . In order to see that
the first implies that 1/JN is non-increasing, we will show that
(g , - Q (Nl g) 7r = (* )
L ( ir ) iV;(N) g(i) 2 � i,jLEF'f, (fr ) i (Q) ij (g (j ) - g(i ) ) 2 + iEFN ( #) j
if g = 0 off FN and v
=
L Q ij for i E FN . j 'f:FN
5 REVERSIBLE MARKOV PROCESSES
124
To check (*), first observe that
(g , - Q (N ) g)ir = - L (ir)i(Q)ijg(i)g(j) (i,j ) EF� i #j (i,j ) EF� i#j Next, use (ir )i (Q)ij = (ir ) J (Q )ji for ( i , j) E F'fv to see that
- I: (ir )i (Q )i1 g (i) (g (j) - g( i)) (i,j ) EFN i fj
I:
=
(i,j ) EFN i fj
(ir )i (Q) ijg (j) (g (j) - g( i)) ,
and thereby arrive at (*). Finally, apply (*) with g = p ( Nl (t )f to conclude that �N ::::: 0. Turning to the second and third assertions, let f be any element of L 2 ( ir ) . Now that we know that the corresponding 1/J is a continuous, non-increasing, non-negative, convex function, it is easy (cf. part (d) in Exercise 5.6.1 ) to check that t� '1/; ( 0 );'!f; ( t) is non-increasing and therefore that limh',O '!f; ( O ) �'!f; ( h ) exists in [0, oo] . Next, remember ( 5.2.7) , apply it when P = P( 2h ) , and conclude that
�
1/J ( O) - 1/J ( h) = (f, (I - P( 2h)) f )fr = L (ir )i (P(2h)) ij ( f(j) - f( i)) 2 • ( i,j ) i fj Hence, since, by (4.3.4 ) ,
( P( 2h)) "1 h'-, 0 h lim
=
2 ( Q)iJ for i =1- j,
the required inequality follows after an application of Fatou's Lemma. D
5.3.9 LEMM A . If 0 < s < t, then for any f E L2 (ir )
�
�
ll f � , fr IIP( s )f ll�.: = IP(t)f ll�,fr ;::: 2EQ (P (t )f, P(t )f ) . ;::: PROOF: Set 'lj;(t ) = IIP( t )f ll � fr · We know that 1/J is a continuous, non increasing, non-negative, conv�x function. Hence, by part ( a) of Exercise 5.6.1, _ 1/J (_ 0) > 1/J ( 0) - 1/J ( s ) > 1/J ( s ) - 1/J ( t) > _:_ 1/J ('t )_ ---' -----:1/J--'-( t_+_h--) -'--
s -
s
-
t-s
-
h
5.3 Reversible Markov Processes in Continuous Time
125
for any h > 0. Moreover, because,
IIP (t) fl1 �,7r - IIP (h) P (t) JII �, 7r h
7/J(t) - 7/J(t + h) h
the last part of Lemma 5.3.8 applied with P (t) f replacing f yields the second asserted estimate. D With the preceding at hand, we can now complete our program. Namely, by writing
IIJII�,7r - IIP (t) 1 1 �,7r
L ( I I P ( �t ) ll �,7r - I I P Cm�l ) t ) l l �,7r) '
n- 1
=
m=O
we can use the result in Lemma 5.3.9 to obtain the estimate
11 ! 11�,* - II P (t) f ll�,* ::::
� mt=l EQ (P ( �t ) J, P ( �t ) f) .
Hence, if .A is defined as in (5.3.6) , then, for any f E L6(-rr ) ,
2 t 1 1 ! 1 1 �,71- - I I P (t) fll �,7r :::: n.\ � I I P ( �t ) ! l l � , 71- ' mL...=l
which, when n ---> oo, leads to
1 1 !11 �,71- - IIP (t) fll �,7r :::: 2 .\
1t IIP (T) fll �,7r dT.
Finally, by Gronwall 's inequality (cf. Exercise 5.6.4), the preceding yields the estimate I I P (t) fl1 �,7r ::::; e -2.At llfll �, 1r · After replacing a general f E by , we have now proved that (5.3.7) holds even when 9'\ is unbounded. f - (! )715.3.4. Estimating .A : Proceeding in the exactly the same way that we did in §5.2.2, we can estimate the .A in (5.3.6) in the same way as we estimated there. Namely, we make a selection P consisting of paths one for each from pair ( i, j) E § \ D, with the properties that if p ( i, j) = ( k0, , kn), then ko kn = and p( i, is allowable i n the sense that ( Q)k=_1 k= > 0 for each 1 ::::; m ::::; n. Then, just as in §5.2.2, we can say that
L2 (ir)
(3+
=
(5.3.10)
i,
j,
.A ::=:
p(i,j ) ,
j)
1
W(p
)
where W (P) = sup
. . •
L L (ir(p))p(-e(;) (p))+ ,
e p3e e' Ep
where the supremum is over oriented edges e = (k, £) with (Q)ke > 0, the first sum is over p E P in which the edge e appears, the second sum is over = edges e' which appear in the path if the path starts at i f the path ends at and p(e') = (ir )k (Q)ke if e' (k, R) .
(ir (p))+ (ir)j =
p
p, (ir(p)) - (ir) i j,
p
=
i,
5 REVERSIBLE MARKOV PROCESSES
126
5.4 Gibbs States and Glauber Dynamics
Loosely speaking, the physical principle underlying statistical mechanics can be summarized in the statement that, when a system is in equilibrium, states with lower energy are more likely than those with higher energy. In fact, J.W. Gibbs sharpened this statement by saying that the probability of a state i will be proportional to e - ��l , where k is the Boltzmann constant, T is temperature, and H ( i) is the energy of the system when it is in state i. For this reason, a distribution which assigns probabilities in this Gibbsian manner is called a Gibbs state. Since a Gibbs state is to be a model of equilibrium, it is only reasonable to ask what is the dynamics for which it is the equilibrium. From our point of view, this means that we should seek a Markov process for which the Gibbs state is the stationary distribution. Further, because dynamics in physics should be reversible, we should be looking for Markov processes which are reversible with respect to the Gibbs state, and, because such processes were introduced in this context by R. Glauber, we will call a Markov process which is reversible with respect to a Gibbs state a Glauber dynamics for that Gibbs state. In this section, we will give a rather simplistic treatment of Gibbs states and their associated Glauber dynamics. 5.4.1. Formulation: Throughout this section, we will be working in the following setting. As usual, § is either a finite or countably infinite space. On § there is given some "natural" background assignment v E (0, oo ) § of (not necessarily summable) weights, which should be thought of as a row vector. In many applications, v is uniform: it assigns each i weight 1 , but in other situations it is convenient to not have to assume that it is uniform. Next, there is a function H : § [0, oo ) (alias, the energy function) with the property that -----+
(5.4.1)
Z(f3)
=
L e - f3H (i) ( v) i < oo for each (3 E (0, oo ) . iE§
In the physics metaphor, f3 k� is, apart from Boltzmann's constant k, the reciprocal temperature, and physicists would call (3""" Z (f3) the partition function. Finally, for each f3 E (0, oo ), the Gibbs state !( !3) is the probability vector given by =
(5.4.2) From a physical standpoint, everything of interest is encoded in the parti tion function. For example, it is elementary to compute both the average and variance of the energy by taking logarithmic derivatives: (5.4.3)
and
5.4 Gibbs States and Glauber Dynamics
127
The final ingredient is the description of the Glauber dynamics. For this purpose, we start with a matrix A all of whose entries are non-negative and whose diagonal entries are 0. Further, we assume that A is irreducible in the sense that
(5.4.4)
sup(An) ij > 0 for all ( i, j ) E §2 n �O
and that it is reversible in the sense that
(5.4.5) Finally, we insist that
(5.4.6)
L e-!3H (j ) ( A) ij j E§
<
oo
for each i
E
§ and (3 > 0.
At this point there are many ways in which to construct a Glauber dynam ics. However, for our purposes, the one which will serve us best is the one whose Q-matrix is given by
(5.4.7)
( Q( f3) ) ij = e-!3 ( H (j ) -H ( i ) ) + ( A) ij when j i= i ( Q( f3) ) ii = - 'L: ( Q( f3) ) ij ' jf.i
where a+ = a V 0 is the non-negative part of the number a E JR. Because,
(5.4.8)
! ( f3) clearly is reversible for Q( f3). There are many other possibilities, and the optimal choice is often dictated by special features of the situation under consideration. However, whatever choice is made, it should be made in such a way that, for each (3 > 0, Q( f3) determines a Markov process which never explodes. 5.4.2. The Dirichlet Form: In this subsection we will modify the ideas developed in §5.2.3 to get a lower bound on (5.4.9)
>..13 = inf { E[3 (f, f) : Var13(f ) = 1 } where E[3(f, f ) = L (i ( f3) ) i ( Q( f3) ) ij ( f(j ) - f(i ) ) 2 jf.i
�
and Var13( f ) is shorthand for Var1 ( J (f ) , the variance of f with respect to ! ( f3). For this purpose, we introduce!3 the notation Elev(p) = o maxn H(im) and e (p ) = Elev(p) - H(io) - H(in) ::;m::;
5 REVERSIBLE MARKOV PROCESSES
128
for a path p = (i0 , . . . , in ) · Then, when Q((J) is given by (5.4.7) , one sees that, when p = ( io, . . . , in )
Hence, for any choice of paths P, we know that (cf. (5.3. 10))
W,e (P) = sup L w,e (P) � Z( f))- 1 e,6E ( P) W(P), e
p3 e
where W(P) = sup L w(p) and E(P) = sup e(p), e
p '3 e
pEP
and therefore that (5.4.10)
>.,e 2::
Z(f))e - i3 E (P ) W(P) .
On the one hand, it is clear that (5.4.10) gives information only when
W(P) < oo. At the same time, it shows that, at least if ones interest is in large f)'s, then it is important to choose P so that E(P) is as small as
possible. When § is finite, reconciling these two creates no problem. Indeed, the finiteness of § guarantees that W(P) will be finite for every choice of allowable paths. In addition, finiteness allows one to find for each ( i, j) a path p(i, j) which minimizes Elev(p) among allowable paths from i to j, and clearly any P consisting of such paths will minimize E(P). Of course, it is sensible to choose such a P so as to minimize W(P) as well. In any case, whenever § is finite and P consists of paths p(i , j) which minimize Elev(p) among paths p between i and j, E(P) has a nice interpretation. Namely, think of § as being sites on a map and of H as giving the altitude of the sites. That is, in this metaphor, H ( i) is the distance of i "above sea level." Without loss in generality, we will assume that at least one site k0 is at sea level: H(ko) = 0.4 When such an ko exists, the metaphorical interpretation of E(P) is as the least upper bound on the altitude a hiker must gain, no matter where he starts or what allowable path he chooses to follow, in order to reach the sea. To see this, first observe that if p and p' are a pair of allowable paths and if the end point of p is the initial point of p', then the path q is allowable and Elev(q) � Elev(p) V Elev(p') when q is obtained by concatenating p and If that is not already so, we can make it so by choosing ko to be a point at which H takes its minimum value and replacing H by H - H(ko ) . Such a replacement leaves both !(f3) and Q (f3) as will as the quantity on the right hand side of (5.4. 10) unchanged. 4
5.4 Gibbs States and Glauber Dynamics
)
)
129
)
p': if p = (io , . . . , in ) and p' = (i� , . . . , i �, ) , then q = (io, . . . , in , i � , . . . , i�, ) . Hence, for any (i, j) , e (p(i, j) = e (p(i, k0) V e (p(j, k0) , from which it should be clear that E (P) maxi e (p(i, k0)) . Finally, since, for each i, e (p(i, k0) ) = H ( £) - H ( i), where £ is a highest point along the path p( i, ko) , the explanation is complete. When § is infinite, the same interpretation is valid in various circumstances. For example, it applies when H "tends to infinity at infinity" in the sense that { i : H( i) ::; M} is finite for each M < oo. When § is finite, we can show that, at least for large (3, (5.4.10) is quite good. To be precise, we have the following result: 5.4. 1 1 THEOREM. Assume that § is finite and that Q (/3) is given by ( 5.4. 7) . Set = mini E § H ( i) and §0 = { i : H ( i) = } , and let e be the minimum value =
m
m
E(P) takes as P runs over all selections of allowable paths. Then e 2: - m , and e = - m if and only if for each ( i, j) E § x §o there is an allowable path p from i to j with Elev(p) = H(i). (See also Exercise 5.6.5 below.) More generally, whatever the value of e, there exist constants 0 < c_ ::; c+ < oo, which are independent of H, such that for all (3 2: 0.
PROOF: Because neither '"Y (f3) nor Q ((3) is changed if H is replaced by H whereas e changes to e + we may and will assume that = 0. Choose a collection P = {p(i, j) : (i, j) E §2 } of allowable paths so that, for each (i, j) , e (p(i, j) minimizes e(p) over allowable paths from i to j. Next choose and fix a k0 E §0. By the reasoning given above, (*) e = max i E§ e (p(i, ko) ) . In particular, since e (p(i, k0) ) = Elev (p(i, k0) - H(i) 2: 0 for all i E §, this proves that e 2: 0 and makes it clear that e = 0 if and only if H(i) = Elev (p( i, k0 ) ) for all i E §. Turning to the lower bound for Af3 , observe that, because = 0, Z ((3) 2: (v) k0 > 0 and therefore, by (5.4.10) , that we can take c_ = i{;��) . Finally, to prove the upper bound, choose C0 E § \ {ko } so that e(p0) = e when Po p( fo , ko) , let f be the set of i E § with the property that either i = k0 or Elev (p( i) < Elev(p0) for the path p( i) p( i, k0) E P from i to k0 , and set f = lr. Then, because ko E r and Co ¢'. r,
)
m
m,
m
)
m
=
Vru-p ( !)
)
�
(�b(IJ)) ,) (�b(�)),)
=
v ko v c f3 )+ H(£o )) > . - ('"Y (f3) ) k0 ('"Y (f3) ) Eo = ( ) ( ) o e - (H(ko
Z((3)2
At the same time
5 REVERSIBLE MARKOV PROCESSES
130
But if i E r \ {ko}, j tJ_ r, and (A) ij > 0, then H(j) ;:,. Elev(po). To see this, consider the path q obtained by going in one step from j to i and then following p(i) from i to k0 . Clearly, q is an allowable path from j to ko , and therefore Elev(q) ;:,. Elev (p(j) ) ;:,. Elev(p0). But this means that Elev(p0) ::; Elev (p(j) ) ::; Elev(q)
=
Elev (p(i) ) V H(j),
which, together with Elev (p( i) ) < Elev(p0) , forces the conclusion that H(j) ;:,. Elev(p0) . Even easier is the observation that H(j) ;:,. Elev(p0) if j tJ_ r and (A)ko j > 0, since in that case the path (j, ko) is allowable and H(j)
=
Elev ( (j, k0) ) ;:,. Elev (p(j) ) ;:,. Elev(p0) .
Hence, after plugging this into the preceding expression for
L
(i ,j ) ErxrC
£(3(!, f), we get
( v) i ( A) ij ,
which, because e = Elev(p0) - H(ko) - H(€0 ) , means that
Finally, because Z((3)
::;
!!v!!v , the upper bound follows. D 5.5 Simulated Annealing
This concluding section deals with an application of the ideas in the preced ing section. Namely, given a function H : § [0, oo ) , we want to describe a procedure, variously known as the simulated annealing or the Metmpolis algorithm, for locating a place where H achieves its minimum value. In order to understand the intuition which underlies this procedure, let A be a matrix of the sort discussed in §5.4, assume that 0 is the minimum value of H, set §0 = {i H(i) = 0}, and think about dynamic procedures which would lead you from any initial point to §0 via paths which are allowable according to A (i.e., Akc > 0 if k and € are successive points along the path) . One procedure is based on the steepest decent strategy. That is, if one is at k, one moves to any one of the points € for which Akc > 0 and H(€) is minimal if H(€) ::; H(k) for at least one such point, and one stays put if H(€) > H(k) for every € with (A)kc > 0. This procedure works beautifully as long as you avoid, in the metaphor suggested at the end of §5.4, getting trapped in some "mountain valley." The point is that the steepest decent procedure is the most efficient strategy for getting to some local minimum of H. However, if that minimum is not global, then, in general, you will get stuck! Thus, if you are going to avoid this fate, occasionally you will have to go "up hill" even when -----+
:
5.5 Simulated Annealing
131
you have the option to go "down hill." However, unless you have a detailed priori knowledge of the whole terrain, there is no way to know when you should decide to do so. For this reason, it may be best to abandon rationality and let the decision be made randomly. Of course, after a while, you should hope that you will have worked your way out of the mountain valleys and that a steepest decent strategy should become increasingly reasonable. 5.5. 1 . The Algorithm: In order to eliminate as many technicalities as possible, we will assume throughout that § is finite and has at least 2 elements. Next, let H § [0, oo) be the function for which we want to locate a place where it achieves its minimum, and, without loss in generality, we will assume that 0 is its minimum. Now take v so that ( v ) i = 1 for all i E §, and choose a matrix A so that ( A ) ii = 0 for all i E §, ( A) ij = (A)Ji ?: 0 if j =/:- i , and A is irreducible (cf. (5.4.4)) on §. In practice, the selection of A should be made so that the evaluation of H (j) - H ( i) when (A ) ij > 0 is as "easy possible." For example, if § has some sort of natural neighborhood structure with respect to which § is connected and the computation of H(j) - H( i) when j is a neighbor of i requires very little time, then it is reasonable to take A so that (A)ij = 0 unless j =J i is a neighbor of i. Now define 1((3) as in (5.4.2) and Q((J) as in (5.4.7) . Clearly, 1(0) is just the normalized, uniform distribution on §: (r (O)) i = L - I , where L #§ ?: 2 is the number of elements in §. On the one hand, as (3 gets larger, 1((3) becomes more concentrated on §o. More precisely, since Z ((3) ?: #§o ?: 1 a
:
-----+
as
=
(5.5.1) On the other hand, as (3 gets larger, Theorem 5.4. 1 1 says that, at least when e > 0, >.. ( (3) will be getting smaller. Thus, we are confronted by a conflict. In view of the introductory discussion, this conflict between the virtues of taking (3 large, which is tantamount to adopting an approximately steepest decent strategy, versus those of taking (3 small, which is tantamount to keeping things fluid and thereby diminishing the danger of getting trapped, should be expected. Moreover, a resolution is suggested at the end of that discussion. Namely, in order to maximize the advantages of each, one should start with 0 and allow (3 to increase with time. 5 That is, we will make (3 an (3 increasing, continuous function t-v--; (J (t) with (3(0) = 0. In the interest of unencumbering our formulae, we will adopt the notation =
Z(t) = Z (f3 (t)) , rt = r (f3 (t)) , ( · ) t = ( · )-y, , I · l l 2, t = II · l l 2 ,-y, , Vart = Var-y, , Q(t) = Q (f3(t)) , Et = Ef3 (t) • and At = Af3 ( t) · 5 Actually, there is good reason to doubt that monotonically increasing (3 is the best way to go. Indeed, the name "simulated annealing" derives from the idea that what one wants to do is simulate the annealing process familiar to chemists, material scientists, skilled carpenters, and followers of Metropolis. Namely, what these people do is alternately heat and cool to achieve their goal, and there is reason to believe we should be following their example. However, I have chosen not to follow them on the unforgivable, but understandable, grounds that my analysis is capable of handling only the monotone case.
5 REVERSIBLE MARKOV PROCESSES
132
Because, in the physical model, f3 is proportional to the reciprocal of tem perature and f3 increases with time, t-v--t {3 (t) is called the cooling schedule. 5.5.2. Construction of the Transition Probabilities: Because the Q matrix here is time dependent, the associated transition probabilities will be time-inhomogeneous. Thus, instead of t-v--t Q (t) determining a one parameter family of transition probability matrices, for each s E [0, oo it will determine a map t -v--t P (s, t) from [s, oo into transition probability matrices by the time inhomogeneous Kolmogorov forward equation
)
)
(5.5.2)
d (s, t) = P (s, t)Q(t) on (s, oo) 1. th P (s, s) dt P w
( )
= I.
Although 5.5.2 is not exactly covered by our earlier analysis of Kolmogorov equations, it nearly is. To see this, we solve (5.5.2 via an approximation procedure in which Q(t) is replaced on the right hand side by
Q ( N) ( t )
)
=
[ -r;r- )
m m , (m + l) 10r t E N Q ( [t] N ) where [t] N N _
.c
.
The solution t-v--t p ( Nl (s, t) to the resulting equation is then given by the pre scription p( Nl (s, s) = I and
p(Nl (s, t) = p( N ) ( s, s V [t ] e (t -s)Q([t] N ) for t > s. N
)
As this construction makes obvious, p( N) (s, t) is a transition probability ma trix for each N 2': 1 and t 2': s, and (s, t)-v--t P ( Nl (s, t) is continuous. Moreover,
l l p( N) (s, t) - p (M) (s, t) l l u ,v t :::; l i Q ([T] N ) - Q ([T] M ) l l u,J p( N) (s, T ) llu, v dT
l
+
lt l! Q ( [T] M ) ilu,J p(N) (s, T) - p(M) (s, T) l lu,v dT.
But
and so
1t jf3([T] N ) - f3([T] M ) j dT t + I I AIIu .v l j jp( N) (T ) - p (M) ( T ) jiu,v dT.
jj p(Nl (s, t) - p(M) (s, t) j j u , v :s;IIAIIu,v ii H IIu
5.5 Simulated Annealing
133
Hence, after an application of Gronwall's inequality, we find that sup l l p (Nl (s, t)
OS s S t S T
�
p (Ml (s, t) l i u,v
'5:: I I A II u,v iiHII ueiiAIIu,v T faT 1 ,6 ( [7]
N)
�
,6 ( [T] M ) I dT.
Because T-v-+,6(7) is continuous, this proves that the sequence {P ( Nl (s, t) N � 1 } is Cauchy convergent in the sense that, for each T > 0, As a consequence, we know that there exists a continuous (s, t)-v-+P(s, t) to which the p (Nl (s, t) ' s converge with respect to I I llu,v uniformly on finite intervals. In particular, for each t � s, P ( s, t) is a transition probability matrix and t E [s, oo) P(s, t) is a continuous solution to ·
f------+
t
P(s, t) = I + i P(s, T) Q(T) dT, t E [s, oo), which is the equivalent, integrated form (5.5.2). Furthermore, if t E [s, oo) f.Lt E M1 (§) is continuously differentiable, then
f------+
f.L = J.Lt Q (t) for t E [ s, oo) {===>- f.Lt = JL8P(s, t) for t E [s, oo). d t Since the "if" assertion is trivial, we turn to the "only if" statement. Thus, supposP- that t E [s, oo) f.Lt E M1 (§) satisfying flt = f.Lt Q(t) is given, and set Wt f.Lt JL8P(s, t). Then (5.5.3) flt
=
�
f------+
=
�
and so, Hence, after another application of Gronwall's inequality, we see that Wt = 0 for all t � s. Of course, by applying this uniqueness result when J.Lt = 8iP(s, t) for each i E §, we learn that (s, t)-v-+P(s, t) is the one and only solution to (5.5.2). In ad dition, it leads to the following time-inhomogeneous version of the Chapman� Kolmogorov equation: (5.5.4)
P(r, t)
=
P(r, s)P(s, t) for 0 <:::: r <:::: s <:::: t.
Indeed, set f.Lt = 8iP(r, t) for t � s, note that t-v-+J.Lt satisfies (5.5.3 ) with f.L s = 8iP(r, s), and conclude that f.Lt = 8iP(r, s)P(s, t) .
5 REVERSIBLE MARKOV PROCESSES
134
5.5.3. Description of the Markov Process: Given a probability vector
f.L, we now want to construct a Markov process {X (t) : t 2: 0} which has 11 as its initial distribution and ( s , t)�P( s , t) as its transition mechanism, in the sense that
(5.5.5)
lP'(X(O) = i) = ( f.t ) i & lP' (X(t) = j I X(a ), a E [0, sl ) = P ( s , t)x(s )j ·
The idea which we will use is basically the same as the one which we used in §4.2.1 & 2.1.1. However, life here is made more complicated by the fact that the time inhomogeneity forces us to have an uncountable number of random variables at hand: a pair for each (t, i) E [0, oo ) x §. To handle this situation, take, without loss in generality, § = { 1, . . . , L} and, for (t, i, j) E [0, oo ) x §2 , set S(t ' i ' J.) = L... ""..-j.€=1 e - f3 (t) ( H (R) -H (i )) + (A) take S(t i 0) = 0 and define llf(t, i, u ) =
Also, determine T : [0, oo )
x
{j
"' Z<.
?
i ,j -1 ) < if SS(t,t,•,L ( ) i if > 1 .
§
x
U
<
?
?
S t, i ,j S ((t,•,L))
u
[0, oo )
---+
[0, oo ) by
Next, let X0 be an §-valued random variable with distribution f.t, let {En : 2: 1 } be a sequence of unit exponential random variables which are indepen dent of each other and of X0 , and let { Un : n 2: 1 } be a sequence of random variables which are uniformly distributed on [0, 1) and independent of each other and of a ( {Xo } U {En : n 2: 1}) . Finally, set Jo = 0 and X(O ) = Xo , and, when n 2: 1 , use induction to define n
Jn - Jn-1 = T( Jn- 1 , X(Jn-1), En) , X (Jn) = llf (Jn, X(Jn_I ) , Un) , and X(t) = X(Jn-1) for Jn-1 :S: t < Jn . Without substantial change, the reasoning given in §2.1.1 combined with that in §4.2.2 allows one to show that (5.5.5) holds. 5.5.4. Choosing a Cooling Schedule: In this section, we will give a ratio nal basis on which to choose the cooling schedule t�f3(t) . For this purpose, it is essential to keep in mind what it is that we are attempting to do. Namely, we are trying to have the Markov process {X(t) : t 2: 0} seek out the set §0 = {j : H(j) = 0} in the sense that, as t --+ oo, lP'(X (t) 'f. §0 ) should tend to 0 as fast as possible, and the way we hope to accomplish this is by making the distribution of X(t) look as much like "it as possible. Thus, on the one hand, we need to give {X ( t) : t 2: 0} enough time to equilibrate, so that the distribution of X(t) will look a lot like "it · On the other hand, in spite of the fact that it may inhibit equilibration, unless we make f3(t) increase to infinity,
5.5 Simulated Annealing
135
there is no reason for our wanting to make the distribution of X (t) look like
It ·
In order to understand how to deal with the concerns raised above, let 1-L be a fixed initial distribution, and let f-Lt be the distribution at time t 2: 0 of the Markov process {X(t) : t 2: 0} described in the §5.5.3 with initial distribution f-L. Equivalently, f-Lt = f-LP (O , t) , where { P (s, t) 0 ::::; s ::::; t < oo } is the family o f transition probability matrices constructed in §5.5.2. Next, [0, oo ) so that define ft § :
:
-----+
( f.Lt ) i ft (i ) = bt ) i
for t 2: 0 and i E §.
It should be obvious that the size of ft provides a good measure of the extent to which /-Lt resembles It · For example, by Schwarz's inequality and (5.5.1), (5.5.6) and so we will have made progress if we can keep ll ft l l 2 , t under control. With the preceding in mind, assume that t-v-+(3(t) is continuously differen tiable, note that this assumption makes t""" I I Jt ll § ' t also continuously differentiable, and, in fact, that
�
f d l l t l l � ,t
�
(
2 (ft (i)) e - i3(t )H(i ) Z�t) �
)
=
d
=
2 ( ft, jt ) t - /3(t) ( H - ( H ) t , Jnt'
since, by (5.4.3) , Z(t) -/3(t)Z(t) ( H ) t . On the other hand, we can compute this same derivative another way. Namely, because ( P (O, t)g)JL = (g)J.L, = Ut , g) t for any function g, =
ll ft l l �,t
= ( ft) J.L,
=
( P (O, t) ft ) JL ,
and so we can use (5.5.2) to see that d dt ll ft ll 22 ,t
=
. . ( P (O, t) Q (t) ft ) JL + ( P (O, t) ftlJL = -Et ( ft , ft ) + ( ft , ft ) t .
Thus, after combining these to eliminate the term containing jt , we arrive at
136
5 REVERSIBLE MARKOV PROCESSES
where, in passing to the second line, we have used the fact that Ut ) t = 1 and therefore that Vart ( f ) = ll ft ii L' - 1. Putting all this together, we now know that d . 1 1 H IIu/3 (t) � At ===} dt l l ft ll 22 , t � -At l l ft ll 22 , t + 2A t · The preceding differential inequality for 1 1 ! 1 1 § ' t is easy to integrate. Namely, it says that
Hence,
t t ll ft ll �,t � e - A( ) l l fo l l �,o + 2 ( 1 - e -A( ) ) � ll fo ll �,o V 2 .
Moreover, since (-yo ) i = L- 1 , where L = #§
�
2 , ll fo l l§ ,o �
L, and so
(5.5.7) The final step is to find out how to choose t�,B(t) so that it satisfies the condition in (5.5.7); and, of course, we are only interested in the case when §o "/= § or, equivalently, IIH II u > 0. Next, by Theorem 5.4.11, we know that t A t � c_ e- f3 ( ) e . Hence, we can take (5.5.8)
p (t) =
(
c et I I H II u
_ - log 1 + --
{1 e
c_ t II H IIu
)
when e > 0 when e = 0,
the case when e = 0 being obtained by an obvious limit procedure. After putting this together with (5.5.7) and (5.5.6), we have now proved that when ,8(t) is given by (5.5.8) , then
(5.5.9)
when e > 0 when e = 0.
Remark: The result in (5.5.9) when e = 0 deserves some further comment. In particular, it should be observed that e = 0 does not guarantee success for a steepest decent strategy. Indeed, e = 0 only means that each i can be connected to §o by an allowable path along which H is non-increasing ( cf. Exercise 5.6.5), it does not rule out the possibility that, when using steepest decent, one will choose bad path and get stuck. Thus, even in this situation, one needs enough randomness to hunt around until one finds a good path.
5.5 Simulated Annealing
137
5.5.5. Small Improvements: An observation, which is really only of inter est when e > 0, is that one carry out the same sort of analysis to control the
ll !t llq ,t
( (l ft l q )t) " for each q E [2, oo). As a result, one can show that there 1
=
is, for each e E (0, 1), a cooling schedule which makes lP'(X(t) rfc §0 ) go to 0 at least as fast as r e�' . The relationship between 8 and q is given by 8 = 1 - �. To carry this out, one begins by computing ft ll !t ll �,t twice, once for each each of the following expressions:
One then eliminates it from the resulting equations and thereby arrives at
where q'
=
q� 1 and
At this point one has to show that
and a little thought makes it clear that this inequality comes down to checking that, for any pair ( a , b) E [0, oo ) 2 ,
which, when looked at correctly, follows from the Fundamental Theorem of Calculus plus Schwarz's inequality. Hence, in conjunction with the preceding and (5.3.6), we find that
q
In order to proceed further, we must learn how to control ( fl )i in terms of ll ft ll �,t · In the case when q = 2, this quantity caused no problem because we knew it was equal to 1 . When q > 2, we no longer have so much control over it. Nonetheless, by first writing ( f/ )i = Ul - 1 )�, and then using part ( c ) of Exercise 5.6.2 to see that 2
( ft2 - 1 ) z q
J.t t
<
-
( ftq - 1 )
q-2 q-1 J.t t
=
( ftq ) t ' q-2 q-1
5 REVERSIBLE MARKOV PROCESSES
138
q-2
q
we arrive at U?); � unt-l . Armed with this estimate, we obtain the differential inequality
Finally, by taking f3(t) = � log ( 1 + ��H�:) , the preceding inequality can be replaced by ll!t ll�, t � - llft ll �,t + 4 t (ll!tll�,t) 1- � '
:
��
d�
which, after integration, can be made to yield
Remark: Actually, it is possible to do even better if one is prepared to combine the preceding line of reasoning, which is basically a consequence of Poincare's inequality, with analytic ideas which come under the general heading of Sobolev inequalities. The interested reader might want to consult [3] , which is the source from which the contents of this whole section are derived. 5.6 Exercises
EXERCISE 5 . 6 . 1 . A function '1/J : [0, oo) ----+ JR. is said to be convex if the graph of '1/J lies below the secant connecting any pair of points on its graph. That is, if it satisfies ( * ) 'I/J ( ( 1 - B ) s + Bt) � ( 1 - 8 ) '1/J ( s ) + B'ljJ (t) for all 0 � s < t and B E [0, 1].
This exercise deals with various properties of convex functions, all of which turn on the property that the slope of a convex function is non-decreasing. ( a) If {'1/Jn }f U {'1/J} are functions on [O, oo) and '1/Jn(t) '1/J(t) for each t E [0, oo), show that '1/J is non-increasing if each of the '1/Jn 's is and that '1/J is convex if each of the '1/Jn 's is. (b) If '1/J : [0, oo) JR. is continuous and twice continuously differentiable on (0, oo ), show that '1/J is convex on [0, oo) if and only if '¢ 2': 0 on (0, oo ) . Hint: The "only if" part is an easy consequence of ----+
----+
=
1. h"-, 0 Ill
'1/J(t + h) + '1 /J(t - h) - 2'1/J( s ) h2
for t E (O, oo) .
To prove the "if" statement, let 0 < s < t b e given, and for E > 0 set
B E [0, 1 ] .
Note that 0 on (0, 1 ) . Hence, by the second derivative test,
5.6 Exercises
139
( c ) If 'ljJ is convex on [0, oo), show that, for each s
t E (s, oo)
�
E
[0, oo),
'lj;(s) - 'lj; ( t) is non-increasing. t-s
(d) If 'ljJ is convex on [0, oo) and 0 �
(s) - (t) t-s
'lj; -'lj;--'- > --'--'--
s
�
u < w, show that
'1/J (u) - '1/J (w) . w-u
Reduce to the case when u = t.
Hint:
EXERCISE 5 . 6 . 2 . Given a probability vector J.L E [0, 1]1', there are many ways to prove that (!)� � (P) J.L for any f E L2 (J.L) . For example, one can get this inequality as an application of Schwarz's inequality l (f, g) J.L I � I I J I I 2 ,J.L I I g i i 2,J.L by taking g = 1. Alternatively, one can use 0 � VarJ.L ( f ) = (P) J.L - (!)� . However, neither of these approaches reveals the essential role that convexity plays here. Namely, the purpose of this exercise is to show that for any non-decreasing, continuous, convex function 'ljJ [0, oo ) [0, oo) and any f : § -----> [0, oo ) , :
----->
(5.6.3)
where the meaning of the left hand side when (f) J.L = oo is given by taking 'lj;(oo) = limvoo 'I/J ( t). The inequality (5.6.3) is an example of more general statement known as Jensen's inequality (cf. Theorem 6.1.1 in [8]) . ( a) Use induction on n ?: 2 to show that
(Bl , . . . , Bn )
E
n [0, 1t with L Bk = and ( x 1 , . . .
1
m=l
, xn
)
E
[O, oo) n .
(b) Let { FN }f be a non-decreasing exhaustion of § by finite sets satisfying f.L (FN ) = �iE FN (J.L)i > 0, apply part (a) to see that
for each N, and get the asserted result after letting N ----> oo. ( c ) As an application of (5.6.3), show that, for any 0 < p f : § -----> [0, oo), (JP)f', � (fq )/t . 1.
l.
�
q < oo and
5 REVERSIBLE MARKOV PROCESSES
140
ExERCISE 5 . 6 . 4 . Gronwall's is an inequality which has many forms, the most elementary of which states that if : [0, T] [0, oo ) is a continuous function which satisfies
u
----+
u(t) ::; A + B lot u(T) dT for t E [0, T] , then u(t) ::; AeBt for t E [0, T] . Prove this form of Gronwall's inequality. Hint: Set U(t) J� u(T) dT, show that U(t) ::; A +BU(t), and conclude that U (t) ::; � (eBt 1 ) . =
-
if and EXERCISE 5 . 6 . 5 . Define e as in Theorem 5.4. 1 1 , and show that e = only if for each (i , j) E § x §o there is an allowable path (i0, . . . , in ) starting at and ending at j along which H is non-increasing. That is, io = i n = j, and, for each 1 ::; m ::; n, Ai= _ 1 im > 0 and H(i m ) ::; H( m - 1 · EXERCISE 5 . 6 . 6 . This exercise deals with the material in §5. 1.3 and demon strates that, with the exception of (5.1.10) , more or less everything in that section extends to general irreducible, positive recurrent P's, whether or not they are reversible. Again let 1r 1r5 be the unique P-stationary probabil ity vector. In addition, for the exercise which follows, it will be important to consider the space L 2 ( 1r;
i
i,
i )
=
:
----+
( a) Define pT as in (5. 1 .2) , show that 1 ?: (PTP)ii = (1r)i LjE § �"?)� , and conclude that the series in the definition P f i ) LjE § f(j)( P )ij is absolutely convergent for each i E § and f E L2 ( 1r;
(
=
(
d
e
d
:
=
e
=
no:
d
d
(i)
5. 6 Exercises
141
( c ) Suppose that f E L 2 (§; C) is a non-trivial, bounded solution to Pf = BJ' f for some m E Z, and let (§o, . . . , §d _ I) be a cyclic decomposition of §. Show that, for each 0 .::; r < d, f I §r = e·;rco , where Co E c \ { 0 } . In particular, up to a multiplicative constant, for each integer 0 .::; m < d there is exactly one non-trivial, bounded f E L 2 (§; q satisfying P f = e;r f. (d) Let H denote the subspace of f E L2 (1r; C) satisfying Pf = Bf for some e E C with IBI 1 . By combining parts (b) and ( c ) , show that d is the dimension of H as a vector space over C. ( e) Assume that (P)ij > 0 ===} (P)ji > 0 for all (i, j) E §2 . Show that d .::; 2 and that d = 2 if and only if there is a non-trivial, bounded f : § --t lR satisfying P f = - f . Thus, this is the only property of reversible transition probability matrices of which we made essential use in Theorem 5.1. 14. =
EXERCISE 5.6.8. This exercise provides another way to think about the re lationship between non-negative definiteness and aperiodicity. Namely, let P be a not necessarily irreducible transition probability matrix on §, and as sume that J-t is a probability vector for which the detailed balance condition (J-t)i(P)ij = (J-t)J (P)ji, (i, j) E §2 holds. Further, assume that P is non negative definite in L2 (J-t): (1, P f) 2: 0 for all bounded f § R Show that (J-t)i > 0 ===} (P)ii > 0 and therefore that i is aperiodic if (J-t )i > 0. What follows are steps which lead to this conclusion. ( a) Define the matrix A so that ( A)ij = (l { i } , Pl {j } )JJ., and show that A is symmetric (i.e., ( A)ij = (A )1i) and non-negative definite in the sense that L ij (A)i1 (x)i(x)1 2: 0 for any x E JR§ with only a finite number of non vanishing entries. (b) Given i =/= j, consider the plane {ali + (31j : a, (3 E JR} in L2 (1r) , and, using the argument with which we derived (5.1 .5), show that (A)Tj < (A)ii(A)jj · In particular, if (A)ii = 0, then ( A)ij 0 for all j E §. ( c ) Complete the proof by noting that Lj E §(A)ij = (J-tk (d) After examining the argument, show that we did not need P to be non negative definite but only that, for a given i E §, each of the 2 x 2 sub matrices J1.
:
--t
=
be. EXERCISE 5 .6. 9 . Let P be an irreducible, positive recurrent transition prob ability matrix with stationary distribution Refer to (5.1.2), and show that the first construction in (5. 1.3) is again irreducible whereas the second one need not be. Also, show that the period of the first construction is never greater than that of P and that the second construction is always non-negative definite. Thus, if P T P is irreducible, it is necessarily aperiodic. 7r.
142
5 REVERSIBLE MARKOV PROCESSES
EXERCISE 5. 6 . 1 0 . In many applications, the structure of a Markov chain is determined by a finite, simple graph (§, E) . To be precise, the state space § is set of vertices of the graph and E is a symmetric subset of §2 , the set of edges; and simplicity means that there are no loops or double edges. Next, we say that j is a nearest neighbor of i, denoted by j E N(i) , if and only if ( i , j ) E E. Throughout, we assume that the degree d(i) #N(i) is positive for each i E §. The transition probability P associated with (§, E) is determined so that (P)ij = dCi) if j E N(i) and 0 otherwise. The interested reader will find more examples in [1] . ( a) Show that P is irreducible if and only the graph is connected. That is, if and only if to each pair ( i, j) E §2 there corresponds a finite sequence (io , . . . , in) E §n+l such that i = io, j = in , and (im- 1 , im) E E for 1 :<:; m :<:; n . In addition, assuming irreducibility, show that the chain is aperiodic if and only if the graph is not bipartite. That is, if and only if there is no non-empty §' c;; § with the property that every edge connects a point in §' to one in § \ §' . (b) Determine the probability vector by ( ) = d�i1 , and show that is 2 a reversible for P. ( c ) Assuming that the graph is connected, choose a set P = {p ( i, j) : ( i , j) E §2 \ D} of allowable paths, as in §5.2.2, and show that =
7i'
(5.6.11)
<1 (3+ -
_
7i' i
7i'
2#E ---: :- -'-: ..,... ---,--,- ' L(P)B(P) D 2-.
where D = maxiE§ d(i), L(P) is the maximal length (i.e., number of edges) of the paths in P, and B(P) , the bottleneck coefficient, is the maximal number of paths p E P which cross over an edge e E E. Obviously, if one chooses P to consist of geodesics (i.e. , paths of minimal length connecting their end points) , then L(P) is just the diameter of (§, E) , and, as such, is as small as possible. On the other hand, because it may force there to be bad bottlenecks (i.e. , many paths traversing a given edge) , choosing geodesics may not be the optimal. (d) Assuming that the graph is connected and not bipartite, choose P = {p ( i) : i E §} to be of set of allowable closed paths of odd length, and show that 2 (5.6.12) (3 � - 1 + _ DL(P)B(P) EXERCISE 5 . 6 . 1 3 . Here is an example to which the considerations in Exercise 5.6.10 apply and give close to optimal results. Namely, let N � 2, and consider the set § in the complex plain C consisting of the N roots of unity. Next, take E be the collection of pairs of adjacent roots of unity. That is, pairs of y'=lz,.-(rn-1) y'=12,.-rn ) , e--N- . Finally, take P and accordingly, as in the the form ( e N preceding. 7i'
5.6 Exercises
/3+
143
(N - 1 + 1 )
(a) As an application of ( c ) in the preceding, show that ::; 1��N (b) Assume that N is odd, and use (d) above to show that !3- ;::: - 1 + J2 .
EXERCISE 5 . 6 . 1 4 . In this exercise we will give a very cursory introduction to a class of reversible Markov processes which provide somewhat naive mathemat ical models of certain physical systems. In the literature, these are often called, for reasons which will be clear shortly, spin-flip systems, and they are among the earliest examples of Glauber dynamics. Here the state space § = { -1 , 1 }N is to be thought of as the configuration space for a system of N particles, each of which has "spin" + 1 or - 1 . Because it is more conventional, we will use to denote generic elements of §. Given or TJ = (771 , w = (w , w E § and 1 ::; k ::; N, wk will be the configuration obtained from w by . . . 'w "flipping" its kth spin. That is, the wk = ( Next, given lR-valued functions f and g on §, define
. . . , 7JN )
1 . . . , WN )
N ).
W1' . . . ' Wk-1' -Wk' Wk+ 1'
N 2 )f (wk ) - f(w )) (g (wk ) - g ( w )) , k=1
r ( f, g) (w ) =
which is a discrete analog of the dot product of the gradient of f with the gradient of g. Finally, given a probability vector J1, with (JL)w > 0 for all w E §, define
1.f
TJ 'F d:
{ w, w A
if TJ = w.
1 ' . . . ' wA N }
( a) Check that Q J.L is an irreducible Q-matrix on § and that the detailed
balance condition (JL)w Q f:;'17 = (JL) ry Q �w holds. In addition, show that (b) Let A be the uniform probability vector on §. That is, (.A)w
each w E §. Show that
and
1
(j, where mJ.L = £>.(!, f) ::; m-£1-L f)
J.L
( c ) For each S <:;:; { 1 , . . . , N}, define xs §
2 N min w E § JLw ·
=
2- N for
{ - 1 , 1} so that xs(w) = In particular, X0 = 1 . Show that {xs s <:;:; {1, . . . ' N} } is an orthonormal basis in L2 (.A) and that Q >.xs = - 2(# S)xs, and conclude from these that Var>. ( f ) ::; �£>. ( !, !) .
n kES Wk.
:
-----+
:
5 REVERSIBLE MARKOV PROCESSES
144
(d) By combining (b) with ( c ) , show that (31-'Vari-L � fY( f , f ) where (31-' 2;;;"' . In particular, if { r : t ?: 0} is the semigroup of transition probability P matrices determined by Q �-' , conclude that IIPr f - (f) J-L II 2 ,J-L � e - !3,. 1 1 ! (f ) �-' II 2 ,J-L · EXERCISE 5 . 6 . 1 5 . Refer to parts (b) and ( c ) in Exercise 5.6.14. It is some what surprising that the spectral gap for the uniform probability A is 2, in dependent of N. In particular, this means that if { P (N) ( t) : t ?: 0} is the semigroup determined by the Q-matrix =
p.
if 17 = wk if TJ = w otherwise on { -1, 1}N, then IIP (Nl (t) f - (fhcNJ II 2 , ACNJ � e- 2t ll f ii 2 , AcNJ for t ?: 0 and f E L2 (_A(Nl) , where _A(N) is the uniform probability measure on { -1, 1 } N . In fact, the situation here provides convincing evidence of that the theory developed in this chapter works in situations where Doeblin's theory is doomed to failure. Indeed, the purpose of this exercise is to prove that, for any t > 0 and w E { - 1, 1} N , limN ---+oo 1 1 1-L(Nl (t, w) - _A(N) II v = 2 when ( p(Nl (t, w) ) T = J (P (N) (t) ) WTJ . ( a) Begin by showing that 11 1-L(Nl (t, w) - _A(N) II v is independent of w. (b) Show that for any two probability vectors v and v' on a countable space §, 2 ?: ll v - v' ll v ?: 2 l v(A) - v'(A) I for all A <:;:; §. (c ) For 1 � k ::; N, define the random variable Xk on { - 1 , 1}N so that Xk (TJ) = T/k if ry = (TJ1 , . . . , TJN ) · Show that, under _A(N) , the Xk 's are mutually independent, { - 1, 1 }-valued Bernoulli random variables with expectation 0. Next, let t > 0 be given, and set p(N) = 1-L(Nl (t, w) , where w is the element of { - 1 , 1} whose coordinates are all -1. Show that, under 1-L (N) , the Xk 's are mutually independent, { - 1 , 1 }-valued Bernoulli random variables with expectation value - e - 2t . (d) Continuing in the setting of ( c ) , let A(N) be the set of 17 for which -f:t L� Xk (TJ) � - � e - 2t , and show that p(Nl (A(N) ) - _A(N) (A(N)) ?: 1 8�' . In fact, by using the sort of estimate developed at the end of §1 .2.4, especially (1.2.16), one can sharpen this and get p(N) (A(N)) - _A(N) (A(N) ) ?: N 1 - 2 exp - -T. Hint: Use the usual Chebychev estimate with which the Weak Law is proved. ( ) By combining the preceding, conclude that llt-L (Nl (t, w) - _A(N) IIv > 2 ( 1 - 8�' ) for all t > 0, N E z+, and w E { -1, 1 }N .
( -4t)
e
CHAPTER 6
Some Mild Measure Theory
On Easter 1933, A.N. Kolmogorov published Foundations of Probability, a book which set out the foundations on which most of probability theory has rested ever since. Because Kolmogorov's model is given in terms of Lebesgue's theory of measures and integration, its full appreciation requires a thorough understanding that theory. Thus, although it is far too sketchy to provide anything approaching a thorough understanding, this chapter is an attempt to provide an introduction to Lebesgue's ideas and Kolmogorov's application of them in his model of probability theory. 6.1 A Description of Lebesgue's Measure Theory
In this section, we will introduce the terminology used in Lebesgue's the ory. However, we will systematically avoid giving rigorous proofs of any hard results. There are many places in which these proofs can be found, one of them being [8] . 6 . 1 . 1. Measure Spaces: The essential components in measure theory are a set 0, the space, a collection :F of subsets of 0, the collection of measurable, subsets, and a function p, from :F into [0, oo] , called the measure. Being a space on which a measure might exist, the pair (0, :F) is called a measurable, space, and when a measurable space (0, :F) comes equipped with a measure p,, the triple (0, :F, p,) is called a measure space. In order to avoid stupid trivialities, we will always assume that the space 0 is non-empty. Also, we will assume that the collection F of measurable sets forms a a-algebra over 0:
0 E :F, A E :F ===} AC 0 \ A E F, =
00
It is important to emphasize that, as distinguished from point-set topology ( i.e. , the description of open and closed sets ) only finite or countable set theoretic operations are permitted in measure theory.
6 SOME M ILD MEASURE THEORY
146
Using elementary set theoretic manipulations, it is easy to check that
A, B E ===} A n B E and B \ A E :F {An }\'0 � ===} n1 An E :F. :F
:F
00
:F
Finally, the measure fL will be function which assigns 1 0 to 0 and is countably additive in the sense that
Am n An = when (6. 1 . 1 ) ===} fL (Q An) = � JL(An ) · In particular, for A, B E A � B ===} JL(B) = JL(A) + JL(B \ A) JL(A) (6. 1.2) JL(A n B) < oo ===} JL(A B) = JL(A) + JL(B) - JL(A n B). The first line comes from writing B as the union of the disjoint sets A and B \ A, and the second line comes from writing A U B as the union of the disjoint sets A and B \ (A n B ) and then applying the first line to B and An B. The finiteness condition is needed when one moves the term JL(A n B) to the left hand side in JL( B ) = JL(An B ) + fL (B \ (An B )). That is, one wants to avoid having to subtract oo from oo. When the set is finite or countable, there is no problem constructing measure spaces. Namely, one can take :F = {A : A � SZ}, the set of all subsets of make any assignment of w E JL({w}) E [O, oo], at which {An }r' �
:F
0
and
m
=J n
:F,
�
u
n
Sl,
Sl
f-----+
point countable additivity demands that we take
JL(A) = L JL({w}) for A � w EA
n.
However, when n is uncountable, it is far from obvious that interesting mea sures can be constructed on a non-trivial collection of measurable sets. Indeed, it is reasonable to think that Lebesgue's most significant achievement was his construction of a measure space in which n = JR, :F is a a-algebra of which every interval (open, closed, or semi-closed) is an element, and fL assigns each interval its length (i.e., b - a if is an interval whose right and left end points are b and a) . Although the terminology is misleading, a measure space :F, is said to be finite if < oo. That is, the "finiteness" here is not determined by
JL(I) =
I
(w, JL)
JL(SZ)
1 In view of additivity, it is clear that either J.L(0) 0 or J.t (A ) = oo for all A E :F. Indeed, by additivity, J.L(0) J.L(0 U 0) 2J.L(0), and therefore J.L(0) is either 0 or oo. Moreover, if J.L(0) = oo , then J.t ( A ) = J.t ( A U 0) = J.t(A) + J.L(0) = oo for all A E :F. =
=
=
6.1
A Description of Lebesgue's Measure Theory
147
the size of n but instead by how large n looks to 1-l · Even if a measure space is not finite, it may be decomposable into a countable number of finite pieces, in which case it is said to be a-finite. Equivalently, (n, :F, 1-l) is a-finite if there exists <;;; :F such that 2 and < oo for each n � 1 . Thus, for example, both Lebesgue's measure on lR and counting measure on Z are a-finite but not finite. 6.1.2. Some Consequences of Countable Additivity: Countable addi tivity is the sine qua non in this subject. In particular, it leads to the following continuity properties of measures: ===} <;;; :F and (6.1 .3) < oo, and ===? <;;; :F, Although we do not intend to prove many of the results discussed in this chapter, the proofs of those in (6.1 .3) are too basic and easy to omit. Indeed, to prove the first line, simply take and Then for all n � 1 , and U� 0 when m =/=- n, U� n Hence, by (6. 1 . 1 ) ,
{S1n }f'
Sln / S1 J-L(Sln )
An / A J-L(An ) / J-L(A) {An } f' An "" A J-L(An ) "" J-L(A). {An }f' J-L(Al ) Bm Bn = BmB=1 =AnA1 Bn+l = An+ l \ ABnm. = A. J-L(Al) \ An ) J-L(An /l ) A= J-L(A n ) +J-L(A J-L(AlJ-L) (=A J-L)(-A)+J-L(A A \A \A \A). 1 l 1 J-L(An ) ""l J-L(A). l J-L(An ) / J-L(Al ) - J-L(A)
To prove the second line, begin by noting that and Hence, since and < oo , we have and therefore that Just as in the proof of second line in (6.1.2), we need the finiteness condition here to avoid being forced to subtract oo from oo . Another important consequence of countable additivity is countable subad ditivity: (6.1 .4)
B1 = A 1 and Bn+l = An+l \ Am , J-L(Bn) ::; J-L(An ) (Q An) (Q Bn) = � J-L(Bn ) ::; � J-L(An ) ·
Like the preceding, this is easy. Namely, if U� then and = f-l
f-l
A particularly important consequence of (6.1 .4) is the fact that (6.1.5)
f-l
(Q An) =
0 if
J-L(An ) = 0 for each
n � 1.
U�
2 We write An / A when An c;;; An+l for all n 2': 1 and A = An. Similarly, An ""' A means that An 2 An+l for all n 2': 1 and A = n� An. Obviously, An / A if and only if
AnC ""' AC.
6 SOME MILD MEASURE THEORY
148
That is, the countable union of sets each of which has measure 0 is again a set having measure 0. Here one begins to see the reason for restricting oneself to countable operations in measure theory. Namely, it is certainly not true that the uncountable union of sets having measure 0 will necessarily have measure 0. For example, in the case of Lebesgue's measure on JR., the second line of (6.1 .3) implies p,({x}) = lim p, ( ( x - 5, x + 5)) = lim 25 = 0
8',0
8',0
for each point x E JR., and yet ( 0, 1 ) = Ux r:: (o, 1) { x} has measure 1 . 6 . 1 . 3 . Generating (}-Algebras: Very often one wants to make sure that a certain collection of subsets will be among the measurable subsets, and for this reason it is important to know the following constructions. First, suppose that C is a collection of subsets of D. Then there is a smallest (}-algebra iJ(C), called the iJ-algebra generated b y C, over D which contains C. Namely, consider the collection of all the (}-algebras over D which contain C. This collection is non-empty because {A : A <;;;; D} is an element. In addition, as is easily verified, the intersection of any collection of (}-algebras is again a iJ-algebra. Hence, iJ(C) is the intersection of all the (}-algebras which contain C. When n is a topological space and c is the collection of all open subsets of n, then iJ(C) is called the Borel iJ-algebra and is denoted by B0 . One the most important reasons for knowing how a iJ-algebra is generated is that one can often check properties of measures on iJ(C) by making sure that the property holds on C. An important example of such a result is the one in the following uniqueness theorem. 6 . 1 . 6 THEOREM. Suppose that (D, :F) is a measurable space and that C <;;;; :F includes D and is closed under intersection (i.e., AnB E C whenever A, B E C). If p, and are a pair of finite measures on (D, :F) and if p,(A) ( A ) for each A E C, then p,(A) ( A) for all A E iJ(C) . P ROOF: We will say that S <;;;; :F is good if v
=
v
= v
A, B E S and A <;;;; B A, B E S and A n B 0 {An}f" <;;;; S and An / A
B \ A E S. A U B E S. A E S. Notice that if S is good, D E S, and A, B E S ===} A n B E S, then S is a (i) (ii) (iii)
=
===} ===} ===}
iJ-algebra. Indeed, because of (i) and (iii), all that one has to do is check that A U B E S whenever A, B E S. But, because A U B = (A \ (A n B)) U B, this is clear from (i) , (ii), and the fact that S is closed under intersections. In addition, observe that if S is good, then, for any V <;;;; :F, S' = {A E S : A n B E S for all B E V} is again good.
6.1 A Description of Lebesgue's Measure Theory
149
Now set B = {A E F : JL(A) = v(A) }. From the properties of fi nite measures, in particular, (6. 1 .2) and (6.1 .3) , it is easy to check that B is good. Moreover, by assumption, C <;;;; B. Thus, if B' = {A E B : A n C E B for all C E C}, then, by the preceding observation, B' is again good. In addition, because C is closed under intersection, C <;;;; B'. Simi larly, B" {A E B' : A n B E B' for all B E B'} is also good, and, by the definition of B', C <;;;; B". Finally, if A, A' E B" and B E B', then (A n A') n B = A n (A' n B ) E B', and so A n A' E B". Hence, B" is a a--algebra which contains C, B" <;;;; B, and therefore !L equals v on u(C) . D 6 . 1 .4. Measurable Functions: Given a pair (01 , FI ) and (02, F2) of mea surable spaces, we will say that the map F 01 02 is measurable if the inverse image of sets in F2 are elements of F1 : p- 1 (r) E F1 for every r E F2 . 3 Notice that if F2 = u(C) , F is measurable if p- 1 (C) E F1 for each C E C. In particular, if 01 and 02 are topological spaces and Fi = Boi , then every continuous map from 01 to 02 is measurable. It is important to know that when 02 = lR and F2 = BIR, measurability is preserved under sequential limit operations. To be precise, if {fn}f is a sequence of IR-valued measurable functions from (0, F) to (JR, BIR) , then =
:
(6. 1 .7)
------>
w� sup fn (w) , w� inf n fn (w) , n are measurable. fn (w) w� nlim ---+ oo fn (w) , and w � nlim ---+ oo
For example, the first of these can be proved by the following line of reasoning. Begin with the observation that BIR = a-( C) when C = { (a, oo ) : a E JR}. Hence f:0 lR will be measurable if and only if4 {f > a} for each a E JR, and so the first function in (6. 1 .7) is measurable because ------>
{ s�p
} yun > a }
fn > a =
for every a E JR.
As a consequence the second line in ( 6.1. 7) , we know that the set D. of points w at which the limit limn---+oo fn ( w) exists is measurable, and limn ---+ oo fn ( w) is measurable if D. = 0. Finally, measurable functions give rise to important instances of the con struction in the preceding subsection. Namely, suppose that the space D 1 and the measurable space (02 , F2) are given, and let ;y b e some collection of maps from 01 into 02. Then the a--algebra u(;y) over 01 generated by ;y is the smallest a--algebra over 01 with respect to which every element of ;y is mea surable. Equivalently, u(;y) = u(C) when C = {F-1 (f) : F E ;y & r E F2 } . W """
3 The reader should notice the striking similarity between this definition and the one for continuity in terms of inverse images of open sets. 4 When there is no ambiguity caused by doing so, we use {F E r} to stand for {w : F(w) E
r}.
150
6 SOME M ILD MEASURE THEORY
Finally, suppose that :F2 = O"(C2 ) , where C2 C::: :F2 contains fl2 and is closed under intersection ( i.e., A n B E C2 if A, B E C2) , and let C1 be the collection of sets of the form F1- 1 (A1) n · · · n F,_;- 1 (An ) for n E z+ , {F1 , . . . , Fn } C::: J, and {A1, . . . , An} C::: C2. Then O"(J) = O"(C1 ) , fl1 E C1 , and C1 is closed under intersection. In particular, by Theorem 6.1 .6, if J-L and v are a pair of finite measures on (rl 1 , :F1 ) and
J-L ( {w1 : F1 (w 1 ) E A1 , . . . , Fn (wl ) E An}) = v({w 1 : F1 ( w 1 ) E A1 , . . . , Fn ( wl) E An })
for all n E z+ , {F1 , . . . , Fn} C::: J, and {A1 , . . . , An} C::: C2, then
O"(J) .
J-L
equals v on
Lebesgue Integration: Given a measure space (fl, :F, J-L), Lebes gue's theory of integration begins by giving a prescription for defining the integral with respect to J-L of all non-negative, measurable functions ( i.e. , all measurable maps f from (fl, :F) into5 ( [0, oo] ; B[o, = J ) . Namely, his theory says that when lA is the indicator function of a set A E :F and a E [0, oo ) , then the integral of the function6 alA should be equal to a�J(A) . He then insists that the integral should be additive in the sense that the integral of + h should be sum of the integrals of h and h. In particular, this means that if f is a non-negative, measurable function which is simple, in the sense that it takes on only a finite number of values, then the integral J f d!J of f must be 6.1.5.
h
XJ-L (f- l ({x}) ) ,
L
x E [O, =)
where, because f- 1 ( { x}) = 0 for all but a finite number of x, the sum involves only a finite number of non-zero terms. Of course, before one can insist on this additivity of the integral, one is obliged to check that additivity is consistent. Specifically, it is necessary to show that 2:::: � am !J(Am ) = 2::::�' a�, J-L(A�, ) when L � am l A, = f = I:�' a� lA'"' , . However, once this consistency has been established, one knows how to define the integral of any non-negative, mea surable, simple function in such a way that the integral is additive and gives the obvious answer for indicator functions. In particular, additivity implies monotonicity: J f dv � J g df-L when f � g. To complete Lebesgue's program for non-negative functions, one has to first observe that if f : E [0, oo] is measurable, then there exists a sequence { 'Pn } f of non-negative, measurable, simple functions with the property that 'Pn (w) / f ( w) for each w E n. For example, one can take ______,
4n
'Pn =
:; lA,,n m= l L
where Am n = {w : mT n � f(w) < (m + l ) T n }. ,
In this context, we are thinking of [0, oo] as the compact metric space obtained by mapping [0, 1] onto [O, oo] via the map t E [0, 1] r-----t tan ( � t) . 6 In measure theory, the convention which works best is to take Ooo 0.
5
=
6.1 A Description of Lebesgue's Measure Theory
151
Given {
=
=
(6.1 .8) is well defined. 6 . 1 .6. Stability Properties of Lebesgue Integration: The power of Lebesgue's theory derives from the stability of the integral it defines, and its
6 SOME MILD MEASURE THEORY
152
stability is made manifest in the following three famous theorems. Through out, ( D, :F, fL ) is a measure space to which all references about measurability and integration refer. Also, the functions here are assumed to take their values in ( - oo , oo ] . 6 . 1 . 9 THEOREM. (Monotone Convergence) Suppose that {fn}'f is a non decreasing sequence of measurable functions, and, for each w E D, set f(w) limn-> oo fn (w). If there exists a fixed integrable function g which is dominated by each fn , then J fn dp / J f dp. If, instead, {fn}'f is non-increasing and if there exists a fixed integrable function g which dominates each fn, then =
J fn df-L \. J f dp.
6 . 1 . 10 THEOREM. (Fatou's Lemmas) Given any sequence {fn}'f of mea surable functions, all of which dominate some fixed integrable function g,
If, instead, there is some fixed integrable function g which dominates all of the fn 's, then
6 . 1 . 1 1 THEOREM. (Lebesgue's Dominated Convergence) Suppose that
{fn}'f is a sequence of measurable functions, and assume that there is a fixed integrable function g with the property that p ( {w : lfn(w) l > g(w)} ) 0 for each n 2: 1 . Further, assume that there exists a measurable function f to which {fn} converges in either one of the following two senses: =
(a) (b)
fL
( { w : f(w) # nl!_.r�Jn (w) }) = 0
( nlim -> oo J-L { w : l fn(w) - f(w) l 2: } ) E
=
0 for all > 0. E
Then
Before proceeding, a word should be said about the conditions in Lebes gue's Dominated Convergence Theorem. Because integrals cannot "see" what is happening on a set of measure 0 (i.e. , an A E :F for which p(A) = 0) it is natural that conditions which guarantee certain behavior of integrals should be conditions which need only hold on the complement of a set of f(w) for all w outside measure 0. Thus, condition ( a) says that fn(w) of a set of measure 0. In the jargon, conditions which hold off of a set of measure 0 are said to hold almost everywhere. In this terminology, the first hypothesis is that I fn i :::; g almost everywhere and ( a) is saying that {fn}'f converges to f almost everywhere, and these statements would be abbreviated -----+
153
6.1 A Description of Lebesgue's Measure Theory
fn
I fni
f
----+ in the literature by something like :C:: g a.e. and a.e. . The condition in ( b ) is related to, but significantly different from, the one in (a) . In particular, it does not guarantee that converges for any S1 . For example, take p t o b e the measure o f Lebesgue on lR described above, for n 2: and :C:: m < 2 n . Then = and take m2 n : 2: -=f. 2 - , and so lim --+ P for all E > On the other hand, for each 1), limn--+ oo = 1 but lim --+ Thus, (b ) most definitely does not imply (a) . Conversely, although (a) implies (b) when p(S1) < oo, ( a) does not imply (b) when p(S1) oo. To wit, again when p is Lebesgue ' s measure, and consider In connection with the preceding discussion, there is a basic estimate, known as Markov 's inequality, which plays a central role in all measure theoretic analysis. Namely, because, for any ..\ > :C:: :C::
wE {fn (w)}1 n (w) l[o,2-n ) (w - n ) 0 0 f m+ 2 p({w : fm+02. n(w) 0}) = nE oo[0, ({w l fn (w)lfn (w) E}) = 0 w n oo fn (w) = 0. fn = lrn:\[- n,n=] · 0, ..\l [.x,oo] of fl[.x, oo] of l f l , . 11 2: ..\ } ) {w : f(w) (6. 1 .12) fdp 1 / l f l dp. p( :C::
�
{w:f(w)2.A}
:C::
�
In particular, this leads to the conclusion that
0.
for all E > That is, the condition ( b ) is necessary for the conclusion in Lebesgue ' s theorem. In addition, (6. 1.12) proves that
J l f l dp = 0 ==? p ({w : l f(w) l 2: c}) = 0 for all E > 0, and therefore, by (6.1.3),
(6.1.13)
j l f l dp = O ==? p ({w : f(w) -=f- 0} ) = 0.
Finally, the role of the Lebesgue dominant g is made clear by considering and p is Lebesgue ' s measure. when either nl [o , ,\: ) or = 6.1. 7. Lebesgue Integration in Countable Spaces: In this subsection we will see what Lebesgue ' s theory looks like in the relatively trivial case when S1 is countable, F is the collection of all subsets of Sl, and the measure p is O"-finite. As we pointed out in §6.1 .1, specifying a measure on (S1, F) is tantamount to assigning a non-negative number to each element of Sl, and, because we want our measures to be O"-finite, no element is assigned oo . Because the elements of S1 can be counted, there is no reason to not count them. Thus, in the case when S1 is finite, there is no loss in generality if we write S1 {1, where #Sl is the number of elements in S1, and, similarly, when S1 is countably infinite, we might as well, at least for abstract purposes, think of its being the set of positive integers. In fact, in order to
{fn }l'
fn
= . . . , N},
fn l[n- l,n)
=
N=
z+
6 S OME MILD MEASURE THEORY
154
avoid having to worry about the finite and countably infinite cases separately, we will embed the finite case into the infinite one by simply noting that the theory for 1 , is exactly the same as the theory for z + restricted to measures for which Finally, in order to make the = 0 when w > notation here conform with the notation in the rest of the book, we will use § in place of n, j ' or k to denote generic elements of §, and we will identify a measure with the row vector E [0, oo given by The first thing to observe is that
11{ . . . , N} 11( { w}) N. 11 i, (J.L)i 11( {i}). J.L )§ (6. 1 . 14) I d11 I:J iE§ (i)(J.L) i J j + d11 J f - d11 a1 , Ae = { i : cp(i) cp= ae0}, f I cpd11 = l:: ae11 (Ae) = l:: ae iL EAe (J.L)i 2::: i2:::EAe cp(i)(J.L)i = 2:::iE§ cp (i)(J.Lk f ;::: 0 'Pn (i) = f(i) 1 :::; i :::; 'Pn (i) = 0 i l fdl1 n-+(X) l cpn dl1 = n--+(X) 1 �"'i �n f(i)(J.L) i = L iE§ f(i)(J.L) i. J j + d11 J f - d11 I f dl1 = I j+ dl1 - I dl1 2::: J (i)( tt)i - 2::: J (i)(J.L)i = 2::: J (i)(J.Lk =
1
=
< oo is finite. Indeed, if takes, and
whenever either or . . . , aL are the distinct values
=
Second, given any > n. Then,
L
L
€=1
£=1
L
is simple, then
n and
if
€.=1
on §, set
if
lim
= lim
Finally, if either
;:::
�
is finite, then it is clear that
or
r
{ i : f ( i ) 2: 0 }
i E§
{ i : f ( i) � O }
We next want to see what the "big three" look like in this context.
The Monotone Convergence Theorem:
:::; fn / f.fn
the case when 0 to that case by replacing obvious that
iE§
First observe that it suffices to treat Indeed, one can reduce each of these statements g or g When 0 :S: / it is with
fn iE§
Thus, all that remains is to note that
L
fn·
fn J ,
iE§ L
n--+(X) "'iE§ fn (i)(JL)i ;::: n--+(X) "'i=1 fn (i)(J.L)i = "'i=l f(i)(J.L) i + lim
for each L E L / oo .
L....,;
z ,
lim
L....t
L.....t
and therefore that the desired result follows after one lets
6.1 A Description of Lebesgue's Measure Theory
155
Fatou 's Lemma: Again one can reduce to the case when fn 2 0 and the limit being taken is the limit inferior. But in this case, lim L fn ( i) ( JL ) i = mlimoo ninfm L fn (i) (JL )i -> ?: n -> oo iE § i E§
where the last equality follows from the Monotone Convergence Theorem ap plied to 0 ::=; infn?:m fn / limn _, oo fn ·
Lebesgue 's Dominated Convergence Theorem: First note that, if we elimi nate those i E § for which (JL ) i = 0, none of the conclusions change. Thus, we will, from now on assume that (JL) i > 0 for all i E §. Next observe that, under this assumption, the hypotheses become supn l fn ( i ) / ::; g ( i ) and fn (i) --+ f (i) for each i E §. In particular, I f / ::; g. Hence, by considering fn - f instead of { fn } r' and replacing g by 2g, we may and will assume that f =.= 0. Now let E > 0 be given, and choose L so that Li >L g(i) (JL) i < E. Then
6 . 1 .8. Fubini's Theorem: Fubini's Theorem deals with products of mea sure spaces. Namely, given measurable spaces (D1 , F1 ) and (D2, F2) , the product of F1 and F2 is the O"-algebra F1 x F2 over D1 x D2 which is gen erated by the set { A1 x Az : A1 E F1 & A2 E F2} of measurable rectangles. An important technical fact about this construction is that, if f is a measur able function on (D1 x Dz, F1 x Fz) , then, for each w1 E D1 and w2 E D2 , Wz-v-+ j (w1 , wz) and w1.,... j(w1 , wz ) are measurable functions on, respectively, (Dz , Fz) and (D1 , Fl ) . In the following statement, it is important to emphasize exactly which vari able is being integrated. For this reason, we use the more detailed notation J0 f(w) p,(dw) instead of the more abbreviated J f dp,.
6 . 1 . 1 5 THEOREM. (Fubini) 7 Let (Dl , Fl , f..L l ) and (D2 , F2 , p,2) be a pair of O"-finite measure spaces, and set D = D 1 x D 2 and F = F1 x F2 . Then there is a unique measure p, = p,1 x p,2 on (D, F) with the property that p,(A1 x Az) = f..L l (Al )J.Lz (Az) for all A1 E F1 and Az E Fz . Moreover, if f is a non-negative, measurable function on (D, F), then both
7 Although this theorem is usually attributed Fubini, it seems that Tonelli deserves, but seldom receives, a good deal of credit for it.
6 SOME MILD MEASURE THEORY
156
are measurable functions, and
L1 (L2
f(w i , wz) tt2 (dw2)
)
tt1 (dw1)
= In f dp = L2 (L1 f(wl , w2) PI (dwl ) ) ttz (dw2) ·
Finally, for any integrable function f on (D, :F, p),
{ fn2 { L1
A1 = w1 :
!f(w i , wz ) ! pz (dwz)
< oo
A 2 = w2 :
! f(w i , w2) ! tti (dwl )
< oo
WI-v--> fr (w l )
::=
1A1 (WI )
} }
E :Fr , E :Fz,
r f(wi , Wz) pz (dw2) ln2
and
are integrable, and
In the case when D1 and D2 are countable, all this becomes very easy. Namely, in the notation which we used in §6. 1.6, the measure p1 x p2 cor responds to row vector J-t l x J-tz E [0, oo ) § 1 x§2 given by (J-tl X J-tz) (i 1 , 22 ) = (J-t) i1 ( J-t ) i2 , and so, by ( 6 . 1 . 14), Fubini's Theorem reduces to the statement that
{ ai1 i2 : (i1 , i2) E §1 x §2} � (-oo, oo] satisfies either aiti2 2: 0 for all (i1 , i2) or I: (i 1 , i2 ) ! a i 1i2 ! < oo. In proving this, we may and will assume that §1 = z + §2 throughout and will start with the case when ai d2 2: 0. Given any pair (n1 , n2 ) E (z+ ?, when
=
L ai1i2 2: (i 1 , i 2)ES1 xS2 Hence, by first letting n1
----> oo
and then letting n2
----> oo,
we arrive at
6 . 2 Modeling Probability Similarly, for any n
E
(i1 ,i2)§1 x §2 i1Vi2$ n
z+ ,
aid2
and so, after letting n < oo ,
I: (i1 ,i2) l ai,i2 i
=
----+
t (t )
i2= l i1=l
ai1i2
:S:
157
(
)
L L ai1i2 , i2ES1 i2ES2
oo , we get the opposite inequality. Next, when
L l aid2 1 i1E§1
< oo
for all
i2 E §2,
and so
L ai,i2 (i,,i2)ES1 xS2
=
a2-:-1 2.2 L a� i2 (i, ,i2)ES1 xS2 ) (i, ,i2 ES1 xS2
(
)
� (�, a� i2 )
i 2
(
)
2.:: 2.:: ai,i2 . 2.:: ai,i2 i2 ES2 i, E§, i2i2$ES2 i1 E§, n Finally, after reversing the roles 1 and 2, we get the relation with the order
= E.� 2.::
of summation reversed.
6.2 Modeling Probability To understand how these considerations relate to probability theory, think about the problem of modeling the tosses of a fair coin. When the game ends after the nth toss, a Kolmogorov model is provided by taking D = {0, 1 } n , F the set of all subsets of D, and setting p, ( { w } ) = 2- n for each w E D. More generally, any measure space in which D has total measure 1 can be thought of as a model of probability, for which reason, such a measure space is called a probability space, and the measure p, is called a probability measure and is often denoted by JP. In this connection, when dealing with probability spaces, ones intuition is aided by extending the metaphor to other objects. Namely, one calls D the sample space, its elements are called sample points, the elements of F are called events, and the number that lP assigns an event is called the probability of that event. In addition, a measurable map is called a random variable, it tends to be denoted by X instead of F, and, when it is IEk-valued, its integral is called its expected value. Moreover, the latter convention is reflected in the use of JE[X] , or, when more precision is required, JEll'[X] to denote J X dlP. Also, lE[X, A] or JEll'[X, A] is used to denote JA X dlP, the expected value of X on the event A. Finally, the distribution of a random variable whose values lie in the measurable space (E , B) is the probability measure X*JP on
6 SOME MILD MEASURE THEORY
158
lP'(X E B ) . In particular, when X is IR-valued , its distribution function Fx is defined so that Fx (x) = (X*lP') ( ( - oo , xl ) = lP'(X :=; x ) . Obviously, Fx is non-increasing. Moreover, as an application of (6. 1 . 3 ) , one can show that Fx is continuous from the right, i n the sense that Fx (x) = limy',x Fx (y) for each x E IR, limx',-oo Fx (x) = 0, and limx/oo Fx (x) = 1 . At the same time, one sees that lP'(X < x) = Fx (x - ) limy/x Fx (y), and, as a consequence, that lP'(X = x) = Fx (x) - Fx (x - ) is the jump in Fx at x. (E, B) given by f..l ( B) =
=
6.2.1. Modeling Infinitely Many Tosses of a Fair Coin: Just as in the general theory of measure spaces, the construction of probability spaces presents no analytic (as opposed to combinatorial) problems as long as the sample space is finite or countable. The only change from the general theory is that the assignment of probabilities to the sample points must satisfy the condition that Lw E O lP'( {w}) = 1 . The technical analytic problems arise when S1 is uncountable. For example, suppose that, instead of stopping the game after n tosses, one thinks about a coin tossing of indefinite duration. Clearly, the sample space will now be S1 = {0, 1 }'z+ . In addition, when A <;;: S1 depends only on tosses 1 through n, then A should be a measurable event and the probability assigned to A should be the same as the probability that would have been assigned had the game stopped after the nth toss. That is, if n r <;;: {O, l } and8 A = {w E S1 : (w( 1 ) , . . . , w(n) ) E f } , then lP'(A) should n equal 2- #r. Continuing with the example of an infinite coin tossing game, one sees ( cf. (6. 1 .3)) that, for any fixed T) E S1, lP'({ TJ } ) is equal to
nlim --+ lP'({w : (w ( 1 ) , . . . , w(n) ) (X)
-n --+ 2 = O.
= ( TJ ( 1 ) , . . . , TJ (n) ) }) = nlim
(X)
Hence, in this case, nothing is learned from the way in which probability is assigned to points: every sample points has probability 0. In fact, it is far from obvious that there exists a probability measure on S1 with the required properties. Nonetheless, as we will now prove such a measure does exist. To get started, for each n ?: 1 , let IIn : S1 � Sln = {0, 1 } n be the projection map given by w"-"+ (w ( 1 ) , . . . , w(n) ) , and set
Equivalently, A E An if and only if A depends on the first n coordinates, in the senRe that: w E A and IIn(w') = IIn(w) ==? w' E A. Next, define lP'n on An so that lP'n(A) = 2 - n #Iln(A) . Clearly An <;;: An+ l · In fact, if A E An, then IIn + l (A) (IIn (A) x {0, 1 } ) ) , and so lP'n+ l (A) = lP'n(A). Hence, we can unambiguously define lP' on A = U�= 1 An so that lP'(A) = lP'n(A) when A E An· Moreover, if A, A' E A are disjoint, then, by choosing n so that A, A' E An , we see that lP'(A U A') = lP'n (A U A') = lP'n(A) + lP'n(A') = lP'(A) + lP'(A') . =
8 It is convenient here to identify D with the set a mappings w from z+ into {0, 1}. Thus, we will use w ( n) to denote the "nth coordinate" of w.
6.2 Modeling Probability
159
Before taking the next step, it will be convenient to introduce a metric on n for which all elements of A are open sets. Namely, we take 00
p(w, w') = :L:)� n l w(n) - w'(n) l . n=l It is an easy matter to check that p is a metric n. In fact, it is a metric for which the notion of convergence corresponds to convergence of each coordi nate. In addition, if w E A E An , then p(w', w) < 2� n =? w ' E A. Hence, each A E A is open relative to topology induced by p. At the same time, each A E A is also closed in p-topology. Indeed, if A E An , {wk}f <;;; A, and p(wk , w) --+ 0, then there is a k E z+ for which p(w, wk) < 2� n , and so, for this k, IIn (w) = IIn (wk) · Another fact which we will need is that n is compact in the p-topology. To see this, suppose that {Wk }f is a sequence in n. To produce a convergent subsequence, we employ a diagonalization procedure. That is, because, for each m, either wk (m) 0 for infinitely many k E z+ or Wk ( m) = 1 for infinitely many k E z+ , we can use induction to construct {km ,e : (m, £) E (Z+) 2 } so that { k 1 ,e : £ E z+ } is an increasing enumeration of {k : wk(1) = 0} or of {k : wk(1) = 1 } depending on whether wk (1) = 0 for in finitely or finitely many k ' s, and, when m > 1 , { km , e : £ E z+} is an increasing enumeration of {km l ,£ : Wkm _ , ,, (m) = 0} or of { km l ,£ : Wkm _ , ,, (m) = 1 } � � depending on whether wk, _ , ,, (m) = 0 for infinitely or finitely many £'s. Now set ke = ke,e, and check that {wk, : £ E z+} a convergent subsequence of {wk}f. Finally, there is another property which we will need to know about. Namely, for each open set G <;;; n there is a sequence { Am}f <;;; A of mutually disjoint sets such that Am E Am and G = U� Am. To produce {Am }f one can proceed as follows. Choose A1 to be the largest element of A E A1 with the property that A <;;; G. Equivalently, A1 is the union of all the A E A1 contained in G. Next, given Ae for 1 :::; £ :::; m, choose Am+l to the largest A E Am + l contained in G \ U;"' Am . Obviously, these Am 's are mutually dis joint. To see that they cover G, suppose that w E G, and choose n � 2 so that w ' E G whenever p(w', w) < 2� n+ l . Then A = {w ' : IIn (w') = IIn (w) } <;;; G, and either w E U� � l Am or A n U� � l Am 0, in which case w E A <;;; An . Having made these preparations, we can continue our construction. First, define JF(r) for r <;;; n to the infemum of � � IP'(Am) over all countable covers { Am } f <;;; A of r. Equivalently, since all elements of A are open and all open sets can be written as the union of countably many elements of A, =
=
(6.2.1)
JF(f)
=
inf{JF(G) : G :2 r and G open}.
The following lemma contains several elementary facts about JF.
6 . 2 . 2 LEMMA. For each A E A, lF(A) = JP'(A) . More generally, for all f1 <;;; f2 <;;; n, JF(f1 :::; JF(f2 ) . Moreover, for any open G <;;; n and any countable, exact cover { Am } f of G by mutually disjoint elements of A, JF(G) = �� IP'(Am ) ,
160
6 SOME MILD MEASURE THEORY
and, in general, for any sequence {fk }r' of subsets of D, (6.2.3) Finally, if F and F' are disjoint closed subsets of D, then W(F U F')
JID( F') .
=
JlD(F) +
P ROOF: The second assertion is obvious, since any cover of r2 is also a cover of
f1 . Next, suppose that A E A. Obviously, JlD( A ) :::; lP'(A ) . On the other hand, if {Am }r' is a cover of A by elements of A, then, by the Reine-Borel property of compact sets, we can find (remember that A is closed and therefore compact) an n such that A � U� Am . Now choose N so that {A} u {Am}? � AN .
Then
lP'( A )
=
lP'N ( A )
:=::;
n
n
oo
1
1
1
L lP'N (Am) = L lP'( Am) :=::; L lP'(Am) ,
and so we can conclude that lP' ( A ) :::; JlD ( A ) . Next let G be open, and suppose that {Am }r' is an exact cover of G by mutually disjoint elements of A. By definition, JID(G) :::; L � lP'(Am) · On the other hand, it is clear that, because G :! U� Am E A, we know that
where N is chosen so that {Am}? � AN . Hence, the desired conclusion follows after one lets n --7 oo . Now suppose that {fk }r' is a sequence of subsets of D. Given E > 0 , choose for each k E z + a countable cover { Ak,C : f!. E z+ } � A of rk so that Lc lP'(Ak,c ) :::; P (rk) + 2- k E. Then { Ak,C : (k , f!.) E ( &::;+ ) 2 } � A i s a countable cover of U k rk > and so (by Fubini's Theorem for countable measure spaces)
Thus, J]D (U k rk) :::; L k JID(rk) · Finally, given disjoint, closed subsets F and F' of D , the preceding implies that JID( F U F') :::; JlD( F) + JID( F') . To get the opposite inequality, first note that, because they are compact, there is an N 2: 2 such that > 2- N + l for all 1 1 w E F and w' E F'. Hence, if B = II.N- (II N ( F ) ) and B' = Il.N- (IIN (F') ) , then F � B E A, F' � B' E A, and B n B' = 0, since 77 E B n B' would imply there
p(w, w')
6.2 Modeling Probability
161
exist w E F and w ' E F' such that p(w, w' ) -:; p(w, 77 ) + p(TJ, w ' ) -:; 2- N + l . Now suppose that {Am}! t:;:: A is a countable cover of F U F' , and set Bm = B n Am and B;, = B ' n A�. Then {Bm }1 t:;:: A and { B;, }1 t:;:: A are countable covers of F and F'. In addition, because Bm and B;, are disjoint elements of A, IP' (Am) ::=: IP'(Bm U B;,) = IP'(Bm) + IP'(B;, ) . Hence,
2::: IP'( Am) :::: 2::: IP'(Bm) + 2::: IP'(B;,) :::: JPl ( F) + iP'( F' ) . m
m
m
0
The final ingredient in our construction is the specification of which subsets of n are to be measurable. Although it may not be immediately apparent why, we will say that f t:;:: [2 is measurable and will write f E B if, for each E > 0, there exists an open G ;:::> r such that iP'( G \ r ) 'S E. Obviously, every open set is measurable. At the same time, it should be clear that r E B if JPl ( f ) = 0. Indeed, by ( 6.2.1 ) , if JPl ( r ) = 0, then, for each E > 0 we can find a open G ;:::> r such that JPl( G \ r ) 'S JPl ( G) 'S E. However ' it is less obvious that every closed set F t:;:: n is measurable. To see this, let E > 0 be given, and choose an open set G ;:::> F so that iP'( G ) 'S JPl(F) + E. Then G \ F is open, and so we can write G \ F = Um Am , where the Am 's are mutually disjoint elements of A. Hence, by Lemma 6.2.2, JPl(G \ F) = I:� IP'(Am) · On the other hand, for each n :::: 1 , Bn = U� Am is a closed subset of G \ F, and so, by the first and last parts of Lemma 6.2.2,
n
2::: IP'(Am) = IP'(Bn ) = W(Bn ) = W(F u Bn ) - W( F) -:; iP'(G) - W( F) -:; c . 1 Thus, JPl ( G \ F) = L � IP'( Am) 'S E .
Now that we know open sets, closed sets, and sets of JPl-measure 0 are all measurable, we can show that B is a O"-algebra. To this end, first suppose that {rk }1 t:;:: B, and, given E > 0, choose open sets Gk ;:::> rk so that W(Gk \ rk) 'S 2- k E. Then G = U� Gk is open, G ;:::> f = U� fk, and
iP'( G \ r ) -::: w
(Q
cck \ fk )
) � -:::
IP'(ck \ fk) -::: E.
Hence, r E B, and so B is closed under countable unions. To see that it is also closed under complementation, let r E B be given, and, for each n ::=: 1 , choose an open set Gn ;:::> r so that W( Gn \ r ) 'S � - Then D = n � Gn ;:::> r and iP'( D \ f ) 'S iP'( Gn \ f ) 'S � for all n :::: 1 . Thus, W( D \ f ) = 0, and so D \ r E B. Now set Fn = GnC and C = U� Fn . Because each Fn is closed, and therefore measurable, c E B. Hence, since rc = c u (D \ r ) E B, we are done. 6 . 2 . 4 THEOREM. Referring to the preceding, Bn = O"(A) and f E B if and only if there exist C, D E Bn such that C t:;:: r t:;:: D and W( D \ C) = 0. Next, again use IP' to denote restriction JPl I B of JPl to B . Then, (D, B , IP') is a probability space with the property that IP' I An = IP'n for every n ::=: 1 .
6 SOME MILD
162
MEASURE THEORY
PROOF: We have already established that B is a O"-algebra. Since all elements of A are open, the equality Bo = O"(A) follows from the fact that every open set can be written as the countable union of elements of A. To prove the characterization of B in terms of Bo , first observe that, since open sets are in B, Bo � B. Moreover, since sets of JID-measure 0 are also in B, c � r � D for C, D E Bo with JID(D \ C) = 0 implies that r = C n (f \ C) E B, since iP'(r \ C) :::; iP'(D \ C) = 0. Conversely, if r E B, then we can find sequences {Gn}i"' and {Hn}i"' of open sets such that Gn ::2 r, Hn ::2 rC, and JID(Gn \ f) v JID(Hn \ rC) :::; � · Thus, if D = n� Gn and C = U� HnC, then C, D E Bo , C � r � D, and
for all n � 1. Hence, JID(D \ C) = 0. All that remains is to check that fiD is countably additive on B. To this end, let {rk}f � B be a sequence of mutually disjoint sets, and set r = U� rk. By the second part of Lemma 6.2.2, we know that JID(f) :::; I: � iP'(rk) · To get the opposite inequality, let E > 0 be given, and, for each k, choose an open Gk ::2 rkC so that JID(Gk \ rkC) :::; 2- k E. Then each Fk = GkC is closed, and
In addition, because the rk'S are, the Fk'S are mutually disjoint. Hence, by the last part of Lemma 6.2.2, we know first that
and then that CXJ
CXJ
_L fiD(rk) :::; _L fiD(Fk) + E :::; iP'(r) + c . 1
o
6.3 Independent Random Variables In Kolmogorov's model, independence is best described in terms of (} algebras. Namely, if (S1, F, lP') is a probability space and F1 and F2 are O" subalgebras (i.e. , are O"-algebras which are subsets) of F, then we say that F1 and F2 are independent if lP'(f1 n f2) = lP'(fi )lP'(f2) for all n E F1 and f2 E F2. It should be comforting to recognize that, when A1 , A2 E F and, for i E {1, 2}, Fi = O"( { Ai}) = {0, Ai , AiC, S1}, then, as is easily checked, F1 is in dependent of F2 precisely when, in the terminology of elementary probability theory, "A1 is independent of A2" : lP'(A1 n A2) = lP'(A1)lP'(A2 ) . The notion of independence gets inherited by random variables. Namely, the members of a collection {Xa : a E I} of random variables on (S1, F, lP')
6.3 Independent Random Variables
163
are said to be mutually independent if, for each pair of disjoint subsets J1 and J2 of I, the cr-algebras cr ( {X"' : a E Jl }) and cr ( {X"' : a E J2 }) are independent. One can use Theorem 6.1.6 to show that this definition is equivalent to saying that if X"' takes its values in the measurable space (E"' , B"') , then, for every finite subset {am }1 of distinct elements of I and choice of B"' "' E B"''" , 1 :::::; m :::::; n,
n lP'(X"'= E B"'= for 1 :::::; m :::::; n) = IJ 1P'(X"'= E B"' "') . As a dividend of this definition, it is essentially obvious that if {X"' : a E I} are mutually independent and if, for each a E I, F"' is a measurable map on the range of X"' , then {F"'(X"') : a E I} are again mutually independent. Finally, by starting with simple functions, one can show that if { Xm } l are mutually independent and, for each 1 :::::; m :::::; n, fm is a measurable lR-valued function on the range of Xm, then
lE [fl (Xl ) · · · fn(Xn)]
=
n IJ lE [fm(Xm) ] 1
whenever the fm are all bounded or are all non-negative. 6.3. 1 . Existence of Lots of Independent Random Variables: In the preceding section, we constructed a countable family of mutually independent random variables. Namely, if (r!, B, lP') is the probability space discussed in Theorem 6.2.4 and Xm (w) = w(m) is the mth coordinate of w, then, for any choice of n 2: 1 and (7Jl , . . . , 7Jn) E {0, 1}, lP'(Xm = 7]m, 1 :::::; m :::::; n) = 2- n = TI? lP'(Xm = 77m)· Thus, the random variables { Xm } 'f are mutually independent. Mutually independent random variables, like these, which take on only two values are called Bernoulli random variables. As we are about to see, Bernoulli random variables can be used as build ing blocks to construct many other families of mutually independent random variables. The key to such constructions is contained in the following lemma. 6 . 3 . 1 LEMMA. Given any family { Bm}'f of mutually independent, {0, 1 } valued Bernoulli random variables satisfying lP'(Bm = 0 ) = � = lP'(Bm = 1 ) for all m E z+ , set U = �� 2 - m Bm · Then U is uniformly distributed on [0, 1 ) . That is, lP'(U :::::; u) is 0 if u < 0, u if u E [0, 1), and 1 if u 2: 1 . PROOF: Given N 2: 1 and 0 :::::;
(*)
n < 2N, we want to show that
To this end, note that n2 - N < U :::::; (n + 1)2-N if and only if � � 2 - m Bm = (n + 1)2-N and Bm 0 for m > N, or � � 2 - m Bm = n2 - N and Bm = 1 for some m > N. Hence, since lP'(Bm = 0 for all m > N) = 0, the left hand =
6 SOME MILD MEASURE THEORY
164
side of (*) is equal to the probability that � � 2 - m Bm = n2 - N. However, elementary considerations show that, for any 0 :::; n < 2N there is exactly one choice of (ry1 , . . . , 1J ) E {0, 1 } N for which � � 2- m 1Jm = n 2 - N . Hence,
lP'
(�
N
2- m Em = n2 - N
)=
lP' (Bm '/]m for 1 :::; m :::; N) = T N . =
Having proved (*) , the rest is easy. Namely, since = 0, (*) tells us that, for any 1 :::; k :::;
0 for all m :2: 1)
lP'(U 2N
=
0)
=
lP'(Bm
k- 1 lP'(U :::; kT N ) = L lP' ( mT N < U :::; (m + 1 )T N ) kTN. m=O Hence, because U"-""' lP' (U :::; u) is continuous from the right, it is now clear that Fu(u) = u for all u E [0, 1 ) . Finally, since lP'(U E [0, 1] ) = 1 , this completes =
the proof. 0 Now let I a non-empty, finite or countably infinite set. Then I x z+ is countable, and so we can construct a 1-to-1 map (a, n)"-""' N (a, n) from I x z+ onto z+ . Next, for each a E I, define w E D {0, 1 }z+ f-----7 Xa(w) E D so that the nth coordinate of Xa ( w) is w ( N (a, n) ) . Then, as random variables on the probability space (D, B, lP') in Theorem 6.2.4, {Xa : a E I} are mutually independent and each has distribution lP'. Hence, if : D ---+ [0, 1] is the continuous map given by
=
(ry)
L 2 - m ry(m) m=l 00
=
for
1]
ED
and if Ua = (Xa), then the random variables {Ua = (Xa ) : a E I} are mutually independent and, by Lemma 6.3. 1 , each is uniformly distributed on [0, 1] . The final step in this section combines the preceding construction with the well-known fact that any IR-valued random variable can be represented in terms of a uniform random variable. More precisely, a map F : lR ---+ [0, 1] is called a distribution function if F is non-decreasing, continuous from the right, and tends to 0 at -oo and 1 at +oo. Given such an F, define
F -1 (u)
F(x) :2: u} for u E [0, 1] . Notice that, by right continuity, F(x) :2: u {=} F- 1 (u) :::; x. Hence, if U is uniformly distributed on [0, 1] , then F - 1 (U) is a random variable whose distribution function is F. 6 . 3 . 2 THEOREM. Let (D, B, lP') be the probability space in Theorem 6.2.4. Given any finite or countably infinite index set I and a collection {Fa : a E I} =
inf{x E lR :
of distribution functions, there exist mutually independent random variables {Xa : a E I} on (D, B, lP') with the property that, for each a E I, Fa is the distribution of Xa .
6.4 Conditional Probabilities and Expectations
165
6.4 Conditional Probabilities and Expectations Just as they are to independence, O"-algebras are central to Kolmogorov ' s definition of conditioning. Namely, given a probability space (D, F, J!D) , a sub O"-algebra I:, and a random variable X which is non-negative or integrable, Kolmogorov says that the random variable XE is a conditional expectation of X given I; if Xz:; is a non-negative random variable which is measurable with respect to I: ( i.e., CJ ( {X}) <;;; I: ) and satisfies (6.4. 1 )
IE[Xz:: , f] = IE [X, f]
for all r
E I:.
When X is the indicator function of a set B E F, then the term conditional expectation is replaced to conditional probability. To understand that this definition is an extension of the one given in elemen tary probability courses, begin by considering the case when I: is the trivial O"-algebra {0, D}. Because only constant random variables are measurable with respect to {0, D}, it is clear that the one an only conditional expectation of X will be IE[X] . Next, suppose that I; CJ ( {A}) = {0, A, AC, D} for some A E F with J!D(A) E (0, 1) . In this situation, it is an easy matter to check that, for any B E F, =
J!D(B n A) J!D(B n AC) w-v-+ J!D (A) l A (w) + J!D (AC) l Ac (w ) is a conditional probability of B given I:. That is, the quantity J!D(BI A) , which in elementary probability theory would be called "the condi tional probability of B given A," appears here as the value on A of the map w -v-+ J!D (BII:)( w). More generally, if I: is generated by a finite or countable par tition P <;;; F of D, then, for any non-negative or integrable random variable
IP'\:c��)
X,
""""'
IE[X, A] ( J!D(A) l A w ) {AE'P:IP'(A)>O} L....
will be a conditional expectation of X given I:. Of course, Kolmogorov's definition brings up two essential questions: exis tence and uniqueness. A proof of existence in general can be done in any one of many ways. For instance, when IE[X2] < oo, one can easily see that, just as IE[X] l is the minimum value of IE[(X - X')2] among all constant random variables X', so Xz:; will have to be the minimum of IE[(X - X')2 ] among all I:-measurable random variables X'. In this way, the problem of existence can be related to a problem of orthogonal projection in the space of all square integrable random variables, and, although they are outside the scope of this book, such projection results are familiar to anyone who has studied the the ory Hilbert spaces. Uniqueness, on the other hand, is both easier and more subtle than ex istence. Namely, there is no nai:ve uniqueness statement here, because, in
6 SOME MILD MEASURE THEORY
166
general, there will be uncountably many ways to take XI; . 9 On the other hand, every choice differs from any other choice on a set of measure at most 0. To see this, suppose that X� is a second non-negative random variable which satisfies (6.4 . 1 ) . Then A = {X� > Xd E � ' and so the only way that (6.4.1) can hold is if J!D(A) = 0. Similarly, J!D(X2: > X� ) = 0, and so J!D(X2: cJ X� ) 0. In spite of the ambiguity caused by the sort of uniqueness problems just discussed, it is common to ignore, in so far as possible, this ambiguity and proceed as if a random variable possesses only one conditional expectation with respect to a given IT-algebra. In this connection, the standard notation for a conditional expectation of X given � is JE [X J �] or, when X = l B , J!D(B J � ) , which is the notation which we adopted in the earlier chapters of this book. 6.4. 1 . Conditioning with Respect to Random Variables: In this book, essentially all conditioning is done when � = 17(�) (cf. § 6.1 .4) for some family � of measurable functions on (D, F, lP') . When � has this form, the conditional expectation of a random variable X will be a measurable functions of the functions in �. For example, if � = {F1 , . . . , Fn} and the functions Fm all take their values in a countable space §, a conditional expectation value lE [X I 17 (�) J of X given 17(�) has the form (F1 , . . . , Fn ) , where (i 1 , . . . , in ) is equal to lE[X, F1 = i1 , . . . , Fn = in] or 0 IP' (F1 = i 1 , . . . , Fn = in) according to whether J!D(F1 = i 1 , . . . , Fn = in) is positive or 0. In order to emphasize that conditioning with respect to 17 (�) results in a function of �' we use the notation lE[X J�] or J!D(BJ�) instead of lE[XJIT(�)] or J!D ( B J17 (�) ) . To give more concrete examples of what we are talking about, first suppose that X and Y are independent random variables with values in some countable space §, and set Z = F (X, Y), where F : §2 -----+ lR is bounded. Then =
(6.4.2)
where v(i) = JE[F(i, Y)] for i E §. A less trivial example is provided by our discussion of Markov chains. In Chapter 2, we encoded the Markov property in equations like JE[ZJX] = v(X)
J!D (Xn+l = j i Xa , . . . Xn ) = ( P )Xnj ' which displays this conditioning as a function, namely (io, . . . , i n) � (P )inj , of the random variables in terms of which the condition is made. (Of course, the distinguishing feature of the Markov property is that the function depends only on in and not (io , . . . , in_ l ) . ) Similarly, when we discussed Markov pro cesses with a continuous time parameter, we wrote J!D (X (t ) = j I X (IT) , IT E [0, sl) = (P (t - s ) ) X s j '
()
which again makes it explicit that the conditioned quantity is a function ran dom variables on which the condition is imposed.
9 This non-uniqueness is the reason for our use of the article "a" instead of "the" in front of "conditional expectation."
Notation
Notation
Description
z & z+
set of all integers and the subset of positive integers
N
See
set of non-negative integers
#S
number of elements in the set S
AC
complement of the set
!A
indicator function of the set and
l A (x) =
0 if
A
x tf:. A
A: l A (x)
=
F r s
restriction of the function F to the set S
a l\ b & a v b
minimum and the maximum of a, b E lR
a+ & a i -t j & i t-+ j
state j is accessible from i & communicates with i Kronecker delta:
Oij is
1 or
i is equal or unequal to j
8;
point measure at i E §:
E[X I A] & E[X I I:] ('P) -rr (cp, 'ljJ) -rr & II'P, 'ljJ I I 2 ,-rr
xEA
positive part a V 0 & negative part ( - a) V 0 of a E lR
O;j
E[X, A]
1 if
expected value of
0 depending on whether
(8; )j
=
Oij for j E
§
X on the event A
conditional expectatation value of event
A & the u-algebra I:
alternative notations for
§ 6.2
X given the
§6.4
I: i E§ cp( i) (7r ) ;
inner product and norm in
§3.1
(5 . 1 . 4)
L2 ( 1r )
§5.1.2
set of stationary distribution for the transition probability matrix P
§3.2.3
1 1 �-t llv
variation norm of the row vector J.t
(2.1 .5)
llfllu
uniform norm of the function
Stat(P)
(2 . 1 . 1 0)
f
II M I I u,v
uniform-variation norm of the matrix
Varl-' (f)
variance of
M
f relative to the probability vector
(3.2.1) J.t
References
1 . Diaconis, P. & Stroock, D . ,
chains,
Geometric bounds for eigenvalues of Markov
Ann. Appl. Probab. 1 # 1 ( 1991 ) , 36-61. 2. Dunford, N . & Schwartz, J., Linear Operators, part I, Wiley Classics Lib., Wiley-Interscience, NY, 1988. 3. Holley, R. & Stroock, D . , Simulated annealing via Sobolev inequalities, Comm. Math. Phys. 1 1 5 #4 ( 1988) , 553-569. 4. Karlin, S. & Taylor, H, A First Course in Stochastic Processes, 2nd ed., Academic Press, NY, 1975. 5. Norris, J . R., Markov Chains, Cambridge Series in Statistical & Proba bilistic Mathematics, Cambridge Univ. Press, Cambridge, U .K . , 1997. 6. Revuz, D., Markov Chains, Mathematical Library, vol. 1 1 , North Hol land, Amsterdam & New York, 1984. 7. Riesz, F. & Sz.-Nagy, B . , Functional Analysis, translated from the French edition by F. Boron, reprint of 1955 original, Dover Books, NY,
1990. 8. Stroock, D . ,
ed.,
A Concise Introduction to the Theory of Integration, 3rd
Birkhauser-Boston, Cambridge, MA, USA, 1998. 9. Stroock, D., Probability Theory, An Analytic View, 2nd ed., Cambridge Univ. Press, NY, 2000.
Index
A
accessible, 36, 46 adjoint, 40, 103, 108 allowable path, 118, 125 almost everywhere, 152 aperiodic, 52
B
backward variable, 86 Bernoulli random variable, Bernoulli random variable, 163 binomial coefficient, 2 generalized, 6 branching process, 4 1 extinction, 4 1 c
Cauchy convergent, 26 Chapman-Kolmogorov equation, 80 time-inhomogeneous, 133 communicates, 46 Q-communicates, 96 communicating class properties, 46 conditional expectation, 165 conditional probability, 2, 165 continuity properties of measures, 147 convex function, 138 convex set, 58 convolution, 79 cooling schedule, 132 countably additive, 146 coupling, 15 D
2 detailed balance condition, 108, 121 Dirichlet form, 116 distribution function, 158 initial, 24 of a random variable, 157
8i,j , the Kronecker symbol,
distribution function, 164 Doeblin's Theorem, 28 basic, 28 sharpened, 40 spectral theory, 29 Doob's h-transformation, 43 Doob's Stopping Time Theorem, 49 doubly stochastic, 40 E
empirical measure, 74, 105 ergodic theorem empirical measure, 74, 105 individual, 72, 105 mean, 33, 61, 100 ergodic theory, 33 event, 157 exhaustion, 90 expected value, 157 explosion time, 90 extreme point, 58 F
Fatou's Lemma, 155 finite measure space, 146 first passage time, 3 forward variable, 86 Fubini's Theorem, 155 G
generalized binomial coefficient, 6 Gibbs state, 126 Glauber dynamics, 126 greatest common divisor, 5 1 Gronwall's inequality, 125, 140 H
homogeneous increments, 75
Index
170 I independence of a--algebras, 162 of random variables, 163 individual ergodic theorem, 38, 72, 105 infinitesimal characteristics, 88 initial distribution, 24 inner product, 109 integer part [s] of s, 30 integrable function, 1 5 1 irreducible, 4 6 Q-irreducible, 96 J
Jensen's inequality, 139 K
Kolmogorov's equation backward, 85, 91 forward, 86 time-inhomogeneous, 132 Kronecker symbol, 2 L
Lebesgue's Dominated Convergence Theorem, 155 M
Markov chain, 23 initial distribution, 24 Markov process, 80 rates for, 80 transition probability for, 80 Markov property, 23, 27, 83 Markov's inequality, 153 measurable map, 149 space, 145 subsets, 145 measure, 145 measure space, 145 a--finite, 147 finite, 146 Metropolis algorithm, 130 minimal extension, 95 Monotone Convergence Theorem, 154 monotonicity of integral, 150 mutually independent random variables, 163
N non-explosion, 92 non-negative definite, 1 1 6 norm, 26 null recurrent, 58 p
partition function, 126 period of a state, 52 Poincare inequality, 1 1 7 Poincare constant, 1 1 6 Poisson process compound with jump distribution JJ and rate R, 78 simple, 75 positive recurrent, 58, 101 probability, 157 measure, 157 space, 157 vector, 24 P-stationary, 57 Q Q-null recurrent, 100 Q-positive recurrent, 100 Q-matrix, 86 queuing model, 21, 68 queuing theory, 2 1 R
random time change, 106 random variable, 157 Bernoulli, 163 random walk symmetric, 7, 9 rates, 80 bounded, 8 1 degenerate, 8 1 non-degenerate, 8 1 recurrence, 6 recurrence time, 8 recurrent, 8, 9, 20, 34, 46, 48 null, 58 positive, 58 reflection principle, 3 renewal equation, 56 reverse, 40, 103 reversible, 107
Index s
sample point, 157 sample space, 157 Schwarz's inequality, 16, 109, 139 semigroup, 80 set of measure 0, 152 a--algebra, 145 Borel, 148 generated by, 148, 149 smallest containing, 148 a--finite measure space, 147 signum, 5 simple function, 150 simple Poisson process, 75 simulated annealing algorithm, 130 spectral gap, 1 1 1 spectral theory, 29, 1 1 2 spin-flip systems, 143 state space, 23 stationary distribution, 28, 57, 71 null recurrent case, 72, 104 uniqueness, 72, 104 Stirling's formula, 18 Strong Law of Large Numbers, 17
171
subadditivity of measures, 147 symmetric on L 2 , 1 1 0 T
time o f first return, 6, 9, 46 time-inhomogeneous, 132 Chapman-Kolmogorov equation, 133 Kolmogorov's forward equation, 132 transition probability, 132 transient, 8, 9, 20, 34, 46, 48 transition probability matrix, 24 time inhomogeneous, 132 triangle inequality, 26 u
uniform norm II
·
llu, 27 v
variation norm I I PIIv, 25 w
Weak Law of Large Numbers, 16, 18
(continuedfrom page ii)
64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93
94 95
Fourier Series. Vol. I. 2nd ed. Differential Analysis on Complex Manifolds. 2nd ed. WATERHOUSE. Introduction to Affine Group Schemes. SERRE. Local Fields. WEIDMANN. Linear Operators in Hilbert Spaces. LANG. Cyclotomic Fields II. MASSEY. Singular Homology Theory. FARKASIKRA. Riemann Surfaces. 2nd ed. STILLWELL. Classical Topology and Combinatorial Group Theory. 2nd ed. HUNGERFORD. Algebra. DAVENPORT. Multiplicative Number Theory. 3rd ed. HocHSCHILD. Basic Theory of Algebraic Groups and Lie Algebras. liTAKA. Algebraic Geometry. HECKE. Lectures on the Theory of Algebraic Numbers. BURRIS/SANKAPPANAVAR. A Course in Universal Algebra. WALTERS. An Introduction to Ergodic Theory. ROBINSON. A Course in the Theory of Groups. 2nd ed. FORSTER. Lectures o n Riemann Surfaces. BoTT/Tu. Differential Forms in Algebraic Topology. WASHINGTON. Introduction to Cyclotomic Fields. 2nd ed. IRELAND/ROSEN. A Classical Introduction to Modem Number Theory. 2nd ed. EDWARDS. Fourier Series. Vol. II. 2nd ed. vAN LINT. Introduction to Coding Theory. 2nd ed. BROWN. Cohomology of Groups. PIERCE. Associative Algebras. LANG. Introduction to Algebraic and Abelian Functions. 2nd ed. BR0NDSTED. An Introduction to Convex Polytopes. BEARDON. On the Geometry of Discrete Groups. DIES TEL. Sequences and Series in Banach Spaces. DUBROVINIFOMENKOINOVIKOV. Modem Geometry-Methods and Applications. Part I. 2nd ed. WARNER. Foundations of Differentiable Manifolds and Lie Groups. SHIRYAEV. Probability. 2nd ed. EDWARDS.
96
WELLS.
97 98 99 ] 00
101 I 02 1 03 1 04
I 05
1 06 1 07 I 08
I 09 I I0
Ill 1 12 1 13 1 14 1 15 1 16 1 17 1 18 1 19 1 20
121 1 22 1 23
CoNWAY. A Course in Functional Analysis. 2nd ed. KOBLITZ. Introduction to Elliptic Curves and Modular Forms. 2nd ed. BROCKER/TOM DIECK. Representations of Compact Lie Groups. GROVE/BENSON. Finite Reflection Groups. 2nd ed. BERG/CHRISTENSEN/RESSEL. Harmonic Analysis on Semigroups: Theory of Positive Definite and Related Functions. EDWARDS. Galois Theory. V ARADARAJAN. Lie Groups, Lie Algebras and Their Representations. LANG. Complex Analysis. 3rd ed. DUBROVIN/FOMENKOINOVIKOV. Modem Geometry-Methods and Applications. Part II. LANG. SLz(R). SILVERMAN. The Arithmetic of Elliptic Curves. OLVER. Applications of Lie Groups to Differential Equations. 2nd ed. RANGE. Holomorphic Functions and Integral Representations in Several Complex Variables. LEHTO. Univalent Functions and Teichmi.iller Spaces. LANG. Algebraic Number Theory. HUSEMOLLER. Elliptic Curves. 2nd ed. LANG. Elliptic Functions. KARATzAs/SHREVE. Brownian Motion and Stochastic Calculus. 2nd ed. KoBLITZ. A Course in Number Theory and Cryptography. 2nd ed. BERGERfGOSTIAUX. Differential Geometry: Manifolds, Curves, and Surfaces. KELLEY/SRINIVASAN. Measure and Integral . Vol. I. J.-P. SERRE. Algebraic Groups and Class Fields. PEDERSEN. Analysis Now. RoTMAN. An Introduction to Algebraic Topology. ZIEMER. Weakly Differentiable Functions: Sobolev Spaces and Functions of Bounded Variation. LANG. Cyclotomic Fields I and II. Combined 2nd ed. REMMERT. Theory of Complex Functions. Readings in Mathematics EBBINGHAUS/HERMES
et al. Numbers.
Readings in Mathematics
1 24
1 25 1 26 1 27 128 129
Modem Geometry-Methods and Applications. Part III BERENSTEIN/GAY. Complex Variables: An Introduction. BoREL. Linear Algebraic Groups. 2nd ed. MASSEY. A Basic Course in Algebraic Topology. RAUCH. Partial Differential Equations. FULTON/HARRIS. Representation Theory: A First Course. DUBROVIN/FOMENKO/NOVIKOV.
Readings in Mathematics
1 30 DODSON/POSTON. Tensor Geometry. 1 3 1 LAM. A First Course in Noncommutative Rings. 2nd ed. 1 32 BEARDON. Iteration of Rational Functions. 1 33 HARRI S. Algebraic Geometry: A First Course. 1 34 RoMAN. Coding and Information Theory. 1 35 RoMAN. Advanced Linear Algebra. 1 36 ADKINS/WEINTRAUB. Algebra: An Approach via Module Theory. 1 37 AXLERIBOURDONIRAMEY. Harmonic Function Theory. 2nd ed. 1 3 8 CoHEN. A Course in Computational Algebraic Number Theory. 1 39 BREDON. Topology and Geometry. 1 40 AUBIN. Optima and Equilibria. An Introduction to Nonlinear Analysis. 1 4 1 BECKERfWEISPFENNINGIKREDEL. Grabner Bases. A Computational Approach to Commutative Algebra. 1 42 LANG. Real and Functional Analysis. 3rd ed. 1 43 DooB. Measure Theory. 1 44 DENNISIFARB. Noncommutative Algebra. 1 45 VICK. Homology Theory. An Introduction to Algebraic Topology. 2nd ed. 1 46 BRIDGES. Computability: A Mathematical Sketchbook. 1 47 ROSENBERG. Algebraic K- Theory and Its Applications. 1 48 ROTMAN. An Introduction to the Theory of Groups. 4th ed. 1 49 RATCLIFFE. Foundations of Hyperbolic Manifolds. I SO EISENBUD. Commutative Algebra with a View Toward Algebraic Geometry. 1 5 1 SILVERMAN. Advanced Topics in the Arithmetic of Elliptic Curves.
1 52 ZIEGLER. Lectures on Polytopes. 1 53 FULTON. Algebraic Topology: A First Course. 1 54 BROWN/PEARCY. An Introduction to Analysis. I SS KASSEL. Quantum Groups. 1 56 KECHRIS. Classical Descriptive Set Theory. 1 57 MALLIAYIN. Integration and Probability. 1 58 ROMAN. Field Theory. 1 59 CONWAY. Functions of One Complex Variable II. 1 60 LANG. Differential and Riemannian Manifolds. 1 6 1 BORWEINIERDELYI. Polynomials and Polynomial Inequalities. 1 62 ALPERIN/BELL. Groups and Representations. 1 63 DIXON/MORTIMER. Permutation Groups. 1 64 NATHANSON. Additive Number Theory: The Classical Bases. 1 65 NATHANSON. Additive Number Theory: Inverse Problems and the Geometry of Sumsets. 1 66 SHARPE. Differential Geometry: Cartan's Generalization of Klein's Erlangen Program. 1 67 MORANDI. Field and Galois Theory. 1 68 EWALD. Combinatorial Convexity and Algebraic Geometry. 1 69 BHATIA. Matrix Analysis. 1 70 BREDON. Sheaf Theory. 2nd ed. 1 7 1 PETERSEN. Riemannian Geometry. 1 72 REMMERT. Classical Topics in Complex Function Theory. 1 73 DIESTEL. Graph Theory 2nd ed. 1 74 BRJDGES. Foundations of Real and Abstract Analysis. 1 75 LICKORJSH. An Introduction to Knot Theory. 1 76 LEE. Riemannian Manifolds. 1 77 NEWMAN. Analytic Number Theory. 178 CLARKEILEDYAEV/STERN/WOLENSKI. Nonsmooth Analysis and Control Theory. 1 79 DOUGLAS. Banach Algebra Techniques in Operator Theory. 2nd ed. 1 80 SRJVASTAVA. A Course on Borel Sets. 1 8 1 KRESS. Numerical Analysis. 1 82 WALTER. Ordinary Differential Equations.
1 83 MEGGINSON. An Introduction to Banach Space Theory. 1 84 BOLLOBAS. Modem Graph Theory. 1 85 CoxiLITTLE/O'SHEA. Using Algebraic Geometry. 1 86 RAMAKRISHNANN ALENZA. Fourier Analysis on Number Fields. 1 87 HARRIS/MORRISON. Moduli of Curves. 1 88 GOLDBLATT. Lectures on the Hyperreals: An Introduction to Nonstandard Analysis. 1 89 LAM. Lectures on Modules and Rings. 1 90 ESMONDEIMURTY. Problems in Algebraic Number Theory. 2nd ed. 1 9 1 LANG. Fundamentals of Differential Geometry. 1 92 HJRSCHILACOMBE. Elements of Functional Analysis. 1 93 COHEN. Advanced Topics in Computational Number Theory. 1 94 ENGEL/NAGEL One-Parameter Semi groups for Linear Evolution Equations. 1 95 NATHANSON. Elementary Methods in Number Theory. 1 96 OSBORNE. Basic Homological Algebra. 1 97 EJSENBUDIHARRJS. The Geometry of Schemes. 1 98 ROBERT. A Course in p-adic Analysis. 1 99 HEDENMALMIKORENBLUM/ZHU. Theory of Bergman Spaces. 200 BAOICHERN/SHEN. An Introduction to Riemann-Finsler Geometry. 201 HJNDRYISILVERMAN. Diophantine Geometry: An Introduction. 202 LEE. Introduction to Topological Manifolds. 203 SAGAN. The Symmetric Group: Representations, Combinatorial Algorithms, and Symmetric Functions. 204 ESCOFIER. Galois Theory. 205 FELIX/HALPERIN/THOMAS. Rational Homotopy Theory. 2nd ed. 206 MURTY. Problems in Analytic Number Theory. Readings in Mathematics
207 GODSILIROYLE. Algebraic Graph Theory. 208 CHENEY. Analysis for Applied Mathematics.
209 ARVESON. A Short Course on Spectral Theory. 2 1 0 ROSEN. Number Theory in Function Fields. 2 1 1 LANG. Algebra. Revised 3rd ed. 2 1 2 MATOUSEK. Lectures on Discrete Geometry. 2 1 3 FRJTZSCHEIGRAUERT. From Holomorphic Functions to Complex Manifolds. 2 1 4 JOST. Partial Differential Equations. 2 1 5 GOLDSCHMIDT. Algebraic Functions and Projective Curves. 2 1 6 D. SERRE. Matrices: Theory and Applications. . 2 1 7 MARKER. Model Theory: An Introduction. 2 1 8 LEE. Introduction to Smooth Manifolds. 2 1 9 MACLACHLAN/REID. The Arithmetic of Hyperbolic 3-Manifolds. 220 NESTRUEV. Smooth Manifolds and Observables. 22 1 GRONBAUM. Convex Polytopes. 2nd ed. 222 HALL Lie Groups, Lie Algebras, and Representations: An Elementary Introduction. 223 VRETBLAD. Fourier Analysis and Its Applications. 224 WALSCHAP. Metric Structures in Differential Geometry. 225 BUMP: Lie Groups 226 ZHU. Spaces of Holomorphic Functions in the Unit BalL 227 MILLERISTURMFELS. Combinatorial Communicative Algebra. 228 DIAMOND/SHURMAN. A F irst Course in Modular Forms. 229 EISENBUD. The Geometry of Syzygies. 230 STROOCK. An Introduction to Markov Processes.